Hindawi Publishing Corporation

EURASIP Journal on Advances in Signal Processing Volume 2010, Article ID 451695, 28 pages doi:10.1155/2010/451695

Research Article

Audio Signal Processing Using Time-Frequency Approaches: Coding, Classiftcation, Fingerprinting, and Watermarking

K. Umapathy, B. Ghoraani, and S. Krishnan

Department of Electrical and Computer Engineering, Ryerson University, 350, Victoria Street, Toronto, ON, Canada M5B 2k3

Correspondence should be addressed to S. Krishnan, krishnan@ee.ryerson.ca Received 24 February 2010; Accepted 14 May 2010

Academic Editor: Srdjan Stankovic

Copyright copy; 2010 K. Umapathy et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Audio signals are information rich nonstationary signals that play an important role in our day-to-day communication, perception of environment, and entertainment. Due to its non-stationary nature, time- or frequency-only approaches are inadequate in analyzing these signals. A joint time-frequency (TF) approach would be a better choice to eﬃciently process these signals. In this digital era, compression, intelligent indexing for content-based retrieval, classification, and protection of digital audio content are few of the areas that encapsulate a majority of the audio signal processing applications. In this paper, we present a comprehensive array of TF methodologies that successfully address applications in all of the above mentioned areas. A TF-based audio coding scheme with novel psychoacoustics model, music classification, audio classification of environmental sounds, audio fingerprinting, and audio watermarking will be presented to demonstrate the advantages of using time-frequency approaches in analyzing and extracting information from audio signals.

Introduction

A normal human can hear sound vibrations in the range of 20 Hz to 20 kHz. Signals that create such audible vibrations qualify as an audio signal. Creating, modulating, and inter- preting audio clues were among the foremost abilities that diﬀerentiated humans from the rest of the animal species. Over the years, methodical creation and processing of audio signals resulted in the development of diﬀerent forms of communication, entertainment, and even biomedical diagnostic tools. With the advancements in the technology, audio processing was automated and various enhancements were introduced. The current digital era furthered the audio processing with the power of computers. Complex audio processing tasks were easily implemented and performed in blistering speeds. The digitally converted and formatted audio signals brought in high levels of noise immunity with guaranteed quality of reproduction over time. However, the benefits of digital audio format came with the penalty of huge data rates and diﬃculties in protecting copyrighted audio content over Internet. On the other hand, the ability to use computers brought in great power and flexibility in analyzing and extracting information from audio signals.

This contrasting pros and cons of digital audio inspired the development of variety of audio processing techniques.

In general, a majority of audio processing techniques address the following 3 application areas: (1) compression,

(2) classification, and (3) security. The underlying theme (or motivation) for each of these areas is diﬀerent and at sometimes contrasting, which poses a major challenge to arrive at a single solution. In spite of the bandwidth expansion and better storage solution, compression still plays an important role particularly in mobile devices and content delivery over Internet. While the requirement of compaction (in terms of retaining major audio components) drives the audio coding approaches, audio classification requires the extraction of subtle, accurate, and discriminatory informa- tion to group or index a variety of audio signals. It also covers a wide range of subapplications where the accuracy of the extracted audio information plays a vital role in content-based retrievals, sensing auditory environment for critical applications, and biometrics. Unlike compaction in audio coding or extraction of information in classification, to protect the digital audio content addition of information in the form of a security key is required which would then prove the ownership of the audio content. The addition

of the external message (or key) should be in such a way that the addition does not cause perceptual distortions and remains robust from attacks to remove it. Considering the above requirements it would be diﬃcult to address all the above application areas with a universal methodology unless we could model the audio signal as accurately as possible in a joint TF plane and then adaptively process the model parameters depending upon the application. In line with the above 3 application areas, this paper presents and discusses a TF-based audio coding scheme, music classification, audio classification of environmental sounds, audio fingerprinting, and audio watermarking.

The paper is organized as follows. Section 2 is devoted to the theories and the algorithms related to TF analysis. Section 3 will deal with the use of TF analysis in audio coding and also will present the comparisons among some of the audio coding technologies including adaptive time- frequency transform (ATFT) coding, MPEG-Layer 3 (MP3) coding and MPEG Advanced Audio Coding (AAC). In Section 4, TF analysis-based music classification and envi- ronmental sounds classification will be covered. Section 5 will present fingerprinting and waterm

Hindawi出版公司

EURASIP Journal on Signal Processing in Signal Processing 2010，Article ID 451695,28页doi：10.1155 / 2010/451695

研究文章

使用时频方法的音频信号处理：编码，分类，指纹和水印

K. Umapathy，B。Ghoraani和S. Krishnan

瑞尔森大学电气与计算机工程系，350，维多利亚街，多伦多，加拿大，M5B 2k3

通讯应发给S. Krishnan，克氏菌 2010年2月24日收到;2010年5月14日接受

学术编辑：Srdjan Stankovic

音频信号是信息丰富的非平稳信号，在我们的日常通信，环境感知和娱乐中起着重要作用。由于其非静态特性，仅时间或频率方法不足以分析这些信号。联合时频方法是有效处理这些信号的更好选择。在这个数字时代，基于内容的检索，分类和数字音频内容保护的压缩，智能索引是封装大多数音频信号处理应用的几个领域。在本文中，我们提出了一系列完整的TF方法，成功地解决了上述所有领域的应用问题。将呈现基于TF的音频编码方案，其具有新颖的心理声学模型，音乐分类，环境声音的音频分类，音频指纹识别和音频水印，以展示使用时频方法分析和从音频信号中提取信息的优点。

介绍

普通人可以听到20 Hz至20 kHz范围内的声音振动。产生这种可听振动的信号有资格作为音频信号。创造，调整和解释音频线索是将人类与其他动物物种区分开来的最重要的能力。多年来，有条理地创建和处理音频信号导致了不同形式的通信，娱乐甚至生物医学诊断工具的发展。随着技术的进步，音频处理自动化并引入了各种增强功能。当前的数字时代利用计算机的力量进一步推动了音频处理。复杂的音频处理任务很容易实现，并以极快的速度执行。经过数字转换和格式化的音频信号带来了高水平的抗噪性，并保证了随着时间的推移再现的质量。然而，数字音频格式的好处伴随着巨大的数据速率和在互联网上保护受版权保护的音频内容的困难。另一方面，使用计算机的能力在分析和从音频信号中提取信息方面带来了巨大的力量和灵活性。

数字音频的这种截然不同的优点和缺点激发了各种音频处理技术的发展。

通常，大多数音频处理技术涉及以下3个应用领域：（1）压缩，

（2）分类，（3）安全性。每个领域的基本主题（或动机）是不同的，有时是对比的，这对于达成单一解决方案构成了重大挑战。尽管带宽扩展和更好的存储解决方案，压缩仍然在移动设备和通过Internet传输内容方面发挥着重要作用。虽然压缩的要求（在保留主要音频组件方面）驱动音频编码方法，但音频分类需要提取细微，准确和有区别的信息以对各种音频信号进行分组或索引。它还涵盖了广泛的子应用程序，其中提取的音频信息的准确性在基于内容的检索，感知关键应用的听觉环境和生物识别中起着至关重要的作用。与音频编码中的压缩或分类中的信息提取不同，为了保护数字音频内容，需要以安全密钥的形式添加信息，这将证明音频内容的所有权。另外

外部消息（或密钥）应该以这样的方式，即添加不会导致感知失真，并且保持强大的攻击以消除它。考虑到上述要求，除非我们能够在联合TF平面中尽可能精确地对音频信号进行建模，然后根据应用自适应地处理模型参数，否则难以用通用方法来解决所有上述应用领域。根据上述3个应用领域，本文提出并讨论了基于TF的音频编码方案，音乐分类，环境声音的音频分类，音频指纹识别和音频水印。

本文的结构如下。第2节致力于与TF分析相关的理论和算法。第3节将讨论在音频编码中使用TF分析，并将介绍一些音频编码技术的比较，包括自适应时频变换（ATFT）编码，MPEG-Layer 3（MP3）编码和MPEG高级音频编码（AAC））。在第4节中，将介绍基于TF分析的音乐分类和环境声音分类。第5节将介绍使用TF方法对音频信号进行指纹识别和水印处理，第6节将提供该论文的摘要。

时频分析

信号可以根据其特征分为不同的类别。一种这样的分类是确定性和随机信号。确定性信号是那些

通常，TF变换可以基于（1）信号分解方法和（2）双线性TF分布（也称为Cohen类）分为两大类。在基于分解的方法中，信号被近似为从具有确定的时间和频率定位的基函数的平移，调制和缩放得到的小TF函数。分布是具有高TF分辨率的二维能量表示。根据手头的应用和特征提取策略，可以使用TF分解方法或TF分布方法。

1. 自适应时频变换（ATFT）算法 - 分解方法。ATFT技术基于匹配追踪算法和TF字典[1,2]。ATFT具有出色的TF分辨率特性（优于小波和小波包），并且由于其自适应特性（处理非平稳性），因此不需要信号分段。根据TF字典的特性，可以尽可能准确地实现灵活的信号表示。

在ATFT算法中，任何信号x（t）被分解为从TF函数的冗余字典[2]中选择的TF函数g_gamma;-N（t）的线性组合。在这种情况下，冗余字典意味着字典过于完整并且包含的不仅仅是最小所需的基函数，即非正交基函数的集合，即比跨越给定信号空间的最小所需基函数大得多. .使用ATFT，我们可以将任何给定信号x（t）建模为

可以用数学表示，或者换句话说，所有关于信号的信息都是先验已知的。随机信号采用随机值，而不能像确定性信号那样以简单的数学形式表示

它们使用概率统计表示。什么时候

哪里

infin;

n gamma;_n

.( )x t = a g

n=0

(t), (1)

这些信号的统计数据随着时间的推移而变化，它们符合条件

g（t）＝ 1g.t - pnSigma;Exp.j.2pi;f t phi 西格玛

形成另一个称为非平稳信号的细分。

非平稳信号与时变有关

gamma;n radic;s_n s_n

n n (2)

频谱内容和大多数现实世界（包括音频）信号都属于这一类。由于时变行为，分析非平稳信号具有挑战性。

早期的信号处理技术主要使用时域操作，如相关，卷积，内积和信号平均。虽然时域操作提供了关于信号的一些信息，但是它们在提取信号的频率内容方面受到限制。傅里叶理论的引入通过分析频域中的信号来解决该问题。然而，傅里叶技术仅提供信号的全局频率内容而不提供那些频率的出现时间。因此，时域和频域分析都不足以分析具有时变频率内容的信号。为了克服这一困难并有效地分析非平稳信号，需要能够提供联合时间和频率信息的技术。这催生了TF变换。

和_n 是扩展系数。窗函数g（t）的选择决定了TF字典的特征。可以基于手中的应用适当地修改或选择TF函数的字典。比例因子s_n，也称为八度音程参数，用于控制窗函数的宽度，参数p_n 控制时间位置。参数f_n 和phi;_n 分别是指数函数的频率和相位。指数gamma;_n 表示TF分解参数（s_n，p_n，f_n 和phi;_n）的特定组合。在基于TF分解的工作中，将在本文的后半部分介绍Gabor

使用词典（高斯函数，即（2）中的g（t）= exp（-2pi;t2），其具有最佳的TF定位特性

[3]并且在这些工作中使用的离散ATFT算法实现中，八度音阶参数s_n 可以采用90mu;s至0.4s之间的任何等效时间宽度值;相位参数phi;_n 可以取0到1之间的任何值，缩放到0到180度;频率参数f_n 可以采用对应于0到22,050Hz的8192电平之一

（即宽带音频的采样频率为44,100 Hz）;时间位置参数p_n 可以取1到信号长度之间的任何值。

信号x（t）被投影在TF函数的冗余字典上，具有缩放，平移和调制的所有可能组合。当x（t）是真实且离散的时，就像所提出的技术中的音频信号一样，我们使用真实和离散TF函数的字典。由于字典的冗余或过度完整性，它为选择最适合本地信号结构（局部优化）提供了极大的灵活性[2]。这种极端的灵活性使得能够尽可能精确地对信号建模，其中TF函数的数量最少，从而提供紧凑的信号近似。在每次迭代中，搜索并从Gabor字典中选择最佳匹配的TF函数（即，捕获信号能量的最大部分的TF函数）。最佳匹配取决于选择函数，并且在该工作中，如[1]中所述使用每次迭代的最大能量捕获。称为残留物的剩余信号在每次迭代时以相同的方式进一步分解，将它们细分为TF函数。由于TF功能的顺序选择，信号分解可能需要更长的时间

1. TF分配方法。TF分布D）表示信号在时域和频域方面的二维能量表示。TFD方法领域的工作范围很广[2,5,7]。一些众所周知的TFD技术如下。
  1. 线性TFD。最简单的线性TFD是信号的STFT的平方模数，其假设信号在短持续时间内是静止的并且将信号乘以窗口，并且对窗口段进行傅立叶变换。该联合TF表示频率的时间定位;然而，它受到TF分辨率权衡的影响。
  2. 二次TFD。在二次TFD中，分析窗口适应于分析的信号。为了实现这一点，二次TFD变换信号的时变自相关，以获得随时间和频率分布的信号能量的表示。

X_沃夫（tau;，omega;）= int;times;.t 1tau;sigma;times;^lowast;.t - 1tau;sigma;exp. - jomega;tsigma;dt，（4）

特别是对于更长的信号为了克服这一点，存在 2 2

在每个中选择多个TF函数的更快方法

迭代[4]。在M次迭代之后，信号x（t）可以表示为

其中X_沃夫是信号的Wigner-Ville分布（WVD）。WVD提供比STFT更高的分辨率;然而，

x(t) =

Mminus;1

n=0

. .

Rⁿx, g_gamma;

.g_gamma; (t) R^mx(t), (3)

当信号中存在多个分量时，WVD包含干扰交叉项。干扰交叉项不属于信号并由其生成

WVD的二次性。它们产生高度

其中（3）的第一部分是分解的TF函数直到M次迭代，第二部分是将在随后的迭代中分解的残差。重复该过程直到信号的所有能量被分解。在每次迭代中，信号能量的一部分在TF平面中以最佳TF分辨率建模。在迭代中，可以观察到捕获的能量增加并且残余能量下降。基于信号内容，M的值可能非常高

完全分解（即残余能量= 0）。具有不同尺度和调制参数的高斯TF函数的示例在图1中示出。一次迭代的计算复杂度的顺序

AT算法由O（N log N）给出，其中N是信号样本的长度。ATFT算法的时间复杂度随着模拟信号所需的迭代次数的增加而增加，而信号的性质又取决于信号的性质。与此相比，在少数现有技术的音频编码器中使用的改进离散余弦变换（MDCT）的计算复杂度仅为O（N log N）（与FFT相同）。

一旦信号被精确建模或分解成具有确定时间和频率定位的TF函数，就可以分析控制TF函数的TF参数以提取特定于应用的信息。在我们的例子中，我们处理音频信号的TF分解参数，以执行音频压缩和分类，这将在后面的章节中解释。

TFD中的振荡干扰，它们的存在将导致对信号特性的错误解释。WVD的这个缺点是引入其他TFD的动机，例如伪Wigner-Ville分布（PWVD），SPWVD，Choi-Williams分布（CWD）和Cohen内核分布，以在歧义域中定义可以消除交叉项的内核。这些分布属于称为Cohens类的双线性TF表示的一般类[3]。这

原文和译文剩余内容已隐藏，您需要先支付 30元 才能查看原文和译文全部内容！立即支付

免费ai写开题、写任务书：免费Ai开题 | 免费Ai任务书 | 免费降AI率 | 免费降重复率 | 论文免费排版

注册

找回密码

使用时频方法的音频信号处理：编码，分类，指纹和水印外文翻译资料

K. Umapathy, B. Ghoraani, and S. Krishnan

Introduction

K. Umapathy，B。Ghoraani和S. Krishnan

介绍

时频分析

您可能感兴趣的文章

登录

注册

找回密码

K. Umapathy, B. Ghoraani, and S. Krishnan

Introduction

K. Umapathy，B。Ghoraani和S. Krishnan

介绍

时频分析

您可能感兴趣的文章