2021-11-04 10:11

Hindawi Publishing Corporation

EURASIP Journal on Advances in Signal Processing Volume 2010, Article ID 451695, 28 pages doi:10.1155/2010/451695

Research Article

Audio Signal Processing Using Time-Frequency Approaches: Coding, Classiftcation, Fingerprinting, and Watermarking

K. Umapathy, B. Ghoraani, and S. Krishnan

Department of Electrical and Computer Engineering, Ryerson University, 350, Victoria Street, Toronto, ON, Canada M5B 2k3

Correspondence should be addressed to S. Krishnan, krishnan@ee.ryerson.ca Received 24 February 2010; Accepted 14 May 2010

Academic Editor: Srdjan Stankovic

Copyright copy; 2010 K. Umapathy et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Audio signals are information rich nonstationary signals that play an important role in our day-to-day communication, perception of environment, and entertainment. Due to its non-stationary nature, time- or frequency-only approaches are inadequate in analyzing these signals. A joint time-frequency (TF) approach would be a better choice to efficiently process these signals. In this digital era, compression, intelligent indexing for content-based retrieval, classification, and protection of digital audio content are few of the areas that encapsulate a majority of the audio signal processing applications. In this paper, we present a comprehensive array of TF methodologies that successfully address applications in all of the above mentioned areas. A TF-based audio coding scheme with novel psychoacoustics model, music classification, audio classification of environmental sounds, audio fingerprinting, and audio watermarking will be presented to demonstrate the advantages of using time-frequency approaches in analyzing and extracting information from audio signals.


A normal human can hear sound vibrations in the range of 20 Hz to 20 kHz. Signals that create such audible vibrations qualify as an audio signal. Creating, modulating, and inter- preting audio clues were among the foremost abilities that differentiated humans from the rest of the animal species. Over the years, methodical creation and processing of audio signals resulted in the development of different forms of communication, entertainment, and even biomedical diagnostic tools. With the advancements in the technology, audio processing was automated and various enhancements were introduced. The current digital era furthered the audio processing with the power of computers. Complex audio processing tasks were easily implemented and performed in blistering speeds. The digitally converted and formatted audio signals brought in high levels of noise immunity with guaranteed quality of reproduction over time. However, the benefits of digital audio format came with the penalty of huge data rates and difficulties in protecting copyrighted audio content over Internet. On the other hand, the ability to use computers brought in great power and flexibility in analyzing and extracting information from audio signals.

This contrasting pros and cons of digital audio inspired the development of variety of audio processing techniques.

In general, a majority of audio processing techniques address the following 3 application areas: (1) compression,

(2) classification, and (3) security. The underlying theme (or motivation) for each of these areas is different and at sometimes contrasting, which poses a major challenge to arrive at a single solution. In spite of the bandwidth expansion and better storage solution, compression still plays an important role particularly in mobile devices and content delivery over Internet. While the requirement of compaction (in terms of retaining major audio components) drives the audio coding approaches, audio classification requires the extraction of subtle, accurate, and discriminatory informa- tion to group or index a variety of audio signals. It also covers a wide range of subapplications where the accuracy of the extracted audio information plays a vital role in content-based retrievals, sensing auditory environment for critical applications, and biometrics. Unlike compaction in audio coding or extraction of information in classification, to protect the digital audio content addition of information in the form of a security key is required which would then prove the ownership of the audio content. The addition

of the external message (or key) should be in such a way that the addition does not cause perceptual distortions and remains robust from attacks to remove it. Considering the above requirements it would be difficult to address all the above application areas with a universal methodology unless we could model the audio signal as accurately as possible in a joint TF plane and then adaptively process the model parameters depending upon the application. In line with the above 3 application areas, this paper presents and discusses a TF-based audio coding scheme, music classification, audio classification of environmental sounds, audio fingerprinting, and audio watermarking.

The paper is organized as follows. Section 2 is devoted to the theories and the algorithms related to TF analysis. Section 3 will deal with the use of TF analysis in audio coding and also will present the comparisons among some of the audio coding technologies including adaptive time- frequency transform (ATFT) coding, MPEG-Layer 3 (MP3) coding and MPEG Advanced Audio Coding (AAC). In Section 4, TF analysis-based music classification and envi- ronmental sounds classification will be covered. Section 5 will present fingerprinting and waterm


EURASIP Journal on Signal Processing in Signal Processing 2010,Article ID 451695,28页doi:10.1155 / 2010/451695



K. Umapathy,B。Ghoraani和S. Krishnan

瑞尔森大学电气与计算机工程系,350,维多利亚街,多伦多,加拿大,M5B 2k3

通讯应发给S. Krishnan,克氏菌 2010年2月24日收到;2010年5月14日接受

学术编辑:Srdjan Stankovic

版权所有copy;2010 K. Umapathy等。这是一份根据知识共享署名许可分发的开放获取文章,允许在任何媒体中不受限制地使用,分发和复制,前提是原始作品被正确引用。



普通人可以听到20 Hz至20 kHz范围内的声音振动。产生这种可听振动的信号有资格作为音频信号。创造,调整和解释音频线索是将人类与其他动物物种区分开来的最重要的能力。多年来,有条理地创建和处理音频信号导致了不同形式的通信,娱乐甚至生物医学诊断工具的发展。随着技术的进步,音频处理自动化并引入了各种增强功能。当前的数字时代利用计算机的力量进一步推动了音频处理。复杂的音频处理任务很容易实现,并以极快的速度执行。经过数字转换和格式化的音频信号带来了高水平的抗噪性,并保证了随着时间的推移再现的质量。然而,数字音频格式的好处伴随着巨大的数据速率和在互联网上保护受版权保护的音频内容的困难。另一方面,使用计算机的能力在分析和从音频信号中提取信息方面带来了巨大的力量和灵活性。





本文的结构如下。第2节致力于与TF分析相关的理论和算法。第3节将讨论在音频编码中使用TF分析,并将介绍一些音频编码技术的比较,包括自适应时频变换(ATFT)编码,MPEG-Layer 3(MP3)编码和MPEG高级音频编码(AAC) )。在第4节中,将介绍基于TF分析的音乐分类和环境声音分类。第5节将介绍使用TF方法对音频信号进行指纹识别和水印处理,第6节将提供该论文的摘要。




    1. 适应时频变换(ATFT)算法 - 分解方法。ATFT技术基于匹配追踪算法和TF字典[1,2]。ATFT具有出色的TF分辨率特性(优于小波和小波包),并且由于其自适应特性(处理非平稳性),因此不需要信号分段。根据TF字典的特性,可以尽可能准确地实现灵活的信号表示。

在ATFT算法中,任何信号x(t)被分解为从TF函数的冗余字典[2]中选择的TF函数ggamma;-N(t)的线性组合。在这种情况下,冗余字典意味着字典过于完整并且包含的​​不仅仅是最小所需的基函数,即非正交基函数的集合,即比跨越给定信号空间的最小所需基函数大得多. .使用ATFT,我们可以将任何给定信号x(t)建模为





n gamma;n

.( )x t = a g


(t), (1)


g(t)= 1g.t - pnSigma;Exp.j.2pi;f t phi 西格玛



gamma;n radic;sn sn

n n (2)



n 是扩展系数。窗函数g(t)的选择决定了TF字典的特征。可以基于手中的应用适当地修改或选择TF函数的字典。比例因子sn,也称为八度音程参数,用于控制窗函数的宽度,参数pn 控制时间位置。参数fn 和phi;n 分别是指数函数的频率和相位。指数gamma;n 表示TF分解参数(sn,pn,fn 和phi;n)的特定组合。在基于TF分解的工作中,将在本文的后半部分介绍Gabor

使用词典(高斯函数,即(2)中的g(t)= exp(-2pi;t2),其具有最佳的TF定位特性

[3]并且在这些工作中使用的离散ATFT算法实现中,八度音阶参数sn 可以采用90mu;s至0.4s之间的任何等效时间宽度值;相位参数phi;n 可以取0到1之间的任何值,缩放到0到180度;频率参数fn 可以采用对应于0到22,050Hz的8192电平之一

(即宽带音频的采样频率为44,100 Hz);时间位置参数pn 可以取1到信号长度之间的任何值。


    1. TF分配方法。TF分布D)表示信号在时域和频域方面的二维能量表示。TFD方法领域的工作范围很广[2,5,7]。一些众所周知的TFD技术如下。
      1. 线性TFD。最简单的线性TFD是信号的STFT的平方模数,其假设信号在短持续时间内是静止的并且将信号乘以窗口,并且对窗口段进行傅立叶变换。该联合TF表示频率的时间定位;然而,它受到TF分辨率权衡的影响。
      2. 次TFD。在二次TFD中,分析窗口适应于分析的信号。为了实现这一点,二次TFD变换信号的时变自相关,以获得随时间和频率分布的信号能量的表示。

X沃夫(tau;,omega;)= int;times;.t 1tau;sigma;times;lowast;.t - 1tau;sigma;exp. - jomega;tsigma;dt, (4)

特别是对于更长的信号为了克服这一点,存在 2 2



其中X沃夫 是信号的Wigner-Ville分布(WVD)。WVD提供比STFT更高的分辨率;然而,

x(t) =



. .

Rnx, ggamma;

.ggamma; (t) Rmx(t), (3)






完全分解(即残余能量= 0)。具有不同尺度和调制参数的高斯TF函数的示例在图1中示出。一次迭代的计算复杂度的顺序

AT算法由O(N log N)给出,其中N是信号样本的长度。ATFT算法的时间复杂度随着模拟信号所需的迭代次数的增加而增加,而信号的性质又取决于信号的性质。与此相比,在少数现有技术的音频编码器中使用的改进离散余弦变换(MDCT)的计算复杂度仅为O(N log N)(与FFT相同)。



原文和译文剩余内容已隐藏,您需要先支付 30元 才能查看原文和译文全部内容!立即支付