



技术领域technical field
本发明属于声场景分类技术领域,具体涉及一种基于注意力机制和双路径深度残差网络的声场景分类方法。The invention belongs to the technical field of acoustic scene classification, and in particular relates to an acoustic scene classification method based on an attention mechanism and a dual-path deep residual network.
背景技术Background technique
声场景分类,就是训练计算机通过声音中所包含的信息将声音正确的划分到其所属的场景中。声场景分类技术在物联网设备、智能助听器、自动驾驶等领域有着广泛的应用,对声场景分类进行深入的研究具有十分重要的意义。Sound scene classification is to train the computer to correctly classify the sound into the scene to which it belongs through the information contained in the sound. Acoustic scene classification technology has a wide range of applications in the fields of Internet of Things devices, intelligent hearing aids, and autonomous driving. It is of great significance to conduct in-depth research on acoustic scene classification.
声场景分类最开始属于模式识别的一个子领域。上世纪九十年代,Sawhney和Maes首次提出了声场景分类的概念。他们录制了一份包含人行道、地铁、餐厅、公园、街道五类声场景的数据集,Sawhney从录制的音频中提取了功率谱密度、相对光谱、滤波器组的频带三类特征,之后采用k最邻近和循环神经网络算法进行分类,取得了68%的准确率。二十世纪初期,机器学习领域快速发展,越来越多的学者尝试使用机器学习的方法来进行声音场景的划分。支持向量机、决策树等机器学习算法逐渐替代传统的HMM模型,被广泛的应用在了声场景分类和声事件检测任务中。同时,一些集成学习的方法如随机森林、XGBoost进一步提升了声场景分类的效果。2015年,Phan等人将声场景分类问题转化为回归问题,搭建了基于随机森林回归的模型,并在ITC-Irst和UPC-TALP两个数据库上分别将检测错误率降低了6%和10%。2012年,在ImageNet图像分类竞赛中,Krizhevsky提出了AlexNet模型并一举获得了冠军。AlexNet的巨大的成功,引发了深度学习的热潮,研究者也逐渐开始将深度学习的方法引入到声场景分类任务中。Sound scene classification was originally a subfield of pattern recognition. The concept of sound scene classification was first proposed by Sawhney and Maes in the 1990s. They recorded a dataset containing five types of sound scenes: sidewalks, subways, restaurants, parks, and streets. Sawhney extracted three types of features from the recorded audio: power spectral density, relative spectrum, and frequency bands of the filter bank, and then used k The nearest neighbor and recurrent neural network algorithms for classification achieved 68% accuracy. In the early 20th century, the field of machine learning developed rapidly, and more and more scholars tried to use machine learning methods to classify sound scenes. Machine learning algorithms such as support vector machines and decision trees have gradually replaced the traditional HMM model and have been widely used in acoustic scene classification and acoustic event detection tasks. At the same time, some ensemble learning methods such as random forest and XGBoost further improve the effect of sound scene classification. In 2015, Phan et al. transformed the sound scene classification problem into a regression problem, built a model based on random forest regression, and reduced the detection error rate by 6% and 10% on the ITC-Irst and UPC-TALP databases, respectively. . In 2012, in the ImageNet image classification competition, Krizhevsky proposed the AlexNet model and won the championship. The great success of AlexNet has triggered a boom in deep learning, and researchers have gradually begun to introduce deep learning methods into the task of sound scene classification.
此外,可以被用于声场景分类的声学特征有很多,如何将这些特征进行融合,以匹配深度学习的模型是未来一个重要的研究方向。In addition, there are many acoustic features that can be used for acoustic scene classification. How to fuse these features to match the deep learning model is an important research direction in the future.
发明内容SUMMARY OF THE INVENTION
发明目的:针对现有技术中存在的问题,本发明公开了一种基于注意力机制和双路径深度残差网络的声场景分类方法,对三种变换后的信号分别求取对数梅尔谱图及其一阶二阶差分谱图,将其融合后分离出高频和低频部分,再输入具有注意力机制的双路径深度残差网络模型中,能够有效的捕获对分类结果有重要的影响的特征图,提升了声场景分类系统的准确性和鲁棒性。Purpose of the invention: In view of the problems existing in the prior art, the present invention discloses an acoustic scene classification method based on an attention mechanism and a dual-path deep residual network. The graph and its first-order and second-order difference spectrum are fused to separate high-frequency and low-frequency parts, and then input into the dual-path deep residual network model with attention mechanism, which can effectively capture the important influence on the classification results. The feature map of , which improves the accuracy and robustness of the acoustic scene classification system.
技术方案:本发明采用如下技术方案:一种基于注意力机制和双路径深度残差网络的声场景分类方法,其特征在于,包括如下步骤:Technical scheme: The present invention adopts the following technical scheme: an acoustic scene classification method based on an attention mechanism and a dual-path deep residual network, which is characterized in that it includes the following steps:
步骤1、对原始语音信号进行预处理并计算原始语音频谱图,对原始语音频谱图中的水平线和垂直线分别进行增强得到水平频谱图和垂直频谱图,对水平频谱图和垂直频谱图分别变换得到新的两路时域信号;Step 1. Preprocess the original voice signal and calculate the original voice spectrogram, enhance the horizontal and vertical lines in the original voice spectrogram respectively to obtain the horizontal spectrogram and the vertical spectrogram, and transform the horizontal spectrogram and the vertical spectrogram respectively. Obtain two new time domain signals;
步骤2、分别计算原始语音信号、新的两路时域信号的对数梅尔谱图以及一阶差分对数梅尔谱图和二阶差分对数梅尔谱图,并在通道维度上进行融合得到融合谱图;Step 2. Calculate the log mel spectrogram of the original speech signal, the new two-channel time-domain signal, the first-order difference log mel spectrogram and the second-order difference log mel spectrogram, and perform it in the channel dimension. Fusion to obtain a fusion spectrum;
步骤3、在频率轴上将融合谱图平均分割为高频谱图和低频谱图;
步骤4、搭建带有注意力层的双路径深度残差网络;Step 4. Build a dual-path deep residual network with an attention layer;
步骤5、将所述步骤3中的高频谱图和低频谱图输入步骤4中的深度残差网络,输出原始语音信号所属的声场景类别。Step 5: Input the high spectrogram and the low spectrogram in the
优选的,所述步骤1中:Preferably, in the step 1:
其中,Xh为水平频谱图,Xp为垂直频谱图,X为原始语音频谱图;κ和λ为权重平滑因子;f和t分别表示频率和时间;最小化代价函数J,令θJ/θXh=0和θJ/θXp=0,则可求得水平频谱图Xh和垂直频谱图Xp。Among them, Xh is the horizontal spectrogram, Xp is the vertical spectrogram, and X is the original speech spectrogram; κ and λ are the weight smoothing factors; f and t represent the frequency and time, respectively; To minimize the cost function J, let θJ/θXh =0 and θJ/θXp =0, then the horizontal spectrogram Xh and the vertical spectrogram Xp can be obtained.
优选的,所述步骤2中:Preferably, in the step 2:
So(T,F)=(SX(T,F),SH(T,F),SP(T,F))So (T, F) = (SX (T, F), SH (T, F), SP (T, F))
其中,Sa表示融合谱图;SX表示原始语音信号的对数梅尔谱图以及一阶差分对数梅尔谱图和二阶差分对数梅尔谱图;SH表示由水平频谱图生成的对数梅尔谱图以及一阶差分对数梅尔谱图和二阶差分对数梅尔谱图;SP表示由垂直频谱图生成的对数梅尔谱图以及一阶差分对数梅尔谱图和二阶差分对数梅尔谱图;T和F分别表示时间轴和频率轴。Among them,Sa represents the fusion spectrogram; S Xrepresents the logarithmic mel spectrogram of the original speech signal, as well as the first-order differential logarithmic mel spectrogram and the second-order differential logarithmic mel spectrogram; SH represents the horizontal spectrogram Generated log mel-spectrogram and first-order difference log mel-spectrogram and second-order difference log mel-spectrogram; SP represents the log mel-spectrogram generated from the vertical spectrogram and the first-order difference log mel-spectrogram Mel-spectrogram and second-order difference log mel-spectrogram; T and F represent the time and frequency axes, respectively.
优选的,所述步骤5包括如下步骤:Preferably, the step 5 includes the following steps:
步骤51、高频谱图和低频谱图输入深度残差网络的双路径后分别输出高频特征图和低频特征图;Step 51. After the high-spectrogram and the low-spectrogram are input into the dual paths of the deep residual network, the high-frequency feature map and the low-frequency feature map are respectively output;
步骤52、高频特征图和低频特征图在频率轴维度上进行融合得到融合特征图,通过融合特征图得到多通道特征图,通过多通道特征图计算得到注意力系数;Step 52, the high-frequency feature map and the low-frequency feature map are fused in the frequency axis dimension to obtain a fusion feature map, a multi-channel feature map is obtained by fusing the feature map, and an attention coefficient is obtained by calculating the multi-channel feature map;
步骤53、将注意力系数应用于多通道特征图得到加权特征图;Step 53, applying the attention coefficient to the multi-channel feature map to obtain a weighted feature map;
步骤54、将加权特征图展开为一维的特征向量,通过特征向量输出原始语音信号所属的声场景类别。Step 54: Expand the weighted feature map into a one-dimensional feature vector, and output the acoustic scene category to which the original speech signal belongs through the feature vector.
优选的,所述步骤52中:Preferably, in the step 52:
MP(T,F)=(MP1(T,FL),MP2(T,FH))MP (T, F) = (MP1 (T, FL ), MP2 (T, FH ))
其中,MP(T,F)表示融合特征图;MP1(T,FL)和MP2(T,FH)分别表示低频特征图和高频特征图;T表示特征图的高度;F、FL和FH分别表示融合特征图、低频特征图和高频特征图的宽度。Among them, MP (T, F) represents the fusion feature map; MP1 (T, FL ) and MP2 (T, FH ) represent the low-frequency feature map and the high-frequency feature map respectively; T represents the height of the feature map; F , FL and FH represent the width of the fusion feature map, low frequency feature map and high frequency feature map, respectively.
优选的,所述步骤52中:Preferably, in the step 52:
α=σ(W2ReLU(W1z))α=σ(W2 ReLU(W1 z))
其中,α∈RC表示注意力系数向量;和表示权重;σ表示sigmoid激活函数;M表示多通道特征图;T和F分别表示多通道特征图的高度和宽度;C表多通道特征图的通道维度;r表示尺度缩放系数。where α∈RC represents the attention coefficient vector; and represents the weight; σ represents the sigmoid activation function; M represents the multi-channel feature map; T and F represent the height and width of the multi-channel feature map, respectively; C represents the channel dimension of the multi-channel feature map; r represents the scaling factor.
优选的,所述步骤53中:Preferably, in the step 53:
其中,为加权特征图中的第k个通道值;Mk(T,F)为多通道特征图M(T,F)中的第k个通道值;αk为注意力系数向量α中的第k个值;T和F分别表示特征图的高度和宽度;C表示通道维度。in, is the weighted feature map The k-th channel value in Mk (T, F) is the k-th channel value in the multi-channel feature map M (T, F); αk is the k-th value in the attention coefficient vector α; T and F represent the height and width of the feature map, respectively; C represents the channel dimension.
优选的,深度残差网络的每个路径中包括残差块;Preferably, each path of the deep residual network includes a residual block;
残差块中包括依次连接的批量归一化层、ReLU激活层、卷积层、批量归一化层、ReLU激活层和卷积层,输出低频特征图和高频特征图;The residual block includes batch normalization layer, ReLU activation layer, convolution layer, batch normalization layer, ReLU activation layer and convolution layer connected in sequence, and outputs low-frequency feature map and high-frequency feature map;
低频特征图和高频特征图融合后输入依次连接的批量归一化层、ReLU激活层和卷积块,卷积块中包括依次连接的卷积层和批量归一化层,输出多通道特征图;After the low-frequency feature map and the high-frequency feature map are fused, input the batch normalization layer, ReLU activation layer and convolution block connected in sequence. picture;
多通道特征图输入依次连接的全局平均池化层和全连接层,输出注意力系数向量;The multi-channel feature map inputs the global average pooling layer and the fully connected layer connected in sequence, and outputs the attention coefficient vector;
注意力系数向量和多通道特征图合并后输入依次连接的平铺层、全连接层和Softmax层,输出分类结果。After the attention coefficient vector and the multi-channel feature map are merged, the tiling layer, the fully connected layer and the Softmax layer are connected in sequence, and the classification result is output.
优选的,所述步骤1中预处理包括:对原始语音信号降采样或升采样到48kHZ,然后进行预加重、分帧和加窗处理;在分帧时,每2048个采样点分为一帧,帧重叠率为50%;在加窗时,采用的窗函数为汉明窗。Preferably, the preprocessing in step 1 includes: down-sampling or up-sampling the original speech signal to 48 kHz, and then performing pre-emphasis, framing and windowing processing; during framing, every 2048 sampling points is divided into one frame , the frame overlap rate is 50%; when adding the window, the window function used is the Hamming window.
有益效果:本发明具有如下有益效果:Beneficial effects: the present invention has the following beneficial effects:
(1)本发明将原始音频、增强语谱图水平线的信号和增强语谱图垂直线的信号的对数梅尔谱图和差分谱图进行融合,使得融合谱图不仅体现出了音频的静态特性、动态特性,而且增强了特征的表达能力,有效提升了声场景分类的准确率;(1) The present invention fuses the logarithmic Mel spectrogram and the differential spectrogram of the original audio, the signal of the enhanced spectrogram horizontal line and the signal of the enhanced spectrogram vertical line, so that the fusion spectrogram not only reflects the static state of the audio characteristics, dynamic characteristics, and enhanced the expressive ability of features, effectively improving the accuracy of sound scene classification;
(2)、本发明将融合谱图中的高频部分和低频部分进行分离,并搭建双路径的深度残差网络分别建模,极大的体现了频谱图中高频和低频具有不同的物理含义这一特点;高低频分离建模使得模型可以更好的捕获高频分量、低频分量的时频特性,利用这些特性能更准确的区分出相似的声场景;(2) The present invention separates the high-frequency part and the low-frequency part in the fusion spectrogram, and builds a dual-path deep residual network to model separately, which greatly reflects that the high-frequency and low-frequency in the spectrogram have different physical meanings This feature; high and low frequency separation modeling enables the model to better capture the time-frequency characteristics of high frequency components and low frequency components, and use these characteristics to more accurately distinguish similar acoustic scenes;
(3)、本发明在深度残差网络中引入注意力机制,对多通道的融合特征图在通道维度上进行注意力加权操作,使得对最终分类结果有着积极作用的特征图在后面的全连接层中获得更高的关注度,有效提升了模型的分类效果,使得整个系统的识别率极大提升。(3) The present invention introduces an attention mechanism into the deep residual network, and performs an attention weighting operation on the multi-channel fusion feature map in the channel dimension, so that the feature map that has a positive effect on the final classification result is fully connected in the back. The higher attention is obtained in the layer, which effectively improves the classification effect of the model and greatly improves the recognition rate of the entire system.
附图说明Description of drawings
图1为本发明改进的声场景分类方法的总体结构图;Fig. 1 is the overall structure diagram of the sound scene classification method improved by the present invention;
图2为本发明带有注意力层的双路径深度残差网络结构图;2 is a structural diagram of a dual-path deep residual network with an attention layer according to the present invention;
图3为本发明的注意力网络结构图;Fig. 3 is the attention network structure diagram of the present invention;
图4为本发明方法和其他4类声场景分类方法分类结果的对比图。FIG. 4 is a comparison diagram of the classification results of the method of the present invention and other four types of sound scene classification methods.
具体实施方式Detailed ways
下面结合附图对本发明作更进一步的说明。The present invention will be further described below in conjunction with the accompanying drawings.
本发明公开了一种基于注意力机制和双路径深度残差网络的声场景分类方法,如图1所示,包括如下步骤:The present invention discloses a sound scene classification method based on an attention mechanism and a dual-path depth residual network, as shown in FIG. 1 , including the following steps:
步骤1、对音频样本中的原始语音信号x进行预处理,然后进行傅里叶变换得到原始语音频谱图X,通过分别增强原始语音频谱图X中的水平线和垂直线将其分解为新的水平频谱图Xh和垂直频谱图Xp。Step 1. Preprocess the original speech signal x in the audio sample, and then perform Fourier transform to obtain the original speech spectrogram X, which is decomposed into new levels by enhancing the horizontal and vertical lines in the original speech spectrogram X respectively. Spectrogram Xh and vertical spectrogram Xp .
预处理包括将所有音频样本中的原始语音信号统一降采样或升采样到48kHZ,然后进行预加重、分帧和加窗处理。在分帧时,每2048个采样点分为一帧,帧重叠率为50%;在加窗时,所采用的窗函数为汉明窗。Preprocessing includes uniform down-sampling or up-sampling of the original speech signal in all audio samples to 48kHZ, and then pre-emphasis, framing and windowing. When dividing the frame, every 2048 sampling points is divided into one frame, and the frame overlap rate is 50%; when adding the window, the adopted window function is the Hamming window.
因为水平频谱图Xh和垂直频谱图Xp的和等于信号的能量谱,所以在求解水平频谱图Xh和垂直频谱图Xp时可以构建代价J(Xh,Xp):Since the sum of the horizontal spectrogram Xh and the vertical spectrogram Xp is equal to the energy spectrum of the signal, the cost J(Xh , Xp ) can be constructed when solving for the horizontal spectrogram Xh and the vertical spectrogram Xp :
其中,κ和λ为权重平滑因子;f和t分别表示频率和时间;最小化代价函数J,令和则可求得水平频谱图Xh和垂直频谱图Xp。Among them, κ and λ are weight smoothing factors; f and t represent frequency and time, respectively; to minimize the cost function J, let and Then the horizontal spectrogram Xh and the vertical spectrogram Xp can be obtained.
步骤2、将水平频谱图Xh和垂直频谱图Xp进行傅里叶反变换,得到新的由水平频谱图生成的时域信号xh和由垂直频谱图生成的时域信号xp,然后分别提取由水平频谱图生成的时域信号xh、由垂直频谱图生成的时域信号xp和原始语音信号x的对数梅尔谱图,并进一步计算其各自的一阶差分对数梅尔谱图和二阶差分对数梅尔谱图。Step 2. Perform inverse Fourier transform on the horizontal spectrogram Xh and the vertical spectrogram Xp to obtain a new time domain signal xh generated by the horizontal spectrogram and a time domain signal xp generated by the vertical spectrogram, and then Extract the time-domain signal xh generated from the horizontal spectrogram, the time-domain signal xp generated from the vertical spectrogram, and the log mel spectrogram of the original speech signal x, and further calculate their respective first-order difference log mel spectrograms Mel-spectrogram and second-order difference log mel-spectrogram.
步骤3、将步骤2中求得的三组对数梅尔谱图、一阶差分对数梅尔谱图和二阶差分对数梅尔谱图进行融合,并按照高频和低频进行切分。
步骤3将步骤2中由水平频谱图生成的时域信号xh、由垂直频谱图生成的时域信号xp和原始语音信号x的对数梅尔谱图及其一阶差分对数梅尔谱图、二阶差分对数梅尔谱图在通道维度上进行拼接,形成融合谱图Sa(T,F):
Sa(T,F)=(SX(T,F),SH(T,F),SP(T,F))Sa (T, F) = (SX (T, F), SH (T, F), SP (T, F))
其中,SX、SH和SP分别表示原始语音信号x、由水平频谱图生成的时域信号xh和由垂直频谱图生成的时域信号xp的三个谱图,Sa表示融合谱图,T和F分别表示时间轴和频率轴。Among them, SX ,SH and SP represent the original speech signal x, the time-domain signal xh generated from the horizontal spectrogram, and the time-domain signal xp generated from the vertical spectrogram, respectively, andSa represents the fusion In the spectrogram, T and F represent the time axis and frequency axis, respectively.
融合谱图体现出原始语音信号的静态特性、动态特性,具有良好特征表达能力。The fusion spectrogram reflects the static and dynamic characteristics of the original speech signal, and has good feature expression ability.
然后在频率轴上将融合谱图Sa(T,F)平均切割为高频谱图Sa(T,FL)和低频谱图Sa(T,FH),其中FL和FH分别表示低频谱图和高频谱图的频率轴。The fused spectrogramSa (T,F ) is then averagely cut into high spectrogramSa (T,FL ) and low spectrogramSa (T,FH) on the frequency axis, whereFL andFH , respectively The frequency axis representing the low and high spectrograms.
步骤4、搭建带有注意力层的双路径的深度残差网络模型。Step 4. Build a dual-path deep residual network model with an attention layer.
步骤5、将步骤3中得到的高频和低频谱图输入步骤4中搭建好的深度残差网络,得到最终的音频场景标签。Step 5. Input the high-frequency and low-frequency spectrograms obtained in
双路径的深度残差网络对高频谱图和低频谱图分别进行建模,并将两个路径分别获得的特征图在频率轴维度上进行融合:The dual-path deep residual network models the high-spectrogram and low-spectrogram separately, and fuses the feature maps obtained by the two paths in the frequency axis dimension:
MP(T,F)=(MP1(T,FL),MP2(T,FH))MP (T, F) = (MP1 (T, FL ), MP2 (T, FH ))
其中,MP1(T,FL)、MP2(T,FH)和Mp(T,F)分别表示低频路径P1输出的低频特征图、高频路径P2输出的高频特征图和融合特征图。Among them, MP1 (T, FL ), MP2 (T, FH ) and Mp (T, F) represent the low-frequency feature map output by the low-frequency path P1, the high-frequency feature map output by the high-frequency path P2 and the fusion, respectively. feature map.
本发明的一种实施例中,图2为本发明带有注意力层的高低频双路径的深度残差网络的结构图。深度残差网络模型中每个路径均包含4个残差块,残差块中的结构依次为:批量归一化(BN)层、ReLU激活层、卷积层、BN层、ReLU激活层、卷积层,两个卷积层中卷积核的大小均为3×3。通过4个残差块进行特征提取之后,将两个路径上所获得的特征图在频率轴维度上进行融合,然后将融合特征图Mp(T,F)通过BN层、ReLU激活层、两个卷积块,最终获得包含768个通道的多通道特征图M(T,F),卷积块中的结构依次为卷积层、BN层,卷积层中卷积核的大小为1×1。In an embodiment of the present invention, FIG. 2 is a structural diagram of a high-low frequency dual-path deep residual network with an attention layer of the present invention. Each path in the deep residual network model contains 4 residual blocks, and the structures in the residual blocks are: batch normalization (BN) layer, ReLU activation layer, convolution layer, BN layer, ReLU activation layer, Convolutional layer, the size of the convolution kernel in both convolutional layers is 3 × 3. After feature extraction through four residual blocks, the feature maps obtained on the two paths are fused in the frequency axis dimension, and then the fused feature map Mp (T, F) is passed through the BN layer, the ReLU activation layer, the two Finally, a multi-channel feature map M(T, F) containing 768 channels is obtained. The structure in the convolution block is the convolution layer and the BN layer in turn. The size of the convolution kernel in the convolution layer is 1× 1.
图3为注意力网络结构图。将多通道特征图M(T,F)送入注意力网络,在通道维度上进行注意力加权,在注意力网络中依次进行以下操作:Figure 3 shows the structure of the attention network. The multi-channel feature map M(T, F) is sent to the attention network, and attention is weighted in the channel dimension, and the following operations are performed in sequence in the attention network:
(1)对输入的多通道特征图M(T,F)在通道维度上进行全局平均池化,将一个通道上的整个空间特征编码为一个全局特征:(1) Perform global average pooling on the input multi-channel feature map M(T, F) in the channel dimension, and encode the entire spatial feature on a channel as a global feature:
上式中M表示多通道特征图,z∈RC是全局平均池化之后的输出向量,T、F和C分别表示多通道特征图的高度、宽度和通道数。In the above formula, M represents the multi-channel feature map,z∈RC is the output vector after global average pooling, and T, F, and C represent the height, width, and number of channels of the multi-channel feature map, respectively.
(2)将长度为C的一维特征向量z输入包含两层的全连接层的DNN模型,并计算输出:(2) Input the one-dimensional feature vector z of length C into the DNN model containing the fully connected layer of two layers, and calculate the output:
α=FDNN(z,W)=σ(g(z,W))=σ(W2ReLU(W1z))α=FDNN (z,W)=σ(g(z,W))=σ(W2 ReLU(W1 z))
上式中α∈RC是DNN模型的输出,即注意力系数向量。和分别是两个全连接层的权重,C分表示通道维度,r表示尺度缩放系数,σ表示sigmoid激活函数。为了降低模型复杂度以及提升泛化能力,本发明采用包含两个全连接层的bottleneck结构,其中第一个全连接层起到降维的作用,尺度缩放系数r是个超参数,然后采用ReLU激活;第二个全连接层恢复原始的维度。In the above formula,α∈RC is the output of the DNN model, that is, the attention coefficient vector. and are the weights of the two fully connected layers, C points represent the channel dimension, r represents the scaling factor, and σ represents the sigmoid activation function. In order to reduce the complexity of the model and improve the generalization ability, the present invention adopts a bottleneck structure including two fully connected layers, in which the first fully connected layer plays the role of dimensionality reduction, and the scaling coefficient r is a hyperparameter, which is then activated by ReLU. ; the second fully connected layer restores the original dimension.
(3)将得到的注意力系数向量作用于多通道特征图的各通道,对通道进行加权,得到加权特征图(3) Apply the obtained attention coefficient vector to each channel of the multi-channel feature map, and weight the channels to obtain a weighted feature map
其中,为加权特征图中的第k个通道值;Mk(T,F)为多通道特征图M(T,F)中的第k个通道值;αk为注意力系数向量α中的第k个值;T和F分别表示特征图的高度和宽度;C表示通道维度。in, is the weighted feature map The k-th channel value in Mk (T, F) is the k-th channel value in the multi-channel feature map M (T, F); αk is the k-th value in the attention coefficient vector α; T and F represent the height and width of the feature map, respectively; C represents the channel dimension.
将加权特征图通过平铺(Flatten)层展开为一个一维的特征向量,最后通过全连接层和Softmax层得到模型的输出,即原始语音信号所属的场景类别。The weighted feature map is expanded into a one-dimensional feature vector through the Flatten layer, and finally the output of the model is obtained through the fully connected layer and the Softmax layer, that is, the scene category to which the original speech signal belongs.
图4显示了本发明改进的声场景分类方法和其他4类声场景分类方法分类结果对比。根据本发明改进的声场景分类方法,在数据集上对比了5种分类模型:高斯混合模型(GMM)、k最邻近(kNN)、支撑向量机(SVM)、随机森林(RF)和本发明提出的双路径的深度残差网络模型。使用从音频中提取的988维特征向量作为GMM、kNN、SVM、RF模型的输入,其中GMM中高斯分布的个数为12个,每个高斯分布具有不同的标准协方差矩阵;kNN模型在分类时最邻近的k取7;SVM的惩罚系数为1.8,采用高斯核函数,分类方式为OvO,即一对一分类;RF中包含的决策树个数为200,决策树在进行节点分裂时采用基尼指数作为最优特征选择标准。基尼指数表示在样本集合中一个随机选中的样本被分错的概率,基尼指数=样本被选中的概率×样本被分错的概率,基尼指数的性质与信息熵一样,度量随机变量的不确定度的大小:基尼指数越大,表示数据的不确定性越高;基尼指数越小,表示数据的不确定性越低;基尼指数为0,表示数据集中的所有样本都是同一类别。所选取的原始语音数据集中包含机场、公共汽车、地铁、地铁站、公园、广场、购物商场、步行街道、交通街道和电车轨道10类声场景,共14400条音频数据。实验结果表明,本发明所提出的改进的声场景分类方法在数据集上达到了81.6%的平均准确率,远高于其他4种声场景识别方法。FIG. 4 shows the comparison of the classification results between the improved acoustic scene classification method of the present invention and other four types of acoustic scene classification methods. According to the improved acoustic scene classification method of the present invention, five classification models are compared on the dataset: Gaussian Mixture Model (GMM), k-Nearest Neighbor (kNN), Support Vector Machine (SVM), Random Forest (RF) and the present invention The proposed dual-path deep residual network model. Use the 988-dimensional feature vector extracted from the audio as the input of the GMM, kNN, SVM, and RF models, where the number of Gaussian distributions in GMM is 12, and each Gaussian distribution has a different standard covariance matrix; kNN model is used in classification The nearest k is 7; the penalty coefficient of SVM is 1.8, the Gaussian kernel function is used, and the classification method is OvO, that is, one-to-one classification; the number of decision trees contained in RF is 200, and the decision tree is used for node splitting. The Gini index is used as the optimal feature selection criterion. The Gini index represents the probability that a randomly selected sample is wrongly classified in the sample set. Gini index = the probability that the sample is selected × the probability that the sample is wrongly classified. The property of the Gini index is the same as the information entropy, which measures the uncertainty of random variables. The size of : the larger the Gini index, the higher the uncertainty of the data; the smaller the Gini index, the lower the uncertainty of the data; the Gini index of 0 means that all the samples in the data set are of the same category. The selected original voice data set contains 10 categories of sound scenes, including airports, buses, subways, subway stations, parks, squares, shopping malls, pedestrian streets, traffic streets and tram tracks, with a total of 14,400 pieces of audio data. The experimental results show that the improved acoustic scene classification method proposed by the present invention achieves an average accuracy of 81.6% on the dataset, which is much higher than the other four acoustic scene recognition methods.
以上所述仅是本发明的优选实施方式,应当指出:对于本技术领域的普通技术人员来说,在不脱离本发明原理的前提下,还可以做出若干改进和润饰,这些改进和润饰也应视为本发明的保护范围。The above is only the preferred embodiment of the present invention, it should be pointed out that: for those skilled in the art, without departing from the principle of the present invention, several improvements and modifications can also be made, and these improvements and modifications are also It should be regarded as the protection scope of the present invention.
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202010585359.5ACN111754988B (en) | 2020-06-23 | 2020-06-23 | Sound scene classification method based on attention mechanism and double-path depth residual error network |
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202010585359.5ACN111754988B (en) | 2020-06-23 | 2020-06-23 | Sound scene classification method based on attention mechanism and double-path depth residual error network |
| Publication Number | Publication Date |
|---|---|
| CN111754988Atrue CN111754988A (en) | 2020-10-09 |
| CN111754988B CN111754988B (en) | 2022-08-16 |
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| CN202010585359.5AActiveCN111754988B (en) | 2020-06-23 | 2020-06-23 | Sound scene classification method based on attention mechanism and double-path depth residual error network |
| Country | Link |
|---|---|
| CN (1) | CN111754988B (en) |
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN112418181A (en)* | 2020-12-13 | 2021-02-26 | 西北工业大学 | Personnel overboard detection method based on convolutional neural network |
| CN112700794A (en)* | 2021-03-23 | 2021-04-23 | 北京达佳互联信息技术有限公司 | Audio scene classification method and device, electronic equipment and storage medium |
| CN113327626A (en)* | 2021-06-23 | 2021-08-31 | 深圳市北科瑞声科技股份有限公司 | Voice noise reduction method, device, equipment and storage medium |
| CN113361546A (en)* | 2021-06-18 | 2021-09-07 | 合肥工业大学 | Remote sensing image feature extraction method integrating asymmetric convolution and attention mechanism |
| CN113793627A (en)* | 2021-08-11 | 2021-12-14 | 华南师范大学 | A multi-scale convolutional speech emotion recognition method and device based on attention |
| CN113793622A (en)* | 2021-09-10 | 2021-12-14 | 中国科学院声学研究所 | Audio scene recognition method, system and device |
| CN114330636A (en)* | 2021-06-04 | 2022-04-12 | 北京蜜蜂工物科技有限公司 | Non-intrusive power load decomposition and identification method and device, electronic equipment and storage medium |
| CN114373476A (en)* | 2022-01-11 | 2022-04-19 | 江西师范大学 | A sound scene classification method based on multi-scale residual attention network |
| CN114863938A (en)* | 2022-05-24 | 2022-08-05 | 西南石油大学 | A bird language recognition method and system based on attention residual and feature fusion |
| CN114937461A (en)* | 2022-06-13 | 2022-08-23 | 华南农业大学 | Live pig sound event detection method and device based on channel attention and residual gating convolution |
| CN115034471A (en)* | 2022-06-09 | 2022-09-09 | 天津蓝天太阳科技有限公司 | Energy storage fault prediction method and device |
| CN115050373A (en)* | 2022-04-29 | 2022-09-13 | 思必驰科技股份有限公司 | Dual path embedded learning method, electronic device, and storage medium |
| CN115602165A (en)* | 2022-09-07 | 2023-01-13 | 杭州优航信息技术有限公司(Cn) | Digital staff intelligent system based on financial system |
| CN116310970A (en)* | 2023-03-03 | 2023-06-23 | 中南大学 | Automatic driving scene classification algorithm based on deep learning |
| CN116863957A (en)* | 2023-09-05 | 2023-10-10 | 硕橙(厦门)科技有限公司 | Method, device, equipment and storage medium for identifying operation state of industrial equipment |
| CN117975994A (en)* | 2024-04-01 | 2024-05-03 | 华南师范大学 | Method, device and computer equipment for quality classification of voice data |
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| KR20050080647A (en)* | 2004-02-10 | 2005-08-17 | 삼성전자주식회사 | Appratuses and methods for detecting and discriminating acoustical impact |
| CN107945811A (en)* | 2017-10-23 | 2018-04-20 | 北京大学 | A kind of production towards bandspreading resists network training method and audio coding, coding/decoding method |
| CN109859771A (en)* | 2019-01-15 | 2019-06-07 | 华南理工大学 | An acoustic scene clustering method for jointly optimizing deep transformation features and clustering process |
| CN109978034A (en)* | 2019-03-18 | 2019-07-05 | 华南理工大学 | A kind of sound scenery identification method based on data enhancing |
| CN110600054A (en)* | 2019-09-06 | 2019-12-20 | 南京工程学院 | Sound scene classification method based on network model fusion |
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| KR20050080647A (en)* | 2004-02-10 | 2005-08-17 | 삼성전자주식회사 | Appratuses and methods for detecting and discriminating acoustical impact |
| CN107945811A (en)* | 2017-10-23 | 2018-04-20 | 北京大学 | A kind of production towards bandspreading resists network training method and audio coding, coding/decoding method |
| CN109859771A (en)* | 2019-01-15 | 2019-06-07 | 华南理工大学 | An acoustic scene clustering method for jointly optimizing deep transformation features and clustering process |
| CN109978034A (en)* | 2019-03-18 | 2019-07-05 | 华南理工大学 | A kind of sound scenery identification method based on data enhancing |
| CN110600054A (en)* | 2019-09-06 | 2019-12-20 | 南京工程学院 | Sound scene classification method based on network model fusion |
| Title |
|---|
| BISOT V,SERIZEL R,ESSID S,ET AL.: "Acoustic scene classification with matrix", 《PROC OF 2016IEEE INTERNATIONAL 》* |
| PICZAK K J.: "Environmental sound classification with convolutional neural networks", 《PROCEEDINGS OF THE IEEE 25TH INTERNATIONAL WORKSHOP ON MACHINE LEARNING FOR 》* |
| PILLOS A,ALGHAMIDI K,ALZAMEL N,ET AL.: "A real-time environmental sound recognition", 《HTTP://WWW.CS.TUT.FI/SGN/ARG/DCASE2016/DOCUMENTS/WORKSHOP/PILLOS-》* |
| 周迅等: "基于双重数据增强策略的音频分类方法", 《武汉科技大学学报》* |
| 张晓龙等: "基于残差网络和随机森林的音频识别方法", 《计算机工程与科学》* |
| 王天锐等: "基于梅尔倒谱系数、深层卷积和Bagging的环境音分类方法", 《计算机应用》* |
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN112418181B (en)* | 2020-12-13 | 2023-05-02 | 西北工业大学 | Personnel falling water detection method based on convolutional neural network |
| CN112418181A (en)* | 2020-12-13 | 2021-02-26 | 西北工业大学 | Personnel overboard detection method based on convolutional neural network |
| CN112700794A (en)* | 2021-03-23 | 2021-04-23 | 北京达佳互联信息技术有限公司 | Audio scene classification method and device, electronic equipment and storage medium |
| CN112700794B (en)* | 2021-03-23 | 2021-06-22 | 北京达佳互联信息技术有限公司 | Audio scene classification method and device, electronic equipment and storage medium |
| CN114330636A (en)* | 2021-06-04 | 2022-04-12 | 北京蜜蜂工物科技有限公司 | Non-intrusive power load decomposition and identification method and device, electronic equipment and storage medium |
| CN113361546A (en)* | 2021-06-18 | 2021-09-07 | 合肥工业大学 | Remote sensing image feature extraction method integrating asymmetric convolution and attention mechanism |
| CN113327626A (en)* | 2021-06-23 | 2021-08-31 | 深圳市北科瑞声科技股份有限公司 | Voice noise reduction method, device, equipment and storage medium |
| CN113327626B (en)* | 2021-06-23 | 2023-09-08 | 深圳市北科瑞声科技股份有限公司 | Voice noise reduction method, device, equipment and storage medium |
| CN113793627B (en)* | 2021-08-11 | 2023-12-29 | 华南师范大学 | Attention-based multi-scale convolution voice emotion recognition method and device |
| CN113793627A (en)* | 2021-08-11 | 2021-12-14 | 华南师范大学 | A multi-scale convolutional speech emotion recognition method and device based on attention |
| CN113793622A (en)* | 2021-09-10 | 2021-12-14 | 中国科学院声学研究所 | Audio scene recognition method, system and device |
| CN113793622B (en)* | 2021-09-10 | 2023-08-29 | 中国科学院声学研究所 | A method, system and device for audio scene recognition |
| CN114373476B (en)* | 2022-01-11 | 2025-09-19 | 江西师范大学 | Sound scene classification method based on multi-scale residual error attention network |
| CN114373476A (en)* | 2022-01-11 | 2022-04-19 | 江西师范大学 | A sound scene classification method based on multi-scale residual attention network |
| CN115050373A (en)* | 2022-04-29 | 2022-09-13 | 思必驰科技股份有限公司 | Dual path embedded learning method, electronic device, and storage medium |
| CN115050373B (en)* | 2022-04-29 | 2024-09-06 | 思必驰科技股份有限公司 | Dual-path embedding learning method, electronic device and storage medium |
| CN114863938A (en)* | 2022-05-24 | 2022-08-05 | 西南石油大学 | A bird language recognition method and system based on attention residual and feature fusion |
| CN115034471A (en)* | 2022-06-09 | 2022-09-09 | 天津蓝天太阳科技有限公司 | Energy storage fault prediction method and device |
| CN114937461A (en)* | 2022-06-13 | 2022-08-23 | 华南农业大学 | Live pig sound event detection method and device based on channel attention and residual gating convolution |
| CN114937461B (en)* | 2022-06-13 | 2024-12-03 | 华南农业大学 | Pig sound event detection method and device based on channel attention and residual gated convolution |
| CN115602165A (en)* | 2022-09-07 | 2023-01-13 | 杭州优航信息技术有限公司(Cn) | Digital staff intelligent system based on financial system |
| CN116310970A (en)* | 2023-03-03 | 2023-06-23 | 中南大学 | Automatic driving scene classification algorithm based on deep learning |
| CN116863957A (en)* | 2023-09-05 | 2023-10-10 | 硕橙(厦门)科技有限公司 | Method, device, equipment and storage medium for identifying operation state of industrial equipment |
| CN116863957B (en)* | 2023-09-05 | 2023-12-12 | 硕橙(厦门)科技有限公司 | Method, device, equipment and storage medium for identifying operation state of industrial equipment |
| CN117975994A (en)* | 2024-04-01 | 2024-05-03 | 华南师范大学 | Method, device and computer equipment for quality classification of voice data |
| CN117975994B (en)* | 2024-04-01 | 2024-06-11 | 华南师范大学 | Method, device and computer equipment for quality classification of voice data |
| Publication number | Publication date |
|---|---|
| CN111754988B (en) | 2022-08-16 |
| Publication | Publication Date | Title |
|---|---|---|
| CN111754988B (en) | Sound scene classification method based on attention mechanism and double-path depth residual error network | |
| CN114373476B (en) | Sound scene classification method based on multi-scale residual error attention network | |
| Stöter et al. | CountNet: Estimating the number of concurrent speakers using supervised learning | |
| Lidy et al. | CQT-based Convolutional Neural Networks for Audio Scene Classification. | |
| CN108231067A (en) | Sound scenery recognition methods based on convolutional neural networks and random forest classification | |
| Paseddula et al. | Late fusion framework for Acoustic Scene Classification using LPCC, SCMC, and log-Mel band energies with Deep Neural Networks | |
| Xie et al. | Domain generalization via aggregation and separation for audio deepfake detection | |
| CN109559736B (en) | A method for automatic dubbing of movie actors based on adversarial networks | |
| CN106952644A (en) | A Clustering Method for Complex Audio Segmentation Based on Bottleneck Features | |
| Devi et al. | Automatic speaker recognition from speech signals using self organizing feature map and hybrid neural network | |
| CN102568476B (en) | Voice conversion method based on self-organizing feature map network cluster and radial basis network | |
| Piczak | The Details That Matter: Frequency Resolution of Spectrograms in Acoustic Scene Classification. | |
| Yan et al. | Audio deepfake detection system with neural stitching for add 2022 | |
| CN110853656A (en) | Audio Tampering Recognition Algorithm Based on Improved Neural Network | |
| Ludena-Choez et al. | Acoustic Event Classification using spectral band selection and Non-Negative Matrix Factorization-based features | |
| Nawas et al. | Speaker recognition using random forest | |
| Han et al. | Attention-based scaling adaptation for target speech extraction | |
| Vignolo et al. | Feature optimisation for stress recognition in speech | |
| Awais et al. | Speaker recognition using mel frequency cepstral coefficient and locality sensitive hashing | |
| Taşcı et al. | A new lateral geniculate nucleus pattern-based environmental sound classification using a new large sound dataset | |
| CN112466333A (en) | Acoustic scene classification method and system | |
| Bhavya et al. | Deep learning approach for sound signal processing | |
| Song et al. | A compact and discriminative feature based on auditory summary statistics for acoustic scene classification | |
| Ding et al. | Acoustic scene classification based on ensemble system | |
| Sangeetha et al. | Analysis of machine learning algorithms for audio event classification using Mel-frequency cepstral coefficients |
| Date | Code | Title | Description |
|---|---|---|---|
| PB01 | Publication | ||
| PB01 | Publication | ||
| SE01 | Entry into force of request for substantive examination | ||
| SE01 | Entry into force of request for substantive examination | ||
| GR01 | Patent grant | ||
| GR01 | Patent grant |