CN111754988A

Movatterモバイル変換

Info

Publication number: CN111754988A
Application number: CN202010585359.5A
Authority: CN
Inventors: 唐闺臣; 梁瑞宇; 谢跃; 黄裕磊; 王青云
Original assignee: Nanjing Institute of Technology
Current assignee: Nanjing Institute of Technology
Priority date: 2020-06-23
Filing date: 2020-06-23
Publication date: 2020-10-09
Anticipated expiration: 2040-06-23
Also published as: CN111754988B

Abstract

The invention discloses an attention mechanism and double-path depth residual error network-based sound scene classification method, which comprises the following steps of: calculating an original voice spectrogram, a horizontal spectrogram and a vertical spectrogram of the original voice signal, and transforming the horizontal spectrogram and the vertical spectrogram to obtain two new paths of time domain signals; respectively calculating logarithmic Mel spectrograms of the original voice signal and the new two paths of time domain signals, and a first-order difference logarithmic Mel spectrogram and a second-order difference logarithmic Mel spectrogram, and fusing on channel dimensions to obtain a fused spectrogram; dividing the fused spectrogram on a frequency axis into a high-frequency spectrogram and a low-frequency spectrogram in an average manner; building a double-path depth residual error network with an attention layer; and inputting the high-frequency spectrogram and the low-frequency spectrogram into a depth residual error network, and outputting the sound scene category to which the original voice signal belongs. The invention can better capture the time-frequency characteristics of high-frequency and low-frequency components and the importance of different channels in the characteristic diagram, and improves the accuracy and robustness of the sound scene classification system.

Description

Translated fromChinese

基于注意力机制和双路径深度残差网络的声场景分类方法Acoustic scene classification method based on attention mechanism and dual-path deep residual network

技术领域technical field

本发明属于声场景分类技术领域，具体涉及一种基于注意力机制和双路径深度残差网络的声场景分类方法。The invention belongs to the technical field of acoustic scene classification, and in particular relates to an acoustic scene classification method based on an attention mechanism and a dual-path deep residual network.

背景技术Background technique

声场景分类，就是训练计算机通过声音中所包含的信息将声音正确的划分到其所属的场景中。声场景分类技术在物联网设备、智能助听器、自动驾驶等领域有着广泛的应用，对声场景分类进行深入的研究具有十分重要的意义。Sound scene classification is to train the computer to correctly classify the sound into the scene to which it belongs through the information contained in the sound. Acoustic scene classification technology has a wide range of applications in the fields of Internet of Things devices, intelligent hearing aids, and autonomous driving. It is of great significance to conduct in-depth research on acoustic scene classification.

声场景分类最开始属于模式识别的一个子领域。上世纪九十年代，Sawhney和Maes首次提出了声场景分类的概念。他们录制了一份包含人行道、地铁、餐厅、公园、街道五类声场景的数据集，Sawhney从录制的音频中提取了功率谱密度、相对光谱、滤波器组的频带三类特征，之后采用k最邻近和循环神经网络算法进行分类，取得了68％的准确率。二十世纪初期，机器学习领域快速发展，越来越多的学者尝试使用机器学习的方法来进行声音场景的划分。支持向量机、决策树等机器学习算法逐渐替代传统的HMM模型，被广泛的应用在了声场景分类和声事件检测任务中。同时，一些集成学习的方法如随机森林、XGBoost进一步提升了声场景分类的效果。2015年，Phan等人将声场景分类问题转化为回归问题，搭建了基于随机森林回归的模型，并在ITC-Irst和UPC-TALP两个数据库上分别将检测错误率降低了6％和10％。2012年，在ImageNet图像分类竞赛中，Krizhevsky提出了AlexNet模型并一举获得了冠军。AlexNet的巨大的成功，引发了深度学习的热潮，研究者也逐渐开始将深度学习的方法引入到声场景分类任务中。Sound scene classification was originally a subfield of pattern recognition. The concept of sound scene classification was first proposed by Sawhney and Maes in the 1990s. They recorded a dataset containing five types of sound scenes: sidewalks, subways, restaurants, parks, and streets. Sawhney extracted three types of features from the recorded audio: power spectral density, relative spectrum, and frequency bands of the filter bank, and then used k The nearest neighbor and recurrent neural network algorithms for classification achieved 68% accuracy. In the early 20th century, the field of machine learning developed rapidly, and more and more scholars tried to use machine learning methods to classify sound scenes. Machine learning algorithms such as support vector machines and decision trees have gradually replaced the traditional HMM model and have been widely used in acoustic scene classification and acoustic event detection tasks. At the same time, some ensemble learning methods such as random forest and XGBoost further improve the effect of sound scene classification. In 2015, Phan et al. transformed the sound scene classification problem into a regression problem, built a model based on random forest regression, and reduced the detection error rate by 6% and 10% on the ITC-Irst and UPC-TALP databases, respectively. . In 2012, in the ImageNet image classification competition, Krizhevsky proposed the AlexNet model and won the championship. The great success of AlexNet has triggered a boom in deep learning, and researchers have gradually begun to introduce deep learning methods into the task of sound scene classification.

此外，可以被用于声场景分类的声学特征有很多，如何将这些特征进行融合，以匹配深度学习的模型是未来一个重要的研究方向。In addition, there are many acoustic features that can be used for acoustic scene classification. How to fuse these features to match the deep learning model is an important research direction in the future.

发明内容SUMMARY OF THE INVENTION

发明目的：针对现有技术中存在的问题，本发明公开了一种基于注意力机制和双路径深度残差网络的声场景分类方法，对三种变换后的信号分别求取对数梅尔谱图及其一阶二阶差分谱图，将其融合后分离出高频和低频部分，再输入具有注意力机制的双路径深度残差网络模型中，能够有效的捕获对分类结果有重要的影响的特征图，提升了声场景分类系统的准确性和鲁棒性。Purpose of the invention: In view of the problems existing in the prior art, the present invention discloses an acoustic scene classification method based on an attention mechanism and a dual-path deep residual network. The graph and its first-order and second-order difference spectrum are fused to separate high-frequency and low-frequency parts, and then input into the dual-path deep residual network model with attention mechanism, which can effectively capture the important influence on the classification results. The feature map of , which improves the accuracy and robustness of the acoustic scene classification system.

技术方案：本发明采用如下技术方案：一种基于注意力机制和双路径深度残差网络的声场景分类方法，其特征在于，包括如下步骤：Technical scheme: The present invention adopts the following technical scheme: an acoustic scene classification method based on an attention mechanism and a dual-path deep residual network, which is characterized in that it includes the following steps:

步骤1、对原始语音信号进行预处理并计算原始语音频谱图，对原始语音频谱图中的水平线和垂直线分别进行增强得到水平频谱图和垂直频谱图，对水平频谱图和垂直频谱图分别变换得到新的两路时域信号；Step 1. Preprocess the original voice signal and calculate the original voice spectrogram, enhance the horizontal and vertical lines in the original voice spectrogram respectively to obtain the horizontal spectrogram and the vertical spectrogram, and transform the horizontal spectrogram and the vertical spectrogram respectively. Obtain two new time domain signals;

步骤2、分别计算原始语音信号、新的两路时域信号的对数梅尔谱图以及一阶差分对数梅尔谱图和二阶差分对数梅尔谱图，并在通道维度上进行融合得到融合谱图；Step 2. Calculate the log mel spectrogram of the original speech signal, the new two-channel time-domain signal, the first-order difference log mel spectrogram and the second-order difference log mel spectrogram, and perform it in the channel dimension. Fusion to obtain a fusion spectrum;

步骤3、在频率轴上将融合谱图平均分割为高频谱图和低频谱图；Step 3. Divide the fusion spectrogram equally into high spectrogram and low spectrogram on the frequency axis;

步骤4、搭建带有注意力层的双路径深度残差网络；Step 4. Build a dual-path deep residual network with an attention layer;

步骤5、将所述步骤3中的高频谱图和低频谱图输入步骤4中的深度残差网络，输出原始语音信号所属的声场景类别。Step 5: Input the high spectrogram and the low spectrogram in thestep 3 into the deep residual network in the step 4, and output the acoustic scene category to which the original speech signal belongs.

优选的，所述步骤1中：Preferably, in the step 1:

其中，X_h为水平频谱图，X_p为垂直频谱图，X为原始语音频谱图；κ和λ为权重平滑因子；f和t分别表示频率和时间；最小化代价函数J，令θJ/θX_h＝0和θJ/θX_p＝0，则可求得水平频谱图X_h和垂直频谱图X_p。Among them, X_h is the horizontal spectrogram, X_p is the vertical spectrogram, and X is the original speech spectrogram; κ and λ are the weight smoothing factors; f and t represent the frequency and time, respectively; To minimize the cost function J, let θJ/θX_h =0 and θJ/θX_p =0, then the horizontal spectrogram X_h and the vertical spectrogram X_p can be obtained.

优选的，所述步骤2中：Preferably, in the step 2:

S_o(T，F)＝(S_X(T，F)，S_H(T，F)，S_P(T，F))S_o (T, F) = (S_X (T, F), S_H (T, F), S_P (T, F))

其中，S_a表示融合谱图；S_X表示原始语音信号的对数梅尔谱图以及一阶差分对数梅尔谱图和二阶差分对数梅尔谱图；S_H表示由水平频谱图生成的对数梅尔谱图以及一阶差分对数梅尔谱图和二阶差分对数梅尔谱图；S_P表示由垂直频谱图生成的对数梅尔谱图以及一阶差分对数梅尔谱图和二阶差分对数梅尔谱图；T和F分别表示时间轴和频率轴。Among them,_{Sa represents the fusion spectrogram; S X}_represents the logarithmic mel spectrogram of the original speech signal, as well as the first-order differential logarithmic mel spectrogram and the second-order differential logarithmic mel spectrogram; S_H represents the horizontal spectrogram Generated log mel-spectrogram and first-order difference log mel-spectrogram and second-order difference log mel-spectrogram; S_P represents the log mel-spectrogram generated from the vertical spectrogram and the first-order difference log mel-spectrogram Mel-spectrogram and second-order difference log mel-spectrogram; T and F represent the time and frequency axes, respectively.

优选的，所述步骤5包括如下步骤：Preferably, the step 5 includes the following steps:

步骤51、高频谱图和低频谱图输入深度残差网络的双路径后分别输出高频特征图和低频特征图；Step 51. After the high-spectrogram and the low-spectrogram are input into the dual paths of the deep residual network, the high-frequency feature map and the low-frequency feature map are respectively output;

步骤52、高频特征图和低频特征图在频率轴维度上进行融合得到融合特征图，通过融合特征图得到多通道特征图，通过多通道特征图计算得到注意力系数；Step 52, the high-frequency feature map and the low-frequency feature map are fused in the frequency axis dimension to obtain a fusion feature map, a multi-channel feature map is obtained by fusing the feature map, and an attention coefficient is obtained by calculating the multi-channel feature map;

步骤53、将注意力系数应用于多通道特征图得到加权特征图；Step 53, applying the attention coefficient to the multi-channel feature map to obtain a weighted feature map;

步骤54、将加权特征图展开为一维的特征向量，通过特征向量输出原始语音信号所属的声场景类别。Step 54: Expand the weighted feature map into a one-dimensional feature vector, and output the acoustic scene category to which the original speech signal belongs through the feature vector.

优选的，所述步骤52中：Preferably, in the step 52:

M_P(T，F)＝(M_P1(T，F_L)，M_P2(T，F_H))M_P (T, F) = (M_P1 (T, F_L ), M_P2 (T, F_H ))

其中，M_P(T，F)表示融合特征图；M_P1(T，F_L)和M_P2(T，F_H)分别表示低频特征图和高频特征图；T表示特征图的高度；F、F_L和F_H分别表示融合特征图、低频特征图和高频特征图的宽度。Among them, M_P (T, F) represents the fusion feature map; M_P1 (T, F_L ) and M_P2 (T, F_H ) represent the low-frequency feature map and the high-frequency feature map respectively; T represents the height of the feature map; F , F_L and F_H represent the width of the fusion feature map, low frequency feature map and high frequency feature map, respectively.

优选的，所述步骤52中：Preferably, in the step 52:

α＝σ(W₂ReLU(W₁z))α=σ(W₂ ReLU(W₁ z))

其中，α∈R^C表示注意力系数向量；

和

表示权重；σ表示sigmoid激活函数；M表示多通道特征图；T和F分别表示多通道特征图的高度和宽度；C表多通道特征图的通道维度；r表示尺度缩放系数。where α∈R^C represents the attention coefficient vector;

and

represents the weight; σ represents the sigmoid activation function; M represents the multi-channel feature map; T and F represent the height and width of the multi-channel feature map, respectively; C represents the channel dimension of the multi-channel feature map; r represents the scaling factor.

优选的，所述步骤53中：Preferably, in the step 53:

其中，

为加权特征图

中的第k个通道值；M_k(T，F)为多通道特征图M(T，F)中的第k个通道值；α_k为注意力系数向量α中的第k个值；T和F分别表示特征图的高度和宽度；C表示通道维度。in,

is the weighted feature map

The k-th channel value in M_k (T, F) is the k-th channel value in the multi-channel feature map M (T, F); α_k is the k-th value in the attention coefficient vector α; T and F represent the height and width of the feature map, respectively; C represents the channel dimension.

优选的，深度残差网络的每个路径中包括残差块；Preferably, each path of the deep residual network includes a residual block;

残差块中包括依次连接的批量归一化层、ReLU激活层、卷积层、批量归一化层、ReLU激活层和卷积层，输出低频特征图和高频特征图；The residual block includes batch normalization layer, ReLU activation layer, convolution layer, batch normalization layer, ReLU activation layer and convolution layer connected in sequence, and outputs low-frequency feature map and high-frequency feature map;

低频特征图和高频特征图融合后输入依次连接的批量归一化层、ReLU激活层和卷积块，卷积块中包括依次连接的卷积层和批量归一化层，输出多通道特征图；After the low-frequency feature map and the high-frequency feature map are fused, input the batch normalization layer, ReLU activation layer and convolution block connected in sequence. picture;

多通道特征图输入依次连接的全局平均池化层和全连接层，输出注意力系数向量；The multi-channel feature map inputs the global average pooling layer and the fully connected layer connected in sequence, and outputs the attention coefficient vector;

注意力系数向量和多通道特征图合并后输入依次连接的平铺层、全连接层和Softmax层，输出分类结果。After the attention coefficient vector and the multi-channel feature map are merged, the tiling layer, the fully connected layer and the Softmax layer are connected in sequence, and the classification result is output.

优选的，所述步骤1中预处理包括：对原始语音信号降采样或升采样到48kHZ，然后进行预加重、分帧和加窗处理；在分帧时，每2048个采样点分为一帧，帧重叠率为50％；在加窗时，采用的窗函数为汉明窗。Preferably, the preprocessing in step 1 includes: down-sampling or up-sampling the original speech signal to 48 kHz, and then performing pre-emphasis, framing and windowing processing; during framing, every 2048 sampling points is divided into one frame , the frame overlap rate is 50%; when adding the window, the window function used is the Hamming window.

有益效果：本发明具有如下有益效果：Beneficial effects: the present invention has the following beneficial effects:

(1)本发明将原始音频、增强语谱图水平线的信号和增强语谱图垂直线的信号的对数梅尔谱图和差分谱图进行融合，使得融合谱图不仅体现出了音频的静态特性、动态特性，而且增强了特征的表达能力，有效提升了声场景分类的准确率；(1) The present invention fuses the logarithmic Mel spectrogram and the differential spectrogram of the original audio, the signal of the enhanced spectrogram horizontal line and the signal of the enhanced spectrogram vertical line, so that the fusion spectrogram not only reflects the static state of the audio characteristics, dynamic characteristics, and enhanced the expressive ability of features, effectively improving the accuracy of sound scene classification;

(2)、本发明将融合谱图中的高频部分和低频部分进行分离，并搭建双路径的深度残差网络分别建模，极大的体现了频谱图中高频和低频具有不同的物理含义这一特点；高低频分离建模使得模型可以更好的捕获高频分量、低频分量的时频特性，利用这些特性能更准确的区分出相似的声场景；(2) The present invention separates the high-frequency part and the low-frequency part in the fusion spectrogram, and builds a dual-path deep residual network to model separately, which greatly reflects that the high-frequency and low-frequency in the spectrogram have different physical meanings This feature; high and low frequency separation modeling enables the model to better capture the time-frequency characteristics of high frequency components and low frequency components, and use these characteristics to more accurately distinguish similar acoustic scenes;

(3)、本发明在深度残差网络中引入注意力机制，对多通道的融合特征图在通道维度上进行注意力加权操作，使得对最终分类结果有着积极作用的特征图在后面的全连接层中获得更高的关注度，有效提升了模型的分类效果，使得整个系统的识别率极大提升。(3) The present invention introduces an attention mechanism into the deep residual network, and performs an attention weighting operation on the multi-channel fusion feature map in the channel dimension, so that the feature map that has a positive effect on the final classification result is fully connected in the back. The higher attention is obtained in the layer, which effectively improves the classification effect of the model and greatly improves the recognition rate of the entire system.

附图说明Description of drawings

图1为本发明改进的声场景分类方法的总体结构图；Fig. 1 is the overall structure diagram of the sound scene classification method improved by the present invention;

图2为本发明带有注意力层的双路径深度残差网络结构图；2 is a structural diagram of a dual-path deep residual network with an attention layer according to the present invention;

图3为本发明的注意力网络结构图；Fig. 3 is the attention network structure diagram of the present invention;

图4为本发明方法和其他4类声场景分类方法分类结果的对比图。FIG. 4 is a comparison diagram of the classification results of the method of the present invention and other four types of sound scene classification methods.

具体实施方式Detailed ways

下面结合附图对本发明作更进一步的说明。The present invention will be further described below in conjunction with the accompanying drawings.

本发明公开了一种基于注意力机制和双路径深度残差网络的声场景分类方法，如图1所示，包括如下步骤：The present invention discloses a sound scene classification method based on an attention mechanism and a dual-path depth residual network, as shown in FIG. 1 , including the following steps:

步骤1、对音频样本中的原始语音信号x进行预处理，然后进行傅里叶变换得到原始语音频谱图X，通过分别增强原始语音频谱图X中的水平线和垂直线将其分解为新的水平频谱图X_h和垂直频谱图X_p。Step 1. Preprocess the original speech signal x in the audio sample, and then perform Fourier transform to obtain the original speech spectrogram X, which is decomposed into new levels by enhancing the horizontal and vertical lines in the original speech spectrogram X respectively. Spectrogram X_h and vertical spectrogram X_p .

预处理包括将所有音频样本中的原始语音信号统一降采样或升采样到48kHZ，然后进行预加重、分帧和加窗处理。在分帧时，每2048个采样点分为一帧，帧重叠率为50％；在加窗时，所采用的窗函数为汉明窗。Preprocessing includes uniform down-sampling or up-sampling of the original speech signal in all audio samples to 48kHZ, and then pre-emphasis, framing and windowing. When dividing the frame, every 2048 sampling points is divided into one frame, and the frame overlap rate is 50%; when adding the window, the adopted window function is the Hamming window.

因为水平频谱图X_h和垂直频谱图X_p的和等于信号的能量谱，所以在求解水平频谱图X_h和垂直频谱图X_p时可以构建代价J(X_h，X_p)：Since the sum of the horizontal spectrogram X_h and the vertical spectrogram X_p is equal to the energy spectrum of the signal, the cost J(X_h , X_p ) can be constructed when solving for the horizontal spectrogram X_h and the vertical spectrogram X_p :

其中，κ和λ为权重平滑因子；f和t分别表示频率和时间；最小化代价函数J，令

和

则可求得水平频谱图X_h和垂直频谱图X_p。Among them, κ and λ are weight smoothing factors; f and t represent frequency and time, respectively; to minimize the cost function J, let

and

Then the horizontal spectrogram X_h and the vertical spectrogram X_p can be obtained.

步骤2、将水平频谱图X_h和垂直频谱图X_p进行傅里叶反变换，得到新的由水平频谱图生成的时域信号x_h和由垂直频谱图生成的时域信号x_p，然后分别提取由水平频谱图生成的时域信号x_h、由垂直频谱图生成的时域信号x_p和原始语音信号x的对数梅尔谱图，并进一步计算其各自的一阶差分对数梅尔谱图和二阶差分对数梅尔谱图。Step 2. Perform inverse Fourier transform on the horizontal spectrogram X_h and the vertical spectrogram X_p to obtain a new time domain signal x_h generated by the horizontal spectrogram and a time domain signal x_p generated by the vertical spectrogram, and then Extract the time-domain signal x_h generated from the horizontal spectrogram, the time-domain signal x_p generated from the vertical spectrogram, and the log mel spectrogram of the original speech signal x, and further calculate their respective first-order difference log mel spectrograms Mel-spectrogram and second-order difference log mel-spectrogram.

步骤3、将步骤2中求得的三组对数梅尔谱图、一阶差分对数梅尔谱图和二阶差分对数梅尔谱图进行融合，并按照高频和低频进行切分。Step 3. Fusion of the three groups of logarithmic mel-spectrograms, first-order difference logarithmic mel-spectrograms, and second-order difference logarithmic mel-spectrograms obtained in step 2, and segmented according to high frequency and low frequency .

步骤3将步骤2中由水平频谱图生成的时域信号x_h、由垂直频谱图生成的时域信号x_p和原始语音信号x的对数梅尔谱图及其一阶差分对数梅尔谱图、二阶差分对数梅尔谱图在通道维度上进行拼接，形成融合谱图S_a(T，F)：Step 3 combines the time domain signal x_h generated by the horizontal spectrogram in step 2, the time domain signal x_p generated by the vertical spectrogram and the logarithmic mel spectrogram of the original speech signal x and its first-order difference logarithmic mel The spectrogram and the second-order difference log mel spectrogram are spliced in the channel dimension to form a fusion spectrogram S_a (T, F):

S_a(T，F)＝(S_X(T，F)，S_H(T，F)，S_P(T，F))S_a (T, F) = (S_X (T, F), S_H (T, F), S_P (T, F))

其中，S_X、S_H和S_P分别表示原始语音信号x、由水平频谱图生成的时域信号x_h和由垂直频谱图生成的时域信号x_p的三个谱图，S_a表示融合谱图，T和F分别表示时间轴和频率轴。Among them, S_X ,_SH and S_P represent the original speech signal x, the time-domain signal x_h generated from the horizontal spectrogram, and the time-domain signal x_p generated from the vertical spectrogram, respectively, and_Sa represents the fusion In the spectrogram, T and F represent the time axis and frequency axis, respectively.

融合谱图体现出原始语音信号的静态特性、动态特性，具有良好特征表达能力。The fusion spectrogram reflects the static and dynamic characteristics of the original speech signal, and has good feature expression ability.

然后在频率轴上将融合谱图S_a(T，F)平均切割为高频谱图S_a(T，F_L)和低频谱图S_a(T，F_H)，其中F_L和F_H分别表示低频谱图和高频谱图的频率轴。The fused spectrogram_Sa (T,_F ) is then averagely cut into high spectrogram_Sa (T,_FL ) and low spectrogram_Sa (T,FH) on the frequency axis, where_FL and_FH , respectively The frequency axis representing the low and high spectrograms.

步骤4、搭建带有注意力层的双路径的深度残差网络模型。Step 4. Build a dual-path deep residual network model with an attention layer.

步骤5、将步骤3中得到的高频和低频谱图输入步骤4中搭建好的深度残差网络，得到最终的音频场景标签。Step 5. Input the high-frequency and low-frequency spectrograms obtained instep 3 into the deep residual network built in step 4 to obtain the final audio scene label.

双路径的深度残差网络对高频谱图和低频谱图分别进行建模，并将两个路径分别获得的特征图在频率轴维度上进行融合：The dual-path deep residual network models the high-spectrogram and low-spectrogram separately, and fuses the feature maps obtained by the two paths in the frequency axis dimension:

其中，M_P1(T，F_L)、M_P2(T，F_H)和M_p(T，F)分别表示低频路径P1输出的低频特征图、高频路径P2输出的高频特征图和融合特征图。Among them, M_P1 (T, F_L ), M_P2 (T, F_H ) and M_p (T, F) represent the low-frequency feature map output by the low-frequency path P1, the high-frequency feature map output by the high-frequency path P2 and the fusion, respectively. feature map.

本发明的一种实施例中，图2为本发明带有注意力层的高低频双路径的深度残差网络的结构图。深度残差网络模型中每个路径均包含4个残差块，残差块中的结构依次为：批量归一化(BN)层、ReLU激活层、卷积层、BN层、ReLU激活层、卷积层，两个卷积层中卷积核的大小均为3×3。通过4个残差块进行特征提取之后，将两个路径上所获得的特征图在频率轴维度上进行融合，然后将融合特征图M_p(T，F)通过BN层、ReLU激活层、两个卷积块，最终获得包含768个通道的多通道特征图M(T，F)，卷积块中的结构依次为卷积层、BN层，卷积层中卷积核的大小为1×1。In an embodiment of the present invention, FIG. 2 is a structural diagram of a high-low frequency dual-path deep residual network with an attention layer of the present invention. Each path in the deep residual network model contains 4 residual blocks, and the structures in the residual blocks are: batch normalization (BN) layer, ReLU activation layer, convolution layer, BN layer, ReLU activation layer, Convolutional layer, the size of the convolution kernel in both convolutional layers is 3 × 3. After feature extraction through four residual blocks, the feature maps obtained on the two paths are fused in the frequency axis dimension, and then the fused feature map M_p (T, F) is passed through the BN layer, the ReLU activation layer, the two Finally, a multi-channel feature map M(T, F) containing 768 channels is obtained. The structure in the convolution block is the convolution layer and the BN layer in turn. The size of the convolution kernel in the convolution layer is 1× 1.

图3为注意力网络结构图。将多通道特征图M(T，F)送入注意力网络，在通道维度上进行注意力加权，在注意力网络中依次进行以下操作：Figure 3 shows the structure of the attention network. The multi-channel feature map M(T, F) is sent to the attention network, and attention is weighted in the channel dimension, and the following operations are performed in sequence in the attention network:

(1)对输入的多通道特征图M(T，F)在通道维度上进行全局平均池化，将一个通道上的整个空间特征编码为一个全局特征：(1) Perform global average pooling on the input multi-channel feature map M(T, F) in the channel dimension, and encode the entire spatial feature on a channel as a global feature:

上式中M表示多通道特征图，z∈R^C是全局平均池化之后的输出向量，T、F和C分别表示多通道特征图的高度、宽度和通道数。In the above formula, M represents the multi-channel feature map,^z∈RC is the output vector after global average pooling, and T, F, and C represent the height, width, and number of channels of the multi-channel feature map, respectively.

(2)将长度为C的一维特征向量z输入包含两层的全连接层的DNN模型，并计算输出：(2) Input the one-dimensional feature vector z of length C into the DNN model containing the fully connected layer of two layers, and calculate the output:

α＝F_DNN(z，W)＝σ(g(z，W))＝σ(W₂ReLU(W₁z))α=F_DNN (z,W)=σ(g(z,W))=σ(W₂ ReLU(W₁ z))

上式中α∈R^C是DNN模型的输出，即注意力系数向量。

和

分别是两个全连接层的权重，C分表示通道维度，r表示尺度缩放系数，σ表示sigmoid激活函数。为了降低模型复杂度以及提升泛化能力，本发明采用包含两个全连接层的bottleneck结构，其中第一个全连接层起到降维的作用，尺度缩放系数r是个超参数，然后采用ReLU激活；第二个全连接层恢复原始的维度。In the above formula,^α∈RC is the output of the DNN model, that is, the attention coefficient vector.

and

are the weights of the two fully connected layers, C points represent the channel dimension, r represents the scaling factor, and σ represents the sigmoid activation function. In order to reduce the complexity of the model and improve the generalization ability, the present invention adopts a bottleneck structure including two fully connected layers, in which the first fully connected layer plays the role of dimensionality reduction, and the scaling coefficient r is a hyperparameter, which is then activated by ReLU. ; the second fully connected layer restores the original dimension.

(3)将得到的注意力系数向量作用于多通道特征图的各通道，对通道进行加权，得到加权特征图

(3) Apply the obtained attention coefficient vector to each channel of the multi-channel feature map, and weight the channels to obtain a weighted feature map

其中，

为加权特征图

is the weighted feature map

将加权特征图通过平铺(Flatten)层展开为一个一维的特征向量，最后通过全连接层和Softmax层得到模型的输出，即原始语音信号所属的场景类别。The weighted feature map is expanded into a one-dimensional feature vector through the Flatten layer, and finally the output of the model is obtained through the fully connected layer and the Softmax layer, that is, the scene category to which the original speech signal belongs.

图4显示了本发明改进的声场景分类方法和其他4类声场景分类方法分类结果对比。根据本发明改进的声场景分类方法，在数据集上对比了5种分类模型：高斯混合模型(GMM)、k最邻近(kNN)、支撑向量机(SVM)、随机森林(RF)和本发明提出的双路径的深度残差网络模型。使用从音频中提取的988维特征向量作为GMM、kNN、SVM、RF模型的输入，其中GMM中高斯分布的个数为12个，每个高斯分布具有不同的标准协方差矩阵；kNN模型在分类时最邻近的k取7；SVM的惩罚系数为1.8，采用高斯核函数，分类方式为OvO，即一对一分类；RF中包含的决策树个数为200，决策树在进行节点分裂时采用基尼指数作为最优特征选择标准。基尼指数表示在样本集合中一个随机选中的样本被分错的概率，基尼指数＝样本被选中的概率×样本被分错的概率，基尼指数的性质与信息熵一样，度量随机变量的不确定度的大小：基尼指数越大，表示数据的不确定性越高；基尼指数越小，表示数据的不确定性越低；基尼指数为0，表示数据集中的所有样本都是同一类别。所选取的原始语音数据集中包含机场、公共汽车、地铁、地铁站、公园、广场、购物商场、步行街道、交通街道和电车轨道10类声场景，共14400条音频数据。实验结果表明，本发明所提出的改进的声场景分类方法在数据集上达到了81.6％的平均准确率，远高于其他4种声场景识别方法。FIG. 4 shows the comparison of the classification results between the improved acoustic scene classification method of the present invention and other four types of acoustic scene classification methods. According to the improved acoustic scene classification method of the present invention, five classification models are compared on the dataset: Gaussian Mixture Model (GMM), k-Nearest Neighbor (kNN), Support Vector Machine (SVM), Random Forest (RF) and the present invention The proposed dual-path deep residual network model. Use the 988-dimensional feature vector extracted from the audio as the input of the GMM, kNN, SVM, and RF models, where the number of Gaussian distributions in GMM is 12, and each Gaussian distribution has a different standard covariance matrix; kNN model is used in classification The nearest k is 7; the penalty coefficient of SVM is 1.8, the Gaussian kernel function is used, and the classification method is OvO, that is, one-to-one classification; the number of decision trees contained in RF is 200, and the decision tree is used for node splitting. The Gini index is used as the optimal feature selection criterion. The Gini index represents the probability that a randomly selected sample is wrongly classified in the sample set. Gini index = the probability that the sample is selected × the probability that the sample is wrongly classified. The property of the Gini index is the same as the information entropy, which measures the uncertainty of random variables. The size of : the larger the Gini index, the higher the uncertainty of the data; the smaller the Gini index, the lower the uncertainty of the data; the Gini index of 0 means that all the samples in the data set are of the same category. The selected original voice data set contains 10 categories of sound scenes, including airports, buses, subways, subway stations, parks, squares, shopping malls, pedestrian streets, traffic streets and tram tracks, with a total of 14,400 pieces of audio data. The experimental results show that the improved acoustic scene classification method proposed by the present invention achieves an average accuracy of 81.6% on the dataset, which is much higher than the other four acoustic scene recognition methods.

以上所述仅是本发明的优选实施方式，应当指出：对于本技术领域的普通技术人员来说，在不脱离本发明原理的前提下，还可以做出若干改进和润饰，这些改进和润饰也应视为本发明的保护范围。The above is only the preferred embodiment of the present invention, it should be pointed out that: for those skilled in the art, without departing from the principle of the present invention, several improvements and modifications can also be made, and these improvements and modifications are also It should be regarded as the protection scope of the present invention.

Claims

1. An attention mechanism and dual-path depth residual error network-based sound scene classification method is characterized by comprising the following steps:

step 1, preprocessing an original voice signal, calculating an original voice spectrogram, respectively enhancing horizontal lines and vertical lines in the original voice spectrogram to obtain a horizontal spectrogram and a vertical spectrogram, and respectively transforming the horizontal spectrogram and the vertical spectrogram to obtain two new paths of time domain signals;

step 2, respectively calculating a logarithmic Mel spectrogram, a first-order difference logarithmic Mel spectrogram and a second-order difference logarithmic Mel spectrogram of the original voice signal and the new two paths of time domain signals, and fusing on channel dimensions to obtain a fused spectrogram;

step 3, averagely dividing the fusion spectrogram into a high-frequency spectrogram and a low-frequency spectrogram on a frequency axis;

step 4, building a double-path depth residual error network with an attention layer;

and 5, inputting the high-frequency spectrogram and the low-frequency spectrogram in the step 3 into the depth residual error network in the step 4, and outputting the sound scene type to which the original voice signal belongs.

2. The method for sound scene classification based on the attention mechanism and the dual-path depth residual error network as claimed in claim 1, wherein in the step 1:

wherein, X_hIs a horizontal spectrogram, X_pIs a vertical spectrogram, and X is an original voice spectrogram; kappa and lambda are weight smoothing factors; f and t represent frequency and time, respectively; minimize the cost function J, order

And

then the horizontal spectrogram X can be obtained_hAnd vertical spectrogram X_p。

3. The method for sound scene classification based on the attention mechanism and the dual-path depth residual error network as claimed in claim 1, wherein in the step 2:

S_a(T，F)＝(S_X(T，F)，S_H(T，F)，S_P(T，F))

wherein S is_aRepresenting a fused spectrogram; s_XA logarithmic Mel spectrum, a first order difference logarithmic Mel spectrum and a second order difference logarithmic Mel spectrum representing the original speech signal; s_HPairs representing time-domain signals generated from horizontal spectrogramsA digital Mel spectrogram, a first order difference logarithm Mel spectrogram and a second order difference logarithm Mel spectrogram; s_PRepresenting a logarithmic mel spectrum generated from a vertical spectrogram, a first order difference logarithmic mel spectrum and a second order difference logarithmic mel spectrum; t and F denote a time axis and a frequency axis, respectively.

4. The method for sound scene classification based on the attention mechanism and the dual-path depth residual error network as claimed in claim 1, wherein the step 5 comprises the following steps:

step 51, inputting the high-frequency spectrogram and the low-frequency spectrogram into double paths of a depth residual error network, and then respectively outputting a high-frequency characteristic map and a low-frequency characteristic map;

step 52, fusing the high-frequency characteristic diagram and the low-frequency characteristic diagram on the frequency axis dimension to obtain a fused characteristic diagram, obtaining a multi-channel characteristic diagram through the fused characteristic diagram, and calculating through the multi-channel characteristic diagram to obtain an attention coefficient;

step 53, applying the attention coefficient to the multi-channel feature map to obtain a weighted feature map;

and step 54, expanding the weighted feature map into a one-dimensional feature vector, and outputting the sound scene type to which the original voice signal belongs through the feature vector.

5. The method for sound scene classification based on attention mechanism and dual-path depth residual error network as claimed in claim 4, wherein in the step 52:

M_P(T，F)＝(M_P1(T，F_L)，M_P2(T，F_H))

wherein M is_P(T, F) represents a fusion feature map; m_P1(T，F_L) And M_P2(T，F_H) Respectively representing a low-frequency characteristic diagram and a high-frequency characteristic diagram; t represents the height of the feature map; F. f_LAnd F_HThe widths of the fused feature map, the low frequency feature map, and the high frequency feature map are shown, respectively.

6. The method for sound scene classification based on attention mechanism and dual-path depth residual error network as claimed in claim 4, wherein in the step 52:

α＝σ(W₂ReLU(W₁z))

wherein, α∈ R^CRepresenting an attention coefficient vector;

and

representing a weight; sigma represents a sigmoid activation function; m represents a multi-channel feature map; t and F respectively represent the height and width of the multi-channel feature map; c, channel dimensions of the multi-channel feature map are shown; r represents a scaling factor.

7. The method according to claim 4, wherein in the step 53:

wherein,

as a weighted profile

The kth channel value of (1); m_k(T, F) is the k channel value in the multi-channel feature map M (T, F); α_kTo note the kth value in the force coefficient vector α, T and F represent the height and width of the feature map, respectively, and C represents the channel dimension.

8. The method according to claim 4, wherein each path of the depth residual network comprises a residual block;

the residual block comprises a batch normalization layer, a ReLU active layer, a convolution layer, a batch normalization layer, a ReLU active layer and a convolution layer which are connected in sequence, and outputs a low-frequency characteristic diagram and a high-frequency characteristic diagram;

after the low-frequency characteristic diagram and the high-frequency characteristic diagram are fused, inputting a batch normalization layer, a ReLU activation layer and a convolution block which are sequentially connected, wherein the convolution block comprises a convolution layer and a batch normalization layer which are sequentially connected, and outputting a multi-channel characteristic diagram;

inputting a global average pooling layer and a full-connection layer which are sequentially connected by a multi-channel feature map, and outputting an attention coefficient vector;

and (4) merging the attention coefficient vector and the multi-channel feature map, inputting the sequentially connected flat layer, full-connection layer and Softmax layer, and outputting a classification result.

9. The method for sound scene classification based on the attention mechanism and the dual-path depth residual error network as claimed in claim 1, wherein the preprocessing in step 1 comprises: down-sampling or up-sampling an original voice signal to 48kHZ, and then performing pre-emphasis, framing and windowing; in the frame division, each 2048 sampling points are divided into one frame, and the frame overlapping rate is 50%; when windowing, the window function used is a hamming window.