Sound scene classification method based on multi-scale residual error attention networkTechnical Field
The invention relates to the technical field of artificial intelligence, in particular to a sound scene classification method based on a multi-scale residual error attention network.
Background
The human has the inherent ability to identify the sound scene, i.e. the scene where the audio is located, such as subway, bus, etc., can be judged according to the audio by experience. With the continuous development of signal processing and artificial intelligence technology, it is also possible for a machine device to understand sound and judge the source of sound. Sound scene classification (Acoustic scene classification, ASC) is a multi-class classification task aimed at identifying from audio segments the scene in which the audio is located. At present, sound scene classification is widely applied to the fields of intelligent wearable equipment, audio archiving, interactive robots, safety monitoring and the like.
The sound scene classification method mainly comprises two major categories, namely a sound scene classification method based on traditional machine learning, such as a Gaussian mixture model, a hidden Markov model, a support vector machine and the like, but has low classification effect and poor generalization capability, and a sound scene classification method based on deep learning, such as a deep neural network, a convolution neural network, a circulation neural network and the like, but usually only comprises a convolution kernel with a single scale, the characteristics of excavation are not rich and comprehensive enough, and the characteristics of different areas are not considered to have different importance.
Therefore, how to fully mine the data features and improve the accuracy of classification of sound scenes is a problem that needs to be solved by those skilled in the art.
Disclosure of Invention
In view of the above, the invention provides a sound scene classification method based on a multi-scale residual error attention network, which aims to solve the problems that the extracted features in the current sound scene classification task are single in scale and not rich enough, and different areas of the extracted features are not considered to have different importance, and the like.
In order to achieve the above purpose, the present invention adopts the following technical scheme:
a sound scene classification method based on a multi-scale residual attention network, comprising the steps of:
the method comprises the steps of 1, collecting audio data, inputting the audio data to a feature extraction module for feature extraction, and extracting a logarithmic Mel spectrogram and a first-order difference and a second-order difference of the logarithmic Mel spectrogram as input features;
Step 2, constructing a multi-scale residual error attention network, inputting input features into the network for training and establishing a classification model;
step 3, processing the audio data by adopting a mixup method to obtain a data sample, and enhancing the data diversity;
Step 4, inputting the data samples into a classification model for classification, and optimizing the classification model by adopting samples with focus loss and focus attention classification difficulty;
and 5, acquiring new sound scene voice, inputting the optimized classification model to classify the sound scene, and obtaining a sound scene classification result.
Preferably, the specific process of feature extraction in the step 1 is as follows:
step 1.1, pre-emphasis processing is carried out on collected voice data so that a high-frequency part and a low-frequency part of a sound signal are more balanced;
Step 1.2, framing the pre-emphasized voice data into a plurality of frames of voice signals;
step 1.3, windowing each frame of voice signal by adopting a Hanning window function to obtain a short-time windowed voice signal;
step 1.4, carrying out Fourier transform on the short-time windowed voice signal to convert the short-time windowed voice signal from a time domain to a frequency domain, and obtaining a frequency domain signal;
step 1.5, the obtained frequency domain signal is passed through a Mel filter to obtain a Mel spectrogram with proper size;
Step 1.6, taking the logarithm of the Mel spectrogram to obtain a logarithmic Mel spectrogram;
Step 1.7, the first-order difference and the second-order difference of the logarithmic Mel spectrogram are calculated to obtain the dynamic characteristics of the voice signal, and the logarithmic Mel spectrogram, the first-order difference and the second-order difference are stacked to obtain the final input characteristics.
Preferably, in the step 1, the frame overlapping rate of the voice data in framing is 50%, the number of FFT points in the Fourier transform process is 2048, and the number of Mel filters is 128.
Preferably, the specific process of the step 2 is as follows:
Step 2.1, dividing an input characteristic formed by a logarithmic Mel spectrogram and a first-order difference and a second-order difference thereof into a high-frequency part and a low-frequency part;
step 2.2, respectively inputting the high-frequency part and the low-frequency part into a channel attention module of a multi-scale residual attention network, distributing different weights according to different importance of the features, highlighting important features and inhibiting secondary features so as to generate new features;
Step 2.3, inputting the new features extracted by the channel attention module into a multi-scale residual error module of a multi-scale residual error attention network, extracting feature information with different precision and different depth, and obtaining a high-frequency part feature map and a low-frequency part feature map;
Step 2.4, splicing the two part characteristic diagrams obtained through the multi-scale residual error module in the frequency dimension to obtain all the characteristics;
And 2.5, all the features are classified by a convolution block consisting of a batch normalization layer, a correction linear unit and a1 multiplied by 1 convolution layer, a convolution block consisting of the batch normalization layer and the 1 multiplied by 1 convolution layer, a batch normalization layer, a global average pooling layer and a softmax layer in sequence.
Preferably, the specific process of generating the new feature through the channel attention module in the step 2.2 includes:
Step 2.2.1, respectively carrying out maximum pooling and average pooling operation on the high-frequency part input features and the low-frequency part input features to obtain two feature graphs;
step 2.2.2, respectively sending the two feature images obtained by the pooling treatment into a multi-layer perceptron to obtain two perception results;
Step 2.2.3, adding two sensing results obtained through the multi-layer sensing machine to obtain a result;
step 2.2.4, carrying out sigmoid activation operation on the added result to obtain a weight parameter of the input feature;
And 2.2.5, finally, carrying out product operation on the weight parameters and the input features to generate new features.
Preferably, the new feature in step 2.3 sequentially passes through the batch normalization layer and the convolution layer, and then passes through the Residual block Residual 01 consisting of two convolution kernels of 1×1, 3×3 and 5×5 twice, and then passes through the combined block consisting of the Residual block Residual 02 consisting of two convolution kernels of 1×1, 3×3 and 5×5, maximum pooling, average pooling and zero padding and the Residual block Residual 01 three times, thereby obtaining the high-frequency part feature map and the low-frequency part feature map.
Preferably, the formula for acquiring the data sample by using the mixup method in the step 3 is as follows:
x=λxi+(1-λ)xj
y=λyi+(1-λ)yj
Wherein (xi,yi) and (xj,yj) are two random samples selected randomly in a training set divided by the collected voice data, xi and xj are original input vectors, yi and yj are corresponding tag codes, lambda is a super parameter, lambda is [0,1].
Preferably, the acquired new sound scene speech is tested to obtain a real classification result, and the classification accuracy is calculated according to the sound scene classification result obtained in the step 5. The classification model can be optimized and corrected according to the classification accuracy, and the classification accuracy is improved.
Compared with the prior art, the invention discloses a sound scene classification method based on a multi-scale residual error attention network, which utilizes convolution kernels with various different scales to mine more details and whole information, extracts more semantic information with different levels by combining an improved residual error network structure, introduces a channel attention mechanism, endows different weights to the features on different channels according to different importance of the features, thereby learning key features, inhibiting secondary features and enhancing the capability of network learning features, adopts mixup method to enhance data diversity, adopts focus loss to pay attention to samples with difficult classification, and improves the classification effect of sound scenes.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings that are required to be used in the embodiments or the description of the prior art will be briefly described below, and it is obvious that the drawings in the following description are only embodiments of the present invention, and that other drawings can be obtained according to the provided drawings without inventive effort for a person skilled in the art.
Fig. 1 is a schematic flow chart of a sound scene classification method based on a multi-scale residual attention network provided by the invention;
FIG. 2 is a schematic diagram of a feature extraction process according to the present invention;
FIG. 3 is a schematic view of a channel attention module structure according to the present invention;
Fig. 4 is a schematic structural diagram of a multi-scale residual error module provided by the invention.
Detailed Description
The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
As shown in fig. 1, the present invention provides a sound scene classification method based on a multi-scale residual attention network, comprising the following steps:
s1, inputting the acquired audio data into a feature extraction module, extracting a logarithmic Mel spectrogram and a first-order difference and a second-order difference thereof as input features, wherein the specific flow of the feature extraction module is shown in a figure 2, and the method comprises the following steps:
s1.1, pre-emphasis processing is carried out on the collected audio data, so that the high-frequency part and the low-frequency part of the sound signal are more balanced, and the pre-emphasis formula is as follows:
H(z)=1-az-1 (1)
s1.2, framing the pre-emphasized audio data into a plurality of frames of voice signals;
S1.3, windowing each frame of voice signal by adopting a Hanning window function to obtain a short-time windowed voice signal, wherein the Hanning window has the following formula:
s1.4, carrying out Fourier transformation on the short-time windowed voice signal to convert the short-time windowed voice signal from a time domain to a frequency domain, and obtaining a frequency domain signal, wherein the conversion process is as follows:
S1.5, the frequency domain signal obtained in the previous step is passed through a Mel filter to obtain a Mel spectrogram with proper size, the center frequency of each triangular filter in the Mel filter is shown as a formula (4), and the frequency response is shown as a formula (5):
Wherein f (M) is the center frequency of the mth filter, fl and fh are the upper limit and the lower limit of the triangular filter respectively, N is the number of sampling points in the filter, fs is the sampling frequency, M-1 and M are inverse functions, and M is defined as shown in formula (6):
Wherein, the
S1.6, taking the logarithm of the Mel spectrogram to obtain a logarithmic Mel spectrogram;
S1.7, obtaining a first-order difference and a second-order difference of the logarithmic Mel spectrogram to obtain the dynamic characteristics of the voice signal, and stacking the logarithmic Mel spectrogram and the first-order and second-order differences thereof to obtain the final input characteristics;
s2, constructing a multi-scale residual error attention network, and inputting the extracted input features into the multi-scale residual error attention network for training to obtain a classification model;
s2.1, dividing an input characteristic formed by a logarithmic Mel spectrogram and a first-order difference thereof into a high-frequency part and a low-frequency part;
S2.2, respectively inputting a high-frequency part and a low-frequency part into a channel attention module of the multi-scale residual attention network, distributing different weights according to different importance of the features, highlighting important features and suppressing secondary features, wherein the specific process can be expressed as formulas (7) and (8):
Mc(F)=σ(MLP(AvgPool(F))+MLP(MaxPool(F))) (7)
Wherein F is an input feature map, the size is (H×W×C), avgPool (F) and MaxPool (F) are respectively the average pooling and maximum pooling operations, Mc (F) is a weight parameter, F' is a feature obtained through a channel attention module, sigma represents a sigmoid function,Representing a product operation;
The channel attention module structure is shown in fig. 3, and comprises the following steps:
s2.2.1, carrying out maximum pooling and average pooling operation on the input features respectively to obtain two feature graphs;
s2.2.2, respectively sending the two feature maps obtained by pooling treatment into a multi-layer perceptron to obtain two perception results;
s2.2.3, adding two perception results obtained by the multi-layer perceptron;
S2.2.4, obtaining weight parameters of input features by performing sigmoid activation operation on the obtained result;
s2.2.5 finally, carrying out product operation on the weight parameters and the input features to generate new features.
S2.3, inputting new features proposed by the channel attention module into a Multi-scale residual error module (Multi-Scale Residual Module, MSRM) of a Multi-scale residual error attention network, extracting feature information with different precision and different depth, and obtaining a high-frequency part feature map and a low-frequency part feature map, wherein the structure of the Multi-scale residual error module is shown in figure 4, and the steps are as follows:
s2.3.1 passing the generated new features through a batch normalization layer (BatchNormalization, BN) and a convolution layer;
S2.3.2 passing through a Residual block Residual 01 composed of convolution kernels of three different scales of 1×1, 3×3 and 5×5 twice;
s2.3.3 passing through a combined block consisting of Residual block Residual 02 and Residual block Residual 01 consisting of convolution kernels of three different scales of 1×1, 3×3 and 5×5, max pooling, average pooling and zero padding three times;
S2.3.4 finally obtaining a high-frequency part characteristic diagram and a low-frequency part characteristic diagram.
S2.4, splicing the two part characteristic diagrams obtained through the multi-scale residual error module in the frequency dimension to obtain all the characteristics;
S2.5, sequentially classifying all the features by a convolution block consisting of a BN layer, a correction linear unit (RectifiedLinearUnit, reLU) and a 1 multiplied by 1 convolution layer, a convolution block consisting of the BN layer and the 1 multiplied by 1 convolution layer, the BN layer, a global average pooling layer and a softmax layer;
S3, enhancing data diversity by adopting mixup method, wherein mixup can be specifically expressed as:
x=λxi+(1-λ)xj (9)
y=λyi+(1-λ)yj (10)
Wherein (xi,yi) and (xj,yj) are two random samples randomly selected from a training set of collected voice data divisions, xi and xj are original input vectors, yi and yj are corresponding tag codes, lambda is a super parameter, lambda is epsilon [0,1], and the confusion degree of the two random samples can be controlled;
s4, taking a sample with focus loss focusing on classification difficulty, the focus loss function can be specifically expressed as:
Wherein n represents a category number, yi represents a true label code of the i-th sample, pi represents a probability that the i-th sample is predicted as a true label, α is a weight factor, λ is a hyper-parameter;
s5, acquiring new sound scene voices, and performing sound scene classification on the new sound scene voices by using the trained classification model to obtain sound scene classification results.
Examples
Sound scene classification is performed using DCASE2020 (acoustic scene and event detection classification challenge) the public dataset TAU UrbanAcoustic Scenes 2020Mobile Development dataset (TAU for short) in challenge task 1A. The dataset contains recordings in 10 different acoustic scenes recorded by 9 different devices from 10 european cities, respectively. The dataset contains 10 sound scenes, including Airport (Airport), mall (Shopping mall), subway station (Metro station), sidewalk (STREET PEDESTRIAN), public square (Public square), congested Street (STREETTRAFFIC), tram (Tram), bus (Bus), subway (Metro) and Park (Park). The experiment adopts the classification accuracy of the sound scene category as the criterion of the judgment model, the training set is used for training the model parameters, and the testing set is used for comparing the performance of the model. The experimental results are shown in the classification results of the sound scene categories in table 1:
TABLE 1 Sound scene class classification results
From the experimental results, the performance of the proposed multi-scale residual attention network is obviously better than that of a DCASE2020 Task1A baseline system.
In the present specification, each embodiment is described in a progressive manner, and each embodiment is mainly described in a different point from other embodiments, and identical and similar parts between the embodiments are all enough to refer to each other. For the device disclosed in the embodiment, since it corresponds to the method disclosed in the embodiment, the description is relatively simple, and the relevant points refer to the description of the method section.
The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.