Complex sound identification method based on one-dimensional convolutional neural networkTechnical Field
The invention belongs to the technical field of audio processing, relates to a complex sound identification technology, and particularly relates to a complex sound identification method based on a one-dimensional convolutional neural network.
Background
The complex sound refers to non-language sounds in the environment, the sound source is complex and various, the signal itself has non-stationarity and is often accompanied by extremely interfering background noise and the like, so that the sound characteristics of different sound scenes are not obvious enough or the similarity of the characteristics is very high, and the complex sound identification can automatically identify the specific types of the complex sounds in the environment, such as child playing, car whistling, street music and the like. In the field of sound classification, such as speech classification and music classification, very high accuracy has been achieved, but in the field of complex sound recognition, due to the non-stationarity of the signal itself, the speech or music classification scheme is obviously not suitable for solving such problems, and therefore an effective recognition model for complex sounds needs to be provided.
At present, there are three main methods for solving the problem of complex sound classification by combining with neural network according to the difference of input data: based on the original signal, artificial features and a variety of input data. The first method is to directly use the original signal to carry out network training, and the method has the advantages that the characteristic extraction of the signal is not needed manually, the operation flow is greatly simplified, and the model is simple and convenient to popularize; the second method is to process the original data and artificially extract some features of the sound signal, such as a spectrogram, a mel frequency cepstrum coefficient and the like. The third is a multi-input complex network, the original sound signal and the manually extracted features are used as the input part of the network, and the method has the advantages that the original features (time sequence features) and frequency domain features of the signal can be combined, so that the defect of insufficient single data features is overcome, but the model is complex, has high requirements on the hardware of the platform and is inconvenient to apply.
Deep learning models based on original audio signals are used by many scholars to solve complex voice recognition problems, such as the complex voice recognition model based on one-dimensional convolutional neural network proposed by Dai et al, which achieves better recognition accuracy. However, the deep learning model is difficult to effectively extract the features of the original signal, and the model proposed by the prior art is complex and needs to be further optimized. Solving complex sound problems based on the original audio signal is thus a very big challenge. In order to achieve a good recognition effect, the following problems still exist in the existing scheme:
(1) problem of raw data inconsistency
In the actual data processing process, there are some cases that audio durations in data sets (such as an UrbanSound8K data set and data collected in an actual environment) are not consistent, and a one-dimensional convolutional neural network model requires a fixed input data length, so that data padding needs to be used, and common data padding methods include cubic spline interpolation, a zero padding method and the like. There are many audio time lengths in the data set that are very different from the target length, for example, the actual time length is 1 second, and the target length is 4 seconds, obviously, cubic spline interpolation is not suitable for this case, and the zero filling method is too simple, data will lose much information, and the more zeros filled in may mask valid information. Therefore, the invention provides a random completion algorithm, which is based on original data and can enrich data characteristics while filling data.
(2) Attention mechanism
The attention mechanism can enable the model to pay attention to useful information, and can further improve the performance of the model. The invention provides a simplified attention mechanism facing a one-dimensional convolutional neural network, which obtains an attention feature vector by weighting global features and multiplying the global features by an original feature vector.
Disclosure of Invention
Aiming at the defects in the prior art, the invention provides a complex voice recognition method based on a one-dimensional convolutional neural network, firstly, a random completion algorithm is provided, uneven original audio data are filled to the same length and input into a network model, then the network model is optimized, a pre-emphasis technology and a simplified attention mechanism are introduced into the neural network for training, and finally, a complex voice recognition model is constructed.
In order to solve the technical problems, the invention adopts the technical scheme that:
a complex sound identification method based on a one-dimensional convolution neural network adopts a random completion algorithm to process complex sounds, and original data are filled to the same length and used for input of the one-dimensional convolution neural network; and embedding a pre-emphasis module and a simplified attention mechanism module in a basic frame of the one-dimensional convolutional neural network, wherein the pre-emphasis module is arranged at an input part of the one-dimensional convolutional neural network and is used for pre-emphasizing input data and optimizing a participating network model, the simplified attention mechanism module is arranged at the deep layer of the one-dimensional convolutional neural network, and global features with attention are obtained by utilizing a global average pooling function and a sigmoid function.
Further, the detailed steps of the complex voice recognition method based on the one-dimensional convolutional neural network are as follows:
firstly, processing original data: filling the original data by adopting a random padding algorithm to obtain cut original audio with consistent length after random padding, and taking the original audio as input data of a one-dimensional convolution neural network;
secondly, pre-emphasis: pre-emphasis is carried out on input data through a pre-emphasis module, and then the input data are processed through a layer of convolution layer;
and thirdly, one-dimensional convolution neural network: obtaining a characteristic vector through one-dimensional convolutional neural network processing, wherein the one-dimensional convolutional neural network structure adopts a convolutional layer with the same number of two channels, and then a pooling layer is stacked for three times, so that 6 layers of convolutional structures are obtained;
fourthly, attention mechanism: inputting the feature vector into a simplified attention mechanism module to obtain the feature with attention;
fifthly, output classification: and finally, outputting a final recognition result through a two-layer full connection structure and a softmax classification function.
Further, the random completion algorithm specifically comprises the following steps:
(1) dividing all samples into two categories of more than or equal to N/2 seconds and less than N/2 seconds, wherein the target length of the samples is N seconds;
randomly selecting a starting point which can be supplemented to N seconds at one time for the samples of which the time is more than or equal to N/2 seconds, then intercepting the starting point to a required length, and finally filling the intercepted audio segment at the tail end of the original audio to complete the supplementation;
(2) and directly copying the whole sample until the length of the sample is more than or equal to N seconds for the sample less than N/2 seconds, and finally cutting the sample into the sample of N seconds.
Furthermore, the pre-emphasis module has two layers of convolution structures, initial values of convolution kernels of the first layer are set to be-0.97 and 1 and are continuously stacked, initial values of convolution kernels of the second layer are set to be 1, and pre-emphasis coefficients are further adjusted.
Further, the number of convolution kernels of each layer of the pre-emphasis module is set to be 1.
Further, the simplified attention mechanism is that firstly, global average pooling is used for compressing the features into one-dimensional features consistent with the number of channels, global features of the model are obtained, then the features are input into a sigmoid function to obtain weights corresponding to each channel, and finally the weights and the one-dimensional features obtained by the original global average pooling are multiplied to obtain new global features which are the features with attention;
the expression for the attention mechanism is as follows:
wherein F is the deep output characteristic of the one-dimensional convolution neural network, W is the weight vector, FOIs a global feature with attention.
Compared with the prior art, the invention has the advantages that:
(1) the random completion method designed by the invention can fill the uneven original data to the same length, is convenient for the input of a network model, makes up for the singleness of the zero filling method, depends on the original data to supplement the original data, furthest retains the characteristics of the time sequence and the like of the original data, provides more useful characteristics and obviously contributes to the improvement of the classification performance.
(2) The pre-emphasis module designed by the invention combines the pre-emphasis technology into the convolutional neural network by utilizing the convolution operation of the convolutional layer, provides a buffer space for the previous pre-emphasis layer by adding a convolutional layer with a convolutional kernel initial value of 1 and a kernel length of 1, can further properly adjust the network, simultaneously lightens the tuning burden of the next one-dimensional convolutional neural network and improves the performance.
(3) The simplified attention mechanism designed by the invention obtains the global characteristics with attention by utilizing the global average pooling and sigmoid functions, and is beneficial to model classification.
(4) The key points are combined and an end-to-end complex sound identification model based on the one-dimensional convolutional neural network is constructed, and the model can well acquire the characteristics of original complex sound and obtain a good identification effect.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.
FIG. 1 is a flow chart of complex speech recognition according to the present invention;
FIG. 2 is a comparison of the original data after the random completion method and the zero padding method of the present invention;
FIG. 3 is a diagram of a pre-emphasis module of the present invention;
FIG. 4 is a simplified attention mechanism model architecture diagram of the present invention.
Detailed Description
The invention is further described with reference to the following figures and specific embodiments.
The embodiment provides a complex sound identification method based on a one-dimensional convolutional neural network, which comprises the following two aspects: on one hand, the method adopts a random completion algorithm to process complex sound, and fills original data to the same length for the input of a one-dimensional convolution neural network. On the other hand, the optimization network model structure: a pre-emphasis module and a simplified attention mechanism module are embedded in the basic framework of the one-dimensional convolutional neural network. The pre-emphasis module is arranged at the input part of the one-dimensional convolutional neural network and is used for pre-emphasizing input data and participating in network model tuning; the simplified attention mechanism module is arranged at the deep layer of the one-dimensional convolutional neural network, and global features with attention are obtained by utilizing a global average pooling function and a sigmoid function.
In conjunction with the complex voice recognition flowchart shown in fig. 1, the detailed steps are as follows:
firstly, processing original data: and filling the original data by adopting a random padding algorithm to obtain the original audio with consistent length after being cut and padded randomly, and taking the original audio as the input data of the one-dimensional convolution neural network.
The random completion algorithm comprises the following specific steps:
(1) assuming that the target length of a sample is 4 seconds, dividing all samples into two categories of more than or equal to 2 seconds and less than 2 seconds;
randomly selecting a starting point which can be supplemented for 4 seconds at one time for samples more than or equal to 2 seconds, then intercepting the starting point to a required length, and finally filling the intercepted audio segment at the tail end of the original audio to complete the supplementation;
(2) for samples less than 2 seconds, the whole sample is directly copied until the length is more than or equal to 4 seconds, and finally the sample is cut into samples of 4 seconds.
The pair of raw data after the random padding method and the method of padding zeros is shown in fig. 2. The pseudo code for the random completion algorithm is as follows.
Secondly, pre-emphasis: the input data is pre-emphasized through a pre-emphasis module, and then is processed through a convolution layer with a large convolution kernel.
The pre-emphasis module has two layers of convolution structures, initial values of convolution kernels of the first layer are set to be-0.97 and 1 and are stacked continuously, initial values of convolution kernels of the second layer are 1, and pre-emphasis coefficients can be further adjusted. Fig. 3 is a diagram of a pre-emphasis module, which aims to pre-emphasize input data without extracting features, so that the number of convolution kernels in each layer is set to 1. In the process of model learning, the pre-emphasis module also participates in the network for tuning.
And thirdly, one-dimensional convolution neural network: and processing by a one-dimensional convolutional neural network to obtain a feature vector, wherein the one-dimensional convolutional neural network structure adopts a convolutional layer with the same number of two channels, and then a pooling layer is stacked for three times, so that 6 layers of convolutional structures are obtained.
Fourthly, attention mechanism: the feature vectors are input into a simplified attention mechanism module to obtain attention-bearing features.
As shown in fig. 4, the attention mechanism is placed at a deep layer of a one-dimensional convolutional neural network, first, features are compressed into one-dimensional features consistent with the number of channels by using Global Average Pooling (GAP), the Global features of the model can be obtained in this step, then the features are input into a sigmoid function to obtain weights corresponding to each channel, and finally, the weights and the one-dimensional features obtained by the original GAP are multiplied to obtain new Global features, which are features with attention.
The expression for the attention mechanism is as follows:
wherein F is the deep output characteristic of the one-dimensional convolution neural network, W is the weight vector, FOIs a global feature with attention.
Fifthly, output classification: and finally, outputting a final recognition result through a two-layer full connection structure and a softmax classification function.
With reference to fig. 1, the present invention integrates a stochastic completion algorithm, a pre-emphasis module, and a simplified attention mechanism to obtain a complex voice recognition model, and performs stochastic completion on input audio data to obtain a clipped and completed original audio, and uses the original audio as input data of a network; then the pre-emphasis module performs pre-emphasis on the input data, and then the input data passes through a convolution layer with a large convolution kernel; then, the traditional one-dimensional convolution neural network structure adopts a convolution layer with the same number of two channels and a pooling layer, and the convolution layer is stacked for three times to form a total of 6 layers of convolution structures. In addition, the first three layers further increase the receptive field of the model by using dilation convolution with dilation coefficients of 2, 3 and 4 respectively; then inputting the feature vector into a simplified attention mechanism module to obtain the feature with attention; and finally, outputting a final recognition result through a two-layer full connection structure and a softmax classification function.
The model structure and parameters are shown in table 1, with a sample rate of 16kHz and a sample length of 4 seconds as an example.
TABLE 1 model Structure and parameter Table
Experimental configuration and results:
1. loss function
The model uses a classical cross-entropy loss function, the formula is as follows:
H(p,q)=-∑xp (x) log q (x) equation (2)
Wherein p represents the distribution of the real samples, and q is the sample prediction distribution of the trained model.
2. Optimization algorithm
The optimizer algorithm uses a stochastic gradient descent method with momentum of 0.9, and the acceleration is updated as follows:
vt=γvt-1+ lr grad formula (3)
Where v is the acceleration, γ is the momentum coefficient, typically set to 0.9, lr is the learning rate, and grad is the gradient.
3. Learning rate
The learning rate attenuation adopts discrete decline, and the specific parameters are as follows:
the batch is set to 64 during model training for 200 rounds of training.
4. Results of the experiment
The recognition accuracy of the complex sound recognition model based on the one-dimensional convolutional neural network on ESC10, ESC50 and UrbanSound8K data sets respectively reaches 84.4%, 73.8% and 88.6%, and the method is effective.
It is understood that the above description is not intended to limit the present invention, and the present invention is not limited to the above examples, and those skilled in the art should understand that they can make various changes, modifications, additions and substitutions within the spirit and scope of the present invention.