Disclosure of Invention
The invention aims to provide an unknown audio event recognition algorithm based on a pulse neural network, which can effectively recognize and distinguish unknown sound events under the condition of not depending on pre-labeled unknown category information, improves the accuracy of overall recognition of a system, and provides support for subsequent analysis and processing of the unknown sound events.
The invention adopts the technical scheme that the unknown audio event recognition algorithm based on the impulse neural network comprises the following steps:
S1, constructing an audio data set, and splitting the audio data set into a training set, a verification set and a test set;
S2, preprocessing each section of audio data in the audio data set to generate a 3D log-mel spectrogram;
S3, constructing a pulse neural network model, and inputting a 3D log-mel spectrogram corresponding to a test set into the pulse neural network model for classification training, wherein the pulse neural network model comprises a convolution layer, a plurality of pulse neural units, a multi-layer perceptron MLP, a remodelling layer and a long-short-period memory network LSTM;
S4, jointly training the impulse neural network model by using cross entropy loss and contrast loss;
S5, inputting the audio data of the known class in the verification set into a pulse neural network and a self-encoder to obtain mean square error average loss deduced by the self-encoderSetting upThe self-encoder comprises an encoder and a decoder, wherein the self-encoder comprises an input layer, at least two hiding layers and an output layer;
And S6, identifying the acquired audio data by using the trained impulse neural network model, inputting the probability value output by the impulse neural network model into a self-encoder, judging whether the input data belongs to a known class or an unknown class through the self-encoder, judging the unknown class if the loss obtained by reasoning of the self-encoder is higher than the threshold value, and otherwise judging which known class the audio specifically belongs to according to the probability value output by the impulse neural network model.
Further, the specific steps of the step S2 are as follows:
S201, converting the audio data into normal distribution through z-standardization, so that the audio data with different characteristics have the same dimension and distribution, and the impulse neural network model is convenient to learn, and the specific formula is as follows:
;
Where x represents the original audio data,Representing the average value of the original audio data,Representing standard deviation of the original audio data, z representing the z-normalized original audio data;
S202, generating a log-mel spectrogram corresponding to z-standardized original audio data by using a Mel filter bank, and taking a log-mel spectral feature value in the log-mel spectrogram as an original static feature of the audio data;
S203, calculating the difference between the previous frame and the next frame for each frame log-mel spectrum to obtain a corresponding first-order time derivative, namely a first-order Delta differential feature, and capturing the change of the feature along with time by using the first-order Delta differential feature;
The specific process for solving the first-order Delta differential characteristics is as follows:
Selecting a window size N1 for calculating a first-order Delta difference characteristic, respectively calculating a weighted difference of 1-N1 frames before and after a log-mel frequency spectrum characteristic value of each frame for each time t, and the weighted difference of N frames before and after the weighted difference to obtain the final productNormalizing the weighted difference result as a normalization factor, wherein the specific formula is as follows:
;
Wherein,Representing first order Delta differential features, Ct+n represents log-Mel spectral feature values of N frames after time frame t, Ct-n represents log-Mel spectral feature values of N frames before time frame t, n=1, 2.
S204, calculating the difference between the previous frame and the next frame for the first-order Delta differential feature to obtain a corresponding second-order Delta differential feature, capturing acceleration information of the feature changing along with time by using the second-order Delta differential feature, and describing the dynamic change characteristic of the signal;
the specific process for solving the second-order Delta differential characteristics is as follows:
Window size N2 for calculating as second-order Delta differential feature is selected, weighted differential of 1-N2 frames before and after each frame of first-order Delta differential feature is calculated respectively, and the weighted differential of N frames before and after is calculated respectively toNormalizing the weighted difference result as a normalization factor, wherein the specific formula is as follows:
;
Wherein,Representing the second order Delta differential feature of the t-th frame,Representing the first order Delta differential feature of the t + n frame,Representing first order Delta differential features for the t-N frame, n=1, 2.
S205, stacking the log-mel spectrum characteristic value obtained in the step S202, the first-order Delta differential characteristic obtained in the step S203 and the second-order Delta differential characteristic obtained in the step S204 according to characteristic dimensions to obtain a 3D log-mel spectrum signal.
Further, the specific steps of the step S202 are as follows:
S2021, dividing the original audio data after z-normalization into overlapped frames, wherein each frame comprises N3 samples, reducing spectrum leakage by using a Hamming window, and obtaining a windowed signal, wherein the specific formula is as follows:
;
;
Wherein w N represents Hamming window, xw N represents signal obtained after windowing, x N represents signal of each frame, N is more than or equal to 0 and less than or equal to N3 -1;
S2022, performing discrete on the windowed signals of each frame by using Fourier transform, and calculating a power spectrum of the discrete signals of each frame, wherein the specific formula is as follows:
;
;
Wherein X [ k ] represents a frequency spectrum, P [ k ] represents a power spectrum of each frame of discrete signal, k represents a frequency index, and j represents an imaginary unit;
S2023, converting the frequency scale and the power spectrum of each frame of signal into Mel frequency spectrum scale, and obtaining log-Mel frequency spectrum diagram by logarithmically solving the Mel frequency spectrum scale, wherein the specific formula is as follows:
;
;
;
Where Hm [ k ] denotes a filter response which is a Mel filter bank, f (m) denotes a center frequency of an mth Mel filter, and Sm denotes Mel spectral energy calculated by the Mel filter bank.
Further, the specific steps of the step S3 are as follows:
s301, inputting the 3D log-mel spectrum signals to a convolution layer, and performing shape reshaping according to batch and channel number;
S302, processing the spectrum signal output by the convolution layer through a pulse neural unit, and providing front and rear time sequence information through the pulse unit, wherein the method comprises the following specific steps:
S3021, processing a spectrum signal through a small pulse neuron, and capturing the time characteristic of the spectrum signal, wherein the specific process is as follows:
;
;
;
Wherein U [ t ] represents the membrane potential before reset at time t, S [ t ] represents the output peak at time t, when there is a peak, it is equal to 1, otherwise it is 0, T represents the time constant, the decay rate of the membrane potential is affected, V [ t-1] represents the membrane potential after trigger peak at time t-1, I [ t ] represents the input current at time t,Representing a Heaviside step function,A threshold value representing a membrane potential, when U [ t ] exceeds the threshold value, the neuron triggers a peak, V [ t ] represents the membrane potential after the peak is triggered at the moment t, and Vreset represents a value reset after the peak is triggered;
S3022, extracting scale invariant information of a spectrum signal by using a convolution layer in a pulse unit, inputting I to each frame, setting the size of a convolution kernel to be K, and performing convolution operation, wherein the specific formula is as follows:
;
Wherein S (U, V) represents the value of the ith row and the nth column in the convolution result matrix, U represents the height of the convolution K, V represents the width of the convolution K, I (u+a, v+b) represents the value of the (u+a) th row and the (v+b) th column in the input matrix I, and K (a, b) represents the value of the (a) th row and the (b) th column in the convolution kernel K;
s3023, carrying out standardized processing on the output of the convolution layer in the pulse unit by using a batch normalization layer to accelerate the training of the pulse neural network model and improve the stability of the pulse neural network model, wherein the specific formula is as follows:
;
;
;
Wherein yi represents the output after the batch normalization process,Representing a first learnable parameter for scaling the normalized value, xi representing the input eigenvalue of the ith sample in the batch,Representing the feature solution mean for each batch,Representing the characteristic variance of each batch,Is a constant value, and is used for the treatment of the skin,Representing a second learnable parameter for translating the normalized value, Q representing the number of samples in the batch;
S3024, after the pulse unit extracts the specific characteristics in the input spectrum signal, the pulse unit fuses the specific characteristics with the input spectrum signal through residual convolution, so that the propagation efficiency of the information flow is improved, and the output expression of the pulse nerve unit is as follows:
Output=F(w) +Conv(w);
wherein Output represents the Output of the impulse neural unit, w represents the input characteristics, F represents the Output of the two impulse units, conv represents the convolution operation;
s303, obtaining scale invariant features of the frequency spectrum signals through the pulse neural unit, and processing the output feature map pulse neural unit by using space average pooling to obtain feature vectors;
s304, sequentially inputting the feature vectors into a multi-layer perceptron MLP, a remodelling layer and a long-term memory network LSTM for processing, wherein the long-term memory network LSTM comprises an input gateForgetful doorAnd an output doorWherein sig represents an activation function sigmoid for mapping an input value between 0 and 1, Wi represents a weight matrix of an input gate, Wf represents a weight matrix of a forgetting gate, Wo represents a weight matrix of an output gate, ht-1 represents a hidden state at a previous time, xt represents an input vector at a current time, bi represents a bias vector of an input gate, bf represents a bias vector of a forgetting gate, bo represents a bias vector of an output gate, candidate states generated from the current input and the previous hidden stateWherein, tanh represents a hyperbolic tangent function for mapping an input value between-1 and 1, Wc represents a weight matrix of candidate states, bc represents a bias vector of the candidate states, and the cell state is obtained by combining a previous cell state ct-1 and the candidate state update through the regulation of a forgetting gate and an input gate;
S305, sequentially inputting the characteristics output by the LSTM into a remodelling layer and a multi-layer perceptron MLP for processing, and calculating an input vector x1 through a hidden layerWherein, the method comprises the steps of, wherein,Representing the feature vector output by the hidden layer, W1 representing the weight matrix from the input layer to the hidden layer, b1 representing the bias vector of the hidden layer, sending the hidden feature vector output by the hidden layer to the output layer for operation, and outputting the feature vectorWhere W2 represents the hidden layer to output layer weight matrix and b2 represents the output layer bias vector.
Further, the calculation formulas of the cross entropy loss and the contrast loss are as follows:
;
;
where Lce denotes the cross entropy loss, BS denotes the size of the batch data, xb denotes the b sample, yb denotes the label of the b sample, f (xb,yb) denotes the output of the model on the real label yb of sample xb, D denotes the total number of categories, D denotes the category index, xb,d denotes the input of the b sample on the D-th category, f (xb,d) denotes the output value of the D-th category of the input data,Representing contrast loss, P1 represents the set of positive samples in the batch, x1 represents the current sample, x1+ represents a positive sample similar to the current sample, i.e. a sample belonging to the same class as the current sample,Representing temperature parameters, sim represents cosine similarity, x1- represents a negative sample dissimilar to the current sample, i.e., a sample belonging to a different class than the current sample, and Nneg represents a set of negative samples in the batch of data.
Further, the node number of the input layer of the encoder is obtained by averaging and pooling the feature images output by the convolution layer of the pulse unit, and when the self encoder is trained, the self encoder is trained by using a mean square error, and the specific formula is as follows:
;
where MSE represents the mean square error, S represents the number of samples, yae represents the true predictor,Representing the self-encoder prediction value.
The invention has the beneficial effects that:
When the method is used for preprocessing data, the method is different from the traditional method which uses a 2D-mel spectrogram, and combines dynamic and original static characteristics to form three-dimensional log-mel characteristics so as to capture more detailed low-frequency voice signals and facilitate better understanding of the characteristics of sound events; the method combines the impulse neuron and the residual neural network to construct the impulse neural network model, can effectively process time correlation characteristics in the audio while extracting characteristic information, classifies the characteristic distance of samples of the same category by using cross entropy loss and combines contrast loss when training the model, the contrast loss can reduce the characteristic distance of samples of different categories and enlarge the characteristic distance of the samples of different categories, so that the impulse neural network model learns more compact characteristics, the identification accuracy of unknown categories and known categories is improved, and the distribution is judged by using the maximum probability value logic in a neural network output layer in the process of distinguishing the unknown categories from the known categories, different from the traditional method.
Detailed Description
In order that the above-recited objects, features and advantages of the present invention will be more clearly understood, a more particular description of the invention will be rendered by reference to the appended drawings and appended detailed description. In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present invention, but the present invention may be practiced in other ways than as described herein, and therefore the present invention is not limited to the specific embodiments disclosed below.
The embodiment of the invention provides an unknown audio event recognition algorithm based on a pulse neural network, which comprises the following steps:
S1, constructing an audio data set, and splitting the audio data set into a training set, a verification set and a test set. In an embodiment of the present invention, DCASE2019 Subtask C Open set Acoustic Scene Classification dataset is selected as the audio dataset, which is selected from the documents disclosed in 2018 by AnnamariaMesaros, toni Heittola and Tuomas Virtanen "A multi-device dataset for urban acoustic scene classification. In Proceedings of the Detection and Classification of AcousticScenes and Events 2018 Workshop (DCASE2018)".
S2, preprocessing each section of audio data in the audio data set to generate a 3D log-mel spectrogram, wherein the specific steps are as follows:
S201, converting the audio data into normal distribution through z-standardization, so that the audio data with different characteristics have the same dimension and distribution, and the impulse neural network model is convenient to learn, and the specific formula is as follows:
;
Where x represents the original audio data,Representing the average value of the original audio data,Representing standard deviation of the original audio data, z representing the z-normalized original audio data;
s202, generating a log-mel spectrogram corresponding to the z-normalized original audio data by using a Mel filter bank, and taking a log-mel spectral feature value in the log-mel spectrogram as an original static feature of the audio data. In the embodiment of the invention, the number of the Mel filter groups is set to 128, and a log-mel spectrogram of 1×128×320 is generated, which comprises the following specific steps:
S2021, dividing the original audio data after z-normalization into overlapped frames, wherein each frame comprises N3 samples, reducing spectrum leakage by using a Hamming window, and obtaining a windowed signal, wherein the specific formula is as follows:
;
;
Wherein w n represents a Hamming window,Representing the signal obtained after windowing, x N represents the signal of each frame, N is more than or equal to 0 and less than or equal to N3 -1.
S2022, performing discrete on the windowed signals of each frame by using Fourier transform, and calculating a power spectrum of the discrete signals of each frame, wherein the specific formula is as follows:
;
;
Where X k represents the spectrum, P k represents the power spectrum of the discrete signal of each frame, k represents the frequency index, and j represents the imaginary unit.
S2023, converting the frequency scale and the power spectrum of each frame of signal into Mel frequency spectrum scale, and obtaining log-Mel frequency spectrum diagram by logarithmically solving the Mel frequency spectrum scale, wherein the specific formula is as follows:
;
;
;
Where Hm [ k ] denotes a filter response which is a Mel filter bank, f (m) denotes a center frequency of an mth Mel filter, and Sm denotes Mel spectral energy calculated by the Mel filter bank.
S203, calculating the difference between the previous frame and the next frame for each frame log-mel spectrum to obtain a corresponding first-order time derivative, namely a first-order Delta differential feature, and capturing the change of the feature along with time by using the first-order Delta differential feature;
The specific process for solving the first-order Delta differential characteristics is as follows:
Selecting a window size N1 for calculating a first-order Delta difference characteristic, respectively calculating a weighted difference of 1-N1 frames before and after a log-mel frequency spectrum characteristic value of each frame for each time t, and the weighted difference of N frames before and after the weighted difference to obtain the final productNormalizing the weighted difference result as a normalization factor, wherein the specific formula is as follows:
;
Wherein,Representing first order Delta differential features, Ct+n represents the log-Mel spectral feature values of the N frames after time frame t, Ct-n represents the log-Mel spectral feature values of the N frames before time frame t, n=1, 2. In the embodiment of the invention, the value of N1 is 2.
S204, calculating the difference between the previous frame and the next frame for the first-order Delta differential feature to obtain a corresponding second-order Delta differential feature, capturing acceleration information of the feature changing along with time by using the second-order Delta differential feature, and describing the dynamic change characteristic of the signal.
The specific process for solving the second-order Delta differential characteristics is as follows:
Window size N2 for calculating as second-order Delta differential feature is selected, weighted differential of 1-N2 frames before and after each frame of first-order Delta differential feature is calculated respectively, and the weighted differential of N frames before and after is calculated respectively toNormalizing the weighted difference result as a normalization factor, wherein the specific formula is as follows:
;
Wherein,Representing the second order Delta differential feature of the t-th frame,Representing the first order Delta differential feature of the t + n frame,Representing the first order Delta differential characteristic for the t-N frame, n=1, 2. In the embodiment of the invention, the value of N2 is 2.
S205, stacking the log-mel spectrum characteristic value obtained in the step S202, the first-order Delta differential characteristic obtained in the step S203 and the second-order Delta differential characteristic obtained in the step S204 according to characteristic dimensions to obtain a 3D log-mel spectrum signal of 3X 1X 128X 320.
S3, constructing a pulse neural network model, inputting a 3D log-mel spectrogram corresponding to a test set into the pulse neural network model for classification training, wherein the pulse neural network model is shown in a figure 1 and comprises a convolution layer, a plurality of pulse neural units, a multi-layer perceptron MLP, a remodelling layer and a long-short-term memory network LSTM, the pulse neural units comprise two pulse units and residual convolution, and the pulse units consist of the convolution layer, a batch normalization layer and at least two pulse neurons. Since the audio signal is a continuous time sequence, the impulse neural network can encode time information by the time intervals and order of the impulses, adequately capturing timing characteristics in the audio, such as tempo and pitch variations, etc. The specific training process of the impulse neural network model is as follows:
S301, inputting the 3D log-mel spectrum signals to a convolution layer, and performing shape reshaping according to the batch and the channel number.
S302, processing the spectrum signal output by the convolution layer through a pulse neural unit, and providing front and rear time sequence information through the pulse unit, wherein the method comprises the following specific steps:
S3021, processing a spectrum signal through a small pulse neuron, and capturing the time characteristic of the spectrum signal, wherein the specific process is as follows:
;
;
;
Wherein U [ t ] represents the membrane potential before reset at time t, S [ t ] represents the output peak at time t, when there is a peak, it is equal to 1, otherwise it is 0, T represents the time constant, the decay rate of the membrane potential is affected, V [ t-1] represents the membrane potential after trigger peak at time t-1, I [ t ] represents the input current at time t,Representing a Heaviside step function,When U [ t ] exceeds the threshold value, the neuron triggers a peak, V [ t ] represents the membrane potential after the trigger peak at time t, and Vreset represents the reset value of the membrane potential after the trigger peak. The impulse neuron is designed from the perspective of bio-plasticity, and can ensure low energy consumption and high robustness and capture more time characteristics due to sparse peak transmission instead of continuous representation. In the embodiment of the present invention, the "hard reset" method is used to reset the membrane potential in V [ t ], ensuring that after the spike is triggered St=1, the value of the membrane potential V [ t ] will return to Vreset =0. Compared with the traditional resnet-18-based classification network, the pulse neural network used in the embodiment of the invention can effectively improve the recognition pulse neurons of unknown categories by 2% from the perspective of biological plasticity, and can ensure low energy consumption and high robustness and capture more time characteristics due to sparse peak transmission instead of continuous representation. In the embodiment of the present invention, the "hard reset" method is used to reset the membrane potential in V [ t ], ensuring that after the spike is triggered St=1, the value of the membrane potential V [ t ] will return to Vreset =0. Compared with the traditional resnet-18-based classification network, the pulse neural network used in the embodiment of the invention can effectively improve the recognition rate of unknown class by 2%.
S3022, extracting scale invariant information of a spectrum signal by using a convolution layer in a pulse unit, inputting I to each frame, setting the size of a convolution kernel to be K, and performing convolution operation, wherein the specific formula is as follows:
;
Where S (U, V) represents the value of the V column of the U th row in the convolution result matrix, U represents the height of the convolution K, V represents the width of the convolution K, I (u+a, v+b) represents the value of the v+b of the u+a th row in the input matrix I, and K (a, b) represents the value of the b th column of the a th row in the convolution kernel K.
S3023, carrying out standardized processing on the output of the convolution layer in the pulse unit by using a batch normalization layer to accelerate the training of the pulse neural network model and improve the stability of the pulse neural network model, wherein the specific formula is as follows:
;
;
;
Wherein yi represents the output after the batch normalization process,Representing a first learnable parameter for scaling the normalized value, xi representing the input eigenvalue of the ith sample in the batch,Representing the feature solution mean for each batch,Representing the characteristic variance of each batch,Is a constant value, and is used for the treatment of the skin,Representing a second learnable parameter for translating the normalized value, Q representing the number of samples in the batch.
S3024 As shown in FIG. 1, in the embodiment of the present invention, a pulse neural unit includes two pulse units and a convolution residual, after the pulse units extract specific features in an input spectrum signal, the pulse units are fused with the input spectrum signal through residual convolution, so as to improve the propagation efficiency of an information stream, and an output expression of the pulse neural unit is:
Output=F(w) +Conv(w);
where Output represents the Output of the impulse neural unit, w represents the characteristics of the input, F represents the Output of the two impulse units, conv represents the convolution operation.
S303, obtaining scale invariant features of the spectrum signals through the pulse neural unit, and processing the output feature map pulse neural unit by using space average pooling to obtain feature vectors with the size of 512.
S304, sequentially inputting the feature vectors into a multi-layer perceptron MLP, a remodelling layer and a long-term memory network LSTM for processing, wherein the long-term memory network LSTM comprises an input gateForgetful doorAnd an output doorWherein sig represents an activation function sigmoid for mapping an input value between 0 and 1, Wi represents a weight matrix of an input gate, Wf represents a weight matrix of a forgetting gate, Wo represents a weight matrix of an output gate, ht-1 represents a hidden state at a previous time, xt represents an input vector at a current time, bi represents a bias vector of an input gate, bf represents a bias vector of a forgetting gate, bo represents a bias vector of an output gate, candidate states generated from the current input and the previous hidden stateWherein, tanh represents a hyperbolic tangent function for mapping an input value between-1 and 1, Wc represents a weight matrix of candidate states, bc represents a bias vector of the candidate states, and the cell state is obtained by combining a previous cell state ct-1 and the candidate state update through the regulation of a forgetting gate and an input gate
S305, sequentially inputting the characteristics output by the LSTM into a remodelling layer and a multi-layer perceptron MLP for processing, and calculating an input vector x1 through a hidden layerWherein, the method comprises the steps of, wherein,Representing the feature vector output by the hidden layer, W1 representing the weight matrix from the input layer to the hidden layer, b1 representing the bias vector of the hidden layer, sending the hidden feature vector output by the hidden layer to the output layer for operation, and outputting the feature vectorWhere W2 represents the hidden layer to output layer weight matrix and b2 represents the output layer bias vector. In the embodiment of the invention, sig is a Sigmoid activation function, the number of output feature vectors is 10, and the probability of 10 categories is represented.
S4, jointly training the impulse neural network model by using cross entropy loss and contrast loss, wherein the calculation formula of the cross entropy loss and the contrast loss is as follows:
;
;
where Lce denotes the cross entropy loss, BS denotes the size of the batch data, xb denotes the b sample, yb denotes the label of the b sample, f (xb,yb) denotes the output of the model on the real label yb of sample xb, D denotes the total number of categories, D denotes the category index, xb,d denotes the input of the b sample on the D-th category, f (xb,d) denotes the output value of the D-th category of the input data,Representing contrast loss, P1 represents the set of positive samples in the batch, x1 represents the current sample, x1+ represents a positive sample similar to the current sample, i.e. a sample belonging to the same class as the current sample,Representing temperature parameters, sim represents cosine similarity, x1- represents a negative sample dissimilar to the current sample, i.e., a sample belonging to a different class than the current sample, and Nneg represents a set of negative samples in the batch of data.
S5, inputting the audio data of the known class in the verification set into a pulse neural network and a self-encoder to obtain mean square error average loss deduced by the self-encoderSetting upThe self-encoder includes an encoder and a decoder, and the self-encoder includes an input layer, at least two concealment layers, and an output layer.
In the embodiment of the invention, the node number of the input layer of the encoder is obtained by averaging and pooling the characteristic diagram output by the convolution layer of the pulse unit. In the embodiment of the invention, the encoder and the decoder are respectively provided with three hidden layers, and through experimental tests, the neuron numbers of the input layer, the three hidden layers and the output layer of the encoder are preferably 256, [128, 64] and 8, and the neuron numbers of the decoder are consistent with the encoder, namely, the neuron numbers of the input layer, the three hidden layers and the output layer are 256, [128, 64] and 8. When the self-encoder is trained, the self-encoder is trained by using a mean square error, and the specific formula is as follows:
;
where MSE represents the mean square error, S represents the number of samples, yae represents the true predictor,Representing the self-encoder prediction value.
And S6, identifying the acquired audio data by using the trained impulse neural network model, inputting the probability value output by the impulse neural network model into a self-encoder, judging whether the input data belongs to a known class or an unknown class through the self-encoder, judging the unknown class if the loss obtained by reasoning of the self-encoder is higher than the threshold value, and otherwise judging which known class the audio specifically belongs to according to the probability value output by the impulse neural network model.
When the embodiment of the invention is used for preprocessing data, the method is different from the traditional method of using a 2D-mel spectrogram, and combines dynamic and original static characteristics to form three-dimensional log-mel characteristics so as to capture more detailed low-frequency voice signals and time context and dynamic changes in the voice signals, so that the characteristics of sound events can be better understood, the dependency relationship between voice and high-frequency dynamic sound range information can be extracted, a pulse neural network model can be constructed through pulse neurons and residual neural networks, and the time correlation characteristics in audio can be effectively processed while the characteristic information is extracted. Impulse neural network activity is typically sparse, meaning that at any point in time, only a small number of neurons are active. The sparsity not only reduces the calculation burden, but also ensures the extraction capability of important features. When the model is trained, the cross entropy loss is used for classifying, the contrast loss is combined, the contrast loss can reduce the characteristic distance of samples of the same category and enlarge the characteristic distance of samples of different categories, so that the impulse neural network model learns more compact characteristics, the recognition accuracy of the unknown category and the known category is improved, and in the process of distinguishing the unknown category from the known category, the embodiment of the invention is different from the traditional method in which the maximum probability value logic in the neural network output layer is used for judging the distribution, but the threshold value is set according to the reconstruction loss of the known category through the self-encoder, so that the performance loss when the model is poorer in calibration is avoided.
The above description is only of the preferred embodiments of the present invention and is not intended to limit the present invention, but various modifications and variations can be made to the present invention by those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention should be included in the protection scope of the present invention.