Movatterモバイル変換


[0]ホーム

URL:


CN119252276B - Unknown audio event recognition algorithm based on impulse neural network - Google Patents

Unknown audio event recognition algorithm based on impulse neural network
Download PDF

Info

Publication number
CN119252276B
CN119252276BCN202411764748.9ACN202411764748ACN119252276BCN 119252276 BCN119252276 BCN 119252276BCN 202411764748 ACN202411764748 ACN 202411764748ACN 119252276 BCN119252276 BCN 119252276B
Authority
CN
China
Prior art keywords
feature
input
output
layer
neural network
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202411764748.9A
Other languages
Chinese (zh)
Other versions
CN119252276A (en
Inventor
游捷
蔡瑞泽
蔡体健
熊汉卿
阙越
谭林丰
刘文涵
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
East China Jiaotong University
Original Assignee
East China Jiaotong University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by East China Jiaotong UniversityfiledCriticalEast China Jiaotong University
Priority to CN202411764748.9ApriorityCriticalpatent/CN119252276B/en
Publication of CN119252276ApublicationCriticalpatent/CN119252276A/en
Application grantedgrantedCritical
Publication of CN119252276BpublicationCriticalpatent/CN119252276B/en
Activelegal-statusCriticalCurrent
Anticipated expirationlegal-statusCritical

Links

Classifications

Landscapes

Abstract

Translated fromChinese

本申请涉及一种基于脉冲神经网络的未知音频事件识别算法,它包括如下步骤:构建音频数据集,并拆分为训练集、验证集和测试集;对音频数据集中的每段音频数据进行预处理,生成3D log‑mel频谱图;构建脉冲神经网络模型并进行分类训练;使用交叉熵损失和对比损失联合训所述脉冲神经网络模型;使用验证集中的已知类别的音频数据输入至脉冲神经网络和自编码器,获得区分已知类别和未知音频类别的阈值;使用训练好的脉冲神经网络模型对采集的音频数据进行识别。本发明能够在不依赖于预先标注的未知类别信息的情况下,有效地识别和区分未知的声音事件,提高系统的整体识别的准确率,并为后续的未知声音事件分析和处理提供支持。

The present application relates to an unknown audio event recognition algorithm based on a pulse neural network, which includes the following steps: constructing an audio data set and splitting it into a training set, a validation set, and a test set; preprocessing each audio data segment in the audio data set to generate a 3D log‑mel spectrogram; constructing a pulse neural network model and performing classification training; using cross entropy loss and contrast loss to jointly train the pulse neural network model; using audio data of known categories in the validation set to input into the pulse neural network and the autoencoder to obtain a threshold for distinguishing known categories from unknown audio categories; using the trained pulse neural network model to identify the collected audio data. The present invention can effectively identify and distinguish unknown sound events without relying on pre-labeled unknown category information, improve the overall recognition accuracy of the system, and provide support for subsequent unknown sound event analysis and processing.

Description

Unknown audio event recognition algorithm based on impulse neural network
Technical Field
The application relates to the technical field of audio data processing, in particular to an unknown audio event recognition algorithm based on a pulse neural network.
Background
Sound event classification is an application of speech information retrieval that automatically identifies and annotates specific sound events in an audio signal, such as speech, music, animal sounds, etc. At present, the voice information retrieval technology has made remarkable progress in the aspects of indexing and inquiring of audio data, and is widely applied to the fields of voice recognition, music retrieval, sound event detection and the like. Existing audio recognition methods typically rely on deep learning models, particularly convolutional neural networks CNNs and recurrent neural networks RNNs, which can effectively identify and classify known classes of sound events through training of large amounts of labeling data. However, the effectiveness of these methods depends on the richness and diversity of the training data, which is often incorrectly classified as the closest known class when the algorithm encounters sound events that do not occur in the training set. Such misclassification not only affects the accuracy of the recognition, but may also lead to a decrease in the robustness of the system in processing audio events. For example, in practice, the system may misidentify the engine's abnormal sound as background noise, thereby ignoring potential fault warnings.
Disclosure of Invention
The invention aims to provide an unknown audio event recognition algorithm based on a pulse neural network, which can effectively recognize and distinguish unknown sound events under the condition of not depending on pre-labeled unknown category information, improves the accuracy of overall recognition of a system, and provides support for subsequent analysis and processing of the unknown sound events.
The invention adopts the technical scheme that the unknown audio event recognition algorithm based on the impulse neural network comprises the following steps:
S1, constructing an audio data set, and splitting the audio data set into a training set, a verification set and a test set;
S2, preprocessing each section of audio data in the audio data set to generate a 3D log-mel spectrogram;
S3, constructing a pulse neural network model, and inputting a 3D log-mel spectrogram corresponding to a test set into the pulse neural network model for classification training, wherein the pulse neural network model comprises a convolution layer, a plurality of pulse neural units, a multi-layer perceptron MLP, a remodelling layer and a long-short-period memory network LSTM;
S4, jointly training the impulse neural network model by using cross entropy loss and contrast loss;
S5, inputting the audio data of the known class in the verification set into a pulse neural network and a self-encoder to obtain mean square error average loss deduced by the self-encoderSetting upThe self-encoder comprises an encoder and a decoder, wherein the self-encoder comprises an input layer, at least two hiding layers and an output layer;
And S6, identifying the acquired audio data by using the trained impulse neural network model, inputting the probability value output by the impulse neural network model into a self-encoder, judging whether the input data belongs to a known class or an unknown class through the self-encoder, judging the unknown class if the loss obtained by reasoning of the self-encoder is higher than the threshold value, and otherwise judging which known class the audio specifically belongs to according to the probability value output by the impulse neural network model.
Further, the specific steps of the step S2 are as follows:
S201, converting the audio data into normal distribution through z-standardization, so that the audio data with different characteristics have the same dimension and distribution, and the impulse neural network model is convenient to learn, and the specific formula is as follows:
;
Where x represents the original audio data,Representing the average value of the original audio data,Representing standard deviation of the original audio data, z representing the z-normalized original audio data;
S202, generating a log-mel spectrogram corresponding to z-standardized original audio data by using a Mel filter bank, and taking a log-mel spectral feature value in the log-mel spectrogram as an original static feature of the audio data;
S203, calculating the difference between the previous frame and the next frame for each frame log-mel spectrum to obtain a corresponding first-order time derivative, namely a first-order Delta differential feature, and capturing the change of the feature along with time by using the first-order Delta differential feature;
The specific process for solving the first-order Delta differential characteristics is as follows:
Selecting a window size N1 for calculating a first-order Delta difference characteristic, respectively calculating a weighted difference of 1-N1 frames before and after a log-mel frequency spectrum characteristic value of each frame for each time t, and the weighted difference of N frames before and after the weighted difference to obtain the final productNormalizing the weighted difference result as a normalization factor, wherein the specific formula is as follows:
;
Wherein,Representing first order Delta differential features, Ct+n represents log-Mel spectral feature values of N frames after time frame t, Ct-n represents log-Mel spectral feature values of N frames before time frame t, n=1, 2.
S204, calculating the difference between the previous frame and the next frame for the first-order Delta differential feature to obtain a corresponding second-order Delta differential feature, capturing acceleration information of the feature changing along with time by using the second-order Delta differential feature, and describing the dynamic change characteristic of the signal;
the specific process for solving the second-order Delta differential characteristics is as follows:
Window size N2 for calculating as second-order Delta differential feature is selected, weighted differential of 1-N2 frames before and after each frame of first-order Delta differential feature is calculated respectively, and the weighted differential of N frames before and after is calculated respectively toNormalizing the weighted difference result as a normalization factor, wherein the specific formula is as follows:
;
Wherein,Representing the second order Delta differential feature of the t-th frame,Representing the first order Delta differential feature of the t + n frame,Representing first order Delta differential features for the t-N frame, n=1, 2.
S205, stacking the log-mel spectrum characteristic value obtained in the step S202, the first-order Delta differential characteristic obtained in the step S203 and the second-order Delta differential characteristic obtained in the step S204 according to characteristic dimensions to obtain a 3D log-mel spectrum signal.
Further, the specific steps of the step S202 are as follows:
S2021, dividing the original audio data after z-normalization into overlapped frames, wherein each frame comprises N3 samples, reducing spectrum leakage by using a Hamming window, and obtaining a windowed signal, wherein the specific formula is as follows:
;
;
Wherein w N represents Hamming window, xw N represents signal obtained after windowing, x N represents signal of each frame, N is more than or equal to 0 and less than or equal to N3 -1;
S2022, performing discrete on the windowed signals of each frame by using Fourier transform, and calculating a power spectrum of the discrete signals of each frame, wherein the specific formula is as follows:
;
;
Wherein X [ k ] represents a frequency spectrum, P [ k ] represents a power spectrum of each frame of discrete signal, k represents a frequency index, and j represents an imaginary unit;
S2023, converting the frequency scale and the power spectrum of each frame of signal into Mel frequency spectrum scale, and obtaining log-Mel frequency spectrum diagram by logarithmically solving the Mel frequency spectrum scale, wherein the specific formula is as follows:
;
;
;
Where Hm [ k ] denotes a filter response which is a Mel filter bank, f (m) denotes a center frequency of an mth Mel filter, and Sm denotes Mel spectral energy calculated by the Mel filter bank.
Further, the specific steps of the step S3 are as follows:
s301, inputting the 3D log-mel spectrum signals to a convolution layer, and performing shape reshaping according to batch and channel number;
S302, processing the spectrum signal output by the convolution layer through a pulse neural unit, and providing front and rear time sequence information through the pulse unit, wherein the method comprises the following specific steps:
S3021, processing a spectrum signal through a small pulse neuron, and capturing the time characteristic of the spectrum signal, wherein the specific process is as follows:
;
;
;
Wherein U [ t ] represents the membrane potential before reset at time t, S [ t ] represents the output peak at time t, when there is a peak, it is equal to 1, otherwise it is 0, T represents the time constant, the decay rate of the membrane potential is affected, V [ t-1] represents the membrane potential after trigger peak at time t-1, I [ t ] represents the input current at time t,Representing a Heaviside step function,A threshold value representing a membrane potential, when U [ t ] exceeds the threshold value, the neuron triggers a peak, V [ t ] represents the membrane potential after the peak is triggered at the moment t, and Vreset represents a value reset after the peak is triggered;
S3022, extracting scale invariant information of a spectrum signal by using a convolution layer in a pulse unit, inputting I to each frame, setting the size of a convolution kernel to be K, and performing convolution operation, wherein the specific formula is as follows:
;
Wherein S (U, V) represents the value of the ith row and the nth column in the convolution result matrix, U represents the height of the convolution K, V represents the width of the convolution K, I (u+a, v+b) represents the value of the (u+a) th row and the (v+b) th column in the input matrix I, and K (a, b) represents the value of the (a) th row and the (b) th column in the convolution kernel K;
s3023, carrying out standardized processing on the output of the convolution layer in the pulse unit by using a batch normalization layer to accelerate the training of the pulse neural network model and improve the stability of the pulse neural network model, wherein the specific formula is as follows:
;
;
;
Wherein yi represents the output after the batch normalization process,Representing a first learnable parameter for scaling the normalized value, xi representing the input eigenvalue of the ith sample in the batch,Representing the feature solution mean for each batch,Representing the characteristic variance of each batch,Is a constant value, and is used for the treatment of the skin,Representing a second learnable parameter for translating the normalized value, Q representing the number of samples in the batch;
S3024, after the pulse unit extracts the specific characteristics in the input spectrum signal, the pulse unit fuses the specific characteristics with the input spectrum signal through residual convolution, so that the propagation efficiency of the information flow is improved, and the output expression of the pulse nerve unit is as follows:
Output=F(w) +Conv(w);
wherein Output represents the Output of the impulse neural unit, w represents the input characteristics, F represents the Output of the two impulse units, conv represents the convolution operation;
s303, obtaining scale invariant features of the frequency spectrum signals through the pulse neural unit, and processing the output feature map pulse neural unit by using space average pooling to obtain feature vectors;
s304, sequentially inputting the feature vectors into a multi-layer perceptron MLP, a remodelling layer and a long-term memory network LSTM for processing, wherein the long-term memory network LSTM comprises an input gateForgetful doorAnd an output doorWherein sig represents an activation function sigmoid for mapping an input value between 0 and 1, Wi represents a weight matrix of an input gate, Wf represents a weight matrix of a forgetting gate, Wo represents a weight matrix of an output gate, ht-1 represents a hidden state at a previous time, xt represents an input vector at a current time, bi represents a bias vector of an input gate, bf represents a bias vector of a forgetting gate, bo represents a bias vector of an output gate, candidate states generated from the current input and the previous hidden stateWherein, tanh represents a hyperbolic tangent function for mapping an input value between-1 and 1, Wc represents a weight matrix of candidate states, bc represents a bias vector of the candidate states, and the cell state is obtained by combining a previous cell state ct-1 and the candidate state update through the regulation of a forgetting gate and an input gate;
S305, sequentially inputting the characteristics output by the LSTM into a remodelling layer and a multi-layer perceptron MLP for processing, and calculating an input vector x1 through a hidden layerWherein, the method comprises the steps of, wherein,Representing the feature vector output by the hidden layer, W1 representing the weight matrix from the input layer to the hidden layer, b1 representing the bias vector of the hidden layer, sending the hidden feature vector output by the hidden layer to the output layer for operation, and outputting the feature vectorWhere W2 represents the hidden layer to output layer weight matrix and b2 represents the output layer bias vector.
Further, the calculation formulas of the cross entropy loss and the contrast loss are as follows:
;
;
where Lce denotes the cross entropy loss, BS denotes the size of the batch data, xb denotes the b sample, yb denotes the label of the b sample, f (xb,yb) denotes the output of the model on the real label yb of sample xb, D denotes the total number of categories, D denotes the category index, xb,d denotes the input of the b sample on the D-th category, f (xb,d) denotes the output value of the D-th category of the input data,Representing contrast loss, P1 represents the set of positive samples in the batch, x1 represents the current sample, x1+ represents a positive sample similar to the current sample, i.e. a sample belonging to the same class as the current sample,Representing temperature parameters, sim represents cosine similarity, x1- represents a negative sample dissimilar to the current sample, i.e., a sample belonging to a different class than the current sample, and Nneg represents a set of negative samples in the batch of data.
Further, the node number of the input layer of the encoder is obtained by averaging and pooling the feature images output by the convolution layer of the pulse unit, and when the self encoder is trained, the self encoder is trained by using a mean square error, and the specific formula is as follows:
;
where MSE represents the mean square error, S represents the number of samples, yae represents the true predictor,Representing the self-encoder prediction value.
The invention has the beneficial effects that:
When the method is used for preprocessing data, the method is different from the traditional method which uses a 2D-mel spectrogram, and combines dynamic and original static characteristics to form three-dimensional log-mel characteristics so as to capture more detailed low-frequency voice signals and facilitate better understanding of the characteristics of sound events; the method combines the impulse neuron and the residual neural network to construct the impulse neural network model, can effectively process time correlation characteristics in the audio while extracting characteristic information, classifies the characteristic distance of samples of the same category by using cross entropy loss and combines contrast loss when training the model, the contrast loss can reduce the characteristic distance of samples of different categories and enlarge the characteristic distance of the samples of different categories, so that the impulse neural network model learns more compact characteristics, the identification accuracy of unknown categories and known categories is improved, and the distribution is judged by using the maximum probability value logic in a neural network output layer in the process of distinguishing the unknown categories from the known categories, different from the traditional method.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a schematic diagram of a pulse neural network model according to an embodiment of the present invention;
fig. 2 is a schematic diagram of a self-encoder according to an embodiment of the present invention.
Detailed Description
In order that the above-recited objects, features and advantages of the present invention will be more clearly understood, a more particular description of the invention will be rendered by reference to the appended drawings and appended detailed description. In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present invention, but the present invention may be practiced in other ways than as described herein, and therefore the present invention is not limited to the specific embodiments disclosed below.
The embodiment of the invention provides an unknown audio event recognition algorithm based on a pulse neural network, which comprises the following steps:
S1, constructing an audio data set, and splitting the audio data set into a training set, a verification set and a test set. In an embodiment of the present invention, DCASE2019 Subtask C Open set Acoustic Scene Classification dataset is selected as the audio dataset, which is selected from the documents disclosed in 2018 by AnnamariaMesaros, toni Heittola and Tuomas Virtanen "A multi-device dataset for urban acoustic scene classification. In Proceedings of the Detection and Classification of AcousticScenes and Events 2018 Workshop (DCASE2018)".
S2, preprocessing each section of audio data in the audio data set to generate a 3D log-mel spectrogram, wherein the specific steps are as follows:
S201, converting the audio data into normal distribution through z-standardization, so that the audio data with different characteristics have the same dimension and distribution, and the impulse neural network model is convenient to learn, and the specific formula is as follows:
;
Where x represents the original audio data,Representing the average value of the original audio data,Representing standard deviation of the original audio data, z representing the z-normalized original audio data;
s202, generating a log-mel spectrogram corresponding to the z-normalized original audio data by using a Mel filter bank, and taking a log-mel spectral feature value in the log-mel spectrogram as an original static feature of the audio data. In the embodiment of the invention, the number of the Mel filter groups is set to 128, and a log-mel spectrogram of 1×128×320 is generated, which comprises the following specific steps:
S2021, dividing the original audio data after z-normalization into overlapped frames, wherein each frame comprises N3 samples, reducing spectrum leakage by using a Hamming window, and obtaining a windowed signal, wherein the specific formula is as follows:
;
;
Wherein w n represents a Hamming window,Representing the signal obtained after windowing, x N represents the signal of each frame, N is more than or equal to 0 and less than or equal to N3 -1.
S2022, performing discrete on the windowed signals of each frame by using Fourier transform, and calculating a power spectrum of the discrete signals of each frame, wherein the specific formula is as follows:
;
;
Where X k represents the spectrum, P k represents the power spectrum of the discrete signal of each frame, k represents the frequency index, and j represents the imaginary unit.
S2023, converting the frequency scale and the power spectrum of each frame of signal into Mel frequency spectrum scale, and obtaining log-Mel frequency spectrum diagram by logarithmically solving the Mel frequency spectrum scale, wherein the specific formula is as follows:
;
;
;
Where Hm [ k ] denotes a filter response which is a Mel filter bank, f (m) denotes a center frequency of an mth Mel filter, and Sm denotes Mel spectral energy calculated by the Mel filter bank.
S203, calculating the difference between the previous frame and the next frame for each frame log-mel spectrum to obtain a corresponding first-order time derivative, namely a first-order Delta differential feature, and capturing the change of the feature along with time by using the first-order Delta differential feature;
The specific process for solving the first-order Delta differential characteristics is as follows:
Selecting a window size N1 for calculating a first-order Delta difference characteristic, respectively calculating a weighted difference of 1-N1 frames before and after a log-mel frequency spectrum characteristic value of each frame for each time t, and the weighted difference of N frames before and after the weighted difference to obtain the final productNormalizing the weighted difference result as a normalization factor, wherein the specific formula is as follows:
;
Wherein,Representing first order Delta differential features, Ct+n represents the log-Mel spectral feature values of the N frames after time frame t, Ct-n represents the log-Mel spectral feature values of the N frames before time frame t, n=1, 2. In the embodiment of the invention, the value of N1 is 2.
S204, calculating the difference between the previous frame and the next frame for the first-order Delta differential feature to obtain a corresponding second-order Delta differential feature, capturing acceleration information of the feature changing along with time by using the second-order Delta differential feature, and describing the dynamic change characteristic of the signal.
The specific process for solving the second-order Delta differential characteristics is as follows:
Window size N2 for calculating as second-order Delta differential feature is selected, weighted differential of 1-N2 frames before and after each frame of first-order Delta differential feature is calculated respectively, and the weighted differential of N frames before and after is calculated respectively toNormalizing the weighted difference result as a normalization factor, wherein the specific formula is as follows:
;
Wherein,Representing the second order Delta differential feature of the t-th frame,Representing the first order Delta differential feature of the t + n frame,Representing the first order Delta differential characteristic for the t-N frame, n=1, 2. In the embodiment of the invention, the value of N2 is 2.
S205, stacking the log-mel spectrum characteristic value obtained in the step S202, the first-order Delta differential characteristic obtained in the step S203 and the second-order Delta differential characteristic obtained in the step S204 according to characteristic dimensions to obtain a 3D log-mel spectrum signal of 3X 1X 128X 320.
S3, constructing a pulse neural network model, inputting a 3D log-mel spectrogram corresponding to a test set into the pulse neural network model for classification training, wherein the pulse neural network model is shown in a figure 1 and comprises a convolution layer, a plurality of pulse neural units, a multi-layer perceptron MLP, a remodelling layer and a long-short-term memory network LSTM, the pulse neural units comprise two pulse units and residual convolution, and the pulse units consist of the convolution layer, a batch normalization layer and at least two pulse neurons. Since the audio signal is a continuous time sequence, the impulse neural network can encode time information by the time intervals and order of the impulses, adequately capturing timing characteristics in the audio, such as tempo and pitch variations, etc. The specific training process of the impulse neural network model is as follows:
S301, inputting the 3D log-mel spectrum signals to a convolution layer, and performing shape reshaping according to the batch and the channel number.
S302, processing the spectrum signal output by the convolution layer through a pulse neural unit, and providing front and rear time sequence information through the pulse unit, wherein the method comprises the following specific steps:
S3021, processing a spectrum signal through a small pulse neuron, and capturing the time characteristic of the spectrum signal, wherein the specific process is as follows:
;
;
;
Wherein U [ t ] represents the membrane potential before reset at time t, S [ t ] represents the output peak at time t, when there is a peak, it is equal to 1, otherwise it is 0, T represents the time constant, the decay rate of the membrane potential is affected, V [ t-1] represents the membrane potential after trigger peak at time t-1, I [ t ] represents the input current at time t,Representing a Heaviside step function,When U [ t ] exceeds the threshold value, the neuron triggers a peak, V [ t ] represents the membrane potential after the trigger peak at time t, and Vreset represents the reset value of the membrane potential after the trigger peak. The impulse neuron is designed from the perspective of bio-plasticity, and can ensure low energy consumption and high robustness and capture more time characteristics due to sparse peak transmission instead of continuous representation. In the embodiment of the present invention, the "hard reset" method is used to reset the membrane potential in V [ t ], ensuring that after the spike is triggered St=1, the value of the membrane potential V [ t ] will return to Vreset =0. Compared with the traditional resnet-18-based classification network, the pulse neural network used in the embodiment of the invention can effectively improve the recognition pulse neurons of unknown categories by 2% from the perspective of biological plasticity, and can ensure low energy consumption and high robustness and capture more time characteristics due to sparse peak transmission instead of continuous representation. In the embodiment of the present invention, the "hard reset" method is used to reset the membrane potential in V [ t ], ensuring that after the spike is triggered St=1, the value of the membrane potential V [ t ] will return to Vreset =0. Compared with the traditional resnet-18-based classification network, the pulse neural network used in the embodiment of the invention can effectively improve the recognition rate of unknown class by 2%.
S3022, extracting scale invariant information of a spectrum signal by using a convolution layer in a pulse unit, inputting I to each frame, setting the size of a convolution kernel to be K, and performing convolution operation, wherein the specific formula is as follows:
;
Where S (U, V) represents the value of the V column of the U th row in the convolution result matrix, U represents the height of the convolution K, V represents the width of the convolution K, I (u+a, v+b) represents the value of the v+b of the u+a th row in the input matrix I, and K (a, b) represents the value of the b th column of the a th row in the convolution kernel K.
S3023, carrying out standardized processing on the output of the convolution layer in the pulse unit by using a batch normalization layer to accelerate the training of the pulse neural network model and improve the stability of the pulse neural network model, wherein the specific formula is as follows:
;
;
;
Wherein yi represents the output after the batch normalization process,Representing a first learnable parameter for scaling the normalized value, xi representing the input eigenvalue of the ith sample in the batch,Representing the feature solution mean for each batch,Representing the characteristic variance of each batch,Is a constant value, and is used for the treatment of the skin,Representing a second learnable parameter for translating the normalized value, Q representing the number of samples in the batch.
S3024 As shown in FIG. 1, in the embodiment of the present invention, a pulse neural unit includes two pulse units and a convolution residual, after the pulse units extract specific features in an input spectrum signal, the pulse units are fused with the input spectrum signal through residual convolution, so as to improve the propagation efficiency of an information stream, and an output expression of the pulse neural unit is:
Output=F(w) +Conv(w);
where Output represents the Output of the impulse neural unit, w represents the characteristics of the input, F represents the Output of the two impulse units, conv represents the convolution operation.
S303, obtaining scale invariant features of the spectrum signals through the pulse neural unit, and processing the output feature map pulse neural unit by using space average pooling to obtain feature vectors with the size of 512.
S304, sequentially inputting the feature vectors into a multi-layer perceptron MLP, a remodelling layer and a long-term memory network LSTM for processing, wherein the long-term memory network LSTM comprises an input gateForgetful doorAnd an output doorWherein sig represents an activation function sigmoid for mapping an input value between 0 and 1, Wi represents a weight matrix of an input gate, Wf represents a weight matrix of a forgetting gate, Wo represents a weight matrix of an output gate, ht-1 represents a hidden state at a previous time, xt represents an input vector at a current time, bi represents a bias vector of an input gate, bf represents a bias vector of a forgetting gate, bo represents a bias vector of an output gate, candidate states generated from the current input and the previous hidden stateWherein, tanh represents a hyperbolic tangent function for mapping an input value between-1 and 1, Wc represents a weight matrix of candidate states, bc represents a bias vector of the candidate states, and the cell state is obtained by combining a previous cell state ct-1 and the candidate state update through the regulation of a forgetting gate and an input gate
S305, sequentially inputting the characteristics output by the LSTM into a remodelling layer and a multi-layer perceptron MLP for processing, and calculating an input vector x1 through a hidden layerWherein, the method comprises the steps of, wherein,Representing the feature vector output by the hidden layer, W1 representing the weight matrix from the input layer to the hidden layer, b1 representing the bias vector of the hidden layer, sending the hidden feature vector output by the hidden layer to the output layer for operation, and outputting the feature vectorWhere W2 represents the hidden layer to output layer weight matrix and b2 represents the output layer bias vector. In the embodiment of the invention, sig is a Sigmoid activation function, the number of output feature vectors is 10, and the probability of 10 categories is represented.
S4, jointly training the impulse neural network model by using cross entropy loss and contrast loss, wherein the calculation formula of the cross entropy loss and the contrast loss is as follows:
;
;
where Lce denotes the cross entropy loss, BS denotes the size of the batch data, xb denotes the b sample, yb denotes the label of the b sample, f (xb,yb) denotes the output of the model on the real label yb of sample xb, D denotes the total number of categories, D denotes the category index, xb,d denotes the input of the b sample on the D-th category, f (xb,d) denotes the output value of the D-th category of the input data,Representing contrast loss, P1 represents the set of positive samples in the batch, x1 represents the current sample, x1+ represents a positive sample similar to the current sample, i.e. a sample belonging to the same class as the current sample,Representing temperature parameters, sim represents cosine similarity, x1- represents a negative sample dissimilar to the current sample, i.e., a sample belonging to a different class than the current sample, and Nneg represents a set of negative samples in the batch of data.
S5, inputting the audio data of the known class in the verification set into a pulse neural network and a self-encoder to obtain mean square error average loss deduced by the self-encoderSetting upThe self-encoder includes an encoder and a decoder, and the self-encoder includes an input layer, at least two concealment layers, and an output layer.
In the embodiment of the invention, the node number of the input layer of the encoder is obtained by averaging and pooling the characteristic diagram output by the convolution layer of the pulse unit. In the embodiment of the invention, the encoder and the decoder are respectively provided with three hidden layers, and through experimental tests, the neuron numbers of the input layer, the three hidden layers and the output layer of the encoder are preferably 256, [128, 64] and 8, and the neuron numbers of the decoder are consistent with the encoder, namely, the neuron numbers of the input layer, the three hidden layers and the output layer are 256, [128, 64] and 8. When the self-encoder is trained, the self-encoder is trained by using a mean square error, and the specific formula is as follows:
;
where MSE represents the mean square error, S represents the number of samples, yae represents the true predictor,Representing the self-encoder prediction value.
And S6, identifying the acquired audio data by using the trained impulse neural network model, inputting the probability value output by the impulse neural network model into a self-encoder, judging whether the input data belongs to a known class or an unknown class through the self-encoder, judging the unknown class if the loss obtained by reasoning of the self-encoder is higher than the threshold value, and otherwise judging which known class the audio specifically belongs to according to the probability value output by the impulse neural network model.
When the embodiment of the invention is used for preprocessing data, the method is different from the traditional method of using a 2D-mel spectrogram, and combines dynamic and original static characteristics to form three-dimensional log-mel characteristics so as to capture more detailed low-frequency voice signals and time context and dynamic changes in the voice signals, so that the characteristics of sound events can be better understood, the dependency relationship between voice and high-frequency dynamic sound range information can be extracted, a pulse neural network model can be constructed through pulse neurons and residual neural networks, and the time correlation characteristics in audio can be effectively processed while the characteristic information is extracted. Impulse neural network activity is typically sparse, meaning that at any point in time, only a small number of neurons are active. The sparsity not only reduces the calculation burden, but also ensures the extraction capability of important features. When the model is trained, the cross entropy loss is used for classifying, the contrast loss is combined, the contrast loss can reduce the characteristic distance of samples of the same category and enlarge the characteristic distance of samples of different categories, so that the impulse neural network model learns more compact characteristics, the recognition accuracy of the unknown category and the known category is improved, and in the process of distinguishing the unknown category from the known category, the embodiment of the invention is different from the traditional method in which the maximum probability value logic in the neural network output layer is used for judging the distribution, but the threshold value is set according to the reconstruction loss of the known category through the self-encoder, so that the performance loss when the model is poorer in calibration is avoided.
The above description is only of the preferred embodiments of the present invention and is not intended to limit the present invention, but various modifications and variations can be made to the present invention by those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims (6)

Translated fromChinese
1.一种基于脉冲神经网络的未知音频事件识别算法,其特征在于,包括如下步骤:1. An unknown audio event recognition algorithm based on a pulse neural network, characterized in that it comprises the following steps:S1:构建音频数据集,并将所述音频数据集拆分为训练集、验证集和测试集;S1: construct an audio dataset, and split the audio dataset into a training set, a validation set, and a test set;S2:对音频数据集中的每段音频数据进行预处理,生成3D log-mel频谱图;S2: Preprocess each audio data segment in the audio dataset to generate a 3D log-mel spectrogram;S3:构建脉冲神经网络模型,将训练集对应的3D log-mel频谱图输入至所述脉冲神经网络模型中进行分类训练;所述脉冲神经网络模型包括卷积层、多个脉冲神经单元、多层感知机MLP、重塑层和长短期记忆网络LSTM;所述脉冲神经单元包括两个脉冲单元和残差卷积,所述脉冲单元由卷积层、批归一化层和至少两个小脉冲神经元组成;S3: construct a spiking neural network model, and input the 3D log-mel spectrogram corresponding to the training set into the spiking neural network model for classification training; the spiking neural network model includes a convolutional layer, multiple spiking neural units, a multi-layer perceptron MLP, a reshaping layer, and a long short-term memory network LSTM; the spiking neural unit includes two spiking units and residual convolution, and the spiking unit is composed of a convolutional layer, a batch normalization layer, and at least two small spiking neurons;S4:使用交叉熵损失和对比损失联合训练所述脉冲神经网络模型;S4: Use cross entropy loss and contrast loss to jointly train the spiking neural network model;S5:使用验证集中的已知类别的音频数据输入至脉冲神经网络和自编码器,获得自编码器推理所得的均方误差平均损失,设定为区分已知类别和未知音频类别的阈值;所述自编码器包括编码器和解码器,所述编码器包括输入层、至少两个隐藏层和输出层,所述解码器包括输入层、至少两个隐藏层和输出层;S5: Use the known audio data in the validation set to input the pulse neural network and autoencoder to obtain the mean square error average loss obtained by autoencoder inference ,set up A threshold for distinguishing known categories from unknown audio categories; the autoencoder comprises an encoder and a decoder, the encoder comprises an input layer, at least two hidden layers and an output layer, and the decoder comprises an input layer, at least two hidden layers and an output layer;S6:使用训练好的脉冲神经网络模型对采集的音频数据进行识别,并将脉冲神经网络模型输出的概率值输入至自编码器,通过自编码器判定输入数据属于已知类别还是未知类别,若自编码器推理获得的损失高于所述阈值的则判定为未知类别,否则根据脉冲神经网络模型输出的概率值判断音频具体属于哪一已知类别。S6: Use the trained pulse neural network model to identify the collected audio data, and input the probability value output by the pulse neural network model into the autoencoder, and use the autoencoder to determine whether the input data belongs to a known category or an unknown category. If the loss obtained by the autoencoder reasoning is higher than the threshold, it is determined to be an unknown category. Otherwise, the probability value output by the pulse neural network model is used to determine which known category the audio belongs to.2.根据权利要求1所述的一种基于脉冲神经网络的未知音频事件识别算法,其特征在于,所述步骤S2的具体步骤为:2. According to the unknown audio event recognition algorithm based on pulse neural network in claim 1, it is characterized in that the specific steps of step S2 are:S201:将音频数据通过z-标准化转变为正态分布,使不同特征的音频数据具有相同的量纲和分布,便于脉冲神经网络模型学习,具体公式为:S201: The audio data is transformed into a normal distribution through z-standardization, so that the audio data with different features have the same dimension and distribution, which is convenient for the pulse neural network model to learn. The specific formula is: ;其中,x表示原始音频数据,表示原始音频数据的均值,表示原始音频数据的标准差,z表示z-标准化后的原始音频数据;Among them,x represents the original audio data, represents the mean of the original audio data, represents the standard deviation of the original audio data, and z represents the original audio data after z-standardization;S202:使用梅尔滤波器组生成z-标准化后的原始音频数据对应的log-mel频谱图,并将所述log-mel频谱图中的log-mel频谱特征值作为音频数据的原始静态特征;S202: using a Mel filter bank to generate a log-mel spectrogram corresponding to the z-normalized original audio data, and using the log-mel spectrum feature values in the log-mel spectrogram as the original static features of the audio data;S203:对每一帧log-mel频谱计算前一帧和后一帧的差异,得到对应的一阶时间导数,即一阶Delta差分特征,使用一阶Delta差分特征捕捉特征随时间的变化;S203: For each frame of log-mel spectrum, the difference between the previous frame and the next frame is calculated to obtain the corresponding first-order time derivative, that is, the first-order Delta difference feature, and the first-order Delta difference feature is used to capture the change of the feature over time;求解一阶Delta差分特征的具体过程为:The specific process of solving the first-order Delta difference feature is:选定用于计算一阶Delta差分特征的窗口大小N1,对于每个时刻t,分别计算每帧log-mel频谱特征值前后1~ N1帧的加权差分,并且对于前后n帧的加权差分,以作为归一化因子对加权差分结果进行归一化,具体公式为:Select the window sizeN1 for calculating the first-order Delta difference feature. For each timet , calculate the weighted difference of the log-mel spectrum feature value of each frame before and after 1~N1 frames, and for the weighted difference of then frames before and after, use As a normalization factor, the weighted difference result is normalized. The specific formula is: ;其中,表示一阶Delta差分特征,Ct+n表示时间帧t处后n帧的log-Mel频谱特征值,Ct-n表示时间帧t处前n帧的log-Mel频谱特征值,n=1,2,……,N1in, represents the first-order Delta difference feature,Ct+n represents the log-Mel spectrum feature value of then frames after time framet ,Ctn represents the log-Mel spectrum feature value ofthe n frames before time framet ,n =1,2,……,N1 ;S204:对一阶Delta差分特征计算前一帧和后一帧的差异,得到对应的二阶Delta差分特征,使用二阶Delta差分特征捕捉特征随时间变化的加速度信息,描述信号的动态变化特性;S204: Calculate the difference between the previous frame and the next frame for the first-order Delta differential feature to obtain the corresponding second-order Delta differential feature, and use the second-order Delta differential feature to capture the acceleration information of the feature changing over time to describe the dynamic change characteristics of the signal;求解二阶Delta差分特征的具体过程为:The specific process of solving the second-order Delta difference feature is:选定用于计算为二阶Delta差分特征的窗口大小N2,分别计算每帧一阶Delta差分特征前后1~N2帧的加权差分,并且对于前后n帧的加权差分,以作为归一化因子对加权差分结果进行归一化,具体公式为:Select the window sizeN2 for calculating the second-order Delta differential feature, calculate the weighted difference of the first-order Delta differential feature of each frame before and after 1~N2 frames, and for the weighted difference of then frames before and after, use As a normalization factor, the weighted difference result is normalized. The specific formula is: ;其中,表示第t帧的二阶Delta差分特征,表示第t+n帧的一阶Delta差分特征,表示第t-n帧的一阶Delta差分特征,n=1,2,……,N2in, represents the second-order Delta difference feature of thet- th frame, represents the first-order Delta difference feature of thet +nth frame, represents the first-order Delta difference feature of thet -nth frame,n = 1, 2, ...,N2 ;S205:将步骤S202得到的log-mel频谱特征值、步骤S203得到的一阶Delta差分特征和步骤S204得到的二阶Delta差分特征按照特征维度进行堆叠,获得3D log-mel频谱信号。S205: The log-mel spectrum feature value obtained in step S202, the first-order Delta difference feature obtained in step S203, and the second-order Delta difference feature obtained in step S204 are stacked according to the feature dimension to obtain a 3D log-mel spectrum signal.3.根据权利要求2所述的一种基于脉冲神经网络的未知音频事件识别算法,其特征在于,所述步骤S202的具体步骤为:3. The unknown audio event recognition algorithm based on pulse neural network according to claim 2 is characterized in that the specific steps of step S202 are:S2021:将z-标准化后的原始音频数据分成重叠的帧,每帧包含N3个样本,使用汉明窗减少频谱泄露,并获得加窗后的信号,具体公式为:S2021: Divide the z-normalized raw audio data into overlapping frames, each containingN3 samples, use a Hamming window to reduce spectrum leakage, and obtain the windowed signal. The specific formula is: ; ;其中,w[n]表示汉明窗,xw[n]表示加窗后得到的信号,x[n]表示每一帧的信号,0 ≤nN3 -1;Wherein,w [n ] represents the Hamming window,xw [n ] represents the signal obtained after windowing,x [n ] represents the signal of each frame, 0nN3-1;S2022:对每帧加窗后的信号使用傅里叶变换进行离散,并计算每一帧离散信号的功率谱,具体公式为:S2022: Use Fourier transform to discretize the windowed signal of each frame, and calculate the power spectrum of the discrete signal of each frame. The specific formula is: ; ;其中,X[k]表示频谱,P[k]表示每一帧离散信号的功率谱,k表示频率索引,j表示虚数单位;WhereX [k ] represents the frequency spectrum,P [k ] represents the power spectrum of each frame of discrete signal,k represents the frequency index, andj represents the imaginary unit;S2023:将每帧信号的频率刻度和功率谱转换为Mel频谱刻度,并对所述Mel频谱刻度求对数得到log-Mel频谱图,具体公式为:S2023: Convert the frequency scale and power spectrum of each frame signal to a Mel spectrum scale, and calculate the logarithm of the Mel spectrum scale to obtain a log-Mel spectrum diagram. The specific formula is: ; ; ;其中,Hm[k]表示是 Mel 滤波器组的滤波器响应,f(m)表示第m个梅尔滤波器的中心频率,Sm表示通过 Mel 滤波器组计算出的 Mel 频谱能量。Wherein,Hm [k ] represents the filter response of the Mel filter bank,f (m ) represents the center frequency of themth Mel filter,andSm represents the Mel spectrum energy calculatedby the Mel filter bank.4.根据权利要求3所述的一种基于脉冲神经网络的未知音频事件识别算法,其特征在于,所述步骤S3的具体步骤为:4. The unknown audio event recognition algorithm based on pulse neural network according to claim 3 is characterized in that the specific steps of step S3 are:S301:将所述3D log-mel频谱信号输入至卷积层,按照批量和通道数进行形状重塑;S301: Input the 3D log-mel spectrum signal into the convolution layer and reshape it according to the batch size and the number of channels;S302:通过脉冲神经单元对卷积层输出的频谱信号进行处理,以脉冲单元提供前后时间序列信息,具体步骤为:S302: Processing the spectrum signal output by the convolution layer through the pulse neural unit, using the pulse unit to provide the previous and next time series information, the specific steps are:S3021:通过小脉冲神经元对频谱信号进行处理,捕获频谱信号的时间特征,具体过程为:S3021: Process the spectrum signal through small pulse neurons to capture the time characteristics of the spectrum signal. The specific process is as follows: ; ; ;其中,U[t]表示t时刻重置前的膜电位,S[t]表示t时刻的输出尖峰,当有尖峰时等于1,否则为0,T表示时间常数,影响膜电位的衰减速度,V[t-1]表示t-1时刻触发尖峰后的膜电位,I[t]表示在时间t时刻的输入电流,表示Heaviside阶跃函数,表示膜电位的阈值,当U[t]超过此值时,神经元会触发尖峰,V[t]表示t时刻触发尖峰后的膜电位,Vreset表示膜电位在触发尖峰后重置的值;Among them,U [t ] represents the membrane potential before the reset at timet ,S [t ] represents the output spike at timet , which is equal to 1 when there is a spike and 0 otherwise,T represents the time constant, which affects the decay rate of the membrane potential,V [t -1] represents the membrane potential after the spike is triggered at timet -1, andI [t ] represents the input current at timet . represents the Heaviside step function, represents the threshold of membrane potential. WhenU [t ] exceeds this value, the neuron will trigger a spike.V [t ] represents the membrane potential after the spike is triggered at timet .Vreset represents the value to which the membrane potential is reset after the spike is triggered.S3022:使用脉冲单元中的卷积层提取频谱信号的尺度不变信息,对每一帧输入I,设定卷积核大小为K,进行卷积操作,具体公式为:S3022: Use the convolution layer in the pulse unit to extract the scale-invariant information of the spectrum signal, inputI for each frame, set the convolution kernel size toK , and perform a convolution operation. The specific formula is: ;其中,S(uv)表示卷积结果矩阵中第u行第v列的值,U表示卷积K的高度,V表示卷积K的宽度,I(u+av+b)表示输入矩阵I中第u+a行第v+b的值,K(ab)表示卷积核K中第a行第b列的值;Where,S (u ,v ) represents the value of theu- th row andv-th column in the convolution result matrix,U represents the height of the convolutionK ,V represents the width of the convolutionK ,I (u+a ,v+b ) represents the value of theu+a- th row andv+b- th column in the input matrixI , andK (a ,b ) represents the value of thea- th row andb- th column in the convolution kernelK ;S3023:使用批归一化层对脉冲单元中卷积层的输出进行标准化处理,以加速脉冲神经网络模型训练并提高其稳定性,具体公式为:S3023: Use batch normalization layers to normalize the output of the convolutional layer in the spike unit to accelerate the training of the spike neural network model and improve its stability. The specific formula is: ; ; ;其中,yi表示经过批归一化处理后的输出,表示第一可学习参数,用于缩放标准化后的值,xi表示批次中第i个样本的输入特征值,表示每一批次的特征求解均值,表示每一批次的特征方差,为常数,表示第二可学习参数,用于平移标准化后的值,Q表示批次中样本的数量;Among them,yi represents the output after batch normalization, represents the first learnable parameter used to scale the normalized value,xirepresents the input feature value of thei- th sample in the batch, represents the mean value of feature solution for each batch, represents the characteristic variance of each batch, is a constant, represents the second learnable parameter used to translate the normalized value,Q represents the number of samples in the batch;S3024:脉冲单元提取完输入的频谱信号内的特异性特征后,通过残差卷积与输入的频谱信号进行融合,提高信息流的传播效率,脉冲神经单元的输出表达式为:S3024: After the pulse unit extracts the specific features in the input spectrum signal, it fuses it with the input spectrum signal through residual convolution to improve the transmission efficiency of the information flow. The output expression of the pulse neural unit is:Output =F(w) +Conv(w);Output =F (w ) +Conv (w );其中,Output表示脉冲神经单元的输出,w表示输入的特征,F表示两个脉冲单元的输出,Conv表示卷积操作;Among them,Output represents the output of the spiking neural unit,w represents the input feature, F represents the output of two spiking units, and Conv represents the convolution operation;S303:通过脉冲神经单元得到频谱信号的尺度不变特征,并对输出的特征图脉冲神经单元使用空间平均池化进行处理,获得特征向量;S303: obtaining scale-invariant features of the spectrum signal through the spiking neural unit, and processing the output feature map spiking neural unit using spatial average pooling to obtain a feature vector;S304:将所述特征向量依次输入多层感知机MLP、重塑层和长短期记忆网络LSTM进行处理,所述长短期记忆网络LSTM包括输入门、遗忘门和输出门,其中,sig表示激活函数sigmoid,用于将输入值映射到0到1之间,Wi表示输入门的权重矩阵,Wf表示遗忘门的权重矩阵,Wo表示输出门的权重矩阵,ht-1表示前一个时刻的隐藏状态,xt表示当前时刻的输入向量,bi表示输入门的偏置向量,bf表示遗忘门的偏置向量,bo表示输出门的偏置向量;根据当前输入和前一个隐藏状态生成的候选状态,其中,tanh表示双曲正切函数,用于将输入值映射到-1到1之间,Wc表示候选状态的权重矩阵,bc表示候选状态的偏置向量,通过遗忘门和输入门的调控,结合前一个单元状态ct-1和候选状态更新得到单元状态S304: The feature vector is sequentially input into a multi-layer perceptron MLP, a reshaping layer and a long short-term memory network LSTM for processing. The long short-term memory network LSTM includes an input gate , Forget Gate and output gate , wheresig representsthe activation function sigmoid, which is used to map the input valueto between 0 and 1,Wi represents the weight matrix of the input gate,Wf represents the weight matrix of the forget gate,Wo represents the weight matrix of the output gate,ht- 1 represents thehidden state at the previous moment,xt representsthe inputvector at the current moment,bi represents the bias vector of the input gate,bf represents the bias vector of the forget gate,andbo represents the biasvector of the output gate; the candidate state generated according to the current input and the previous hidden state , where tanh represents the hyperbolic tangent function, which is used to map the input value to between -1 and 1,Wc represents the weight matrix of the candidate state,bc represents the bias vector of the candidate state, and the unit state is obtained by combining the previous unit statect- 1 and the candidate state update through the regulation of the forget gate and the input gate. ;S305:将长短期记忆网络LSTM输出的特征依次输入重塑层和多层感知机MLP进行处理,输入向量x1经过隐藏层计算,其中,表示隐藏层输出的特征向量,W1表示输入层到隐藏层的权重矩阵,b1表示隐藏层的偏置向量;将隐藏层输出的隐藏特征向量送入输出层进行运算,输出特征向量,其中,W2表示隐藏层到输出层的权重矩阵,b2表示输出层的偏置向量。S305: The features output by the long short-term memory network LSTM are sequentially input into the reshaping layer and the multi-layer perceptron MLP for processing.The input vectorx1 is calculated by the hidden layer ,in, represents the feature vector output by the hidden layer,W1 represents the weight matrix from the input layer to the hidden layer, andb1 represents the bias vector of the hidden layer; the hidden feature vector output by the hidden layer is sent to the output layer for operation, and the output feature vector , whereW2 represents the weight matrix from the hidden layer to the output layer, andb2 represents the bias vector of the output layer.5.根据权利要求4所述的一种基于脉冲神经网络的未知音频事件识别算法,其特征在于,所述交叉熵损失和对比损失的计算公式为:5. According to the unknown audio event recognition algorithm based on pulse neural network in claim 4, it is characterized in that the calculation formula of the cross entropy loss and the contrast loss is: ; ;其中,Lce表示交叉熵损失,BS表示批量数据的大小,xb表示第b个样本,yb表示第b个样本的标签,f(xbyb)表示模型对样本xb在真实标签yb上的输出,D表示类别的总数,d表示类别索引,xb,d表示第b个样本在第d类上的输入,f(xb,d)表示输入数据的第d类的输出值,表示对比损失,P1表示批量数据中正样本的集合,x1表示当前样本,x1+表示与当前样本相似的正样本,即与当前样本属于同一类别的样本,表示温度参数,sim表示余弦相似度,x1-表示与当前样本不相似的负样本,即与当前样本属于不同类别的样本,Nneg表示批量数据中负样本的集合。WhereLce represents the cross entropy loss,BS represents the size of the batch data,xb represents theb -th sample,yb represents the label of theb -th sample,f (xb ,yb ) represents the output of the model for samplexb on the true labelyb ,D represents the total number of categories,d represents the category index,xb,d represents the input of theb- th sample on thed- th category,f (xb,d ) represents the output value of thed-th category of the input data, represents contrast loss,P1 represents the set of positive samples in the batch data,x1 represents the current sample,x1+ represents the positive sample similar to the current sample, that is, the sample belonging to the same category as the current sample, represents the temperature parameter,sim represents the cosine similarity,x1- represents the negative samples that are dissimilarto the current sample, that is, the samples that belong to different categories from the current sample,andNneg represents the set of negative samples in the batch data.6.根据权利要求5所述的一种基于脉冲神经网络的未知音频事件识别算法,其特征在于,所述编码器输入层的节点数由脉冲单元的卷积层输出的特征图经过平均池化得到;在训练所述自编码器时,使用均方误差训练自编码器,具体公式为:6. According to the unknown audio event recognition algorithm based on pulse neural network in claim 5, it is characterized in that the number of nodes in the encoder input layer is obtained by averaging the feature map output by the convolution layer of the pulse unit; when training the autoencoder, the mean square error is used to train the autoencoder, and the specific formula is: ;其中,MSE表示均方误差,S表示样本数量,yae表示真实预测值,表示自编码器预测值。Among them,MSE represents mean square error,S represents the number of samples,yae represents the true prediction value, Represents the autoencoder prediction value.
CN202411764748.9A2024-12-042024-12-04Unknown audio event recognition algorithm based on impulse neural networkActiveCN119252276B (en)

Priority Applications (1)

Application NumberPriority DateFiling DateTitle
CN202411764748.9ACN119252276B (en)2024-12-042024-12-04Unknown audio event recognition algorithm based on impulse neural network

Applications Claiming Priority (1)

Application NumberPriority DateFiling DateTitle
CN202411764748.9ACN119252276B (en)2024-12-042024-12-04Unknown audio event recognition algorithm based on impulse neural network

Publications (2)

Publication NumberPublication Date
CN119252276A CN119252276A (en)2025-01-03
CN119252276Btrue CN119252276B (en)2025-03-18

Family

ID=94018915

Family Applications (1)

Application NumberTitlePriority DateFiling Date
CN202411764748.9AActiveCN119252276B (en)2024-12-042024-12-04Unknown audio event recognition algorithm based on impulse neural network

Country Status (1)

CountryLink
CN (1)CN119252276B (en)

Citations (2)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN115762478A (en)*2022-09-282023-03-07西安电子科技大学 Speech Recognition Method Based on Photon Pulse Neural Network
CN116259310A (en)*2023-01-162023-06-13之江实验室Hardware-oriented deep pulse neural network voice recognition method and system

Family Cites Families (8)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
KR20220024579A (en)*2019-07-112022-03-03엘지전자 주식회사 artificial intelligence server
CN113974607B (en)*2021-11-172024-04-26杭州电子科技大学Sleep snore detecting system based on pulse neural network
KR102574165B1 (en)*2022-02-232023-09-01고려대학교 산학협력단Apparatus for classifying sounds basde on neural code in spiking neural network and method thereof
CN117436487A (en)*2022-07-072024-01-23华为技术有限公司 Data processing method, device and storage medium based on impulse neural network
TW202437241A (en)*2023-02-242024-09-16荷蘭商因納特拉納米系統有限公司Always-on neuromorphic audio processing modules and methods
US20240394932A1 (en)*2023-05-262024-11-28Snap Inc.Text-to-image diffusion model rearchitecture
CN116849637A (en)*2023-06-092023-10-10西安电子科技大学 An abnormal heart rate diagnosis system based on lightweight spiking neural network
CN118861805A (en)*2024-06-282024-10-29广东第二师范学院 Multimodal emotion recognition method based on spiking neural network and attention mechanism

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN115762478A (en)*2022-09-282023-03-07西安电子科技大学 Speech Recognition Method Based on Photon Pulse Neural Network
CN116259310A (en)*2023-01-162023-06-13之江实验室Hardware-oriented deep pulse neural network voice recognition method and system

Also Published As

Publication numberPublication date
CN119252276A (en)2025-01-03

Similar Documents

PublicationPublication DateTitle
CN110491416B (en)Telephone voice emotion analysis and identification method based on LSTM and SAE
Stöter et al.CountNet: Estimating the number of concurrent speakers using supervised learning
Ge et al.Deep learning approach in DOA estimation: A systematic literature review
CN109559736B (en) A method for automatic dubbing of movie actors based on adversarial networks
CN108231067A (en)Sound scenery recognition methods based on convolutional neural networks and random forest classification
Bahari et al.Speaker age estimation and gender detection based on supervised non-negative matrix factorization
Han et al.Self-supervised learning with cluster-aware-dino for high-performance robust speaker verification
WO2020181998A1 (en)Method for detecting mixed sound event on basis of factor decomposition of supervised variational encoder
CN114582325B (en)Audio detection method, device, computer equipment and storage medium
CN115101077A (en)Voiceprint detection model training method and voiceprint recognition method
Wang et al.Robust speaker identification of iot based on stacked sparse denoising auto-encoders
Liu et al.Birdsong classification based on multi feature channel fusion
Wang et al.A novel underground pipeline surveillance system based on hybrid acoustic features
Rahman et al.Dynamic thresholding on speech segmentation
CN115132180A (en)Synthetic voice detection method based on residual error network
CN119252276B (en)Unknown audio event recognition algorithm based on impulse neural network
CN109522448A (en)A method of robustness speech Gender Classification is carried out based on CRBM and SNN
CN117649859A (en)Environmental sound classification and identification method based on D-CNN and KNN technologies
CN116593980B (en)Radar target recognition model training method, radar target recognition method and device
CN117690441A (en)Talking scene speaker identification method
CN114267361A (en)Speaker recognition system with high recognition degree
Xie et al.Acoustic scene classification using deep cnns with time-frequency representations
DeroyExploiting Machine Learning Techniques for Unsupervised Clustering of Speech Utterances
Kusumoputro et al.Speaker identification in noisy environment using bispectrum analysis and probabilistic neural network
Zhou et al.A Intelligent Speech Recognition Method Based on Stable Learning

Legal Events

DateCodeTitleDescription
PB01Publication
PB01Publication
SE01Entry into force of request for substantive examination
SE01Entry into force of request for substantive examination
GR01Patent grant
GR01Patent grant

[8]ページ先頭

©2009-2025 Movatter.jp