Movatterモバイル変換


[0]ホーム

URL:


CN114373476B - Sound scene classification method based on multi-scale residual error attention network - Google Patents

Sound scene classification method based on multi-scale residual error attention network

Info

Publication number
CN114373476B
CN114373476BCN202210028342.9ACN202210028342ACN114373476BCN 114373476 BCN114373476 BCN 114373476BCN 202210028342 ACN202210028342 ACN 202210028342ACN 114373476 BCN114373476 BCN 114373476B
Authority
CN
China
Prior art keywords
sound scene
features
frequency
input
classification
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202210028342.9A
Other languages
Chinese (zh)
Other versions
CN114373476A (en
Inventor
雷震春
周勇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Jiangxi Normal University
Original Assignee
Jiangxi Normal University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Jiangxi Normal UniversityfiledCriticalJiangxi Normal University
Priority to CN202210028342.9ApriorityCriticalpatent/CN114373476B/en
Publication of CN114373476ApublicationCriticalpatent/CN114373476A/en
Application grantedgrantedCritical
Publication of CN114373476BpublicationCriticalpatent/CN114373476B/en
Activelegal-statusCriticalCurrent
Anticipated expirationlegal-statusCritical

Links

Classifications

Landscapes

Abstract

Translated fromChinese

本发明提供一种基于多尺度残差注意力网络的声音场景分类方法,包括将采集到的音频数据进行特征提取,提取出对数梅尔频谱图及其一阶差分和二阶差分作为输入特征;构建多尺度残差注意力网络,将提取到的对数梅尔频谱图输入到网络中进行训练建立分类模型;采用mixup方法增强数据多样性;采用焦点损失关注分类困难的样本;获取新的声音场景语音,利用分类模型对语音进行声音场景分类,得到声音场景分类结果。本发明采用对数梅尔频谱图及其一阶和二阶差分,使用多尺度残差注意力网络模型来对声音场景进行分类,能够挖掘更多丰富全面的特征信息,从而提高声音场景分类性能。

The present invention provides a sound scene classification method based on a multi-scale residual attention network, comprising the following steps: extracting features from collected audio data, extracting a log-Mel spectrogram and its first-order and second-order differences as input features; constructing a multi-scale residual attention network, inputting the extracted log-Mel spectrogram into the network for training and establishing a classification model; employing a mixup method to enhance data diversity; employing a focal loss to focus on samples that are difficult to classify; acquiring new sound scene speech, and using the classification model to classify the speech into a sound scene, thereby obtaining a sound scene classification result. The present invention employs a log-Mel spectrogram and its first-order and second-order differences, and uses a multi-scale residual attention network model to classify sound scenes, thereby being able to mine more rich and comprehensive feature information, thereby improving sound scene classification performance.

Description

Sound scene classification method based on multi-scale residual error attention network
Technical Field
The invention relates to the technical field of artificial intelligence, in particular to a sound scene classification method based on a multi-scale residual error attention network.
Background
The human has the inherent ability to identify the sound scene, i.e. the scene where the audio is located, such as subway, bus, etc., can be judged according to the audio by experience. With the continuous development of signal processing and artificial intelligence technology, it is also possible for a machine device to understand sound and judge the source of sound. Sound scene classification (Acoustic scene classification, ASC) is a multi-class classification task aimed at identifying from audio segments the scene in which the audio is located. At present, sound scene classification is widely applied to the fields of intelligent wearable equipment, audio archiving, interactive robots, safety monitoring and the like.
The sound scene classification method mainly comprises two major categories, namely a sound scene classification method based on traditional machine learning, such as a Gaussian mixture model, a hidden Markov model, a support vector machine and the like, but has low classification effect and poor generalization capability, and a sound scene classification method based on deep learning, such as a deep neural network, a convolution neural network, a circulation neural network and the like, but usually only comprises a convolution kernel with a single scale, the characteristics of excavation are not rich and comprehensive enough, and the characteristics of different areas are not considered to have different importance.
Therefore, how to fully mine the data features and improve the accuracy of classification of sound scenes is a problem that needs to be solved by those skilled in the art.
Disclosure of Invention
In view of the above, the invention provides a sound scene classification method based on a multi-scale residual error attention network, which aims to solve the problems that the extracted features in the current sound scene classification task are single in scale and not rich enough, and different areas of the extracted features are not considered to have different importance, and the like.
In order to achieve the above purpose, the present invention adopts the following technical scheme:
a sound scene classification method based on a multi-scale residual attention network, comprising the steps of:
the method comprises the steps of 1, collecting audio data, inputting the audio data to a feature extraction module for feature extraction, and extracting a logarithmic Mel spectrogram and a first-order difference and a second-order difference of the logarithmic Mel spectrogram as input features;
Step 2, constructing a multi-scale residual error attention network, inputting input features into the network for training and establishing a classification model;
step 3, processing the audio data by adopting a mixup method to obtain a data sample, and enhancing the data diversity;
Step 4, inputting the data samples into a classification model for classification, and optimizing the classification model by adopting samples with focus loss and focus attention classification difficulty;
and 5, acquiring new sound scene voice, inputting the optimized classification model to classify the sound scene, and obtaining a sound scene classification result.
Preferably, the specific process of feature extraction in the step 1 is as follows:
step 1.1, pre-emphasis processing is carried out on collected voice data so that a high-frequency part and a low-frequency part of a sound signal are more balanced;
Step 1.2, framing the pre-emphasized voice data into a plurality of frames of voice signals;
step 1.3, windowing each frame of voice signal by adopting a Hanning window function to obtain a short-time windowed voice signal;
step 1.4, carrying out Fourier transform on the short-time windowed voice signal to convert the short-time windowed voice signal from a time domain to a frequency domain, and obtaining a frequency domain signal;
step 1.5, the obtained frequency domain signal is passed through a Mel filter to obtain a Mel spectrogram with proper size;
Step 1.6, taking the logarithm of the Mel spectrogram to obtain a logarithmic Mel spectrogram;
Step 1.7, the first-order difference and the second-order difference of the logarithmic Mel spectrogram are calculated to obtain the dynamic characteristics of the voice signal, and the logarithmic Mel spectrogram, the first-order difference and the second-order difference are stacked to obtain the final input characteristics.
Preferably, in the step 1, the frame overlapping rate of the voice data in framing is 50%, the number of FFT points in the Fourier transform process is 2048, and the number of Mel filters is 128.
Preferably, the specific process of the step 2 is as follows:
Step 2.1, dividing an input characteristic formed by a logarithmic Mel spectrogram and a first-order difference and a second-order difference thereof into a high-frequency part and a low-frequency part;
step 2.2, respectively inputting the high-frequency part and the low-frequency part into a channel attention module of a multi-scale residual attention network, distributing different weights according to different importance of the features, highlighting important features and inhibiting secondary features so as to generate new features;
Step 2.3, inputting the new features extracted by the channel attention module into a multi-scale residual error module of a multi-scale residual error attention network, extracting feature information with different precision and different depth, and obtaining a high-frequency part feature map and a low-frequency part feature map;
Step 2.4, splicing the two part characteristic diagrams obtained through the multi-scale residual error module in the frequency dimension to obtain all the characteristics;
And 2.5, all the features are classified by a convolution block consisting of a batch normalization layer, a correction linear unit and a1 multiplied by 1 convolution layer, a convolution block consisting of the batch normalization layer and the 1 multiplied by 1 convolution layer, a batch normalization layer, a global average pooling layer and a softmax layer in sequence.
Preferably, the specific process of generating the new feature through the channel attention module in the step 2.2 includes:
Step 2.2.1, respectively carrying out maximum pooling and average pooling operation on the high-frequency part input features and the low-frequency part input features to obtain two feature graphs;
step 2.2.2, respectively sending the two feature images obtained by the pooling treatment into a multi-layer perceptron to obtain two perception results;
Step 2.2.3, adding two sensing results obtained through the multi-layer sensing machine to obtain a result;
step 2.2.4, carrying out sigmoid activation operation on the added result to obtain a weight parameter of the input feature;
And 2.2.5, finally, carrying out product operation on the weight parameters and the input features to generate new features.
Preferably, the new feature in step 2.3 sequentially passes through the batch normalization layer and the convolution layer, and then passes through the Residual block Residual 01 consisting of two convolution kernels of 1×1, 3×3 and 5×5 twice, and then passes through the combined block consisting of the Residual block Residual 02 consisting of two convolution kernels of 1×1, 3×3 and 5×5, maximum pooling, average pooling and zero padding and the Residual block Residual 01 three times, thereby obtaining the high-frequency part feature map and the low-frequency part feature map.
Preferably, the formula for acquiring the data sample by using the mixup method in the step 3 is as follows:
x=λxi+(1-λ)xj
y=λyi+(1-λ)yj
Wherein (xi,yi) and (xj,yj) are two random samples selected randomly in a training set divided by the collected voice data, xi and xj are original input vectors, yi and yj are corresponding tag codes, lambda is a super parameter, lambda is [0,1].
Preferably, the acquired new sound scene speech is tested to obtain a real classification result, and the classification accuracy is calculated according to the sound scene classification result obtained in the step 5. The classification model can be optimized and corrected according to the classification accuracy, and the classification accuracy is improved.
Compared with the prior art, the invention discloses a sound scene classification method based on a multi-scale residual error attention network, which utilizes convolution kernels with various different scales to mine more details and whole information, extracts more semantic information with different levels by combining an improved residual error network structure, introduces a channel attention mechanism, endows different weights to the features on different channels according to different importance of the features, thereby learning key features, inhibiting secondary features and enhancing the capability of network learning features, adopts mixup method to enhance data diversity, adopts focus loss to pay attention to samples with difficult classification, and improves the classification effect of sound scenes.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings that are required to be used in the embodiments or the description of the prior art will be briefly described below, and it is obvious that the drawings in the following description are only embodiments of the present invention, and that other drawings can be obtained according to the provided drawings without inventive effort for a person skilled in the art.
Fig. 1 is a schematic flow chart of a sound scene classification method based on a multi-scale residual attention network provided by the invention;
FIG. 2 is a schematic diagram of a feature extraction process according to the present invention;
FIG. 3 is a schematic view of a channel attention module structure according to the present invention;
Fig. 4 is a schematic structural diagram of a multi-scale residual error module provided by the invention.
Detailed Description
The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
As shown in fig. 1, the present invention provides a sound scene classification method based on a multi-scale residual attention network, comprising the following steps:
s1, inputting the acquired audio data into a feature extraction module, extracting a logarithmic Mel spectrogram and a first-order difference and a second-order difference thereof as input features, wherein the specific flow of the feature extraction module is shown in a figure 2, and the method comprises the following steps:
s1.1, pre-emphasis processing is carried out on the collected audio data, so that the high-frequency part and the low-frequency part of the sound signal are more balanced, and the pre-emphasis formula is as follows:
H(z)=1-az-1 (1)
s1.2, framing the pre-emphasized audio data into a plurality of frames of voice signals;
S1.3, windowing each frame of voice signal by adopting a Hanning window function to obtain a short-time windowed voice signal, wherein the Hanning window has the following formula:
s1.4, carrying out Fourier transformation on the short-time windowed voice signal to convert the short-time windowed voice signal from a time domain to a frequency domain, and obtaining a frequency domain signal, wherein the conversion process is as follows:
S1.5, the frequency domain signal obtained in the previous step is passed through a Mel filter to obtain a Mel spectrogram with proper size, the center frequency of each triangular filter in the Mel filter is shown as a formula (4), and the frequency response is shown as a formula (5):
Wherein f (M) is the center frequency of the mth filter, fl and fh are the upper limit and the lower limit of the triangular filter respectively, N is the number of sampling points in the filter, fs is the sampling frequency, M-1 and M are inverse functions, and M is defined as shown in formula (6):
Wherein, the
S1.6, taking the logarithm of the Mel spectrogram to obtain a logarithmic Mel spectrogram;
S1.7, obtaining a first-order difference and a second-order difference of the logarithmic Mel spectrogram to obtain the dynamic characteristics of the voice signal, and stacking the logarithmic Mel spectrogram and the first-order and second-order differences thereof to obtain the final input characteristics;
s2, constructing a multi-scale residual error attention network, and inputting the extracted input features into the multi-scale residual error attention network for training to obtain a classification model;
s2.1, dividing an input characteristic formed by a logarithmic Mel spectrogram and a first-order difference thereof into a high-frequency part and a low-frequency part;
S2.2, respectively inputting a high-frequency part and a low-frequency part into a channel attention module of the multi-scale residual attention network, distributing different weights according to different importance of the features, highlighting important features and suppressing secondary features, wherein the specific process can be expressed as formulas (7) and (8):
Mc(F)=σ(MLP(AvgPool(F))+MLP(MaxPool(F))) (7)
Wherein F is an input feature map, the size is (H×W×C), avgPool (F) and MaxPool (F) are respectively the average pooling and maximum pooling operations, Mc (F) is a weight parameter, F' is a feature obtained through a channel attention module, sigma represents a sigmoid function,Representing a product operation;
The channel attention module structure is shown in fig. 3, and comprises the following steps:
s2.2.1, carrying out maximum pooling and average pooling operation on the input features respectively to obtain two feature graphs;
s2.2.2, respectively sending the two feature maps obtained by pooling treatment into a multi-layer perceptron to obtain two perception results;
s2.2.3, adding two perception results obtained by the multi-layer perceptron;
S2.2.4, obtaining weight parameters of input features by performing sigmoid activation operation on the obtained result;
s2.2.5 finally, carrying out product operation on the weight parameters and the input features to generate new features.
S2.3, inputting new features proposed by the channel attention module into a Multi-scale residual error module (Multi-Scale Residual Module, MSRM) of a Multi-scale residual error attention network, extracting feature information with different precision and different depth, and obtaining a high-frequency part feature map and a low-frequency part feature map, wherein the structure of the Multi-scale residual error module is shown in figure 4, and the steps are as follows:
s2.3.1 passing the generated new features through a batch normalization layer (BatchNormalization, BN) and a convolution layer;
S2.3.2 passing through a Residual block Residual 01 composed of convolution kernels of three different scales of 1×1, 3×3 and 5×5 twice;
s2.3.3 passing through a combined block consisting of Residual block Residual 02 and Residual block Residual 01 consisting of convolution kernels of three different scales of 1×1, 3×3 and 5×5, max pooling, average pooling and zero padding three times;
S2.3.4 finally obtaining a high-frequency part characteristic diagram and a low-frequency part characteristic diagram.
S2.4, splicing the two part characteristic diagrams obtained through the multi-scale residual error module in the frequency dimension to obtain all the characteristics;
S2.5, sequentially classifying all the features by a convolution block consisting of a BN layer, a correction linear unit (RectifiedLinearUnit, reLU) and a 1 multiplied by 1 convolution layer, a convolution block consisting of the BN layer and the 1 multiplied by 1 convolution layer, the BN layer, a global average pooling layer and a softmax layer;
S3, enhancing data diversity by adopting mixup method, wherein mixup can be specifically expressed as:
x=λxi+(1-λ)xj (9)
y=λyi+(1-λ)yj (10)
Wherein (xi,yi) and (xj,yj) are two random samples randomly selected from a training set of collected voice data divisions, xi and xj are original input vectors, yi and yj are corresponding tag codes, lambda is a super parameter, lambda is epsilon [0,1], and the confusion degree of the two random samples can be controlled;
s4, taking a sample with focus loss focusing on classification difficulty, the focus loss function can be specifically expressed as:
Wherein n represents a category number, yi represents a true label code of the i-th sample, pi represents a probability that the i-th sample is predicted as a true label, α is a weight factor, λ is a hyper-parameter;
s5, acquiring new sound scene voices, and performing sound scene classification on the new sound scene voices by using the trained classification model to obtain sound scene classification results.
Examples
Sound scene classification is performed using DCASE2020 (acoustic scene and event detection classification challenge) the public dataset TAU UrbanAcoustic Scenes 2020Mobile Development dataset (TAU for short) in challenge task 1A. The dataset contains recordings in 10 different acoustic scenes recorded by 9 different devices from 10 european cities, respectively. The dataset contains 10 sound scenes, including Airport (Airport), mall (Shopping mall), subway station (Metro station), sidewalk (STREET PEDESTRIAN), public square (Public square), congested Street (STREETTRAFFIC), tram (Tram), bus (Bus), subway (Metro) and Park (Park). The experiment adopts the classification accuracy of the sound scene category as the criterion of the judgment model, the training set is used for training the model parameters, and the testing set is used for comparing the performance of the model. The experimental results are shown in the classification results of the sound scene categories in table 1:
TABLE 1 Sound scene class classification results
From the experimental results, the performance of the proposed multi-scale residual attention network is obviously better than that of a DCASE2020 Task1A baseline system.
In the present specification, each embodiment is described in a progressive manner, and each embodiment is mainly described in a different point from other embodiments, and identical and similar parts between the embodiments are all enough to refer to each other. For the device disclosed in the embodiment, since it corresponds to the method disclosed in the embodiment, the description is relatively simple, and the relevant points refer to the description of the method section.
The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims (4)

Translated fromChinese
1.一种基于多尺度残差注意力网络的声音场景分类方法,其特征在于,包括以下步骤:1. A sound scene classification method based on a multi-scale residual attention network, characterized by comprising the following steps:步骤1:采集音频数据进行特征提取,提取出对数梅尔频谱图及其一阶差分和二阶差分作为输入特征;Step 1: Collect audio data for feature extraction, and extract the logarithmic Mel-spectrogram and its first-order difference and second-order difference as input features;步骤2:构建多尺度残差注意力网络,将输入特征输入到网络中进行训练建立分类模型;Step 2: Construct a multi-scale residual attention network and input the input features into the network for training to establish a classification model;步骤3:采用mixup方法对音频数据进行处理,获得数据样本;Step 3: Use the mixup method to process the audio data to obtain data samples;步骤4:将所述数据样本输入所述分类模型进行分类,采用焦点损失关注分类困难的样本,优化分类模型;Step 4: Input the data sample into the classification model for classification, use focal loss to focus on samples that are difficult to classify, and optimize the classification model;步骤5:获取新的声音场景语音,输入优化后的分类模型进行声音场景分类,得到声音场景分类结果;Step 5: Obtain new sound scene speech, input it into the optimized classification model to perform sound scene classification, and obtain the sound scene classification result;所述步骤1中进行特征提取的具体过程为:The specific process of feature extraction in step 1 is as follows:步骤1.1:对所采集到的语音数据进行预加重处理;Step 1.1: pre-emphasize the collected voice data;步骤1.2:将预加重后的语音数据进行分帧,分成若干帧语音信号;Step 1.2: Divide the pre-emphasized voice data into frames, into several frames of voice signals;步骤1.3:采用汉宁窗函数对每帧语音信号进行加窗处理,得到短时加窗的语音信号;Step 1.3: Use the Hanning window function to perform windowing processing on each frame of speech signal to obtain a short-time windowed speech signal;步骤1.4:将短时加窗的语音信号进行傅里叶变换将其从时域转换到频域,获得频域信号;Step 1.4: Perform Fourier transform on the short-time windowed speech signal to convert it from the time domain to the frequency domain to obtain a frequency domain signal;步骤1.5:将得到的频域信号通过梅尔滤波器,得到梅尔频谱图;Step 1.5: Pass the obtained frequency domain signal through the Mel filter to obtain the Mel spectrum;步骤1.6:对梅尔频谱图取对数得到对数梅尔频谱图;Step 1.6: Take the logarithm of the Mel-spectrogram to obtain the logarithmic Mel-spectrogram;步骤1.7:对对数梅尔频谱图求其一阶差分和二阶差分,再将对数梅尔频谱图及其一阶差分、二阶差分堆叠起来,得到最终的输入特征;Step 1.7: Calculate the first-order difference and second-order difference of the log-Mel spectrum graph, and then stack the log-Mel spectrum graph and its first-order difference and second-order difference to obtain the final input features;所述步骤2的具体过程为:The specific process of step 2 is:步骤2.1:将输入特征分为高频部分和低频部分;Step 2.1: Divide the input features into high-frequency and low-frequency parts;步骤2.2:分别将高频部分与低频部分输入多尺度残差注意力网络的通道注意力模块,根据特征的重要性分配不同的权重,生成新的特征;所述步骤2.2中经过通道注意力模块生成新的特征的具体过程包括:Step 2.2: Input the high-frequency part and the low-frequency part into the channel attention module of the multi-scale residual attention network respectively, assign different weights according to the importance of the features, and generate new features; the specific process of generating new features through the channel attention module in step 2.2 includes:步骤2.2.1:将高频部分输入特征和低频部分输入特征分别进行最大池化和平均池化操作,得到两个特征图;Step 2.2.1: Perform maximum pooling and average pooling operations on the high-frequency input features and the low-frequency input features respectively to obtain two feature maps;步骤2.2.2:将池化处理得到的两个特征图分别送入多层感知机中,获得两个感知结果;Step 2.2.2: Send the two feature maps obtained by pooling processing into the multi-layer perceptron respectively to obtain two perception results;步骤2.2.3:将经过多层感知机得到的两个感知结果进行相加,获得结果;Step 2.2.3: Add the two perception results obtained by the multi-layer perceptron to obtain the result;步骤2.2.4:将相加之后的结果经过sigmoid激活操作,获得输入特征的权重参数;Step 2.2.4: Apply sigmoid activation to the result of the addition to obtain the weight parameters of the input features.步骤2.2.5:最后将权重参数与输入特征进行乘积运算,生成新的特征;Step 2.2.5: Finally, multiply the weight parameter by the input feature to generate a new feature;步骤2.3:将新的特征输入到多尺度残差注意力网络的多尺度残差模块,提取不同精度与不同深度的特征信息,获得高频部分特征图和低频部分特征图;所述步骤2.3中新的特征依次通过批处理归一化层以及卷积层,经过两次由两个1×1、两个3×3和两个5×5的卷积核组成的残差块Residual 01,再经过三次由两个1×1、两个3×3和两个5×5的卷积核、最大池化、平均池化和零填充组成的残差块Residual 02以及残差块Residual 01构成的组合块,从而获得高频部分特征图和低频部分特征图;Step 2.3: Input the new features into the multi-scale residual module of the multi-scale residual attention network, extract feature information of different precisions and different depths, and obtain high-frequency feature maps and low-frequency feature maps; the new features in step 2.3 pass through the batch normalization layer and the convolution layer in sequence, pass through the residual block Residual 01 composed of two 1×1, two 3×3 and two 5×5 convolution kernels twice, and then pass through the residual block Residual 02 composed of two 1×1, two 3×3 and two 5×5 convolution kernels, maximum pooling, average pooling and zero padding three times, and the residual block Residual 01 composed of zero padding, thereby obtaining high-frequency feature maps and low-frequency feature maps;步骤2.4:将高频部分特征图和低频部分特征图在频率维度上拼接起来,获得全部特征;Step 2.4: Concatenate the high-frequency feature map and the low-frequency feature map in the frequency dimension to obtain all the features;步骤2.5:全部特征依次经过由批处理归一化层、修正线性单元以及1×1卷积层组成的卷积块,由BN层以及1×1卷积层组成的卷积块,批处理归一化层,全局平均池化层,以及softmax层进行分类,获得分类模型。Step 2.5: All features are sequentially classified through a convolutional block consisting of a batch normalization layer, a rectified linear unit, and a 1×1 convolutional layer, a convolutional block consisting of a batch normalization layer and a 1×1 convolutional layer, a batch normalization layer, a global average pooling layer, and a softmax layer to obtain a classification model.2.根据权利要求1所述的一种基于多尺度残差注意力网络的声音场景分类方法,其特征在于,所述步骤1中,对语音数据进行分帧时帧重叠率为50%;傅里叶变换过程中FFT点数为2048;梅尔滤波器个数为128。2. A sound scene classification method based on a multi-scale residual attention network according to claim 1, characterized in that in step 1, the frame overlap rate is 50% when framing the speech data; the number of FFT points in the Fourier transform process is 2048; and the number of Mel filters is 128.3.根据权利要求1所述的一种基于多尺度残差注意力网络的声音场景分类方法,其特征在于,所述步骤3中采用的mixup方法获取数据样本的公式为:3. The sound scene classification method based on a multi-scale residual attention network according to claim 1, wherein the formula for obtaining data samples by the mixup method adopted in step 3 is:x=λxi+(1-λ)xjx=λxi +(1-λ)xjy=λyi+(1-λ)yjy=λyi +(1-λ)yj其中,(xi,yi)和(xi,yj)分别为采集的语音数据划分的训练集中随机挑选的两个任意的随机样本,xi和xj为原始输入向量,yi和yj为对应的标签编码;λ是超参数,λ∈[0,1]。Wherein, (xi ,yi ) and (xi ,yj ) are two random samples randomly selected from the training set divided by the collected speech data,xi andxj are the original input vectors,yi andyj are the corresponding label codes; λ is a hyperparameter, λ∈[0,1].4.根据权利要求1所述的一种基于多尺度残差注意力网络的声音场景分类方法,其特征在于,对获取的新的声音场景语音进行测试获得真实分类结果,并根据所述步骤5中获得的声音场景分类结果,计算得到分类准确率。4. The sound scene classification method based on a multi-scale residual attention network according to claim 1 is characterized in that the acquired new sound scene speech is tested to obtain a true classification result, and the classification accuracy is calculated based on the sound scene classification result obtained in step 5.
CN202210028342.9A2022-01-112022-01-11Sound scene classification method based on multi-scale residual error attention networkActiveCN114373476B (en)

Priority Applications (1)

Application NumberPriority DateFiling DateTitle
CN202210028342.9ACN114373476B (en)2022-01-112022-01-11Sound scene classification method based on multi-scale residual error attention network

Applications Claiming Priority (1)

Application NumberPriority DateFiling DateTitle
CN202210028342.9ACN114373476B (en)2022-01-112022-01-11Sound scene classification method based on multi-scale residual error attention network

Publications (2)

Publication NumberPublication Date
CN114373476A CN114373476A (en)2022-04-19
CN114373476Btrue CN114373476B (en)2025-09-19

Family

ID=81144196

Family Applications (1)

Application NumberTitlePriority DateFiling Date
CN202210028342.9AActiveCN114373476B (en)2022-01-112022-01-11Sound scene classification method based on multi-scale residual error attention network

Country Status (1)

CountryLink
CN (1)CN114373476B (en)

Families Citing this family (14)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN115101091B (en)*2022-05-112025-03-07上海事凡物联网科技有限公司 Sound data classification method, terminal and medium based on multi-dimensional feature weighted fusion
CN114898778A (en)*2022-05-172022-08-12东南大学Voice emotion recognition method and system based on attention time-frequency network
CN114863938B (en)*2022-05-242024-09-13西南石油大学Method and system for identifying bird language based on attention residual error and feature fusion
CN114863112B (en)*2022-05-272025-07-08江苏大学Tea tender bud identification and picking point positioning method and system based on U-net semantic segmentation
CN115329893B (en)*2022-09-012025-08-05无锡致同知新科技有限公司 Acoustic scene classification method based on paired feature fusion
CN115565538A (en)*2022-09-132023-01-03山东省计算中心(国家超级计算济南中心)Voice counterfeit distinguishing method and system based on single-classification multi-scale residual error network
CN116013361B (en)*2022-12-082025-09-16武汉大学Sound event sample mixing method and device based on attention mechanism
CN116416997A (en)*2023-03-102023-07-11华中科技大学 Intelligent Voice Forgery Attack Detection Method Based on Attention Mechanism
CN116030800A (en)*2023-03-302023-04-28南昌航天广信科技有限责任公司Audio classification recognition method, system, computer and readable storage medium
CN116597822B (en)*2023-05-162025-10-03江苏大学 A network model for acoustic scene classification based on hierarchical information fusion
CN117009781A (en)*2023-07-042023-11-07海纳科德(湖北)科技有限公司Method and system for identifying passive underwater acoustic signals based on MFCC (multi-frequency component carrier) characteristics
CN117238320B (en)*2023-11-162024-01-09天津大学 A noise classification method based on multi-feature fusion convolutional neural network
CN119420599A (en)*2024-11-202025-02-11重庆邮电大学 A channel estimation method for RIS-assisted communication system based on deep learning
CN119181388B (en)*2024-11-222025-03-14浙江大学 A method and system for classifying respiratory sounds based on mel-spectrogram

Citations (1)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN111754988A (en)*2020-06-232020-10-09南京工程学院 Acoustic scene classification method based on attention mechanism and dual-path deep residual network

Family Cites Families (11)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN109766546A (en)*2018-12-252019-05-17华东师范大学 A Natural Language Inference Method Based on Neural Network
CN111062278B (en)*2019-12-032023-04-07西安工程大学Abnormal behavior identification method based on improved residual error network
CN110992270A (en)*2019-12-192020-04-10西南石油大学Multi-scale residual attention network image super-resolution reconstruction method based on attention
KR20210131067A (en)*2020-04-232021-11-02한국전자통신연구원Method and appratus for training acoustic scene recognition model and method and appratus for reconition of acoustic scene using acoustic scene recognition model
CN112750462B (en)*2020-08-072024-06-21腾讯科技(深圳)有限公司Audio processing method, device and equipment
CN112149504B (en)*2020-08-212024-03-26浙江理工大学Motion video identification method combining mixed convolution residual network and attention
CN112487939A (en)*2020-11-262021-03-12深圳市热丽泰和生命科技有限公司Pure vision light weight sign language recognition system based on deep learning
CN112581979B (en)*2020-12-102022-07-12重庆邮电大学 A Spectrogram-Based Speech Emotion Recognition Method
CN113852432B (en)*2021-01-072023-08-25上海应用技术大学Spectrum Prediction Sensing Method Based on RCS-GRU Model
CN112906591A (en)*2021-03-022021-06-04中国人民解放军海军航空大学航空作战勤务学院Radar radiation source identification method based on multi-stage jumper residual error network
CN112700794B (en)*2021-03-232021-06-22北京达佳互联信息技术有限公司Audio scene classification method and device, electronic equipment and storage medium

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN111754988A (en)*2020-06-232020-10-09南京工程学院 Acoustic scene classification method based on attention mechanism and dual-path deep residual network

Also Published As

Publication numberPublication date
CN114373476A (en)2022-04-19

Similar Documents

PublicationPublication DateTitle
CN114373476B (en)Sound scene classification method based on multi-scale residual error attention network
Qamhan et al.Digital audio forensics: microphone and environment classification using deep learning
CN112562698B (en) A defect diagnosis method for power equipment based on fusion of sound source information and thermal imaging features
CN111488486B (en) A kind of electronic music classification method and system based on multi-sound source separation
CN110718234A (en) Acoustic scene classification method based on semantic segmentation encoder-decoder network
CN111754988A (en) Acoustic scene classification method based on attention mechanism and dual-path deep residual network
CN110600059B (en) Acoustic event detection method, device, electronic device and storage medium
CN111785262B (en)Speaker age and gender classification method based on residual error network and fusion characteristics
Qais et al.Deepfake audio detection with neural networks using audio features
Waldekar et al.Two-level fusion-based acoustic scene classification
López-Espejo et al.Improved external speaker-robust keyword spotting for hearing assistive devices
Fan et al.Discriminative learning for monaural speech separation using deep embedding features
Bai et al.A squeeze-and-excitation and transformer-based cross-task model for environmental sound recognition
CN117851936A (en) A network model for acoustic scene classification based on multi-dimensional weighted fusion
Jin et al.Speech separation and emotion recognition for multi-speaker scenarios
Song et al.A compact and discriminative feature based on auditory summary statistics for acoustic scene classification
Ding et al.Acoustic scene classification based on ensemble system
Bhavya et al.Deep learning approach for sound signal processing
CN118098247A (en)Voiceprint recognition method and system based on parallel feature extraction model
Kek et al.Acoustic scene classification using bilinear pooling on time-liked and frequency-liked convolution neural network
CN117594061A (en) A sound detection and localization method based on multi-scale feature attention network
Chen et al.Long-term scalogram integrated with an iterative data augmentation scheme for acoustic scene classification
Sharma et al.Sound event separation and classification in domestic environment using mean teacher
CN114898778A (en)Voice emotion recognition method and system based on attention time-frequency network
Shariff et al.Comparison of Spectrograms for Classification of Vehicles from Traffic Audio

Legal Events

DateCodeTitleDescription
PB01Publication
PB01Publication
SE01Entry into force of request for substantive examination
SE01Entry into force of request for substantive examination
GR01Patent grant
GR01Patent grant

[8]ページ先頭

©2009-2025 Movatter.jp