Movatterモバイル変換


[0]ホーム

URL:


CN117109726A - A single-channel noise detection method and device - Google Patents

A single-channel noise detection method and device
Download PDF

Info

Publication number
CN117109726A
CN117109726ACN202311015701.8ACN202311015701ACN117109726ACN 117109726 ACN117109726 ACN 117109726ACN 202311015701 ACN202311015701 ACN 202311015701ACN 117109726 ACN117109726 ACN 117109726A
Authority
CN
China
Prior art keywords
noise
time
source
signal
model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202311015701.8A
Other languages
Chinese (zh)
Other versions
CN117109726B (en
Inventor
伍世丰
陈庆春
陈多宏
李钦诚
黄国锋
吴科毅
刘军
孙陈林
廖彤
蔡林羲
周炳朋
郑蕾
张承云
林子锋
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangdong Ecological Environment Monitoring Center
Guangzhou University
Original Assignee
Guangdong Ecological Environment Monitoring Center
Guangzhou University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangdong Ecological Environment Monitoring Center, Guangzhou UniversityfiledCriticalGuangdong Ecological Environment Monitoring Center
Priority to CN202311015701.8ApriorityCriticalpatent/CN117109726B/en
Publication of CN117109726ApublicationCriticalpatent/CN117109726A/en
Application grantedgrantedCritical
Publication of CN117109726BpublicationCriticalpatent/CN117109726B/en
Activelegal-statusCriticalCurrent
Anticipated expirationlegal-statusCritical

Links

Classifications

Landscapes

Abstract

The invention discloses a single-channel noise detection method and a single-channel noise detection device, wherein the single-channel noise detection method comprises the following steps: inputting the mixed sound signals into a pre-trained noise separation model to obtain a plurality of single-source noise signals; the noise separation model is constructed according to a time domain coder and a dual-path cyclic neural network; inputting the single-source noise signals obtained by separation into a pre-trained noise enhancement model, and respectively enhancing each single-source noise signal; the noise enhancement model is constructed according to the time-frequency domain generation countermeasure network; and acquiring each enhanced single-source noise signal, and evaluating each single-source noise signal according to a preset index to finish noise detection. According to the invention, the noise separation model is constructed through the time domain encoder and the dual-path cyclic neural network to separate the mixed sound signals, so that the noise sources in the mixed sound signals can be rapidly and effectively separated, and the accuracy of noise monitoring is improved.

Description

Translated fromChinese
一种单通道噪声检测方法及装置A single-channel noise detection method and device

技术领域Technical field

本发明涉及数据处理技术领域,尤其涉及一种单通道噪声检测方法及装置。The present invention relates to the field of data processing technology, and in particular to a single-channel noise detection method and device.

背景技术Background technique

噪声污染防治事关人民群众身心健康,是最普惠的民生工程,是生态文明建设和生态环境保护的重要内容。目前,复杂声学场景下的噪声种类繁多,缺乏有效的智能噪声量化解析技术,导致无法明确场景中到底是哪一种噪声存在超标情况。现有的噪声检测方法,通常通过声音分离从而识别场景中的各种常见的噪声。Noise pollution prevention and control is related to the physical and mental health of the people. It is the most universal livelihood project and an important part of ecological civilization construction and ecological environment protection. Currently, there are many types of noise in complex acoustic scenes, and there is a lack of effective intelligent noise quantification analysis technology, making it impossible to determine which type of noise in the scene exceeds the standard. Existing noise detection methods usually identify various common noises in the scene through sound separation.

现有的声音分离方法,一般是通过神经网络进行局部特征建模和全局特征建模,并通过多尺度时延采样建模,弥补分段式建模中的信息缺失,虽然在一定程度上提高了声音分离得精确性,但由于分段式建模中模型的参数较多和处理时间较长,无法做到实时检测。Existing sound separation methods generally use neural networks for local feature modeling and global feature modeling, and use multi-scale time delay sampling modeling to make up for the lack of information in segmented modeling. Although it improves to a certain extent This improves the accuracy of sound separation, but due to the large number of model parameters and long processing time in segmented modeling, real-time detection cannot be achieved.

发明内容Contents of the invention

本发明提供了一种单通道噪声检测方法及装置,以解决现有声音检测方法无法兼顾检测精确性以及声音分离处理实时性的技术问题。The present invention provides a single-channel noise detection method and device to solve the technical problem that existing sound detection methods cannot take into account detection accuracy and real-time sound separation processing.

为了解决上述技术问题,本发明实施例提供了一种单通道噪声检测方法,包括:In order to solve the above technical problems, embodiments of the present invention provide a single-channel noise detection method, which includes:

将混合声音信号输入预先训练好的噪声分离模型,获取若干个单源噪声信号;所述噪声分离模型根据时域编码器和双路径循环神经网络构建;Input the mixed sound signal into a pre-trained noise separation model to obtain several single-source noise signals; the noise separation model is constructed based on a time domain encoder and a dual-path recurrent neural network;

将分离得到的单源噪声信号输入预先训练好的噪声增强模型,分别对各个单源噪声信号进行增强操作;所述噪声增强模型根据时频域生成对抗网络构建;Input the separated single-source noise signal into a pre-trained noise enhancement model, and perform enhancement operations on each single-source noise signal respectively; the noise enhancement model is constructed based on the time-frequency domain generative adversarial network;

获取增强后的各个单源噪声信号,根据预设指标评估各个单源噪声信号,完成噪声检测。Obtain the enhanced single-source noise signals, evaluate each single-source noise signal according to the preset indicators, and complete the noise detection.

本发明通过时域编码器和双路径循环神经网络构建噪声分离模型对混合声音信号进行分离,通过双路径捕获声音信号的特征信息,可以快速、有效地将混合声音信号中的噪声源分离出来。同时,使用时频域生成对抗网络增强模块对分离得到的单源噪声进行增强处理,使单源噪声更加清晰,便于后续根据指标评估所述单源噪声,从而完成对噪声的量化解析,以提高噪声监测的准确性。The present invention uses a time domain encoder and a dual-path recurrent neural network to build a noise separation model to separate the mixed sound signal, and captures the characteristic information of the sound signal through the dual path, so that the noise source in the mixed sound signal can be quickly and effectively separated. At the same time, the time-frequency domain generative adversarial network enhancement module is used to enhance the separated single-source noise to make the single-source noise clearer and facilitate the subsequent evaluation of the single-source noise based on indicators, thereby completing the quantitative analysis of the noise to improve Noise monitoring accuracy.

进一步地,所述将混合声音信号输入预先训练好的噪声分离模型,获取若干个单源噪声信号,具体为:Further, the mixed sound signal is input into a pre-trained noise separation model to obtain several single-source noise signals, specifically:

根据时域编码器对所述混合声音信号进行特征提取,获取所述混合声音信号的时频特征表示;Perform feature extraction on the mixed sound signal according to the time domain encoder to obtain the time-frequency feature representation of the mixed sound signal;

将所述时频特征表示输入所述双路径神经循环网络,提取所述时域特征表示的时域特征和频率特征,并根据所述时域特征和频率特征获取时间序列信息;其中,所述双路径神经循环网络的前向路径提取音频信号的时域信息,所述双路径神经循环网络的后向路径提取音频信号的频域信息;Input the time-frequency feature representation into the dual-path neural recurrent network, extract the time domain features and frequency features of the time domain feature representation, and obtain time series information according to the time domain features and frequency features; wherein, The forward path of the dual-path neural recurrent network extracts the time domain information of the audio signal, and the backward path of the dual-path neural recurrent network extracts the frequency domain information of the audio signal;

将所述时间序列信息输入第一建模层和第二建模层,获取所述时间序列信息的局部特征和上下文信息;Input the time series information into the first modeling layer and the second modeling layer, and obtain the local features and contextual information of the time series information;

根据所述局部特征和所述上下文信息获取若干个单源掩码,并通过解码器将所述单源掩码转换为单源噪声信号。Several single-source masks are obtained according to the local features and the context information, and the single-source masks are converted into single-source noise signals through a decoder.

进一步的,所述将所述时间序列信息输入第一建模层和第二建模层,获取所述时间序列信息的局部特征和上下文信息,具体为:Further, the time series information is input into the first modeling layer and the second modeling layer to obtain the local features and context information of the time series information, specifically:

根据第一建模层对所述时间序列信息在局部时间范围内进行特征提取,获取所述时间序列信息的局部特征;Perform feature extraction on the time series information within a local time range according to the first modeling layer to obtain local features of the time series information;

根据所述第二建模层对所述局部特征和时间序列信息进行建模整合,获取所述混合声音信号的上下文信息。The local features and time series information are modeled and integrated according to the second modeling layer to obtain the context information of the mixed sound signal.

进一步的,在所述将分离到的单源噪声信号输入预先训练好的噪声增强模型之前,还包括:Further, before inputting the separated single-source noise signal into the pre-trained noise enhancement model, it also includes:

根据鉴别器网络调整所述噪声增强模型的权重参数,所述噪声增强模型包括编码器和时频域生成对抗网络。The weight parameters of the noise enhancement model are adjusted according to the discriminator network, and the noise enhancement model includes an encoder and a time-frequency domain generative adversarial network.

进一步的,所述根据鉴别器网络调整所述噪声增强模型的权重参数,具体为:Further, the weight parameters of the noise enhancement model are adjusted according to the discriminator network, specifically:

将第一音频数据依次输入编码器和时频域生成对抗网络,获取第二音频数据;所述第一音频数据为原始噪声数据经过所述噪声分离模型处理后的各个单源噪声信号;The first audio data is sequentially input into the encoder and the time-frequency domain generative adversarial network to obtain the second audio data; the first audio data is each single-source noise signal after the original noise data has been processed by the noise separation model;

根据鉴别器网络计算所述第二音频数据和原始音频数据的相似度指标值;Calculate the similarity index value of the second audio data and the original audio data according to the discriminator network;

根据预设阈值和所述相似度指标值对所述第二音频数据进行鉴别操作,并根据鉴别结果调整所述噪声增强模型的权重参数。Perform an identification operation on the second audio data according to the preset threshold and the similarity index value, and adjust the weight parameter of the noise enhancement model according to the identification result.

进一步的,所述将分离到的单源噪声信号输入预先训练好的噪声增强模型,分别对各个单源噪声信号进行增强操作,具体为:Further, the separated single-source noise signal is input into a pre-trained noise enhancement model, and enhancement operations are performed on each single-source noise signal respectively, specifically as follows:

分别根据短时傅里叶变换对各个单源噪声信号进行转换操作,获取各个单源噪声信号的时频谱图;Convert each single-source noise signal according to the short-time Fourier transform to obtain the time-frequency spectrum diagram of each single-source noise signal;

将所述时频谱图依次输入编码器和时频域生成对抗网络进行增强操作,获取所述各个时频谱图的时频特征;所述时频域生成对抗网络包括若干个TS-Conformer层;The time-frequency spectrum diagrams are sequentially input into the encoder and the time-frequency domain generative adversarial network for enhancement operations to obtain the time-frequency characteristics of each time-frequency spectrum diagram; the time-frequency domain generation adversarial network includes several TS-Conformer layers;

根据掩码解码器和复数解码器对所述时频特征进行处理,生成各个单源噪声信号的第二音频数据,所述第二音频数据为各个单源噪声信号增强后的音频片段。The time-frequency features are processed according to the mask decoder and the complex decoder to generate second audio data of each single-source noise signal, where the second audio data is an enhanced audio segment of each single-source noise signal.

进一步的,所述获取增强后的各个单源噪声信号,根据预设指标评估各个单源噪声信号,完成噪声检测,具体为:Further, the enhanced single-source noise signals are obtained, each single-source noise signal is evaluated according to preset indicators, and noise detection is completed, specifically as follows:

获取增强后的各个单源噪声信号,计算各个单源噪声信号的A计权声压级;Obtain the enhanced single-source noise signals and calculate the A-weighted sound pressure level of each single-source noise signal;

根据所述A计权声压级对各个单源噪声信号进行分级和分类,并输出检测报告,完成噪声监测。Classify and classify each single-source noise signal according to the A-weighted sound pressure level, and output a detection report to complete noise monitoring.

进一步的,在所述将混合声音信号输入预先训练好的噪声分离模型之前,还包括:Further, before inputting the mixed sound signal into the pre-trained noise separation model, it also includes:

根据预设的采样率对所述混合声音信号中各个声音信号的采样率进行调整,获取采样率一致的混合声音信号;Adjust the sampling rate of each sound signal in the mixed sound signal according to the preset sampling rate to obtain a mixed sound signal with a consistent sampling rate;

并根据所述混合声音信号的最大dBFS值计算归一化比例,根据所述归一化比例对所述混合声音信号中各个音频信号的波形进行调整;And calculate a normalized ratio according to the maximum dBFS value of the mixed sound signal, and adjust the waveform of each audio signal in the mixed sound signal according to the normalized ratio;

根据噪声检测场景动态调整所述混合声音信号的混合比例,并对所述混合声音信号进行数据增强,构建混合声音信号集;所述数据增强包括信噪比调整、时间平移操作和多声源混合操作。Dynamically adjust the mixing ratio of the mixed sound signal according to the noise detection scene, and perform data enhancement on the mixed sound signal to construct a mixed sound signal set; the data enhancement includes signal-to-noise ratio adjustment, time translation operation and multi-sound source mixing operate.

第二方面,本发明提供了一种单通道噪声检测装置,包括:噪声分离模块、噪声增强模块和评估模块;In a second aspect, the present invention provides a single-channel noise detection device, including: a noise separation module, a noise enhancement module and an evaluation module;

所述噪声分离模块,用于将混合声音信号输入预先训练好的噪声分离模型,获取若干个单源噪声信号;所述噪声分离模型根据时域编码器和双路径循环神经网络构建;The noise separation module is used to input mixed sound signals into a pre-trained noise separation model to obtain several single-source noise signals; the noise separation model is constructed based on a time domain encoder and a dual-path recurrent neural network;

所述噪声增强模块,用于将分离得到的单源噪声信号输入预先训练好的噪声增强模型,分别对各个单源噪声信号进行增强操作;所述噪声增强模型根据时频域生成对抗网络构建;The noise enhancement module is used to input the separated single-source noise signal into a pre-trained noise enhancement model, and perform enhancement operations on each single-source noise signal respectively; the noise enhancement model is constructed based on the time-frequency domain generative adversarial network;

所述评估模块,用于获取增强后的各个单源噪声信号,根据预设指标评估各个单源噪声信号,完成噪声检测。The evaluation module is used to obtain each enhanced single-source noise signal, evaluate each single-source noise signal according to preset indicators, and complete noise detection.

第三方面,本发明提供了一种计算机设备,包括:处理器、通信接口和存储器,所述处理器、所述通信接口和所述存储器相互连接,其中,所述存储器存储有可执行程序代码,所述处理器用于调用所述可执行程序代码,执行所述的单通道噪声检测方法。In a third aspect, the present invention provides a computer device, including: a processor, a communication interface and a memory, the processor, the communication interface and the memory are connected to each other, wherein the memory stores executable program code , the processor is configured to call the executable program code to execute the single-channel noise detection method.

附图说明Description of drawings

图1为本发明实施例提供的单通道噪声检测方法地一种流程示意图;Figure 1 is a schematic flow chart of a single-channel noise detection method provided by an embodiment of the present invention;

图2为本发明实施例提供的噪声分离模型的一种训练流程示意图;Figure 2 is a schematic diagram of a training process of the noise separation model provided by the embodiment of the present invention;

图3为本发明实施例提供的一种时域编码器结构示意图;Figure 3 is a schematic structural diagram of a time domain encoder provided by an embodiment of the present invention;

图4为本发明实施例提供的双路径循环神经网络结构示意图;Figure 4 is a schematic structural diagram of a dual-path recurrent neural network provided by an embodiment of the present invention;

图5为本发明实施例提供的单通道噪声检测方法的一种解码器结构示意图;Figure 5 is a schematic structural diagram of a decoder of the single-channel noise detection method provided by an embodiment of the present invention;

图6为本发明实施例提供的噪声增强模型的一种编码器结构示意图;Figure 6 is a schematic structural diagram of an encoder of the noise enhancement model provided by the embodiment of the present invention;

图7为本发明实施例提供的时频域生成对抗网络的一种结构示意图;Figure 7 is a schematic structural diagram of a time-frequency domain generative adversarial network provided by an embodiment of the present invention;

图8为本发明实施例提供的噪声增强模型的一种掩码解码器示意图;Figure 8 is a schematic diagram of a mask decoder of the noise enhancement model provided by the embodiment of the present invention;

图9为本发明实施例提供的噪声增强模型的一种复数解码器示意图;Figure 9 is a schematic diagram of a complex decoder of the noise enhancement model provided by the embodiment of the present invention;

图10为本发明实施例提供的单通道噪声检测方法的鉴别器网络结构示意图。Figure 10 is a schematic diagram of the discriminator network structure of the single-channel noise detection method provided by the embodiment of the present invention.

具体实施方式Detailed ways

下面将结合本发明实施例中的附图,对本发明实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本发明一部分实施例,而不是全部的实施例。基于本发明中的实施例,本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例,都属于本发明保护的范围。The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention. Obviously, the described embodiments are only some of the embodiments of the present invention, rather than all the embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those of ordinary skill in the art without making creative efforts fall within the scope of protection of the present invention.

实施例一Embodiment 1

请参照图1,图1为本发明实施例提供的单通道噪声检测方法地一种流程示意图,包括步骤101至步骤103,具体如下:Please refer to Figure 1. Figure 1 is a schematic flow chart of a single-channel noise detection method provided by an embodiment of the present invention, including steps 101 to 103, specifically as follows:

步骤101:将混合声音信号输入预先训练好的噪声分离模型,获取若干个单源噪声信号;所述噪声分离模型根据时域编码器和双路径循环神经网络构建;Step 101: Input the mixed sound signal into a pre-trained noise separation model to obtain several single-source noise signals; the noise separation model is constructed based on a time domain encoder and a dual-path recurrent neural network;

在本实施例中,在所述将混合声音信号输入预先训练好的噪声分离模型之前,还包括:In this embodiment, before inputting the mixed sound signal into the pre-trained noise separation model, it also includes:

根据预设的采样率对所述混合声音信号中各个声音信号的采样率进行调整,获取采样率一致的混合声音信号;Adjust the sampling rate of each sound signal in the mixed sound signal according to the preset sampling rate to obtain a mixed sound signal with a consistent sampling rate;

并根据所述混合声音信号的最大dBFS值计算归一化比例,根据所述归一化比例对所述混合声音信号中各个音频信号的波形进行调整;And calculate a normalized ratio according to the maximum dBFS value of the mixed sound signal, and adjust the waveform of each audio signal in the mixed sound signal according to the normalized ratio;

根据噪声检测场景动态调整所述混合声音信号的混合比例,并对所述混合声音信号进行数据增强,构建混合声音信号集;所述数据增强包括信噪比调整、时间平移操作和多声源混合操作。Dynamically adjust the mixing ratio of the mixed sound signal according to the noise detection scene, and perform data enhancement on the mixed sound signal to construct a mixed sound signal set; the data enhancement includes signal-to-noise ratio adjustment, time translation operation and multi-sound source mixing operate.

在本实施例中,通过采集纯净噪声声源并对所述纯净噪声声源进行预处理建立干净的噪声数据库,具体的所述预处理操作包括标准化处理和数据增强扩展,以使所述噪声数据库中的混合噪声信号达到采样率匹配、能量归一化、时长对齐、音量平衡、混合比例控制和数据增强。In this embodiment, a clean noise database is established by collecting pure noise sources and preprocessing the pure noise sources. The specific preprocessing operations include standardization processing and data enhancement expansion, so that the noise database The mixed noise signal in the system achieves sampling rate matching, energy normalization, duration alignment, volume balance, mixing ratio control and data enhancement.

在本实施例中,若混合语音信号集与预设的采样率不匹配,可以根据预设的采样率对其进行调整,确保所有语音信号具有相同的采样率,所述采样率根据模型或算法要求进行预设。In this embodiment, if the mixed voice signal set does not match the preset sampling rate, it can be adjusted according to the preset sampling rate to ensure that all voice signals have the same sampling rate. The sampling rate is based on the model or algorithm. Preset required.

在本实施例中,根据dBFS(decibels relative to ful l scale)对混合声音信号中的音量进行标准化。具体的,所述归一化比例计算公式具体为:In this embodiment, the volume in the mixed sound signal is standardized according to dBFS (decibels relative to full scale). Specifically, the normalized ratio calculation formula is:

dBFSnew=20×log10(max|waveformold|) (1)dBFSnew =20×log10 (max|waveformold |) (1)

其中,waveformold为归一化前的输入波形,dBFSinput为输入音频波形信号的dBFS值。Among them, waveformold is the input waveform before normalization, and dBFSinput is the dBFS value of the input audio waveform signal.

根据所述归一化比例对混合声音信号中各个声音信号进行归一化,具体为:Each sound signal in the mixed sound signal is normalized according to the normalization ratio, specifically as follows:

其中,waveformnew为归一化后的波形,waveformold为归一化前的输入波形。Among them, waveformnew is the normalized waveform, and waveformold is the input waveform before normalization.

在本实施例中,通过音量标准化保证混合语音的能量平衡和一致性,避免某些源信号的音量过大,从而影响到噪声分离模型的性能和结果。In this embodiment, the energy balance and consistency of the mixed speech are ensured through volume normalization, and the volume of some source signals is prevented from being too loud, thereby affecting the performance and results of the noise separation model.

在本实施例中,在训练过程中动态调整混合比例,使噪声分离模型和噪声增强模型能够适应不同混合比例的语音分离任务。通过在训练数据中根据检测场景或检测任务引入不同混合比例的样本,从而是在混合语音中平衡不同音频源的贡献,确保目标音频源的信号能够被准确地提取和分离,提高模型的鲁棒性和泛化能力。In this embodiment, the mixing ratio is dynamically adjusted during the training process so that the noise separation model and the noise enhancement model can adapt to speech separation tasks with different mixing ratios. By introducing samples with different mixing ratios in the training data according to the detection scene or detection task, the contributions of different audio sources are balanced in the mixed speech, ensuring that the signal of the target audio source can be accurately extracted and separated, and improving the robustness of the model. and generalization ability.

在本实施例中,通过对所述混合声音信号进行数据增强,生成新的样本数据增强包括信噪比调整、时间平移操作和多声源混合操作。In this embodiment, by performing data enhancement on the mixed sound signal, generating new sample data enhancement includes signal-to-noise ratio adjustment, time shifting operation and multi-sound source mixing operation.

在本实施例中,通过调整所述混合声音信号的SNR的值,可以模拟不同的噪声水平和噪声干扰程度。通过随机应用不同的SNR值,可以模拟不同的噪声水平,提高数据样本,增多噪声分离场景。从而模拟在不同的噪声水平和信噪比条件下训练模型,可以提高模型的鲁棒性和泛化能力。In this embodiment, by adjusting the SNR value of the mixed sound signal, different noise levels and noise interference levels can be simulated. By randomly applying different SNR values, different noise levels can be simulated, data samples can be increased, and noise separation scenarios can be increased. This simulates training the model under different noise levels and signal-to-noise ratio conditions, which can improve the model's robustness and generalization ability.

在本实施例中,通过对混合语音信号进行时间上的平移操作,提高训练数据的多样性,以使提高模型对不同的时间变化和时间偏移情况的声音信号的分离效果。且通过将多个干净噪声信号叠加在一起,模拟真实场景中存在多个噪声源的情况。提高混合声音信号集中的样本多样性,提高模型噪声检测精确性。In this embodiment, by performing a temporal translation operation on the mixed speech signal, the diversity of the training data is increased, so as to improve the separation effect of the model on sound signals with different time changes and time offsets. And by superimposing multiple clean noise signals together, the situation where multiple noise sources exist in the real scene is simulated. Improve the sample diversity in the mixed sound signal set and improve the accuracy of model noise detection.

本发明采用了数据预处理和数据增强的方法,通过对已有的纯净噪声源进行处理和扩充,提高了数据的丰富性和多样性,从而根据多样的数据样本训练噪声分离模型和噪声增强模型,以提高噪声分离和量化解析的准确性和稳定性。This invention adopts data preprocessing and data enhancement methods to improve the richness and diversity of data by processing and expanding existing pure noise sources, thereby training noise separation models and noise enhancement models based on various data samples. , to improve the accuracy and stability of noise separation and quantitative analysis.

在本实施例中,根据时频域生成对抗网络构建噪声分离模型,并根据混合声音信号集中的音频数据训练所述噪声分离模型。In this embodiment, a noise separation model is constructed based on the time-frequency domain generative adversarial network, and the noise separation model is trained based on the audio data in the mixed sound signal set.

请参照图2,图2为本发明实施例提供的噪声分离模型的一种训练流程示意图。Please refer to Figure 2, which is a schematic diagram of a training process of the noise separation model provided by an embodiment of the present invention.

在本实施例中,从混合声音信号集中获取混合声音信号,并通过编码器提取混合声音信号的特征表示。再根据双路径循环神经网络获取所属混合声音信号的分离掩码,从而根据解码器对所述分离掩码进行解码生成分离噪声源。In this embodiment, the mixed sound signal is obtained from the mixed sound signal set, and the feature representation of the mixed sound signal is extracted through the encoder. Then, the separation mask of the mixed sound signal is obtained according to the dual-path recurrent neural network, and the separation mask is decoded by the decoder to generate a separation noise source.

在本实施例中,所述编码器具体为时域编码器,通过时域编码器将输入的多通道混合噪声信号转换为时域特征表示。In this embodiment, the encoder is specifically a time domain encoder, and the input multi-channel mixed noise signal is converted into a time domain feature representation through the time domain encoder.

请参照图3,图3为本发明实施例提供的一种时域编码器结构示意图。Please refer to Figure 3, which is a schematic structural diagram of a time domain encoder provided by an embodiment of the present invention.

在本实施例中,所述时域编码器包含1D卷积层、激活函数和全连接层。1D卷积层用于在时域上对输入语音信号进行滤波和特征提取。激活函数通常被应用在卷积层的输出上,引入非线性变换。In this embodiment, the time domain encoder includes a 1D convolution layer, an activation function and a fully connected layer. The 1D convolutional layer is used to filter and extract features of the input speech signal in the time domain. Activation functions are often applied to the output of convolutional layers to introduce nonlinear transformations.

请参照图4,图4为本发明实施例提供的双路径循环神经网络结构示意图。Please refer to Figure 4, which is a schematic structural diagram of a dual-path recurrent neural network provided by an embodiment of the present invention.

在本实施例中,根据一系列双向循环神经网络单元组成双路径循环神经网络,每个单元具有前向路径和后向路径两个路径。所述双路径循环神经网络中的前向路径和后向路径分别从时域编码器的输出中提取时域特征和频率特征,并通过交叉路径连接来增强特征的表达能力,在每个双向循环神经网络单元中,输入的特征经过前向和后向路径的处理后,得到该单元的输出特征,通过在时间维度上前向和后向传播信息,双向循环神经网络能够捕捉输入序列的上下文信息。In this embodiment, a dual-path recurrent neural network is formed based on a series of bidirectional recurrent neural network units, and each unit has two paths: a forward path and a backward path. The forward path and backward path in the dual-path recurrent neural network extract time domain features and frequency features respectively from the output of the time domain encoder, and enhance the expression ability of the features through cross-path connections. In each two-way cycle In the neural network unit, the input features are processed by the forward and backward paths to obtain the output features of the unit. By propagating information forward and backward in the time dimension, the bidirectional recurrent neural network can capture the contextual information of the input sequence. .

在本实施例中,所述双路径循环神经网络包括局部特征建模层和全局特征建模层。在训练时,对混合声音信号进行编码提取特征后进行切块处理,然后将切片拼接成3D张量再输入到局部特征建模层,局部特征建模层主要对局部特征进行建模和增强,包括一维卷积层、非线性激活函数、规范化层。然后输入到全局特征建模层,全局特征建模层主要是对全局特征进行建模和整合。最后输出结果进行处理以及重叠相加得到分离掩码。In this embodiment, the dual-path recurrent neural network includes a local feature modeling layer and a global feature modeling layer. During training, the mixed sound signal is encoded to extract features and then sliced into blocks. The slices are then spliced into 3D tensors and then input into the local feature modeling layer. The local feature modeling layer mainly models and enhances local features. Includes one-dimensional convolution layer, nonlinear activation function, and normalization layer. Then it is input to the global feature modeling layer, which mainly models and integrates global features. The final output results are processed and overlapped and added to obtain the separation mask.

请参照图5,图5为本发明实施例提供的单通道噪声检测方法的一种解码器结构示意图。Please refer to FIG. 5 , which is a schematic structural diagram of a decoder of the single-channel noise detection method provided by an embodiment of the present invention.

在本实施例中,根据解码器将提取的特征进行进一步处理,以恢复源音频的时域和频域细节。具体的,所述解码器根据卷积、池化等操作,捕捉源音频之间的时频特征差异。并通过逆卷积等操作对分离特征进行处理,以恢复源音频的频域细节。通过逆卷积操作扩展特征图的尺寸,使其与原始音频具有相同的时间和频率分辨率。并激活函数引入非线性性质,以增加模型的表达能力和适应性。In this embodiment, the extracted features are further processed according to the decoder to restore the time domain and frequency domain details of the source audio. Specifically, the decoder captures the time-frequency feature differences between source audios based on operations such as convolution and pooling. The separation features are processed through operations such as deconvolution to restore the frequency domain details of the source audio. The size of the feature map is expanded through a deconvolution operation so that it has the same time and frequency resolution as the original audio. And the activation function introduces nonlinear properties to increase the expressive ability and adaptability of the model.

在本实施例中,在模型训练阶段,还需根据若干损失函数联合优化所述噪声分离模型,所述损失函数包括尺度不变信噪比、波形损失函数、掩码和损失函数。In this embodiment, during the model training stage, the noise separation model needs to be jointly optimized based on several loss functions. The loss functions include scale-invariant signal-to-noise ratio, waveform loss function, mask and loss function.

在本实施例中,所述尺度不变信噪比具体为:In this embodiment, the scale-invariant signal-to-noise ratio is specifically:

其中,si是原始混合信号中第i个噪声源的语音分量,ei是分离后得到的第i个噪声源的估计信号与其原始语音信号的残差(即分离误差)。Among them, si is the speech component of the i-th noise source in the original mixed signal, and ei is the residual (i.e., separation error) of the estimated signal of the i-th noise source obtained after separation and its original speech signal.

在本实施例中,所述波形损失函数具体为:In this embodiment, the waveform loss function is specifically:

Loss2=MSE(waveformestimated,waveformorigin) (4)Loss2=MSE(waveformestimated ,waveformorigin ) (4)

其中,waveformestimated为分离网络估计的波形,waveformorigin为对应的原始单源波形。Among them, waveformestimated is the waveform estimated by the separation network, and waveformorigin is the corresponding original single-source waveform.

在本实施例中,所述掩码和损失函数具体为:In this embodiment, the mask and loss function are specifically:

Loss3=MSE(∑maski,1) (5)Loss3=MSE(∑maski ,1) (5)

其中,maski为第i个源的估计掩码。∑maski为所有源的掩码对应位置相加。Among them, maski is the estimated mask of the i-th source. ∑maski is the sum of the mask corresponding positions of all sources.

在本实施例中,根据三个损失函数联合训练所述噪声分离模型,所述噪声分离模型的总损失函数公式如下:In this embodiment, the noise separation model is jointly trained according to three loss functions. The total loss function formula of the noise separation model is as follows:

Losstatal=Loss1+10*Loss2+100*Loss3 (6)Losstatal =Loss1+10*Loss2+100*Loss3 (6)

在本实施例中,利用噪声分离模型分离出的单源噪声信号与原始纯净的单源信号,以总损失函数为提高基准进行网络参数迭代更新。直至收敛或达到目标性能,停止迭代,完成所述噪声分离模型训练。In this embodiment, the single-source noise signal and the original pure single-source signal separated by the noise separation model are used, and the network parameters are iteratively updated using the total loss function as the improvement benchmark. Until it converges or reaches the target performance, the iteration is stopped and the noise separation model training is completed.

在本实施例中,根据尺度不变信噪比、波形损失函数、掩码和损失函数提高所述噪声分离模型对于信号幅度的鲁棒性,且使所述噪声分离模型适应不同的声音类别的数据分布和变化,使得模型能够处理多样性的声音场景。且通过最小化“各单源掩码之和”与“1”之间的差距作为损失函数可以确保分离结果的物理可行性、提高分离性能、避免过度分离或欠分离,提高数值稳定性。In this embodiment, the robustness of the noise separation model to signal amplitude is improved according to the scale-invariant signal-to-noise ratio, waveform loss function, mask and loss function, and the noise separation model is adapted to different sound categories. Data distribution and changes enable the model to handle diverse sound scenes. And by minimizing the gap between the "sum of each single source mask" and "1" as a loss function, we can ensure the physical feasibility of the separation results, improve separation performance, avoid over-separation or under-separation, and improve numerical stability.

在本实施例中,所述将混合声音信号输入预先训练好的噪声分离模型,获取若干个单源噪声信号,具体为:In this embodiment, the mixed sound signal is input into a pre-trained noise separation model to obtain several single-source noise signals, specifically:

根据时域编码器对所述混合声音信号进行特征提取,获取所述混合声音信号的时频特征表示;Perform feature extraction on the mixed sound signal according to the time domain encoder to obtain the time-frequency feature representation of the mixed sound signal;

将所述时频特征表示表示输入所述双路径神经循环网络,提取所述时域特征表示的时域特征和频率特征,并根据所述时域特征和频率特征获取时间序列信息;其中,所述双路径神经循环网络的前向路径提取音频信号的时域信息,所述双路径神经循环网络的后向路径提取音频信号的频域信息;Input the time-frequency feature representation into the dual-path neural recurrent network, extract the time domain features and frequency features of the time domain feature representation, and obtain time series information according to the time domain features and frequency features; wherein, The forward path of the dual-path neural recurrent network extracts the time domain information of the audio signal, and the backward path of the dual-path neural recurrent network extracts the frequency domain information of the audio signal;

将所述时间序列信息输入第一建模层和第二建模层,获取所述时间序列信息的局部特征和上下文信息;Input the time series information into the first modeling layer and the second modeling layer, and obtain the local features and contextual information of the time series information;

根据所述局部特征和所述上下文信息获取若干个单源掩码,并通过解码器将所述单源掩码转换为单源噪声信号。Several single-source masks are obtained according to the local features and the context information, and the single-source masks are converted into single-source noise signals through a decoder.

在本实施例中,根据时域编码器提取所述混合声音信号的时频特征表示,并将所述时域特征输入双路径循环神经网络。其中,时域特征经过前向和后向路径的处理后,得到该单元的输出特征,通过在时间维度上前向和后向传播信息,从而获取时间序列信息,并将所述时间序列信息输入第一建模层和第二建模层,从而获取局部特征和上下文信息,最后对所述局部特征信息和上下文信息进行建模和整合,输出结果进行处理以及重叠相加得到分离掩码。In this embodiment, the time-frequency feature representation of the mixed sound signal is extracted according to the time domain encoder, and the time domain feature is input into a dual-path recurrent neural network. Among them, after the time domain features are processed by the forward and backward paths, the output features of the unit are obtained, and the time series information is obtained by forward and backward propagation of information in the time dimension, and the time series information is input The first modeling layer and the second modeling layer obtain local features and context information. Finally, the local feature information and context information are modeled and integrated, and the output results are processed and overlapped and added to obtain a separation mask.

在本实施例中,所述将所述时间序列信息输入第一建模层和第二建模层,获取所述时间序列信息的局部特征和上下文信息,具体为:In this embodiment, the time series information is input into the first modeling layer and the second modeling layer to obtain the local features and context information of the time series information, specifically:

根据第一建模层对所述时间序列信息在局部时间范围内进行特征提取,获取所述时间序列信息的局部特征;Perform feature extraction on the time series information within a local time range according to the first modeling layer to obtain local features of the time series information;

根据所述第二建模层对所述局部特征和时间序列信息进行建模整合,获取所述混合声音信号的上下文信息。The local features and time series information are modeled and integrated according to the second modeling layer to obtain the context information of the mixed sound signal.

在本实施例中,所述第一建模层为局部特征建模层,第二建模层为全局特征建模层。所述局部特征建模层中根据一维卷积层、非线性激活函数、规范化层对局部特征进行建模和增强,通过1D卷积操作对时间序列信息进行滑动窗口处理,并利用卷积核在局部时间范围内提取特征,以捕获时间序列信息的局部结构和相关信息。将所述局部结构和相关信息输入所述全局特征建模层,在所述全局特征建模层中根据循环神经网络对整个时间序列信息进行建模,以获取全局的上下文信息。In this embodiment, the first modeling layer is a local feature modeling layer, and the second modeling layer is a global feature modeling layer. In the local feature modeling layer, local features are modeled and enhanced based on the one-dimensional convolution layer, nonlinear activation function, and normalization layer, and the time series information is processed by sliding window through 1D convolution operation, and the convolution kernel is used Features are extracted within a local time range to capture the local structure and related information of time series information. The local structure and related information are input into the global feature modeling layer, and the entire time series information is modeled in the global feature modeling layer according to the recurrent neural network to obtain global context information.

在本实施例中,通过对编码器输出的特征表示进行切块、块内建模、块间建模最后重叠相加后与原始混合声音运算得到估计的单源掩码通过解码器将单源掩码恢复成分离出的单源噪声信号。In this embodiment, the feature representation output by the encoder is divided into blocks, intra-block modeling, inter-block modeling, and finally overlapped and added together with the original mixed sound to obtain the estimated single-source mask. The single-source mask is obtained through the decoder. The mask is recovered into the separated single source noise signal.

步骤102:将分离到的单源噪声信号输入预先训练好的噪声增强模型,分别对各个单源噪声信号进行增强操作;所述噪声增强模型根据时频域生成对抗网络构建;Step 102: Input the separated single-source noise signal into the pre-trained noise enhancement model, and perform enhancement operations on each single-source noise signal respectively; the noise enhancement model is constructed based on the time-frequency domain generative adversarial network;

在本实施例中,所述将分离到的单源噪声信号输入预先训练好的噪声增强模型,分别对各个单源噪声信号进行增强操作,具体为:In this embodiment, the separated single-source noise signal is input into a pre-trained noise enhancement model, and enhancement operations are performed on each single-source noise signal, specifically as follows:

分别根据短时傅里叶变换对各个单源噪声信号进行转换操作,获取各个单源噪声信号的时频谱图;Convert each single-source noise signal according to the short-time Fourier transform to obtain the time-frequency spectrum diagram of each single-source noise signal;

将所述时频谱图依次输入编码器和时频域生成对抗网络进行增强操作,获取所述各个时频谱图的时频特征;所述时频域生成对抗网络包括若干个TS-Conformer层;The time-frequency spectrum diagrams are sequentially input into the encoder and the time-frequency domain generative adversarial network for enhancement operations to obtain the time-frequency characteristics of each time-frequency spectrum diagram; the time-frequency domain generation adversarial network includes several TS-Conformer layers;

根据掩码解码器和复数解码器对所述时频特征进行处理,生成各个单源噪声信号的第二音频数据,所述第二音频数据为各个单源噪声信号增强后的音频片段。The time-frequency features are processed according to the mask decoder and the complex decoder to generate second audio data of each single-source noise signal, where the second audio data is an enhanced audio segment of each single-source noise signal.

在本实施例中,所述噪声增强模型由时频域生成对抗网络构建,在训练时获取噪声分离模型输出的单源噪声信号,将语音波形转换为复时频谱图,然后通过幂律压缩的方法获得压缩频谱,分别得到压缩谱图的幅度、相位、实部和虚部。然后将压缩谱图的幅度、实部和虚部拼接在一起作为编码器的输入,并通过编码器提取所述时频谱图的特征表示,从而输入所述生成器网络,以根据解码器生成预测的干净音频。如图6所示,图6为本发明实施例提供的噪声增强模型的一种编码器结构示意图。In this embodiment, the noise enhancement model is constructed by a time-frequency domain generative adversarial network. During training, the single-source noise signal output by the noise separation model is obtained, the speech waveform is converted into a complex-time spectrogram, and then the power-law compressed The method obtains the compressed spectrum and obtains the amplitude, phase, real part and imaginary part of the compressed spectrum respectively. The amplitude, real and imaginary parts of the compressed spectrogram are then spliced together as input to the encoder, and feature representations of the temporal spectrogram are extracted through the encoder and thus fed into the generator network to generate predictions from the decoder clean audio. As shown in Figure 6, Figure 6 is a schematic structural diagram of an encoder of a noise enhancement model provided by an embodiment of the present invention.

请参照图7,图7为本发明实施例提供的时频域生成对抗网络的一种结构示意图。Please refer to FIG. 7 , which is a schematic structural diagram of a time-frequency domain generative adversarial network provided by an embodiment of the present invention.

在本实施例中,所述时频域生成对抗网络即生成器网络,所述生成器网络包括四个TS-Conformer层,所述TS-Conformer层TS-Conformer是一种结合了Transformer和Conformer的模型,其包括两个Conformer,所述Conformer能够捕获长距离的依赖关系,使得模型能够更好地理解上下文信息,提高序列建模的准确性。In this embodiment, the time-frequency domain generative adversarial network is a generator network. The generator network includes four TS-Conformer layers. The TS-Conformer layer TS-Conformer is a combination of Transformer and Conformer. The model includes two Conformers, which can capture long-distance dependencies, allowing the model to better understand contextual information and improve the accuracy of sequence modeling.

请参照图9和图10,图9为本发明实施例提供的噪声增强模型的一种掩码解码器示意图;图10为本发明实施例提供的噪声增强模型的一种复数解码器示意图。Please refer to Figures 9 and 10. Figure 9 is a schematic diagram of a mask decoder of the noise enhancement model provided by an embodiment of the present invention; Figure 10 is a schematic diagram of a complex decoder of the noise enhancement model provided by an embodiment of the present invention.

在本实施例中,所述解码器以解耦的方式从4层的TS-Conformer模块中提取输出作为解码器的输入。解码器包括掩码解码器和复数解码器。In this embodiment, the decoder extracts the output from the 4-layer TS-Conformer module in a decoupled manner as the input of the decoder. Decoders include mask decoders and complex decoders.

在本实施例中,所述掩码解码器包括一个扩张的Dense Net模块、一个子像素卷积层和两个卷积层。所述扩张的Dense Net模块与编码器相同,通过扩张卷积、密集连接和特征重用等机制,从而增强解码器的感受野和上下文理解能力,提高解码器的性能,从而产生更准确和具有高质量的语音重建结果。所述子像素卷积层用于将频率维度变回初始大小。In this embodiment, the mask decoder includes a dilated Dense Net module, a sub-pixel convolution layer and two convolution layers. The expanded Dense Net module is the same as the encoder. Through mechanisms such as expanded convolution, dense connection and feature reuse, it enhances the receptive field and context understanding capabilities of the decoder and improves the performance of the decoder, thereby producing more accurate and high-performance Quality speech reconstruction results. The sub-pixel convolutional layer is used to change the frequency dimension back to its original size.

在本实施例中,所述复数解码器将掩码解码器的输出幅度与压缩谱图的相位组合得到幅度增强的复谱图,然后与复数解码器的实部和虚部输出分别逐元素相加得到最终的复谱图。将最终的复谱图经过反转幂律压缩和逆短时傅里叶变换(I STFT)后得到最终的增强后的音频片段。In this embodiment, the complex decoder combines the output amplitude of the mask decoder with the phase of the compressed spectrogram to obtain an amplitude-enhanced complex spectrogram, which is then combined element-by-element with the real and imaginary outputs of the complex decoder respectively. Add to get the final complex spectrum. The final enhanced audio clip is obtained after inverting power law compression and inverse short-time Fourier transform (I STFT) on the final complex spectrogram.

在本实施例中,在所述将分离到的单源噪声信号输入预先训练好的噪声增强模型之前,还包括:In this embodiment, before inputting the separated single-source noise signal into the pre-trained noise enhancement model, it also includes:

根据鉴别器网络调整所述噪声增强模型的权重参数,所述噪声增强模型包括编码器和时频域生成对抗网络。The weight parameters of the noise enhancement model are adjusted according to the discriminator network, and the noise enhancement model includes an encoder and a time-frequency domain generative adversarial network.

在本实施例中,所述根据鉴别器网络调整所述噪声增强模型的权重参数,具体为:In this embodiment, the weight parameters of the noise enhancement model are adjusted according to the discriminator network, specifically:

将第一音频数据依次输入编码器和时频域生成对抗网络,获取第二音频数据;所述第一音频数据为原始噪声数据经过所述噪声分离模型处理后的各个单源噪声信号;The first audio data is sequentially input into the encoder and the time-frequency domain generative adversarial network to obtain the second audio data; the first audio data is each single-source noise signal after the original noise data has been processed by the noise separation model;

根据鉴别器网络计算所述第二音频数据和原始音频数据的相似度指标值;Calculate the similarity index value of the second audio data and the original audio data according to the discriminator network;

根据预设阈值和所述相似度指标值对所述第二音频数据进行鉴别操作,并根据鉴别结果调整所述噪声增强模型的权重参数。Perform an identification operation on the second audio data according to the preset threshold and the similarity index value, and adjust the weight parameter of the noise enhancement model according to the identification result.

请参照图10,图10为本发明实施例提供的单通道噪声检测方法的鉴别器网络结构示意图。Please refer to FIG. 10 , which is a schematic diagram of the discriminator network structure of the single-channel noise detection method provided by an embodiment of the present invention.

在本实施例中,通过计算生成器生成语音和原始音频之间的相似度评价分数来判断所述生成器网络生成的语音片段的优劣。所述鉴别器由四个卷积块组成,在卷积块之后进行全局平均池化,然后通过两个前馈层,最后经过一个Sigmoid激活函数得到最后的结果。In this embodiment, the quality of the speech clip generated by the generator network is judged by calculating the similarity evaluation score between the speech generated by the generator and the original audio. The discriminator consists of four convolution blocks, followed by global average pooling, then passes through two feedforward layers, and finally passes through a Sigmoid activation function to obtain the final result.

在本实施例中,具体的,以PESQ分数作为鉴别器的评价指标,将原始音频数据和第二音频数据作为输入来估计最大归一化的PESQ分数。鉴别器预设一个阈值,当原始音频数据和第二音频数据之间的PESQ得分达到阈值就认为此生成语音足够优秀,最终输出改生成语音;当原始音频数据和第二音频数据之间的PESQ得分低于此阈值时就要将此语音重新输入生成器网络继续训练,直到评价结果满足预设阈值输出这部分语音。In this embodiment, specifically, the PESQ score is used as the evaluation index of the discriminator, and the original audio data and the second audio data are used as input to estimate the maximum normalized PESQ score. The discriminator presets a threshold. When the PESQ score between the original audio data and the second audio data reaches the threshold, the generated speech is considered to be good enough, and the final output is changed to speech; when the PESQ score between the original audio data and the second audio data When the score is lower than this threshold, the speech must be re-entered into the generator network to continue training until the evaluation result meets the preset threshold to output this part of the speech.

在本实施例中,将未通过鉴别器鉴别的声音信号重新传递给生成器,生成器根据联合Loss函数重新调整网络模型的参数,让生成器输出更加接近于真实的噪声分布,即将之前未通过鉴别器鉴别的声音信号再通过生成器,直到该生成噪声信号可以通过鉴别器的鉴别。与此同时生成器也会对应地不断更新鉴别器,调整鉴别器参数,提高鉴别器识别噪声信号的能力。In this embodiment, the sound signal that fails to pass the discriminator is re-transmitted to the generator, and the generator re-adjusts the parameters of the network model according to the joint Loss function, so that the generator output is closer to the real noise distribution, that is, the sound signal that has not passed the discriminator before is re-adjusted. The sound signal identified by the discriminator then passes through the generator until the generated noise signal can pass the identification of the discriminator. At the same time, the generator will continuously update the discriminator accordingly, adjust the discriminator parameters, and improve the discriminator's ability to identify noise signals.

在本实施例中,所述联合损失函数包括时频方面的联合误差、波形信号之间的误差和鉴别器得到幅度信息之间的误差。具体的:In this embodiment, the joint loss function includes a joint error in time and frequency, an error between waveform signals, and an error between amplitude information obtained by the discriminator. specific:

所述时频方面的联合误差包括幅度误差和实虚部误差的线性组合,表达式如下:The joint error in time and frequency includes a linear combination of amplitude error and real and imaginary part error, and the expression is as follows:

LTF=αLMag.+(1―α)LRI (7)LTF =αLMag .+(1―α)LRI (7)

其中,α是一个超参数,我们这里将α的值设为0.7。LMag为预测结果和原始结果的幅度部分误差,LRI为预测结果和原始结果的实虚部之间误差。LMag、LRI用均方误差来计算,各自表达式如下:Among them, α is a hyperparameter, and we set the value of α to 0.7 here. LMag is the amplitude error between the predicted result and the original result, and LRI is the error between the real and imaginary parts of the predicted result and the original result. LMag and LRI are calculated using mean square error, and their respective expressions are as follows:

其中,Xm代表原始语音的幅度信息,代表预测语音的幅度信息,E表示均方误差操作。Among them, Xm represents the amplitude information of the original speech, represents the amplitude information of predicted speech, and E represents the mean square error operation.

其中,Xr代表原始语音的实部信息,代表预测语音的实部信息;Xi代表原始语音的虚部信息,/>代表预测语音的虚部信息,E表示均方误差操作。Among them, Xr represents the real part information of the original speech, represents the real part information of the predicted speech;Xi represents the imaginary part information of the original speech, /> represents the imaginary part information of predicted speech, and E represents the mean square error operation.

所述波形信号之间的误差为预测的时域波形信号与原始波形信号之间的误差。具体公式如下:The error between the waveform signals is the error between the predicted time domain waveform signal and the original waveform signal. The specific formula is as follows:

其中,x代表原始语音的时域信号,代预测信号的时域信号,时域波形信号之间的误差通过计算二者的平均绝对差得到。Among them, x represents the time domain signal of the original speech, The time domain signal represents the predicted signal, and the error between the time domain waveform signals is obtained by calculating the average absolute difference between the two.

所述鉴别器得到幅度信息之间的误差用于预测的语音信号和原始语音信号之间的鉴别器部分误差:The discriminator gets the error between the amplitude information for the discriminator part error between the predicted speech signal and the original speech signal:

其中,Xm代表原始语音的幅度信息,代表预测语音的幅度信息,QPESQ代表最大归一化的PESQ分数,取值在(0,1)之间,E表示均方误差操作。Among them, Xm represents the amplitude information of the original speech, represents the amplitude information of the predicted speech, QPESQ represents the maximum normalized PESQ score, the value is between (0, 1), and E represents the mean square error operation.

所述生成器部分的误差,公式如下:The error of the generator part is as follows:

其中Xm代表原始语音的幅度信息,代表预测语音的幅度信息,E表示均方误差操作。where Xm represents the amplitude information of the original speech, represents the amplitude information of predicted speech, and E represents the mean square error operation.

所述生成器的联合损失函数由这三部分的线性组合联合构成,具体如下:The joint loss function of the generator is composed of a linear combination of these three parts, as follows:

LG=γ1LTF2LGAN3LTime (13)LG =γ1 LTF2 LGAN3 LTime (13)

其中,γ1、γ2、γ3也是超参数,根据经验可将γ1、γ2、γ3分别设为0.9、0.2和0.05。Among them, γ1 , γ2 , and γ3 are also hyperparameters. According to experience, γ1 , γ2 , and γ3 can be set to 0.9, 0.2, and 0.05 respectively.

在本实施例中,基于Conformer的时频域生成对抗网络增强模块,该模块结合了卷积层和Transformer层的优势,既可以捕获局部关系,又可以捕获长期的依赖关系。这样可以有效地提取和增强单源噪声的时频特征,使其更加清晰和易于分析。In this embodiment, the Conformer-based time-frequency domain generative adversarial network enhancement module combines the advantages of the convolutional layer and the Transformer layer, and can capture both local relationships and long-term dependencies. This can effectively extract and enhance the time-frequency characteristics of single-source noise, making it clearer and easier to analyze.

步骤103:获取增强后的各个单源噪声信号,根据预设指标评估各个单源噪声信号,完成噪声检测。Step 103: Obtain the enhanced single-source noise signals, evaluate each single-source noise signal according to the preset indicators, and complete the noise detection.

在本实施例中,所述获取增强后的各个单源噪声信号,根据预设指标评估各个单源噪声信号,完成噪声检测,具体为:In this embodiment, the steps of obtaining enhanced single-source noise signals, evaluating each single-source noise signal according to preset indicators, and completing noise detection are as follows:

获取增强后的各个单源噪声信号,计算各个单源噪声信号的A计权声压级;Obtain the enhanced single-source noise signals and calculate the A-weighted sound pressure level of each single-source noise signal;

根据所述A计权声压级对各个单源噪声信号进行分级和分类,并输出检测报告,完成噪声监测。Classify and classify each single-source noise signal according to the A-weighted sound pressure level, and output a detection report to complete noise monitoring.

在本实施例中,使用A计权声压级作为量化指标进行分析。通过对分离增强后的单源噪声进行量化解析,可以更加准确地评估噪声的级别和影响程度,为后续的处理和决策提供依据。In this embodiment, A-weighted sound pressure level is used as a quantitative index for analysis. By quantitatively analyzing the separated and enhanced single-source noise, the level and impact of the noise can be more accurately assessed, providing a basis for subsequent processing and decision-making.

本发明还提供了一种单通道噪声检测装置,包括:噪声分离模块、噪声增强模块和评估模块;The invention also provides a single-channel noise detection device, including: a noise separation module, a noise enhancement module and an evaluation module;

所述噪声分离模块,用于将混合声音信号输入预先训练好的噪声分离模型,获取若干个单源噪声信号;所述噪声分离模型根据时域编码器和双路径循环神经网络构建;The noise separation module is used to input mixed sound signals into a pre-trained noise separation model to obtain several single-source noise signals; the noise separation model is constructed based on a time domain encoder and a dual-path recurrent neural network;

所述噪声增强模块,用于将分离到的单源噪声信号输入预先训练好的噪声增强模型,分别对各个单源噪声信号进行增强操作;所述噪声增强模型根据时频域生成对抗网络构建;The noise enhancement module is used to input the separated single-source noise signal into a pre-trained noise enhancement model, and perform enhancement operations on each single-source noise signal respectively; the noise enhancement model is constructed based on the time-frequency domain generative adversarial network;

所述评估模块,用于获取增强后的各个单源噪声信号,根据预设指标评估各个单源噪声信号,完成噪声检测。The evaluation module is used to obtain each enhanced single-source noise signal, evaluate each single-source noise signal according to preset indicators, and complete noise detection.

本发明还提供了一种计算机设备,包括:处理器、通信接口和存储器,所述处理器、所述通信接口和所述存储器相互连接,其中,所述存储器存储有可执行程序代码,所述处理器用于调用所述可执行程序代码,执行所述的单通道噪声检测方法。The present invention also provides a computer device, including: a processor, a communication interface and a memory, the processor, the communication interface and the memory are connected to each other, wherein the memory stores executable program code, and the The processor is configured to call the executable program code to execute the single-channel noise detection method.

在本实施例中,通过时域编码器和双路径循环神经网络构建噪声分离模型对混合声音信号进行分离,通过双路径捕获声音信号的特征信息,可以快速、有效地将混合声音信号中的噪声源分离出来。同时,使用时频域生成对抗网络增强模块对分离得到的单源噪声进行增强处理,使单源噪声更加清晰,便于后续根据指标评估所述单源噪声,提高噪声监测的准确性。In this embodiment, a noise separation model is constructed through a time domain encoder and a dual-path recurrent neural network to separate the mixed sound signal. The characteristic information of the sound signal is captured through the dual-path, so that the noise in the mixed sound signal can be quickly and effectively separated. source separated. At the same time, the time-frequency domain generative adversarial network enhancement module is used to enhance the separated single-source noise to make the single-source noise clearer, facilitate subsequent evaluation of the single-source noise based on indicators, and improve the accuracy of noise monitoring.

以上所述的具体实施例,对本发明的目的、技术方案和有益效果进行了进一步的详细说明,应当理解,以上所述仅为本发明的具体实施例而已,并不用于限定本发明的保护范围。特别指出,对于本领域技术人员来说,凡在本发明的精神和原则之内,所做的任何修改、等同替换、改进等,均应包含在本发明的保护范围之内。The above-mentioned specific embodiments further describe the purpose, technical solutions and beneficial effects of the present invention in detail. It should be understood that the above-mentioned are only specific embodiments of the present invention and are not intended to limit the scope of the present invention. . It is particularly pointed out that for those skilled in the art, any modifications, equivalent substitutions, improvements, etc. made within the spirit and principles of the present invention shall be included in the protection scope of the present invention.

Claims (10)

CN202311015701.8A2023-08-112023-08-11Single-channel noise detection method and deviceActiveCN117109726B (en)

Priority Applications (1)

Application NumberPriority DateFiling DateTitle
CN202311015701.8ACN117109726B (en)2023-08-112023-08-11Single-channel noise detection method and device

Applications Claiming Priority (1)

Application NumberPriority DateFiling DateTitle
CN202311015701.8ACN117109726B (en)2023-08-112023-08-11Single-channel noise detection method and device

Publications (2)

Publication NumberPublication Date
CN117109726Atrue CN117109726A (en)2023-11-24
CN117109726B CN117109726B (en)2025-03-14

Family

ID=88808365

Family Applications (1)

Application NumberTitlePriority DateFiling Date
CN202311015701.8AActiveCN117109726B (en)2023-08-112023-08-11Single-channel noise detection method and device

Country Status (1)

CountryLink
CN (1)CN117109726B (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN118193955A (en)*2024-04-112024-06-14哈尔滨工程大学Method, device, medium and product for acquiring pneumatic noise of air compressor
CN118329188A (en)*2024-03-042024-07-12珠海高凌信息科技股份有限公司Noise monitoring data analysis method and device based on threshold linkage triggering strategy
CN118376984A (en)*2024-06-262024-07-23成都锐新科技有限公司Digital throwing type radar active interference bomb

Citations (14)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
US20170162194A1 (en)*2015-12-042017-06-08Conexant Systems, Inc.Semi-supervised system for multichannel source enhancement through configurable adaptive transformations and deep neural network
US20190066713A1 (en)*2016-06-142019-02-28The Trustees Of Columbia University In The City Of New YorkSystems and methods for speech separation and neural decoding of attentional selection in multi-speaker environments
US20210012767A1 (en)*2020-09-252021-01-14Intel CorporationReal-time dynamic noise reduction using convolutional networks
US20210074282A1 (en)*2019-09-112021-03-11Massachusetts Institute Of TechnologySystems and methods for improving model-based speech enhancement with neural networks
CN112992172A (en)*2021-01-282021-06-18广州大学Single-channel time domain bird song separating method based on attention mechanism
CN113113049A (en)*2021-03-182021-07-13西北工业大学Voice activity detection method combined with voice enhancement
CN113327624A (en)*2021-05-252021-08-31西北工业大学Method for intelligently monitoring environmental noise by adopting end-to-end time domain sound source separation system
CN114360567A (en)*2022-02-162022-04-15东北大学 A single-channel speech enhancement method based on deep complex convolutional network
CN114446314A (en)*2021-12-312022-05-06中国人民解放军陆军工程大学Voice enhancement method for deeply generating confrontation network
CN115273885A (en)*2022-06-172022-11-01南京大学 Full-band speech enhancement method based on spectral compression and self-attention neural network
CN115273882A (en)*2022-07-292022-11-01新疆大学Speech enhancement method for simultaneously modeling speech and noise in time domain
CN115602183A (en)*2022-09-232023-01-13广州博冠信息科技有限公司(Cn)Audio enhancement method and device, electronic equipment and storage medium
CN115731924A (en)*2022-11-012023-03-03广州大学Single-channel time domain birdsound separation method and device and computer readable storage medium
CN116013344A (en)*2022-12-172023-04-25西安交通大学 A Speech Enhancement Method in Multiple Noise Environments

Patent Citations (14)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
US20170162194A1 (en)*2015-12-042017-06-08Conexant Systems, Inc.Semi-supervised system for multichannel source enhancement through configurable adaptive transformations and deep neural network
US20190066713A1 (en)*2016-06-142019-02-28The Trustees Of Columbia University In The City Of New YorkSystems and methods for speech separation and neural decoding of attentional selection in multi-speaker environments
US20210074282A1 (en)*2019-09-112021-03-11Massachusetts Institute Of TechnologySystems and methods for improving model-based speech enhancement with neural networks
US20210012767A1 (en)*2020-09-252021-01-14Intel CorporationReal-time dynamic noise reduction using convolutional networks
CN112992172A (en)*2021-01-282021-06-18广州大学Single-channel time domain bird song separating method based on attention mechanism
CN113113049A (en)*2021-03-182021-07-13西北工业大学Voice activity detection method combined with voice enhancement
CN113327624A (en)*2021-05-252021-08-31西北工业大学Method for intelligently monitoring environmental noise by adopting end-to-end time domain sound source separation system
CN114446314A (en)*2021-12-312022-05-06中国人民解放军陆军工程大学Voice enhancement method for deeply generating confrontation network
CN114360567A (en)*2022-02-162022-04-15东北大学 A single-channel speech enhancement method based on deep complex convolutional network
CN115273885A (en)*2022-06-172022-11-01南京大学 Full-band speech enhancement method based on spectral compression and self-attention neural network
CN115273882A (en)*2022-07-292022-11-01新疆大学Speech enhancement method for simultaneously modeling speech and noise in time domain
CN115602183A (en)*2022-09-232023-01-13广州博冠信息科技有限公司(Cn)Audio enhancement method and device, electronic equipment and storage medium
CN115731924A (en)*2022-11-012023-03-03广州大学Single-channel time domain birdsound separation method and device and computer readable storage medium
CN116013344A (en)*2022-12-172023-04-25西安交通大学 A Speech Enhancement Method in Multiple Noise Environments

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
CAO RZ, ET AL: "CMGAN: Conformer-based Metric GAN for Speech Enhancement", 《INTERSPEECH 2022》, 31 December 2022 (2022-12-31), pages 937 - 939*
LI Q, ET AL: "Two-Stage Noise Origin-Source Tracking and Quantitative Analysis for Intelligent Urban Environment Monitoring", 《 2024 9TH INTERNATIONAL CONFERENCE ON COMPUTER AND COMMUNICATION SYSTEMS》, 15 August 2024 (2024-08-15), pages 507 - 512*
LUO Y, ET AL: "DUAL-PATH RNN:EFFICIENT LONG SEQUENCE MODELING FOR TIME-DOMAIN SINGLE-CHANNEL SPEECH SEPARATION", 《2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING》, 31 December 2020 (2020-12-31), pages 47 - 49*
张明亮等: "基于全卷积神经网络的语音增强算法", 计算机应用研究, vol. 37, no. 1, 30 June 2020 (2020-06-30), pages 135 - 137*

Cited By (3)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN118329188A (en)*2024-03-042024-07-12珠海高凌信息科技股份有限公司Noise monitoring data analysis method and device based on threshold linkage triggering strategy
CN118193955A (en)*2024-04-112024-06-14哈尔滨工程大学Method, device, medium and product for acquiring pneumatic noise of air compressor
CN118376984A (en)*2024-06-262024-07-23成都锐新科技有限公司Digital throwing type radar active interference bomb

Also Published As

Publication numberPublication date
CN117109726B (en)2025-03-14

Similar Documents

PublicationPublication DateTitle
CN111247585B (en)Voice conversion method, device, equipment and storage medium
CN110136731B (en)Cavity causal convolution generation confrontation network end-to-end bone conduction voice blind enhancement method
Yuliani et al.Speech enhancement using deep learning methods: A review
US11024324B2 (en)Methods and devices for RNN-based noise reduction in real-time conferences
CN117109726A (en) A single-channel noise detection method and device
US20230245674A1 (en)Method for learning an audio quality metric combining labeled and unlabeled data
CN113488063B (en)Audio separation method based on mixed features and encoding and decoding
CN115457980B (en) A method and system for automatic speech quality assessment without reference speech
Zhu et al.FLGCNN: A novel fully convolutional neural network for end-to-end monaural speech enhancement with utterance-based objective functions
CN112289338B (en)Signal processing method and device, computer equipment and readable storage medium
CN115273874A (en) Compression Method of Speech Enhancement Model Based on Recurrent Neural Network
Jin et al.Speech separation and emotion recognition for multi-speaker scenarios
CN118197353A (en)Target noise extraction and evaluation method and device
Guo et al.DENT-DDSP: Data-efficient noisy speech generator using differentiable digital signal processors for explicit distortion modelling and noise-robust speech recognition
Li et al.A Convolutional Neural Network with Non-Local Module for Speech Enhancement.
CN114049887B (en) Real-time voice activity detection method and system for audio and video conferencing
Tiwari et al.Real-time audio visual speech enhancement: integrating visual cues for improved performance
Feng et al.Noise classification speech enhancement generative adversarial network
Gudepu et al.Dynamic encoder RNN for online voice activity detection in adverse noise conditions
CN115881157A (en)Audio signal processing method and related equipment
Wen et al.Multi-stage progressive audio bandwidth extension
Zhao et al.Speech Enhancement Based on Dual-Path Cross-Parallel Conformer Network
Shiroma et al.Missing data completion of multi-channel signals using autoencoder for acoustic scene classification
Yuan et al.An Improved Optimal Transport Kernel Embedding Method with Gating Mechanism for Singing Voice Separation and Speaker Identification
Alim et al.Open-Source Pipeline for Noise-Resilient Voice Data Preparation

Legal Events

DateCodeTitleDescription
PB01Publication
PB01Publication
SE01Entry into force of request for substantive examination
SE01Entry into force of request for substantive examination
GR01Patent grant
GR01Patent grant

[8]ページ先頭

©2009-2025 Movatter.jp