Movatterモバイル変換


[0]ホーム

URL:


CN116312616A - A processing recovery method and control system for noisy speech signals - Google Patents

A processing recovery method and control system for noisy speech signals
Download PDF

Info

Publication number
CN116312616A
CN116312616ACN202211678470.4ACN202211678470ACN116312616ACN 116312616 ACN116312616 ACN 116312616ACN 202211678470 ACN202211678470 ACN 202211678470ACN 116312616 ACN116312616 ACN 116312616A
Authority
CN
China
Prior art keywords
learning network
noise suppression
frequency
noise
processing
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211678470.4A
Other languages
Chinese (zh)
Inventor
李倩
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Bestechnic Shanghai Co Ltd
Original Assignee
Bestechnic Shanghai Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Bestechnic Shanghai Co LtdfiledCriticalBestechnic Shanghai Co Ltd
Priority to CN202211678470.4ApriorityCriticalpatent/CN116312616A/en
Publication of CN116312616ApublicationCriticalpatent/CN116312616A/en
Priority to PCT/CN2023/103754prioritypatent/WO2024139120A1/en
Pendinglegal-statusCriticalCurrent

Links

Images

Classifications

Landscapes

Abstract

The application relates to a processing recovery method and a control system for a noisy speech signal. The method comprises the steps of obtaining a voice signal with noise, and performing STFT (short time Fourier transform) to obtain a spectrogram. Based on the spectrogram, determining time-frequency voice characteristics, and estimating masking values of all frequency points by using a noise suppression learning network. And determining the frequency domain voice signal after noise suppression based on the masking value and the spectrogram of each frequency point, performing LPC processing, and predicting the linear part and the residual part of the time domain voice signal after noise suppression. ISTFT conversion is carried out on the frequency domain voice signals, and the time domain voice signals after noise suppression are obtained. And recovering the enhanced residual error part by utilizing a recovery learning network based on the noise-suppressed time domain voice signal, the linearity and the residual error part. The predicted linear portion and the enhanced residual portion are summed to obtain a recovered speech signal. Therefore, the LPC technology can be effectively combined with the adaptive learning network on the small chip, and efficient and rapid noise reduction and restoration processing of the voice signal with noise under the variable noise environment can be realized.

Description

Translated fromChinese
一种用于带噪语音信号的处理恢复方法和控制系统A processing recovery method and control system for noisy speech signals

技术领域technical field

本申请涉及无线通信领域,更具体地,涉及一种用于无线通信中带噪语音信号的处理恢复方法和控制系统。The present application relates to the field of wireless communication, and more specifically, to a method and control system for processing and restoring noisy speech signals in wireless communication.

背景技术Background technique

随着物联网的发展,除了手机以外,人们频繁且广泛地使用各种小型化便携式智能设备,例如智能眼镜、无线蓝牙耳机、无线蓝牙音箱等,在各种多变的噪声背景下,例如地铁、商圈人群、赛场、户外场地等,来执行语音通话功能。与手机不同之处在于,这些小型化便携式智能设备通常对成本和尺寸有严格的要求,配备的芯片也比较小,存储空间和算力有限,也称为“边缘计算”。With the development of the Internet of Things, in addition to mobile phones, people frequently and widely use various miniaturized portable smart devices, such as smart glasses, wireless Bluetooth headsets, wireless Bluetooth speakers, etc. People in business circles, stadiums, outdoor venues, etc., to perform voice call functions. The difference from mobile phones is that these miniaturized portable smart devices usually have strict requirements on cost and size, and are equipped with relatively small chips, with limited storage space and computing power, also known as "edge computing".

目前虽然采用了一些语音通信降噪技术,但通常在频域对噪声能量高的频率分量进行强度抑制,往往会在大噪声情况下损失语音清晰度,使得降噪后的语音质量很差,不可避免地会损伤语音,影响用户听音体验。此外,这些语音通话降噪技术受限于小型化便携式智能设备的芯片配置,算法通常较为粗糙,或者计算缓慢导致听音滞后,不能满足人们对高语音质量和实时性的需求。Although some voice communication noise reduction technologies are used at present, the intensity of frequency components with high noise energy is usually suppressed in the frequency domain, and the voice clarity is often lost in the case of large noise, making the voice quality after noise reduction very poor. It will inevitably damage the voice and affect the user's listening experience. In addition, these voice call noise reduction technologies are limited by the chip configuration of miniaturized portable smart devices. The algorithms are usually rough, or the calculation is slow, resulting in listening lag, which cannot meet people's needs for high voice quality and real-time performance.

发明内容Contents of the invention

提供了本申请以解决现有技术中存在的上述缺陷。需要一种用于带噪语音信号的处理恢复方法和控制系统,其能够在边缘计算的小型芯片上有效配置适应性的学习网络结合LPC(线性预测编码)技术,实现对多变的噪声环境下的带噪语音信号的高效且迅速的降噪处理,且能够恢复出无损、高清晰度且实时性良好的语音信号。The present application is provided to address the above-mentioned deficiencies in the prior art. There is a need for a processing and recovery method and control system for noisy speech signals, which can effectively configure an adaptive learning network combined with LPC (Linear Predictive Coding) technology on a small chip for edge computing, and realize the detection of noise in a variable noise environment. Efficient and rapid noise reduction processing of noisy speech signals, and can restore lossless, high-definition and real-time speech signals.

根据本申请的第一方案,提供了一种用于带噪语音信号的处理恢复方法。该处理恢复方法包括如下步骤。获取要处理的带噪语音信号。对所述带噪语音信号进行STFT变换,以得到声谱图。基于所述声谱图,确定时频语音特征。基于所述时频语音特征,利用抑噪学习网络来估计各个频点的掩蔽值,作为各个频点的抑噪量。基于各个频点的掩蔽值和声谱图,确定抑噪后的频域语音信号。基于所述频域语音信号,计算功率谱密度。基于所述功率谱密度,通过执行LPC处理,来预测抑噪后的时域语音信号的线性部分和残差部分。对所述频域语音信号进行ISTFT变换,以得到抑噪后的时域语音信号。基于所述抑噪后的时域语音信号、所述线性部分和残差部分,利用恢复学习网络来恢复出增强后的残差部分。将预测的线性部分和增强后的残差部分求和,来得到恢复后的语音信号,使其语音清晰度高于预定阈值。According to the first solution of the present application, a method for processing and restoring a noisy speech signal is provided. The processing recovery method includes the following steps. Get the noisy speech signal to process. performing STFT transformation on the noisy speech signal to obtain a spectrogram. Based on the spectrogram, time-frequency speech features are determined. Based on the time-frequency speech features, the noise suppression learning network is used to estimate the masking value of each frequency point as the noise suppression amount of each frequency point. Based on the masking value and spectrogram of each frequency point, the frequency-domain speech signal after noise suppression is determined. Based on the frequency domain speech signal, a power spectral density is calculated. Based on the power spectral density, the linear part and the residual part of the noise-suppressed time-domain speech signal are predicted by performing LPC processing. Perform ISTFT transformation on the frequency-domain speech signal to obtain a noise-suppressed time-domain speech signal. Based on the noise-suppressed time-domain speech signal, the linear part and the residual part, the enhanced residual part is recovered by using a recovery learning network. The predicted linear part and the enhanced residual part are summed to obtain a restored speech signal whose speech intelligibility is higher than a predetermined threshold.

根据本申请的第二方案,提供了一种用于带噪语音信号的处理恢复的控制系统。该控制系统包括接口、处理单元和存储器。所述接口配置为获取要处理的带噪语音信号。所述处理单元配置为根据本申请各个实施例的用于带噪语音信号的处理恢复方法,包括如下步骤。获取要处理的带噪语音信号。对所述带噪语音信号进行STFT变换,以得到声谱图。基于所述声谱图,确定时频语音特征。基于所述时频语音特征,利用抑噪学习网络来估计各个频点的掩蔽值,作为各个频点的抑噪量。基于各个频点的掩蔽值和声谱图,确定抑噪后的频域语音信号。基于所述频域语音信号,计算功率谱密度。基于所述功率谱密度,通过执行LPC处理,来预测抑噪后的时域语音信号的线性部分和残差部分。对所述频域语音信号进行ISTFT变换,以得到抑噪后的时域语音信号。基于所述抑噪后的时域语音信号、所述线性部分和残差部分,利用恢复学习网络来恢复出增强后的残差部分。将预测的线性部分和增强后的残差部分求和,来得到恢复后的语音信号,使其语音清晰度高于预定阈值。所述存储器配置为存储训练好的抑噪学习网络和恢复学习网络。According to the second aspect of the present application, a control system for processing and restoring a noisy speech signal is provided. The control system includes an interface, a processing unit and a memory. The interface is configured to acquire a noisy speech signal to be processed. The processing unit is configured as a method for processing and restoring a noisy speech signal according to various embodiments of the present application, including the following steps. Get the noisy speech signal to process. performing STFT transformation on the noisy speech signal to obtain a spectrogram. Based on the spectrogram, time-frequency speech features are determined. Based on the time-frequency speech features, the noise suppression learning network is used to estimate the masking value of each frequency point as the noise suppression amount of each frequency point. Based on the masking value and spectrogram of each frequency point, the frequency-domain speech signal after noise suppression is determined. Based on the frequency domain speech signal, a power spectral density is calculated. Based on the power spectral density, the linear part and the residual part of the noise-suppressed time-domain speech signal are predicted by performing LPC processing. Perform ISTFT transformation on the frequency-domain speech signal to obtain a noise-suppressed time-domain speech signal. Based on the noise-suppressed time-domain speech signal, the linear part and the residual part, the enhanced residual part is recovered by using a recovery learning network. The predicted linear part and the enhanced residual part are summed to obtain a restored speech signal whose speech intelligibility is higher than a predetermined threshold. The memory is configured to store the trained noise suppression learning network and restoration learning network.

本申请各个实施例提供的用于带噪语音信号的处理恢复方法和控制系统,其能够在边缘计算的小型芯片上有效配置适应性的学习网络结合LPC(线性预测编码)技术,实现对多变的噪声环境下的带噪语音信号的高效且迅速的降噪处理,且能够恢复出无损、高清晰度且实时性良好的语音信号。The processing and recovery method and control system for noisy speech signals provided by various embodiments of the present application can effectively configure an adaptive learning network combined with LPC (Linear Predictive Coding) technology on a small chip of edge computing to realize the recognition of variable Efficient and rapid noise reduction processing of noisy speech signals in a noisy environment, and can restore lossless, high-definition and real-time speech signals.

附图说明Description of drawings

下面将参照附图描述本发明的示例性实施例的特征、优势以及技术和工业意义,其中相同的附图标记表示相同的元件,并且其中:The features, advantages and technical and industrial significance of exemplary embodiments of the invention will be described below with reference to the accompanying drawings, in which like reference numerals refer to like elements, and in which:

图1示出根据本申请实施例的用于带噪语音信号的处理恢复方法的流程图;Fig. 1 shows a flow chart of a method for processing and restoring a noisy speech signal according to an embodiment of the present application;

图2示出根据本申请实施例的用于带噪语音信号的处理恢复的控制系统的构造图;以及Fig. 2 shows the structural diagram of the control system for the processing recovery of noisy speech signal according to the embodiment of the present application; And

图3示出根据本申请实施例的用于带噪语音信号的处理恢复方法的示例的流程图。Fig. 3 shows a flow chart of an example of a method for processing and restoring a noisy speech signal according to an embodiment of the present application.

具体实施方式Detailed ways

为使本领域技术人员更好的理解本申请的技术方案,下面结合附图和具体实施方式对本申请作详细说明。下面结合附图和具体实施例对本申请的实施例作进一步详细描述,但不作为对本申请的限定。In order to enable those skilled in the art to better understand the technical solutions of the present application, the present application will be described in detail below in conjunction with the accompanying drawings and specific embodiments. Embodiments of the present application will be described in further detail below in conjunction with the accompanying drawings and specific embodiments, but this is not intended to limit the present application.

本申请中使用的“第一”、“第二”以及类似的词语并不表示任何顺序、数量或者重要性,而只是用来区分。“包括”或者“包含”等类似的词语意指在该词前的要素涵盖在该词后列举的要素,并不排除也涵盖其他要素的可能。本申请附图中用箭头示出的各个步骤的顺序仅仅作为示例,并不意味着各个步骤一定要以箭头所示顺序来执行。如果没有特别指出,各个步骤可以合并处理或调换执行顺序,以与箭头所示不同的顺序来执行,只要不影响各个步骤的逻辑关系即可。本申请中的一组全连接层可以是一层,也可以是数层,在此不做特别限定。本申请中的技术术语“残差”旨在表示语音信号去掉预测部分后的较少的残留部分。"First", "second" and similar words used in this application do not indicate any order, quantity or importance, but are only used for distinction. Words like "comprising" or "comprising" mean that the elements preceding the word cover the elements listed after the word, and do not exclude the possibility of also covering other elements. The order of the various steps shown by the arrows in the drawings of the present application is only an example, and does not mean that the various steps must be executed in the order shown by the arrows. If not specified, each step can be combined or executed in a different order than that shown by the arrow, as long as the logical relationship of each step is not affected. A group of fully connected layers in this application may be one layer or several layers, which is not specifically limited here. The technical term "residual" in this application is intended to refer to the less residual part of the speech signal after the prediction part is removed.

图1示出根据本申请实施例的用于带噪语音信号的处理恢复方法的流程图。具体说来,该处理恢复方法尤其适用于各种执行边缘计算的小型芯片,通常尺寸较小,存储空间和算力有限。参见图2,这些芯片(也称为控制系统)通常用在各种小型化便携式智能设备中,例如智能眼镜、无线蓝牙耳机、无线蓝牙音箱、多功能智能化充电盒(例如蓝牙耳机充电盒)、智能手表(例如但不限于儿童多功能定位监控手表)等等。请注意,本申请的各个实施例的处理恢复方法尤其适用于各种执行边缘计算的小型芯片,并不意味着只能在这样的小型芯片上执行,对于诸如手机等处理能力和存储空间更大的芯片,乃至对于诸如CPU等处理能力和存储空间又更大的处理器来说,当然也可以执行,只是其处理步骤对于各种执行边缘计算的小型芯片尤其友好,可以克服算力和存储空间有限的缺陷,确保对多变的噪声环境下的带噪语音信号的高效且迅速的降噪处理。Fig. 1 shows a flowchart of a method for processing and restoring a noisy speech signal according to an embodiment of the present application. Specifically, this processing recovery method is especially suitable for various small chips that perform edge computing, which are usually small in size and have limited storage space and computing power. See Figure 2, these chips (also known as control systems) are usually used in various miniaturized portable smart devices, such as smart glasses, wireless Bluetooth headsets, wireless Bluetooth speakers, multi-functional intelligent charging boxes (such as Bluetooth headset charging boxes) , smart watches (such as but not limited to children's multifunctional positioning monitoring watches), etc. Please note that the processing recovery methods of the various embodiments of this application are especially suitable for various small chips that perform edge computing, and it does not mean that they can only be executed on such small chips. For example, mobile phones have larger processing power and storage space Chips, even for processors with larger processing power and storage space such as CPUs, of course, it can also be executed, but its processing steps are especially friendly to various small chips that perform edge computing, which can overcome computing power and storage space. Limited defects ensure efficient and rapid noise reduction processing of noisy speech signals in variable noise environments.

如图1所示,该处理恢复方法始于步骤101,获取要处理的带噪语音信号。要处理的带噪语音信号可以由麦克风采集并进行了模数转换。As shown in FIG. 1 , the processing and recovery method begins atstep 101, where a noisy speech signal to be processed is acquired. The noisy speech signal to be processed can be collected by a microphone and converted from analog to digital.

在步骤102,对所述带噪语音信号进行STFT变换,以得到声谱图。STFT变换又称短时傅里叶变换,其先对带噪语音信号分帧。例如,对于16kHz的采样率来说,每帧的时间长度为8ms,帧间隔长度为8ms。然后对每一帧的带噪语音信号数据加窗做傅里叶变换(FFT),将每帧得到的变换结果拼接到一起,来得到声谱图。例如,进行傅里叶变换的每段数据长度为16ms共计256个采样点。带噪语音信号通常在时域和频域上的分量都会变化而非保持平稳,通过进行STFT变换,可以得到声谱图以反映不同频率状态所处的不同时间,也就是说,声谱图体现了带噪语音信号在时域和频域上的联合分布状况。Instep 102, an STFT transformation is performed on the noisy speech signal to obtain a spectrogram. The STFT transform, also known as the short-time Fourier transform, divides the noisy speech signal into frames first. For example, for a sampling rate of 16kHz, the time length of each frame is 8ms, and the frame interval length is 8ms. Then perform Fourier transform (FFT) on the noisy speech signal data of each frame by windowing, and stitch the transformation results obtained for each frame together to obtain the spectrogram. For example, the length of each piece of data for Fourier transform is 16 ms, and there are 256 sampling points in total. The components of noisy speech signals usually change in the time domain and frequency domain instead of remaining stable. By performing STFT transformation, the spectrogram can be obtained to reflect the different times of different frequency states. That is to say, the spectrogram reflects The joint distribution of noisy speech signal in time domain and frequency domain is obtained.

在步骤103,基于所述声谱图,确定时频语音特征。在体现了带噪语音信号在时域和频域上的联合分布状况的声谱图的基础上进一步提取时频语音特征,可以考虑到人耳听觉机理,来提取出更符合人耳听觉特性的特征参数,使得在信噪比降低时依然具有较好的识别性能。在一些实施例中,该符合人耳听觉特性的时频语音特征包括MFCC特征、BFCC(巴克倒谱系数)特征、Fbank特征(基于Mel滤波器组的特征)等中的至少一种,尤其是MFCC特征。MFCC即梅尔(Mel)倒谱系数,MFCC特征的提取过程包括预加重滤波处理、分帧、加窗、FFT、Mel滤波器组的滤波处理、对数运算、离散余弦变换(DCT)、动态特征(差分特征)提取等步骤。Mel滤波器组模拟了人体耳蜗纤毛声学感知器的听觉机理,低频分辨率高,高频分辨率低,和线性频率对应关系都近似对数关系,在此不赘述。在步骤102,已经得到了基于各段带噪语音数据的FFT变换结果。那么在步骤103,可以接着求模并利用Mel滤波器组进行滤波处理,滤波处理结果进行自然对数运算。随后可执行DCT,从而得到MFCC参数和MFCC差分参数,两者结合起来即可以得到MFCC特征。Instep 103, time-frequency speech features are determined based on the spectrogram. On the basis of the spectrogram that reflects the joint distribution of noisy speech signals in the time domain and frequency domain, the time-frequency speech features are further extracted, and the human auditory mechanism can be considered to extract more in line with the human auditory characteristics. The characteristic parameters make it still have better recognition performance when the signal-to-noise ratio decreases. In some embodiments, the time-frequency speech features conforming to human auditory characteristics include at least one of MFCC features, BFCC (Barker's cepstral coefficient) features, Fbank features (features based on Mel filter banks), etc., especially MFCC characteristics. MFCC is the Mel (Mel) cepstral coefficient. The extraction process of MFCC features includes pre-emphasis filtering, framing, windowing, FFT, filtering of Mel filter banks, logarithmic operations, discrete cosine transform (DCT), dynamic Feature (differential feature) extraction and other steps. The Mel filter bank simulates the auditory mechanism of the human cochlear cilium acoustic sensor, with high low-frequency resolution and low high-frequency resolution, and the linear frequency correspondence is approximately logarithmic, so I won't repeat it here. Instep 102, FFT transformation results based on each segment of noisy speech data have been obtained. Then instep 103, the modulus can be calculated and the Mel filter bank is used to perform filtering processing, and the filtering processing result is subjected to natural logarithmic operation. Subsequently, DCT can be performed to obtain MFCC parameters and MFCC differential parameters, which can be combined to obtain MFCC features.

在步骤104,基于所述时频语音特征,利用抑噪学习网络来估计各个频点的掩蔽值,作为各个频点的抑噪量。掩蔽值也称为Mask,其表示的是各个不同时刻各个频点的噪声抑制量。在步骤105,基于各个频点的掩蔽值和声谱图,确定抑噪后的频域语音信号。在一些实施例中,Mask可以用作声谱图的各个时刻的频域分量的抑噪处理系数,通过将估计的Mask与声谱图的各个时刻的频域分量相乘,可以得到对各个时刻频域全面抑噪后的频域语音信号。该抑噪学习网络可以采用各种RNN神经网络,例如但不限于GRU神经网络和LSTM神经网络等来实现。通过采用这些RNN神经网络,能够考虑时频语音特征在时域和频域上相邻点之间的相互作用,来估计出更准确的Mask。通过使用LSTM神经网络,在考虑到时频语音特征在时域和频域上相邻点之间的相互作用的同时,又可以遗忘掉在时域和频域上距离较久远的点的影响,从而吻合带噪语音中噪声引入的随机性机制,使得Mask估计更准确,且估计计算的收敛更迅速。本发明人发现,可以将LSTM的规模控制在2-4层,这个规模的LSTM神经网络可以存储在小型芯片的存储器上,在执行Mask估计计算的工作负荷小型芯片的处理单元也完全可以承担。Instep 104, based on the time-frequency speech features, the noise suppression learning network is used to estimate the masking value of each frequency point as the noise suppression amount of each frequency point. The masking value is also called Mask, which represents the noise suppression amount of each frequency point at different times. Instep 105, based on the masking value of each frequency point and the spectrogram, the frequency-domain speech signal after noise suppression is determined. In some embodiments, Mask can be used as the noise suppression processing coefficient of the frequency domain components at each moment of the spectrogram, and by multiplying the estimated Mask with the frequency domain components at each moment of the spectrogram, it can be obtained for each moment The frequency-domain speech signal after comprehensive noise suppression in the frequency domain. The noise suppression learning network can be implemented by using various RNN neural networks, such as but not limited to GRU neural network and LSTM neural network. By using these RNN neural networks, it is possible to estimate a more accurate Mask by considering the interaction of time-frequency speech features between adjacent points in the time domain and frequency domain. By using the LSTM neural network, while considering the interaction of time-frequency speech features between adjacent points in the time domain and frequency domain, the influence of distant points in the time domain and frequency domain can be forgotten. Therefore, it is consistent with the randomness mechanism introduced by noise in noisy speech, making Mask estimation more accurate, and the convergence of estimation calculation is faster. The inventors found that the scale of LSTM can be controlled to 2-4 layers. The LSTM neural network of this scale can be stored in the memory of a small chip, and the processing unit of the small chip can fully bear the workload of performing Mask estimation calculation.

在步骤106,基于所述频域语音信号,计算功率谱密度。例如,所述频域语音信号包含各频率对应的信号幅度和相位信息,据此对幅度取平方即可以得到功率谱密度。Instep 106, a power spectral density is calculated based on the frequency domain speech signal. For example, the frequency-domain voice signal includes signal amplitude and phase information corresponding to each frequency, and the power spectral density can be obtained by taking the square of the amplitude.

在步骤107,基于所述功率谱密度,通过执行LPC处理,来预测抑噪后的时域语音信号的线性部分和残差部分。基于功率谱密度来执行LPC(线性预测编码)处理是常规的降噪技术,通过对语音的线性部分进行预测,来得到线性部分和残差部分,在此不赘述。Instep 107, based on the power spectral density, the linear part and the residual part of the noise-suppressed time-domain speech signal are predicted by performing LPC processing. Performing LPC (Linear Predictive Coding) processing based on power spectral density is a conventional noise reduction technology. The linear part and the residual part are obtained by predicting the linear part of the speech, which will not be described in detail here.

在步骤108,对所述频域语音信号进行ISTFT变换(即逆STFT变换),以得到抑噪后的时域语音信号。Instep 108, ISTFT transformation (ie inverse STFT transformation) is performed on the frequency-domain speech signal to obtain a noise-suppressed time-domain speech signal.

在步骤109,基于所述抑噪后的时域语音信号、所述线性部分和残差部分,利用恢复学习网络来恢复出增强后的残差部分。该恢复学习网络可以采用各种RNN神经网络,例如但不限于GRU神经网络和LSTM神经网络等来实现。通常,在小型芯片等算力(单核)和存储空间有限的情况下,可以采用2-4层的GRU神经网络,以节省算力和存储空间,优先保证步骤104中抑噪学习网络可以采用充足规模的LSTM神经网络。本发明人发现,可以将GRU神经网络的规模控制在2-4层,这个规模的GRU神经网络可以存储在小型芯片的存储器上,在执行残差部分的恢复和增强计算时小型芯片的处理单元也完全可以承担;进一步地,对于单核的芯片来说,如此规模的GRU神经网络与2-4层的LSTM神经网络都可以协同工作,以流式执行抑噪处理和残差部分的恢复增强处理。Instep 109, based on the noise-suppressed time-domain speech signal, the linear part and the residual part, the enhanced residual part is recovered by using a recovery learning network. The recovery learning network can be realized by using various RNN neural networks, such as but not limited to GRU neural network and LSTM neural network. Usually, in the case of limited computing power (single core) and storage space such as small chips, a 2-4-layer GRU neural network can be used to save computing power and storage space, and it is preferred to ensure that the noise suppression learning network instep 104 can be used Sufficiently scaled LSTM neural networks. The inventors have found that the scale of the GRU neural network can be controlled at 2-4 layers, and the GRU neural network of this scale can be stored on the memory of a small chip. It is also fully affordable; furthermore, for a single-core chip, a GRU neural network of this scale and a 2-4 layer LSTM neural network can work together to perform noise suppression processing and recovery enhancement of the residual part in a streaming manner. deal with.

在步骤110,将预测的线性部分和增强后的残差部分求和,来得到恢复后的语音信号,使其语音清晰度高于预定阈值。通过以上的处理过程,可以先利用具有学习能力的抑噪神经网络在时域和频域两个层面上尽量消除噪声影响,随后先通过LPC技术对干净的语音的线性部分进行预测,这一部分是干净的语音信号的主要成分,留下比例相对较小的非线性残差部分(残余部分)利用恢复神经网络进行恢复和增强,从而在保证恢复神经网络的规模较小的情况下,依然能够以边缘计算的片上系统(甚至是单核设计),实现对多变的噪声环境(尤其是大噪声、规律性较差的噪声等)下的带噪语音信号的高效且迅速的降噪处理,且能够恢复出无损、高清晰度且实时性良好的语音信号。本申请的处理恢复方法对多变的噪声环境(尤其是大噪声、多源复杂的噪声等)下的带噪语音信号的语音清晰度。Instep 110, the predicted linear part and the enhanced residual part are summed to obtain a restored speech signal whose speech intelligibility is higher than a predetermined threshold. Through the above processing process, the noise suppression neural network with learning ability can be used to eliminate the influence of noise as much as possible on the two levels of time domain and frequency domain, and then the linear part of the clean speech can be predicted by LPC technology. This part is The main component of the clean speech signal, leaving a relatively small proportion of the nonlinear residual part (residual part) is restored and enhanced by the restoration neural network, so that the recovery neural network can still be restored with a small scale. The system-on-chip (even single-core design) of edge computing realizes efficient and rapid noise reduction processing of noisy speech signals under variable noise environments (especially large noise, noise with poor regularity, etc.), and It can recover lossless, high-definition and good real-time voice signals. The processing and restoration method of the present application can improve the speech intelligibility of the noisy speech signal under the changeable noise environment (especially large noise, multi-source complex noise, etc.).

下面结合图3对用于带噪语音信号的处理恢复方法的示例进行详细说明。如图3所示,利用麦克风采集到带噪语音信号x(n),n表示当前采样时刻,首先对带噪语音信号进行STFT变换(步骤301),再计算MFCC特征(步骤302),该MFCC特征作为特征值,根据特征值对带噪语音进行噪声抑制后,再生成y(n),如此可以进一步减少要处理的数据量。如果对16KHz采样率下的数据进行处理,帧长为8ms,帧间隔长度为8ms,进行FFT变换的数据长度为16ms,256个采样点,经过MFCC特征提取后得到的MFCC特征维度为32维,其中22维为MFCC特征,6维为一阶MFCC差分特征,4维为二阶MFCC差分特征。An example of a processing and restoration method for a noisy speech signal will be described in detail below with reference to FIG. 3 . As shown in Figure 3, the noisy speech signal x(n) is collected by the microphone, and n represents the current sampling moment. First, the STFT transformation is carried out to the noisy speech signal (step 301), and then the MFCC feature is calculated (step 302), the MFCC Features are used as eigenvalues, and y(n) is generated after the noisy speech is suppressed according to the eigenvalues, which can further reduce the amount of data to be processed. If the data at 16KHz sampling rate is processed, the frame length is 8ms, the frame interval length is 8ms, the data length for FFT transformation is 16ms, and 256 sampling points, the MFCC feature dimension obtained after MFCC feature extraction is 32 dimensions, Among them, 22 dimensions are MFCC features, 6 dimensions are first-order MFCC differential features, and 4 dimensions are second-order MFCC differential features.

接着,可以基于所述MFCC特征,利用抑噪学习网络来估计各个频点的掩蔽值。具体地,可以先将MFCC特征馈送到第一组全连接层303进行降维处理,再馈送到所述抑噪学习网络,也就是RNN/GRU/LSTM神经网络304。第一组全连接层303用于降维,32维输入,降维后变成16维输出,用于进一步减少LSTM/RNN/GRU神经网络304的输入数据维度,从而减少网络规模。基于所述MFCC特征降维后,利用抑噪学习网络,RNN/GRU/LSTM神经网络304进行初步估计,并将该估计结果馈送到第二组全连接层305进行升维处理,来得到各个频点的掩蔽值,从而得到掩蔽值估计(结果)306。例如,将RNN/GRU/LSTM神经网络304的输出馈送到该第二组全连接层305,该第二组全连接层305用于升维,16维输入,升维后为127维输出,即得到127个频点的Mask值(也就是噪声抑制量),进而得到降噪后的频域语音信号Y(w)。Then, based on the MFCC features, the masking value of each frequency point can be estimated by using a noise suppression learning network. Specifically, the MFCC features can be fed to the first group of fully connected layers 303 for dimension reduction processing, and then fed to the noise suppression learning network, that is, the RNN/GRU/LSTMneural network 304 . The first group of fully-connected layers 303 is used for dimensionality reduction, with 32-dimensional input and 16-dimensional output after dimensionality reduction, which is used to further reduce the input data dimension of LSTM/RNN/GRUneural network 304, thereby reducing the network scale. After dimensionality reduction based on the MFCC features, the noise suppression learning network and the RNN/GRU/LSTMneural network 304 are used for preliminary estimation, and the estimation results are fed to the second group of fully connected layers 305 for dimension-up processing to obtain each frequency The masked value of the point, thereby obtaining the masked value estimate (result) 306. For example, the output of the RNN/GRU/LSTMneural network 304 is fed to the second group of fully connected layers 305, and the second group of fully connected layers 305 is used for dimension enhancement, with 16-dimensional input and 127-dimensional output after dimensionality enhancement, namely The Mask values (that is, the amount of noise suppression) of 127 frequency points are obtained, and then the frequency-domain speech signal Y(w) after noise reduction is obtained.

降噪后的频域语音信号Y(w)分两路进行计算。一路送入LPC计算模块执行LPC处理,以计算LPC系数,并对语音信号进行线性预测,由先前时段的已经语音增强后的信号,s(n-1),s(n-2),...,s(n-16),结合线性预测系数,来预测当前时刻最终要得到的语音增强后的信号s(n)的线性部分z(n)。另一路进行ISTFT变换307后得到抑噪后的时域信号y(n),将该时域信号y(n)连同LPC处理那路得到的线性部分z(n)和残留部分e(n)(非线性部分)一起馈送到恢复神经网络,例如GRU/RNN,以便GRU/RNN对语音残差部分进行修正,修正后得到恢复和增强的残留部分L(n),再与线性部分z(n)相加313,来得到最终语音增强后的信号s(n)。The noise-reduced frequency-domain speech signal Y(w) is calculated in two ways. All the way into the LPC calculation module to perform LPC processing to calculate the LPC coefficient, and linearly predict the speech signal, from the speech-enhanced signal of the previous period, s(n-1), s(n-2),... ., s(n-16), combined with linear prediction coefficients, to predict the linear part z(n) of the speech-enhanced signal s(n) to be finally obtained at the current moment. The other path performsISTFT transformation 307 to obtain the noise-suppressed time-domain signal y(n), and the time-domain signal y(n) is processed together with the linear part z(n) and the residual part e(n) ( The nonlinear part) is fed to the restoration neural network, such as GRU/RNN, so that the GRU/RNN can modify the speech residual part, and after the correction, the restored and enhanced residual part L(n) is obtained, and then combined with the linear part z(n) Adding 313 to obtain the final speech-enhanced signal s(n).

关于在步骤304中使用的抑噪学习网络和在步骤312中使用的恢复学习网络可以在服务器上完成训练,训练好的抑噪学习网络和恢复学习网络的参数可以从服务器通信传输到基于执行边缘计算的片上系统的控制系统。在一些实施例中,控制系统可以经由通信接口向服务器请求更新的抑噪学习网络和恢复学习网络的参数。在一些实施例中,也可以在控制系统出厂前就将训练好的抑噪学习网络和恢复学习网络的参数存入存储器中并持续使用。The noise suppression learning network used instep 304 and the recovery learning network used in step 312 can be trained on the server, and the parameters of the trained noise suppression learning network and recovery learning network can be communicated from the server to the edge based on execution Computing system-on-chip control system. In some embodiments, the control system may request updated parameters of the noise suppression learning network and the recovery learning network from the server via the communication interface. In some embodiments, the trained parameters of the noise suppression learning network and the restoration learning network can also be stored in the memory and used continuously before the control system leaves the factory.

在一些实施例中,在服务器的计算负荷能够支持的情况下,可以对抑噪学习网络和恢复学习网络进行联合训练,以得到协调更优化的抑噪和残差修正效果。在一些实施例中,在服务器的计算负荷有限的情况下,也可以在服务器上先执行所述抑噪学习网络的训练,再利用训练好的所述抑噪学习网络的输出作为训练数据来训练所述恢复学习网络,从而兼顾训练速度和效果。对于抑制学习网络来说,可以对预设噪声的语音信号提取MFCC特征作为网络输入并通过测试来得到抑噪效果良好的Mask值作为网络输入,来得到一条训练数据,基于训练数据集的各条数据对抑噪学习网络进行训练,例如可以采用批梯度下降法或者随机梯度下降法来执行训练。关于恢复学习网络的训练,可以对预设噪声的语音信号的抑噪语音信号y(n)、线性预测后的z(n)和e(n)作为网络输入,以及残差的真值作为网络输出,共同构成一条训练数据,基于训练数据集的各条数据对抑噪学习网络进行训练,例如可以采用批梯度下降法或者随机梯度下降法来执行训练。当先训练抑噪学习网络再利用训练好的所述抑噪学习网络的输出作为训练数据来训练所述恢复学习网络的情况下,对于预设噪声的语音信号,可以利用训练好的所述抑噪学习网络依序执行步骤301-307,来得出y(n),并依序执行步骤308-311(细节在下文中会详细说明),来得到e(n)和z(n),以输出的y(n)、线性预测后的z(n)和e(n)作为网络的输入而残差的真值作为网络的输出来生成训练数据,并据此对恢复学习网络进行训练。In some embodiments, if the computing load of the server can support it, the noise suppression learning network and the recovery learning network can be jointly trained to obtain a coordinated and more optimized effect of noise suppression and residual correction. In some embodiments, when the computing load of the server is limited, the training of the noise suppression learning network can also be performed on the server first, and then the output of the trained noise suppression learning network can be used as training data for training The recovery learning network takes into account both training speed and effect. For the suppression learning network, the MFCC feature can be extracted from the speech signal of the preset noise as the network input, and the Mask value with good noise suppression effect can be obtained through the test as the network input to obtain a piece of training data, based on each piece of the training data set The data trains the noise suppression learning network, for example, the batch gradient descent method or the stochastic gradient descent method can be used to perform training. Regarding the training of the recovery learning network, the noise-suppressed speech signal y(n) of the speech signal of the preset noise, z(n) and e(n) after linear prediction can be used as the network input, and the true value of the residual can be used as the network input The outputs together form a piece of training data, and the noise suppression learning network is trained based on each piece of data in the training data set, for example, the batch gradient descent method or the stochastic gradient descent method can be used to perform training. When training the noise suppression learning network first and then using the output of the trained noise suppression learning network as training data to train the recovery learning network, for the speech signal of the preset noise, the trained noise suppression can be used The learning network executes steps 301-307 in sequence to obtain y(n), and executes steps 308-311 in sequence (details will be described in detail below) to obtain e(n) and z(n), and output y (n), z(n) and e(n) after linear prediction are used as the input of the network and the true value of the residual is used as the output of the network to generate training data, and the recovery learning network is trained accordingly.

回到图3对LPC处理进行具体说明。Referring back to FIG. 3, the LPC processing will be described in detail.

在步骤308,对降噪后的频域信号Y(W)进行功率谱密度计算。Instep 308, power spectral density calculation is performed on the denoised frequency domain signal Y(W).

在步骤309,对所述功率谱密度执行IFFT变换,来得到自相关系数。Instep 309, an IFFT transformation is performed on the power spectral density to obtain an autocorrelation coefficient.

在步骤310,基于所述自相关系数,利用Levinson-Durbin算法计算LPC线性预测系数,从而预测出抑噪后的时域语音信号的线性部分,也就是对语音线性部分z(n)进行预测(步骤311),并得到残差部分e(n)。残差部分e(n),经过恢复神经网络的非线性预测后得到增强,利用增强后的残差数据L(n)和线性预测数据z(n)求和得到最终的语音信号s(n)。Instep 310, based on the autocorrelation coefficient, the Levinson-Durbin algorithm is used to calculate the LPC linear prediction coefficient, thereby predicting the linear part of the time-domain speech signal after noise suppression, that is, predicting the speech linear part z(n) ( Step 311), and obtain the residual part e(n). The residual part e(n) is enhanced after restoring the nonlinear prediction of the neural network, and the final speech signal s(n) is obtained by summing the enhanced residual data L(n) and the linear prediction data z(n) .

具体说来,LPC线性预测系数根据如下公式(1)来求解:Specifically, the LPC linear prediction coefficient is solved according to the following formula (1):

Figure BDA0004018108720000081
Figure BDA0004018108720000081

其中Rn(j),j=1,...,p,为语音信号的自相关函数,p表示LPC线性预测系数的个数,a1,...,ap,为LPC线性预测系数。上述方程也称为Yule-Walker方程,方程左边的矩阵成为Toeplitz矩阵,主对角线对称,沿着主对角线平行方向的各轴向的元素值都相等,该方程可用Levinson-Durbin递推算法求解。Among them, Rn (j), j=1,..., p, is the autocorrelation function of the speech signal, p represents the number of LPC linear prediction coefficients, a1 ,..., ap , are the LPC linear prediction coefficients . The above equation is also called the Yule-Walker equation. The matrix on the left side of the equation becomes the Toeplitz matrix, the main diagonal is symmetrical, and the values of the elements in each axis parallel to the main diagonal are equal. This equation can be calculated by Levinson-Durbin recursion method to solve.

语音线性部分z(n)可以根据公式(2)来计算:The speech linear part z(n) can be calculated according to formula (2):

Figure BDA0004018108720000082
Figure BDA0004018108720000082

残差部分e(n)可以根据公式(3)来计算:The residual part e(n) can be calculated according to formula (3):

Figure BDA0004018108720000083
Figure BDA0004018108720000083

如此,可以将计算得到的语音线性部分z(n)和残差部分e(n)馈送到恢复神经网络,以对非线性部分的残差进行修复和改善。In this way, the calculated speech linear part z(n) and residual part e(n) can be fed to the restoration neural network to repair and improve the residual part of the nonlinear part.

返回图2,对根据本申请的各个实施例的用于带噪语音信号的处理恢复的控制系统200进行说明。该控制系统200可以包括接口201、处理单元202和存储器203。接口201可以配置为获取要处理的带噪语音信号。Returning to FIG. 2 , acontrol system 200 for processing and restoring a noisy speech signal according to various embodiments of the present application will be described. Thecontrol system 200 may include aninterface 201 , aprocessing unit 202 and amemory 203 . Theinterface 201 may be configured to acquire a noisy speech signal to be processed.

处理单元202可以配置为执行根据本申请各个实施例所述的用于带噪语音信号的处理恢复方法。回到图1,该处理恢复方法可以包括如下步骤。Theprocessing unit 202 may be configured to execute the processing and restoration methods for noisy speech signals according to various embodiments of the present application. Returning to Fig. 1, the processing recovery method may include the following steps.

在步骤102,对所述带噪语音信号进行STFT变换,以得到声谱图。STFT变换又称短时傅里叶变换,其先对带噪语音信号分帧。例如,对于16kHz的采样率来说,每帧的时间长度为8ms,帧间隔长度为8ms。然后对每一帧的带噪语音信号数据加窗做傅里叶变换(FFT),将每帧得到的变换结果拼接到一起,来得到声谱图。例如,进行傅里叶变换的每段数据长度为16ms共计256个采样点。带噪语音信号通常在时域和频域上的分量都会变化而非保持平稳,通过进行STFT变换,可以得到声谱图以反映不同频率状态所处的不同时间,也就是说,声谱图体现了带噪语音信号在时域和频域上的联合分布状况。Instep 102, an STFT transformation is performed on the noisy speech signal to obtain a spectrogram. The STFT transform, also known as the short-time Fourier transform, divides the noisy speech signal into frames first. For example, for a sampling rate of 16kHz, the time length of each frame is 8ms, and the frame interval length is 8ms. Then perform Fourier transform (FFT) on the noisy speech signal data of each frame by windowing, and stitch the transformation results obtained for each frame together to obtain the spectrogram. For example, the length of each piece of data for Fourier transform is 16 ms, and there are 256 sampling points in total. The components of noisy speech signals usually change in the time domain and frequency domain instead of remaining stable. By performing STFT transformation, the spectrogram can be obtained to reflect the different times of different frequency states. That is to say, the spectrogram reflects The joint distribution of noisy speech signal in time domain and frequency domain is obtained.

在步骤103,基于所述声谱图,确定时频语音特征。在体现了带噪语音信号在时域和频域上的联合分布状况的声谱图的基础上进一步提取时频语音特征,可以考虑到人耳听觉机理,来提取出更符合人耳听觉特性的特征参数,使得在信噪比降低时依然具有较好的识别性能。在一些实施例中,该符合人耳听觉特性的时频语音特征包括MFCC特征、BFCC(巴克倒谱系数)特征、Fbank特征(基于Mel滤波器组的特征)等中的至少一种,尤其是MFCC特征。MFCC即梅尔(Mel)倒谱系数,MFCC特征的提取过程包括预加重滤波处理、分帧、加窗、FFT、Mel滤波器组的滤波处理、对数运算、离散余弦变换(DCT)、动态特征(差分特征)提取等步骤。Mel滤波器组模拟了人体耳蜗纤毛声学感知器的听觉机理,低频分辨率高,高频分辨率低,和线性频率对应关系都近似对数关系,在此不赘述。在步骤102,已经得到了基于各段带噪语音数据的FFT变换结果。那么在步骤103,可以接着求模并利用Mel滤波器组进行滤波处理,滤波处理结果进行自然对数运算。随后可执行DCT,从而得到MFCC参数和MFCC差分参数,两者结合起来即可以得到MFCC特征。Instep 103, time-frequency speech features are determined based on the spectrogram. On the basis of the spectrogram that reflects the joint distribution of noisy speech signals in the time domain and frequency domain, the time-frequency speech features are further extracted, and the human auditory mechanism can be considered to extract more in line with the human auditory characteristics. The characteristic parameters make it still have better recognition performance when the signal-to-noise ratio is reduced. In some embodiments, the time-frequency speech features conforming to human auditory characteristics include at least one of MFCC features, BFCC (Barker's cepstral coefficient) features, Fbank features (features based on Mel filter banks), etc., especially MFCC characteristics. MFCC is the Mel (Mel) cepstral coefficient. The extraction process of MFCC features includes pre-emphasis filtering, framing, windowing, FFT, filtering of Mel filter banks, logarithmic operations, discrete cosine transform (DCT), dynamic Feature (differential feature) extraction and other steps. The Mel filter bank simulates the auditory mechanism of the human cochlear cilium acoustic sensor, with high low-frequency resolution and low high-frequency resolution, and the linear frequency correspondence is approximately logarithmic, so I won't repeat it here. Instep 102, FFT transformation results based on each segment of noisy speech data have been obtained. Then instep 103, the modulus can be calculated and the Mel filter bank can be used to perform filtering processing, and the filtering processing result can be subjected to natural logarithmic operation. Subsequently, DCT can be performed to obtain MFCC parameters and MFCC differential parameters, which can be combined to obtain MFCC features.

在步骤104,基于所述时频语音特征,利用抑噪学习网络来估计各个频点的掩蔽值,作为各个频点的抑噪量。掩蔽值也称为Mask,其表示的是各个不同时刻各个频点的噪声抑制量。在步骤105,基于各个频点的掩蔽值和声谱图,确定抑噪后的频域语音信号。在一些实施例中,通过将估计的Mask与声谱图的各个时刻的频域分量相乘,可以得到对各个时刻频域全面抑噪后的频域语音信号。该抑噪学习网络可以采用各种RNN神经网络,例如但不限于GRU神经网络和LSTM神经网络等来实现。通过采用这些RNN神经网络,能够考虑时频语音特征在时域和频域上相邻点之间的相互作用,来估计出更准确的Mask。通过使用LSTM神经网络,在考虑到时频语音特征在时域和频域上相邻点之间的相互作用的同时,又可以遗忘掉在时域和频域上距离较久远的点的影响,从而吻合带噪语音中噪声引入的随机性机制,使得Mask估计更准确,且估计计算的收敛更迅速。本发明人发现,可以将LSTM的规模控制在2-4层,这个规模的LSTM神经网络可以存储在小型芯片的存储器上,在执行Mask估计计算的工作负荷小型芯片的处理单元也完全可以承担。Instep 104, based on the time-frequency speech features, the noise suppression learning network is used to estimate the masking value of each frequency point as the noise suppression amount of each frequency point. The masking value is also called Mask, which represents the noise suppression amount of each frequency point at different times. Instep 105, based on the masking value of each frequency point and the spectrogram, the frequency-domain speech signal after noise suppression is determined. In some embodiments, by multiplying the estimated Mask by the frequency-domain components at each time point of the spectrogram, the frequency-domain speech signal after the frequency domain has been completely suppressed at each time point can be obtained. The noise suppression learning network can be implemented by using various RNN neural networks, such as but not limited to GRU neural network and LSTM neural network. By using these RNN neural networks, it is possible to estimate a more accurate Mask by considering the interaction of time-frequency speech features between adjacent points in the time domain and frequency domain. By using the LSTM neural network, while considering the interaction of time-frequency speech features between adjacent points in the time domain and frequency domain, the influence of distant points in the time domain and frequency domain can be forgotten. Therefore, it is consistent with the randomness mechanism introduced by noise in noisy speech, making Mask estimation more accurate, and the convergence of estimation calculation is faster. The inventors found that the scale of LSTM can be controlled to 2-4 layers. The LSTM neural network of this scale can be stored in the memory of a small chip, and the processing unit of the small chip can fully bear the workload of performing Mask estimation calculation.

在步骤106,基于所述频域语音信号,计算功率谱密度。例如,所述频域语音信号包含各频率对应的信号幅度和相位信息,据此对幅度取平方即可以得到功率谱密度。Instep 106, a power spectral density is calculated based on the frequency domain speech signal. For example, the frequency-domain voice signal includes signal amplitude and phase information corresponding to each frequency, and the power spectral density can be obtained by taking the square of the amplitude.

在步骤107,基于所述功率谱密度,通过执行LPC处理,来预测抑噪后的时域语音信号的线性部分和残差部分。基于功率谱密度来执行LPC(线性预测编码)处理是常规的降噪技术,通过对语音的线性部分进行预测,来得到线性部分和残差部分,在此不赘述。Instep 107, based on the power spectral density, the linear part and the residual part of the noise-suppressed time-domain speech signal are predicted by performing LPC processing. Performing LPC (Linear Predictive Coding) processing based on power spectral density is a conventional noise reduction technology. The linear part and the residual part are obtained by predicting the linear part of the speech, which will not be described in detail here.

在步骤108,对所述频域语音信号进行ISTFT变换(即逆STFT变换),以得到抑噪后的时域语音信号。Instep 108, ISTFT transformation (ie inverse STFT transformation) is performed on the frequency-domain speech signal to obtain a noise-suppressed time-domain speech signal.

在步骤109,基于所述抑噪后的时域语音信号、所述线性部分和残差部分,利用恢复学习网络来恢复出增强后的残差部分。该恢复学习网络可以采用各种RNN神经网络,例如但不限于GRU神经网络和LSTM神经网络等来实现。通常,在小型芯片等算力(单核)和存储空间有限的情况下,可以采用2-4层的GRU神经网络,以节省算力和存储空间,优先保证步骤104中抑噪学习网络可以采用充足规模的LSTM神经网络。本发明人发现,可以将GRU神经网络的规模控制在2-4层,这个规模的GRU神经网络可以存储在小型芯片的存储器上,在执行残差部分的恢复和增强计算时小型芯片的处理单元也完全可以承担;进一步地,对于单核的芯片来说,如此规模的GRU神经网络与2-4层的LSTM神经网络都可以协同工作,以流式执行抑噪处理和残差部分的恢复增强处理。Instep 109, based on the noise-suppressed time-domain speech signal, the linear part and the residual part, the enhanced residual part is recovered by using a recovery learning network. The recovery learning network can be realized by using various RNN neural networks, such as but not limited to GRU neural network and LSTM neural network. Usually, in the case of limited computing power (single core) and storage space such as small chips, a 2-4-layer GRU neural network can be used to save computing power and storage space, and it is preferred to ensure that the noise suppression learning network instep 104 can be used Sufficiently scaled LSTM neural networks. The inventors have found that the scale of the GRU neural network can be controlled at 2-4 layers, and the GRU neural network of this scale can be stored on the memory of a small chip. It is also fully affordable; furthermore, for a single-core chip, a GRU neural network of this scale and a 2-4 layer LSTM neural network can work together to perform noise suppression processing and recovery enhancement of the residual part in a streaming manner. deal with.

在步骤110,将预测的线性部分和增强后的残差部分求和,来得到恢复后的语音信号,使其语音清晰度高于预定阈值。通过以上的处理过程,可以先利用具有学习能力的抑噪神经网络在时域和频域两个层面上尽量消除噪声影响,随后先通过LPC技术对干净的语音的线性部分进行预测,这一部分是干净的语音信号的主要成分,留下比例相对较小的非线性残差部分(残余部分)利用恢复神经网络进行恢复和增强,从而在保证恢复神经网络的规模较小的情况下,依然能够以边缘计算的控制系统(甚至基于单核设计的片上系统来实现),实现对多变的噪声环境(尤其是大噪声、规律性较差的噪声等)下的带噪语音信号的高效且迅速的降噪处理,且能够恢复出无损、高清晰度且实时性良好的语音信号。本申请的处理恢复方法对多变的噪声环境(尤其是大噪声、多源复杂的噪声等)下的带噪语音信号的语音清晰度。请注意,根据本申请各个实施例的处理恢复方法的各个步骤的示例均可以结合于此,在此不赘述。Instep 110, the predicted linear part and the enhanced residual part are summed to obtain a restored speech signal whose speech intelligibility is higher than a predetermined threshold. Through the above processing process, the noise suppression neural network with learning ability can be used to eliminate the influence of noise as much as possible on the two levels of time domain and frequency domain, and then the linear part of the clean speech can be predicted by LPC technology. This part is The main component of the clean speech signal, leaving a relatively small proportion of the nonlinear residual part (residual part) is restored and enhanced by the restoration neural network, so that the recovery neural network can still be restored with a small scale. The edge computing control system (even based on a single-core designed system-on-chip) realizes efficient and rapid processing of noisy speech signals in a changeable noise environment (especially large noise, noise with poor regularity, etc.) Noise reduction processing, and can restore lossless, high-definition and real-time voice signals. The processing and restoration method of the present application can improve the speech intelligibility of the noisy speech signal under the changeable noise environment (especially large noise, multi-source complex noise, etc.). Please note that the examples of each step of the processing recovery method according to each embodiment of the present application can be combined here, and details are not repeated here.

存储器203可以配置为存储训练好的抑噪学习网络和恢复学习网络。Thememory 203 can be configured to store the trained noise suppression learning network and restoration learning network.

在一些实施例中,在所述处理单元202为单核的情况下,其配置为:对所述抑噪学习网络的处理和所述恢复学习网络的处理流式执行;而在所述处理单元为双核的情况下,对所述抑噪学习网络的处理和所述恢复学习网络的处理并行执行。由此,在单核设计算力有限时,可以优先执行抑噪学习网络的处理再执行恢复学习网络的处理,本来恢复学习网络的输入就依赖于抑噪学习网络的输出,本申请中采用的抑噪学习网络的架构能够应对算力有限的情况,且配合诸如全连接层的降维可以进一步减少数据处理量,即便采用流式先后交替执行也不会影响修复的实时性,确保用户依然有良好的无损、高清晰度且实时性良好的语音信号的听音体验。In some embodiments, when theprocessing unit 202 is a single core, it is configured to: execute the processing of the noise suppression learning network and the processing of the restoration learning network in stream; In the case of dual cores, the processing of the noise suppression learning network and the processing of the restoration learning network are executed in parallel. Therefore, when the computing power of the single-core design is limited, the processing of the noise suppression learning network can be performed first and then the processing of the recovery learning network. Originally, the input of the recovery learning network depends on the output of the noise suppression learning network. The architecture of the noise suppression learning network can cope with the situation of limited computing power, and with the dimensionality reduction of the fully connected layer, it can further reduce the amount of data processing. Good listening experience of lossless, high-definition and real-time voice signals.

在一些实施例中,可以利用从ARM公司等购买的各种RISC(精简指令集计算机)处理器IP来作为本申请的控制系统的处理单元202来执行对应的功能,利用嵌入式系统(例如但不限于SOC)来实现对于带噪语音信号的处理恢复。具体说来,在市场上可购买到的模块(IP)上具有很多模块,例如但不限于内存(存储器203可以是内存也可以在IP上外接扩展存储器)、各种通信模块(例如蓝牙模块)、编解码器、缓存器等等。其它的比如天线、麦克风和扬声器等可以外接到芯片上。接口201可以用于外接用于采集带噪语音信号的麦克风。用户可以通过基于购买的IP或自主研发的模块构建ASIC(特定用途集成电路),来实现各种通信模块、编解码器、以及本申请的处理恢复方法的各个步骤等,以便降低功耗和成本。注意,本申请中的“控制系统”旨在表示对其所在的目标设备进行操控的系统,其可以通常表示例如芯片,例如基于SOC来实现的ASIC,但不限于此,任何能够实现操控的硬件电路、软件-处理器配置和软-硬结合的固件都可以用于实现该控制系统。例如,处理单元202执行的处理可以实现为可执行指令由RISC处理器来执行,也可以形成为不同的硬件电路模块,也可以形成为软-硬结合的固件,在此不赘述。In some embodiments, various RISC (Reduced Instruction Set Computer) processor IPs purchased from ARM, etc. can be used as theprocessing unit 202 of the control system of the present application to perform corresponding functions, and embedded systems (such as but Not limited to SOC) to realize the processing and recovery of noisy speech signals. Specifically, there are many modules available on the market (IP), such as but not limited to internal memory (storage 203 can be internal memory or an external expansion memory on IP), various communication modules (such as bluetooth module) , codecs, buffers, etc. Others such as antenna, microphone and speaker can be externally connected to the chip. Theinterface 201 can be used for externally connecting a microphone for collecting noisy speech signals. Users can build ASICs (Application-Specific Integrated Circuits) based on purchased IP or self-developed modules to implement various communication modules, codecs, and various steps of the processing recovery method of this application, etc., in order to reduce power consumption and cost . Note that the "control system" in this application is intended to mean a system that controls the target device where it is located, which can generally mean, for example, a chip, such as an ASIC implemented based on a SOC, but is not limited thereto, any hardware that can achieve control Circuitry, software-processor configurations, and soft-hardware firmware can all be used to implement the control system. For example, the processing performed by theprocessing unit 202 may be implemented as executable instructions executed by a RISC processor, or may be formed as different hardware circuit modules, or may be formed as firmware combining software and hardware, which will not be repeated here.

此外,尽管已经在本文中描述了示例性实施例,其范围包括任何和所有基于本申请的具有等同元件、修改、省略、组合(例如,各种实施例交叉的方案)、改编或改变的实施例。权利要求书中的元件将被基于权利要求中采用的语言宽泛地解释,并不限于在本说明书中或本申请的实施期间所描述的示例,其示例将被解释为非排他性的。因此,本说明书和示例旨在仅被认为是示例,真正的范围和精神由以下权利要求以及其等同物的全部范围所指示。Furthermore, while exemplary embodiments have been described herein, the scope includes any and all implementations having equivalent elements, modifications, omissions, combinations (eg, cross-cutting aspects of various embodiments), adaptations, or changes based on this application example. Elements in the claims are to be interpreted broadly based on the language employed in the claims and are not limited to examples described in this specification or during the practice of the application, which examples are to be construed as non-exclusive. It is therefore intended that the specification and examples be considered as illustrations only, with a true scope and spirit being indicated by the following claims, along with their full scope of equivalents.

以上描述旨在是说明性的而不是限制性的。例如,上述示例(或其一个或更多方案)可以彼此组合使用。例如本领域普通技术人员在阅读上述描述时可以使用其它实施例。另外,在上述具体实施方式中,各种特征可以被分组在一起以简单化本申请。这不应解释为一种不要求保护的申请的特征对于任一权利要求是必要的意图。相反,本申请的主题可以少于特定的申请的实施例的全部特征。从而,权利要求书作为示例或实施例在此并入具体实施方式中,其中每个权利要求独立地作为单独的实施例,并且考虑这些实施例可以以各种组合或排列彼此组合。本发明的范围应参照所附权利要求以及这些权利要求赋权的等同形式的全部范围来确定。The above description is intended to be illustrative rather than restrictive. For example, the above examples (or one or more aspects thereof) may be used in combination with each other. For example, other embodiments may be used by those of ordinary skill in the art upon reading the above description. Additionally, in the above detailed description, various features may be grouped together to simplify the application. This should not be interpreted as intending that an unclaimed feature of the application is essential to any claim. Rather, the subject matter of the present application may lie in less than all features of a particular application's embodiment. Thus, the claims are hereby incorporated into the Detailed Description as examples or embodiments, with each claim standing on its own as a separate embodiment, and it is contemplated that these embodiments may be combined with each other in various combinations or permutations. The scope of the invention should be determined with reference to the appended claims, along with the full scope of equivalents to which such claims are entitled.

以上实施例仅为本申请的示例性实施例,不用于限制本发明,本发明的保护范围由权利要求书限定。本领域技术人员可以在本申请的实质和保护范围内,对本发明做出各种修改或等同替换,这种修改或等同替换也应视为落在本发明的保护范围内。The above embodiments are only exemplary embodiments of the present application, and are not intended to limit the present invention, and the protection scope of the present invention is defined by the claims. Those skilled in the art may make various modifications or equivalent replacements to the present invention within the essence and protection scope of the present application, and such modifications or equivalent replacements shall also be deemed to fall within the protection scope of the present invention.

Claims (10)

1. A process restoration method for a noisy speech signal, comprising:
acquiring a voice signal with noise to be processed;
performing STFT (short-time Fourier transform) on the voice signal with noise to obtain a spectrogram;
determining time-frequency voice characteristics based on the spectrogram;
estimating masking values of all frequency points by using a noise suppression learning network based on the time-frequency voice characteristics, and taking the masking values as noise suppression amounts of all frequency points;
determining a frequency domain voice signal after noise suppression based on the masking value and the spectrogram of each frequency point;
calculating a power spectral density based on the frequency domain speech signal;
predicting a linear portion and a residual portion of the denoised time domain speech signal by performing an LPC process based on the power spectral density;
ISTFT conversion is carried out on the frequency domain voice signals so as to obtain time domain voice signals after noise suppression;
recovering an enhanced residual error part by using a recovery learning network based on the noise-suppressed time domain speech signal, the linear part and the residual error part;
the predicted linear portion and the enhanced residual portion are summed to obtain a recovered speech signal having a speech intelligibility above a predetermined threshold.
2. The process restoration method according to claim 1, wherein the voice feature includes an MFCC feature.
3. The process restoration method according to claim 1, wherein the noise suppression learning network is an LSTM neural network and the restoration learning network is a GRU neural network.
4. The process restoration method according to claim 1, wherein estimating the masking values of the respective frequency points using a noise suppression learning network based on the time-frequency speech characteristics specifically includes: the voice features are fed to a first group of full-connection layers for dimension reduction processing, and then fed to the noise suppression learning network.
5. The process restoration method according to claim 1 or 4, wherein estimating the masking values of the respective frequency points using a noise suppression learning network based on the time-frequency speech characteristics specifically further comprises: and estimating by utilizing the noise suppression learning network based on the time-frequency voice characteristics, and feeding the estimation result to a second group of full-connection layers for dimension-lifting processing to obtain masking values of all frequency points.
6. The process restoration method according to claim 1, further comprising: and firstly executing the training of the noise suppression learning network on a server, and then training the recovery learning network by using the output of the trained noise suppression learning network as training data.
7. The process restoration method according to claim 1, wherein performing the LPC process specifically includes:
performing an IFFT transformation on the power spectral density to obtain an autocorrelation coefficient;
based on the autocorrelation coefficient, an LPC linear prediction coefficient is calculated by utilizing a Levinson-Durbin algorithm, so that the linear part of the time domain voice signal after noise suppression is predicted, and a residual part is obtained.
8. A control system for processing recovery of noisy speech signals, comprising:
an interface configured to obtain a noisy speech signal to be processed;
a processing unit configured to:
performing a process restoration method for a noisy speech signal according to any of claims 1 to 7; and
a memory configured to: and storing the trained noise suppression learning network and the recovery learning network.
9. The control system of claim 8, wherein, in the case where the processing unit is a single core, it is configured to: processing the noise suppression learning network and processing the recovery learning network in a flow mode; and in the case that the processing unit is dual-core, the processing of the noise suppression learning network and the processing of the recovery learning network are executed in parallel.
10. The control system of claim 8, wherein the control system is implemented based on a system-on-chip that performs edge calculations.
CN202211678470.4A2022-12-262022-12-26 A processing recovery method and control system for noisy speech signalsPendingCN116312616A (en)

Priority Applications (2)

Application NumberPriority DateFiling DateTitle
CN202211678470.4ACN116312616A (en)2022-12-262022-12-26 A processing recovery method and control system for noisy speech signals
PCT/CN2023/103754WO2024139120A1 (en)2022-12-262023-06-29Noisy voice signal processing recovery method and control system

Applications Claiming Priority (1)

Application NumberPriority DateFiling DateTitle
CN202211678470.4ACN116312616A (en)2022-12-262022-12-26 A processing recovery method and control system for noisy speech signals

Publications (1)

Publication NumberPublication Date
CN116312616Atrue CN116312616A (en)2023-06-23

Family

ID=86782312

Family Applications (1)

Application NumberTitlePriority DateFiling Date
CN202211678470.4APendingCN116312616A (en)2022-12-262022-12-26 A processing recovery method and control system for noisy speech signals

Country Status (2)

CountryLink
CN (1)CN116312616A (en)
WO (1)WO2024139120A1 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN117690421A (en)*2024-02-022024-03-12深圳市友杰智新科技有限公司Speech recognition method, device, equipment and medium of noise reduction recognition combined network
WO2024139120A1 (en)*2022-12-262024-07-04恒玄科技(上海)股份有限公司Noisy voice signal processing recovery method and control system

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN120412619B (en)*2025-07-022025-10-03贵州理工学院Audio noise suppression method based on deep learning and intelligent sound equipment

Citations (7)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
JP2004219757A (en)*2003-01-152004-08-05Fujitsu Ltd Voice enhancement device, voice enhancement method, and portable terminal
US20050207583A1 (en)*2004-03-192005-09-22Markus ChristophAudio enhancement system and method
US20080137874A1 (en)*2005-03-212008-06-12Markus ChristophAudio enhancement system and method
CN110808063A (en)*2019-11-292020-02-18北京搜狗科技发展有限公司Voice processing method and device for processing voice
CN112581973A (en)*2020-11-272021-03-30深圳大学Voice enhancement method and system
CN113096682A (en)*2021-03-202021-07-09杭州知存智能科技有限公司Real-time voice noise reduction method and device based on mask time domain decoder
CN114694674A (en)*2022-03-102022-07-01深圳市友杰智新科技有限公司 Artificial intelligence-based voice noise reduction method, device, equipment and storage medium

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN109065067B (en)*2018-08-162022-12-06福建星网智慧科技有限公司Conference terminal voice noise reduction method based on neural network model
CN111223493B (en)*2020-01-082022-08-02北京声加科技有限公司Voice signal noise reduction processing method, microphone and electronic equipment
US12062369B2 (en)*2020-09-252024-08-13Intel CorporationReal-time dynamic noise reduction using convolutional networks
CN112750451A (en)*2020-12-172021-05-04云知声智能科技股份有限公司Noise reduction method for improving voice listening feeling
CN113571081B (en)*2021-02-082025-05-30腾讯科技(深圳)有限公司 Speech enhancement method, device, equipment and storage medium
CN113838471A (en)*2021-08-102021-12-24北京塞宾科技有限公司Noise reduction method and system based on neural network, electronic device and storage medium
CN116312616A (en)*2022-12-262023-06-23恒玄科技(上海)股份有限公司 A processing recovery method and control system for noisy speech signals

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
JP2004219757A (en)*2003-01-152004-08-05Fujitsu Ltd Voice enhancement device, voice enhancement method, and portable terminal
US20050207583A1 (en)*2004-03-192005-09-22Markus ChristophAudio enhancement system and method
US20080137874A1 (en)*2005-03-212008-06-12Markus ChristophAudio enhancement system and method
CN110808063A (en)*2019-11-292020-02-18北京搜狗科技发展有限公司Voice processing method and device for processing voice
CN112581973A (en)*2020-11-272021-03-30深圳大学Voice enhancement method and system
CN113096682A (en)*2021-03-202021-07-09杭州知存智能科技有限公司Real-time voice noise reduction method and device based on mask time domain decoder
CN114694674A (en)*2022-03-102022-07-01深圳市友杰智新科技有限公司 Artificial intelligence-based voice noise reduction method, device, equipment and storage medium

Cited By (3)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
WO2024139120A1 (en)*2022-12-262024-07-04恒玄科技(上海)股份有限公司Noisy voice signal processing recovery method and control system
CN117690421A (en)*2024-02-022024-03-12深圳市友杰智新科技有限公司Speech recognition method, device, equipment and medium of noise reduction recognition combined network
CN117690421B (en)*2024-02-022024-06-04深圳市友杰智新科技有限公司Speech recognition method, device, equipment and medium of noise reduction recognition combined network

Also Published As

Publication numberPublication date
WO2024139120A1 (en)2024-07-04

Similar Documents

PublicationPublication DateTitle
CN111223493B (en)Voice signal noise reduction processing method, microphone and electronic equipment
CN106486131B (en) Method and device for voice denoising
CN109767783B (en)Voice enhancement method, device, equipment and storage medium
CN109597022B (en) Method, device and equipment for sound source azimuth calculation and target audio positioning
CN109065067B (en)Conference terminal voice noise reduction method based on neural network model
CN116312616A (en) A processing recovery method and control system for noisy speech signals
CN110767244B (en)Speech enhancement method
WO2019113130A1 (en)Voice activity detection systems and methods
CN112735456A (en)Speech enhancement method based on DNN-CLSTM network
CN108172231A (en) A method and system for removing reverberation based on Kalman filter
WO2018223727A1 (en)Voiceprint recognition method, apparatus and device, and medium
CN114822569B (en)Audio signal processing method, device, equipment and computer readable storage medium
CN114566179B (en)Time delay controllable voice noise reduction method
CN118899005B (en)Audio signal processing method, device, computer equipment and storage medium
CN112885375A (en)Global signal-to-noise ratio estimation method based on auditory filter bank and convolutional neural network
CN113611321A (en)Voice enhancement method and system
CN113345460A (en)Audio signal processing method, device, equipment and storage medium
Martín-Doñas et al.Dual-channel DNN-based speech enhancement for smartphones
CN117219102A (en)Low-complexity voice enhancement method based on auditory perception
CN110970044B (en) A speech enhancement method for speech recognition
CN112652321B (en)Deep learning phase-based more friendly voice noise reduction system and method
WO2025007866A1 (en)Speech enhancement method and apparatus, electronic device and storage medium
CN109801643B (en) Reverberation suppression processing method and device
WO2020015546A1 (en)Far-field speech recognition method, speech recognition model training method, and server
CN110875037A (en)Voice data processing method and device and electronic equipment

Legal Events

DateCodeTitleDescription
PB01Publication
PB01Publication
SE01Entry into force of request for substantive examination
SE01Entry into force of request for substantive examination

[8]ページ先頭

©2009-2025 Movatter.jp