


技术领域technical field
本申请涉及无线通信领域,更具体地,涉及一种用于无线通信中带噪语音信号的处理恢复方法和控制系统。The present application relates to the field of wireless communication, and more specifically, to a method and control system for processing and restoring noisy speech signals in wireless communication.
背景技术Background technique
随着物联网的发展,除了手机以外,人们频繁且广泛地使用各种小型化便携式智能设备,例如智能眼镜、无线蓝牙耳机、无线蓝牙音箱等,在各种多变的噪声背景下,例如地铁、商圈人群、赛场、户外场地等,来执行语音通话功能。与手机不同之处在于,这些小型化便携式智能设备通常对成本和尺寸有严格的要求,配备的芯片也比较小,存储空间和算力有限,也称为“边缘计算”。With the development of the Internet of Things, in addition to mobile phones, people frequently and widely use various miniaturized portable smart devices, such as smart glasses, wireless Bluetooth headsets, wireless Bluetooth speakers, etc. People in business circles, stadiums, outdoor venues, etc., to perform voice call functions. The difference from mobile phones is that these miniaturized portable smart devices usually have strict requirements on cost and size, and are equipped with relatively small chips, with limited storage space and computing power, also known as "edge computing".
目前虽然采用了一些语音通信降噪技术,但通常在频域对噪声能量高的频率分量进行强度抑制,往往会在大噪声情况下损失语音清晰度,使得降噪后的语音质量很差,不可避免地会损伤语音,影响用户听音体验。此外,这些语音通话降噪技术受限于小型化便携式智能设备的芯片配置,算法通常较为粗糙,或者计算缓慢导致听音滞后,不能满足人们对高语音质量和实时性的需求。Although some voice communication noise reduction technologies are used at present, the intensity of frequency components with high noise energy is usually suppressed in the frequency domain, and the voice clarity is often lost in the case of large noise, making the voice quality after noise reduction very poor. It will inevitably damage the voice and affect the user's listening experience. In addition, these voice call noise reduction technologies are limited by the chip configuration of miniaturized portable smart devices. The algorithms are usually rough, or the calculation is slow, resulting in listening lag, which cannot meet people's needs for high voice quality and real-time performance.
发明内容Contents of the invention
提供了本申请以解决现有技术中存在的上述缺陷。需要一种用于带噪语音信号的处理恢复方法和控制系统,其能够在边缘计算的小型芯片上有效配置适应性的学习网络结合LPC(线性预测编码)技术,实现对多变的噪声环境下的带噪语音信号的高效且迅速的降噪处理,且能够恢复出无损、高清晰度且实时性良好的语音信号。The present application is provided to address the above-mentioned deficiencies in the prior art. There is a need for a processing and recovery method and control system for noisy speech signals, which can effectively configure an adaptive learning network combined with LPC (Linear Predictive Coding) technology on a small chip for edge computing, and realize the detection of noise in a variable noise environment. Efficient and rapid noise reduction processing of noisy speech signals, and can restore lossless, high-definition and real-time speech signals.
根据本申请的第一方案,提供了一种用于带噪语音信号的处理恢复方法。该处理恢复方法包括如下步骤。获取要处理的带噪语音信号。对所述带噪语音信号进行STFT变换,以得到声谱图。基于所述声谱图,确定时频语音特征。基于所述时频语音特征,利用抑噪学习网络来估计各个频点的掩蔽值,作为各个频点的抑噪量。基于各个频点的掩蔽值和声谱图,确定抑噪后的频域语音信号。基于所述频域语音信号,计算功率谱密度。基于所述功率谱密度,通过执行LPC处理,来预测抑噪后的时域语音信号的线性部分和残差部分。对所述频域语音信号进行ISTFT变换,以得到抑噪后的时域语音信号。基于所述抑噪后的时域语音信号、所述线性部分和残差部分,利用恢复学习网络来恢复出增强后的残差部分。将预测的线性部分和增强后的残差部分求和,来得到恢复后的语音信号,使其语音清晰度高于预定阈值。According to the first solution of the present application, a method for processing and restoring a noisy speech signal is provided. The processing recovery method includes the following steps. Get the noisy speech signal to process. performing STFT transformation on the noisy speech signal to obtain a spectrogram. Based on the spectrogram, time-frequency speech features are determined. Based on the time-frequency speech features, the noise suppression learning network is used to estimate the masking value of each frequency point as the noise suppression amount of each frequency point. Based on the masking value and spectrogram of each frequency point, the frequency-domain speech signal after noise suppression is determined. Based on the frequency domain speech signal, a power spectral density is calculated. Based on the power spectral density, the linear part and the residual part of the noise-suppressed time-domain speech signal are predicted by performing LPC processing. Perform ISTFT transformation on the frequency-domain speech signal to obtain a noise-suppressed time-domain speech signal. Based on the noise-suppressed time-domain speech signal, the linear part and the residual part, the enhanced residual part is recovered by using a recovery learning network. The predicted linear part and the enhanced residual part are summed to obtain a restored speech signal whose speech intelligibility is higher than a predetermined threshold.
根据本申请的第二方案,提供了一种用于带噪语音信号的处理恢复的控制系统。该控制系统包括接口、处理单元和存储器。所述接口配置为获取要处理的带噪语音信号。所述处理单元配置为根据本申请各个实施例的用于带噪语音信号的处理恢复方法,包括如下步骤。获取要处理的带噪语音信号。对所述带噪语音信号进行STFT变换,以得到声谱图。基于所述声谱图,确定时频语音特征。基于所述时频语音特征,利用抑噪学习网络来估计各个频点的掩蔽值,作为各个频点的抑噪量。基于各个频点的掩蔽值和声谱图,确定抑噪后的频域语音信号。基于所述频域语音信号,计算功率谱密度。基于所述功率谱密度,通过执行LPC处理,来预测抑噪后的时域语音信号的线性部分和残差部分。对所述频域语音信号进行ISTFT变换,以得到抑噪后的时域语音信号。基于所述抑噪后的时域语音信号、所述线性部分和残差部分,利用恢复学习网络来恢复出增强后的残差部分。将预测的线性部分和增强后的残差部分求和,来得到恢复后的语音信号,使其语音清晰度高于预定阈值。所述存储器配置为存储训练好的抑噪学习网络和恢复学习网络。According to the second aspect of the present application, a control system for processing and restoring a noisy speech signal is provided. The control system includes an interface, a processing unit and a memory. The interface is configured to acquire a noisy speech signal to be processed. The processing unit is configured as a method for processing and restoring a noisy speech signal according to various embodiments of the present application, including the following steps. Get the noisy speech signal to process. performing STFT transformation on the noisy speech signal to obtain a spectrogram. Based on the spectrogram, time-frequency speech features are determined. Based on the time-frequency speech features, the noise suppression learning network is used to estimate the masking value of each frequency point as the noise suppression amount of each frequency point. Based on the masking value and spectrogram of each frequency point, the frequency-domain speech signal after noise suppression is determined. Based on the frequency domain speech signal, a power spectral density is calculated. Based on the power spectral density, the linear part and the residual part of the noise-suppressed time-domain speech signal are predicted by performing LPC processing. Perform ISTFT transformation on the frequency-domain speech signal to obtain a noise-suppressed time-domain speech signal. Based on the noise-suppressed time-domain speech signal, the linear part and the residual part, the enhanced residual part is recovered by using a recovery learning network. The predicted linear part and the enhanced residual part are summed to obtain a restored speech signal whose speech intelligibility is higher than a predetermined threshold. The memory is configured to store the trained noise suppression learning network and restoration learning network.
本申请各个实施例提供的用于带噪语音信号的处理恢复方法和控制系统,其能够在边缘计算的小型芯片上有效配置适应性的学习网络结合LPC(线性预测编码)技术,实现对多变的噪声环境下的带噪语音信号的高效且迅速的降噪处理,且能够恢复出无损、高清晰度且实时性良好的语音信号。The processing and recovery method and control system for noisy speech signals provided by various embodiments of the present application can effectively configure an adaptive learning network combined with LPC (Linear Predictive Coding) technology on a small chip of edge computing to realize the recognition of variable Efficient and rapid noise reduction processing of noisy speech signals in a noisy environment, and can restore lossless, high-definition and real-time speech signals.
附图说明Description of drawings
下面将参照附图描述本发明的示例性实施例的特征、优势以及技术和工业意义,其中相同的附图标记表示相同的元件,并且其中:The features, advantages and technical and industrial significance of exemplary embodiments of the invention will be described below with reference to the accompanying drawings, in which like reference numerals refer to like elements, and in which:
图1示出根据本申请实施例的用于带噪语音信号的处理恢复方法的流程图;Fig. 1 shows a flow chart of a method for processing and restoring a noisy speech signal according to an embodiment of the present application;
图2示出根据本申请实施例的用于带噪语音信号的处理恢复的控制系统的构造图;以及Fig. 2 shows the structural diagram of the control system for the processing recovery of noisy speech signal according to the embodiment of the present application; And
图3示出根据本申请实施例的用于带噪语音信号的处理恢复方法的示例的流程图。Fig. 3 shows a flow chart of an example of a method for processing and restoring a noisy speech signal according to an embodiment of the present application.
具体实施方式Detailed ways
为使本领域技术人员更好的理解本申请的技术方案,下面结合附图和具体实施方式对本申请作详细说明。下面结合附图和具体实施例对本申请的实施例作进一步详细描述,但不作为对本申请的限定。In order to enable those skilled in the art to better understand the technical solutions of the present application, the present application will be described in detail below in conjunction with the accompanying drawings and specific embodiments. Embodiments of the present application will be described in further detail below in conjunction with the accompanying drawings and specific embodiments, but this is not intended to limit the present application.
本申请中使用的“第一”、“第二”以及类似的词语并不表示任何顺序、数量或者重要性,而只是用来区分。“包括”或者“包含”等类似的词语意指在该词前的要素涵盖在该词后列举的要素,并不排除也涵盖其他要素的可能。本申请附图中用箭头示出的各个步骤的顺序仅仅作为示例,并不意味着各个步骤一定要以箭头所示顺序来执行。如果没有特别指出,各个步骤可以合并处理或调换执行顺序,以与箭头所示不同的顺序来执行,只要不影响各个步骤的逻辑关系即可。本申请中的一组全连接层可以是一层,也可以是数层,在此不做特别限定。本申请中的技术术语“残差”旨在表示语音信号去掉预测部分后的较少的残留部分。"First", "second" and similar words used in this application do not indicate any order, quantity or importance, but are only used for distinction. Words like "comprising" or "comprising" mean that the elements preceding the word cover the elements listed after the word, and do not exclude the possibility of also covering other elements. The order of the various steps shown by the arrows in the drawings of the present application is only an example, and does not mean that the various steps must be executed in the order shown by the arrows. If not specified, each step can be combined or executed in a different order than that shown by the arrow, as long as the logical relationship of each step is not affected. A group of fully connected layers in this application may be one layer or several layers, which is not specifically limited here. The technical term "residual" in this application is intended to refer to the less residual part of the speech signal after the prediction part is removed.
图1示出根据本申请实施例的用于带噪语音信号的处理恢复方法的流程图。具体说来,该处理恢复方法尤其适用于各种执行边缘计算的小型芯片,通常尺寸较小,存储空间和算力有限。参见图2,这些芯片(也称为控制系统)通常用在各种小型化便携式智能设备中,例如智能眼镜、无线蓝牙耳机、无线蓝牙音箱、多功能智能化充电盒(例如蓝牙耳机充电盒)、智能手表(例如但不限于儿童多功能定位监控手表)等等。请注意,本申请的各个实施例的处理恢复方法尤其适用于各种执行边缘计算的小型芯片,并不意味着只能在这样的小型芯片上执行,对于诸如手机等处理能力和存储空间更大的芯片,乃至对于诸如CPU等处理能力和存储空间又更大的处理器来说,当然也可以执行,只是其处理步骤对于各种执行边缘计算的小型芯片尤其友好,可以克服算力和存储空间有限的缺陷,确保对多变的噪声环境下的带噪语音信号的高效且迅速的降噪处理。Fig. 1 shows a flowchart of a method for processing and restoring a noisy speech signal according to an embodiment of the present application. Specifically, this processing recovery method is especially suitable for various small chips that perform edge computing, which are usually small in size and have limited storage space and computing power. See Figure 2, these chips (also known as control systems) are usually used in various miniaturized portable smart devices, such as smart glasses, wireless Bluetooth headsets, wireless Bluetooth speakers, multi-functional intelligent charging boxes (such as Bluetooth headset charging boxes) , smart watches (such as but not limited to children's multifunctional positioning monitoring watches), etc. Please note that the processing recovery methods of the various embodiments of this application are especially suitable for various small chips that perform edge computing, and it does not mean that they can only be executed on such small chips. For example, mobile phones have larger processing power and storage space Chips, even for processors with larger processing power and storage space such as CPUs, of course, it can also be executed, but its processing steps are especially friendly to various small chips that perform edge computing, which can overcome computing power and storage space. Limited defects ensure efficient and rapid noise reduction processing of noisy speech signals in variable noise environments.
如图1所示,该处理恢复方法始于步骤101,获取要处理的带噪语音信号。要处理的带噪语音信号可以由麦克风采集并进行了模数转换。As shown in FIG. 1 , the processing and recovery method begins at
在步骤102,对所述带噪语音信号进行STFT变换,以得到声谱图。STFT变换又称短时傅里叶变换,其先对带噪语音信号分帧。例如,对于16kHz的采样率来说,每帧的时间长度为8ms,帧间隔长度为8ms。然后对每一帧的带噪语音信号数据加窗做傅里叶变换(FFT),将每帧得到的变换结果拼接到一起,来得到声谱图。例如,进行傅里叶变换的每段数据长度为16ms共计256个采样点。带噪语音信号通常在时域和频域上的分量都会变化而非保持平稳,通过进行STFT变换,可以得到声谱图以反映不同频率状态所处的不同时间,也就是说,声谱图体现了带噪语音信号在时域和频域上的联合分布状况。In
在步骤103,基于所述声谱图,确定时频语音特征。在体现了带噪语音信号在时域和频域上的联合分布状况的声谱图的基础上进一步提取时频语音特征,可以考虑到人耳听觉机理,来提取出更符合人耳听觉特性的特征参数,使得在信噪比降低时依然具有较好的识别性能。在一些实施例中,该符合人耳听觉特性的时频语音特征包括MFCC特征、BFCC(巴克倒谱系数)特征、Fbank特征(基于Mel滤波器组的特征)等中的至少一种,尤其是MFCC特征。MFCC即梅尔(Mel)倒谱系数,MFCC特征的提取过程包括预加重滤波处理、分帧、加窗、FFT、Mel滤波器组的滤波处理、对数运算、离散余弦变换(DCT)、动态特征(差分特征)提取等步骤。Mel滤波器组模拟了人体耳蜗纤毛声学感知器的听觉机理,低频分辨率高,高频分辨率低,和线性频率对应关系都近似对数关系,在此不赘述。在步骤102,已经得到了基于各段带噪语音数据的FFT变换结果。那么在步骤103,可以接着求模并利用Mel滤波器组进行滤波处理,滤波处理结果进行自然对数运算。随后可执行DCT,从而得到MFCC参数和MFCC差分参数,两者结合起来即可以得到MFCC特征。In
在步骤104,基于所述时频语音特征,利用抑噪学习网络来估计各个频点的掩蔽值,作为各个频点的抑噪量。掩蔽值也称为Mask,其表示的是各个不同时刻各个频点的噪声抑制量。在步骤105,基于各个频点的掩蔽值和声谱图,确定抑噪后的频域语音信号。在一些实施例中,Mask可以用作声谱图的各个时刻的频域分量的抑噪处理系数,通过将估计的Mask与声谱图的各个时刻的频域分量相乘,可以得到对各个时刻频域全面抑噪后的频域语音信号。该抑噪学习网络可以采用各种RNN神经网络,例如但不限于GRU神经网络和LSTM神经网络等来实现。通过采用这些RNN神经网络,能够考虑时频语音特征在时域和频域上相邻点之间的相互作用,来估计出更准确的Mask。通过使用LSTM神经网络,在考虑到时频语音特征在时域和频域上相邻点之间的相互作用的同时,又可以遗忘掉在时域和频域上距离较久远的点的影响,从而吻合带噪语音中噪声引入的随机性机制,使得Mask估计更准确,且估计计算的收敛更迅速。本发明人发现,可以将LSTM的规模控制在2-4层,这个规模的LSTM神经网络可以存储在小型芯片的存储器上,在执行Mask估计计算的工作负荷小型芯片的处理单元也完全可以承担。In
在步骤106,基于所述频域语音信号,计算功率谱密度。例如,所述频域语音信号包含各频率对应的信号幅度和相位信息,据此对幅度取平方即可以得到功率谱密度。In
在步骤107,基于所述功率谱密度,通过执行LPC处理,来预测抑噪后的时域语音信号的线性部分和残差部分。基于功率谱密度来执行LPC(线性预测编码)处理是常规的降噪技术,通过对语音的线性部分进行预测,来得到线性部分和残差部分,在此不赘述。In
在步骤108,对所述频域语音信号进行ISTFT变换(即逆STFT变换),以得到抑噪后的时域语音信号。In
在步骤109,基于所述抑噪后的时域语音信号、所述线性部分和残差部分,利用恢复学习网络来恢复出增强后的残差部分。该恢复学习网络可以采用各种RNN神经网络,例如但不限于GRU神经网络和LSTM神经网络等来实现。通常,在小型芯片等算力(单核)和存储空间有限的情况下,可以采用2-4层的GRU神经网络,以节省算力和存储空间,优先保证步骤104中抑噪学习网络可以采用充足规模的LSTM神经网络。本发明人发现,可以将GRU神经网络的规模控制在2-4层,这个规模的GRU神经网络可以存储在小型芯片的存储器上,在执行残差部分的恢复和增强计算时小型芯片的处理单元也完全可以承担;进一步地,对于单核的芯片来说,如此规模的GRU神经网络与2-4层的LSTM神经网络都可以协同工作,以流式执行抑噪处理和残差部分的恢复增强处理。In
在步骤110,将预测的线性部分和增强后的残差部分求和,来得到恢复后的语音信号,使其语音清晰度高于预定阈值。通过以上的处理过程,可以先利用具有学习能力的抑噪神经网络在时域和频域两个层面上尽量消除噪声影响,随后先通过LPC技术对干净的语音的线性部分进行预测,这一部分是干净的语音信号的主要成分,留下比例相对较小的非线性残差部分(残余部分)利用恢复神经网络进行恢复和增强,从而在保证恢复神经网络的规模较小的情况下,依然能够以边缘计算的片上系统(甚至是单核设计),实现对多变的噪声环境(尤其是大噪声、规律性较差的噪声等)下的带噪语音信号的高效且迅速的降噪处理,且能够恢复出无损、高清晰度且实时性良好的语音信号。本申请的处理恢复方法对多变的噪声环境(尤其是大噪声、多源复杂的噪声等)下的带噪语音信号的语音清晰度。In
下面结合图3对用于带噪语音信号的处理恢复方法的示例进行详细说明。如图3所示,利用麦克风采集到带噪语音信号x(n),n表示当前采样时刻,首先对带噪语音信号进行STFT变换(步骤301),再计算MFCC特征(步骤302),该MFCC特征作为特征值,根据特征值对带噪语音进行噪声抑制后,再生成y(n),如此可以进一步减少要处理的数据量。如果对16KHz采样率下的数据进行处理,帧长为8ms,帧间隔长度为8ms,进行FFT变换的数据长度为16ms,256个采样点,经过MFCC特征提取后得到的MFCC特征维度为32维,其中22维为MFCC特征,6维为一阶MFCC差分特征,4维为二阶MFCC差分特征。An example of a processing and restoration method for a noisy speech signal will be described in detail below with reference to FIG. 3 . As shown in Figure 3, the noisy speech signal x(n) is collected by the microphone, and n represents the current sampling moment. First, the STFT transformation is carried out to the noisy speech signal (step 301), and then the MFCC feature is calculated (step 302), the MFCC Features are used as eigenvalues, and y(n) is generated after the noisy speech is suppressed according to the eigenvalues, which can further reduce the amount of data to be processed. If the data at 16KHz sampling rate is processed, the frame length is 8ms, the frame interval length is 8ms, the data length for FFT transformation is 16ms, and 256 sampling points, the MFCC feature dimension obtained after MFCC feature extraction is 32 dimensions, Among them, 22 dimensions are MFCC features, 6 dimensions are first-order MFCC differential features, and 4 dimensions are second-order MFCC differential features.
接着,可以基于所述MFCC特征,利用抑噪学习网络来估计各个频点的掩蔽值。具体地,可以先将MFCC特征馈送到第一组全连接层303进行降维处理,再馈送到所述抑噪学习网络,也就是RNN/GRU/LSTM神经网络304。第一组全连接层303用于降维,32维输入,降维后变成16维输出,用于进一步减少LSTM/RNN/GRU神经网络304的输入数据维度,从而减少网络规模。基于所述MFCC特征降维后,利用抑噪学习网络,RNN/GRU/LSTM神经网络304进行初步估计,并将该估计结果馈送到第二组全连接层305进行升维处理,来得到各个频点的掩蔽值,从而得到掩蔽值估计(结果)306。例如,将RNN/GRU/LSTM神经网络304的输出馈送到该第二组全连接层305,该第二组全连接层305用于升维,16维输入,升维后为127维输出,即得到127个频点的Mask值(也就是噪声抑制量),进而得到降噪后的频域语音信号Y(w)。Then, based on the MFCC features, the masking value of each frequency point can be estimated by using a noise suppression learning network. Specifically, the MFCC features can be fed to the first group of fully connected layers 303 for dimension reduction processing, and then fed to the noise suppression learning network, that is, the RNN/GRU/LSTM
降噪后的频域语音信号Y(w)分两路进行计算。一路送入LPC计算模块执行LPC处理,以计算LPC系数,并对语音信号进行线性预测,由先前时段的已经语音增强后的信号,s(n-1),s(n-2),...,s(n-16),结合线性预测系数,来预测当前时刻最终要得到的语音增强后的信号s(n)的线性部分z(n)。另一路进行ISTFT变换307后得到抑噪后的时域信号y(n),将该时域信号y(n)连同LPC处理那路得到的线性部分z(n)和残留部分e(n)(非线性部分)一起馈送到恢复神经网络,例如GRU/RNN,以便GRU/RNN对语音残差部分进行修正,修正后得到恢复和增强的残留部分L(n),再与线性部分z(n)相加313,来得到最终语音增强后的信号s(n)。The noise-reduced frequency-domain speech signal Y(w) is calculated in two ways. All the way into the LPC calculation module to perform LPC processing to calculate the LPC coefficient, and linearly predict the speech signal, from the speech-enhanced signal of the previous period, s(n-1), s(n-2),... ., s(n-16), combined with linear prediction coefficients, to predict the linear part z(n) of the speech-enhanced signal s(n) to be finally obtained at the current moment. The other path performs
关于在步骤304中使用的抑噪学习网络和在步骤312中使用的恢复学习网络可以在服务器上完成训练,训练好的抑噪学习网络和恢复学习网络的参数可以从服务器通信传输到基于执行边缘计算的片上系统的控制系统。在一些实施例中,控制系统可以经由通信接口向服务器请求更新的抑噪学习网络和恢复学习网络的参数。在一些实施例中,也可以在控制系统出厂前就将训练好的抑噪学习网络和恢复学习网络的参数存入存储器中并持续使用。The noise suppression learning network used in
在一些实施例中,在服务器的计算负荷能够支持的情况下,可以对抑噪学习网络和恢复学习网络进行联合训练,以得到协调更优化的抑噪和残差修正效果。在一些实施例中,在服务器的计算负荷有限的情况下,也可以在服务器上先执行所述抑噪学习网络的训练,再利用训练好的所述抑噪学习网络的输出作为训练数据来训练所述恢复学习网络,从而兼顾训练速度和效果。对于抑制学习网络来说,可以对预设噪声的语音信号提取MFCC特征作为网络输入并通过测试来得到抑噪效果良好的Mask值作为网络输入,来得到一条训练数据,基于训练数据集的各条数据对抑噪学习网络进行训练,例如可以采用批梯度下降法或者随机梯度下降法来执行训练。关于恢复学习网络的训练,可以对预设噪声的语音信号的抑噪语音信号y(n)、线性预测后的z(n)和e(n)作为网络输入,以及残差的真值作为网络输出,共同构成一条训练数据,基于训练数据集的各条数据对抑噪学习网络进行训练,例如可以采用批梯度下降法或者随机梯度下降法来执行训练。当先训练抑噪学习网络再利用训练好的所述抑噪学习网络的输出作为训练数据来训练所述恢复学习网络的情况下,对于预设噪声的语音信号,可以利用训练好的所述抑噪学习网络依序执行步骤301-307,来得出y(n),并依序执行步骤308-311(细节在下文中会详细说明),来得到e(n)和z(n),以输出的y(n)、线性预测后的z(n)和e(n)作为网络的输入而残差的真值作为网络的输出来生成训练数据,并据此对恢复学习网络进行训练。In some embodiments, if the computing load of the server can support it, the noise suppression learning network and the recovery learning network can be jointly trained to obtain a coordinated and more optimized effect of noise suppression and residual correction. In some embodiments, when the computing load of the server is limited, the training of the noise suppression learning network can also be performed on the server first, and then the output of the trained noise suppression learning network can be used as training data for training The recovery learning network takes into account both training speed and effect. For the suppression learning network, the MFCC feature can be extracted from the speech signal of the preset noise as the network input, and the Mask value with good noise suppression effect can be obtained through the test as the network input to obtain a piece of training data, based on each piece of the training data set The data trains the noise suppression learning network, for example, the batch gradient descent method or the stochastic gradient descent method can be used to perform training. Regarding the training of the recovery learning network, the noise-suppressed speech signal y(n) of the speech signal of the preset noise, z(n) and e(n) after linear prediction can be used as the network input, and the true value of the residual can be used as the network input The outputs together form a piece of training data, and the noise suppression learning network is trained based on each piece of data in the training data set, for example, the batch gradient descent method or the stochastic gradient descent method can be used to perform training. When training the noise suppression learning network first and then using the output of the trained noise suppression learning network as training data to train the recovery learning network, for the speech signal of the preset noise, the trained noise suppression can be used The learning network executes steps 301-307 in sequence to obtain y(n), and executes steps 308-311 in sequence (details will be described in detail below) to obtain e(n) and z(n), and output y (n), z(n) and e(n) after linear prediction are used as the input of the network and the true value of the residual is used as the output of the network to generate training data, and the recovery learning network is trained accordingly.
回到图3对LPC处理进行具体说明。Referring back to FIG. 3, the LPC processing will be described in detail.
在步骤308,对降噪后的频域信号Y(W)进行功率谱密度计算。In
在步骤309,对所述功率谱密度执行IFFT变换,来得到自相关系数。In
在步骤310,基于所述自相关系数,利用Levinson-Durbin算法计算LPC线性预测系数,从而预测出抑噪后的时域语音信号的线性部分,也就是对语音线性部分z(n)进行预测(步骤311),并得到残差部分e(n)。残差部分e(n),经过恢复神经网络的非线性预测后得到增强,利用增强后的残差数据L(n)和线性预测数据z(n)求和得到最终的语音信号s(n)。In
具体说来,LPC线性预测系数根据如下公式(1)来求解:Specifically, the LPC linear prediction coefficient is solved according to the following formula (1):
其中Rn(j),j=1,...,p,为语音信号的自相关函数,p表示LPC线性预测系数的个数,a1,...,ap,为LPC线性预测系数。上述方程也称为Yule-Walker方程,方程左边的矩阵成为Toeplitz矩阵,主对角线对称,沿着主对角线平行方向的各轴向的元素值都相等,该方程可用Levinson-Durbin递推算法求解。Among them, Rn (j), j=1,..., p, is the autocorrelation function of the speech signal, p represents the number of LPC linear prediction coefficients, a1 ,..., ap , are the LPC linear prediction coefficients . The above equation is also called the Yule-Walker equation. The matrix on the left side of the equation becomes the Toeplitz matrix, the main diagonal is symmetrical, and the values of the elements in each axis parallel to the main diagonal are equal. This equation can be calculated by Levinson-Durbin recursion method to solve.
语音线性部分z(n)可以根据公式(2)来计算:The speech linear part z(n) can be calculated according to formula (2):
残差部分e(n)可以根据公式(3)来计算:The residual part e(n) can be calculated according to formula (3):
如此,可以将计算得到的语音线性部分z(n)和残差部分e(n)馈送到恢复神经网络,以对非线性部分的残差进行修复和改善。In this way, the calculated speech linear part z(n) and residual part e(n) can be fed to the restoration neural network to repair and improve the residual part of the nonlinear part.
返回图2,对根据本申请的各个实施例的用于带噪语音信号的处理恢复的控制系统200进行说明。该控制系统200可以包括接口201、处理单元202和存储器203。接口201可以配置为获取要处理的带噪语音信号。Returning to FIG. 2 , a
处理单元202可以配置为执行根据本申请各个实施例所述的用于带噪语音信号的处理恢复方法。回到图1,该处理恢复方法可以包括如下步骤。The
在步骤102,对所述带噪语音信号进行STFT变换,以得到声谱图。STFT变换又称短时傅里叶变换,其先对带噪语音信号分帧。例如,对于16kHz的采样率来说,每帧的时间长度为8ms,帧间隔长度为8ms。然后对每一帧的带噪语音信号数据加窗做傅里叶变换(FFT),将每帧得到的变换结果拼接到一起,来得到声谱图。例如,进行傅里叶变换的每段数据长度为16ms共计256个采样点。带噪语音信号通常在时域和频域上的分量都会变化而非保持平稳,通过进行STFT变换,可以得到声谱图以反映不同频率状态所处的不同时间,也就是说,声谱图体现了带噪语音信号在时域和频域上的联合分布状况。In
在步骤103,基于所述声谱图,确定时频语音特征。在体现了带噪语音信号在时域和频域上的联合分布状况的声谱图的基础上进一步提取时频语音特征,可以考虑到人耳听觉机理,来提取出更符合人耳听觉特性的特征参数,使得在信噪比降低时依然具有较好的识别性能。在一些实施例中,该符合人耳听觉特性的时频语音特征包括MFCC特征、BFCC(巴克倒谱系数)特征、Fbank特征(基于Mel滤波器组的特征)等中的至少一种,尤其是MFCC特征。MFCC即梅尔(Mel)倒谱系数,MFCC特征的提取过程包括预加重滤波处理、分帧、加窗、FFT、Mel滤波器组的滤波处理、对数运算、离散余弦变换(DCT)、动态特征(差分特征)提取等步骤。Mel滤波器组模拟了人体耳蜗纤毛声学感知器的听觉机理,低频分辨率高,高频分辨率低,和线性频率对应关系都近似对数关系,在此不赘述。在步骤102,已经得到了基于各段带噪语音数据的FFT变换结果。那么在步骤103,可以接着求模并利用Mel滤波器组进行滤波处理,滤波处理结果进行自然对数运算。随后可执行DCT,从而得到MFCC参数和MFCC差分参数,两者结合起来即可以得到MFCC特征。In
在步骤104,基于所述时频语音特征,利用抑噪学习网络来估计各个频点的掩蔽值,作为各个频点的抑噪量。掩蔽值也称为Mask,其表示的是各个不同时刻各个频点的噪声抑制量。在步骤105,基于各个频点的掩蔽值和声谱图,确定抑噪后的频域语音信号。在一些实施例中,通过将估计的Mask与声谱图的各个时刻的频域分量相乘,可以得到对各个时刻频域全面抑噪后的频域语音信号。该抑噪学习网络可以采用各种RNN神经网络,例如但不限于GRU神经网络和LSTM神经网络等来实现。通过采用这些RNN神经网络,能够考虑时频语音特征在时域和频域上相邻点之间的相互作用,来估计出更准确的Mask。通过使用LSTM神经网络,在考虑到时频语音特征在时域和频域上相邻点之间的相互作用的同时,又可以遗忘掉在时域和频域上距离较久远的点的影响,从而吻合带噪语音中噪声引入的随机性机制,使得Mask估计更准确,且估计计算的收敛更迅速。本发明人发现,可以将LSTM的规模控制在2-4层,这个规模的LSTM神经网络可以存储在小型芯片的存储器上,在执行Mask估计计算的工作负荷小型芯片的处理单元也完全可以承担。In
在步骤106,基于所述频域语音信号,计算功率谱密度。例如,所述频域语音信号包含各频率对应的信号幅度和相位信息,据此对幅度取平方即可以得到功率谱密度。In
在步骤107,基于所述功率谱密度,通过执行LPC处理,来预测抑噪后的时域语音信号的线性部分和残差部分。基于功率谱密度来执行LPC(线性预测编码)处理是常规的降噪技术,通过对语音的线性部分进行预测,来得到线性部分和残差部分,在此不赘述。In
在步骤108,对所述频域语音信号进行ISTFT变换(即逆STFT变换),以得到抑噪后的时域语音信号。In
在步骤109,基于所述抑噪后的时域语音信号、所述线性部分和残差部分,利用恢复学习网络来恢复出增强后的残差部分。该恢复学习网络可以采用各种RNN神经网络,例如但不限于GRU神经网络和LSTM神经网络等来实现。通常,在小型芯片等算力(单核)和存储空间有限的情况下,可以采用2-4层的GRU神经网络,以节省算力和存储空间,优先保证步骤104中抑噪学习网络可以采用充足规模的LSTM神经网络。本发明人发现,可以将GRU神经网络的规模控制在2-4层,这个规模的GRU神经网络可以存储在小型芯片的存储器上,在执行残差部分的恢复和增强计算时小型芯片的处理单元也完全可以承担;进一步地,对于单核的芯片来说,如此规模的GRU神经网络与2-4层的LSTM神经网络都可以协同工作,以流式执行抑噪处理和残差部分的恢复增强处理。In
在步骤110,将预测的线性部分和增强后的残差部分求和,来得到恢复后的语音信号,使其语音清晰度高于预定阈值。通过以上的处理过程,可以先利用具有学习能力的抑噪神经网络在时域和频域两个层面上尽量消除噪声影响,随后先通过LPC技术对干净的语音的线性部分进行预测,这一部分是干净的语音信号的主要成分,留下比例相对较小的非线性残差部分(残余部分)利用恢复神经网络进行恢复和增强,从而在保证恢复神经网络的规模较小的情况下,依然能够以边缘计算的控制系统(甚至基于单核设计的片上系统来实现),实现对多变的噪声环境(尤其是大噪声、规律性较差的噪声等)下的带噪语音信号的高效且迅速的降噪处理,且能够恢复出无损、高清晰度且实时性良好的语音信号。本申请的处理恢复方法对多变的噪声环境(尤其是大噪声、多源复杂的噪声等)下的带噪语音信号的语音清晰度。请注意,根据本申请各个实施例的处理恢复方法的各个步骤的示例均可以结合于此,在此不赘述。In
存储器203可以配置为存储训练好的抑噪学习网络和恢复学习网络。The
在一些实施例中,在所述处理单元202为单核的情况下,其配置为:对所述抑噪学习网络的处理和所述恢复学习网络的处理流式执行;而在所述处理单元为双核的情况下,对所述抑噪学习网络的处理和所述恢复学习网络的处理并行执行。由此,在单核设计算力有限时,可以优先执行抑噪学习网络的处理再执行恢复学习网络的处理,本来恢复学习网络的输入就依赖于抑噪学习网络的输出,本申请中采用的抑噪学习网络的架构能够应对算力有限的情况,且配合诸如全连接层的降维可以进一步减少数据处理量,即便采用流式先后交替执行也不会影响修复的实时性,确保用户依然有良好的无损、高清晰度且实时性良好的语音信号的听音体验。In some embodiments, when the
在一些实施例中,可以利用从ARM公司等购买的各种RISC(精简指令集计算机)处理器IP来作为本申请的控制系统的处理单元202来执行对应的功能,利用嵌入式系统(例如但不限于SOC)来实现对于带噪语音信号的处理恢复。具体说来,在市场上可购买到的模块(IP)上具有很多模块,例如但不限于内存(存储器203可以是内存也可以在IP上外接扩展存储器)、各种通信模块(例如蓝牙模块)、编解码器、缓存器等等。其它的比如天线、麦克风和扬声器等可以外接到芯片上。接口201可以用于外接用于采集带噪语音信号的麦克风。用户可以通过基于购买的IP或自主研发的模块构建ASIC(特定用途集成电路),来实现各种通信模块、编解码器、以及本申请的处理恢复方法的各个步骤等,以便降低功耗和成本。注意,本申请中的“控制系统”旨在表示对其所在的目标设备进行操控的系统,其可以通常表示例如芯片,例如基于SOC来实现的ASIC,但不限于此,任何能够实现操控的硬件电路、软件-处理器配置和软-硬结合的固件都可以用于实现该控制系统。例如,处理单元202执行的处理可以实现为可执行指令由RISC处理器来执行,也可以形成为不同的硬件电路模块,也可以形成为软-硬结合的固件,在此不赘述。In some embodiments, various RISC (Reduced Instruction Set Computer) processor IPs purchased from ARM, etc. can be used as the
此外,尽管已经在本文中描述了示例性实施例,其范围包括任何和所有基于本申请的具有等同元件、修改、省略、组合(例如,各种实施例交叉的方案)、改编或改变的实施例。权利要求书中的元件将被基于权利要求中采用的语言宽泛地解释,并不限于在本说明书中或本申请的实施期间所描述的示例,其示例将被解释为非排他性的。因此,本说明书和示例旨在仅被认为是示例,真正的范围和精神由以下权利要求以及其等同物的全部范围所指示。Furthermore, while exemplary embodiments have been described herein, the scope includes any and all implementations having equivalent elements, modifications, omissions, combinations (eg, cross-cutting aspects of various embodiments), adaptations, or changes based on this application example. Elements in the claims are to be interpreted broadly based on the language employed in the claims and are not limited to examples described in this specification or during the practice of the application, which examples are to be construed as non-exclusive. It is therefore intended that the specification and examples be considered as illustrations only, with a true scope and spirit being indicated by the following claims, along with their full scope of equivalents.
以上描述旨在是说明性的而不是限制性的。例如,上述示例(或其一个或更多方案)可以彼此组合使用。例如本领域普通技术人员在阅读上述描述时可以使用其它实施例。另外,在上述具体实施方式中,各种特征可以被分组在一起以简单化本申请。这不应解释为一种不要求保护的申请的特征对于任一权利要求是必要的意图。相反,本申请的主题可以少于特定的申请的实施例的全部特征。从而,权利要求书作为示例或实施例在此并入具体实施方式中,其中每个权利要求独立地作为单独的实施例,并且考虑这些实施例可以以各种组合或排列彼此组合。本发明的范围应参照所附权利要求以及这些权利要求赋权的等同形式的全部范围来确定。The above description is intended to be illustrative rather than restrictive. For example, the above examples (or one or more aspects thereof) may be used in combination with each other. For example, other embodiments may be used by those of ordinary skill in the art upon reading the above description. Additionally, in the above detailed description, various features may be grouped together to simplify the application. This should not be interpreted as intending that an unclaimed feature of the application is essential to any claim. Rather, the subject matter of the present application may lie in less than all features of a particular application's embodiment. Thus, the claims are hereby incorporated into the Detailed Description as examples or embodiments, with each claim standing on its own as a separate embodiment, and it is contemplated that these embodiments may be combined with each other in various combinations or permutations. The scope of the invention should be determined with reference to the appended claims, along with the full scope of equivalents to which such claims are entitled.
以上实施例仅为本申请的示例性实施例,不用于限制本发明,本发明的保护范围由权利要求书限定。本领域技术人员可以在本申请的实质和保护范围内,对本发明做出各种修改或等同替换,这种修改或等同替换也应视为落在本发明的保护范围内。The above embodiments are only exemplary embodiments of the present application, and are not intended to limit the present invention, and the protection scope of the present invention is defined by the claims. Those skilled in the art may make various modifications or equivalent replacements to the present invention within the essence and protection scope of the present application, and such modifications or equivalent replacements shall also be deemed to fall within the protection scope of the present invention.
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202211678470.4ACN116312616A (en) | 2022-12-26 | 2022-12-26 | A processing recovery method and control system for noisy speech signals |
| PCT/CN2023/103754WO2024139120A1 (en) | 2022-12-26 | 2023-06-29 | Noisy voice signal processing recovery method and control system |
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202211678470.4ACN116312616A (en) | 2022-12-26 | 2022-12-26 | A processing recovery method and control system for noisy speech signals |
| Publication Number | Publication Date |
|---|---|
| CN116312616Atrue CN116312616A (en) | 2023-06-23 |
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| CN202211678470.4APendingCN116312616A (en) | 2022-12-26 | 2022-12-26 | A processing recovery method and control system for noisy speech signals |
| Country | Link |
|---|---|
| CN (1) | CN116312616A (en) |
| WO (1) | WO2024139120A1 (en) |
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN117690421A (en)* | 2024-02-02 | 2024-03-12 | 深圳市友杰智新科技有限公司 | Speech recognition method, device, equipment and medium of noise reduction recognition combined network |
| WO2024139120A1 (en)* | 2022-12-26 | 2024-07-04 | 恒玄科技(上海)股份有限公司 | Noisy voice signal processing recovery method and control system |
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN120412619B (en)* | 2025-07-02 | 2025-10-03 | 贵州理工学院 | Audio noise suppression method based on deep learning and intelligent sound equipment |
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| JP2004219757A (en)* | 2003-01-15 | 2004-08-05 | Fujitsu Ltd | Voice enhancement device, voice enhancement method, and portable terminal |
| US20050207583A1 (en)* | 2004-03-19 | 2005-09-22 | Markus Christoph | Audio enhancement system and method |
| US20080137874A1 (en)* | 2005-03-21 | 2008-06-12 | Markus Christoph | Audio enhancement system and method |
| CN110808063A (en)* | 2019-11-29 | 2020-02-18 | 北京搜狗科技发展有限公司 | Voice processing method and device for processing voice |
| CN112581973A (en)* | 2020-11-27 | 2021-03-30 | 深圳大学 | Voice enhancement method and system |
| CN113096682A (en)* | 2021-03-20 | 2021-07-09 | 杭州知存智能科技有限公司 | Real-time voice noise reduction method and device based on mask time domain decoder |
| CN114694674A (en)* | 2022-03-10 | 2022-07-01 | 深圳市友杰智新科技有限公司 | Artificial intelligence-based voice noise reduction method, device, equipment and storage medium |
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN109065067B (en)* | 2018-08-16 | 2022-12-06 | 福建星网智慧科技有限公司 | Conference terminal voice noise reduction method based on neural network model |
| CN111223493B (en)* | 2020-01-08 | 2022-08-02 | 北京声加科技有限公司 | Voice signal noise reduction processing method, microphone and electronic equipment |
| US12062369B2 (en)* | 2020-09-25 | 2024-08-13 | Intel Corporation | Real-time dynamic noise reduction using convolutional networks |
| CN112750451A (en)* | 2020-12-17 | 2021-05-04 | 云知声智能科技股份有限公司 | Noise reduction method for improving voice listening feeling |
| CN113571081B (en)* | 2021-02-08 | 2025-05-30 | 腾讯科技(深圳)有限公司 | Speech enhancement method, device, equipment and storage medium |
| CN113838471A (en)* | 2021-08-10 | 2021-12-24 | 北京塞宾科技有限公司 | Noise reduction method and system based on neural network, electronic device and storage medium |
| CN116312616A (en)* | 2022-12-26 | 2023-06-23 | 恒玄科技(上海)股份有限公司 | A processing recovery method and control system for noisy speech signals |
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| JP2004219757A (en)* | 2003-01-15 | 2004-08-05 | Fujitsu Ltd | Voice enhancement device, voice enhancement method, and portable terminal |
| US20050207583A1 (en)* | 2004-03-19 | 2005-09-22 | Markus Christoph | Audio enhancement system and method |
| US20080137874A1 (en)* | 2005-03-21 | 2008-06-12 | Markus Christoph | Audio enhancement system and method |
| CN110808063A (en)* | 2019-11-29 | 2020-02-18 | 北京搜狗科技发展有限公司 | Voice processing method and device for processing voice |
| CN112581973A (en)* | 2020-11-27 | 2021-03-30 | 深圳大学 | Voice enhancement method and system |
| CN113096682A (en)* | 2021-03-20 | 2021-07-09 | 杭州知存智能科技有限公司 | Real-time voice noise reduction method and device based on mask time domain decoder |
| CN114694674A (en)* | 2022-03-10 | 2022-07-01 | 深圳市友杰智新科技有限公司 | Artificial intelligence-based voice noise reduction method, device, equipment and storage medium |
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| WO2024139120A1 (en)* | 2022-12-26 | 2024-07-04 | 恒玄科技(上海)股份有限公司 | Noisy voice signal processing recovery method and control system |
| CN117690421A (en)* | 2024-02-02 | 2024-03-12 | 深圳市友杰智新科技有限公司 | Speech recognition method, device, equipment and medium of noise reduction recognition combined network |
| CN117690421B (en)* | 2024-02-02 | 2024-06-04 | 深圳市友杰智新科技有限公司 | Speech recognition method, device, equipment and medium of noise reduction recognition combined network |
| Publication number | Publication date |
|---|---|
| WO2024139120A1 (en) | 2024-07-04 |
| Publication | Publication Date | Title |
|---|---|---|
| CN111223493B (en) | Voice signal noise reduction processing method, microphone and electronic equipment | |
| CN106486131B (en) | Method and device for voice denoising | |
| CN109767783B (en) | Voice enhancement method, device, equipment and storage medium | |
| CN109597022B (en) | Method, device and equipment for sound source azimuth calculation and target audio positioning | |
| CN109065067B (en) | Conference terminal voice noise reduction method based on neural network model | |
| CN116312616A (en) | A processing recovery method and control system for noisy speech signals | |
| CN110767244B (en) | Speech enhancement method | |
| WO2019113130A1 (en) | Voice activity detection systems and methods | |
| CN112735456A (en) | Speech enhancement method based on DNN-CLSTM network | |
| CN108172231A (en) | A method and system for removing reverberation based on Kalman filter | |
| WO2018223727A1 (en) | Voiceprint recognition method, apparatus and device, and medium | |
| CN114822569B (en) | Audio signal processing method, device, equipment and computer readable storage medium | |
| CN114566179B (en) | Time delay controllable voice noise reduction method | |
| CN118899005B (en) | Audio signal processing method, device, computer equipment and storage medium | |
| CN112885375A (en) | Global signal-to-noise ratio estimation method based on auditory filter bank and convolutional neural network | |
| CN113611321A (en) | Voice enhancement method and system | |
| CN113345460A (en) | Audio signal processing method, device, equipment and storage medium | |
| Martín-Doñas et al. | Dual-channel DNN-based speech enhancement for smartphones | |
| CN117219102A (en) | Low-complexity voice enhancement method based on auditory perception | |
| CN110970044B (en) | A speech enhancement method for speech recognition | |
| CN112652321B (en) | Deep learning phase-based more friendly voice noise reduction system and method | |
| WO2025007866A1 (en) | Speech enhancement method and apparatus, electronic device and storage medium | |
| CN109801643B (en) | Reverberation suppression processing method and device | |
| WO2020015546A1 (en) | Far-field speech recognition method, speech recognition model training method, and server | |
| CN110875037A (en) | Voice data processing method and device and electronic equipment |
| Date | Code | Title | Description |
|---|---|---|---|
| PB01 | Publication | ||
| PB01 | Publication | ||
| SE01 | Entry into force of request for substantive examination | ||
| SE01 | Entry into force of request for substantive examination |