技术领域Technical Field
本申请涉及蓝牙技术领域,特别涉及一种关键词识别方法、装置及介质。The present application relates to the field of Bluetooth technology, and in particular to a keyword recognition method, device and medium.
背景技术Background Art
随着科技的发展,无线音频的关键词识别技术在越来越多的使用场景中被应用,例如,应用在蓝牙遥控器上。如图1所示,利用蓝牙遥控器控制家电的过程是:用户发出语音控制命令,如‘打开空调’,利用遥控器的麦克风采集、模数转换(ADC)、音频预处理和音频编码器生成音频压缩包,最后通过无线通信模块发送出去;接收端的无线通信模块收到音频压缩包,调用音频解码器生成音频PCM数据,经音频后处理模块和关键词识别模块识别出关键词,如‘打开空调’,再将其转换成对应的控制信号来控制家电。With the development of science and technology, keyword recognition technology of wireless audio is being applied in more and more usage scenarios, for example, in Bluetooth remote controllers. As shown in Figure 1, the process of controlling home appliances using Bluetooth remote controllers is as follows: the user issues a voice control command, such as "turn on the air conditioner", and the remote controller's microphone is used for acquisition, analog-to-digital conversion (ADC), audio preprocessing, and audio encoder to generate an audio compression package, which is finally sent out through the wireless communication module; the wireless communication module at the receiving end receives the audio compression package, calls the audio decoder to generate audio PCM data, and the audio post-processing module and keyword recognition module identify keywords, such as "turn on the air conditioner", and then convert it into a corresponding control signal to control the home appliances.
在现有技术的关键词识别过程中,接收端需要进行多次时频转换,如图1,在接收端的无线通讯模块中将接收到的频域码流转换为时域的PCM数据,然而利用音频后处理模块进行音频降噪时,又需要将PCM数据转换为频域数据在频域上进行音频的降噪,然后将降噪后的音频以PCM数据形式传递给关键词识别模块,进行关键词的识别,然而在关键词识别的过程中仍需要进行时频的转换。因此,现有技术在进行关键词识别的过程中需要多次进行时频转换,这会造成的算力浪费、存储功能的浪费和系统延迟的问题。同时,因为现有的降噪技术其目标主要是改善人的主观听感,并非为识别关键词设计,因此现有技术在降噪过程中会引入非线性处理,这会导致关键词识别过程的准确性下降。In the keyword recognition process of the prior art, the receiving end needs to perform multiple time-frequency conversions. As shown in Figure 1, the received frequency domain code stream is converted into time domain PCM data in the wireless communication module of the receiving end. However, when using the audio post-processing module to perform audio noise reduction, the PCM data needs to be converted into frequency domain data to perform audio noise reduction in the frequency domain, and then the noise-reduced audio is passed to the keyword recognition module in the form of PCM data for keyword recognition. However, time-frequency conversion is still required during the keyword recognition process. Therefore, the prior art needs to perform multiple time-frequency conversions during keyword recognition, which will cause problems such as waste of computing power, waste of storage function and system delay. At the same time, because the existing noise reduction technology is mainly aimed at improving people's subjective hearing, and is not designed for keyword recognition, the existing technology will introduce nonlinear processing during the noise reduction process, which will lead to a decrease in the accuracy of the keyword recognition process.
发明内容Summary of the invention
针对现有技术存在的在进行关键词识别过程中需要多次进行时频转换造成的算力浪费和系统延迟等问题,本申请主要提供一种关键词识别方法、装置及存储介质。In view of the problems existing in the prior art such as the waste of computing power and system delay caused by multiple time-frequency conversions during keyword recognition, the present application mainly provides a keyword recognition method, device and storage medium.
为了实现上述目的,本申请采用的一个技术方案是:提供一种关键词识别方法,其包括:在信号接收端,将当前帧码流进行部分解码直至得到当前帧码流的离散余弦变换谱系数;利用离散余弦变换谱系数计算得到当前帧码流的带噪语音特征,并将带噪语音特征输入预先训练的降噪神经网络模型中计算得到降噪增益;利用降噪增益对离散余弦变换谱系数进行降噪处理,得到降噪离散余弦变换谱系数;利用降噪离散余弦变换谱系数进行降噪谱系数特征提取,得到降噪离散余弦变换谱系数的降噪谱系数特征;以及,在预先训练的识别神经网络模型中,根据当前帧的降噪谱系数特征和预定帧数的历史帧降噪谱系数特征进行关键词识别,得到语音码流中的关键词。In order to achieve the above-mentioned purpose, a technical solution adopted in the present application is: to provide a keyword recognition method, which includes: at a signal receiving end, partially decoding the current frame code stream until the discrete cosine transform spectrum coefficients of the current frame code stream are obtained; using the discrete cosine transform spectrum coefficients to calculate the noisy speech features of the current frame code stream, and inputting the noisy speech features into a pre-trained denoising neural network model to calculate the denoising gain; using the denoising gain to perform denoising on the discrete cosine transform spectrum coefficients to obtain denoised discrete cosine transform spectrum coefficients; using the denoised discrete cosine transform spectrum coefficients to extract denoising spectrum coefficient features to obtain denoised spectrum coefficient features of the denoised discrete cosine transform spectrum coefficients; and, in the pre-trained recognition neural network model, performing keyword recognition based on the denoising spectrum coefficient features of the current frame and the denoising spectrum coefficient features of a predetermined number of historical frames to obtain keywords in the speech code stream.
可选的,将混合语音码流和纯净的关键词语音码流分别进行部分解码,得到混合语音码流的第一离散余弦变换谱系数和关键词语音码流的第二离散余弦变换谱系数;利用第一离散余弦变换谱系数计算得到带噪语音特征,并利用第二离散余弦变换谱系数计算得到无噪语音特征;将带噪语音特征和无噪语音特征输入到降噪神经网络模型中进行训练,使得降噪神经网络模型根据输入的带噪语音特征得到降噪增益。Optionally, the mixed speech code stream and the pure keyword speech code stream are partially decoded respectively to obtain the first discrete cosine transform spectrum coefficient of the mixed speech code stream and the second discrete cosine transform spectrum coefficient of the keyword speech code stream; the noisy speech features are calculated using the first discrete cosine transform spectrum coefficient, and the noise-free speech features are calculated using the second discrete cosine transform spectrum coefficient; the noisy speech features and the noise-free speech features are input into the denoising neural network model for training, so that the denoising neural network model obtains the noise reduction gain according to the input noisy speech features.
可选的,根据识别得到的关键词和关键词语音中的关键词,调整降噪神经网络模型和识别神经网络模型的权重大小和偏置大小,直至识别得到关键词的正确率大于预定阈值,结束对降噪神经网络模型和识别神经网络模型的训练。Optionally, the weights and biases of the denoising neural network model and the recognition neural network model are adjusted according to the recognized keywords and the keywords in the keyword speech, until the accuracy of the recognized keywords is greater than a predetermined threshold, thereby terminating the training of the denoising neural network model and the recognition neural network model.
可选的,利用交叉熵损失函数进行降噪神经网络模型和识别神经网络模型的训练。Optionally, a cross entropy loss function is used to train the denoising neural network model and the recognition neural network model.
可选的,分别获取混合语音码流的第一离散余弦变换谱系数中每一帧谱系数的子带带噪语音特征和关键词语音码流的第二离散余弦变换谱系数中对应帧的子带无噪语音特征;将子带带噪语音特征和子带无噪语音特征输入到降噪神经网络模型中进行训练,使得生成的降噪神经网络模型根据输入的一帧谱系数得到一帧谱系数中各个子带对应的子带增益。Optionally, obtain the sub-band noisy speech features of each frame of spectrum coefficients in the first discrete cosine transform spectrum coefficients of the mixed voice code stream and the sub-band noise-free speech features of the corresponding frame in the second discrete cosine transform spectrum coefficients of the keyword voice code stream respectively; input the sub-band noisy speech features and the sub-band noise-free speech features into the denoising neural network model for training, so that the generated denoising neural network model obtains the sub-band gains corresponding to each sub-band in a frame of spectrum coefficients according to the input frame of spectrum coefficients.
可选的,利用第一离散余弦变换谱系数和第二离散余弦变换谱系数分别对当前帧混合语音码流和对应帧的关键词语音码流进行子带划分,得到对应的子带;利用子带的伪谱系数计算子带的子带能量,并利用子带能量计算得到子带对应的子带带噪语音特征或子带无噪语音特征。Optionally, the first discrete cosine transform spectral coefficient and the second discrete cosine transform spectral coefficient are used to respectively divide the current frame mixed speech code stream and the corresponding frame keyword speech code stream into subbands to obtain corresponding subbands; the subband energy of the subband is calculated using the pseudo-spectral coefficients of the subband, and the subband energy calculation is used to obtain the subband noisy speech characteristics or subband noise-free speech characteristics corresponding to the subband.
可选的,利用子带增益和子带对应的离散余弦变换谱系数之间的乘积,计算得到子带的降噪离散余弦变换谱系数;以及,将每个子带的降噪离散余弦变换谱系数进行拼接,得到当前帧码流的降噪离散余弦变换谱系数。Optionally, the denoised discrete cosine transform spectrum coefficient of the subband is calculated by multiplying the subband gain and the discrete cosine transform spectrum coefficient corresponding to the subband; and the denoised discrete cosine transform spectrum coefficients of each subband are concatenated to obtain the denoised discrete cosine transform spectrum coefficients of the current frame code stream.
可选的,对降噪离散余弦变换谱系数进行预加重,得到加重后的降噪离散余弦变换谱系数;利用加重后的降噪离散余弦变换谱系数,生成加重后的降噪离散余弦变换谱系数的能量谱;计算能量谱经过梅尔滤波器组后的通道能量;以及,将通道能量进行对数变换,并利用计算结果进行离散余弦变换,将计算得到的梅尔频率倒谱系数作为降噪谱系数特征。Optionally, the denoised discrete cosine transform spectrum coefficients are pre-emphasized to obtain the denoised discrete cosine transform spectrum coefficients after denoising; the energy spectrum of the denoised discrete cosine transform spectrum coefficients after denoising is generated using the denoised discrete cosine transform spectrum coefficients after denoising; the channel energy after the energy spectrum passes through the Mel filter group is calculated; and the channel energy is logarithmically transformed, and the calculation result is used to perform discrete cosine transform, and the calculated Mel-frequency cepstrum coefficients are used as denoised spectrum coefficient features.
本申请采用的另一个技术方案是:提供一种关键词识别装置,其包括:半解码模块,用于在信号接收端,将当前帧码流进行部分解码直至得到当前帧码流的离散余弦变换谱系数;降噪增益获取模块,用于利用离散余弦变换谱系数计算得到当前帧码流的带噪语音特征,并将带噪语音特征输入预先训练的降噪神经网络模型中计算得到降噪增益;降噪模块,用于利用降噪增益对离散余弦变换谱系数进行降噪处理,得到降噪离散余弦变换谱系数;降噪谱系数特征提取模块,用于利用降噪离散余弦变换谱系数进行降噪谱系数特征提取,得到降噪离散余弦变换谱系数的降噪谱系数特征;以及,关键词识别模块,用于在预先训练的识别神经网络模型中,根据当前帧的降噪谱系数特征和预定帧数的历史帧降噪谱系数特征进行关键词识别,得到语音码流中的关键词。Another technical solution adopted by the present application is: to provide a keyword recognition device, which includes: a semi-decoding module, which is used to partially decode the current frame code stream at the signal receiving end until the discrete cosine transform spectrum coefficients of the current frame code stream are obtained; a noise reduction gain acquisition module, which is used to calculate the noisy speech features of the current frame code stream using the discrete cosine transform spectrum coefficients, and input the noisy speech features into a pre-trained noise reduction neural network model to calculate the noise reduction gain; a noise reduction module, which is used to perform noise reduction processing on the discrete cosine transform spectrum coefficients using the noise reduction gain to obtain the noise-reduced discrete cosine transform spectrum coefficients; a noise reduction spectrum coefficient feature extraction module, which is used to extract noise reduction spectrum coefficient features using the noise-reduced discrete cosine transform spectrum coefficients to obtain the noise-reduced spectrum coefficient features of the noise-reduced discrete cosine transform spectrum coefficients; and a keyword recognition module, which is used to perform keyword recognition in the pre-trained recognition neural network model according to the noise reduction spectrum coefficient features of the current frame and the noise reduction spectrum coefficient features of a predetermined number of historical frames to obtain keywords in the speech code stream.
本申请采用的另一个技术方案是:提供一种计算机可读存储介质,其存储有计算机指令,该计算机指令被操作以执行方案一中的关键词识别方法。Another technical solution adopted by the present application is: providing a computer-readable storage medium storing computer instructions, which are operated to execute the keyword recognition method in solution one.
本申请的技术方案可以达到的有益效果是:该方法在进行关键词识别过程中不需要多次进行时频转换,减少关键词识别过程中的算力浪费和系统延迟,同时改进降噪技术使降噪技术更适用于关键词的识别。The beneficial effect that can be achieved by the technical solution of the present application is that the method does not need to perform time-frequency conversion multiple times during the keyword recognition process, thereby reducing the waste of computing power and system delays during the keyword recognition process, and at the same time improving the noise reduction technology to make it more suitable for keyword recognition.
附图说明BRIEF DESCRIPTION OF THE DRAWINGS
为了更清楚地说明本申请实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作以简单地介绍,显而易见地,下面描述中的附图是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动性的前提下,还可以根据这些附图获得其他的附图。In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings required for use in the embodiments or the description of the prior art will be briefly introduced below. Obviously, the drawings described below are some embodiments of the present application. For ordinary technicians in this field, other drawings can be obtained based on these drawings without paying creative labor.
图1是本申请一种关键词识别方法的现有技术关键词识别过程的示意图;FIG1 is a schematic diagram of a keyword recognition process of a prior art keyword recognition method of the present application;
图2是本申请一种关键词识别方法的一个具体实施方式的示意图;FIG2 is a schematic diagram of a specific implementation of a keyword identification method of the present application;
图3是本申请一种关键词识别方法的一个预加重频率响应的示意图;FIG3 is a schematic diagram of a pre-emphasis frequency response of a keyword recognition method of the present application;
图4是本申请一种关键词识别方法的一个关键词识别过程和训练过程的示意图;FIG4 is a schematic diagram of a keyword recognition process and a training process of a keyword recognition method of the present application;
图5是本申请一种关键词识别装置的一个具体实施方式的示意图。FIG. 5 is a schematic diagram of a specific implementation of a keyword recognition device of the present application.
通过上述附图,已示出本申请明确的实施例,后文中将有更详细的描述。这些附图和文字描述并不是为了通过任何方式限制本申请构思的范围,而是通过参考特定实施例为本领域技术人员说明本申请的概念。The above drawings have shown clear embodiments of the present application, which will be described in more detail later. These drawings and text descriptions are not intended to limit the scope of the present application in any way, but to illustrate the concept of the present application to those skilled in the art by referring to specific embodiments.
具体实施方式DETAILED DESCRIPTION
下面结合附图对本申请的较佳实施例进行详细阐述,以使本申请的优点和特征能更易于被本领域技术人员理解,从而对本申请的保护范围做出更为清楚明确的界定。The preferred embodiments of the present application are described in detail below in conjunction with the accompanying drawings so that the advantages and features of the present application can be more easily understood by those skilled in the art, thereby making a clearer and more definite definition of the protection scope of the present application.
需要说明的是,在本文中,诸如第一和第二等之类的关系术语仅仅用来将一个实体或者操作与另一个实体或操作区分开来,而不一定要求或者暗示这些实体或操作之间存在任何这种实际的关系或者顺序。而且,术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、方法、物品或者设备不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、方法、物品或者设备所固有的要素。在没有更多限制的情况下,由语句“包括……”限定的要素,并不排除在包括所述要素的过程、方法、物品或者设备中还存在另外的相同要素。It should be noted that, in this article, relational terms such as first and second, etc. are only used to distinguish one entity or operation from another entity or operation, and do not necessarily require or imply any such actual relationship or order between these entities or operations. Moreover, the terms "include", "comprise" or any other variants thereof are intended to cover non-exclusive inclusion, so that a process, method, article or device including a series of elements includes not only those elements, but also other elements not explicitly listed, or also includes elements inherent to such process, method, article or device. In the absence of further restrictions, the elements defined by the statement "include..." do not exclude the presence of other identical elements in the process, method, article or device including the elements.
在现有技术中,蓝牙接收端的无线通讯模块接收到蓝牙发射端发射的音频码流,并利用音频解码器经过码流解析、算术与残差解码、噪声填充和全局增益、时域噪声解码、变换域噪声解码、频域到时域的转换(通常为低延迟改进型离散余弦逆变换)和长期后置滤波器的处理得到PCM数据。音频处理模块利用PCM数据通过时域到频域的转换、在频域估计噪声与音频信号、计算增益、对谱系数应用增益和频域到时域的转换完成音频的去噪处理,得到去噪后的PCM数据。将去噪后的PCM数据利用关键词识别模块通过特征提取、深度神经网络和后处理得到需要识别的关键词,其中,特征提取步骤包括预加重、加窗、时域到频域转换(通常为离散傅里叶变换)、能量谱、Mel滤波器组、对数变换、离散余弦变换和梅尔频率倒谱系数(简称MFCC)。In the prior art, the wireless communication module of the Bluetooth receiving end receives the audio code stream transmitted by the Bluetooth transmitting end, and uses the audio decoder to obtain PCM data through code stream analysis, arithmetic and residual decoding, noise filling and global gain, time domain noise decoding, transform domain noise decoding, frequency domain to time domain conversion (usually low-delay improved inverse discrete cosine transform) and long-term post-filter processing. The audio processing module uses the PCM data to complete the audio denoising process by converting from time domain to frequency domain, estimating noise and audio signals in the frequency domain, calculating gain, applying gain to spectral coefficients and converting from frequency domain to time domain, and obtains denoised PCM data. The denoised PCM data is used by the keyword recognition module to obtain the keywords to be identified through feature extraction, deep neural network and post-processing, wherein the feature extraction step includes pre-emphasis, windowing, time domain to frequency domain conversion (usually discrete Fourier transform), energy spectrum, Mel filter bank, logarithmic transform, discrete cosine transform and Mel frequency cepstrum coefficient (abbreviated as MFCC).
现有技术的关键词识别过程中需要进行多次时频转换,这会造成相关的算法延迟与处理延迟,时频相关模块的主要作用是实现时域和频域的转换,运算量很大且对存储需求量较大,对于在嵌入式蓝牙接收端部署关键词识别模块提出很大的挑战。另外,音频解码器中的长期后置滤波器处理的作用是改善人的听感,对降噪或关键词识别模块并无益处且运算量较大,所以,在上述应用场景中,完整的解码是不必要的。同时,现有技术在降噪过程中会引入非线性处理,这会导致关键词识别过程的准确性下降。The keyword recognition process of the prior art requires multiple time-frequency conversions, which will cause related algorithm delays and processing delays. The main function of the time-frequency related module is to realize the conversion between the time domain and the frequency domain. The amount of calculation is large and the storage demand is large, which poses a great challenge to the deployment of the keyword recognition module on the embedded Bluetooth receiver. In addition, the role of the long-term post-filter processing in the audio decoder is to improve people's hearing, which is of no benefit to the noise reduction or keyword recognition module and has a large amount of calculation. Therefore, in the above application scenarios, complete decoding is unnecessary. At the same time, the prior art introduces nonlinear processing in the noise reduction process, which will lead to a decrease in the accuracy of the keyword recognition process.
下面,以具体的实施例对本申请的技术方案以及本申请的技术方案如何解决上述技术问题进行详细说明。下面述及的具体的实施例可以相互结合形成新的实施例。对于在一个实施例中描述过的相同或相似的思想或过程,可能在其他某些实施例中不再赘述。下面将结合附图,对本申请的实施例进行描述。The technical solution of the present application and how the technical solution of the present application solves the above-mentioned technical problems are described in detail below with specific embodiments. The specific embodiments described below can be combined with each other to form new embodiments. The same or similar ideas or processes described in one embodiment may not be repeated in other embodiments. The embodiments of the present application will be described below in conjunction with the accompanying drawings.
图2示出了本申请一种关键词识别方法的一个实施方式。FIG. 2 shows an implementation of a keyword recognition method of the present application.
图2所示的关键词识别方法,包括:步骤S201,在信号接收端,将当前帧码流进行部分解码直至得到当前帧码流的离散余弦变换谱系数;The keyword recognition method shown in FIG2 comprises: step S201, at a signal receiving end, partially decoding a current frame code stream until a discrete cosine transform spectrum coefficient of the current frame code stream is obtained;
步骤S202,利用离散余弦变换谱系数计算得到当前帧码流的带噪语音特征,并将带噪语音特征输入预先训练的降噪神经网络模型中计算得到降噪增益;Step S202, using discrete cosine transform spectrum coefficients to calculate the noisy speech features of the current frame stream, and inputting the noisy speech features into a pre-trained denoising neural network model to calculate a denoising gain;
步骤S203,利用降噪增益对离散余弦变换谱系数进行降噪处理,得到降噪离散余弦变换谱系数;Step S203, performing denoising processing on the discrete cosine transform spectrum coefficients using the denoising gain to obtain denoised discrete cosine transform spectrum coefficients;
步骤S204,利用降噪离散余弦变换谱系数进行降噪谱系数特征提取,得到降噪离散余弦变换谱系数的降噪谱系数特征;以及,Step S204, extracting denoising spectrum coefficient features using denoising discrete cosine transform spectrum coefficients to obtain denoising spectrum coefficient features of the denoising discrete cosine transform spectrum coefficients; and,
步骤S205,在预先训练的识别神经网络模型中,根据当前帧的降噪谱系数特征和预定帧数的历史帧降噪谱系数特征进行关键词识别,得到语音码流中的关键词。Step S205, in the pre-trained recognition neural network model, keyword recognition is performed based on the denoising spectrum coefficient characteristics of the current frame and the denoising spectrum coefficient characteristics of a predetermined number of historical frames to obtain keywords in the speech code stream.
该方法在进行关键词识别过程中不需要多次进行时频转换,减少关键词识别过程中的算力浪费、存储需求量大、算法延迟和系统延迟问题,改进降噪技术使降噪技术更适用于关键词的识别,同时减少不必要的解码步骤节省算力。This method does not require multiple time-frequency conversions during the keyword recognition process, which reduces the waste of computing power, large storage requirements, algorithm delays and system delays in the keyword recognition process. It improves the noise reduction technology to make it more suitable for keyword recognition, while reducing unnecessary decoding steps to save computing power.
在图2所示的实施方式中,关键词识别方法包括步骤S201,在信号接收端,将当前帧码流进行部分解码直至得到当前帧码流的离散余弦变换谱系数。该步骤节省了不必要的解码步骤,节省了关键词识别过程中的算力,减少了存储需求。In the implementation shown in FIG2 , the keyword recognition method includes step S201, at the signal receiving end, partially decoding the current frame code stream until the discrete cosine transform spectrum coefficient of the current frame code stream is obtained. This step saves unnecessary decoding steps, saves computing power in the keyword recognition process, and reduces storage requirements.
具体的,在进行关键词识别时,首先将输入的pcm信号进行分帧,按帧进行解码等后续处理。以LC3解码器为例将每一帧信号通过算术及残差解码、噪声填充、全局增益、时域噪声整形解码和变换域噪声整形解码的处理得到当前帧的离散余弦变换谱系数X(k)。例如,将pcm信号划分为大小10ms,共160个采样点的信号帧,则在进行部分解码后得到离散余弦变换谱系数X(k),其中k=0,1,2,…,159。其中,因为减少了解码过程中的长期后置滤波解码步骤,实现了在不影响音频品质的条件下,减少算力需求和存储需求的目的,并且此处的pcm信号可以为包含噪声的语音数据,也可以为纯语音数据。Specifically, when performing keyword recognition, the input PCM signal is first divided into frames, and subsequent processing such as decoding is performed by frame. Taking the LC3 decoder as an example, each frame of the signal is processed by arithmetic and residual decoding, noise filling, global gain, time domain noise shaping decoding and transform domain noise shaping decoding to obtain the discrete cosine transform spectrum coefficient X(k) of the current frame. For example, the PCM signal is divided into signal frames of 10ms in size and 160 sampling points in total, and the discrete cosine transform spectrum coefficient X(k) is obtained after partial decoding, where k=0,1,2,…,159. Among them, because the long-term post-filter decoding step in the decoding process is reduced, the purpose of reducing computing power requirements and storage requirements is achieved without affecting the audio quality, and the PCM signal here can be voice data containing noise or pure voice data.
在图2所示的具体实施方式中,关键词识别方法,还包括步骤S202,利用离散余弦变换谱系数计算得到当前帧码流的带噪语音特征,并将带噪语音特征输入预先训练的降噪神经网络模型中计算得到降噪增益。该步骤通过改进降噪技术使降噪技术更适用于关键词的识别,同时不需要进行额外的时频转换减小了算力需求和存储需求,提高了系统的处理效率。In the specific implementation shown in FIG2 , the keyword recognition method further includes step S202, using discrete cosine transform spectrum coefficients to calculate the noisy speech features of the current frame code stream, and inputting the noisy speech features into a pre-trained denoising neural network model to calculate the denoising gain. This step improves the denoising technology to make it more suitable for keyword recognition, and at the same time, does not require additional time-frequency conversion, thereby reducing computing power requirements and storage requirements, and improving the processing efficiency of the system.
在本申请的一个具体实施例中,具体的降噪神经网络模型训练方式包括,将混合语音码流和纯净的关键词语音码流分别进行部分解码,得到混合语音码流的第一离散余弦变换谱系数和关键词语音码流的第二离散余弦变换谱系数;利用第一离散余弦变换谱系数计算得到带噪语音特征,并利用第二离散余弦变换谱系数计算得到无噪语音特征;将带噪语音特征和无噪语音特征输入到降噪神经网络模型中进行训练,使得降噪神经网络模型根据输入的所述带噪语音特征得到降噪增益。In a specific embodiment of the present application, a specific denoising neural network model training method includes partially decoding a mixed speech code stream and a pure keyword speech code stream respectively to obtain a first discrete cosine transform spectrum coefficient of the mixed speech code stream and a second discrete cosine transform spectrum coefficient of the keyword speech code stream; using the first discrete cosine transform spectrum coefficient to calculate the noisy speech features, and using the second discrete cosine transform spectrum coefficient to calculate the noise-free speech features; inputting the noisy speech features and the noise-free speech features into the denoising neural network model for training, so that the denoising neural network model obtains a denoising gain according to the input noisy speech features.
具体的,如图4的离线训练过程,利用编码器将纯净的关键词语音和向该关键词语音中混合了噪音得到的混合语音分别进行编码得到混合语音码流和关键词语音码流,并将混合语音码流和关键词语音码流进行部分解码,得到对应的第一离散余弦变换谱系数和第二离散余弦变换谱系数。如图4中的特征1提取步骤,根据第一离散余弦变换谱系数计算得到混合音频对应的带噪语音特征,如图4中的特征2提取步骤,利用第二离散余弦变换谱系数计算得到纯净的关键词语音对应的无噪语音特征。通过在降噪步骤中对比计算带噪语音特征和无噪语音特征得到应用在混合音频上的降噪增益。其中,降噪神经网络模型的选择本发明并不限制,考虑到语音帧的前后相关特性,优选的,选取循环神经网络模型(Recurrent Neural Network,RNN)作为降噪神经网络模型。Specifically, as shown in the offline training process of FIG4, the encoder is used to encode the pure keyword voice and the mixed voice obtained by mixing noise into the keyword voice to obtain a mixed voice code stream and a keyword voice code stream, and the mixed voice code stream and the keyword voice code stream are partially decoded to obtain the corresponding first discrete cosine transform spectrum coefficient and second discrete cosine transform spectrum coefficient. As shown in the feature 1 extraction step in FIG4, the noisy speech feature corresponding to the mixed audio is calculated according to the first discrete cosine transform spectrum coefficient, and as shown in the feature 2 extraction step in FIG4, the noise-free speech feature corresponding to the pure keyword voice is calculated using the second discrete cosine transform spectrum coefficient. The noise reduction gain applied to the mixed audio is obtained by comparing and calculating the noisy speech feature and the noise-free speech feature in the noise reduction step. Among them, the selection of the denoising neural network model is not limited by the present invention. Considering the front-back correlation characteristics of the speech frame, preferably, a recurrent neural network model (Recurrent Neural Network, RNN) is selected as the denoising neural network model.
在本申请的一个具体实施例中,分别获取混合语音码流的第一离散余弦变换谱系数中每一帧谱系数的子带带噪语音特征和关键词语音码流的第二离散余弦变换谱系数中对应帧的子带无噪语音特征;将子带带噪语音特征和子带无噪语音特征输入到降噪神经网络模型中进行训练,使得生成的降噪神经网络模型根据输入的一帧谱系数得到一帧谱系数中各个子带对应的子带增益。In a specific embodiment of the present application, the sub-band noisy speech features of each frame of spectrum coefficients in the first discrete cosine transform spectrum coefficients of the mixed speech code stream and the sub-band noise-free speech features of the corresponding frame in the second discrete cosine transform spectrum coefficients of the keyword voice code stream are obtained respectively; the sub-band noisy speech features and the sub-band noise-free speech features are input into the denoising neural network model for training, so that the generated denoising neural network model obtains the sub-band gains corresponding to each sub-band in a frame of spectrum coefficients according to the input frame of spectrum coefficients.
进一步,利用第一离散余弦变换谱系数和第二离散余弦变换谱系数分别对当前帧混合语音码流和对应帧的关键词语音码流进行子带划分,得到对应的子带;利用子带的伪谱系数计算子带的子带能量,并利用子带能量计算得到子带对应的子带带噪语音特征或子带无噪语音特征。Furthermore, the first discrete cosine transform spectrum coefficient and the second discrete cosine transform spectrum coefficient are used to sub-band divide the current frame mixed speech code stream and the corresponding frame keyword speech code stream to obtain corresponding sub-bands; the sub-band energy of the sub-band is calculated using the pseudo-spectral coefficient of the sub-band, and the sub-band energy calculation is used to obtain the sub-band noisy speech feature or sub-band noise-free speech feature corresponding to the sub-band.
具体的,子带带噪语音特征和子带无噪语音特征的计算方式相同,其具体的计算方式如下。Specifically, the sub-band noisy speech feature and the sub-band noise-free speech feature are calculated in the same manner, and the specific calculation method is as follows.
例如,当分帧得到的语音帧是采样率16kHz、帧长10ms的音频帧时,将每帧的160个谱系数,按照巴克频率将其划分为17子带,其中子带序号记为subband=0,1,2,…,16,并且每个子带的谱系数个数为:4,4,4,4,4,4,4,4,8,8,8,8,16,16,16,24,24。For example, when the speech frame obtained by framing is an audio frame with a sampling rate of 16 kHz and a frame length of 10 ms, the 160 spectral coefficients of each frame are divided into 17 subbands according to the Bark frequency, where the subband sequence numbers are recorded as subband = 0, 1, 2, ..., 16, and the number of spectral coefficients in each subband is: 4, 4, 4, 4, 4, 4, 4, 8, 8, 8, 8, 16, 16, 16, 24, 24.
因为伪谱的特性与实际频率的分布具有更好的对应关系,因此计算每个子带对应的伪谱系数。其计算公式为:其中Xpseudo(k)表示每个子带对应的伪谱系数,X(k)为子带对应的离散余弦变换谱系数,NE为帧长。然后利用每个子带对应的伪谱系数计算子带对应的伪谱子带能量,其计算公式为:Energysubband(b)=∑kXpseudo(k),其中b是子带序号,k是子带对应的伪谱系数序号。例如,当子带序号b=0时,对应的伪谱子带能量计算公式为Energysubband(0)=Xpseud(0)+Xpseudo(1)+Xpse (2)+Xpseu (3)。Because the characteristics of the pseudo spectrum have a better correspondence with the actual frequency distribution, the pseudo spectrum coefficient corresponding to each sub-band is calculated. The calculation formula is: Where Xpseudo (k) represents the pseudo spectrum coefficient corresponding to each subband, X(k) is the discrete cosine transform spectrum coefficient corresponding to the subband, and NE is the frame length. Then, the pseudo spectrum coefficient corresponding to each subband is used to calculate the pseudo spectrum subband energy corresponding to the subband, and the calculation formula is: Energysubband (b) = ∑k Xpseudo (k), where b is the subband number, and k is the pseudo spectrum coefficient number corresponding to the subband. For example, when the subband number b = 0, the corresponding pseudo spectrum subband energy calculation formula is Energysubband (0) = Xpseud (0) + Xpseudo (1) + Xpse (2) + Xpseu (3).
将计算得到每个伪谱子带能量进行对数变换,其计算公式为:Energylog(m)=log(Energysubband(m)),m=0,1,…,16。再利用计算得到的伪谱子带能量的对数变换结果进行离散余弦变换,得到离散余弦变换的结果BFCC。其具体的计算公式为The energy of each pseudo-spectrum subband is logarithmically transformed, and the calculation formula is: Energylog (m) = log (Energysubband (m)), m = 0, 1, ..., 16. Then the logarithmic transformation result of the pseudo-spectrum subband energy is used to perform discrete cosine transformation to obtain the discrete cosine transformation result BFCC. The specific calculation formula is:
然后,利用BFCC计算得到当前帧的语音特征,即计算得到当前帧对应的17个子带的伪谱子带能量的离散余弦变换结果,并计算出17个子带的伪谱子带能量的离散余弦变换结果中前6个子带的伪谱子带能量的离散余弦变换结果的时间差分,将这6个时间差分和17个子带的伪谱子带能量的离散余弦变换结果作为当前帧所提取得到的特征数据,其中带噪语音特征和无噪语音特征的计算过程相同。Then, the BFCC is used to calculate the speech features of the current frame, that is, the discrete cosine transform results of the pseudo-spectral subband energies of the 17 sub-bands corresponding to the current frame are calculated, and the time difference of the discrete cosine transform results of the pseudo-spectral subband energies of the first 6 sub-bands among the discrete cosine transform results of the pseudo-spectral sub-band energies of the 17 sub-bands are calculated, and these 6 time differences and the discrete cosine transform results of the pseudo-spectral sub-band energies of the 17 sub-bands are used as the feature data extracted from the current frame, wherein the calculation process of the noisy speech features and the noise-free speech features are the same.
最后,利用伪谱子带能量的离散余弦变换结果和时间差分结果,通过降噪神经网络模型计算降噪模块输出的增益。其具体计算公式为:BFCCdiff(k)=BFCCcurr(k)-BFCClastlast(k),k=0,1,2,3,4,5,其中,BFCCdiff(k)表示BFCC的一阶差分,BFCCcurr(k)表示当前帧的第k个BFCC系数,BFCClastlast(k)表示当前帧的上两帧音频帧的第k个BFCC系数,和公式BFCCdiff2(k)=BFCCcurr(k)-2*BFCClast(k)+BFCClastlast(k),k=0,1,2,3,4,5,其中,BFCCdiff2(k)表示BFCC的第k个二阶差分,BFCClast(k)表示上一帧的第k个BFCC系数。降噪模块的降噪神经网络模型的输出结果即是当前帧子带对应的子带增益,Gainnr(b),b=0,1,2,…,16。Finally, the gain of the denoising module output is calculated by the denoising neural network model using the discrete cosine transform result and the time difference result of the pseudo-spectrum subband energy. The specific calculation formula is: BFCCdiff (k) = BFCCcurr (k) - BFCClastlast (k), k = 0, 1, 2, 3, 4, 5, where BFCCdiff (k) represents the first-order difference of BFCC, BFCCcurr (k) represents the kth BFCC coefficient of the current frame, and BFCClastlast (k) represents the kth BFCC coefficient of the previous two audio frames of the current frame, and the formula BFCCdiff2 (k) = BFCCcurr (k) - 2 * BFCClast (k) + BFCClastlast (k), k = 0, 1, 2, 3, 4, 5, where BFCCdiff2 (k) represents the kth second-order difference of BFCC, and BFCClast (k) represents the kth BFCC coefficient of the previous frame. The output result of the denoising neural network model of the denoising module is the subband gain corresponding to the current frame subband, Gainnr (b), b=0, 1, 2, ..., 16.
在图2所示的具体实施方式中,关键词识别方法,还包括步骤S203,利用降噪增益对离散余弦变换谱系数进行降噪处理,得到降噪离散余弦变换谱系数。该步骤能够在不进行时频转换的条件下进行音频的降噪处理,同时使得降噪处理更加适用于进行关键词识别。In the specific implementation shown in FIG2 , the keyword recognition method further includes step S203, performing noise reduction processing on the discrete cosine transform spectrum coefficients using the noise reduction gain to obtain noise-reduced discrete cosine transform spectrum coefficients. This step can perform noise reduction processing on the audio without performing time-frequency conversion, and at the same time makes the noise reduction processing more suitable for keyword recognition.
在本申请的一个具体实施例中,具体的降噪处理过程包括,利用子带增益和子带对应的离散余弦变换谱系数之间的乘积,计算得到子带的降噪离散余弦变换谱系数;以及,将每个子带的降噪离散余弦变换谱系数进行拼接,得到当前帧码流的降噪离散余弦变换谱系数。该具体实施例,能够在不进行时频转换的条件下进行降噪处理,减少了计算量保证了音频质量,为后续进行关键词识别奠定基础。In a specific embodiment of the present application, the specific noise reduction process includes: using the product between the subband gain and the discrete cosine transform spectrum coefficient corresponding to the subband to calculate the noise reduction discrete cosine transform spectrum coefficient of the subband; and splicing the noise reduction discrete cosine transform spectrum coefficient of each subband to obtain the noise reduction discrete cosine transform spectrum coefficient of the current frame code stream. This specific embodiment can perform noise reduction processing without time-frequency conversion, reduce the amount of calculation, ensure audio quality, and lay the foundation for subsequent keyword recognition.
例如,第一个子带的4个谱系数分别为X(0),X(1),X(2)和X(3)降噪模块计算得到的第一个子带的子带增益为Gainnr(0),则应用子带增益降噪的计算方法为:X′(0)=X(0)*Gainnr(0);X′(1)=X(1)*Gainnr(0);X′(2)=X(2)*Gainnr(0);X′(3)=X(3)*Gainnr(0)。其中,X′(0),X′(1),X′(2)和X′(3)是降噪后的谱系数。For example, the four spectral coefficients of the first subband are X(0), X(1), X(2) and X(3), respectively. The subband gain of the first subband calculated by the noise reduction module is Gainnr (0). Then the calculation method of applying subband gain noise reduction is: X′ (0)=X(0)*Gainnr (0); X′ (1)=X(1)*Gainnr (0); X′ (2)=X(2)*Gainnr (0); X′ (3)=X(3)*Gainnr (0). Among them, X′ (0), X′ (1), X′ (2) and X′ (3) are the spectral coefficients after noise reduction.
在图2所示的具体实施方式中,关键词识别方法,还包括步骤S204,利用降噪离散余弦变换谱系数进行降噪谱系数特征提取,得到降噪离散余弦变换谱系数的降噪谱系数特征。该步骤能够为进行关键词识别奠定基础,能够减少系统的延迟、算力的浪费和存储空间的浪费。In the specific implementation shown in FIG2 , the keyword recognition method further includes step S204, using the denoised discrete cosine transform spectrum coefficients to perform denoised spectrum coefficient feature extraction to obtain denoised spectrum coefficient features of the denoised discrete cosine transform spectrum coefficients. This step can lay the foundation for keyword recognition and can reduce system delays, waste of computing power and waste of storage space.
在本申请的一个具体实施例中,步骤S204,包括对降噪离散余弦变换谱系数进行预加重,得到加重后的降噪离散余弦变换谱系数;利用加重后的降噪离散余弦变换谱系数,生成加重后的降噪离散余弦变换谱系数的能量谱;计算能量谱经过梅尔滤波器组后的通道能量;以及,将通道能量进行对数变换,并利用计算结果进行离散余弦变换,将计算得到的梅尔频率倒谱系数作为降噪谱系数特征。该具体实施例,不需要多次进行时频转换,减少关键词识别过程中的算力浪费、存储需求量大、算法延迟和系统延迟。In a specific embodiment of the present application, step S204 includes pre-emphasis on the denoised discrete cosine transform spectrum coefficients to obtain the emphasized denoised discrete cosine transform spectrum coefficients; using the emphasized denoised discrete cosine transform spectrum coefficients to generate the energy spectrum of the emphasized denoised discrete cosine transform spectrum coefficients; calculating the channel energy after the energy spectrum passes through the Mel filter group; and performing logarithmic transformation on the channel energy, and using the calculated result to perform discrete cosine transform, and using the calculated Mel frequency cepstrum coefficients as denoised spectrum coefficient features. This specific embodiment does not require multiple time-frequency conversions, reducing the waste of computing power, large storage requirements, algorithm delays, and system delays in the keyword recognition process.
具体的,在进行降噪谱系数特征的提取时首先要对降噪离散余弦变换谱系数进行预加重,现有技术中的预加重通常在时域进行,即在图1的音频预处理或音频后处理模块中实现,实现方式为:y[n]=x[n]-0.95*x[n-1],其中x[n]是当前输入的音频数据pcm采样点,x[n-1]是当前输入的上一个音频数据pcm采样点,y[n]是预加重(即时域滤波)的结果,现有技术的滤波器对应的频率响应如图3所示。本发明的预加重处理在频域实现,以16kHz采样率为例,将预加重频率响应按照50Hz间隔保存为预加重频率响应表:p(0),p(1),p(2),…,p(159)。当MDCT的谱系数共160个,NF对应的值为160,则本申请的预加重过程为:其中,为预加重的结果,为当前帧的离散余弦变换谱系数,p(k)为预加重频率响应表。Specifically, when extracting the denoising spectrum coefficient features, the denoising discrete cosine transform spectrum coefficients must first be pre-emphasized. The pre-emphasis in the prior art is usually performed in the time domain, that is, implemented in the audio preprocessing or audio post-processing module of Figure 1, and the implementation method is: y[n] = x[n]-0.95*x[n-1], where x[n] is the current input audio data pcm sampling point, x[n-1] is the previous audio data pcm sampling point of the current input, and y[n] is the result of pre-emphasis (that is, time domain filtering). The frequency response corresponding to the filter of the prior art is shown in Figure 3. The pre-emphasis processing of the present invention is implemented in the frequency domain. Taking the 16kHz sampling rate as an example, the pre-emphasis frequency response is saved at 50Hz intervals as a pre-emphasis frequency response table: p(0), p(1), p(2), ..., p(159). When the MDCT spectrum coefficients are 160 in total, the value corresponding toNF is 160, The pre-emphasis process of this application is: in, As a result of pre-emphasis, is the discrete cosine transform spectrum coefficient of the current frame, and p(k) is the pre-emphasis frequency response table.
利用预加重的结果进行能量谱的计算,其具体过程为首先利用预加重的结果生成伪谱系数,具体公式为其中,k=0…NF-1并且当k=-1或NF时,本发明不限制是否使用伪谱系数,例如直接使用MDCT谱系数也能生成能量谱进行关键词识别,但由于MDCT伪谱系数的能量分布与傅里叶变换谱系数的能量分布有更好的对应关系,因此使用伪谱来计算能量谱可以提高训练与识别的性能。利用伪谱系数计算得到能量谱,其具体的计算公式为:The energy spectrum is calculated using the result of pre-emphasis. The specific process is to first generate pseudo-spectrum coefficients using the result of pre-emphasis. The specific formula is: Where k = 0...NF -1 and when k = -1 or NF , The present invention does not limit whether to use pseudo-spectrum coefficients. For example, directly using MDCT spectrum coefficients can also generate energy spectrum for keyword recognition. However, since the energy distribution of MDCT pseudo-spectrum coefficients has a better correspondence with the energy distribution of Fourier transform spectrum coefficients, using pseudo-spectrum to calculate energy spectrum can improve the performance of training and recognition. The energy spectrum is calculated using pseudo-spectrum coefficients, and its specific calculation formula is:
将计算得到的结果通过Mel滤波器组进行滤波,即将频谱能量经过梅尔滤波器组进行滤波,并计算得到每个通道的能量。其具体的计算过程为:其中,Hm是第m个梅尔滤波器。The calculated result is filtered through the Mel filter bank, that is, the spectrum energy is filtered through the Mel filter bank, and the energy of each channel is calculated. The specific calculation process is: WhereHm is the mth Mel filter.
将滤波后的频谱能量进行对数变换,其具体的计算方法为:Meldb(m)=log(|Energymel(m)|),m=0,1,…,M-1。将对数变换的结果进行离散余弦变换生成梅尔频率倒谱系数(简称MFCC),梅尔频率倒谱系数即降噪谱系数特征。其具体计算过程为:其中D是MFCC特征的维数。The filtered spectrum energy is logarithmically transformed, and the specific calculation method is: Meldb (m) = log(|Energymel (m)|), m = 0, 1, ..., M-1. The result of the logarithmic transformation is subjected to discrete cosine transformation to generate Mel frequency cepstral coefficients (MFCC for short), which are the denoising spectrum coefficient features. The specific calculation process is: Where D is the dimension of MFCC features.
在图2所示的具体实施方式中,关键词识别方法,还包括步骤S205,在预先训练的识别神经网络模型中,根据当前帧的降噪谱系数特征和预定帧数的历史帧降噪谱系数特征进行关键词识别,得到语音码流中的关键词。该步骤使得关键词识别过程的应用范围更广,既可以用于低功耗蓝牙音频,也可以用于经典蓝牙或用于其他无线通信领域,同时利用降噪模块和关键词识别模块的联合处理,提升了噪声环境下关键词识别的性能,并且充分利用了音频解码器与现有的算法模块,避免了大量的时频转换运算,降低了系统的复杂度,便于在嵌入式系统实施,从而能够节省功耗,延长设备的使用时间,节省相关模块需要的存储空间,降低了设备的成本。In the specific implementation shown in FIG2 , the keyword recognition method further includes step S205, in which, in the pre-trained recognition neural network model, keyword recognition is performed according to the noise reduction spectrum coefficient characteristics of the current frame and the noise reduction spectrum coefficient characteristics of the predetermined number of historical frames to obtain the keywords in the speech code stream. This step makes the application scope of the keyword recognition process wider, and can be used for low-power Bluetooth audio, classic Bluetooth or other wireless communication fields. At the same time, the joint processing of the noise reduction module and the keyword recognition module is used to improve the performance of keyword recognition in a noisy environment, and the audio decoder and the existing algorithm module are fully utilized to avoid a large number of time-frequency conversion operations, reduce the complexity of the system, and facilitate implementation in an embedded system, thereby saving power consumption, extending the use time of the device, saving the storage space required by the relevant modules, and reducing the cost of the device.
具体的,如图4的在线推理过程,利用完成训练的识别神经网络模型对当前帧码流进行处理提取得到当前帧码流中的关键词,并利用识别得到的关键词控制家用电器等产品。其中,关键词识别过程和降噪处理过程利用神经网络进行处理,优选的,关键词识别模块的神经网络选择卷积神经网络(Convolutional Neural Networks,CNN)。Specifically, in the online reasoning process shown in FIG4 , the trained recognition neural network model is used to process the current frame code stream to extract keywords in the current frame code stream, and the recognized keywords are used to control products such as household appliances. Among them, the keyword recognition process and the noise reduction process are processed by neural networks. Preferably, the neural network of the keyword recognition module selects a convolutional neural network (CNN).
在本申请的一个具体实施例中,利用交叉熵损失函数进行降噪神经网络模型和识别神经网络模型的训练。该具体实施例,通过在两个模块都使用交叉熵损失函数,使得整个系统的关键词检测识别效率和准确性更高。In a specific embodiment of the present application, a cross entropy loss function is used to train a denoising neural network model and a recognition neural network model. In this specific embodiment, by using the cross entropy loss function in both modules, the keyword detection and recognition efficiency and accuracy of the entire system are higher.
具体的,在现有技术中降噪通常使用均方误差损失函数,使用均方误差损失函数会使系统输出的音频信号的信噪比达到最大化,从而起到降噪的效果,但这种方法对关键词识别过程中提高识别准确率和效率的作用较小。在现有技术中,关键词识别过程常使用交叉熵损失函数,使得关键词识别模块的识别结果能够提高关键词识别的效率和准确性。因此本申请利用交叉熵损失函数同时进行关键词识别处理过程和降噪处理过程,能够提高关键词的识别过程的准确性和效率,在本申请中损失函数定义为:其中y是关键词真实的标签,是关键词识别过程的输结果。Specifically, in the prior art, noise reduction usually uses a mean square error loss function. The use of the mean square error loss function will maximize the signal-to-noise ratio of the audio signal output by the system, thereby achieving a noise reduction effect. However, this method has little effect on improving recognition accuracy and efficiency in the keyword recognition process. In the prior art, the keyword recognition process often uses a cross entropy loss function, so that the recognition result of the keyword recognition module can improve the efficiency and accuracy of keyword recognition. Therefore, the present application uses a cross entropy loss function to simultaneously perform the keyword recognition process and the noise reduction process, which can improve the accuracy and efficiency of the keyword recognition process. In the present application, the loss function is defined as: Where y is the real label of the keyword, It is the output of the keyword recognition process.
在本申请的一个具体实施例中,模型的训练过程包括,根据识别得到的关键词和关键词语音中的关键词,调整降噪神经网络模型和识别神经网络模型的权重大小和偏置大小,直至识别得到关键词的正确率大于预定阈值,结束对降噪神经网络模型和识别神经网络模型的训练。In a specific embodiment of the present application, the model training process includes adjusting the weights and biases of the denoising neural network model and the recognition neural network model according to the recognized keywords and keywords in the keyword speech, until the accuracy of the recognized keywords is greater than a predetermined threshold, thereby terminating the training of the denoising neural network model and the recognition neural network model.
具体的,当关键词识别过程的识别结果的正确率达到或超过预定阈值时,按照此时关键词检测模块与降噪模块的权重与偏置的配置数据进行关键词识别处理。Specifically, when the accuracy of the recognition result of the keyword recognition process reaches or exceeds a predetermined threshold, the keyword recognition process is performed according to the configuration data of the weight and bias of the keyword detection module and the noise reduction module at this time.
在本申请的一个具体实施例中,本申请的具体处理过程如图4,在离线训练的过程中,首先通过将混有噪声的关键词语音和纯净的关键词语音分别进行分帧,将每一帧语音进行全部编码和部分解码,得到每一帧语音对应的混有噪声的关键词语音的离散余弦变换谱系数和关键词语音的离散余弦变换谱系数。将每一帧语音混有噪声的关键词语音的离散余弦变换谱系数和关键词语音的离散余弦变换谱系数分别进行如上述的特征提取步骤的处理得到带噪语音特征(特征1)和无噪语音特征(特征2)。通过带噪语音特征和无噪语音特征训练降噪模块,得到混有噪声时音频的子带增益,并在应用增益模块利用该子带增益对混有噪声的关键词语音的离散余弦变换谱系数进行降噪,利用降噪后的离散余弦变换谱系数提取降噪后的谱系数特征。利用降噪后的谱系数特征训练关键词识别模块,并在关键词识别模块根据当前帧的降噪谱系数特征和预定帧数的历史帧降噪谱系数特征进行关键词识别,得到语音码流中的关键词。其中,为使离线训练的训练速度更快,离线训练时也可以使用音频数据经LD-MDCT(逆离散余弦变换谱系数)变换后的谱系数,优选的,使用编码和部分解码后的离散余弦变换谱系数(LD-IMDCT谱系数)进行训练,这能够更好的匹配实际应用的场景。其中,通过对带有噪声的混合语音进行特征提取得到当前帧码流的帯噪语音特征,通过对关键词语音进行特征提取得到当前帧码流的无噪语音特征。In a specific embodiment of the present application, the specific processing process of the present application is shown in Figure 4. During the offline training process, first, the keyword speech mixed with noise and the pure keyword speech are framed respectively, and each frame of speech is fully encoded and partially decoded to obtain the discrete cosine transform spectrum coefficient of the keyword speech mixed with noise and the discrete cosine transform spectrum coefficient of the keyword speech corresponding to each frame of speech. The discrete cosine transform spectrum coefficient of the keyword speech mixed with noise in each frame of speech and the discrete cosine transform spectrum coefficient of the keyword speech are processed as described above in the feature extraction steps to obtain the noisy speech feature (feature 1) and the noise-free speech feature (feature 2). The noise reduction module is trained by the noisy speech feature and the noise-free speech feature to obtain the subband gain of the audio when mixed with noise, and the subband gain is used in the application gain module to denoise the discrete cosine transform spectrum coefficient of the keyword speech mixed with noise, and the spectrum coefficient feature after denoising is extracted using the discrete cosine transform spectrum coefficient after denoising. The keyword recognition module is trained using the spectral coefficient features after noise reduction, and the keyword recognition module performs keyword recognition based on the denoised spectral coefficient features of the current frame and the denoised spectral coefficient features of the predetermined number of historical frames to obtain keywords in the speech code stream. In order to make the training speed of offline training faster, the spectral coefficients of the audio data after LD-MDCT (inverse discrete cosine transform spectral coefficients) transformation can also be used during offline training. Preferably, the discrete cosine transform spectral coefficients (LD-IMDCT spectral coefficients) after encoding and partial decoding are used for training, which can better match the actual application scenario. In which, the noisy speech features of the current frame code stream are obtained by extracting features from the mixed speech with noise, and the noise-free speech features of the current frame code stream are obtained by extracting features from the keyword speech.
如图4中的虚线,在完成关键词识别后根据关键词的识别结果后,将关键词的识别结果传递给关键词识别模块,关键词识别模块将系统的神经网络梯度相关信息传播给降噪模块,从而完成根据识别结果调节关键词识别模块和降噪模块中各参数的权重与偏置,使得降噪模块的输出结果更容易被关键词检测模块检测出来。As shown in the dotted line in Figure 4, after completing the keyword recognition, the keyword recognition result is passed to the keyword recognition module, and the keyword recognition module propagates the system's neural network gradient related information to the denoising module, thereby adjusting the weights and biases of various parameters in the keyword recognition module and the denoising module according to the recognition result, so that the output result of the denoising module is easier to be detected by the keyword detection module.
当处于在线推理过程中时,在音频接收端,蓝牙通信模块在接收到当前帧的音频码流后将码流传递至部分解码模块,部分解码模块将码流解码至离散余弦变换谱系数。通过特征4提取模块(提取方法同特征1)提取得到带噪语音特征,利用降噪模块,通过带噪语音特征计算得到得到带噪语音特征对应的降噪增益。利用应用增益模块,根据计算得到的增益对当前帧码流的离散余弦变换谱系数进行降噪得到降噪离散余弦变换谱系数,通过特征5提取模块,利用当前帧的降噪离散余弦变换谱系数和预定帧数的历史帧降噪离散余弦变换谱系数识别得到降噪谱系数特征(方法与内容同特征3),利用降噪谱系数特征进行关键词的识别。根据识别出的关键词发出控制信号,控制家电进行相应的动作。其中,由于关键词一般为3~5个字,通常时长为1~1.5秒,以关键词的时长是1秒为例,如果帧长为10ms,则总共100帧,预定帧数的历史帧降噪离散余弦变换谱系数为99帧。When in the online reasoning process, at the audio receiving end, the Bluetooth communication module transmits the code stream to the partial decoding module after receiving the audio code stream of the current frame, and the partial decoding module decodes the code stream into discrete cosine transform spectrum coefficients. The noisy speech feature is extracted by the feature 4 extraction module (the extraction method is the same as feature 1), and the noise reduction gain corresponding to the noisy speech feature is obtained by using the noise reduction module. Using the application gain module, the discrete cosine transform spectrum coefficient of the current frame code stream is denoised according to the calculated gain to obtain the denoised discrete cosine transform spectrum coefficient. Through the feature 5 extraction module, the denoised discrete cosine transform spectrum coefficient of the current frame and the denoised discrete cosine transform spectrum coefficient of the predetermined number of historical frames are used to identify the denoised spectrum coefficient feature (the method and content are the same as feature 3), and the denoised spectrum coefficient feature is used to identify keywords. A control signal is sent according to the identified keyword to control the home appliance to perform the corresponding action. Among them, since the keyword is generally 3 to 5 words and usually lasts 1 to 1.5 seconds, taking the keyword duration of 1 second as an example, if the frame length is 10ms, there are a total of 100 frames, and the historical frame denoising discrete cosine transform spectrum coefficients of the predetermined number of frames are 99 frames.
图5示出了本申请一种关键词识别装置的具体实施方式。FIG5 shows a specific implementation of a keyword recognition device of the present application.
在图5所示的具体实施方式中,关键词识别装置主要包括:半解码模块501,用于在信号接收端,将当前帧码流进行部分解码直至得到当前帧码流的离散余弦变换谱系数;In the specific implementation shown in FIG5 , the keyword recognition device mainly includes: a semi-decoding module 501, which is used to partially decode the current frame code stream at the signal receiving end until the discrete cosine transform spectrum coefficients of the current frame code stream are obtained;
降噪增益获取模块502,用于利用离散余弦变换谱系数计算得到当前帧码流的带噪语音特征,并将带噪语音特征输入预先训练的降噪神经网络模型中计算得到降噪增益;The noise reduction gain acquisition module 502 is used to calculate the noisy speech features of the current frame code stream using discrete cosine transform spectrum coefficients, and input the noisy speech features into a pre-trained noise reduction neural network model to calculate the noise reduction gain;
降噪模块503,用于利用降噪增益对离散余弦变换谱系数进行降噪处理,得到降噪离散余弦变换谱系数;A noise reduction module 503 is used to perform noise reduction processing on discrete cosine transform spectrum coefficients using noise reduction gains to obtain noise-reduced discrete cosine transform spectrum coefficients;
降噪谱系数特征提取模块504,用于利用降噪离散余弦变换谱系数进行降噪谱系数特征提取,得到降噪离散余弦变换谱系数的降噪谱系数特征;以及,A denoising spectrum coefficient feature extraction module 504 is used to extract denoising spectrum coefficient features using denoising discrete cosine transform spectrum coefficients to obtain denoising spectrum coefficient features of denoising discrete cosine transform spectrum coefficients; and
关键词识别模块505,用于在预先训练的识别神经网络模型中,根据当前帧的降噪谱系数特征和预定帧数的历史帧降噪谱系数特征进行关键词识别,得到语音码流中的关键词。The keyword recognition module 505 is used to perform keyword recognition in a pre-trained recognition neural network model according to the denoising spectrum coefficient characteristics of the current frame and the denoising spectrum coefficient characteristics of a predetermined number of historical frames to obtain keywords in the speech code stream.
本申请提供的关键词识别装置,可用于执行上述任一实施例描述的关键词识别方法,其实现原理和技术效果类似,在此不再赘述。The keyword recognition device provided in the present application can be used to execute the keyword recognition method described in any of the above embodiments. Its implementation principles and technical effects are similar and will not be repeated here.
在本申请的一个具体实施例中,本申请一种关键词识别装置中各功能模块可直接在硬件中、在由处理器执行的软件模块中或在两者的组合中。In a specific embodiment of the present application, each functional module in a keyword recognition device of the present application can be directly in hardware, in a software module executed by a processor, or in a combination of the two.
软件模块可驻留在RAM存储器、快闪存储器、ROM存储器、EPROM存储器、EEPROM存储器、寄存器、硬盘、可装卸盘、CD-ROM或此项技术中已知的任何其它形式的存储介质中。示范性存储介质耦合到处理器,使得处理器可从存储介质读取信息和向存储介质写入信息。The software modules may reside in RAM memory, flash memory, ROM memory, EPROM memory, EEPROM memory, registers, hard disk, removable disk, CD-ROM or any other form of storage medium known in the art. An exemplary storage medium is coupled to the processor so that the processor can read information from and write information to the storage medium.
处理器可以是中央处理单元(英文:Central Processing Unit,简称:CPU),还可以是其他通用处理器、数字信号处理器(英文:Digital Signal Processor,简称:DSP)、专用集成电路(英文:Application Specific Integrated Circuit,简称:ASIC)、现场可编程门阵列(英文:Field Programmable Gate Array,简称:FPGA)或其它可编程逻辑装置、离散门或晶体管逻辑、离散硬件组件或其任何组合等。通用处理器可以是微处理器,但在替代方案中,处理器可以是任何常规处理器、控制器、微控制器或状态机。处理器还可实施为计算装置的组合,例如DSP与微处理器的组合、多个微处理器、结合DSP核心的一个或一个以上微处理器或任何其它此类配置。在替代方案中,存储介质可与处理器成一体式。处理器和存储介质可驻留在ASIC中。ASIC可驻留在用户终端中。在替代方案中,处理器和存储介质可作为离散组件驻留在用户终端中。The processor may be a central processing unit (CPU), or other general-purpose processors, digital signal processors (DSP), application-specific integrated circuits (ASIC), field programmable gate arrays (FPGA), or other programmable logic devices, discrete gates or transistor logic, discrete hardware components, or any combination thereof. A general-purpose processor may be a microprocessor, but in an alternative, the processor may be any conventional processor, controller, microcontroller, or state machine. The processor may also be implemented as a combination of computing devices, such as a combination of a DSP and a microprocessor, a plurality of microprocessors, one or more microprocessors combined with a DSP core, or any other such configuration. In an alternative, the storage medium may be integrated with the processor. The processor and the storage medium may reside in an ASIC. The ASIC may reside in a user terminal. In an alternative, the processor and the storage medium may reside in a user terminal as discrete components.
在本申请的另一个具体实施方式中,一种计算机可读存储介质,其存储有计算机指令,计算机指令被操作以执行上述实施例中描述的关键词识别方法。In another specific embodiment of the present application, a computer-readable storage medium stores computer instructions, and the computer instructions are operated to execute the keyword identification method described in the above embodiment.
在本申请的一个具体实施方式中,一种计算机设备,其包括:至少一个处理器;以及与至少一个处理器进行通信连接的存储器;其中,存储器存储有可被至少一个处理器执行的计算机指令,至少一个处理器操作计算机指令以执行上述实施例中描述的关键词识别方法。In a specific embodiment of the present application, a computer device includes: at least one processor; and a memory that is communicatively connected to the at least one processor; wherein the memory stores computer instructions that can be executed by the at least one processor, and the at least one processor operates the computer instructions to execute the keyword recognition method described in the above embodiment.
在本申请所提供的几个实施例中,应该理解到,所揭露的装置和方法,可以通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如,所述单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,装置或单元的间接耦合或通信连接,可以是电性,机械或其它的形式。In the several embodiments provided in the present application, it should be understood that the disclosed devices and methods can be implemented in other ways. For example, the device embodiments described above are only schematic. For example, the division of the units is only a logical function division. There may be other division methods in actual implementation, such as multiple units or components can be combined or integrated into another system, or some features can be ignored or not executed. Another point is that the mutual coupling or direct coupling or communication connection shown or discussed can be through some interfaces, indirect coupling or communication connection of devices or units, which can be electrical, mechanical or other forms.
所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。The units described as separate components may or may not be physically separated, and the components shown as units may or may not be physical units, that is, they may be located in one place or distributed on multiple network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.
以上所述仅为本申请的实施例,并非因此限制本申请的专利范围,凡是利用本申请说明书及附图内容所作的等效结构变换,或直接或间接运用在其他相关的技术领域,均同理包括在本申请的专利保护范围内。The above description is only an embodiment of the present application and does not limit the patent scope of the present application. Any equivalent structural transformation made using the contents of the present application specification and drawings, or directly or indirectly applied in other related technical fields, are also included in the patent protection scope of the present application.
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202211072718.2ACN115394291B (en) | 2022-09-02 | 2022-09-02 | Keyword recognition method, device and medium |
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202211072718.2ACN115394291B (en) | 2022-09-02 | 2022-09-02 | Keyword recognition method, device and medium |
| Publication Number | Publication Date |
|---|---|
| CN115394291A CN115394291A (en) | 2022-11-25 |
| CN115394291Btrue CN115394291B (en) | 2024-10-29 |
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| CN202211072718.2AActiveCN115394291B (en) | 2022-09-02 | 2022-09-02 | Keyword recognition method, device and medium |
| Country | Link |
|---|---|
| CN (1) | CN115394291B (en) |
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN110164465A (en)* | 2019-05-15 | 2019-08-23 | 上海大学 | A kind of sound enhancement method and device based on deep layer Recognition with Recurrent Neural Network |
| CN112017644A (en)* | 2020-10-21 | 2020-12-01 | 南京硅基智能科技有限公司 | Sound transformation system, method and application |
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| JP2001228892A (en)* | 2000-02-15 | 2001-08-24 | Sony Corp | Noise removal device and noise removal method, and recording medium |
| US7197456B2 (en)* | 2002-04-30 | 2007-03-27 | Nokia Corporation | On-line parametric histogram normalization for noise robust speech recognition |
| US10319377B2 (en)* | 2016-03-15 | 2019-06-11 | Tata Consultancy Services Limited | Method and system of estimating clean speech parameters from noisy speech parameters |
| CN113782011B (en)* | 2021-08-26 | 2024-04-09 | 清华大学苏州汽车研究院(相城) | Training method of frequency band gain model and voice noise reduction method for vehicle-mounted scene |
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN110164465A (en)* | 2019-05-15 | 2019-08-23 | 上海大学 | A kind of sound enhancement method and device based on deep layer Recognition with Recurrent Neural Network |
| CN112017644A (en)* | 2020-10-21 | 2020-12-01 | 南京硅基智能科技有限公司 | Sound transformation system, method and application |
| Publication number | Publication date |
|---|---|
| CN115394291A (en) | 2022-11-25 |
| Publication | Publication Date | Title |
|---|---|---|
| CN110956957B (en) | Training method and system of speech enhancement model | |
| US9269368B2 (en) | Speaker-identification-assisted uplink speech processing systems and methods | |
| CN103236260B (en) | Speech recognition system | |
| CN113744749B (en) | Speech enhancement method and system based on psychoacoustic domain weighting loss function | |
| WO2021114733A1 (en) | Noise suppression method for processing at different frequency bands, and system thereof | |
| CN107871499B (en) | Speech recognition method, system, computer device and computer-readable storage medium | |
| CN106486131A (en) | A kind of method and device of speech de-noising | |
| CN111785285A (en) | Voiceprint recognition method for home multi-feature parameter fusion | |
| CN108447495A (en) | A Deep Learning Speech Enhancement Method Based on Comprehensive Feature Set | |
| EP2347412B1 (en) | Method and system for frequency domain postfiltering of encoded audio data in a decoder | |
| CN114121004B (en) | Voice recognition method, system, medium and equipment based on deep learning | |
| CN110136709A (en) | Speech recognition method and video conferencing system based on speech recognition | |
| CN115083429A (en) | Model training method for voice noise reduction, voice noise reduction method, device and medium | |
| CN103021405A (en) | Voice signal dynamic feature extraction method based on MUSIC and modulation spectrum filter | |
| CN107464563B (en) | Voice interaction toy | |
| CN114678033A (en) | Speech enhancement algorithm based on multi-head attention mechanism only comprising encoder | |
| CN115966218A (en) | Bone conduction assisted air conduction voice processing method, device, medium and equipment | |
| CN103971697A (en) | Speech enhancement method based on non-local mean filtering | |
| CN113823277A (en) | Keyword recognition method, system, medium, and apparatus based on deep learning | |
| CN115394291B (en) | Keyword recognition method, device and medium | |
| Shahhoud et al. | PESQ enhancement for decoded speech audio signals using complex convolutional recurrent neural network | |
| Saleem | Single channel noise reduction system in low SNR | |
| CN108022588B (en) | Robust speech recognition method based on dual-feature model | |
| Abka et al. | Speech recognition features: Comparison studies on robustness against environmental distortions | |
| Vicente-Peña et al. | Band-pass filtering of the time sequences of spectral parameters for robust wireless speech recognition |
| Date | Code | Title | Description |
|---|---|---|---|
| PB01 | Publication | ||
| PB01 | Publication | ||
| SE01 | Entry into force of request for substantive examination | ||
| SE01 | Entry into force of request for substantive examination | ||
| GR01 | Patent grant | ||
| GR01 | Patent grant |