CN115620737A

Movatterモバイル変換

Info

Publication number: CN115620737A
Application number: CN202211193923.4A
Authority: CN
Inventors: 徐友聚; 朱福国; 秦亚光; 尹悦
Original assignee: Beijing Eswin Computing Technology Co Ltd
Current assignee: Beijing Eswin Computing Technology Co Ltd
Priority date: 2022-09-28
Filing date: 2022-09-28
Publication date: 2023-01-17

Abstract

Translated fromChinese

本发明提供一种语音信号处理装置、方法、电子设备和扩音系统，涉及语音处理技术领域，其中装置包括：第一回声消除单元，用于将参考信号和语音采集装置采集的当前语音信号输入至自适应滤波器中，得到目标残留信号；第二回声消除单元，用于将目标残留信号和远端语音信号输入至预设的语音处理模型中，得到当前帧近端语音处理信号。本发明将目标残留信号通过语音处理模型进行去回声处理，由于语音处理模型是基于残留信号样本和远端语音信号样本训练得到的，而残留信号样本和目标残留信号都包括自适应滤波器未能完全消除的非线性回声等信号，所以语音处理模型能够对目标残留信号中非线性成分的回声进行消除，从而提高了语音信号处理的准确性。

The invention provides a voice signal processing device, method, electronic equipment and sound amplification system, which relate to the technical field of voice processing, wherein the device includes: a first echo canceling unit, which is used to input a reference signal and a current voice signal collected by a voice collection device to the adaptive filter to obtain the target residual signal; the second echo cancellation unit is used to input the target residual signal and the far-end voice signal into the preset voice processing model to obtain the current frame near-end voice processing signal. In the present invention, the target residual signal is de-echoed through the speech processing model. Since the speech processing model is trained based on the residual signal sample and the remote speech signal sample, the residual signal sample and the target residual signal both include adaptive filters that cannot Signals such as nonlinear echoes are completely eliminated, so the speech processing model can eliminate the echoes of nonlinear components in the target residual signal, thereby improving the accuracy of speech signal processing.

Description

Translated fromChinese

语音信号处理装置、方法、电子设备和扩音系统Speech signal processing device, method, electronic equipment and public address system

技术领域technical field

本发明涉及语音处理技术领域，尤其涉及一种语音信号处理装置、方法、电子设备和扩音系统。The invention relates to the technical field of voice processing, in particular to a voice signal processing device, method, electronic equipment and a sound reinforcement system.

背景技术Background technique

回声是指近端扬声器播放出的远端语音信号经过墙壁等物体反射后传播到近端麦克风而被采集到的信号，回声信号和近端语音混合后传输至远端扬声器，会使得远端用户通过远端扬声器听到自己的说话声。为了提高通信质量，回声消除技术在电话、视频会议等场景广泛应用，其中，最常见的回声消除方法是通过自适应滤波器进行回声消除。Echo refers to the signal collected by the far-end voice signal played by the near-end speaker after being reflected by the wall and other objects and transmitted to the near-end microphone. The echo signal and the near-end voice are mixed and transmitted to the far-end speaker, which will make the far-end Hear yourself speaking through the far-site speaker. In order to improve communication quality, echo cancellation technology is widely used in scenarios such as telephony and video conferencing, among which, the most common echo cancellation method is to perform echo cancellation through an adaptive filter.

相关技术中，假设作为输入信号的近端语音信号与作为参考信号的远端语音信号不相关，以自适应滤波器的输出信号与扬声器已播放的远端语音信号的最小化相关性为目标，迭代更新自适应滤波器系数，基于更新后的滤波器系数对近端扬声器到近端麦克风的传输路径进行建模，用远端语音信号估计出回声信号，再从近端麦克风采集的语音信号中减去估计的回声信号，进而输出去除回声的语音信号。In the related art, it is assumed that the near-end speech signal as the input signal is not correlated with the far-end speech signal as the reference signal, and the minimum correlation between the output signal of the adaptive filter and the far-end speech signal that has been played by the loudspeaker is the goal, Iteratively update the adaptive filter coefficients, model the transmission path from the near-end speaker to the near-end microphone based on the updated filter coefficients, use the far-end voice signal to estimate the echo signal, and then use the voice signal collected by the near-end microphone The estimated echo signal is subtracted, and then the echo-removed speech signal is output.

但上述相关技术中，自适应滤波器是一种线性操作，无法去除由扬声器、声学信道和麦克风等造成的非线性成分的回声，从而降低了语音信号处理的准确性。However, in the above-mentioned related technologies, the adaptive filter is a linear operation, which cannot remove the echo of nonlinear components caused by speakers, acoustic channels, and microphones, thereby reducing the accuracy of speech signal processing.

发明内容Contents of the invention

针对现有技术存在的问题，本发明实施例提供一种语音信号处理装置、方法、电子设备和扩音系统。Aiming at the problems existing in the prior art, the embodiments of the present invention provide a voice signal processing device, method, electronic equipment and a sound reinforcement system.

本发明提供一种语音信号处理装置，包括：The invention provides a voice signal processing device, comprising:

第一回声消除单元，用于将参考信号和语音采集装置采集的当前语音信号输入至自适应滤波器中，通过所述自适应滤波器基于所述参考信号对所述当前语音信号进行处理，得到目标残留信号；所述参考信号包括接收到的远端语音信号；The first echo canceling unit is configured to input the reference signal and the current speech signal collected by the speech collection device into the adaptive filter, and process the current speech signal based on the reference signal through the adaptive filter to obtain A target residual signal; the reference signal includes a received far-end voice signal;

第二回声消除单元，用于将所述目标残留信号和所述远端语音信号输入至预设的语音处理模型中，得到所述语音处理模型输出的用于传输的当前帧近端语音处理信号；The second echo cancellation unit is configured to input the target residual signal and the far-end speech signal into a preset speech processing model, and obtain a current frame near-end speech processing signal output by the speech processing model for transmission. ;

所述语音处理模型用于对目标残留信号进行去回声处理；所述语音处理模型是基于远端语音信号样本和残留信号样本训练得到的。The speech processing model is used to perform echo removal processing on the target residual signal; the speech processing model is obtained through training based on remote speech signal samples and residual signal samples.

进一步地，所述参考信号还包括所述语音处理模型输出的上一帧近端语音处理信号。Further, the reference signal also includes the last frame of near-end speech processing signal output by the speech processing model.

进一步地，所述语音处理模型包括第一特征提取网络、第二特征提取网络、第三特征提取网络和多层人工神经网络；Further, the speech processing model includes a first feature extraction network, a second feature extraction network, a third feature extraction network and a multi-layer artificial neural network;

所述第二回声消除单元具体用于：The second echo cancellation unit is specifically used for:

将目标残留信号输入至第一特征提取网络，通过所述第一特征提取网络将所述目标残留信号从时域映射至变换域，得到第一特征；inputting the target residual signal into a first feature extraction network, and mapping the target residual signal from the time domain to the transform domain through the first feature extraction network to obtain a first feature;

将所述远端语音信号输入至所述第二特征提取网络，通过第二特征提取网络将远端语音信号从时域映射至变换域，得到第二特征；The remote speech signal is input to the second feature extraction network, and the remote speech signal is mapped from the time domain to the transform domain through the second feature extraction network to obtain a second feature;

将所述第一特征和所述第二特征均输入至所述多层人工神经网络，通过所述多层人工神经网络基于所述第一特征和所述第二特征，在所述第一特征中提取当前帧近端语音信号的掩膜；Both the first feature and the second feature are input to the multi-layer artificial neural network, and based on the first feature and the second feature through the multi-layer artificial neural network, in the first feature Extract the mask of the near-end speech signal in the current frame;

基于所述掩膜和所述第一特征确定所述当前帧近端语音信号在变换域的第三特征；determining a third feature of the current frame near-end speech signal in a transform domain based on the mask and the first feature;

将所述第三特征输入至所述第三特征提取网络，通过所述第三特征提取网络将所述第三特征从变换域映射至时域，得到所述当前帧近端语音处理信号。The third feature is input to the third feature extraction network, and the third feature is mapped from the transform domain to the time domain through the third feature extraction network to obtain the current frame near-end speech processing signal.

进一步地，所述语音处理模型为基于如下方式得到的：Further, the speech processing model is obtained based on the following method:

将所述残留信号样本和所述远端语音信号样本输入至初始网络模型中，得到所述初始网络模型输出的近端语音处理样本；Inputting the residual signal samples and the far-end speech signal samples into an initial network model to obtain near-end speech processing samples output by the initial network model;

基于所述近端语音处理样本和期望信号确定损失函数；所述期望信号包括近端残留信号样本；determining a loss function based on the near-end speech processing samples and a desired signal; the desired signal includes a near-end residual signal sample;

基于所述损失函数对所述初始网络模型的模型参数进行优化，直至达到收敛条件，得到所述语音处理模型。Optimizing model parameters of the initial network model based on the loss function until a convergence condition is reached to obtain the speech processing model.

进一步地，所述装置还包括：Further, the device also includes:

样本获取单元，用于获取近端含噪信号样本和语音播放装置到语音采集装置的冲激响应样本；The sample acquisition unit is used to obtain the near-end noise-containing signal sample and the impulse response sample from the voice playback device to the voice acquisition device;

样本延时单元，用于将所述近端含噪信号样本延时预设时间，得到目标近端含噪信号样本；A sample delay unit, configured to delay the near-end noise-containing signal sample for a preset time to obtain a target near-end noise-containing signal sample;

回声样本确定单元，用于基于所述目标近端含噪信号样本和所述冲激响应样本，确定近端语音回声样本；an echo sample determining unit, configured to determine a near-end speech echo sample based on the target near-end noisy signal sample and the impulse response sample;

输入信号样本确定单元，用于基于所述近端语音回声样本和所述近端含噪信号样本确定输入信号样本；an input signal sample determining unit, configured to determine an input signal sample based on the near-end speech echo sample and the near-end noisy signal sample;

近端残留信号样本确定单元，用于将所述输入信号样本和所述目标近端含噪信号样本输入至所述自适应滤波器中，通过所述自适应滤波器基于所述目标近端含噪信号样本对所述输入信号样本进行处理，得到近端残留信号样本；A near-end residual signal sample determining unit, configured to input the input signal sample and the target near-end noise-containing signal sample into the adaptive filter, and use the adaptive filter based on the target near-end noise-containing signal sample The noise signal sample processes the input signal sample to obtain a near-end residual signal sample;

残留信号样本确定单元，用于基于所述近端残留信号样本确定所述残留信号样本。A residual signal sample determining unit, configured to determine the residual signal sample based on the near-end residual signal sample.

进一步地，所述残留信号样本确定单元具体用于：Further, the residual signal sample determining unit is specifically used for:

基于远端语音信号样本和冲激响应样本，确定远端语音回声样本；Determine the far-end voice echo samples based on the far-end voice signal samples and the impulse response samples;

将所述远端语音回声样本和所述远端语音信号样本输入至所述自适应滤波器中，通过所述自适应滤波器基于所述远端语音信号样本对所述远端语音回声样本进行处理，得到远端残留信号样本；input the far-end voice echo samples and the far-end voice signal samples into the adaptive filter, and perform the far-end voice echo samples based on the far-end voice signal samples through the adaptive filter processing to obtain a remote residual signal sample;

基于所述近端残留信号样本和所述远端残留信号样本确定所述残留信号样本。The residual signal samples are determined based on the near-end residual signal samples and the far-end residual signal samples.

进一步地，所述回声样本确定单元具体用于：Further, the echo sample determination unit is specifically used for:

基于所述目标近端含噪信号样本和所述冲激响应样本，确定参考近端语音回声样本，将所述参考近端语音回声样本延时所述预设时间，得到延时后的近端语音回声样本，并将所述延时后的近端语音回声样本作为新的目标近端含噪信号样本，重复执行上述步骤，直至延时次数达到预设次数；Based on the target near-end noisy signal sample and the impulse response sample, determine a reference near-end voice echo sample, and delay the reference near-end voice echo sample by the preset time to obtain a delayed near-end Speech echo samples, using the delayed near-end speech echo samples as new target near-end noise-containing signal samples, and repeating the above steps until the number of delays reaches the preset number of times;

基于每次得到的参考近端语音回声样本确定近端语音回声样本。A near-end speech echo sample is determined based on each obtained reference near-end speech echo sample.

进一步地，所述第一回声消除单元具体用于：Further, the first echo cancellation unit is specifically used for:

通过所述自适应滤波器基于所述参考信号对所述当前语音信号进行处理，得到输出信号；processing the current speech signal based on the reference signal through the adaptive filter to obtain an output signal;

以所述输出信号和所述参考信号的最小化相关性为目标，对所述自适应滤波器的当前冲激响应进行更新，得到目标冲激响应；updating the current impulse response of the adaptive filter with the goal of minimizing the correlation between the output signal and the reference signal to obtain a target impulse response;

基于所述目标冲激响应和所述参考信号确定目标回声信号；determining a target echo signal based on the target impulse response and the reference signal;

基于所述目标回声信号对所述当前语音信号进行处理，得到所述目标残留信号。Processing the current speech signal based on the target echo signal to obtain the target residual signal.

本发明提供一种语音信号处理方法，包括：The invention provides a voice signal processing method, comprising:

将参考信号和语音采集装置采集的当前语音信号输入至自适应滤波器中，通过所述自适应滤波器基于所述参考信号对所述当前语音信号进行处理，得到目标残留信号；所述参考信号包括接收到的远端语音信号；The reference signal and the current speech signal collected by the speech collection device are input into an adaptive filter, and the current speech signal is processed based on the reference signal by the adaptive filter to obtain a target residual signal; the reference signal Including the received far-end voice signal;

将所述目标残留信号和所述远端语音信号输入至预设的语音处理模型中，得到所述语音处理模型输出的用于传输的当前帧近端语音处理信号；Inputting the target residual signal and the far-end speech signal into a preset speech processing model to obtain a current frame near-end speech processing signal output by the speech processing model for transmission;

所述语音处理模型用于对所述目标残留信号进行去回声处理；所述语音处理模型是基于远端语音信号样本和残留信号样本训练得到的。The speech processing model is used to perform echo removal processing on the target residual signal; the speech processing model is obtained through training based on remote speech signal samples and residual signal samples.

本发明还提供一种扩音系统，包括：The present invention also provides a sound reinforcement system, comprising:

语音采集装置，用于采集当前语音信号，并将所述当前语音信号输入至语音信号处理装置；A voice collection device, configured to collect a current voice signal, and input the current voice signal to a voice signal processing device;

所述语音信号处理装置，采用如上述任一种所述的语音信号处理装置。The speech signal processing device adopts the speech signal processing device described in any one of the above.

本发明还提供一种电子设备，包括存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序，所述处理器执行所述程序时实现如上述任一种所述语音信号处理方法。The present invention also provides an electronic device, which includes a memory, a processor, and a computer program stored on the memory and operable on the processor. When the processor executes the program, it realizes the speech signal processing as described above. method.

本发明还提供一种非暂态计算机可读存储介质，其上存储有计算机程序，该计算机程序被处理器执行时实现如上述任一种所述语音信号处理方法。The present invention also provides a non-transitory computer-readable storage medium, on which a computer program is stored, and when the computer program is executed by a processor, any one of the speech signal processing methods described above can be realized.

本发明还提供一种计算机程序产品，包括计算机程序，所述计算机程序被处理器执行时实现如上述任一种所述语音信号处理方法。The present invention also provides a computer program product, including a computer program. When the computer program is executed by a processor, any one of the voice signal processing methods described above is realized.

本发明提供的语音信号处理装置、方法、电子设备和扩音系统，将远端语音信号和自适应滤波器输出的目标残留信号输入至预先训练好的语音处理模型中，得到语音处理模型输出的用于传输的当前帧近端语音处理信号。可知，本发明将自适应滤波器输出的目标残留信号通过语音处理模型进行进一步的去回声处理，由于语音处理模型是基于残留信号样本和远端语音信号样本训练得到的，而残留信号样本和目标残留信号都包括自适应滤波器未能完全消除的非线性回声等信号，所以语音处理模型能够对目标残留信号中非线性成分的回声进行消除，从而提高了语音信号处理的准确性。The voice signal processing device, method, electronic equipment and sound reinforcement system provided by the present invention input the remote voice signal and the target residual signal output by the adaptive filter into the pre-trained voice processing model to obtain the output of the voice processing model Current frame near-end speech processing signal for transmission. It can be seen that in the present invention, the target residual signal output by the adaptive filter is further de-echoed through the speech processing model. Since the speech processing model is trained based on the residual signal sample and the remote speech signal sample, the residual signal sample and the target Residual signals include signals such as nonlinear echoes that cannot be completely eliminated by adaptive filters, so the speech processing model can eliminate the echo of nonlinear components in the target residual signal, thereby improving the accuracy of speech signal processing.

附图说明Description of drawings

为了更清楚地说明本发明或现有技术中的技术方案，下面将对实施例或现有技术描述中所需要使用的附图作一简单地介绍，显而易见地，下面描述中的附图是本发明的一些实施例，对于本领域普通技术人员来讲，在不付出创造性劳动的前提下，还可以根据这些附图获得其他的附图。In order to more clearly illustrate the present invention or the technical solutions in the prior art, the accompanying drawings that need to be used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the accompanying drawings in the following description are the present invention. For some embodiments of the invention, those skilled in the art can also obtain other drawings based on these drawings without creative effort.

图1是本发明实施例提供的语音信号处理方法的流程示意图之一；Fig. 1 is one of the flow diagrams of the voice signal processing method provided by the embodiment of the present invention;

图2是本发明实施例提供的语音交互的场景示意图；FIG. 2 is a schematic diagram of a scene of voice interaction provided by an embodiment of the present invention;

图3是本发明实施例提供的自适应滤波器的原理示意图；FIG. 3 is a schematic diagram of the principle of an adaptive filter provided by an embodiment of the present invention;

图4是本发明实施例提供的语音处理模型的结构示意图；FIG. 4 is a schematic structural diagram of a speech processing model provided by an embodiment of the present invention;

图5是本发明实施例提供的语音信号处理方法的流程示意图之二；Fig. 5 is the second schematic flow diagram of the speech signal processing method provided by the embodiment of the present invention;

图6是本发明实施例提供的语音信号处理方法的流程示意图之三；Fig. 6 is the third schematic flow diagram of the speech signal processing method provided by the embodiment of the present invention;

图7是本发明实施例提供的语音信号处理方法的流程示意图之四；Fig. 7 is the fourth schematic flow diagram of the speech signal processing method provided by the embodiment of the present invention;

图8是本发明实施例提供的语音信号处理方法的流程示意图之五；Fig. 8 is the fifth schematic flow diagram of the speech signal processing method provided by the embodiment of the present invention;

图9是本发明实施例提供的语音信号处理装置的结构示意图；FIG. 9 is a schematic structural diagram of a speech signal processing device provided by an embodiment of the present invention;

图10是本发明实施例提供的电子设备的实体结构示意图。Fig. 10 is a schematic diagram of the physical structure of the electronic device provided by the embodiment of the present invention.

具体实施方式detailed description

为使本发明的目的、技术方案和优点更加清楚，下面将结合本发明中的附图，对本发明中的技术方案进行清楚、完整地描述，显然，所描述的实施例是本发明一部分实施例，而不是全部的实施例。基于本发明中的实施例，本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例，都属于本发明保护的范围。In order to make the purpose, technical solutions and advantages of the present invention clearer, the technical solutions in the present invention will be clearly and completely described below in conjunction with the accompanying drawings in the present invention. Obviously, the described embodiments are part of the embodiments of the present invention , but not all examples. Based on the embodiments of the present invention, all other embodiments obtained by persons of ordinary skill in the art without creative efforts fall within the protection scope of the present invention.

下面结合图1-图8描述本发明的语音信号处理方法。The voice signal processing method of the present invention will be described below with reference to FIGS. 1-8 .

图1是本发明实施例提供的语音信号处理方法的流程示意图之一，如图1所示，该语音信号处理方法包括以下步骤：Fig. 1 is one of the flow diagrams of the speech signal processing method provided by the embodiment of the present invention, as shown in Fig. 1, the speech signal processing method comprises the following steps:

步骤101、将参考信号和语音采集装置采集的当前语音信号输入至自适应滤波器中，通过所述自适应滤波器基于所述参考信号对所述当前语音信号进行处理，得到目标残留信号。Step 101. Input a reference signal and a current speech signal collected by a speech collection device into an adaptive filter, and process the current speech signal based on the reference signal through the adaptive filter to obtain a target residual signal.

其中，语音采集装置包括麦克风，参考信号包括接收到的远端语音信号，当前语音信号包括近端语音信号(近端用户所说的语音)、环境噪声和目标信号，这里的目标信号可以包括远端语音回声信号和近端语音回声信号，其中，远端语音回声信号为语音采集装置采集的语音播放装置播放的远端语音信号，近端语音回声信号为语音采集装置采集的语音播放装置播放的近端语音信号，语音播放装置包括扬声器。Wherein, the voice collection device includes a microphone, the reference signal includes a received far-end voice signal, the current voice signal includes a near-end voice signal (voice spoken by a near-end user), ambient noise, and a target signal, and the target signal here may include a far-end voice signal. The end voice echo signal and the near-end voice echo signal, wherein the far-end voice echo signal is the far-end voice signal played by the voice playback device collected by the voice collection device, and the near-end voice echo signal is played by the voice playback device collected by the voice collection device For the near-end voice signal, the voice playback device includes a loudspeaker.

示例地，图2是本发明实施例提供的语音交互的场景示意图，如图2所示，用户A和用户B进行电话会议，用户A说话的声音，经过用户A侧的语音采集装置A1采集后，由用户A侧的电子设备A2通过网络传输至用户B侧的电子设备B1，电子设备B1通过用户B侧的语音播放装置B2播放用户A的说话声音。此时，用户B也正在说话，用户B侧的语音采集装置B3在采集语音信号时，不仅会采集到用户B的说话声音，也会采集到此时语音播放装置B2播放的用户A的说话声音。假设语音采集装置B3所采集的语音信号没有经过回声消除处理就传输至用户A侧的语音播放装置A3进行播放，则用户A会听到语音播放装置A3播放出用户B说话声音以及用户A自己的说话声音，这种现象就是回声现象。上述步骤101中的远端语音信号可以理解为传输至用户B侧的用户A的说话声音。By way of example, FIG. 2 is a schematic diagram of a voice interaction scene provided by an embodiment of the present invention. As shown in FIG. 2 , user A and user B conduct a conference call, and the voice of user A is collected by the voice collection device A1 on user A's side. , is transmitted from the electronic device A2 on the user A side to the electronic device B1 on the user B side through the network, and the electronic device B1 plays the voice of the user A through the voice playback device B2 on the user B side. At this time, user B is also speaking, and when the voice collection device B3 on the user B side collects voice signals, it will not only collect the voice of user B, but also the voice of user A played by the voice playback device B2 at this time. . Assuming that the voice signal collected by the voice collection device B3 is transmitted to the voice playback device A3 on the user A side for playback without echo cancellation processing, the user A will hear the voice playback device A3 playing the voice of user B and the voice of user A himself. Speaking voice, this phenomenon is the echo phenomenon. The far-end voice signal instep 101 above can be understood as the voice of user A transmitted to user B's side.

具体地，通过所述自适应滤波器基于所述参考信号对所述当前语音信号进行处理，得到输出信号；以所述输出信号和所述参考信号的最小化相关性为目标，对所述自适应滤波器的当前冲激响应进行更新，得到目标冲激响应；基于所述目标冲激响应和所述参考信号确定目标回声信号；基于所述目标回声信号对所述当前语音信号进行处理，得到所述目标残留信号。Specifically, the current speech signal is processed by the adaptive filter based on the reference signal to obtain an output signal; with the goal of minimizing the correlation between the output signal and the reference signal, the self updating the current impulse response of the adaptive filter to obtain a target impulse response; determining a target echo signal based on the target impulse response and the reference signal; processing the current speech signal based on the target echo signal to obtain The target residual signal.

示例地，在得到参考信号和语音采集装置采集的当前语音信号时，将语音采集装置采集的当前语音信号作为输入信号，将输入信号和参考信号均输入至自适应滤波器中，由自适应滤波器基于参考信号对当前语音信号中的远端语音回声信号和近端语音回声信号进行滤波处理，得到输出信号；然后通过适应性算法计算输出信号和参考信号的相关性，以最小化相关性为目标，更新自适应滤波器的当前冲激响应，对语音播放装置到语音采集装置的传输路径进行建模，得到目标冲激响应，进而将目标冲激响应和参考信号进行卷积，得到目标回声信号，然后在当前语音信号中减去目标回声信号，得到目标残留信号，该目标残留信号中包括当前帧近端语音信号和除当前帧近端语音信号之外的其他信号，例如非线性回声信号。For example, when the reference signal and the current speech signal collected by the speech collection device are obtained, the current speech signal collected by the speech collection device is used as an input signal, and both the input signal and the reference signal are input into the adaptive filter, and the adaptive filter Based on the reference signal, the device filters the far-end voice echo signal and the near-end voice echo signal in the current voice signal to obtain the output signal; then calculates the correlation between the output signal and the reference signal through an adaptive algorithm to minimize the correlation as Target, update the current impulse response of the adaptive filter, model the transmission path from the voice playback device to the voice collection device, obtain the target impulse response, and then convolve the target impulse response and the reference signal to obtain the target echo signal, and then subtract the target echo signal from the current speech signal to obtain the target residual signal, which includes the current frame near-end speech signal and other signals except the current frame near-end speech signal, such as nonlinear echo signal .

需要说明的是，上述自适应滤波器可以选用任意合适的结构，例如，单滤波自适应滤波器或者双滤波自适应滤波器，自适应滤波器还可以为时域滤波器或者频域滤波器，本发明对此不作限定。It should be noted that any suitable structure can be selected for the above-mentioned adaptive filter, for example, a single-filter adaptive filter or a double-filter adaptive filter, and the adaptive filter can also be a time-domain filter or a frequency-domain filter, The present invention is not limited thereto.

步骤102、将所述目标残留信号和所述远端语音信号输入至预设的语音处理模型中，得到所述语音处理模型输出的用于传输的当前帧近端语音处理信号。Step 102: Input the target residual signal and the far-end speech signal into a preset speech processing model to obtain a current-frame near-end speech processing signal output by the speech processing model for transmission.

其中，所述语音处理模型用于对目标残留信号进行去回声处理；语音处理模型是基于远端语音信号样本和残留信号样本训练得到的。Wherein, the speech processing model is used to perform echo removal processing on the target residual signal; the speech processing model is obtained through training based on remote speech signal samples and residual signal samples.

示例地，在得到目标残留信号时，将目标残留信号作为语音处理模型的输入信号输入至语音处理模型中，将远端语音信号作为语音处理模型的参考信号输入至语音处理模型中，由语音处理模型基于远端语音信号对目标残留信号中除当前帧近端语音信号之外的其他信号进行进一步处理，最终输出当前帧近端语音处理信号，该当前帧近端语音处理信号即为用于传输至对端的语音信号。For example, when the target residual signal is obtained, the target residual signal is input into the speech processing model as an input signal of the speech processing model, and the remote speech signal is input into the speech processing model as a reference signal of the speech processing model, and the speech processing Based on the far-end speech signal, the model further processes other signals in the target residual signal except the near-end speech signal of the current frame, and finally outputs the near-end speech processing signal of the current frame. The near-end speech processing signal of the current frame is used for transmission To the voice signal of the peer end.

本发明提供的语音信号处理方法，将远端语音信号和自适应滤波器输出的目标残留信号输入至预先训练好的语音处理模型中，得到语音处理模型输出的用于传输的当前帧近端语音处理信号。可知，本发明将自适应滤波器输出的目标残留信号通过语音处理模型进行进一步的去回声处理，由于语音处理模型是基于残留信号样本和远端语音信号样本训练得到的，而残留信号样本和目标残留信号都包括自适应滤波器未能完全消除的非线性回声等信号，所以语音处理模型能够对目标残留信号中非线性成分的回声进行消除，从而提高了语音信号处理的准确性；另外，本发明中的自适应滤波器和语音处理模型的参考信号均不需要硬件的模数转换器来处理，属于软参考，从而减少了硬件成本和电路板布线的复杂度，提高了电子设备部署的灵活性。In the speech signal processing method provided by the present invention, the far-end speech signal and the target residual signal output by the adaptive filter are input into the pre-trained speech processing model, and the current frame near-end speech output by the speech processing model for transmission is obtained. Handle signals. It can be seen that in the present invention, the target residual signal output by the adaptive filter is further de-echoed through the speech processing model. Since the speech processing model is trained based on the residual signal sample and the remote speech signal sample, the residual signal sample and the target The residual signal includes signals such as nonlinear echoes that cannot be completely eliminated by the adaptive filter, so the speech processing model can eliminate the echo of the nonlinear component in the target residual signal, thereby improving the accuracy of speech signal processing; in addition, this The adaptive filter and the reference signal of the speech processing model in the invention do not need hardware analog-to-digital converters to process, and belong to soft reference, thereby reducing hardware cost and complexity of circuit board wiring, and improving the flexibility of electronic equipment deployment sex.

可选地，所述参考信号还包括所述语音处理模型输出的上一帧近端语音处理信号。Optionally, the reference signal further includes a last frame of near-end speech processing signal output by the speech processing model.

示例地，如图2所示的语音交互的场景，上一帧近端语音处理信号可以理解为语音处理模型将用户B侧的语音采集装置B3采集的上一帧语音信号进行处理后，得到的用于传输至用户A侧的语音信号。For example, in the voice interaction scene shown in FIG. 2 , the last frame of near-end voice processing signal can be understood as the voice processing model that processes the last frame of voice signal collected by the voice collection device B3 on the user B side. It is used to transmit the voice signal to user A's side.

示例地，图3是本发明实施例提供的自适应滤波器的原理示意图，如图3所示，在得到远端语音信号f(n)和上一帧近端语音处理信号u(n-1)时，可将远端语音信号f(n)和上一帧近端语音处理信号u(n-1)进行叠加作为参考信号r1(n)，再将参考信号r1(n)和语音采集装置采集的当前语音信号y(n)输入至自适应滤波器中，其中，y(n)＝x1(n)+s(n)+v(n)+x2(n)，x1(n)表示远端语音回声信号，s(n)表示近端语音信号，v(n)表示环境噪声信号，x2(n)表示近端语音回声信号，通过自适应滤波器基于参考信号r1(n)对当前语音信号y(n)进行处理，得到目标残留信号e(n)，将目标残留信号e(n)和远端语音信号f(n)输入至语音处理模型，得到语音处理模型输出的当前帧近端语音处理信号u(n)；由于近端语音信号s(n)被本地语音播放装置扩音后，再被语音采集装置采集、再被本地语音播放装置扩音，再被语音采集装置采集，如此循环会产生啸叫；所以将上一帧近端语音处理信号u(n-1)加入参考信号r1(n)的目的是为了打断这个循环，以实现啸叫的抑制。Exemplarily, FIG. 3 is a schematic diagram of the principle of an adaptive filter provided by an embodiment of the present invention. As shown in FIG. ), the far-end voice signal f(n) and the last frame of near-end voice processing signal u(n-1) can be superimposed as the reference signal r1(n), and then the reference signal r1(n) and the voice acquisition device The collected current speech signal y(n) is input into the adaptive filter, wherein, y(n)=x1(n)+s(n)+v(n)+x2(n), x1(n) represents far End speech echo signal, s(n) represents the near-end speech signal, v(n) represents the environmental noise signal, x2(n) represents the near-end speech echo signal. The signal y(n) is processed to obtain the target residual signal e(n), and the target residual signal e(n) and the far-end speech signal f(n) are input to the speech processing model to obtain the current frame near-end output by the speech processing model Voice processing signal u(n); after the near-end voice signal s(n) is amplified by the local voice playback device, it is collected by the voice collection device, then amplified by the local voice playback device, and then collected by the voice collection device, so The loop will generate howling; therefore, the purpose of adding the last frame of near-end speech processing signal u(n-1) to the reference signal r1(n) is to break this loop and suppress howling.

另外，在参考信号包括上一帧近端语音处理信号和远端语音信号、输入信号包括当前语音信号的情况下，自适应滤波器在更新当前冲激响应时，当参考信号中能够满足输入信号与参考信号不相关的假设的远端语音信号的功率占比较高时，给予更高的权重，可以保证建模的精度；当输入信号与参考信号相关性强的啸叫抑制部分的功率占比较高时，给予较低权重，保证在没有远端语音信号的情况下，依靠近端语音信号也可以确保自适应滤波器正常收敛，确保自适应滤波器的工作状态稳定。In addition, when the reference signal includes the last frame of near-end speech processing signal and far-end speech signal, and the input signal includes the current speech signal, when the adaptive filter updates the current impulse response, when the reference signal can satisfy the input signal When the power ratio of the far-end speech signal that is not related to the reference signal is relatively high, a higher weight is given to ensure the accuracy of the modeling; when the power ratio of the howling suppression part that has a strong correlation between the input signal and the reference signal When it is high, a lower weight is given to ensure that in the absence of far-end voice signals, relying on near-end voice signals can also ensure the normal convergence of the adaptive filter and ensure that the working state of the adaptive filter is stable.

本发明实施例提供的语音信号处理方法，将上一帧近端语音处理信号加入参考信号，通过自适应滤波器基于包括上一帧近端语音处理信号和远端语音信号的参考信号对当前语音信号进行处理，实现了远端回声的处理和啸叫的抑制。即本发明的回声处理和啸叫抑制是对相同的语音播放装置到语音采集装置的传输路径进行建模，在同一自适应滤波器中实现回声处理和啸叫抑制。In the speech signal processing method provided by the embodiment of the present invention, the last frame of the near-end speech processing signal is added to the reference signal, and the current speech is processed by an adaptive filter based on the reference signal including the last frame of near-end speech processing signal and the far-end speech signal. The signal is processed to realize the processing of the far-end echo and the suppression of howling. That is, the echo processing and howling suppression in the present invention model the transmission path from the same voice playing device to the voice collecting device, and realize echo processing and howling suppression in the same adaptive filter.

可选地，图4是本发明实施例提供的语音处理模型的结构示意图，如图4所示，所述语音处理模型包括第一特征提取网络401、第二特征提取网络402、第三特征提取网络403和多层人工神经网络404；其中，第一特征提取网络401的输入端作为语音处理模型的输入端，第一特征提取网络401的输出端与第二特征提取网络402的输入端连接，第二特征提取网络402的输出端与多层人工神经网络403的输入端连接，多层人工神经网络403的输出端与第三特征提取网络404的输入端连接，第三特征提取网络404的输出端作为语音处理模型的输出端；其中，第一特征提取网络401、第二特征提取网络402和第三特提取网络403均可以由一层全连接层、或者一维卷积层、或者多层全连接层构成。Optionally, FIG. 4 is a schematic structural diagram of a speech processing model provided by an embodiment of the present invention. As shown in FIG. 4 , the speech processing model includes a first feature extraction network 401, a secondfeature extraction network 402, and a third feature extraction network.Network 403 and multilayer artificial neural network 404; Wherein, the input end of the first feature extraction network 401 is used as the input end of speech processing model, the output end of the first feature extraction network 401 is connected with the input end of the secondfeature extraction network 402, The output end of the secondfeature extraction network 402 is connected with the input end of the multilayer artificialneural network 403, the output end of the multilayer artificialneural network 403 is connected with the input end of the third feature extraction network 404, the output of the third feature extraction network 404 end as the output end of the speech processing model; wherein, the first feature extraction network 401, the secondfeature extraction network 402 and the thirdspecial extraction network 403 can be composed of a fully connected layer, or a one-dimensional convolutional layer, or a multi-layer fully connected layer.

图5是本发明实施例提供的语音信号处理方法的流程示意图之二，如图5所示，上述步骤102具体可通过以下步骤实现：Fig. 5 is the second schematic flow diagram of the speech signal processing method provided by the embodiment of the present invention. As shown in Fig. 5, theabove step 102 can be specifically implemented by the following steps:

步骤1021、将所述目标残留信号输入至所述第一特征提取网络，通过所述第一特征提取网络将所述目标残留信号从时域映射至变换域，得到第一特征。Step 1021: Input the target residual signal into the first feature extraction network, and map the target residual signal from the time domain to the transform domain through the first feature extraction network to obtain a first feature.

其中，第一特征提取网络可以为全连接层或者一维卷积层。Wherein, the first feature extraction network may be a fully connected layer or a one-dimensional convolutional layer.

示例地，在自适应滤波器为时域滤波器时，目标残留信号为时域信号；在自适应滤波器为频域滤波器时，目标残留信号为频域信号，需要将频域的目标残留信号转换为时域的目标残留信号，进而将时域的目标残留信号输入至第一特征提取网络，由第一特征提取网络将目标残留信号从时域映射至语音处理模型学习到的变换域，得到与目标残留信号对应的第一特征。For example, when the adaptive filter is a time-domain filter, the target residual signal is a time-domain signal; when the adaptive filter is a frequency-domain filter, the target residual signal is a frequency-domain signal, and the target residual signal in the frequency domain needs to be The signal is converted into a target residual signal in the time domain, and then the target residual signal in the time domain is input to the first feature extraction network, and the target residual signal is mapped from the time domain to the transform domain learned by the speech processing model by the first feature extraction network, A first feature corresponding to the target residual signal is obtained.

需要说明的是，在得到时域的目标残留信号时，还可以将时域的目标残留信号按照预设长度和预设重叠度进行分段，得到目标残留信号对应的时域的多个残留信号片段，每次在第一特征提取网络输入一个残留信号片段；其中，预设长度和预设重叠度可以基于实际需求进行设定，例如，预设长度为256个采样点，预设重叠度为50％，本发明对此不作限定。It should be noted that, when obtaining the target residual signal in the time domain, the target residual signal in the time domain can also be segmented according to a preset length and a preset overlap degree to obtain multiple residual signals in the time domain corresponding to the target residual signal Segment, each time a residual signal segment is input into the first feature extraction network; wherein, the preset length and the preset overlap can be set based on actual needs, for example, the preset length is 256 sampling points, and the preset overlap is 50%, which is not limited in the present invention.

步骤1022、将所述远端语音信号输入至所述第二特征提取网络，通过所述第二特征提取网络将所述远端语音信号从时域映射至变换域，得到第二特征。Step 1022: Input the remote speech signal into the second feature extraction network, and map the remote speech signal from the time domain to the transform domain through the second feature extraction network to obtain a second feature.

其中，第二特征提取网络可以为全连接层或者一维卷积层。Wherein, the second feature extraction network may be a fully connected layer or a one-dimensional convolutional layer.

示例地，在自适应滤波器为时域滤波器时，远端语音信号为时域信号；在自适应滤波器为频域滤波器时，远端语音信号为频域信号，需要将频域的远端语音信号转换为时域的远端语音信号，进而将远端语音信号输入至第二特征提取网络，由第二特征提取网络将远端语音信号从时域映射至语音处理模型学习到的变换域，得到与远端语音信号对应的第二特征。For example, when the adaptive filter is a time-domain filter, the far-end voice signal is a time-domain signal; when the adaptive filter is a frequency-domain filter, the far-end voice signal is a frequency-domain signal, and the frequency-domain The far-end speech signal is converted into a time-domain far-end speech signal, and then the far-end speech signal is input to the second feature extraction network, and the second feature extraction network maps the far-end speech signal from the time domain to the speech processing model. Transform the domain to obtain the second feature corresponding to the far-end speech signal.

需要说明的是，在得到时域的远端语音信号时，还可以将时域的远端语音信号按照预设长度和预设重叠度进行分段，得到远端语音信号对应的时域的多个远端语音信号片段，每次在第二特征提取网络输入一个远端语音信号片段。It should be noted that, when obtaining the far-end speech signal in the time domain, the far-end speech signal in the time domain can also be segmented according to the preset length and the preset overlap degree, so as to obtain the multiplicity of the time domain corresponding to the far-end speech signal. remote speech signal segments, each time a remote speech signal segment is input into the second feature extraction network.

步骤1023、将所述第一特征和所述第二特征均输入至所述多层人工神经网络，通过所述多层人工神经网络基于所述第一特征和所述第二特征，在所述第一特征中提取当前帧近端语音信号的掩膜。Step 1023, input both the first feature and the second feature into the multi-layer artificial neural network, through the multi-layer artificial neural network based on the first feature and the second feature, in the In the first feature, the mask of the near-end speech signal of the current frame is extracted.

其中，多层人工神经网络可以为基于长短期记忆网络(Long Short-Term Memory，LSTM)和门控循环单元(Gated Recurrent Unit，GRU)的循环神经网络，或者多层卷积神经网络。Wherein, the multi-layer artificial neural network may be a recurrent neural network based on a long short-term memory network (Long Short-Term Memory, LSTM) and a gated recurrent unit (Gated Recurrent Unit, GRU), or a multi-layer convolutional neural network.

示例地，在得到第一特征和第二特征时，将第一特征和第二特征作为输入特征输入至多层人工神经网络，由多层人工神经网络基于第一特征和第二特征，在第一特征中提取当前帧近端语音信号的掩膜mask。掩膜为一组系数，每个数代表变换域中每一帧输入数据对应点上转为近端语音所需要乘的权重。For example, when the first feature and the second feature are obtained, the first feature and the second feature are input to the multi-layer artificial neural network as input features, and the multi-layer artificial neural network is based on the first feature and the second feature, in the first The mask mask of the near-end speech signal of the current frame is extracted from the feature. The mask is a set of coefficients, and each number represents the weight that needs to be multiplied to convert the corresponding point of each frame of input data in the transform domain to the near-end speech.

需要说明的是，在得到第一特征和第二特征时，还可以将第一特征进行归一化处理，并将第二特征也进行归一化处理，将归一化处理后的第一特征和归一化处理后的第二特征输入至多层人工神经网络中，这样能够进一步提高多层人工神经网络提取当前帧近端语音信号的掩膜的准确性，进一步提高语音处理模型的准确性。It should be noted that when the first feature and the second feature are obtained, the first feature can also be normalized, and the second feature can also be normalized, and the normalized first feature and the normalized second features are input into the multi-layer artificial neural network, which can further improve the accuracy of the multi-layer artificial neural network for extracting the mask of the near-end speech signal of the current frame, and further improve the accuracy of the speech processing model.

步骤1024、基于所述掩膜和所述第一特征确定所述当前帧近端语音信号在变换域的第三特征。Step 1024: Determine a third feature of the current frame near-end speech signal in the transform domain based on the mask and the first feature.

示例地，在得到当前帧近端语音信号的掩膜mask时，将掩膜mask和第一特征相乘，得到当前帧近端语音信号在变换域的第三特征。For example, when the mask mask of the near-end speech signal of the current frame is obtained, the mask mask is multiplied by the first feature to obtain the third feature of the near-end speech signal of the current frame in the transform domain.

步骤1025、将所述第三特征输入至所述第三特征提取网络，通过所述第三特征提取网络将所述第三特征从变换域映射至时域，得到所述当前帧近端语音处理信号。Step 1025: Input the third feature into the third feature extraction network, map the third feature from the transform domain to the time domain through the third feature extraction network, and obtain the current frame near-end speech processing Signal.

其中，第三特征提取网络可以为全连接层或者一维卷积层。Wherein, the third feature extraction network may be a fully connected layer or a one-dimensional convolutional layer.

示例地，在得到变换域的第三特征时，将第三特征输入至第三特征提取网络，由第三特征提取网络将第三特征从变换域映射至时域，得到当前帧近端语音处理信号。For example, when the third feature of the transform domain is obtained, the third feature is input to the third feature extraction network, and the third feature extraction network maps the third feature from the transform domain to the time domain to obtain the current frame near-end speech processing Signal.

需要说明的是，每次在第一特征输入层输入的是一个残留信号片段，每次在第二特征输入层输入的是一个远端语音信号片段时，得到的当前帧近端语音处理信号也是一个近端语音处理信号片段，在得到所有近端语音处理信号片段时，再基于时间顺序和预设重叠度将所有的近端语音处理信号片段进行组合，得到当前帧近端语音处理信号。It should be noted that each time a residual signal segment is input in the first feature input layer, and each time a far-end speech signal segment is input in the second feature input layer, the obtained near-end speech processing signal of the current frame is also For a near-end speech processing signal segment, when all the near-end speech processing signal segments are obtained, all the near-end speech processing signal segments are combined based on time sequence and preset overlapping degree to obtain the current frame near-end speech processing signal.

本发明实施例提供的语音信号处理方法，基于第一特征提取网络对目标残留信号进行特征提取，得到第一特征层；基于第二特征提取网络对远端语音信号进行特征提取，得到第二特征层；再基于多层人工神经网络和第三特征提取网络得到最终的当前帧近端语音处理信号，实现了对目标残留信号中除当前帧近端语音处理信号之外的其他信号的消除，利用语音处理模型的非线性处理能力处理回声中的非线性成分，另外，由于语音处理模型具备记忆能力，所以能够存储之前一段时间对应的参考信号，进而基于多个参考信号对混响严重的信号进行处理，进一步提高了语音信号处理的准确性。In the speech signal processing method provided by the embodiment of the present invention, the feature extraction is performed on the target residual signal based on the first feature extraction network to obtain the first feature layer; the feature extraction is performed on the remote speech signal based on the second feature extraction network to obtain the second feature layer; and then based on the multi-layer artificial neural network and the third feature extraction network to obtain the final current frame near-end speech processing signal, which realizes the elimination of other signals in the target residual signal except the current frame near-end speech processing signal. The nonlinear processing capability of the speech processing model can deal with the nonlinear components in the echo. In addition, because the speech processing model has the memory ability, it can store the corresponding reference signal for a period of time before, and then based on multiple reference signals, the signal with severe reverberation can be processed. processing, further improving the accuracy of speech signal processing.

可选地，图6是本发明实施例提供的语音信号处理方法的流程示意图之三，如图6所示，所述语音处理模型的训练步骤如下：Optionally, FIG. 6 is the third schematic flowchart of the voice signal processing method provided by the embodiment of the present invention. As shown in FIG. 6, the training steps of the voice processing model are as follows:

步骤601、将所述残留信号样本和所述远端语音信号样本输入至初始网络模型中，得到所述初始网络模型输出的近端语音处理样本。Step 601: Input the residual signal samples and the far-end speech signal samples into an initial network model to obtain near-end speech processing samples output by the initial network model.

示例地，在训练语音处理模型之前，首先获取多个远端语音信号样本组成远端语音信号数据集，获取每个远端语音信号样本对应的残留信号样本，组成残留信号样本数据集，并构建初始网络模型，在远端语音信号数据集中随机选取一个远端语音信号样本作为参考样本输入至初始网络模型中，在残留信号样本数据集中选取远端语音信号样本对应的残留信号样本作为输入样本输入至初始网络模型中，由初始网络模型基于远端语音信号样本对残留信号样本中除近端语音处理样本之外的信号进行处理，得到初始网络模型输出的近端语音处理样本。As an example, before training the speech processing model, first obtain a plurality of remote speech signal samples to form a remote speech signal data set, obtain the residual signal samples corresponding to each remote speech signal sample to form a residual signal sample data set, and construct In the initial network model, a remote voice signal sample is randomly selected in the remote voice signal data set as a reference sample and input into the initial network model, and the residual signal sample corresponding to the remote voice signal sample is selected in the residual signal sample data set as the input sample input In the initial network model, the initial network model processes signals except the near-end speech processing samples in the residual signal samples based on the far-end speech signal samples to obtain near-end speech processing samples output by the initial network model.

步骤602、基于所述近端语音处理样本和期望信号确定损失函数。Step 602. Determine a loss function based on the near-end speech processing samples and the expected signal.

其中，所述期望信号包括近端残留信号样本。Wherein, the desired signal includes near-end residual signal samples.

具体地，可基于近端语音处理样本和期望信号构建第一损失子函数和第二损失子函数，将第一损失子函数和第二损失子函数的加权确定为损失函数；其中，第一损失子函数主要用于确保模型收敛的精度，第二损失子函数主要用于保证模型收敛的稳定性。Specifically, the first loss subfunction and the second loss subfunction can be constructed based on the near-end speech processing samples and the desired signal, and the weight of the first loss subfunction and the second loss subfunction is determined as the loss function; wherein, the first loss The sub-function is mainly used to ensure the accuracy of model convergence, and the second loss sub-function is mainly used to ensure the stability of model convergence.

其中，第一损失子函数采用以下公式(1)表示：Among them, the first loss sub-function is expressed by the following formula (1):

其中，Loss_SNR表示第一损失子函数，u(n)表示近端语音处理样本，gt(n)表示期望信号。Among them, Loss_SNR represents the first loss subfunction, u(n) represents the near-end speech processing samples, and gt(n) represents the desired signal.

第二损失子函数采用以下公式(2)表示：The second loss sub-function is represented by the following formula (2):

其中，Loss_SmoothL1表示第二损失子函数，x＝u(n)-gt(n)。Wherein, Loss_SmoothL1 represents the second loss sub-function, x=u(n)-gt(n).

需要说明的是，损失函数还可以为主观语音质量评价(Perceptual Evaluationof Speech Quality,PESQ)或者短时客观可懂度(Short-Time ObjectiveIntelligibility，STOI)相关的用于改善听感的损失函数，本发明对此不作限定。It should be noted that the loss function can also be a loss function related to subjective speech quality evaluation (Perceptual Evaluation of Speech Quality, PESQ) or short-time objective intelligibility (Short-Time ObjectiveIntelligibility, STOI) for improving the sense of hearing, the present invention There is no limit to this.

步骤603、基于所述损失函数对所述初始网络模型的模型参数进行优化，直至达到收敛条件，得到所述语音处理模型。Step 603: Optimizing the model parameters of the initial network model based on the loss function until a convergence condition is reached to obtain the speech processing model.

示例地，在得到损失函数时，基于损失函数对初始网络模型的模型参数进行优化，不断进行迭代，直至迭代次数达到预设次数时，确定达到收敛条件，得到语音处理模型。For example, when the loss function is obtained, the model parameters of the initial network model are optimized based on the loss function, and iterations are performed continuously until the number of iterations reaches a preset number, then it is determined that the convergence condition is met, and the speech processing model is obtained.

本发明实施例提供的语音信号处理方法，基于残留信号样本和远端语音信号样本对初始网络模型进行训练，并基于损失函数对初始网络模型的模型参数进行优化，最终得到训练好的语音处理模型，便于后期基于语音处理模型对目标残留信号进行进一步处理，以提高语音信号处理的准确性。The speech signal processing method provided by the embodiment of the present invention trains the initial network model based on the residual signal samples and remote speech signal samples, and optimizes the model parameters of the initial network model based on the loss function, and finally obtains the trained speech processing model , which facilitates further processing of the target residual signal based on the speech processing model in the later stage, so as to improve the accuracy of speech signal processing.

可选地，图7是本发明实施例提供的语音信号处理方法的流程示意图之四，如图7所示，在上述步骤401之前，该语音信号处理方法还包括以下步骤：Optionally, FIG. 7 is a fourth schematic flow diagram of the voice signal processing method provided by the embodiment of the present invention. As shown in FIG. 7, before the above step 401, the voice signal processing method further includes the following steps:

步骤604、获取近端含噪信号样本和语音播放装置到语音采集装置的冲激响应样本。Step 604: Acquire near-end noise-containing signal samples and impulse response samples from the voice playback device to the voice collection device.

其中，近端含噪信号样本为近端语音信号样本和环境噪声样本的叠加，即采集实际应用场景中的近端语音信号样本和环境噪声样本，将近端语音信号样本和环境噪声样本叠加之后作为近端含噪信号样本，多个近端含噪信号样本组成近端含噪信号数据集；另外，采集实际应用场景中语音播放装置到语音采集装置的真实冲激响应，对真实冲激响应进行分析合成，得到冲激响应样本，多个冲激响应样本组成冲激响应数据集。Among them, the near-end noisy signal sample is the superposition of the near-end speech signal sample and the environmental noise sample, that is, after collecting the near-end speech signal sample and the environmental noise sample in the actual application scene, after superimposing the near-end speech signal sample and the environmental noise sample As a near-end noisy signal sample, multiple near-end noisy signal samples form a near-end noisy signal data set; in addition, the real impulse response from the voice playback device to the voice acquisition device in the actual application scene is collected, and the real impulse response Perform analysis and synthesis to obtain impulse response samples, and multiple impulse response samples form an impulse response data set.

步骤605、将所述近端含噪信号样本延时预设时间，得到目标近端含噪信号样本。Step 605: Delay the near-end noise-containing signal samples for a preset time to obtain target near-end noise-containing signal samples.

其中，预设时间的取值范围略大于算法耗时时间的波动范围，算法耗时时间为从语音采集装置采集到当前语音信号至语音处理模型输出近端语音处理信号的时间，将预设时间的取值范围略大于算法耗时时间的波动范围，目的是为了使得最终训练得到的语音处理模型的处理时间更加贴近算法的实际处理时间。Among them, the value range of the preset time is slightly larger than the fluctuation range of the time-consuming time of the algorithm. The time-consuming time of the algorithm is the time from collecting the current voice signal to the voice processing model outputting the near-end voice processing signal from the voice collection device. The preset time The value range of is slightly larger than the fluctuation range of the algorithm time-consuming time, the purpose is to make the processing time of the final trained speech processing model closer to the actual processing time of the algorithm.

示例地，将近端含噪信号数据集中的随机一个近端含噪信号样本延时预设时间，得到延时预设时间后的时间对应的近端含噪信号样本，将延时预设时间后的时间对应的近端含噪信号样本确定为目标近端含噪信号样本。For example, a random near-end noisy signal sample in the near-end noisy signal data set is delayed by a preset time to obtain a near-end noisy signal sample corresponding to the time after the preset time delay, and the delay is delayed by the preset time The near-end noisy signal sample corresponding to the later time is determined as the target near-end noisy signal sample.

步骤606、基于所述目标近端含噪信号样本和所述冲激响应样本，确定近端语音回声样本。Step 606: Determine near-end voice echo samples based on the target near-end noise-containing signal samples and the impulse response samples.

示例地，将目标近端含噪信号样本和冲激响应数据集中的随机一个冲激响应样本进行卷积，得到本地扩声后的近端语音回声样本，其中，本地扩声是指将语音采集装置采集的近端语音信号样本通过本地语音播放装置进行播放，近端语音回声样本是指在近端语音信号样本通过本地语音播放装置进行播放后再被语音采集装置采集。As an example, the target near-end noise-containing signal sample is convolved with a random impulse response sample in the impulse response data set to obtain a near-end voice echo sample after local sound reinforcement, where the local sound reinforcement refers to collecting the voice The near-end voice signal samples collected by the device are played by the local voice playback device, and the near-end voice echo samples are collected by the voice collection device after the near-end voice signal samples are played by the local voice playback device.

步骤607、基于所述近端语音回声样本和所述近端含噪信号样本确定输入信号样本。Step 607. Determine an input signal sample based on the near-end speech echo sample and the near-end noise-containing signal sample.

示例地，将近端语音回声样本和近端含噪信号样本进行叠加，得到输入信号样本。Exemplarily, the near-end speech echo sample and the near-end noise-containing signal sample are superimposed to obtain the input signal sample.

步骤608、将所述输入信号样本和所述目标近端含噪信号样本输入至所述自适应滤波器中，通过所述自适应滤波器基于所述目标近端含噪信号样本对所述输入信号样本进行处理，得到近端残留信号样本。Step 608: Input the input signal sample and the target near-end noisy signal sample into the adaptive filter, and use the adaptive filter to process the input signal sample based on the target near-end noisy signal sample The signal samples are processed to obtain near-end residual signal samples.

示例地，将目标近端含噪信号样本作为参考样本，连同输入信号样本一起输入至自适应滤波器中，通过自适应滤波器基于目标近端含噪信号样本对输入信号样本中的近端语音回声样本进行处理，得到近端残留信号样本。For example, the target near-end noisy signal sample is used as a reference sample, together with the input signal sample, it is input into the adaptive filter, and the near-end speech in the input signal sample is analyzed by the adaptive filter based on the target near-end noisy signal sample The echo samples are processed to obtain near-end residual signal samples.

步骤609、基于所述近端残留信号样本确定所述残留信号样本。Step 609. Determine the residual signal samples based on the near-end residual signal samples.

示例地，在得到近端残留信号样本时，将近端残留信号样本作为输入至初始网络模型的残留信号样本。For example, when the near-end residual signal samples are obtained, the near-end residual signal samples are used as the residual signal samples input to the initial network model.

本发明实施例提供的语音信号处理方法，通过自适应滤波器对输入信号样本中的近端语音回声样本进行处理，得到近端残留信号样本，使得输入至初始网络模型的残留信号样本更加准确。In the speech signal processing method provided by the embodiment of the present invention, the near-end speech echo samples in the input signal samples are processed by an adaptive filter to obtain near-end residual signal samples, so that the residual signal samples input to the initial network model are more accurate.

可选地，图8是本发明实施例提供的语音信号处理方法的流程示意图之五，如图8所示，上述步骤609具体可通过以下步骤实现：Optionally, FIG. 8 is a fifth schematic flowchart of a speech signal processing method provided by an embodiment of the present invention. As shown in FIG. 8, theabove step 609 can be specifically implemented by the following steps:

步骤6091、基于所述远端语音信号样本和所述冲激响应样本，确定远端语音回声样本。Step 6091: Determine far-end voice echo samples based on the far-end voice signal samples and the impulse response samples.

其中，远端语音回声样本是指在远端语音信号样本通过本地语音播放装置进行播放后再被语音采集装置采集。Wherein, the far-end voice echo sample refers to being collected by the voice collection device after the remote voice signal sample is played by the local voice playback device.

示例地，将远端语音信号样本和冲激响应样本进行卷积，得到远端语音回声样本。Exemplarily, the far-end voice signal sample is convolved with the impulse response sample to obtain the far-end voice echo sample.

步骤6092、将所述远端语音回声样本和所述远端语音信号样本输入至所述自适应滤波器中，通过所述自适应滤波器基于所述远端语音信号样本对远端语音回声样本进行处理，得到远端残留信号样本。Step 6092: Input the far-end voice echo samples and the far-end voice signal samples into the adaptive filter, and use the adaptive filter to analyze the far-end voice echo samples based on the far-end voice signal samples Processing is performed to obtain remote residual signal samples.

示例地，将远端语音回声样本作为输入信号输入至自适应滤波器中，同时将远端语音信号样本作为参考信号输入至自适应滤波器中，通过自适应滤波器基于远端语音信号样本对远端语音回声样本进行处理，得到远端残留信号样本。Exemplarily, the far-end speech echo sample is input into the adaptive filter as an input signal, and the far-end speech signal sample is input into the adaptive filter as a reference signal at the same time, and the far-end speech signal sample is paired by the adaptive filter. The far-end voice echo samples are processed to obtain far-end residual signal samples.

步骤6093、基于所述近端残留信号样本和所述远端残留信号样本确定所述残留信号样本。Step 6093. Determine the residual signal samples based on the near-end residual signal samples and the far-end residual signal samples.

示例地，将近端残留信号样本和远端残留信号样本进行叠加作为输入至初始网络模型的残留信号样本。Exemplarily, the near-end residual signal samples and the far-end residual signal samples are superimposed as residual signal samples input to the initial network model.

本发明实施例提供的语音信号处理方法，通过自适应滤波器对输入信号样本中的近端语音回声样本进行处理，得到近端残留信号样本，并通过自适应滤波器对远端语音回声样本进行处理，得到远端残留信号样本，将近端残留信号样本和远端残留信号样本叠加作为残留信号样本，进一步提高了输入至初始网络模型的残留信号样本的准确性。In the speech signal processing method provided by the embodiment of the present invention, the near-end speech echo samples in the input signal samples are processed by an adaptive filter to obtain near-end residual signal samples, and the far-end speech echo samples are processed by an adaptive filter. processing to obtain the far-end residual signal samples, and superpose the near-end residual signal samples and the far-end residual signal samples as the residual signal samples, which further improves the accuracy of the residual signal samples input to the initial network model.

可选地，上述步骤606具体可通过以下方式实现：Optionally, theabove step 606 may specifically be implemented in the following manner:

其中，预设次数可基于实际需求进行选择，例如，预设次数为3。Wherein, the preset number of times can be selected based on actual needs, for example, the preset number of times is 3.

示例地，为避免近端语音信号抑制不完全而被循环采集到的情况，对参考近端语音回声样本迭代执行预设次数，假设预设次数为3，参考近端语音回声样本用near_noisy_echo0(n)表示，则将near_noisy_echo0(n)延时预设时间并与冲激响应样本进行卷积，得到near_noisy_echo1(n)，将near_noisy_echo1(n)延时预设时间并与冲激响应样本进行卷积，得到near_noisy_echo2(n)，将near_noisy_echo2(n)延时预设时间并与冲激响应样本进行卷积，得到near_noisy_echo3(n)，此时延时次数达到预设次数，则将near_noisy_echo0(n)+near_noisy_echo1(n)+near_noisy_echo2(n)+near_noisy_echo3(n)确定为近端语音回声样本。For example, in order to avoid the situation that the near-end voice signal is incompletely suppressed and collected cyclically, the reference near-end voice echo sample is iteratively executed for a preset number of times, assuming that the preset number of times is 3, and the reference near-end voice echo sample is used near_noisy_echo0(n ) indicates that the near_noisy_echo0(n) is delayed by the preset time and convolved with the impulse response samples to obtain near_noisy_echo1(n), and the near_noisy_echo1(n) is delayed by the preset time and convolved with the impulse response samples, Get near_noisy_echo2(n), delay near_noisy_echo2(n) for a preset time and convolve it with the impulse response sample to get near_noisy_echo3(n), when the number of delays reaches the preset number, then near_noisy_echo0(n)+near_noisy_echo1 (n)+near_noisy_echo2(n)+near_noisy_echo3(n) are determined as near-end speech echo samples.

本发明实施例提供的语音信号处理方法，将参考近端语音回声样本迭代执行预设次数，以避免除近端语音信号之外的信号抑制不完全而被语音采集装置循环采集，从而提高了近端语音回声样本的准确性。The speech signal processing method provided by the embodiment of the present invention will refer to the near-end speech echo sample and iteratively execute the preset number of times, so as to avoid the incomplete suppression of signals other than the near-end speech signal and be collected by the speech collection device, thereby improving the near-end speech signal processing method. Accuracy of end speech echo samples.

下面对本发明提供的语音信号处理装置进行描述，下文描述的语音信号处理装置与上文描述的语音信号处理方法可相互对应参照。The speech signal processing device provided by the present invention is described below, and the speech signal processing device described below and the speech signal processing method described above can be referred to in correspondence.

图9是本发明实施例提供的语音信号处理装置的结构示意图，如图9所示，该语音信号处理装置900包括第一回声消除单元901和第二回声消除单元902；其中：Fig. 9 is a schematic structural diagram of a speech signal processing device provided by an embodiment of the present invention. As shown in Fig. 9, the speechsignal processing device 900 includes a firstecho canceling unit 901 and a secondecho canceling unit 902; wherein:

第一回声消除单元901，用于将参考信号和语音采集装置采集的当前语音信号输入至自适应滤波器中，通过所述自适应滤波器基于所述参考信号对所述当前语音信号进行处理，得到目标残留信号；所述参考信号包括接收到的远端语音信号；The firstecho cancellation unit 901 is configured to input the reference signal and the current speech signal collected by the speech collection device into the adaptive filter, and process the current speech signal based on the reference signal through the adaptive filter, obtaining a target residual signal; the reference signal includes a received far-end voice signal;

第二回声消除单元902，用于将所述目标残留信号和所述远端语音信号输入至预设的语音处理模型中，得到所述语音处理模型输出的用于传输的当前帧近端语音处理信号；The secondecho cancellation unit 902 is configured to input the target residual signal and the far-end speech signal into a preset speech processing model, and obtain the near-end speech processing of the current frame output by the speech processing model for transmission Signal;

本发明提供的语音信号处理装置，将远端语音信号和自适应滤波器输出的目标残留信号输入至预先训练好的语音处理模型中，得到语音处理模型输出的用于传输的当前帧近端语音处理信号。可知，本发明将自适应滤波器输出的目标残留信号通过语音处理模型进行进一步的去回声处理，由于语音处理模型是基于残留信号样本和远端语音信号样本训练得到的，而残留信号样本和目标残留信号都包括自适应滤波器未能完全消除的非线性回声等信号，所以语音处理模型能够对目标残留信号中非线性成分的回声进行消除，从而提高了语音信号处理的准确性。The speech signal processing device provided by the present invention inputs the far-end speech signal and the target residual signal output by the adaptive filter into the pre-trained speech processing model, and obtains the current frame near-end speech output by the speech processing model for transmission Handle signals. It can be seen that in the present invention, the target residual signal output by the adaptive filter is further de-echoed through the speech processing model. Since the speech processing model is trained based on the residual signal sample and the remote speech signal sample, the residual signal sample and the target Residual signals include signals such as nonlinear echoes that cannot be completely eliminated by adaptive filters, so the speech processing model can eliminate the echo of nonlinear components in the target residual signal, thereby improving the accuracy of speech signal processing.

基于上述任一实施例，所述参考信号还包括所述语音处理模型输出的上一帧近端语音处理信号。Based on any of the foregoing embodiments, the reference signal further includes a last frame of near-end speech processing signal output by the speech processing model.

基于上述任一实施例，所述语音处理模型包括第一特征提取网络、第二特征提取网络、第三特征提取网络和多层人工神经网络；所述第二回声消除单元902具体用于：Based on any of the above embodiments, the speech processing model includes a first feature extraction network, a second feature extraction network, a third feature extraction network and a multi-layer artificial neural network; the secondecho cancellation unit 902 is specifically used for:

将目标残留信号输入至所述第一特征提取网络，通过所述第一特征提取网络将所述目标残留信号从时域映射至变换域，得到第一特征；inputting the target residual signal into the first feature extraction network, and mapping the target residual signal from the time domain to the transform domain through the first feature extraction network to obtain a first feature;

将所述远端语音信号输入至所述第二特征提取网络，通过第二特征提取网络将所述远端语音信号从时域映射至变换域，得到第二特征；inputting the far-end speech signal into the second feature extraction network, and mapping the far-end speech signal from the time domain to the transform domain through the second feature extraction network to obtain a second feature;

基于上述任一实施例，所述语音处理模型为基于如下方式得到的：Based on any of the above-mentioned embodiments, the speech processing model is obtained based on the following method:

基于上述任一实施例，所述语音信号处理装置900还包括：Based on any of the above embodiments, the speechsignal processing device 900 further includes:

基于上述任一实施例，所述残留信号样本确定单元具体用于：Based on any of the above embodiments, the residual signal sample determining unit is specifically configured to:

基于上述任一实施例，所述回声样本确定单元具体用于：Based on any of the above embodiments, the echo sample determining unit is specifically configured to:

基于上述任一实施例，所述第一回声消除单元901具体用于：Based on any of the foregoing embodiments, the firstecho cancellation unit 901 is specifically configured to:

本发明实施例提供一种扩音系统，包括语音采集装置和语音信号处理装置；其中：An embodiment of the present invention provides a sound amplification system, including a voice collection device and a voice signal processing device; wherein:

所述语音信号处理装置，采用上述任一实施例所述的语音信号处理装置。The speech signal processing device adopts the speech signal processing device described in any one of the above-mentioned embodiments.

进一步地，扩音系统还包括语音播放装置，用于播放当前语音信号和/或远端语音信号。Further, the public address system further includes a voice playing device, which is used to play the current voice signal and/or the remote voice signal.

图10是本发明实施例提供的电子设备的实体结构示意图，如图10所示，该电子设备1000可以包括：处理器(processor)1010、通信接口(Communications Interface)1020、存储器(memory)1030和通信总线1040，其中，处理器1010，通信接口1020，存储器1030通过通信总线1040完成相互间的通信。处理器1010可以调用存储器1030中的逻辑指令，以执行语音信号处理方法，该方法包括：将参考信号和语音采集装置采集的当前语音信号输入至自适应滤波器中，通过所述自适应滤波器基于所述参考信号对所述当前语音信号进行处理，得到目标残留信号；所述参考信号包括接收到的远端语音信号；FIG. 10 is a schematic diagram of the physical structure of an electronic device provided by an embodiment of the present invention. As shown in FIG. Acommunication bus 1040 , wherein theprocessor 1010 , thecommunication interface 1020 , and thememory 1030 communicate with each other through thecommunication bus 1040 . Theprocessor 1010 can call the logic instructions in thememory 1030 to execute the speech signal processing method, the method includes: inputting the reference signal and the current speech signal collected by the speech collection device into the adaptive filter, and passing the adaptive filter Processing the current speech signal based on the reference signal to obtain a target residual signal; the reference signal includes a received far-end speech signal;

将目标残留信号和远端语音信号输入至预设的语音处理模型中，得到语音处理模型输出的用于传输的当前帧近端语音处理信号；Inputting the target residual signal and the far-end speech signal into a preset speech processing model to obtain the current frame near-end speech processing signal output by the speech processing model for transmission;

此外，上述的存储器1030中的逻辑指令可以通过软件功能单元的形式实现并作为独立的产品销售或使用时，可以存储在一个计算机可读取存储介质中。基于这样的理解，本发明的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的部分可以以软件产品的形式体现出来，该计算机软件产品存储在一个存储介质中，包括若干指令用以使得一台计算机设备(可以是个人计算机，服务器，或者网络设备等)执行本发明各个实施例所述方法的全部或部分步骤。而前述的存储介质包括：U盘、移动硬盘、只读存储器(ROM，Read-Only Memory)、随机存取存储器(RAM，Random Access Memory)、磁碟或者光盘等各种可以存储程序代码的介质。In addition, the above-mentioned logic instructions in thememory 1030 may be implemented in the form of software functional units and be stored in a computer-readable storage medium when sold or used as an independent product. Based on this understanding, the essence of the technical solution of the present invention or the part that contributes to the prior art or the part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a storage medium, including Several instructions are used to make a computer device (which may be a personal computer, a server, or a network device, etc.) execute all or part of the steps of the methods described in various embodiments of the present invention. The aforementioned storage medium includes: U disk, mobile hard disk, read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), magnetic disk or optical disk and other media that can store program codes. .

另一方面，本发明还提供一种计算机程序产品，所述计算机程序产品包括计算机程序，计算机程序可存储在非暂态计算机可读存储介质上，所述计算机程序被处理器执行时，计算机能够执行上述各方法所提供的语音信号处理方法，该方法包括：将参考信号和语音采集装置采集的当前语音信号输入至自适应滤波器中，通过所述自适应滤波器基于所述参考信号对所述当前语音信号进行处理，得到目标残留信号；所述参考信号包括接收到的远端语音信号；On the other hand, the present invention also provides a computer program product. The computer program product includes a computer program that can be stored on a non-transitory computer-readable storage medium. When the computer program is executed by a processor, the computer can Executing the speech signal processing method provided by each of the above methods, the method includes: inputting the reference signal and the current speech signal collected by the speech collection device into an adaptive filter, and using the adaptive filter to process the speech signal based on the reference signal The current voice signal is processed to obtain the target residual signal; the reference signal includes the received far-end voice signal;

又一方面，本发明还提供一种非暂态计算机可读存储介质，其上存储有计算机程序，该计算机程序被处理器执行时实现以执行上述各方法提供的语音信号处理方法，该方法包括：将参考信号和语音采集装置采集的当前语音信号输入至自适应滤波器中，通过所述自适应滤波器基于所述参考信号对所述当前语音信号进行处理，得到目标残留信号；所述参考信号包括接收到的远端语音信号；In yet another aspect, the present invention also provides a non-transitory computer-readable storage medium, on which a computer program is stored, and when the computer program is executed by a processor, it is implemented to perform the speech signal processing method provided by the above-mentioned methods, the method includes : input the reference signal and the current speech signal collected by the speech collection device into an adaptive filter, and process the current speech signal based on the reference signal through the adaptive filter to obtain a target residual signal; the reference The signal includes a received far-end voice signal;

以上所描述的装置实施例仅仅是示意性的，其中所述作为分离部件说明的单元可以是或者也可以不是物理上分开的，作为单元显示的部件可以是或者也可以不是物理单元，即可以位于一个地方，或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部模块来实现本实施例方案的目的。本领域普通技术人员在不付出创造性的劳动的情况下，即可以理解并实施。The device embodiments described above are only illustrative, and the units described as separate components may or may not be physically separated, and the components shown as units may or may not be physical units, that is, they may be located in One place, or it can be distributed to multiple network elements. Part or all of the modules can be selected according to actual needs to achieve the purpose of the solution of this embodiment. It can be understood and implemented by those skilled in the art without any creative effort.

通过以上的实施方式的描述，本领域的技术人员可以清楚地了解到各实施方式可借助软件加必需的通用硬件平台的方式来实现，当然也可以通过硬件。基于这样的理解，上述技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来，该计算机软件产品可以存储在计算机可读存储介质中，如ROM/RAM、磁碟、光盘等，包括若干指令用以使得一台计算机设备(可以是个人计算机，服务器，或者网络设备等)执行各个实施例或者实施例的某些部分所述的方法。Through the above description of the implementations, those skilled in the art can clearly understand that each implementation can be implemented by means of software plus a necessary general hardware platform, and of course also by hardware. Based on this understanding, the essence of the above technical solution or the part that contributes to the prior art can be embodied in the form of software products, and the computer software products can be stored in computer-readable storage media, such as ROM/RAM, magnetic discs, optical discs, etc., including several instructions to make a computer device (which may be a personal computer, server, or network device, etc.) execute the methods described in various embodiments or some parts of the embodiments.

最后应说明的是：以上实施例仅用以说明本发明的技术方案，而非对其限制；尽管参照前述实施例对本发明进行了详细的说明，本领域的普通技术人员应当理解：其依然可以对前述各实施例所记载的技术方案进行修改，或者对其中部分技术特征进行等同替换；而这些修改或者替换，并不使相应技术方案的本质脱离本发明各实施例技术方案的精神和范围。Finally, it should be noted that: the above embodiments are only used to illustrate the technical solutions of the present invention, rather than to limit them; although the present invention has been described in detail with reference to the foregoing embodiments, those of ordinary skill in the art should understand that: it can still be Modifications are made to the technical solutions described in the foregoing embodiments, or equivalent replacements are made to some of the technical features; and these modifications or replacements do not make the essence of the corresponding technical solutions deviate from the spirit and scope of the technical solutions of the various embodiments of the present invention.

Claims

1. A speech signal processing apparatus, comprising:

the first echo eliminating unit is used for inputting a reference signal and a current voice signal acquired by the voice acquisition device into an adaptive filter, and processing the current voice signal through the adaptive filter based on the reference signal to obtain a target residual signal; the reference signal comprises a received far-end voice signal;

the second echo cancellation unit is used for inputting the target residual signal and the far-end voice signal into a preset voice processing model to obtain a current frame near-end voice processing signal which is output by the voice processing model and used for transmission;

the voice processing model is used for performing echo removing processing on the target residual signal; the speech processing model is trained based on the far-end speech signal samples and the residual signal samples.

2. The speech signal processing apparatus of claim 1, wherein the reference signal further comprises a last frame of near-end speech processing signal output by the speech processing model.

3. The speech signal processing apparatus of claim 1, wherein the speech processing model comprises a first feature extraction network, a second feature extraction network, a third feature extraction network, and a multi-layer artificial neural network;

the second echo cancellation unit is specifically configured to:

inputting the target residual signal into the first feature extraction network, and mapping the target residual signal from a time domain to a transform domain through the first feature extraction network to obtain a first feature;

inputting the far-end voice signal into the second feature extraction network, and mapping the far-end voice signal from a time domain to a transform domain through the second feature extraction network to obtain a second feature;

inputting the first feature and the second feature into the multilayer artificial neural network, and extracting a mask of a near-end speech signal of the current frame from the first feature through the multilayer artificial neural network based on the first feature and the second feature;

determining a third feature of the current frame near-end speech signal in a transform domain based on the mask and the first feature;

inputting the third feature into the third feature extraction network, and mapping the third feature from a transform domain to a time domain through the third feature extraction network to obtain the current frame near-end speech processing signal.

4. The speech signal processing apparatus of claim 1, wherein the speech processing model is obtained based on:

inputting the residual signal sample and the far-end voice signal sample into an initial network model to obtain a near-end voice processing sample output by the initial network model;

determining a loss function based on the near-end speech processing samples and a desired signal; the desired signal comprises near-end residual signal samples;

and optimizing the model parameters of the initial network model based on the loss function until a convergence condition is reached to obtain the voice processing model.

5. The speech signal processing apparatus of claim 4, wherein the apparatus further comprises:

the system comprises a sample acquisition unit, a voice acquisition unit and a processing unit, wherein the sample acquisition unit is used for acquiring a near-end noisy signal sample and an impulse response sample from a voice playing device to a voice acquisition device;

the sample delay unit is used for delaying the near-end noisy signal sample for a preset time to obtain a target near-end noisy signal sample;

an echo sample determination unit, configured to determine a near-end speech echo sample based on the target near-end noisy signal sample and the impulse response sample;

an input signal sample determination unit for determining an input signal sample based on the near-end speech echo sample and the near-end noisy signal sample;

a near-end residual signal sample determining unit, configured to input the input signal sample and the target near-end noisy signal sample into the adaptive filter, and process the input signal sample based on the target near-end noisy signal sample through the adaptive filter to obtain a near-end residual signal sample;

a residual signal sample determination unit for determining the residual signal samples based on the near-end residual signal samples.

6. The speech signal processing apparatus of claim 5, wherein the residual signal sample determination unit is specifically configured to:

determining a far-end speech echo sample based on the far-end speech signal sample and the impulse response sample;

inputting the far-end voice echo sample and the far-end voice signal sample into the adaptive filter, and processing the far-end voice echo sample based on the far-end voice signal sample through the adaptive filter to obtain a far-end residual signal sample;

determining the residual signal samples based on the near-end residual signal samples and the far-end residual signal samples.

7. The speech signal processing apparatus of claim 5, wherein the echo sample determination unit is specifically configured to:

determining a reference near-end voice echo sample based on the target near-end noisy signal sample and the impulse response sample, delaying the reference near-end voice echo sample for the preset time to obtain a delayed near-end voice echo sample, taking the delayed near-end voice echo sample as a new target near-end noisy signal sample, and repeatedly executing the steps until the delay times reach the preset times;

determining the near-end voice echo sample based on the reference near-end voice echo sample obtained each time.

8. The speech signal processing apparatus according to any one of claims 1 to 7, wherein the first echo cancellation unit is specifically configured to:

processing the current voice signal based on the reference signal through the adaptive filter to obtain an output signal;

updating the current impulse response of the self-adaptive filter by taking the minimum correlation between the output signal and the reference signal as a target to obtain a target impulse response;

determining a target echo signal based on the target impulse response and the reference signal;

and processing the current voice signal based on the target echo signal to obtain the target residual signal.

9. A speech signal processing method, comprising:

inputting a reference signal and a current voice signal acquired by a voice acquisition device into an adaptive filter, and processing the current voice signal through the adaptive filter based on the reference signal to obtain a target residual signal; the reference signal comprises a received far-end voice signal;

inputting the target residual signal and the far-end voice signal into a preset voice processing model to obtain a current frame near-end voice processing signal which is output by the voice processing model and used for transmission;

10. A loudspeaker system, comprising:

the voice acquisition device is used for acquiring a current voice signal and inputting the current voice signal to the voice signal processing device;

the speech signal processing apparatus using the speech signal processing apparatus according to any one of claims 1 to 8.

11. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the speech signal processing method of claim 9 when executing the program.

12. A non-transitory computer-readable storage medium on which a computer program is stored, the computer program, when being executed by a processor, implementing the speech signal processing method according to claim 9.

13. A computer program product comprising a computer program, characterized in that the computer program realizes the speech signal processing method according to claim 9 when executed by a processor.