Movatterモバイル変換


[0]ホーム

URL:


CN113299306B - Echo cancellation method, apparatus, electronic device, and computer-readable storage medium - Google Patents

Echo cancellation method, apparatus, electronic device, and computer-readable storage medium
Download PDF

Info

Publication number
CN113299306B
CN113299306BCN202110847066.4ACN202110847066ACN113299306BCN 113299306 BCN113299306 BCN 113299306BCN 202110847066 ACN202110847066 ACN 202110847066ACN 113299306 BCN113299306 BCN 113299306B
Authority
CN
China
Prior art keywords
spectrogram
feature
signal
features
scale
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110847066.4A
Other languages
Chinese (zh)
Other versions
CN113299306A (en
Inventor
马路
杨嵩
王心恬
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Century TAL Education Technology Co Ltd
Original Assignee
Beijing Century TAL Education Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Century TAL Education Technology Co LtdfiledCriticalBeijing Century TAL Education Technology Co Ltd
Priority to CN202110847066.4ApriorityCriticalpatent/CN113299306B/en
Publication of CN113299306ApublicationCriticalpatent/CN113299306A/en
Application grantedgrantedCritical
Publication of CN113299306BpublicationCriticalpatent/CN113299306B/en
Activelegal-statusCriticalCurrent
Anticipated expirationlegal-statusCritical

Links

Images

Classifications

Landscapes

Abstract

Translated fromChinese

本公开提供了一种回声消除方法、装置、电子设备及计算机可读存储介质,接收近端混合信号和对应的参考通道的远端信号;对近端混合信号和远端信号分别进行编码,得到编码后的近端混合信号语谱图和编码后的远端信号语谱图,并将编码后的近端混合信号语谱图和编码后的远端信号语谱图进行拼接,得到拼接的后语谱图;根据拼接的后语谱图提取多尺度特征;根据编码后的近端混合信号语谱图提取深度特征;根据深度特征计算多尺度特征的每一层特征的权重;利用每一层特征的权重对对应的特征进行加权处理,得到合并后的多尺度特征;根据合并后的多尺度特征和深度特征获取近端信号估计。通过本公开实现了语音交互和语音通话等场景中,回声有效消除。

Figure 202110847066

The present disclosure provides an echo cancellation method, device, electronic device and computer-readable storage medium, which receive a near-end mixed signal and a corresponding far-end signal of a reference channel; encode the near-end mixed signal and the far-end signal respectively to obtain The coded near-end mixed signal spectrogram and the coded far-end signal spectrogram, and the coded near-end mixed signal spectrogram and the coded far-end signal spectrogram are spliced to obtain the spliced post-spectrogram. Spectrogram; extract multi-scale features according to the spliced spectrogram; extract depth features according to the encoded near-end mixed signal spectrogram; calculate the weight of each layer feature of multi-scale features according to the depth feature; use each layer The weight of the feature weights the corresponding feature to obtain the combined multi-scale feature; and obtains the near-end signal estimation according to the combined multi-scale feature and depth feature. Through the present disclosure, echoes can be effectively eliminated in scenarios such as voice interaction and voice calls.

Figure 202110847066

Description

Translated fromChinese
回声消除方法、装置、电子设备及计算机可读存储介质Echo cancellation method, apparatus, electronic device, and computer-readable storage medium

技术领域technical field

本发明涉及语音处理技术领域,尤其涉及一种回声消除方法、装置、电子设备及计算机可读存储介质。The present invention relates to the technical field of speech processing, and in particular, to an echo cancellation method, apparatus, electronic device, and computer-readable storage medium.

背景技术Background technique

回声消除最早是应用在音频通话系统中。在通话的两端,一端的声音经过线路传到另一端,并通过另一端的扬声器播放出去,另一端的麦克风会接收扬声器播放的声音,与此同时由于房间内的地板、墙壁、其他物体的定向和反射,麦克风除了接收到扬声器播放的直达声之外还会接收到各种反射声,这种混合声音会传回给说话的那一端,这就是所谓的声学回声问题,它会干扰人们的谈话,降低系统的质量,这是通信网络中常见的问题。而在智能语音设备中,设备自身播放的音频会被自身的麦克风接收,也存在回声问题,若不能消除,则会影响音频质量,进而影响语音识别率,降低用户体验。Echo cancellation was first used in audio call systems. At both ends of the call, the sound of one end is transmitted to the other end through the line, and played out through the speaker at the other end, and the microphone at the other end will receive the sound played by the speaker. Directional and reflection, the microphone will receive various reflections in addition to the direct sound played by the speaker, this mixed sound will be transmitted back to the speaking end, this is the so-called acoustic echo problem, it can interfere with people's talk, reducing the quality of the system, which is a common problem in communication networks. In a smart voice device, the audio played by the device itself will be received by its own microphone, and there is also an echo problem. If it cannot be eliminated, it will affect the audio quality, thereby affecting the speech recognition rate and reducing the user experience.

在语音交互和语音通话等场景中,回声消除性能的好坏直接影响后端语音识别率和用户的听感体验,是语音技术的关键核心技术。In scenarios such as voice interaction and voice calls, the performance of echo cancellation directly affects the back-end voice recognition rate and the user's listening experience, and is the key core technology of voice technology.

发明内容SUMMARY OF THE INVENTION

根据本公开的一方面,提供了一种回声消除方法,包括:According to an aspect of the present disclosure, there is provided an echo cancellation method, comprising:

接收近端混合信号和对应的参考通道的远端信号;Receive the near-end mixed signal and the far-end signal of the corresponding reference channel;

对所述近端混合信号和所述远端信号分别进行编码,得到编码后的近端混合信号语谱图和编码后的远端信号语谱图,并将编码后的近端混合信号语谱图和编码后的远端信号语谱图进行拼接,得到拼接的后语谱图;The near-end mixed signal and the far-end signal are encoded respectively to obtain the encoded near-end mixed signal spectrogram and the encoded far-end signal spectrogram, and the encoded near-end mixed signal spectrogram is obtained. The image and the encoded far-end signal spectrogram are spliced to obtain the spliced post-spectrogram;

根据所述拼接的后语谱图提取多尺度特征;Extract multi-scale features according to the spliced post-speech spectrogram;

根据所述编码后的近端混合信号语谱图提取深度特征;Extracting depth features according to the encoded near-end mixed-signal spectrogram;

根据所述深度特征计算所述多尺度特征的每一层特征的权重;Calculate the weight of each layer feature of the multi-scale feature according to the depth feature;

利用所述每一层特征的权重对对应的特征进行加权处理,得到合并后的多尺度特征;Using the weight of each layer of features to perform weighting processing on the corresponding features to obtain the combined multi-scale features;

根据所述合并后的多尺度特征和所述深度特征获取近端信号估计。A near-end signal estimate is obtained from the combined multi-scale feature and the depth feature.

根据本公开的另一方面,提供了一种回声消除装置,包括:According to another aspect of the present disclosure, an echo cancellation apparatus is provided, comprising:

接收模块,用于接收近端混合信号和对应的参考通道的远端信号;The receiving module is used to receive the near-end mixed signal and the far-end signal of the corresponding reference channel;

编码模块,用于对所述近端混合信号和所述远端信号分别进行编码,得到编码后的近端混合信号语谱图和编码后的远端信号语谱图,并将编码后的近端混合信号语谱图和编码后的远端信号语谱图进行拼接,得到拼接的后语谱图;The encoding module is used to encode the near-end mixed signal and the far-end signal respectively to obtain the encoded near-end mixed signal spectrogram and the encoded far-end signal spectrogram, and encode the encoded near-end signal spectrogram. The spectrogram of the end mixed signal and the encoded spectrogram of the far end signal are spliced to obtain the spliced post spectrogram;

第一提取模块,用于根据所述拼接的后语谱图提取多尺度特征;a first extraction module, used for extracting multi-scale features according to the spliced post-speech spectrogram;

第二提取模块,用于根据所述编码后的近端混合信号语谱图提取深度特征;a second extraction module, configured to extract depth features according to the encoded near-end mixed-signal spectrogram;

计算模块,用于根据所述深度特征计算所述多尺度特征的每一层特征的权重;a calculation module, configured to calculate the weight of each layer feature of the multi-scale feature according to the depth feature;

加权模块,用于利用所述每一层特征的权重对对应的特征进行加权处理,得到合并后的多尺度特征;a weighting module, configured to perform weighting processing on the corresponding features by using the weight of the features of each layer to obtain the combined multi-scale features;

获取模块,用于根据所述合并后的多尺度特征和所述深度特征获取近端信号估计。an obtaining module, configured to obtain a near-end signal estimate according to the combined multi-scale feature and the depth feature.

根据本公开的另一方面,提供了一种电子设备,包括:According to another aspect of the present disclosure, there is provided an electronic device, comprising:

处理器;以及processor; and

存储程序的存储器,memory for storing programs,

其中,所述程序包括指令,所述指令在由所述处理器执行时使所述处理器执行根据上述方面中任一项所述的方法。Wherein the program comprises instructions which, when executed by the processor, cause the processor to perform the method according to any of the preceding aspects.

根据本公开的另一方面,提供了一种存储有计算机指令的非瞬时计算机可读存储介质,其中,所述计算机指令用于使所述计算机执行根据上述方面中任一项所述的方法。According to another aspect of the present disclosure, there is provided a non-transitory computer readable storage medium storing computer instructions for causing the computer to perform the method according to any of the above aspects.

本公开实施例中提供的一个或多个技术方案,可以实现语音交互和语音通话等场景中,回声有效消除。One or more technical solutions provided in the embodiments of the present disclosure can effectively eliminate echoes in scenarios such as voice interaction and voice calls.

附图说明Description of drawings

在下面结合附图对于示例性实施例的描述中,本公开的更多细节、特征和优点被公开,在附图中:Further details, features and advantages of the present disclosure are disclosed in the following description of exemplary embodiments in conjunction with the accompanying drawings, in which:

图1示出了根据本公开示例性实施例的回声消除方法的流程图;1 shows a flowchart of an echo cancellation method according to an exemplary embodiment of the present disclosure;

图2示出了根据本公开示例性实施例的回声消除网络结构示意图;FIG. 2 shows a schematic structural diagram of an echo cancellation network according to an exemplary embodiment of the present disclosure;

图3示出了根据本公开示例性实施例的回声消除数据准备示意图;FIG. 3 shows a schematic diagram of echo cancellation data preparation according to an exemplary embodiment of the present disclosure;

图4示出了根据本公开示例性实施例的1-D Conv Block模型结构示意图;4 shows a schematic structural diagram of a 1-D Conv Block model according to an exemplary embodiment of the present disclosure;

图5示出了根据本公开示例性实施例的回声消除装置的示意性框图;FIG. 5 shows a schematic block diagram of an echo cancellation apparatus according to an exemplary embodiment of the present disclosure;

图6示出了能够用于实现本公开的实施例的示例性电子设备的结构框图。6 shows a block diagram of an exemplary electronic device that can be used to implement embodiments of the present disclosure.

具体实施方式Detailed ways

下面将参照附图更详细地描述本公开的实施例。虽然附图中显示了本公开的某些实施例,然而应当理解的是,本公开可以通过各种形式来实现,而且不应该被解释为限于这里阐述的实施例,相反提供这些实施例是为了更加透彻和完整地理解本公开。应当理解的是,本公开的附图及实施例仅用于示例性作用,并非用于限制本公开的保护范围。Embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While certain embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be construed as limited to the embodiments set forth herein, but rather are provided for the purpose of A more thorough and complete understanding of the present disclosure. It should be understood that the drawings and embodiments of the present disclosure are only for exemplary purposes, and are not intended to limit the protection scope of the present disclosure.

应当理解,本公开的方法实施方式中记载的各个步骤可以按照不同的顺序执行,和/或并行执行。此外,方法实施方式可以包括附加的步骤和/或省略执行示出的步骤。本公开的范围在此方面不受限制。It should be understood that the various steps described in the method embodiments of the present disclosure may be performed in different orders and/or in parallel. Furthermore, method embodiments may include additional steps and/or omit performing the illustrated steps. The scope of the present disclosure is not limited in this regard.

本文使用的术语“包括”及其变形是开放性包括,即“包括但不限于”。术语“基于”是“至少部分地基于”。术语“一个实施例”表示“至少一个实施例”;术语“另一实施例”表示“至少一个另外的实施例”;术语“一些实施例”表示“至少一些实施例”。其他术语的相关定义将在下文描述中给出。需要注意,本公开中提及的“第一”、“第二”等概念仅用于对不同的装置、模块或单元进行区分,并非用于限定这些装置、模块或单元所执行的功能的顺序或者相互依存关系。As used herein, the term "including" and variations thereof are open-ended inclusions, ie, "including but not limited to". The term "based on" is "based at least in part on." The term "one embodiment" means "at least one embodiment"; the term "another embodiment" means "at least one additional embodiment"; the term "some embodiments" means "at least some embodiments". Relevant definitions of other terms will be given in the description below. It should be noted that concepts such as "first" and "second" mentioned in the present disclosure are only used to distinguish different devices, modules or units, and are not used to limit the order of functions performed by these devices, modules or units or interdependence.

需要注意,本公开中提及的“一个”、“多个”的修饰是示意性而非限制性的,本领域技术人员应当理解,除非在上下文另有明确指出,否则应该理解为“一个或多个”。It should be noted that the modifications of "a" and "a plurality" mentioned in the present disclosure are illustrative rather than restrictive, and those skilled in the art should understand that unless the context clearly indicates otherwise, they should be understood as "one or a plurality of". multiple".

本公开实施方式中的多个装置之间所交互的消息或者信息的名称仅用于说明性的目的,而并不是用于对这些消息或信息的范围进行限制。The names of messages or information exchanged between multiple devices in the embodiments of the present disclosure are only for illustrative purposes, and are not intended to limit the scope of these messages or information.

在语音交互和语音通话等场景中,回声消除性能的好坏直接影响后端语音识别率和用户的听感体验,是语音技术的关键核心技术。目前常用的方法是采用WebRTC的方法,即:首先利用时延估计算法对齐近端和远端的数据;之后采用自适应滤波器完成对回声的估计,从而消除线性回声;最后利用非线性处理完成对残余回声的抑制。虽然非线性处理可以在一定程度上抑制这种残余回声,但是抑制程度有限,仍然存在一定的残余回声,特别是复杂环境中的回声,且滤波器无法快速跟踪房间冲激响应的变化,从而影响最终的回声消除效果,进而影响整个声音信号处理的性能。In scenarios such as voice interaction and voice calls, the performance of echo cancellation directly affects the back-end voice recognition rate and the user's listening experience, and is the key core technology of voice technology. At present, the commonly used method is to use the WebRTC method, that is: first use the time delay estimation algorithm to align the near-end and far-end data; then use the adaptive filter to complete the echo estimation, so as to eliminate the linear echo; finally use the nonlinear processing to complete Residual echo suppression. Although nonlinear processing can suppress this residual echo to a certain extent, the degree of suppression is limited, and there is still a certain residual echo, especially in complex environments, and the filter cannot quickly track the changes of the room's impulse response, which affects the The final echo cancellation effect, which in turn affects the performance of the entire sound signal processing.

针对以上问题,在本实施例中提供了一种回声消除方法,可以用于智能手机,还可用于便携式平板电脑等具有语音处理功能的智能设备(电子设备)。图1示出了根据本公开示例性实施例的回声消除方法的流程图,如图1所示,该流程包括如下步骤:In view of the above problems, this embodiment provides an echo cancellation method, which can be used for smart phones, and also for smart devices (electronic devices) with voice processing functions, such as portable tablet computers. FIG. 1 shows a flowchart of an echo cancellation method according to an exemplary embodiment of the present disclosure. As shown in FIG. 1 , the flowchart includes the following steps:

步骤S101,接收近端混合信号和对应的参考通道的远端信号。具体地,可以直接输入近端麦克风接收的近端混合信号和参考通道的远端信号。Step S101, receiving the near-end mixed signal and the far-end signal of the corresponding reference channel. Specifically, the near-end mixed signal received by the near-end microphone and the far-end signal of the reference channel can be directly input.

步骤S102,对近端混合信号和远端信号分别进行编码,得到编码后的近端混合信号语谱图和编码后的远端信号语谱图,并将编码后的近端混合信号语谱图和编码后的远端信号语谱图进行拼接,得到拼接后的语谱图。In step S102, the near-end mixed signal and the far-end signal are encoded respectively to obtain the encoded near-end mixed signal spectrogram and the encoded far-end signal spectrogram, and the encoded near-end mixed signal spectrogram is obtained. Splicing with the encoded far-end signal spectrogram to obtain the spliced spectrogram.

步骤S103,根据拼接后的语谱图提取多尺度特征。例如,可以采用膨胀时域卷积网络提取多尺度的特征。本领域技术人员应当知晓,该提取多尺度特征的方式并非用于限制本实施例,根据实际需要采用其他方式亦在本实施例的保护范围之内。Step S103, extracting multi-scale features according to the spliced spectrogram. For example, dilated temporal convolutional networks can be used to extract multi-scale features. Those skilled in the art should know that the method of extracting multi-scale features is not used to limit this embodiment, and other methods are also within the protection scope of this embodiment according to actual needs.

步骤S104,根据编码后的近端混合信号语谱图提取深度特征。Step S104, extracting depth features according to the encoded near-end mixed signal spectrogram.

步骤S105,根据深度特征计算多尺度特征的每一层特征的权重。以用于提高多尺度特征中的重要特征同时抑制非重要特征。Step S105: Calculate the weight of each layer feature of the multi-scale feature according to the depth feature. It is used to improve the important features in multi-scale features while suppressing the non-important features.

步骤S106,利用每一层特征的权重对对应的特征进行加权处理,得到合并后的多尺度特征,进而提高了多尺度特征中的重要特征同时抑制了非重要特征。Step S106 , weighting the corresponding features by using the weight of each layer of features to obtain the combined multi-scale features, thereby improving the important features in the multi-scale features and suppressing the non-important features.

步骤S107,根据合并后的多尺度特征和深度特征获取近端信号估计。Step S107, obtaining a near-end signal estimate according to the combined multi-scale feature and depth feature.

通过上述步骤,采用端到端方式实现,提取输入音频的多尺度特征,并利用多尺度特征进行回声消除,回声抑制能力强,对频谱损失小。Through the above steps, an end-to-end approach is adopted to extract the multi-scale features of the input audio, and use the multi-scale features to perform echo cancellation, which has strong echo suppression capability and small spectral loss.

上述步骤S103涉及根据拼接的后语谱图提取多尺度特征,在一些可选实施例中,如图2所示,可以将拼接的后语谱图输入至消除器模块中的多尺度特征提取模块,该多尺度特征提取模块由多组膨胀卷积构成,每一组膨胀卷积包括多个卷积块,由多尺度特征提取模块根据拼接的后语谱图提取每一层的多尺度特征。The above-mentioned step S103 involves extracting multi-scale features according to the spliced post-speech spectrogram. In some optional embodiments, as shown in FIG. 2 , the spliced post-speech spectrogram can be input into the multi-scale feature extraction module in the canceller module. , the multi-scale feature extraction module is composed of multiple groups of dilated convolutions, each group of dilated convolutions includes multiple convolution blocks, and the multi-scale feature extraction module extracts the multi-scale features of each layer according to the spliced post-language spectrogram.

上述步骤S104涉及根据编码后的近端混合信号语谱图提取深度特征,在一些可选实施例中,如图2所示,可以将编码后的近端混合信号语谱图输入至消除器模块中的第一长短期记忆网络,由第一长短期记忆网络根据编码后的近端混合信号语谱图提取上述深度特征。The above-mentioned step S104 involves extracting depth features according to the encoded near-end mixed signal spectrogram. In some optional embodiments, as shown in FIG. 2 , the encoded near-end mixed signal spectrogram can be input to the canceller module The first long-term and short-term memory network in the above-mentioned depth feature is extracted by the first long-term and short-term memory network according to the encoded near-end mixed signal spectrogram.

上述步骤S105涉及根据上述深度特征计算多尺度特征的每一层特征的权重,在一些可选实施例中,如图2所示,可以将深度特征作为query,将多尺度特征的每一层特征作为key和value,利用多头注意力机制计算多尺度特征每一层特征的权重。The above-mentioned step S105 involves calculating the weight of each layer feature of the multi-scale feature according to the above-mentioned depth feature. In some optional embodiments, as shown in FIG. 2 , the depth feature can be used as a query, and each layer feature of the multi-scale feature can be used as the query. As the key and value, the multi-head attention mechanism is used to calculate the weight of each layer of multi-scale features.

上述步骤S106涉及利用每一层特征的权重对对应的特征进行加权处理,得到合并后的多尺度特征,在一些可选实施例中,如图2所示,可以通过多头注意力机制将每一层特征的权重与对应的特征相乘并叠加,得到合并后的多尺度特征。The above-mentioned step S106 involves using the weight of each layer of features to perform weighting processing on the corresponding features to obtain combined multi-scale features. In some optional embodiments, as shown in FIG. The weights of the layer features are multiplied and superimposed with the corresponding features to obtain the combined multi-scale features.

上述步骤S107涉及根据合并后的多尺度特征和该深度特征获取近端信号估计,在一些可选实施例中,如图2所示,可以将合并后的多尺度特征和上述深度特征进行拼接后,输入消除器模块中的第二长短期记忆网络,得到上述近端信号估计。The above step S107 involves obtaining a near-end signal estimate according to the combined multi-scale feature and the depth feature. In some optional embodiments, as shown in FIG. 2 , the combined multi-scale feature and the above-mentioned depth feature may be spliced. , input to the second long short-term memory network in the canceller module to obtain the above-mentioned near-end signal estimation.

在一些可选实施例中,如图2所示,将该合并后的多尺度特征和该近端信号估计进行拼接之后输入至分类器,由该分类器判断是否有远端信号或者近端信号。In some optional embodiments, as shown in FIG. 2 , the combined multi-scale feature and the near-end signal estimate are spliced and input to a classifier, and the classifier determines whether there is a far-end signal or a near-end signal .

消除器模块训练所需数据准备如下图3所示,上述消除器模块是通过如下步骤训练得到的:从数据库中选择不同人的语音分别作为近端信号样本(near-end)和远端信号样本(far-end),将远端信号样本依次经过非线性处理模块(Non-linear processing,简称为NLP)和房间冲激响应(Room Impulse Response,简称为RIR)的处理,分别模拟喇叭引入的非线性和环境引入的混响,进而得到回声信号样本echo,将近端信号样本和回声信号样本叠加,于此同时叠加一定的噪声,从而得到近端麦克风接收的近端混合信号样本mixture,将近端混合信号样本mixture和远端信号样本(far-end)作为消除器模块的输入,将近端信号样本(near-end)作为消除器模块的最小均方误差损失函数的学习目标,对消除器模块进行训练。The data preparation required for the training of the canceller module is shown in Figure 3 below. The above-mentioned canceller module is trained through the following steps: selecting the voices of different people from the database as the near-end signal samples (near-end) and the far-end signal samples respectively. (far-end), the remote signal samples are sequentially processed by the nonlinear processing module (Non-linear processing, referred to as NLP) and the room impulse response (Room Impulse Response, referred to as RIR), respectively simulating the non-linear noise introduced by the speaker. Linearity and reverberation introduced by the environment, and then obtain the echo signal sample echo, superimpose the near-end signal sample and the echo signal sample, and superimpose a certain noise at the same time, so as to obtain the near-end mixed signal sample mixture received by the near-end microphone, which is nearly The end mixed signal sample mixture and the far end signal sample (far-end) are used as the input of the canceller module, and the near end signal sample (near-end) is used as the learning target of the minimum mean square error loss function of the canceller module. module for training.

如图3所示,在一些可选实施例中,继续对消除器模块进行训练,计算回声信号样本的能量和近端信号样本的能量,分别将回声信号样本的能量和近端信号样本的能量与预定阈值进行比较,得到第一数值和第二数值,作为双端检测结果标签。例如,大于预定阈值为“1”,小于预定阈值为“0”,从而得到双端检测结果class,即:只有静音(“00”)、只有远端信号(“01”)、只有近端信号(“10”),双端都存在信号(“11”)。将近端混合信号样本和远端信号样本作为消除器模块的输入,将双端检测结果标签class作为消除器模块的交叉熵损失函数的学习目标。As shown in FIG. 3 , in some optional embodiments, the canceller module continues to be trained to calculate the energy of the echo signal sample and the energy of the near-end signal sample, and respectively calculate the energy of the echo signal sample and the energy of the near-end signal sample Comparing with a predetermined threshold, the first value and the second value are obtained, which are used as double-end detection result labels. For example, the value greater than the predetermined threshold is "1", and the value less than the predetermined threshold is "0", so as to obtain the double-ended detection result class, namely: only silence ("00"), only far-end signal ("01"), only near-end signal ("10"), there is a signal ("11") at both ends. The near-end mixed signal samples and the far-end signal samples are used as the input of the canceller module, and the double-end detection result label class is used as the learning target of the cross-entropy loss function of the canceller module.

网络训练目标有两个,一个是针对近端信号估计精度,目标是最小化近端信号估计与真实近端信号之间的最小均方误差(minimum mean square error,简称为MSE),定义如下:There are two network training goals, one is for the estimation accuracy of the near-end signal, and the goal is to minimize the minimum mean square error (minimum mean square error, abbreviated as MSE) between the near-end signal estimation and the real near-end signal, which is defined as follows:

Figure 846081DEST_PATH_IMAGE001
Figure 846081DEST_PATH_IMAGE001

其中,

Figure 950172DEST_PATH_IMAGE002
Figure 198751DEST_PATH_IMAGE003
分别是近端语音的估计信号和真实近端信号。in,
Figure 950172DEST_PATH_IMAGE002
,
Figure 198751DEST_PATH_IMAGE003
are the estimated signal and the real near-end signal of the near-end speech, respectively.

另一个学习目标是分类,目标是最小化估计得到的分类与真实标签分类之间的交叉熵损失函数,即:Another learning objective is classification, where the goal is to minimize the cross-entropy loss function between the estimated classification and the true label classification, namely:

Figure 859539DEST_PATH_IMAGE004
Figure 859539DEST_PATH_IMAGE004

其中,

Figure 142753DEST_PATH_IMAGE005
表示网络估计得到的经过Softmax之后的类别分布概率,
Figure 168478DEST_PATH_IMAGE006
表示类别的真实分布概率,即:标签分布,C表示类别数量。in,
Figure 142753DEST_PATH_IMAGE005
Represents the class distribution probability after Softmax estimated by the network,
Figure 168478DEST_PATH_IMAGE006
Represents the true distribution probability of the class, that is: the label distribution, and C represents the number of classes.

网络总损失函数为分类交叉熵损失函数与MSE损失函数加权平均结果,即:The total loss function of the network is the weighted average result of the classification cross-entropy loss function and the MSE loss function, namely:

Figure 701090DEST_PATH_IMAGE007
Figure 701090DEST_PATH_IMAGE007

其中,

Figure 618100DEST_PATH_IMAGE008
为权重系数,平衡分类和分离两个任务,对分类交叉熵取log是为了将两种损失函数保持在同一个数量级。in,
Figure 618100DEST_PATH_IMAGE008
For weight coefficients, balancing the classification and separation tasks, the log of the categorical cross-entropy is to keep the two loss functions in the same order of magnitude.

在一些可选实施例中,如图2所示,将近端信号估计输入至掩码估计模块,得到近端混合信号中纯粹近端信号每个时频点的mask值,将每个时频点的mask值与编码后的近端混合信号语谱图相乘得到近端信号语谱图,将近端信号语谱图输入至一维卷积的解码器得到近端信号的时域波形。In some optional embodiments, as shown in FIG. 2, the near-end signal estimation is input to the mask estimation module to obtain the mask value of each time-frequency point of the pure near-end signal in the near-end mixed signal, and each time-frequency The mask value of the point is multiplied by the encoded near-end mixed signal spectrogram to obtain the near-end signal spectrogram, and the near-end signal spectrogram is input to the one-dimensional convolution decoder to obtain the near-end signal's time domain waveform.

下面参照图2结合一些完整的可选实施例进行详细说明。A detailed description will be given below in conjunction with some complete optional embodiments with reference to FIG. 2 .

主要功能模块回声消除网络如下图2所示,主要包括4个模块:音频编码模块(Encoder)、音频编码模块(Decoder)、消除器模块(Canceller)、分类器模块(Classifier)。The main functional module echo cancellation network is shown in Figure 2 below, which mainly includes 4 modules: audio coding module (Encoder), audio coding module (Decoder), canceler module (Canceller), and classifier module (Classifier).

音频编码模块(Encoder)是一个一维卷积模块。The audio encoding module (Encoder) is a one-dimensional convolution module.

消除器:包括一个层归一化、一个一维卷积,两个LSTM层、多组膨胀卷积层和一个注意力机制模块。每一组膨胀卷积包含X个一维卷积块1-DConv Block,每个卷积块的膨胀率(dilation)按照2的指数增大,即:2i-1(i表示第i个卷积块,取值= 1,…,X),根据是否因果卷积,填充0的数量为:因果情况:(dilation*(kernel_size-1))/2;非因果情况为dilation*(kernel_size-1),每一个1-D Conv Block结构如图4所示。假定多尺度特征提取的特征表示如下:Eliminator: includes a layer normalization, a 1D convolution, two LSTM layers, groups of dilated convolution layers, and an attention mechanism module. Each group of dilated convolutions contains X one-dimensional convolution blocks 1-DConv Block, and the dilation rate of each convolution block increases according to the exponential of 2, that is: 2i-1 (i represents the ith convolution block, value = 1,...,X), according to whether causal convolution, the number of padding 0 is: causal case: (dilation*(kernel_size-1))/2; non-causal case is dilation*(kernel_size-1) , each 1-D Conv Block structure is shown in Figure 4. It is assumed that the feature representation of multi-scale feature extraction is as follows:

Figure 490241DEST_PATH_IMAGE009
Figure 490241DEST_PATH_IMAGE009

其中,S表示每一层输出的特征维度,T表示时间步数,J=M*R表示总的层数,M是每一组堆叠膨胀卷积的层数,R表示共重复堆叠了R组(每一组包含M层)。Among them, S represents the feature dimension of the output of each layer, T represents the number of time steps, J=M*R represents the total number of layers, M is the number of layers of each group of stacked dilated convolutions, and R represents the repeated stacking of R groups. (Each group contains M layers).

注意力机制:计算LSTM提取的近端混合信号的深度特征与多组膨胀卷积每一层提取的特征的相似度,得到对应层的权重,将该权重乘以对应层的特征之后直接叠加得到一个加权后的深度特征。注意力机制采用标准的多头注意力机制,即:Attention mechanism: Calculate the similarity between the depth features of the near-end mixed signal extracted by LSTM and the features extracted by each layer of multiple groups of dilated convolutions, and obtain the weight of the corresponding layer. The weight is multiplied by the feature of the corresponding layer and directly superimposed to obtain A weighted depth feature. The attention mechanism adopts the standard multi-head attention mechanism, namely:

Figure 749184DEST_PATH_IMAGE010
Figure 749184DEST_PATH_IMAGE010

Figure 972355DEST_PATH_IMAGE011
Figure 972355DEST_PATH_IMAGE011

Figure 178208DEST_PATH_IMAGE012
Figure 178208DEST_PATH_IMAGE012

Figure 154123DEST_PATH_IMAGE013
Figure 154123DEST_PATH_IMAGE013

其中,Q,K,V分别表示注意力机制的query,key和value;

Figure 849547DEST_PATH_IMAGE014
为LSTM提取的特征,
Figure 560014DEST_PATH_IMAGE015
为多尺度特征提取模块提取的多层特征;
Figure 38400DEST_PATH_IMAGE016
分别表示注意力机制中的映射矩阵,F表示注意力机制计算过程的维度尺寸;h表示多头注意力机制的头数。Among them, Q, K, V represent the query, key and value of the attention mechanism, respectively;
Figure 849547DEST_PATH_IMAGE014
Features extracted for LSTM,
Figure 560014DEST_PATH_IMAGE015
Multi-layer features extracted for the multi-scale feature extraction module;
Figure 38400DEST_PATH_IMAGE016
Respectively represent the mapping matrix in the attention mechanism, F represents the dimension size of the attention mechanism calculation process; h represents the number of heads of the multi-head attention mechanism.

LSTM层:第一个LSTM提取近端混合信号的深度特征;第二个LSTM根据第一个LSTM提取的近端混合信号的深度特征以及注意力机制得到的深度特征计算近端语音信号的深度特征。LSTM layer: The first LSTM extracts the deep features of the near-end mixed signal; the second LSTM calculates the deep features of the near-end speech signal according to the deep features of the near-end mixed signals extracted by the first LSTM and the deep features obtained by the attention mechanism .

掩码模块:由一个PReLU激活函数、一个一维卷积层(1x1 Conv)和Sigmoid激活函数组成;根据LSTM估计的近端语音信号的深度特征得到近端混合信号中近端语音的mask。Mask module: It consists of a PReLU activation function, a one-dimensional convolution layer (1x1 Conv) and a Sigmoid activation function; the mask of the near-end speech in the near-end mixed signal is obtained according to the depth features of the near-end speech signal estimated by LSTM.

分类器:由一个线性层和一个Softmax层组成;根据注意力机制得到的深度特征和LSTM得到的近端语音特征估计每个时间步近端和远端出现信号的概率。Classifier: Consists of a linear layer and a Softmax layer; estimates the probability of a signal appearing at the near and far ends at each time step based on the deep features obtained by the attention mechanism and the near-end speech features obtained by the LSTM.

解码器Decoder:由一个转置卷积网络构成,对输入进行解卷积得到时域信号。Decoder Decoder: Consists of a transposed convolutional network that deconvolves the input to obtain a time-domain signal.

网络结构配置如表1所示,其中,F表示Encoder的输出通道数;L表示Encoder的卷积核大小;瓶颈层输出通道数为E,多尺度特征的每一组1-DConvBlock数量为M,一共堆叠了R组;分类器的输入通道为2*E,输出通道数为C,即:将音频分为C个类别;Masking的输出通道为F,其中F表示Encoder的输出通道数。The network structure configuration is shown in Table 1, where F represents the number of output channels of the Encoder; L represents the size of the convolution kernel of the Encoder; the number of output channels of the bottleneck layer is E, and the number of 1-DConvBlocks in each group of multi-scale features is M, A total of R groups are stacked; the input channel of the classifier is 2*E, and the number of output channels is C, that is, the audio is divided into C categories; the output channel of Masking is F, where F represents the number of output channels of the Encoder.

表1Table 1

Figure 150712DEST_PATH_IMAGE017
Figure 150712DEST_PATH_IMAGE017

1-D Conv Block模型结构如下图4所示,将常规卷积拆分成一个逐点卷积(pointwise convolution)和一个深度卷积(depthwise convolution),采用parametricrectified linear unit (PReLU)作为激活函数,其表达式如下所示,每一次卷积之后对数据进行归一化操作,最后输出分为两路,每一路经过一个1x1 Conv进行维度变换,output支路与输入进行叠加提高网络深度,Skip-out支路的输出作为该模块的输出特征,该特征将与后面堆叠的特征进行拼接送给分类器。The 1-D Conv Block model structure is shown in Figure 4 below. The conventional convolution is split into a pointwise convolution and a depthwise convolution, and a parametric rectified linear unit (PReLU) is used as the activation function. The expression is as follows. After each convolution, the data is normalized, and the final output is divided into two channels. Each channel undergoes a 1x1 Conv for dimension transformation. The output branch is superimposed with the input to improve the network depth. Skip- The output of the out branch is used as the output feature of the module, and the feature will be spliced with the subsequent stacked features to the classifier.

Figure 751458DEST_PATH_IMAGE018
Figure 751458DEST_PATH_IMAGE018

为了保证分离网络对输入语音的幅度不敏感,在进行多尺度映射之前需要对输入特征进行归一化操作。In order to ensure that the separation network is not sensitive to the amplitude of the input speech, the input features need to be normalized before multi-scale mapping.

非实时场景下,层归一化可以采用全局层归一化,即:特征在通道和时域都做归一化,表达式如下:In non-real-time scenarios, global layer normalization can be used for layer normalization, that is, the features are normalized in both the channel and time domains, and the expression is as follows:

Figure 189699DEST_PATH_IMAGE019
Figure 189699DEST_PATH_IMAGE019

其中,

Figure 471776DEST_PATH_IMAGE020
表示特征,
Figure 235333DEST_PATH_IMAGE021
为可训练参数,
Figure 475821DEST_PATH_IMAGE022
表示稳定系数。in,
Figure 471776DEST_PATH_IMAGE020
represent features,
Figure 235333DEST_PATH_IMAGE021
are trainable parameters,
Figure 475821DEST_PATH_IMAGE022
represents the coefficient of stability.

实时场景下,层归一化可以采用累积层归一化,即:对连续输入特征进行层归一化,表达式如下:In real-time scenarios, layer normalization can use cumulative layer normalization, that is, layer normalization of continuous input features, the expression is as follows:

Figure 629722DEST_PATH_IMAGE023
Figure 629722DEST_PATH_IMAGE023

其中,

Figure 230337DEST_PATH_IMAGE024
表示第k帧的特征,
Figure 848400DEST_PATH_IMAGE025
表示连续k帧特征,即:
Figure 994210DEST_PATH_IMAGE026
Figure 900986DEST_PATH_IMAGE027
为可训练参数,
Figure 587183DEST_PATH_IMAGE028
表示稳定系数。in,
Figure 230337DEST_PATH_IMAGE024
represents the feature of the kth frame,
Figure 848400DEST_PATH_IMAGE025
Represents consecutive k frame features, that is:
Figure 994210DEST_PATH_IMAGE026
,
Figure 900986DEST_PATH_IMAGE027
are trainable parameters,
Figure 587183DEST_PATH_IMAGE028
represents the coefficient of stability.

编码器Encoder将一维时域输入音频变换到二维语谱图;近端混合信号(mixture)与远端信号(far-end)经过编码器后得到的语谱图送入消除器模块。消除器首先采用层归一化对输入幅度进行归一化,之后经过一个一维卷积对输入维度进行压缩(即:瓶颈层),最后将近端混合信号与远端信号拼接之后送入一个多尺度特征提取模块,多尺度特征提取模块对输入特征进行多尺度的特征提取,对每个尺度下提取的特征进行拼接,组成一个多尺度特征组;与此同时,近端混合信号经过维度压缩之后采用长短期记忆网络(Long Short-Term Memory,简称为LSTM)提取深度特征,该深度特征作为注意力机制(Attention)的query(查询),与每一层提取的多尺度特征进行相似度计算,得到每一层特征的权重,之后对每一层的特征进行加权;在计算Attention的时候,LSTM提取的特征作为query,多尺度特征提取模块每一层提取的特征作为key和value。采用标准的多头注意力机制计算多尺度特征每一层特征的权重,进而利用该权重乘以对应的特征并叠加得到合并后的多尺度特征;该多尺度特征与LSTM提取的近端混合信号的特征进行拼接之后送入另一个LSTM得到对近端混合信号特征的估计。将近端混合信号的估计以及注意力机制的输出进行拼接之后送入分类器,判断近端和远端是否有信号。LSTM输出的近端估计特征送入掩码估计模块(即:包括一个PReLU激活函数、一个一维卷积(1-D Conv)和一个sigmoid激活函数)得到近端混合信号中纯粹近端信号每个时频点的mask值,将该mask值与近端混合信号编码后的语谱图相乘得到近端信号的语谱图,将该近端信号的语谱图送入由一维卷积构成的Decoder中得到对应的近端信号的时域波形。The encoder Encoder transforms the one-dimensional time-domain input audio into a two-dimensional spectrogram; the spectrogram obtained after the near-end mixed signal (mixture) and the far-end signal (far-end) pass through the encoder is sent to the canceller module. The canceller first uses layer normalization to normalize the input amplitude, then compresses the input dimension through a one-dimensional convolution (ie: bottleneck layer), and finally splices the near-end mixed signal and the far-end signal into a Multi-scale feature extraction module, the multi-scale feature extraction module performs multi-scale feature extraction on the input features, and splices the extracted features at each scale to form a multi-scale feature group; at the same time, the near-end mixed signal is dimensionally compressed. After that, the Long Short-Term Memory (LSTM) network is used to extract the depth feature, which is used as the query of the attention mechanism (Attention), and the similarity is calculated with the multi-scale features extracted by each layer. , get the weight of each layer of features, and then weight the features of each layer; when calculating Attention, the features extracted by LSTM are used as query, and the features extracted by each layer of the multi-scale feature extraction module are used as key and value. The standard multi-head attention mechanism is used to calculate the weight of each layer feature of the multi-scale feature, and then the weight is multiplied by the corresponding feature and superimposed to obtain the combined multi-scale feature; the multi-scale feature and the near-end mixed signal extracted by LSTM After the features are spliced, they are sent to another LSTM to obtain an estimate of the near-end mixed signal features. The estimation of the near-end mixed signal and the output of the attention mechanism are spliced and sent to the classifier to determine whether there is a signal at the near-end and the far-end. The near-end estimated features output by LSTM are sent to the mask estimation module (ie: including a PReLU activation function, a one-dimensional convolution (1-D Conv) and a sigmoid activation function) to obtain the pure near-end signal in the near-end mixed signal. The mask value of each time-frequency point, multiply the mask value with the spectrogram encoded by the near-end mixed signal to obtain the spectrogram of the near-end signal, and send the spectrogram of the near-end signal to the one-dimensional convolution The time-domain waveform of the corresponding near-end signal is obtained in the constituted Decoder.

在本实施例中还提供了一种回声消除装置,该装置用于实现上述实施例及优选实施方式,已经进行过说明的不再赘述。如以下所使用的,术语“模块”为可以实现预定功能的软件和/或硬件的组合。尽管以下实施例所描述的装置较佳地以软件来实现,但是硬件,或者软件和硬件的组合的实现也是可能并被构想的。In this embodiment, an echo cancellation apparatus is also provided, and the apparatus is used to implement the above-mentioned embodiments and preferred implementations, and what has been described will not be repeated. As used below, the term "module" is a combination of software and/or hardware that can implement a predetermined function. Although the apparatus described in the following embodiments is preferably implemented in software, implementations in hardware, or a combination of software and hardware, are also possible and contemplated.

本实施例提供一种回声消除装置,如图5所示,包括:This embodiment provides an echo cancellation device, as shown in FIG. 5 , including:

接收模块51,用于接收近端混合信号和对应的参考通道的远端信号;The receiving module 51 is used for receiving the near-end mixed signal and the far-end signal of the corresponding reference channel;

编码模块52,用于对所述近端混合信号和所述远端信号分别进行编码,得到编码后的近端混合信号语谱图和编码后的远端信号语谱图,并将编码后的近端混合信号语谱图和编码后的远端信号语谱图进行拼接,得到拼接的后语谱图;The encoding module 52 is configured to encode the near-end mixed signal and the far-end signal respectively to obtain the encoded near-end mixed signal spectrogram and the encoded far-end signal spectrogram, and convert the encoded spectrogram of the near-end mixed signal. The near-end mixed signal spectrogram and the encoded far-end signal spectrogram are spliced to obtain a spliced post-spectrogram;

第一提取模块53,用于根据所述拼接的后语谱图提取多尺度特征;Thefirst extraction module 53 is used for extracting multi-scale features according to the spliced post-speech spectrogram;

第二提取模块54,用于根据所述编码后的近端混合信号语谱图提取深度特征;Thesecond extraction module 54 is used for extracting depth features according to the encoded near-end mixed signal spectrogram;

计算模块55,用于根据所述深度特征计算所述多尺度特征的每一层特征的权重;a calculation module 55, configured to calculate the weight of each layer feature of the multi-scale feature according to the depth feature;

加权模块56,用于利用所述每一层特征的权重对对应的特征进行加权处理,得到合并后的多尺度特征;The weighting module 56 is used to perform weighting processing on the corresponding features by utilizing the weights of the features of each layer to obtain the combined multi-scale features;

获取模块57,用于根据所述合并后的多尺度特征和所述深度特征获取近端信号估计。The obtaining module 57 is configured to obtain a near-end signal estimate according to the combined multi-scale feature and the depth feature.

本实施例中的回声消除装置是以功能单元的形式来呈现,这里的单元是指ASIC电路,执行一个或多个软件或固定程序的处理器和存储器,和/或其他可以提供上述功能的器件。The echo cancellation apparatus in this embodiment is presented in the form of functional units, where units refer to ASIC circuits, processors and memories that execute one or more software or fixed programs, and/or other devices that can provide the above functions .

上述各个模块的更进一步的功能描述与上述对应实施例相同,在此不再赘述。Further functional descriptions of the above-mentioned modules are the same as those of the above-mentioned corresponding embodiments, and are not repeated here.

本公开示例性实施例还提供一种电子设备,包括:至少一个处理器;以及与至少一个处理器通信连接的存储器。所述存储器存储有能够被所述至少一个处理器执行的计算机程序,所述计算机程序在被所述至少一个处理器执行时用于使所述电子设备执行根据本公开实施例的方法。Exemplary embodiments of the present disclosure also provide an electronic device including: at least one processor; and a memory communicatively connected to the at least one processor. The memory stores a computer program executable by the at least one processor for causing the electronic device to perform a method according to an embodiment of the present disclosure when executed by the at least one processor.

本公开示例性实施例还提供一种存储有计算机程序的非瞬时计算机可读存储介质,其中,所述计算机程序在被计算机的处理器执行时用于使所述计算机执行根据本公开实施例的方法。Exemplary embodiments of the present disclosure also provide a non-transitory computer-readable storage medium storing a computer program, wherein the computer program, when executed by a processor of a computer, is used to cause the computer to execute a computer program according to an embodiment of the present disclosure. method.

本公开示例性实施例还提供一种计算机程序产品,包括计算机程序,其中,所述计算机程序在被计算机的处理器执行时用于使所述计算机执行根据本公开实施例的方法。Exemplary embodiments of the present disclosure also provide a computer program product comprising a computer program, wherein the computer program, when executed by a processor of a computer, is used to cause the computer to perform a method according to an embodiment of the present disclosure.

参考图6,现将描述可以作为本公开的服务器或客户端的电子设备600的结构框图,其是可以应用于本公开的各方面的硬件设备的示例。电子设备旨在表示各种形式的数字电子的计算机设备,诸如,膝上型计算机、台式计算机、工作台、个人数字助理、服务器、刀片式服务器、大型计算机、和其它适合的计算机。电子设备还可以表示各种形式的移动装置,诸如,个人数字处理、蜂窝电话、智能电话、可穿戴设备和其它类似的计算装置。本文所示的部件、它们的连接和关系、以及它们的功能仅仅作为示例,并且不意在限制本文中描述的和/或者要求的本公开的实现。Referring to FIG. 6 , a structural block diagram of anelectronic device 600 that can function as a server or client of the present disclosure will now be described, which is an example of a hardware device that can be applied to various aspects of the present disclosure. Electronic devices are intended to represent various forms of digital electronic computer devices, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframe computers, and other suitable computers. Electronic devices may also represent various forms of mobile devices, such as personal digital processors, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions are by way of example only, and are not intended to limit implementations of the disclosure described and/or claimed herein.

如图6所示,电子设备600包括计算单元601,其可以根据存储在只读存储器(ROM)602中的计算机程序或者从存储单元608加载到随机访问存储器(RAM)603中的计算机程序,来执行各种适当的动作和处理。在RAM 603中,还可存储设备600操作所需的各种程序和数据。计算单元601、ROM 602以及RAM 603通过总线604彼此相连。输入/输出(I/O)接口605也连接至总线604。As shown in FIG. 6 , theelectronic device 600 includes acomputing unit 601 , which can be programmed according to a computer program stored in a read only memory (ROM) 602 or loaded into a random access memory (RAM) 603 from astorage unit 608 . Various appropriate actions and processes are performed. In theRAM 603, various programs and data necessary for the operation of thedevice 600 can also be stored. Thecomputing unit 601 , theROM 602 , and theRAM 603 are connected to each other through abus 604 . An input/output (I/O)interface 605 is also connected tobus 604 .

电子设备600中的多个部件连接至I/O接口605,包括:输入单元606、输出单元607、存储单元608以及通信单元609。输入单元606可以是能向电子设备600输入信息的任何类型的设备,输入单元606可以接收输入的数字或字符信息,以及产生与电子设备的用户设置和/或功能控制有关的键信号输入。输出单元607可以是能呈现信息的任何类型的设备,并且可以包括但不限于显示器、扬声器、视频/音频输出终端、振动器和/或打印机。存储单元604可以包括但不限于磁盘、光盘。通信单元609允许电子设备600通过诸如因特网的计算机网络和/或各种电信网络与其他设备交换信息/数据,并且可以包括但不限于调制解调器、网卡、红外通信设备、无线通信收发机和/或芯片组,例如蓝牙TM设备、WiFi设备、WiMax设备、蜂窝通信设备和/或类似物。Various components in theelectronic device 600 are connected to the I/O interface 605 , including: aninput unit 606 , anoutput unit 607 , astorage unit 608 , and acommunication unit 609 . Theinput unit 606 may be any type of device capable of inputting information to theelectronic device 600, and theinput unit 606 may receive input numerical or character information and generate key signal input related to user settings and/or function control of the electronic device.Output unit 607 may be any type of device capable of presenting information, and may include, but is not limited to, a display, speakers, video/audio output terminals, vibrators, and/or printers. Thestorage unit 604 may include, but is not limited to, magnetic disks and optical disks. Thecommunication unit 609 allows theelectronic device 600 to exchange information/data with other devices through a computer network such as the Internet and/or various telecommunication networks, and may include, but is not limited to, modems, network cards, infrared communication devices, wireless communication transceivers and/or chips Groups such as Bluetooth™ devices, WiFi devices, WiMax devices, cellular communication devices and/or the like.

计算单元601可以是各种具有处理和计算能力的通用和/或专用处理组件。计算单元601的一些示例包括但不限于中央处理单元(CPU)、图形处理单元(GPU)、各种专用的人工智能(AI)计算芯片、各种运行机器学习模型算法的计算单元、数字信号处理器(DSP)、以及任何适当的处理器、控制器、微控制器等。计算单元601执行上文所描述的各个方法和处理。例如,在一些实施例中,回声消除方法可被实现为计算机软件程序,其被有形地包含于机器可读介质,例如存储单元608。在一些实施例中,计算机程序的部分或者全部可以经由ROM602和/或通信单元609而被载入和/或安装到电子设备600上。在一些实施例中,计算单元601可以通过其他任何适当的方式(例如,借助于固件)而被配置为执行回声消除方法。Computing unit 601 may be various general-purpose and/or special-purpose processing components with processing and computing capabilities. Some examples of computingunits 601 include, but are not limited to, central processing units (CPUs), graphics processing units (GPUs), various specialized artificial intelligence (AI) computing chips, various computing units that run machine learning model algorithms, digital signal processing processor (DSP), and any suitable processor, controller, microcontroller, etc. Thecomputing unit 601 performs the various methods and processes described above. For example, in some embodiments, the echo cancellation method may be implemented as a computer software program tangibly embodied on a machine-readable medium, such asstorage unit 608 . In some embodiments, part or all of the computer program may be loaded and/or installed onelectronic device 600 viaROM 602 and/orcommunication unit 609 . In some embodiments, thecomputing unit 601 may be configured to perform the echo cancellation method by any other suitable means (eg, by means of firmware).

用于实施本公开的方法的程序代码可以采用一个或多个编程语言的任何组合来编写。这些程序代码可以提供给通用计算机、专用计算机或其他可编程数据处理装置的处理器或控制器,使得程序代码当由处理器或控制器执行时使流程图和/或框图中所规定的功能/操作被实施。程序代码可以完全在机器上执行、部分地在机器上执行,作为独立软件包部分地在机器上执行且部分地在远程机器上执行或完全在远程机器或服务器上执行。Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer or other programmable data processing apparatus, such that the program code, when executed by the processor or controller, performs the functions/functions specified in the flowcharts and/or block diagrams. Action is implemented. The program code may execute entirely on the machine, partly on the machine, partly on the machine and partly on a remote machine as a stand-alone software package or entirely on the remote machine or server.

在本公开的上下文中,机器可读介质可以是有形的介质,其可以包含或存储以供指令执行系统、装置或设备使用或与指令执行系统、装置或设备结合地使用的程序。机器可读介质可以是机器可读信号介质或机器可读储存介质。机器可读介质可以包括但不限于电子的、磁性的、光学的、电磁的、红外的、或半导体系统、装置或设备,或者上述内容的任何合适组合。机器可读存储介质的更具体示例会包括基于一个或多个线的电气连接、便携式计算机盘、硬盘、随机存取存储器(RAM)、只读存储器(ROM)、可擦除可编程只读存储器(EPROM或快闪存储器)、光纤、便捷式紧凑盘只读存储器(CD-ROM)、光学储存设备、磁储存设备、或上述内容的任何合适组合。In the context of the present disclosure, a machine-readable medium may be a tangible medium that may contain or store a program for use by or in connection with the instruction execution system, apparatus or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. Machine-readable media may include, but are not limited to, electronic, magnetic, optical, electromagnetic, infrared, or semiconductor systems, devices, or devices, or any suitable combination of the foregoing. More specific examples of machine-readable storage media would include one or more wire-based electrical connections, portable computer disks, hard disks, random access memory (RAM), read only memory (ROM), erasable programmable read only memory (EPROM or flash memory), fiber optics, compact disk read only memory (CD-ROM), optical storage devices, magnetic storage devices, or any suitable combination of the foregoing.

如本公开使用的,术语“机器可读介质”和“计算机可读介质”指的是用于将机器指令和/或数据提供给可编程处理器的任何计算机程序产品、设备、和/或装置(例如,磁盘、光盘、存储器、可编程逻辑装置(PLD)),包括,接收作为机器可读信号的机器指令的机器可读介质。术语“机器可读信号”指的是用于将机器指令和/或数据提供给可编程处理器的任何信号。As used in this disclosure, the terms "machine-readable medium" and "computer-readable medium" refer to any computer program product, apparatus, and/or apparatus for providing machine instructions and/or data to a programmable processor (eg, magnetic disk, optical disk, memory, programmable logic device (PLD)), including a machine-readable medium that receives machine instructions as machine-readable signals. The term "machine-readable signal" refers to any signal used to provide machine instructions and/or data to a programmable processor.

为了提供与用户的交互,可以在计算机上实施此处描述的系统和技术,该计算机具有:用于向用户显示信息的显示装置(例如,CRT(阴极射线管)或者LCD(液晶显示器)监视器);以及键盘和指向装置(例如,鼠标或者轨迹球),用户可以通过该键盘和该指向装置来将输入提供给计算机。其它种类的装置还可以用于提供与用户的交互;例如,提供给用户的反馈可以是任何形式的传感反馈(例如,视觉反馈、听觉反馈、或者触觉反馈);并且可以用任何形式(包括声输入、语音输入或者、触觉输入)来接收来自用户的输入。To provide interaction with a user, the systems and techniques described herein may be implemented on a computer having: a display device (eg, a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to the user ); and a keyboard and pointing device (eg, a mouse or trackball) through which a user can provide input to the computer. Other kinds of devices can also be used to provide interaction with the user; for example, the feedback provided to the user can be any form of sensory feedback (eg, visual feedback, auditory feedback, or tactile feedback); and can be in any form (including acoustic input, voice input, or tactile input) to receive input from the user.

可以将此处描述的系统和技术实施在包括后台部件的计算系统(例如,作为数据服务器)、或者包括中间件部件的计算系统(例如,应用服务器)、或者包括前端部件的计算系统(例如,具有图形用户界面或者网络浏览器的用户计算机,用户可以通过该图形用户界面或者该网络浏览器来与此处描述的系统和技术的实施方式交互)、或者包括这种后台部件、中间件部件、或者前端部件的任何组合的计算系统中。可以通过任何形式或者介质的数字数据通信(例如,通信网络)来将系统的部件相互连接。通信网络的示例包括:局域网(LAN)、广域网(WAN)和互联网。The systems and techniques described herein can be implemented on a computing system that includes back-end components (eg, as a data server), or a computing system that includes middleware components (eg, an application server), or a computing system that includes front-end components (eg, a user computer having a graphical user interface or web browser through which a user can interact with implementations of the systems and techniques described herein), or including such backend components, middleware components, Or any combination of front-end components in a computing system. The components of the system may be interconnected by any form or medium of digital data communication (eg, a communication network). Examples of communication networks include: Local Area Networks (LANs), Wide Area Networks (WANs), and the Internet.

计算机系统可以包括客户端和服务器。客户端和服务器一般远离彼此并且通常通过通信网络进行交互。通过在相应的计算机上运行并且彼此具有客户端-服务器关系的计算机程序来产生客户端和服务器的关系。A computer system can include clients and servers. Clients and servers are generally remote from each other and usually interact through a communication network. The relationship of client and server arises by computer programs running on the respective computers and having a client-server relationship to each other.

Claims (9)

Translated fromChinese
1.一种回声消除方法,包括:1. A method for echo cancellation, comprising:接收近端混合信号和对应的参考通道的远端信号;Receive the near-end mixed signal and the far-end signal of the corresponding reference channel;对所述近端混合信号和所述远端信号分别进行编码,得到编码后的近端混合信号语谱图和编码后的远端信号语谱图,并将编码后的近端混合信号语谱图和编码后的远端信号语谱图进行拼接,得到拼接的后语谱图;The near-end mixed signal and the far-end signal are encoded respectively to obtain the encoded near-end mixed signal spectrogram and the encoded far-end signal spectrogram, and the encoded near-end mixed signal spectrogram is obtained. The image and the encoded far-end signal spectrogram are spliced to obtain the spliced post-spectrogram;根据所述拼接的后语谱图提取多尺度特征;Extract multi-scale features according to the spliced post-speech spectrogram;根据所述编码后的近端混合信号语谱图提取深度特征;根据所述编码后的近端混合信号语谱图提取深度特征包括:将所述编码后的近端混合信号语谱图输入至消除器模块中的第一长短期记忆网络,由所述第一长短期记忆网络根据所述编码后的近端混合信号语谱图提取所述深度特征;Extracting depth features according to the encoded near-end mixed signal spectrogram; extracting depth features according to the encoded near-end mixed signal spectrogram includes: inputting the encoded near-end mixed signal spectrogram into a the first long-term and short-term memory network in the canceller module, the depth feature is extracted by the first long-term and short-term memory network according to the encoded near-end mixed signal spectrogram;根据所述深度特征计算所述多尺度特征的每一层特征的权重;根据所述深度特征计算所述多尺度特征的每一层特征的权重包括:将所述深度特征作为query,将所述多尺度特征的每一层特征作为key和value,利用多头注意力机制计算多尺度特征每一层特征的权重;Calculating the weight of each layer feature of the multi-scale feature according to the depth feature; calculating the weight of each layer feature of the multi-scale feature according to the depth feature includes: using the depth feature as a query, using the Each layer of features of multi-scale features is used as key and value, and the multi-head attention mechanism is used to calculate the weight of each layer of features of multi-scale features;利用所述每一层特征的权重对对应的特征进行加权处理,得到合并后的多尺度特征;利用所述每一层特征的权重对对应的特征进行加权处理,得到合并后的多尺度特征包括:通过所述多头注意力机制将每一层特征的权重与对应的特征相乘并叠加,得到所述合并后的多尺度特征;Use the weight of each layer of features to perform weighting processing on the corresponding features to obtain a combined multi-scale feature; use the weight of each layer of features to perform weighting processing on the corresponding features to obtain the combined multi-scale features include: : the weight of each layer feature is multiplied and superimposed by the corresponding feature through the multi-head attention mechanism to obtain the combined multi-scale feature;根据所述合并后的多尺度特征和所述深度特征获取近端信号估计;根据所述合并后的多尺度特征和所述深度特征获取近端信号估计包括:将所述合并后的多尺度特征和所述深度特征进行拼接后,输入至消除器模块中的第二长短期记忆网络,得到所述近端信号估计;Obtaining a near-end signal estimate according to the combined multi-scale feature and the depth feature; obtaining a near-end signal estimate according to the combined multi-scale feature and the depth feature includes: combining the combined multi-scale feature After splicing with the depth feature, input to the second long short-term memory network in the canceller module to obtain the near-end signal estimation;将所述近端信号估计输入至掩码估计模块,得到近端混合信号中纯粹近端信号每个时频点的mask值;The near-end signal estimation is input to the mask estimation module to obtain the mask value of each time-frequency point of the pure near-end signal in the near-end mixed signal;将所述每个时频点的mask值与所述编码后的近端混合信号语谱图相乘得到近端信号语谱图;Multiplying the mask value of each time-frequency point and the encoded near-end mixed signal spectrogram to obtain the near-end signal spectrogram;将所述近端信号语谱图输入至一维卷积的解码器得到近端信号的时域波形。Inputting the near-end signal spectrogram to a one-dimensional convolutional decoder to obtain a time-domain waveform of the near-end signal.2.如权利要求1所述的回声消除方法,其中,根据所述拼接的后语谱图提取多尺度特征包括:2. The echo cancellation method according to claim 1, wherein extracting multi-scale features according to the spliced post-speech spectrogram comprises:将所述拼接的后语谱图输入至消除器模块中的多尺度特征提取模块;其中,所述多尺度特征提取模块由多组膨胀卷积构成,每一组膨胀卷积包括多个卷积块;由所述多尺度特征提取模块根据所述拼接的后语谱图提取每一层的多尺度特征。Inputting the spliced post-language spectrogram to the multi-scale feature extraction module in the canceller module; wherein, the multi-scale feature extraction module is composed of multiple groups of dilated convolutions, and each group of dilated convolutions includes multiple convolutions block; the multi-scale feature of each layer is extracted by the multi-scale feature extraction module according to the spliced post-speech spectrogram.3.如权利要求1中所述的回声消除方法,其中,根据所述合并后的多尺度特征和所述深度特征获取近端信号估计包括:3. The echo cancellation method of claim 1, wherein obtaining a near-end signal estimate according to the combined multi-scale feature and the depth feature comprises:将所述合并后的多尺度特征和所述深度特征进行拼接后,输入至消除器模块中的第二长短期记忆网络,得到所述近端信号估计。After the combined multi-scale feature and the depth feature are spliced, they are input to the second long short-term memory network in the canceller module to obtain the near-end signal estimate.4.如权利要求1所述的回声消除方法,其中,所述方法还包括:4. The echo cancellation method of claim 1, wherein the method further comprises:将所述合并后的多尺度特征和所述近端信号估计进行拼接之后输入至分类器;由所述分类器判断是否有远端信号或者近端信号。The combined multi-scale feature and the near-end signal estimate are spliced and then input to a classifier; the classifier determines whether there is a far-end signal or a near-end signal.5.如权利要求3或者4所述的回声消除方法,其中,消除器模块是通过如下步骤训练得到的:5. The echo cancellation method as claimed in claim 3 or 4, wherein the canceller module is obtained by training through the following steps:从数据库中选择不同人的语音分别作为近端信号样本和远端信号样本;The voices of different people are selected from the database as the near-end signal samples and the far-end signal samples respectively;将所述远端信号样本依次经过非线性处理模块和房间冲激响应处理后得到回声信号样本;The remote signal samples are sequentially processed by the nonlinear processing module and the room impulse response to obtain echo signal samples;将所述近端信号样本和所述回声信号样本叠加得到近端混合信号样本;superimposing the near-end signal samples and the echo signal samples to obtain near-end mixed signal samples;将所述近端混合信号样本和所述远端信号样本作为消除器模块的输入,将所述近端信号样本作为消除器模块的最小均方误差损失函数的学习目标,对消除器模块进行训练。Use the near-end mixed signal samples and the far-end signal samples as the input of the canceller module, and use the near-end signal samples as the learning target of the minimum mean square error loss function of the canceller module, and train the canceller module .6.如权利要求5所述的回声消除方法,其中,所述消除器模块是通过如下步骤训练得到的:6. The echo cancellation method according to claim 5, wherein the canceller module is obtained by training through the following steps:计算所述回声信号样本的能量和所述近端信号样本的能量;calculating the energy of the echo signal samples and the energy of the near-end signal samples;分别将所述回声信号样本的能量和所述近端信号样本的能量与预定阈值进行比较,得到第一数值和第二数值,作为双端检测结果标签;respectively comparing the energy of the echo signal sample and the energy of the near-end signal sample with a predetermined threshold to obtain a first numerical value and a second numerical value, which are used as double-end detection result labels;将所述近端混合信号样本和所述远端信号样本作为消除器模块的输入,将所述双端检测结果标签作为消除器模块的交叉熵损失函数的学习目标。The near-end mixed signal samples and the far-end signal samples are used as the input of the canceller module, and the double-end detection result label is used as the learning target of the cross-entropy loss function of the canceller module.7.一种回声消除装置,包括:7. An echo cancellation device, comprising:接收模块,用于接收近端混合信号和对应的参考通道的远端信号;The receiving module is used to receive the near-end mixed signal and the far-end signal of the corresponding reference channel;编码模块,用于对所述近端混合信号和所述远端信号分别进行编码,得到编码后的近端混合信号语谱图和编码后的远端信号语谱图,并将编码后的近端混合信号语谱图和编码后的远端信号语谱图进行拼接,得到拼接的后语谱图;The encoding module is used to encode the near-end mixed signal and the far-end signal respectively to obtain the encoded near-end mixed signal spectrogram and the encoded far-end signal spectrogram, and encode the encoded near-end signal spectrogram. The spectrogram of the end mixed signal and the encoded spectrogram of the far end signal are spliced to obtain the spliced post spectrogram;第一提取模块,用于根据所述拼接的后语谱图提取多尺度特征;a first extraction module, used for extracting multi-scale features according to the spliced post-speech spectrogram;第二提取模块,用于根据所述编码后的近端混合信号语谱图提取深度特征;根据所述编码后的近端混合信号语谱图提取深度特征包括:将所述编码后的近端混合信号语谱图输入至消除器模块中的第一长短期记忆网络,由所述第一长短期记忆网络根据所述编码后的近端混合信号语谱图提取所述深度特征;The second extraction module is configured to extract depth features according to the encoded near-end mixed signal spectrogram; extracting depth features according to the encoded near-end mixed signal spectrogram includes: extracting the encoded near-end The mixed-signal spectrogram is input to the first long-term and short-term memory network in the canceller module, and the depth feature is extracted by the first long-short-term memory network according to the encoded near-end mixed-signal spectrogram;计算模块,用于根据所述深度特征计算所述多尺度特征的每一层特征的权重;根据所述深度特征计算所述多尺度特征的每一层特征的权重包括:将所述深度特征作为query,将所述多尺度特征的每一层特征作为key和value,利用多头注意力机制计算多尺度特征每一层特征的权重;a calculation module, configured to calculate the weight of each layer feature of the multi-scale feature according to the depth feature; calculating the weight of each layer feature of the multi-scale feature according to the depth feature includes: using the depth feature as query, using each layer feature of the multi-scale feature as the key and value, and using the multi-head attention mechanism to calculate the weight of each layer feature of the multi-scale feature;加权模块,用于利用所述每一层特征的权重对对应的特征进行加权处理,得到合并后的多尺度特征;利用所述每一层特征的权重对对应的特征进行加权处理,得到合并后的多尺度特征包括:通过所述多头注意力机制将每一层特征的权重与对应的特征相乘并叠加,得到所述合并后的多尺度特征;A weighting module is used to perform weighting processing on the corresponding features by using the weights of the features of each layer to obtain the combined multi-scale features; using the weights of the features of each layer to perform weighting processing on the corresponding features to obtain the combined features The multi-scale features include: multiplying and stacking the weight of each layer of features and the corresponding features through the multi-head attention mechanism to obtain the combined multi-scale features;获取模块,用于根据所述合并后的多尺度特征和所述深度特征获取近端信号估计;根据所述合并后的多尺度特征和所述深度特征获取近端信号估计包括:将所述合并后的多尺度特征和所述深度特征进行拼接后,输入至消除器模块中的第二长短期记忆网络,得到所述近端信号估计;将所述近端信号估计输入至掩码估计模块,得到近端混合信号中纯粹近端信号每个时频点的mask值;将所述每个时频点的mask值与所述编码后的近端混合信号语谱图相乘得到近端信号语谱图;将所述近端信号语谱图输入至一维卷积的解码器得到近端信号的时域波形。an obtaining module, configured to obtain a near-end signal estimate according to the combined multi-scale feature and the depth feature; obtaining a near-end signal estimate according to the combined multi-scale feature and the depth feature includes: combining the combined After the multi-scale feature and the depth feature are spliced, input to the second long short-term memory network in the canceller module to obtain the near-end signal estimate; input the near-end signal estimate to the mask estimation module, Obtain the mask value of each time-frequency point of the pure near-end signal in the near-end mixed signal; multiply the mask value of each time-frequency point with the encoded near-end mixed signal spectrogram to obtain the near-end signal language Spectrogram; inputting the near-end signal spectrogram into a one-dimensional convolutional decoder to obtain a time-domain waveform of the near-end signal.8.一种电子设备,包括:8. An electronic device comprising:处理器;以及processor; and存储程序的存储器,memory to store programs,其中,所述程序包括指令,所述指令在由所述处理器执行时使所述处理器执行根据权利要求1-6中任一项所述的方法。wherein the program comprises instructions which, when executed by the processor, cause the processor to perform the method of any of claims 1-6.9.一种存储有计算机指令的非瞬时计算机可读存储介质,其中,所述计算机指令用于使所述计算机执行根据权利要求1-6中任一项所述的方法。9. A non-transitory computer-readable storage medium storing computer instructions for causing the computer to perform the method of any of claims 1-6.
CN202110847066.4A2021-07-272021-07-27 Echo cancellation method, apparatus, electronic device, and computer-readable storage mediumActiveCN113299306B (en)

Priority Applications (1)

Application NumberPriority DateFiling DateTitle
CN202110847066.4ACN113299306B (en)2021-07-272021-07-27 Echo cancellation method, apparatus, electronic device, and computer-readable storage medium

Applications Claiming Priority (1)

Application NumberPriority DateFiling DateTitle
CN202110847066.4ACN113299306B (en)2021-07-272021-07-27 Echo cancellation method, apparatus, electronic device, and computer-readable storage medium

Publications (2)

Publication NumberPublication Date
CN113299306A CN113299306A (en)2021-08-24
CN113299306Btrue CN113299306B (en)2021-10-15

Family

ID=77331041

Family Applications (1)

Application NumberTitlePriority DateFiling Date
CN202110847066.4AActiveCN113299306B (en)2021-07-272021-07-27 Echo cancellation method, apparatus, electronic device, and computer-readable storage medium

Country Status (1)

CountryLink
CN (1)CN113299306B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN113870874B (en)*2021-09-232024-09-13武汉大学Multi-feature fusion echo cancellation method and system based on self-attention transformation network
CN114242097B (en)*2021-12-012025-02-14腾讯科技(深圳)有限公司 Audio data processing method, device, medium, equipment and program product
CN117219107B (en)*2023-11-082024-01-30腾讯科技(深圳)有限公司Training method, device, equipment and storage medium of echo cancellation model

Citations (8)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
US7171008B2 (en)*2002-02-052007-01-30Mh Acoustics, LlcReducing noise in audio systems
CN109905793A (en)*2019-02-212019-06-18电信科学技术研究院有限公司A kind of wind noise suppression method and device
CN110503972A (en)*2019-08-262019-11-26北京大学深圳研究生院 Speech enhancement method, system, computer equipment and storage medium
CN111862962A (en)*2020-07-202020-10-30汪秀英Voice recognition method and system
CN112687288A (en)*2021-03-122021-04-20北京世纪好未来教育科技有限公司Echo cancellation method, echo cancellation device, electronic equipment and readable storage medium
CN112989106A (en)*2021-05-182021-06-18北京世纪好未来教育科技有限公司Audio classification method, electronic device and storage medium
CN112989107A (en)*2021-05-182021-06-18北京世纪好未来教育科技有限公司Audio classification and separation method and device, electronic equipment and storage medium
CN113160839A (en)*2021-04-162021-07-23电子科技大学Single-channel speech enhancement method based on adaptive attention mechanism and progressive learning

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
US7171008B2 (en)*2002-02-052007-01-30Mh Acoustics, LlcReducing noise in audio systems
CN109905793A (en)*2019-02-212019-06-18电信科学技术研究院有限公司A kind of wind noise suppression method and device
CN110503972A (en)*2019-08-262019-11-26北京大学深圳研究生院 Speech enhancement method, system, computer equipment and storage medium
CN111862962A (en)*2020-07-202020-10-30汪秀英Voice recognition method and system
CN112687288A (en)*2021-03-122021-04-20北京世纪好未来教育科技有限公司Echo cancellation method, echo cancellation device, electronic equipment and readable storage medium
CN113160839A (en)*2021-04-162021-07-23电子科技大学Single-channel speech enhancement method based on adaptive attention mechanism and progressive learning
CN112989106A (en)*2021-05-182021-06-18北京世纪好未来教育科技有限公司Audio classification method, electronic device and storage medium
CN112989107A (en)*2021-05-182021-06-18北京世纪好未来教育科技有限公司Audio classification and separation method and device, electronic equipment and storage medium

Also Published As

Publication numberPublication date
CN113299306A (en)2021-08-24

Similar Documents

PublicationPublication DateTitle
US11894014B2 (en)Audio-visual speech separation
Zhao et al.Monaural speech dereverberation using temporal convolutional networks with self attention
JP7258182B2 (en) Speech processing method, device, electronic device and computer program
CN113299306B (en) Echo cancellation method, apparatus, electronic device, and computer-readable storage medium
CN108597496B (en)Voice generation method and device based on generation type countermeasure network
KR20200115107A (en)System and method for acoustic echo cancelation using deep multitask recurrent neural networks
CN108417224B (en) Method and system for training and recognition of bidirectional neural network model
CN117121103A (en)Method and apparatus for real-time sound enhancement
CN112687288B (en)Echo cancellation method, echo cancellation device, electronic equipment and readable storage medium
CN114974280B (en) Audio noise reduction model training method, audio noise reduction method and device
CN114333912B (en)Voice activation detection method, device, electronic equipment and storage medium
CN113345460B (en) Audio signal processing method, device, device and storage medium
CN114242098B (en)Voice enhancement method, device, equipment and storage medium
CN111710344A (en) A signal processing method, apparatus, device and computer-readable storage medium
CN118899005B (en)Audio signal processing method, device, computer equipment and storage medium
CN114373473A (en)Simultaneous noise reduction and dereverberation through low-delay deep learning
CN114242100A (en)Audio signal processing method, training method and device, equipment and storage medium thereof
CN114827363A (en)Method, device and readable storage medium for eliminating echo in call process
CN114495977A (en) Speech translation and model training method, device, electronic device and storage medium
CN114333893A (en)Voice processing method and device, electronic equipment and readable medium
CN112750469A (en)Method for detecting music in voice, voice communication optimization method and corresponding device
CN112687284A (en)Reverberation suppression method and device for reverberation voice
WO2020015546A1 (en)Far-field speech recognition method, speech recognition model training method, and server
US11924367B1 (en)Joint noise and echo suppression for two-way audio communication enhancement
CN115421099A (en)Voice direction of arrival estimation method and system

Legal Events

DateCodeTitleDescription
PB01Publication
PB01Publication
SE01Entry into force of request for substantive examination
SE01Entry into force of request for substantive examination
GR01Patent grant
GR01Patent grant

[8]ページ先頭

©2009-2025 Movatter.jp