CN115101085A

Movatterモバイル変換

Info

Publication number: CN115101085A
Application number: CN202210647059.4A
Authority: CN
Inventors: 闫河; 张宇宁; 李梦雪; 王潇棠; 刘建骐; 刘宇涵; 黄骏滨
Original assignee: Chongqing University of Technology
Current assignee: Chongqing University of Technology
Priority date: 2022-06-09
Filing date: 2022-06-09
Publication date: 2022-09-23
Anticipated expiration: 2042-06-09
Also published as: CN115101085B

Abstract

The invention relates to the technical field of voice processing, in particular to a multi-speaker time-domain voice separation method for enhancing external attention by convolution. The method comprises the following steps: s1, carrying out convolution operation on mixed voice of multiple speakers through an encoder, and converting the mixed voice into potential feature representation; learning to obtain a voice mask through a separator based on a convolution enhanced external attention module; the speech mask is multiplied by the potential feature representation output by the encoder, and then the waveform is reconstructed through deconvolution operation of the decoder to obtain separated speech. The method can meet the requirements of smaller models and high timeliness of voice separation, and achieves better separation effect by virtue of sequence modeling; the external attention mechanism is enhanced to learn more features and correlations, and the advantage of high separation speed is maintained; the application in the double-path structure can better balance timeliness, model size and separation effect.

Description

Translated fromChinese

一种卷积增强外部注意力的多说话人时域语音分离方法A Multi-Speaker Temporal Speech Separation Method for Convolutional Enhanced External Attention

技术领域technical field

本发明涉及语音处理技术领域，尤其涉及一种卷积增强外部注意力的多说话人时域语音分离方法。The invention relates to the technical field of speech processing, in particular to a multi-speaker time-domain speech separation method for enhancing external attention by convolution.

背景技术Background technique

实际应用中，语音交互往往发生于多人说话条件下，干扰人的声音会严重阻碍机器对语音信息的提取，语音分离技术对多说话人语音进行分离，使得机器能够有效提取信息进行语音识别等任务。目前基于深度学习的单通道语音分离主要采用时域语音分离网络(Time-domain audio separation,TasNet)结构。与传统的基于时频域方法相比，TasNet使用编码器-解码器框架在时域中直接对语音信号建模，并对非负编码器输出执行语音分离，省去了频率分解步骤，将分离问题减少到估计编码器输出的语音掩码(Mask)，然后由解码器对其进行合成，因此具有更好的性能和更低的延迟。In practical applications, voice interaction often occurs under the condition of multiple speakers. Interfering voices will seriously hinder the extraction of voice information by the machine. The voice separation technology separates the voices of multiple speakers, so that the machine can effectively extract information for speech recognition, etc. Task. At present, the single-channel speech separation based on deep learning mainly adopts the time-domain audio separation network (TasNet) structure. Compared with traditional time-frequency domain based methods, TasNet uses an encoder-decoder framework to directly model the speech signal in the time domain, and performs speech separation on the non-negative encoder output, eliminating the frequency decomposition step, which will separate the The problem is reduced to estimating the speech mask (Mask) output by the encoder, which is then synthesized by the decoder, thus having better performance and lower latency.

BLSTM-TasNet基于LSTM网络，但是深度LSTM网络计算成本较大，限制了其在低资源、低功耗平台中的适用性。陈修凯等(Luo Y,Mesgarani N.TasNet:time-domain audioseparation network for real-time,single-channel speech separation[C].2018)将LSTM替换为门控循环单元(GRU)，可以进行长序列建模，又能减少梯度消失，但仍具有较大参数量。于是Luo Y等(Luo Y,Mesgarani N.Conv-TasNet:Surpassing Ideal Time–Frequency Magnitude Masking for Speech Separation[J].IEEE/ACM Transactions onAudio,Speech,and Language Processing,2019,PP(99):1-1)提出了基于卷积的Conv-TasNet，使用由一维膨胀卷积块组成的时域卷积网络计算掩码，使网络可以对语音信号的长期依赖性进行建模，同时保持较小的参数量和更快的分离速度。SSGAN采用基于生成对抗网络联合训练的方式来进行语音分离，生成器和判别器均使用全卷积神经网络构建，可以有效对时域波形进行高维特征提取和恢复。但是一维卷积神经网络(CNN)在感受域小于序列长度时不能进行话语级序列建模，而循环神经网络(RNN)由于优化困难不能有效建模较长序列，于是Dual-Path RNN(Luo Y,Chen Z,Yoshioka T.Dual-Path RNN:Efficient LongSequence Modeling for Time-Domain Single-Channel Speech Separation[C]//ICASSP2020-2020IEEE International Conference on Acoustics,Speech and SignalProcessing(ICASSP).IEEE,2020:46-50,doi:10.1109/ICASSP40776.2020.9054266.)提出了一种双路结构，将较长的音频输入分为较小的块，在深层模型中优化RNN迭代地应用块内和块间操作，缓解了RNN优化困难的问题，以更小的模型获得了更好的性能。基于RNN的分离模型对语音序列的建模间接地取决于上下文，中间状态的传递影响了分离性能的提高，Chen J等(Chen J,Mao Q,Liu D.Dual-path transformer network:Direct context-aware modeling for end-to-end monaural speech separation[J].arXiv preprintarXiv:2007.13975,2020.)在双路结构中引入一种改进的Transformer，通过在前馈层中加入LSTM，使其能够在不使用位置编码的情况下学习语音序列的顺序信息进行直接交互，从而实现了对语音序列的直接上下文感知，但也具有很慢的推理速度。在利用有限的计算资源来实现语音分离方面，SuDoRM-RF(Tzinis E,Wang Z,Smaragdis P.Sudo rm-rf:Efficient networks for universal audio source separation[C]//2020 IEEE 30thInternational Workshop on Machine Learning for Signal Processing(MLSP).IEEE,2020:1-6)网络利用类似UNet的一维卷积块进行持续下采样和上采样增加网络的感受帧，从多分辨率中提取信息，推理速度较快；多尺度特征的提取结合感受帧的增加，使SuDoRM-RF优于其他卷积，但是也具有较大的参数量。BLSTM-TasNet is based on LSTM network, but the computational cost of deep LSTM network is large, which limits its applicability in low-resource and low-power platforms. Chen Xiukai et al. (Luo Y, Mesgarani N. TasNet: time-domain audioseparation network for real-time, single-channel speech separation [C]. 2018) replaced LSTM with Gated Recurrent Unit (GRU), which can model long sequences , which can reduce the vanishing gradient, but still has a large amount of parameters. So Luo Y et al. (Luo Y, Mesgarani N. Conv-TasNet: Surpassing Ideal Time–Frequency Magnitude Masking for Speech Separation [J]. IEEE/ACM Transactions onAudio, Speech, and Language Processing, 2019, PP(99): 1- 1) A convolution-based Conv-TasNet is proposed, which computes masks using a time-domain convolutional network consisting of one-dimensional dilated convolutional blocks, allowing the network to model the long-term dependencies of speech signals while keeping small Amount of parameters and faster separation. SSGAN adopts the method of joint training based on generative adversarial network for speech separation. Both the generator and the discriminator are constructed using a fully convolutional neural network, which can effectively extract and restore high-dimensional features of time-domain waveforms. However, one-dimensional convolutional neural network (CNN) cannot model speech-level sequences when the receptive field is smaller than the sequence length, and recurrent neural network (RNN) cannot effectively model longer sequences due to optimization difficulties, so Dual-Path RNN (Luo Y,Chen Z,Yoshioka T.Dual-Path RNN:Efficient LongSequence Modeling for Time-Domain Single-Channel Speech Separation[C]//ICASSP2020-2020IEEE International Conference on Acoustics,Speech and SignalProcessing(ICASSP).IEEE,2020:46 -50, doi: 10.1109/ICASSP40776.2020.9054266.) proposed a two-way structure to divide the longer audio input into smaller chunks and optimize the RNN in the deep model to apply intra- and inter-block operations iteratively, alleviating the It solves the difficult problem of RNN optimization, and obtains better performance with smaller models. The modeling of speech sequences by RNN-based separation models indirectly depends on the context, and the transfer of intermediate states affects the improvement of separation performance, Chen J et al. (Chen J, Mao Q, Liu D. Dual-path transformer network: Direct context- aware modeling for end-to-end monaural speech separation[J].arXiv preprintarXiv:2007.13975,2020.) Introduce an improved Transformer in the two-way structure, by adding LSTM to the feedforward layer, it can be used without In the case of positional coding, the sequential information of the speech sequence is learned for direct interaction, thus realizing the direct context awareness of the speech sequence, but it also has a very slow inference speed. In terms of using limited computing resources to achieve speech separation, SuDoRM-RF (Tzinis E, Wang Z, Smaragdis P. Sudo rm-rf: Efficient networks for universal audio source separation [C]//2020 IEEE 30th International Workshop on Machine Learning for Signal Processing (MLSP). IEEE, 2020: 1-6) The network uses a one-dimensional convolution block similar to UNet for continuous down-sampling and up-sampling to increase the perception frame of the network, extract information from multi-resolution, and the inference speed is fast; The extraction of multi-scale features combined with the increase of receptive frames makes SuDoRM-RF superior to other convolutions, but also has a larger amount of parameters.

针对现有技术，如何以较小的模型尺寸以及较快的分离速度实现可观的分离效果，更好地满足和平衡语音识别前端对模型尺寸、时效性以及分离效果的要求，仍然具有较大的挑战性，为此，我们提出了本发明的一种卷积增强外部注意力的多说话人时域语音分离方法。In view of the existing technology, how to achieve a considerable separation effect with a smaller model size and a faster separation speed, and better meet and balance the requirements of the speech recognition front-end for model size, timeliness, and separation effect, still has a large effect. Challenging, for this, we propose a multi-speaker temporal speech separation method for convolution-enhanced external attention of the present invention.

发明内容SUMMARY OF THE INVENTION

本发明的目的是提供一种卷积增强外部注意力的多说话人时域语音分离方法。The purpose of the present invention is to provide a multi-speaker time-domain speech separation method with convolution enhancing external attention.

为了实现上述目的，本发明采用了如下技术方案：In order to achieve the above object, the present invention adopts the following technical solutions:

一种卷积增强外部注意力的多说话人时域语音分离方法，包括以下步骤:A multi-speaker time-domain speech separation method with convolution-enhanced external attention, including the following steps:

S1.通过编码器将多说话人混合语音，进行卷积运算，转换为其潜在特征表示S1. Mix multi-speaker speech through the encoder, perform convolution operation, and convert it into its latent feature representation

记多说话人混合语音为x(t)，

其中，

表示实数域；T为语音长度；Remember the multi-speaker mixed speech as x(t),

in,

Represents the real number field; T is the speech length;

记潜在特征表示为h，

其中，C_E为编码器通道数；L为潜在特征表示；Note that the latent feature is denoted as h,

Among them, C_E is the number of encoder channels; L is the latent feature representation;

S2.通过基于卷积增强外部注意力模块的分离器学习得到语音掩码；S2. The speech mask is obtained by learning the separator based on the convolution-enhanced external attention module;

S3.语音掩码与编码器输出的潜在特征表示相乘，再通过解码器的反卷积运算重建波形得到分离后的语音。S3. The speech mask is multiplied by the latent feature representation output by the encoder, and the separated speech is obtained by reconstructing the waveform through the deconvolution operation of the decoder.

进一步地，S2主要包括以下步骤：Further, S2 mainly includes the following steps:

全局归一化和卷积操作：将潜在特征表示h，映射并得到一个中间表示

Global normalization and convolution operations: the latent feature representation h, map and get an intermediate representation

其中，C为通道数；Among them, C is the number of channels;

分割堆叠：将中间表示h'分割成S个有重叠的长度为K的较小块，组成一个三维向量

Split stacking: split the intermediate representation h' into S overlapping smaller blocks of length K to form a three-dimensional vector

其中，K为重叠块的长度，S为重叠块的个数；Among them, K is the length of overlapping blocks, and S is the number of overlapping blocks;

ExConformer变换：对三维向量T的块内维度K维和S维迭代地应用B个ExConformer模块组成的变换,块内处理的输出T_b将作为块间处理的输入；ExConformer transformation: iteratively applies a transformation consisting of B ExConformer modules to the intra-block dimensions K and S dimensions of the three-dimensional vector T, and the output T_b of the intra-block processing will be used as the input of the inter-block processing;

即第b-1个ECBlock的输出将作为第b个块的输入，b＝1,…,B，表示如下：That is, the output of the b-1th ECBlock will be used as the input of the bth block, b=1,...,B, which is expressed as follows:

T_b＝ECBlock_intra(U_b-1)T_b = ECBlock_intra (U_b-1 )

U_b＝ECBlock_inter(T_b)U_b =ECBlock_inter (T_b )

维度变换：对第B个ECBlock块的输出U_B应用二维卷积来为每个源学习掩码，得到三维向量

Dimensional transformation: Apply a 2D convolution to the output U_B of the Bth ECBlock to learn a mask for each source, resulting in a 3D vector

Y＝Conv2D(U_B)Y=Conv2D(U_B )

其中，Conv2D表示二维卷积操作，Among them, Conv2D represents a two-dimensional convolution operation,

聚合多通道信息处理：经过重叠相加的操作将三维向量Y转换为每个源的中间潜在表示

对y应用一维卷积和PReLU来聚合多个通道上的信息，对于第i个源，语音掩码的其估计掩码

如下：Aggregate multi-channel information processing: transform a three-dimensional vector Y into an intermediate latent representation for each source via an overlap-addoperation

Apply 1D convolution and PReLU to y to aggregate information over multiple channels, for the ith source, its estimated mask of the speech mask

as follows:

进一步地，所述ExConformer模块由位置卷积模块、外部注意力模块、卷积模块、前馈神经网络模块组成，每个模块之间添加残差连接；Further, the ExConformer module is composed of a position convolution module, an external attention module, a convolution module, and a feedforward neural network module, and a residual connection is added between each module;

若将第i个ExConformer模块的输入定义为x_i，则其输出y_i，表示如下：If the input of the i-th ExConformer module is defined as x_i , its output y_i is expressed as follows:

x″_i＝x′_i+Conv(x′_i)x″_i =x′_i +Conv(x′_i )

y_i＝Layernorm(x″_i+FFN(x″_i))。y_i =Layernorm(x″_i +FFN(x″_i )).

进一步地，所述ExConformer模块的卷积模块和前馈神经网络模块的激活函数使用Penalized_tanh。Further, the activation function of the convolution module of the ExConformer module and the feedforward neural network module uses Penalized_tanh.

进一步地，所述位置卷积模块由多个堆叠的带zero-paddings的一维卷积、层归一化和ReLU激活层组成。Further, the positional convolution module consists of multiple stacked one-dimensional convolutions with zero-paddings, layer normalization and ReLU activation layers.

进一步地，所述外部注意力模块的步骤如下：Further, the steps of the external attention module are as follows:

通过一个一维卷积调整输入特征的通道数；Adjust the number of channels of the input feature through a one-dimensional convolution;

使用一个线性层来构造M_k存储器来学习查询向量之间的注意力图A；Use a linear layer to construct_Mk memory to learn the attention map A between query vectors;

其中，F为输入的特征图；

为M_k的转置；Among them, F is the input feature map;

is the transpose of M_k ;

对其进行softmax和L1范数归一化(L1_Norm)；Perform softmax and L1 norm normalization on it (L1_Norm);

使用一个线性层来构造M_v存储器来生成细化的特征图，表示如下：A linear layer is used to construct the_Mv memory to generate the refined feature map, which is expressed as follows:

F_out＝AM_vF_out = AM_v

对输出结果进行Dropout操作。Dropout the output result.

本发明至少具备以下有益效果：The present invention at least has the following beneficial effects:

1.本发明通过对Conformer的改进设计，使其能够满足语音分离较小模型、高时效性的需求，并且以其序列建模的优势来达到更好的分离效果。1. The present invention improves the design of the Conformer so that it can meet the needs of a smaller model and high timeliness for speech separation, and achieves a better separation effect with the advantage of its sequence modeling.

2.本发明通过将外部注意力机制引入语音分离任务中，增强了外部注意力机制学习到更多的特征和相关性，且保持了其分离速度快的优势。2. By introducing the external attention mechanism into the speech separation task, the present invention enhances the learning of more features and correlations by the external attention mechanism, and maintains the advantage of fast separation speed.

3.本发明通过使用卷积神经网络对输入进行编码，使得每个帧都包含上下文信息，且卷积编码使位置编码可训练。3. The present invention encodes the input by using a convolutional neural network so that each frame contains contextual information, and the convolutional encoding makes the positional encoding trainable.

4.本发明通过提出的卷积增强外部注意力的多说话人时域语音分离方法在双路结构中的应用可以较好地平衡时效性、模型大小和分离效果。4. The present invention can better balance the timeliness, model size and separation effect by applying the proposed multi-speaker time-domain speech separation method of enhancing external attention by convolution in a two-way structure.

附图说明Description of drawings

为了更清楚地说明本发明实施例技术方案，下面将对实施例描述中所需要使用的附图作简单地介绍，显而易见地，下面描述中的附图是本发明的一些实施例，对于本领域普通技术人员来讲，在不付出创造性劳动的前提下，还可以根据这些附图获得其他的附图。In order to explain the technical solutions of the embodiments of the present invention more clearly, the following briefly introduces the accompanying drawings used in the description of the embodiments. Obviously, the drawings in the following description are some embodiments of the present invention, which are of great significance to the art For those of ordinary skill, other drawings can also be obtained from these drawings without any creative effort.

图1为本发明整体网络框架图；Fig. 1 is the overall network frame diagram of the present invention;

图2为本发明ExConformer模块。Fig. 2 is the ExConformer module of the present invention.

具体实施方式Detailed ways

为了使本发明的目的、技术方案及优点更加清楚明白，以下结合附图及实施例，对本发明进行进一步详细说明。应当理解，此处所描述的具体实施例仅仅用以解释本发明，并不用于限定本发明。In order to make the objectives, technical solutions and advantages of the present invention clearer, the present invention will be further described in detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are only used to explain the present invention, but not to limit the present invention.

参阅图1，本发明为一种卷积增强外部注意力的多说话人时域语音分离方法，包括以下步骤：Referring to FIG. 1, the present invention is a multi-speaker time-domain speech separation method with convolution enhancing external attention, comprising the following steps:

1.编码器对多说话人混合语音进行卷积运算转换为其潜在特征表示h；1. The encoder performs a convolution operation on the multi-speaker mixed speech and converts it to its latent feature representation h;

2.通过基于卷积增强外部注意力模块的分离器学习得到语音掩码；2. The speech mask is obtained by learning the separator based on the convolution-enhanced external attention module;

3.语音掩码与编码器输出的潜在特征表示相乘，再通过解码器的反卷积运算重建波形得到分离后的语音。3. The speech mask is multiplied by the latent feature representation output by the encoder, and the separated speech is obtained by reconstructing the waveform through the deconvolution operation of the decoder.

在上述步骤2中，进行以下操作：In step 2 above, do the following:

首先通过全局归一化和一个一维卷积将潜在表示h映射到一个新的特征空间得到一个中间表示

改变了潜在表示h的通道数，在不改变感受域的情况下增加了网络深度，增强了网络局部模块的抽象表达能力。First, an intermediate representation is obtained by mapping the latent representation h to a new feature space by global normalization and a 1D convolution

The number of channels of the latent representation h is changed, the depth of the network is increased without changing the receptive field, and the abstract expression ability of the local modules of the network is enhanced.

h′＝Conv1D(GlobLN(h))h'=Conv1D(GlobLN(h))

全局层归一化GlobLN(·)定义了两个可学习参数

和

使用全局层归一化替代层归一化显著提高了模型的收敛性，因为不同通道之间，梯度统计量是相互依赖的。对输入矩阵

应用全局归一化可以定义为：The global layer normalization GlobLN( ) defines two learnable parameters

and

Replacing layer normalization with global layer normalization significantly improves the convergence of the model because the gradient statistics are interdependent across channels. to the input matrix

Applying global normalization can be defined as:

接着将中间表示h'分割成S个有重叠的长度为K的较小块，组成一个三维向量

接着对三维向量T的块内维度(K维)和块间维度(S维)迭代地应用B个ExConformer模块(ECBlock)组成的变换,块内处理的输出T_b将作为块间处理的输入，第(b-1)个ECBlock的输出将作为第b个块的输入，b＝1,…,B。Then divide the intermediate representation h' into S overlapping smaller blocks of length K to form a three-dimensional vector

Then, a transformation consisting of B ExConformer modules (ECBlock) is iteratively applied to the intra-block dimension (K dimension) and inter-block dimension (S dimension) of the three-dimensional vector T, and the output T_b of the intra-block processing will be used as the input of the inter-block processing, The output of the (b-1)th ECBlock will be used as the input of the bth block, b=1,...,B.

T_b＝ECBlock_intra(U_b-1)T_b = ECBlock_intra (U_b-1 )

U_b＝ECBlock_inter(T_b)U_b =ECBlock_inter (T_b )

ExConformer模块参看图2。See Figure 2 for the ExConformer module.

对第B个ECBlock块的输出U_B应用一个二维卷积来为每个源学习掩码得到一个三维向量

Apply a 2D convolution to the output U_B of the Bth ECBlock to obtain a 3D vector for each source learned mask

Y＝Conv2D(U_B)Y=Conv2D(U_B )

再经过重叠相加的操作将三维向量Y转换为每个源的中间潜在表示

对y应用一维卷积和PReLU来聚合多个通道上的信息，对于第i个源，将得其估计掩码

如下：The three-dimensional vector Y is then transformed into an intermediate latent representation of each source through an overlap-addoperation

Apply 1D convolution and PReLU to y to aggregate the information on multiple channels, and for the ith source will get its estimated mask

as follows:

最后将编码后的潜在表示h与对应掩码相乘得到每个源估计的潜在表示

Finally, the encoded latent representation h is multiplied by the corresponding mask to obtain the latent representation estimated for each source

其中a⊙b表示具有相同形状的两个张量对应元素相乘。where a⊙b represents the multiplication of corresponding elements of two tensors with the same shape.

其中，ExConformer模块为卷积增强外部注意力模块。Conformer使用卷积增强自注意力机制的方式来学习基于位置的本地信息，并使用基于内容的全局交互，在序列建模任务中表现良好，其没有考虑不同样本之间的联系，且自注意力机制与卷积的结合使得模型具有较大的参数量和较慢的推理速度，难以应用在语音分离任务中。而外部注意力机制使用两个共享的、可学习的外部存储器，可以隐式地学习到来自其他样本的特征；通过控制存储器大小可以实现线性复杂度，推理速度较快；但是由于结构简单，学习效果不如自注意力机制。本发明将外部注意力引入Conformer，使用卷积增强外部注意力的方式来对不同语音片段之间的联系进行建模，减少参数量的同时加快了推理速度；并使用序列任务中表现更稳定的Penalied_tanh作为激活函数，从而提出了一个卷积增强外部注意力模块(ExConformer)。ExConformer模块由位置卷积模块(PosConv Module)、外部注意力模块(External Attention Module)、卷积模块(Convolution Module)、前馈神经网络模块(FeedForward Module)组成，与Conformer类似，每个模块之间添加残差连接，若将第i个ExConformer块的输入定义为x_i，则其输出y_i可以经过以下步骤得到其表示：Among them, the ExConformer module is a convolution enhanced external attention module. Conformer uses a convolution-enhanced self-attention mechanism to learn location-based local information, and uses content-based global interactions, which perform well in sequence modeling tasks, which do not consider the connections between different samples, and self-attention The combination of mechanism and convolution makes the model have a large amount of parameters and a slow inference speed, which is difficult to apply in speech separation tasks. The external attention mechanism uses two shared and learnable external memories, which can implicitly learn features from other samples; linear complexity can be achieved by controlling the size of the memory, and the inference speed is faster; but due to the simple structure, learning The effect is not as good as the self-attention mechanism. The invention introduces the external attention into the Conformer, uses the convolution to enhance the external attention to model the connection between different speech segments, reduces the amount of parameters and speeds up the inference speed; and uses the more stable performance in the sequence task. Penalied_tanh is used as the activation function, thereby proposing a convolutionally enhanced external attention module (ExConformer). The ExConformer module is composed of a PosConv Module, an External Attention Module, a Convolution Module, and a FeedForward Module. Similar to the Conformer, between each module Adding residual connections, if the input of the i-th ExConformer block is defined as_xi , its output_yi can be represented by the following steps:

x″_i＝x′_i+Conv(x′_i)x″_i =x′_i +Conv(x′_i )

y_i＝Layernorm(x″_i+FFN(x″_i))y_i =Layernorm(x″_i +FFN(x″_i ))

在Conformer中使用的是Macaron-Net提到的两个半步前馈层，并且在自注意力机制中使用正弦位置编码。近期关于语音识别的研究发现卷积位置编码比正弦位置编码能够取得更好的效果，所以我们在Exconformer中也采用了卷积位置编码，这使得位置编码变得可训练；同时实验发现使用两个半步前馈层产生的增益不是很明显，所以在本发明中，将Conformer的第一个半步前馈层改成卷积位置编码模块，第二个半步前馈层改为普通前馈层。使用带zero-padding的卷积，在分离器中分别对块内和块间维度进行建模，学习其位置上下文位置信息，速度更快且参数量也更少。The two half-step feedforward layers mentioned by Macaron-Net are used in the Conformer, and the sinusoidal position encoding is used in the self-attention mechanism. Recent research on speech recognition has found that convolutional positional coding can achieve better results than sinusoidal positional coding, so we also use convolutional positional coding in Exconformer, which makes positional coding trainable; The gain generated by the half-step feedforward layer is not very obvious, so in the present invention, the first half-step feedforward layer of the Conformer is changed to a convolutional position coding module, and the second half-step feedforward layer is changed to ordinary feedforward Floor. Using convolution with zero-padding, the intra-block and inter-block dimensions are modeled separately in the separator to learn their location contextual location information, which is faster and with less parameters.

对于多说话人时域语音分离技术进行实验，实验过程描述如下：For experiments on multi-speaker time-domain speech separation technology, the experimental process is described as follows:

1.数据集和评价指标1. Datasets and Evaluation Metrics

以下实验主要针对两个说话者的语音分离，使用公开数据集LibriSpeech混合而成的Libri2Mix进行实验，混合方式采用的Libri2Mix数据集论文中的设置。Libri2Mix训练集由train-100中数据混合，共13900条语音；验证集和测试集分别有3000条语音数据；每条数据以-5dB～5dB随机信噪比混合而成；预处理时降采样到8kHz。相较于WSJ0数据集和TIMIT数据集等较小的数据集，LibriSpeech数据集包含更多的说话者，其中训练集train-100有251个说话者，验证集和测试集有40个说话者；由LibriSpeech混合的Libri2Mix数据集上的实验结果能更可靠地推广到新的场景，且可以提供通用的建模趋势。The following experiments are mainly aimed at the speech separation of two speakers, using the Libri2Mix mixed from the public data set LibriSpeech, and the mixed method adopts the settings in the Libri2Mix data set paper. The Libri2Mix training set is mixed with the data in train-100, with a total of 13900 voices; the validation set and the test set have 3000 voice data respectively; each data is mixed with a random signal-to-noise ratio of -5dB to 5dB; during preprocessing, it is down-sampled to 8kHz. Compared with smaller datasets such as the WSJ0 dataset and the TIMIT dataset, the LibriSpeech dataset contains more speakers, of which the training set train-100 has 251 speakers, and the validation and test sets have 40 speakers; The experimental results on the Libri2Mix dataset mixed by LibriSpeech can more reliably generalize to new scenarios and can provide general modeling trends.

评价指标采用尺度不变信噪比SI-SNR，计算如下：The evaluation index adopts the scale-invariant signal-to-noise ratio SI-SNR, which is calculated as follows:

其中

和x分别为估计的目标说话人语音和纯净的目标说话人语音，两者都在计算前进行了零均值归一化处理。通常情况下，SI-SNR的值越大表示语音分离质量越好。in

and x are the estimated target speaker speech and the pure target speaker speech, respectively, both of which are zero-mean normalized before calculation. In general, the larger the value of SI-SNR, the better the speech separation quality.

2.消融实验2. Ablation Experiment

我们将Conformer和卷积增强外部注意力的ExConformer作为基于TasNet分离网络的分离器模块进行实验，验证外部注意力机制对语音分离的作用。其中Conformer使用卷积增强自注意力和两个半步前馈层(FFN)；ExConformer-PosConv表示不使用卷积编码位置的结构；ExConformer-FFN表示不使用前馈层，仅使用卷积和外部注意力模块的结构；ExConformer with Swish使用Conformer中的Swish激活函数；ExConformer将Swish替换为Penalied_tanh，为本文提出的模型。实验结果如表1所示，其中实验均在同一台服务器上进行的，分离速度以一秒内处理测试集语音片段的个数来衡量，可以反应分离速度的快慢。我们认为，相较于训练，分离速度更能说明一个模型在实际中的时效性。We experiment with Conformer and ExConformer with convolution-enhanced external attention as separator modules based on TasNet separation network to verify the effect of external attention mechanism on speech separation. Among them, Conformer uses convolution to enhance self-attention and two half-step feedforward layers (FFN); ExConformer-PosConv represents a structure that does not use convolutional coding positions; ExConformer-FFN means that no feedforward layer is used, only convolution and external The structure of the attention module; ExConformer with Swish uses the Swish activation function in Conformer; ExConformer replaces Swish with Penalied_tanh, which is the model proposed in this paper. The experimental results are shown in Table 1, in which the experiments are all carried out on the same server, and the separation speed is measured by the number of speech fragments processed in the test set in one second, which can reflect the speed of separation. We believe that separation speed is more indicative of a model's timeliness in practice than training.

由表1实验结果可知，Conformer具有较大的参数量和最慢的分离速度，卷积增强外部注意力的ExConformer以小于其的参数量和2倍的分离速度实现了更好的分离效果；同时，表中第2、3行的结果表明，卷积位置编码和前馈层的使用是有必要的，这两个部分对ExConformer的提升效果显著；且使用Penalied_tanh替换Swish函数也使得分离效果有所提升。From the experimental results in Table 1, it can be seen that the Conformer has a larger parameter amount and the slowest separation speed. , the results of rows 2 and 3 in the table show that the use of convolutional position coding and feedforward layer is necessary, and these two parts have a significant improvement effect on ExConformer; and replacing the Swish function with Penalied_tanh also makes the separation effect somewhat promote.

表1不同分离器模块的对比实验Table 1 Comparative experiments of different separator modules

3.对比实验3. Comparative experiment

我们将本发明与现有的一些时域语音分离方法在Libri2Mix数据集上的实验效果进行比较。我们在自己的设备上复现了部分代码，以测试这些模型的分离速度。其中BLSTM-TasNet、Conv-TasNet和SuDoRM-RF++模型没有将混合语音的潜在表示分割为较小的块并重叠为三维向量进行块内块间操作，因此具有更快的分离速度，但是同时也具有更大的参数量；DPRNN和DPTNet与本发明均采用了分割重叠来进行块内块间迭代操作的方式，相较于前几个模型，增加了分离速度，但也减少了参数量。由表可知，DPTNet效果最好，参数量也少，但是其分离速度也是最慢的。本发明以最小的模型尺寸实现了与DPTNet相当的分离效果，且分离速度是其2倍。We compare the experimental results of the present invention with some existing time-domain speech separation methods on the Libri2Mix dataset. We reproduced parts of the code on our own device to test the separation speed of these models. Among them, the BLSTM-TasNet, Conv-TasNet and SuDoRM-RF++ models do not divide the latent representation of mixed speech into smaller blocks and overlap them into 3D vectors for intra-block and inter-block operations, so they have faster separation speed, but also have Larger amount of parameters; DPRNN and DPTNet and the present invention both use the method of overlapping segmentation to perform the iterative operation between blocks within a block. Compared with the previous models, the separation speed is increased, but the amount of parameters is also reduced. It can be seen from the table that DPTNet has the best effect and less parameters, but its separation speed is also the slowest. The present invention achieves a separation effect comparable to that of DPTNet with the smallest model size, and the separation speed is twice as fast.

表2不同方法在Libri2Mix数据集上的比较Table 2 Comparison of different methods on Libri2Mix dataset

综上所述，达成以下效果：In summary, the following effects are achieved:

1.对Conformer进行改进。Conformer在序列建模任务中表现良好，但对作为语音识别前端技术的语音分离任务，其较大的参数量和较长的推理时间难以满足语音分离模型大小和时效性的需求；本文对Conformer进行改进使其能够满足语音分离较小模型、高时效性的需求，并且以其序列建模的优势来达到更好的分离效果。1. Improve the Conformer. Conformer performs well in sequence modeling tasks, but for the speech separation task as a front-end technology for speech recognition, its large parameter amount and long inference time cannot meet the needs of the size and timeliness of the speech separation model. The improvement enables it to meet the needs of a smaller model and high timeliness for speech separation, and to achieve a better separation effect with the advantages of its sequence modeling.

2.将外部注意力机制引入语音分离任务中。外部注意力机制使用两个级联的线性层和归一化层取代了自注意力机制，与自注意力机制不同，其通过共享权重单元的方式隐式地考虑了来自其他样本的特征；且通过控制存储器大小可以实现线性复杂度，分离速度较快；但较为简单的外部注意力机制无法学到与自注意力机制相当的特征和相关性，本发明将其Conformer结合，Conformer的卷积模块可以增强外部注意力机制学习到更多的特征和相关性，且保持了其分离速度快的优势。2. Introduce an external attention mechanism into the speech separation task. The external attention mechanism replaces the self-attention mechanism with two cascaded linear layers and a normalization layer, which, unlike the self-attention mechanism, implicitly considers features from other samples by sharing weight units; and Linear complexity can be achieved by controlling the memory size, and the separation speed is faster; however, the relatively simple external attention mechanism cannot learn the features and correlations equivalent to the self-attention mechanism. The present invention combines its Conformer, the Conformer's convolution module The external attention mechanism can be enhanced to learn more features and correlations, while maintaining the advantage of fast separation.

3.使用了卷积编码。使用卷积神经网络对输入进行编码，在此编码之后，每个帧都包含上下文信息。且卷积编码使位置编码可训练。3. Convolutional coding is used. The input is encoded using a convolutional neural network, and after this encoding, each frame contains contextual information. And convolutional coding makes positional coding trainable.

4.新提出的卷积增强外部注意力的多说话人时域语音分离方法在双路结构中的应用可以较好地平衡时效性、模型大小和分离效果。4. The application of the newly proposed convolution-enhanced external attention multi-speaker time-domain speech separation method in a two-way structure can better balance the timeliness, model size and separation effect.

以上显示和描述了本发明的基本原理、主要特征和本发明的优点。本行业的技术人员应该了解，本发明不受上述实施例的限制，上述实施例和说明书中描述的只是本发明的原理，在不脱离本发明精神和范围的前提下本发明还会有各种变化和改进，这些变化和改进都落入要求保护的本发明的范围内。本发明要求的保护范围由所附的权利要求书及其等同物界定。The foregoing has shown and described the basic principles, main features and advantages of the present invention. It should be understood by those skilled in the art that the present invention is not limited by the above-mentioned embodiments. The above-mentioned embodiments and descriptions describe only the principles of the present invention. Without departing from the spirit and scope of the present invention, there are various Variations and improvements are intended to fall within the scope of the claimed invention. The scope of protection claimed by the present invention is defined by the appended claims and their equivalents.

Claims

Translated fromChinese

1.一种卷积增强外部注意力的多说话人时域语音分离方法，其特征在于，包括以下步骤:1. a multi-speaker time-domain speech separation method of convolution enhancing external attention, is characterized in that, comprises the following steps:

记多说话人混合语音为x(t)，

其中，

为实数域；T为语音长度；Remember the multi-speaker mixed speech as x(t),

in,

is the real number field; T is the speech length;记潜在特征表示为h，

其中，C_E为编码器通道数；L为潜在特征表示的长度；Note that the latent feature is denoted as h,

Among them, C_E is the number of encoder channels; L is the length of the latent feature representation;

2.根据权利要求1所述的一种卷积增强外部注意力的多说话人时域语音分离方法，其特征在于，S2主要包括以下步骤：2. the multi-speaker time-domain speech separation method of a kind of convolution enhancing external attention according to claim 1, is characterized in that, S2 mainly comprises the following steps:

其中，C为通道数；Global normalization and convolution operations: the latent feature representation h, map and get an intermediate representation

Among them, C is the number of channels;

T_b＝ECBlock_intra(U_b-1)T_b = ECBlock_intra (U_b-1 )

U_b＝ECBlock_inter(T_b)U_b =ECBlock_inter (T_b )

Y＝Conv2D(U_B)Y=Conv2D(U_B )

对y应用一维卷积和PReLU来聚合多个通道上的信息，对于第i个源，其语音掩码的估计掩码

如下：Aggregate multi-channel information processing: transform a three-dimensional vector Y into an intermediate latent representation for each source via an overlap-add operation

Apply 1D convolution and PReLU to y to aggregate information over multiple channels, for the ith source, the estimated mask of its speech mask

as follows:

3.根据权利要求2所述的一种卷积增强外部注意力的多说话人时域语音分离方法，其特征在于，所述ExConformer模块由位置卷积模块、外部注意力模块、卷积模块、前馈神经网络模块组成，每个模块之间添加残差连接；3. the multi-speaker time-domain speech separation method of a kind of convolution enhancing external attention according to claim 2, is characterized in that, described ExConformer module is composed of position convolution module, external attention module, convolution module, It is composed of feedforward neural network modules, and residual connections are added between each module;

若将第i个ExConformer模块的输入定义为x_i，则其输出yi，表示如下：If the input of the i-th ExConformer module is defined as x_i , its output yi is expressed as follows:

x”_i＝x'_i+Conv(x'_i)x"_i =x'_i +Conv(x'_i )

y_i＝Layernorm(x”_i+FFN(x”_i))。y_i =Layernorm(x"_i +FFN(x"_i )).

4.根据权利要求3所述的一种卷积增强外部注意力的多说话人时域语音分离方法，其特征在于，所述ExConformer模块的卷积模块和前馈神经网络模块的激活函数使用Penalized_tanh。4. the multi-speaker time-domain speech separation method of a kind of convolution enhancing external attention according to claim 3, is characterized in that, the activation function of the convolution module of described ExConformer module and the feedforward neural network module uses Penalized_tanh .

5.根据权利要求3所述的一种卷积增强外部注意力的多说话人时域语音分离方法，其特征在于，所述位置卷积模块由多个堆叠的带zero-paddings的一维卷积、层归一化和ReLU激活层组成。5. The multi-speaker time-domain speech separation method of convolution enhancing external attention according to claim 3, wherein the position convolution module is composed of a plurality of stacked one-dimensional volumes with zero-paddings product, layer normalization, and ReLU activation layers.

6.根据权利要求3所述的一种卷积增强外部注意力的多说话人时域语音分离方法，其特征在于，所述外部注意力模块的步骤如下：6. the multi-speaker time-domain speech separation method of a kind of convolution enhancing external attention according to claim 3, is characterized in that, the step of described external attention module is as follows:

其中，F为输入的特征图；

为M_k的转置；Among them, F is the input feature map;

is the transpose of M_k ;

F_out＝AM_vF_out = AM_v

对输出结果进行Dropout操作。Dropout the output result.