CN114495973A

Movatterモバイル変換

Info

Publication number: CN114495973A
Application number: CN202210088494.8A
Authority: CN
Inventors: 张东; 暴媛媛
Original assignee: Sun Yat Sen University
Current assignee: Sun Yat Sen University
Priority date: 2022-01-25
Filing date: 2022-01-25
Publication date: 2022-05-13
Anticipated expiration: 2042-01-25
Also published as: CN114495973B

Abstract

Translated fromChinese

本发明公开了一种基于双路径自注意力机制的特定人语音分离方法，该方法包括：获取注册语料和混合语料；对注册语料进行梅尔谱提取并输入至预训练的说话人编码器，得到身份特征；基于预训练的语音编码器对混合语料进行处理，得到语音特征；将身份特征和语音特征融合，得到融合特征；基于预训练的信噪比估计模块对融合特征进行处理，得到信噪比估计值；将融合特征和信噪比估计值依次经过预训练的语音分离器和语音解码器，得到目标说话人的干净语音信号。通过使用本发明，够快速精准地从包含噪声和多人语音干扰的混合语料中提取出该目标说话人的声音。本发明作为一种基于双路径自注意力机制的特定人语音分离方法，可广泛应用于语音分离领域。

The invention discloses a method for separating speech of a specific person based on a dual-path self-attention mechanism. The method includes: acquiring registered corpus and mixed corpus; extracting Mel spectrum from the registered corpus and inputting it to a pre-trained speaker encoder, Obtain the identity feature; process the mixed corpus based on the pre-trained speech encoder to obtain the speech feature; fuse the identity feature and the speech feature to obtain the fused feature; process the fused feature based on the pre-trained signal-to-noise ratio estimation module to obtain the signal-to-noise ratio estimation module. Noise ratio estimate; the fusion feature and the signal-to-noise ratio estimate are sequentially passed through the pre-trained speech separator and speech decoder to obtain the clean speech signal of the target speaker. By using the present invention, the voice of the target speaker can be quickly and accurately extracted from the mixed corpus containing noise and multi-person speech interference. As a specific person speech separation method based on a dual-path self-attention mechanism, the present invention can be widely used in the field of speech separation.

Description

Translated fromChinese

一种基于双路径自注意力机制的特定人语音分离方法A person-specific speech separation method based on dual-path self-attention mechanism

技术领域technical field

本发明涉及语音分离领域，尤其涉及一种基于双路径自注意力机制的特定人语音分离方法。The invention relates to the field of speech separation, in particular to a method for separating speech of a specific person based on a dual-path self-attention mechanism.

背景技术Background technique

特定人分离技术，指在给定目标说话人的参考语音的情况下，从包含噪声和多人语音干扰的混合语料中提取出该目标说话人的声音。目前，基于深度学习方法的特定人分离算法有时频域和时域两大流派，时频域的方法忽略了相位信息在语音信号重建过程中的重要性，使用已有的提取好的幅度谱，在一定程度上限制了分离网络对原始语音信号的特征学习；基于时域的特定人分离算法通过编码器得到的信号序列往往显著长于传统的短时傅里叶变换(STFT)之后得到的序列长度，使得网络的建模和学习过程会较为困难。另外，无论是时频域还是时域的方法，分离的效果都会因为训练和测试条件不完全匹配而有所下降，其中测试数据与训练数据信噪比水平(在分离任务中指目标说话人声与其余干扰人声的能量水平的比值，以下简称SNR)的不完全相同是影响分离效果的一大因素。Specific person separation technology refers to extracting the target speaker's voice from a mixed corpus containing noise and multi-person speech interference given the target speaker's reference speech. At present, the specific person separation algorithm based on the deep learning method sometimes has two major schools: frequency domain and time domain. The time-frequency domain method ignores the importance of phase information in the process of speech signal reconstruction, and uses the existing extracted amplitude spectrum. To a certain extent, the feature learning of the original speech signal by the separation network is limited; the signal sequence obtained by the encoder based on the time-domain specific person separation algorithm is often significantly longer than the sequence length obtained after the traditional short-time Fourier transform (STFT). , which makes the modeling and learning process of the network more difficult. In addition, whether it is the time-frequency domain or the time domain method, the separation effect will be reduced because the training and test conditions are not completely matched, where the signal-to-noise ratio level of the test data and the training data (in the separation task refers to the target speaker's voice and the The difference in the ratio of the energy levels of the remaining interfering vocals (hereinafter referred to as SNR) is a major factor affecting the separation effect.

发明内容SUMMARY OF THE INVENTION

为了解决上述技术问题，本发明的目的是提供一种基于双路径自注意力机制的特定人语音分离方法，能够快速精准地从包含噪声和多人语音干扰的混合语料中提取出该目标说话人的声音。In order to solve the above technical problems, the purpose of the present invention is to provide a method for separating speech of a specific person based on a dual-path self-attention mechanism, which can quickly and accurately extract the target speaker from a mixed corpus containing noise and multi-person speech interference the sound of.

本发明所采用的第一技术方案是：一种基于双路径自注意力机制的特定人语音分离方法，包括以下步骤：The first technical solution adopted by the present invention is: a method for separating speech of a specific person based on a dual-path self-attention mechanism, comprising the following steps:

获取注册语料和混合语料；Obtain registered corpus and mixed corpus;

对注册语料进行梅尔谱提取并输入至预训练的说话人编码器，得到身份特征；Extract the mel spectrum from the registration corpus and input it to the pre-trained speaker encoder to obtain the identity feature;

基于预训练的语音编码器对混合语料进行处理，得到语音特征；Based on the pre-trained speech encoder, the mixed corpus is processed to obtain the speech features;

将身份特征和语音特征融合，得到融合特征；Integrate identity features and voice features to obtain fused features;

基于预训练的信噪比估计模块对融合特征进行处理，得到信噪比估计值；Based on the pre-trained signal-to-noise ratio estimation module, the fusion features are processed to obtain the estimated signal-to-noise ratio;

将融合特征和信噪比估计值依次经过预训练的语音分离器和语音解码器，得到目标说话人的干净语音信号。The fusion features and SNR estimates are sequentially passed through the pre-trained speech separator and speech decoder to obtain a clean speech signal of the target speaker.

进一步，所述对注册语料进行梅尔谱提取并输入至预训练的说话人编码器，得到身份特征这一步骤，其具体包括：Further, the described step of performing Mel spectrum extraction on the registration corpus and inputting it to the pre-trained speaker encoder to obtain the identity feature specifically includes:

对注册语料依次进行分帧处理、预加重处理和加窗处理，得到加窗后信号；Perform framing processing, pre-emphasis processing and windowing processing on the registered corpus in turn to obtain the signal after windowing;

对加窗后信号进行短时傅里叶变换，得到线性频谱；Perform short-time Fourier transform on the windowed signal to obtain a linear spectrum;

将线性频谱转换为梅尔非线性频谱，得到梅尔谱；Convert the linear spectrum to a Mel nonlinear spectrum to get the Mel spectrum;

基于预训练的说话人编码器对梅尔谱进行处理，得到身份特征。The Mel spectrum is processed based on the pre-trained speaker encoder to obtain identity features.

进一步，所述预训练的说话人编码器包括前端特征提取网络、编码层、全连接层。Further, the pre-trained speaker encoder includes a front-end feature extraction network, an encoding layer, and a fully connected layer.

进一步，所述基于预训练的说话人编码器对梅尔谱进行处理，得到身份特征这一步骤，其具体包括：Further, the pre-trained speaker encoder processes the Mel spectrum to obtain the identity feature, which specifically includes:

基于前端特征提取网络从梅尔谱中学习说话人确认任务的特征，得到前端特征；Based on the front-end feature extraction network, the features of the speaker verification task are learned from the Mel spectrum, and the front-end features are obtained;

基于编码层将前端特征转换为编码向量；Convert front-end features to encoding vectors based on the encoding layer;

基于全连接层对编码向量进行处理，得到固定维度的说话人身份特征。The encoded vector is processed based on the fully connected layer, and the speaker identity feature of fixed dimension is obtained.

进一步，所述基于预训练的语音编码器对混合语料进行处理，得到语音特征这一步骤，其具体包括：Further, the pre-trained speech encoder processes the mixed corpus to obtain the step of speech features, which specifically includes:

基于预训练的语音编码器将混合语料转换为语音信号序列；Convert the mixed corpus into a sequence of speech signals based on a pre-trained speech encoder;

将语音信号序列切割重组为三维的特征块，得到语音特征。The speech signal sequence is cut and recombined into three-dimensional feature blocks to obtain speech features.

进一步，所述预训练的信噪比估计模块包括空洞卷积层、LSTM层和全连接层，所述基于预训练的信噪比估计模块对融合特征进行处理，得到信噪比估计值这一步骤，其具体包括：Further, the pre-trained signal-to-noise ratio estimation module includes a hole convolution layer, an LSTM layer and a fully connected layer, and the pre-trained signal-to-noise ratio estimation module processes the fusion feature to obtain a signal-to-noise ratio estimation value. steps, which specifically include:

基于空洞卷积对融合特征进行特征提取；Feature extraction of fusion features based on atrous convolution;

基于LSTM层挖掘特征之间的时序信息；Mining time series information between features based on LSTM layer;

基于全连接层得到每一帧的信噪比估计值；Obtain the estimated signal-to-noise ratio of each frame based on the fully connected layer;

在时间维度上取均值，得到该语音片段的信噪比估计值。The average value is taken in the time dimension to obtain the estimated value of the signal-to-noise ratio of the speech segment.

进一步，所述将融合特征和信噪比估计值依次经过预训练的语音分离器和语音解码器，得到目标说话人的干净语音信号这一步骤，其具体包括：Further, the step of obtaining the clean speech signal of the target speaker by passing the fusion feature and the estimated signal-to-noise ratio through a pre-trained speech separator and a speech decoder in turn includes:

将融合特征和信噪比估计值在特征维度上进行拼接，得到拼接特征；Splicing the fusion feature and the estimated signal-to-noise ratio in the feature dimension to obtain the splicing feature;

基于预训练的双路径自注意力机制语音分离器将拼接特征分离，得到分离后的三维特征模块；Based on the pre-trained dual-path self-attention mechanism speech separator, the spliced features are separated to obtain the separated 3D feature modules;

基于语音解码器将分离后的三维特征模块进行重组和拼接，恢复得到目标说话人的干净语音信号。Based on the speech decoder, the separated three-dimensional feature modules are recombined and spliced, and the clean speech signal of the target speaker is recovered.

进一步，所述说话人编码器的训练步骤具体包括：Further, the training steps of the speaker encoder specifically include:

构建第一训练集；Construct the first training set;

对第一训练集的数据进行采样、分帧、预加重、加窗、傅里叶变换和梅尔滤波，得到训练用梅尔频谱；Perform sampling, framing, pre-emphasis, windowing, Fourier transform and Mel filtering on the data of the first training set to obtain the Mel spectrum for training;

以Cross Entropy Loss为损失函数，根据训练用梅尔频谱和第一训练集中的真实标签，训练说话人编码器，得到预训练的说话人编码器。Using Cross Entropy Loss as the loss function, the speaker encoder is trained according to the training Mel spectrum and the real labels in the first training set, and the pre-trained speaker encoder is obtained.

本发明方法的有益效果是：本发明在网络中添加信噪比估计模块并提取目标语音与干扰语音的信噪比，将估计的信噪比用作分离网络的输入之一，使得分离网络关注到当前语音片段的信噪比水平，以提升网络在不同信噪比场景下的分离性能，另外通过一种双路径自注意力机制的分离算法，充分发挥网络自主学习的优势，提升分离的性能。The beneficial effects of the method of the present invention are: the present invention adds a signal-to-noise ratio estimation module in the network and extracts the signal-to-noise ratio of the target speech and the interfering speech, and uses the estimated signal-to-noise ratio as one of the inputs of the separation network, so that the separation network pays attention to It can improve the separation performance of the network in different signal-to-noise ratio scenarios. In addition, a dual-path self-attention mechanism separation algorithm is used to give full play to the advantages of the network’s autonomous learning and improve the separation performance. .

附图说明Description of drawings

图1是本发明一种基于双路径自注意力机制的特定人语音分离方法的步骤流程图；Fig. 1 is the step flow chart of a kind of specific person speech separation method based on dual-path self-attention mechanism of the present invention;

图2是本发明具体实施例方法的结构框图；2 is a structural block diagram of a method according to a specific embodiment of the present invention;

图3是本发明具体实施例梅尔谱提取过程的示意图。3 is a schematic diagram of a Mel spectrum extraction process according to a specific embodiment of the present invention.

具体实施方式Detailed ways

下面结合附图和具体实施例对本发明做进一步的详细说明。对于以下实施例中的步骤编号，其仅为了便于阐述说明而设置，对步骤之间的顺序不做任何限定，实施例中的各步骤的执行顺序均可根据本领域技术人员的理解来进行适应性调整。The present invention will be further described in detail below with reference to the accompanying drawings and specific embodiments. The numbers of the steps in the following embodiments are only set for the convenience of description, and the sequence between the steps is not limited in any way, and the execution sequence of each step in the embodiments can be adapted according to the understanding of those skilled in the art Sexual adjustment.

参照图1和图2，本发明提供了一种基于双路径自注意力机制的特定人语音分离方法，该方法包括以下步骤：1 and 2, the present invention provides a method for separating a specific person's speech based on a dual-path self-attention mechanism, and the method includes the following steps:

S1、获取注册语料和混合语料；S1. Obtain registered corpus and mixed corpus;

S2、对注册语料进行梅尔谱提取并输入至预训练的说话人编码器，得到身份特征；S2, extracting the mel spectrum from the registration corpus and inputting it to the pre-trained speaker encoder to obtain the identity feature;

具体地，说话人编码器用于从目标说话人的注册音频中提取可以表征目标说话人身份特征的编码向量，本发明采用文本无关的说话人确认网络来提取目标说话人的身份特征编码向量。Specifically, the speaker encoder is used to extract an encoding vector that can represent the identity of the target speaker from the registered audio of the target speaker. The present invention uses a text-independent speaker confirmation network to extract the encoding vector of the target speaker's identity.

S2.1、对注册语料依次进行分帧处理、预加重处理和加窗处理，得到加窗后信号；S2.1. Perform framing processing, pre-emphasis processing and windowing processing on the registered corpus in turn to obtain a windowed signal;

具体地，语音信号具有短时平稳性，在10～30ms内可以将语音信号看作准静态的，因此通常对语音信号在时间域上进行分帧操作，且为了帧与帧之间的平滑过渡，帧与帧之间通常有重叠部分，称为帧移。在此语音信号都默认为16k，采用25ms为帧长，10ms为帧移。由于人发声结构的特点，人语音信号中的高频部分会被抑制，且声音在传播过程中，高频能量衰减较大，为了弥补高频信号的不足，对每帧信号的高频部分进行预加重。为了减轻吉布斯效应，对预加重之后的信号使用汉明窗进行加窗处理。Specifically, the voice signal has short-term stability, and the voice signal can be regarded as quasi-static within 10-30ms, so the voice signal is usually divided into frames in the time domain, and for smooth transition between frames, There is usually an overlap between frames called frame shifts. Here, the voice signals are all defaulted to 16k, with 25ms as the frame length and 10ms as the frame shift. Due to the characteristics of the human voice structure, the high frequency part of the human voice signal will be suppressed, and the high frequency energy will be attenuated greatly during the sound propagation process. Pre-emphasis. In order to reduce the Gibbs effect, a Hamming window is used to window the signal after pre-emphasis.

S2.2、对加窗后信号进行短时傅里叶变换，得到线性频谱；S2.2. Perform short-time Fourier transform on the windowed signal to obtain a linear spectrum;

具体地，对加窗后的信号进行短时傅里叶变换，将信号从时域转换到频域，得到幅度谱。在此设置傅里叶变换的点数为512。Specifically, short-time Fourier transform is performed on the windowed signal, and the signal is converted from the time domain to the frequency domain to obtain an amplitude spectrum. Here, set the number of Fourier transform points to 512.

S2.3、将线性频谱转换为梅尔非线性频谱，得到梅尔谱；S2.3. Convert the linear spectrum to the Mel nonlinear spectrum to obtain the Mel spectrum;

具体地，人耳对声音频率的感受呈对数变化，经过短时傅里叶变换的线形频谱图在频域上等间隔分布，而梅尔谱符合人耳的听觉特性。将能量谱乘以一组在梅尔频谱上均匀分布的三角带通滤波器，即带通滤波器的中心频率间隔随着滤波器索引的减小而缩小，随着滤波器索引的增大而加宽，求得每个滤波器输出的对数能量，即可实现将线形频谱向梅尔非线性频谱的转换，得到需要的梅尔谱，梅尔谱提取过程参照图3。Specifically, the human ear's perception of sound frequency changes logarithmically, the linear spectrogram after short-time Fourier transform is distributed at equal intervals in the frequency domain, and the Mel spectrum conforms to the auditory characteristics of the human ear. Multiply the energy spectrum by a set of triangular bandpass filters uniformly distributed on the Mel spectrum, that is, the center frequency interval of the bandpass filter decreases as the filter index decreases, and decreases as the filter index increases. By widening and obtaining the logarithmic energy output by each filter, the conversion of the linear spectrum to the Mel nonlinear spectrum can be realized, and the required Mel spectrum can be obtained. Refer to Figure 3 for the Mel spectrum extraction process.

S2.4、基于预训练的说话人编码器对梅尔谱进行处理，得到身份特征。S2.4, process the Mel spectrum based on the pre-trained speaker encoder to obtain the identity feature.

S2.4.1、基于前端特征提取网络从梅尔谱中学习说话人确认任务的特征，得到前端特征；S2.4.1. Based on the front-end feature extraction network, the features of the speaker verification task are learned from the Mel spectrum, and the front-end features are obtained;

具体地，前端特征提取网络用于从语音信号的梅尔谱中学习适合说话人确认任务的特征。在此使用ResNet-18用作前端特征学习的模块。Specifically, a front-end feature extraction network is used to learn features suitable for speaker identification tasks from the Mel-spectrum of speech signals. Here, ResNet-18 is used as a module for front-end feature learning.

S2.4.2、基于编码层将前端特征转换为编码向量；S2.4.2. Convert front-end features into coding vectors based on the coding layer;

具体地，编码层用于将前端特征提取器输出的包含时序关系的特征转换为时序无关的固定长度的编码向量。同时实现特征提取层和分类器之间的维度转换以及减少深层网络的过拟合问题。Specifically, the coding layer is used to convert the features outputted by the front-end feature extractor that contain temporal relationships into temporally independent fixed-length coding vectors. At the same time, it realizes the dimension conversion between the feature extraction layer and the classifier and reduces the overfitting problem of the deep network.

S2.4.3、基于全连接层对编码向量进行处理，得到固定维度的说话人身份特征。S2.4.3. The encoding vector is processed based on the fully connected layer to obtain the speaker identity feature of fixed dimension.

另外，还包括分类器，分类器用于对说话人身份特征进行分类，其输出的每个节点代表训练数据中的每一个人。此部分只在训练过程中存在，在训练结束之后，该部分被移除，上一部分的全连接层的输出做为说话人的身份特征输入后续的分离网络。In addition, a classifier is included, which is used to classify speaker identity features, and each node of its output represents each person in the training data. This part only exists during the training process. After the training is over, this part is removed, and the output of the fully connected layer of the previous part is used as the speaker's identity feature to input into the subsequent separation network.

S3、基于预训练的语音编码器对混合语料进行处理，得到语音特征；S3. Process the mixed corpus based on the pre-trained speech encoder to obtain speech features;

S3.1、基于预训练的语音编码器将混合语料转换为语音信号序列；S3.1. Convert the mixed corpus into a sequence of speech signals based on a pre-trained speech encoder;

S3.2、将语音信号序列切割重组为三维的特征块，得到语音特征。S3.2, cutting and recombining the speech signal sequence into three-dimensional feature blocks to obtain speech features.

具体地，语音编码器用于对混合语音进行编码，模拟STFT，实现语音信号域的转换，得到长度为L、特征维度为H的特征序列，该过程使用一维卷积网络实现。经过卷积之后的语音信号序列通常过长，在此对序列进行切割和重组，得到三维的特征块。即将长度为L、特征维度为H的特征序列切割为M段短序列，每个短序列长度为N，相邻的短序列之间有长度为P的重叠，且有K＝2P，最后得到维度为N×M×H的三维模块。Specifically, the speech encoder is used to encode mixed speech, simulate STFT, realize the conversion of speech signal domain, and obtain a feature sequence with length L and feature dimension H. This process is realized by using a one-dimensional convolutional network. The speech signal sequence after convolution is usually too long, and the sequence is cut and reorganized to obtain a three-dimensional feature block. That is, the feature sequence of length L and feature dimension H is cut into M short sequences, each short sequence is of length N, there is an overlap of length P between adjacent short sequences, and K=2P, and finally the dimension is obtained. is a three-dimensional module of N×M×H.

S4、将身份特征和语音特征融合，得到融合特征；S4, fuse the identity feature and the voice feature to obtain the fused feature;

具体地，用于将语音编码器得到的语音特征和说话人编码器得到的身份特征进行融合。将语音特征和身份特征在特征维度上进行拼接，然后通过一层全连接层得到融合后的特征。融合特征同时输入到语音分离器和信噪比估计模块中。Specifically, it is used to fuse the speech features obtained by the speech encoder and the identity features obtained by the speaker encoder. The speech features and identity features are spliced in the feature dimension, and then the fused features are obtained through a fully connected layer. The fused features are fed into both the speech separator and the signal-to-noise ratio estimation module.

S5、基于预训练的信噪比估计模块对融合特征进行处理，得到信噪比估计值；S5. Process the fusion feature based on the pre-trained signal-to-noise ratio estimation module to obtain an estimated signal-to-noise ratio;

S5.1、基于空洞卷积对融合特征进行特征提取；S5.1. Feature extraction for fusion features based on hole convolution;

S5.2、基于LSTM层挖掘特征之间的时序信息；S5.2. Mining time series information between features based on LSTM layer;

S5.3、基于全连接层得到每一帧的信噪比估计值；S5.3. Obtain the estimated signal-to-noise ratio of each frame based on the fully connected layer;

S5.4、在时间维度上取均值，得到该语音片段的信噪比估计值。S5.4. Take the mean value in the time dimension to obtain the estimated value of the signal-to-noise ratio of the speech segment.

具体地，信噪比估计模块用于对语料中的信噪比进行估计。本发明中的信噪比估计模块由三层二维空洞卷积、一层LSTM、一层全连接实现。其中空洞卷积用于进行特征提取，LSTM用于发掘特征之间的时序信息，全连接层用于得到每一帧的信噪比估计值，最后在时间维度上取均值平滑得到整个片段的信噪比大小。在训练阶段，该模块与其余部分一起进行多任务训练；在测试阶段，该模块连同之前的说话人编码器、语音编码器和特征融合模块一起，从混合语料中提取当前语音片段中的信噪比估计值(SNR)，代替训练阶段GroundTruth SNR，作为语音分离器的输入之一。Specifically, the signal-to-noise ratio estimation module is used to estimate the signal-to-noise ratio in the corpus. The signal-to-noise ratio estimation module in the present invention is realized by three layers of two-dimensional hole convolution, one layer of LSTM, and one layer of full connection. The hole convolution is used for feature extraction, LSTM is used to explore the timing information between features, the fully connected layer is used to obtain the estimated signal-to-noise ratio of each frame, and finally the average value is smoothed in the time dimension to obtain the information of the entire segment. Noise ratio size. In the training phase, the module is multi-tasked with the rest; in the testing phase, the module, together with the previous speaker encoder, speech encoder and feature fusion module, extracts the signal-to-noise in the current speech segment from the mixed corpus The Ratio Estimate (SNR), in place of the GroundTruth SNR in the training phase, is used as one of the inputs to the speech separator.

S6、将融合特征和信噪比估计值依次经过预训练的语音分离器和语音解码器，得到目标说话人的干净语音信号。S6. Passing the fusion feature and the estimated signal-to-noise ratio through the pre-trained speech separator and speech decoder in turn to obtain a clean speech signal of the target speaker.

S6.1、将融合特征和信噪比估计值在特征维度上进行拼接，得到拼接特征；S6.1. Splicing the fusion feature and the estimated signal-to-noise ratio in the feature dimension to obtain the splicing feature;

S6.2、基于预训练的双路径自注意力机制语音分离器将拼接特征分离，得到分离后的三维特征模块；S6.2. The speech separator based on the pre-trained dual-path self-attention mechanism separates the spliced features to obtain the separated three-dimensional feature modules;

具体地，语音分离器用于分离目标说话人对应的三维特征，由堆叠了B层的双路径自注意力机制模块实现。融合特征首先和当前语音片段的SNR在特征维度上进行拼接，即先将SNR重复N×M次，然后拼接在融合特征维度上。每个双路径自注意力机制模块由块内和块间两条路径上的自注意力机制组成，各个模块的参数设置相同。自注意力机制可以让网络发掘自身序列之间的相关性。首先根据输入的特征向量生成查询矩阵Q、键矩阵K和值矩阵V，随后Q与K进行矩阵相乘，得到某一时刻与序列中其他时刻的相关性打分矩阵R，R通过Softmax函数将数值归一化到0和1之间，然后和V进行矩阵相乘，得到自注意力输出的值。本文中自注意力机制由多头注意力机制实现，多头自注意力由h个并行的自注意力模块构成，各个自注意力模块的结果拼接得到最后输出。块内自注意力将N看作时间维度，相当于计算每一个短序列的局部注意力，而块间自注意力将M看作时间维度，相当于计算整条长序列的全局注意力，通过这种局部与整体相结合的方式实现对长序列信息的有效建模。在训练阶段，语音片段的SNR使用Ground Truth SNR，即在生成训练的混合语料时的SNR值；在测试阶段，SNR使用信噪比估计模块计算的SNR值。Specifically, the speech separator is used to separate the three-dimensional features corresponding to the target speaker, and is implemented by a dual-path self-attention mechanism module stacked with B layers. The fusion feature is first spliced with the SNR of the current speech segment in the feature dimension, that is, the SNR is repeated N×M times, and then spliced in the fusion feature dimension. Each dual-path self-attention mechanism module consists of self-attention mechanisms on two paths within a block and between blocks, and the parameter settings of each module are the same. The self-attention mechanism allows the network to discover the correlations between its own sequences. First, the query matrix Q, the key matrix K and the value matrix V are generated according to the input eigenvectors, and then Q and K are matrix-multiplied to obtain the correlation scoring matrix R between a certain moment and other moments in the sequence. R uses the Softmax function to convert the numerical value Normalize to between 0 and 1, and then perform matrix multiplication with V to get the value of the self-attention output. The self-attention mechanism in this paper is implemented by a multi-head attention mechanism, which consists of h parallel self-attention modules, and the results of each self-attention module are spliced to obtain the final output. The intra-block self-attention regards N as the time dimension, which is equivalent to calculating the local attention of each short sequence, while the inter-block self-attention regards M as the time dimension, which is equivalent to calculating the global attention of the entire long sequence. This combination of part and whole achieves effective modeling of long sequence information. In the training phase, the SNR of the speech segment uses the Ground Truth SNR, that is, the SNR value when the mixed corpus for training is generated; in the testing phase, the SNR uses the SNR value calculated by the signal-to-noise ratio estimation module.

S6.2、基于语音解码器将三维特征模块进行重组和拼接，恢复得到目标说话人的干净语音信号。S6.2. Recombining and splicing the three-dimensional feature modules based on the speech decoder, and recovering the clean speech signal of the target speaker.

具体地，语音解码器用于将语音分离器得到的三维特征模块恢复得到目标说话人对应的干净音频，由一维反卷积网络实现。将三维模块按照与语音编码器中切割和拼接相逆的过程进行重组和拼接得到和语音编码器的输出相同长度的特征序列，然后送入一维反卷积网络中恢复得到目标说话人的干净的语音信号。Specifically, the speech decoder is used to restore the three-dimensional feature module obtained by the speech separator to obtain clean audio corresponding to the target speaker, which is implemented by a one-dimensional deconvolution network. The three-dimensional module is reorganized and spliced according to the reverse process of cutting and splicing in the speech encoder to obtain a feature sequence of the same length as the output of the speech encoder, and then sent to a one-dimensional deconvolution network to restore the cleanness of the target speaker. voice signal.

进一步作为本方法的优选实施例，所述说话人编码器的训练步骤具体包括：Further as a preferred embodiment of the method, the training steps of the speaker encoder specifically include:

构建第一训练集DatasetA；Construct the first training set DatasetA;

对第一训练集DatasetA的数据进行采样、分帧、预加重、加窗、傅里叶变换和梅尔滤波，得到训练用梅尔频谱；Perform sampling, framing, pre-emphasis, windowing, Fourier transform and Mel filtering on the data of the first training set DatasetA to obtain the Mel spectrum for training;

具体地，2.其中帧长设置为25ms，帧移设置为10ms，预加重系数设置为0.97，窗函数使用汉明窗，傅里叶变换的点数设置为512，梅尔滤波器系数设置为64；Specifically, 2. The frame length is set to 25ms, the frame shift is set to 10ms, the pre-emphasis coefficient is set to 0.97, the Hamming window is used as the window function, the number of Fourier transform points is set to 512, and the mel filter coefficient is set to 64 ;

进一步作为本方法的优选实施例，所述语音分离器的训练步骤具体包括：Further as a preferred embodiment of the method, the training steps of the speech separator specifically include:

构建第二训练集DatasetB；Construct the second training set DatasetB;

从第二训练集DatasetB中每次随机选取两个说话人A和B，A作为目标说话人，从A对应的语料中选取一条音频作为注册语料用于从训练好的说话人编码器中提取身份特征向量E，再从A剩余的语料中随机选择一条音频作为目标恢复语音信号S1；B作为干扰说话人，从B的语料中随机选取一条音频作为干扰语音S2，与S1按照某一信噪比(-5～10dB)进行混合，作为分离用的训练数据(所有用于训练的语音片段都裁剪到3s)，混合用的信噪比的数据也输入到模型中；Two speakers A and B are randomly selected each time from the second training set DatasetB, A is the target speaker, and an audio is selected from the corpus corresponding to A as the registration corpus to extract the identity from the trained speaker encoder Eigen vector E, and then randomly select an audio from the remaining corpus of A as the target to restore the speech signal S1; B is used as an interfering speaker, randomly select an audio from B's corpus as the interfering speech S2, and S1 according to a certain signal-to-noise ratio (-5 ~ 10dB) for mixing, as training data for separation (all speech fragments used for training are trimmed to 3s), and the data of signal-to-noise ratio for mixing is also input into the model;

以SI-SNR(scale-invariant signal-to-noise ratio)为损失函数，训练语音分离器。The speech separator is trained with SI-SNR (scale-invariant signal-to-noise ratio) as the loss function.

另外，其他模块的训练步骤同理。In addition, the training steps of other modules are the same.

SI-SNR和L1 Loss分别作为分离模块和信噪比估计模块训练的损失函数，相加的结果作为整个模型的损失函数对两个模块进行联合优化，Adam作为优化器，初始学习率设置为0.001，当验证集的Loss不下降超过10个epoch时调整学习率，迭代50次，完成训练，保存模型；SI-SNR and L1 Loss are used as the loss functions for the training of the separation module and the signal-to-noise ratio estimation module, respectively. The added result is used as the loss function of the entire model to jointly optimize the two modules. Adam is used as the optimizer, and the initial learning rate is set to 0.001 , when the Loss of the validation set does not drop for more than 10 epochs, adjust the learning rate, iterate 50 times, complete the training, and save the model;

一种基于双路径自注意力机制的特定人语音分离系统，包括：A person-specific speech separation system based on dual-path self-attention mechanism, including:

数据获取模块，用于获取获取注册语料和混合语料；Data acquisition module, used to acquire registered corpus and mixed corpus;

身份特征提取模块，用于对注册语料进行梅尔谱提取并输入至预训练的说话人编码器，得到身份特征；The identity feature extraction module is used to extract the mel spectrum from the registration corpus and input it to the pre-trained speaker encoder to obtain the identity feature;

语音特征提取模块，基于预训练的语音编码器对混合语料进行处理，得到语音特征；The speech feature extraction module processes the mixed corpus based on the pre-trained speech encoder to obtain speech features;

融合模块，用于将身份特征和语音特征融合，得到融合特征；The fusion module is used to fuse the identity feature and the voice feature to obtain the fusion feature;

信噪比估计模块，基于预训练的信噪比估计模块对融合特征进行处理，得到信噪比估计值；The signal-to-noise ratio estimation module, based on the pre-trained signal-to-noise ratio estimation module, processes the fusion features to obtain an estimated signal-to-noise ratio;

分离模块，用于将融合特征和信噪比估计值依次经过预训练的语音分离器和语音解码器，得到目标说话人的干净语音信号。The separation module is used to sequentially pass the fusion feature and the estimated signal-to-noise ratio through the pre-trained speech separator and speech decoder to obtain the clean speech signal of the target speaker.

进一步作为本系统优选实施例，还包括：Further as a preferred embodiment of the system, it also includes:

训练模块，用于对说话人编码器、语音编码器、信噪比估计模块、语音分离器和语音解码器进行训练。Training module for training the speaker encoder, speech encoder, signal-to-noise ratio estimation module, speech separator, and speech decoder.

上述方法实施例中的内容均适用于本系统实施例中，本系统实施例所具体实现的功能与上述方法实施例相同，并且达到的有益效果与上述方法实施例所达到的有益效果也相同。The contents in the above method embodiments are all applicable to the present system embodiments, the specific functions implemented by the present system embodiments are the same as the above method embodiments, and the beneficial effects achieved are also the same as those achieved by the above method embodiments.

一种基于双路径自注意力机制的特定人语音分离装置：A person-specific speech separation device based on dual-path self-attention mechanism:

至少一个处理器；at least one processor;

至少一个存储器，用于存储至少一个程序；at least one memory for storing at least one program;

当所述至少一个程序被所述至少一个处理器执行，使得所述至少一个处理器实现如上所述一种基于双路径自注意力机制的特定人语音分离方法。When the at least one program is executed by the at least one processor, the at least one processor implements the above-mentioned method for separating a specific person's speech based on a dual-path self-attention mechanism.

上述方法实施例中的内容均适用于本装置实施例中，本装置实施例所具体实现的功能与上述方法实施例相同，并且达到的有益效果与上述方法实施例所达到的有益效果也相同。The contents in the above method embodiments are all applicable to the present device embodiments, the specific functions implemented by the present device embodiments are the same as the above method embodiments, and the beneficial effects achieved are also the same as those achieved by the above method embodiments.

一种存储介质，其中存储有处理器可执行的指令，其特征在于：所述处理器可执行的指令在由处理器执行时用于实现如上所述一种基于双路径自注意力机制的特定人语音分离方法。A storage medium storing processor-executable instructions, wherein the processor-executable instructions, when executed by the processor, are used to implement the above-mentioned specific dual-path self-attention mechanism-based Human speech separation method.

上述方法实施例中的内容均适用于本存储介质实施例中，本存储介质实施例所具体实现的功能与上述方法实施例相同，并且达到的有益效果与上述方法实施例所达到的有益效果也相同。The contents in the above method embodiments are all applicable to the present storage medium embodiments, the specific functions implemented by the present storage medium embodiments are the same as the above method embodiments, and the beneficial effects achieved are also the same as those achieved by the above method embodiments. same.

以上是对本发明的较佳实施进行了具体说明，但本发明创造并不限于所述实施例，熟悉本领域的技术人员在不违背本发明精神的前提下还可做作出种种的等同变形或替换，这些等同的变形或替换均包含在本申请权利要求所限定的范围内。The above is a specific description of the preferred implementation of the present invention, but the present invention is not limited to the described embodiments, and those skilled in the art can make various equivalent deformations or replacements without departing from the spirit of the present invention. , these equivalent modifications or substitutions are all included within the scope defined by the claims of the present application.

Claims

Translated fromChinese

1.一种基于双路径自注意力机制的特定人语音分离方法，其特征在于，包括以下步骤：1. a specific person speech separation method based on dual-path self-attention mechanism, is characterized in that, comprises the following steps:

获取注册语料和混合语料；Obtain registered corpus and mixed corpus;

2.根据权利要求1所述一种基于双路径自注意力机制的特定人语音分离方法，其特征在于，所述对注册语料进行梅尔谱提取并输入至预训练的说话人编码器，得到身份特征这一步骤，其具体包括：2. a kind of specific person speech separation method based on dual-path self-attention mechanism according to claim 1, is characterized in that, described registering corpus is carried out Mel spectrum extraction and input to the speaker encoder of pre-training, obtains This step of identity features specifically includes:

3.根据权利要求2所述一种基于双路径自注意力机制的特定人语音分离方法，其特征在于，所述预训练的说话人编码器包括前端特征提取网络、编码层、全连接层。3 . The method for separating speech of a specific person based on a dual-path self-attention mechanism according to claim 2 , wherein the pre-trained speaker encoder comprises a front-end feature extraction network, an encoding layer, and a fully connected layer. 4 .

4.根据权利要求3所述一种基于双路径自注意力机制的特定人语音分离方法，其特征在于，所述基于预训练的说话人编码器对梅尔谱进行处理，得到身份特征这一步骤，其具体包括：4. a kind of specific person speech separation method based on dual-path self-attention mechanism according to claim 3, it is characterized in that, described based on pre-trained speaker encoder, mel spectrum is processed, obtains the identity feature: steps, which specifically include:

5.根据权利要求4所述一种基于双路径自注意力机制的特定人语音分离方法，其特征在于，所述基于预训练的语音编码器对混合语料进行处理，得到语音特征这一步骤，其具体包括：5. a kind of specific person speech separation method based on dual-path self-attention mechanism according to claim 4, is characterized in that, described based on pre-trained speech encoder is processed to mixed corpus, obtains this step of speech feature, Specifically, it includes:

6.根据权利要求5所述一种基于双路径自注意力机制的特定人语音分离方法，其特征在于，所述预训练的信噪比估计模块包括空洞卷积层、LSTM层和全连接层，所述基于预训练的信噪比估计模块对融合特征进行处理，得到信噪比估计值这一步骤，其具体包括：6. a kind of specific person speech separation method based on dual-path self-attention mechanism according to claim 5, is characterized in that, described pre-trained signal-to-noise ratio estimation module comprises hollow convolution layer, LSTM layer and fully connected layer , and the pre-trained signal-to-noise ratio estimation module processes the fusion feature to obtain a signal-to-noise ratio estimate, which specifically includes:

基于空洞卷积对融合特征进行特征提取；Feature extraction for fusion features based on atrous convolution;

7.根据权利要求6所述一种基于双路径自注意力机制的特定人语音分离方法，其特征在于，所述将融合特征和信噪比估计值依次经过预训练的语音分离器和语音解码器，得到目标说话人的干净语音信号这一步骤，其具体包括：7. a kind of specific person speech separation method based on dual-path self-attention mechanism according to claim 6, is characterized in that, described by fusion feature and signal-to-noise ratio estimation value successively through pre-trained speech separator and speech decoding The step of obtaining the clean speech signal of the target speaker specifically includes:

基于预训练的双路径自注意力机制语音分离器将拼接特征分离，得到分离后的三维特征模块；Based on the pre-trained dual-path self-attention mechanism speech separator, the spliced features are separated, and the separated 3D feature modules are obtained;

8.根据权利要求7所述一种基于双路径自注意力机制的特定人语音分离方法，其特征在于，所述说话人编码器的训练步骤具体包括：8. a kind of specific person speech separation method based on dual-path self-attention mechanism according to claim 7, is characterized in that, the training step of described speaker encoder specifically comprises:

构建第一训练集；Construct the first training set;