CN118887958A

Movatterモバイル変換

Info

Publication number: CN118887958A
Application number: CN202411390118.XA
Authority: CN
Inventors: 张句
Original assignee: Tianjin University of Technology
Current assignee: Tianjin University of Technology
Priority date: 2024-10-08
Filing date: 2024-10-08
Publication date: 2024-11-01
Anticipated expiration: 2044-10-08
Also published as: CN118887958B

Abstract

The invention relates to the technical field of voice analysis and processing, and provides a speaker voice information decoupling method based on unsupervised learning. According to the method, the text labels are not needed, fine granularity representation of the speaker information is carried out through the multi-reference voice prompt encoder, meanwhile, the gradient inversion speaker classification is combined to carry out joint training under the encoder-decoder framework, and decoupling of the speaker correlation and the speaker independence information in the voice information is further achieved. The invention can be used as a basic module to provide effective speaker information characterization for related downstream applications such as speaker recognition, voice conversion, voice synthesis and the like.

Description

Translated fromChinese

一种基于无监督学习的说话人语音信息解耦方法A speaker voice information decoupling method based on unsupervised learning

技术领域Technical Field

本发明涉及语音分析与处理技术领域，公开一种基于无监督学习的针对语音中说话人信息的解耦方法，具体为基于编码器（encoder）-解码器（decoder）框架，引入多参考语音提示编码器（multi-reference acoustic prompts encoder, MRAPE）以及梯度反转（gradient reversal layer, GRL）说话人分类（speaker classification）联合训练，从而实现语音信号中说话人信息的解耦，并为下游任务提供更有效的说话人信息表征。The present invention relates to the technical field of speech analysis and processing, and discloses a method for decoupling speaker information in speech based on unsupervised learning. Specifically, based on an encoder-decoder framework, a multi-reference acoustic prompts encoder (MRAPE) and a gradient reversal layer (GRL) speaker classification joint training are introduced to achieve decoupling of speaker information in speech signals and provide more effective speaker information representation for downstream tasks.

背景技术Background Art

语音中说话人信息的有效提取是说话人识别、语音合成、语音转换等领域的技术核心。现有的说话人信息提取方法，通常是通过基于时延神经网络或残差网络的如ECAPA-TDNN、Resnet等语音编码器预训练模型，从语音中直接提取说话人低维向量来进行说话人的表征。然而，考虑到语音中除说话人信息外，还包含着诸如言语、情感、韵律等其他信息，现有的方法无法有针对性地对说话人信息进行有效地提取和表征，因此存在泛化性能较弱的缺陷。Effective extraction of speaker information from speech is the core technology in speaker recognition, speech synthesis, speech conversion and other fields. Existing methods for extracting speaker information usually directly extract low-dimensional vectors of speakers from speech to represent speakers through speech encoder pre-training models based on time-delay neural networks or residual networks, such as ECAPA-TDNN and Resnet. However, considering that speech contains other information such as speech, emotion, and rhythm in addition to speaker information, existing methods cannot effectively extract and represent speaker information in a targeted manner, and therefore have the defect of weak generalization performance.

为了排除语音中言语信息特征的干扰，现有的技术通常会借助预训练的自动语音识别（automatic speech recognition，ASR）网络来建模言语内容信息，并以此作为说话人语音信息提取的参考。然而，考虑到发音障碍人群语音可懂度较低的特性，预训练的自动语音识别网络不能有效地得到其语音中的言语内容表征，而且由于缺乏其语音所对应的文本标签，因而对自动语音识别网络的训练及微调也存在一定的困难。因此，基于自动语音识别预训练网络建模言语信息的方法，在应对病理语音的说话人信息提取方面效果十分有限。此外，部分病理语音除可懂度低以外，还存在语音片段间断不连续、不稳定等特性，这将导致以往基于目标语音全局的说话人特征提取方式，在说话人信息表征效果上受到一定影响。In order to eliminate the interference of speech information features in speech, existing technologies usually use pre-trained automatic speech recognition (ASR) networks to model speech content information and use it as a reference for speaker speech information extraction. However, considering the low intelligibility of speech in people with articulation disorders, the pre-trained automatic speech recognition network cannot effectively obtain the speech content representation in their speech, and due to the lack of text labels corresponding to their speech, there are certain difficulties in training and fine-tuning the automatic speech recognition network. Therefore, the method of modeling speech information based on the automatic speech recognition pre-trained network has very limited effect in extracting speaker information for pathological speech. In addition, in addition to low intelligibility, some pathological speech also has characteristics such as discontinuous and unstable speech segments, which will cause the previous speaker feature extraction method based on the global target speech to be affected to a certain extent in the speaker information representation effect.

发明内容Summary of the invention

本发明基于上述阐述的现状，开发了一种基于无监督学习的针对语音中说话人信息的解耦方法，在无需借助文本标签的情况下，通过语音重建和说话人分类两个联合任务的训练来指导说话人语音信息的解耦，从而得到更有效的说话人表征信息。Based on the above-mentioned situation, the present invention has developed a method for decoupling speaker information in speech based on unsupervised learning. Without the need for text labels, the decoupling of speaker speech information is guided by training two joint tasks of speech reconstruction and speaker classification, thereby obtaining more effective speaker representation information.

本发明的技术方案是一种基于无监督学习的针对语音中说话人信息的解耦方法，基于编码器和解码器两部分组成的自编码器框架实现说话人语音信息解耦，具体实现步骤如下：The technical solution of the present invention is a method for decoupling speaker information in speech based on unsupervised learning, which realizes speaker voice information decoupling based on an autoencoder framework consisting of an encoder and a decoder. The specific implementation steps are as follows:

S1 数据预处理及特征提取：模型训练之前，首先将音频数据进行批标准化处理至统一格式；一方面，可以有效提升模型训练速度加快模型收敛，另一方面，有助于提高模型的准确性和泛化能力；处理后的音频数据经过音频特征提取转换为梅尔谱滤波器组特征（FBank）作为框架的主要输入数据，同时针对每一条音频数据，对于相同说话人的其他不同语音片段的音频数据也转换为梅尔谱滤波器特征作为相同说话人参考特征；S1 Data preprocessing and feature extraction: Before model training, the audio data is first batch-normalized to a unified format; on the one hand, it can effectively improve the model training speed and accelerate model convergence; on the other hand, it helps to improve the accuracy and generalization ability of the model; the processed audio data is converted into Mel-spectrogram filter bank features (FBank) through audio feature extraction As the main input data of the framework, for each audio data, the audio data of other different speech segments of the same speaker are also converted into Mel spectrum filter features as the reference features of the same speaker ;

S2 提取说话人无关信息表征：首先将S1提取到的梅尔谱滤波器组特征输入到基础语音编码器中，以抽取包含说话人、言语、韵律、情感在内的基础语音信息特征；而后通过引入声道长度扰动（vocal tract length perturbation, VTLP）处理，从而得到说话人信息受到破坏的说话人无关语音表征，并通过梯度反转后的说话人分类器中进行说话人分类训练，进一步压缩（最小化）该部分表征信息中的说话人信息，得到优化后的说话人无关语音表征，训练中该部分的损失为。S2 extracts speaker-independent information representation: First, the Mel spectrum filter bank features extracted by S1 Input into the basic speech encoder to extract basic speech information features including speaker, speech, prosody, and emotion ; Then, by introducing vocal tract length perturbation (VTLP) processing, speaker-independent speech representation with destroyed speaker information is obtained, and speaker classification training is performed in the speaker classifier after gradient inversion to further compress (minimize) the speaker information in this part of the representation information, and obtain the optimized speaker-independent speech representation , the loss of this part in training is .

S2的详细步骤如下：The detailed steps of S2 are as follows:

S2-1梅尔谱滤波器组特征首先被输入到基础语音编码器中，基础语音编码器包括卷积降采样模块、自注意力编码模块；所述卷积降采样模块是由两层卷积神经网络（Convolutional Neural Network，CNN）模块组成；所述自注意力编码模块（ConformerLayer）为多层结构，每一层包括前馈神经网络、多头自注意力模块、卷积模块，经过卷积降采样后得到的基础语音特征s；S2-1 Mel-spectrogram filter bank characteristics First, it is input into the basic speech encoder, which includes a convolution downsampling module and a self-attention encoding module; the convolution downsampling module is composed of two layers of convolutional neural network (CNN) modules; the self-attention encoding module (ConformerLayer) is a multi-layer structure, each layer includes a feedforward neural network, a multi-head self-attention module, and a convolution module, and the basic speech features is obtained after convolution downsampling;

S2-2基础语音特征通过声道长度扰动进行频率规整，从而初步剥离语音内容信息与说话人信息的耦合，得到说话人无关语音表征；S2-2 Basic speech features Frequency regularization is performed by perturbing the vocal tract length, thereby initially removing the coupling between speech content information and speaker information and obtaining speaker-independent speech representation. ;

S2-3 使用梯度反转说话人分类训练将说话人无关语音表征送入说话人分类器中进行说话人分类训练,训练中该部分损失为；将说话人无关语音表征经过梯度翻转层之后再经过说话人分类器计算损失，当最小时，表示所得到最优特征中当前说话人的说话人信息达到最少，因而得到进一步剥离说话人信息的说话人无关语音表征；S2-3 Using gradient reversal to train speaker classification to represent speaker-independent speech It is sent to the speaker classifier for speaker classification training. The loss of this part in training is ; Represent speaker-independent speech After the gradient flip layer, the loss is calculated by the speaker classifier ,when When it is the smallest, it means that the speaker information of the current speaker in the obtained optimal feature is the least, so the speaker-independent speech representation that further strips away the speaker information is obtained;

S3提取说话人相关特征:在进行步骤S2的同时，步骤S1所得的梅尔谱滤波器组特征以及相同说话人参考特征一起送入多参考语音提示编码器中，来抽取细粒度的说话人相关特征，通过说话人分类器进行说话人分类训练使得对于说话人信息的表征能力得到进一步增强，S3的详细步骤如下：S3 extracts speaker-related features: While performing step S2, the Mel spectrum filter bank features obtained in step S1 are and the same speaker reference features The two are fed together into a multi-reference speech prompt encoder to extract fine-grained speaker-related features. , speaker classification training is performed through the speaker classifier so that The representation capability of speaker information is further enhanced. The detailed steps of S3 are as follows:

S3-1首先通过多参考语音提示编码器对同一个说话人进行说话人信息参考特征的抽取；所述多参考语音提示编码器由两个卷积模块与一个降采样层组成；所述多参考语音提示编码器将相同说话人参考特征并送入第一个卷积模块，而后进行16倍的降采样来提高运算效率；第二个卷积模块对降采样后的说话人参考特征进行进一步地卷积以聚合不同语音片段中共同包含的说话人信息，形成最终的相同说话人参考编码；S3-1 first extracts speaker information reference features of the same speaker through a multi-reference speech prompt encoder; the multi-reference speech prompt encoder is composed of two convolution modules and a downsampling layer; the multi-reference speech prompt encoder extracts speaker information reference features of the same speaker The first convolution module then performs 16-fold downsampling to improve computational efficiency. The second convolution module further convolves the downsampled speaker reference features to aggregate the speaker information contained in different speech segments to form the final reference code of the same speaker. ;

S3-2相同说话人参考编码与步骤S1的梅尔谱滤波器组特征一同被送入多参考语音提示编码器中的音色注意力模块中进行当前信息的说话人信息编码；所述音色注意力模块为多层结构的注意力编码模块（ConformerLayer），每一层包括前馈神经网络、多头注意力模块、卷积模块；与S2-1所述基础语音编码器中自注意力编码模块不同的是，音色注意力模块中的查询信息Q为当前处理的梅尔谱滤波器组特征，索引信息K与被查询信息V均为S3-1得到的相同说话人参考编码；通过音色注意力模块的特征提取将进一步抽取同一说话人当前发音状态中的说话人信息，形成说话人相关特征。S3-2 Same speaker reference code The Mel spectrum filter bank features of step S1 The timbre attention module is sent to the multi-reference speech prompt encoder to encode the speaker information of the current information; the timbre attention module is a multi-layer structured attention encoding module (Conformer Layer), each layer of which includes a feedforward neural network, a multi-head attention module, and a convolution module; the query information Q in the timbre attention module is different from the self-attention encoding module in the basic speech encoder described in S2-1, which is the Mel spectrum filter group feature currently being processed. , the index information K and the queried information V are both the same speaker reference code obtained by S3-1 ; The feature extraction of the timbre attention module will further extract the speaker information of the same speaker in the current pronunciation state to form speaker-related features .

S3-3将所述步骤S3-2得到的说话人相关特征送入说话人分类器中,训练中该部分训练损失为。当最小时，表示所得特征中包含的说话人信息达到最优，因而得到进一步优化的说话人相关特征。S3-3 The speaker-related features obtained in step S3-2 are Sent to the speaker classifier, the training loss of this part during training is .when When it is the smallest, it indicates the obtained feature The speaker information contained in is optimized, thereby obtaining further optimized speaker-related features.

S4 将S2得到的说话人无关语音表征与S3得到的说话人相关特征拼接在一起后送入语音解码器中进行解码，得到重建后的梅尔谱滤波器组特征信息，并计算出重建的梅尔谱滤波器组特征与原始的梅尔谱滤波器组特征之间的损失。S4 represents the speaker-independent speech obtained in S2 Speaker-related features obtained from S3 After being spliced together, they are sent to the speech decoder for decoding to obtain the reconstructed Mel spectrum filter bank feature information , and calculate the reconstructed Mel spectrum filter bank features Compared with the original Mel spectrum filter bank features The loss between .

本发明语音解码器为多层结构的自注意力编码模块（Transformer Layer）每一层包括前馈神经网络、多头注意力模块。The speech decoder of the present invention is a multi-layer structure of a self-attention encoding module (Transformer Layer), each layer of which includes a feedforward neural network and a multi-head attention module.

通过对S2、S3、S4中的模块进行联合训练，并对三个模块的联合损失=++进行优化，最终可以从原始语音信号中得到优化解耦后的说话人无关语音表征与说话人相关表征。By jointly training the modules in S2, S3, and S4, and performing the joint loss of the three modules = + + After optimization, the speaker-independent speech representation after optimization decoupling can be obtained from the original speech signal. Speaker-dependent representation .

本发明所述S3中说话人分类器与S2中说话人分类器的结构相同，参数相互独立。The speaker classifier in S3 of the present invention has the same structure as the speaker classifier in S2, and the parameters are independent of each other.

本发明提取的说话人相关表征可以作为说话人嵌入特征，应用于多个说话人相关应用如说话人识别、语音转换、音色克隆等领域中。Speaker-related representation extracted by the present invention It can be used as speaker embedding features and applied to multiple speaker-related applications such as speaker recognition, speech conversion, and voice cloning.

有益效果Beneficial Effects

本发明针对现有说话人表征技术应对复杂语音，诸如发音障碍人群病理语音中说话人信息提取性能较差、泛化性低的缺陷，提取的说话人表征具有较好的泛化性能，应对如病理语音可懂度低、语音片段间断不连续、不稳定等特点。The present invention aims to address the defects of existing speaker representation technology in dealing with complex speech, such as poor speaker information extraction performance and low generalization in pathological speech of people with pronunciation disorders. The extracted speaker representation has better generalization performance and can cope with the characteristics of low intelligibility of pathological speech, discontinuous and unstable speech segments, etc.

本发明可以作为基础模块，为相关的下游应用如说话人识别、语音转换、语音合成等任务提供有效的说话人信息表征。The present invention can be used as a basic module to provide effective speaker information representation for related downstream applications such as speaker recognition, speech conversion, speech synthesis and other tasks.

相较于融合以往说话人表征提取方法的零样本语音合成，可以在目标说话人相似度以及语音可懂度，自然度等方面有较大提升。Compared with zero-shot speech synthesis that integrates previous speaker representation extraction methods, it can significantly improve the target speaker similarity, speech intelligibility, naturalness, etc.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

图1为本发明训练模型结构；Fig. 1 is a training model structure of the present invention;

图2多参考语音提示编码器模块内部结构。Fig. 2 Internal structure of the multi-reference speech prompt encoder module.

具体实施方式DETAILED DESCRIPTION

以下结合附图来对本发明作进一步的说明。如图1所示，本发明训练模型结构，基于无监督学习的针对语音中说话人信息的解耦方法，基于编码器和解码器两部分组成的自编码器框架实现说话人语音信息解耦。The present invention is further described below with reference to the accompanying drawings. As shown in Figure 1, the training model structure of the present invention is based on an unsupervised learning method for decoupling speaker information in speech, and the speaker voice information decoupling is achieved based on an autoencoder framework consisting of an encoder and a decoder.

S1 数据预处理及特征提取：音频数据使用python音频处理工具包librosa进行重采样与振幅标准化处理，同时对频谱进行预加重、加窗以及预处理；处理后的音频数据经过音频特征提取转换为80维的梅尔谱滤波器组特征（FBank）作为框架的主要输入数据。针对每一条音频，对于相同说话人的其他不同语音片段中随机提取2000帧数据并转换为梅尔谱滤波器组特征后再拼接在一起作为说话人参考特征。S1 Data preprocessing and feature extraction: The audio data is resampled and amplitude normalized using the Python audio processing toolkit librosa, and the spectrum is pre-emphasized, windowed, and pre-processed. The processed audio data is converted into 80-dimensional Mel-spectrum filter bank features (FBank) through audio feature extraction. As the main input data of the framework. For each audio, 2000 frames of data are randomly extracted from other different speech clips of the same speaker and converted into Mel spectrum filter bank features and then spliced together as the speaker reference features. .

S2 提取说话人无关信息表征：首先将S1提取到的梅尔谱滤波器组特征首先输入到基础语音编码器中，抽取得到包含说话人、言语、韵律、情感在内的基础语音信息特征，再经过声道长度扰动处理后得到说话人信息受到破坏的说话人无关语音表征，并通过梯度反转后的说话人分类训练，进一步压缩（最小化）该部分表征信息中的说话人信息，得到优化后的说话人无关语音表征。S2 extracts speaker-independent information representation: First, the Mel spectrum filter bank features extracted by S1 First, it is input into the basic speech encoder to extract basic speech information features including speaker, speech, rhythm, and emotion. After the vocal tract length perturbation processing, the speaker-independent speech representation with the speaker information destroyed is obtained, and the speaker information in this part of the representation information is further compressed (minimized) through the speaker classification training after gradient inversion to obtain the optimized speaker-independent speech representation. .

S2的详细步骤如下：The detailed steps of S2 are as follows:

S2-1 通过S1提取到的梅尔谱滤波器组特征首先被输入到基础语音编码器，得到基础语音信息特征。在基础语音编码器中，梅尔谱滤波器组特征首先经过卷积降采样模块进行两倍帧数的降采样处理，所述卷积降采样模块是由两层二维卷积神经网络（Convolutional Neural Network，CNN）模块组成，卷积核为3，降采样步长为2；降采样后的梅尔谱特征随后经过自注意力编码模块转换为基础语音信息特征。所述自注意力编码模块由6层自注意力编码层组成，每一层包括前馈神经网络、多头自注意力模块、卷积模块；S2-1 Mel spectrum filter bank features extracted by S1 First, it is input into the basic speech encoder to obtain the basic speech information features In the basic speech encoder, the Mel spectrum filter bank features First, the convolutional downsampling module is used to perform downsampling processing with twice the number of frames. The convolutional downsampling module is composed of two layers of two-dimensional convolutional neural network (CNN) modules, with a convolution kernel of 3 and a downsampling step of 2; the downsampled Mel spectrum features are then converted into basic speech information features through the self-attention encoding module The self-attention encoding module consists of 6 self-attention encoding layers, each of which includesa feedforward neural network, a multi-head self-attention module, and a convolution module;

S2-2基础语音信息特征通过声道长度扰动进行频率规整，通过随机生成一个扭曲因子对频谱的频率轴进行扭曲，来去除声道差别，得到说话人无关语音表征；S2-2 Basic voice information features Frequency regularization is performed by perturbing the vocal tract length, and the frequency axis of the spectrum is distorted by randomly generating a distortion factor to remove the vocal tract difference and obtain speaker-independent speech representation. ;

S2-3 使用梯度反转说话人分类训练将说话人无关语音表征送入说话人分类器中进行说话人分类训练。所述说话人分类器为线性层（Linear Layer），将说话人无关语音表征映射至的特征空间中进行说话人分类，其中为训练所用说话人的总数量，为嵌入特征维度，当前训练所使用的嵌入特征维度为192。训练中该部分损失为，采用加性角度边界损失（AAM-Softmax Loss）进行计算。S2-3 Using gradient reversal to train speaker classification to represent speaker-independent speech The speaker classifier is sent to the speaker classifier for speaker classification training. The speaker classifier is a linear layer that represents speaker-independent speech. Map to Speaker classification is performed in the feature space of is the total number of speakers used for training, is the embedding feature dimension. The embedding feature dimension used in the current training is 192. The loss of this part in training is , the additive angular margin loss (AAM-Softmax Loss) is used for calculation.

S3 说话人相关特征提取：在进行步骤S2的同时，步骤S1所得的梅尔谱滤波器组特征以及相同说话人参考特征一起送入多参考语音提示编码器模块块中，来抽取细粒度的说话人相关特征，多参考语音提示编码器模块的详细结构如图2所示；S3 Speaker-related feature extraction: While performing step S2, the Mel spectrum filter bank features obtained in step S1 are extracted. and the same speaker reference features The 3D images are sent together to the multi-reference speech prompt encoder module to extract fine-grained speaker-related features. The detailed structure of the multi-reference speech prompt encoder module is shown in Figure 2.

S3的详细步骤如下：The detailed steps of S3 are as follows:

S3-1相同说话人参考特征首先通过多参考语音提示编码器进行说话人信息参考编码的抽取；所述多参考语音提示编码器由两个卷积模块与一个降采样层组成，每个卷积模块包含5层基础卷积层，每层基础卷积层由2个一维卷积层连接组成；基础卷积层之间以残差连接的方式进行连接。说话人参考特征被送入第一个卷积模块后进行16倍的降采样，而后经过第二个卷积模块后形成最终的相同说话人参考编码；S3-1 Reference features of the same speaker First, the speaker information reference code is extracted through a multi-reference speech prompt encoder; the multi-reference speech prompt encoder consists of two convolution modules and a downsampling layer, each convolution module contains 5 layers of basic convolution layers, each basic convolution layer is composed of 2 one-dimensional convolution layers connected; the basic convolution layers are connected in a residual connection manner. Speaker reference features After being sent to the first convolution module, it is downsampled by 16 times, and then passes through the second convolution module to form the final reference code of the same speaker. ;

S3-2相同说话人参考特征与步骤S1所得的梅尔谱滤波器组特征一同送入多参考语音提示编码器模块中的音色注意力模块中得到相同说话人相关特征；所述音色注意力模块为多层结构，每一层包括前馈神经网络、多头注意力模块、卷积模块；音色注意力模块中的查询信息Q为当前处理的梅尔谱滤波器组特征，索引信息K与被查询信息V均为S3-1得到的相同说话人参考编码。S3-2 Reference features of the same speaker The Mel spectrum filter bank features obtained in step S1 The same speaker-related features are obtained by feeding them into the timbre attention module in the multi-reference speech prompt encoder module. The timbre attention module is a multi-layer structure, each layer includes a feedforward neural network, a multi-head attention module, and a convolution module; the query information Q in the timbre attention module is the feature of the currently processed Mel spectrum filter bank , the index information K and the queried information V are both the same speaker reference code obtained by S3-1 .

S3-3将所述步骤S3-2得到的说话人相关特征送入说话人分类器中,该说话人分类器与S2-3中说话人分类器的结构相同，参数相互独立。训练中该部分训练损失为，采用加性角度边界损失（AAM-Softmax Loss）进行计算。S3-3 The speaker-related features obtained in step S3-2 are The speaker classifier has the same structure as that in S2-3, and the parameters are independent of each other. The training loss of this part is , the additive angular margin loss (AAM-Softmax Loss) is used for calculation.

S4 多模块联合训练：将S2得到的说话人无关语音表征与S3得到的说话人相关特征拼接在一起后送入语音解码器中进行解码，得到重建后的语音梅尔滤波器组信息，并计算出重建语音信息与原始梅尔滤波器组之间的损失。S4 multi-module joint training: speaker-independent speech representation obtained in S2 Speaker-related features obtained from S3 After being spliced together, they are sent to the speech decoder for decoding to obtain the reconstructed speech Mel filter bank information , and calculate the reconstructed speech information With the original Mel filter bank The loss between .

语音解码器为多层结构的自注意力编码模块（Transformer Layer）每一层包括前馈神经网络、多头注意力模块。The speech decoder is a multi-layer self-attention encoding module (Transformer Layer), each layer of which includes a feedforward neural network and a multi-head attention module.

基于该框架所提取的说话人特征，本发明联合语音合成框架，构建面向病理语音的零样本语音克隆系统。由于所提出的基于无监督学习的说话人语音信息解耦方法，一方面可以最大化提取表征说话人的有效信息，另一方面排除其他无关信息的干扰，特别是避免病理语音中如间断不连续、不完整言语信息的泄漏问题。Based on the speaker features extracted by the framework, the present invention combines the speech synthesis framework to construct a zero-sample speech cloning system for pathological speech. Due to the proposed speaker speech information decoupling method based on unsupervised learning, on the one hand, it can maximize the extraction of effective information characterizing the speaker, and on the other hand, exclude the interference of other irrelevant information, especially avoiding the leakage of discontinuous and incomplete speech information in pathological speech.

Claims

Translated fromChinese

1.一种基于无监督学习的说话人信息解耦方法，其特征在于，基于编码器和解码器两部分组成的自编码器框架实现说话人语音信息解耦，步骤如下：1. A speaker information decoupling method based on unsupervised learning, characterized in that the speaker voice information decoupling is realized based on an autoencoder framework consisting of an encoder and a decoder, and the steps are as follows:

S1:数据预处理及特征提取：将音频数据进行批标准化处理至统一格式；处理后的音频数据经过音频特征提取转换为梅尔谱滤波器组特征为框架的主要输入数据；S1: Data preprocessing and feature extraction: The audio data is batch-normalized to a unified format; the processed audio data is converted into Mel-spectrogram filter bank features through audio feature extraction. It is the main input data of the framework;

同时针对每一条音频数据，对于相同说话人的其他不同语音片段的音频数据转换为梅尔谱滤波器特征，并将该特征作为相同说话人参考特征；At the same time, for each audio data, the audio data of other different speech segments of the same speaker are converted into Mel spectrum filter features, and the features are used as reference features of the same speaker. ;

S2:提取说话人无关信息表征：首先将S1提取到的梅尔谱滤波器组特征输入到基础语音编码器中，抽取基础语音信息特征，再引入声道长度扰动处理，从而得到说话人信息受到破坏的说话人无关语音表征，并通过梯度反转，后经说话人分类器进行说话人分类训练，进一步压缩该部分表征信息中的说话人信息，得到优化后的说话人无关语音表征训练中该部分的损失为；S2: Extract speaker-independent information representation: First, the Mel spectrum filter bank features extracted by S1 Input into the basic speech encoder to extract basic speech information features , and then introduce the vocal tract length perturbation processing to obtain the speaker-independent speech representation with the speaker information destroyed, and then perform speaker classification training through gradient inversion, further compress the speaker information in this part of the representation information, and obtain the optimized speaker-independent speech representation The loss of this part in training is ;

S4:将S2得到优化后的说话人无关语音表征与S3得到的说话人相关表征拼接在一起后送入语音解码器中进行解码，得到重建后的梅尔谱滤波器组特征信息并计算出重建的梅尔谱滤波器组特征与原始的梅尔谱滤波器组特征之间的损失；S4: Speaker-independent speech representation after optimizing S2 Speaker-related representation obtained with S3 After being spliced together, they are sent to the speech decoder for decoding to obtain the reconstructed Mel spectrum filter bank feature information And calculate the reconstructed Mel spectrum filter bank features Compared with the original Mel spectrum filter bank features The loss between ;

联合训练，并对三个模块的联合损失=++进行优化，最终从原始语音信号中得到优化解耦后的说话人无关语音表征与说话人相关表征。Joint training and joint loss of the three modules = + + Optimize and finally obtain the optimized decoupled speaker-independent speech representation from the original speech signal Speaker-dependent representation .

2.根据权利要求1所述的方法，其特征在于，所述S2将说话人信息受到破坏的说话人无关语音表征经过梯度翻转之后再经过说话人分类器计算损失，当最小时，表示得到优化后的说话人无关语音表征中当前说话人的说话人信息达到最少，因而得到进一步剥离说话人信息的说话人无关语音表征。2. The method according to claim 1 is characterized in that S2 performs gradient flipping on the speaker-independent speech representation whose speaker information is destroyed and then calculates the loss through the speaker classifier ,when When it is the smallest, it means that the speaker information of the current speaker in the optimized speaker-independent speech representation is minimized, thus obtaining a speaker-independent speech representation that further strips away the speaker information. .

3.根据权利要求1所述的方法，其特征在于，所述S3具体如下：3. The method according to claim 1, characterized in that said S3 is specifically as follows:

S3-1首先通过多参考语音提示编码器对同一个说话人进行说话人信息参考特征的抽取，形成相同说话人参考编码；S3-1 first extracts the speaker information reference features of the same speaker through the multi-reference speech prompt encoder to form the same speaker reference code ;

S3-2相同说话人参考编码与步骤S1的梅尔谱滤波器组特征一同被送入多参考语音提示编码器中的音色注意力模块中进行当前信息的说话人信息编码；通过音色注意力模块的特征提取将进一步抽取同一说话人当前发音状态中的说话人信息，形成说话人相关特征；S3-2 Same speaker reference code The Mel spectrum filter bank features of step S1 The timbre attention module in the multi-reference speech prompt encoder is sent together to encode the speaker information of the current information; the feature extraction of the timbre attention module will further extract the speaker information in the current pronunciation state of the same speaker to form speaker-related features ;

S3-3将所述步骤S3-2得到的说话人相关特征送入说话人分类器中,训练中该部分训练损失为，当最小时，表示所得说话人相关特征中包含的说话人信息达到最优，因而得到进一步优化的说话人相关特征；S3-3 The speaker-related features obtained in step S3-2 are Sent to the speaker classifier, the training loss of this part during training is ,when When it is the smallest, it indicates the speaker-related features obtained The speaker information contained in is optimized, thereby obtaining further optimized speaker-related features;

所述音色注意力模块为多层结构的注意力编码模块，每一层包括前馈神经网络、多头注意力模块、卷积模块；The timbre attention module is a multi-layer attention encoding module, each layer of which includes a feedforward neural network, a multi-head attention module, and a convolution module;

所述音色注意力模块中的查询信息Q为当前处理的梅尔谱滤波器组特征，索引信息K与被查询信息V均为S3-1得到的相同说话人参考编码。The query information Q in the timbre attention module is the feature of the currently processed Mel spectrum filter bank. , the index information K and the queried information V are both the same speaker reference code obtained by S3-1 .

4.根据权利要求1所述的方法，其特征在于，所述基础语音编码器包括卷积降采样模块、自注意力编码模块；所述卷积降采样模块是由两层卷积神经网络模块组成；所述自注意力编码模块为多层结构，每一层包括前馈神经网络、多头自注意力模块、卷积模块。4. The method according to claim 1 is characterized in that the basic speech encoder includes a convolution downsampling module and a self-attention encoding module; the convolution downsampling module is composed of two layers of convolutional neural network modules; the self-attention encoding module is a multi-layer structure, each layer includes a feedforward neural network, a multi-head self-attention module, and a convolution module.

5.根据权利要求1所述的方法，其特征在于，所述多参考语音提示编码器由两个卷积模块与一个降采样层组成；所述多参考语音提示编码器将说话人参考特征并送入第一个卷积模块，而后进行16倍的降采样来提高运算效率；第二个卷积模块对降采样后的话人参考特征进行进一步地卷积以聚合不同语音片段中共同包含的说话人信息,形成最终的相同说话人参考编码。5. The method according to claim 1, characterized in that the multi-reference speech prompt encoder is composed of two convolution modules and a downsampling layer; the multi-reference speech prompt encoder converts the speaker reference features The first convolution module then performs 16-fold downsampling to improve computational efficiency. The second convolution module further convolves the downsampled speaker reference features to aggregate the speaker information contained in different speech segments to form the final reference code of the same speaker. .