CN117953906B

Movatterモバイル変換

Info

Publication number: CN117953906B
Application number: CN202410179913.8A
Authority: CN
Inventors: 刘刚; 苏江
Original assignee: Lixin Tongzhi Technology Beijing Co ltd
Current assignee: Lixin Tongzhi Technology Beijing Co ltd
Priority date: 2024-02-18
Filing date: 2024-02-18
Publication date: 2025-03-07
Anticipated expiration: 2044-02-18
Also published as: CN117953906A

Abstract

The invention discloses a high-fidelity voice conversion system and a high-fidelity voice conversion method, which relate to the technical field of voice recognition and voice synthesis, wherein the system comprises an audio acquisition module, a voice recognition module, a prosody encoder, a decoder and a vocoder; the voice recognition system comprises an audio acquisition module, a voice recognition module, a prosody encoder, a decoder and a vocoder, wherein the audio acquisition module acquires input audio of a source speaker and reference audio of a target speaker, the voice recognition module extracts content information and intonation information in the input audio, the prosody encoder extracts prosody information of the reference audio, the decoder generates a Mel spectrogram according to the content information, the intonation information and the prosody information, and the vocoder converts the Mel spectrogram into audio. The invention can convert any input audio to the target tone, completely reserve the content and rhythm of the input audio, has high-fidelity tone quality after conversion, and can greatly meet the sound changing scenes of various scenes.

Description

High-fidelity voice conversion system and method

Technical Field

The invention relates to the technical field of voice recognition and voice synthesis, in particular to a high-fidelity voice conversion system and a high-fidelity voice conversion method.

Background

Speech conversion refers to a technique of speaking one person into the kiss of another person and keeping the content unchanged, and one vivid example is the vocal bowknot of korann. The voice conversion method has a plurality of application scenes, in the movie dubbing process, dubbing personnel and the tone of the actor are obviously different, audience always expects to have the speaking effect of the actor, at the moment, the voice conversion technology can be used for converting the words of the dubbing personnel into the voices with the tone of the actor, so that the ornamental effect of the movie is enhanced, in certain conversation scenes, when the speaker is inconvenient to reveal own identity information, the voices can be converted into the voices of another person through the voice conversion technology, so that the anonymity is achieved, in the live broadcast field, certain anchor persons do not have a pair of voices due to the problem of vocal cords, and the voice of the anchor persons is converted into other pleasant persons through the voice conversion technology, so that the purposes of beautifying the voices and improving the live broadcast effect are achieved.

Speech conversion has had a considerable history of development. The traditional voice conversion method can achieve the specific sound-changing purpose by changing the tone of the original audio through changing the signal processing modes such as the fundamental frequency, formants and the like of the audio, but takes more time to optimize parameters and has unstable effect. The voice conversion technology based on deep learning greatly improves the effect of voice conversion no matter from the richness of the sound-changing scene or the naturalness of the converted audio. The existing voice conversion technology still has a plurality of problems to be solved. Although the voice conversion based on voice recognition and voice synthesis can obtain better conversion tone quality and tone quality, the rhythm and content of the original audio are difficult to ensure, and the content and rhythm of the original audio can be completely reserved based on the voice conversion of a generated countermeasure network or a variation self-encoder, the primary tone quality is difficult to eliminate and the tone quality is often distorted.

Therefore, how to improve the fidelity effect of voice conversion is a problem that needs to be solved by those skilled in the art.

Disclosure of Invention

In view of this, the present invention provides a high-fidelity voice conversion system and method, which can customize and convert tone by providing reference audio, convert any input audio to target tone, completely preserve the content and rhythm of the input audio, and have high-fidelity tone quality after conversion, so as to greatly satisfy the sound-changing scenes of each scene.

In order to achieve the above purpose, the present invention adopts the following technical scheme:

A high-fidelity voice conversion system comprises an audio acquisition module, a voice recognition module, a prosody encoder, a decoder and a vocoder;

the audio acquisition module acquires input audio of a source speaker and reference audio of a target speaker;

the voice recognition module is used for extracting content information and intonation information in the input audio;

A prosody encoder extracting prosody information of the reference audio;

a decoder for generating a mel-frequency spectrogram according to the content information, the intonation information and the prosody information;

A vocoder for converting the mel-frequency spectrogram into audio.

Preferably, the speech recognition module comprises a content encoder and a intonation extractor, wherein wenet algorithm models are adopted as the content encoder, a large-scale speech recognition dataset is collected to train the content encoder, after training convergence, input audio is input into the content encoder to obtain phoneme posterior probability which is the content information of the audio, and the intonation extractor adopts praat-parselmouth to extract fundamental frequency information in the input audio and quantize the fundamental frequency information to obtain intonation information.

Preferably, the prosody information comprises integral prosody information and hidden characteristic prosody information, the prosody encoder comprises a coarse-granularity prosody encoder and a fine-granularity prosody encoder, the coarse-granularity prosody encoder comprises a plurality of convolution layers and a plurality of pooling layers, integral prosody and timbre information of a target speaker is obtained from reference audio, namely one-dimensional vector is obtained from the reference audio through the plurality of convolution layers and the pooling layers and used as integral prosody information, the fine-granularity prosody encoder comprises the plurality of convolution layers and the plurality of pooling layers, multi-dimensional hidden characteristics are obtained from the reference audio and used as hidden characteristic prosody information, and the number of pooling layers of the coarse-granularity prosody encoder is larger than that of the fine-granularity prosody encoder. The integral prosody information comprises prosody and timbre of the target speaker, and the hidden characteristic prosody information is specific to detail characteristics similar to pronunciation and accent of each word of the target speaker.

Preferably, the content information, the intonation information, the whole prosody information and the hidden-feature prosody information are summed or spliced at the decoder to generate a mel-spectrogram, and the hidden-feature prosody information is used as a query of an attention mechanism and sent to the decoder together with the content information attention.

Preferably, the decoder employs a transducer structure.

Preferably, the vocoder uses a HIFI-GAN model to map mel-spectrograms to audio.

Preferably, the content encoder includes Transformer Encoder layers, inputs the input audio to Transformer Encoder layers, and obtains phoneme posterior probabilities as content information.

A high-fidelity speech conversion method, comprising the steps of:

Step 1, acquiring input audio of a source speaker and reference audio of a target speaker;

Step 2, extracting content information and intonation information from the input audio and extracting prosody information of the target speaker from the reference audio;

step 3, generating a Mel spectrogram according to the content information, the intonation information and the prosody information;

And 4, converting the Mel spectrogram into audio.

Compared with the prior art, the invention discloses a high-fidelity voice conversion system and a high-fidelity voice conversion method, which solve the problem of source speaker identity information leakage by utilizing the phoneme Posterior Probability (PPG) in voice recognition, design prosody extraction modules with different granularities, ensure the similarity of the tone of a converted audio and a target speaker as much as possible, extract fundamental frequency from input audio as additional information for guiding conversion for stabilizing the tone after conversion, thereby realizing learning of reference audio of the target speaker and converting the input audio as the audio to be converted into audio with the tone of the target speaker. The voice conversion system and the voice conversion method solve the problems of low tone similarity and unnatural tone quality of the current voice conversion technology, and can better meet the voice sound changing requirements of various scenes.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings that are required to be used in the embodiments or the description of the prior art will be briefly described below, and it is obvious that the drawings in the following description are only embodiments of the present invention, and that other drawings can be obtained according to the provided drawings without inventive effort for a person skilled in the art.

FIG. 1 is a schematic diagram of a high-fidelity speech conversion system according to the present invention;

fig. 2 is a schematic diagram of a content encoder training process provided by the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

Example 1

The embodiment of the invention discloses a high-fidelity voice conversion system, which is shown in figure 1 and comprises an audio acquisition module, a voice recognition module, a prosody encoder, a decoder and a vocoder;

A prosody encoder extracting prosody information of the reference audio;

A vocoder for converting the mel-frequency spectrogram into audio.

Further, the speech recognition module comprises a content encoder and a intonation extractor, wherein wenet algorithm models are adopted as the content encoder, the content encoder is trained by using a large-scale speech recognition data set, after training convergence, input audio is input into the content encoder to obtain phoneme posterior probability which is the content information of the audio, and the intonation extractor adopts praat-parselmouth to extract fundamental frequency information in the input audio and quantize the fundamental frequency information to obtain intonation information.

The phoneme posterior probability (phonetic posteriorgrams, PPG) in speech recognition is used as content information, i.e. the probability distribution of each frame of speech on each phoneme is used as content information, and the speech recognition algorithm wenet is used to pretrain on large-scale speech data with phonemes as recognition units.

Because PPG is close to text and unavoidable lacks intonation information, the intonation of the original audio is difficult to keep by converting only by PPG, even after conversion, the pronunciation of certain words is abnormal due to misalignment.

Further, the prosody information comprises integral prosody information and hidden characteristic prosody information, the prosody encoder comprises a coarse-granularity prosody encoder and a fine-granularity prosody encoder, the coarse-granularity prosody encoder comprises a plurality of convolution layers and a plurality of pooling layers, integral prosody and timbre information of a target speaker is obtained from reference audio, one-dimensional vector is obtained from the reference audio through the plurality of convolution layers and the pooling layers and is used as integral prosody information, the fine-granularity prosody encoder comprises the plurality of convolution layers and the plurality of pooling layers, multidimensional hidden characteristics are obtained from the reference audio and are used as hidden characteristic prosody information, and the number of pooling layers of the coarse-granularity prosody encoder is larger than that of the fine-granularity prosody encoder.

Further, the content information, intonation information, whole prosody information and hidden characteristic prosody information are summed or spliced at a decoder to generate a mel spectrogram, and the hidden characteristic prosody information is used as a query of an attention mechanism and sent to the decoder together with the content information attention.

Furthermore, the decoder adopts a transducer structure, integrates the input intonation, content and prosody information, and finally generates converted acoustic features, namely a Mel spectrogram.

Prosody is an implicit characteristic of speech, and can be understood as the style of speech, such as the fluctuation of the speech, the tone style (crazy, crisp, or others), etc., which is difficult to describe with specific numerical values. The invention divides the rhythm into two rhythms, namely coarse-granularity rhythm and fine-granularity rhythm, wherein the coarse-granularity rhythm represents an overall style of the audio, including the emotion of overall speaking, the speed of speech and the like, and the fine-granularity rhythm represents finer information of the audio, such as the change of the light and heavy pronunciation of each word, the connection of the last moment with the next moment of language, and the like. The prosody information can better assist the decoder to reconstruct the mel spectrum on the one hand, and can better simulate the style of the target speaker on the other hand, so that the converted audio is more similar to the target speaker.

The coarse-granularity prosody encoder obtains integral prosody and timbre information of a speaker from the reference audio, and obtains a one-dimensional vector from the reference audio through a plurality of rolling and pooling layers, and the vector is summed with content information or spliced for later generation of a mel frequency spectrum. The fine granularity prosody encoder obtains multi-dimensional hidden features from the reference audio through the multi-layer convolution and pooling layers, the hidden features serve as a query of an attention mechanism, and the hidden features and the content information are attention and then sent to a decoder for generating the voice acoustic features.

Further, the vocoder maps the mel-spectrogram to audio using the HIFI-GAN model. Training is performed on multiple speakers to achieve a goal common to each speaker.

Further, the vocoder may continue to fine tune to the audio of the targeted speaker. The audio is fine-tuned by adjusting parameters of the vocoder.

Further, the content encoder includes Transformer Encoder layers, and inputs the input audio to Transformer Encoder layers to obtain the phoneme posterior probability, that is, obtain the content information.

As shown in fig. 2, the probability distribution of each frame outputted by Transformer Encoder of wenet algorithm model at each phoneme is PPG, and the pre-training of large-scale data promotes PPG to have generalization capability for various input audio. The PPG has natural advantages as content information of voice conversion, namely, the PPG completely contains all text content in voice, simultaneously retains voice speed information which is identical to that of original audio, is very close to text, and does not contain any speaker information, so that the PPG basically thoroughly filters tone information of the original audio, and is very ideal content information for voice conversion.

Example 2

In one embodiment, a high-fidelity speech conversion method comprises the steps of:

s1, acquiring input audio of a source speaker and reference audio of a target speaker;

S2, extracting content information and intonation information from the input audio and extracting prosody information of a target speaker from the reference audio;

s3, generating a Mel spectrogram according to the content information, the intonation information and the prosody information;

s4, converting the Mel spectrogram into audio.

Firstly, extracting content information and intonation information from input audio, extracting prosody information of a target speaker from reference audio, combining the extracted information and speaker information to generate a converted Mel spectrogram by a decoder, and converting the generated spectrogram into audio by a vocoder. During the training phase of the whole system,

The input audio and the reference audio may be the same audio or different audio from the same speaker. And after the large-scale voice data are pre-trained, fine tuning is performed on the data of the target speaker so as to achieve a better conversion effect, wherein the input audio is the audio to be converted, and the reference audio is the audio of the target speaker, so that the audio to be converted is converted into the audio with the tone of the target speaker.

In the present specification, each embodiment is described in a progressive manner, and each embodiment is mainly described in a different point from other embodiments, and identical and similar parts between the embodiments are all enough to refer to each other. For the device disclosed in the embodiment, since it corresponds to the method disclosed in the embodiment, the description is relatively simple, and the relevant points refer to the description of the method section.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

Translated fromChinese

1.一种高保真的语音转换系统，其特征在于，包括音频采集模块、语音识别模块、韵律编码器、解码器和声码器；1. A high-fidelity speech conversion system, characterized in that it includes an audio acquisition module, a speech recognition module, a prosody encoder, a decoder and a vocoder;

音频采集模块，采集源说话人的输入音频和目标说话人的参考音频；An audio acquisition module, which acquires the input audio of the source speaker and the reference audio of the target speaker;

语音识别模块，提取输入音频中的内容信息和语调信息；语音识别模块包括内容编码器和语调提取器；采用wenet算法模型作为内容编码器，采集大规模语音识别数据集训练内容编码器，训练收敛之后将输入音频输入至内容编码器获得音素后验概率，音素后验概率作为内容信息；语调提取器采用praat-parselmouth提取输入音频中的基频信息并进行量化获得语调信息；The speech recognition module extracts content information and intonation information from the input audio; the speech recognition module includes a content encoder and an intonation extractor; the wenet algorithm model is used as the content encoder, and a large-scale speech recognition data set is collected to train the content encoder. After the training converges, the input audio is input into the content encoder to obtain the phoneme posterior probability, and the phoneme posterior probability is used as the content information; the intonation extractor uses praat-parselmouth to extract the fundamental frequency information in the input audio and quantize it to obtain the intonation information;

韵律编码器，提取参考音频的韵律信息；韵律信息包括整体韵律信息和隐特征韵律信息；韵律编码器包括粗粒度韵律编码器和细粒度韵律编码器；粗粒度韵律编码器包括多层卷积层和多层池化层，从参考音频中获取目标说话人整体的韵律和音色信息，作为整体韵律信息，整体韵律信息代表音频的一个整体风格，包括整体说话的情绪和语速快慢；细粒度韵律编码器包括多层卷积层和多层池化层，从参考音频中得到多维隐特征，作为隐特征韵律信息，隐特征韵律信息表示语音的风格，包括语音的音色风格、每个字的轻重读音、上一刻与下一刻语气的衔接；所述粗粒度韵律编码器的池化层数量大于所述细粒度韵律编码器的池化层数量；A prosody encoder extracts prosody information from a reference audio; the prosody information includes overall prosody information and latent feature prosody information; the prosody encoder includes a coarse-grained prosody encoder and a fine-grained prosody encoder; the coarse-grained prosody encoder includes multiple convolutional layers and multiple pooling layers, and obtains the overall prosody and timbre information of the target speaker from the reference audio as the overall prosody information, and the overall prosody information represents an overall style of the audio, including the overall speech emotion and speech speed; the fine-grained prosody encoder includes multiple convolutional layers and multiple pooling layers, and obtains multi-dimensional latent features from the reference audio as latent feature prosody information, and the latent feature prosody information represents the style of speech, including the timbre style of speech, the stress of each word, and the connection between the tone of the previous moment and the next moment; the number of pooling layers of the coarse-grained prosody encoder is greater than the number of pooling layers of the fine-grained prosody encoder;

解码器，根据内容信息、语调信息和韵律信息生成梅尔频谱图；解码器整合输入的语调、内容和韵律信息最终生成转换后的声学特征；韵律信息协助解码器重建梅尔频谱，模拟目标说话人的风格；将内容信息、语调信息、整体韵律信息和隐特征韵律信息在解码器进行求和或者拼接后生成梅尔频谱图；其中，整体韵律信息与内容信息进行求和或者拼接用于之后的梅尔频谱的生成；隐特征韵律信息作为注意力机制的query，与内容信息进行attention送入解码器；The decoder generates a Mel-spectrogram based on the content information, intonation information and prosody information. The decoder integrates the input intonation, content and prosody information to finally generate the converted acoustic features. The prosody information assists the decoder in reconstructing the Mel-spectrogram to simulate the style of the target speaker. The content information, intonation information, overall prosody information and latent feature prosody information are summed or concatenated in the decoder to generate a Mel-spectrogram. The overall prosody information is summed or concatenated with the content information for subsequent generation of the Mel-spectrogram. The latent feature prosody information is used as the query of the attention mechanism and is sent to the decoder for attention together with the content information.

声码器，将梅尔频谱图转换成音频。Vocoder, which converts the mel-spectrogram into audio.

2.根据权利要求1所述的一种高保真的语音转换系统，其特征在于，解码器采用Transformer结构。2. A high-fidelity speech conversion system according to claim 1, characterized in that the decoder adopts a Transformer structure.

3.根据权利要求1所述的一种高保真的语音转换系统，其特征在于，声码器采用HIFI-GAN模型将梅尔频谱图映射为音频。3. A high-fidelity speech conversion system according to claim 1, characterized in that the vocoder uses a HIFI-GAN model to map the Mel-spectrogram into audio.

4.根据权利要求1所述的一种高保真的语音转换系统，其特征在于，内容编码器包括Transformer Encoder层，将输入音频输入至Transformer Encoder层，获得音素后验概率作为内容信息。4. A high-fidelity speech conversion system according to claim 1, characterized in that the content encoder includes a Transformer Encoder layer, the input audio is input to the Transformer Encoder layer, and the phoneme posterior probability is obtained as the content information.

5.一种高保真的语音转换方法，其特征在于，应用于权利要求1-4任一项所述的一种高保真的语音转换系统，包括以下步骤：5. A high-fidelity speech conversion method, characterized in that it is applied to a high-fidelity speech conversion system according to any one of claims 1 to 4, comprising the following steps:

步骤1：获取源说话人的输入音频和目标说话人的参考音频；Step 1: Get the input audio of the source speaker and the reference audio of the target speaker;

步骤2：从输入音频提取出内容信息和语调信息，从参考音频中提取出目标说话人的韵律信息；Step 2: Extract content information and intonation information from the input audio, and extract the prosody information of the target speaker from the reference audio;

步骤3：根据内容信息、语调信息和韵律信息生成梅尔频谱图；Step 3: Generate a Mel-spectrogram based on content information, intonation information, and prosody information;

步骤4：将梅尔频谱图转换成音频。Step 4: Convert the Mel-spectrogram to audio.