Disclosure of Invention
In view of this, the present invention provides a high-fidelity voice conversion system and method, which can customize and convert tone by providing reference audio, convert any input audio to target tone, completely preserve the content and rhythm of the input audio, and have high-fidelity tone quality after conversion, so as to greatly satisfy the sound-changing scenes of each scene.
In order to achieve the above purpose, the present invention adopts the following technical scheme:
A high-fidelity voice conversion system comprises an audio acquisition module, a voice recognition module, a prosody encoder, a decoder and a vocoder;
the audio acquisition module acquires input audio of a source speaker and reference audio of a target speaker;
the voice recognition module is used for extracting content information and intonation information in the input audio;
A prosody encoder extracting prosody information of the reference audio;
a decoder for generating a mel-frequency spectrogram according to the content information, the intonation information and the prosody information;
A vocoder for converting the mel-frequency spectrogram into audio.
Preferably, the speech recognition module comprises a content encoder and a intonation extractor, wherein wenet algorithm models are adopted as the content encoder, a large-scale speech recognition dataset is collected to train the content encoder, after training convergence, input audio is input into the content encoder to obtain phoneme posterior probability which is the content information of the audio, and the intonation extractor adopts praat-parselmouth to extract fundamental frequency information in the input audio and quantize the fundamental frequency information to obtain intonation information.
Preferably, the prosody information comprises integral prosody information and hidden characteristic prosody information, the prosody encoder comprises a coarse-granularity prosody encoder and a fine-granularity prosody encoder, the coarse-granularity prosody encoder comprises a plurality of convolution layers and a plurality of pooling layers, integral prosody and timbre information of a target speaker is obtained from reference audio, namely one-dimensional vector is obtained from the reference audio through the plurality of convolution layers and the pooling layers and used as integral prosody information, the fine-granularity prosody encoder comprises the plurality of convolution layers and the plurality of pooling layers, multi-dimensional hidden characteristics are obtained from the reference audio and used as hidden characteristic prosody information, and the number of pooling layers of the coarse-granularity prosody encoder is larger than that of the fine-granularity prosody encoder. The integral prosody information comprises prosody and timbre of the target speaker, and the hidden characteristic prosody information is specific to detail characteristics similar to pronunciation and accent of each word of the target speaker.
Preferably, the content information, the intonation information, the whole prosody information and the hidden-feature prosody information are summed or spliced at the decoder to generate a mel-spectrogram, and the hidden-feature prosody information is used as a query of an attention mechanism and sent to the decoder together with the content information attention.
Preferably, the decoder employs a transducer structure.
Preferably, the vocoder uses a HIFI-GAN model to map mel-spectrograms to audio.
Preferably, the content encoder includes Transformer Encoder layers, inputs the input audio to Transformer Encoder layers, and obtains phoneme posterior probabilities as content information.
A high-fidelity speech conversion method, comprising the steps of:
Step 1, acquiring input audio of a source speaker and reference audio of a target speaker;
Step 2, extracting content information and intonation information from the input audio and extracting prosody information of the target speaker from the reference audio;
step 3, generating a Mel spectrogram according to the content information, the intonation information and the prosody information;
And 4, converting the Mel spectrogram into audio.
Compared with the prior art, the invention discloses a high-fidelity voice conversion system and a high-fidelity voice conversion method, which solve the problem of source speaker identity information leakage by utilizing the phoneme Posterior Probability (PPG) in voice recognition, design prosody extraction modules with different granularities, ensure the similarity of the tone of a converted audio and a target speaker as much as possible, extract fundamental frequency from input audio as additional information for guiding conversion for stabilizing the tone after conversion, thereby realizing learning of reference audio of the target speaker and converting the input audio as the audio to be converted into audio with the tone of the target speaker. The voice conversion system and the voice conversion method solve the problems of low tone similarity and unnatural tone quality of the current voice conversion technology, and can better meet the voice sound changing requirements of various scenes.
Detailed Description
The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
Example 1
The embodiment of the invention discloses a high-fidelity voice conversion system, which is shown in figure 1 and comprises an audio acquisition module, a voice recognition module, a prosody encoder, a decoder and a vocoder;
the audio acquisition module acquires input audio of a source speaker and reference audio of a target speaker;
the voice recognition module is used for extracting content information and intonation information in the input audio;
A prosody encoder extracting prosody information of the reference audio;
a decoder for generating a mel-frequency spectrogram according to the content information, the intonation information and the prosody information;
A vocoder for converting the mel-frequency spectrogram into audio.
Further, the speech recognition module comprises a content encoder and a intonation extractor, wherein wenet algorithm models are adopted as the content encoder, the content encoder is trained by using a large-scale speech recognition data set, after training convergence, input audio is input into the content encoder to obtain phoneme posterior probability which is the content information of the audio, and the intonation extractor adopts praat-parselmouth to extract fundamental frequency information in the input audio and quantize the fundamental frequency information to obtain intonation information.
The phoneme posterior probability (phonetic posteriorgrams, PPG) in speech recognition is used as content information, i.e. the probability distribution of each frame of speech on each phoneme is used as content information, and the speech recognition algorithm wenet is used to pretrain on large-scale speech data with phonemes as recognition units.
Because PPG is close to text and unavoidable lacks intonation information, the intonation of the original audio is difficult to keep by converting only by PPG, even after conversion, the pronunciation of certain words is abnormal due to misalignment.
Further, the prosody information comprises integral prosody information and hidden characteristic prosody information, the prosody encoder comprises a coarse-granularity prosody encoder and a fine-granularity prosody encoder, the coarse-granularity prosody encoder comprises a plurality of convolution layers and a plurality of pooling layers, integral prosody and timbre information of a target speaker is obtained from reference audio, one-dimensional vector is obtained from the reference audio through the plurality of convolution layers and the pooling layers and is used as integral prosody information, the fine-granularity prosody encoder comprises the plurality of convolution layers and the plurality of pooling layers, multidimensional hidden characteristics are obtained from the reference audio and are used as hidden characteristic prosody information, and the number of pooling layers of the coarse-granularity prosody encoder is larger than that of the fine-granularity prosody encoder.
Further, the content information, intonation information, whole prosody information and hidden characteristic prosody information are summed or spliced at a decoder to generate a mel spectrogram, and the hidden characteristic prosody information is used as a query of an attention mechanism and sent to the decoder together with the content information attention.
Furthermore, the decoder adopts a transducer structure, integrates the input intonation, content and prosody information, and finally generates converted acoustic features, namely a Mel spectrogram.
Prosody is an implicit characteristic of speech, and can be understood as the style of speech, such as the fluctuation of the speech, the tone style (crazy, crisp, or others), etc., which is difficult to describe with specific numerical values. The invention divides the rhythm into two rhythms, namely coarse-granularity rhythm and fine-granularity rhythm, wherein the coarse-granularity rhythm represents an overall style of the audio, including the emotion of overall speaking, the speed of speech and the like, and the fine-granularity rhythm represents finer information of the audio, such as the change of the light and heavy pronunciation of each word, the connection of the last moment with the next moment of language, and the like. The prosody information can better assist the decoder to reconstruct the mel spectrum on the one hand, and can better simulate the style of the target speaker on the other hand, so that the converted audio is more similar to the target speaker.
The coarse-granularity prosody encoder obtains integral prosody and timbre information of a speaker from the reference audio, and obtains a one-dimensional vector from the reference audio through a plurality of rolling and pooling layers, and the vector is summed with content information or spliced for later generation of a mel frequency spectrum. The fine granularity prosody encoder obtains multi-dimensional hidden features from the reference audio through the multi-layer convolution and pooling layers, the hidden features serve as a query of an attention mechanism, and the hidden features and the content information are attention and then sent to a decoder for generating the voice acoustic features.
Further, the vocoder maps the mel-spectrogram to audio using the HIFI-GAN model. Training is performed on multiple speakers to achieve a goal common to each speaker.
Further, the vocoder may continue to fine tune to the audio of the targeted speaker. The audio is fine-tuned by adjusting parameters of the vocoder.
Further, the content encoder includes Transformer Encoder layers, and inputs the input audio to Transformer Encoder layers to obtain the phoneme posterior probability, that is, obtain the content information.
As shown in fig. 2, the probability distribution of each frame outputted by Transformer Encoder of wenet algorithm model at each phoneme is PPG, and the pre-training of large-scale data promotes PPG to have generalization capability for various input audio. The PPG has natural advantages as content information of voice conversion, namely, the PPG completely contains all text content in voice, simultaneously retains voice speed information which is identical to that of original audio, is very close to text, and does not contain any speaker information, so that the PPG basically thoroughly filters tone information of the original audio, and is very ideal content information for voice conversion.
Example 2
In one embodiment, a high-fidelity speech conversion method comprises the steps of:
s1, acquiring input audio of a source speaker and reference audio of a target speaker;
S2, extracting content information and intonation information from the input audio and extracting prosody information of a target speaker from the reference audio;
s3, generating a Mel spectrogram according to the content information, the intonation information and the prosody information;
s4, converting the Mel spectrogram into audio.
Firstly, extracting content information and intonation information from input audio, extracting prosody information of a target speaker from reference audio, combining the extracted information and speaker information to generate a converted Mel spectrogram by a decoder, and converting the generated spectrogram into audio by a vocoder. During the training phase of the whole system,
The input audio and the reference audio may be the same audio or different audio from the same speaker. And after the large-scale voice data are pre-trained, fine tuning is performed on the data of the target speaker so as to achieve a better conversion effect, wherein the input audio is the audio to be converted, and the reference audio is the audio of the target speaker, so that the audio to be converted is converted into the audio with the tone of the target speaker.
In the present specification, each embodiment is described in a progressive manner, and each embodiment is mainly described in a different point from other embodiments, and identical and similar parts between the embodiments are all enough to refer to each other. For the device disclosed in the embodiment, since it corresponds to the method disclosed in the embodiment, the description is relatively simple, and the relevant points refer to the description of the method section.
The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.