Disclosure of Invention
The invention aims to provide an end-to-end accent conversion method, which aims to solve the problem that the conventional voice recognition library needs to use the tunnel accent for converting non-tunnel accents into tunnel accents.
In order to solve the technical problems, the technical scheme of the invention is as follows: an end-to-end accent conversion method comprises an accent conversion system for realizing the accent conversion method, wherein the accent conversion system comprises a voice recognition module, a speaker encoder, a voice synthesis module and a neural network vocoder, the voice recognition module is used for adjusting the acoustic characteristics of an input non-channel accent into signal parameters of the channel accent, and the signal parameters are only related to the speaking content of the non-channel accent; the signal parameters of the non-channel accent and the speaker vector are input to the voice synthesis module, and the voice processed by the voice synthesis module finally passes through a neural network vocoder to synthesize the channel accent of the specific speaker.
As a preferred embodiment of the present invention, the speaker encoder is an extensible and speaker-independent neural network framework that converts arbitrary-length input speech-generated acoustic frames into fixed-dimensional speaker-embedded vectors that are associated with only the speaker.
As a preferred embodiment of the present invention, the speech synthesis module uses a mean square error loss L for the speaker-embedded vectorTTS The speech synthesis model is trained to generate only the channel accents embedded by the speaker in the corresponding tone.
As a preferred embodiment of the present invention, the voice recognition module is configured to adjust the acoustic characteristic of the non-channel accent to the signal parameter of the channel accent.
As a preferred embodiment of the present invention, the neural network vocoder is a WaveRNN network, an LPCNet or a WaveNet.
As a preferred embodiment of the present invention, the method comprises the steps of: a. collecting voice information and corresponding text information of a plurality of speakers, and training a speaker encoder and a voice synthesis module; b. screening out speakers with non-tunnel accents as target speakers; c. the text sample and the speech sample of the target speaker are used for training of the speech recognition module.
The beneficial effects of adopting above-mentioned technical scheme are: the end-to-end accent conversion method provided by the invention can convert non-channel accents into channel accents without any guidance of channel accent reference audio frequency in the conversion process, and can keep the original tone of a speaker.
Detailed Description
The following describes the embodiments of the present invention further with reference to the drawings. The description of these embodiments is provided to assist understanding of the present invention, but is not intended to limit the present invention. In addition, the technical features of the embodiments of the present invention described below may be combined with each other as long as they do not collide with each other.
The embodiment provides an end-to-end accent conversion method, which comprises an accent conversion system for realizing the accent conversion method, wherein the accent conversion system comprises a voice recognition module, a speaker encoder, a voice synthesis module and a neural network vocoder, the voice recognition module is used for adjusting the acoustic characteristics of the input non-channel accent into signal parameters of the channel accent, and the signal parameters are only related to the speaking content of the non-channel accent; the signal parameters of the non-channel accent and the speaker vector are input into a voice synthesis module, and the voice processed by the voice synthesis module finally passes through a neural network vocoder to synthesize the channel accent of the specific speaker. The voice recognition module is used for adjusting the acoustic characteristics of the non-channel accent into the signal parameters of the channel accent.
The accent conversion method mainly comprises the following steps: collecting voice information and corresponding text information of a plurality of speakers, and training a speaker encoder and a voice synthesis module; b. screening out speakers with non-tunnel accents as target speakers; c. the text sample and the speech sample of the target speaker are used for training of the speech recognition module.
The speaker encoder is a neural network framework that is scalable and can verify the speaker, generates variable-length acoustic features from input speech of arbitrary length, and converts the variable-length acoustic features into fixed-dimension speaker-embedded vectors. The speaker encoder model used herein is an extensible and very accurate speaker verification neural network framework that generates a fixed-dimensional speaker-embedded vector from a series of acoustic frames computed from speech sentences of arbitrary length. The method utilizes the speaker embedded vector to adjust a TTS model of a reference voice signal of a target speaker, so that the generated voice has the speaker identity of the target speaker. Generalized end-to-end speaker verification loss-loss is optimized by training the speaker encoder such that speech embedding from the same speaker has a high co-sinusoidal similarity, while speech embedding from different speakers is spatially far apart. It is desirable for the speaker encoder to learn a representation related to speech synthesis that captures the characteristics of non-channel accents that are not visible during training.
The speech synthesis module uses the mean square error loss L for speaker embedded vectorTTS The speech synthesis model is trained to generate only the channel accents embedded by the speaker in the corresponding tone. The model is an attention-based encoder-decoder model to support multiple speakers. As shown in fig. 1, the speaker-embedded vector calculated by the speaker encoder of the desired target speaker is connected to the TTS decoder output at each time step, and the attentive TTS decoder uses these as additional inputs to generate the mel-frequency spectrogram. Loss L using Mean Square Error (MSE)TTS The TTS model is trained to generate only the channel accents whose identity/timbre is determined by the speaker's embedding. We map the text transcript into a phoneme sequence as input to the TTS model, as it has been shown that: the use of phoneme sequences can increase convergence speed and improve pronunciation effects of rare words and proper nouns.
The multi-task ASR model is also connected with a full-connection conversion layer for calculating a connector time classification loss LCTC . We learn the accent independent language representation from the acoustic features using the accent ASR model. The ASR model applies an end-to-end based attention encoder-decoder framework. Given a pair of audio and its phoneme transcriptions, by TThe TS decoder calculates speech features from the audio and linguistic representations of the phoneme sequence, respectively. We add a Full Connection (FC) conversion layer to the ASR encoder and calculate a Connector Time Class (CTC) penalty LCTC To stabilize the training process. Since the training data of the ASR model contains accents, we connect the accent embedding and acoustic features of each frame as input to the ASR model and add an accent classifier on top of the ASR encoder, making it more robust to the recognition of accents. We assume that different accents are associated with different speakers. The speaker's accent embedding is obtained herein by averaging all the embeddings of the speaker. The output of the accent classifier is used to calculate the cross entropy loss LACC . The phoneme labels and linguistic representations in the two streams are pre-determined using an attention-based ASR decoder. Cross entropy loss LCE MSE loss L for phoneme label predictionTTSE For measuring the language difference between the TTS decoder output Hl and the ASR decoder output Hl.
The neural network vocoder is a neural network vocoder such as a WaveRNN network, LPCNet or WaveNet. The WaveRNN network is preferably employed herein as a neural network vocoder, with training being accomplished using open source Pytorch. Since the mel-frequency spectrogram captures all relevant details required for high quality speech synthesis, we only train the WaveRNN using the mel-frequency spectrograms from multiple speakers, without adding any speaker embeddings.
In the training phase, as shown in fig. 3, we first train the speaker encoder model. Then, at loss LTTS In the case of (2) training a multi-speaker TTS model using only native english speech data. Thereafter, a speech conversion (AC) technique ASR model is first pre-trained using speech data from a plurality of native language users and a non-native language user. The ASR model then uses only speech data from non-native spoken target speakers for fine tuning. In both phases we train a WaveRNN using the lossy lidar in equation 2 and using only the voice data of the channel accent user. In the accent conversion phase, as shown in fig. 4, acoustic features are first calculated from non-channel accents. Then, the speakerThe encoder receives the acoustic features and outputs a speaker-embedded vector representing the identity of the non-authentic accent speaker. Accent embedding is the average embedding of speakers who are non-channel accents. We connect the accent embedding with the acoustic features of each frame, then input it into the ASR model to generate the linguistic representation H-l, and then the attention-based TTS decoder combines the linguistic representation with the speaker embedding to generate the channel accent acoustic features. Finally, we use the WaveRNN model to transform acoustic features into time domain waveforms, hopefully to more naturally present accents.
The accent conversion method comprises the following steps: a. collecting voice information of multiple speakers; b. screening out the speaker of the voice of the seal land as a target speaker; c. a portion of the target speaker's speech samples are used for multi-speaker TTS model and neural network vocoder training, and another portion of the target speaker's speech samples are used for training of the multi-tasking ASR model.
During the experimental stage, we collected speech samples that contained 109 speakers with a clear speech up to 44 hours. In this study, we selected the speaker p248 of the open-ended voice as the target speaker. We consider a person without the hindi accent as a person in native english and then resample the audio to 22.05kHz. For training of the TTS model and the WaveRNN model we only use speech data of 105 native speakers. The mel-frequency spectrogram has 80 channels, calculated using a 50ms window width and a 12.5ms frame offset. 1000 samples were randomly drawn as a validation set, with the remaining samples used for training. In the training of the accent ASR model, the voice data of the channel accent and p248 are used. 40 mel-frequency spectra calculated with a window width of 25 milliseconds and a displacement of 10 milliseconds and delta characteristics thereof are taken as acoustic characteristics. The number of utterances used for training, verification, and testing in p248 are 326, 25, and 25, respectively. The SI-ASR model for extracting ppg was trained using the TIMIT dataset.
As shown in fig. 1, is the training phase of the method. hs is a linguistic representation, hl is a linguistic representation of the TTS decoder, and Hl is a linguistic representation of the ASR decoder.
Fig. 2 investigates the transition phase of the method. hs and H1 respectively represent speakingA person and a language characterization. L (L)TTSE The expression is as follows:
where N is the number of training samples. ASR models are training with multitasking losses: l (L)ASR =λ1 LCE +λ2 LTTSE +λ3 LCTC +λ4 LACC Where λ is the hyper-parameter, weighting the four losses.
LTTS The speaker encoder is a 3-layer LSTM with 256 hidden nodes followed by a 256-unit projection layer. The output is the second level normalized hidden state of the last layer, which is a vector of 256 elements. The TTS model employs the same architecture as in [ in ]. The ASR encoder is a 5-layer bi-directional LSTM (BLSTM) with 320 units per direction. 300-dimensional location-aware attention is used in the attention layer. The ASR decoder is a single layer LSTM with 320 units. The accent classifier is a 2-layer 1-dimensional convolutional network with 128 channels and 3 kernel sizes, followed by an averaging pooling layer and a final FC output layer. In equation 2, λ1=0.5, λ2=0.1, λ3=0.5, and λ4=0.1 heuristically brings the four penalty terms to a similar numerical scale. Transform 1 and transform 2 in the baseline approach are 2-layer and 4-layer BLSTMs, respectively, of 128 units per direction.
The speaker encoder model was trained using the Adam optimizer in 1000k steps with a batch size of 640 and a learning rate of 0.0001. The Adam optimizer was used to train the TTS model in 100k steps with a batch size of 16 and a learning rate of 0.001. The accent ASR model first pre-trains 160k steps using an Adadelta optimizer that uses a batch size of 16 and the learning rate of the speech data obtained from the native speaker and p 248. Then, while keeping the batch size and learning rate unchanged, fine tuning was performed for only the voice data from p248 by another 5k steps.
As shown in fig. 3, (a) is the mean opinion score result for the 95% confidence interval. (b) For speaker phase preference test results, AB-BL, P-AB represent comparison of ablation method v.s baseline method, proposed method v.s baseline method, proposed method v.s ablation method, respectively.
As shown in fig. 4, the accent preference test results are shown. "AB-BL", "P-AB", "P-L2" and "P-L1" respectively represent a comparison of "ablation method v.s baseline method", "proposed method v.s baseline method", "proposed method v.s ablation method", "proposed method v.s non-channel accent recording" and "proposed method v.s channel accent recording".
Three perceptual hearing tests were used to evaluate the conversion performance of Baseline (BL), proposed (P) and Ablation (AB) systems: mean Opinion Score (MOS) test of audio naturalness, speaker similarity XAB test, and accent AB test. We randomly extract 20 utterances from the test utterances of the p248 speaker for evaluation.
Audio naturalness. In the MOS test, the audio naturalness is evaluated in terms of five-component (1-component difference, 2-component difference, 3-component difference, 4-component difference, 5-component difference). The three system-generated audio and the reference recordings of non-channel accents ("L2-Ref") are randomly ordered before being presented to the listener. Each set of audio corresponds to the same text content. The listener can listen to the audio repeatedly as desired. This method resulted in 4.0 MOSs (L2-Refereeives-MOS 4.4), which were statistically significantly higher than the baseline method. The maximum likelihood of this approach is slightly higher than that of the ablative system, which illustrates that the addition of an accent embedding and classifier helps to extract accent-independent language content from the accent that is beneficial for speech synthesis. Since we use the ASR and TTS models based on the seq2seq, the converted speech has a pronunciation pattern more like the accent of the channel, such as in terms of duration and speech speed, which is very different from the speech of the source accent. We suggest that the reader listen to the audio samples.
Speaker similarity. We compare the difference in speaker similarity between the converted speech and the non-channel accent. In the XAB test, X represents a non-meatus accent reference sample. We present pairs of speech samples (a and B) that have the same text content as the reference, and ask the listener to determine which speech sample has a tone color closer to the reference. The listener may also replay the audio samples and may select "No Preference (NP)" if the difference cannot be distinguished. To avoid the impact of potential content, the audio is played back in reverse. We can see that the baseline system has better similarity performance than the proposed system. The results are reasonable since the speech synthesis model in this approach never finds speech data from non-channel accents. It is desirable that the synthesis model be able to infer the timbre of a speaker from the speaker-embedded generated by a speaker encoder that has only one speech (i.e., a phonetic clone). Speech data access from more speakers facilitates training of more versatile speaker encoders. By training the speaker encoder with speech data from 18K speakers, very good speech cloning performance can be achieved, but we cannot access such a large corpus of data. We found that there was no statistical difference between the two ablation systems (p-value 0.36). In the accent AB test, we first let the participant listen to the reference audio of the channel accent and the non-channel accent. Pairs of speech samples (a and B) with the same text content are then presented and the listener is asked to select a sample that more closely resembles accent speech. The results are shown in FIG. 4. According to the preference test between "P-BL" and "P-AB", the listener is very confident that the proposed method is able to produce more accents of the earth (P < 0.001) than the baseline and ablation methods, even though the ablation training is able to obtain better accent performance (P < 0.001) than the baseline treatment. The DTW process in the baseline approach may introduce alignment errors and the mapping from L2-PPGs to L1-PPGs using neural networks may be ineffective. From the results of "P-L2" and "P-L1", we can conclude that: the method can remove non-channel accents from the speech in the second language, so that the converted speech is similar to the channel accents in accent. The p values were 2.3 Xe-8 and 0.06, respectively.
An end-to-end accent conversion method is presented herein, which is the first model that can convert non-channel accents into channel accents without any guidance of channel accent reference audio during the conversion process. It consists of four independently trained neural networks: a speaker encoder, a multi-speaker TTS model, a multi-tasking ASR model, and a neural network vocoder. The experimental results show that: the method can convert the English speech of the non-channel accent into the English speech of the non-channel accent, and the English speech of the non-channel accent is difficult to distinguish from the channel accent on the accent. It is desirable for the synthesis model to be able to generate the desired timbre of the target speaker from the speaker embedding obtained from the speaker encoder. Taking the english accent conversion as an example, the same framework may be used for accent conversion in any other language.
In addition, the end-to-end accent conversion method provided herein can also be used for pronunciation correction of patients with dysarthria, and causes of dysarthria include brain injury, stroke, parkinson's disease or amyotrophic lateral sclerosis, etc. Accent modification is particularly important for the patient's daily communications. The implementation method is consistent with the steps, only the non-channel accent data is needed to be replaced by the non-standard accent data of the patient, and other training methods and conversion methods are the same as the conversion method of the non-channel accent.
The embodiments of the present invention have been described in detail above with reference to the accompanying drawings, but the present invention is not limited to the described embodiments. It will be apparent to those skilled in the art that various changes, modifications, substitutions and alterations can be made to these embodiments without departing from the principles and spirit of the invention, and yet fall within the scope of the invention.