CN119763547B

Movatterモバイル変換

Info

Publication number: CN119763547B
Application number: CN202411955158.4A
Authority: CN
Inventors: 史永方
Original assignee: Anhui Xunfei Huanyu Technology Co ltd
Current assignee: Anhui Xunfei Huanyu Technology Co ltd
Filing date: 2024-12-27
Publication date: 2025-10-10
Anticipated expiration: 2044-12-27

Abstract

The application provides a voice synthesis method, a training method of a voice synthesis model, electronic equipment and a computer program product. The text characteristics and the speech demand information of the target text may be input into a speech synthesis model to generate a synthesized speech that speaks the target text through the target dialect. The speech synthesis model comprises an encoding module, a dialect distinguishing module and a decoding module, wherein the dialect distinguishing module can output judging information to the encoding module, and then the initial priori information output by the encoding module is corrected or the quality of the initial priori information is improved by means of the judging information, so that target priori information with more dialect characteristics is obtained. The final decoding module generates the synthesized voice based on the target priori information, so that the dialect features of the synthesized voice are more remarkable, and the effect of synthesizing the dialect is improved.

Description

Speech synthesis method, training method of speech synthesis model, electronic device and computer program product

Technical Field

The present application relates to the field of artificial intelligence, and in particular, to a speech synthesis method, a training method for a speech synthesis model, an electronic device, and a computer program product.

Background

In a country with a wide region and a plurality of nations, people exchange more frequently, and dialects in different regions become important elements in the exchange. The speech synthesis technology related to dialects provides a more convenient and natural interaction mode for people. The language of the dialect is also called "earth words" or "local accent" and refers to a language of a certain area different from the standard language.

Currently, a synthesized speech of a dialect may be generated, typically using a trained speech synthesis model. The speech synthesis model requires training using a large number of dialects during the training phase, thereby learning the unique speech features and grammar structure of each dialect.

However, the kinds of dialects are generally numerous. Statistically, some countries with wide territories may have dialect types exceeding one hundred, which makes training a speech synthesis model covering various dialects extremely complex. For example, there are large differences in tones, vowels, initials, etc. of different dialects, which require a large amount of data and complex algorithms to process. Limited by training data and algorithms, the current speech synthesis model mostly adopts a conventional coder and decoder to synthesize dialect speech, so that the effect of synthesizing the dialect is general, and the feature of the dialect is not obvious enough.

Disclosure of Invention

Based on the above state of the art, the present application proposes a speech synthesis method, a training method of a speech synthesis model, an electronic device and a computer program product.

According to a first aspect of an embodiment of the present application, there is provided a speech synthesis method, the method including:

acquiring text characteristics of a target text and voice demand information, wherein the voice demand information comprises an identification of a target dialect;

Inputting the text characteristics and the voice demand information into a voice synthesis model to obtain a synthesized voice for reading the target text through the target dialect;

The speech synthesis model comprises an encoding module, a dialect distinguishing module and a decoding module, wherein the encoding module is used for generating initial priori information according to the text characteristics and the speech demand information, the dialect distinguishing module is used for generating judgment information according to the initial priori information, the judgment information is used for correcting the initial priori information, the encoding module is also used for generating target priori information according to the text characteristics, the speech demand information and the judgment information, the target priori information is input into the decoding module, and the decoding module is used for generating the synthesized speech according to the target priori information.

According to a second aspect of an embodiment of the present application, there is provided a training method of a speech synthesis model, the method including:

Acquiring text characteristics of sample data and voice demand information, wherein the voice demand information comprises an identification of a target dialect;

Training a speech synthesis model according to the text features and the speech demand information;

Updating model parameters of the speech synthesis model based on model loss;

The speech synthesis model comprises an encoding module, a dialect discrimination module and a decoding module, wherein the encoding module is used for generating initial priori information according to the text characteristics and the speech demand information, the dialect discrimination module is used for generating judgment information according to the initial priori information, the judgment information is used for correcting the initial priori information, the encoding module is also used for generating target priori information according to the text characteristics, the speech demand information and the judgment information, the target priori information is input into the decoding module, the decoding module is used for generating synthetic speech according to the target priori information, and the synthetic speech comprises synthetic speech of reading the sample data through the target dialect.

According to a third aspect of an embodiment of the present application, there is provided an electronic device, including a memory and a processor, where the memory is connected to the processor and configured to store a program, and the processor is configured to implement the speech synthesis method according to the first aspect or the training method of the speech synthesis model according to the second aspect by running the program in the memory.

According to a fourth aspect of embodiments of the present application, there is provided a storage medium having stored thereon a computer program which, when executed by a processor, implements the speech synthesis method according to the first aspect or the training method of the speech synthesis model according to the second aspect.

According to a fifth aspect of an embodiment of the present application, there is provided a computer program product comprising a computer program which, when executed by a processor, implements the speech synthesis method according to the first aspect or the training method of the speech synthesis model according to the second aspect.

According to the embodiment of the application, the text characteristics and the voice demand information of the target text are input into the voice synthesis model, so that the synthesized voice reading the target text through the target dialect can be generated. The speech synthesis model comprises an encoding module, a dialect distinguishing module and a decoding module, wherein the dialect distinguishing module can output judging information to the encoding module, and then the initial priori information output by the encoding module is corrected or the quality of the initial priori information is improved by means of the judging information, so that target priori information with more dialect characteristics is obtained. The final decoding module generates the synthesized voice based on the target priori information, so that the dialect features of the synthesized voice are more remarkable, and the effect of synthesizing the dialect is improved.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings that are required to be used in the embodiments or the description of the prior art will be briefly described below, and it is obvious that the drawings in the following description are only embodiments of the present application, and that other drawings can be obtained according to the provided drawings without inventive effort for a person skilled in the art.

Fig. 1 is a schematic flow chart of a speech synthesis method according to an embodiment of the present application;

FIG. 2 is a schematic diagram of a speech synthesis model according to an embodiment of the present application;

FIG. 3 is a flow chart of a training method of a speech synthesis model according to an embodiment of the present application;

Fig. 4 is a schematic structural diagram of a speech synthesis apparatus according to an embodiment of the present application;

FIG. 5 is a schematic structural diagram of a training device for a speech synthesis model according to an embodiment of the present application;

fig. 6 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

The following description of the embodiments of the present application will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present application, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.

SUMMARY

As described in the background, after the speech synthesis technology involving dialects has emerged, a synthesized speech of text, i.e., dialect speech, can be spoken through the dialect based on the text output. Currently, there are mainly three ways of synthesizing dialect speech. In one mode, the dialect voice is divided into a plurality of basic units, such as phonemes, syllables or words, based on unit concatenation synthesis. And then, selecting proper units from a prerecorded unit library to splice according to the input text, and synthesizing a voice waveform. And secondly, based on statistical parameter synthesis, an acoustic model and a language model are built by analyzing a large amount of square speech sound data. The acoustic model is used for predicting acoustic characteristics of the voice, such as fundamental frequency, frequency spectrum and the like, and the language model is responsible for processing linguistic information of the text, such as grammar, semantics and the like. And generating corresponding acoustic characteristic parameters by using the two models according to the input text, and converting the parameters into voice waveforms by a synthesizer. And thirdly, performing end-to-end synthesis based on deep learning, namely directly converting the input text into a voice waveform by adopting a deep neural network without an intermediate acoustic characteristic parameter conversion process. The processing logic of the third mode is simpler, and only the corresponding samples are collected for model training.

However, since the kinds of dialects are numerous, and each dialect has unique speech characteristics and grammar structure. Therefore, training with a traditional neural network model (e.g., a recurrent neural network, a long-short-term memory network, etc.), the effect of synthesizing the dialect is generally remarkable, and the feature of the dialect is not obvious enough.

In view of the above state of the art, the inventors propose that a decision maker can be introduced into a neural network model, so that the quality of encoded data in the model is corrected or improved by means of signals fed back by the decision maker, so that the neural network model has more remarkable dialect features. Therefore, the embodiment of the application provides a voice synthesis method. The text characteristics and the voice demand information of the target text are input into a voice synthesis model, so that synthesized voice which reads the target text through the target dialect can be generated. The speech synthesis model comprises an encoding module, a dialect distinguishing module and a decoding module, wherein the dialect distinguishing module can output judging information to the encoding module, and then the initial priori information output by the encoding module is corrected or the quality of the initial priori information is improved by means of the judging information, so that target priori information with more dialect characteristics is obtained. The final decoding module generates the synthesized voice based on the target priori information, so that the dialect features of the synthesized voice are more remarkable, and the effect of synthesizing the dialect is improved.

Exemplary method

Referring to fig. 1, in an exemplary embodiment, a method of speech synthesis is provided, which may include:

S101, acquiring text characteristics of a target text and voice demand information, wherein the voice demand information comprises identification of a target dialect.

In this step, the target text is text that needs to be converted into dialect speech, and the semantic content of the target text is not limited here. For example, the target text may be an article, a lecture, a news report. As another example, the target text may also be text converted from a piece of speech data.

The text features of the target text may include pronunciation features of the target text on the target dialect, semantic features of the target text, and so forth. In some embodiments, the text feature includes a target phoneme sequence of the target text having target dialect features. Wherein the target dialect features include features or rules of the target dialect in pronunciation, prosody, tone.

The voice demand information includes a demand for final synthesized voice. In some embodiments, the requirements for the final synthesized speech include requirements for the final synthesized speech on dialect categories. For example, the requirement may be synthesizing dialect speech under the target dialect. The target dialect can be any regional dialect. The identification of the dialect may be a unique identification of the dialect. In the process of synthesizing the voice, the dialect voice needing to be synthesized can be determined through the identifier, and the dialect voice under the target dialect indicated by the identifier. In some embodiments, the identification of the dialect may be a dialect tag. It is understood that the speech demand information includes, but is not limited to, the demand on the type of dialect for the final synthesized speech.

S102, inputting the text characteristics and the voice demand information into a voice synthesis model to obtain the synthesized voice of the target text read by the target dialect.

In the step, the speech synthesis model comprises a coding module, a dialect distinguishing module and a decoding module, wherein the coding module is used for generating initial priori information according to text characteristics and speech demand information, the dialect distinguishing module is used for generating judgment information according to the initial priori information, the judgment information is used for correcting the initial priori information, the coding module is also used for generating target priori information according to the text characteristics, the speech demand information and the judgment information, the target priori information is input into the decoding module, and the decoding module is used for generating synthesized speech according to the target priori information.

It will be appreciated that the initial prior information is an intermediate representation generated by the module based on text features and speech demand information, and the encoding module may be, for example, an encoder in a neural network model that captures and converts semantic information, grammatical information, contextual information, dialect information, etc. into a form that is easy to process, i.e., the initial prior information. For example, the network architecture of the encoding module may be the same as the network architecture of the encoding portion in the SOVITS model.

The judging module can judge the initial priori information generated by the encoding module so as to determine the quality of the initial priori information, wherein the higher the quality is, the more accurate the content is represented, and conversely, the lower the quality is, the larger the content deviation is represented. And the judgment information is generated and fed back to the coding module so as to correct the initial priori information, and the quality of the priori information generated by the coding module is improved. Here, the encoding module will regenerate the higher quality target prior information based on the decision information. In this embodiment, the more accurate the content represented by the a priori information output by the encoding module, the more remarkable or accurate the dialect feature under the target dialect. In some embodiments, the network architecture of the discrimination module may be similar to that of the SOVITS model, for example, the multi-resolution discriminator of the SOVITS model may be removed, leaving the rest. In this way, the original audio signal passes through a 1D convolution module of 6 layers, passes through a multi-period convolution module of 2D and 5 layers after being converted into 2D (the scale of multi-period convolution is 2,3,5,7,11), and the return result can comprise two types of final generation decision information and output characteristic decision results of each layer.

The decoding module may be, for example, a decoder in a neural network model that may generate final synthesized speech based on intermediate representations of the other module outputs. In some embodiments, the network architecture of the decoding module may be the same as the network architecture of the decoding portion of the SOVITS model.

In order to further enhance the dialect features of the synthesized speech, in some embodiments of the application, the speech synthesis model further comprises a prosody encoding module, wherein the prosody encoding module is used for generating prosody encoding of the target text under the target dialect according to the text features and the speech demand information;

The encoding module inputs the target prior information into the decoding module, comprising:

The encoding module is used for splicing the prosody encoding and the target priori information to obtain encoding characteristics, and the encoding characteristics are input into the decoding module.

It should be noted that prosody encoding is prosody information of the target text under the target dialect. After the target priori information is spliced with the prosody code, intermediate features which are more in line with the characteristics of the dialect, namely the code features, can be obtained. And the synthesized voice generated by utilizing the coding features has more remarkable dialect features.

In some embodiments, the prosody encoding module may be, for example, a large language model that is enabled to output prosody codes based on the identification of the input text features and dialects by training the large language model alone.

In the embodiment of the application, the prosody coding module special for providing prosody coding is arranged in the speech synthesis model, and the dialect feature of the synthesized speech can be further improved through the processing of the prosody coding model.

In some embodiments of the present application, obtaining text features of a target text includes:

And based on the prosody control rule and/or the tone control rule of the target dialect, arranging and combining phonemes in the initial phoneme sequence to generate a target phoneme sequence with the characteristics of the target dialect.

It should be noted that the initial phoneme sequence and the target phoneme sequence are sequences of phoneme compositions. Wherein, the phonemes are the minimum phonetic units divided according to the natural attribute of the voice, and are analyzed according to the pronunciation actions in syllables, and one action constitutes one phoneme. For example, the Chinese syllable o (ā) has only one phoneme, the love (a i) has two phonemes, the generation (d a i) has three phonemes, etc.

It will be appreciated that different dialects have different phoneme pronunciation rules, rhythm control rules, and pitch control rules. Therefore, the phoneme pronunciation rules, the rhythm control rules, and the pitch control rules thereof can be created in advance for each dialect. Further, after determining the target text and the target dialect, a phoneme sequence featuring the target dialect may be quickly generated based on the corresponding rule. In some embodiments, dialect data may be collected in advance, and then the pronunciation phonemes, prosody changes, and pitch changes specific to each dialect may be filtered and analyzed to determine pronunciation rules for the phonemes, prosody, and pitch control rules. Taking a certain dialect as an example, there may be specific initial consonant and final sound changes, characters in the text can be accurately converted into corresponding phoneme sequences through the sound rule of phonemes, and then the corresponding phoneme sequences are combined according to the control rule of specific rhythm and tone to generate the phoneme sequences with the feature of the dialect.

According to the embodiment of the application, the factor sequence with the characteristics of the target dialect can be quickly generated by means of the phoneme pronunciation rule, the rhythm control rule and the tone control rule, so that the factor sequence is used as the text characteristic, and the dialect characteristic of the final synthesized voice can be improved.

In some embodiments of the present application, the speech demand information further includes a personal identification, inputting the text feature and the speech demand information into a speech synthesis model to obtain a synthesized speech that speaks the target text through the target dialect, including:

inputting the text characteristics and the voice demand information into a voice synthesis model to obtain the synthesized voice of the target character reading the target text through the target dialect, wherein the target character comprises a character indicated by the character identification.

It should be noted that the person identifier is an identifier of the target person, for example, may be a name of the target person, but is not limited thereto. By means of the character, the tone color in the synthesized speech can be specified. For example, the voice demand information includes a dialect A and Zhang San. The synthesized speech will read the speech data of the target text in the a dialect for Zhang Sanjin. Compared with the method for reading the target text by adopting the fixed tone color, the method and the device can be applied to more application scenes and meet the related requirements of users on the tone color.

It will be appreciated that the speech synthesis model indicates the timbre and dialect type in the synthesized speech during training by the identification of the persona and dialect. Therefore, in the model reasoning process, the speech synthesis model outputs the synthetic speech of the target character reading the target text through the target dialect according to the text characteristics, the character identification and the identification of the target dialect.

In some embodiments, in the process of outputting the synthesized voice of the target character through the target dialect according to the text feature, the character identifier and the identifier of the target dialect, the voice synthesis model first determines the audio data of the target character according to the character identifier, and then extracts the voiceprint feature of the audio data to obtain the voiceprint feature of the target character. And finally, outputting the synthesized voice of the target character reading the target text through the target dialect based on the text characteristics, the character identification, the voiceprint characteristics of the target character and the identification of the target dialect.

According to the embodiment of the application, the tone in the synthesized voice can be designated, and the synthesized voice of the tone of the target character can be output, so that more application scenes can be suitable, and the related requirements of users on the tone can be met.

In some embodiments of the present application, the speech demand information further includes a style identification, inputting the text feature and the speech demand information into a speech synthesis model to obtain a synthesized speech that speaks the target text through the target dialect, including:

and inputting the text characteristics and the voice demand information into a voice synthesis model to obtain the synthesized voice of the target text read by the target dialect under the target style, wherein the target style comprises the style indicated by the style identification.

It should be noted that the style identification is an identification of the target style, for example, may be a name of the target style, but is not limited thereto. Where the style is related to emotion or emotion, for example, the style may include happiness, difficulty, excitement, tension, and the like. The target style may be any style. By this style identification, the expression style of the person in the synthesized voice can be specified. For example, the voice demand information includes a dialect A, tension. The synthesized speech will be speech data that speaks the target text in the a dialect under tension. Compared with reading the target text in a fixed style, the method and the device can be applied to more application scenes, and meet the related requirements of users on the style.

It will be appreciated that the speech synthesis model indicates the style and dialect type in the synthesized speech during the training process by the style identification and the identification of the dialect. Therefore, in the model reasoning process, the speech synthesis model outputs the synthesized speech of the target text read by the target dialect under the target style according to the text characteristics, the style identification and the identification of the target dialect.

According to the embodiment of the application, the style of the character in the synthesized voice can be specified, and the synthesized voice read by the target dialect under the target style is output, so that more application scenes can be applied, and the related requirements of the user on the style are met.

In some embodiments, the voice demand information includes a personal identification, a style identification, and an identification of the target dialect. And inputting the text characteristics and the voice demand information into a voice synthesis model to obtain the synthetic voice of the target character reading the target text through the target dialect in the target style. For example, as shown in FIG. 2, the speech synthesis model includes a dialect synthesis front end 201, a large language model 202, an encoder 203, and a decoder 204.

After the text data and the dialect label are input to the dialect synthesis front end 201, the dialect synthesis front end 201 searches the phoneme pronunciation rule, the prosody control rule and the tone control rule corresponding to the target dialect indicated by the dialect label. And further converting the text data into a target phoneme sequence with the characteristics of the target dialect based on the phoneme pronunciation rule, the prosody control rule and the tone control rule corresponding to the target dialect.

After the emotion tag, the timbre tag, the dialect tag and the target phoneme sequence are input into the large language model 202, the large language model 202 generates prosodic codes, and the generating process of the prosodic codes is referred to the above embodiments, which are not repeated here.

After the emotion tags, timbre tags, dialect tags, target phoneme sequences and prosody codes are input to the encoder 203, the encoder 203 generates target priori information based on the emotion tags, timbre tags, dialect tags and target phoneme sequences. The encoder 203 is similar to the encoding module in the above embodiment and the process of generating the target priori information is also similar and will not be repeated here. The encoder 203 splices the generated target prior information and prosody codes to obtain coding features, and then inputs the coding features into the decoder 204.

After receiving the coding feature, the decoder 204 performs a speech decoding process based on the coding feature to generate a synthesized speech, so as to obtain a synthesized speech of the target character reading the target text through the target dialect in the target style. The emotion tag corresponds to the style mark in the above embodiment, the tone tag corresponds to the personal mark in the above embodiment, and the dialect tag corresponds to the dialect tag in the above embodiment. The target character is a character indicated by a tone color label, the target style is emotion or style indicated by an emotion label, and the target dialect is a dialect indicated by a dialect label.

Referring to fig. 3, in another aspect of the present application, in an exemplary embodiment, a method for training a speech synthesis model is provided, the method may include:

and S301, acquiring text characteristics of the sample data and voice demand information, wherein the voice demand information comprises identification of a target dialect.

S302, training a voice synthesis model according to the text characteristics and the voice demand information.

And S303, updating model parameters of the voice synthesis model based on the model loss.

The speech synthesis model comprises a coding module, a dialect discrimination module and a decoding module, wherein the coding module is used for generating initial priori information according to text characteristics and speech demand information, the dialect discrimination module is used for generating decision information according to the initial priori information, the decision information is used for correcting the initial priori information, the coding module is also used for generating target priori information according to the text characteristics, the speech demand information and the decision information, the target priori information is input into the decoding module, the decoding module is used for generating synthetic speech according to the target priori information, and the synthetic speech comprises synthetic speech of reading sample data through a target dialect.

It should be noted that, it is understood that S101 to S102 in the above embodiment are processing procedures in the model reasoning process, and S301 to S302 in the present embodiment are processing procedures in the model training process, and the similarities thereof are not repeated. In the model training process, the sample data used is data collected in advance for training a speech synthesis model. Also, in the model training scenario, the model parameters need to be updated in reverse with model loss, which is not described in detail here.

In the embodiment of the application, the voice synthesis model comprises an encoding module, a dialect distinguishing module and a decoding module, wherein the dialect distinguishing module can output judging information to the encoding module, and further correct or improve the quality of initial priori information output by the encoding module by means of the judging information to obtain target priori information with more dialect characteristics. The final decoding module generates the synthesized voice based on the target priori information, so that the dialect features of the synthesized voice are more obvious, and the effect of synthesizing the dialect is improved.

In some embodiments of the application, sample data may be collected as follows. Collecting Mandarin data by collecting voice data in the network, including source data such as radio stations, documentaries, open source data sets and the like. Collecting dialect data, expanding a dialect corpus in various modes, for example, cooperating with a local culture organization to collect dialect voice data of people, inviting volunteers to record dialect voice and the like by utilizing a crowdsourcing platform. For collected speech data, the speech data is scored by MosNet, and speech data in which the score is above a threshold is screened, e.g., speech data with a score below 3.5 is filtered out. By means of speaker inspection technology, it is determined that there are several speakers for each sentence, and more than one person is filtered. And generating texts corresponding to the voice through a language recognition system ASR aiming at the screened voice data.

In some embodiments, the information data with the speaker can also be obtained through a speaker cluster model. Thus, from the voice data which are not marked in a plurality of segments of disorder, the voice data which correspond to the same speaker can be determined. And then extracting voiceprint features in voice data of each speaker.

And collecting a large number of text corpora containing Mandarin of various styles, distributing corresponding style labels for each text sample (the text converted by ASR), and establishing a mapping relation between text content and the style labels. The collected Mandarin data and dialect data can also be used for pre-training of the model by extracting the non-supervision hubert features of the voice by using an open source model.

In some embodiments of the application, the speech synthesis model further comprises a prosody encoding module, wherein the prosody encoding module is used for generating prosody encoding of the sample data under the target dialect according to the text characteristics and the speech demand information;

It should be noted that prosody encoding is prosody information of sample data under the target dialect. After the target priori information is spliced with the prosody code, intermediate features which are more in line with the characteristics of the dialect, namely the code features, can be obtained. And the synthesized voice generated by utilizing the coding features has more remarkable dialect features.

In some embodiments, the prosody encoding module may be, for example, a large language model that is enabled to output prosody codes based on the identification of the input text features and dialects by training the large language model alone. For example, a GPT model based on a transducer architecture is used as a base model. Through pre-training on large-scale data, extensive language knowledge is learned, and then fine tuning is adopted to a specific task, so that excellent performance is shown. The training process is as follows:

and in the first stage, training a baseline mandarin model.

Training the model by using Mandarin data, so that the trained model can output prosodic codes according to style labels. In the tag encoding process, the voiceprint characteristic of the sample data can be encoded into an encoding module with the size of 192 x 1024, and the style tag of the sample data can be encoded into an encoding module with the size of 10 x 1024. In the word segmentation and coding process, a pre-trained Tokenizer word segmentation device can be used for segmenting and coding the input sequence and converting the input sequence into an input format acceptable to a model. The input sequence may be a phoneme sequence of the sample data or a phoneme sequence with dialect features generated using the dialect synthesis front end described above. In updating the model parameters based on model loss, model loss may be generated based on prosodic code guidance generated by a pre-trained codec model. The codec model may generate synthesized speech in a specified dialect based on the text data, with intermediate features including prosodic codes.

And in the second stage, fine tuning training of the high-quality Mandarin data and dialect data models.

Training is continued on the basis of the first-stage model using high-quality data in mandarin data as samples. The size of the coding module of the dialect tag is 100 x 1024, which is added in the tag coding process. The remaining training process is substantially the same as the first stage and will not be described here again.

And thirdly, fine tuning training of the fine mandarin data and dialect data models.

And continuing training on the basis of the second-stage model by taking the fine data in the Mandarin data as a sample. The size of the coding module of the tone color label is 100 x 1024, which is additionally added in the label coding process. The remaining training process is substantially the same as the first stage and will not be described here again.

It will be appreciated that only the first stage and the second stage of training may be experienced in the training process based on differences in business requirements or voice requirement information. The first, second and third phases may also be experienced during the training process. It is noted that the training data used in the training process includes mandarin data, dialect data, timbre labels, style labels, dialect labels, voiceprint features, text features, etc., and are different at each stage, and will not be described in detail herein.

In some embodiments of the application, training a speech synthesis model based on text features and speech demand information includes:

Training an initial speech synthesis model according to Mandarin data to obtain an intermediate speech synthesis model, wherein the intermediate speech synthesis model is used for generating speech data corresponding to Mandarin data;

training the middle voice synthesis model according to the text characteristics and the voice demand information to obtain a trained voice synthesis model.

It should be noted that the speech synthesis model has the basic capability to generate synthetic speech for text. Since dialect data is difficult to sort through and obtain. The present embodiment will train the speech synthesis model with easily available mandarin chinese data to provide it with basic capabilities. How to train the model is provided with the ability to generate corresponding synthesized speech from the input text, which will not be described in detail here. In the present embodiment, a speech synthesis model having basic capabilities, i.e., an intermediate speech synthesis model, is used as a basis. And training is carried out by using sample data, so that the trained voice synthesis model can generate dialect voice, namely, the synthesized voice of the text is read through the dialect. In some embodiments, the network architecture for the initial speech synthesis model may be the same as that of the SOVITS model, but is not limited thereto.

In the embodiment of the application, a multi-stage training mode is adopted, the basic capability of the model is trained by firstly utilizing the easily obtained Mandarin data, and further, the dialect data is used as a sample for training on the basis of the basic capability, so that the model training efficiency can be improved, and the model training time is shortened.

In some embodiments of the present application, training the initial speech synthesis model according to mandarin chinese data to obtain an intermediate speech synthesis model includes:

training the initial speech synthesis model according to the Mandarin data and the style identification thereof to obtain an intermediate speech synthesis model, wherein the intermediate speech synthesis model is used for generating speech data corresponding to the Mandarin and the style identification thereof, and the speech data corresponding to the Mandarin and the style identification thereof comprises speech data for reading the Mandarin data in the style indicated by the style identification.

It should be noted that, regarding the style identification, reference may be made to the same description as above regarding the embodiment of the speech synthesis method, and no further description is given here. A style identification is added to the synthesized speech that needs to be output by the speech synthesis model, even if the model has the ability to generate speech in a specified style. The model needs to be indicated by adding a style identification. In the model training process, the dialect data are limited, and the acquisition difficulty is high. Therefore, when training the intermediate speech synthesis model using mandarin chinese data, the training model may be able to generate speech in accordance with a specified style. The mandarin data in the training process may provide mandarin data in the above embodiments, but is not limited thereto.

In the embodiment of the application, the ability of training the model to generate the voice according to the appointed style on the Mandarin data which is easy to acquire can be synchronously applied to dialect data without style labels, so that the complexity of model training is reduced.

In some embodiments of the present application, the speech demand information further includes a style identification, training the intermediate speech synthesis model based on the text feature and the speech demand information to obtain a trained speech synthesis model, comprising:

Training the middle voice synthesis model according to the text characteristics, the target dialect identification and the style identification to obtain a trained voice synthesis model.

It should be noted that while the intermediate speech synthesis model has the ability to generate speech in a specified style, it does not involve dialect information. Therefore, it is necessary to add dialect information to have the ability to generate speech in accordance with a specified dialect. In the training process of this stage, the ability of the model to generate voice according to the appointed style is greatly improved by continuing to add the style identification.

In the embodiment of the application, the capacity of the model for generating the voice according to the appointed style is improved in both training stages, so that the capacity of the model for generating the voice according to the style can be greatly improved.

Training the initial speech synthesis model according to the Mandarin data and the voiceprint features to obtain an intermediate speech synthesis model, wherein the intermediate speech synthesis model is used for generating speech data corresponding to the Mandarin and the voiceprint features, and the speech data corresponding to the Mandarin and the voiceprint features comprises speech data for reading the Mandarin data according to the voiceprint features.

It should be noted that a voiceprint feature can be considered as a tone information and that different voiceprint features can indicate different tones. Thus, the speech data of the mandarin chinese data is spoken in terms of the voiceprint characteristics, and can be regarded as the speech data of the mandarin chinese data being spoken by a person having a tone indicated by the voiceprint characteristics. Thus, the synthesized speech output by the model may be dialect speech of a specified timbre.

In the case where the speech synthesis model is required to have the ability to generate speech in accordance with a specified timbre, it is necessary to add voiceprint features to instruct the model. In the model training process, the dialect data are limited, and the acquisition difficulty is high. Therefore, the ability of the training model to generate speech according to specified timbre/voiceprint features may be achieved when training an intermediate speech synthesis model using mandarin data. The mandarin data in the training process may provide mandarin data in the above embodiments, but is not limited thereto.

In the embodiment of the application, the ability of training the model to generate the voice according to the appointed tone on the Mandarin data which is easy to acquire can be synchronously applied to dialect data without tone labels, so that the complexity of model training is reduced.

In some embodiments of the present application, the speech demand information further includes a personal identifier, training the intermediate speech synthesis model based on the text feature and the speech demand information to obtain a trained speech synthesis model, comprising:

Training the middle voice synthesis model according to the text features, the voiceprint features, the target dialect marks and the personal marks to obtain a trained voice synthesis model.

It should be noted that although the intermediate speech synthesis model has the ability to generate speech in accordance with a given tone, the control effect of voiceprints on tone is generally common. Thus, it is necessary to add a personal identification to the intermediate speech synthesis model. For the personal identification, reference may be made to the same description as that of the embodiment of the speech synthesis method, and the description thereof will not be repeated here. In the training process of the stage, the character mark is added to control tone color, so that the capability of the model for generating voice according to the appointed tone color is greatly improved.

In the embodiment of the application, the capacity of the model for generating the voice according to the appointed tone color is improved in both training stages, so that the capacity of the model for generating the voice according to the tone color can be greatly improved.

To facilitate understanding, a model training process is described below with an example of how a model may be provided and the ability to generate speech in terms of timbre, style, dialect is enhanced by the training process.

The training process is as follows:

and in the first stage, training a mandarin basic model.

The network architecture of the model is first selected, for example, the network architecture of the SOVITS model may be selected. Model training is then performed based on the mandarin data, voiceprint features, style labels, and text features collected in the previous embodiments. During the coding processing in the training process, the voice print characteristic is coded into a coding module with the size of 192 x 1024, and the style label is coded into a coding module with the size of 10 x 1024, and the coding module is converted into an input format acceptable by a model. Aiming at model loss, the audio countermeasure loss (ADVERSARIAL LOSS) enables the voice generated by the generator in the network architecture to be more similar to the real voice, and improves the quality and naturalness of the synthesized voice. Feedback from the discriminators in the network architecture can help the generator learn the characteristics of the real speech, thereby continually improving the synthesis. The feature matching loss can promote the generator to learn the feature distribution of the real voice on different levels, so that the generator is prevented from paying excessive attention to the final output result and ignoring the learning of the intermediate process. This helps to improve the details and realism of the synthesized speech. Mel spectrum loss (Mel-Spectrogram Loss) the difference model between Mel spectrum loss calculation synthesized speech and real speech can learn the frequency characteristics of real speech, and improve the timbre and tone quality of the synthesized speech. In some embodiments, the training data may be divided into small batches, and iterative training may be performed, so as to improve training efficiency and stability. In addition, when the performance of the model is evaluated on the verification set, the similarity between the original sound recording and the generated voice can be measured, and the accuracy of the generated code of the model can be checked.

And in the second stage, training the multi-language model.

Based on the dialect data, voiceprint features, style labels, dialect labels and text features collected in the foregoing embodiments, the model after the first stage training is continuously trained. In this stage, a dialect discriminator is added into the original network architecture, so that coding countermeasures are introduced, and the codes generated by the generator are more similar to the coding sequences of the dialect features. The structure of the dialect discriminator is substantially similar to SOVITS, except that the multi-resolution discriminator is removed, leaving the remainder. In this way, the original audio signal passes through a 1D convolution module of 6 layers, passes through a multi-period convolution module of 2D and 5 layers after being converted into 2D (the scale of multi-period convolution is 2,3,5,7,11), and the return result can comprise two types of final generation decision information and output characteristic decision results of each layer. In some embodiments, the training data is divided into small batches, iterative training is performed to improve training efficiency and stability, and when the performance of the model is evaluated on the verification set, the similarity between the original sound recording and the generated voice is measured, and the accuracy of the generated code of the model is checked. After training, the appointed style label can generate the voice of the corresponding style, the appointed voiceprint feature can generate the voice with the characteristic of the voiceprint tone, and the style label and the voiceprint feature can be combined to control the generated voice.

And thirdly, fine tuning training of the dialect model, namely fine tuning training of a small amount of high-quality target speaker data and part of high-quality dialect data.

Based on the data collected in the foregoing embodiment, a small amount of high-quality target speaker data and a part of high-quality dialect data, voiceprint features, timbre tags, dialect tags, and text features, the model after the second stage training is continuously trained/fine-tuned. In the training process, parameters of a style coding module are fixed, a tone label coding module is added on the basis of an initial model, and the tone control of a speaker based on voiceprint is weaker, so that the tone of a target speaker can be better controlled by the speaker label in a fine tuning stage. In some embodiments, the training data may be divided into small batches, and iterative training may be performed, so as to improve training efficiency and stability. And when evaluating the performance of the model on the verification set, measuring the similarity between the original sound recording and the generated voice, checking the accuracy of the generated code of the model, and controlling the generated voice through the combination of the style label and the tone label. In some embodiments, the trained model, whose intermediate features include prosodic codes, may be used to guide the training of the large language model in the above embodiments.

It should be noted that although technology is advancing, synthesized dialect speech still has a gap in naturalness from real human pronunciation. There may be problems of mechanical feel, lack of fluency or lack of emotional expression, etc., and it is difficult to completely simulate the rich changes in human speech. In particular, when expressing complex emotions and subtle changes in mood, synthesized speech tends to appear very stiff. In order to improve the accuracy and naturalness of the synthesis of the multi-directional speech sounds, the key technology for improving the synthesis of the speech sounds is adopted, and in the aspect of prosody modeling, a modeling method based on prosody characteristics of the dialect is adopted to generate more natural dialect speech sounds. In the aspect of emotion and language information modeling, dialect voice synthesis can express emotion and language information, and the expressive force of voice is enhanced. The training method of multiple stages is adopted to learn the rhythm characteristics of each dialect, the model is pre-trained through mass data, the pre-trained model parameters are obtained, and basic natural pronunciation rhythm can be learned. And then, the model parameters are optimized by using a small amount of high-quality data marked manually, so that an optimized model is obtained, and the pronunciation and rhythm characteristics of each dialect can be accurately learned.

According to the embodiment of the application, a multi-stage training scheme is adopted, a baseline Mandarin model can learn basic language information on large-scale data, a high-quality Mandarin and dialect data model is added, dialect data is added, the model stability is ensured by continuous training on higher-quality data, fine-tuning training of fine-quality data can be realized, pronunciation details of a target speaker can be learned more finely, and a synthesis effect is ensured. For tone control, tone is controlled based on voiceprint first, then in fine tuning training model stage of fine data, tone is controlled by speaker information, tone in synthesized speech can be controlled completely, and tone control problem of big data training is solved. In addition, the information of the dialect/the dialect category is added in the encoding layer, so that the generated encoding can be ensured to be more accordant with the language characteristics of the dialect.

In some embodiments of the application, the text feature comprises a target phoneme sequence of the sample data having a target dialect feature.

It should be noted that after various data is collected in the manner described in the above embodiments, a custom front end, i.e., dialect synthesis front end, may be created using the collected data. A dialect tag may be specified from which the front-end engine outputs text features, including phonemic, prosodic and tonal information, consistent with the dialect.

In some embodiments, a customized mapping dictionary can be constructed according to phonemes, tones and prosody rhythms of different dialects, a small amount of dialect data is collected to cover all pronunciation characteristics of the dialect, and then the dialect data is marked by a method combining manual marking and automatic marking, so that accurate customized dictionaries of different dialects can be obtained at a low manual marking cost, and a front end of each dialect is created through a specific dictionary of each dialect to obtain a dialect synthesis front end. The pronunciation rules, prosody and pitch control rules of the phonemes corresponding to each dialect are determined by, for example, filtering and analyzing pronunciation phonemes, prosody changes and pitch changes specific to each dialect based on the dialect data. Based on the pronunciation rules and control rules corresponding to the dialect, a dialect synthesis front end is created, so that when a text is input into the dialect synthesis front end and the dialect is specified, a phoneme sequence with the feature of the dialect can be output. Taking a certain dialect as an example, characters in a text can be accurately converted into corresponding phoneme sequences through pronunciation rules of phonemes, and then the corresponding phoneme sequences are combined according to control rules of specific rhythm and tone to generate the phoneme sequences with the feature of the dialect.

In the embodiment of the application, the text data also has dialect characteristics, so that the effect of finally synthesizing dialect voice by the model can be further improved.

Exemplary apparatus

Correspondingly, the embodiment of the application also provides a voice synthesis device, which is shown in fig. 4, and comprises:

A first obtaining module 401, configured to obtain text features of a target text and voice requirement information, where the voice requirement information includes an identifier of a target dialect;

A voice module 402, configured to input text features and voice requirement information into a voice synthesis model, so as to obtain a synthesized voice for reading a target text through a target dialect;

The speech synthesis model comprises an encoding module, a dialect discrimination module and a decoding module, wherein the encoding module is used for generating initial priori information according to text characteristics and speech demand information, the dialect discrimination module is used for generating decision information according to the initial priori information, the decision information is used for correcting the initial priori information, the encoding module is also used for generating target priori information according to the text characteristics, the speech demand information and the decision information, the target priori information is input into the decoding module, and the decoding module is used for generating synthesized speech according to the target priori information.

In some embodiments, the speech synthesis model further comprises a prosody encoding module, wherein the prosody encoding module is used for generating prosody encoding of the target text under the target dialect according to the text characteristics and the speech demand information;

In some embodiments, the text feature comprises a target phoneme sequence of the target text having target dialect features.

In some embodiments, the first obtaining module 401 is specifically configured to convert the target text into an initial phoneme sequence based on a phoneme pronunciation rule of the target dialect, and perform permutation and combination on phonemes in the initial phoneme sequence based on a prosody control rule and/or a pitch control rule of the target dialect to generate a target phoneme sequence with characteristics of the target dialect.

In some embodiments, the voice requirement information further includes a personal identifier, and the voice module 402 is specifically configured to input the text feature and the voice requirement information into the voice synthesis model to obtain a synthesized voice of the target person reading the target text through the target dialect, where the target person includes a person indicated by the personal identifier.

In some embodiments, the voice demand information further includes a style identifier, and the voice module 402 is specifically configured to input the text feature and the voice demand information into the voice synthesis model to obtain a synthesized voice that reads the target text through the target dialect in the target style, where the target style includes a style indicated by the style identifier.

The speech synthesis apparatus provided in this embodiment belongs to the same application conception as the speech synthesis method provided in the above embodiment of the present application, and may execute the speech synthesis method provided in any of the above embodiments of the present application, and has the functional modules and beneficial effects corresponding to the execution method. Technical details not described in detail in this embodiment may be referred to the specific processing content of the speech synthesis method provided in the above embodiment of the present application, and will not be described herein.

Correspondingly, the embodiment of the application also provides a training device of the speech synthesis model, which is shown in fig. 5, and comprises:

a second obtaining module 501, configured to obtain text features of the sample data and voice requirement information, where the voice requirement information includes an identifier of a target dialect;

The training module 502 is configured to train a speech synthesis model according to the text feature and the speech demand information;

An updating module 503, configured to update model parameters of the speech synthesis model based on the model loss;

In some embodiments, the speech synthesis model further comprises a prosody encoding module, wherein the prosody encoding module is used for generating prosody encoding of the sample data under the target dialect according to the text characteristics and the speech demand information;

In some embodiments, training module 502 includes:

The first training unit is used for training the initial voice synthesis model according to the Mandarin data to obtain an intermediate voice synthesis model, and the intermediate voice synthesis model is used for generating voice data corresponding to the Mandarin data;

and the second training unit is used for training the middle voice synthesis model according to the text characteristics and the voice demand information to obtain a trained voice synthesis model.

In some embodiments, the first training unit is specifically configured to train the initial speech synthesis model according to mandarin data and its style identification to obtain an intermediate speech synthesis model, where the intermediate speech synthesis model is configured to generate speech data corresponding to mandarin and its style identification, and the speech data corresponding to mandarin and its style identification includes speech data that reads mandarin data in a style indicated by the style identification.

In some embodiments, the voice demand information further includes a style identifier, and the second training unit is specifically configured to train the intermediate voice synthesis model according to the text feature, the identifier of the target dialect, and the style identifier, to obtain a trained voice synthesis model.

In some embodiments, the first training unit is specifically configured to train the initial speech synthesis model according to mandarin data and voiceprint features to obtain an intermediate speech synthesis model, where the intermediate speech synthesis model is configured to generate speech data corresponding to mandarin and voiceprint features, and the speech data corresponding to mandarin and voiceprint features includes speech data that reads mandarin data according to voiceprint features.

In some embodiments, the voice demand information further includes a personal identifier, and the second training unit is specifically configured to train the intermediate voice synthesis model according to the text feature, the voiceprint feature, the identifier of the target dialect, and the personal identifier, to obtain a trained voice synthesis model.

In some embodiments, the text feature comprises a target phoneme sequence of the sample data having a target dialect feature.

The training device for the speech synthesis model provided by the embodiment of the application belongs to the same application conception as the training method for the speech synthesis model provided by the embodiment of the application, and can execute the training method for the speech synthesis model provided by any embodiment of the application, and has the corresponding functional modules and beneficial effects of the execution method. Technical details not described in detail in this embodiment may be referred to the specific processing content of the training method of the speech synthesis model provided in the foregoing embodiment of the present application, and will not be described herein.

It should be appreciated that the modules in the above apparatus may be implemented in the form of processor-invoked software. For example, the device includes a processor, where the processor is connected to a memory, and the memory stores instructions, and the processor invokes the instructions stored in the memory to implement any of the methods above or to implement functions of each unit of the device, where the processor may be a general-purpose processor, such as a CPU or a microprocessor, and the memory may be a memory within the device or a memory outside the device. Or the units in the device may be implemented in the form of hardware circuits, where the functions of some or all of the units may be implemented by designing a hardware circuit, where the hardware circuit may be understood as one or more processors, for example, in one implementation, the hardware circuit is an ASIC, where the functions of some or all of the units are implemented by designing a logic relationship between elements in the circuit, and in another implementation, the hardware circuit may be implemented by a PLD, for example, an FPGA, and may include a large number of logic gates, where the connection relationship between the logic gates is configured by a configuration file, so as to implement the functions of some or all of the units. All units of the above device may be realized in the form of processor calling software, or in the form of hardware circuits, or in part in the form of processor calling software, and in the rest in the form of hardware circuits.

In an embodiment of the present application, the processor is a circuit with signal processing capability, in one implementation, the processor may be a circuit with instruction reading and running capability, such as a CPU, a microprocessor, a GPU, or a DSP, etc., and in another implementation, the processor may implement a certain function through a logic relationship of a hardware circuit, where the logic relationship of the hardware circuit is fixed or reconfigurable, for example, the processor is a hardware circuit implemented by an ASIC or a PLD, such as an FPGA, etc. In the reconfigurable hardware circuit, the processor loads the configuration document, and the process of implementing the configuration of the hardware circuit may be understood as a process of loading instructions by the processor to implement the functions of some or all of the above units. Furthermore, a hardware circuit designed for artificial intelligence may be provided, which may be understood as an ASIC, such as NPU, TPU, DPU, etc.

It will be seen that each of the units in the above apparatus may be one or more processors (or processing circuits) configured to implement the above methods, e.g., CPU, GPU, NPU, TPU, DPU, a microprocessor, DSP, ASIC, FPGA, or a combination of at least two of these processor forms.

Furthermore, the units in the above apparatus may be integrated together in whole or in part, or may be implemented independently. In one implementation, these units are integrated together and implemented in the form of an SOC. The SOC may include at least one processor for implementing any of the methods above or for implementing the functions of the units of the apparatus, where the at least one processor may be of different types, including, for example, a CPU and an FPGA, a CPU and an artificial intelligence processor, a CPU and a GPU, and the like.

Exemplary electronic device

An embodiment of the present application proposes an electronic device, as shown in fig. 6, including:

a memory 600 and a processor 610;

Wherein the memory 600 is connected to the processor 610, and is used for storing a program;

the processor 610 is configured to implement the speech synthesis method or the training method of the speech synthesis model disclosed in any of the above embodiments by running the program stored in the memory 600.

In particular, the electronic device may further include a bus, a communication interface 620, an input device 630, and an output device 640.

The processor 610, the memory 600, the communication interface 620, the input device 630, and the output device 640 are connected to each other by a bus. Wherein:

a bus may comprise a path that communicates information between components of a computer system.

The processor 610 may be a general-purpose processor, such as a general-purpose Central Processing Unit (CPU), microprocessor, etc., or may be an application-specific integrated circuit (ASIC), or one or more integrated circuits for controlling the execution of programs in accordance with aspects of the present invention. But may also be a Digital Signal Processor (DSP), application Specific Integrated Circuit (ASIC), an off-the-shelf programmable gate array (FPGA) or other programmable logic device, discrete gate or transistor logic device, discrete hardware components.

The processor 610 may include a main processor, and may also include a baseband chip, a modem, and the like.

The memory 600 stores programs for implementing the technical scheme of the present invention, and may also store an operating system and other critical services. In particular, the program may include program code including computer-operating instructions. More specifically, memory 600 may include read-only memory (ROM), other types of static storage devices that may store static information and instructions, random access memory (random access memory, RAM), other types of dynamic storage devices that may store information and instructions, disk storage, flash, and the like.

The input device 630 may include means for receiving data and information entered by a user, such as a keyboard, mouse, camera, scanner, light pen, voice input means, touch screen, pedometer, or gravity sensor, among others.

Output device 640 may include means such as a display screen, printer, speakers, etc. that allow information to be output to a user.

The communication interface 620 may include devices using any transceiver or the like to communicate with other devices or communication networks, such as ethernet, radio Access Network (RAN), wireless Local Area Network (WLAN), etc.

The processor 610 executes programs stored in the memory 600 and invokes other devices that can be used to implement the steps of any of the speech synthesis methods or training methods of the speech synthesis model provided in the above embodiments of the present application.

The embodiment of the application also provides a chip, which comprises a processor and a data interface, wherein the processor reads and runs a program stored in a memory through the data interface so as to execute the voice synthesis method or the training method of the voice synthesis model described in any embodiment, and the specific processing process and the beneficial effects thereof can be described by referring to the embodiment of the voice synthesis method or the training method of the voice synthesis model.

Exemplary computer program product and storage Medium

In addition to the methods and apparatus described above, embodiments of the application may also be a computer program product comprising computer program instructions which, when run by a processor, cause the processor to perform the steps in the speech synthesis method or the training method of the speech synthesis model according to the various embodiments of the application described in any of the embodiments of the specification.

The computer program product may write program code for performing operations of embodiments of the present application in any combination of one or more programming languages, including an object oriented programming language such as Java, C++ or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device, partly on a remote computing device, or entirely on the remote computing device or server.

Furthermore, an embodiment of the present application may also be a storage medium having stored thereon a computer program that is executed by a processor to perform the steps in the speech synthesis method or the training method of the speech synthesis model according to the various embodiments of the present application described in any of the above embodiments of the present specification.

For the foregoing method embodiments, for simplicity of explanation, the methodologies are shown as a series of acts, but one of ordinary skill in the art will appreciate that the present application is not limited by the order of acts, as some steps may, in accordance with the present application, occur in other orders or concurrently. Further, those skilled in the art will also appreciate that the embodiments described in the specification are all preferred embodiments, and that the acts and modules referred to are not necessarily required for the present application.

It should be noted that, in the present specification, each embodiment is described in a progressive manner, and each embodiment is mainly described as different from other embodiments, and identical and similar parts between the embodiments are all enough to be referred to each other. For the apparatus class embodiments, the description is relatively simple as it is substantially similar to the method embodiments, and reference is made to the description of the method embodiments for relevant points.

The steps in the method of each embodiment of the application can be sequentially adjusted, combined and deleted according to actual needs, and the technical features described in each embodiment can be replaced or combined.

The modules and the submodules in the device and the terminal of the embodiments of the application can be combined, divided and deleted according to actual needs.

In the embodiments provided in the present application, it should be understood that the disclosed terminal, apparatus and method may be implemented in other manners. For example, the above-described terminal embodiments are merely illustrative, and for example, the division of modules or sub-modules is merely a logical function division, and there may be other manners of division in actual implementation, for example, multiple sub-modules or modules may be combined or integrated into another module, or some features may be omitted, or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or modules, which may be in electrical, mechanical, or other forms.

The modules or sub-modules illustrated as separate components may or may not be physically separate, and components that are modules or sub-modules may or may not be physical modules or sub-modules, i.e., may be located in one place, or may be distributed over multiple network modules or sub-modules. Some or all of the modules or sub-modules may be selected according to actual needs to achieve the purpose of the embodiment.

In addition, each functional module or sub-module in the embodiments of the present application may be integrated in one processing module, or each module or sub-module may exist alone physically, or two or more modules or sub-modules may be integrated in one module. The integrated modules or sub-modules may be implemented in hardware or in software functional modules or sub-modules.

Those of skill would further appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the various illustrative elements and steps are described above generally in terms of functionality in order to clearly illustrate the interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied directly in hardware, in a software unit executed by a processor, or in a combination of the two. The software elements may be disposed in Random Access Memory (RAM), memory, read Only Memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art.

Finally, it is further noted that relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises an element.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the application. Thus, the present application is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

Translated fromUnknown language

1.一种语音合成方法，其特征在于，所述方法包括：1. A speech synthesis method, characterized in that the method comprises:

获取目标文本的文本特征以及语音需求信息；其中，所述语音需求信息包括目标方言的标识；Acquiring text features and speech requirement information of a target text; wherein the speech requirement information includes an identifier of a target dialect;

将所述文本特征和所述语音需求信息输入语音合成模型，得到通过所述目标方言朗读所述目标文本的合成语音；Inputting the text features and the speech requirement information into a speech synthesis model to obtain synthesized speech that reads the target text in the target dialect;

其中，所述语音合成模型包括：编码模块、方言判别模块以及解码模块，所述编码模块用于根据所述文本特征和所述语音需求信息生成初始先验信息，所述方言判别模块用于根据所述初始先验信息生成判决信息，所述判决信息用于校正所述初始先验信息，所述编码模块还用于根据所述文本特征、所述语音需求信息以及所述判决信息生成目标先验信息，并将所述目标先验信息输入所述解码模块，所述解码模块用于根据所述目标先验信息生成所述合成语音。Among them, the speech synthesis model includes: an encoding module, a dialect discrimination module and a decoding module. The encoding module is used to generate initial prior information based on the text features and the speech requirement information. The dialect discrimination module is used to generate decision information based on the initial prior information. The decision information is used to correct the initial prior information. The encoding module is also used to generate target prior information based on the text features, the speech requirement information and the decision information, and input the target prior information into the decoding module. The decoding module is used to generate the synthesized speech based on the target prior information.

2.根据权利要求1所述的方法，其特征在于，所述语音合成模型还包括：韵律编码模块，其中，所述韵律编码模块用于根据所述文本特征和所述语音需求信息生成所述目标文本在所述目标方言下的韵律编码；2. The method according to claim 1, wherein the speech synthesis model further comprises: a prosody encoding module, wherein the prosody encoding module is configured to generate a prosody encoding of the target text in the target dialect based on the text features and the speech requirement information;

所述编码模块将所述目标先验信息输入所述解码模块，包括：The encoding module inputs the target prior information into the decoding module, including:

所述编码模块将所述韵律编码和所述目标先验信息进行拼接得到编码特征，将所述编码特征输入所述解码模块。The encoding module concatenates the prosody code and the target prior information to obtain encoding features, and inputs the encoding features into the decoding module.

3.根据权利要求1所述的方法，其特征在于，所述文本特征包括：所述目标文本的具有目标方言特点的目标音素序列。3. The method according to claim 1 is characterized in that the text features include: a target phoneme sequence of the target text having characteristics of a target dialect.

4.根据权利要求3所述的方法，其特征在于，所述获取目标文本的文本特征，包括：4. The method according to claim 3, wherein obtaining text features of the target text comprises:

基于所述目标方言的音素发音规则，将所述目标文本转换为初始音素序列；Based on the phoneme pronunciation rules of the target dialect, converting the target text into an initial phoneme sequence;

基于所述目标方言的韵律控制规则和/或音调控制规则，对所述初始音素序列中的音素进行排列组合，生成具有目标方言特点的目标音素序列。Based on the prosody control rules and/or tone control rules of the target dialect, the phonemes in the initial phoneme sequence are arranged and combined to generate a target phoneme sequence with the characteristics of the target dialect.

5.根据权利要求1所述的方法，其特征在于，所述语音需求信息还包括人物标识；将所述文本特征和所述语音需求信息输入语音合成模型，得到通过所述目标方言朗读所述目标文本的合成语音，包括：5. The method according to claim 1, wherein the speech requirement information further includes a character identifier; inputting the text features and the speech requirement information into a speech synthesis model to obtain synthesized speech that reads the target text in the target dialect, comprising:

将所述文本特征和所述语音需求信息输入语音合成模型，得到目标人物通过所述目标方言朗读所述目标文本的合成语音，其中，所述目标人物包括所述人物标识指示的人物。The text features and the speech requirement information are input into a speech synthesis model to obtain a synthesized speech of a target person reading the target text in the target dialect, wherein the target person includes the person indicated by the person identifier.

6.根据权利要求1所述的方法，其特征在于，所述语音需求信息还包括风格标识；将所述文本特征和所述语音需求信息输入语音合成模型，得到通过所述目标方言朗读所述目标文本的合成语音，包括：6. The method according to claim 1, wherein the speech requirement information further includes a style identifier; inputting the text features and the speech requirement information into a speech synthesis model to obtain synthesized speech that reads the target text in the target dialect, comprising:

将所述文本特征和所述语音需求信息输入语音合成模型，得到在目标风格下通过所述目标方言朗读所述目标文本的合成语音，其中，所述目标风格包括所述风格标识指示的风格。The text features and the speech requirement information are input into a speech synthesis model to obtain synthesized speech that reads the target text in the target dialect under a target style, wherein the target style includes the style indicated by the style identifier.

7.一种语音合成模型的训练方法，其特征在于，所述方法包括：7. A method for training a speech synthesis model, characterized in that the method comprises:

获取样本数据的文本特征以及语音需求信息；其中，所述语音需求信息包括目标方言的标识；Acquiring text features and speech requirement information of sample data; wherein the speech requirement information includes an identifier of the target dialect;

根据所述文本特征和所述语音需求信息训练语音合成模型；Training a speech synthesis model according to the text features and the speech demand information;

基于模型损失更新所述语音合成模型的模型参数；Updating model parameters of the speech synthesis model based on the model loss;

其中，所述语音合成模型包括：编码模块、方言判别模块以及解码模块，所述编码模块用于根据所述文本特征和所述语音需求信息生成初始先验信息，所述方言判别模块用于根据所述初始先验信息生成判决信息，所述判决信息用于校正所述初始先验信息，所述编码模块还用于根据所述文本特征、所述语音需求信息以及所述判决信息生成目标先验信息，并将所述目标先验信息输入所述解码模块，所述解码模块用于根据所述目标先验信息生成合成语音，所述合成语音包括通过所述目标方言朗读所述样本数据的合成语音。Among them, the speech synthesis model includes: an encoding module, a dialect discrimination module and a decoding module. The encoding module is used to generate initial prior information based on the text features and the speech demand information. The dialect discrimination module is used to generate decision information based on the initial prior information. The decision information is used to correct the initial prior information. The encoding module is also used to generate target prior information based on the text features, the speech demand information and the decision information, and input the target prior information into the decoding module. The decoding module is used to generate synthesized speech based on the target prior information. The synthesized speech includes synthesized speech that reads the sample data in the target dialect.

8.根据权利要求7所述的方法，其特征在于，所述语音合成模型还包括：韵律编码模块，其中，所述韵律编码模块用于根据所述文本特征和所述语音需求信息生成所述样本数据在所述目标方言下的韵律编码；8. The method according to claim 7, wherein the speech synthesis model further comprises: a prosody encoding module, wherein the prosody encoding module is configured to generate a prosody encoding of the sample data in the target dialect based on the text features and the speech requirement information;

9.根据权利要求7或8所述的方法，其特征在于，根据所述文本特征和所述语音需求信息训练语音合成模型，包括：9. The method according to claim 7 or 8, wherein training a speech synthesis model according to the text features and the speech demand information comprises:

根据普通话数据对初始语音合成模型进行训练，得到中间语音合成模型，所述中间语音合成模型用于生成对应所述普通话数据的语音数据；Training an initial speech synthesis model based on Mandarin data to obtain an intermediate speech synthesis model, wherein the intermediate speech synthesis model is used to generate speech data corresponding to the Mandarin data;

根据所述文本特征和所述语音需求信息训练所述中间语音合成模型，得到训练好的语音合成模型。The intermediate speech synthesis model is trained according to the text features and the speech demand information to obtain a trained speech synthesis model.

10.根据权利要求9所述的方法，其特征在于，根据普通话数据对初始语音合成模型进行训练，得到中间语音合成模型，包括：10. The method according to claim 9, wherein the initial speech synthesis model is trained based on Mandarin data to obtain an intermediate speech synthesis model, comprising:

根据普通话数据及其风格标识对初始语音合成模型进行训练，得到中间语音合成模型，所述中间语音合成模型用于生成对应所述普通话及其风格标识的语音数据，所述对应所述普通话及其风格标识的语音数据包括：在所述风格标识指示的风格下朗读所述普通话数据的语音数据。An initial speech synthesis model is trained based on Mandarin data and its style identifier to obtain an intermediate speech synthesis model. The intermediate speech synthesis model is used to generate speech data corresponding to the Mandarin and its style identifier. The speech data corresponding to the Mandarin and its style identifier includes: speech data of reading the Mandarin data in the style indicated by the style identifier.

11.根据权利要求10所述的方法，其特征在于，所述语音需求信息还包括风格标识；根据所述文本特征和所述语音需求信息训练所述中间语音合成模型，得到训练好的语音合成模型，包括：11. The method according to claim 10, wherein the speech requirement information further includes a style identifier; and training the intermediate speech synthesis model according to the text features and the speech requirement information to obtain a trained speech synthesis model comprises:

根据所述文本特征、所述目标方言的标识和所述风格标识训练所述中间语音合成模型，得到训练好的语音合成模型。The intermediate speech synthesis model is trained according to the text features, the identifier of the target dialect and the style identifier to obtain a trained speech synthesis model.

12.根据权利要求9所述的方法，其特征在于，根据普通话数据对初始语音合成模型进行训练，得到中间语音合成模型，包括：12. The method according to claim 9, wherein the initial speech synthesis model is trained based on Mandarin data to obtain an intermediate speech synthesis model, comprising:

根据普通话数据及声纹特征对初始语音合成模型进行训练，得到中间语音合成模型，所述中间语音合成模型用于生成对应所述普通话及声纹特征的语音数据，所述对应所述普通话及声纹特征的语音数据包括：按照所述声纹特征朗读所述普通话数据的语音数据。An initial speech synthesis model is trained based on Mandarin data and voiceprint features to obtain an intermediate speech synthesis model. The intermediate speech synthesis model is used to generate speech data corresponding to the Mandarin and voiceprint features. The speech data corresponding to the Mandarin and voiceprint features includes: speech data that reads the Mandarin data according to the voiceprint features.

13.根据权利要求12所述的方法，其特征在于，所述语音需求信息还包括人物标识；根据所述文本特征和所述语音需求信息训练所述中间语音合成模型，得到训练好的语音合成模型，包括：13. The method according to claim 12, wherein the speech demand information further includes a character identifier; and training the intermediate speech synthesis model according to the text features and the speech demand information to obtain a trained speech synthesis model comprises:

根据所述文本特征、所述声纹特征、所述目标方言的标识和所述人物标识训练所述中间语音合成模型，得到训练好的语音合成模型。The intermediate speech synthesis model is trained according to the text features, the voiceprint features, the identifier of the target dialect and the character identifier to obtain a trained speech synthesis model.

14.根据权利要求7所述的方法，其特征在于，所述文本特征包括：所述样本数据的具有目标方言特点的目标音素序列。14. The method according to claim 7, wherein the text features include: a target phoneme sequence of the sample data having characteristics of a target dialect.

15.一种电子设备，其特征在于，包括存储器和处理器；15. An electronic device, comprising a memory and a processor;

所述存储器与所述处理器连接，用于存储程序；The memory is connected to the processor and is used to store programs;

所述处理器用于通过运行所述存储器中的程序，实现如权利要求1至14中任意一项所述的方法。The processor is configured to implement the method according to any one of claims 1 to 14 by running the program in the memory.

16.一种计算机程序产品，其特征在于，包括：计算机程序，所述计算机程序被处理器执行时实现如权利要求1至14中任意一项所述的方法。16. A computer program product, comprising: a computer program, wherein when the computer program is executed by a processor, the method according to any one of claims 1 to 14 is implemented.