Detailed Description
The present application will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present application more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the application.
With the continuous development of artificial intelligence, speech synthesis technology is becoming more and more mature. Currently, speech synthesis techniques may be applied in a variety of scenarios.
For example, in one possible reading listening scenario, the speech synthesis technology is applied to various reading APP, and can provide a user with a reading function of various voice libraries, release both hands and eyes of the user, and provide a more excellent reading experience.
In another possible information broadcasting scene, the voice synthesis technology provides a special sound library specially created for the news information broadcasting scene, so that the avatar of the mobile phone, the loudspeaker box and other equipment plays a role in broadcasting fresh information to the user at any time and any place.
In another possible order broadcasting scene, the voice synthesis technology can be applied to scenes such as taxi taking software, catering number calling and queuing software, order broadcasting is performed through the voice synthesis technology, and a user can conveniently obtain notification information.
In another possible scenario, the speech synthesis technique may be integrated into intelligent hardware devices such as children story machines, intelligent robots, tablet devices, etc., making the user interaction with the device more natural and intimate.
Currently, in the speech synthesis technology, there is a lot of research on single-language text, which refers to text containing only one language, and currently, the technology of converting single-language text into synthesized speech is mature.
However, with the progress of globalization, communication between countries has been enhanced, and mixed-language text has been increasingly common in daily life of people, which refers to text including at least two languages, for example, "i do not care" and "i are preparing a presentation", etc. However, at present, the technology of speech synthesis for mixed language text has been less studied, and the technology is relatively immature.
The application provides a method for synthesizing voice aiming at the technical problems, which mainly comprises the following steps: and obtaining a target text to be synthesized, which is composed of at least two languages, and inputting the target text into a text synthesis model composed of at least two feature extraction modules, a feature fusion module and a voice conversion module, which are in one-to-one correspondence with the at least two languages. And respectively carrying out feature extraction processing on the target text through at least two feature extraction modules in the text synthesis model to obtain at least two text features corresponding to the at least two feature extraction modules one by one, carrying out fusion processing on the at least two text features through a feature fusion module in the text synthesis model to obtain fusion features, and finally carrying out voice conversion processing on the fusion features through a voice conversion module in the text synthesis model to obtain synthesized voice corresponding to the target text. The embodiment of the application provides a method for synthesizing voice of a mixed language text, which can input a target text composed of at least two languages into a text synthesis model, and extract features of the target text through at least two feature extraction modules corresponding to the at least two languages one by one in the text synthesis model to obtain at least two text features corresponding to the feature extraction modules. Wherein, because the accuracy of different language features in the text features extracted by each feature extraction module is different, taking the target text including Chinese text and English text as an example, the accuracy of the Chinese language features is higher and the accuracy of the English language features is lower in the text features extracted by the Chinese feature extraction module; in the text features extracted by the English feature extraction module, the accuracy of English language features is higher, and the accuracy of Chinese language features is lower, so that the text features corresponding to Chinese in the Chinese feature extraction model and the text features corresponding to English in the English feature extraction model are fused, the obtained fused features are higher in accuracy for various languages, and then the fused features are subjected to voice conversion processing to obtain synthesized voices corresponding to target texts composed of at least two languages. Finally, the research of converting the mixed language text into the synthetic language is realized.
It should be noted that, the execution body of the method for synthesizing voice provided in the embodiment of the present application may be a device for synthesizing voice, where the device for synthesizing voice may be implemented as part or all of a computer device in a manner of software, hardware or a combination of software and hardware, where the computer device may be a server or a terminal, where the server in the embodiment of the present application may be a server or a server cluster formed by multiple servers, and the terminal in the embodiment of the present application may be a smart phone, a personal computer, a tablet computer, a wearable device, a child story machine, and other intelligent hardware devices such as an intelligent robot. In the following method embodiments, the execution subject is a computer device.
In one embodiment of the present application, as shown in fig. 1, a speech synthesis method is provided, and the method is applied to a computer device for illustration, and includes the following steps:
in step 101, a computer device obtains a target text to be synthesized.
Wherein the target text is composed of at least two languages, for example, the at least two languages may be chinese and english, and the target text may be composed of chinese text and english text.
In the embodiment of the application, in the case that the computer device is a server, optionally, the server may receive the target text sent by the terminal; the server may also extract the target text in a server database.
In the case that the computer device is a terminal, optionally, the terminal may receive a target text input by a user; the terminal can also acquire a target text displayed on the interface; the terminal may also extract the target text in the terminal data. The method for acquiring the target text by the computer equipment is not particularly limited in the embodiment of the application.
At step 102, the computer device enters target text into the text synthesis model.
In the embodiment of the application, the text synthesis model is used for synthesizing the input target text into the synthesized voice corresponding to the target text. Alternatively, the training process of the text synthesis model may include: and acquiring a plurality of training samples, wherein each training sample comprises training texts for training and real synthesized voices corresponding to the training texts, and training a text synthesis model by using the training texts and the real synthesized voices. The text synthesis model may include at least two feature extraction modules, feature fusion modules, and a voice conversion module, which are in one-to-one correspondence with at least two languages, and the structure diagram thereof is shown in fig. 2.
The text synthesis model may include at least two feature extraction modules corresponding to at least two languages one by one, and the feature extraction modules are used for extracting features of the input target text to obtain at least two text features corresponding to the at least two feature extraction modules one by one. For example, at least two languages included in the target text are chinese and english, respectively, and at least two feature extraction modules corresponding to at least two languages one by one are a chinese feature extraction module and an english feature extraction module, respectively. The Chinese character feature extraction module performs feature extraction on the target text to obtain text features corresponding to the Chinese character feature extraction module, and the English character feature extraction module performs feature extraction on the target text to obtain text features corresponding to the English character feature extraction module.
The Chinese character extracting module extracts the Chinese text to obtain text character with high accuracy, and extracts the English text to obtain text character with low accuracy; the text feature accuracy obtained by the English feature extraction module aiming at English text extraction is high, and the text feature accuracy obtained by Chinese text extraction is low. Therefore, it can be known that, among the text features extracted by the Chinese feature extraction module, the accuracy of the Chinese language features is higher, and the accuracy of the English language features is lower; among the text features extracted by the English feature extraction module, the English language features have higher accuracy and the Chinese language features have lower accuracy.
The feature fusion module is used for carrying out fusion processing on the text features extracted by the at least two feature extraction modules to obtain fused features. Based on the above example, the feature fusion module performs fusion processing on the text feature corresponding to the central feature extraction module and the text feature corresponding to the english feature extraction module, to obtain a fused feature after fusion.
As can be seen from the above, the accuracy of the text features extracted by each feature extraction module for the texts in different languages is different, and among the text features extracted by the chinese feature extraction module, the accuracy of the chinese language features is higher, and the accuracy of the english language features is lower; among the text features extracted by the English feature extraction module, the English language features have higher accuracy and the Chinese language features have lower accuracy. Therefore, optionally, the computer device can fuse the Chinese language features in the text features extracted by the Chinese feature extraction module with the English language features in the text features extracted by the English feature extraction module, so that the fused features have higher accuracy for each language, and the finally synthesized voice has higher accuracy, and is clear and natural.
The voice conversion module is used for converting the fusion characteristics fused by the characteristic fusion module to obtain synthesized voice.
And 103, the computer equipment performs feature extraction processing on the target text through at least two feature extraction modules to obtain at least two text features corresponding to the at least two feature extraction modules one by one.
Based on the above, it can be known that the computer device inputs the target text to at least two feature extraction modules, and the at least two feature extraction modules respectively perform feature extraction on the target text, so as to obtain at least two text features corresponding to the at least two feature extraction modules one by one.
And 104, the computer equipment performs fusion processing on at least two text features through a feature fusion module to obtain fusion features.
In the embodiment of the application, as can be seen from the above, the target text is subjected to feature extraction by at least two feature extraction modules to obtain at least two text features corresponding to the at least two feature extraction modules one by one, so that the computer equipment needs to perform feature fusion on the obtained at least two text features by the feature fusion module to obtain fusion features in order to obtain the synthetic language output by the final text synthesis model.
And 105, the computer equipment performs voice conversion processing on the fusion characteristics through a voice conversion module to obtain synthesized voice corresponding to the target text.
In the embodiment of the application, the computer equipment inputs the fused characteristics to the voice conversion module, and the fused characteristics are converted into the synthesized voice corresponding to the target text through the voice conversion module.
According to the voice synthesis method, the computer equipment acquires the target text to be synthesized and inputs the target text into the text synthesis model. The computer equipment performs feature extraction processing on the target text through at least two feature extraction modules in the text synthesis model respectively to obtain at least two text features corresponding to the at least two feature extraction modules one by one. Then, the computer equipment performs fusion processing on at least two text features through a feature fusion module in the text synthesis model to obtain fusion features, and performs voice conversion processing on the fusion features through a voice conversion module in the text synthesis model to obtain synthesized voice corresponding to the target text. In the method, a target text composed of at least two languages can be input into a text synthesis model, and feature extraction is carried out on the target text through at least two feature extraction modules which are in one-to-one correspondence with the at least two languages in the text synthesis model, so that at least two text features corresponding to the feature extraction modules are obtained. Wherein, because the accuracy of different language features in the text features extracted by each feature extraction module is different, taking the target text including Chinese text and English text as an example, the accuracy of the Chinese language features is higher and the accuracy of the English language features is lower in the text features extracted by the Chinese feature extraction module; in the text features extracted by the English feature extraction module, the accuracy of English language features is higher, and the accuracy of Chinese language features is lower, so that the text features corresponding to Chinese in the Chinese feature extraction model and the text features corresponding to English in the English feature extraction model are fused, the obtained fused features are higher in accuracy for various languages, and then the fused features are subjected to voice conversion processing to obtain synthesized voices corresponding to target texts composed of at least two languages. Finally, the research of converting the mixed language text into the synthetic language is realized.
In an alternative embodiment of the present application, as shown in fig. 3, the above-mentioned inputting the target text into the text synthesis model may include the following steps:
in step 301, the computer device converts the target text into target phoneme notation.
In the embodiment of the application, optionally, the computer device may display the target text to the user through the display interface, and the user reads the target text on the display interface, marks the target text as a target phoneme mark, and inputs the target phoneme mark to the computer device. Alternatively, in the embodiment of the present application, the target phonetic symbol writing symbol may be an international phonetic symbol.
Alternatively, the computer device may convert the target text into target phoneme notation according to a pre-trained text conversion model. The text conversion model is used for converting the input target text into target phoneme marking symbols corresponding to the target text. The training process of the text conversion model may include: and acquiring a plurality of text conversion training samples, wherein each text conversion training sample comprises training texts for training a text conversion model and phoneme notation corresponding to the training texts, and training the text conversion model based on the training texts in each text conversion training sample and the phoneme notation corresponding to the training texts.
In step 302, the computer device inputs the target phoneme label into the acoustic feature recognition model to obtain an acoustic feature corresponding to the target phoneme label.
In an embodiment of the application, the computer device inputs the target phoneme notation into the acoustic feature recognition model. The acoustic feature recognition model is used for recognizing the input target phoneme label as an acoustic feature corresponding to the target phoneme label. Alternatively, in an embodiment of the present application, the acoustic feature may be a melpri feature.
Optionally, in an embodiment of the present application, the training process of the acoustic feature recognition model may include: and acquiring a plurality of acoustic feature training samples, wherein each acoustic feature training sample comprises a phoneme label writing symbol for training an acoustic feature model and acoustic features corresponding to the phoneme label writing symbol, and training an acoustic feature recognition model based on the phoneme label writing symbol and the acoustic features corresponding to the phoneme label writing symbol in each acoustic feature training sample.
In step 303, the computer device inputs the acoustic features into the text-synthesis model.
In the embodiment of the application, the computer equipment inputs the acoustic features output by the acoustic feature recognition model into the text synthesis model, so that the acoustic features corresponding to the target text are synthesized into the synthesized voice.
In the embodiment of the application, the computer equipment converts the target text into the target phoneme label, inputs the target phoneme label into the acoustic feature recognition model to obtain the acoustic feature corresponding to the target phoneme label, and then inputs the acoustic feature into the text synthesis model. According to the method, the target text is converted into the target phoneme label, so that at least two language texts in the target text are input into the acoustic feature recognition model through the target phoneme label with the same specification, and acoustic features corresponding to the target phoneme label are obtained. Thereby simplifying the training and use process of the acoustic feature recognition model. Furthermore, instead of inputting the target text directly into the text-synthesis model, the acoustic features corresponding to the target phoneme notation are input into the text-synthesis model. If the target text is directly input into the text synthesis model, the text synthesis model is complicated to train, and the synthesized speech output according to the target text is hard and unclear. However, the above method inputs acoustic features into the text synthesis model, making the synthetic language input by the text synthesis model more natural and clear.
In an alternative embodiment of the present application, as shown in fig. 4, the above-mentioned fusing processing of at least two text features by the feature fusion module may include the following steps:
step 401, for each text feature, the computer device determines at least two language features in the text features, and determines weights corresponding to the language features according to the target language corresponding to the feature extraction module from which the text features are extracted.
Wherein, at least two language features correspond to at least two languages one by one.
In the embodiment of the application, for each text feature output by the at least two feature extraction modules corresponding to at least two languages one by one, the computing device determines at least two language features in each text feature based on the position of each language text in the target text.
For example, the target text comprises two languages of Chinese and English, two feature extraction modules corresponding to the two languages one by one are a Chinese feature extraction module and an English feature extraction module respectively, and the computer equipment determines Chinese language features and English language features in text features output by the Chinese feature extraction module and the English feature extraction module respectively according to the positions of the Chinese text and the English text in the target text.
Optionally, the target text is "I do not care" and the two feature extraction modules are respectively a Chinese feature extraction module and an English feature extraction module, and the computer equipment determines the Chinese language feature corresponding to "I do not care" and the English language feature corresponding to "care" in the text features output by the Chinese feature extraction module and the English feature extraction module according to the positions of "I do not care" and "care" in the "I do not care" respectively.
In the embodiment of the application, after at least two language features in each text feature are determined for each text feature, the computer equipment respectively extracts the target language corresponding to the feature extraction module of the text feature, and the weight corresponding to each language feature is determined according to the accuracy of the feature extraction module extracting each language feature.
For example, by determining the chinese language feature and the english language feature in each text feature for each text feature based on the above example, the computer device determines the target language corresponding to the chinese feature extraction module as chinese and the target language corresponding to the english feature extraction module as english. The computer equipment determines the weight corresponding to the Chinese language feature and the English language feature in each text feature according to the accuracy of the Chinese language feature and the English language feature extracted by the Chinese feature extraction module and the English feature extraction module.
In step 402, the computer device performs fusion processing on at least two text features according to the language features in each text feature and weights corresponding to the language features in each text feature.
In the embodiment of the application, after determining the language features in each text feature and the weights corresponding to the language features in each text feature, the computer equipment can optionally multiply or divide the language features and the weights corresponding to the language features, so as to realize fusion processing of at least two text features.
In an embodiment of the application, the computer device determines, for each acquired text feature, at least two language features among the text features so as to perform different processing on different language features in each text feature. The computer equipment determines the weight corresponding to each language feature according to the target language corresponding to the feature extraction module of the extracted text feature, and the accuracy of different language features in the text feature extracted by different feature extraction modules is different, so that the weight corresponding to each language feature is determined according to the target language corresponding to the feature extraction module of the extracted text feature, and the accuracy of each language feature can be ensured. And then, the computer equipment performs fusion processing on at least two text features according to the language features in each text feature and the weights corresponding to the language features in each text feature, so that the accuracy of the fusion features after the fusion processing is ensured, the accuracy of the synthesized language output by the text synthesis model is further ensured, and the definition and nature of the synthesized language are ensured.
In an alternative embodiment of the present application, as shown in fig. 5, the above determining at least two language features among text features may include the steps of:
in step 501, the computer device determines at least two language texts among the target texts.
Wherein, at least two language texts are in one-to-one correspondence with at least two languages. For example, at least two languages are chinese and english, respectively, and at least two language texts are chinese text and english text, respectively.
In the embodiment of the present application, optionally, the target text acquired by the computer device has identification information of at least two languages, and the computer device determines at least two language texts in the target text based on the identification information of at least two languages in the target text. For example, the target text is "not cares" in which the identification information corresponding to "care" is english and the identification information corresponding to "not cares" is chinese, and the computer device determines that "care" is english and "not cares" is chinese text by reading the identification information corresponding to "care" and the identification information corresponding to "not cares".
The computer device may determine at least two language texts in the target text by means of a text determination model. The text determination model is used for determining language texts included in target texts of the target texts input into the text determination model. For example, after inputting "me" into the text determination model, the text determination model may determine "me" as chinese text; after inputting the "care" to the text determination model, the text determination model may determine that the "care" is english text; after inputting "i did not care" into the text determination model, the text determination model may determine that "i did not care" is chinese text and "care" is english text.
At step 502, the computer device determines at least two language features from among the text features based on the location of the at least two language texts in the target text.
In the embodiment of the application, after determining at least two language texts in the target text, the computer equipment determines at least two language features in at least two text features output by at least two feature extraction modules according to the positions of the at least two language texts in the target text.
In an embodiment of the application, the computer device determines at least two language texts in the target text, and determines at least two language features in the text features according to the positions of the at least two language texts in the target text. At least two language features in each text feature are determined, so that weight distribution is facilitated for the at least two language features in each text feature, and the weight distribution error of the at least two language features in each text feature is avoided.
In an optional embodiment of the present application, the language features are matrices, as shown in fig. 6, and the fusing processing is performed on at least two text features according to the language features in each text feature and weights corresponding to the language features in each text feature, which may include the following steps:
In step 601, the computer device determines a first language feature and a second language feature from at least two language features according to the target language corresponding to the feature extraction module of the extracted text feature.
In the embodiment of the application, the first language feature corresponds to the target language, and the second language feature does not correspond to the target language.
In the embodiment of the application, the computer equipment determines the target language corresponding to each feature extraction module according to the feature extraction module of the extracted text feature, and determines the first language feature and the second language feature in each text feature based on the target language of each feature extraction module.
For example, the target text includes two languages, i.e., chinese and English, and the two feature extraction modules corresponding to the two languages one by one are a Chinese feature extraction module and an English feature extraction module respectively. The target language of the Chinese character extracting module is Chinese, and then the computer equipment determines the Chinese language character as a first language character and the English language character as a second language character in the text character output by the Chinese character extracting module; and determining English language features as first language features and Chinese language features as second language features from the text features output by the English feature extraction module.
In step 602, the computer device uses the first weight as a weight corresponding to the first language feature, and uses the second weight as a weight corresponding to the second language feature, where the first weight is greater than the second weight.
In an embodiment of the present application, the first weight is greater than the second weight. Because each feature extraction module performs feature extraction on different language texts in the target text, the extracted different language features have different accuracies. Therefore, the computer equipment needs to determine at least two language features in the text features output by each feature extraction module, and takes the first weight as the weight corresponding to the first language feature and takes the second weight as the weight corresponding to the second language feature according to the accuracy of the different language features extracted by each feature extraction module.
Optionally, based on the above example, it is known that the text feature output by the chinese feature extraction module determines that the chinese language feature is a first language feature and the english language is a second language feature, and then the first weight is taken as a weight corresponding to the chinese language feature, the second weight is taken as a weight corresponding to the english language feature and the first weight is greater than the second weight, where the first weight may be 80%, the second weight may be 20%, the chinese language feature may be a first chinese matrix, and the english language feature may be a second english matrix; determining that the English language feature is a first language feature and the Chinese language is a second language feature in the text feature output by the English feature extraction module, wherein the first weight is used as a weight corresponding to the English language feature, the second weight is used as a weight corresponding to the Chinese language feature and is larger than the second weight, the first weight can be 80%, the second weight can be 20%, the English language feature can be a first English matrix, and the Chinese language feature can be a second Chinese matrix.
In step 603, the computer device multiplies each language feature by a weight corresponding to each language feature to obtain a plurality of corrected language features.
In the embodiment of the application, after obtaining weights corresponding to each language feature and each language feature, the computer device multiplies each language feature by the weight corresponding to each language feature to obtain a plurality of corrected language features.
Continuing to be based on the above example. Optionally, the computer device determines a chinese language feature in the text feature corresponding to the chinese feature extraction module as a first language feature, determines an english language feature as a second language feature, and the first weight corresponding to the first language feature is 80%, and the second weight corresponding to the second language feature is 20%, where the first language feature may be the first language; the computer equipment determines English language features in text features corresponding to the English feature extraction module as first language features, chinese language features are determined as second language features, the first weight corresponding to the first language features is 80%, and the second weight corresponding to the second language features is 20%. The computer equipment multiplies the Chinese language feature in the text feature corresponding to the Chinese feature extraction module by 80%, the English language feature by 20%, and multiplies the English language feature in the text feature corresponding to the English feature extraction module by 80%, the Chinese language feature by 20%, namely, the first Chinese matrix by 80%, the second English matrix by 20%, the first English matrix by 80% and the second Chinese matrix by 20%, thereby obtaining a plurality of corrected language features.
In step 604, the computer device adds, for each of the at least two languages, the respective modified language features corresponding to the language to obtain candidate language features corresponding to the language.
In the embodiment of the application, after a plurality of corrected language features are obtained, the computer equipment determines the languages corresponding to the corrected language features, and determines at least two languages from the corrected language features. Based on each determined language, the computer equipment adds the corrected language features corresponding to each language, so as to obtain candidate language features corresponding to each language.
For example, based on the above examples, the computer device obtains the modified chinese language feature and the modified english language feature corresponding to the chinese feature extraction module and the modified english language feature and the modified chinese language feature corresponding to the english feature extraction module, and the computer device adds the modified chinese language feature corresponding to the chinese feature extraction module and the modified chinese language feature corresponding to the english feature extraction module to obtain candidate chinese features, and adds the modified english language feature corresponding to the chinese feature extraction module and the modified english language feature corresponding to the english feature extraction module to obtain candidate english features. Multiplying the first Chinese matrix by 80%, and then multiplying the first Chinese matrix by 20% to obtain candidate Chinese features; multiplying the second English matrix by 20% and then multiplying the second English matrix by 80% to obtain candidate English features.
In step 605, the computer device performs a stitching process on each candidate language feature.
In the embodiment of the application, after the computer equipment acquires each candidate language feature, the computer equipment performs splicing processing on each candidate language feature.
Based on the content, namely, the computer equipment performs splicing processing on the candidate Chinese character and the candidate English character to obtain the Chinese-English spliced language character.
In the embodiment of the application, the computer equipment determines the first language feature and the second language feature in at least two language features according to the target language corresponding to the feature extraction module of the acquired text feature, so that the first language feature is the language feature corresponding to the target language. Then, the computer equipment takes the first weight as the weight corresponding to the first language feature, takes the second weight as the weight corresponding to the second language feature, multiplies each language feature by the weight corresponding to each language feature to obtain a plurality of corrected language features, and ensures the accuracy of each two language features in each text feature, thereby also ensuring the accuracy of the plurality of corrected language features. For each of at least two languages, the computer equipment adds the corrected language features corresponding to the languages to obtain candidate language features corresponding to the languages, and performs splicing processing on the candidate language features. According to the method, after the correction language features corresponding to the languages are added, the candidate language features corresponding to the languages become optimal features corresponding to the languages, and the candidate language features are spliced, so that the overall language features corresponding to the target text are optimal, and the accuracy, the clearness and the naturalness of the synthesized voice output by the text synthesis model are ensured.
In an alternative embodiment of the present application, as shown in fig. 7, in the above-mentioned speech synthesis method, the training process of the text synthesis model may include the following steps:
in step 701, a computer device obtains at least two sets of training sets in one-to-one correspondence with at least two languages.
Each training set comprises a plurality of training samples, and each training sample comprises training texts and real synthesized voices corresponding to the training texts.
In the embodiment of the present application, optionally, in the case that the computer device is a server, the server may receive at least two sets of training sets corresponding to at least two languages one by one, which are sent by the terminal; the server may also extract at least two sets of training sets in a server database that are in one-to-one correspondence with at least two languages.
Optionally, in the case that the computer device is a terminal, the terminal may receive at least two sets of training sets corresponding to at least two languages one by one, which are input by a user; the terminal can also acquire at least two training sets which are displayed on the interface and correspond to at least two languages one by one; the terminal may also extract at least two sets of training sets in one-to-one correspondence with at least two languages from the terminal data. The method for acquiring at least two training sets corresponding to at least two languages one by the computer equipment is not particularly limited.
At step 702, the computer device trains at least two single-language text synthesis models corresponding to at least two languages one-to-one, respectively, using at least two training sets.
In the embodiment of the application, after at least two sets of training sets corresponding to at least two languages one by one are acquired, the computer equipment can respectively train each single-language text synthesis model corresponding to each training set language by utilizing each training set. The training mode of each single-language text synthesis model may be that a training set corresponding to each single-language text synthesis model is obtained, and each training set includes a plurality of training samples. Each training sample comprises training texts and real synthesized voices corresponding to the training texts, and each single-language text synthesis model is trained based on the training texts in each training sample and the real synthesized voices corresponding to the training texts.
For example, optionally, the computer device obtains a chinese training set and an english training set, and trains the chinese text synthesis model and the english text synthesis model using the chinese training set and the english training set, respectively. The training process of the Chinese text synthesis model is described below by taking the training process of the Chinese text synthesis model as an example: the method comprises the steps that a computer device obtains a Chinese training set, wherein the Chinese training set comprises a plurality of Chinese training samples, each Chinese training sample comprises a Chinese text and real synthesized voice corresponding to the Chinese text, and a Chinese text synthesis model is trained based on the Chinese text and the real synthesized voice corresponding to the Chinese text in each Chinese training sample.
In step 703, the computer device obtains at least two feature extraction modules based on at least two single language text synthesis models.
In the embodiment of the application, after training to obtain at least two single-language text synthesis models, the computer equipment can train at least two feature extraction modules by taking the output features of the penultimate layer of each single-language text synthesis model as the tag features of the feature extraction modules corresponding to each single-language text synthesis model in a knowledge distillation mode.
Optionally, based on the above examples, the computer device trains to obtain a chinese text synthesis model and an english text synthesis model, and uses knowledge distillation to respectively use output features of the penultimate layers of the chinese text synthesis model and the english text synthesis model as tag features of the chinese feature extraction module and the english feature extraction module, so as to train the chinese feature extraction module and the english feature extraction module.
In step 704, the computer device composes the obtained at least two feature extraction modules, the preset fusion module and the preset voice conversion module into a text synthesis model.
In the embodiment of the application, the computer equipment can train a classification model in advance according to the requirement of feature fusion, the classification model is used for classifying at least two languages in the target text, and the trained classification module is linked to the feature fusion module to guide the fusion of the features of each text.
Alternatively, the computer device may train a speech conversion model to convert the fusion feature into synthesized speech based on the content of the last layer of the single-language text synthesis model.
In the embodiment of the application, the computer equipment forms a text synthesis model from the acquired at least two feature extraction modules, the preset fusion module and the preset voice conversion module.
In the embodiment of the application, the computer equipment acquires at least two groups of training sets corresponding to at least two languages one by one, and trains at least two single-language text synthesis models corresponding to at least two languages one by utilizing the at least two groups of training sets. Because the corresponding training set of the single-language text synthesis model is large, and the method is mature, the accuracy of each single-language text synthesis model obtained through training can be ensured. After training to obtain at least two single-language text synthesis models, the computer equipment obtains at least two feature extraction modules based on the at least two single-language text synthesis models, and under the condition that the accuracy of each single-language text synthesis model is ensured, the accuracy of the at least two feature extraction modules can be ensured. And then, the computer equipment forms a text synthesis model from the acquired at least two feature extraction modules, the preset fusion module and the preset voice conversion module. The preset fusion module and the preset voice conversion module can be trained for a plurality of times, and accuracy is high. Based on the accuracy of at least two feature extraction modules and the accuracy of a preset fusion module and a preset voice conversion module, the accuracy of a text synthesis model is ensured, and the accuracy of a synthesized language obtained by synthesizing the target file based on the text synthesis model is improved.
In order to better explain the speech synthesis method provided by the present application, the present application provides an embodiment for explaining the overall flow aspect of the speech synthesis method, as shown in fig. 8, the method includes:
in step 801, a computer device obtains at least two sets of training sets corresponding to at least two languages one to one, wherein each set of training sets includes a plurality of training samples, each training sample including training text and real synthesized speech corresponding to the training text.
At step 802, the computer device trains at least two single-language text synthesis models corresponding to at least two languages one by one, respectively, using at least two training sets.
In step 803, the computer device obtains at least two feature extraction modules based on at least two single language text synthesis models.
In step 804, the computer device composes the text synthesis model from the obtained at least two feature extraction modules, the preset fusion module and the preset voice conversion module.
In step 805, the computer device obtains a target text to be synthesized, the target text being composed of at least two languages.
At step 806, the computer device converts the target text into target phoneme notation.
In step 807, the computer device inputs the target phoneme label into the acoustic feature recognition model to obtain an acoustic feature corresponding to the target phoneme label.
At step 808, the computer device inputs acoustic features into the text-synthesis model.
In step 809, the computer device determines at least two language texts in the target text, wherein the at least two language texts correspond to the at least two languages one to one.
At step 810, the computer device determines at least two language features among the text features based on the location of the at least two language texts in the target text.
In step 811, the computer device determines a first language feature and a second language feature in at least two language features according to the target language corresponding to the feature extraction module of the extracted text feature, where the first language feature corresponds to the target language and the second language feature does not correspond to the target language.
In step 812, the computer device uses the first weight as a weight corresponding to the first language feature, and uses the second weight as a weight corresponding to the second language feature, where the first weight is greater than the second weight.
In step 813, the computer device multiplies each language feature by the weight corresponding to each language feature to obtain a plurality of corrected language features.
In step 814, the computer device adds, for each of the at least two languages, the respective modified language features corresponding to the language to obtain candidate language features corresponding to the language.
In step 815, the computer device performs a stitching process on each candidate language feature.
In step 816, the computer device performs a voice conversion process on the fusion feature through a voice conversion module, so as to obtain a synthesized voice corresponding to the target text.
It should be understood that, although the steps in the flowcharts of fig. 1 and 3-8 are shown in order as indicated by the arrows, these steps are not necessarily performed in order as indicated by the arrows. The steps are not strictly limited to the order of execution unless explicitly recited herein, and the steps may be executed in other orders. Moreover, at least some of the steps of fig. 1 and 3-8 may include steps or stages that are not necessarily performed at the same time, but may be performed at different times, nor do the order in which the steps or stages are performed necessarily performed in sequence, but may be performed alternately or alternately with other steps or at least a portion of the steps or stages in other steps.
In one embodiment, as shown in fig. 9, there is provided a speech synthesis apparatus 900 comprising: a first acquisition module 901, an input module 902, a second acquisition module 903, a third acquisition module 904, and a fourth acquisition module 905, wherein:
The first obtaining module 901 is configured to obtain a target text to be synthesized, where the target text is composed of at least two languages.
The input module 902 is configured to input the target text into a text synthesis model, where the text synthesis model includes at least two feature extraction modules, a feature fusion module, and a speech conversion module that are in one-to-one correspondence with at least two languages.
And the second obtaining 903 is configured to perform feature extraction processing on the target text through at least two feature extraction modules, so as to obtain at least two text features corresponding to the at least two feature extraction modules one by one.
And the third obtaining module 904 is configured to perform fusion processing on at least two text features through the feature fusion module to obtain a fused feature.
And a fourth obtaining module 905, configured to perform a voice conversion process on the fusion feature through a voice conversion module, so as to obtain a synthesized voice corresponding to the target text.
In one embodiment, the input module 902 is specifically configured to: converting the target text into a target phoneme label; inputting the target phoneme label-writing symbol into an acoustic feature recognition model to obtain an acoustic feature corresponding to the target phoneme label-writing symbol; the acoustic features are input into a text synthesis model.
In one embodiment, as shown in fig. 10, the third obtaining module 904 includes: a determining unit 9041 and a fusing unit 9042, wherein:
a determining unit 9041, configured to determine, for each text feature, at least two language features in the text features, and determine weights corresponding to the language features according to the target language corresponding to the feature extraction module that has extracted the text features, where the at least two language features are in one-to-one correspondence with the at least two languages;
and a fusion unit 9042, configured to fuse at least two text features according to the language features in each text feature and weights corresponding to the language features in each text feature.
In one embodiment, the determining unit 9041 is specifically configured to: determining at least two language texts in the target text, wherein the at least two language texts are in one-to-one correspondence with the at least two languages; at least two language features are determined among the text features based on the location of the at least two language texts in the target text.
In one embodiment, the fusion unit 9042 is specifically configured to: determining a first language feature and a second language feature in at least two language features according to the target language corresponding to the feature extraction module of the extracted text feature, wherein the first language feature corresponds to the target language, and the second language feature does not correspond to the target language; and taking the first weight as the weight corresponding to the first language feature, and taking the second weight as the weight corresponding to the second language feature, wherein the first weight is larger than the second weight.
In one embodiment, the language features are matrices, and the fusion unit 9042 is specifically configured to: multiplying each language feature by the weight corresponding to each language feature to obtain a plurality of corrected language features; for each of at least two languages, adding the corrected language features corresponding to the languages to obtain candidate language features corresponding to the languages; and performing splicing processing on each candidate language feature.
In one embodiment, as shown in fig. 11, the above-mentioned voice synthesis apparatus 900 further includes: a fifth acquisition module 906, a training module 907, a sixth acquisition module 908, and a composition module 909, wherein:
a fifth obtaining module 906, configured to obtain at least two sets of training sets corresponding to at least two languages one by one, where each set of training sets includes a plurality of training samples, and each training sample includes a training text and a real synthesized speech corresponding to the training text;
a training module 907, configured to train at least two single-language text synthesis models corresponding to at least two languages one by using at least two training sets, respectively;
a sixth obtaining module 908, configured to obtain at least two feature extraction modules based on at least two single language text synthesis models;
And a composition module 909, configured to compose the obtained at least two feature extraction modules, the preset fusion module, and the preset speech conversion module into a text synthesis model.
For specific limitations of the speech synthesis apparatus, reference may be made to the above limitations of the speech synthesis method, and no further description is given here. The respective modules in the above-described speech synthesis apparatus may be implemented in whole or in part by software, hardware, and combinations thereof. The above modules may be embedded in hardware or may be independent of a processor in the computer device, or may be stored in software in a memory in the computer device, so that the processor may call and execute operations corresponding to the above modules.
In one embodiment, a computer device is provided, which may be a terminal, and the internal structure thereof may be as shown in fig. 12. The computer device includes a processor, a memory, a communication interface, a display screen, and an input device connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The communication interface of the computer device is used for carrying out wired or wireless communication with an external terminal, and the wireless mode can be realized through WIFI, an operator network, NFC (near field communication) or other technologies. The computer program is executed by a processor to implement a speech synthesis method. The display screen of the computer equipment can be a liquid crystal display screen or an electronic ink display screen, and the input device of the computer equipment can be a touch layer covered on the display screen, can also be keys, a track ball or a touch pad arranged on the shell of the computer equipment, and can also be an external keyboard, a touch pad or a mouse and the like.
It will be appreciated by those skilled in the art that the structure shown in FIG. 12 is merely a block diagram of some of the structures associated with the present inventive arrangements and is not limiting of the computer device to which the present inventive arrangements may be applied, and that a particular computer device may include more or fewer components than shown, or may combine some of the components, or have a different arrangement of components.
In one embodiment, a computer device is provided, which may be a server, and the internal structure of which may be as shown in fig. 13. The computer device includes a processor, a memory, and a network interface connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, computer programs, and a database. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The database of the computer device is for storing speech synthesis data. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a speech synthesis method.
It will be appreciated by those skilled in the art that the structure shown in FIG. 13 is merely a block diagram of some of the structures associated with the present inventive arrangements and is not limiting of the computer device to which the present inventive arrangements may be applied, and that a particular computer device may include more or fewer components than shown, or may combine some of the components, or have a different arrangement of components.
In one embodiment of the application, a computer device is provided, comprising a memory and a processor, the memory storing a computer program, the processor implementing the following steps when executing the computer program:
obtaining a target text to be synthesized, wherein the target text consists of at least two languages; inputting a target text into a text synthesis model, wherein the text synthesis model comprises at least two feature extraction modules, a feature fusion module and a voice conversion module which are in one-to-one correspondence with at least two languages; performing feature extraction processing on the target text through at least two feature extraction modules respectively to obtain at least two text features corresponding to the at least two feature extraction modules one by one; at least two text features are fused through a feature fusion module, so that fusion features are obtained; and performing voice conversion processing on the fusion characteristics through a voice conversion module to obtain synthesized voice corresponding to the target text.
In one embodiment of the application, the processor when executing the computer program further performs the steps of: converting the target text into a target phoneme label; inputting the target phoneme label-writing symbol into an acoustic feature recognition model to obtain an acoustic feature corresponding to the target phoneme label-writing symbol; the acoustic features are input into a text synthesis model.
In one embodiment of the present application, the processor, when executing the computer program, further performs the steps of: for each text feature, determining at least two language features in the text feature, and determining weights corresponding to the language features according to target languages corresponding to feature extraction modules of the extracted text features, wherein the at least two language features are in one-to-one correspondence with the at least two languages; and carrying out fusion processing on at least two text features according to the language features in each text feature and the weights corresponding to the language features in each text feature.
In one embodiment of the application, the processor when executing the computer program further performs the steps of: determining at least two language texts in the target text, wherein the at least two language texts are in one-to-one correspondence with the at least two languages; at least two language features are determined among the text features based on the location of the at least two language texts in the target text.
In one embodiment of the application, the processor when executing the computer program further performs the steps of: determining a first language feature and a second language feature in at least two language features according to the target language corresponding to the feature extraction module of the extracted text feature, wherein the first language feature corresponds to the target language, and the second language feature does not correspond to the target language; and taking the first weight as the weight corresponding to the first language feature, and taking the second weight as the weight corresponding to the second language feature, wherein the first weight is larger than the second weight.
In one embodiment of the application, the language features are matrices, and in one embodiment, the processor when executing the computer program further performs the steps of: multiplying each language feature by the weight corresponding to each language feature to obtain a plurality of corrected language features; for each of at least two languages, adding the corrected language features corresponding to the languages to obtain candidate language features corresponding to the languages; and performing splicing processing on each candidate language feature.
In one embodiment of the application, the processor when executing the computer program further performs the steps of: acquiring at least two groups of training sets corresponding to at least two languages one by one, wherein each group of training set comprises a plurality of training samples, and each training sample comprises training texts and real synthesized voices corresponding to the training texts; respectively training at least two single-language text synthesis models corresponding to at least two languages one by utilizing at least two training sets; acquiring at least two feature extraction modules based on at least two single-language text synthesis models; and forming a text synthesis model by the obtained at least two feature extraction modules, the preset fusion module and the preset voice conversion module.
In one embodiment of the present application, there is provided a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of:
obtaining a target text to be synthesized, wherein the target text consists of at least two languages; inputting a target text into a text synthesis model, wherein the text synthesis model comprises at least two feature extraction modules, a feature fusion module and a voice conversion module which are in one-to-one correspondence with at least two languages; performing feature extraction processing on the target text through at least two feature extraction modules respectively to obtain at least two text features corresponding to the at least two feature extraction modules one by one; at least two text features are fused through a feature fusion module, so that fusion features are obtained; and performing voice conversion processing on the fusion characteristics through a voice conversion module to obtain synthesized voice corresponding to the target text.
In one embodiment of the application, the computer program when executed by the processor further implements the steps of: converting the target text into a target phoneme label; inputting the target phoneme label-writing symbol into an acoustic feature recognition model to obtain an acoustic feature corresponding to the target phoneme label-writing symbol; the acoustic features are input into a text synthesis model.
In one embodiment of the application, the computer program when executed by the processor further implements the steps of: for each text feature, determining at least two language features in the text feature, and determining weights corresponding to the language features according to target languages corresponding to feature extraction modules of the extracted text features, wherein the at least two language features are in one-to-one correspondence with the at least two languages; and carrying out fusion processing on at least two text features according to the language features in each text feature and the weights corresponding to the language features in each text feature.
In one embodiment of the application, the computer program when executed by the processor further implements the steps of: determining at least two language texts in the target text, wherein the at least two language texts are in one-to-one correspondence with the at least two languages; at least two language features are determined among the text features based on the location of the at least two language texts in the target text.
In one embodiment of the application, the computer program when executed by the processor further implements the steps of: determining a first language feature and a second language feature in at least two language features according to the target language corresponding to the feature extraction module of the extracted text feature, wherein the first language feature corresponds to the target language, and the second language feature does not correspond to the target language; and taking the first weight as the weight corresponding to the first language feature, and taking the second weight as the weight corresponding to the second language feature, wherein the first weight is larger than the second weight.
In one embodiment of the application, the language features are matrices and in one embodiment the computer program when executed by the processor further performs the steps of: multiplying each language feature by the weight corresponding to each language feature to obtain a plurality of corrected language features; for each of at least two languages, adding the corrected language features corresponding to the languages to obtain candidate language features corresponding to the languages; and performing splicing processing on each candidate language feature.
In one embodiment of the application, the computer program when executed by the processor further implements the steps of: acquiring at least two groups of training sets corresponding to at least two languages one by one, wherein each group of training set comprises a plurality of training samples, and each training sample comprises training texts and real synthesized voices corresponding to the training texts; respectively training at least two single-language text synthesis models corresponding to at least two languages one by utilizing at least two training sets; acquiring at least two feature extraction modules based on at least two single-language text synthesis models; and forming a text synthesis model by the obtained at least two feature extraction modules, the preset fusion module and the preset voice conversion module.
Those skilled in the art will appreciate that implementing all or part of the above-described methods in accordance with the embodiments may be accomplished by way of a computer program stored on a non-transitory computer readable storage medium, which when executed may comprise the steps of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in embodiments provided herein may include at least one of non-volatile and volatile memory. The nonvolatile Memory may include Read-Only Memory (ROM), magnetic tape, floppy disk, flash Memory, optical Memory, or the like. Volatile memory can include random access memory (Random Access Memory, RAM) or external cache memory. By way of illustration, and not limitation, RAM can be in the form of a variety of forms, such as static random access memory (Static Random Access Memory, SRAM) or dynamic random access memory (Dynamic Random Access Memory, DRAM), and the like.
The technical features of the above embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.
The foregoing examples illustrate only a few embodiments of the application, which are described in detail and are not to be construed as limiting the scope of the application. It should be noted that it will be apparent to those skilled in the art that several variations and modifications can be made without departing from the spirit of the application, which are all within the scope of the application. Accordingly, the scope of protection of the present application is to be determined by the appended claims.