CN113539233A

Movatterモバイル変換

Info

Publication number: CN113539233A
Application number: CN202010301719.4A
Authority: CN
Inventors: 李栋梁; 刘恺; 周明; 陈伟
Original assignee: Beijing Sogou Technology Development Co Ltd
Current assignee: Beijing Sogou Technology Development Co Ltd
Priority date: 2020-04-16
Filing date: 2020-04-16
Publication date: 2021-10-22
Anticipated expiration: 2040-04-16
Also published as: CN113539233B; WO2021208531A1

Abstract

The embodiment of the invention provides a voice processing method, a voice processing device and electronic equipment, wherein the method comprises the following steps: acquiring text information to be converted, and determining a source language corresponding to the text information and a target user to be converted; converting the text information into target voice data of a target user adopting source language pronunciation according to the text information and a target conversion model corresponding to the target user; the target conversion model carries out self-adaptive training on the trained general conversion model according to the monolingual voice data of target user pronunciation, and the general conversion model carries out training according to the voice data containing N languages; the source language is one of the N languages, and N is an integer greater than 1; and then, under the condition that only the target user has the voice data of a single language, the multilingual text is converted into the target voice data of the target user of the corresponding language, so that the multilingual voice conversion is realized.

Description

Voice processing method and device and electronic equipment

Technical Field

The present invention relates to the field of data processing technologies, and in particular, to a voice processing method and apparatus, and an electronic device.

Background

With the development of speech processing technology, speech conversion technology is widely used. For example, the field of input methods adopts voice conversion technology to realize sound-changing input; for another example, the instant messaging software applies the voice conversion technology to realize the voice change of the video call or the voice call; and so on.

The voice conversion technology refers to a technology of converting a voice of one person (source user) into a voice of another person (target user). In the prior art, voice data of a target user is generally collected, and a model is trained by adopting the voice data of the target user; in the subsequent application process, after the voice data of the source user is obtained, the trained model is adopted to carry out voice conversion on the voice data of the source user, and the voice data of the target user is obtained.

However, if the data of the training model is the single-language voice data of the target user, the prior art can only convert the voice data of the source user using the language pronunciation into the voice data of the target user using the language pronunciation; the method can not convert the voice data of the source user adopting the pronunciation of other languages into the voice data of the target user adopting the pronunciation of other languages. For example, if the data of the training model is the Chinese voice data of the target user, the voice conversion can be only performed on the voice data of the source user adopting Chinese pronunciation, so as to obtain the Chinese pronunciation voice data of the target user; and the voice data of the source user adopting English pronunciation cannot be converted into the voice data of the target user adopting English pronunciation.

Disclosure of Invention

The embodiment of the invention provides a voice processing method, which is used for realizing multi-language voice conversion under the condition that only voice data of a target user in a single language exists.

Correspondingly, the embodiment of the invention also provides a voice processing device and electronic equipment, which are used for ensuring the realization and application of the method.

In order to solve the above problem, an embodiment of the present invention discloses a speech processing method, which specifically includes: acquiring text information to be converted, and determining a source language corresponding to the text information and a target user to be converted; converting the text information into target voice data of a target user adopting source language pronunciation according to the text information and a target conversion model corresponding to the target user; the target conversion model carries out self-adaptive training on the trained general conversion model according to the monolingual voice data of the target user, and the general conversion model carries out training according to the voice data containing N languages; the source language is one of the N languages, and N is an integer greater than 1.

Optionally, the acquiring the text information to be converted includes: acquiring source speech data of a source user, wherein the source user and a target user are the same user or different users; and performing voice recognition on the source voice data, and determining corresponding text information to be converted.

Optionally, the performing speech recognition on the source speech data and determining corresponding text information to be converted includes: inputting the source voice data into N voice recognizers respectively to obtain corresponding N voice recognition results, wherein one voice recognizer corresponds to one language; and splicing the N voice recognition results to obtain text information to be converted.

Optionally, the performing speech recognition on the source speech data and determining corresponding text information to be converted includes: inputting the source speech data into a speech recognizer to obtain a corresponding speech recognition result, wherein the speech recognizer corresponds to N languages; and determining the voice recognition result as text information to be converted.

Optionally, the converting the text information into target speech data that is pronounced by the target user in the source language according to the text information and the target conversion model corresponding to the target user includes: converting the text information by adopting the target conversion model, and outputting the acoustic characteristics of the target user pronunciation by adopting the source language; and synthesizing the acoustic features by adopting a synthesizer to obtain target voice data of the target user adopting the source language for pronunciation.

Optionally, the converting the text information by using the target conversion model, and outputting the acoustic feature of the target user pronouncing the text information by using the source language includes: inputting the text information, language identification corresponding to the source language and user identification corresponding to the target user into the target conversion model; the target conversion model searches for target model parameters matched with the language identification and the user identification; and the target conversion model converts the text information by adopting the target model parameters and outputs the acoustic characteristics of the pronunciation of the target user by adopting the source language.

Optionally, the method further includes the step of training the generic conversion model: collecting X pieces of first voice training data of M users, wherein one piece of first voice training data corresponds to one language, and X pieces of first voice training data correspond to N languages; respectively extracting reference acoustic features of each piece of first voice training data, and respectively labeling corresponding user identification and language identification for each piece of first voice training data and the corresponding reference acoustic features; for each piece of first voice training data, recognizing text information corresponding to the first voice training data; and training the universal conversion model according to the text information, the reference acoustic feature, the user identification and the language identification corresponding to the first voice training data.

Optionally, the method further includes the step of performing adaptive training on the trained general conversion model according to the monolingual speech data of the target user to generate a target conversion model: acquiring Y pieces of second voice training data of the target user, wherein the languages corresponding to the Y pieces of second voice training data are the same; respectively extracting reference acoustic features of each piece of second voice training data, and respectively labeling a user identifier and a language identifier of the target user for each piece of second voice training data and the corresponding reference acoustic features; for each piece of second voice training data, identifying text information corresponding to the second voice training data; and carrying out self-adaptive training on the trained general conversion model according to the text information, the reference acoustic feature, the user identification and the language identification corresponding to the second training voice data to obtain a target conversion model.

The embodiment of the invention also discloses a voice processing device, which specifically comprises: the acquisition module is used for acquiring text information to be converted; the information determining module is used for determining a source language type corresponding to the text information and a target user to be converted; the voice conversion module is used for converting the text information into target voice data of a target user adopting source language pronunciation according to the text information and a target conversion model corresponding to the target user; the target conversion model carries out self-adaptive training on the trained general conversion model according to the monolingual voice data of the target user, and the general conversion model carries out training according to the voice data containing N languages; the source language is one of the N languages, and N is an integer greater than 1.

Optionally, the obtaining module includes: the voice acquisition submodule is used for acquiring source voice data of a source user, wherein the source user and a target user are the same user or different users; and the recognition submodule is used for performing voice recognition on the source voice data and determining corresponding text information to be converted.

Optionally, the identification submodule includes: the first voice recognition unit is used for respectively inputting the source voice data into N voice recognizers to obtain corresponding N voice recognition results, wherein one voice recognizer corresponds to one language; and splicing the N voice recognition results to obtain text information to be converted.

Optionally, the identification submodule includes: the second voice recognition unit is used for inputting the source voice data into a voice recognizer to obtain a corresponding voice recognition result, wherein the voice recognizer corresponds to N languages; and determining the voice recognition result as text information to be converted.

Optionally, the voice conversion module includes: the feature generation submodule is used for converting the text information by adopting the target conversion model and outputting the acoustic features of the target user pronunciation by adopting the source language; and the voice synthesis submodule is used for synthesizing the acoustic features by adopting a synthesizer to obtain target voice data of the target user adopting the source language to pronounce.

Optionally, the feature generation sub-module is configured to input the text information, the language identifier corresponding to the source language, and the user identifier corresponding to the target user into the target conversion model; the target conversion model searches for target model parameters matched with the language identification and the user identification; and the target conversion model converts the text information by adopting the target model parameters and outputs the acoustic characteristics of the pronunciation of the target user by adopting the source language.

Optionally, the method further comprises: the first training module is used for training the general conversion model; the first training module is specifically configured to collect X pieces of first speech training data of M users, where one piece of first speech training data corresponds to one language and X pieces of first speech training data correspond to N languages; respectively extracting reference acoustic features of each piece of first voice training data, and respectively labeling corresponding user identification and language identification for each piece of first voice training data and the corresponding reference acoustic features; for each piece of first voice training data, recognizing text information corresponding to the first voice training data; and training the universal conversion model according to the text information, the reference acoustic feature, the user identification and the language identification corresponding to the first voice training data.

Optionally, the method further comprises: the second training module is used for carrying out self-adaptive training on the trained general conversion model according to the monolingual voice data of the target user to generate a target conversion model; the second training module is specifically configured to acquire Y pieces of second speech training data of the target user, where the languages corresponding to the Y pieces of second speech training data are the same; respectively extracting reference acoustic features of each piece of second voice training data, and respectively labeling a user identifier and a language identifier of the target user for each piece of second voice training data and the corresponding reference acoustic features; for each piece of second voice training data, identifying text information corresponding to the second voice training data; and carrying out self-adaptive training on the trained general conversion model according to the text information, the reference acoustic feature, the user identification and the language identification corresponding to the second training voice data to obtain a target conversion model.

The embodiment of the invention also discloses a readable storage medium, and when the instructions in the storage medium are executed by a processor of the electronic equipment, the electronic equipment can execute the voice processing method according to any one of the embodiments of the invention.

An embodiment of the present invention also discloses an electronic device, including a memory, and one or more programs, where the one or more programs are stored in the memory, and configured to be executed by one or more processors, and the one or more programs include instructions for: acquiring text information to be converted, and determining a source language corresponding to the text information and a target user to be converted; converting the text information into target voice data of a target user adopting source language pronunciation according to the text information and a target conversion model corresponding to the target user; the target conversion model carries out self-adaptive training on the trained general conversion model according to the monolingual voice data of the target user, and the general conversion model carries out training according to the voice data containing N languages; the source language is one of the N languages, and N is an integer greater than 1.

Optionally, the method further includes the following instructions for adaptively training the trained generic conversion model according to the monolingual speech data of the target user to generate a target conversion model: acquiring Y pieces of second voice training data of the target user, wherein the languages corresponding to the Y pieces of second voice training data are the same; respectively extracting reference acoustic features of each piece of second voice training data, and respectively labeling a user identifier and a language identifier of the target user for each piece of second voice training data and the corresponding reference acoustic features; for each piece of second voice training data, identifying text information corresponding to the second voice training data; and carrying out self-adaptive training on the trained general conversion model according to the text information, the reference acoustic feature, the user identification and the language identification corresponding to the second training voice data to obtain a target conversion model.

The embodiment of the invention has the following advantages:

in the embodiment of the invention, the text information to be converted can be obtained, and the source language corresponding to the text information and the target user to be converted are determined; then according to the text information and a target conversion model corresponding to the target user, converting the text information into target voice data which is generated by the target user through source language pronunciation; the target conversion model carries out self-adaptive training on the trained general conversion model according to the monolingual voice data of target user pronunciation, and the general conversion model carries out training according to the voice data containing N languages; the source language is one of the N languages, and N is an integer greater than 1; and then, under the condition that only the target user has the voice data of a single language, the multilingual text is converted into the target voice data of the target user of the corresponding language, so that the multilingual voice conversion is realized.

Drawings

FIG. 1 is a flow chart of the steps of one embodiment of a speech processing method of the present invention;

FIG. 2 is a flow chart of the steps of one embodiment of a model training method of the present invention;

FIG. 3 is a flow chart of the steps of one embodiment of a method of model adaptive training of the present invention;

FIG. 4 is a flow chart of the steps of an alternative embodiment of a speech processing method of the present invention;

FIG. 5a is a process diagram of a speech processing method of the present invention;

FIG. 5b is a process diagram of another speech processing method of the present invention;

FIG. 6 is a block diagram of a speech processing apparatus according to an embodiment of the present invention;

FIG. 7 is a block diagram of an alternative embodiment of a speech processing apparatus of the present invention;

FIG. 8 illustrates a block diagram of an electronic device for speech processing, according to an exemplary embodiment;

fig. 9 is a schematic structural diagram of an electronic device for speech processing according to another exemplary embodiment of the present invention.

Detailed Description

In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in further detail below.

Referring to fig. 1, a flowchart illustrating steps of an embodiment of a speech processing method according to the present invention is shown, which may specifically include the following steps:

and 102, acquiring text information to be converted.

In the embodiment of the present invention, the text information to be converted into the voice data may be acquired, and then, the text information may be subjected to voice conversion with reference tosteps 104 to 106.

Andstep 104, determining a source language corresponding to the text information and a target user to be converted.

In the embodiment of the invention, after the text information to be converted is obtained, the source language corresponding to the text information and the target user to be converted can be determined, so that the text information can be subsequently determined to be converted into the voice data which is used by the user and pronounces the voice data in which language.

Step 106, converting the text information into target voice data which is generated by a target user through source language pronunciation according to the text information and a target conversion model corresponding to the target user; the target conversion model carries out self-adaptive training on the trained general conversion model according to the monolingual voice data of the target user, and the general conversion model carries out training according to the voice data containing N languages; the source language is one of the N languages, and N is an integer greater than 1.

In the embodiment of the invention, the general conversion model can be trained in advance according to the voice data containing N languages, so as to obtain the trained general conversion model. And then carrying out self-adaptive training on the trained general conversion model according to the monolingual voice data of the pronunciation of the target user to obtain the target conversion model corresponding to the target user. The training and adaptive training processes of the model are described in the following.

Then, converting the text information by adopting a target conversion model corresponding to the target user to obtain a corresponding conversion result; and then, according to the conversion result, converting the text information into voice data (subsequently called target voice data) which is generated by a target user through pronunciation in a source language. Wherein, N is a positive integer greater than 1, and the source language corresponding to the source speech data is one of the N languages; therefore, under the condition that only the target user has the voice data of a single language, the multilingual text information can be converted into the target voice data of the target user of the corresponding language.

The source language may be the same as or different from the language corresponding to the voice data of the target user for performing the adaptive training of the trained general conversion model, which is not limited in this embodiment of the present invention.

In summary, in the embodiment of the present invention, text information to be converted may be obtained, and a source language corresponding to the text information and a target user to be converted are determined; then converting the text information into target voice data of a target user adopting source language pronunciation according to the text information and a target conversion model corresponding to the target user; the target conversion model carries out self-adaptive training on the trained general conversion model according to the monolingual voice data of target user pronunciation, and the general conversion model carries out training according to the voice data containing N languages; the source language is one of the N languages, and N is an integer greater than 1; and then, under the condition that only the target user has the voice data of a single language, the multi-language text information can be converted into the target voice data of the target user of the corresponding language, so that multi-language voice conversion is realized.

How to train the generic speech synthesis model is explained below.

Referring to fig. 2, a flowchart illustrating steps of an embodiment of a model training method according to the present invention is shown, which may specifically include the following steps:

step 202, collecting X pieces of first voice training data of M users, where one piece of first voice training data corresponds to one language and X pieces of first voice training data correspond to N languages.

In the embodiment of the invention, M, X and N are positive integers, and M, X and N can be set as required; for example, M is 20, X is 1000, and N is 5 (e.g., 5 languages such as Chinese, English, Japanese, Korean, and Russian); the embodiments of the present invention are not limited in this regard.

After M, X and N are determined, x (i) pieces of speech data may be collected for each of the M users; wherein, the value of i is 1-M, and the sum of the number of the voice data collected by M users is X. The requirement for collecting the X pieces of voice data may be that each piece of voice data corresponds to one language, and the X pieces of voice data cover N languages. Each piece of voice data in the X pieces of voice data may then be determined as first voice training data, thereby obtaining X pieces of first voice training data.

And 204, respectively extracting reference acoustic features corresponding to the first voice training data, and respectively labeling corresponding user identifications and language identifications for the first voice training data and the corresponding reference acoustic features.

In the embodiment of the present invention, the output of the general conversion model is an acoustic feature, and the acoustic feature refers to a feature that can be used for synthesizing voice data. In order to train the general conversion model, corresponding acoustic features may be extracted from each piece of the first speech training data, and the extracted acoustic features may be used as reference acoustic features, so that the general conversion model may be reversely trained by comparing the reference acoustic features with the acoustic features output by the general conversion model.

Because the acoustic characteristics of different users using the same language for pronunciation are different for the same text information, and the acoustic characteristics of the same user using different languages for pronunciation are also different; in order to train the general conversion model to learn the acoustic features of different users using different languages for pronunciation, corresponding user identifiers can be respectively allocated to each user in advance, and corresponding language identifiers can be allocated to each language; the user identification is used for uniquely identifying one user, and the language identification is used for uniquely identifying one language. Then respectively labeling a user identification corresponding to the user and a language identification corresponding to the language for each piece of first voice training data and corresponding reference acoustic characteristics; then, the general conversion model is trained by using the X pieces of first speech training data labeled with the user identifier and the language identifier and the corresponding reference acoustic features, which can refer tosteps 206 to 208.

Step 206, for each piece of first voice training data, identifying text information corresponding to the first voice training data.

In the embodiment of the invention, each piece of first voice training data can be subjected to voice recognition to determine corresponding text information; and then training the universal conversion model according to the text information of each piece of first voice data.

In one example of the present invention, one way to recognize the text information corresponding to the first speech training data may refer to substeps 22-24 as follows; the following description will take a piece of first speech training data as an example.

Substep 22, inputting the first voice training data to N voice recognizers respectively to obtain corresponding N voice recognition results, wherein one voice recognizer corresponds to one language;

and a substep 24 of splicing the N voice recognition results to obtain corresponding text information.

In an embodiment of the present invention, the first speech training data may be respectively input into N speech recognizers, where each speech recognizer in the N speech recognizers is a speech recognizer of a language. Then each voice recognizer carries out voice recognition on the first voice training data and outputs a corresponding voice recognition result; the voice recognition result may be text coding information or text itself. When the voice recognition result is text coding information, splicing the text coding information output by each voice recognizer according to a preset sequence to obtain text information corresponding to the first voice training data; the preset sequence may be set as required, and the embodiment of the present invention is not limited thereto. When the speech recognition result is a text itself, the speech recognition results output by the speech recognizers can be respectively encoded (e.g., one-hot encoding, conversion into word vectors, etc.); then, the coded voice recognition results are spliced.

In one example of the present invention, still another way of identifying text information corresponding to the first speech training data may refer to sub-steps 42-44 as follows; the following description will take a piece of first speech training data as an example.

And a substep 42 of inputting the first voice training data into a voice recognizer to obtain a corresponding voice recognition result, wherein the voice recognizer corresponds to the N languages.

Substep 44, determining the speech recognition result as text information.

In the embodiment of the present invention, a speech recognizer capable of recognizing N kinds of speech may also be used to perform speech recognition on the first speech training data; namely, the first voice training data is input to the voice recognizer capable of recognizing the N languages, and a corresponding voice recognition result is obtained. The speech recognition result may be text code information, and the dimension of the text code information and the meaning corresponding to each dimension are different from each other.

In the embodiment of the invention, no matter the first voice training data is input to the N voice recognizers, text information is obtained by splicing the voice recognition results of the voice recognizers, or the text information is obtained by inputting the first voice training data to one voice recognizer; all contain the association between languages. And then, after the universal conversion model is trained by adopting the text information subsequently, the trained universal conversion model can learn the association among all languages, so that after the trained universal conversion model is adaptively trained by adopting the monolingual voice data of the target user to obtain the target conversion model, the target conversion model can realize multilingual voice conversion.

And 208, training the universal conversion model according to the text information, the reference acoustic feature, the user identification and the language identification corresponding to the first voice training data.

Now, how to train the generic transformation model will be described by taking a first speech training data as an example. In the embodiment of the present invention, the text information, the user identifier, and the language identifier corresponding to the first voice training data may be input into the general conversion model; and performing forward calculation on the text information by the general conversion model, and outputting a predicted acoustic feature corresponding to the first voice training data. In the process of forward calculation of the text information by the general conversion model, the model parameters may be associated with both the user identifier and the language identifier of the first speech training data. And then comparing the predicted acoustic features with reference acoustic features corresponding to the first voice training data, and adjusting model parameters of the universal conversion model corresponding to the user identification and the language identification of the first voice training data. The general conversion model can be continuously trained by adopting X pieces of first voice training data until the end condition is met; and then the trained general conversion model can be obtained.

In one embodiment of the invention, after the trained universal conversion model is obtained, the trained universal conversion model can be adaptively trained by adopting the monolingual voice data of the target user to obtain the target conversion model capable of predicting the multilingual acoustic characteristics of the target user; the following may be used:

referring to FIG. 3, a flow chart of steps of an embodiment of a model adaptive training method of the present invention is shown.

Step 302, obtaining Y pieces of second voice training data of the target user, where the languages corresponding to the Y pieces of second voice training data are the same.

In the embodiment of the present invention, Y is a positive integer, which may be specifically set as required, and the embodiment of the present invention is not limited thereto. After Y is determined, Y pieces of voice data can be selected from voice data of the target user with the same language pronunciation as second voice training data; and then, the trained general conversion model is adaptively trained by using the Y pieces of second voice training data, which can refer to steps 304 to 308.

And step 304, respectively extracting reference acoustic features of each piece of second voice training data, and respectively labeling the user identification and the language identification of the target user for each piece of second voice training data and the corresponding reference acoustic features.

And step 306, for each piece of second voice training data, performing voice recognition on the second voice training data to determine corresponding text information.

And 308, performing adaptive training on the trained general conversion model according to the text information, the reference acoustic feature, the user identification and the language identification corresponding to the second training voice data to obtain a target conversion model.

The steps 304-308 are similar to the steps 204-208, and are not described herein again.

How to convert the text information into the target voice data will be described below.

Referring to fig. 4, a flowchart illustrating steps of an alternative embodiment of the speech processing method of the present invention is shown, which may specifically include the following steps:

step 402, obtaining source speech data of a source user, wherein the source user and a target user are the same user or different users.

And step 404, performing voice recognition on the source voice data, and determining corresponding text information to be converted.

In the embodiment of the present invention, one way to obtain text information to be converted may be to obtain source speech data of a source user; and then, performing voice recognition on the source voice data to determine corresponding text information to be converted. The source user and the target user may be the same user or different users, which is not limited in this embodiment of the present invention.

In the embodiment of the invention, the source speech data is subjected to speech recognition, and the modes corresponding to the text information to be converted are determined to comprise multiple modes; in one example, a manner of performing speech recognition on the source speech data and determining the corresponding text information to be converted may refer to the following sub-steps:

substep 62, inputting the source speech data into N speech recognizers respectively to obtain corresponding N speech recognition results, wherein one speech recognizer corresponds to one language;

and a substep 64 of splicing the N voice recognition results to obtain text information to be converted.

The present substeps 62-64 are similar to substeps 22-24 described above and will not be described in detail herein.

In another example of the present invention, another way of performing speech recognition on the source speech data to determine the corresponding text information to be converted may include the following sub-steps:

and a substep 82, inputting the source speech data into a speech recognizer to obtain a corresponding speech recognition result, wherein the speech recognizer corresponds to N languages.

Substep 84, determining the speech recognition result as the text information to be converted.

This substep 82-substep 84 is similar to substep 42-substep 44 described above and will not be described in detail herein.

Of course, the user can also directly input the text information which needs to be converted into the voice data; further, the embodiment of the invention can acquire the text information input by the user and determine the text information input by the user as the text information to be converted.

And step 406, determining a source language corresponding to the text information and a target user to be converted.

In one example of the present invention, when a source user inputs source speech data (or input text information), the source user may configure the language corresponding to the source speech data (or input text information) and a target user to be converted. Therefore, after the source speech data is obtained, the configuration information of the source speech data can be obtained, and the source language type corresponding to the text information to be converted and the target user to be converted are determined according to the configuration information.

In another example of the present invention, when a source user is inputting source speech data (or inputting text information), a language corresponding to the source speech data (or inputting text information) is not configured. At this time, one way of determining the source language corresponding to the text information to be converted may be to directly perform language identification on the text information to be converted, and determine the source language corresponding to the text information to be converted.

In the embodiment of the present invention, since the target conversion model outputs the acoustic features, after the text information corresponding to the source speech data is converted into the acoustic features corresponding to the target user by using the target conversion model, the acoustic features may be synthesized into the target speech data by using a synthesizer. In step 106, the text information is converted into target speech data that is pronounced by the target user in the source language according to the text information and the target conversion model corresponding to the target user, which may refer tosteps 408 to 410.

And 408, converting the text information by adopting the target conversion model, and outputting the acoustic characteristics of the pronunciation of the target user by adopting the source language.

In the embodiment of the invention, the obtained target conversion model can be used for voice conversion, and acoustic characteristics of the target user pronunciation adopting the source language are output; reference may be made to the following substeps:

and a substep S2 of inputting the text information, the language identification corresponding to the source language and the user identification corresponding to the target user into the target conversion model.

And a substep S4, searching target model parameters matched with the language identifier and the user identifier by the target conversion model.

And a substep S6, converting the text information by the target conversion model by using the target model parameters, and outputting the acoustic characteristics of the pronunciation of the target user in the source language.

In the embodiment of the invention, the language identification corresponding to the source language and the user identification corresponding to the target user can be determined; and then inputting the text information, the language identification corresponding to the source language and the user identification corresponding to the target user into the target conversion model. The target conversion model can search for target model parameters which are matched with language identification corresponding to the source language and user identification corresponding to the target user; and converting the text information by adopting the target model parameters, and outputting the acoustic characteristics of the pronunciation of the target user by adopting the source language.

And step 410, synthesizing the acoustic features by adopting a synthesizer to obtain target voice data of the target user adopting the source language for pronunciation.

As an example of the present invention, referring to fig. 5a, a process diagram of a speech processing method of the present invention is shown. FIG. 5a illustrates that N speech recognition results are obtained by inputting the source speech data to N speech recognizers, respectively; and splicing the N voice recognition results to realize recognition of the text information corresponding to the source voice data.

As another example of the present invention, reference may be made to fig. 5b, which shows a process diagram of another speech processing method according to the present invention. Wherein, FIG. 5b is implemented by inputting the source speech data into a speech recognizer to recognize text information corresponding to the source speech data.

In one application of the embodiment of the invention, source speech data input by Zhang III can be obtained, speech recognition is carried out on the source speech data input by Zhang III, and the text information K corresponding to the text information to be converted is determined. Then, the language corresponding to the text information K to be converted can be determined as language a, and lie four of the target user. Then, the target conversion model is adopted to convert the text information K, and acoustic characteristics of the source language for pronouncing the text information K are output; and then, synthesizing the acoustic features of the Liqu by adopting a synthesizer to obtain target voice data of the Liqu pronouncing text information K by adopting the language A. And further, the voice data input by Zhang three (source user) in the language A is converted into the target voice data aiming at the universal pronunciation by Li four (target user) in the language A.

In summary, in the embodiment of the present invention, source speech data of a source user may be obtained, speech recognition is performed on the source speech data, and text information to be converted is determined; determining a source language type corresponding to the source speech data and a target user to be converted; then, converting the text information by adopting the target conversion model, and outputting acoustic characteristics of pronouncing the text information by the target user by adopting the source language; synthesizing the acoustic characteristic information by adopting a synthesizer to obtain target voice data of the target user adopting the source language for pronunciation; and then, under the condition that only the target user has the voice data of a single language, the multilingual source voice data can be converted into the target voice data of the target user of the corresponding language, so that multilingual voice conversion is realized.

Secondly, in the embodiment of the present invention, in the process of identifying the text information corresponding to the source speech data, the source speech data may be respectively input to N speech recognizers to obtain N corresponding speech recognition results, wherein one speech recognizer corresponds to one language; then, the N voice recognition results are spliced to obtain corresponding text information; or inputting the source speech data into a speech recognizer to obtain a corresponding speech recognition result, wherein the speech recognizer corresponds to N languages; and determining the voice recognition result as text information. And further, the accuracy of the determined text information can be improved, so that the accuracy of multi-language conversion is further improved.

It should be noted that, for simplicity of description, the method embodiments are described as a series of acts or combination of acts, but those skilled in the art will recognize that the present invention is not limited by the illustrated order of acts, as some steps may occur in other orders or concurrently in accordance with the embodiments of the present invention. Further, those skilled in the art will appreciate that the embodiments described in the specification are presently preferred and that no particular act is required to implement the invention.

Referring to fig. 6, a block diagram of a speech processing apparatus according to an embodiment of the present invention is shown, which may specifically include the following modules:

an obtaining module 602, configured to obtain text information to be converted;

an information determining module 604, configured to determine a source language type corresponding to the text information and a target user to be converted;

the voice conversion module 606 is configured to convert the text information into target voice data, which is generated by a target user through source language pronunciation, according to the text information and a target conversion model corresponding to the target user; the target conversion model carries out self-adaptive training on the trained general conversion model according to the monolingual voice data of the target user, and the general conversion model carries out training according to the voice data containing N languages; the source language is one of the N languages, and N is an integer greater than 1.

Referring to fig. 7, a block diagram of an alternative embodiment of a speech processing apparatus of the present invention is shown.

In an optional embodiment of the present invention, the obtaining module 602 includes:

the voice obtaining submodule 6022 is configured to obtain source voice data of a source user, where the source user and a target user are the same user or different users;

and the recognition submodule 6024 is configured to perform speech recognition on the source speech data and determine corresponding text information to be converted.

In an alternative embodiment of the present invention, the identifier sub-module 6024 comprises:

a firstspeech recognition unit 60242, configured to input the source speech data to N speech recognizers respectively, to obtain N corresponding speech recognition results, where one speech recognizer corresponds to one language; and splicing the N voice recognition results to obtain text information to be converted.

a secondspeech recognition unit 60244, configured to input the source speech data into a speech recognizer to obtain a corresponding speech recognition result, where the speech recognizer corresponds to N languages; and determining the voice recognition result as text information to be converted.

In an optional embodiment of the present invention, the voice conversion module 606 includes:

a feature generation submodule 6062, configured to convert the text information using the target conversion model, and output an acoustic feature of the target user using the source language pronunciation;

and a speech synthesis submodule 6064, configured to synthesize the acoustic features by using a synthesizer, so as to obtain target speech data that is pronounced by the target user in the source language.

In an optional embodiment of the present invention, the feature generation sub-module 6062 is configured to input the text information, the language identifier corresponding to the source language, and the user identifier corresponding to the target user into the target conversion model; the target conversion model searches for target model parameters matched with the language identification and the user identification; and the target conversion model converts the text information by adopting the target model parameters and outputs the acoustic characteristics of the pronunciation of the target user by adopting the source language.

In an optional embodiment of the present invention, the method further comprises:

a first training module 608, configured to train the generic conversion model; the first training module is specifically configured to collect X pieces of first speech training data of M users, where one piece of first speech training data corresponds to one language and X pieces of first speech training data correspond to N languages; respectively extracting reference acoustic features of each piece of first voice training data, and respectively labeling corresponding user identification and language identification for each piece of first voice training data and the corresponding reference acoustic features; for each piece of first voice training data, recognizing text information corresponding to the first voice training data; and training the universal conversion model according to the text information, the reference acoustic feature, the user identification and the language identification corresponding to the first voice training data.

the second training module 610 is configured to perform adaptive training on the trained general conversion model according to the monolingual speech data of the target user, and generate a target conversion model; the second training module is specifically configured to acquire Y pieces of second speech training data of the target user, where the languages corresponding to the Y pieces of second speech training data are the same; respectively extracting reference acoustic features of each piece of second voice training data, and respectively labeling a user identifier and a language identifier of the target user for each piece of second voice training data and the corresponding reference acoustic features; for each piece of second voice training data, identifying text information corresponding to the second voice training data; and carrying out self-adaptive training on the trained general conversion model according to the text information, the reference acoustic feature, the user identification and the language identification corresponding to the second training voice data to obtain a target conversion model.

In summary, in the embodiment of the present invention, text information to be converted may be obtained, and a source language corresponding to the text information and a target user to be converted are determined; then according to the text information and a target conversion model corresponding to the target user, converting the text information into target voice data which is generated by the target user through source language pronunciation; the target conversion model carries out self-adaptive training on the trained general conversion model according to the monolingual voice data of target user pronunciation, and the general conversion model carries out training according to the voice data containing N languages; the source language is one of the N languages, and N is an integer greater than 1; and then, under the condition that only the target user has the voice data of a single language, the multilingual text is converted into the target voice data of the target user of the corresponding language, so that the multilingual voice conversion is realized.

For the device embodiment, since it is basically similar to the method embodiment, the description is simple, and for the relevant points, refer to the partial description of the method embodiment.

FIG. 8 is a block diagram illustrating a structure of anelectronic device 800 for speech processing according to an example embodiment. For example, theelectronic device 800 may be a mobile phone, a computer, a digital broadcast terminal, a messaging device, a game console, a tablet device, a medical device, an exercise device, a personal digital assistant, and the like.

Referring to fig. 8,electronic device 800 may include one or more of the following components: aprocessing component 802, amemory 804, apower component 806, amultimedia component 808, an audio component 810, an input/output (I/O)interface 812, asensor component 814, and acommunication component 816.

Theprocessing component 802 generally controls overall operation of theelectronic device 800, such as operations associated with display, telephone calls, data communications, camera operations, and recording operations. Theprocessing elements 802 may include one ormore processors 820 to execute instructions to perform all or a portion of the steps of the methods described above. Further, theprocessing component 802 can include one or more modules that facilitate interaction between theprocessing component 802 and other components. For example, theprocessing component 802 can include a multimedia module to facilitate interaction between themultimedia component 808 and theprocessing component 802.

Thememory 804 is configured to store various types of data to support operation at thedevice 800. Examples of such data include instructions for any application or method operating on theelectronic device 800, contact data, phonebook data, messages, pictures, videos, and so forth. Thememory 804 may be implemented by any type or combination of volatile or non-volatile memory devices such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disks.

Thepower components 806 provide power to the various components of theelectronic device 800.Power components 806 may include a power management system, one or more power sources, and other components associated with generating, managing, and distributing power forelectronic device 800.

Themultimedia component 808 includes a screen that provides an output interface between theelectronic device 800 and a user. In some embodiments, the screen may include a Liquid Crystal Display (LCD) and a Touch Panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive an input signal from a user. The touch panel includes one or more touch sensors to sense touch, slide, and gestures on the touch panel. The touch sensor may not only sense the boundary of a touch or slide action, but also detect the duration and pressure associated with the touch or slide operation. In some embodiments, themultimedia component 808 includes a front facing camera and/or a rear facing camera. The front camera and/or the rear camera may receive external multimedia data when theelectronic device 800 is in an operation mode, such as a shooting mode or a video mode. Each front camera and rear camera may be a fixed optical lens system or have a focal length and optical zoom capability.

The audio component 810 is configured to output and/or input audio signals. For example, the audio component 810 includes a Microphone (MIC) configured to receive external audio signals when theelectronic device 800 is in an operational mode, such as a call mode, a recording mode, and a voice recognition mode. The received audio signals may further be stored in thememory 804 or transmitted via thecommunication component 816. In some embodiments, audio component 810 also includes a speaker for outputting audio signals.

The I/O interface 812 provides an interface between theprocessing component 802 and peripheral interface modules, which may be keyboards, click wheels, buttons, etc. These buttons may include, but are not limited to: a home button, a volume button, a start button, and a lock button.

Thesensor assembly 814 includes one or more sensors for providing various aspects of state assessment for theelectronic device 800. For example, thesensor assembly 814 may detect an open/closed state of thedevice 800, the relative positioning of components, such as a display and keypad of theelectronic device 800, thesensor assembly 814 may also detect a change in the position of theelectronic device 800 or a component of theelectronic device 800, the presence or absence of user contact with theelectronic device 800, orientation or acceleration/deceleration of theelectronic device 800, and a change in the temperature of theelectronic device 800.Sensor assembly 814 may include a proximity sensor configured to detect the presence of a nearby object without any physical contact. Thesensor assembly 814 may also include a light sensor, such as a CMOS or CCD image sensor, for use in imaging applications. In some embodiments, thesensor assembly 814 may also include an acceleration sensor, a gyroscope sensor, a magnetic sensor, a pressure sensor, or a temperature sensor.

Thecommunication component 816 is configured to facilitate wired or wireless communication between theelectronic device 800 and other devices. Theelectronic device 800 may access a wireless network based on a communication standard, such as WiFi, 2G or 3G, or a combination thereof. In an exemplary embodiment, thecommunication component 814 receives a broadcast signal or broadcast associated information from an external broadcast management system via a broadcast channel. In an exemplary embodiment, thecommunications component 814 further includes a Near Field Communication (NFC) module to facilitate short-range communications. For example, the NFC module may be implemented based on Radio Frequency Identification (RFID) technology, infrared data association (IrDA) technology, Ultra Wideband (UWB) technology, Bluetooth (BT) technology, and other technologies.

In an exemplary embodiment, theelectronic device 800 may be implemented by one or more Application Specific Integrated Circuits (ASICs), Digital Signal Processors (DSPs), Digital Signal Processing Devices (DSPDs), Programmable Logic Devices (PLDs), Field Programmable Gate Arrays (FPGAs), controllers, micro-controllers, microprocessors or other electronic components for performing the above-described methods.

In an exemplary embodiment, a non-transitory computer-readable storage medium comprising instructions, such as thememory 804 comprising instructions, executable by theprocessor 820 of theelectronic device 800 to perform the above-described method is also provided. For example, the non-transitory computer readable storage medium may be a ROM, a Random Access Memory (RAM), a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, and the like.

A non-transitory computer readable storage medium in which instructions, when executed by a processor of an electronic device, enable the electronic device to perform a method of speech processing, the method comprising: acquiring text information to be converted, and determining a source language corresponding to the text information and a target user to be converted; converting the text information into target voice data of a target user adopting source language pronunciation according to the text information and a target conversion model corresponding to the target user; the target conversion model carries out self-adaptive training on the trained general conversion model according to the monolingual voice data of the target user, and the general conversion model carries out training according to the voice data containing N languages; the source language is one of the N languages, and N is an integer greater than 1.

Fig. 9 is a schematic structural diagram of anelectronic device 900 for speech processing according to another exemplary embodiment of the present invention. Theelectronic device 900 may be a server, which may vary widely depending on configuration or performance, and may include one or more Central Processing Units (CPUs) 922 (e.g., one or more processors) andmemory 932, one or more storage media 930 (e.g., one or more mass storage devices) storingapplications 942 ordata 944.Memory 932 andstorage media 930 can be, among other things, transient storage or persistent storage. The program stored on thestorage medium 930 may include one or more modules (not shown), each of which may include a series of instruction operations for the server. Still further, thecentral processor 922 may be arranged to communicate with thestorage medium 930 to execute a series of instruction operations in thestorage medium 930 on the server.

The server may also include one ormore power supplies 926, one or more wired or wireless network interfaces 950, one or more input-output interfaces 958, one or more keyboards 956, and/or one ormore operating systems 941, such as Windows Server, Mac OS XTM, UnixTM, LinuxTM, FreeBSDTM, etc.

An electronic device comprising a memory, and one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by one or more processors the one or more programs including instructions for: acquiring text information to be converted, and determining a source language corresponding to the text information and a target user to be converted; converting the text information into target voice data of a target user adopting source language pronunciation according to the text information and a target conversion model corresponding to the target user; the target conversion model carries out self-adaptive training on the trained general conversion model according to the monolingual voice data of the target user, and the general conversion model carries out training according to the voice data containing N languages; the source language is one of the N languages, and N is an integer greater than 1.

The embodiments in the present specification are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other.

Embodiments of the present invention are described with reference to flowchart illustrations and/or block diagrams of methods, terminal devices (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing terminal to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing terminal, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing terminal to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing terminal to cause a series of operational steps to be performed on the computer or other programmable terminal to produce a computer implemented process such that the instructions which execute on the computer or other programmable terminal provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

While preferred embodiments of the present invention have been described, additional variations and modifications of these embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including preferred embodiments and all such alterations and modifications as fall within the scope of the embodiments of the invention.

Finally, it should also be noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or terminal that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or terminal. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or terminal that comprises the element.

The foregoing describes a speech processing method, a speech processing apparatus and an electronic device in detail, and specific examples are applied herein to explain the principles and embodiments of the present invention, and the descriptions of the foregoing examples are only used to help understand the method and the core ideas of the present invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present invention.

Claims

1. A method of speech processing, comprising:

acquiring text information to be converted, and determining a source language corresponding to the text information and a target user to be converted;

converting the text information into target voice data of a target user adopting source language pronunciation according to the text information and a target conversion model corresponding to the target user;

the target conversion model carries out self-adaptive training on the trained general conversion model according to the monolingual voice data of the target user, and the general conversion model carries out training according to the voice data containing N languages; the source language is one of the N languages, and N is an integer greater than 1.

2. The method of claim 1, wherein the obtaining the text information to be converted comprises:

acquiring source speech data of a source user, wherein the source user and a target user are the same user or different users;

and performing voice recognition on the source voice data, and determining corresponding text information to be converted.

3. The method of claim 2, wherein the performing speech recognition on the source speech data and determining corresponding text information to be converted comprises:

inputting the source voice data into N voice recognizers respectively to obtain corresponding N voice recognition results, wherein one voice recognizer corresponds to one language;

and splicing the N voice recognition results to obtain text information to be converted.

4. The method of claim 2, wherein the performing speech recognition on the source speech data and determining corresponding text information to be converted comprises:

inputting the source speech data into a speech recognizer to obtain a corresponding speech recognition result, wherein the speech recognizer corresponds to N languages;

and determining the voice recognition result as text information to be converted.

5. The method according to claim 1, wherein said converting the text information into target speech data that is pronounced by a target user in a source language according to the text information and a target conversion model corresponding to the target user comprises:

converting the text information by adopting the target conversion model, and outputting the acoustic characteristics of the target user pronunciation by adopting the source language;

and synthesizing the acoustic features by adopting a synthesizer to obtain target voice data of the target user adopting the source language for pronunciation.

6. The method according to claim 5, wherein said converting the text information using the target conversion model and outputting the acoustic feature of the target user speaking the text information in the source language comprises:

inputting the text information, language identification corresponding to the source language and user identification corresponding to the target user into the target conversion model;

the target conversion model searches for target model parameters matched with the language identification and the user identification;

and the target conversion model converts the text information by adopting the target model parameters and outputs the acoustic characteristics of the pronunciation of the target user by adopting the source language.

7. The method of claim 1, further comprising the step of training the generic conversion model:

collecting X pieces of first voice training data of M users, wherein one piece of first voice training data corresponds to one language, and X pieces of first voice training data correspond to N languages;

respectively extracting reference acoustic features of each piece of first voice training data, and respectively labeling corresponding user identification and language identification for each piece of first voice training data and the corresponding reference acoustic features;

for each piece of first voice training data, recognizing text information corresponding to the first voice training data;

and training the universal conversion model according to the text information, the reference acoustic feature, the user identification and the language identification corresponding to the first voice training data.

8. A speech processing apparatus, comprising:

the acquisition module is used for acquiring text information to be converted;

the information determining module is used for determining a source language type corresponding to the text information and a target user to be converted;

the voice conversion module is used for converting the text information into target voice data of a target user adopting source language pronunciation according to the text information and a target conversion model corresponding to the target user; the target conversion model carries out self-adaptive training on the trained general conversion model according to the monolingual voice data of the target user, and the general conversion model carries out training according to the voice data containing N languages; the source language is one of the N languages, and N is an integer greater than 1.

9. An electronic device comprising a memory, and one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by one or more processors the one or more programs including instructions for:

10. A readable storage medium, wherein instructions in the storage medium, when executed by a processor of an electronic device, enable the electronic device to perform the speech processing method of any of method claims 1-7.