CN112331177B

Movatterモバイル変換

Info

Publication number: CN112331177B
Application number: CN202011224950.4A
Authority: CN
Inventors: 周明康; 罗超; 陈子浩; 胡泓; 李巍
Original assignee: Ctrip Computer Technology Shanghai Co Ltd
Current assignee: Ctrip Computer Technology Shanghai Co Ltd
Priority date: 2020-11-05
Filing date: 2020-11-05
Publication date: 2024-07-02
Anticipated expiration: 2040-11-05
Also published as: CN112331177A

Abstract

The invention relates to the technical field of voice processing, and provides a prosody-based voice synthesis method, a model training method and related equipment. The prosody-based speech synthesis method includes: preprocessing a text to be synthesized to obtain a word segmentation text; obtaining a prosodic text containing prosodic information through a prosodic model according to the word segmentation text; converting the prosodic text into a vector comprising prosodic features; obtaining Mel spectrum characteristics through an acoustic model according to the vector containing prosodic characteristics; and synthesizing target audio according to the Mel spectrum characteristics. According to the method and the device for synthesizing the voice, voice synthesis is carried out according to the prosodic features of the text, audio data with natural pronunciation are generated, sentence breaking errors are avoided, the synthesized audio data are played more naturally and truly, and user experience is improved.

Description

Prosody-based speech synthesis method, model training method and related equipment

Technical Field

The invention relates to the technical field of voice processing, in particular to a prosody-based voice synthesis method, a model training method and related equipment.

Background

With the continuous development of deep learning technology, the speech synthesis technology based on neural network is also maturing. However, most of the existing speech synthesis models based on the neural network are end-to-end models, namely, input text and output audio, although the synthesis speed is greatly improved, the rhythm and naturalness are poor, and sentence breaking errors can occur when a large number of words which are not covered in a training sample are synthesized.

Especially for online travel agencies, a large number of calls need to be called out daily, usually by means of an out-call robot. In the existing outbound robot, because the adopted voice synthesis model does not consider prosodic features, sentence breaking errors are easy to occur, and the user experience is poor.

It should be noted that the information disclosed in the foregoing background section is only for enhancement of understanding of the background of the invention and thus may include information that does not form the prior art that is already known to those of ordinary skill in the art.

Disclosure of Invention

In view of the above, the invention provides a prosody-based speech synthesis method, a model training method and related equipment, which are used for generating audio data with natural pronunciation by performing speech synthesis according to prosody characteristics of texts, so that sentence breaking errors are avoided, the synthesized audio data is played more naturally and truly, and user experience is improved.

The invention provides a prosody-based speech synthesis method, which comprises the following steps: preprocessing a text to be synthesized to obtain a word segmentation text; obtaining a prosodic text containing prosodic information through a prosodic model according to the word segmentation text; converting the prosodic text into a vector comprising prosodic features; obtaining Mel spectrum characteristics through an acoustic model according to the vector containing prosodic characteristics; and synthesizing target audio according to the Mel spectrum characteristics.

In some embodiments, the prosody model includes an embedded layer, a plurality of layers of two-way long and short term memory networks, and a fully connected layer, connected in sequence.

In some embodiments, the obtaining prosodic text comprising prosodic information with a prosodic model includes: inputting the word segmentation text into the prosody model, and sequentially processing the word segmentation text through the embedding layer, the multi-layer two-way long-short-term memory network and the full-connection layer to generate prosody text containing prosody information, wherein the prosody information is used for marking the pause time of each prosody phrase in the prosody text.

In some embodiments, the converting the prosodic text into a vector comprising prosodic features comprises: converting the prosodic text into a pinyin sequence comprising syllables, tones, and the prosodic information; converting the pinyin sequence into a digital sequence; and converting the digital sequence into a vector comprising prosodic features corresponding to the prosodic information.

In some embodiments, the acoustic model includes an encoder including a plurality of layers of convolutional neural networks and a plurality of layers of two-way long-short-term memory networks connected in sequence, and a decoder including a plurality of layers of two-way long-short-term memory networks and a linear layer connected in sequence, and a last layer of two-way long-term memory network of the encoder is connected to a first layer of two-way long-term memory network of the decoder.

In some embodiments, the obtaining mel-spectrum features by an acoustic model includes: inputting the vector containing prosodic features into the acoustic model, and sequentially processing the vector by the multi-layer convolutional neural network and the multi-layer two-way long-short-term memory network of the encoder, and the multi-layer two-way long-short-term memory network and the linear layer of the decoder to generate Mel spectrum features with preset dimensions.

In some embodiments, the synthesizing the target audio includes: and inputting the Mel spectrum characteristics into a vocoder to obtain the target audio.

In some embodiments, the preprocessing the text to be synthesized includes: regularizing the text to be synthesized; adjusting the regularized text to be synthesized into a text with a uniform format; and word segmentation is carried out on the text in the unified format.

The invention also provides a model training method, which comprises the following steps: obtaining a first sample text and a target prosody text corresponding to the first sample text; constructing a prosody model based at least on the two-way long-short-term memory network; and training the prosody model by taking the first sample text as input and the target prosody text as output until the prosody model converges.

In some embodiments, the constructing a prosody model based at least on the two-way long-short term memory network comprises: constructing an embedded layer, a multi-layer bidirectional convolutional neural network and a full-connection layer which are sequentially connected to form the rhythm model; the embedded layer is used for inputting the first text, and the full-connection layer is used for outputting the target prosodic text.

The invention also provides a model training method, which comprises the following steps: obtaining a second sample text and a real Mel spectrum feature and a target Mel spectrum feature corresponding to the second sample text; constructing an acoustic model based at least on the convolutional neural network and the two-way long-short-term memory network; and training the acoustic model by taking the second sample text and the real Mel-spectrum feature as inputs and the target Mel-spectrum feature as output until the acoustic model converges.

In some embodiments, the constructing an acoustic model based at least on the convolutional neural network and the two-way long-short term memory network comprises: constructing a network structure including an encoder and a decoder; constructing a multi-layer convolutional neural network and a multi-layer two-way long-term and short-term memory network which are sequentially connected in the encoder; in the decoder, a convolutional neural network, a plurality of layers of two-way long-short-term memory networks and a linear layer which are sequentially connected are constructed, and the input of the two-way long-term memory network of the first layer of the decoder is connected with the output of the two-way long-term memory network of the last layer of the encoder; adding a local attention mechanism into the network structure, and performing nonlinear activation between every two layers by using a linear rectification function to form the acoustic model; the first layer convolutional neural network of the encoder is used for inputting a vector corresponding to the second sample text, the convolutional neural network of the decoder is used for inputting the real mel spectrum characteristic, and the linear layer is used for outputting the target mel spectrum characteristic.

The invention also provides a prosody-based speech synthesis device, comprising: the preprocessing module is used for preprocessing the text to be synthesized to obtain word segmentation text; the prosody prediction module is used for obtaining prosody text containing prosody information through a prosody model according to the word segmentation text; a vector conversion module for converting the prosodic text into a vector comprising prosodic features; the Mel spectrum generating module is used for obtaining Mel spectrum characteristics through an acoustic model according to the vector containing prosodic characteristics; and the voice synthesis module is used for synthesizing target audio according to the Mel spectrum characteristics.

The present invention also provides an electronic device including: a processor; and a memory having stored therein executable instructions which when executed by the processor implement the method of any of the embodiments described above.

The present invention also provides a computer-readable storage medium storing a computer program which, when executed, implements the method of any of the embodiments described above.

Compared with the prior art, the invention has the beneficial effects that at least:

When the invention synthesizes the voice of the text to be synthesized, firstly preprocessing to generate word segmentation text, and then inputting the word segmentation text into a prosodic model for processing to generate prosodic text containing prosodic information; then, the vector which is converted by the prosodic text and contains the prosodic features is input into an acoustic model to generate Mel spectrum features; finally synthesizing target audio based on rhythm according to the Mel spectrum characteristics;

According to the method and the device for synthesizing the voice, voice synthesis is carried out according to the prosodic features of the text, audio data with natural pronunciation are generated, sentence breaking errors are avoided, the synthesized audio data are played more naturally and truly, and user experience is improved.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the invention as claimed.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the invention and together with the description, serve to explain the principles of the invention. It is evident that the figures described below are only some embodiments of the invention, from which other figures can be obtained without inventive effort for a person skilled in the art.

FIG. 1 is a schematic diagram showing steps of a speech synthesis method according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of a network structure of a prosody model in an embodiment of the present invention;

FIG. 3 is a schematic diagram of a network structure of an acoustic model in an embodiment of the present invention;

FIG. 4 is a schematic diagram showing a training process of a prosody model in an embodiment of the present invention;

FIG. 5 shows a schematic diagram of an acoustic model training process in an embodiment of the invention;

FIG. 6 shows a schematic diagram of a training architecture of an acoustic model in an embodiment of the invention;

FIG. 7 is a schematic block diagram of a speech synthesis apparatus according to an embodiment of the present invention;

Fig. 8 is a schematic structural diagram of an electronic device according to an embodiment of the present invention; and

Fig. 9 shows a schematic structure of a storage medium in an embodiment of the present invention.

Detailed Description

Example embodiments will now be described more fully with reference to the accompanying drawings. However, the example embodiments may be embodied in many different forms and should not be construed as limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of the example embodiments to those skilled in the art.

Furthermore, the drawings are merely schematic illustrations of the present invention and are not necessarily drawn to scale. The same reference numerals in the drawings denote the same or similar parts, and thus a repetitive description thereof will be omitted. Some of the block diagrams shown in the figures are functional entities and do not necessarily correspond to physically or logically separate entities. These functional entities may be implemented in software or in one or more hardware modules or integrated circuits or in different networks and/or processor devices and/or microcontroller devices.

The step numbers in the following embodiments are merely for representing different execution contents, and do not limit the logical relationship and execution order between steps. The use of the terms "first," "second," and the like in the description herein do not denote any order, quantity, or importance, but rather are used to distinguish one element from another. It should be noted that, without conflict, the embodiments of the present invention and features in different embodiments may be combined with each other.

Fig. 1 shows main steps of a speech synthesis method in an embodiment, and referring to fig. 1, the prosody-based speech synthesis method in the present embodiment includes: in step S110, preprocessing a text to be synthesized to obtain a word segmentation text; in step S120, according to the word segmentation text, obtaining a prosodic text containing prosodic information through a prosodic model; in step S130, converting the prosodic text into a vector containing prosodic features; in step S140, obtaining mel-spectrum features by an acoustic model according to the vector containing prosodic features; and in step S150, synthesizing the target audio according to the mel-spectrum feature.

When the embodiment is used for synthesizing the voice of the text to be synthesized, firstly, preprocessing is performed to generate a word segmentation text, and then the word segmentation text is input into a prosodic model for processing to generate a prosodic text containing prosodic information; then, the vector which is converted by the prosodic text and contains the prosodic features is input into an acoustic model to generate Mel spectrum features; and finally synthesizing target audio based on rhythm according to the Mel spectrum characteristics. By the voice synthesis method, the input text is realized, voice synthesis is carried out according to the prosodic features of the text, corresponding natural prosodic audio is generated, sentence breaking errors are avoided, the synthesized audio data are played more naturally and truly, and user experience is improved.

The individual steps of the speech synthesis method are elucidated below in connection with specific examples.

In step S110, the process of preprocessing the text to be synthesized specifically includes: regularizing the text to be synthesized; adjusting the regularized text to be synthesized into a text with a uniform format; and word segmentation is carried out on the text in the unified format, and word segmentation text is obtained.

Regularization processing is carried out on the text to be synthesized, so that normalized text can be obtained, and messy codes, nonstandard symbols and the like in the text are removed. The text to be synthesized is adjusted to text in a unified format, for example, the numbers such as the amount of money, time and the like in the text to be synthesized are converted into characters. Because the numbers are different in pronunciation in different scenes, the numbers are replaced by different words according to the keywords of the scene where the numbers are located, for example, the room number 203 is converted into the room number of two, zero and three, and the room price of 423 yuan is converted into the room price of four hundred twenty three.

In step S120, the prosody model is used to predict the prosody of the text pronunciation from the input text. The prosody model comprises an embedded layer (Embedding), a multi-layer two-way Long-Term Memory (LSTM) and a full-connection layer which are connected in sequence. The embedded layer, the bidirectional LSTM and the full connection layer adopt the existing network structure, and are not described herein.

Fig. 2 illustrates a network structure of a prosody model in the embodiment, and referring to fig. 2, the prosody model 200 in the embodiment includes an embedding layer 210, a first layer bidirectional LSTM 220, a second layer bidirectional LSTM 230, a third layer bidirectional LSTM 240, and a full connection layer 250. The process of obtaining prosodic text containing prosodic information through prosodic model 200 includes: the segmented text is input into the prosodic model 200, and is processed sequentially by the embedding layer 210, the first layer bi-directional LSTM 220, the second layer bi-directional LSTM 230, the third layer bi-directional LSTM 240, and the full-connection layer 250, to generate a prosodic text containing prosodic information. Wherein the prosodic information is used to identify a pause time for each prosodic phrase in the prosodic text.

In this embodiment, the prosody model adopts three layers of prosody classification, respectively "#1", and "". "indicates different dwell times. For example, the word segmentation text "I am online travel agency, welcome to call online travel agency". "after predicting prosody by the prosody model, generating prosody text" I are #1 online travel agency welcome to make a call to online travel agency ". ". Wherein "#1", and ". "prosodic information respectively," #1 "indicates a short pause,", "indicates a long pause,". "indicates the end of the sentence.

In step S130, the process of converting prosodic text into a vector containing prosodic features specifically includes: converting the prosodic text into a pinyin sequence containing syllables, tones, and prosodic information; converting the pinyin sequence into a digital sequence; and converting the digital sequence into a vector containing prosodic features, wherein the prosodic features correspond to prosodic information.

Each word in the prosodic text corresponds to syllables and tones, and the prosodic information can follow the syllables and tones of the corresponding word, so that the converted pinyin sequence completely reflects the text content and prosodic features of the prosodic text. The conversion of the pinyin sequence to the numeric sequence may be accomplished by dictionary lookup followed by Embedding of the conversion to generate a vector containing prosodic features.

In step S140, the acoustic model is used to predict the input prosodic text to obtain mel-spectrum features. The acoustic model includes an encoder including a multi-layer convolutional neural network (Convolutional Neural Networks, CNN for short) and a multi-layer bi-directional LSTM connected in sequence, and a Decoder (Encoder-Decoder) including a multi-layer bi-directional LSTM and a Linear layer (Linear) connected in sequence, with the last layer bi-directional LSTM of the encoder being connected to the first layer bi-directional LSTM of the Decoder. Among them, the CNN, the bidirectional LSTM and the linear layer adopt the existing network structure, and will not be described here.

Fig. 3 shows a network structure of an acoustic model in an embodiment, and referring to fig. 3, the acoustic model 300 in this embodiment includes an encoder 300a and a decoder 300b, the encoder 300a includes a first layer CNN 310, a second layer CNN 320, a first layer bidirectional LSTM 330 and a second layer bidirectional LSTM 340, and the decoder 300b includes a first layer bidirectional LSTM 360, a second layer bidirectional LSTM 370, a third layer bidirectional LSTM 380 and a linear layer 390.

The process of obtaining mel-spectrum features by acoustic model 300 includes: the vector containing prosodic features is input into the acoustic model 300, and sequentially processed by the encoder 300a and the decoder 300b to generate mel-spectrum features of a preset dimension. The acoustic model 300 also uses a Local Attention (Local Attention) mechanism to better learn the alignment between text and acoustic features (mel-spectra). Meanwhile, a linear rectification function (RECTIFIED LINEAR Unit, short for ReLU) is used between every two layers of networks for nonlinear activation. The preset dimension of the acoustic model 300 is 160 dimensions. The acoustic model 300 removes the conventional residual connection structure, CNN uses 256 dimensions, and bi-directional LSTM hidden states use 128 dimensions.

In step S150, when synthesizing the target audio, the mel-spectrum feature is input to a vocoder to obtain the target audio. The vocoder uses WaveRNN model to transform the mel-spectrum features generated by the acoustic model into playable audio waveforms.

The invention also provides a training method of the prosody model and the acoustic model in the above embodiments.

Fig. 4 shows a training process of a prosody model, and referring to fig. 4, the model training method of the present embodiment includes: in step S410, a first sample text and a target prosody text corresponding to the first sample text are obtained; in step S420, a prosody model is constructed based at least on the two-way long-short term memory network; and in step S430, training the prosody model with the first sample text as input and the target prosody text as output until the prosody model converges.

The prosody model constructed may be shown with reference to fig. 2, and includes: the embedded layer 210, the multi-layered LSTM 220-240, and the full-connection layer 250, which are sequentially connected, are constructed to form the prosody model 200. The embedding layer 210 is used to input a first text sample, and the full-connection layer 250 is used to output a target prosodic text. The loss function uses a cross entropy loss function.

The first sample text can be obtained from the statistics of call records of customer service, merchants and users of the online travel agency, and is preprocessed and prosody marked by professionals in advance to form target prosody text. The prosody model can be trained on a Central Processing Unit (CPU), the batch size is set to be 1000, the learning rate is 0.01, the model is trained for 30 epochs in total, and the model accuracy reaches 95.36%.

Fig. 5 shows a training process of an acoustic model in the embodiment, and referring to fig. 5, the model training method in this embodiment includes: in step S510, a second sample text and a real mel spectrum feature and a target mel spectrum feature corresponding to the second sample text are obtained; in step S520, an acoustic model is constructed based at least on the convolutional neural network and the two-way long-short-term memory network; and in step S530, training the acoustic model with the second sample text and the real mel-spectrum feature as inputs and the target mel-spectrum feature as output until the acoustic model converges.

Wherein the process of constructing the acoustic model based at least on the convolutional neural network and the two-way long-short-term memory network comprises: constructing a network structure including an encoder and a decoder; constructing a multi-layer convolutional neural network and a multi-layer two-way long-short-term memory network which are sequentially connected in an encoder; in the decoder, a convolutional neural network, a plurality of layers of two-way long-short-term memory networks and a linear layer which are sequentially connected are constructed, and the input of the two-way long-term memory network of the first layer of the decoder is connected with the output of the two-way long-term memory network of the last layer of the encoder; and adding a local attention mechanism into the network structure, and performing nonlinear activation between every two layers by using a linear rectification function to form an acoustic model. The first layer convolution neural network of the encoder is used for inputting a vector corresponding to the second sample text, the convolution neural network of the decoder is used for inputting real Mel spectrum characteristics, and the linear layer is used for outputting target Mel spectrum characteristics.

The second sample text may employ the target prosodic text described above. And, the professional manual customer service records and forms the target audio based on the target prosodic text, 10000 pieces of 16KHz audio are recorded in total, the total duration of the audio is about 10 hours, and the text corresponding to each piece of audio is checked by the professional. According to the target audio frequency, the real Mel spectrum characteristics can be obtained.

Referring to the training structure of the acoustic model shown in fig. 6, a difference from the network structure of the acoustic model shown in fig. 3 when applied to actual prediction is that in the training process shown in fig. 6, the input of the decoder 300b of the acoustic model 300 includes a real mel-spectrum feature in addition to the output of the encoder 300a, and the real mel-spectrum feature is input through the CNN 350 of the encoder 300a for accelerating convergence of the acoustic model 300 in the training process.

Training of the acoustic model may use a Graphics Processor (GPU), with the batch size set to 48, and 160 dimensions for the synthesized mel-spectrum features. After the model is trained for hundred thousand times in an iteration mode, testing is started on a verification set, testing is carried out once every ten thousand times in an iteration mode, the descending trend of loss is observed, and the alignment effect of text information and Mel spectrum characteristics is checked. The training is carried out for 40 ten thousand times in total, loss is basically not reduced any more, the text and the mel spectrum are perfectly aligned, and the model converges to achieve the expected effect.

Other features of the prosodic model and the acoustic model not specifically described are referred to the above-described embodiments of the speech synthesis method, and will not be repeated here.

The present invention also provides a speech synthesis apparatus, referring to fig. 7, in which a prosody-based speech synthesis apparatus 700 according to the present embodiment includes: a preprocessing module 710, configured to preprocess a text to be synthesized to obtain a word segmentation text; the prosody prediction module 720 is configured to obtain a prosody text including prosody information through a prosody model according to the word segmentation text; a vector conversion module 730 for converting prosodic text into a vector containing prosodic features; the mel-spectrum generating module 740 is configured to obtain mel-spectrum features through an acoustic model according to the vector containing prosodic features; and a speech synthesis module 750 for synthesizing the target audio according to the mel-spectrum features.

Other features of the speech synthesis apparatus not specifically described are referred to the above-described embodiments of the speech synthesis method, and will not be repeated here.

The voice synthesis device of the embodiment can realize text input, perform voice synthesis according to the prosodic features of the text, generate corresponding natural prosodic audio, avoid sentence breaking errors, enable synthesized audio data to be played more naturally and truly, and promote user experience.

The embodiment of the invention also provides electronic equipment, which comprises a processor and a memory, wherein executable instructions are stored in the memory, and when the executable instructions are executed by the processor, the voice synthesis method and/or the model training method of any embodiment are realized, the input text is realized, the voice synthesis is carried out according to the prosodic features of the text, the corresponding natural prosodic audio is generated, sentence breaking errors are avoided, the synthesized audio data are played more naturally and truly, and the user experience is improved.

Fig. 8 illustrates the structure of the electronic device in the embodiment, and it should be understood that fig. 8 only schematically illustrates the respective modules, and these modules may be virtual software modules or actual hardware modules, and the combination, splitting and addition of the remaining modules are all within the scope of the present invention.

As shown in fig. 8, the electronic device 800 is embodied in the form of a general purpose computing device. Components of electronic device 800 may include, but are not limited to: at least one processing unit 810, at least one storage unit 820, a bus 830 that connects the different platform components (including storage unit 820 and processing unit 810), a display unit 840, and the like.

Wherein the storage unit stores program code executable by the processing unit 810 such that the processing unit 810 performs the steps of any of the method embodiments described above. For example, the processing unit 810 may perform the steps shown in fig. 1,4, and 5.

The storage unit 820 may include readable media in the form of volatile storage units, such as Random Access Memory (RAM) 8201 and/or cache memory 8202, and may further include Read Only Memory (ROM) 8203.

Storage unit 820 may also include a program/utility 8204 having one or more program modules 8205, such program modules 8205 including, but not limited to: an operating system, one or more application programs, other program modules, and program data, each or some combination of which may include an implementation of a network environment.

Bus 830 may be one or more of several types of bus structures including a memory unit bus or memory unit controller, a peripheral bus, an accelerated graphics port, a processing unit, or a local bus using any of a variety of bus architectures.

The electronic device 800 may also communicate with one or more external devices 8000, which external devices 8000 may be one or more of a keyboard, pointing device, bluetooth device, etc. These external devices 8000 enable a user to interactively communicate with the electronic device 800. The electronic device 800 can also communicate with one or more other computing devices, including a router, modem, and the like. Such communication may occur through an input/output (I/O) interface 850. Also, electronic device 800 may communicate with one or more networks such as a Local Area Network (LAN), a Wide Area Network (WAN), and/or a public network, such as the Internet, through network adapter 860. Network adapter 860 may communicate with other modules of electronic device 800 via bus 830. It should be appreciated that although not shown, other hardware and/or software modules may be used in connection with electronic device 800, including, but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, data backup storage platforms, and the like.

The embodiment of the invention also provides a computer readable storage medium for storing a computer program, wherein the computer program is executed to realize the voice synthesis method and/or the model training method of any embodiment, realize the input of text, perform voice synthesis according to the prosodic features of the text, generate corresponding natural prosodic audio, avoid sentence breaking errors, enable synthesized audio data to be played more naturally and truly, and improve user experience.

In some possible implementations, the aspects of the invention may also be implemented in the form of a program product comprising program code for causing a terminal device to carry out the steps as described in any of the method embodiments described above, when the program product is run on the terminal device.

Fig. 9 is a schematic structural view of a computer-readable storage medium of the present invention. Referring to fig. 9, a program product 900 for implementing the above-described method according to an embodiment of the present invention is described, which may employ a portable compact disc read only memory (CD-ROM) and include program code, and may be run on a terminal device, such as a personal computer. However, the program product of the present invention is not limited thereto, and in this document, a readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.

The program product may employ any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. The readable storage medium can be, for example, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples of the readable storage medium include, but are not limited to: an electrical connection having one or more wires, a portable disk, a hard disk, random Access Memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

The computer readable storage medium may include a data signal propagated in baseband or as part of a carrier wave, with readable program code embodied therein. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A readable storage medium may also be any readable medium that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a readable storage medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.

Program code for carrying out operations of the present invention may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, C++ or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device, partly on a remote computing device, or entirely on the remote computing device or server. In the case of remote computing devices, the remote computing device may be connected to the user computing device through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computing device, such as through the Internet using an Internet service provider.

The foregoing is a further detailed description of the invention in connection with the preferred embodiments, and it is not intended that the invention be limited to the specific embodiments described. It will be apparent to those skilled in the art that several simple deductions or substitutions may be made without departing from the spirit of the invention, and these should be considered to be within the scope of the invention.

Claims

1. A prosody-based speech synthesis method, comprising:

preprocessing a text to be synthesized to obtain a word segmentation text;

obtaining a prosodic text containing prosodic information through a prosodic model according to the word segmentation text;

The rhythm model comprises an embedded layer, a plurality of layers of two-way long-short-term memory networks and a full-connection layer which are sequentially connected; obtaining prosodic text containing prosodic information through the prosodic model, comprising: inputting the word segmentation text into the prosody model, and sequentially processing the word segmentation text through the embedded layer, the multi-layer two-way long-short-term memory network and the full-connection layer to generate prosody text containing prosody information, wherein the prosody information is used for marking the pause time of each prosody phrase in the prosody text;

Converting the prosodic text into a vector comprising prosodic features, comprising: converting the prosodic text into a pinyin sequence comprising syllables, tones, and the prosodic information, converting the pinyin sequence into a numeric sequence, and converting the numeric sequence into a vector comprising prosodic features corresponding to the prosodic information;

According to the vector containing prosodic features, obtaining mel-spectrum features through an acoustic model, wherein the acoustic model comprises an encoder and a decoder, the encoder comprises a plurality of layers of convolutional neural networks and a plurality of layers of two-way long-short-term memory networks which are sequentially connected, the decoder comprises a plurality of layers of two-way long-short-term memory networks and linear layers which are sequentially connected, and a last layer of two-way long-short-term memory network of the encoder is connected with a first layer of two-way long-short-term memory network of the decoder; and

And synthesizing target audio according to the Mel spectrum characteristics.

2. The method of speech synthesis according to claim 1, wherein obtaining mel-spectrum features by an acoustic model comprises:

inputting the vector containing prosodic features into the acoustic model, and sequentially processing the vector by the multi-layer convolutional neural network and the multi-layer two-way long-short-term memory network of the encoder, and the multi-layer two-way long-short-term memory network and the linear layer of the decoder to generate Mel spectrum features with preset dimensions.

3. The method of speech synthesis according to claim 1, wherein the synthesizing the target audio includes:

and inputting the Mel spectrum characteristics into a vocoder to obtain the target audio.

4. The method for synthesizing speech according to claim 1, wherein the preprocessing of the text to be synthesized comprises:

regularizing the text to be synthesized;

adjusting the regularized text to be synthesized into a text with a uniform format; and

And word segmentation is carried out on the text in the unified format.

5. A method of model training, comprising:

obtaining a first sample text and a target prosody text corresponding to the first sample text;

Constructing a prosody model based at least on the two-way long-short term memory network, comprising: constructing an embedded layer, a plurality of layers of bidirectional convolutional neural networks and a full-connection layer which are sequentially connected to form the prosody model, wherein the embedded layer is used for inputting the first text sample, and the full-connection layer is used for outputting the target prosody text; and

And training the prosody model by taking the first sample text as input and the target prosody text as output until the prosody model converges.

6. A method of model training, comprising:

Obtaining a second sample text and a real Mel spectrum feature and a target Mel spectrum feature corresponding to the second sample text;

Constructing an acoustic model based at least on the convolutional neural network and the two-way long-short term memory network, comprising: constructing a network structure including an encoder and a decoder; constructing a multi-layer convolutional neural network and a multi-layer two-way long-term and short-term memory network which are sequentially connected in the encoder; in the decoder, a convolutional neural network, a plurality of layers of two-way long-short-term memory networks and a linear layer which are sequentially connected are constructed, and the input of the two-way long-term memory network of the first layer of the decoder is connected with the output of the two-way long-term memory network of the last layer of the encoder; adding a local attention mechanism into the network structure, and performing nonlinear activation between every two layers by using a linear rectification function to form the acoustic model; the first layer convolutional neural network of the encoder is used for inputting a vector corresponding to the second sample text, the convolutional neural network of the decoder is used for inputting the real mel spectrum characteristic, and the linear layer is used for outputting the target mel spectrum characteristic; and

And training the acoustic model by taking the second sample text and the real Mel spectrum characteristic as input and the target Mel spectrum characteristic as output until the acoustic model converges.

7. A prosody-based speech synthesis device, comprising:

the preprocessing module is used for preprocessing the text to be synthesized to obtain word segmentation text;

the prosody prediction module is used for obtaining prosody text containing prosody information through a prosody model according to the word segmentation text;

A vector conversion module for converting the prosodic text into a vector containing prosodic features, comprising: converting the prosodic text into a pinyin sequence comprising syllables, tones, and the prosodic information, converting the pinyin sequence into a numeric sequence, and converting the numeric sequence into a vector comprising prosodic features corresponding to the prosodic information;

The Mel spectrum generating module is used for obtaining Mel spectrum characteristics through an acoustic model according to the vector containing prosodic characteristics, the acoustic model comprises an encoder and a decoder, the encoder comprises a plurality of layers of convolution neural networks and a plurality of layers of two-way long-short-term memory networks which are connected in sequence, the decoder comprises a plurality of layers of two-way long-short-term memory networks and linear layers which are sequentially connected, and the two-way long-term memory network of the last layer of the encoder is connected with the two-way long-term memory network of the first layer of the decoder; and

And the voice synthesis module is used for synthesizing target audio according to the Mel spectrum characteristics.

8. An electronic device, comprising:

a processor;

a memory having stored therein executable instructions which when executed by the processor implement the method of any one of claims 1 to 6.

9. A computer readable storage medium storing a computer program, characterized in that the computer program when executed implements the method according to any one of claims 1 to 6.