Movatterモバイル変換


[0]ホーム

URL:


CN112331177B - Prosody-based speech synthesis method, model training method and related equipment - Google Patents

Prosody-based speech synthesis method, model training method and related equipment
Download PDF

Info

Publication number
CN112331177B
CN112331177BCN202011224950.4ACN202011224950ACN112331177BCN 112331177 BCN112331177 BCN 112331177BCN 202011224950 ACN202011224950 ACN 202011224950ACN 112331177 BCN112331177 BCN 112331177B
Authority
CN
China
Prior art keywords
text
prosodic
prosody
layer
term memory
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202011224950.4A
Other languages
Chinese (zh)
Other versions
CN112331177A (en
Inventor
周明康
罗超
陈子浩
胡泓
李巍
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ctrip Computer Technology Shanghai Co Ltd
Original Assignee
Ctrip Computer Technology Shanghai Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ctrip Computer Technology Shanghai Co LtdfiledCriticalCtrip Computer Technology Shanghai Co Ltd
Priority to CN202011224950.4ApriorityCriticalpatent/CN112331177B/en
Publication of CN112331177ApublicationCriticalpatent/CN112331177A/en
Application grantedgrantedCritical
Publication of CN112331177BpublicationCriticalpatent/CN112331177B/en
Activelegal-statusCriticalCurrent
Anticipated expirationlegal-statusCritical

Links

Classifications

Landscapes

Abstract

The invention relates to the technical field of voice processing, and provides a prosody-based voice synthesis method, a model training method and related equipment. The prosody-based speech synthesis method includes: preprocessing a text to be synthesized to obtain a word segmentation text; obtaining a prosodic text containing prosodic information through a prosodic model according to the word segmentation text; converting the prosodic text into a vector comprising prosodic features; obtaining Mel spectrum characteristics through an acoustic model according to the vector containing prosodic characteristics; and synthesizing target audio according to the Mel spectrum characteristics. According to the method and the device for synthesizing the voice, voice synthesis is carried out according to the prosodic features of the text, audio data with natural pronunciation are generated, sentence breaking errors are avoided, the synthesized audio data are played more naturally and truly, and user experience is improved.

Description

Prosody-based speech synthesis method, model training method and related equipment
Technical Field
The invention relates to the technical field of voice processing, in particular to a prosody-based voice synthesis method, a model training method and related equipment.
Background
With the continuous development of deep learning technology, the speech synthesis technology based on neural network is also maturing. However, most of the existing speech synthesis models based on the neural network are end-to-end models, namely, input text and output audio, although the synthesis speed is greatly improved, the rhythm and naturalness are poor, and sentence breaking errors can occur when a large number of words which are not covered in a training sample are synthesized.
Especially for online travel agencies, a large number of calls need to be called out daily, usually by means of an out-call robot. In the existing outbound robot, because the adopted voice synthesis model does not consider prosodic features, sentence breaking errors are easy to occur, and the user experience is poor.
It should be noted that the information disclosed in the foregoing background section is only for enhancement of understanding of the background of the invention and thus may include information that does not form the prior art that is already known to those of ordinary skill in the art.
Disclosure of Invention
In view of the above, the invention provides a prosody-based speech synthesis method, a model training method and related equipment, which are used for generating audio data with natural pronunciation by performing speech synthesis according to prosody characteristics of texts, so that sentence breaking errors are avoided, the synthesized audio data is played more naturally and truly, and user experience is improved.
The invention provides a prosody-based speech synthesis method, which comprises the following steps: preprocessing a text to be synthesized to obtain a word segmentation text; obtaining a prosodic text containing prosodic information through a prosodic model according to the word segmentation text; converting the prosodic text into a vector comprising prosodic features; obtaining Mel spectrum characteristics through an acoustic model according to the vector containing prosodic characteristics; and synthesizing target audio according to the Mel spectrum characteristics.
In some embodiments, the prosody model includes an embedded layer, a plurality of layers of two-way long and short term memory networks, and a fully connected layer, connected in sequence.
In some embodiments, the obtaining prosodic text comprising prosodic information with a prosodic model includes: inputting the word segmentation text into the prosody model, and sequentially processing the word segmentation text through the embedding layer, the multi-layer two-way long-short-term memory network and the full-connection layer to generate prosody text containing prosody information, wherein the prosody information is used for marking the pause time of each prosody phrase in the prosody text.
In some embodiments, the converting the prosodic text into a vector comprising prosodic features comprises: converting the prosodic text into a pinyin sequence comprising syllables, tones, and the prosodic information; converting the pinyin sequence into a digital sequence; and converting the digital sequence into a vector comprising prosodic features corresponding to the prosodic information.
In some embodiments, the acoustic model includes an encoder including a plurality of layers of convolutional neural networks and a plurality of layers of two-way long-short-term memory networks connected in sequence, and a decoder including a plurality of layers of two-way long-short-term memory networks and a linear layer connected in sequence, and a last layer of two-way long-term memory network of the encoder is connected to a first layer of two-way long-term memory network of the decoder.
In some embodiments, the obtaining mel-spectrum features by an acoustic model includes: inputting the vector containing prosodic features into the acoustic model, and sequentially processing the vector by the multi-layer convolutional neural network and the multi-layer two-way long-short-term memory network of the encoder, and the multi-layer two-way long-short-term memory network and the linear layer of the decoder to generate Mel spectrum features with preset dimensions.
In some embodiments, the synthesizing the target audio includes: and inputting the Mel spectrum characteristics into a vocoder to obtain the target audio.
In some embodiments, the preprocessing the text to be synthesized includes: regularizing the text to be synthesized; adjusting the regularized text to be synthesized into a text with a uniform format; and word segmentation is carried out on the text in the unified format.
The invention also provides a model training method, which comprises the following steps: obtaining a first sample text and a target prosody text corresponding to the first sample text; constructing a prosody model based at least on the two-way long-short-term memory network; and training the prosody model by taking the first sample text as input and the target prosody text as output until the prosody model converges.
In some embodiments, the constructing a prosody model based at least on the two-way long-short term memory network comprises: constructing an embedded layer, a multi-layer bidirectional convolutional neural network and a full-connection layer which are sequentially connected to form the rhythm model; the embedded layer is used for inputting the first text, and the full-connection layer is used for outputting the target prosodic text.
The invention also provides a model training method, which comprises the following steps: obtaining a second sample text and a real Mel spectrum feature and a target Mel spectrum feature corresponding to the second sample text; constructing an acoustic model based at least on the convolutional neural network and the two-way long-short-term memory network; and training the acoustic model by taking the second sample text and the real Mel-spectrum feature as inputs and the target Mel-spectrum feature as output until the acoustic model converges.
In some embodiments, the constructing an acoustic model based at least on the convolutional neural network and the two-way long-short term memory network comprises: constructing a network structure including an encoder and a decoder; constructing a multi-layer convolutional neural network and a multi-layer two-way long-term and short-term memory network which are sequentially connected in the encoder; in the decoder, a convolutional neural network, a plurality of layers of two-way long-short-term memory networks and a linear layer which are sequentially connected are constructed, and the input of the two-way long-term memory network of the first layer of the decoder is connected with the output of the two-way long-term memory network of the last layer of the encoder; adding a local attention mechanism into the network structure, and performing nonlinear activation between every two layers by using a linear rectification function to form the acoustic model; the first layer convolutional neural network of the encoder is used for inputting a vector corresponding to the second sample text, the convolutional neural network of the decoder is used for inputting the real mel spectrum characteristic, and the linear layer is used for outputting the target mel spectrum characteristic.
The invention also provides a prosody-based speech synthesis device, comprising: the preprocessing module is used for preprocessing the text to be synthesized to obtain word segmentation text; the prosody prediction module is used for obtaining prosody text containing prosody information through a prosody model according to the word segmentation text; a vector conversion module for converting the prosodic text into a vector comprising prosodic features; the Mel spectrum generating module is used for obtaining Mel spectrum characteristics through an acoustic model according to the vector containing prosodic characteristics; and the voice synthesis module is used for synthesizing target audio according to the Mel spectrum characteristics.
The present invention also provides an electronic device including: a processor; and a memory having stored therein executable instructions which when executed by the processor implement the method of any of the embodiments described above.
The present invention also provides a computer-readable storage medium storing a computer program which, when executed, implements the method of any of the embodiments described above.
Compared with the prior art, the invention has the beneficial effects that at least:
When the invention synthesizes the voice of the text to be synthesized, firstly preprocessing to generate word segmentation text, and then inputting the word segmentation text into a prosodic model for processing to generate prosodic text containing prosodic information; then, the vector which is converted by the prosodic text and contains the prosodic features is input into an acoustic model to generate Mel spectrum features; finally synthesizing target audio based on rhythm according to the Mel spectrum characteristics;
According to the method and the device for synthesizing the voice, voice synthesis is carried out according to the prosodic features of the text, audio data with natural pronunciation are generated, sentence breaking errors are avoided, the synthesized audio data are played more naturally and truly, and user experience is improved.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the invention as claimed.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the invention and together with the description, serve to explain the principles of the invention. It is evident that the figures described below are only some embodiments of the invention, from which other figures can be obtained without inventive effort for a person skilled in the art.
FIG. 1 is a schematic diagram showing steps of a speech synthesis method according to an embodiment of the present invention;
FIG. 2 is a schematic diagram of a network structure of a prosody model in an embodiment of the present invention;
FIG. 3 is a schematic diagram of a network structure of an acoustic model in an embodiment of the present invention;
FIG. 4 is a schematic diagram showing a training process of a prosody model in an embodiment of the present invention;
FIG. 5 shows a schematic diagram of an acoustic model training process in an embodiment of the invention;
FIG. 6 shows a schematic diagram of a training architecture of an acoustic model in an embodiment of the invention;
FIG. 7 is a schematic block diagram of a speech synthesis apparatus according to an embodiment of the present invention;
Fig. 8 is a schematic structural diagram of an electronic device according to an embodiment of the present invention; and
Fig. 9 shows a schematic structure of a storage medium in an embodiment of the present invention.
Detailed Description
Example embodiments will now be described more fully with reference to the accompanying drawings. However, the example embodiments may be embodied in many different forms and should not be construed as limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of the example embodiments to those skilled in the art.
Furthermore, the drawings are merely schematic illustrations of the present invention and are not necessarily drawn to scale. The same reference numerals in the drawings denote the same or similar parts, and thus a repetitive description thereof will be omitted. Some of the block diagrams shown in the figures are functional entities and do not necessarily correspond to physically or logically separate entities. These functional entities may be implemented in software or in one or more hardware modules or integrated circuits or in different networks and/or processor devices and/or microcontroller devices.
The step numbers in the following embodiments are merely for representing different execution contents, and do not limit the logical relationship and execution order between steps. The use of the terms "first," "second," and the like in the description herein do not denote any order, quantity, or importance, but rather are used to distinguish one element from another. It should be noted that, without conflict, the embodiments of the present invention and features in different embodiments may be combined with each other.
Fig. 1 shows main steps of a speech synthesis method in an embodiment, and referring to fig. 1, the prosody-based speech synthesis method in the present embodiment includes: in step S110, preprocessing a text to be synthesized to obtain a word segmentation text; in step S120, according to the word segmentation text, obtaining a prosodic text containing prosodic information through a prosodic model; in step S130, converting the prosodic text into a vector containing prosodic features; in step S140, obtaining mel-spectrum features by an acoustic model according to the vector containing prosodic features; and in step S150, synthesizing the target audio according to the mel-spectrum feature.
When the embodiment is used for synthesizing the voice of the text to be synthesized, firstly, preprocessing is performed to generate a word segmentation text, and then the word segmentation text is input into a prosodic model for processing to generate a prosodic text containing prosodic information; then, the vector which is converted by the prosodic text and contains the prosodic features is input into an acoustic model to generate Mel spectrum features; and finally synthesizing target audio based on rhythm according to the Mel spectrum characteristics. By the voice synthesis method, the input text is realized, voice synthesis is carried out according to the prosodic features of the text, corresponding natural prosodic audio is generated, sentence breaking errors are avoided, the synthesized audio data are played more naturally and truly, and user experience is improved.
The individual steps of the speech synthesis method are elucidated below in connection with specific examples.
In step S110, the process of preprocessing the text to be synthesized specifically includes: regularizing the text to be synthesized; adjusting the regularized text to be synthesized into a text with a uniform format; and word segmentation is carried out on the text in the unified format, and word segmentation text is obtained.
Regularization processing is carried out on the text to be synthesized, so that normalized text can be obtained, and messy codes, nonstandard symbols and the like in the text are removed. The text to be synthesized is adjusted to text in a unified format, for example, the numbers such as the amount of money, time and the like in the text to be synthesized are converted into characters. Because the numbers are different in pronunciation in different scenes, the numbers are replaced by different words according to the keywords of the scene where the numbers are located, for example, the room number 203 is converted into the room number of two, zero and three, and the room price of 423 yuan is converted into the room price of four hundred twenty three.
In step S120, the prosody model is used to predict the prosody of the text pronunciation from the input text. The prosody model comprises an embedded layer (Embedding), a multi-layer two-way Long-Term Memory (LSTM) and a full-connection layer which are connected in sequence. The embedded layer, the bidirectional LSTM and the full connection layer adopt the existing network structure, and are not described herein.
Fig. 2 illustrates a network structure of a prosody model in the embodiment, and referring to fig. 2, the prosody model 200 in the embodiment includes an embedding layer 210, a first layer bidirectional LSTM 220, a second layer bidirectional LSTM 230, a third layer bidirectional LSTM 240, and a full connection layer 250. The process of obtaining prosodic text containing prosodic information through prosodic model 200 includes: the segmented text is input into the prosodic model 200, and is processed sequentially by the embedding layer 210, the first layer bi-directional LSTM 220, the second layer bi-directional LSTM 230, the third layer bi-directional LSTM 240, and the full-connection layer 250, to generate a prosodic text containing prosodic information. Wherein the prosodic information is used to identify a pause time for each prosodic phrase in the prosodic text.
In this embodiment, the prosody model adopts three layers of prosody classification, respectively "#1", and "". "indicates different dwell times. For example, the word segmentation text "I am online travel agency, welcome to call online travel agency". "after predicting prosody by the prosody model, generating prosody text" I are #1 online travel agency welcome to make a call to online travel agency ". ". Wherein "#1", and ". "prosodic information respectively," #1 "indicates a short pause,", "indicates a long pause,". "indicates the end of the sentence.
In step S130, the process of converting prosodic text into a vector containing prosodic features specifically includes: converting the prosodic text into a pinyin sequence containing syllables, tones, and prosodic information; converting the pinyin sequence into a digital sequence; and converting the digital sequence into a vector containing prosodic features, wherein the prosodic features correspond to prosodic information.
Each word in the prosodic text corresponds to syllables and tones, and the prosodic information can follow the syllables and tones of the corresponding word, so that the converted pinyin sequence completely reflects the text content and prosodic features of the prosodic text. The conversion of the pinyin sequence to the numeric sequence may be accomplished by dictionary lookup followed by Embedding of the conversion to generate a vector containing prosodic features.
In step S140, the acoustic model is used to predict the input prosodic text to obtain mel-spectrum features. The acoustic model includes an encoder including a multi-layer convolutional neural network (Convolutional Neural Networks, CNN for short) and a multi-layer bi-directional LSTM connected in sequence, and a Decoder (Encoder-Decoder) including a multi-layer bi-directional LSTM and a Linear layer (Linear) connected in sequence, with the last layer bi-directional LSTM of the encoder being connected to the first layer bi-directional LSTM of the Decoder. Among them, the CNN, the bidirectional LSTM and the linear layer adopt the existing network structure, and will not be described here.
Fig. 3 shows a network structure of an acoustic model in an embodiment, and referring to fig. 3, the acoustic model 300 in this embodiment includes an encoder 300a and a decoder 300b, the encoder 300a includes a first layer CNN 310, a second layer CNN 320, a first layer bidirectional LSTM 330 and a second layer bidirectional LSTM 340, and the decoder 300b includes a first layer bidirectional LSTM 360, a second layer bidirectional LSTM 370, a third layer bidirectional LSTM 380 and a linear layer 390.
The process of obtaining mel-spectrum features by acoustic model 300 includes: the vector containing prosodic features is input into the acoustic model 300, and sequentially processed by the encoder 300a and the decoder 300b to generate mel-spectrum features of a preset dimension. The acoustic model 300 also uses a Local Attention (Local Attention) mechanism to better learn the alignment between text and acoustic features (mel-spectra). Meanwhile, a linear rectification function (RECTIFIED LINEAR Unit, short for ReLU) is used between every two layers of networks for nonlinear activation. The preset dimension of the acoustic model 300 is 160 dimensions. The acoustic model 300 removes the conventional residual connection structure, CNN uses 256 dimensions, and bi-directional LSTM hidden states use 128 dimensions.
In step S150, when synthesizing the target audio, the mel-spectrum feature is input to a vocoder to obtain the target audio. The vocoder uses WaveRNN model to transform the mel-spectrum features generated by the acoustic model into playable audio waveforms.
The invention also provides a training method of the prosody model and the acoustic model in the above embodiments.
Fig. 4 shows a training process of a prosody model, and referring to fig. 4, the model training method of the present embodiment includes: in step S410, a first sample text and a target prosody text corresponding to the first sample text are obtained; in step S420, a prosody model is constructed based at least on the two-way long-short term memory network; and in step S430, training the prosody model with the first sample text as input and the target prosody text as output until the prosody model converges.
The prosody model constructed may be shown with reference to fig. 2, and includes: the embedded layer 210, the multi-layered LSTM 220-240, and the full-connection layer 250, which are sequentially connected, are constructed to form the prosody model 200. The embedding layer 210 is used to input a first text sample, and the full-connection layer 250 is used to output a target prosodic text. The loss function uses a cross entropy loss function.
The first sample text can be obtained from the statistics of call records of customer service, merchants and users of the online travel agency, and is preprocessed and prosody marked by professionals in advance to form target prosody text. The prosody model can be trained on a Central Processing Unit (CPU), the batch size is set to be 1000, the learning rate is 0.01, the model is trained for 30 epochs in total, and the model accuracy reaches 95.36%.
Fig. 5 shows a training process of an acoustic model in the embodiment, and referring to fig. 5, the model training method in this embodiment includes: in step S510, a second sample text and a real mel spectrum feature and a target mel spectrum feature corresponding to the second sample text are obtained; in step S520, an acoustic model is constructed based at least on the convolutional neural network and the two-way long-short-term memory network; and in step S530, training the acoustic model with the second sample text and the real mel-spectrum feature as inputs and the target mel-spectrum feature as output until the acoustic model converges.
Wherein the process of constructing the acoustic model based at least on the convolutional neural network and the two-way long-short-term memory network comprises: constructing a network structure including an encoder and a decoder; constructing a multi-layer convolutional neural network and a multi-layer two-way long-short-term memory network which are sequentially connected in an encoder; in the decoder, a convolutional neural network, a plurality of layers of two-way long-short-term memory networks and a linear layer which are sequentially connected are constructed, and the input of the two-way long-term memory network of the first layer of the decoder is connected with the output of the two-way long-term memory network of the last layer of the encoder; and adding a local attention mechanism into the network structure, and performing nonlinear activation between every two layers by using a linear rectification function to form an acoustic model. The first layer convolution neural network of the encoder is used for inputting a vector corresponding to the second sample text, the convolution neural network of the decoder is used for inputting real Mel spectrum characteristics, and the linear layer is used for outputting target Mel spectrum characteristics.
The second sample text may employ the target prosodic text described above. And, the professional manual customer service records and forms the target audio based on the target prosodic text, 10000 pieces of 16KHz audio are recorded in total, the total duration of the audio is about 10 hours, and the text corresponding to each piece of audio is checked by the professional. According to the target audio frequency, the real Mel spectrum characteristics can be obtained.
Referring to the training structure of the acoustic model shown in fig. 6, a difference from the network structure of the acoustic model shown in fig. 3 when applied to actual prediction is that in the training process shown in fig. 6, the input of the decoder 300b of the acoustic model 300 includes a real mel-spectrum feature in addition to the output of the encoder 300a, and the real mel-spectrum feature is input through the CNN 350 of the encoder 300a for accelerating convergence of the acoustic model 300 in the training process.
Training of the acoustic model may use a Graphics Processor (GPU), with the batch size set to 48, and 160 dimensions for the synthesized mel-spectrum features. After the model is trained for hundred thousand times in an iteration mode, testing is started on a verification set, testing is carried out once every ten thousand times in an iteration mode, the descending trend of loss is observed, and the alignment effect of text information and Mel spectrum characteristics is checked. The training is carried out for 40 ten thousand times in total, loss is basically not reduced any more, the text and the mel spectrum are perfectly aligned, and the model converges to achieve the expected effect.
Other features of the prosodic model and the acoustic model not specifically described are referred to the above-described embodiments of the speech synthesis method, and will not be repeated here.
The present invention also provides a speech synthesis apparatus, referring to fig. 7, in which a prosody-based speech synthesis apparatus 700 according to the present embodiment includes: a preprocessing module 710, configured to preprocess a text to be synthesized to obtain a word segmentation text; the prosody prediction module 720 is configured to obtain a prosody text including prosody information through a prosody model according to the word segmentation text; a vector conversion module 730 for converting prosodic text into a vector containing prosodic features; the mel-spectrum generating module 740 is configured to obtain mel-spectrum features through an acoustic model according to the vector containing prosodic features; and a speech synthesis module 750 for synthesizing the target audio according to the mel-spectrum features.
Other features of the speech synthesis apparatus not specifically described are referred to the above-described embodiments of the speech synthesis method, and will not be repeated here.
The voice synthesis device of the embodiment can realize text input, perform voice synthesis according to the prosodic features of the text, generate corresponding natural prosodic audio, avoid sentence breaking errors, enable synthesized audio data to be played more naturally and truly, and promote user experience.
The embodiment of the invention also provides electronic equipment, which comprises a processor and a memory, wherein executable instructions are stored in the memory, and when the executable instructions are executed by the processor, the voice synthesis method and/or the model training method of any embodiment are realized, the input text is realized, the voice synthesis is carried out according to the prosodic features of the text, the corresponding natural prosodic audio is generated, sentence breaking errors are avoided, the synthesized audio data are played more naturally and truly, and the user experience is improved.
Fig. 8 illustrates the structure of the electronic device in the embodiment, and it should be understood that fig. 8 only schematically illustrates the respective modules, and these modules may be virtual software modules or actual hardware modules, and the combination, splitting and addition of the remaining modules are all within the scope of the present invention.
As shown in fig. 8, the electronic device 800 is embodied in the form of a general purpose computing device. Components of electronic device 800 may include, but are not limited to: at least one processing unit 810, at least one storage unit 820, a bus 830 that connects the different platform components (including storage unit 820 and processing unit 810), a display unit 840, and the like.
Wherein the storage unit stores program code executable by the processing unit 810 such that the processing unit 810 performs the steps of any of the method embodiments described above. For example, the processing unit 810 may perform the steps shown in fig. 1,4, and 5.
The storage unit 820 may include readable media in the form of volatile storage units, such as Random Access Memory (RAM) 8201 and/or cache memory 8202, and may further include Read Only Memory (ROM) 8203.
Storage unit 820 may also include a program/utility 8204 having one or more program modules 8205, such program modules 8205 including, but not limited to: an operating system, one or more application programs, other program modules, and program data, each or some combination of which may include an implementation of a network environment.
Bus 830 may be one or more of several types of bus structures including a memory unit bus or memory unit controller, a peripheral bus, an accelerated graphics port, a processing unit, or a local bus using any of a variety of bus architectures.
The electronic device 800 may also communicate with one or more external devices 8000, which external devices 8000 may be one or more of a keyboard, pointing device, bluetooth device, etc. These external devices 8000 enable a user to interactively communicate with the electronic device 800. The electronic device 800 can also communicate with one or more other computing devices, including a router, modem, and the like. Such communication may occur through an input/output (I/O) interface 850. Also, electronic device 800 may communicate with one or more networks such as a Local Area Network (LAN), a Wide Area Network (WAN), and/or a public network, such as the Internet, through network adapter 860. Network adapter 860 may communicate with other modules of electronic device 800 via bus 830. It should be appreciated that although not shown, other hardware and/or software modules may be used in connection with electronic device 800, including, but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, data backup storage platforms, and the like.
The embodiment of the invention also provides a computer readable storage medium for storing a computer program, wherein the computer program is executed to realize the voice synthesis method and/or the model training method of any embodiment, realize the input of text, perform voice synthesis according to the prosodic features of the text, generate corresponding natural prosodic audio, avoid sentence breaking errors, enable synthesized audio data to be played more naturally and truly, and improve user experience.
In some possible implementations, the aspects of the invention may also be implemented in the form of a program product comprising program code for causing a terminal device to carry out the steps as described in any of the method embodiments described above, when the program product is run on the terminal device.
Fig. 9 is a schematic structural view of a computer-readable storage medium of the present invention. Referring to fig. 9, a program product 900 for implementing the above-described method according to an embodiment of the present invention is described, which may employ a portable compact disc read only memory (CD-ROM) and include program code, and may be run on a terminal device, such as a personal computer. However, the program product of the present invention is not limited thereto, and in this document, a readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.
The program product may employ any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. The readable storage medium can be, for example, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples of the readable storage medium include, but are not limited to: an electrical connection having one or more wires, a portable disk, a hard disk, random Access Memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
The computer readable storage medium may include a data signal propagated in baseband or as part of a carrier wave, with readable program code embodied therein. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A readable storage medium may also be any readable medium that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a readable storage medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.
Program code for carrying out operations of the present invention may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, C++ or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device, partly on a remote computing device, or entirely on the remote computing device or server. In the case of remote computing devices, the remote computing device may be connected to the user computing device through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computing device, such as through the Internet using an Internet service provider.
The foregoing is a further detailed description of the invention in connection with the preferred embodiments, and it is not intended that the invention be limited to the specific embodiments described. It will be apparent to those skilled in the art that several simple deductions or substitutions may be made without departing from the spirit of the invention, and these should be considered to be within the scope of the invention.

Claims (9)

Constructing an acoustic model based at least on the convolutional neural network and the two-way long-short term memory network, comprising: constructing a network structure including an encoder and a decoder; constructing a multi-layer convolutional neural network and a multi-layer two-way long-term and short-term memory network which are sequentially connected in the encoder; in the decoder, a convolutional neural network, a plurality of layers of two-way long-short-term memory networks and a linear layer which are sequentially connected are constructed, and the input of the two-way long-term memory network of the first layer of the decoder is connected with the output of the two-way long-term memory network of the last layer of the encoder; adding a local attention mechanism into the network structure, and performing nonlinear activation between every two layers by using a linear rectification function to form the acoustic model; the first layer convolutional neural network of the encoder is used for inputting a vector corresponding to the second sample text, the convolutional neural network of the decoder is used for inputting the real mel spectrum characteristic, and the linear layer is used for outputting the target mel spectrum characteristic; and
CN202011224950.4A2020-11-052020-11-05Prosody-based speech synthesis method, model training method and related equipmentActiveCN112331177B (en)

Priority Applications (1)

Application NumberPriority DateFiling DateTitle
CN202011224950.4ACN112331177B (en)2020-11-052020-11-05Prosody-based speech synthesis method, model training method and related equipment

Applications Claiming Priority (1)

Application NumberPriority DateFiling DateTitle
CN202011224950.4ACN112331177B (en)2020-11-052020-11-05Prosody-based speech synthesis method, model training method and related equipment

Publications (2)

Publication NumberPublication Date
CN112331177A CN112331177A (en)2021-02-05
CN112331177Btrue CN112331177B (en)2024-07-02

Family

ID=74316325

Family Applications (1)

Application NumberTitlePriority DateFiling Date
CN202011224950.4AActiveCN112331177B (en)2020-11-052020-11-05Prosody-based speech synthesis method, model training method and related equipment

Country Status (1)

CountryLink
CN (1)CN112331177B (en)

Families Citing this family (13)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN113129862B (en)*2021-04-222024-03-12合肥工业大学Voice synthesis method, system and server based on world-tacotron
CN113345417B (en)*2021-05-312024-03-01平安科技(深圳)有限公司Speech synthesis method, device, equipment and storage medium
CN113436607B (en)*2021-06-122024-04-09西安工业大学Quick voice cloning method
CN113436608B (en)*2021-06-252023-11-28平安科技(深圳)有限公司Double-flow voice conversion method, device, equipment and storage medium
CN113628610B (en)*2021-08-122024-02-13科大讯飞股份有限公司Voice synthesis method and device and electronic equipment
CN115910021A (en)*2021-09-222023-04-04脸萌有限公司Speech synthesis method, device, electronic equipment and readable storage medium
CN113870835A (en)*2021-09-272021-12-31平安科技(深圳)有限公司Speech synthesis method, apparatus, device and storage medium based on artificial intelligence
CN114495896B (en)*2021-12-312024-12-20深圳市优必选科技股份有限公司 A voice playing method and computer device
CN114399991A (en)*2022-01-272022-04-26北京有竹居网络技术有限公司Speech synthesis method, speech synthesis device, storage medium and electronic equipment
CN114724538B (en)*2022-03-302025-01-07上海工程技术大学 A parallel speech synthesis method based on prosodic features, device, medium and equipment
CN115359780A (en)*2022-07-282022-11-18平安科技(深圳)有限公司 Speech synthesis method, device, computer equipment and storage medium
CN116092473A (en)*2023-02-102023-05-09平安科技(深圳)有限公司 Prosodic labeling model, training method of prosodic prediction model and related equipment
CN119832893A (en)*2024-12-122025-04-15中电信人工智能科技(北京)有限公司Acoustic model generation method and device, electronic equipment and storage medium

Citations (2)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN110010136A (en)*2019-04-042019-07-12北京地平线机器人技术研发有限公司The training and text analyzing method, apparatus, medium and equipment of prosody prediction model
CN110534089A (en)*2019-07-102019-12-03西安交通大学A kind of Chinese speech synthesis method based on phoneme and rhythm structure

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
JP6436806B2 (en)*2015-02-032018-12-12株式会社日立超エル・エス・アイ・システムズ Speech synthesis data creation method and speech synthesis data creation device
CN105185372B (en)*2015-10-202017-03-22百度在线网络技术(北京)有限公司Training method for multiple personalized acoustic models, and voice synthesis method and voice synthesis device
CN105355193B (en)*2015-10-302020-09-25百度在线网络技术(北京)有限公司Speech synthesis method and device
CN107945786B (en)*2017-11-272021-05-25北京百度网讯科技有限公司 Speech synthesis method and apparatus
CN110797008B (en)*2018-07-162024-03-29阿里巴巴集团控股有限公司Far-field voice recognition method, voice recognition model training method and server
CN110600018B (en)*2019-09-052022-04-26腾讯科技(深圳)有限公司 Speech recognition method and device, neural network training method and device
CN111667816B (en)*2020-06-152024-01-23北京百度网讯科技有限公司Model training method, speech synthesis method, device, equipment and storage medium

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN110010136A (en)*2019-04-042019-07-12北京地平线机器人技术研发有限公司The training and text analyzing method, apparatus, medium and equipment of prosody prediction model
CN110534089A (en)*2019-07-102019-12-03西安交通大学A kind of Chinese speech synthesis method based on phoneme and rhythm structure

Also Published As

Publication numberPublication date
CN112331177A (en)2021-02-05

Similar Documents

PublicationPublication DateTitle
CN112331177B (en)Prosody-based speech synthesis method, model training method and related equipment
CN112349273B (en)Speech synthesis method based on speaker, model training method and related equipment
CN111627418B (en)Training method, synthesizing method, system, device and medium for speech synthesis model
CN113205817B (en)Speech semantic recognition method, system, device and medium
CN109523989B (en)Speech synthesis method, speech synthesis device, storage medium, and electronic apparatus
CN112233646B (en)Voice cloning method, system, equipment and storage medium based on neural network
CN112289299B (en)Training method and device of speech synthesis model, storage medium and electronic equipment
US20230127787A1 (en)Method and apparatus for converting voice timbre, method and apparatus for training model, device and medium
CN112652291B (en)Speech synthesis method, system, equipment and storage medium based on neural network
CN111710326B (en)English voice synthesis method and system, electronic equipment and storage medium
CN105723362A (en)Natural expression processing method, processing and response method, device, and system
CN113380221A (en)Chinese and English mixed speech synthesis method and device, electronic equipment and storage medium
CN114694633A (en) Speech synthesis method, device, equipment and storage medium
WO2022105472A1 (en)Speech recognition method, apparatus, and electronic device
CN113362801A (en)Audio synthesis method, system, device and storage medium based on Mel spectrum alignment
CN112397056A (en)Voice evaluation method and computer storage medium
CN116863912A (en)Speech synthesis method, device, equipment and medium
CN112185340A (en)Speech synthesis method, speech synthesis device, storage medium and electronic apparatus
CN114495977B (en) Speech translation and model training methods, devices, electronic devices and storage media
CN114360488A (en) Speech synthesis, speech synthesis model training method, device and storage medium
CN118824233A (en) Speech representation model training method, device and related equipment
CN118298804A (en)Speech processing method, device, equipment and medium for intelligent customer service
CN113254606A (en)Generative response method, and related method, apparatus, device and medium
CN117316139A (en)Method and device for training speech synthesis model and speech synthesis
CN117174073A (en)Emotion voice synthesis method and device based on AI large model

Legal Events

DateCodeTitleDescription
PB01Publication
PB01Publication
SE01Entry into force of request for substantive examination
SE01Entry into force of request for substantive examination
GR01Patent grant

[8]ページ先頭

©2009-2025 Movatter.jp