CN113990287B

Movatterモバイル変換

Info

Publication number: CN113990287B
Application number: CN202111369331.9A
Authority: CN
Inventors: 徐东
Original assignee: Tencent Music Entertainment Technology Shenzhen Co Ltd
Current assignee: Tencent Music Entertainment Technology Shenzhen Co Ltd
Priority date: 2021-11-18
Filing date: 2021-11-18
Publication date: 2025-03-07
Anticipated expiration: 2041-11-18
Also published as: CN113990287A

Abstract

The application discloses a voice synthesis method which comprises the following steps of obtaining a target text, inputting the target text into a target frequency spectrum generated by an acoustic model, utilizing a neural network vocoder to infer the target frequency spectrum to obtain a predicted voice waveform, conducting pop sound detection on the voice waveform, repeatedly executing the step of utilizing the neural network vocoder to infer the target frequency spectrum to obtain the predicted voice waveform if the voice waveform is detected to have the pop sound until the voice waveform does not have the pop sound, and outputting voice corresponding to the voice waveform without the pop sound. By applying the technical scheme provided by the application, the finally output voice has no pop sound, the finally obtained voice is stable and reliable, and the quality of the synthesized voice is improved. The application also discloses a voice synthesis device, equipment and a storage medium, which have corresponding technical effects.

Description

Speech synthesis method, equipment and storage medium

Technical Field

The present application relates to the field of audio processing technologies, and in particular, to a method and apparatus for speech synthesis, and a storage medium.

Background

The speech synthesis technology is a technology for converting text information into speech information, relates to multiple subjects such as acoustics, linguistics, computer science and the like, and is a leading edge technology in the field of Chinese information processing. The synthesized voice can be obtained by utilizing the voice synthesis technology, the voice can be played in application scenes such as a user terminal, a robot and the like, and the quality of the synthesized voice directly influences the hearing experience of a user.

Therefore, how to perform the speech synthesis makes the finally obtained synthesized speech more stable and reliable, and improves the quality of the synthesized speech, which is a technical problem that needs to be solved by those skilled in the art.

Disclosure of Invention

The application aims to provide a voice synthesis method, equipment and a storage medium, so that the obtained synthesized voice is stable and reliable, and the quality of the synthesized voice is improved.

In order to solve the technical problems, the application provides the following technical scheme:

a method of speech synthesis, comprising:

acquiring a target text, and inputting the target text into a target frequency spectrum generated by an acoustic model;

Reasoning the target frequency spectrum by utilizing a neural network vocoder to obtain a predicted voice waveform;

performing pop sound detection on the voice waveform;

If the voice waveform is detected to have the pop sound, repeating the step of reasoning the target frequency spectrum by utilizing the neural network vocoder to obtain the predicted voice waveform until the voice waveform does not have the pop sound;

and outputting the voice corresponding to the voice waveform without the pop sound.

In one embodiment of the present application, the performing pop detection on the voice waveform includes:

And detecting the pop sound of the voice waveform in the time domain according to the absolute value of the amplitude and the relative change rate of the amplitude of the voice waveform in the time domain.

In one specific embodiment of the present application, the detecting the pop sound in the time domain according to the absolute value of the amplitude and the relative change rate of the amplitude of the voice waveform in the time domain includes:

Determining whether pop noise exists in the time domain of the voice waveform in the time domain according to the absolute value of the amplitude of each sampling point of the voice waveform in the time domain, the absolute value of the amplitude difference value of every two adjacent sampling points, and/or the absolute value of the amplitude difference value of two sampling points which are separated by one or more sampling points.

In one embodiment of the present application, after the detecting the pop sound in the time domain according to the absolute value of the amplitude and the relative change rate of the amplitude of the voice waveform in the time domain, the detecting method further includes:

And if the voice waveform does not have the pop sound in the time domain, performing the pop sound detection in the frequency domain on the voice waveform according to the energy of the voice waveform in each sub-band frequency interval in the frequency domain.

In one embodiment of the present application, the detecting the pop sound in the frequency domain according to the energy of the voice waveform in each sub-band frequency interval in the frequency domain includes:

judging whether the absolute value of the energy difference value of each two adjacent sub-band frequency intervals of the voice waveform in the frequency domain is larger than a preset energy difference threshold value or not;

If yes, judging that the voice waveform has pop sound in a frequency domain;

if not, judging that the voice waveform has no pop sound in the frequency domain.

In one embodiment of the present application, before the outputting of the speech waveform without pop, the method further comprises:

performing silence detection on the voice corresponding to the voice waveform without the pop sound;

and if the target silence with the silence time length longer than the preset time length threshold value is detected to exist in the voice, processing the target silence in the voice.

In one embodiment of the present application, processing the target silence in the speech includes:

The target silence in the voice is truncated,

Or replacing the target silence in the voice with silence with a silence duration smaller than the preset duration threshold value.

In one embodiment of the present application, before the reasoning on the target spectrum by using the neural network vocoder, the method further includes:

Detecting the amplitude of the target frequency spectrum;

and adjusting the detected amplitude exceeding the preset amplitude range to be within the amplitude range.

A speech synthesis apparatus comprising:

a memory for storing a computer program;

a processor for implementing the steps of any one of the above-described speech synthesis methods when executing the computer program.

A computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of the speech synthesis method of any of the preceding claims.

By applying the technical scheme provided by the embodiment of the application, after the target frequency spectrum is inferred by utilizing the neural network vocoder, the voice corresponding to the voice waveform is not directly output after the predicted voice waveform is obtained, but the voice waveform is subjected to the pop detection, and the corresponding voice is output under the condition that the pop does not exist. Under the condition that the pop sound exists, the neural network vocoder is utilized to reasoner the target frequency spectrum to obtain a predicted voice waveform, the pop sound is detected on the voice waveform obtained again, and the voice corresponding to the voice waveform without the pop sound is output until the pop sound is not detected in the predicted voice waveform obtained by utilizing the neural network vocoder, so that the pop sound does not exist in the voice output finally, the voice obtained finally is stable and reliable, and the quality of the synthesized voice is improved.

Drawings

In order to more clearly illustrate the embodiments of the application or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, it being obvious that the drawings in the following description are only some embodiments of the application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flow chart of a speech synthesis method according to an embodiment of the present application;

FIG. 2 is a schematic diagram of a speech synthesis apparatus according to an embodiment of the present application;

fig. 3 is a schematic structural diagram of a speech synthesis apparatus according to an embodiment of the present application.

Detailed Description

The core of the application is to provide a voice synthesis method which can be implemented in the background operation of a computer and also can be implemented through cloud processing, so that the processing efficiency is higher and the operation speed is faster.

The technical scheme provided by the embodiment of the application can be applied to speech synthesis under any scene with the requirement of converting text information into speech information. For example, a science popularization article is played to a user's scene in a voice manner.

In the embodiment of the application, a target text is acquired, the target text is input into a target frequency spectrum generated by an acoustic model, a neural network vocoder is utilized to infer the target frequency spectrum to obtain a predicted voice waveform, the voice waveform is subjected to pop sound detection, if the voice waveform is detected to have pop sound, the step of utilizing the neural network vocoder to infer the target frequency spectrum is repeatedly executed to obtain the predicted voice waveform until the voice waveform does not have the pop sound, and the voice corresponding to the voice waveform without the pop sound is output. After the target frequency spectrum is inferred by utilizing the neural network vocoder, the predicted voice waveform is obtained, the voice corresponding to the voice waveform is not directly output, but the voice waveform is subjected to pop sound detection, and the corresponding voice is output under the condition that the pop sound does not exist. Under the condition that the pop sound exists, the neural network vocoder is utilized to reasoner the target frequency spectrum to obtain a predicted voice waveform, the pop sound is detected on the voice waveform obtained again, and the voice corresponding to the voice waveform without the pop sound is output until the pop sound is not detected in the predicted voice waveform obtained by utilizing the neural network vocoder, so that the pop sound does not exist in the voice output finally, the voice obtained finally is stable and reliable, and the quality of the synthesized voice is improved.

In order to better understand the aspects of the present application, the present application will be described in further detail with reference to the accompanying drawings and detailed description. It will be apparent that the described embodiments are only some, but not all, embodiments of the application. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.

Referring to fig. 1, a flowchart of a speech synthesis method according to an embodiment of the present application may include the following steps:

S110, acquiring a target text, and inputting the target text into a target frequency spectrum generated by an acoustic model;

In the embodiment of the application, the target text can be text information to be subjected to voice synthesis currently, can be a piece of text information appointed by a user, and can also be a piece of text information automatically generated by a computer.

When there is a speech synthesis requirement, a target spectrum may be generated based on the target text. Specifically, mapping can be performed on the target text based on a preset mapping relation, so as to obtain a target spectrum mapped by the target text. The target text can be input into a pre-obtained acoustic model, and a target frequency spectrum corresponding to the target text is generated by using the acoustic model.

The frequency spectrum in the embodiment of the application is the frequency domain amplitude spectrum of the sound waveform, and does not contain phase information. A specific form of spectrum may be melton's spectrum.

After the target spectrum generated based on the target text is obtained, the operation of the subsequent steps may be continued.

S120, utilizing a neural network vocoder to infer a target frequency spectrum to obtain a predicted voice waveform;

In the embodiment of the application, a network model can be constructed by using a deep neural network, the model training is performed by using training data to obtain the neural network vocoder, and the obtained neural network vocoder has the capacity of correspondingly correlating the frequency spectrum with the voice. Specifically, the neural network vocoder may be a neural network vocoder based on a generative countermeasure network (GENERATIVE ADVERSARIAL Networks, GAN), such as melgan (a non-autoregressive feedforward convolution architecture), hifigan (a GAN-based model capable of efficiently generating high-fidelity speech), or a neural network vocoder based on a recurrent neural network (Recurrent Neural Network, RNN), such as wavernn (a speech synthesis depth architecture).

The training data may include a plurality of information pairs acquired in advance, and each information pair may include a frequency spectrum generated based on text and a corresponding speech waveform, or may include a frequency spectrum converted from a speech waveform and the speech waveform. For example, a section of the voiced novels is divided into a plurality of audio segments, the frequency spectrum of each audio segment is extracted, the voice waveform and the frequency spectrum corresponding to each audio segment are used as training data to train the neural network vocoder, and the trained neural network vocoder is used for predicting the voice waveform corresponding to the target frequency spectrum.

In the process of model training of the neural network vocoder, the voice information can be divided into a plurality of frequency bands in a multi-subband mode, then training and reasoning are respectively carried out, and finally all the subbands are combined. In addition, in the process of model training of the neural network vocoder, the amplitude normalization processing can be performed on the voice waveform in the training data, for example, the amplitude is made to be 0 db at maximum, and then the set values are uniformly reduced, for example, 6 db is reduced. This allows the neural network vocoder to normally output predicted speech waveforms having amplitudes below-6 db during the use phase. On the one hand, the overload phenomenon of the predicted voice waveform can be avoided, and on the other hand, the energy detection can be ensured to be carried out in a sufficient sound range.

After training to obtain the neural network vocoder, the neural network vocoder can be utilized to infer the target frequency spectrum, so as to obtain the predicted voice waveform.

S130, performing pop sound detection on the voice waveform to judge whether the pop sound exists in the voice waveform, if so, re-performing S120, otherwise, entering S140;

the target frequency spectrum generated based on the target text is obtained, the target frequency spectrum is inferred by utilizing the neural network vocoder, and after the predicted voice waveform is obtained, the voice waveform can be further subjected to sound explosion detection.

Pop sound means that sound is suddenly oversized in terms of volume, but there is no clearly understood information on the sound content.

If the target frequency spectrum is inferred by utilizing the neural network vocoder, the obtained predicted voice waveform has pop sound, and if the voice corresponding to the voice waveform is directly output, the voice experience of the user is affected.

Therefore, the present application does not directly output the voice corresponding to the voice waveform, but performs the pop sound detection on the voice waveform. If a pop sound is detected in the voice waveform, the step of reasoning the target frequency spectrum by using the neural network vocoder to obtain a predicted voice waveform is repeatedly performed, as shown in fig. 1.

It will be appreciated that the neural network vocoder essentially processes based on the principle of probability statistics, and is characterized in that even though the values of the predicted speech waveforms inferred by the neural network vocoder at each sample point are different for each input, the audible speech content is the same or the predicted speech waveforms inferred each time are similar. That is, the neural network vocoder has the working characteristics that the predicted voice waveforms are different but the voice contents are not affected. On the basis, when the voice waveform predicted by the neural network vocoder is detected to have a pop sound, the step of utilizing the neural network vocoder to infer a target frequency spectrum to obtain the predicted voice waveform is repeatedly executed, so that the predicted voice waveforms obtained in the front and back steps are similar but have differences. The pop detection of the retrieved predicted speech waveform may continue.

After the pop sound detection is performed on the voice waveform, if the existence of the pop sound of the voice waveform is not detected, the operation of step S140 may be performed.

And S140, outputting the voice corresponding to the voice waveform without the pop sound.

After a target frequency spectrum is inferred by utilizing a neural network vocoder to obtain a predicted voice waveform and the voice waveform is subjected to pop sound detection, if the voice waveform is not detected to have the pop sound, the voice waveform without the pop sound can be packaged into a corresponding audio file by utilizing a preset packaging format, and the audio file is decoded into voice under application scenes such as a user terminal, a robot and the like to realize playing.

By applying the method provided by the embodiment of the application, after the target frequency spectrum is inferred by utilizing the neural network vocoder, the voice corresponding to the voice waveform is not directly output after the predicted voice waveform is obtained, but the voice waveform is subjected to the pop detection, and the corresponding voice is output under the condition that the pop does not exist. Under the condition that the pop sound exists, the neural network vocoder is utilized to reasoner the target frequency spectrum to obtain a predicted voice waveform, the pop sound is detected on the voice waveform obtained again, and the voice corresponding to the voice waveform without the pop sound is output until the pop sound is not detected in the predicted voice waveform obtained by utilizing the neural network vocoder, so that the pop sound does not exist in the voice output finally, the voice obtained finally is stable and reliable, and the quality of the synthesized voice is improved.

In one embodiment of the present application, before reasoning about the target spectrum using the neural network vocoder, the method may further comprise the steps of:

step one, detecting the amplitude of a target frequency spectrum;

and secondly, adjusting the detected amplitude exceeding the preset amplitude range to be within the amplitude range.

For ease of description, the two steps described above are combined.

In the embodiment of the application, after the target spectrum generated based on the target text is obtained, the amplitude detection can be further performed on the target spectrum to determine whether the amplitude exceeding the preset amplitude range exists in the target spectrum. The amplitude range can be set and adjusted according to practical situations, and the embodiment of the application is not limited to this.

After the amplitude detection of the target spectrum, the detected amplitude exceeding the preset amplitude range may be adjusted to be within the amplitude range. Specifically, the upper limit value may be used instead of the amplitude exceeding the upper limit value of the amplitude range, and the lower limit value may be used instead of the amplitude smaller than the lower limit value of the amplitude range.

After the detected amplitude exceeding the preset amplitude range in the target frequency spectrum is adjusted to the amplitude range, the target frequency spectrum is input into a neural network vocoder, and the neural network vocoder is utilized to infer the target frequency spectrum, so that a predicted voice waveform is obtained. Thus, the problem of inaccurate reasoning caused by overlarge or undersize amplitude of the target frequency spectrum input to the neural network vocoder can be effectively avoided.

In one embodiment of the present application, the pop detection of a voice waveform may include the steps of:

In the embodiment of the application, after the target frequency spectrum is inferred by utilizing the neural network vocoder to obtain the predicted voice waveform, the voice waveform can be subjected to the sound burst detection in the time domain according to the absolute value of the amplitude and the relative change rate of the amplitude of the voice waveform in the time domain so as to determine whether the sound burst exists in the time domain.

Pop noise, the performance in the time domain is high in energy and abrupt changes in energy over time. The volume of the voice can be reflected by measuring the absolute value of the amplitude, and the fluctuation degree of the volume can be reflected by the relative change rate of the amplitude. In the embodiment of the application, whether the voice waveform has pop sound in the time domain can be determined by detecting the amplitude condition of the voice waveform in the time domain and the change condition of the amplitude in the time dimension.

In a specific embodiment of the present application, whether a pop sound exists in the time domain of the speech waveform in the time domain is determined according to the absolute value of the amplitude of each sampling point of the speech waveform in the time domain, and/or the absolute value of the amplitude difference value of every two adjacent sampling points, and/or the absolute value of the amplitude difference value of two sampling points separated by one or more sampling points.

Specifically, for each sampling point of the speech waveform in the time domain, it may be determined whether the absolute value of the amplitude of the sampling point is greater than a preset first threshold, if greater than the first threshold, it may be determined that a pop sound exists in the speech waveform in the time domain, if less than or equal to the first threshold, it may be further determined whether the absolute value of the amplitude difference between the sampling point and the next following sampling point is greater than a preset second threshold, if greater than the second threshold, it may be determined that a pop sound exists in the speech waveform in the time domain, if less than or equal to the second threshold, it may be further determined whether the absolute value of the amplitude difference between the sampling point and the next following second sampling point is greater than a preset third threshold, if greater than or equal to the third threshold, it may be determined that a pop sound exists in the speech waveform in the time domain, if greater than or equal to the third threshold, it may be further determined that a pop sound exists in the time domain, if greater than or equal to the fourth threshold, it may be determined that a difference between the sampling point and the second sampling point is greater than the fourth threshold, and if the speech waveform is not equal to the fourth threshold.

The above procedure can be expressed by the following time domain detection formula:

ψ=Count:{||Y_n||>α};

Δ₁＝Count:{||Y_n+1-Y_n||>β};

Δ₂＝Count:{||Y_n+2-Y_n||>γ};

Δ₃＝Count:{||Y_n+3-Y_n||>δ};

Wherein, |y_n || represents the absolute value of the amplitude of the nth sampling point of the voice waveform in the time domain, the first formula is used for calculating the number that the absolute value of the amplitude of the sampling point of the voice waveform in the time domain is larger than the first threshold alpha, the second formula is used for calculating the number that the absolute value of the amplitude difference value of the adjacent two sampling points of the voice waveform in the time domain is larger than the second threshold beta, the third formula is used for calculating the number that the absolute value of the amplitude difference value of the two sampling points of the voice waveform in the time domain, which are separated by one sampling point, is larger than the third threshold gamma, and the fourth formula is used for calculating the number that the absolute value of the amplitude difference value of the two sampling points of the voice waveform in the time domain, which are separated by two sampling points, is larger than the third threshold delta. The second to third formulas reflect energy change rates. The presence of a pop sound in the speech waveform in the time domain can be determined as long as there are sample values greater than the corresponding threshold. And the existence of the pop sound of the voice waveform in the time domain can be determined when the number of the sampling points larger than the corresponding threshold reaches a certain number.

In one embodiment of the present application, after performing the pop detection on the voice waveform in the time domain, the method may include the steps of:

After the voice waveform is subjected to the time domain pop sound detection, if the voice waveform in the time domain is determined to have no pop sound, the voice waveform can be further subjected to the frequency domain pop sound detection according to the energy of the voice waveform in each sub-band frequency interval in the frequency domain, so that the detection accuracy of the pop sound is improved.

Pop noise appears in the frequency domain as being less energy in some sub-band frequency bins, but very high energy in other sub-band frequency bins. Based on this, the pop sound detection in the frequency domain can be performed on the voice waveform by analyzing the energy variation situation in each frequency subband in the frequency domain. The potential pop noise can be further screened through the sub-band energy change of the frequency domain.

In one specific embodiment of the application, whether the absolute value of the difference value of the energy in each two adjacent sub-band frequency intervals of the voice waveform in the frequency domain is larger than a preset energy difference threshold value is judged, if yes, the voice waveform is judged to have pop sound in the frequency domain, and if not, the voice waveform is judged to have no pop sound in the frequency domain.

Specifically, the energy of the voice waveform in each subband frequency interval in the frequency domain can be calculated first. The calculation can be based on the following formula:

where freq represents any one subband, e.g., a frequency range of 100 hz to 1000 hz is one subband and F_freq is the spectral amplitude within the subband freq.

After the energy in the frequency interval of each sub-band of the voice waveform in the frequency domain is obtained through calculation, whether the absolute value of the energy difference value in each two adjacent sub-band frequency intervals is larger than a preset energy difference threshold value or not can be further determined, if the absolute value of the energy difference value is larger than the energy difference threshold value, the voice waveform in the frequency domain can be determined to have the pop sound, and if the absolute value of the energy difference value in each two adjacent sub-band frequency intervals is smaller than or equal to the energy difference threshold value, the voice waveform in the frequency domain can be determined to have no pop sound. Specifically, the determination can be made using the following frequency domain detection formula.

Δ₄＝Count:{||H_band+1-H_band||>ε};

The formula is used for calculating the number that the absolute value of the difference value of the energy in the adjacent two sub-band frequency intervals of the voice waveform in the frequency domain is larger than the energy difference threshold epsilon, for example, whether the absolute value of the difference value of the sub-band energy of 100 to 1000 hertz and the energy in the sub-band frequency interval of 1000 to 2000 hertz is larger than the energy difference threshold epsilon.

In one embodiment of the present application, in a case where no pop sound exists in the voice waveform, before outputting the voice corresponding to the voice waveform in which no pop sound exists, the method further includes:

If silence exists in the voice corresponding to the voice waveform without the pop sound, the voice waveform is provided with longer-time silence on the hearing sense during playing, and the hearing feeling of the user is influenced. Therefore, in the embodiment of the application, when the pop sound is detected on the voice waveform and the pop sound is not detected in the voice waveform, the silence detection can be further performed on the voice corresponding to the voice waveform. If silence is detected in the voice, it may be further determined whether the silence period is greater than a preset period threshold, if less than or equal to the preset period threshold, the silence period may be considered tolerable, and if greater, silence may be handled.

Specifically, the voice pair may be detected by a voice endpoint detection (Voice Activity Detection, VAD) technique to obtain silence without speaking, and if the silence time is longer than the duration threshold, silence truncation may be performed, or a set silence with a shorter duration may be used instead. The method comprises the steps of processing target silence in the voice, wherein the processing of target silence in the voice comprises the step of cutting off the target silence in the voice, or replacing the target silence in the voice with silence with a silence duration smaller than the preset duration threshold.

The silence in the voice corresponding to the voice waveform without the pop sound is processed and then output, so that the long-time silence on the final hearing can be effectively avoided.

Of course, in practical application, whether to perform the mute processing may be determined based on the on-off state of the mute processing or not, or the mute processing operation may not be performed when background music is further to be added to the voice.

According to the technical scheme provided by the embodiment of the application, the stability of the synthesized voice in terms of sound quality can be improved, the problem of harshness caused by popping sound is avoided, meanwhile, the long-time silent voice can be reduced, the synthesized voice output to the user is more reliable and better, and the user listening to the synthesized voice obtains better sound hearing experience.

In the following, a specific application embodiment of the speech synthesis method provided by the present application is described, and when a novel speech is required to be read in the form of speech, the novel speech is input into an acoustic model to generate a frequency spectrum, and then the frequency spectrum is input into a neural network vocoder based on a generation type countermeasure network to predict speech waveforms. For a voice waveform predicted by a neural network vocoder, determining whether an absolute value of an amplitude of each sampling point in a time domain is greater than a preset first threshold, if so, determining that a voice waveform in the time domain has a pop sound, if so, further determining whether an absolute value of an amplitude difference between the sampling point and an adjacent subsequent sampling point is greater than a preset second threshold, if so, determining that a voice waveform in the time domain has a pop sound, if so, further determining whether an absolute value of an amplitude difference between the sampling point and a subsequent second sampling point is greater than a preset third threshold, if so, then determining that a sampling point is spaced between the second sampling point and the sampling point, if so, then determining that a voice waveform in the time domain has a pop sound, if so, then further determining that an absolute value of an amplitude difference between the sampling point and a subsequent third sampling point is greater than a preset fourth threshold, if so, then determining that a voice waveform in the time domain has a pop sound, if so, then determining that a voice waveform in the time domain has a difference between the second sampling point and the second sampling point is greater than the fourth threshold.

If the voice waveform has a pop sound in the time domain, the neural network vocoder predicts the voice waveform based on the frequency spectrum and detects whether the voice waveform has a pop sound in the time domain in the above manner again. If the voice waveform does not have the pop sound in the time domain, performing time-frequency transformation on the voice waveform, calculating energy in each sub-band frequency interval of the voice waveform in the frequency domain, determining whether a fifth threshold value is preset for the absolute value of the difference value of the energy in each two adjacent sub-band frequency intervals, if the absolute value of the difference value is larger than the fifth threshold value, determining that the voice waveform has the pop sound in the frequency domain, and if the absolute value of the difference value of the energy in each two adjacent sub-band frequency intervals is smaller than or equal to the fifth threshold value, determining that the voice waveform does not have the pop sound in the frequency domain.

If the voice waveform has a pop sound in the frequency domain, the neural network vocoder predicts the voice waveform based on the frequency spectrum and detects whether the voice waveform has a pop sound in the time domain in the above manner again. If the voice waveform does not have pop sound in the frequency domain, the voice waveform is packaged into an audio file in MP3 format, then the audio file is decoded into PCM (pulse code modulation ) audio data to be played, and the effect that a novel is read in a voice mode is achieved.

Corresponding to the above method embodiments, the embodiments of the present application further provide a speech synthesis apparatus, where the speech synthesis apparatus described below and the method of the speech synthesis apparatus described above may be referred to correspondingly to each other.

Referring to fig. 2, the apparatus may include the following modules:

a target spectrum obtaining module 210, configured to obtain a target text, and input the target text into a target spectrum generated by an acoustic model;

a speech waveform prediction module 220, configured to infer a target spectrum by using a neural network vocoder, so as to obtain a predicted speech waveform;

the pop sound detection module 230 is used for performing pop sound detection on the voice waveform, triggering the working flow of the voice waveform prediction module 220 until no pop sound exists in the voice model if the pop sound exists in the voice waveform, and triggering the working flow of the synthesized voice output module 240 if the pop sound exists in the voice waveform;

The synthesized voice output module 240 is configured to output a voice corresponding to a voice waveform without pop.

By using the device provided by the embodiment of the application, after the target frequency spectrum is inferred by utilizing the neural network vocoder, the voice corresponding to the voice waveform is not directly output after the predicted voice waveform is obtained, but the voice waveform is subjected to the pop detection, and the corresponding voice is output under the condition that the pop does not exist. Under the condition that the pop sound exists, the neural network vocoder is utilized to reasoner the target frequency spectrum to obtain a predicted voice waveform, the pop sound is detected on the voice waveform obtained again, and the voice corresponding to the voice waveform without the pop sound is output until the pop sound is not detected in the predicted voice waveform obtained by utilizing the neural network vocoder, so that the pop sound does not exist in the voice output finally, the voice obtained finally is stable and reliable, and the quality of the synthesized voice is improved.

In one embodiment of the present application, the pop sound detection module 230 is configured to:

In one embodiment of the present application, the pop sound detection module 230 is further configured to:

After the voice waveform is subjected to the time-domain pop sound detection according to the absolute value of the amplitude and the relative change rate of the amplitude of the voice waveform in the time domain, if the voice waveform does not have the pop sound in the time domain, the voice waveform is subjected to the frequency-domain pop sound detection according to the energy of the voice waveform in each sub-band frequency interval in the frequency domain.

Judging whether the absolute value of the energy difference value between every two adjacent sub-band frequency intervals of the voice waveform in the frequency domain is larger than a preset energy difference threshold value, if so, judging that the voice waveform has pop sound in the frequency domain, and if not, judging that the voice waveform has no pop sound in the frequency domain.

In a specific embodiment of the present application, the system further includes a mute processing module, configured to:

Before outputting the voice corresponding to the voice waveform without the pop sound, performing silence detection on the voice corresponding to the voice waveform without the pop sound;

In a specific embodiment of the present application, the mute processing module is configured to:

The target silence in the voice is truncated,

In a specific embodiment of the present application, the system further includes an amplitude adjustment module, configured to:

Before a target frequency spectrum is inferred by utilizing a neural network vocoder to obtain a predicted voice waveform, detecting the amplitude of the target frequency spectrum;

Corresponding to the above method embodiment, the embodiment of the present application further provides a speech synthesis apparatus, including:

a memory for storing a computer program;

and a processor for implementing the steps of the above-described speech synthesis method when executing the computer program.

As shown in fig. 3, which is a schematic diagram of the composition structure of the speech synthesis apparatus, the speech synthesis apparatus may include a processor 10, a memory 11, a communication interface 12, and a communication bus 13. The processor 10, the memory 11 and the communication interface 12 all complete communication with each other through a communication bus 13.

In an embodiment of the present application, the processor 10 may be a central processing unit (Central Processing Unit, CPU), an asic, a dsp, a field programmable gate array, or other programmable logic device, etc.

The processor 10 may call a program stored in the memory 11, and in particular, the processor 10 may perform operations in an embodiment of the speech synthesis method.

The memory 11 is used for storing one or more programs, and the programs may include program codes including computer operation instructions, and in the embodiment of the present application, at least the programs for implementing the following functions are stored in the memory 11:

performing pop sound detection on the voice waveform;

In one possible implementation, the memory 11 may include a storage program area that may store an operating system, and applications required for at least one function (e.g., spectrum function, pop detection function), etc., and a storage data area that may store data created during use, such as spectrum data, voice data, etc.

In addition, the memory 11 may include high-speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device or other volatile solid-state storage device.

The communication interface 12 may be an interface of a communication module for interfacing with other devices or systems.

Of course, it should be noted that the structure shown in fig. 3 is not limited to the speech synthesis apparatus in the embodiment of the present application, and the speech synthesis apparatus may include more or less components than those shown in fig. 3, or may combine some components in practical applications.

Corresponding to the above method embodiments, the present application further provides a computer readable storage medium, on which a computer program is stored, which when executed by a processor implements the steps of the above speech synthesis method.

In this specification, each embodiment is described in a progressive manner, and each embodiment is mainly described in a different point from other embodiments, so that the same or similar parts between the embodiments are referred to each other.

Those of skill would further appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the various illustrative elements and steps are described above generally in terms of functionality in order to clearly illustrate the interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two. The software modules may be disposed in Random Access Memory (RAM), memory, read Only Memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art.

The principles and embodiments of the present application have been described herein with reference to specific examples, but the description of the examples above is only for aiding in understanding the technical solution of the present application and its core ideas. It should be noted that it will be apparent to those skilled in the art that various modifications and adaptations of the application can be made without departing from the principles of the application and these modifications and adaptations are intended to be within the scope of the application as defined in the following claims.

Claims

Translated fromChinese

1.一种语音合成方法，其特征在于，包括：1. A speech synthesis method, comprising:

获取目标文本，并将所述目标文本输入声学模型生成的目标频谱；Acquire a target text, and input the target text into a target spectrum generated by an acoustic model;

利用神经网络声码器对所述目标频谱进行推理，获得预测的语音波形；Using a neural network vocoder to infer the target spectrum to obtain a predicted speech waveform;

对所述语音波形进行爆音检测；Performing pop detection on the speech waveform;

如果检测到所述语音波形存在爆音，则重复执行所述利用神经网络声码器对所述目标频谱进行推理，获得预测的语音波形的步骤，直至所述语音波形中不存在爆音；If it is detected that there is a pop in the speech waveform, repeatedly performing the step of using a neural network vocoder to infer the target spectrum to obtain a predicted speech waveform until there is no pop in the speech waveform;

输出不存在爆音的语音波形对应的语音。Output the speech corresponding to the speech waveform without popping sounds.

2.根据权利要求1所述的语音合成方法，其特征在于，对所述语音波形进行爆音检测，包括：2. The speech synthesis method according to claim 1, characterized in that the step of performing pop detection on the speech waveform comprises:

根据所述语音波形在时域上的幅度绝对值和幅度相对变化率对所述语音波形进行时域上的爆音检测。The speech waveform is subjected to a time domain popping detection according to an absolute value of the amplitude of the speech waveform and a relative amplitude change rate in the time domain.

3.根据权利要求2所述的语音合成方法，其特征在于，所述根据所述语音波形在时域上的幅度绝对值和幅度相对变化率对所述语音波形进行时域上的爆音检测，包括：3. The speech synthesis method according to claim 2, characterized in that the step of performing time domain pop detection on the speech waveform according to the absolute value of the amplitude and the relative rate of amplitude change of the speech waveform in the time domain comprises:

根据所述语音波形在时域上的每个采样点的幅度绝对值、和/或每相邻两个采样点的幅度差值的绝对值、和/或每相隔一个或多个采样点的两个采样点的幅度差值的绝对值，确定在时域上所述语音波形的时域上是否存在爆音。Determine whether there is a pop in the time domain of the speech waveform in the time domain based on the absolute value of the amplitude of each sampling point of the speech waveform in the time domain, and/or the absolute value of the amplitude difference between every two adjacent sampling points, and/or the absolute value of the amplitude difference between every two sampling points separated by one or more sampling points.

4.根据权利要求2所述的语音合成方法，其特征在于，在所述根据所述语音波形在时域上的幅度绝对值和幅度相对变化率对所述语音波形进行时域上的爆音检测之后，还包括：4. The speech synthesis method according to claim 2, characterized in that after performing the time domain pop detection on the speech waveform according to the absolute value of the amplitude and the relative change rate of the amplitude of the speech waveform in the time domain, the method further comprises:

如果所述语音波形在时域上不存在爆音，则根据所述语音波形在频域上各子带频率区间内的能量对所述语音波形进行频域上的爆音检测。If there is no popping sound in the speech waveform in the time domain, the popping sound detection in the frequency domain is performed on the speech waveform according to the energy of each sub-band frequency interval of the speech waveform in the frequency domain.

5.根据权利要求4所述的语音合成方法，其特征在于，所述根据所述语音波形在频域上各子带频率区间内的能量对所述语音波形进行频域上的爆音检测，包括：5. The speech synthesis method according to claim 4, characterized in that the step of performing frequency domain pop detection on the speech waveform according to the energy of each sub-band frequency interval of the speech waveform in the frequency domain comprises:

判断所述语音波形在频域上每相邻两个子带频率区间内的能量的差值的绝对值是否大于预设的能量差阈值；Determine whether the absolute value of the energy difference between each two adjacent sub-band frequency intervals of the speech waveform in the frequency domain is greater than a preset energy difference threshold;

若是，则判定所述语音波形在频域上存在爆音；If yes, it is determined that the speech waveform has a pop in the frequency domain;

若否，则判定所述语音波形在频域上不存在爆音。If not, it is determined that there is no popping sound in the speech waveform in the frequency domain.

6.根据权利要求1所述的语音合成方法，其特征在于，在所述输出不存在爆音的语音波形对应的语音之前，还包括：6. The speech synthesis method according to claim 1, characterized in that before outputting the speech corresponding to the speech waveform without popping sound, it also includes:

对所述不存在爆音的语音波形对应的语音进行静音检测；Performing silence detection on the speech corresponding to the speech waveform without popping sound;

如果检测到所述语音中存在静音时长大于预设时长阈值的目标静音，则对所述语音中的目标静音进行处理。If it is detected that there is target silence in the speech whose silence duration is greater than a preset duration threshold, the target silence in the speech is processed.

7.根据权利要求6所述的语音合成方法，其特征在于，对所述语音中的目标静音进行处理，包括：7. The speech synthesis method according to claim 6, characterized in that the processing of the target silence in the speech comprises:

对所述语音中的目标静音进行截断处理，performing truncation processing on the target silence in the speech,

或，将所述语音中的目标静音替换为静音时长小于所述预设时长阈值的静音。Or, the target silence in the speech is replaced by silence whose silence duration is less than the preset duration threshold.

8.根据权利要求1至7之中任一项所述的语音合成方法，其特征在于，在所述利用神经网络声码器对所述目标频谱进行推理，获得预测的语音波形之前，还包括：8. The speech synthesis method according to any one of claims 1 to 7, characterized in that before the step of using a neural network vocoder to infer the target spectrum to obtain a predicted speech waveform, it further comprises:

对所述目标频谱进行幅值检测；Performing amplitude detection on the target spectrum;

将检测到的超过预设的幅值范围的幅值调整至所述幅值范围内。The detected amplitude exceeding the preset amplitude range is adjusted to be within the amplitude range.

9.一种语音合成设备，其特征在于，包括：9. A speech synthesis device, comprising:

存储器，用于存储计算机程序；Memory for storing computer programs;

处理器，用于执行所述计算机程序时实现如权利要求1至8任一项所述的语音合成方法的步骤。A processor, configured to implement the steps of the speech synthesis method according to any one of claims 1 to 8 when executing the computer program.

10.一种计算机可读存储介质，其特征在于，所述计算机可读存储介质上存储有计算机程序，所述计算机程序被处理器执行时实现如权利要求1至8任一项所述的语音合成方法的步骤。10. A computer-readable storage medium, characterized in that a computer program is stored on the computer-readable storage medium, and when the computer program is executed by a processor, the steps of the speech synthesis method according to any one of claims 1 to 8 are implemented.