Movatterモバイル変換


[0]ホーム

URL:


Jump to content
WikipediaThe Free Encyclopedia
Search

Deep learning speech synthesis

From Wikipedia, the free encyclopedia
Method of speech synthesis that uses deep neural networks

Deep learning speech synthesis refers to the application ofdeep learning models to generate natural-sounding human speech from written text (text-to-speech) or spectrum (vocoder). Deepneural networks are trained using large amounts of recorded speech and, in the case of a text-to-speech system, the associated labels and/or input text.

Formulation

[edit]

Given an input text or some sequence of linguistic unitsY{\displaystyle Y}, the target speechX{\displaystyle X} can be derived by

X=argmaxP(X|Y,θ){\displaystyle X=\arg \max P(X|Y,\theta )}

whereθ{\displaystyle \theta } is the set of model parameters.

Typically, the input text will first be passed to an acoustic feature generator, then the acoustic features are passed to the neural vocoder. For the acoustic feature generator, theloss function is typically L1 loss (Mean Absolute Error, MAE) or L2 loss (Mean Square Error, MSE). These loss functions impose a constraint that the output acoustic feature distributions must beGaussian orLaplacian. In practice, since the human voice band ranges from approximately 300 to 4000 Hz, the loss function will be designed to have more penalty on this range:

loss=αlosshuman+(1α)lossother{\displaystyle loss=\alpha {\text{loss}}_{\text{human}}+(1-\alpha ){\text{loss}}_{\text{other}}}

wherelosshuman{\displaystyle {\text{loss}}_{\text{human}}} is the loss from human voice band andα{\displaystyle \alpha } is a scalar, typically around 0.5. The acoustic feature is typically aspectrogram orMel scale. These features capture the time-frequency relation of the speech signal, and thus are sufficient to generate intelligent outputs. TheMel-frequency cepstrum feature used in thespeech recognition task is not suitable for speech synthesis, as it reduces too much information.

History

[edit]
For broader coverage of this topic, seeHistory of speech synthesis.
A stack of dilated causal convolutional layers used inWaveNet[1]

In September 2016,DeepMind releasedWaveNet, which demonstrated that deep learning-based models are capable of modeling raw waveforms and generating speech from acoustic features likespectrograms ormel-spectrograms. Although WaveNet was initially considered to be computationally expensive and slow to be used in consumer products at the time, a year after its release, DeepMind unveiled a modified version of WaveNet known as "Parallel WaveNet," a production model 1,000 faster than the original.[1]

A comparison of the alignments (attentions) between Tacotron and a modified variant of Tacotron

This was followed byGoogle AI'sTacotron 2 in 2018, which demonstrated that neural networks could produce highly natural speech synthesis but required substantial training data—typically tens of hours of audio—to achieve acceptable quality. Tacotron 2 used anautoencoder architecture withattention mechanisms to convert input text into mel-spectrograms, which were then converted to waveforms using a separate neuralvocoder. When trained on smaller datasets, such as 2 hours of speech, the output quality degraded while still being able to maintain intelligible speech, and with just 24 minutes of training data, Tacotron 2 failed to produce intelligible speech.[2]

In 2019,Microsoft Research introducedFastSpeech, which addressed speed limitations inautoregressive models like Tacotron 2.[3] FastSpeech utilized a non-autoregressive architecture that enabled parallel sequence generation, significantly reducing inference time while maintaining audio quality. Itsfeedforwardtransformer network with length regulation allowed forone-shot prediction of the full mel-spectrogram sequence, avoiding the sequential dependencies that bottlenecked previous approaches.[3] The same year saw the release ofHiFi-GAN, agenerative adversarial network (GAN)-based vocoder that improved the efficiency of waveform generation while producing high-fidelity speech.[4] In 2020, the release ofGlow-TTS introduced aflow-based approach that allowed for fast inference and voice style transfer capabilities.[5]

In March 2020, the free text-to-speech website15.ai was launched. 15.ai gained widespread international attention in early 2021 for its ability to synthesize emotionally expressive speech of fictional characters from popular media with minimal amount of data.[6][7][8] The creator of 15.ai (knownpseudonymously as15) stated that 15 seconds of training data is sufficient to perfectly clone a person's voice (hence its name, "15.ai"), a significant reduction from the previously known data requirement of tens of hours.[9] 15.ai is credited as the first platform to popularize AI voice cloning inmemes andcontent creation.[10][11][9] 15.ai used a multi-speaker model that enabled simultaneous training of multiple voices and emotions, implementedsentiment analysis usingDeepMoji, and supported precise pronunciation control viaARPABET.[9][6] The 15-second data efficiency benchmark was later corroborated byOpenAI in 2024.[12]

Semi-supervised learning

[edit]

Currently,self-supervised learning has gained much attention through better use of unlabelled data. Research has shown that, with the aid of self-supervised loss, the need forpaired data decreases.[13][14]

Zero-shot speaker adaptation

[edit]

Zero-shot speaker adaptation is promising because a single model can generate speech with various speaker styles and characteristic. In June 2018, Google proposed to use pre-trained speaker verification models as speaker encoders to extract speaker embeddings.[15] The speaker encoders then become part of the neural text-to-speech models, so that it can determine the style and characteristics of the output speech. This procedure has shown the community that it is possible to use only a single model to generate speech with multiple styles.

Neural vocoder

[edit]
Speech synthesis example using the HiFi-GAN neural vocoder

In deep learning-based speech synthesis, neural vocoders play an important role in generating high-quality speech from acoustic features. TheWaveNet model proposed in 2016 achieves excellent performance on speech quality. Wavenet factorised the joint probability of a waveformx={x1,...,xT}{\displaystyle \mathbf {x} =\{x_{1},...,x_{T}\}} as a product of conditional probabilities as follows

pθ(x)=t=1Tp(xt|x1,...,xt1){\displaystyle p_{\theta }(\mathbf {x} )=\prod _{t=1}^{T}p(x_{t}|x_{1},...,x_{t-1})}

whereθ{\displaystyle \theta } is the model parameter including many dilated convolution layers. Thus, each audio samplext{\displaystyle x_{t}} is conditioned on the samples at all previous timesteps. However, the auto-regressive nature of WaveNet makes the inference process dramatically slow. To solve this problem, Parallel WaveNet[16] was proposed. Parallel WaveNet is an inverse autoregressive flow-based model which is trained byknowledge distillation with a pre-trained teacher WaveNet model. Since such inverse autoregressive flow-based models are non-auto-regressive when performing inference, the inference speed is faster than real-time. Meanwhile,Nvidia proposed a flow-based WaveGlow[17] model, which can also generate speech faster than real-time. However, despite the high inference speed, parallel WaveNet has the limitation of needing a pre-trained WaveNet model, so that WaveGlow takes many weeks to converge with limited computing devices. This issue has been solved by Parallel WaveGAN,[18] which learns to produce speech through multi-resolution spectral loss and GAN learning strategies.

The Chaos (short version) synthesized byVITS, a research deep-learning-based end-to-end text-to-speech method, using theLJ Speech dataset.

Problems playing this file? Seemedia help.

References

[edit]
  1. ^abvan den Oord, Aäron (2017-11-12)."High-fidelity speech synthesis with WaveNet".DeepMind. Retrieved2022-06-05.
  2. ^"Audio samples from "Semi-Supervised Training for Improving Data Efficiency in End-to-End Speech Synthesis"". 2018-08-30.Archived from the original on 2020-11-11. Retrieved2022-06-05.
  3. ^abRen, Yi (2019). "FastSpeech: Fast, Robust and Controllable Text to Speech".arXiv:1905.09263 [cs.CL].
  4. ^Kong, Jungil (2020). "HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis".arXiv:2010.05646 [cs.SD].
  5. ^Kim, Jaehyeon (2020). "Glow-TTS: A Generative Flow for Text-to-Speech via Monotonic Alignment Search".arXiv:2005.11129 [eess.AS].
  6. ^abKurosawa, Yuki (January 19, 2021)."ゲームキャラ音声読み上げソフト「15.ai」公開中。『Undertale』や『Portal』のキャラに好きなセリフを言ってもらえる" [Game Character Voice Reading Software "15.ai" Now Available. Get Characters from Undertale and Portal to Say Your Desired Lines].AUTOMATON (in Japanese).Archived from the original on January 19, 2021. RetrievedDecember 18, 2024.
  7. ^遊戲, 遊戲角落 (January 20, 2021)."這個AI語音可以模仿《傳送門》GLaDOS講出任何對白!連《Undertale》都可以學" [This AI Voice Can Imitate Portal's GLaDOS Saying Any Dialog! It Can Even Learn Undertale].United Daily News (in Chinese (Taiwan)).Archived from the original on December 19, 2024. RetrievedDecember 18, 2024.
  8. ^Lamorlette, Robin (January 25, 2021)."Insolite : un site permet de faire dire ce que vous souhaitez à GlaDOS (et à d'autres personnages de jeux vidéo)" [Unusual: A site lets you make GlaDOS (and other video game characters) say whatever you want].Clubic (in French).Archived from the original on January 19, 2025. RetrievedMarch 23, 2025.
  9. ^abcTemitope, Yusuf (December 10, 2024)."15.ai Creator reveals journey from MIT Project to internet phenomenon".The Guardian.Archived from the original on December 28, 2024. RetrievedDecember 25, 2024.
  10. ^Anirudh VK (March 18, 2023)."Deepfakes Are Elevating Meme Culture, But At What Cost?".Analytics India Magazine.Archived from the original on December 26, 2024. RetrievedDecember 18, 2024.
  11. ^Wright, Steven (March 21, 2023)."Why Biden, Trump, and Obama Arguing Over Video Games Is YouTube's New Obsession".Inverse.Archived from the original on December 20, 2024. RetrievedDecember 18, 2024.
  12. ^"Navigating the Challenges and Opportunities of Synthetic Voices".OpenAI. March 9, 2024.Archived from the original on November 25, 2024. RetrievedDecember 18, 2024.
  13. ^Chung, Yu-An (2018). "Semi-Supervised Training for Improving Data Efficiency in End-to-End Speech Synthesis".arXiv:1808.10128 [cs.CL].
  14. ^Ren, Yi (2019). "Almost Unsupervised Text to Speech and Automatic Speech Recognition".arXiv:1905.06791 [cs.CL].
  15. ^Jia, Ye (2018). "Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis".arXiv:1806.04558 [cs.CL].
  16. ^van den Oord, Aaron (2018). "Parallel WaveNet: Fast High-Fidelity Speech Synthesis".arXiv:1711.10433 [cs.CL].
  17. ^Prenger, Ryan (2018). "WaveGlow: A Flow-based Generative Network for Speech Synthesis".arXiv:1811.00002 [cs.SD].
  18. ^Yamamoto, Ryuichi (2019). "Parallel WaveGAN: A fast waveform generation model based on generative adversarial networks with multi-resolution spectrogram".arXiv:1910.11480 [eess.AS].
Free software
Speaking
Singing
Proprietary
software
Speaking
Singing
Machine
Applications
Protocols
Developers/
Researchers
Process
Controversies
Concepts
Applications
Implementations
Audio–visual
Text
Decisional
People
Architectures
Political
Social and economic
Retrieved from "https://en.wikipedia.org/w/index.php?title=Deep_learning_speech_synthesis&oldid=1314440338"
Categories:
Hidden categories:

[8]ページ先頭

©2009-2026 Movatter.jp