Standard and WaveNet voices

Overview

Text-to-Speech creates raw audio data of natural, human speech.That is, it creates audio that sounds like a person talking. Whenyou send a synthesis request to Text-to-Speech, you mustspecify avoice that 'speaks' the words.

There are a wide selection of voices available for you to pick from inText-to-Speech. The voices differ by language, gender, and accent(for some languages). Some languages have multiple voices to choose from. SeetheSupported Voices page for a complete listof voices available in your language. You can tell Text-to-Speech touse a specific voice from this list by setting theVoiceSelectionParamsfields when you send a request to the API. See the Text-to-SpeechQuickstarts for details on how to send asynthesize request.

Standard voices

The voices offered by Text-to-Speech differ in how theyare produced, the synthetic speech technology used to create the machine modelof the voice. One common speech technology,parametric text-to-speech,typically generates audio data by passing outputs through signal processingalgorithms known asvocoders.Many of the standard voices available in Text-to-Speech use avariation of this technology.

WaveNet voices

The Text-to-Speech API also offers a group of premium voices generated using aWaveNet model, the same technology used to produce speech forGoogle Assistant, Google Search, and Google Translate. WaveNettechnology provides more than just a seriesof synthetic voices: it represents a new way of creating synthetic speech.

Note: Check the table of supported voicefor availability of WaveNet-generated voices in specific languages.The Text-to-Speech API does not provide access to the voice of the Google Assistant.

A WaveNet generates speech that sounds more natural than othertext-to-speech systems. It synthesizes speech with more human-likeemphasis and inflection on syllables, phonemes, and words. On average,a WaveNet produces speech audio that people prefer over othertext-to-speech technologies.

Chart shows WaveNet has highest preference by native speakers Figure 1. Chart showing comparison of WaveNet to other synthetic voices, humanspeech. The y-axis values represent the Mean Opinion Score (MOS) for each voice.Test subjects ranked each voice on a scale of 1-5 according to how much itsounded like natural speech. For more information on MOS scores and WaveNettechnology, see theDeepMind WaveNetpage.

Unlike most other text-to-speech systems, a WaveNet model creates raw audiowaveforms from scratch. The model uses a neural network that has beentrained using a large volume of speech samples. During training, the networkextracts the underlying structure of the speech, such as which tonesfollow each other and what a realistic speech waveform looks like. Whengiven a text input, the trained WaveNet model can generatethe corresponding speech waveforms from scratch, one sample at a time, withup to 24,000 samples per second and seamless transitions between the individualsounds.

Note: Using WaveNet voices in your text-to-speech synthesis hasdifferent pricing than non-WaveNet generated audio. For more details,see the pricing page.

To hear the difference between a Wavenet-generated audio clip and aclip generated by another text-to-speech process, compare the twoaudio clips below.

Your browser does not support the audio element.
Example 1. High quality, non-WaveNet voice

Your browser does not support the audio element.
Example 2. WaveNet voice

To learn more about WaveNet models, read this blog post by DeepMind.

Try it for yourself

If you're new to Google Cloud, create an account to evaluate how Text-to-Speech performs in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.

Try Text-to-Speech free

Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.

Last updated 2022-04-22 UTC.

Mar	APR	Jun
	26
2021	2022	2023

Movatterモバイル変換

Standard and WaveNet voices

Overview

Standard voices

WaveNet voices

Try it for yourself