Introduction to audio encoding for Cloud Speech-to-Text

An audio encoding refers to the manner in which audio data is stored andtransmitted. The page covers how such encodings work in relation to theCloud Speech-to-Text API.

For guidelines on choosing the best encoding for your application, see ourbestpractices guide.

Note: The information provided here is meant as a general overview. Digitalaudio encoding is a complex topic, and you don't need to know the details to usethe Cloud Speech-to-Text API. Some of this background information may be useful forunderstanding how the API works, and how audio should be formulated andprocessed in your applications.

Audio formats versus encodings

An audio format is not equivalent to an audio encoding. For example, WAV audiofiles define the format of the header of an audio file, but is not itself anaudio encoding. WAV files often (but not always) use a linear PCM encoding;don't assume a WAV file has any particular encoding until you inspect itsheader.

FLAC, however, is both a file format and an encoding, which sometimesleads to some confusion.A FLAC file must contain the sample rate in theheader in order to be submitted to the Cloud Speech-to-Text API. FLAC is the only encodingthat requires audio data to include a header; all other audio encodings specifyheaderless audio data. When we refer to FLAC within the Cloud Speech-to-Text API, we arealways referring to the codec. When we refer to a FLAC file format, we will usethe format "a FLAC file."

Note: You are not required to specify the encoding and sample rate for WAV orFLAC files. If omitted, Cloud Speech-to-Text automatically determines the encodingand sample rate for WAV or FLAC files based on the file header. If you specifyan encoding or sample rate value that does not match the value in the fileheader, then Cloud Speech-to-Text returns an error.

Supported audio encodings for Cloud Speech-to-Text

The Cloud Speech-to-Text API supports a number of different encodings. The following tablelists supported audio codecs:

CodecNameLosslessUsage Notes
MP3MPEG Audio Layer IIINoMP3 encoding is a Beta feature and only available in v1p1beta1. See theRecognitionConfig reference documentation for details.
FLACFree Lossless Audio CodecYes16-bit or 24-bit required for streams
LINEAR16Linear PCMYes16-bit linear pulse-code modulation (PCM) encoding. The header must contain the sample rate.
MULAWμ-lawNo8-bit PCM encoding
AMRAdaptive Multi-Rate NarrowbandNoSample rate must be 8000 Hz
AMR_WBAdaptive Multi-Rate WidebandNoSample rate must be 16000 Hz
OGG_OPUSOpus encoded audio frames in an Ogg containerNoSample rate must be one of 8000 Hz, 12000 Hz, 16000 Hz, 24000 Hz, or 48000 Hz
SPEEX_WITH_HEADER_BYTESpeex widebandNoSample rate must be 16000 Hz
WEBM_OPUSWebM OpusNoSample rate must be one of 8000 Hz, 12000 Hz, 16000 Hz, 24000 Hz, or 48000 Hz
Note: FLAC is both an audio codec and an audio file format. To transcribe audiofiles using FLAC encoding, you must provide them in the FLAC file format,which includes a header containing metadata.

Cloud Speech-to-Text supports WAV files with LINEAR16 or MULAW encoded audio.

For more information on Cloud Speech-to-Text audio codecs, consult theAudioEncodingreference documentation.

If you have a choice when encoding the source material, use a lossless encodingsuch as FLAC or LINEAR16 for better speech recognition. For guidelines onselecting the appropriate codec for your task, seeBestpractices.

Why encode?

Audio is made up of waveforms, consisting of the interposition of waves ofdifferent frequencies and amplitudes. To represent these waveforms withindigital media, the waveforms need to besampled at rates that can (at least)represent sounds of the highest frequency which you want to replicate, and theyalso need to store enoughbit depth to represent the proper amplitude(loudness and softness) of the waveforms across the sound sample.

The ability of a sound processing device to recreate frequencies is known as itsfrequency response and the ability of it to create proper loudness andsoftness is known as itsdynamic range. Together these terms are oftenreferred to as a sound device'sfidelity. An encoding, in its simplest form,is a means with which to reconstruct sound using these two basic principles, aswell as being able to store and transport such data efficiently.

Sampling rates

Sound exists as an analog waveform. A segment of digital audio approximates thisanalog wave by sampling the amplitude of this analog wave at a fast enough rateto mimic the wave's intrinsic frequencies. A digital audio segment'ssample rate specifies the number of samples to take from an audio's sourcematerial (per second); a high sample rate increases the ability of digital audioto faithfully represent high frequencies.

As a consequence of theNyquist-Shannontheorem,you generally need to sample more than twice the highest frequency of any soundwave you want to capture digitally. To represent audio within the range of humanhearing (20-20000 Hz), for example, a digital audio format must sample at least40000 times per second (which is part of the reason why CD audio uses a samplerate of 44100 Hz).

Bit depths

Bit depth affects the dynamic range of a given audio sample. A higher bit depthlets you represent more precise amplitudes. If you have lots of loud and softsounds within the same audio sample, you will need more bit depth to representthose sounds correctly.

Higher bit depths also reduce the signal to noise ratio within audio samples. CDmusical audio is provided using 16 bits of bit depth. DVD Audio uses 24 bits ofbit depth, while most telephony equipment uses 8 bits of bit depth. (Certaincompression techniques can compensate for smaller bit depths, but they tend tobe lossy.)

Uncompressed audio

Most digital audio processing uses these two techniques (sampling rate and bitdepth) to store audio data in a straightforward manner. One of the most populardigital audio techniques (popularized in use of the Compact Disc) is known asPulse Code Modulation(PCM). Audio is sampled at set intervals, and the amplitude of the sampled waveat that point is stored as a digital value using the sample's bit depth.

Linear PCM (which indicates that the amplitude response is linearly uniformacross the sample) is the standard used within CDs, and within the LINEAR16encoding of the Cloud Speech-to-Text API. Both encodings produce an uncompressed stream ofbytes corresponding directly to audio data, and both standards contain 16 bitsof depth. Linear PCM uses a sample rate of 44,100 Hz within CDs, which isappropriate for the recomposition of music; however a sample rate of 16000 Hz ismore appropriate for recomposing speech.

Linear PCM (LINEAR16) is an example ofuncompressed audio in that thedigital data is stored exactly as the preceding standards imply. Reading aone-channel stream of bytes encoded using Linear PCM, you could count off every16 bits (2 bytes), for example, to get another amplitude value of the waveform.Almost all devices can manipulate such digital data natively—you can even cropLinear PCM audio files using a text editor—but uncompressed audio is not themost efficient way to transport or store digital audio. For that reason, mostaudio uses digital compressions techniques.

Compressed audio

Like all data, audio data is often compressed to make it easier to store and totransport. Compression within audio encoding may be eitherlossless orlossy. Lossless compression can be unpacked to restore the digital data to itsoriginal form. Lossy compression necessarily removes some such informationduring compression and decompression, and is parameterized to indicate how muchtolerance to give to the compression technique to remove data.

Lossless compression

Lossless compression compresses digital audio data using complex rearrangementsof the stored data, but results in no degradation in quality of the originaldigital sample. With lossless compression, when unpacking the data into itsoriginal digital form, no information will be lost.

So why do lossless compression techniques sometimes have optimizationparameters? These parameters often trade file size for decompression time. Forexample, FLAC uses a compression level parameter from 0 (fastest) to 8 (smallestfile size). Higher level FLAC compression won't lose any information incomparison to lower level compression. Instead, the compression algorithm willjust need to expend more computational energy when constructing ordeconstructing original digital audio.

The Cloud Speech-to-Text API supports two lossless encodings: FLAC and LINEAR16.Technically, LINEAR16 isn't "lossless compression" because no compression isinvolved in the first place. If file size or data transmission is important toyou, use FLAC.

Lossy compression

Lossy compression, on the other hand, compresses audio data by eliminating orreducing certain types of information during the construction of the compresseddata. The Cloud Speech-to-Text API supports several lossy formats, though you should avoidthem if you have control over the audio, because data loss may affectrecognition accuracy.

The popular MP3 codec is an example of a lossy encoding technique. All MP3compression techniques remove audio from outside a normal human's audio range,and adjust the amount of compression by adjusting the MP3 codec's effectivebit rate, or amount of bits per second to store the audio date.

For example, a stereo CD using Linear PCM of 16 bits has an effective bit rateof:

44100 * 2 channels * 16 bits = 1411200 bits per second (bps) = 1411 kbps

MP3 compression removes such digital data using bit rates such as 320 kbps, 128kbps, or 96 kbps, for example, with resulting degradation in audio quality. MP3also supports variable bit rates, which can compress the audio further. Bothtechniques lose information and can affect quality. Most people can tell thedifference between 96 kbps or 128 kbps encoded MP3 music, for example.

Other forms of compression will parameterize some other constraint.

MULAW (also called μ-lawor uLaw) is an 8 bit PCM encoding, where the sample's amplitude is modulatedlogarithmically rather than linearly. As a result, uLaw reduces the effectivedynamic range of the audio thus compressed. Though uLaw was introduced tospecifically optimize encoding of speech in contrast to other types of audio, 16bit LINEAR16 (uncompressed PCM) is still far superior to 8 bit uLaw compressedaudio.

AMR andAMR_WB modulatethe encoded audio sample by introducing a variable bit rate on the source audiosample.

Although the Cloud Speech-to-Text API supports several lossy formats, you should avoid themif you have control over the source audio. Although the removal of such datathrough lossy compression may not noticeably affect audio as heard by the humanear, loss of such data to a speech recognition engine may significantly degradeaccuracy.

Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.

Last updated 2025-12-15 UTC.