Cloud Speech-to-Text overview

Cloud Speech-to-Text is an API that lets you integrate Google's speech recognitiontechnologies into your developer applications. This document covers the basicsof using Cloud Speech-to-Text, including the types of requests you can make toCloud STT, how to construct those requests, and how to handle theirresponses. Before you dive into using the API, read this guide and one of theassociated tutorials.

Cloud Speech-to-Text recognition requests

Cloud Speech-to-Text (STT) has three main methods to perform speech recognition. Thefollowing methods are available:

Synchronous recognition (REST and gRPC) sendsaudio data to the Cloud Speech-to-Text API, performs recognition on that data, andreturns results after all audio has been processed. Synchronous recognitionrequests process audio data of 1 minute or less.
Asynchronous recognition (REST and gRPC)sends audio data to the Cloud Speech-to-Text API and initiates along-runningoperation. Using this operation, you can periodically poll for recognitionresults. Use asynchronous requests for audio data of any duration up to 480minutes.
Streaming recognition (gRPC only)performs recognition on audio data provided within agRPC bi-directionalstream.Streaming requests are designed for real-time recognition purposes, such ascapturing live audio from a microphone. Streaming recognition providesinterim results as audio is captured. For example, results can appear whilea user is still speaking.

Requests contain configuration parameters as well as audio data. Recognitionrequests can optionally contain arecognizer, a storedand reusable recognition configuration.

Audio Metadata

For most audio files, Cloud Speech-to-Text API can automatically deduce the audio metadata.Cloud STT parses the header of the file and decodes it according tothat information. See theencoding page for supported filetypes.

For headerless audio files, Cloud Speech-to-Text API lets you specify the audio metadataexplicitly in the recognition config. See theencoding pagefor more details.

If you have a choice when encoding the source material, capture audio using asample rate of 16000 Hz. Values lower than this can impair speech recognitionaccuracy, and higher levels have no appreciable effect on speech recognitionquality.

However, if your audio data has already been recorded at an existing sample rateother than 16000 Hz, don't resample your audio to 16000 Hz. For example, mostlegacy telephony audio uses sample rates of 8000 Hz, which can give lessaccurate results. If you must use such audio, provide the audio to theCloud Speech-to-Text API at its original sample rate.

Languages

Cloud STT's recognition engine supports a variety of languages anddialects. You specify the language (and national or regional dialect) of youraudio within the request configuration'slanguageCode field, using aBCP-47 identifier.

A full list of supported languages for each feature is available on theSupported languages page.

Recognition features

Cloud Speech-to-Text API has additional recognition features such asautomaticpunctuation andword-level confidence. You enable these features in the recognitionconfiguration in requests. See the sample code in the provided links and thelanguages page for feature availability.

Model selection

Cloud STT can use one of several machine learningmodels to transcribe your audio file. Google has trained these speechrecognition models for specific audio types and sources. See themodelselection documentation to learn about theavailable models and how to select one in your requests.

Embedded audio content

You include embedded audio in the speech recognition request by passing acontent parameter within the request'saudio_source field. For embeddedaudio that you provide as content within a gRPC request, the audio must becompatible withProto3serialization and provided as binary data. For embedded audio that you provideas content within a REST request, the audio must be compatible with JSONserialization and first be Base64-encoded. SeeBase64 Encoding Your Audio for more information.

When constructing a request using aGoogle Cloud clientlibrary, you generallywrite out this binary (or Base64-encoded) data directly within thecontentfield.

Pass audio referenced by a URI

More typically, you pass auri parameter within the Cloud Speech-to-Text APIrequest'saudio_source field, pointing to an audio file (in binary format, notBase64) located on Cloud Storage in the following form:

gs://bucket-name/path/to/audio/file

Speech-to-Text uses a service account to access your files in Cloud Storage.By default, the service account has access to Cloud Storage files in the sameproject.

The service account email address is the following:

service-PROJECT_NUMBER@gcp-sa-speech.iam.gserviceaccount.com

In order to transcribe Cloud Storage files in another project, you can givethis service account the [Speech-to-Text Service Agent][speech-service-agent]role in the other project:

gcloudprojectsadd-iam-policy-bindingPROJECT_ID\--member=serviceAccount:service-PROJECT_NUMBER@gcp-sa-speech.iam.gserviceaccount.com\--role=roles/speech.serviceAgent

More information about project IAM policy is available at[Manage access to projects, folders, and organizations][manage-access].

You can also give the service account more granular access by giving itpermission to a specific Cloud Storage bucket:

gcloudstoragebucketsadd-iam-policy-bindinggs://BUCKET_NAME\--member=serviceAccount:service-PROJECT_NUMBER@gcp-sa-speech.iam.gserviceaccount.com\--role=roles/storage.admin

More information about managing access to Cloud Storage is available at[Create and Manage access control lists][buckets-manage-acl]in the Cloud Storage documentation.

Cloud Speech-to-Text API responses

After Cloud Speech-to-Text API processes audio, it returns the transcription results inSpeechRecognitionResult messages forsynchronous and batch requests, and inStreamingRecognitionResult messages for streaming requests. Insynchronous and batch requests, the RPC response contains a list of results. Thelist of recognized audio appears in contiguous order. For streaming responses,all results marked asis_final appear in contiguous order.

Note: If Cloud Speech-to-Text API cannot recognize speech from the supplied audio, then theresults field in the RPC response contains no items. Unrecognized speech iscommonly the result of very poor-quality audio, or from language code, encoding,or sample rate values that don't match the supplied audio.

Select alternatives

Each result within a successful synchronous recognition response can contain oneor morealternatives (if the max_alternatives is greater than1). If Cloud STTdetermines that an alternative has a sufficientconfidencevalue, then Cloud STT includes thatalternative in the response. The first alternative in the response is always thebest (most likely) alternative.

Settingmax_alternatives to a higher value than1 does not imply orguarantee that multiple alternatives are returned. In general, more than onealternative is more appropriate for providing real-time options to users who getresults through astreaming recognition request.

Handling transcriptions

Each alternative in the response contains atranscript with the recognizedtext. When you receive sequential alternatives, concatenate thesetranscriptions.

Confidence values

Theconfidence value is an estimate between 0.0 and 1.0. It's calculated byaggregating the "likelihood" values assigned to each word in the audio. A highernumber indicates a greater estimated likelihood that the individual words arerecognized correctly. This field is typically provided only for the tophypothesis and only for results whereis_final=true. For example, you can usetheconfidence value to decide whether to show alternative results or ask foryour confirmation.

Be aware, however, that the model determines the "best", top-ranked result basedon more signals than theconfidence score alone (such as sentence context).Because of this, occasional cases exist where the top result doesn't have thehighest confidence score. If you haven't requested multiple alternative results,the single "best" result can have a lower confidence value than anticipated.This can occur, for example, when rare words are used. Even if the systemrecognizes a rarely used word correctly, it can be assigned a low "likelihood"value. If the model determines the rare word to be the most likely option basedon context, it returns that result at the top even if the result'sconfidencevalue is lower than alternative options.

Important: Don't includeconfidence as a required field in your code. It maynot be set in any of the results, and it may not be accurate.

What's next

Use client libraries to transcribe audio using your favorite programming language.
Learn how totranscribe short audio files.
Learn how totranscribe streaming audio.
Learn how totranscribe long audio files.

Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.

Last updated 2025-12-17 UTC.

Movatterモバイル変換

Cloud Speech-to-Text overview Stay organized with collections Save and categorize content based on your preferences.