gcloud ml speech recognize

NAME
gcloud ml speech recognize - get transcripts of short(less than 60 seconds) audio from an audio file
SYNOPSIS
gcloud ml speech recognizeAUDIO--language-code=LANGUAGE_CODE[--enable-automatic-punctuation][--encoding=ENCODING; default="encoding-unspecified"][--filter-profanity][--hints=[HINT,…]][--include-word-time-offsets][--max-alternatives=MAX_ALTERNATIVES; default=1][--model=MODEL][--sample-rate=SAMPLE_RATE][--audio-channel-count=AUDIO_CHANNEL_COUNT--separate-channel-recognition][GCLOUD_WIDE_FLAG]
DESCRIPTION
Get a transcript of an audio file that is less than 60 seconds. You can use anaudio file that is on your local drive or a Google Cloud Storage URL.

If the audio is longer than 60 seconds, you will get an error. Please usegcloud mlspeech recognize-long-running instead.

EXAMPLES
To get a transcript of an audio file 'my-recording.wav':
gcloudmlspeechrecognize'my-recording.wav'--language-code=en-US

To get a transcript of an audio file in bucket 'gs://bucket/myaudio' with acustom sampling rate and encoding that uses hints and filters profanity:

gcloudmlspeechrecognize'gs://bucket/myaudio'--language-code=es-ES--sample-rate=2200--hints=Bueno--encoding=OGG_OPUS--filter-profanity
POSITIONAL ARGUMENTS
AUDIO
The location of the audio file to transcribe. Must be a local path or a GoogleCloud Storage URL (in the format gs://bucket/object).
REQUIRED FLAGS
--language-code=LANGUAGE_CODE
The language of the supplied audio as a BCP-47(https://www.rfc-editor.org/rfc/bcp/bcp47.txt) language tag. Example: "en-US".Seehttps://cloud.google.com/speech/docs/languagesfor a list of the currently supported language codes.
OPTIONAL FLAGS
--enable-automatic-punctuation
Adds punctuation to recognition result hypotheses.
--encoding=ENCODING; default="encoding-unspecified"
The type of encoding of the file. Required if the file format is not WAV orFLAC.ENCODING must be one of:alaw,amr,amr-wb,encoding-unspecified,flac,linear16,mp3,mulaw,ogg-opus,speex-with-header-byte,webm-opus.
--filter-profanity
If True, the server will attempt to filter out profanities, replacing all butthe initial character in each filtered word with asterisks, e.g.f***.
--hints=[HINT,…]
A list of strings containing word and phrase "hints" so that the speechrecognition is more likely to recognize them. This can be used to improve theaccuracy for specific words and phrases, for example, if specific commands aretypically spoken by the user. This can also be used to add additional words tothe vocabulary of the recognizer. Seehttps://cloud.google.com/speech/limits#content.
--include-word-time-offsets
If True, the top result includes a list of words with the start and end timeoffsets (timestamps) for those words. If False, no word-level time offsetinformation is returned.
--max-alternatives=MAX_ALTERNATIVES; default=1
Maximum number of recognition hypotheses to be returned. The server may returnfewer than max_alternatives. Valid values are 0-30. A value of 0 or 1 willreturn a maximum of one.
--model=MODEL
Select the model best suited to your domain to get best results. If you do notexplicitly specify a model, Speech-to-Text will auto-select a model based onyour other specified parameters. Some models are premium and cost more thanstandard models (although you can reduce the price by opting intohttps://cloud.google.com/speech-to-text/docs/data-logging).MODEL must be one of:
command_and_search
short queries such as voice commands or voice search.
default
audio that is not one of the specific audio models. For example, long-formaudio. Ideally the audio is high-fidelity, recorded at a 16khz or greatersampling rate.
latest_long
Use this model for any kind of long form content such as media or spontaneousspeech and conversations. Consider using this model in place of the video model,especially if the video model is not available in your target language. You canalso use this in place of the default model.
latest_short
Use this model for short utterances that are a few seconds in length. It isuseful for trying to capture commands or other single shot directed speech usecases. Consider using this model instead of the command and search model.
medical_conversation
Best for audio that originated from a conversation between a medical providerand patient.
medical_dictation
Best for audio that originated from dictation notes by a medical provider.
phone_call
audio that originated from a phone call (typically recorded at an 8khz samplingrate).
phone_call_enhanced
audio that originated from a phone call (typically recorded at an 8khz samplingrate). This is a premium model and can produce better results but costs morethan the standard rate.
telephony
Improved version of the "phone_call" model, best for audio that originated froma phone call, typically recorded at an 8kHz sampling rate.
telephony_short
Dedicated version of the modern "telephony" model for short or even single-wordutterances for audio that originated from a phone call, typically recorded at an8kHz sampling rate.
video_enhanced
audio that originated from video or includes multiple speakers. Ideally theaudio is recorded at a 16khz or greater sampling rate. This is a premium modelthat costs more than the standard rate.
--sample-rate=SAMPLE_RATE
The sample rate in Hertz. For best results, set the sampling rate of the audiosource to 16000 Hz. If that's not possible, use the native sample rate of theaudio source (instead of re-sampling).
Audio channel settings.
--audio-channel-count=AUDIO_CHANNEL_COUNT
The number of channels in the input audio data. Set this forseparate-channel-recognition. Valid values are: 1)LINEAR16 and FLAC are 1-82)OGG_OPUS are 1-254 3) MULAW, AMR, AMR_WB and SPEEX_WITH_HEADER_BYTE is only1.

This flag argument must be specified if any of the other arguments in this groupare specified.

--separate-channel-recognition
Recognition result will contain achannel_tag field to state whichchannel that result belongs to. If this is not true, only the first channel willbe recognized.

This flag argument must be specified if any of the other arguments in this groupare specified.

GCLOUD WIDE FLAGS
These flags are available to all commands:--access-token-file,--account,--billing-project,--configuration,--flags-file,--flatten,--format,--help,--impersonate-service-account,--log-http,--project,--quiet,--trace-token,--user-output-enabled,--verbosity.

Run$gcloud help for details.

API REFERENCE
This command uses the speech/v1 API. The full documentation for this API can befound at:https://cloud.google.com/speech-to-text/docs/quickstart-protocol
NOTES
These variants are also available:
gcloudalphamlspeechrecognize
gcloudbetamlspeechrecognize

Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.

Last updated 2026-01-21 UTC.