gcloud alpha ml video transcribe-speech

NAME
gcloud alpha ml video transcribe-speech - transcribe speech in a video
SYNOPSIS
gcloud alpha ml video transcribe-speechINPUT_PATH--language-code=LANGUAGE_CODE[--async][--audio-tracks=[AUDIO_TRACKS,…]][--diarization-speaker-count=DIARIZATION_SPEAKER_COUNT][--enable-automatic-punctuation][--enable-speaker-diarization][--enable-word-confidence][--filter-profanity][--hints=[HINTS,…]][--max-alternatives=MAX_ALTERNATIVES][--output-uri=OUTPUT_URI][--region=REGION][--segments=[SEGMENTS,…]][GCLOUD_WIDE_FLAG]
DESCRIPTION
(ALPHA) Transcribe speech within a video.
EXAMPLES
To transcribe speech in video file 'gs://my_bucket/input_file.mp4':
gcloudalphamlvideotranscribe-speechgs://my_bucket/input_file.mp4
POSITIONAL ARGUMENTS
INPUT_PATH
Path to the video to be analyzed. Must be a local path or a Google Cloud StorageURI.
REQUIRED FLAGS
--language-code=LANGUAGE_CODE
The language of the supplied audio as aBCP-47 language tag.Example: "en-US". SeeLanguage Support for alist of the currently supported language codes.
OPTIONAL FLAGS
--async
Return immediately, without waiting for the operation in progress to complete.
--audio-tracks=[AUDIO_TRACKS,…]
For file formats, such as MXF or MKV, supporting multiple audio tracks, specifyup to two commas separated tracks. Default track is 0.
--diarization-speaker-count=DIARIZATION_SPEAKER_COUNT
Optional estimated number of speakers in the conversation. If not specified,defaults to '2'. Ignored unless --enable-speaker-diarization is enabled.
--enable-automatic-punctuation
If enabled, adds punctuation to recognition result hypotheses. This feature isonly available in select languages. Setting this for requests in other languageshas no effect at all. The default disabled case does not add punctuation toresult hypotheses.
--enable-speaker-diarization
If enabled, perform speaker detection for each recognized word in the topalternative of the recognition result using a speaker_tag provided in theWordInfo results section. When this is enabled, we send all the words from thebeginning of the audio for the top alternative in every consecutive responses.This is done in order to improve our speaker tags as our models learn toidentify the speakers in the conversation over time.
--enable-word-confidence
If enabled, the top result includes a list of words and the confidence for thosewords. If disabled, no word-level confidence information is returned. Thedefault is disabled.
--filter-profanity
If enabled, the server will attempt to filter out profanities, replacing all butthe initial character in each filtered word with asterisks, e.g.f***. If disabled or omitted, profanities will not be filtered out.
--hints=[HINTS,…]
list of strings containing words and phrases "hints" so that the speechrecognition is more likely to recognize them. This can be used to improve theaccuracy for specific words and phrases, for example, if specific commands aretypically spoken by the user. This can also be used to add additional words tothe vocabulary of the recognizer. See(https://cloud.google.com/speech/limits#content).
--max-alternatives=MAX_ALTERNATIVES
Maximum number of recognition hypotheses to be returned. Specifically, themaximum number ofSpeechRecognitionAlternative messages within eachSpeechTranscription. The server may return fewer thanmax_alternatives. Valid values are0-30.A value of0 or1 will return a maximum of one. Ifomitted, will return a maximum of one
--output-uri=OUTPUT_URI
Location to which the results should be written. Must be a Google Cloud StorageURI.
--region=REGION
Optional Cloud region where annotation should take place. If no region isspecified, a region will be determined based on video file location.REGION must be one of:asia-east1,europe-west1,us-east1,us-west1.
--segments=[SEGMENTS,…]
Segments from the video which you want to analyze (by default, the entire videowill be treated as one segment). Must be in the formatSTART1:END1[,START2:END2,…] (inclusive). START and END of segments mustbe a properly formatted duration string of the formHhMmSs where:
*Histhenumberofhoursfrombeginningofvideo*Misthenumberofminutesfromthebeginningofvideo*Sisthenumberofsecondsfromthebeginningofthevideo

H, M and S can be specified as ints or floats for fractional units (tomicrosecond resolution). Unit chars (e.g.h,m ors) are required. Microseconds can be specified using fractionalseconds e.g. 0.000569s == 569 microseconds.

Examples:

0s:23.554048s,24s:29.528064s

0:1m40s,3m50s:5m10.232265s

GCLOUD WIDE FLAGS
These flags are available to all commands:--access-token-file,--account,--billing-project,--configuration,--flags-file,--flatten,--format,--help,--impersonate-service-account,--log-http,--project,--quiet,--trace-token,--user-output-enabled,--verbosity.

Run$gcloud help for details.

API REFERENCE
This command uses thevideointelligence/v1 API. The fulldocumentation for this API can be found at:https://cloud.google.com/video-intelligence/docs/
NOTES
This command is currently in alpha and might change without notice. If thiscommand fails with API permission errors despite specifying the correct project,you might be trying to access an API with an invitation-only early accessallowlist. This variant is also available:
gcloudbetamlvideotranscribe-speech

Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.

Last updated 2025-07-22 UTC.