gcloud alpha ml video transcribe-speech Stay organized with collections Save and categorize content based on your preferences.
- NAME
- gcloud alpha ml video transcribe-speech - transcribe speech in a video
- SYNOPSIS
gcloud alpha ml video transcribe-speechINPUT_PATH--language-code=LANGUAGE_CODE[--async][--audio-tracks=[AUDIO_TRACKS,…]][--diarization-speaker-count=DIARIZATION_SPEAKER_COUNT][--enable-automatic-punctuation][--enable-speaker-diarization][--enable-word-confidence][--filter-profanity][--hints=[HINTS,…]][--max-alternatives=MAX_ALTERNATIVES][--output-uri=OUTPUT_URI][--region=REGION][--segments=[SEGMENTS,…]][GCLOUD_WIDE_FLAG …]
- DESCRIPTION
(ALPHA)Transcribe speech within a video.- EXAMPLES
- To transcribe speech in video file 'gs://my_bucket/input_file.mp4':
gcloudalphamlvideotranscribe-speechgs://my_bucket/input_file.mp4 - POSITIONAL ARGUMENTS
INPUT_PATH- Path to the video to be analyzed. Must be a local path or a Google Cloud StorageURI.
- REQUIRED FLAGS
--language-code=LANGUAGE_CODE- The language of the supplied audio as aBCP-47 language tag.Example: "en-US". SeeLanguage Support for alist of the currently supported language codes.
- OPTIONAL FLAGS
--async- Return immediately, without waiting for the operation in progress to complete.
--audio-tracks=[AUDIO_TRACKS,…]- For file formats, such as MXF or MKV, supporting multiple audio tracks, specifyup to two commas separated tracks. Default track is 0.
--diarization-speaker-count=DIARIZATION_SPEAKER_COUNT- Optional estimated number of speakers in the conversation. If not specified,defaults to '2'. Ignored unless --enable-speaker-diarization is enabled.
--enable-automatic-punctuation- If enabled, adds punctuation to recognition result hypotheses. This feature isonly available in select languages. Setting this for requests in other languageshas no effect at all. The default disabled case does not add punctuation toresult hypotheses.
--enable-speaker-diarization- If enabled, perform speaker detection for each recognized word in the topalternative of the recognition result using a speaker_tag provided in theWordInfo results section. When this is enabled, we send all the words from thebeginning of the audio for the top alternative in every consecutive responses.This is done in order to improve our speaker tags as our models learn toidentify the speakers in the conversation over time.
--enable-word-confidence- If enabled, the top result includes a list of words and the confidence for thosewords. If disabled, no word-level confidence information is returned. Thedefault is disabled.
--filter-profanity- If enabled, the server will attempt to filter out profanities, replacing all butthe initial character in each filtered word with asterisks, e.g.
f***. If disabled or omitted, profanities will not be filtered out. --hints=[HINTS,…]- list of strings containing words and phrases "hints" so that the speechrecognition is more likely to recognize them. This can be used to improve theaccuracy for specific words and phrases, for example, if specific commands aretypically spoken by the user. This can also be used to add additional words tothe vocabulary of the recognizer. See(https://cloud.google.com/speech/limits#content).
--max-alternatives=MAX_ALTERNATIVES- Maximum number of recognition hypotheses to be returned. Specifically, themaximum number of
SpeechRecognitionAlternativemessages within eachSpeechTranscription. The server may return fewer thanmax_alternatives. Valid values are0-30.A value of0or1will return a maximum of one. Ifomitted, will return a maximum of one --output-uri=OUTPUT_URI- Location to which the results should be written. Must be a Google Cloud StorageURI.
--region=REGION- Optional Cloud region where annotation should take place. If no region isspecified, a region will be determined based on video file location.
REGIONmust be one of:asia-east1,europe-west1,us-east1,us-west1. --segments=[SEGMENTS,…]- Segments from the video which you want to analyze (by default, the entire videowill be treated as one segment). Must be in the formatSTART1:END1[,START2:END2,…] (inclusive). START and END of segments mustbe a properly formatted duration string of the form
HhMmSswhere:*Histhenumberofhoursfrombeginningofvideo*Misthenumberofminutesfromthebeginningofvideo*Sisthenumberofsecondsfromthebeginningofthevideo
H, M and S can be specified as ints or floats for fractional units (tomicrosecond resolution). Unit chars (e.g.
h,mors) are required. Microseconds can be specified using fractionalseconds e.g. 0.000569s == 569 microseconds.Examples:
0s:23.554048s,24s:29.528064s
0:1m40s,3m50s:5m10.232265s
- GCLOUD WIDE FLAGS
- These flags are available to all commands:
--access-token-file,--account,--billing-project,--configuration,--flags-file,--flatten,--format,--help,--impersonate-service-account,--log-http,--project,--quiet,--trace-token,--user-output-enabled,--verbosity.Run
$gcloud helpfor details. - API REFERENCE
- This command uses the
videointelligence/v1API. The fulldocumentation for this API can be found at:https://cloud.google.com/video-intelligence/docs/ - NOTES
- This command is currently in alpha and might change without notice. If thiscommand fails with API permission errors despite specifying the correct project,you might be trying to access an API with an invitation-only early accessallowlist. This variant is also available:
gcloudbetamlvideotranscribe-speech
Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.
Last updated 2025-07-22 UTC.