Video tuning

This page provides prerequisites and detailed instructions for fine-tuningGemini on video data using supervised learning.

Use cases

Fine-tuning lets you adapt base Gemini models for specialized tasks.Here are some video use cases:

Automated video summarization: Tuning LLMs to generate concise andcoherent summaries of long videos, capturing the main themes, events, andnarratives. This is useful for content discovery, archiving, and quickreviews.
Detailed event recognition and localization: Fine-tuning allows LLMs toidentify and pinpoint specific actions, events, or objects within a videotimeline with greater accuracy. For example, identifying all instances of aparticular product in a marketing video or a specific action in sportsfootage.
Content moderation: Specialized tuning can improve an LLM's ability todetect sensitive, inappropriate, or policy-violating content within videos,going beyond simple object detection to understand context and nuance.
Video captioning and subtitling: While already a common application,tuning can improve the accuracy, fluency, and context-awareness ofautomatically generated captions and subtitles, including descriptions ofnonverbal cues.

Limitations

Maximum video file size: 100MB.This may not be sufficient for large video files. Some recommended workaroundsare as follows:
- If there are very few large files, drop those files from including those inthe JSONL files.
- If there are many large files in your dataset and cannot be ignored, reducevisual resolution of the files. This may hurt performance.
- Chunk the videos to limit the files size to 100MB and use the chunked videosfor tuning. Make sure to change any timestamp annotations corresponding tothe original video to the new (chunked) video timeline.
Maximum video length per example: 5 minutes withMEDIA_RESOLUTION_MEDIUMand 20 minutes withMEDIA_RESOLUTION_LOW.
Dropped examples: If an example contains video that is longer than thesupported maximum length, that exampleis dropped from the dataset. Dropped examples are not billed or used for training.If more than 10% of the dataset is dropped, the job will fail with an errormessage before the start of training.
Mixing different media resolutions isn't supported: The value ofmediaResolution for each example in the entire training dataset must beconsistent. All lines in the JSONL files used for training and validationshould have the same value ofmediaResolution.

Dataset format

ThefileUri field specifies the location of your dataset. It can be the URIfor a file in a Cloud Storage bucket, or it can be a publicly available HTTPor HTTPS URL.

ThemediaResolution field is used to specify the token count per frame forthe input videos, as one of the following values:

MEDIA_RESOLUTION_LOW: 64 tokens per frame
MEDIA_RESOLUTION_MEDIUM: 256 tokens per frame

Model tuning withMEDIA_RESOLUTION_LOW is roughly 4 times faster than the onestuned withMEDIA_RESOLUTION_MEDIUM with minimal performance improvement.

When a video segment is used for training and validation, the video segmentis in thevideoMetadata field. During tuning, this data point is decodedto contain information from the segment extracted from the specified video file,starting from timestampstartOffset (the start offset, in seconds) untilendOffset.

To see the generic format example, seeDataset example for Gemini.

The following sections present video dataset format examples.

JSON schema example for cases where the full video is used for training and validation

This schema is added as a single line in the JSONL file.

{"contents":[{"role":"user","parts":[{"fileData":{"fileUri":"gs://<path to the mp4 video file>","mimeType":"video/mp4"},},{"text":"          You are a video analysis expert. Detect which animal appears in the          video.The video can only have one of the following animals: dog, cat,          rabbit.\n Output Format:\n Generate output in the following JSON          format:\n          [{\n            \"animal_name\": \"<CATEGORY>\",\n          }]\n"}]},{"role":"model","parts":[{"text":"```json\n[{\"animal_name\": \"dog\"}]\n```"}]},],"generationConfig":{"mediaResolution":"MEDIA_RESOLUTION_LOW"}}

JSON schema example for cases where a video segment is used for training and validation

This schema is added as a single line in the JSONL file.

{"contents":[{"role":"user","parts":[{"fileData":{"fileUri":"gs://<path to the mp4 video file>","mimeType":"video/mp4"},"videoMetadata":{"startOffset":"5s","endOffset":"25s"}},{"text":"          You are a video analysis expert. Detect which animal appears in the          video.The video can only have one of the following animals: dog, cat,          rabbit.\n Output Format:\n Generate output in the following JSON          format:\n          [{\n            \"animal_name\": \"<CATEGORY>\",\n          }]\n"}]},{"role":"model","parts":[{"text":"```json\n[{\"animal_name\": \"dog\"}]\n```"}]},],"generationConfig":{"mediaResolution":"MEDIA_RESOLUTION_LOW"}}

What's next

To learn more about video tuning, seeHow to fine-tune Gemini 2.5 using videos via Vertex AI.
To learn more about the image understanding capability of Gemini, see ourImage understanding documentation.
To start tuning, seeTune Gemini models by using supervised fine-tuning
To learn how supervised fine-tuning can be used in a solution that builds agenerative AI knowledge base, seeJump Start Solution: Generative AIknowledge base.

Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.

Last updated 2025-12-16 UTC.

Movatterモバイル変換

Video tuning Stay organized with collections Save and categorize content based on your preferences.