Context caching overview

To see an example of context caching, run the "Intro to context caching" notebook in one of the following environments:

Open in Colab |Open in Colab Enterprise |Openin Vertex AI Workbench |View on GitHub

Context caching helps reduce the cost and latency of requests toGemini that contain repeated content. Vertex AI offers twotypes of caching:

Implicit caching: Automatic caching enabled by default that providescost savings when cache hits occur.
Explicit caching: Manual caching enabled using the Vertex AI API,where you explicitly declare the content you want to cache and whether ornot your prompts should refer to the cached content.

For both implicit and explicit caching, thecachedContentTokenCountfield in your response's metadata indicates the number of tokens in the cachedpart of your input. Caching requests must contain a minimum of 2,048 tokens.

Caching storage costs

For both implicit and explicit caching, you're billed for the input tokens usedto create the cache at the standard input token price. For explicit caching,there are also storage costs based on how long caches are stored. There are nostorage costs for implicit caching. For more information, seeVertex AI pricing.

Implicit caching

All Google Cloud projects have implicit caching enabled by default. Implicitcaching provides a 90% discount on cached tokens compared to standardinput tokens.

When enabled, implicit cache hit cost savings are automatically passed on toyou. To increase the chances of an implicit cache hit:

Place large and common contents at the beginning of your prompt.
Send requests with a similar prefix in a short amount of time.

Supported models

Implicit caching is supported when using the following models:

Implicit caching also supports the latest aliases, including:

gemini-flash-latest
gemini-flash-lite-latest

Implicit caching also supports Open Models. For more information, seeVertex AI open models for MaaS.

Explicit caching

Explicit caching offers more control and ensures a discount when explicitcaches are referenced. That is, a discount on the input tokens referencing anexisting context cache. On Gemini 2.5 models this discount is 90%, onGemini 2.0 models this discount is 75%.

Using the Vertex AI API, you can:

Create context caches and control them more effectively.
Use a context cache by referencing its contents in a prompt requestwith its resource name.
Update a context cache's expiration time (Time to Live, or TTL) pastthe default 60 minutes.
Delete a context cache when no longer needed.

You can also use the Vertex AI API toretrieve information about a context cache.

Explicit caches interact with implicit caching, potentially leading toadditional caching beyond the specified contents whencreating a cache. Toprevent cache data retention, disable implicit caching and avoid creatingexplicit caches. For more information, seeEnable and disable caching.

Supported models

Explicit caching is supported when using the following models:

Explicit caching also supports the latest aliases, including:

gemini-flash-latest
gemini-flash-lite-latest

When to use context caching

Context caching is particularly well suited to scenarios where a substantialinitial context is referenced repeatedly by subsequent requests.

Cached context items, such as a large amount of text, an audio file, or a videofile, can be used in prompt requests to the Gemini API to generate output.Requests that use the same cache in the prompt also include text unique to eachprompt. For example, each prompt request that composes a chat conversation mightinclude the same context cache that references a video along with unique textthat comprises each turn in the chat.

Consider using context caching for use cases such as:

Chatbots with extensive system instructions
Repetitive analysis of lengthy video files
Recurring queries against large document sets
Frequent code repository analysis or bug fixing

Implicit and explicit caching are supported with Provisioned ThroughputinPreview. Refer to theProvisioned Throughput guidefor more details.

Availability

Context caching is available in regions where Generative AI on Vertex AI isavailable. For more information, seeGenerative AI on Vertex AIlocations.

Limits

The content that you explicitly cache must adhere to the limits shown in the followingtable:

Context caching limits
Minimum cache token count	`2,048` for all models
Maximum size of content you can cache using a blob or text	10 MB
Minimum time before a cache expires after it's created	1 minute
Maximum time before a cache expires after it's created	There isn't a maximum cache duration

Important: When caching objects that are stored in a Cloud Storage bucket,don't make changes to objects until the cached contents are expired or deleted.Updates to Cloud Storage objects can cause the associated cached contents tobe unusable.

VPC Service Controls support

Context caching supports VPC Service Controls, meaning your cache cannot beexfiltrated beyond your service perimeter. If you use Cloud Storage to buildyour cache, include your bucket in your service perimeter as well to protectyour cache content.

For more information, see VPC Service Controls with Vertex AIin the Vertex AI documentation.

What's next

Learn aboutthe Gemini API.
Learn how touse multimodal prompts.

Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.

Last updated 2026-02-19 UTC.

Movatterモバイル変換

Context caching overview Stay organized with collections Save and categorize content based on your preferences.

Caching storage costs

Implicit caching

Supported models

Explicit caching

Supported models

When to use context caching

Availability

Limits

VPC Service Controls support

What's next

Context caching overview