Context caching overview Stay organized with collections Save and categorize content based on your preferences.
To see an example of context caching, run the "Intro to context caching" notebook in one of the following environments:
Open in Colab |Open in Colab Enterprise |Openin Vertex AI Workbench |View on GitHub
Context caching helps reduce the cost and latency of requests toGemini that contain repeated content. Vertex AI offers twotypes of caching:
- Implicit caching: Automatic caching enabled by default that providescost savings when cache hits occur.
- Explicit caching: Manual caching enabled using the Vertex AI API,where you explicitly declare the content you want to cache and whether ornot your prompts should refer to the cached content.
For both implicit and explicit caching, thecachedContentTokenCountfield in your response's metadata indicates the number of tokens in the cachedpart of your input. Caching requests must contain a minimum of 2,048 tokens.
Caching storage costs
For both implicit and explicit caching, you're billed for the input tokens usedto create the cache at the standard input token price. For explicit caching,there are also storage costs based on how long caches are stored. There are nostorage costs for implicit caching. For more information, seeVertex AI pricing.
Implicit caching
All Google Cloud projects have implicit caching enabled by default. Implicitcaching provides a 90% discount on cached tokens compared to standardinput tokens.
When enabled, implicit cache hit cost savings are automatically passed on toyou. To increase the chances of an implicit cache hit:
- Place large and common contents at the beginning of your prompt.
- Send requests with a similar prefix in a short amount of time.
Supported models
Implicit caching is supported when using the following models:
- Gemini 3.1 Propreview
- Gemini 3 Flashpreview
- Gemini 3 Propreview
- Gemini 2.5 Pro
- Gemini 2.5 Flashpreview
- Gemini 2.5 Flash-Litepreview
- Gemini 2.5 Flash
- Gemini 2.5 Flash-Lite
Implicit caching also supports the latest aliases, including:
gemini-flash-latestgemini-flash-lite-latest
Implicit caching also supports Open Models. For more information, seeVertex AI open models for MaaS.
Explicit caching
Explicit caching offers more control and ensures a discount when explicitcaches are referenced. That is, a discount on the input tokens referencing anexisting context cache. On Gemini 2.5 models this discount is 90%, onGemini 2.0 models this discount is 75%.
Using the Vertex AI API, you can:
- Create context caches and control them more effectively.
- Use a context cache by referencing its contents in a prompt requestwith its resource name.
- Update a context cache's expiration time (Time to Live, or TTL) pastthe default 60 minutes.
- Delete a context cache when no longer needed.
You can also use the Vertex AI API toretrieve information about a context cache.
Explicit caches interact with implicit caching, potentially leading toadditional caching beyond the specified contents whencreating a cache. Toprevent cache data retention, disable implicit caching and avoid creatingexplicit caches. For more information, seeEnable and disable caching.
Supported models
Explicit caching is supported when using the following models:
- Gemini 3.1 Propreview
- Gemini 3 Flashpreview
- Gemini 3 Propreview
- Gemini 2.5 Pro
- Gemini 2.5 Flashpreview
- Gemini 2.5 Flash-Litepreview
- Gemini 2.5 Flash
- Gemini 2.5 Flash-Lite
- Gemini 2.0 Flash
- Gemini 2.0 Flash-Lite
Explicit caching also supports the latest aliases, including:
gemini-flash-latestgemini-flash-lite-latest
When to use context caching
Context caching is particularly well suited to scenarios where a substantialinitial context is referenced repeatedly by subsequent requests.
Cached context items, such as a large amount of text, an audio file, or a videofile, can be used in prompt requests to the Gemini API to generate output.Requests that use the same cache in the prompt also include text unique to eachprompt. For example, each prompt request that composes a chat conversation mightinclude the same context cache that references a video along with unique textthat comprises each turn in the chat.
Consider using context caching for use cases such as:
- Chatbots with extensive system instructions
- Repetitive analysis of lengthy video files
- Recurring queries against large document sets
- Frequent code repository analysis or bug fixing
Implicit and explicit caching are supported with Provisioned ThroughputinPreview. Refer to theProvisioned Throughput guidefor more details.
Availability
Context caching is available in regions where Generative AI on Vertex AI isavailable. For more information, seeGenerative AI on Vertex AIlocations.
Limits
The content that you explicitly cache must adhere to the limits shown in the followingtable:
| Context caching limits | |
|---|---|
Minimum cache token count | 2,048 for all models |
Maximum size of content you can cache using a blob or text | 10 MB |
Minimum time before a cache expires after it's created | 1 minute |
Maximum time before a cache expires after it's created | There isn't a maximum cache duration |
VPC Service Controls support
Context caching supports VPC Service Controls, meaning your cache cannot beexfiltrated beyond your service perimeter. If you use Cloud Storage to buildyour cache, include your bucket in your service perimeter as well to protectyour cache content.
For more information, seeVPC Service Controls with Vertex AIin the Vertex AI documentation.
What's next
- Learn aboutthe Gemini API.
- Learn how touse multimodal prompts.
Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.
Last updated 2026-02-19 UTC.
Open in Colab
Open in Colab Enterprise
Openin Vertex AI Workbench
View on GitHub