Context caching overview

Context caching helps reduce the cost and latency of requests toGemini that contain repeated content. Vertex AI offers twotypes of caching:

  • Implicit caching: Automatic caching enabled by default that providescost savings when cache hits occur.
  • Explicit caching: Manual caching enabled using the Vertex AI API,where you explicitly declare the content you want to cache and whether ornot your prompts should refer to the cached content.

For both implicit and explicit caching, thecachedContentTokenCountfield in your response's metadata indicates the number of tokens in the cachedpart of your input. Caching requests must contain a minimum of 2,048 tokens.

Caching storage costs

For both implicit and explicit caching, you're billed for the input tokens usedto create the cache at the standard input token price. For explicit caching,there are also storage costs based on how long caches are stored. There are nostorage costs for implicit caching. For more information, seeVertex AI pricing.

Implicit caching

All Google Cloud projects have implicit caching enabled by default. Implicitcaching provides a 90% discount on cached tokens compared to standardinput tokens.

When enabled, implicit cache hit cost savings are automatically passed on toyou. To increase the chances of an implicit cache hit:

  • Place large and common contents at the beginning of your prompt.
  • Send requests with a similar prefix in a short amount of time.

Supported models

Implicit caching is supported when using the following models:

Implicit caching also supports the latest aliases, including:

  • gemini-flash-latest
  • gemini-flash-lite-latest

Implicit caching also supports Open Models. For more information, seeVertex AI open models for MaaS.

Explicit caching

Explicit caching offers more control and ensures a discount when explicitcaches are referenced. That is, a discount on the input tokens referencing anexisting context cache. On Gemini 2.5 models this discount is 90%, onGemini 2.0 models this discount is 75%.

Using the Vertex AI API, you can:

You can also use the Vertex AI API toretrieve information about a context cache.

Explicit caches interact with implicit caching, potentially leading toadditional caching beyond the specified contents whencreating a cache. Toprevent cache data retention, disable implicit caching and avoid creatingexplicit caches. For more information, seeEnable and disable caching.

Supported models

Explicit caching is supported when using the following models:

Explicit caching also supports the latest aliases, including:

  • gemini-flash-latest
  • gemini-flash-lite-latest

When to use context caching

Context caching is particularly well suited to scenarios where a substantialinitial context is referenced repeatedly by subsequent requests.

Cached context items, such as a large amount of text, an audio file, or a videofile, can be used in prompt requests to the Gemini API to generate output.Requests that use the same cache in the prompt also include text unique to eachprompt. For example, each prompt request that composes a chat conversation mightinclude the same context cache that references a video along with unique textthat comprises each turn in the chat.

Consider using context caching for use cases such as:

  • Chatbots with extensive system instructions
  • Repetitive analysis of lengthy video files
  • Recurring queries against large document sets
  • Frequent code repository analysis or bug fixing

Implicit and explicit caching are supported with Provisioned ThroughputinPreview. Refer to theProvisioned Throughput guidefor more details.

Availability

Context caching is available in regions where Generative AI on Vertex AI isavailable. For more information, seeGenerative AI on Vertex AIlocations.

Limits

The content that you explicitly cache must adhere to the limits shown in the followingtable:

Context caching limits

Minimum cache token count

2,048 for all models

Maximum size of content you can cache using a blob or text

10 MB

Minimum time before a cache expires after it's created

1 minute

Maximum time before a cache expires after it's created

There isn't a maximum cache duration
Important: When caching objects that are stored in a Cloud Storage bucket,don't make changes to objects until the cached contents are expired or deleted.Updates to Cloud Storage objects can cause the associated cached contents tobe unusable.

VPC Service Controls support

Context caching supports VPC Service Controls, meaning your cache cannot beexfiltrated beyond your service perimeter. If you use Cloud Storage to buildyour cache, include your bucket in your service perimeter as well to protectyour cache content.

For more information, seeVPC Service Controls with Vertex AIin the Vertex AI documentation.

What's next

Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.

Last updated 2026-02-19 UTC.