Best practices: Cloud Run jobs with GPUs

This page provides best practices for optimizing performance when using aCloud Run job with GPU for AI workloads such as,training large language models (LLMs) using your preferred frameworks, fine-tuning,and performing batch or offline inference on LLMs.To create a Cloud Run job that can perform compute intensivetasks or batch processing in real time, you should:

Use models that load fast and require minimal transformation into GPU-readystructures, and optimize how they are loaded.
Use configurations that allow for maximum, efficient, concurrent execution toreduce the number of GPUs needed to serve a target request per second whilekeeping costs down.

Recommended ways to load large ML models on Cloud Run

Google recommendsdownloading ML models from Cloud Storage and accessing them through the Google Cloud CLI. You might alternatively store models inside container images, but this method is best suited for smaller models less than 10 GB.

Storing and loading ML models trade-offs

Here is a comparison of the options:

Model location	Deploy time	Development experience	Container startup time	Storage cost
Cloud Storage, downloaded concurrently using the Google Cloud CLI command`gcloud storage cp` or the Cloud Storage API as shown in thetransfer manager concurrent download code sample.	Fastest. Model downloaded during container startup. Ensure the Cloud Run instance has sufficient RAM allocated to store the model files.	Slightly more difficult to set up, because you'll need to either install the Google Cloud CLI on the image or update your code to use the Cloud Storage API. For more information on how to fetch credentials from the metadata server, seeIntroduction to service identity.	Fast when you usenetwork optimizations. The Google Cloud CLI downloads the model file in parallel, making it faster than FUSE mount.	One copy in Cloud Storage.
Cloud Storage, loaded using Cloud Storage FUSE volume mount	Faster. Model downloaded during container startup.	Not difficult to set up, does not require changes to the docker image.	Fast when you usenetwork optimizations. Does not parallelize the download.	One copy in Cloud Storage.
Container image	Fast. An image containing a large model will take longer to import into Cloud Run.	You'll need to build a new image every time you want to use a different model. Changes to the container image will require redeployment, which may be slow for large images.	Depends on the size of the model. For very large models, use Cloud Storage for more predictable but slower performance.	Potentially multiple copies in Artifact Registry.
Internet	Slow. Model downloaded during container startup.	Typically simpler (many frameworks download models from central repositories).	Typically poor and unpredictable: Frameworks may apply model transformations during initialization. (You should do this at build time). Model host and libraries for downloading the model may not be efficient. There is reliability risk associated with downloading from the internet. Your job could fail to start if the download target is down, and the underlying model downloaded could change, which decreases quality. We recommend hosting in your own Cloud Storage bucket.	Depends on the model hosting provider.

Store models in Cloud Storage

To optimize ML model loading when loading ML models from Cloud Storage,either usingCloud Storage volume mountsor directly using the Cloud Storage API or command line, you must useDirect VPC withthe egress setting value set toall-traffic, along withPrivate Google Access.

For an additional cost, usingAnywhere Cache can reduce model loading latency by efficiently caching data on SSDs for faster reads.

To reduce model read times, try the following mount options to enable Cloud Storage FUSE features:

cache-dir: Enable thefile caching feature with anin-memory volume mount to use as the underlying directory to persist files. Set thecache-dir mount option value to the in-memory volume name in the format,cr-volume:{volume name}. For example, if you have an in-memory volume namedin-memory-1 that you want to use as the cache directory, specifycr-volume:in-memory-1. When this value is set, you can also set otherfile-cacheflags available to configure for cache.
enable-buffered-read: Set theenable-buffered-read field totrue for asynchronous prefetching of parts of a Cloud Storage object into an in-memory buffer. This allows subsequent reads to be served from the buffer instead of requiring network calls. When you configure this field, you can also set theread-global-max-blocks field to configure the maximum number of blocks available for buffered reads across all file handles.

When bothcache-dir andenable-buffered-read are used,cache-dir will take precedence. Note that enabling any of these features will change the resource accounting of the Cloud Storage FUSE process to be counted under container memory limits. Consider raising the container memory limit by followinginstructions on how to configure memory limits.

Store models in container images

By storing the ML model in the container image, model loading will benefit from Cloud Run's optimized container streaming infrastructure.However, building container images that include ML models is a resource-intensive process,especially when working with large models. In particular, the build process can becomebottlenecked on network throughput. When using Cloud Build, we recommendusing a more powerful build machine with increased compute and networkingperformance. To do this, build an image using abuild config filethat has the following steps:

steps:-name:'gcr.io/cloud-builders/docker'args:['build','-t','IMAGE','.']-name:'gcr.io/cloud-builders/docker'args:['push','IMAGE']images:-IMAGEoptions:machineType:'E2_HIGHCPU_32'diskSizeGb:'500'

You can create one model copy per image if the layer containing the model is distinct between images (different hash). There could be additional Artifact Registry cost because there could be one copy of the model per image if your model layer is unique across each image.

Load models from the internet

To optimize ML model loading from the internet,route all traffic throughthe vpc network with the egress settingvalue set toall-traffic and set upCloud NAT to reach the public internet at high bandwidth.

Build, deployment, runtime, and system design considerations

The following sections describe considerations for build, deploy, runtime and system design.

At build time

The following list shows considerations you need to take into account when youare planning your build:

Choose a good base image. You should start with an image from theDeep Learning Containersor theNVIDIA container registryfor the ML framework you're using. These images have the latest performance-relatedpackages installed. We don't recommend creating a custom image.
Choose 4-bit quantized models to maximize concurrency unless you can provethey affect result quality. Quantization produces smaller and faster models,reducing the amount of GPU memory needed to serve the model, and can increaseparallelism at run time. Ideally, the modelsshould be trained at the target bit depth rather than quantized down to it.
Pick a model format with fast load times to minimize container startup time,such as GGUF. These formats more accurately reflect the target quantization typeand require less transformations when loaded onto the GPU. For security reasons,don't use pickle-format checkpoints.
Create and warm LLM caches at build time. Start the LLM on the build machinewhile building the docker image. Enable prompt caching and feed common or exampleprompts to help warm the cache for real-world use. Save the outputs it generatesto be loaded at runtime.
Save your own inference model that you generate during build time. This savessignificant time compared to loading less efficiently stored models and applyingtransforms like quantization at container startup.

At deployment

The following list shows considerations you need to take into account when youare planning your deployment:

Set atask timeout of one houror lesser for job executions.
If you are running parallel tasks in a job execution, determine and setparallelism to less than theGPU quota without zonal redundancy allocated foryour project. To request for a quota increase, seeHow to increase quota. GPU tasks start as quickly as possible,and go up to a maximum that varies depending on how much GPU quota you allocatedfor the project and the region selected. Deployments fail if you set parallelismto more than the GPU quota limit.

At run time

Actively manage your supported context length. The smaller the context windowyou support, the more queries you can support running in parallel. The detailsof how to do this depend on the framework.
Use the LLM caches you generated at build time. Supply the same flags youused during build time when you generated the prompt and prefix cache.
Load from the saved model you just wrote. SeeStoring and loading models trade-offs for a comparison on how to load the model.
Consider using a quantized key-value cache if your framework supports it.This can reduce per-query memory requirements and allows for configuration ofmore parallelism. However, it can also impact quality.
Tune the amount of GPU memory to reserve for model weights, activations andkey-value caches. Set it as high as you can without getting an out-of-memory error.
Check to see whether your framework has anyoptions for improving container startup performance (for example, using model loading parallelization).

At the system design level

Add semantic caches where appropriate. In some cases, caching whole queriesand responses can be a great way of limiting the cost of common queries.
Control variance in your preambles. Prompt caches are only useful when theycontain the prompts in sequence. Caches are effectively prefix-cached.Insertions or edits in the sequence mean that they're either not cached or onlypartially present.

Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.

Last updated 2026-02-18 UTC.

Movatterモバイル変換

Best practices: Cloud Run jobs with GPUs Stay organized with collections Save and categorize content based on your preferences.