Run LLM inference on GPUs with Gemma 3 and Ollama

Objectives

This guide shows how to run LLM inference on Cloud Run GPUs with Gemma3 and Ollama, and has the following objectives:

  • DeployOllama with theGemma 3model on a GPU-enabled Cloud Run service.
  • Send prompts to the Ollama service on its private endpoint.

To learn an alternative way for deploy Gemma 3 open models onCloud Run using a pre-built container, seeRun Gemma 3 models on Cloud Run.

Costs

In this document, you use the following billable components of Google Cloud:

To generate a cost estimate based on your projected usage, use thepricing calculator.

New Google Cloud users might be eligible for afree trial.

Before you begin

  1. Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.
  2. In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

    Roles required to select or create a project

    • Select a project: Selecting a project doesn't require a specific IAM role—you can select any project that you've been granted a role on.
    • Create a project: To create a project, you need the Project Creator role (roles/resourcemanager.projectCreator), which contains theresourcemanager.projects.create permission.Learn how to grant roles.
    Note: If you don't plan to keep the resources that you create in this procedure, create a project instead of selecting an existing project. After you finish these steps, you can delete the project, removing all resources associated with the project.

    Go to project selector

  3. Verify that billing is enabled for your Google Cloud project.

  4. In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

    Roles required to select or create a project

    • Select a project: Selecting a project doesn't require a specific IAM role—you can select any project that you've been granted a role on.
    • Create a project: To create a project, you need the Project Creator role (roles/resourcemanager.projectCreator), which contains theresourcemanager.projects.create permission.Learn how to grant roles.
    Note: If you don't plan to keep the resources that you create in this procedure, create a project instead of selecting an existing project. After you finish these steps, you can delete the project, removing all resources associated with the project.

    Go to project selector

  5. Verify that billing is enabled for your Google Cloud project.

  6. Enable the Artifact Registry, Cloud Build, Cloud Run, and Cloud Storage APIs.

    Roles required to enable APIs

    To enable APIs, you need the Service Usage Admin IAM role (roles/serviceusage.serviceUsageAdmin), which contains theserviceusage.services.enable permission.Learn how to grant roles.

    Enable the APIs

  7. Install and initialize the gcloud CLI.
  8. RequestTotal Nvidia L4 GPUallocation, per project per region quota under Cloud Run Admin APIin theQuotas and system limits page to complete this tutorial.

Required roles

To get the permissions that you need to complete the tutorial, ask your administrator to grant you the following IAM roles on your project:

For more information about granting roles, seeManage access to projects, folders, and organizations.

You might also be able to get the required permissions throughcustom roles or otherpredefined roles.

Note:IAM basic roles might also contain permissions to complete the tutorial. You shouldn't grant basic roles in a production environment, but you can grant them in a development or test environment.

Grant the roles

Console

  1. In the Google Cloud console, go to theIAM page.

    Go to IAM
  2. Select the project.
  3. ClickGrant access.
  4. In theNew principals field, enter your user identifier. This is typically the email address that is used to deploy the Cloud Run service.

  5. In theSelect a role list, select a role.
  6. To grant additional roles, clickAdd another role and add each additional role.
  7. ClickSave.

gcloud

To grant the required IAM roles to your account on your project:

gcloudprojectsadd-iam-policy-bindingPROJECT_ID\--member=PRINCIPAL\--role=ROLE

Replace:

  • PROJECT_NUMBER with your Google Cloud project number.
  • PROJECT_ID with your Google Cloud project ID.
  • PRINCIPAL with the account you are adding the binding for. This is typically the email address that is used to deploy the Cloud Run service.
  • ROLE with the role you are adding to the deployer account.

Set up gcloud

To configure the Google Cloud CLI for your Cloud Run service:

  1. Set your default project:

    gcloudconfigsetprojectPROJECT_ID

    Click the iconto replace the variablePROJECT_ID with the name of the project youcreated for this tutorial. This ensures that all listings on this page thatreferencePROJECT_ID have the correct value alreadyfilled in.

  2. Configure Google Cloud CLI to use the regioneurope-west1 for Cloud Runcommands.

    gcloudconfigsetrun/regioneurope-west1

Use Docker to create a container image with Ollama and Gemma

  1. Create a directory for the Ollama service and change your workingdirectory to this new directory:

    mkdirollama-backendcdollama-backend
  2. Create aDockerfile file with the following contents:

    FROMollama/ollama:latest# Listen on all interfaces, port 8080ENVOLLAMA_HOST0.0.0.0:8080# Store model weight files in /modelsENVOLLAMA_MODELS/models# Reduce logging verbosityENVOLLAMA_DEBUGfalse# Never unload model weights from the GPUENVOLLAMA_KEEP_ALIVE-1# Store the model weights in the container imageENVMODELgemma3:4bRUNollamaserve &sleep5 &&ollamapull$MODEL# Start OllamaENTRYPOINT["ollama","serve"]

Store model weights in the container image for faster instance starts

Google recommends storing themodel weights forGemma 3 (4B) and similarly sized models directly in the container image.

Model weights are the numerical parameters that define the behavior of an LLM.Ollama must fully read these files and load the weights into GPU memory (VRAM)during container instance startup, before it can start serving inference requests.

On Cloud Run, a fast container instance startup is important for minimizingrequest latency. If your container instance has a slow startup time, the servicetakes longer to scale from zero to one instance, and it needs more time to scaleout during a traffic spike.

To ensure a fast startup, store the model files in the container image itself.This is faster and more reliable than downloading the files from a remote locationduring startup. Cloud Run's internal container image storage is optimizedfor handling traffic spikes, allowing it to quickly set up the container's filesystem when an instance starts.

Note that the model weights for Gemma 3 (4B) take up 8 GB of storage. Larger modelshave larger model weight files, and these might be impractical to store in the container image.Refer toBest practices: AI inference on Cloud Run with GPUs for an overview of the trade-offs.

Build and deploy the Ollama service for LLM inference

Build and deploy the service to Cloud Run:

gcloudrundeployollama-gemma\--source.\--concurrency4\--cpu8\--set-env-varsOLLAMA_NUM_PARALLEL=4\--gpu1\--gpu-typenvidia-l4\--max-instances1\--memory32Gi\--no-allow-unauthenticated\--no-cpu-throttling\--no-gpu-zonal-redundancy\--timeout=600

Note the following important flags in this command:

  • --concurrency 4 is set to match the value of the environment variableOLLAMA_NUM_PARALLEL.
  • --gpu 1 with--gpu-type nvidia-l4 assigns 1 NVIDIA L4 GPUto every Cloud Run instance in the service.
  • --max-instances 1 specifies the maximum number of instances to scale to.It has to be equal to or lower than your project's NVIDIA L4 GPU (Total Nvidia L4 GPU allocation, per project per region) quota.
  • --no-allow-unauthenticated restricts unauthenticated access to the service.By keeping the service private, you can rely on Cloud Run's built-inIdentity and Access Management (IAM) authentication for service-to-servicecommunication. Refer toManaging access using IAM.
  • --no-cpu-throttling is required for enabling GPU.
  • --no-gpu-zonal-redundancy set zonal redundancy options depending on your zonal failover requirements and available quota. SeeGPU zonal redundancy options for details.

Concurrency settings for optimal performance

This section provides context on the recommended concurrency settings. For optimalrequest latency, ensure the--concurrency setting is equal to Ollama'sOLLAMA_NUM_PARALLEL environment variable.

  • OLLAMA_NUM_PARALLEL determines how many request slots are available pereach model to handle inference requests concurrently.
  • --concurrency determines how many requests Cloud Run sends toan Ollama instance at the same time.

If--concurrency exceedsOLLAMA_NUM_PARALLEL, Cloud Run can sendmore requests to a model in Ollama than it has available request slots for.This leads to request queuing within Ollama, increasing request latency for thequeued requests. It also leads to less responsive auto scaling, as the queuedrequests don't trigger Cloud Run to scale out and start new instances.

Ollama also supports serving multiple models from one GPU. To completelyavoid request queuing on the Ollama instance, you should still set--concurrency to matchOLLAMA_NUM_PARALLEL.

It's important to note that increasingOLLAMA_NUM_PARALLEL also makes parallel requests take longer.

Optimize GPU utilization

For optimalGPU utilization, increase--concurrency, keeping it withintwice the value ofOLLAMA_NUM_PARALLEL. While this leads to request queuing in Ollama, it can help improve utilization: Ollama instances can immediately process requests from their queue, and the queues help absorb traffic spikes.

Test the deployed Ollama service with curl

Now that you have deployed the Ollama service, you can send requests to it. However,if you send a request directly, Cloud Run responds withHTTP 401 Unauthorized.This is intentional, because an LLM inference API is intended for other services tocall, such as a frontend application. For more information on service-to-serviceauthentication on Cloud Run, refer toAuthenticating service-to-service.

To send requests to the Ollama service, add a header with a valid OIDC token tothe requests, for example using the Cloud Rundeveloper proxy:

  1. Start the proxy, and when prompted to install thecloud-run-proxy component, chooseY:

    gcloudrunservicesproxyollama-gemma--port=9090
  2. Send a request to it in a separate terminal tab, leaving the proxy running. Note that the proxy runs onlocalhost:9090:

    curlhttp://localhost:9090/api/generate-d'{  "model": "gemma3:4b",  "prompt": "Why is the sky blue?"}'

    This command should provide streaming output similar to this:

    {"model":"gemma3:4b","created_at":"2025-03-10T03:02:18.641492408Z","response":"That","done":false}{"model":"gemma3:4b","created_at":"2025-03-10T03:02:18.687529153Z","response":"'","done":false}{"model":"gemma3:4b","created_at":"2025-03-10T03:02:18.753284927Z","response":"s","done":false}{"model":"gemma3:4b","created_at":"2025-03-10T03:02:18.812957381Z","response":" a","done":false}{"model":"gemma3:4b","created_at":"2025-03-10T03:02:18.889102649Z","response":" fantastic","done":false}{"model":"gemma3:4b","created_at":"2025-03-10T03:02:18.925748116Z","response":",","done":false}{"model":"gemma3:4b","created_at":"2025-03-10T03:02:18.958391572Z","response":" decept","done":false}{"model":"gemma3:4b","created_at":"2025-03-10T03:02:18.971035028Z","response":"ively","done":false}{"model":"gemma3:4b","created_at":"2025-03-10T03:02:18.989678484Z","response":" tricky","done":false}{"model":"gemma3:4b","created_at":"2025-03-10T03:02:18.999321940Z","response":" question","done":false}...
    Success: You deployed a GPU-enabled Cloud Run service with Gemma 3 on Ollama and sent aninference request to it.

Clean up

To avoid additional charges to your Google Cloud account, delete all the resourcesyou deployed with this tutorial.

Delete the project

If you created a new project for this tutorial, delete the project.If you used an existing project and need to keep it without the changes you addedin this tutorial,delete resources that you created for the tutorial.

The easiest way to eliminate billing is to delete the project that you created for the tutorial.

To delete the project:

    Caution: Deleting a project has the following effects:
    • Everything in the project is deleted. If you used an existing project for the tasks in this document, when you delete it, you also delete any other work you've done in the project.
    • Custom project IDs are lost. When you created this project, you might have created a custom project ID that you want to use in the future. To preserve the URLs that use the project ID, such as anappspot.com URL, delete selected resources inside the project instead of deleting the whole project.

    If you plan to explore multiple architectures, tutorials, or quickstarts, reusing projects can help you avoid exceeding project quota limits.

  1. In the Google Cloud console, go to theManage resources page.

    Go to Manage resources

  2. In the project list, select the project that you want to delete, and then clickDelete.
  3. In the dialog, type the project ID, and then clickShut down to delete the project.

Delete tutorial resources

  1. Delete the Cloud Run service you deployed in this tutorial.Cloud Run services don't incur costs until they receive requests.

    To delete your Cloud Run service, run the following command:

    gcloudrunservicesdeleteSERVICE-NAME

    ReplaceSERVICE-NAME with the name of your service.

    You can also delete Cloud Run services from theGoogle Cloud console.

  2. Remove thegcloud default region configuration you added during tutorialsetup:

    gcloudconfigunsetrun/region
  3. Remove the project configuration:

     gcloud config unset project

What's next

Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.

Last updated 2026-02-19 UTC.