Run Gemma 3 on Cloud Run

This guide describes how to deployGemma 3 open models onCloud Run using a prebuilt container, and provides guidance on usingthe deployed Cloud Run service with theGoogle Gen AI SDK.

Before you begin

If you usedGoogle AI Studioto deploy to Cloud Run, skip to theSecurely interact with the Google Gen AI SDKsection.

If you didn't use Google AI Studio, follow these steps before usingCloud Run to create a new service.

  1. Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.
  2. In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

    Roles required to select or create a project

    • Select a project: Selecting a project doesn't require a specific IAM role—you can select any project that you've been granted a role on.
    • Create a project: To create a project, you need the Project Creator role (roles/resourcemanager.projectCreator), which contains theresourcemanager.projects.create permission.Learn how to grant roles.
    Note: If you don't plan to keep the resources that you create in this procedure, create a project instead of selecting an existing project. After you finish these steps, you can delete the project, removing all resources associated with the project.

    Go to project selector

  3. Verify that billing is enabled for your Google Cloud project.

  4. In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

    Roles required to select or create a project

    • Select a project: Selecting a project doesn't require a specific IAM role—you can select any project that you've been granted a role on.
    • Create a project: To create a project, you need the Project Creator role (roles/resourcemanager.projectCreator), which contains theresourcemanager.projects.create permission.Learn how to grant roles.
    Note: If you don't plan to keep the resources that you create in this procedure, create a project instead of selecting an existing project. After you finish these steps, you can delete the project, removing all resources associated with the project.

    Go to project selector

  5. Verify that billing is enabled for your Google Cloud project.

  6. Set up your Cloud Run development environmentin your Google Cloud project.
  7. Install and initialize the gcloud CLI.
  8. Ensure you have the following IAM roles granted to your account:
  9. Learn how to grant the roles

    Console

    1. In the Google Cloud console, go to theIAM page.

      Go to IAM
    2. Select the project.
    3. ClickGrant access.
    4. In theNew principals field, enter your user identifier. This is typically the email address that is used to deploy the Cloud Run service.

    5. In theSelect a role list, select a role.
    6. To grant additional roles, clickAdd another role and add each additional role.
    7. ClickSave.

    gcloud

    To grant the required IAM roles to your account on your project:

    gcloudprojectsadd-iam-policy-bindingPROJECT_ID\--member=PRINCIPAL\--role=ROLE

    Replace:

    • PROJECT_NUMBER with your Google Cloud project number.
    • PROJECT_ID with your Google Cloud project ID.
    • PRINCIPAL with the account you are adding the binding for. This is typically the email address that is used to deploy the Cloud Run service.
    • ROLE with the role you are adding to the deployer account.
  10. RequestTotal Nvidia L4 GPUallocation, per project per region quota under Cloud Run Admin APIin theQuotas and system limits page.
  11. Review theCloud Run pricing page. To generate a cost estimate based on your projected usage, use thepricing calculator.

Deploy a Gemma model with a prebuilt container

Cloud Run provides aprebuilt containerfor serving Gemma open models on Cloud Run.

To deploy Gemma models on Cloud Run, use thefollowing gcloud CLI command with the recommended settings:

gcloudrundeploySERVICE_NAME\--imageus-docker.pkg.dev/cloudrun/container/gemma/GEMMA_PARAMETER\--concurrency4\--cpu8\--set-env-varsOLLAMA_NUM_PARALLEL=4\--gpu1\--gpu-typenvidia-l4\--max-instances1\--memory32Gi\--no-allow-unauthenticated\--no-cpu-throttling\--no-gpu-zonal-redundancy\--timeout=600\--regionREGION

Replace:

  • SERVICE_NAME with a unique name for theCloud Run service.
  • GEMMA_PARAMETER with the Gemma modelyou used:

    • Gemma 3 1B (gemma-3-1b-it):gemma3-1b
    • Gemma 3 4B (gemma-3-4b-it):gemma3-4b
    • Gemma 3 12B (gemma-3-12b-it):gemma3-12b
    • Gemma 3 27B (gemma-3-27b-it):gemma3-27b

    Optionally, replace the entire image URL with a Docker image you've built fromtheGemma-on-Cloudrun GitHub repository.

  • REGION with the Google Cloud region where yourCloud Run will be deployed, such aseurope-west1. If you need to modify the region, seeGPU configuration tolearn about supported regions for GPU-enabled deployments.

The other settings are as follows:

OptionDescription
--concurrency

Themaximum number of requests that can be processed simultaneously by a given instance, such as4. SeeSet concurrency for optimal performance for recommendations on optimal request latency.

--cpu

The amount ofallocated CPU for your service, such as8.

--set-env-vars

Theenvironment variables set for your service. For example,OLLAMA_NUM_PARALLEL=4. SeeSet concurrency for optimal performance for recommendations on optimal request latency.

--gpu

TheGPU value for your service, such as1.

--gpu-type

Thetype of GPU to use for your service, such asnvidia-l4.

--max-instances

Themaximum number of container instances for your service, such as1.

--memory

The amount ofallocated memory for your service, such as32Gi.

--no-invoker-iam-check

Disable invoker IAM checks. SeeSecurely interact with the Google Gen AI SDK for recommendations on how to better secure your app.

--no-cpu-throttling

This setting disables CPU throttling when the container is not actively serving requests.

--timeout

Thetime within which a response must be returned, such as600 seconds.

If you need to modify the default settings or add more customized settings to your Cloud Run service, seeConfigure services.

After completion of the deployed service, a success message is displayed alongwith theCloud Run endpoint URLending withrun.app.

Test the deployed Gemma service with curl

Now that you have deployed the Gemma service, you can sendrequests to it. However, if you send a request directly, Cloud Runresponds withHTTP 401 Unauthorized. This is intentional, because an LLMinference API is intended for other services to call, such as a front-endapplication. For more information on service-to-serviceauthentication on Cloud Run, refer toAuthenticating service-to-service.

To send requests to the Gemma service, add a header with a validOIDC token to the requests, for example using the Cloud Rundeveloper proxy:

  1. Start the proxy, and when prompted to install thecloud-run-proxycomponent, chooseY:

    gcloudrunservicesproxySERVICE_NAME--port=9090
  2. Run the following command to send a request in a separate terminal tab, leavingthe proxy running. The proxy runs onlocalhost:9090. Specify the Gemmamodel youpreviously used:

    curlhttp://localhost:9090/api/generate-d'{  "model": "gemma3:4b",  "prompt": "Why is the sky blue?"}'

    This command should provide streaming output similar to this:

    {"model":"gemma3:4b","created_at":"2025-03-10T03:02:18.641492408Z","response":"That","done":false}{"model":"gemma3:4b","created_at":"2025-03-10T03:02:18.687529153Z","response":"'","done":false}{"model":"gemma3:4b","created_at":"2025-03-10T03:02:18.753284927Z","response":"s","done":false}{"model":"gemma3:4b","created_at":"2025-03-10T03:02:18.812957381Z","response":" a","done":false}{"model":"gemma3:4b","created_at":"2025-03-10T03:02:18.889102649Z","response":" fantastic","done":false}{"model":"gemma3:4b","created_at":"2025-03-10T03:02:18.925748116Z","response":",","done":false}{"model":"gemma3:4b","created_at":"2025-03-10T03:02:18.958391572Z","response":" decept","done":false}{"model":"gemma3:4b","created_at":"2025-03-10T03:02:18.971035028Z","response":"ively","done":false}{"model":"gemma3:4b","created_at":"2025-03-10T03:02:18.989678484Z","response":" tricky","done":false}{"model":"gemma3:4b","created_at":"2025-03-10T03:02:18.999321940Z","response":" question","done":false}...
    Success: You deployed a GPU-enabled Cloud Run service withGemma 3 and sent an inference request to it.

Securely interact with the Google Gen AI SDK

After you have deployed your Cloud Run service, you can use theCloud Run endpoint with theGoogle Gen AI SDK.

Before you use the Google Gen AI SDK, ensure that incoming requests pass theappropriate identity token. To learn more about using IAMauthentication and Cloud Run, seeAuthenticating service-to-service.

The following examples show how to use the Google Gen AI SDK withIAM authentication.

Javascript or TypeScript

If you are using the Google Gen AI SDK for Javascript and TypeScript, thecode might look as follows:

import{GoogleGenAI,setDefaultBaseUrls}from"@google/genai";import{GoogleAuth}from'google-auth-library'constcloudrunurl='https://CLOUD_RUN_SERVICE_URL';consttargetAudience=url;constauth=newGoogleAuth();asyncfunctionmain(){constclient=awaitauth.getIdTokenClient(targetAudience);constheaders=awaitclient.getRequestHeaders(targetAudience);constidToken=headers['Authorization']constai=newGoogleGenAI({apiKey:"placeholder",httpOptions:{baseUrl:url,headers:{'Authorization':idToken}},});constresponse=awaitai.models.generateContent({model:"gemma-3-1b-it",contents:"I want a pony",});console.log(response.text);}main();

curl

If using curl, run the following commands to reach the Google Gen AI SDKendpoints:

  • For Generate Content, use/v1beta/{model=models/*}:generateContent: Generates a model response given an inputGenerateContentRequest.

    curl"<cloud_run_url>/v1beta/models/<model>:generateContent"\-H'Content-Type: application/json'\-H"Authorization: Bearer$(gcloudauthprint-identity-token)"\-XPOST\-d'{  "contents": [{    "parts":[{"text": "Write a story about a magic backpack. You are the narrator of an interactive text adventure game."}]    }]    }'
  • For Stream Generate Content, use/v1beta/{model=models/*}:streamGenerateContent: Generates a streamed response from the model given an inputGenerateContentRequest.

    curl"<cloud_run_url>/v1beta/models/<model>:streamGenerateContent"\-H'Content-Type: application/json'\-H"Authorization: Bearer$(gcloudauthprint-identity-token)"\-XPOST\-d'{    "contents": [{      "parts":[{"text": "Write a story about a magic backpack. You are the narrator of an interactive text adventure game."}]      }]      }'

Set concurrency for optimal performance

This section provides context on the recommended concurrency settings. For optimalrequest latency, ensure the--concurrency setting is equal to Ollama'sOLLAMA_NUM_PARALLEL environment variable.

  • OLLAMA_NUM_PARALLEL determines how many request slots are available pereach model to handle inference requests concurrently.
  • --concurrency determines how many requests Cloud Run sends toan Ollama instance at the same time.

If--concurrency exceedsOLLAMA_NUM_PARALLEL, Cloud Run can sendmore requests to a model in Ollama than it has available request slots for.This leads to request queuing within Ollama, increasing request latency for thequeued requests. It also leads to less responsive auto scaling, as the queuedrequests don't trigger Cloud Run to scale out and start new instances.

Ollama also supports serving multiple models from one GPU. To completelyavoid request queuing on the Ollama instance, you should still set--concurrency to matchOLLAMA_NUM_PARALLEL.

It's important to note that increasingOLLAMA_NUM_PARALLEL also makes parallel requests take longer.

Optimize utilization

For optimalGPU utilization, increase--concurrency, keeping it withintwice the value ofOLLAMA_NUM_PARALLEL. While this leads to request queuing in Ollama, it can help improve utilization: Ollama instances can immediately process requests from their queue, and the queues help absorb traffic spikes.

Clean up

Delete the following Google Cloud resources created:

What's next

Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.

Last updated 2026-02-18 UTC.