Serve Gemma open models using GPUs on GKE with vLLM
This tutorial shows you how to deploy and serve aGemma 3large language model (LLM) using GPUs on Google Kubernetes Engine (GKE) with thevLLM serving framework. This provides a foundation for understanding and exploring practical LLM deployment forinference in a managed Kubernetes environment. You deploy a pre-built containerthat runs vLLM to GKE. You also configure GKEto load the Gemma 1B, 4B, 12B, and 27B weights from Hugging Face.
Tip: For production deployments on GKE, we strongly recommend usingInference Quickstart to get tailored best practices and configurations for your model inference.This tutorial is intended for Machine learning (ML) engineers,Platform admins and operators, and for Data and AI specialistswho are interested in using Kubernetes container orchestration capabilitiesfor serving AI/ML workloads on H200, H100, A100, and L4 GPU hardware. To learn moreabout common roles and example tasks that we reference in Google Cloud content,seeCommon GKE user roles and tasks.
If you need a unified managed AI platform that's designed to rapidly build and serve ML modelscost effectively, we recommend that you try ourVertex AI deployment solution.
Before reading this page, ensure that you're familiar with the following:
Background
This section describes the key technologies used in this guide.
Gemma
Gemma isa set of openly available, lightweight, generative artificial intelligence (AI)multimodal models released under an open license. These AI models are available to runin your applications, hardware, mobile devices, or hosted services.Gemma 3 introduces multimodality, and it supports vision-language input and text outputs. It handles context windows of up to 128,000 tokens and supports over 140 languages. Gemma 3 also offers improved math, reasoning, and chat capabilities, including structured outputs and function calling.
You can use the Gemma models for text generation, or you can alsotune these models for specialized tasks.
For more information, see theGemma documentation.
GPUs
GPUs let you accelerate specific workloads running on your nodes, such as machinelearning and data processing. GKE provides a range of machinetype options for node configuration, including machine types with NVIDIA H200,H100, L4, and A100 GPUs.
vLLM
vLLM is a highly optimized open source LLM serving framework that can increaseserving throughput on GPUs, with features such as the following:
- Optimized transformer implementation withPagedAttention
- Continuous batching to improve the overall serving throughput
- Tensor parallelism and distributed serving on multiple GPUs
For more information, refer to thevLLM documentation.
Objectives
- Prepare your environment with a GKE cluster inAutopilot or Standard mode.
- Deploy a vLLM container to your cluster.
- Use vLLM to serve the Gemma 3 model through curl and aweb chat interface.
Before you begin
- Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.
In the Google Cloud console, on the project selector page, select or create a Google Cloud project.
Note: If you don't plan to keep the resources that you create in this procedure, create a project instead of selecting an existing project. After you finish these steps, you can delete the project, removing all resources associated with the project.Roles required to select or create a project
- Select a project: Selecting a project doesn't require a specific IAM role—you can select any project that you've been granted a role on.
- Create a project: To create a project, you need the Project Creator role (
roles/resourcemanager.projectCreator), which contains theresourcemanager.projects.createpermission.Learn how to grant roles.
Verify that billing is enabled for your Google Cloud project.
Enable the required API.
Roles required to enable APIs
To enable APIs, you need the Service Usage Admin IAM role (
roles/serviceusage.serviceUsageAdmin), which contains theserviceusage.services.enablepermission.Learn how to grant roles.In the Google Cloud console, on the project selector page, select or create a Google Cloud project.
Note: If you don't plan to keep the resources that you create in this procedure, create a project instead of selecting an existing project. After you finish these steps, you can delete the project, removing all resources associated with the project.Roles required to select or create a project
- Select a project: Selecting a project doesn't require a specific IAM role—you can select any project that you've been granted a role on.
- Create a project: To create a project, you need the Project Creator role (
roles/resourcemanager.projectCreator), which contains theresourcemanager.projects.createpermission.Learn how to grant roles.
Verify that billing is enabled for your Google Cloud project.
Enable the required API.
Roles required to enable APIs
To enable APIs, you need the Service Usage Admin IAM role (
roles/serviceusage.serviceUsageAdmin), which contains theserviceusage.services.enablepermission.Learn how to grant roles.Make sure that you have the following role or roles on the project: roles/container.admin, roles/iam.serviceAccountAdmin
Check for the roles
In the Google Cloud console, go to theIAM page.
Go to IAM- Select the project.
In thePrincipal column, find all rows that identify you or a group that you're included in. To learn which groups you're included in, contact your administrator.
- For all rows that specify or include you, check theRole column to see whether the list of roles includes the required roles.
Grant the roles
In the Google Cloud console, go to theIAM page.
Go to IAM- Select the project.
- ClickGrant access.
In theNew principals field, enter your user identifier. This is typically the email address for a Google Account.
- In theSelect a role list, select a role.
- To grant additional roles, clickAdd another role and add each additional role.
- ClickSave.
- Create aHugging Face account, if you don't already have one.
- Ensure your project has sufficient quota for L4 GPUs. For more information, seeAbout GPUs andAllocation quotas.
Get access to the model
Note: You must sign the consent agreement to use Gemma in theHugging Face repository.To access the model through Hugging Face, you need aHugging Facetoken.
Follow these steps to generate a new token if you don't have one already:
- ClickYour Profile > Settings > Access Tokens.
- SelectNew Token.
- Specify a Name of your choice and a Role of at least
Read. - SelectGenerate a token.
- Copy the generated token to your clipboard.
Prepare your environment
In this tutorial, you useCloud Shell to manage resources hosted onGoogle Cloud. Cloud Shell comes preinstalled with the software you needfor this tutorial, includingkubectl andgcloud CLI.
To set up your environment with Cloud Shell, follow these steps:
In the Google Cloud console, launch a Cloud Shell session by clicking
Activate Cloud Shell in theGoogle Cloud console. This launches a session in thebottom pane of Google Cloud console.
Set the default environment variables:
gcloudconfigsetprojectPROJECT_IDgcloudconfigsetbilling/quota_projectPROJECT_IDexportPROJECT_ID=$(gcloudconfiggetproject)exportREGION=REGIONexportCLUSTER_NAME=CLUSTER_NAMEexportHF_TOKEN=HF_TOKENReplace the following values:
PROJECT_ID: your Google Cloudproject ID.REGION: a region that supports the acceleratortype you want to use, for example,us-central1for L4 GPU.CLUSTER_NAME: the name of your cluster.HF_TOKEN: the Hugging Face tokenyou generated earlier.
Create and configure Google Cloud resources
Follow these instructions to create the required resources.
Note: You might need to create a capacity reservation to use of some accelerators. To learn howto reserve and consume reserved resources, seeConsuming reserved zonal resources.Create a GKE cluster and node pool
You can serve Gemma on GPUs in a GKE Autopilot or Standard cluster. We recommend that you use a Autopilot cluster for a fully managed Kubernetes experience. To choose the GKE mode of operation that's the best fit for your workloads, seeChoose a GKE mode of operation.
Autopilot
In Cloud Shell, run the following command:
gcloudcontainerclusterscreate-autoCLUSTER_NAME\--project=PROJECT_ID\--location=CONTROL_PLANE_LOCATION\--release-channel=rapidReplace the following values:
PROJECT_ID: your Google Cloudproject ID.CONTROL_PLANE_LOCATION: the Compute Engineregion of the control plane of yourcluster. Provide a region that supports the accelerator type you want to use,for example,us-central1for L4 GPU.CLUSTER_NAME: the name of your cluster.
GKE creates an Autopilot cluster with CPU and GPUnodes as requested by the deployed workloads.
Standard
In Cloud Shell, run the following command to create a Standardcluster:
gcloudcontainerclusterscreateCLUSTER_NAME\--project=PROJECT_ID\--location=CONTROL_PLANE_LOCATION\--workload-pool=PROJECT_ID.svc.id.goog\--release-channel=rapid\--num-nodes=1Replace the following values:
PROJECT_ID: your Google Cloudproject ID.CONTROL_PLANE_LOCATION: the Compute Engineregion of the control plane of yourcluster. Provide a region that supports the accelerator type you want to use,for example,us-central1for L4 GPU.CLUSTER_NAME: the name of your cluster.
The cluster creation might take several minutes.
To create anodepool for your cluster withthe appropriate disk size, run the following command:
Gemma 3 1B
gcloudcontainernode-poolscreategpupool\--acceleratortype=nvidia-l4,count=1,gpu-driver-version=latest\--project=PROJECT_ID\--location=REGION\--node-locations=REGION-a\--cluster=CLUSTER_NAME\--machine-type=g2-standard-8\--num-nodes=1GKE creates a single node pool containing an L4 GPU foreach node.
Gemma 3 4B
gcloudcontainernode-poolscreategpupool\--acceleratortype=nvidia-l4,count=1,gpu-driver-version=latest\--project=PROJECT_ID\--location=REGION\--node-locations=REGION-a\--cluster=CLUSTER_NAME\--machine-type=g2-standard-8\--num-nodes=1GKE creates a single node pool containing an L4 GPU foreach node.
Gemma 3 12B
gcloudcontainernode-poolscreategpupool\--acceleratortype=nvidia-l4,count=4,gpu-driver-version=latest\--project=PROJECT_ID\--location=REGION\--node-locations=REGION-a\--cluster=CLUSTER_NAME\--machine-type=g2-standard-48\--num-nodes=1GKE creates a single node pool containing four L4 GPUs foreach node.
Gemma 3 27B
gcloudcontainernode-poolscreategpupool\--acceleratortype=nvidia-a100-80gb,count=1,gpu-driver-version=latest\--project=PROJECT_ID\--location=REGION\--node-locations=REGION-a\--cluster=CLUSTER_NAME\--machine-type=a2-ultragpu-1g\--disk-type=pd-ssd\--num-nodes=1\--disk-size=256GKE creates a single node pool containing oneA100 80GB GPU.
Create a Kubernetes secret for Hugging Face credentials
In Cloud Shell, do the following:
Configure
kubectlso it can communicate with your cluster:gcloudcontainerclustersget-credentialsCLUSTER_NAME\--location=REGIONReplace the following values:
REGION: a region that supports the acceleratortype you want to use, for example,us-central1for L4 GPU.CLUSTER_NAME: the name of your cluster.
Create a Kubernetes Secret that contains the Hugging Face token:
kubectlcreatesecretgenerichf-secret\--from-literal=hf_api_token=${HF_TOKEN}\--dry-run=client-oyaml|kubectlapply-f-Replace
HF_TOKENwith the Hugging Face token you generated earlier.
Deploy vLLM
In this section, you deploy the vLLM container to serve the Gemmamodel you want to use. To deploy the model, this tutorial uses Kubernetes Deployments. ADeployment is a Kubernetes API object that lets you run multiple replicas of Pods that are distributed among the nodes in a cluster..
Gemma 3 1B-it
Follow these instructions to deploy the Gemma 3 1Binstruction tuned model (text-only input).
Create the following
vllm-3-1b-it.yamlmanifest:apiVersion:apps/v1kind:Deploymentmetadata:name:vllm-gemma-deploymentspec:replicas:1selector:matchLabels:app:gemma-servertemplate:metadata:labels:app:gemma-serverai.gke.io/model:gemma-3-1b-itai.gke.io/inference-server:vllmexamples.ai.gke.io/source:user-guidespec:containers:-name:inference-serverimage:docker.io/vllm/vllm-openai:v0.10.0resources:requests:cpu:"2"memory:"10Gi"ephemeral-storage:"10Gi"nvidia.com/gpu:"1"limits:cpu:"2"memory:"10Gi"ephemeral-storage:"10Gi"nvidia.com/gpu:"1"command:["python3","-m","vllm.entrypoints.openai.api_server"]args:---model=$(MODEL_ID)---tensor-parallel-size=1---host=0.0.0.0---port=8000env:-name:LD_LIBRARY_PATHvalue:${LD_LIBRARY_PATH}:/usr/local/nvidia/lib64-name:MODEL_IDvalue:google/gemma-3-1b-it-name:HUGGING_FACE_HUB_TOKENvalueFrom:secretKeyRef:name:hf-secretkey:hf_api_tokenvolumeMounts:-mountPath:/dev/shmname:dshmvolumes:-name:dshmemptyDir:medium:MemorynodeSelector:cloud.google.com/gke-accelerator:nvidia-l4cloud.google.com/gke-gpu-driver-version:latest---apiVersion:v1kind:Servicemetadata:name:llm-servicespec:selector:app:gemma-servertype:ClusterIPports:-protocol:TCPport:8000targetPort:8000Apply the manifest:
kubectlapply-fvllm-3-1b-it.yaml
Gemma 3 4B-it
Follow these instructions to deploy the Gemma 3 4Binstruction tuned model.
Create the following
vllm-3-4b-it.yamlmanifest:apiVersion:apps/v1kind:Deploymentmetadata:name:vllm-gemma-deploymentspec:replicas:1selector:matchLabels:app:gemma-servertemplate:metadata:labels:app:gemma-serverai.gke.io/model:gemma-3-4b-itai.gke.io/inference-server:vllmexamples.ai.gke.io/source:user-guidespec:containers:-name:inference-serverimage:docker.io/vllm/vllm-openai:v0.10.0resources:requests:cpu:"2"memory:"20Gi"ephemeral-storage:"20Gi"nvidia.com/gpu:"1"limits:cpu:"2"memory:"20Gi"ephemeral-storage:"20Gi"nvidia.com/gpu:"1"command:["python3","-m","vllm.entrypoints.openai.api_server"]args:---model=$(MODEL_ID)---tensor-parallel-size=1---host=0.0.0.0---port=8000---max-model-len=32768---max-num-seqs=4env:-name:LD_LIBRARY_PATHvalue:${LD_LIBRARY_PATH}:/usr/local/nvidia/lib64-name:MODEL_IDvalue:google/gemma-3-4b-it-name:HUGGING_FACE_HUB_TOKENvalueFrom:secretKeyRef:name:hf-secretkey:hf_api_tokenvolumeMounts:-mountPath:/dev/shmname:dshmvolumes:-name:dshmemptyDir:medium:MemorynodeSelector:cloud.google.com/gke-accelerator:nvidia-l4cloud.google.com/gke-gpu-driver-version:latest---apiVersion:v1kind:Servicemetadata:name:llm-servicespec:selector:app:gemma-servertype:ClusterIPports:-protocol:TCPport:8000targetPort:8000Apply the manifest:
kubectlapply-fvllm-3-4b-it.yamlIn our example, we limit the context window by 32 K using vLLM option
--max-model-len=32768.If you want a larger context window size (up to 128 K), adjust your manifest and the node-pool configuration with more GPU capacity.
Gemma 3 12B-it
Follow these instructions to deploy the Gemma 3 12Binstruction tuned model.
Create the following
vllm-3-12b-it.yamlmanifest:apiVersion:apps/v1kind:Deploymentmetadata:name:vllm-gemma-deploymentspec:replicas:1selector:matchLabels:app:gemma-servertemplate:metadata:labels:app:gemma-serverai.gke.io/model:gemma-3-12b-itai.gke.io/inference-server:vllmexamples.ai.gke.io/source:user-guidespec:containers:-name:inference-serverimage:docker.io/vllm/vllm-openai:v0.10.0resources:requests:cpu:"4"memory:"32Gi"ephemeral-storage:"32Gi"nvidia.com/gpu:"2"limits:cpu:"4"memory:"32Gi"ephemeral-storage:"32Gi"nvidia.com/gpu:"2"command:["python3","-m","vllm.entrypoints.openai.api_server"]args:---model=$(MODEL_ID)---tensor-parallel-size=2---host=0.0.0.0---port=8000---max-model-len=16384---max-num-seqs=4env:-name:LD_LIBRARY_PATHvalue:${LD_LIBRARY_PATH}:/usr/local/nvidia/lib64-name:MODEL_IDvalue:google/gemma-3-12b-it-name:HUGGING_FACE_HUB_TOKENvalueFrom:secretKeyRef:name:hf-secretkey:hf_api_tokenvolumeMounts:-mountPath:/dev/shmname:dshmvolumes:-name:dshmemptyDir:medium:MemorynodeSelector:cloud.google.com/gke-accelerator:nvidia-l4cloud.google.com/gke-gpu-driver-version:latest---apiVersion:v1kind:Servicemetadata:name:llm-servicespec:selector:app:gemma-servertype:ClusterIPports:-protocol:TCPport:8000targetPort:8000Apply the manifest:
kubectlapply-fvllm-3-12b-it.yamlIn our example, we limit the context window size by 16 K using vLLM option
--max-model-len=16384. If you want a larger context window size (up to 128 K), adjust your manifest and node-pool configuration with more GPU capacity.
Gemma 3 27B-it
Follow these instructions to deploy the Gemma 3 27Binstruction tuned model.
Create the following
vllm-3-27b-it.yamlmanifest:apiVersion:apps/v1kind:Deploymentmetadata:name:vllm-gemma-deploymentspec:replicas:1selector:matchLabels:app:gemma-servertemplate:metadata:labels:app:gemma-serverai.gke.io/model:gemma-3-27b-itai.gke.io/inference-server:vllmexamples.ai.gke.io/source:user-guidespec:containers:-name:inference-serverimage:docker.io/vllm/vllm-openai:v0.10.0resources:requests:cpu:"10"memory:"128Gi"ephemeral-storage:"120Gi"nvidia.com/gpu:"1"limits:cpu:"10"memory:"128Gi"ephemeral-storage:"120Gi"nvidia.com/gpu:"1"command:["python3","-m","vllm.entrypoints.openai.api_server"]args:---model=$(MODEL_ID)---tensor-parallel-size=1---host=0.0.0.0---port=8000---swap-space=16---gpu-memory-utilization=0.95---max-model-len=32768---max-num-seqs=4env:-name:LD_LIBRARY_PATHvalue:${LD_LIBRARY_PATH}:/usr/local/nvidia/lib64-name:MODEL_IDvalue:google/gemma-3-27b-it-name:HUGGING_FACE_HUB_TOKENvalueFrom:secretKeyRef:name:hf-secretkey:hf_api_tokenvolumeMounts:-mountPath:/dev/shmname:dshmvolumes:-name:dshmemptyDir:medium:MemorynodeSelector:cloud.google.com/gke-accelerator:nvidia-a100-80gbcloud.google.com/gke-gpu-driver-version:latest---apiVersion:v1kind:Servicemetadata:name:llm-servicespec:selector:app:gemma-servertype:ClusterIPports:-protocol:TCPport:8000targetPort:8000Apply the manifest:
kubectlapply-fvllm-3-27b-it.yamlIn our example, we limit the context window size by 32 K using vLLM option
--max-model-len=32768.If you want a larger context window size (up to 128K), adjust your manifest and the node-poolconfiguration with more GPU capacity.
A Pod in the cluster downloads the model weights from Hugging Face and startsthe serving engine.
Wait for the Deployment to be available:
kubectlwait--for=condition=Available--timeout=1800sdeployment/vllm-gemma-deploymentView the logs from the running Deployment:
kubectllogs-f-lapp=gemma-serverThe Deployment resource downloads the model data. This process can take afew minutes. The output is similar to the following:
INFO: Automatically detected platform cuda....INFO [launcher.py:34] Route: /v1/chat/completions, Methods: POST...INFO: Started server process [13]INFO: Waiting for application startup.INFO: Application startup complete.Default STARTUP TCP probe succeeded after 1 attempt for container "vllm--google--gemma-3-4b-it-1" on port 8080.Make sure the model is fully downloaded before proceeding to the next section.
Serve the model
In this section, you interact with the model.
Set up port forwarding
Run the following command to set up port forwarding to the model:
kubectlport-forwardservice/llm-service8000:8000The output is similar to the following:
Forwarding from 127.0.0.1:8000 -> 8000Interact with the model using curl
This section shows how you can perform a basic smoke test to verify your deployedGemma 3 instruction-tuned models.For other models, replacegemma-3-4b-it with the name of the respective model.
This example shows how to test the Gemma 3 4B instructiontuned model with text-only input.
In a new terminal session, usecurl to chat with your model:
curlhttp://127.0.0.1:8000/v1/chat/completions\-XPOST\-H"Content-Type: application/json"\-d'{ "model": "google/gemma-3-4b-it", "messages": [ { "role": "user", "content": "Why is the sky blue?" } ]}'The output looks similar to the following:
{ "id": "chatcmpl-e4a2e624bea849d9b09f838a571c4d9e", "object": "chat.completion", "created": 1741763029, "model": "google/gemma-3-4b-it", "choices": [ { "index": 0, "message": { "role": "assistant", "reasoning_content": null, "content": "Okay, let's break down why the sky appears blue! It's a fascinating phenomenon rooted in physics, specifically something called **Rayleigh scattering**. Here's the explanation: ...", "tool_calls": [] }, "logprobs": null, "finish_reason": "stop", "stop_reason": 106 } ], "usage": { "prompt_tokens": 15, "total_tokens": 668, "completion_tokens": 653, "prompt_tokens_details": null }, "prompt_logprobs": null}(Optional) Interact with the model through a Gradio chat interface
In this section, you build a web chat application that lets you interact withyour instruction tuned model. For simplicity, this section describes only thetesting approach using the 4B-it model.
Gradio is a Python library that has aChatInterface wrapper that creates user interfaces for chatbots.
Deploy the chat interface
In Cloud Shell, save the following manifest as
gradio.yaml. Changegoogle/gemma-2-9b-ittogoogle/gemma-3-4b-itor to anotherGemma 3 model name you used in your deployment.apiVersion:apps/v1kind:Deploymentmetadata:name:gradiolabels:app:gradiospec:replicas:1selector:matchLabels:app:gradiotemplate:metadata:labels:app:gradiospec:containers:-name:gradioimage:us-docker.pkg.dev/google-samples/containers/gke/gradio-app:v1.0.4resources:requests:cpu:"250m"memory:"512Mi"limits:cpu:"500m"memory:"512Mi"env:-name:CONTEXT_PATHvalue:"/v1/chat/completions"-name:HOSTvalue:"http://llm-service:8000"-name:LLM_ENGINEvalue:"openai-chat"-name:MODEL_IDvalue:"google/gemma-2-9b-it"-name:DISABLE_SYSTEM_MESSAGEvalue:"true"ports:-containerPort:7860---apiVersion:v1kind:Servicemetadata:name:gradiospec:selector:app:gradioports:-protocol:TCPport:8080targetPort:7860type:ClusterIPApply the manifest:
kubectlapply-fgradio.yamlWait for the deployment to be available:
kubectlwait--for=condition=Available--timeout=900sdeployment/gradio
Use the chat interface
In Cloud Shell, run the following command:
kubectlport-forwardservice/gradio8080:8080This creates a port forward from Cloud Shell to the Gradio service.
Click the
Web Preview button which can be found on the top right of the Cloud Shell taskbar. ClickPreview on Port 8080. A new tab opens in your browser.
Interact with Gemma using the Gradio chat interface. Add aprompt and clickSubmit.
Troubleshoot issues
- If you get the message
Empty reply from server, it's possible the container has not finished downloading the model data.Check the Pod's logs again for theConnectedmessage which indicates that the model is ready to serve. - If you see
Connection refused, verify that yourport forwarding is active.
Observe model performance
To view the dashboards for observability metrics ofa model, follow these steps:
In the Google Cloud console, go to theDeployed Models page.
To view details about the specific deployment, including its metrics, logs,and dashboards, click the model name in the list.
In the model details page, click theObservability tab to view thefollowing dashboards. If prompted, clickEnable to enable metricscollection for the cluster.
- TheInfrastructure usage dashboard displays utilization metrics.
- TheDCGM dashboard displays DCGM metrics.
- If you are using vLLM, then theModel performance dashboard isavailable and displays metrics for the vLLM model performance.
You can also view metrics in the vLLM dashboard integration inCloud Monitoring.These metrics are aggregated for all vLLM deployments with no pre-set filters
To use the dashboard in Cloud Monitoring, you must enableGoogle Cloud Managed Service for Prometheus,which collects the metrics from vLLM,in your GKE cluster. vLLM exposes metrics in Prometheus format by default;you do not need to install an additional exporter.For information about using Google Cloud Managed Service for Prometheus to collectmetrics from your model, see thevLLMobservability guidance in the Cloud Monitoring documentation.
Clean up
To avoid incurring charges to your Google Cloud account for the resources used in this tutorial, either delete the project that contains the resources, or keep the project and delete the individual resources.
Delete the deployed resources
To avoid incurring charges to your Google Cloud account for the resourcesthat you created in this guide, run the following command:
gcloudcontainerclustersdeleteCLUSTER_NAME\--location=CONTROL_PLANE_LOCATIONReplace the following values:
CONTROL_PLANE_LOCATION: the Compute Engineregion of the control plane of yourcluster.CLUSTER_NAME: the name of your cluster.
What's next
- Learn more aboutGPUs inGKE.
- Learn how to use Gemma with vLLM on other accelerators,including A100 and H100 GPUs, byviewing the sample code in GitHub.
- Learn how todeploy GPU workloads in Autopilot.
- Learn how todeploy GPU workloads in Standard.
- Explore the vLLMGitHub repositoryanddocumentation.
- Explore theVertex AI Model Garden.
- Discover how to run optimized AI/ML workloads withGKEplatform orchestration capabilities.
Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.
Last updated 2025-11-25 UTC.