Serve open LLMs on GKE with a pre-configured architecture Stay organized with collections Save and categorize content based on your preferences.
In this guide, you deploy and serve LLMs using single-host GPU nodes onGKE with the vLLM serving framework. This guide provides instructions and configurations for deploying the following open models:
Note: You must accept the license terms for any gated models you want to use(such as Gemma or Llama) on their respective Hugging Face model page.This guide is intended for Machine learning (ML) engineers and Data and AI specialistswho are interested in exploring Kubernetes container orchestration capabilitiesfor serving open models for inference. To learn more about common roles and example tasksreferenced in Google Cloud content, seeCommon GKE user roles and tasks.
For a detailed analysis of model serving performance and costs for these opensmodels, you can also use the GKE Inference Quickstart tool. Tolearn more, see theGKE Inference Quickstart guide and the accompanyingColab notebook.
Before you begin
- Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.
In the Google Cloud console, on the project selector page, select or create a Google Cloud project.
Note: If you don't plan to keep the resources that you create in this procedure, create a project instead of selecting an existing project. After you finish these steps, you can delete the project, removing all resources associated with the project.Roles required to select or create a project
- Select a project: Selecting a project doesn't require a specific IAM role—you can select any project that you've been granted a role on.
- Create a project: To create a project, you need the Project Creator role (
roles/resourcemanager.projectCreator), which contains theresourcemanager.projects.createpermission.Learn how to grant roles.
Verify that billing is enabled for your Google Cloud project.
Enable the required APIs.
Roles required to enable APIs
To enable APIs, you need the Service Usage Admin IAM role (
roles/serviceusage.serviceUsageAdmin), which contains theserviceusage.services.enablepermission.Learn how to grant roles.In the Google Cloud console, on the project selector page, select or create a Google Cloud project.
Note: If you don't plan to keep the resources that you create in this procedure, create a project instead of selecting an existing project. After you finish these steps, you can delete the project, removing all resources associated with the project.Roles required to select or create a project
- Select a project: Selecting a project doesn't require a specific IAM role—you can select any project that you've been granted a role on.
- Create a project: To create a project, you need the Project Creator role (
roles/resourcemanager.projectCreator), which contains theresourcemanager.projects.createpermission.Learn how to grant roles.
Verify that billing is enabled for your Google Cloud project.
Enable the required APIs.
Roles required to enable APIs
To enable APIs, you need the Service Usage Admin IAM role (
roles/serviceusage.serviceUsageAdmin), which contains theserviceusage.services.enablepermission.Learn how to grant roles.Make sure that you have the following role or roles on the project: roles/artifactregistry.admin, roles/browser, roles/compute.networkAdmin,roles/container.clusterAdmin, roles/iam.serviceAccountAdmin,roles/resourcemanager.projectIamAdmin, and roles/serviceusage.serviceUsageAdmin
Check for the roles
In the Google Cloud console, go to theIAM page.
Go to IAM- Select the project.
In thePrincipal column, find all rows that identify you or a group that you're included in. To learn which groups you're included in, contact your administrator.
- For all rows that specify or include you, check theRole column to see whether the list of roles includes the required roles.
Grant the roles
In the Google Cloud console, go to theIAM page.
Go to IAM- Select the project.
- ClickGrant access.
In theNew principals field, enter your user identifier. This is typically the email address for a Google Account.
- In theSelect a role list, select a role.
- To grant additional roles, clickAdd another role and add each additional role.
- ClickSave.
- Create aHugging Face account.
- Ensure your project has sufficientGPU quota. For moreinformation, seeAllocation quotas.
Get access to the model
Accept the license terms for any gated models you want to use(such as Gemma or Llama) on their respective Hugging Face model page.
To access the model through Hugging Face, you need aHugging Face token.
Follow these steps to generate a new token if you don't have one already:
- ClickYour Profile >Settings >Access Tokens.
- SelectNew Token.
- Specify aName of your choice and aRole of at leastRead.
- SelectGenerate a token.
- Copy the generated token to your clipboard.
Provision the GKE inference environment
In this section, you deploy the necessary infrastructure to serve your model.
Launch Cloud Shell
This guide usesCloud Shell to execute commands. Cloud Shell comespreinstalled with the necessary tools, includinggcloud,kubectl, andgit.
In the Google Cloud console, start a Cloud Shell instance:
This action launches a session in the bottom pane of Google Cloud console.
Deploy the base architecture
To provision the GKE cluster and the necessary resources foraccessing models from Hugging Face, follow these steps:
In Cloud Shell, clone the following repository:
gitclonehttps://github.com/GoogleCloudPlatform/accelerated-platforms--branchhf-model-vllm-gpu-tutorial &&\cdaccelerated-platforms &&\exportACP_REPO_DIR="$(pwd)"Set your environment variables:
exportTF_VAR_platform_default_project_id=PROJECT_IDexportHF_TOKEN_READ=HF_TOKENReplace the following values:
PROJECT_ID: your Google Cloudproject ID.HF_TOKEN: the Hugging Face token yougenerated earlier.
This guide requires Terraform version 1.8.0 or later. Cloud Shell hasTerraform v1.5.7 installed by default.
To update the Terraform version inCloud Shell, you can run the following script. This script installs the
tfswitchtool and installs Terraform v1.8.0 in your home directory. Follow the instruction from the script to set the necessary environment variable or pass the--modify-rc-fileflag to the script."${ACP_REPO_DIR}/tools/bin/install_terraform.sh" &&\exportPATH=${HOME}/bin:${HOME}/.local/bin:${PATH}Run the following deployment script. The deployment script enables therequired Google Cloud APIs and provisions the necessary infrastructure forthis guide. This includes a new VPC network, aGKE cluster with private nodes,and other supporting resources. The script can take several minutes to complete.
You can serve models using GPUs in a GKE Autopilot or Standard cluster. AnAutopilot cluster provides a fully managed Kubernetes experience.For more information about choosing the GKE mode of operationthat's the best fit for your workloads, seeAbout GKE modes of operation.
Autopilot
"${ACP_REPO_DIR}/platforms/gke/base/tutorials/hf-gpu-model/deploy-ap.sh"Standard
"${ACP_REPO_DIR}/platforms/gke/base/tutorials/hf-gpu-model/deploy-standard.sh"After this script completes, you will have a GKE cluster readyfor inference workloads.
Run the following command to set environment variables from the shared configuration:
source"${ACP_REPO_DIR}/platforms/gke/base/use-cases/inference-ref-arch/terraform/_shared_config/scripts/set_environment_variables.sh"The deployment script creates a secret in Secret Manager to storeyour Hugging Face token. You must manually add your token to this secretbefore deploying the cluster. In Cloud Shell, run this command to add thetoken to Secret Manager.
echo${HF_TOKEN_READ}|gcloudsecretsversionsadd${huggingface_hub_access_token_read_secret_manager_secret_name}\--data-file=-\--project=${huggingface_secret_manager_project_id}
Deploy an open model
You are now ready to download and deploy the model.
Select a model
Set the environment variables for the model you want to deploy:
Gemma 3 27B-it
exportACCELERATOR_TYPE="h100"exportHF_MODEL_ID="google/gemma-3-27b-it"Llama 4 Scout 17B-16E-Instruct
exportACCELERATOR_TYPE="h100"exportHF_MODEL_ID="meta-llama/llama-4-scout-17b-16e-instruct"Qwen3 32B
exportACCELERATOR_TYPE="h100"exportHF_MODEL_ID="qwen/qwen3-32b"gpt-oss 20B
exportACCELERATOR_TYPE="h100"exportHF_MODEL_ID="openai/gpt-oss-20b"For additional configurations, including other model variants and GPU types,see the manifests available in the
accelerated-platformsGitHub repository.
Download the model
Source the environment variables from your deployment. These environmentvariables contain the necessary configuration details from the infrastructureyou provisioned.
source"${ACP_REPO_DIR}/platforms/gke/base/use-cases/inference-ref-arch/terraform/_shared_config/scripts/set_environment_variables.sh"Run the following script to configure the Hugging Face model download resources that downloadsthe model to Cloud Storage:
"${ACP_REPO_DIR}/platforms/gke/base/use-cases/inference-ref-arch/kubernetes-manifests/model-download/configure_huggingface.sh"Apply the Hugging Face model download resources:
kubectlapply--kustomize"${ACP_REPO_DIR}/platforms/gke/base/use-cases/inference-ref-arch/kubernetes-manifests/model-download/huggingface"Monitor the Hugging Face model download job until it is complete.
untilkubectl--namespace=${huggingface_hub_downloader_kubernetes_namespace_name}waitjob/${HF_MODEL_ID_HASH}-hf-model-to-gcs--for=condition=complete--timeout=10s>/dev/null;doclearkubectl--namespace=${huggingface_hub_downloader_kubernetes_namespace_name}getjob/${HF_MODEL_ID_HASH}-hf-model-to-gcs|GREP_COLORS='mt=01;92'egrep--color=always-e'^'-e'Complete'echo-e"\nhf-model-to-gcs logs(last 10 lines):"kubectl--namespace=${huggingface_hub_downloader_kubernetes_namespace_name}logsjob/${HF_MODEL_ID_HASH}-hf-model-to-gcs--container=hf-model-to-gcs--tail10doneVerify the Hugging Face model download job is complete.
kubectl--namespace=${huggingface_hub_downloader_kubernetes_namespace_name}getjob/${HF_MODEL_ID_HASH}-hf-model-to-gcs|GREP_COLORS='mt=01;92'egrep--color=always-e'^'-e'Complete'Delete the Hugging Face model download resources.
kubectldelete--ignore-not-found--kustomize"${ACP_REPO_DIR}/platforms/gke/base/use-cases/inference-ref-arch/kubernetes-manifests/model-download/huggingface"
Deploy the model
Source the environment variables from your deployment.
source"${ACP_REPO_DIR}/platforms/gke/base/use-cases/inference-ref-arch/terraform/_shared_config/scripts/set_environment_variables.sh"Verify the Hugging Face model name is set.
echo"HF_MODEL_NAME=${HF_MODEL_NAME}"Configure the vLLM resources.
"${ACP_REPO_DIR}/platforms/gke/base/use-cases/inference-ref-arch/kubernetes-manifests/online-inference-gpu/vllm/configure_vllm.sh"Deploy the inference workload to your GKE cluster.
kubectlapply--kustomize"${ACP_REPO_DIR}/platforms/gke/base/use-cases/inference-ref-arch/kubernetes-manifests/online-inference-gpu/vllm/${ACCELERATOR_TYPE}-${HF_MODEL_NAME}"
Test your deployment
Monitor the inference workload deployment until it is available.
untilkubectl--namespace=${ira_online_gpu_kubernetes_namespace_name}waitdeployment/vllm-${ACCELERATOR_TYPE}-${HF_MODEL_NAME}--for=condition=available--timeout=10s>/dev/null;doclearkubectl--namespace=${ira_online_gpu_kubernetes_namespace_name}getdeployment/vllm-${ACCELERATOR_TYPE}-${HF_MODEL_NAME}|GREP_COLORS='mt=01;92'egrep--color=always-e'^'-e'1/1 1 1'echo-e"\nfetch-safetensors logs(last 10 lines):"kubectl--namespace=${ira_online_gpu_kubernetes_namespace_name}logsdeployment/vllm-${ACCELERATOR_TYPE}-${HF_MODEL_NAME}--container=fetch-safetensors--tail10echo-e"\ninference-server logs(last 10 lines):"kubectl--namespace=${ira_online_gpu_kubernetes_namespace_name}logsdeployment/vllm-${ACCELERATOR_TYPE}-${HF_MODEL_NAME}--container=inference-server--tail10done
Verify the inference workload deployment is available.
kubectl--namespace=${ira_online_gpu_kubernetes_namespace_name}getdeployment/vllm-${ACCELERATOR_TYPE}-${HF_MODEL_NAME}|GREP_COLORS='mt=01;92'egrep--color=always-e'^'-e'1/1 1 1'echo-e"\nfetch-safetensors logs(last 10 lines):"kubectl--namespace=${ira_online_gpu_kubernetes_namespace_name}logsdeployment/vllm-${ACCELERATOR_TYPE}-${HF_MODEL_NAME}--container=fetch-safetensors--tail10echo-e"\ninference-server logs(last 10 lines):"kubectl--namespace=${ira_online_gpu_kubernetes_namespace_name}logsdeployment/vllm-${ACCELERATOR_TYPE}-${HF_MODEL_NAME}--container=inference-server--tail10Run the following script to set up port forwarding and send a samplerequest to the model.
kubectl--namespace=${ira_online_gpu_kubernetes_namespace_name}port-forwardservice/vllm-${ACCELERATOR_TYPE}-${HF_MODEL_NAME}8000:8000>/dev/null&PF_PID=$!while!echo-e'\x1dclose\x0d'|telnetlocalhost8000>/dev/null2>&1;dosleep0.1donecurlhttp://127.0.0.1:8000/v1/chat/completions\--data'{"model": "/gcs/'${HF_MODEL_ID}'","messages": [ { "role": "user", "content": "What is GKE?" } ]}'\--header"Content-Type: application/json"\--requestPOST\--show-error\--silent|jqkill-9${PF_PID}You should see a JSON response from the model answering the question.
Clean up
To avoid incurring charges, delete all the resources you created.
Delete the inference workload:
kubectldelete--ignore-not-found--kustomize"${ACP_REPO_DIR}/platforms/gke/base/use-cases/inference-ref-arch/kubernetes-manifests/online-inference-gpu/vllm/${ACCELERATOR_TYPE}-${HF_MODEL_NAME}"Clean up the resources:
Autopilot
"${ACP_REPO_DIR}/platforms/gke/base/tutorials/hf-gpu-model/teardown-ap.sh"Standard
"${ACP_REPO_DIR}/platforms/gke/base/tutorials/hf-gpu-model/teardown-standard.sh"
What's next
- Learn more aboutAI/ML model inference on GKE.
- Analyze model inference performance and costs with theGKE Inference Quickstart tool.
- Explore theaccelerated-platforms GitHub repositoryused to build this architecture.
Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.
Last updated 2025-12-15 UTC.