Serve open LLMs on GKE with a pre-configured architecture

Autopilot Standard

This page shows you how to quickly deploy and serve popular open large languagemodels (LLMs) on GKE for inference by using a pre-configured,production-readyGKE inference reference architecture. This approach uses Infrastructure asCode (IaC), with Terraform wrapped in CLI scripts, to create a standardized,secure, and scalable GKE environment designed for AI inference workloads.

In this guide, you deploy and serve LLMs using single-host GPU nodes onGKE with the vLLM serving framework. This guide provides instructions and configurations for deploying the following open models:

Note: You must accept the license terms for any gated models you want to use(such as Gemma or Llama) on their respective Hugging Face model page.

This guide is intended for Machine learning (ML) engineers and Data and AI specialistswho are interested in exploring Kubernetes container orchestration capabilitiesfor serving open models for inference. To learn more about common roles and example tasksreferenced in Google Cloud content, see Common GKE user roles and tasks.

For a detailed analysis of model serving performance and costs for these opensmodels, you can also use the GKE Inference Quickstart tool. Tolearn more, see theGKE Inference Quickstart guide and the accompanyingColab notebook.

Before you begin

Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.

In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

Roles required to select or create a project

Select a project: Selecting a project doesn't require a specific IAM role—you can select any project that you've been granted a role on.
Create a project: To create a project, you need the Project Creator role (roles/resourcemanager.projectCreator), which contains theresourcemanager.projects.create permission.Learn how to grant roles.

Note: If you don't plan to keep the resources that you create in this procedure, create a project instead of selecting an existing project. After you finish these steps, you can delete the project, removing all resources associated with the project.

Go to project selector

Verify that billing is enabled for your Google Cloud project.

Enable the required APIs.

Roles required to enable APIs

To enable APIs, you need the Service Usage Admin IAM role (roles/serviceusage.serviceUsageAdmin), which contains theserviceusage.services.enable permission.Learn how to grant roles.

Enable the APIs

In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

Roles required to select or create a project

Select a project: Selecting a project doesn't require a specific IAM role—you can select any project that you've been granted a role on.
Create a project: To create a project, you need the Project Creator role (roles/resourcemanager.projectCreator), which contains theresourcemanager.projects.create permission.Learn how to grant roles.

Go to project selector

Verify that billing is enabled for your Google Cloud project.

Enable the required APIs.

Roles required to enable APIs

To enable APIs, you need the Service Usage Admin IAM role (roles/serviceusage.serviceUsageAdmin), which contains theserviceusage.services.enable permission.Learn how to grant roles.

Enable the APIs

Make sure that you have the following role or roles on the project: roles/artifactregistry.admin, roles/browser, roles/compute.networkAdmin,roles/container.clusterAdmin, roles/iam.serviceAccountAdmin,roles/resourcemanager.projectIamAdmin, and roles/serviceusage.serviceUsageAdmin
Check for the roles
1. In the Google Cloud console, go to theIAM page.
  Go to IAM
2. Select the project.
3. In thePrincipal column, find all rows that identify you or a group that you're included in. To learn which groups you're included in, contact your administrator.
4. For all rows that specify or include you, check theRole column to see whether the list of roles includes the required roles.
Grant the roles
1. In the Google Cloud console, go to theIAM page.
  Go to IAM
2. Select the project.
3. ClickGrant access.
4. In theNew principals field, enter your user identifier. This is typically the email address for a Google Account.
5. In theSelect a role list, select a role.
6. To grant additional roles, clickAdd another role and add each additional role.
7. ClickSave.

Create aHugging Face account.
Ensure your project has sufficientGPU quota. For moreinformation, seeAllocation quotas.

Get access to the model

Accept the license terms for any gated models you want to use(such as Gemma or Llama) on their respective Hugging Face model page.

To access the model through Hugging Face, you need aHugging Face token.

Follow these steps to generate a new token if you don't have one already:

ClickYour Profile >Settings >Access Tokens.
SelectNew Token.
Specify aName of your choice and aRole of at leastRead.
SelectGenerate a token.
Copy the generated token to your clipboard.

Provision the GKE inference environment

In this section, you deploy the necessary infrastructure to serve your model.

Launch Cloud Shell

This guide usesCloud Shell to execute commands. Cloud Shell comespreinstalled with the necessary tools, includinggcloud,kubectl, andgit.

In the Google Cloud console, start a Cloud Shell instance:

Open Cloud Shell

This action launches a session in the bottom pane of Google Cloud console.

Deploy the base architecture

To provision the GKE cluster and the necessary resources foraccessing models from Hugging Face, follow these steps:

In Cloud Shell, clone the following repository:

gitclonehttps://github.com/GoogleCloudPlatform/accelerated-platforms--branchhf-model-vllm-gpu-tutorial &&\cdaccelerated-platforms &&\exportACP_REPO_DIR="$(pwd)"

Set your environment variables:
```
exportTF_VAR_platform_default_project_id=PROJECT_IDexportHF_TOKEN_READ=HF_TOKEN
```
Replace the following values:
- PROJECT_ID: your Google Cloudproject ID.
- HF_TOKEN: the Hugging Face token yougenerated earlier.
This guide requires Terraform version 1.8.0 or later. Cloud Shell hasTerraform v1.5.7 installed by default.
To update the Terraform version inCloud Shell, you can run the following script. This script installs thetfswitchtool and installs Terraform v1.8.0 in your home directory. Follow the instruction from the script to set the necessary environment variable or pass the--modify-rc-file flag to the script.
```
"${ACP_REPO_DIR}/tools/bin/install_terraform.sh" &&\exportPATH=${HOME}/bin:${HOME}/.local/bin:${PATH}
```
Run the following deployment script. The deployment script enables therequired Google Cloud APIs and provisions the necessary infrastructure forthis guide. This includes a new VPC network, aGKE cluster with private nodes,and other supporting resources. The script can take several minutes to complete.
You can serve models using GPUs in a GKE Autopilot or Standard cluster. AnAutopilot cluster provides a fully managed Kubernetes experience.For more information about choosing the GKE mode of operationthat's the best fit for your workloads, seeAbout GKE modes of operation.
Autopilot
```
"${ACP_REPO_DIR}/platforms/gke/base/tutorials/hf-gpu-model/deploy-ap.sh"
```
Standard
```
"${ACP_REPO_DIR}/platforms/gke/base/tutorials/hf-gpu-model/deploy-standard.sh"
```
After this script completes, you will have a GKE cluster readyfor inference workloads.

Run the following command to set environment variables from the shared configuration:

source"${ACP_REPO_DIR}/platforms/gke/base/use-cases/inference-ref-arch/terraform/_shared_config/scripts/set_environment_variables.sh"

The deployment script creates a secret in Secret Manager to storeyour Hugging Face token. You must manually add your token to this secretbefore deploying the cluster. In Cloud Shell, run this command to add thetoken to Secret Manager.
```
echo${HF_TOKEN_READ}|gcloudsecretsversionsadd${huggingface_hub_access_token_read_secret_manager_secret_name}\--data-file=-\--project=${huggingface_secret_manager_project_id}
```

Deploy an open model

You are now ready to download and deploy the model.

Select a model

Set the environment variables for the model you want to deploy:

Gemma 3 27B-it

exportACCELERATOR_TYPE="h100"exportHF_MODEL_ID="google/gemma-3-27b-it"

Llama 4 Scout 17B-16E-Instruct

exportACCELERATOR_TYPE="h100"exportHF_MODEL_ID="meta-llama/llama-4-scout-17b-16e-instruct"

Qwen3 32B

exportACCELERATOR_TYPE="h100"exportHF_MODEL_ID="qwen/qwen3-32b"

gpt-oss 20B

exportACCELERATOR_TYPE="h100"exportHF_MODEL_ID="openai/gpt-oss-20b"

For additional configurations, including other model variants and GPU types,see the manifests available in theaccelerated-platforms GitHub repository.

Download the model

Source the environment variables from your deployment. These environmentvariables contain the necessary configuration details from the infrastructureyou provisioned.
```
source"${ACP_REPO_DIR}/platforms/gke/base/use-cases/inference-ref-arch/terraform/_shared_config/scripts/set_environment_variables.sh"
```

Run the following script to configure the Hugging Face model download resources that downloadsthe model to Cloud Storage:

"${ACP_REPO_DIR}/platforms/gke/base/use-cases/inference-ref-arch/kubernetes-manifests/model-download/configure_huggingface.sh"

Apply the Hugging Face model download resources:

kubectlapply--kustomize"${ACP_REPO_DIR}/platforms/gke/base/use-cases/inference-ref-arch/kubernetes-manifests/model-download/huggingface"

Monitor the Hugging Face model download job until it is complete.

untilkubectl--namespace=${huggingface_hub_downloader_kubernetes_namespace_name}waitjob/${HF_MODEL_ID_HASH}-hf-model-to-gcs--for=condition=complete--timeout=10s>/dev/null;doclearkubectl--namespace=${huggingface_hub_downloader_kubernetes_namespace_name}getjob/${HF_MODEL_ID_HASH}-hf-model-to-gcs|GREP_COLORS='mt=01;92'egrep--color=always-e'^'-e'Complete'echo-e"\nhf-model-to-gcs logs(last 10 lines):"kubectl--namespace=${huggingface_hub_downloader_kubernetes_namespace_name}logsjob/${HF_MODEL_ID_HASH}-hf-model-to-gcs--container=hf-model-to-gcs--tail10done

Verify the Hugging Face model download job is complete.

kubectl--namespace=${huggingface_hub_downloader_kubernetes_namespace_name}getjob/${HF_MODEL_ID_HASH}-hf-model-to-gcs|GREP_COLORS='mt=01;92'egrep--color=always-e'^'-e'Complete'

Delete the Hugging Face model download resources.

kubectldelete--ignore-not-found--kustomize"${ACP_REPO_DIR}/platforms/gke/base/use-cases/inference-ref-arch/kubernetes-manifests/model-download/huggingface"

Deploy the model

Source the environment variables from your deployment.

source"${ACP_REPO_DIR}/platforms/gke/base/use-cases/inference-ref-arch/terraform/_shared_config/scripts/set_environment_variables.sh"

Verify the Hugging Face model name is set.
```
echo"HF_MODEL_NAME=${HF_MODEL_NAME}"
```

Configure the vLLM resources.

"${ACP_REPO_DIR}/platforms/gke/base/use-cases/inference-ref-arch/kubernetes-manifests/online-inference-gpu/vllm/configure_vllm.sh"

Deploy the inference workload to your GKE cluster.

kubectlapply--kustomize"${ACP_REPO_DIR}/platforms/gke/base/use-cases/inference-ref-arch/kubernetes-manifests/online-inference-gpu/vllm/${ACCELERATOR_TYPE}-${HF_MODEL_NAME}"

Test your deployment

Monitor the inference workload deployment until it is available.

untilkubectl--namespace=${ira_online_gpu_kubernetes_namespace_name}waitdeployment/vllm-${ACCELERATOR_TYPE}-${HF_MODEL_NAME}--for=condition=available--timeout=10s>/dev/null;doclearkubectl--namespace=${ira_online_gpu_kubernetes_namespace_name}getdeployment/vllm-${ACCELERATOR_TYPE}-${HF_MODEL_NAME}|GREP_COLORS='mt=01;92'egrep--color=always-e'^'-e'1/1     1            1'echo-e"\nfetch-safetensors logs(last 10 lines):"kubectl--namespace=${ira_online_gpu_kubernetes_namespace_name}logsdeployment/vllm-${ACCELERATOR_TYPE}-${HF_MODEL_NAME}--container=fetch-safetensors--tail10echo-e"\ninference-server logs(last 10 lines):"kubectl--namespace=${ira_online_gpu_kubernetes_namespace_name}logsdeployment/vllm-${ACCELERATOR_TYPE}-${HF_MODEL_NAME}--container=inference-server--tail10done

Verify the inference workload deployment is available.

kubectl--namespace=${ira_online_gpu_kubernetes_namespace_name}getdeployment/vllm-${ACCELERATOR_TYPE}-${HF_MODEL_NAME}|GREP_COLORS='mt=01;92'egrep--color=always-e'^'-e'1/1     1            1'echo-e"\nfetch-safetensors logs(last 10 lines):"kubectl--namespace=${ira_online_gpu_kubernetes_namespace_name}logsdeployment/vllm-${ACCELERATOR_TYPE}-${HF_MODEL_NAME}--container=fetch-safetensors--tail10echo-e"\ninference-server logs(last 10 lines):"kubectl--namespace=${ira_online_gpu_kubernetes_namespace_name}logsdeployment/vllm-${ACCELERATOR_TYPE}-${HF_MODEL_NAME}--container=inference-server--tail10

Run the following script to set up port forwarding and send a samplerequest to the model.

kubectl--namespace=${ira_online_gpu_kubernetes_namespace_name}port-forwardservice/vllm-${ACCELERATOR_TYPE}-${HF_MODEL_NAME}8000:8000>/dev/null&PF_PID=$!while!echo-e'\x1dclose\x0d'|telnetlocalhost8000>/dev/null2>&1;dosleep0.1donecurlhttp://127.0.0.1:8000/v1/chat/completions\--data'{"model": "/gcs/'${HF_MODEL_ID}'","messages": [ { "role": "user", "content": "What is GKE?" } ]}'\--header"Content-Type: application/json"\--requestPOST\--show-error\--silent|jqkill-9${PF_PID}

You should see a JSON response from the model answering the question.

Clean up

To avoid incurring charges, delete all the resources you created.

Delete the inference workload:

kubectldelete--ignore-not-found--kustomize"${ACP_REPO_DIR}/platforms/gke/base/use-cases/inference-ref-arch/kubernetes-manifests/online-inference-gpu/vllm/${ACCELERATOR_TYPE}-${HF_MODEL_NAME}"

Clean up the resources:

Autopilot

"${ACP_REPO_DIR}/platforms/gke/base/tutorials/hf-gpu-model/teardown-ap.sh"

Standard

"${ACP_REPO_DIR}/platforms/gke/base/tutorials/hf-gpu-model/teardown-standard.sh"

What's next

Learn more aboutAI/ML model inference on GKE.
Analyze model inference performance and costs with theGKE Inference Quickstart tool.
Explore theaccelerated-platforms GitHub repositoryused to build this architecture.

Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.

Last updated 2025-12-15 UTC.

Movatterモバイル変換

Serve open LLMs on GKE with a pre-configured architecture Stay organized with collections Save and categorize content based on your preferences.

Before you begin

Check for the roles

Grant the roles

Get access to the model

Provision the GKE inference environment

Launch Cloud Shell

Deploy the base architecture

Autopilot

Standard

Deploy an open model

Select a model

Gemma 3 27B-it

Llama 4 Scout 17B-16E-Instruct

Qwen3 32B

gpt-oss 20B

Download the model

Deploy the model

Test your deployment

Clean up

Autopilot

Standard

What's next

Serve open LLMs on GKE with a pre-configured architecture