Serve open LLMs on GKE with a pre-configured architecture

This page shows you how to quickly deploy and serve popular open large languagemodels (LLMs) on GKE for inference by using a pre-configured,production-readyGKE inference reference architecture. This approach uses Infrastructure asCode (IaC), with Terraform wrapped in CLI scripts, to create a standardized,secure, and scalable GKE environment designed for AI inference workloads.

In this guide, you deploy and serve LLMs using single-host GPU nodes onGKE with the vLLM serving framework. This guide provides instructions and configurations for deploying the following open models:

Note: You must accept the license terms for any gated models you want to use(such as Gemma or Llama) on their respective Hugging Face model page.

This guide is intended for Machine learning (ML) engineers and Data and AI specialistswho are interested in exploring Kubernetes container orchestration capabilitiesfor serving open models for inference. To learn more about common roles and example tasksreferenced in Google Cloud content, seeCommon GKE user roles and tasks.

For a detailed analysis of model serving performance and costs for these opensmodels, you can also use the GKE Inference Quickstart tool. Tolearn more, see theGKE Inference Quickstart guide and the accompanyingColab notebook.

Before you begin

Get access to the model

Accept the license terms for any gated models you want to use(such as Gemma or Llama) on their respective Hugging Face model page.

To access the model through Hugging Face, you need aHugging Face token.

Follow these steps to generate a new token if you don't have one already:

  1. ClickYour Profile >Settings >Access Tokens.
  2. SelectNew Token.
  3. Specify aName of your choice and aRole of at leastRead.
  4. SelectGenerate a token.
  5. Copy the generated token to your clipboard.

Provision the GKE inference environment

In this section, you deploy the necessary infrastructure to serve your model.

Launch Cloud Shell

This guide usesCloud Shell to execute commands. Cloud Shell comespreinstalled with the necessary tools, includinggcloud,kubectl, andgit.

In the Google Cloud console, start a Cloud Shell instance:

Open Cloud Shell

This action launches a session in the bottom pane of Google Cloud console.

Deploy the base architecture

To provision the GKE cluster and the necessary resources foraccessing models from Hugging Face, follow these steps:

  1. In Cloud Shell, clone the following repository:

    gitclonehttps://github.com/GoogleCloudPlatform/accelerated-platforms--branchhf-model-vllm-gpu-tutorial &&\cdaccelerated-platforms &&\exportACP_REPO_DIR="$(pwd)"
  2. Set your environment variables:

    exportTF_VAR_platform_default_project_id=PROJECT_IDexportHF_TOKEN_READ=HF_TOKEN

    Replace the following values:

    • PROJECT_ID: your Google Cloudproject ID.
    • HF_TOKEN: the Hugging Face token yougenerated earlier.
  3. This guide requires Terraform version 1.8.0 or later. Cloud Shell hasTerraform v1.5.7 installed by default.

    To update the Terraform version inCloud Shell, you can run the following script. This script installs thetfswitchtool and installs Terraform v1.8.0 in your home directory. Follow the instruction from the script to set the necessary environment variable or pass the--modify-rc-file flag to the script.

    "${ACP_REPO_DIR}/tools/bin/install_terraform.sh" &&\exportPATH=${HOME}/bin:${HOME}/.local/bin:${PATH}
  4. Run the following deployment script. The deployment script enables therequired Google Cloud APIs and provisions the necessary infrastructure forthis guide. This includes a new VPC network, aGKE cluster with private nodes,and other supporting resources. The script can take several minutes to complete.

    You can serve models using GPUs in a GKE Autopilot or Standard cluster. AnAutopilot cluster provides a fully managed Kubernetes experience.For more information about choosing the GKE mode of operationthat's the best fit for your workloads, seeAbout GKE modes of operation.

    Autopilot

    "${ACP_REPO_DIR}/platforms/gke/base/tutorials/hf-gpu-model/deploy-ap.sh"

    Standard

    "${ACP_REPO_DIR}/platforms/gke/base/tutorials/hf-gpu-model/deploy-standard.sh"

    After this script completes, you will have a GKE cluster readyfor inference workloads.

  5. Run the following command to set environment variables from the shared configuration:

    source"${ACP_REPO_DIR}/platforms/gke/base/use-cases/inference-ref-arch/terraform/_shared_config/scripts/set_environment_variables.sh"
  6. The deployment script creates a secret in Secret Manager to storeyour Hugging Face token. You must manually add your token to this secretbefore deploying the cluster. In Cloud Shell, run this command to add thetoken to Secret Manager.

    echo${HF_TOKEN_READ}|gcloudsecretsversionsadd${huggingface_hub_access_token_read_secret_manager_secret_name}\--data-file=-\--project=${huggingface_secret_manager_project_id}

Deploy an open model

You are now ready to download and deploy the model.

Select a model

  1. Set the environment variables for the model you want to deploy:

    Gemma 3 27B-it

    exportACCELERATOR_TYPE="h100"exportHF_MODEL_ID="google/gemma-3-27b-it"

    Llama 4 Scout 17B-16E-Instruct

    exportACCELERATOR_TYPE="h100"exportHF_MODEL_ID="meta-llama/llama-4-scout-17b-16e-instruct"

    Qwen3 32B

    exportACCELERATOR_TYPE="h100"exportHF_MODEL_ID="qwen/qwen3-32b"

    gpt-oss 20B

    exportACCELERATOR_TYPE="h100"exportHF_MODEL_ID="openai/gpt-oss-20b"

    For additional configurations, including other model variants and GPU types,see the manifests available in theaccelerated-platforms GitHub repository.

Download the model

  1. Source the environment variables from your deployment. These environmentvariables contain the necessary configuration details from the infrastructureyou provisioned.

    source"${ACP_REPO_DIR}/platforms/gke/base/use-cases/inference-ref-arch/terraform/_shared_config/scripts/set_environment_variables.sh"
  2. Run the following script to configure the Hugging Face model download resources that downloadsthe model to Cloud Storage:

    "${ACP_REPO_DIR}/platforms/gke/base/use-cases/inference-ref-arch/kubernetes-manifests/model-download/configure_huggingface.sh"
  3. Apply the Hugging Face model download resources:

    kubectlapply--kustomize"${ACP_REPO_DIR}/platforms/gke/base/use-cases/inference-ref-arch/kubernetes-manifests/model-download/huggingface"
  4. Monitor the Hugging Face model download job until it is complete.

    untilkubectl--namespace=${huggingface_hub_downloader_kubernetes_namespace_name}waitjob/${HF_MODEL_ID_HASH}-hf-model-to-gcs--for=condition=complete--timeout=10s>/dev/null;doclearkubectl--namespace=${huggingface_hub_downloader_kubernetes_namespace_name}getjob/${HF_MODEL_ID_HASH}-hf-model-to-gcs|GREP_COLORS='mt=01;92'egrep--color=always-e'^'-e'Complete'echo-e"\nhf-model-to-gcs logs(last 10 lines):"kubectl--namespace=${huggingface_hub_downloader_kubernetes_namespace_name}logsjob/${HF_MODEL_ID_HASH}-hf-model-to-gcs--container=hf-model-to-gcs--tail10done
  5. Verify the Hugging Face model download job is complete.

    kubectl--namespace=${huggingface_hub_downloader_kubernetes_namespace_name}getjob/${HF_MODEL_ID_HASH}-hf-model-to-gcs|GREP_COLORS='mt=01;92'egrep--color=always-e'^'-e'Complete'
  6. Delete the Hugging Face model download resources.

    kubectldelete--ignore-not-found--kustomize"${ACP_REPO_DIR}/platforms/gke/base/use-cases/inference-ref-arch/kubernetes-manifests/model-download/huggingface"

Deploy the model

  1. Source the environment variables from your deployment.

    source"${ACP_REPO_DIR}/platforms/gke/base/use-cases/inference-ref-arch/terraform/_shared_config/scripts/set_environment_variables.sh"
  2. Verify the Hugging Face model name is set.

    echo"HF_MODEL_NAME=${HF_MODEL_NAME}"
  3. Configure the vLLM resources.

    "${ACP_REPO_DIR}/platforms/gke/base/use-cases/inference-ref-arch/kubernetes-manifests/online-inference-gpu/vllm/configure_vllm.sh"
  4. Deploy the inference workload to your GKE cluster.

    kubectlapply--kustomize"${ACP_REPO_DIR}/platforms/gke/base/use-cases/inference-ref-arch/kubernetes-manifests/online-inference-gpu/vllm/${ACCELERATOR_TYPE}-${HF_MODEL_NAME}"

Test your deployment

  1. Monitor the inference workload deployment until it is available.

    untilkubectl--namespace=${ira_online_gpu_kubernetes_namespace_name}waitdeployment/vllm-${ACCELERATOR_TYPE}-${HF_MODEL_NAME}--for=condition=available--timeout=10s>/dev/null;doclearkubectl--namespace=${ira_online_gpu_kubernetes_namespace_name}getdeployment/vllm-${ACCELERATOR_TYPE}-${HF_MODEL_NAME}|GREP_COLORS='mt=01;92'egrep--color=always-e'^'-e'1/1     1            1'echo-e"\nfetch-safetensors logs(last 10 lines):"kubectl--namespace=${ira_online_gpu_kubernetes_namespace_name}logsdeployment/vllm-${ACCELERATOR_TYPE}-${HF_MODEL_NAME}--container=fetch-safetensors--tail10echo-e"\ninference-server logs(last 10 lines):"kubectl--namespace=${ira_online_gpu_kubernetes_namespace_name}logsdeployment/vllm-${ACCELERATOR_TYPE}-${HF_MODEL_NAME}--container=inference-server--tail10done
  1. Verify the inference workload deployment is available.

    kubectl--namespace=${ira_online_gpu_kubernetes_namespace_name}getdeployment/vllm-${ACCELERATOR_TYPE}-${HF_MODEL_NAME}|GREP_COLORS='mt=01;92'egrep--color=always-e'^'-e'1/1     1            1'echo-e"\nfetch-safetensors logs(last 10 lines):"kubectl--namespace=${ira_online_gpu_kubernetes_namespace_name}logsdeployment/vllm-${ACCELERATOR_TYPE}-${HF_MODEL_NAME}--container=fetch-safetensors--tail10echo-e"\ninference-server logs(last 10 lines):"kubectl--namespace=${ira_online_gpu_kubernetes_namespace_name}logsdeployment/vllm-${ACCELERATOR_TYPE}-${HF_MODEL_NAME}--container=inference-server--tail10
  2. Run the following script to set up port forwarding and send a samplerequest to the model.

    kubectl--namespace=${ira_online_gpu_kubernetes_namespace_name}port-forwardservice/vllm-${ACCELERATOR_TYPE}-${HF_MODEL_NAME}8000:8000>/dev/null&PF_PID=$!while!echo-e'\x1dclose\x0d'|telnetlocalhost8000>/dev/null2>&1;dosleep0.1donecurlhttp://127.0.0.1:8000/v1/chat/completions\--data'{"model": "/gcs/'${HF_MODEL_ID}'","messages": [ { "role": "user", "content": "What is GKE?" } ]}'\--header"Content-Type: application/json"\--requestPOST\--show-error\--silent|jqkill-9${PF_PID}

    You should see a JSON response from the model answering the question.

Clean up

To avoid incurring charges, delete all the resources you created.

  1. Delete the inference workload:

    kubectldelete--ignore-not-found--kustomize"${ACP_REPO_DIR}/platforms/gke/base/use-cases/inference-ref-arch/kubernetes-manifests/online-inference-gpu/vllm/${ACCELERATOR_TYPE}-${HF_MODEL_NAME}"
  2. Clean up the resources:

    Autopilot

    "${ACP_REPO_DIR}/platforms/gke/base/tutorials/hf-gpu-model/teardown-ap.sh"

    Standard

    "${ACP_REPO_DIR}/platforms/gke/base/tutorials/hf-gpu-model/teardown-standard.sh"

What's next

Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.

Last updated 2025-12-15 UTC.