Serve an LLM with multiple GPUs in GKE

This tutorial demonstrates how to deploy and serve a large language model (LLM)using multiple GPUs on GKE for efficient and scalable inference.You create a GKE cluster that uses multiple L4 GPUs and youprepare infrastructure to serve any of the following models:

Depending on the data format of the model, the required number of GPUs varies.In this tutorial, each model uses two L4 GPUs. To learn more, seeCalculating the amount of GPUs.

This tutorial is intended for Machine learning (ML) engineers,Platform admins and operators, and for Data and AI specialists who are interestedin using Kubernetes container orchestration capabilities for serving LLMs. Tolearn more about common roles and example tasks referenced in Google Cloudcontent, seeCommon GKE user roles and tasks.

Before reading this page, ensure that you're familiar with the following:

Objectives

In this tutorial, you:

  1. Create a cluster and node pools.
  2. Prepare your workload.
  3. Deploy your workload.
  4. Interact with the LLM interface.

Before you begin

Before you start, make sure that you have performed the following tasks:

  • Enable the Google Kubernetes Engine API.
  • Enable Google Kubernetes Engine API
  • If you want to use the Google Cloud CLI for this task,install and theninitialize the gcloud CLI. If you previously installed the gcloud CLI, get the latest version by running thegcloud components update command. Earlier gcloud CLI versions might not support running the commands in this document.Note: For existing gcloud CLI installations, make sure to set thecompute/regionproperty. If you use primarily zonal clusters, set thecompute/zone instead. By setting a default location, you can avoid errors in the gcloud CLI like the following:One of [--zone, --region] must be supplied: Please specify location. You might need to specify the location in certain commands if the location of your cluster differs from the default that you set.

Prepare your environment

  1. In the Google Cloud console, start a Cloud Shell instance:
    Open Cloud Shell

  2. Set the default environment variables:

    gcloudconfigsetprojectPROJECT_IDgcloudconfigsetbilling/quota_projectPROJECT_IDexportPROJECT_ID=$(gcloudconfiggetproject)exportCONTROL_PLANE_LOCATION=us-central1

    Replace thePROJECT_ID with your Google Cloudproject ID.

    Note: If your Cloud Shell instance disconnects throughout the tutorialexecution, repeat the preceding step.

Create a GKE cluster and node pool

You can serve LLMs on GPUs in a GKE Autopilot or Standard cluster. We recommend that you use a Autopilot cluster for a fully managed Kubernetes experience. To choose the GKE mode of operation that's the best fit for your workloads, seeChoose a GKE mode of operation.

Autopilot

  1. In Cloud Shell, run the following command:

    gcloudcontainerclusterscreate-autol4-demo\--project=${PROJECT_ID}\--location=${CONTROL_PLANE_LOCATION}\--release-channel=rapid

    GKE creates an Autopilot cluster with CPU and GPUnodes as requested by the deployed workloads.

  2. Configurekubectl to communicate with your cluster:

    gcloudcontainerclustersget-credentialsl4-demo--location=${CONTROL_PLANE_LOCATION}

Standard

  1. In Cloud Shell, run the following command to create a Standard clusterthat usesWorkload Identity Federation for GKE:

    gcloudcontainerclusterscreatel4-demo\--location${CONTROL_PLANE_LOCATION}\--workload-pool${PROJECT_ID}.svc.id.goog\--enable-image-streaming\--node-locations=${CONTROL_PLANE_LOCATION}-a\--workload-pool=${PROJECT_ID}.svc.id.goog\--machine-typen2d-standard-4\--num-nodes1--min-nodes1--max-nodes5\--release-channel=rapid
    Note: The--node-locations flag might have to be adjusted based on which region you choose. Check whichzones the L4 GPUs are available if you change theus-central1 region.

    The cluster creation might take several minutes.

  2. Run the following command to create anode pool for your cluster:

    gcloudcontainernode-poolscreateg2-standard-24--clusterl4-demo\--location${CONTROL_PLANE_LOCATION}\--acceleratortype=nvidia-l4,count=2,gpu-driver-version=latest\--machine-typeg2-standard-24\--enable-autoscaling--enable-image-streaming\--num-nodes=0--min-nodes=0--max-nodes=3\--node-locations${CONTROL_PLANE_LOCATION}-a,${CONTROL_PLANE_LOCATION}-c\--spot

    GKE creates the following resources for the LLM:

    • A public Standard cluster.
    • A node pool withg2-standard-24 machine type scaled down to 0 nodes.You aren't charged for any GPUs until you launch Pods that request GPUs. This node pool provisionsSpot VMs, which are priced lower than the default standard Compute EngineVMs and provide no guarantee of availability. You can remove the--spot flagfrom this command, and thecloud.google.com/gke-spot node selector inthetext-generation-inference.yaml config to use on-demand VMs.
  3. Configurekubectl to communicate with your cluster:

    gcloudcontainerclustersget-credentialsl4-demo--location=${CONTROL_PLANE_LOCATION}

Prepare your workload

This section shows how to set up your workload depending on the modelyou want to use. This tutorial uses Kubernetes Deployments to deploy the model.ADeployment is a Kubernetes API object that lets you run multiple replicas of Pods that are distributed among the nodes in a cluster..

Llama 3 70b

  1. Set the default environment variables:

    exportHF_TOKEN=HUGGING_FACE_TOKEN

    Replace theHUGGING_FACE_TOKEN with yourHuggingFace token.

  2. Create aKubernetes secretfor the HuggingFace token:

    kubectlcreatesecretgenericl4-demo\--from-literal=HUGGING_FACE_TOKEN=${HF_TOKEN}\--dry-run=client-oyaml|kubectlapply-f-
  3. Create the followingtext-generation-inference.yaml Deployment manifest:

    apiVersion:apps/v1kind:Deploymentmetadata:name:llmspec:replicas:1selector:matchLabels:app:llmtemplate:metadata:labels:app:llmspec:containers:-name:llmimage:us-docker.pkg.dev/deeplearning-platform-release/gcr.io/huggingface-text-generation-inference-cu121.2-1.ubuntu2204.py310resources:requests:cpu:"10"memory:"60Gi"nvidia.com/gpu:"2"limits:cpu:"10"memory:"60Gi"nvidia.com/gpu:"2"env:-name:MODEL_IDvalue:meta-llama/Meta-Llama-3-70B-Instruct-name:NUM_SHARDvalue:"2"-name:MAX_INPUT_TOKENSvalue:"2048"-name:PORTvalue:"8080"-name:QUANTIZEvalue:bitsandbytes-nf4-name:HUGGING_FACE_HUB_TOKENvalueFrom:secretKeyRef:name:l4-demokey:HUGGING_FACE_TOKENvolumeMounts:-mountPath:/dev/shmname:dshm# mountPath is set to /tmp as it's the path where the HUGGINGFACE_HUB_CACHE environment# variable in the TGI DLCs is set to instead of the default /data set within the TGI default image.# i.e. where the downloaded model from the Hub will be stored-mountPath:/tmpname:ephemeral-volumevolumes:-name:dshmemptyDir:medium:Memory-name:ephemeral-volumeephemeral:volumeClaimTemplate:metadata:labels:type:ephemeralspec:accessModes:["ReadWriteOnce"]storageClassName:"premium-rwo"resources:requests:storage:150GinodeSelector:cloud.google.com/gke-accelerator:"nvidia-l4"cloud.google.com/gke-spot:"true"

    In this manifest:

    • NUM_SHARD must be2 because the model requires two NVIDIA L4 GPUs.
    • QUANTIZE is set tobitsandbytes-nf4 which means that the model isloaded in 4 bit instead of 32 bits. This allows GKE toreduce the amount of GPU memory needed and improves the inference speed.However, the model accuracy can decrease. To learn how to calculate the GPUs to request, seeCalculating the amount of GPUs.
  4. Apply the manifest:

    kubectlapply-ftext-generation-inference.yaml

    The output is similar to the following:

    deployment.apps/llm created
  5. Verify the status of the model:

    kubectlgetdeploy

    The output is similar to the following:

    NAME          READY   UP-TO-DATE   AVAILABLE   AGEllm           1/1     1            1           20m
  6. View the logs from the running deployment:

    kubectllogs-lapp=llm

    The output is similar to the following:

    {"timestamp":"2024-03-09T05:08:14.751646Z","level":"INFO","message":"Warming up model","target":"text_generation_router","filename":"router/src/main.rs","line_number":291}{"timestamp":"2024-03-09T05:08:19.961136Z","level":"INFO","message":"Setting max batch total tokens to 133696","target":"text_generation_router","filename":"router/src/main.rs","line_number":328}{"timestamp":"2024-03-09T05:08:19.961164Z","level":"INFO","message":"Connected","target":"text_generation_router","filename":"router/src/main.rs","line_number":329}{"timestamp":"2024-03-09T05:08:19.961171Z","level":"WARN","message":"Invalid hostname, defaulting to 0.0.0.0","target":"text_generation_router","filename":"router/src/main.rs","line_number":343}

Mixtral 8x7b

  1. Set the default environment variables:

    exportHF_TOKEN=HUGGING_FACE_TOKEN

    Replace theHUGGING_FACE_TOKEN with yourHuggingFace token.

  2. Create aKubernetes secretfor the HuggingFace token:

    kubectlcreatesecretgenericl4-demo\--from-literal=HUGGING_FACE_TOKEN=${HF_TOKEN}\--dry-run=client-oyaml|kubectlapply-f-
  3. Create the followingtext-generation-inference.yaml Deployment manifest:

    apiVersion:apps/v1kind:Deploymentmetadata:name:llmspec:replicas:1selector:matchLabels:app:llmtemplate:metadata:labels:app:llmspec:containers:-name:llmimage:us-docker.pkg.dev/deeplearning-platform-release/gcr.io/huggingface-text-generation-inference-cu124.2-3.ubuntu2204.py311resources:requests:cpu:"5"memory:"40Gi"nvidia.com/gpu:"2"limits:cpu:"5"memory:"40Gi"nvidia.com/gpu:"2"env:-name:MODEL_IDvalue:mistralai/Mixtral-8x7B-Instruct-v0.1-name:NUM_SHARDvalue:"2"-name:PORTvalue:"8080"-name:QUANTIZEvalue:bitsandbytes-nf4-name:HUGGING_FACE_HUB_TOKENvalueFrom:secretKeyRef:name:l4-demokey:HUGGING_FACE_TOKENvolumeMounts:-mountPath:/dev/shmname:dshm# mountPath is set to /tmp as it's the path where the HF_HOME environment# variable in the TGI DLCs is set to instead of the default /data set within the TGI default image.# i.e. where the downloaded model from the Hub will be stored-mountPath:/tmpname:ephemeral-volumevolumes:-name:dshmemptyDir:medium:Memory-name:ephemeral-volumeephemeral:volumeClaimTemplate:metadata:labels:type:ephemeralspec:accessModes:["ReadWriteOnce"]storageClassName:"premium-rwo"resources:requests:storage:100GinodeSelector:cloud.google.com/gke-accelerator:"nvidia-l4"cloud.google.com/gke-spot:"true"

    In this manifest:

    • NUM_SHARD must be2 because the model requires two NVIDIA L4 GPUs.
    • QUANTIZE is set tobitsandbytes-nf4 which means that the model isloaded in 4 bit instead of 32 bits. This allows GKE toreduce the amount of GPU memory needed and improves the inference speed.However, this may reduce model accuracy. To learn how to calculate theGPUs to request, seeCalculating the amount of GPUs.
  4. Apply the manifest:

    kubectlapply-ftext-generation-inference.yaml

    The output is similar to the following:

    deployment.apps/llm created
  5. Verify the status of the model:

    watchkubectlgetdeploy

    When the Deployment is ready, the output is similar to the following:

    NAME          READY   UP-TO-DATE   AVAILABLE   AGEllm           1/1     1            1           10m

    To exit the watch, typeCTRL + C.

  6. View the logs from the running deployment:

    kubectllogs-lapp=llm

    The output is similar to the following:

    {"timestamp":"2024-03-09T05:08:14.751646Z","level":"INFO","message":"Warming up model","target":"text_generation_router","filename":"router/src/main.rs","line_number":291}{"timestamp":"2024-03-09T05:08:19.961136Z","level":"INFO","message":"Setting max batch total tokens to 133696","target":"text_generation_router","filename":"router/src/main.rs","line_number":328}{"timestamp":"2024-03-09T05:08:19.961164Z","level":"INFO","message":"Connected","target":"text_generation_router","filename":"router/src/main.rs","line_number":329}{"timestamp":"2024-03-09T05:08:19.961171Z","level":"WARN","message":"Invalid hostname, defaulting to 0.0.0.0","target":"text_generation_router","filename":"router/src/main.rs","line_number":343}

Falcon 40b

  1. Create the followingtext-generation-inference.yaml Deployment manifest:

    apiVersion:apps/v1kind:Deploymentmetadata:name:llmspec:replicas:1selector:matchLabels:app:llmtemplate:metadata:labels:app:llmspec:containers:-name:llmimage:us-docker.pkg.dev/deeplearning-platform-release/gcr.io/huggingface-text-generation-inference-cu121.1-4.ubuntu2204.py310resources:requests:cpu:"10"memory:"60Gi"nvidia.com/gpu:"2"limits:cpu:"10"memory:"60Gi"nvidia.com/gpu:"2"env:-name:MODEL_IDvalue:tiiuae/falcon-40b-instruct-name:NUM_SHARDvalue:"2"-name:PORTvalue:"8080"-name:QUANTIZEvalue:bitsandbytes-nf4volumeMounts:-mountPath:/dev/shmname:dshm# mountPath is set to /data as it's the path where the HUGGINGFACE_HUB_CACHE environment# variable points to in the TGI container image i.e. where the downloaded model from the Hub will be# stored-mountPath:/dataname:ephemeral-volumevolumes:-name:dshmemptyDir:medium:Memory-name:ephemeral-volumeephemeral:volumeClaimTemplate:metadata:labels:type:ephemeralspec:accessModes:["ReadWriteOnce"]storageClassName:"premium-rwo"resources:requests:storage:175GinodeSelector:cloud.google.com/gke-accelerator:"nvidia-l4"cloud.google.com/gke-spot:"true"

    In this manifest:

    • NUM_SHARD must be2 because the model requires two NVIDIA L4 GPUs.
    • QUANTIZE is set tobitsandbytes-nf4 which means that the model isloaded in 4 bit instead of 32 bits. This allows GKE toreduce the amount of GPU memory needed and improves the inference speed.However, the model accuracy can decrease. To learn how to calculate theGPUs to request, seeCalculating the amount of GPUs.
  2. Apply the manifest:

    kubectlapply-ftext-generation-inference.yaml

    The output is similar to the following:

    deployment.apps/llm created
  3. Verify the status of the model:

    watchkubectlgetdeploy

    When the deployment is ready, the output is similar to the following:

    NAME          READY   UP-TO-DATE   AVAILABLE   AGEllm           1/1     1            1           10m

    To exit the watch, typeCTRL + C.

  4. View the logs from the running deployment:

    kubectllogs-lapp=llm

    The output is similar to the following:

    {"timestamp":"2024-03-09T05:08:14.751646Z","level":"INFO","message":"Warming up model","target":"text_generation_router","filename":"router/src/main.rs","line_number":291}{"timestamp":"2024-03-09T05:08:19.961136Z","level":"INFO","message":"Setting max batch total tokens to 133696","target":"text_generation_router","filename":"router/src/main.rs","line_number":328}{"timestamp":"2024-03-09T05:08:19.961164Z","level":"INFO","message":"Connected","target":"text_generation_router","filename":"router/src/main.rs","line_number":329}{"timestamp":"2024-03-09T05:08:19.961171Z","level":"WARN","message":"Invalid hostname, defaulting to 0.0.0.0","target":"text_generation_router","filename":"router/src/main.rs","line_number":343}

Create a Service of type ClusterIP

Expose your Pods internally within the cluster so they can be discovered andaccessed by other applications.

  1. Create the followingllm-service.yaml manifest:

    apiVersion:v1kind:Servicemetadata:name:llm-servicespec:selector:app:llmtype:ClusterIPports:-protocol:TCPport:80targetPort:8080
  2. Apply the manifest:

    kubectlapply-fllm-service.yaml

Deploy a chat interface

UseGradio to build a web applicationthat lets you interact with your model. Gradio is a Python library that has aChatInterface wrapper that creates user interfaces for chatbots.

Llama 3 70b

  1. Create a file namedgradio.yaml:

    apiVersion:apps/v1kind:Deploymentmetadata:name:gradiolabels:app:gradiospec:strategy:type:Recreatereplicas:1selector:matchLabels:app:gradiotemplate:metadata:labels:app:gradiospec:containers:-name:gradioimage:us-docker.pkg.dev/google-samples/containers/gke/gradio-app:v1.0.4resources:requests:cpu:"512m"memory:"512Mi"limits:cpu:"1"memory:"512Mi"env:-name:CONTEXT_PATHvalue:"/generate"-name:HOSTvalue:"http://llm-service"-name:LLM_ENGINEvalue:"tgi"-name:MODEL_IDvalue:"meta-llama/Meta-Llama-3-70B-Instruct"-name:USER_PROMPTvalue:"<|begin_of_text|><|start_header_id|>user<|end_header_id|>prompt<|eot_id|><|start_header_id|>assistant<|end_header_id|>"-name:SYSTEM_PROMPTvalue:"prompt<|eot_id|>"ports:-containerPort:7860---apiVersion:v1kind:Servicemetadata:name:gradio-servicespec:type:LoadBalancerselector:app:gradioports:-port:80targetPort:7860
  2. Apply the manifest:

    kubectlapply-fgradio.yaml
  3. Find the external IP address of the Service:

    kubectlgetsvc

    The output is similar to the following:

    NAME             TYPE           CLUSTER-IP     EXTERNAL-IP     PORT(S)        AGEgradio-service   LoadBalancer   10.24.29.197   34.172.115.35   80:30952/TCP   125m
  4. Copy the external IP address from theEXTERNAL-IP column.

  5. View the model interface from your web browser by using the external IP addresswith the exposed port:

    http://EXTERNAL_IP

Mixtral 8x7b

  1. Create a file namedgradio.yaml:

    apiVersion:apps/v1kind:Deploymentmetadata:name:gradiolabels:app:gradiospec:strategy:type:Recreatereplicas:1selector:matchLabels:app:gradiotemplate:metadata:labels:app:gradiospec:containers:-name:gradioimage:us-docker.pkg.dev/google-samples/containers/gke/gradio-app:v1.0.4resources:requests:cpu:"512m"memory:"512Mi"limits:cpu:"1"memory:"512Mi"env:-name:CONTEXT_PATHvalue:"/generate"-name:HOSTvalue:"http://llm-service"-name:LLM_ENGINEvalue:"tgi"-name:MODEL_IDvalue:"mixtral-8x7b"-name:USER_PROMPTvalue:"[INST]prompt[/INST]"-name:SYSTEM_PROMPTvalue:"prompt"ports:-containerPort:7860---apiVersion:v1kind:Servicemetadata:name:gradio-servicespec:type:LoadBalancerselector:app:gradioports:-port:80targetPort:7860
  2. Apply the manifest:

    kubectlapply-fgradio.yaml
  3. Find the external IP address of the Service:

    kubectlgetsvc

    The output is similar to the following:

    NAME             TYPE           CLUSTER-IP     EXTERNAL-IP     PORT(S)        AGEgradio-service   LoadBalancer   10.24.29.197   34.172.115.35   80:30952/TCP   125m
  4. Copy the external IP address from theEXTERNAL-IP column.

  5. View the model interface from your web browser by using the external IPaddress with the exposed port:

    http://EXTERNAL_IP

Falcon 40b

  1. Create a file namedgradio.yaml:

    apiVersion:apps/v1kind:Deploymentmetadata:name:gradiolabels:app:gradiospec:strategy:type:Recreatereplicas:1selector:matchLabels:app:gradiotemplate:metadata:labels:app:gradiospec:containers:-name:gradioimage:us-docker.pkg.dev/google-samples/containers/gke/gradio-app:v1.0.4resources:requests:cpu:"512m"memory:"512Mi"limits:cpu:"1"memory:"512Mi"env:-name:CONTEXT_PATHvalue:"/generate"-name:HOSTvalue:"http://llm-service"-name:LLM_ENGINEvalue:"tgi"-name:MODEL_IDvalue:"falcon-40b-instruct"-name:USER_PROMPTvalue:"User:prompt"-name:SYSTEM_PROMPTvalue:"Assistant:prompt"ports:-containerPort:7860---apiVersion:v1kind:Servicemetadata:name:gradio-servicespec:type:LoadBalancerselector:app:gradioports:-port:80targetPort:7860
  2. Apply the manifest:

    kubectlapply-fgradio.yaml
  3. Find the external IP address of the Service:

    kubectlgetsvc

    The output is similar to the following:

    NAME             TYPE           CLUSTER-IP     EXTERNAL-IP     PORT(S)        AGEgradio-service   LoadBalancer   10.24.29.197   34.172.115.35   80:30952/TCP   125m
  4. Copy the external IP address from theEXTERNAL-IP column.

  5. View the model interface from your web browser by using the external IPaddress with the exposed port:

    http://EXTERNAL_IP
Success: At this point, you have deployed an LLM using L4 GPUs in GKE.

Calculate the amount of GPUs

The amount of GPUs depends on the value of theQUANTIZE flag. In thistutorial,QUANTIZE is set tobitsandbytes-nf4, which means that the model isloaded in 4 bits.

A 70 billion parameter model would require a minimum of 40 GB of GPU memorywhich equals to 70 billion times 4 bits (70 billion x 4 bits= 35 GB) andconsiders a 5 GB of overhead. In this case, a single L4 GPU wouldn't have enoughmemory. Therefore, the examples in this tutorial usetwo L4 GPU of memory (2 x24 = 48 GB). This configuration is sufficient for running Falcon40b or Llama 3 70b in L4 GPUs.

Clean up

To avoid incurring charges to your Google Cloud account for the resources used in this tutorial, either delete the project that contains the resources, or keep the project and delete the individual resources.

Delete the cluster

To avoid incurring charges to your Google Cloud account for the resourcesthat you created in this guide, delete the GKE cluster:

gcloudcontainerclustersdeletel4-demo--location${CONTROL_PLANE_LOCATION}

What's next

Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.

Last updated 2026-02-19 UTC.