Serve an LLM using TPUs on GKE with KubeRay

This tutorial shows how to serve a large language model (LLM) using TensorProcessing Units (TPUs) on Google Kubernetes Engine (GKE) with theRay Operator add-on, and thevLLM serving framework.

In this tutorial, you can serve LLM models on TPU v5e or TPU Trillium (v6e) as follows:

Note: We don't recommend using base models like Llama 3 8B for production chatbotsdue to safety concerns. For more information about best practices, seeSystem instructions for safety.

This guide is for generative AI customers, new and existing GKEusers, ML engineers, MLOps (DevOps) engineers, or platform administratorsinterested in using Kubernetes container orchestration capabilities to servemodels using Ray, on TPUs with vLLM.

Background

This section describes the key technologies used in this guide.

GKE managed Kubernetes service

Google Cloud offers a wide range of services, including GKE,which is well-suited to deploying and managing AI/ML workloads.GKE is a managedKubernetes service that simplifies deploying, scaling, and managingcontainerized applications. GKE provides the necessaryinfrastructure, includingscalable resources, distributed computing, and efficient networking, to handlethe computational demands of LLMs.

To learn more about key Kubernetes concepts, seeStart learning about Kubernetes.To learn more about the GKE and how it helps you scale, automate,and manage Kubernetes, seeGKE overview.

Ray operator

The Ray Operatoradd-on on GKE provides an end-to-end AI/ML platformfor serving, training, and fine-tuning machine learning workloads. In thistutorial, you useRay Serve,a framework in Ray, to serve popular LLMs from Hugging Face.

TPUs

TPUs are Google's custom-developed application-specific integrated circuits(ASICs) used to accelerate machine learning and AI models built using frameworkssuch asTensorFlow,PyTorch, andJAX.

This tutorial covers serving LLM models on TPU v5e or TPU Trillium (v6e) nodeswith TPU topologies configured based on each model requirements for servingprompts with low latency.

vLLM

vLLM is a highly optimized open source LLM serving framework that can increaseserving throughput on TPUs, with features such as:

  • Optimized transformer implementation withPagedAttention
  • Continuous batching to improve the overall serving throughput
  • Tensor parallelism and distributed serving on multiple GPUs

To learn more, refer to thevLLM documentation.

Objectives

This tutorial covers the following steps:

  1. Create a GKE cluster with a TPU node pool.
  2. Deploy aRayClustercustom resource with a single-host TPU slice. GKE deploysthe RayCluster custom resource as Kubernetes Pods.
  3. Serve an LLM.
  4. Interact with the models.

You can optionally configure the following model serving resources andtechniques that the Ray Serve framework supports:

  • Deploy a RayService custom resource.
  • Compose multiple models with model composition.

Before you begin

Before you start, make sure that you have performed the following tasks:

  • Enable the Google Kubernetes Engine API.
  • Enable Google Kubernetes Engine API
  • If you want to use the Google Cloud CLI for this task,install and theninitialize the gcloud CLI. If you previously installed the gcloud CLI, get the latest version by running thegcloud components update command. Earlier gcloud CLI versions might not support running the commands in this document.Note: For existing gcloud CLI installations, make sure to set thecompute/regionproperty. If you use primarily zonal clusters, set thecompute/zone instead. By setting a default location, you can avoid errors in the gcloud CLI like the following:One of [--zone, --region] must be supplied: Please specify location. You might need to specify the location in certain commands if the location of your cluster differs from the default that you set.
  • Create aHugging Face account, if you don't alreadyhave one.
  • Ensure that you have aHugging Face token.
  • Ensure that you have access to the Hugging Face model that you want to use.You usually gain this access by signing an agreement and requesting accessfrom the model owner on the Hugging Face model page.
  • Ensure that you have the followingIAM roles:
    • roles/container.admin
    • roles/iam.serviceAccountAdmin
    • roles/container.clusterAdmin
    • roles/artifactregistry.writer

Prepare your environment

  1. Check that you have enough quota in your Google Cloud project for asingle-host TPU v5e or a single-host TPU Trillium (v6e). To manage your quota, seeTPU quotas.

  2. In the Google Cloud console, start a Cloud Shell instance:
    Open Cloud Shell

  3. Clone the sample repository:

    gitclonehttps://github.com/GoogleCloudPlatform/kubernetes-engine-samples.gitcdkubernetes-engine-samples
  4. Navigate to the working directory:

    cdai-ml/gke-ray/rayserve/llm
  5. Set the default environment variables for the GKE clustercreation:

    Llama-3-8B-Instruct

    exportPROJECT_ID=$(gcloudconfiggetproject)exportPROJECT_NUMBER=$(gcloudprojectsdescribe${PROJECT_ID}--format="value(projectNumber)")exportCLUSTER_NAME=vllm-tpuexportCOMPUTE_REGION=REGIONexportCOMPUTE_ZONE=ZONEexportHF_TOKEN=HUGGING_FACE_TOKENexportGSBUCKET=vllm-tpu-bucketexportKSA_NAME=vllm-saexportNAMESPACE=defaultexportMODEL_ID="meta-llama/Meta-Llama-3-8B-Instruct"exportVLLM_IMAGE=docker.io/vllm/vllm-tpu:866fa4550d572f4ff3521ccf503e0df2e76591a1exportSERVICE_NAME=vllm-tpu-head-svc

    Replace the following:

    • HUGGING_FACE_TOKEN: your Hugging Face access token.
    • REGION: the region where you have TPU quota.Ensure that the TPU version that you want to use is available in thisregion. To learn more, seeTPU availability in GKE.
    • ZONE: the zone with available TPU quota.
    • VLLM_IMAGE: the vLLM TPU image.You can use the publicdocker.io/vllm/vllm-tpu:866fa4550d572f4ff3521ccf503e0df2e76591a1 image orbuild your own TPU image.

    Mistral-7B

    exportPROJECT_ID=$(gcloudconfiggetproject)exportPROJECT_NUMBER=$(gcloudprojectsdescribe${PROJECT_ID}--format="value(projectNumber)")exportCLUSTER_NAME=vllm-tpuexportCOMPUTE_REGION=REGIONexportCOMPUTE_ZONE=ZONEexportHF_TOKEN=HUGGING_FACE_TOKENexportGSBUCKET=vllm-tpu-bucketexportKSA_NAME=vllm-saexportNAMESPACE=defaultexportMODEL_ID="mistralai/Mistral-7B-Instruct-v0.3"exportTOKENIZER_MODE=mistralexportVLLM_IMAGE=docker.io/vllm/vllm-tpu:866fa4550d572f4ff3521ccf503e0df2e76591a1exportSERVICE_NAME=vllm-tpu-head-svc

    Replace the following:

    • HUGGING_FACE_TOKEN: your Hugging Face access token.
    • REGION: the region where you have TPU quota.Ensure that the TPU version that you want to use is available inthis region. To learn more, seeTPU availability in GKE.
    • ZONE: the zone with available TPU quota.
    • VLLM_IMAGE: the vLLM TPU image.You can use the publicdocker.io/vllm/vllm-tpu:866fa4550d572f4ff3521ccf503e0df2e76591a1 image orbuild your own TPU image.

    Llama 3.1 70B

    exportPROJECT_ID=$(gcloudconfiggetproject)exportPROJECT_NUMBER=$(gcloudprojectsdescribe${PROJECT_ID}--format="value(projectNumber)")exportCLUSTER_NAME=vllm-tpuexportCOMPUTE_REGION=REGIONexportCOMPUTE_ZONE=ZONEexportHF_TOKEN=HUGGING_FACE_TOKENexportGSBUCKET=vllm-tpu-bucketexportKSA_NAME=vllm-saexportNAMESPACE=defaultexportMODEL_ID="meta-llama/Llama-3.1-70B"exportMAX_MODEL_LEN=8192exportVLLM_IMAGE=docker.io/vllm/vllm-tpu:866fa4550d572f4ff3521ccf503e0df2e76591a1exportSERVICE_NAME=vllm-tpu-head-svc

    Replace the following:

    • HUGGING_FACE_TOKEN: your Hugging Face access token.
    • REGION: the region where you have TPU quota. Ensure that the TPU version that you want to use is available inthis region. To learn more, seeTPU availability in GKE.
    • ZONE: the zone with available TPU quota.
    • VLLM_IMAGE: the vLLM TPU image.You can use the publicdocker.io/vllm/vllm-tpu:866fa4550d572f4ff3521ccf503e0df2e76591a1 image orbuild your own TPU image.
  6. Pull down the vLLM container image:

    sudousermod-aGdocker${USER}newgrpdockerdockerpull${VLLM_IMAGE}

Create a cluster

You can serve an LLM on TPUs with Ray in a GKEAutopilot or Standard cluster by using the Ray Operator add-on.

Best practices:

Use anAutopilot cluster for a fully managed Kubernetes experience. To choosethe GKE mode of operation that's the best fit for your workloads,seeChoose a GKE mode of operation.

Use Cloud Shell to create an Autopilot or Standard cluster:

Autopilot

  1. Create a GKE Autopilot cluster with the RayOperator add-on enabled:

    gcloudcontainerclusterscreate-auto${CLUSTER_NAME}\--enable-ray-operator\--release-channel=rapid\--location=${COMPUTE_REGION}

Standard

  1. Create a Standard cluster with the Ray Operator add-on enabled:

    gcloudcontainerclusterscreate${CLUSTER_NAME}\--release-channel=rapid\--location=${COMPUTE_ZONE}\--workload-pool=${PROJECT_ID}.svc.id.goog\--machine-type="n1-standard-4"\--addons=RayOperator,GcsFuseCsiDriver
  2. Create a single-host TPU slice node pool:

    Llama-3-8B-Instruct

    gcloudcontainernode-poolscreatetpu-1\--location=${COMPUTE_ZONE}\--cluster=${CLUSTER_NAME}\--machine-type=ct5lp-hightpu-8t\--num-nodes=1

    GKE creates a TPU v5e node pool with act5lp-hightpu-8t machine type.

    Mistral-7B

    gcloudcontainernode-poolscreatetpu-1\--location=${COMPUTE_ZONE}\--cluster=${CLUSTER_NAME}\--machine-type=ct5lp-hightpu-8t\--num-nodes=1

    GKE creates a TPU v5e node pool with act5lp-hightpu-8t machine type.

    Llama 3.1 70B

    gcloudcontainernode-poolscreatetpu-1\--location=${COMPUTE_ZONE}\--cluster=${CLUSTER_NAME}\--machine-type=ct6e-standard-8t\--num-nodes=1

    GKE creates a TPU v6e node pool with act6e-standard-8t machine type.

Configure kubectl to communicate with your cluster

To configure kubectl to communicate with your cluster, run the followingcommand:

Autopilot

gcloudcontainerclustersget-credentials${CLUSTER_NAME}\--location=${COMPUTE_REGION}

Standard

gcloudcontainerclustersget-credentials${CLUSTER_NAME}\--location=${COMPUTE_ZONE}

Create a Kubernetes Secret for Hugging Face credentials

To create a Kubernetes Secret that contains the Hugging Face token, run thefollowing command:

kubectlcreatesecretgenerichf-secret\--from-literal=hf_api_token=${HF_TOKEN}\--dry-run=client-oyaml|kubectl--namespace${NAMESPACE}apply-f-

Create a Cloud Storage bucket

To accelerate the vLLM deployment startup time and minimize required disk space per node, use theCloud Storage FUSE CSI driverto mount the downloaded model and compilation cache to the Ray nodes.

In Cloud Shell, run the following command:

gcloudstoragebucketscreategs://${GSBUCKET}\--uniform-bucket-level-access

This command creates a Cloud Storage bucket to store the model files youdownload from Hugging Face.

Set up a Kubernetes ServiceAccount to access the bucket

  1. Create the Kubernetes ServiceAccount:

    kubectlcreateserviceaccount${KSA_NAME}\--namespace${NAMESPACE}
  2. Grant the Kubernetes ServiceAccount read-write access to theCloud Storage bucket:

    gcloudstoragebucketsadd-iam-policy-bindinggs://${GSBUCKET}\--member"principal://iam.googleapis.com/projects/${PROJECT_NUMBER}/locations/global/workloadIdentityPools/${PROJECT_ID}.svc.id.goog/subject/ns/${NAMESPACE}/sa/${KSA_NAME}"\--role"roles/storage.objectUser"

    GKE creates the following resources for the LLM:

    1. A Cloud Storage bucket to store the downloaded model and thecompilation cache. ACloud Storage FUSE CSI driverreads the content of the bucket.
    2. Volumes with file caching enabled and theparallel download feature of Cloud Storage FUSE.
    Best practice:

    Use a file cache backed bytmpfsorHyperdisk / Persistent Diskdepending on the expected size of the model contents, for example, weight files.In this tutorial, you use Cloud Storage FUSE file cache backed by RAM.

Deploy a RayCluster custom resource

Deploy a RayCluster custom resource, which typically consists of one systemPod and multiple worker Pods.

Llama-3-8B-Instruct

Create the RayCluster custom resource to deploy the Llama 3 8B instruction tunedmodel by completing the following steps:

  1. Inspect theray-cluster.tpu-v5e-singlehost.yaml manifest:

    apiVersion:ray.io/v1kind:RayClustermetadata:name:vllm-tpuspec:headGroupSpec:rayStartParams:{}template:metadata:annotations:gke-gcsfuse/volumes:"true"gke-gcsfuse/cpu-limit:"0"gke-gcsfuse/memory-limit:"0"gke-gcsfuse/ephemeral-storage-limit:"0"spec:serviceAccountName:$KSA_NAMEcontainers:-name:ray-headimage:$VLLM_IMAGEimagePullPolicy:IfNotPresentresources:limits:cpu:"2"memory:8Grequests:cpu:"2"memory:8Genv:-name:HUGGING_FACE_HUB_TOKENvalueFrom:secretKeyRef:name:hf-secretkey:hf_api_token-name:VLLM_XLA_CACHE_PATHvalue:"/data"ports:-containerPort:6379name:gcs-containerPort:8265name:dashboard-containerPort:10001name:client-containerPort:8000name:serve-containerPort:8471name:slicebuilder-containerPort:8081name:mxlavolumeMounts:-name:gcs-fuse-csi-ephemeralmountPath:/data-name:dshmmountPath:/dev/shmvolumes:-name:gke-gcsfuse-cacheemptyDir:medium:Memory-name:dshmemptyDir:medium:Memory-name:gcs-fuse-csi-ephemeralcsi:driver:gcsfuse.csi.storage.gke.iovolumeAttributes:bucketName:$GSBUCKETmountOptions:"implicit-dirs,file-cache:enable-parallel-downloads:true,file-cache:parallel-downloads-per-file:100,file-cache:max-parallel-downloads:-1,file-cache:download-chunk-size-mb:10,file-cache:max-size-mb:-1"workerGroupSpecs:-groupName:tpu-groupreplicas:1minReplicas:1maxReplicas:1numOfHosts:1rayStartParams:{}template:metadata:annotations:gke-gcsfuse/volumes:"true"gke-gcsfuse/cpu-limit:"0"gke-gcsfuse/memory-limit:"0"gke-gcsfuse/ephemeral-storage-limit:"0"spec:serviceAccountName:$KSA_NAMEcontainers:-name:ray-workerimage:$VLLM_IMAGEimagePullPolicy:IfNotPresentresources:limits:cpu:"100"google.com/tpu:"8"ephemeral-storage:40Gmemory:200Grequests:cpu:"100"google.com/tpu:"8"ephemeral-storage:40Gmemory:200Genv:-name:VLLM_XLA_CACHE_PATHvalue:"/data"-name:HUGGING_FACE_HUB_TOKENvalueFrom:secretKeyRef:name:hf-secretkey:hf_api_tokenvolumeMounts:-name:gcs-fuse-csi-ephemeralmountPath:/data-name:dshmmountPath:/dev/shmvolumes:-name:gke-gcsfuse-cacheemptyDir:medium:Memory-name:dshmemptyDir:medium:Memory-name:gcs-fuse-csi-ephemeralcsi:driver:gcsfuse.csi.storage.gke.iovolumeAttributes:bucketName:$GSBUCKETmountOptions:"implicit-dirs,file-cache:enable-parallel-downloads:true,file-cache:parallel-downloads-per-file:100,file-cache:max-parallel-downloads:-1,file-cache:download-chunk-size-mb:10,file-cache:max-size-mb:-1"nodeSelector:cloud.google.com/gke-tpu-accelerator:tpu-v5-lite-podslicecloud.google.com/gke-tpu-topology:2x4
  2. Apply the manifest:

    envsubst <tpu/ray-cluster.tpu-v5e-singlehost.yaml|kubectl--namespace${NAMESPACE}apply-f-

    Theenvsubst command replaces the environment variables in the manifest.

GKE creates a RayCluster custom resource with aworkergroup that contains a TPUv5e single-host in a2x4 topology.

Mistral-7B

Create the RayCluster custom resource to deploy the Mistral-7B model bycompleting the following steps:

  1. Inspect theray-cluster.tpu-v5e-singlehost.yaml manifest:

    apiVersion:ray.io/v1kind:RayClustermetadata:name:vllm-tpuspec:headGroupSpec:rayStartParams:{}template:metadata:annotations:gke-gcsfuse/volumes:"true"gke-gcsfuse/cpu-limit:"0"gke-gcsfuse/memory-limit:"0"gke-gcsfuse/ephemeral-storage-limit:"0"spec:serviceAccountName:$KSA_NAMEcontainers:-name:ray-headimage:$VLLM_IMAGEimagePullPolicy:IfNotPresentresources:limits:cpu:"2"memory:8Grequests:cpu:"2"memory:8Genv:-name:HUGGING_FACE_HUB_TOKENvalueFrom:secretKeyRef:name:hf-secretkey:hf_api_token-name:VLLM_XLA_CACHE_PATHvalue:"/data"ports:-containerPort:6379name:gcs-containerPort:8265name:dashboard-containerPort:10001name:client-containerPort:8000name:serve-containerPort:8471name:slicebuilder-containerPort:8081name:mxlavolumeMounts:-name:gcs-fuse-csi-ephemeralmountPath:/data-name:dshmmountPath:/dev/shmvolumes:-name:gke-gcsfuse-cacheemptyDir:medium:Memory-name:dshmemptyDir:medium:Memory-name:gcs-fuse-csi-ephemeralcsi:driver:gcsfuse.csi.storage.gke.iovolumeAttributes:bucketName:$GSBUCKETmountOptions:"implicit-dirs,file-cache:enable-parallel-downloads:true,file-cache:parallel-downloads-per-file:100,file-cache:max-parallel-downloads:-1,file-cache:download-chunk-size-mb:10,file-cache:max-size-mb:-1"workerGroupSpecs:-groupName:tpu-groupreplicas:1minReplicas:1maxReplicas:1numOfHosts:1rayStartParams:{}template:metadata:annotations:gke-gcsfuse/volumes:"true"gke-gcsfuse/cpu-limit:"0"gke-gcsfuse/memory-limit:"0"gke-gcsfuse/ephemeral-storage-limit:"0"spec:serviceAccountName:$KSA_NAMEcontainers:-name:ray-workerimage:$VLLM_IMAGEimagePullPolicy:IfNotPresentresources:limits:cpu:"100"google.com/tpu:"8"ephemeral-storage:40Gmemory:200Grequests:cpu:"100"google.com/tpu:"8"ephemeral-storage:40Gmemory:200Genv:-name:VLLM_XLA_CACHE_PATHvalue:"/data"-name:HUGGING_FACE_HUB_TOKENvalueFrom:secretKeyRef:name:hf-secretkey:hf_api_tokenvolumeMounts:-name:gcs-fuse-csi-ephemeralmountPath:/data-name:dshmmountPath:/dev/shmvolumes:-name:gke-gcsfuse-cacheemptyDir:medium:Memory-name:dshmemptyDir:medium:Memory-name:gcs-fuse-csi-ephemeralcsi:driver:gcsfuse.csi.storage.gke.iovolumeAttributes:bucketName:$GSBUCKETmountOptions:"implicit-dirs,file-cache:enable-parallel-downloads:true,file-cache:parallel-downloads-per-file:100,file-cache:max-parallel-downloads:-1,file-cache:download-chunk-size-mb:10,file-cache:max-size-mb:-1"nodeSelector:cloud.google.com/gke-tpu-accelerator:tpu-v5-lite-podslicecloud.google.com/gke-tpu-topology:2x4
  2. Apply the manifest:

    envsubst <tpu/ray-cluster.tpu-v5e-singlehost.yaml|kubectl--namespace${NAMESPACE}apply-f-

    Theenvsubst command replaces the environment variables in the manifest.

GKE creates a RayCluster custom resource with aworkergroup that contains a TPUv5e single-host in a2x4 topology.

Llama 3.1 70B

Create the RayCluster custom resource to deploy the Llama 3.1 70B model bycompleting the following steps:

  1. Inspect theray-cluster.tpu-v6e-singlehost.yaml manifest:

    apiVersion:ray.io/v1kind:RayClustermetadata:name:vllm-tpuspec:headGroupSpec:rayStartParams:{}template:metadata:annotations:gke-gcsfuse/volumes:"true"gke-gcsfuse/cpu-limit:"0"gke-gcsfuse/memory-limit:"0"gke-gcsfuse/ephemeral-storage-limit:"0"spec:serviceAccountName:$KSA_NAMEcontainers:-name:ray-headimage:$VLLM_IMAGEimagePullPolicy:IfNotPresentresources:limits:cpu:"2"memory:8Grequests:cpu:"2"memory:8Genv:-name:HUGGING_FACE_HUB_TOKENvalueFrom:secretKeyRef:name:hf-secretkey:hf_api_token-name:VLLM_XLA_CACHE_PATHvalue:"/data"ports:-containerPort:6379name:gcs-containerPort:8265name:dashboard-containerPort:10001name:client-containerPort:8000name:serve-containerPort:8471name:slicebuilder-containerPort:8081name:mxlavolumeMounts:-name:gcs-fuse-csi-ephemeralmountPath:/data-name:dshmmountPath:/dev/shmvolumes:-name:gke-gcsfuse-cacheemptyDir:medium:Memory-name:dshmemptyDir:medium:Memory-name:gcs-fuse-csi-ephemeralcsi:driver:gcsfuse.csi.storage.gke.iovolumeAttributes:bucketName:$GSBUCKETmountOptions:"implicit-dirs,file-cache:enable-parallel-downloads:true,file-cache:parallel-downloads-per-file:100,file-cache:max-parallel-downloads:-1,file-cache:download-chunk-size-mb:10,file-cache:max-size-mb:-1"workerGroupSpecs:-groupName:tpu-groupreplicas:1minReplicas:1maxReplicas:1numOfHosts:1rayStartParams:{}template:metadata:annotations:gke-gcsfuse/volumes:"true"gke-gcsfuse/cpu-limit:"0"gke-gcsfuse/memory-limit:"0"gke-gcsfuse/ephemeral-storage-limit:"0"spec:serviceAccountName:$KSA_NAMEcontainers:-name:ray-workerimage:$VLLM_IMAGEimagePullPolicy:IfNotPresentresources:limits:cpu:"100"google.com/tpu:"8"ephemeral-storage:40Gmemory:200Grequests:cpu:"100"google.com/tpu:"8"ephemeral-storage:40Gmemory:200Genv:-name:HUGGING_FACE_HUB_TOKENvalueFrom:secretKeyRef:name:hf-secretkey:hf_api_token-name:VLLM_XLA_CACHE_PATHvalue:"/data"volumeMounts:-name:gcs-fuse-csi-ephemeralmountPath:/data-name:dshmmountPath:/dev/shmvolumes:-name:gke-gcsfuse-cacheemptyDir:medium:Memory-name:dshmemptyDir:medium:Memory-name:gcs-fuse-csi-ephemeralcsi:driver:gcsfuse.csi.storage.gke.iovolumeAttributes:bucketName:$GSBUCKETmountOptions:"implicit-dirs,file-cache:enable-parallel-downloads:true,file-cache:parallel-downloads-per-file:100,file-cache:max-parallel-downloads:-1,file-cache:download-chunk-size-mb:10,file-cache:max-size-mb:-1"nodeSelector:cloud.google.com/gke-tpu-accelerator:tpu-v6e-slicecloud.google.com/gke-tpu-topology:2x4
  2. Apply the manifest:

    envsubst <tpu/ray-cluster.tpu-v6e-singlehost.yaml|kubectl--namespace${NAMESPACE}apply-f-

    Theenvsubst command replaces the environment variables in the manifest.

GKE creates a RayCluster custom resource with aworkergroupthat contains aTPU v6e single-host in a2x4 topology.

Connect to the RayCluster custom resource

After the RayCluster custom resource is created, you can connect to the RayClusterresource and start serving the model.

  1. Verify that GKE created the RayCluster Service:

    kubectl--namespace${NAMESPACE}getraycluster/vllm-tpu\--outputwide

    The output is similar to the following:

    NAME       DESIRED WORKERS   AVAILABLE WORKERS   CPUS   MEMORY   GPUS   TPUS   STATUS   AGE   HEAD POD IP      HEAD SERVICE IPvllm-tpu   1                 1                   ###    ###G     0      8      ready    ###   ###.###.###.###  ###.###.###.###

    Wait until theSTATUS isready and theHEAD POD IP andHEAD SERVICE IPcolumns have an IP address.

  2. Establishport-forwarding sessions to the Ray head:

    pkill-f"kubectl .* port-forward .* 8265:8265"pkill-f"kubectl .* port-forward .* 10001:10001"kubectl--namespace${NAMESPACE}port-forwardservice/${SERVICE_NAME}8265:82652>&1>/dev/null&kubectl--namespace${NAMESPACE}port-forwardservice/${SERVICE_NAME}10001:100012>&1>/dev/null&
  3. Verify that the Ray client can connect to the remote RayCluster customresource:

    dockerrun--net=host-it${VLLM_IMAGE}\raylistnodes--addresshttp://localhost:8265

    The output is similar to the following:

    ======== List: YYYY-MM-DD HH:MM:SS.NNNNNN ========Stats:------------------------------Total: 2Table:------------------------------    NODE_ID    NODE_IP          IS_HEAD_NODE  STATE    STATE_MESSAGE    NODE_NAME          RESOURCES_TOTAL                   LABELS0  XXXXXXXXXX  ###.###.###.###  True          ALIVE                     ###.###.###.###    CPU: 2.0                          ray.io/node_id: XXXXXXXXXX                                                                                           memory: #.### GiB                                                                                           node:###.###.###.###: 1.0                                                                                           node:__internal_head__: 1.0                                                                                           object_store_memory: #.### GiB1  XXXXXXXXXX  ###.###.###.###  False         ALIVE                     ###.###.###.###    CPU: 100.0                       ray.io/node_id: XXXXXXXXXX                                                                                           TPU: 8.0                                                                                           TPU-v#e-8-head: 1.0                                                                                           accelerator_type:TPU-V#E: 1.0                                                                                           memory: ###.### GiB                                                                                           node:###.###.###.###: 1.0                                                                                           object_store_memory: ##.### GiB                                                                                           tpu-group-0: 1.0

Deploy the model with vLLM

To deploy a specific model with vLLM, follow these instructions.

Llama-3-8B-Instruct

dockerrun\--envMODEL_ID=${MODEL_ID}\--net=host\--volume=./tpu:/workspace/vllm/tpu\-it\${VLLM_IMAGE}\serverunserve_tpu:model\--address=ray://localhost:10001\--app-dir=./tpu\--runtime-env-json='{"env_vars": {"MODEL_ID": "meta-llama/Meta-Llama-3-8B-Instruct"}}'

Mistral-7B

dockerrun\--envMODEL_ID=${MODEL_ID}\--envTOKENIZER_MODE=${TOKENIZER_MODE}\--net=host\--volume=./tpu:/workspace/vllm/tpu\-it\${VLLM_IMAGE}\serverunserve_tpu:model\--address=ray://localhost:10001\--app-dir=./tpu\--runtime-env-json='{"env_vars": {"MODEL_ID": "mistralai/Mistral-7B-Instruct-v0.3", "TOKENIZER_MODE": "mistral"}}'

Llama 3.1 70B

dockerrun\--envMAX_MODEL_LEN=${MAX_MODEL_LEN}\--envMODEL_ID=${MODEL_ID}\--net=host\--volume=./tpu:/workspace/vllm/tpu\-it\${VLLM_IMAGE}\serverunserve_tpu:model\--address=ray://localhost:10001\--app-dir=./tpu\--runtime-env-json='{"env_vars": {"MAX_MODEL_LEN": "8192", "MODEL_ID": "meta-llama/Meta-Llama-3.1-70B"}}'

View the Ray Dashboard

You can view your Ray Serve deployment and relevant logs from the Ray Dashboard.

  1. Click theWeb Preview iconWeb Preview button, which can be found on the top right of the Cloud Shell taskbar.
  2. ClickChange port and set the port number to8265.
  3. ClickChange and Preview.
  4. On the Ray Dashboard, click theServe tab.

After the Serve deployment has aHEALTHY status, the model is ready to begin processing inputs.

Serve the model

This guide highlights models that support text generation, a technique thatallows text content creation from a prompt.

Llama-3-8B-Instruct

  1. Set up port forwarding to the server:

    pkill-f"kubectl .* port-forward .* 8000:8000"kubectl--namespace${NAMESPACE}port-forwardservice/${SERVICE_NAME}8000:80002>&1>/dev/null&
  2. Send a prompt to the Serve endpoint:

    curl-XPOSThttp://localhost:8000/v1/generate-H"Content-Type: application/json"-d'{"prompt": "What are the top 5 most popular programming languages? Be brief.", "max_tokens": 1024}'

Expand the followingsection to see an example of the output.

{"prompt": "Whatare the top 5 most popular programming languages? Be brief.", "text": " (Note:This answer may change over time.)\n\nAccording to the TIOBE Index, a widelyfollowed measure of programming language popularity, the top 5 languagesare:\n\n1. JavaScript\n2. Python\n3. Java\n4. C++\n5. C#\n\nThese rankings arebased on a combination of search engine queries, web traffic, and onlinecourses. Keep in mind that other sources may have slightly different rankings.(Source: TIOBE Index, August 2022)", "token_ids": [320, 9290, 25, 1115, 4320,1253, 2349, 927, 892, 9456, 11439, 311, 279, 350, 3895, 11855, 8167, 11, 264,13882, 8272, 6767, 315, 15840, 4221, 23354, 11, 279, 1948, 220, 20, 15823,527, 1473, 16, 13, 13210, 198, 17, 13, 13325, 198, 18, 13, 8102, 198, 19, 13,356, 23792, 20, 13, 356, 27585, 9673, 33407, 527, 3196, 389, 264, 10824, 315,2778, 4817, 20126, 11, 3566, 9629, 11, 323, 2930, 14307, 13, 13969, 304, 4059,430, 1023, 8336, 1253, 617, 10284, 2204, 33407, 13, 320, 3692, 25, 350, 3895,11855, 8167, 11, 6287, 220, 2366, 17, 8, 128009]}

Mistral-7B

  1. Set up port forwarding to the server:

    pkill-f"kubectl .* port-forward .* 8000:8000"kubectl--namespace${NAMESPACE}port-forwardservice/${SERVICE_NAME}8000:80002>&1>/dev/null&
  2. Send a prompt to the Serve endpoint:

    curl-XPOSThttp://localhost:8000/v1/generate-H"Content-Type: application/json"-d'{"prompt": "What are the top 5 most popular programming languages? Be brief.", "max_tokens": 1024}'

Expand the followingsection to see an example of the output.

{"prompt": "What are the top 5 most popular programming languages? Be brief.","text": "\n\n1. #", "token_ids": [781, 781, 29508, 29491, 27049, 29515, 1162,1081, 1491, 2075, 1122, 5454, 4867, 29493, 7079, 1122, 4466, 29501, 2973,7535, 1056, 1072, 4435, 11384, 5454, 3652, 3804, 29491, 781, 781, 29518,29491, 22134, 29515, 1292, 4444, 1122, 1639, 26001, 1072, 1988, 3205, 29493,1146, 29510, 29481, 13343, 2075, 1122, 5454, 4867, 29493, 6367, 5936, 29493,1946, 6411, 29493, 1072, 11237, 22031, 29491, 781, 781, 29538, 29491, 12407,29515, 1098, 3720, 29501, 15460, 4664, 17060, 4610, 2075, 1065, 1032, 6103,3587, 1070, 9197, 29493, 3258, 13422, 1722, 4867, 29493, 5454, 4113, 29493,1072, 19123, 29501, 5172, 9197, 29491, 781, 781, 29549, 29491, 1102, 29539,29515, 9355, 1054, 1254, 8670, 29493, 1146, 29510, 29481, 3376, 2075, 1122,9723, 25470, 14189, 29493, 2807, 4867, 1093, 2501, 1240, 1325, 1072, 5454,4867, 1093, 2877, 29521, 29491, 12466, 1377, 781, 781, 29550, 29491, 6475,7554, 29515, 1098, 26434, 1067, 1070, 27049, 1137, 14401, 12052, 1830, 25460,1072, 1567, 4958, 1122, 3243, 29501, 6473, 29493, 9855, 1290, 27049, 9197,29491, 2]}

Llama 3.1 70B

  1. Set up port forwarding to the server:

    pkill-f"kubectl .* port-forward .* 8000:8000"kubectl--namespace${NAMESPACE}port-forwardservice/${SERVICE_NAME}8000:80002>&1>/dev/null&
  2. Send a prompt to the Serve endpoint:

    curl-XPOSThttp://localhost:8000/v1/generate-H"Content-Type: application/json"-d'{"prompt": "What are the top 5 most popular programming languages? Be brief.", "max_tokens": 1024}'

Expand the followingsection to see an example of the output.

{"prompt": "What arethe top 5 most popular programming languages? Be brief.", "text": " This is avery subjective question, but there are some general guidelines to follow whenselecting a language. For example, if you\u2019re looking for a languagethat\u2019s easy to learn, you might want to consider Python. It\u2019s one ofthe most popular languages in the world, and it\u2019s also relatively easy tolearn. If you\u2019re looking for a language that\u2019s more powerful, youmight want to consider Java. It\u2019s a more complex language, but it\u2019salso very popular. Whichever language you choose, make sure you do yourresearch and pick one that\u2019s right for you.\nThe most popular programminglanguages are:\nWhy is C++ so popular?\nC++ is a powerful and versatilelanguage that is used in many different types of software. It is also one ofthe most popular programming languages, with a large community of developerswho are always creating new and innovative ways to use it. One of the reasonswhy C++ is so popular is because it is a very efficient language. It allowsdevelopers to write code that is both fast and reliable, which is essentialfor many types of software. Additionally, C++ is very flexible, meaning thatit can be used for a wide range of different purposes. Finally, C++ is alsovery popular because it is easy to learn. There are many resources availableonline and in books that can help anyone get started with learning thelanguage.\nJava is a versatile language that can be used for a variety ofpurposes. It is one of the most popular programming languages in the world andis used by millions of people around the globe. Java is used for everythingfrom developing desktop applications to creating mobile apps and games. It isalso a popular choice for web development. One of the reasons why Java is sopopular is because it is a platform-independent language. This means that itcan be used on any type of computer or device, regardless of the operatingsystem. Java is also very versatile and can be used for a variety of differentpurposes.", "token_ids": [1115, 374, 264, 1633, 44122, 3488, 11, 719, 1070,527, 1063, 4689, 17959, 311, 1833, 994, 27397, 264, 4221, 13, 1789, 3187, 11,422, 499, 3207, 3411, 369, 264, 4221, 430, 753, 4228, 311, 4048, 11, 499,2643, 1390, 311, 2980, 13325, 13, 1102, 753, 832, 315, 279, 1455, 5526, 15823,304, 279, 1917, 11, 323, 433, 753, 1101, 12309, 4228, 311, 4048, 13, 1442,499, 3207, 3411, 369, 264, 4221, 430, 753, 810, 8147, 11, 499, 2643, 1390,311, 2980, 8102, 13, 1102, 753, 264, 810, 6485, 4221, 11, 719, 433, 753, 1101,1633, 5526, 13, 1254, 46669, 4221, 499, 5268, 11, 1304, 2771, 499, 656, 701,3495, 323, 3820, 832, 430, 753, 1314, 369, 499, 627, 791, 1455, 5526, 15840,15823, 527, 512, 10445, 374, 356, 1044, 779, 5526, 5380, 34, 1044, 374, 264,8147, 323, 33045, 4221, 430, 374, 1511, 304, 1690, 2204, 4595, 315, 3241, 13,1102, 374, 1101, 832, 315, 279, 1455, 5526, 15840, 15823, 11, 449, 264, 3544,4029, 315, 13707, 889, 527, 2744, 6968, 502, 323, 18699, 5627, 311, 1005, 433,13, 3861, 315, 279, 8125, 3249, 356, 1044, 374, 779, 5526, 374, 1606, 433,374, 264, 1633, 11297, 4221, 13, 1102, 6276, 13707, 311, 3350, 2082, 430, 374,2225, 5043, 323, 15062, 11, 902, 374, 7718, 369, 1690, 4595, 315, 3241, 13,23212, 11, 356, 1044, 374, 1633, 19303, 11, 7438, 430, 433, 649, 387, 1511,369, 264, 7029, 2134, 315, 2204, 10096, 13, 17830, 11, 356, 1044, 374, 1101,1633, 5526, 1606, 433, 374, 4228, 311, 4048, 13, 2684, 527, 1690, 5070, 2561,2930, 323, 304, 6603, 430, 649, 1520, 5606, 636, 3940, 449, 6975, 279, 4221,627, 15391, 3S74, 264, 33045, 4221, 430, 649, 387, 1511, 369, 264, 8205, 315,10096, 13, 1102, 374, 832, 315, 279, 1455, 5526, 15840, 15823, 304, 279, 1917,323, 374, 1511, 555, 11990, 315, 1274, 2212, 279, 24867, 13, 8102, 374, 1511,369, 4395, 505, 11469, 17963, 8522, 311, 6968, 6505, 10721, 323, 3953, 13,1102, 374, 1101, 264, 5526, 5873, 369, 3566, 4500, 13, 3861, 315, 279, 8125,3249, 8102, 374, 779, 5526, 374, 1606, 433, 374, 264, 5452, 98885, 4221, 13,1115, 3445, 430, 433, 649, 387, 1511, 389, 904, 955, 315, 6500, 477, 3756, 11,15851, 315, 279, 10565, 1887, 13, 8102, 374, 1101, 1633, 33045, 323, 649, 387,1511, 369, 264, 8205, 315, 2204, 10096, 13, 128001]}
Success: You have deployed an LLM with aRayCluster custom resource in a single-host TPU slice.

Additional configuration

You can optionally configure the following model serving resources andtechniques that the Ray Serve framework supports:

Deploy a RayService

You can deploy the same models from this tutorial by using aRayServicecustom resource.

  1. Delete the RayCluster custom resource that you created in this tutorial:

    kubectl--namespace${NAMESPACE}deleteraycluster/vllm-tpu
  2. Create the RayService custom resource to deploy a model:

    Llama-3-8B-Instruct

    1. Inspect theray-service.tpu-v5e-singlehost.yaml manifest:

      apiVersion:ray.io/v1kind:RayServicemetadata:name:vllm-tpuspec:serveConfigV2:|applications:- name: llmimport_path: ai-ml.gke-ray.rayserve.llm.tpu.serve_tpu:modeldeployments:- name: VLLMDeploymentnum_replicas: 1runtime_env:working_dir: "https://github.com/GoogleCloudPlatform/kubernetes-engine-samples/archive/main.zip"env_vars:MODEL_ID: "$MODEL_ID"MAX_MODEL_LEN: "$MAX_MODEL_LEN"DTYPE: "$DTYPE"TOKENIZER_MODE: "$TOKENIZER_MODE"TPU_CHIPS: "8"rayClusterConfig:headGroupSpec:rayStartParams:{}template:metadata:annotations:gke-gcsfuse/volumes:"true"gke-gcsfuse/cpu-limit:"0"gke-gcsfuse/memory-limit:"0"gke-gcsfuse/ephemeral-storage-limit:"0"spec:serviceAccountName:$KSA_NAMEcontainers:-name:ray-headimage:$VLLM_IMAGEimagePullPolicy:IfNotPresentports:-containerPort:6379name:gcs-containerPort:8265name:dashboard-containerPort:10001name:client-containerPort:8000name:serveenv:-name:HUGGING_FACE_HUB_TOKENvalueFrom:secretKeyRef:name:hf-secretkey:hf_api_token-name:VLLM_XLA_CACHE_PATHvalue:"/data"resources:limits:cpu:"2"memory:8Grequests:cpu:"2"memory:8GvolumeMounts:-name:gcs-fuse-csi-ephemeralmountPath:/data-name:dshmmountPath:/dev/shmvolumes:-name:gke-gcsfuse-cacheemptyDir:medium:Memory-name:dshmemptyDir:medium:Memory-name:gcs-fuse-csi-ephemeralcsi:driver:gcsfuse.csi.storage.gke.iovolumeAttributes:bucketName:$GSBUCKETmountOptions:"implicit-dirs,file-cache:enable-parallel-downloads:true,file-cache:parallel-downloads-per-file:100,file-cache:max-parallel-downloads:-1,file-cache:download-chunk-size-mb:10,file-cache:max-size-mb:-1"workerGroupSpecs:-groupName:tpu-groupreplicas:1minReplicas:1maxReplicas:1numOfHosts:1rayStartParams:{}template:metadata:annotations:gke-gcsfuse/volumes:"true"gke-gcsfuse/cpu-limit:"0"gke-gcsfuse/memory-limit:"0"gke-gcsfuse/ephemeral-storage-limit:"0"spec:serviceAccountName:$KSA_NAMEcontainers:-name:ray-workerimage:$VLLM_IMAGEimagePullPolicy:IfNotPresentresources:limits:cpu:"100"google.com/tpu:"8"ephemeral-storage:40Gmemory:200Grequests:cpu:"100"google.com/tpu:"8"ephemeral-storage:40Gmemory:200Genv:-name:JAX_PLATFORMSvalue:"tpu"-name:HUGGING_FACE_HUB_TOKENvalueFrom:secretKeyRef:name:hf-secretkey:hf_api_token-name:VLLM_XLA_CACHE_PATHvalue:"/data"volumeMounts:-name:gcs-fuse-csi-ephemeralmountPath:/data-name:dshmmountPath:/dev/shmvolumes:-name:gke-gcsfuse-cacheemptyDir:medium:Memory-name:dshmemptyDir:medium:Memory-name:gcs-fuse-csi-ephemeralcsi:driver:gcsfuse.csi.storage.gke.iovolumeAttributes:bucketName:$GSBUCKETmountOptions:"implicit-dirs,file-cache:enable-parallel-downloads:true,file-cache:parallel-downloads-per-file:100,file-cache:max-parallel-downloads:-1,file-cache:download-chunk-size-mb:10,file-cache:max-size-mb:-1"nodeSelector:cloud.google.com/gke-tpu-accelerator:tpu-v5-lite-podslicecloud.google.com/gke-tpu-topology:2x4
    2. Apply the manifest:

      envsubst <tpu/ray-service.tpu-v5e-singlehost.yaml|kubectl--namespace${NAMESPACE}apply-f-

      Theenvsubst command replaces the environment variables in the manifest.

      GKE creates a RayService with aworkergroup that contains aTPU v5e single-host in a2x4 topology.

    Mistral-7B

    1. Inspect theray-service.tpu-v5e-singlehost.yaml manifest:

      apiVersion:ray.io/v1kind:RayServicemetadata:name:vllm-tpuspec:serveConfigV2:|applications:- name: llmimport_path: ai-ml.gke-ray.rayserve.llm.tpu.serve_tpu:modeldeployments:- name: VLLMDeploymentnum_replicas: 1runtime_env:working_dir: "https://github.com/GoogleCloudPlatform/kubernetes-engine-samples/archive/main.zip"env_vars:MODEL_ID: "$MODEL_ID"MAX_MODEL_LEN: "$MAX_MODEL_LEN"DTYPE: "$DTYPE"TOKENIZER_MODE: "$TOKENIZER_MODE"TPU_CHIPS: "8"rayClusterConfig:headGroupSpec:rayStartParams:{}template:metadata:annotations:gke-gcsfuse/volumes:"true"gke-gcsfuse/cpu-limit:"0"gke-gcsfuse/memory-limit:"0"gke-gcsfuse/ephemeral-storage-limit:"0"spec:serviceAccountName:$KSA_NAMEcontainers:-name:ray-headimage:$VLLM_IMAGEimagePullPolicy:IfNotPresentports:-containerPort:6379name:gcs-containerPort:8265name:dashboard-containerPort:10001name:client-containerPort:8000name:serveenv:-name:HUGGING_FACE_HUB_TOKENvalueFrom:secretKeyRef:name:hf-secretkey:hf_api_token-name:VLLM_XLA_CACHE_PATHvalue:"/data"resources:limits:cpu:"2"memory:8Grequests:cpu:"2"memory:8GvolumeMounts:-name:gcs-fuse-csi-ephemeralmountPath:/data-name:dshmmountPath:/dev/shmvolumes:-name:gke-gcsfuse-cacheemptyDir:medium:Memory-name:dshmemptyDir:medium:Memory-name:gcs-fuse-csi-ephemeralcsi:driver:gcsfuse.csi.storage.gke.iovolumeAttributes:bucketName:$GSBUCKETmountOptions:"implicit-dirs,file-cache:enable-parallel-downloads:true,file-cache:parallel-downloads-per-file:100,file-cache:max-parallel-downloads:-1,file-cache:download-chunk-size-mb:10,file-cache:max-size-mb:-1"workerGroupSpecs:-groupName:tpu-groupreplicas:1minReplicas:1maxReplicas:1numOfHosts:1rayStartParams:{}template:metadata:annotations:gke-gcsfuse/volumes:"true"gke-gcsfuse/cpu-limit:"0"gke-gcsfuse/memory-limit:"0"gke-gcsfuse/ephemeral-storage-limit:"0"spec:serviceAccountName:$KSA_NAMEcontainers:-name:ray-workerimage:$VLLM_IMAGEimagePullPolicy:IfNotPresentresources:limits:cpu:"100"google.com/tpu:"8"ephemeral-storage:40Gmemory:200Grequests:cpu:"100"google.com/tpu:"8"ephemeral-storage:40Gmemory:200Genv:-name:JAX_PLATFORMSvalue:"tpu"-name:HUGGING_FACE_HUB_TOKENvalueFrom:secretKeyRef:name:hf-secretkey:hf_api_token-name:VLLM_XLA_CACHE_PATHvalue:"/data"volumeMounts:-name:gcs-fuse-csi-ephemeralmountPath:/data-name:dshmmountPath:/dev/shmvolumes:-name:gke-gcsfuse-cacheemptyDir:medium:Memory-name:dshmemptyDir:medium:Memory-name:gcs-fuse-csi-ephemeralcsi:driver:gcsfuse.csi.storage.gke.iovolumeAttributes:bucketName:$GSBUCKETmountOptions:"implicit-dirs,file-cache:enable-parallel-downloads:true,file-cache:parallel-downloads-per-file:100,file-cache:max-parallel-downloads:-1,file-cache:download-chunk-size-mb:10,file-cache:max-size-mb:-1"nodeSelector:cloud.google.com/gke-tpu-accelerator:tpu-v5-lite-podslicecloud.google.com/gke-tpu-topology:2x4
    2. Apply the manifest:

      envsubst <tpu/ray-service.tpu-v5e-singlehost.yaml|kubectl--namespace${NAMESPACE}apply-f-

      Theenvsubst command replaces the environment variables in the manifest.

      GKE creates a RayService with aworkergroup containing aTPU v5e single-host in a2x4 topology.

    Llama 3.1 70B

    1. Inspect theray-service.tpu-v6e-singlehost.yaml manifest:

      apiVersion:ray.io/v1kind:RayServicemetadata:name:vllm-tpuspec:serveConfigV2:|applications:- name: llmimport_path: ai-ml.gke-ray.rayserve.llm.tpu.serve_tpu:modeldeployments:- name: VLLMDeploymentnum_replicas: 1runtime_env:working_dir: "https://github.com/GoogleCloudPlatform/kubernetes-engine-samples/archive/main.zip"env_vars:MODEL_ID: "$MODEL_ID"MAX_MODEL_LEN: "$MAX_MODEL_LEN"DTYPE: "$DTYPE"TOKENIZER_MODE: "$TOKENIZER_MODE"TPU_CHIPS: "8"rayClusterConfig:headGroupSpec:rayStartParams:{}template:metadata:annotations:gke-gcsfuse/volumes:"true"gke-gcsfuse/cpu-limit:"0"gke-gcsfuse/memory-limit:"0"gke-gcsfuse/ephemeral-storage-limit:"0"spec:serviceAccountName:$KSA_NAMEcontainers:-name:ray-headimage:$VLLM_IMAGEimagePullPolicy:IfNotPresentports:-containerPort:6379name:gcs-containerPort:8265name:dashboard-containerPort:10001name:client-containerPort:8000name:serveenv:-name:HUGGING_FACE_HUB_TOKENvalueFrom:secretKeyRef:name:hf-secretkey:hf_api_token-name:VLLM_XLA_CACHE_PATHvalue:"/data"resources:limits:cpu:"2"memory:8Grequests:cpu:"2"memory:8GvolumeMounts:-name:gcs-fuse-csi-ephemeralmountPath:/data-name:dshmmountPath:/dev/shmvolumes:-name:gke-gcsfuse-cacheemptyDir:medium:Memory-name:dshmemptyDir:medium:Memory-name:gcs-fuse-csi-ephemeralcsi:driver:gcsfuse.csi.storage.gke.iovolumeAttributes:bucketName:$GSBUCKETmountOptions:"implicit-dirs,file-cache:enable-parallel-downloads:true,file-cache:parallel-downloads-per-file:100,file-cache:max-parallel-downloads:-1,file-cache:download-chunk-size-mb:10,file-cache:max-size-mb:-1"workerGroupSpecs:-groupName:tpu-groupreplicas:1minReplicas:1maxReplicas:1numOfHosts:1rayStartParams:{}template:metadata:annotations:gke-gcsfuse/volumes:"true"gke-gcsfuse/cpu-limit:"0"gke-gcsfuse/memory-limit:"0"gke-gcsfuse/ephemeral-storage-limit:"0"spec:serviceAccountName:$KSA_NAMEcontainers:-name:ray-workerimage:$VLLM_IMAGEimagePullPolicy:IfNotPresentresources:limits:cpu:"100"google.com/tpu:"8"ephemeral-storage:40Gmemory:200Grequests:cpu:"100"google.com/tpu:"8"ephemeral-storage:40Gmemory:200Genv:-name:JAX_PLATFORMSvalue:"tpu"-name:HUGGING_FACE_HUB_TOKENvalueFrom:secretKeyRef:name:hf-secretkey:hf_api_token-name:VLLM_XLA_CACHE_PATHvalue:"/data"volumeMounts:-name:gcs-fuse-csi-ephemeralmountPath:/data-name:dshmmountPath:/dev/shmvolumes:-name:gke-gcsfuse-cacheemptyDir:medium:Memory-name:dshmemptyDir:medium:Memory-name:gcs-fuse-csi-ephemeralcsi:driver:gcsfuse.csi.storage.gke.iovolumeAttributes:bucketName:$GSBUCKETmountOptions:"implicit-dirs,file-cache:enable-parallel-downloads:true,file-cache:parallel-downloads-per-file:100,file-cache:max-parallel-downloads:-1,file-cache:download-chunk-size-mb:10,file-cache:max-size-mb:-1"nodeSelector:cloud.google.com/gke-tpu-accelerator:tpu-v6e-slicecloud.google.com/gke-tpu-topology:2x4
    2. Apply the manifest:

      envsubst <tpu/ray-service.tpu-v6e-singlehost.yaml|kubectl--namespace${NAMESPACE}apply-f-

      Theenvsubst command replaces the environment variables in the manifest.

    GKE creates a RayCluster custom resource where the Ray Serveapplication is deployed and the subsequent RayService custom resource iscreated.

  3. Verify the status of the RayService resource:

    kubectl--namespace${NAMESPACE}getrayservices/vllm-tpu

    Wait for the Service status to change toRunning:

    NAME       SERVICE STATUS   NUM SERVE ENDPOINTSvllm-tpu   Running          1
  4. Retrieve the name of the RayCluster head service:

    SERVICE_NAME=$(kubectl--namespace=${NAMESPACE}getrayservices/vllm-tpu\--template={{.status.activeServiceStatus.rayClusterStatus.head.serviceName}})
    Note: If the RayCluster head service value is not retrieved, manually updatetheSERVICE_NAME value by running thekubectl get services --namespace${NAMESPACE} command.
  5. Establishport-forwarding sessions to the Ray head to view the Ray dashboard:

    pkill-f"kubectl .* port-forward .* 8265:8265"kubectl--namespace${NAMESPACE}port-forwardservice/${SERVICE_NAME}8265:82652>&1>/dev/null&
  6. View the Ray Dashboard.

  7. Serve the model.

  8. Clean up the RayService resource:

    kubectl--namespace${NAMESPACE}deleterayservice/vllm-tpu

Compose multiple models with model composition

Model compositionis a technique for composing multiple models into a single application.

In this section, you use a GKE cluster to compose two models, Llama 38B IT and Gemma 7B IT, into a single application:

  • The first model is the assistant model thatanswers questions asked in the prompt.
  • The second model is the summarizermodel. The output of the assistant model is chained into the input of thesummarizer model. The final result is the summarized version of the responsefrom the assistant model.
  1. Get access to the Gemma model by completing the following steps:

    1. Sign in to theKaggle platform, sign the license consent agreement, and get a Kaggle API token. In this tutorial, you use a Kubernetes Secret for the Kagglecredentials.
    2. Access themodel consent pageon Kaggle.com.
    3. Sign in to Kaggle, if you haven't done so already.
    4. ClickRequest Access.
    5. In theChoose Account for Consent section, selectVerify via KaggleAccount to use your Kaggle account for granting consent.
    6. Accept the modelTerms and Conditions.
  2. Set up your environment:

    exportASSIST_MODEL_ID=meta-llama/Meta-Llama-3-8B-InstructexportSUMMARIZER_MODEL_ID=google/gemma-7b-it
  3. For Standard clusters, create an additional single-host TPU slicenode pool:

    gcloudcontainernode-poolscreatetpu-2\--location=${COMPUTE_ZONE}\--cluster=${CLUSTER_NAME}\--machine-type=MACHINE_TYPE\--num-nodes=1

    Replace theMACHINE_TYPE with any of the following machine types:

    • ct5lp-hightpu-8t to provision TPU v5e.
    • ct6e-standard-8t to provision TPU v6e.

    Autopilot clusters automatically provision the required nodes.

  4. Deploy the RayService resource based on the TPU version that you want to use:

    TPU v5e

    1. Inspect theray-service.tpu-v5e-singlehost.yaml manifest:

      apiVersion:ray.io/v1kind:RayServicemetadata:name:vllm-tpuspec:serveConfigV2:|applications:- name: llmroute_prefix: /import_path:  ai-ml.gke-ray.rayserve.llm.model-composition.serve_tpu:multi_modeldeployments:- name: MultiModelDeploymentnum_replicas: 1runtime_env:working_dir: "https://github.com/GoogleCloudPlatform/kubernetes-engine-samples/archive/main.zip"env_vars:ASSIST_MODEL_ID: "$ASSIST_MODEL_ID"SUMMARIZER_MODEL_ID: "$SUMMARIZER_MODEL_ID"TPU_CHIPS: "16"TPU_HEADS: "2"rayClusterConfig:headGroupSpec:rayStartParams:{}template:metadata:annotations:gke-gcsfuse/volumes:"true"gke-gcsfuse/cpu-limit:"0"gke-gcsfuse/memory-limit:"0"gke-gcsfuse/ephemeral-storage-limit:"0"spec:serviceAccountName:$KSA_NAMEcontainers:-name:ray-headimage:$VLLM_IMAGEresources:limits:cpu:"2"memory:8Grequests:cpu:"2"memory:8Gports:-containerPort:6379name:gcs-server-containerPort:8265name:dashboard-containerPort:10001name:client-containerPort:8000name:serveenv:-name:HUGGING_FACE_HUB_TOKENvalueFrom:secretKeyRef:name:hf-secretkey:hf_api_token-name:VLLM_XLA_CACHE_PATHvalue:"/data"volumeMounts:-name:gcs-fuse-csi-ephemeralmountPath:/data-name:dshmmountPath:/dev/shmvolumes:-name:gke-gcsfuse-cacheemptyDir:medium:Memory-name:dshmemptyDir:medium:Memory-name:gcs-fuse-csi-ephemeralcsi:driver:gcsfuse.csi.storage.gke.iovolumeAttributes:bucketName:$GSBUCKETmountOptions:"implicit-dirs,file-cache:enable-parallel-downloads:true,file-cache:parallel-downloads-per-file:100,file-cache:max-parallel-downloads:-1,file-cache:download-chunk-size-mb:10,file-cache:max-size-mb:-1"workerGroupSpecs:-replicas:2minReplicas:1maxReplicas:2numOfHosts:1groupName:tpu-grouprayStartParams:{}template:metadata:annotations:gke-gcsfuse/volumes:"true"gke-gcsfuse/cpu-limit:"0"gke-gcsfuse/memory-limit:"0"gke-gcsfuse/ephemeral-storage-limit:"0"spec:serviceAccountName:$KSA_NAMEcontainers:-name:llmimage:$VLLM_IMAGEenv:-name:HUGGING_FACE_HUB_TOKENvalueFrom:secretKeyRef:name:hf-secretkey:hf_api_token-name:VLLM_XLA_CACHE_PATHvalue:"/data"resources:limits:cpu:"100"google.com/tpu:"8"ephemeral-storage:40Gmemory:200Grequests:cpu:"100"google.com/tpu:"8"ephemeral-storage:40Gmemory:200GvolumeMounts:-name:gcs-fuse-csi-ephemeralmountPath:/data-name:dshmmountPath:/dev/shmvolumes:-name:gke-gcsfuse-cacheemptyDir:medium:Memory-name:dshmemptyDir:medium:Memory-name:gcs-fuse-csi-ephemeralcsi:driver:gcsfuse.csi.storage.gke.iovolumeAttributes:bucketName:$GSBUCKETmountOptions:"implicit-dirs,file-cache:enable-parallel-downloads:true,file-cache:parallel-downloads-per-file:100,file-cache:max-parallel-downloads:-1,file-cache:download-chunk-size-mb:10,file-cache:max-size-mb:-1"nodeSelector:cloud.google.com/gke-tpu-accelerator:tpu-v5-lite-podslicecloud.google.com/gke-tpu-topology:2x4
    2. Apply the manifest:

      envsubst <model-composition/ray-service.tpu-v5e-singlehost.yaml|kubectl--namespace${NAMESPACE}apply-f-

    TPU v6e

    1. Inspect theray-service.tpu-v6e-singlehost.yaml manifest:

      apiVersion:ray.io/v1kind:RayServicemetadata:name:vllm-tpuspec:serveConfigV2:|applications:- name: llmroute_prefix: /import_path:  ai-ml.gke-ray.rayserve.llm.model-composition.serve_tpu:multi_modeldeployments:- name: MultiModelDeploymentnum_replicas: 1runtime_env:working_dir: "https://github.com/GoogleCloudPlatform/kubernetes-engine-samples/archive/main.zip"env_vars:ASSIST_MODEL_ID: "$ASSIST_MODEL_ID"SUMMARIZER_MODEL_ID: "$SUMMARIZER_MODEL_ID"TPU_CHIPS: "16"TPU_HEADS: "2"rayClusterConfig:headGroupSpec:rayStartParams:{}template:metadata:annotations:gke-gcsfuse/volumes:"true"gke-gcsfuse/cpu-limit:"0"gke-gcsfuse/memory-limit:"0"gke-gcsfuse/ephemeral-storage-limit:"0"spec:serviceAccountName:$KSA_NAMEcontainers:-name:ray-headimage:$VLLM_IMAGEresources:limits:cpu:"2"memory:8Grequests:cpu:"2"memory:8Gports:-containerPort:6379name:gcs-server-containerPort:8265name:dashboard-containerPort:10001name:client-containerPort:8000name:serveenv:-name:HUGGING_FACE_HUB_TOKENvalueFrom:secretKeyRef:name:hf-secretkey:hf_api_token-name:VLLM_XLA_CACHE_PATHvalue:"/data"volumeMounts:-name:gcs-fuse-csi-ephemeralmountPath:/data-name:dshmmountPath:/dev/shmvolumes:-name:gke-gcsfuse-cacheemptyDir:medium:Memory-name:dshmemptyDir:medium:Memory-name:gcs-fuse-csi-ephemeralcsi:driver:gcsfuse.csi.storage.gke.iovolumeAttributes:bucketName:$GSBUCKETmountOptions:"implicit-dirs,file-cache:enable-parallel-downloads:true,file-cache:parallel-downloads-per-file:100,file-cache:max-parallel-downloads:-1,file-cache:download-chunk-size-mb:10,file-cache:max-size-mb:-1"workerGroupSpecs:-replicas:2minReplicas:1maxReplicas:2numOfHosts:1groupName:tpu-grouprayStartParams:{}template:metadata:annotations:gke-gcsfuse/volumes:"true"gke-gcsfuse/cpu-limit:"0"gke-gcsfuse/memory-limit:"0"gke-gcsfuse/ephemeral-storage-limit:"0"spec:serviceAccountName:$KSA_NAMEcontainers:-name:llmimage:$VLLM_IMAGEenv:-name:HUGGING_FACE_HUB_TOKENvalueFrom:secretKeyRef:name:hf-secretkey:hf_api_token-name:VLLM_XLA_CACHE_PATHvalue:"/data"resources:limits:cpu:"100"google.com/tpu:"8"ephemeral-storage:40Gmemory:200Grequests:cpu:"100"google.com/tpu:"8"ephemeral-storage:40Gmemory:200GvolumeMounts:-name:gcs-fuse-csi-ephemeralmountPath:/data-name:dshmmountPath:/dev/shmvolumes:-name:gke-gcsfuse-cacheemptyDir:medium:Memory-name:dshmemptyDir:medium:Memory-name:gcs-fuse-csi-ephemeralcsi:driver:gcsfuse.csi.storage.gke.iovolumeAttributes:bucketName:$GSBUCKETmountOptions:"implicit-dirs,file-cache:enable-parallel-downloads:true,file-cache:parallel-downloads-per-file:100,file-cache:max-parallel-downloads:-1,file-cache:download-chunk-size-mb:10,file-cache:max-size-mb:-1"nodeSelector:cloud.google.com/gke-tpu-accelerator:tpu-v6e-slicecloud.google.com/gke-tpu-topology:2x4
    2. Apply the manifest:

      envsubst <model-composition/ray-service.tpu-v6e-singlehost.yaml|kubectl--namespace${NAMESPACE}apply-f-
  5. Wait for the status of the RayService resource to change toRunning:

    kubectl--namespace${NAMESPACE}getrayservice/vllm-tpu

    The output is similar to the following:

    NAME       SERVICE STATUS   NUM SERVE ENDPOINTSvllm-tpu   Running          2

    In this output, theRUNNING status indicates the RayService resourceis ready.

  6. Confirm that GKE created the Service for the Ray Serveapplication:

    kubectl--namespace${NAMESPACE}getservice/vllm-tpu-serve-svc

    The output is similar to the following:

    NAME                 TYPE        CLUSTER-IP        EXTERNAL-IP   PORT(S)    AGEvllm-tpu-serve-svc   ClusterIP   ###.###.###.###   <none>        8000/TCP   ###
  7. Establishport-forwarding sessions to the Ray head:

    pkill-f"kubectl .* port-forward .* 8265:8265"pkill-f"kubectl .* port-forward .* 8000:8000"kubectl--namespace${NAMESPACE}port-forwardservice/vllm-tpu-serve-svc8265:82652>&1>/dev/null&kubectl--namespace${NAMESPACE}port-forwardservice/vllm-tpu-serve-svc8000:80002>&1>/dev/null&
  8. Send a request to the model:

    curl-XPOSThttp://localhost:8000/-H"Content-Type: application/json"-d'{"prompt": "What is the most popular programming language for machine learning and why?", "max_tokens": 1000}'

    The output is similar to the following:

      {"text": [" used in various data science projects, including building machine learning models, preprocessing data, and visualizing results.\n\nSure, here is a single sentence summarizing the text:\n\nPython is the most popular programming language for machine learning and is widely used in data science projects, encompassing model building, data preprocessing, and visualization."]}

Build and deploy the TPU image

This tutorial uses hosted TPU images fromvLLM. vLLM provides aDockerfile.tpu image that builds vLLM on top of the required PyTorch XLA imagethat includes TPU dependencies. However, you can also build and deploy your ownTPU image for finer-grained control over the contents of your Docker image.

Note: You need to be granted theroles/artifactregistry.admin role to create and manage Artifact Registry repositories.
  1. Create a Docker repository to store the container images for this guide:

    gcloudartifactsrepositoriescreatevllm-tpu--repository-format=docker--location=${COMPUTE_REGION} &&\gcloudauthconfigure-docker${COMPUTE_REGION}-docker.pkg.dev
  2. Clone the vLLM repository:

    gitclonehttps://github.com/vllm-project/vllm.gitcdvllm
  3. Build the image:

    dockerbuild-f./docker/Dockerfile.tpu.-tvllm-tpu
  4. Tag the TPU image with your Artifact Registry name:

    exportVLLM_IMAGE=${COMPUTE_REGION}-docker.pkg.dev/${PROJECT_ID}/vllm-tpu/vllm-tpu:TAGdockertagvllm-tpu${VLLM_IMAGE}

    ReplaceTAG with the name of the tag that you want todefine. If you don't specify a tag, Docker applies the default latest tag.

  5. Push the image to Artifact Registry:

    dockerpush${VLLM_IMAGE}

Delete the individual resources

If you used an existing project and you don't want to delete it, you can delete theindividual resources.

  1. Delete the RayCluster custom resource:

    kubectl--namespace${NAMESPACE}deleterayclustersvllm-tpu
  2. Delete the Cloud Storage bucket:

    gcloudstoragerm-rgs://${GSBUCKET}
  3. Delete the Artifact Registry repository:

    gcloudartifactsrepositoriesdeletevllm-tpu\--location=${COMPUTE_REGION}
  4. Delete the cluster:

    gcloudcontainerclustersdelete${CLUSTER_NAME}\--location=LOCATION

    ReplaceLOCATION with any of the followingenvironment variables:

    • For Autopilot clusters, useCOMPUTE_REGION.
    • For Standard clusters, useCOMPUTE_ZONE.

Delete the project

If you deployed the tutorial in a new Google Cloud project, and if you no longerneed the project, then delete it by completing the following steps:

    Caution: Deleting a project has the following effects:
    • Everything in the project is deleted. If you used an existing project for the tasks in this document, when you delete it, you also delete any other work you've done in the project.
    • Custom project IDs are lost. When you created this project, you might have created a custom project ID that you want to use in the future. To preserve the URLs that use the project ID, such as anappspot.com URL, delete selected resources inside the project instead of deleting the whole project.

    If you plan to explore multiple architectures, tutorials, or quickstarts, reusing projects can help you avoid exceeding project quota limits.

  1. In the Google Cloud console, go to theManage resources page.

    Go to Manage resources

  2. In the project list, select the project that you want to delete, and then clickDelete.
  3. In the dialog, type the project ID, and then clickShut down to delete the project.

What's next

Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.

Last updated 2025-12-15 UTC.