Serve Gemma open models using GPUs on GKE with Triton and TensorRT-LLM

Autopilot Standard

This tutorial demonstrates how to deploy and serve aGemmalarge language model (LLM) using GPUs on Google Kubernetes Engine (GKE) with theNVIDIATriton andTensorRT-LLMserving stack. This provides a foundation for understanding and exploringpractical LLM deployment for inference in a managed Kubernetes environment. Youdeploy a pre-built container with Triton and TensorRT-LLM to GKE.You also configure GKE to load the Gemma 2B and7B weights.

Tip: For production deployments on GKE, we strongly recommend using Inference Quickstart to get tailored best practices and configurations for your model inference.

This tutorial is intended for Machine learning (ML) engineers,Platform admins and operators, and for Data and AI specialists who are interestedin using Kubernetes container orchestration capabilities for serving LLMs onH100, A100, and L4 GPU hardware. To learn more about common roles and exampletasks that we reference in Google Cloud content, seeCommon GKE user roles and tasks.

If you need a unified managed AI platform to rapidly build and serve ML modelscost effectively, we recommend that you try ourVertex AI deployment solution.

Before reading this page, ensure that you're familiar with the following:

Background

This section describes the key technologies used in this guide.

Gemma

Gemma isa set of openly available, lightweight, generative artificial intelligence (AI)models released under an open license. These AI models are available to runin your applications, hardware, mobile devices, or hosted services.You can use the Gemma models for text generation, however you can alsotune these models for specialized tasks.

To learn more, see theGemma documentation.

GPUs

GPUs let you accelerate specific workloads running on your nodes such as machinelearning and data processing. GKE provides a range of machinetype options for node configuration, including machine types with NVIDIA H100,L4, and A100 GPUs.

TensorRT-LLM

NVIDIA TensorRT-LLM (TRT-LLM) is a toolkit with a Python API for assemblingoptimized solutions to define LLMs and build TensorRT engines that performinference efficiently on NVIDIA GPUs. TensorRT-LLM includes features such as:

Optimized transformer implementation with layer fusions, activation caching,memory buffer reuse, andPagedAttention
In-flight or continuous batching to improve the overall serving throughput
Tensor parallelism and pipeline parallelism for distributed serving on multiple GPUs
Quantization (FP16, FP8, INT8)

To learn more, refer to theTensorRT-LLM documentation.

Triton

NVIDIA Triton Inference Server is a open source inference server for AI/MLapplications. Triton supports high-performance inference on both NVIDIA GPUs andCPUs with optimized backends, including TensorRT and TensorRT-LLM. Triton includesfeatures such as:

Multi-GPU, multi-node inference
Concurrent multiple model execution
Model ensembling or chaining
Static, dynamic, and continuous or in-flight batching of prediction requests

To learn more, refer to theTriton documentation.

Objectives

Prepare your environment with a GKE cluster inAutopilot mode.
Deploy a container with Triton and TritonRT-LLM to your cluster.
Use Triton and TensorRT-LLM to serve the Gemma 2B or 7B model through curl.

Before you begin

Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.

In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

Roles required to select or create a project

Select a project: Selecting a project doesn't require a specific IAM role—you can select any project that you've been granted a role on.
Create a project: To create a project, you need the Project Creator role (roles/resourcemanager.projectCreator), which contains theresourcemanager.projects.create permission.Learn how to grant roles.

Note: If you don't plan to keep the resources that you create in this procedure, create a project instead of selecting an existing project. After you finish these steps, you can delete the project, removing all resources associated with the project.

Go to project selector

Verify that billing is enabled for your Google Cloud project.

Enable the required API.

Roles required to enable APIs

To enable APIs, you need the Service Usage Admin IAM role (roles/serviceusage.serviceUsageAdmin), which contains theserviceusage.services.enable permission.Learn how to grant roles.

Enable the API

In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

Roles required to select or create a project

Select a project: Selecting a project doesn't require a specific IAM role—you can select any project that you've been granted a role on.
Create a project: To create a project, you need the Project Creator role (roles/resourcemanager.projectCreator), which contains theresourcemanager.projects.create permission.Learn how to grant roles.

Go to project selector

Verify that billing is enabled for your Google Cloud project.

Enable the required API.

Roles required to enable APIs

To enable APIs, you need the Service Usage Admin IAM role (roles/serviceusage.serviceUsageAdmin), which contains theserviceusage.services.enable permission.Learn how to grant roles.

Enable the API

Make sure that you have the following role or roles on the project: roles/container.admin, roles/iam.serviceAccountAdmin
Check for the roles
1. In the Google Cloud console, go to theIAM page.
  Go to IAM
2. Select the project.
3. In thePrincipal column, find all rows that identify you or a group that you're included in. To learn which groups you're included in, contact your administrator.
4. For all rows that specify or include you, check theRole column to see whether the list of roles includes the required roles.
Grant the roles
1. In the Google Cloud console, go to theIAM page.
  Go to IAM
2. Select the project.
3. ClickGrant access.
4. In theNew principals field, enter your user identifier. This is typically the email address for a Google Account.
5. ClickSelect a role, then search for the role.
6. To grant additional roles, clickAdd another role and add each additional role.
7. ClickSave.

Create aKaggle account, if you don't already have one.
Ensure your project has sufficient quota for L4 GPUs. To learn more, seeAbout GPUs andAllocation quotas.

Prepare your environment

In this tutorial, you useCloud Shell to manage resources hosted onGoogle Cloud. Cloud Shell comes preinstalled with the software you'll needfor this tutorial, includingkubectl andgcloud CLI.

To set up your environment with Cloud Shell, follow these steps:

In the Google Cloud console, launch a Cloud Shell session by clickingActivate Cloud Shell in theGoogle Cloud console. This launches a session in thebottom pane of Google Cloud console.
Set the default environment variables:
```
gcloudconfigsetprojectPROJECT_IDgcloudconfigsetbilling/quota_projectPROJECT_IDexportPROJECT_ID=$(gcloudconfiggetproject)exportCONTROL_PLANE_LOCATION=CONTROL_PLANE_LOCATIONexportCLUSTER_NAME=CLUSTER_NAME
```
Replace the following values:
- PROJECT_ID: your Google Cloudproject ID.
- CONTROL_PLANE_LOCATION: the Compute Engineregion of the control plane of yourcluster. This region must support the acceleratortype you want to use, for example,us-central1 for L4 GPU.
- CLUSTER_NAME: the name of your cluster.

Get access to the model

To get access to the Gemma models, you must sign in to the Kaggle platform, andget a Kaggle API token.

Sign the license consent agreement

You must sign the consent agreement to use Gemma. Follow these instructions:

Access themodel consent pageon Kaggle.com.
Login to Kaggle if you haven't done so already.
ClickRequest Access.
In theChoose Account for Consent section, selectVerify via Kaggle Accountto use your Kaggle account for consent.
Accept the modelTerms and Conditions.

Generate an access token

To access the model through Kaggle, you need a Kaggle API token. Follow thesesteps to generate a new token if you don't have one already:

In your browser, go toKaggle settings.
Under the API section, clickCreate New Token.

A file namedkaggle.json file is downloaded.

Upload the access token to Cloud Shell

In Cloud Shell, upload the Kaggle API token to your Google Cloudproject:

In Cloud Shell, clickMore>Upload.
Select File and clickChoose Files.
Open thekaggle.json file.
ClickUpload.

Create and configure Google Cloud resources

Follow these instructions to create the required resources.

Note: You may need to create a capacity reservation for usage of some accelerators. To learn howto reserve and consume reserved resources, see Consuming reserved zonal resources.

Create a GKE cluster and node pool

You can serve Gemma on GPUs in a GKE Autopilot or Standard cluster. We recommend that you use a Autopilot cluster for a fully managed Kubernetes experience. To choose the GKE mode of operation that's the best fit for your workloads, seeChoose a GKE mode of operation.

Autopilot

In Cloud Shell, run the following command:

gcloudcontainerclusterscreate-autoCLUSTER_NAME\--project=PROJECT_ID\--location=CONTROL_PLANE_LOCATION\--release-channel=rapid\--cluster-version=1.28

Replace the following values:

PROJECT_ID: your Google Cloudproject ID.
CONTROL_PLANE_LOCATION: the Compute Engineregion of the control plane of yourcluster. This region must support the acceleratortype you want to use, for example,us-central1 for L4 GPU.
CLUSTER_NAME: the name of your cluster.

GKE creates an Autopilot cluster with CPU and GPUnodes as requested by the deployed workloads.

Standard

In Cloud Shell, run the following command to create a Standardcluster:
```
gcloudcontainerclusterscreateCLUSTER_NAME\--project=PROJECT_ID\--location=CONTROL_PLANE_LOCATION\--workload-pool=PROJECT_ID.svc.id.goog\--release-channel=rapid\--machine-type=e2-standard-4\--num-nodes=1
```
Replace the following values:
- PROJECT_ID: your Google Cloudproject ID.
- CONTROL_PLANE_LOCATION: the Compute Engineregion of the control plane of yourcluster. This region must support the acceleratortype you want to use, for example,us-central1 for L4 GPU.
- CLUSTER_NAME: the name of your cluster.
The cluster creation might take several minutes.

Run the following command to create anode pool for your cluster:

gcloudcontainernode-poolscreategpupool\--acceleratortype=nvidia-l4,count=1,gpu-driver-version=latest\--project=PROJECT_ID\--location=CONTROL_PLANE_LOCATION\--cluster=CLUSTER_NAME\--machine-type=g2-standard-12\--num-nodes=1

GKE creates a single node pool containing one L4 GPU node.

Create Kubernetes Secret for Kaggle credentials

In this tutorial, you use a Kubernetes Secret for the Kaggle credentials.

In Cloud Shell, do the following:

Configurekubectl to communicate with your cluster:
```
gcloudcontainerclustersget-credentialsCLUSTER_NAME\--location=CONTROL_PLANE_LOCATION
```
Replace the following values:
- CONTROL_PLANE_LOCATION: the Compute Engineregion of the control plane of yourcluster. This region must support the acceleratortype you want to use, for example,us-central1 for L4 GPU.
- CLUSTER_NAME: the name of your cluster.

Create a Secret to store the Kaggle credentials:

kubectlcreatesecretgenerickaggle-secret\--from-file=kaggle.json\--dry-run=client-oyaml|kubectlapply-f-

Create a PersistentVolume resource to store checkpoints

In this section, you create a PersistentVolume backed by a persistent disk to store the model checkpoints.

Create the followingtrtllm_checkpoint_pv.yaml manifest:

apiVersion:v1kind:PersistentVolumeClaimmetadata:name:model-dataspec:accessModes:-ReadWriteOnceresources:requests:storage:100G

Apply the manifest:
```
kubectlapply-ftrtllm_checkpoint_pv.yaml
```

Download the TensorRT-LLM engine files for Gemma

In this section, you run a Kubernetes Job to complete the following tasks:

Download the TensorRT-LLM engine files and store the files in thePersistentVolume you created earlier.
Prepare configuration filesfor deploying the model on the Triton server.

A Job controller in Kubernetes creates one or more Pods and ensures that they successfully execute a specific task.

The following process can take a few minutes.

Gemma 2B-it

The TensorRT-LLM engine is built from the Gemma 2B-it (instruction tuned)PyTorch checkpoint of Gemma usingbfloat16 activation, input sequence length=2048,and output sequence length=1024 targeted L4 GPUs. You can deploy the model on asingle L4 GPU.

Create the followingjob-download-gemma-2b.yaml manifest:

apiVersion:v1kind:ConfigMapmetadata:name:fetch-model-scriptsdata:fetch_model.sh:|-#!/usr/bin/bash -xpip install kaggle --break-system-packages && \MODEL_NAME=$(echo ${MODEL_PATH} | awk -F'/' '{print $2}') && \VARIATION_NAME=$(echo ${MODEL_PATH} | awk -F'/' '{print $4}') && \ACTIVATION_DTYPE=bfloat16 && \TOKENIZER_DIR=/data/trt_engine/${MODEL_NAME}/${VARIATION_NAME}/${ACTIVATION_DTYPE}/${WORLD_SIZE}-gpu/tokenizer.model && \ENGINE_PATH=/data/trt_engine/${MODEL_NAME}/${VARIATION_NAME}/${ACTIVATION_DTYPE}/${WORLD_SIZE}-gpu/ && \TRITON_MODEL_REPO=/data/triton/model_repository && \mkdir -p /data/${MODEL_NAME}_${VARIATION_NAME} && \mkdir -p ${ENGINE_PATH} && \mkdir -p ${TRITON_MODEL_REPO} && \kaggle models instances versions download ${MODEL_PATH} --untar -p /data/${MODEL_NAME}_${VARIATION_NAME} && \rm -f /data/${MODEL_NAME}_${VARIATION_NAME}/*.tar.gz && \find /data/${MODEL_NAME}_${VARIATION_NAME} -type f && \find /data/${MODEL_NAME}_${VARIATION_NAME} -type f | xargs -I '{}' mv '{}' ${ENGINE_PATH} && \# copying configuration filesecho -e "\nCreating configuration files" && \cp -r /tensorrtllm_backend/all_models/inflight_batcher_llm/* ${TRITON_MODEL_REPO} && \# updating configuration filespython3 /tensorrtllm_backend/tools/fill_template.py -i ${TRITON_MODEL_REPO}/preprocessing/config.pbtxt tokenizer_dir:${TOKENIZER_DIR},tokenizer_type:sp,triton_max_batch_size:64,preprocessing_instance_count:1 && \python3 /tensorrtllm_backend/tools/fill_template.py -i ${TRITON_MODEL_REPO}/postprocessing/config.pbtxt tokenizer_dir:${TOKENIZER_DIR},tokenizer_type:sp,triton_max_batch_size:64,postprocessing_instance_count:1 && \python3 /tensorrtllm_backend/tools/fill_template.py -i ${TRITON_MODEL_REPO}/tensorrt_llm_bls/config.pbtxt triton_max_batch_size:64,decoupled_mode:False,bls_instance_count:1,accumulate_tokens:False && \python3 /tensorrtllm_backend/tools/fill_template.py -i ${TRITON_MODEL_REPO}/ensemble/config.pbtxt triton_max_batch_size:64 && \python3 /tensorrtllm_backend/tools/fill_template.py -i ${TRITON_MODEL_REPO}/tensorrt_llm/config.pbtxt triton_max_batch_size:64,decoupled_mode:False,max_beam_width:1,engine_dir:${ENGINE_PATH},max_tokens_in_paged_kv_cache:2560,max_attention_window_size:2560,kv_cache_free_gpu_mem_fraction:0.5,exclude_input_in_output:True,enable_kv_cache_reuse:False,batching_strategy:inflight_batching,max_queue_delay_microseconds:600,batch_scheduler_policy:guaranteed_no_evict,enable_trt_overlap:False && \echo -e "\nCompleted extraction to ${ENGINE_PATH}"---apiVersion:batch/v1kind:Jobmetadata:name:data-loader-gemma-2blabels:app:data-loader-gemma-2bspec:ttlSecondsAfterFinished:120template:metadata:labels:app:data-loader-gemma-2bspec:restartPolicy:OnFailurecontainers:-name:gcloudimage:us-docker.pkg.dev/google-samples/containers/gke/tritonserver:2.42.0command:-/scripts/fetch_model.shenv:-name:KAGGLE_CONFIG_DIRvalue:/kaggle-name:MODEL_PATHvalue:"google/gemma/tensorrtllm/2b-it/2"-name:WORLD_SIZEvalue:"1"volumeMounts:-mountPath:"/kaggle/"name:kaggle-credentialsreadOnly:true-mountPath:"/scripts/"name:scripts-volumereadOnly:true-mountPath:"/data"name:datavolumes:-name:kaggle-credentialssecret:defaultMode:0400secretName:kaggle-secret-name:scripts-volumeconfigMap:defaultMode:0700name:fetch-model-scripts-name:datapersistentVolumeClaim:claimName:model-datatolerations:-key:"key"operator:"Exists"effect:"NoSchedule"

Apply the manifest:

kubectlapply-fjob-download-gemma-2b.yaml

View the logs for the Job:

kubectllogs-fjob/data-loader-gemma-2b

The output from the logs is similar to the following:

...Creating configuration files+ echo -e '\n02-16-2024 04:07:45 Completed building TensortRT-LLM engine at /data/trt_engine/gemma/2b/bfloat16/1-gpu/'+ echo -e '\nCreating configuration files'...

Wait for the Job to complete:

kubectlwait--for=condition=complete--timeout=900sjob/data-loader-gemma-2b

The output is similar to the following:

job.batch/data-loader-gemma-2b condition met

Verify the Job completed successfully (this may take a few minutes):

kubectlgetjob/data-loader-gemma-2b

The output is similar to the following:

NAME             COMPLETIONS   DURATION   AGEdata-loader-gemma-2b   1/1           ##s        #m##s

Gemma 7B-it

The TensorRT-LLM engine is built from the Gemma 7B-it (instruction tuned)PyTorch checkpoint of Gemma usingbfloat16 activation, input sequence length=1024,and output sequence length=512 targeted L4 GPUs. You can deploy the model on asingle L4 GPU.

Create the followingjob-download-gemma-7b.yaml manifest:

apiVersion:v1kind:ConfigMapmetadata:name:fetch-model-scriptsdata:fetch_model.sh:|-#!/usr/bin/bash -xpip install kaggle --break-system-packages && \MODEL_NAME=$(echo ${MODEL_PATH} | awk -F'/' '{print $2}') && \VARIATION_NAME=$(echo ${MODEL_PATH} | awk -F'/' '{print $4}') && \ACTIVATION_DTYPE=bfloat16 && \TOKENIZER_DIR=/data/trt_engine/${MODEL_NAME}/${VARIATION_NAME}/${ACTIVATION_DTYPE}/${WORLD_SIZE}-gpu/tokenizer.model && \ENGINE_PATH=/data/trt_engine/${MODEL_NAME}/${VARIATION_NAME}/${ACTIVATION_DTYPE}/${WORLD_SIZE}-gpu/ && \TRITON_MODEL_REPO=/data/triton/model_repository && \mkdir -p ${ENGINE_PATH} && \mkdir -p ${TRITON_MODEL_REPO} && \kaggle models instances versions download ${MODEL_PATH} --untar -p /data/${MODEL_NAME}_${VARIATION_NAME} && \rm -f /data/${MODEL_NAME}_${VARIATION_NAME}/*.tar.gz && \find /data/${MODEL_NAME}_${VARIATION_NAME} -type f && \find /data/${MODEL_NAME}_${VARIATION_NAME} -type f | xargs -I '{}' mv '{}' ${ENGINE_PATH} && \# copying configuration filesecho -e "\nCreating configuration files" && \cp -r /tensorrtllm_backend/all_models/inflight_batcher_llm/* ${TRITON_MODEL_REPO} && \# updating configuration filespython3 /tensorrtllm_backend/tools/fill_template.py -i ${TRITON_MODEL_REPO}/preprocessing/config.pbtxt tokenizer_dir:${TOKENIZER_DIR},tokenizer_type:sp,triton_max_batch_size:64,preprocessing_instance_count:1 && \python3 /tensorrtllm_backend/tools/fill_template.py -i ${TRITON_MODEL_REPO}/postprocessing/config.pbtxt tokenizer_dir:${TOKENIZER_DIR},tokenizer_type:sp,triton_max_batch_size:64,postprocessing_instance_count:1 && \python3 /tensorrtllm_backend/tools/fill_template.py -i ${TRITON_MODEL_REPO}/tensorrt_llm_bls/config.pbtxt triton_max_batch_size:64,decoupled_mode:False,bls_instance_count:1,accumulate_tokens:False && \python3 /tensorrtllm_backend/tools/fill_template.py -i ${TRITON_MODEL_REPO}/ensemble/config.pbtxt triton_max_batch_size:64 && \python3 /tensorrtllm_backend/tools/fill_template.py -i ${TRITON_MODEL_REPO}/tensorrt_llm/config.pbtxt triton_max_batch_size:64,decoupled_mode:False,max_beam_width:1,engine_dir:${ENGINE_PATH},max_tokens_in_paged_kv_cache:2560,max_attention_window_size:2560,kv_cache_free_gpu_mem_fraction:0.5,exclude_input_in_output:True,enable_kv_cache_reuse:False,batching_strategy:inflight_batching,max_queue_delay_microseconds:600,batch_scheduler_policy:guaranteed_no_evict,enable_trt_overlap:False && \echo -e "\nCompleted extraction to ${ENGINE_PATH}"---apiVersion:batch/v1kind:Jobmetadata:name:data-loader-gemma-7blabels:app:data-loader-gemma-7bspec:ttlSecondsAfterFinished:120template:metadata:labels:app:data-loader-gemma-7bspec:restartPolicy:OnFailurecontainers:-name:gcloudimage:us-docker.pkg.dev/google-samples/containers/gke/tritonserver:2.42.0command:-/scripts/fetch_model.shenv:-name:KAGGLE_CONFIG_DIRvalue:/kaggle-name:MODEL_PATHvalue:"google/gemma/tensorrtllm/7b-it/2"-name:WORLD_SIZEvalue:"1"volumeMounts:-mountPath:"/kaggle/"name:kaggle-credentialsreadOnly:true-mountPath:"/scripts/"name:scripts-volumereadOnly:true-mountPath:"/data"name:datavolumes:-name:kaggle-credentialssecret:defaultMode:0400secretName:kaggle-secret-name:scripts-volumeconfigMap:defaultMode:0700name:fetch-model-scripts-name:datapersistentVolumeClaim:claimName:model-datatolerations:-key:"key"operator:"Exists"effect:"NoSchedule"

Apply the manifest:

kubectlapply-fjob-download-gemma-7b.yaml

View the logs for the Job:

kubectllogs-fjob/data-loader-gemma-7b

The output from the logs is similar to the following:

...Creating configuration files+ echo -e '\n02-16-2024 04:07:45 Completed building TensortRT-LLM engine at /data/trt_engine/gemma/7b/bfloat16/1-gpu/'+ echo -e '\nCreating configuration files'...

Wait for the Job to complete:

kubectlwait--for=condition=complete--timeout=900sjob/data-loader-gemma-7b

The output is similar to the following:

job.batch/data-loader-gemma-7b condition met

Verify the Job completed successfully (this may take a few minutes):

kubectlgetjob/data-loader-gemma-7b

The output is similar to the following:

NAME             COMPLETIONS   DURATION   AGEdata-loader-gemma-7b   1/1           ##s        #m##s

Make sure the Job is completed successfully before proceeding to the next section.

Deploy Triton

In this section, you deploy a container that uses Triton with the TensorRT-LLMbackend to serve the Gemma model you want to use.

Create the followingdeploy-triton-server.yaml manifest:

apiVersion:v1kind:ConfigMapmetadata:name:launch-tritonserverdata:entrypoint.sh:|-#!/usr/bin/bash -x# Launch Triton Inference serverWORLD_SIZE=1TRITON_MODEL_REPO=/data/triton/model_repositorypython3 /tensorrtllm_backend/scripts/launch_triton_server.py \--world_size ${WORLD_SIZE} \--model_repo ${TRITON_MODEL_REPO}tail -f /dev/null---apiVersion:apps/v1kind:Deploymentmetadata:name:triton-gemma-deploymentlabels:app:gemma-serverversion:v1spec:replicas:1selector:matchLabels:app:gemma-serverversion:v1template:metadata:labels:app:gemma-serverai.gke.io/model:gemmaai.gke.io/inference-server:tritonexamples.ai.gke.io/source:user-guideversion:v1spec:containers:-name:inference-serverimage:us-docker.pkg.dev/google-samples/containers/gke/tritonserver:2.42.0imagePullPolicy:IfNotPresentresources:requests:ephemeral-storage:"40Gi"memory:"40Gi"nvidia.com/gpu:1limits:ephemeral-storage:"40Gi"memory:"40Gi"nvidia.com/gpu:1command:-/scripts/entrypoint.shvolumeMounts:-mountPath:"/scripts/"name:scripts-volumereadOnly:true-mountPath:"/data"name:dataports:-containerPort:8000name:http-containerPort:8001name:grpc-containerPort:8002name:metricslivenessProbe:failureThreshold:60initialDelaySeconds:600periodSeconds:5httpGet:path:/v2/health/liveport:httpreadinessProbe:failureThreshold:60initialDelaySeconds:600periodSeconds:5httpGet:path:/v2/health/readyport:httpsecurityContext:runAsUser:1000fsGroup:1000volumes:-name:scripts-volumeconfigMap:defaultMode:0700name:launch-tritonserver-name:datapersistentVolumeClaim:claimName:model-datanodeSelector:cloud.google.com/gke-accelerator:nvidia-l4tolerations:-key:"key"operator:"Exists"effect:"NoSchedule"---apiVersion:v1kind:Servicemetadata:name:triton-serverlabels:app:gemma-serverspec:type:ClusterIPports:-port:8000targetPort:httpname:http-inference-server-port:8001targetPort:grpcname:grpc-inference-server-port:8002targetPort:metricsname:http-metricsselector:app:gemma-server

Apply the manifest:
```
kubectlapply-fdeploy-triton-server.yaml
```

Wait for the deployment to be available:

kubectlwait--for=condition=Available--timeout=900sdeployment/triton-gemma-deployment

View the logs from manifest:

kubectllogs-f-lapp=gemma-server

The deployment resource launches the Triton server and loads the model data.This process can take a few minutes (up to 20 minutes or longer). The outputis similar to the following:

I0216 03:24:57.387420 29 server.cc:676]+------------------+---------+--------+| Model            | Version | Status |+------------------+---------+--------+| ensemble         | 1       | READY  || postprocessing   | 1       | READY  || preprocessing    | 1       | READY  || tensorrt_llm     | 1       | READY  || tensorrt_llm_bls | 1       | READY  |+------------------+---------+--------+............I0216 03:24:57.425104 29 grpc_server.cc:2519] Started GRPCInferenceService at 0.0.0.0:8001I0216 03:24:57.425418 29 http_server.cc:4623] Started HTTPService at 0.0.0.0:8000I0216 03:24:57.466646 29 http_server.cc:315] Started Metrics Service at 0.0.0.0:8002

Serve the model

In this section, you interact with the model.

Set up port forwarding

Run the following command to set up port forwarding to the model:

kubectlport-forwardservice/triton-server8000:8000

The output is similar to the following:

Forwarding from 127.0.0.1:8000 -> 8000Forwarding from [::1]:8000 -> 8000Handling connection for 8000

Interact with the model using curl

This section shows how you can perform a basic smoke test to verify your deployedinstruction tuned model. For simplicity, this section describesthe testing approach only using the 2B instruction tuned model.

In a new terminal session, usecurl to chat with your model:

USER_PROMPT="I'm new to coding. If you could only recommend one programming language to start with, what would it be and why?"curl-XPOSTlocalhost:8000/v2/models/ensemble/generate\-H"Content-Type: application/json"\-d@-<<EOF{"text_input":"<start_of_turn>user\n${USER_PROMPT}<end_of_turn>\n","temperature":0.9,"max_tokens":128}EOF

The following output shows an example of the model response:

{  "context_logits": 0,  "cum_log_probs": 0,  "generation_logits": 0,  "model_name": "ensemble",  "model_version": "1",  "output_log_probs": [0.0,0.0,...],  "sequence_end": false,  "sequence_id": 0,  "sequence_start": false,  "text_output":"Python.\n\nPython is an excellent choice for beginners due to its simplicity, readability, and extensive documentation. Its syntax is close to natural language, making it easier for beginners to understand and write code. Python also has a vast collection of libraries and tools that make it versatile for various projects. Additionally, Python's dynamic nature allows for easier learning and experimentation, making it a perfect choice for newcomers to get started.Here are some specific reasons why Python is a good choice for beginners:\n\n- Simple and Easy to Read: Python's syntax is designed to be close to natural language, making it easier for"}

Success: You've successfully served Gemma using GPUs onGKE with Triton and TensorRT-LLM.

Troubleshoot issues

If you get the messageEmpty reply from server, it's possible the container has not finished downloading the model data.Check the Pod's logs again for theConnected message which indicates that the model is ready to serve.
If you seeConnection refused, verify that yourport forwarding is active.

Clean up

To avoid incurring charges to your Google Cloud account for the resources used in this tutorial, either delete the project that contains the resources, or keep the project and delete the individual resources.

Delete the deployed resources

To avoid incurring charges to your Google Cloud account for the resourcesthat you created in this guide, run the following command:

gcloudcontainerclustersdeleteCLUSTER_NAME\--location=CONTROL_PLANE_LOCATION

What's next

Learn more aboutGPUs inGKE.
Learn how todeploy GPU workloads in Autopilot.
Learn how todeploy GPU workloads in Standard.
Explore the TensorRT-LLMGitHub repositoryanddocumentation.
Explore theVertex AI Model Garden.
Discover how to run optimized AI/ML workloads withGKEplatform orchestrationcapabilities.

Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.

Last updated 2026-02-18 UTC.

Movatterモバイル変換

Serve Gemma open models using GPUs on GKE with Triton and TensorRT-LLM Stay organized with collections Save and categorize content based on your preferences.

Background

Gemma

GPUs

TensorRT-LLM

Triton

Objectives

Before you begin

Check for the roles

Grant the roles

Prepare your environment

Get access to the model

Sign the license consent agreement

Generate an access token

Upload the access token to Cloud Shell

Create and configure Google Cloud resources

Create a GKE cluster and node pool

Autopilot

Standard

Create Kubernetes Secret for Kaggle credentials

Create a PersistentVolume resource to store checkpoints

Download the TensorRT-LLM engine files for Gemma

Gemma 2B-it

Gemma 7B-it

Deploy Triton

Serve the model

Set up port forwarding

Interact with the model using curl

Troubleshoot issues

Clean up

Delete the deployed resources

What's next

Serve Gemma open models using GPUs on GKE with Triton and TensorRT-LLM