Serve scalable LLMs on GKE with TorchServe Stay organized with collections Save and categorize content based on your preferences.
This tutorial shows you how to deploy and serve a scalable machine learning (ML)model to a Google Kubernetes Engine (GKE) cluster using theTorchServe framework. You serve a pre-trainedPyTorch model that generates predictions based on user requests. After youdeploy the model, you get a prediction URL that your application uses to sendprediction requests. This method lets you scale the model and web application independently. When you deploy the ML workload and application onAutopilot, GKE chooses the most efficient underlyingmachine type and size to run the workloads.
This tutorial is intended for Machine learning (ML) engineers,Platform admins and operators, and for Data and AI specialists who are interestedin using GKE Autopilot to reduce administrative overheadfor node configuration, scaling, and upgrades. To learn more about common rolesand example tasks that we reference in Google Cloud content, seeCommon GKE user roles and tasks.
Before reading this page, ensure that you're familiar withGKE Autopilot mode.
About the tutorial application
The application is a small Python web application created using theFast Dash framework.You use the application to send prediction requests to the T5 model. Thisapplication captures user text inputs and language pairs and sends theinformation to the model. The model translates the text and returns the resultto the application, which displays the result to the user. For more informationabout Fast Dash, seethe Fast Dash documentation.
Objectives
- Prepare a pre-trained T5 model from theHugging Face repository for serving by packaging it as a container image and pushing itto Artifact Registry
- Deploy the model to an Autopilot cluster
- Deploy the Fast Dash application that communicates with the model
- Autoscale the model based on Prometheus metrics
Costs
In this document, you use the following billable components of Google Cloud:
To generate a cost estimate based on your projected usage, use thepricing calculator.
When you finish the tasks that are described in this document, you can avoid continued billing by deleting the resources that you created. For more information, seeClean up.
Before you begin
- Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.
Install the Google Cloud CLI.
Note: If you installed the gcloud CLI previously, make sure you have the latest version by runninggcloud components update.If you're using an external identity provider (IdP), you must first sign in to the gcloud CLI with your federated identity.
Toinitialize the gcloud CLI, run the following command:
gcloudinit
Create or select a Google Cloud project.
Note: If you don't plan to keep the resources that you create in this procedure, create a project instead of selecting an existing project. After you finish these steps, you can delete the project, removing all resources associated with the project.Roles required to select or create a project
- Select a project: Selecting a project doesn't require a specific IAM role—you can select any project that you've been granted a role on.
- Create a project: To create a project, you need the Project Creator role (
roles/resourcemanager.projectCreator), which contains theresourcemanager.projects.createpermission.Learn how to grant roles.
Create a Google Cloud project:
gcloud projects createPROJECT_ID
Replace
PROJECT_IDwith a name for the Google Cloud project you are creating.Select the Google Cloud project that you created:
gcloud config set projectPROJECT_ID
Replace
PROJECT_IDwith your Google Cloud project name.
Verify that billing is enabled for your Google Cloud project.
Enable the Kubernetes Engine, Cloud Storage, Artifact Registry, and Cloud Build APIs:
Roles required to enable APIs
To enable APIs, you need the Service Usage Admin IAM role (
roles/serviceusage.serviceUsageAdmin), which contains theserviceusage.services.enablepermission.Learn how to grant roles.gcloudservicesenablecontainer.googleapis.com
storage.googleapis.com artifactregistry.googleapis.com cloudbuild.googleapis.com Install the Google Cloud CLI.
Note: If you installed the gcloud CLI previously, make sure you have the latest version by runninggcloud components update.If you're using an external identity provider (IdP), you must first sign in to the gcloud CLI with your federated identity.
Toinitialize the gcloud CLI, run the following command:
gcloudinit
Create or select a Google Cloud project.
Note: If you don't plan to keep the resources that you create in this procedure, create a project instead of selecting an existing project. After you finish these steps, you can delete the project, removing all resources associated with the project.Roles required to select or create a project
- Select a project: Selecting a project doesn't require a specific IAM role—you can select any project that you've been granted a role on.
- Create a project: To create a project, you need the Project Creator role (
roles/resourcemanager.projectCreator), which contains theresourcemanager.projects.createpermission.Learn how to grant roles.
Create a Google Cloud project:
gcloud projects createPROJECT_ID
Replace
PROJECT_IDwith a name for the Google Cloud project you are creating.Select the Google Cloud project that you created:
gcloud config set projectPROJECT_ID
Replace
PROJECT_IDwith your Google Cloud project name.
Verify that billing is enabled for your Google Cloud project.
Enable the Kubernetes Engine, Cloud Storage, Artifact Registry, and Cloud Build APIs:
Roles required to enable APIs
To enable APIs, you need the Service Usage Admin IAM role (
roles/serviceusage.serviceUsageAdmin), which contains theserviceusage.services.enablepermission.Learn how to grant roles.gcloudservicesenablecontainer.googleapis.com
storage.googleapis.com artifactregistry.googleapis.com cloudbuild.googleapis.com
Prepare the environment
Clone the example repository and open the tutorial directory:
gitclonehttps://github.com/GoogleCloudPlatform/kubernetes-engine-samples.gitcdkubernetes-engine-samples/ai-ml/t5-model-servingCreate the cluster
Run the following command:
gcloudcontainerclusterscreate-automl-cluster\--release-channel=RELEASE_CHANNEL\--cluster-version=CLUSTER_VERSION\--location=us-central1Replace the following:
RELEASE_CHANNEL: the release channel for your cluster.Must be one ofrapid,regular, orstable. Choose a channel that hasGKE version 1.28.3-gke.1203000 or later to use L4 GPUs. Tosee the versions available in a specific channel, seeView the default and available versions for release channels.CLUSTER_VERSION: the GKE version touse. Must be1.28.3-gke.1203000or later.
This operation takes several minutes to complete.
Create an Artifact Registry repository
Create a new Artifact Registry standard repository with the Docker formatin the same region as your cluster:
gcloudartifactsrepositoriescreatemodels\--repository-format=docker\--location=us-central1\--description="Repo for T5 serving image"Verify the repository name:
gcloudartifactsrepositoriesdescribemodels\--location=us-central1The output is similar to the following:
Encryption: Google-managed keyRepository Size: 0.000MBcreateTime: '2023-06-14T15:48:35.267196Z'description: Repo for T5 serving imageformat: DOCKERmode: STANDARD_REPOSITORYname: projects/PROJECT_ID/locations/us-central1/repositories/modelsupdateTime: '2023-06-14T15:48:35.267196Z'
Package the model
In this section, you package the model and the serving framework in a singlecontainer image using Cloud Build and push the resulting image to theArtifact Registry repository.
Review the Dockerfile for the container image:
# Copyright 2023 Google LLC## Licensed under the Apache License, Version 2.0 (the "License");# you may not use this file except in compliance with the License.# You may obtain a copy of the License at## https://www.apache.org/licenses/LICENSE-2.0## Unless required by applicable law or agreed to in writing, software# distributed under the License is distributed on an "AS IS" BASIS,# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.# See the License for the specific language governing permissions and# limitations under the License.ARGBASE_IMAGE=pytorch/torchserve:0.12.0-cpuFROMalpine/gitARGMODEL_NAME=t5-smallARGMODEL_REPO=https://huggingface.co/${MODEL_NAME}ENVMODEL_NAME=${MODEL_NAME}ENVMODEL_VERSION=${MODEL_VERSION}RUNgitclone"${MODEL_REPO}"/modelFROM${BASE_IMAGE}ARGMODEL_NAME=t5-smallARGMODEL_VERSION=1.0ENVMODEL_NAME=${MODEL_NAME}ENVMODEL_VERSION=${MODEL_VERSION}COPY--from=0/model/./home/model-server/COPYhandler.py\model.py\requirements.txt\setup_config.json/home/model-server/RUNtorch-model-archiver\--model-name="${MODEL_NAME}"\--version="${MODEL_VERSION}"\--model-file="model.py"\--serialized-file="pytorch_model.bin"\--handler="handler.py"\--extra-files="config.json,spiece.model,tokenizer.json,setup_config.json"\--runtime="python"\--export-path="model-store"\--requirements-file="requirements.txt"FROM${BASE_IMAGE}ENVPATH/home/model-server/.local/bin:$PATHENVTS_CONFIG_FILE/home/model-server/config.properties# CPU inference will throw a warning cuda warning (not error)# Could not load dynamic library 'libnvinfer_plugin.so.7'# This is expected behaviour. see: https://stackoverflow.com/a/61137388ENVTF_CPP_MIN_LOG_LEVEL2COPY--from=1/home/model-server/model-store//home/model-server/model-storeCOPYconfig.properties/home/model-server/This Dockerfile defines the following multiple stage build process:
- Download the model artifacts from the Hugging Face repository.
- Package the model using thePyTorch Serving Archive tool. This creates a model archive (.mar) file that the inference serveruses to load the model.
- Build the final image with PyTorch Serve.
Build and push the image using Cloud Build:
gcloudbuildssubmitmodel/\--region=us-central1\--config=model/cloudbuild.yaml\--substitutions=_LOCATION=us-central1,_MACHINE=gpu,_MODEL_NAME=t5-small,_MODEL_VERSION=1.0The build process takes several minutes to complete. If you use a larger modelsize than
t5-small, the build process might takesignificantly moretime.Check that the image is in the repository:
gcloudartifactsdockerimageslistus-central1-docker.pkg.dev/PROJECT_ID/modelsReplace
PROJECT_IDwith your Google Cloudproject ID.The output is similar to the following:
IMAGE DIGEST CREATE_TIME UPDATE_TIMEus-central1-docker.pkg.dev/PROJECT_ID/models/t5-small sha256:0cd... 2023-06-14T12:06:38 2023-06-14T12:06:38
Deploy the packaged model to GKE
To deploy the image, this tutorial use Kubernetes Deployments. ADeployment is a Kubernetes API object that lets you run multiple replicas of Pods that are distributed among the nodes in a cluster..
Modify the Kubernetes manifest in the example repository tomatch your environment.
Review the manifest for the inference workload:
# Copyright 2023 Google LLC## Licensed under the Apache License, Version 2.0 (the "License");# you may not use this file except in compliance with the License.# You may obtain a copy of the License at## https://www.apache.org/licenses/LICENSE-2.0## Unless required by applicable law or agreed to in writing, software# distributed under the License is distributed on an "AS IS" BASIS,# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.# See the License for the specific language governing permissions and# limitations under the License.---apiVersion:apps/v1kind:Deploymentmetadata:name:t5-inferencelabels:model:t5version:v1.0machine:gpuspec:replicas:1selector:matchLabels:model:t5version:v1.0machine:gputemplate:metadata:labels:model:t5version:v1.0machine:gpuspec:nodeSelector:cloud.google.com/gke-accelerator:nvidia-l4securityContext:fsGroup:1000runAsUser:1000runAsGroup:1000containers:-name:inferenceimage:us-central1-docker.pkg.dev/PROJECT_ID/models/t5-small:1.0-gpuimagePullPolicy:IfNotPresentargs:["torchserve","--start","--foreground"]resources:limits:nvidia.com/gpu:"1"cpu:"3000m"memory:16Giephemeral-storage:10Girequests:nvidia.com/gpu:"1"cpu:"3000m"memory:16Giephemeral-storage:10Giports:-containerPort:8080name:http-containerPort:8081name:management-containerPort:8082name:metricsreadinessProbe:httpGet:path:/pingport:httpinitialDelaySeconds:120failureThreshold:10livenessProbe:httpGet:path:/models/t5-smallport:managementinitialDelaySeconds:150periodSeconds:5---apiVersion:v1kind:Servicemetadata:name:t5-inferencelabels:model:t5version:v1.0machine:gpuspec:type:ClusterIPselector:model:t5version:v1.0machine:gpuports:-port:8080name:httptargetPort:http-port:8081name:managementtargetPort:management-port:8082name:metricstargetPort:metricsReplace
PROJECT_IDwith your Google Cloudproject ID:sed-i"s/PROJECT_ID/PROJECT_ID/g""kubernetes/serving-gpu.yaml"This ensures that the container image path in the Deployment specificationmatches the path to your T5 model image in Artifact Registry.
Create the Kubernetes resources:
kubectlcreate-fkubernetes/serving-gpu.yaml
To verify that the model deployed successfully, do the following:
Get the status of the Deployment and the Service:
kubectlget-fkubernetes/serving-gpu.yamlWait until the output shows ready Pods, similar to the following. Dependingon the size of the image, the first image pull might take several minutes.
NAME READY UP-TO-DATE AVAILABLE AGEdeployment.apps/t5-inference 1/1 1 0 66sNAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGEservice/t5-inference ClusterIP 10.48.131.86 <none> 8080/TCP,8081/TCP,8082/TCP 66sOpen a local port for the
t5-inferenceService:kubectlport-forwardsvc/t5-inference8080Open a new terminal window and send a test request to the Service:
curl-v-XPOST-H'Content-Type: application/json'-d'{"text": "this is a test sentence", "from": "en", "to": "fr"}'"http://localhost:8080/predictions/t5-small/1.0"If the test request fails and the Pod connection closes, check the logs:
kubectllogsdeployments/t5-inferenceIf the output is similar to the following, TorchServe failed to installsome model dependencies:
org.pytorch.serve.archive.model.ModelException: Custom pip package installation failed for t5-smallTo resolve this issue, restart the Deployment:
kubectlrolloutrestartdeploymentt5-inferenceThe Deployment controller creates a new Pod. Repeat the previous steps toopen a port on the new Pod.
Access the deployed model using the web application
To access the deployed model with theFast Dash web application, complete the following steps:
Build and push the Fast Dash web application as a container image inArtifact Registry:
gcloudbuildssubmitclient-app/\--region=us-central1\--config=client-app/cloudbuild.yamlOpen
kubernetes/application.yamlin a text editor and replacePROJECT_IDin theimage:field with yourproject ID. Alternatively, run the following command:sed-i"s/PROJECT_ID/PROJECT_ID/g""kubernetes/application.yaml"Create the Kubernetes resources:
kubectlcreate-fkubernetes/application.yamlThe Deployment and Service might take some time to fully provision.
To check the status, run the following command:
kubectlget-fkubernetes/application.yamlWait until the output shows ready Pods, similar to the following:
NAME READY UP-TO-DATE AVAILABLE AGEdeployment.apps/fastdash 1/1 1 0 1mNAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGEservice/fastdash NodePort 203.0.113.12 <none> 8050/TCP 1mThe web application is now running, although it isn't exposed on an externalIP address. To access the web application, open a local port:
kubectlport-forwardservice/fastdash8050In a browser, open the web interface:
- If you're using a local shell, open a browser and go tohttp://127.0.0.1:8050.
- If you're using Cloud Shell, clickWeb preview, and then clickChange port. Specify port
8050.
To send a request to the T5 model, specify values in theTEXT,FROM LANG, andTO LANG fields in the web interface and clickSubmit. For a list of available languages, see theT5 documentation.
Enable autoscaling for the model
This section shows you how to enable autoscaling for the model based on metricsfromGoogle Cloud Managed Service for Prometheusby doing the following:
- Install Custom Metrics Stackdriver Adapter
- Apply PodMonitoring and HorizontalPodAutoscaling configurations
Google Cloud Managed Service for Prometheus is enabled by default in Autopilotclusters running version 1.25 and later.
Install Custom Metrics Stackdriver Adapter
This adapter lets your cluster use metrics from Prometheus to make Kubernetesautoscaling decisions.
Deploy the adapter:
kubectlcreate-fhttps://raw.githubusercontent.com/GoogleCloudPlatform/k8s-stackdriver/master/custom-metrics-stackdriver-adapter/deploy/production/adapter_new_resource_model.yamlCreate an IAM service account for the adapter to use:
gcloudiamservice-accountscreatemonitoring-viewerGrant the IAM service account the
monitoring.viewerrole onthe project and theiam.workloadIdentityUserrole:gcloudprojectsadd-iam-policy-bindingPROJECT_ID\--member"serviceAccount:monitoring-viewer@PROJECT_ID.iam.gserviceaccount.com"\--roleroles/monitoring.viewergcloudiamservice-accountsadd-iam-policy-bindingmonitoring-viewer@PROJECT_ID.iam.gserviceaccount.com\--roleroles/iam.workloadIdentityUser\--member"serviceAccount:PROJECT_ID.svc.id.goog[custom-metrics/custom-metrics-stackdriver-adapter]"Replace
PROJECT_IDwith your Google Cloudproject ID.Annotate the Kubernetes ServiceAccount of the adapter to let it impersonatethe IAM service account:
kubectlannotateserviceaccountcustom-metrics-stackdriver-adapter\--namespacecustom-metrics\iam.gke.io/gcp-service-account=monitoring-viewer@PROJECT_ID.iam.gserviceaccount.comRestart the adapter to propagate the changes:
kubectlrolloutrestartdeploymentcustom-metrics-stackdriver-adapter\--namespace=custom-metrics
Apply PodMonitoring and HorizontalPodAutoscaling configurations
PodMonitoring is a Google Cloud Managed Service for Prometheus custom resource thatenables metrics ingestion and target scraping in a specific namespace.
Deploy the PodMonitoring resource in the same namespace as the TorchServeDeployment:
kubectlapply-fkubernetes/pod-monitoring.yamlReview the HorizontalPodAutoscaler manifest:
# Copyright 2023 Google LLC## Licensed under the Apache License, Version 2.0 (the "License");# you may not use this file except in compliance with the License.# You may obtain a copy of the License at## https://www.apache.org/licenses/LICENSE-2.0## Unless required by applicable law or agreed to in writing, software# distributed under the License is distributed on an "AS IS" BASIS,# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.# See the License for the specific language governing permissions and# limitations under the License.apiVersion:autoscaling/v2kind:HorizontalPodAutoscalermetadata:name:t5-inferencespec:scaleTargetRef:apiVersion:apps/v1kind:Deploymentname:t5-inferenceminReplicas:1maxReplicas:5metrics:-type:Podspods:metric:name:prometheus.googleapis.com|ts_queue_latency_microseconds|countertarget:type:AverageValueaverageValue:"30000"The HorizontalPodAutoscaler scales the T5 model Pod quantity based on thecumulative duration of the request queue. Autoscaling is based on the
ts_queue_latency_microsecondsmetric, which shows cumulative queueduration in microseconds.Create the HorizontalPodAutoscaler:
kubectlapply-fkubernetes/hpa.yaml
Verify autoscaling using a load generator
To test your autoscaling configuration, generate load for the servingapplication. This tutorial uses a Locust load generator to send requests to theprediction endpoint for the model.
Create the load generator:
kubectlapply-fkubernetes/loadgenerator.yamlWait for the load generator Pods to become ready.
Expose the load generator web interface locally:
kubectlport-forwardsvc/loadgenerator8080If you see an error message, try again when the Pod is running.
In a browser, open the load generator web interface:
- If you're using a local shell, open a browser and go tohttp://127.0.0.1:8080.
- If you're using Cloud Shell, clickWeb preview, and thenclickChange port. Enter port
8080.
Click theCharts tab to observe performance over time.
Open a new terminal window and watch the replica count of yourhorizontal Pod autoscalers:
kubectlgethpa-wThe number of replicas increases as the load increases. The scaleup mighttake approximately ten minutes. As new replicas start, the number ofsuccessful requests in the Locust chart increases.
NAME REFERENCE TARGETS MINPODS MAXPODS REPLICAS AGEt5-inference Deployment/t5-inference 71352001470m/7M 1 5 1 2m11s
Recommendations
- Build your model with the same version of the base Docker image that you'lluse for serving.
- If your model has special package dependencies, or if the size of yourdependencies is large, create a custom version of your base Docker image.
- Watch the tree version of your model dependency packages. Ensure that yourpackage dependencies support each others' versions. For example, Panda version2.0.3 supports NumPy version 1.20.3 and later.
- Run GPU-intensive models on GPU nodes and CPU-intensive models on CPU. Thiscould improve the stability of model serving and ensures that you'reefficiently consuming node resources.
Observe model performance
To observe the model performance, you can use the TorchServe dashboardintegration inCloud Monitoring.With this dashboard, you can view critical performance metrics like tokenthroughput, request latency, and error rates.
To use the TorchServe dashboard, you must enableGoogle Cloud Managed Service for Prometheus,which collects the metrics from TorchServe,in your GKE cluster.TorchServe exposes metrics in Prometheus format by default;you do not need to install an additional exporter.
You can then view the metrics by using the TorchServe dashboard.For information about using Google Cloud Managed Service for Prometheus to collectmetrics from your model, see theTorchServeobservability guidance in the Cloud Monitoring documentation.Clean up
To avoid incurring charges to your Google Cloud account for the resources used in this tutorial, either delete the project that contains the resources, or keep the project and delete the individual resources.
Delete the project
Delete individual resources
Delete the Kubernetes resources:
kubectldelete-fkubernetes/loadgenerator.yamlkubectldelete-fkubernetes/hpa.yamlkubectldelete-fkubernetes/pod-monitoring.yamlkubectldelete-fkubernetes/application.yamlkubectldelete-fkubernetes/serving-gpu.yamlkubectldelete-fhttps://raw.githubusercontent.com/GoogleCloudPlatform/k8s-stackdriver/master/custom-metrics-stackdriver-adapter/deploy/production/adapter_new_resource_model.yamlDelete the GKE cluster:
gcloudcontainerclustersdelete"ml-cluster"\--location="us-central1"--quietDelete the IAM service account and IAM policybindings:
gcloudprojectsremove-iam-policy-bindingPROJECT_ID\--member"serviceAccount:monitoring-viewer@PROJECT_ID.iam.gserviceaccount.com"\--roleroles/monitoring.viewergcloudiamservice-accountsremove-iam-policy-bindingmonitoring-viewer@PROJECT_ID.iam.gserviceaccount.com\--roleroles/iam.workloadIdentityUser\--member"serviceAccount:PROJECT_ID.svc.id.goog[custom-metrics/custom-metrics-stackdriver-adapter]"gcloudiamservice-accountsdeletemonitoring-viewerDelete the images in Artifact Registry. Optionally, delete the entirerepository. For instructions, see the Artifact Registry documentation aboutDeleting images.
Component overview
This section describes the components used in this tutorial, such as the model,the web application, the framework, and the cluster.
About the T5 model
This tutorial uses a pre-trained multilingualT5 model. T5 is a text-to-texttransformer that converts text from one language to another. In T5, inputs andoutputs are always text strings, in contrast to BERT-style models that can onlyoutput either a class label or a span of the input. The T5 model can also beused for tasks such as summarization, Q&A, or text classification. The modelis trained on a large quantity of text fromColossal Clean Crawled Corpus (C4) andWiki-DPR.
For more information, seethe T5 model documentation.
Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, MichaelMatena, Yanqi Zhou, Wei Li, and Peter J. Liu presented the T5 model inExploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer,published in theJournal of Machine Learning Research.
The T5 model supports variousmodel sizes, with different levels of complexitythat suit specific use cases. This tutorial uses the default size,t5-small,but you can also choose a different size. The following T5 sizes are distributedunder the Apache 2.0 license:
t5-small:60 million parameterst5-base:220 million parameterst5-large:770 million parameters. 3GB download.t5-3b:3 billion parameters. 11GB download.t5-11b:11 billion parameters. 45GB download.
For other available T5 models, see theHugging Face repository.
About TorchServe
TorchServe is a flexible tool for serving PyTorch models. It provides out ofthe box support for all major deep learning frameworks, including PyTorch,TensorFlow, and ONNX. TorchServe can be used to deploy models in production, orfor rapid prototyping and experimentation.
What's next
- Serve an LLM with multiple GPUs.
- Explore reference architectures, diagrams, and best practices about Google Cloud.Take a look at ourCloud Architecture Center.
Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.
Last updated 2026-02-19 UTC.