Configure autoscaling for LLM workloads on GPUs with Google Kubernetes Engine (GKE)

Autopilot Standard

This page shows how to set up your autoscaling infrastructure by using theGKEHorizontal Pod Autoscaler (HPA) to deploy the Gemmalarge language model (LLM) with theText Generation Interface (TGI) serving framework from Hugging Face.

To learn more about selecting metrics for autoscaling, seeBest practices forautoscaling LLM workloads with GPUs on GKE.

Before you begin

Before you start, make sure that you have performed the following tasks:

Enable the Google Kubernetes Engine API.

Enable Google Kubernetes Engine API

If you want to use the Google Cloud CLI for this task,install and theninitialize the gcloud CLI. If you previously installed the gcloud CLI, get the latest version by running thegcloud components update command. Earlier gcloud CLI versions might not support running the commands in this document.Note: For existing gcloud CLI installations, make sure to set thecompute/regionproperty. If you use primarily zonal clusters, set thecompute/zone instead. By setting a default location, you can avoid errors in the gcloud CLI like the following:One of [--zone, --region] must be supplied: Please specify location. You might need to specify the location in certain commands if the location of your cluster differs from the default that you set.

Familiarize yourself with the workflow inServe Gemma open models using GPUs on GKE with Hugging Face TGI.

Autoscale using server metrics

You can use the workload-specific performance metrics that are emitted by the TGIinference server to direct autoscaling for your Pods. To learn more about thesemetrics, seeServer metrics.

To set up custom-metric autoscaling with server metrics, follow these steps:

Export the metrics fromthe TGI server to Cloud Monitoring. You useGoogle Cloud Managed Service for Prometheus, which simplifies deploying and configuring your Prometheuscollector. Google Cloud Managed Service for Prometheus is enabled by default in yourGKE cluster; you can alsoenable it manually.
The following example manifest shows how to set up yourPodMonitoring resource definition todirect Google Cloud Managed Service for Prometheus to scrape metrics from your Pods atrecurring intervals of 15 seconds:
```
apiVersion:monitoring.googleapis.com/v1kind:PodMonitoringmetadata:name:gemma-pod-monitoringspec:selector:matchLabels:app:gemma-serverendpoints:-port:8000interval:15s
```
Install the Custom Metrics Stackdriver Adapter. This adapter makes thecustom metric that you exported to Monitoring visible to the HPAcontroller. For more details, seeHorizontal pod autoscaling in theGoogle Cloud Managed Service for Prometheus documentation.
The following example command shows how to install the adapter:
```
kubectlapply-fhttps://raw.githubusercontent.com/GoogleCloudPlatform/k8s-stackdriver/master/custom-metrics-stackdriver-adapter/deploy/production/adapter_new_resource_model.yaml
```
Set up the custom metric-based HPA resource. Deploy an HPA resource that is basedon your preferred custom metric. For more details, seeHorizontal pod autoscaling in theGoogle Cloud Managed Service for Prometheus documentation.
Select one of these tabs to see examples of how to configure theHorizontalPodAutoscaler resource in your manifest:
Queue size
This example uses thetgi_queue_size TGI server metrics, whichrepresents the number of requests in the queue.
To determine the right queue size threshold for HPA, seeBest practices forautoscaling LLM inference workloads with GPUs.
```
apiVersion:autoscaling/v2kind:HorizontalPodAutoscalermetadata:name:gemma-serverspec:scaleTargetRef:apiVersion:apps/v1kind:Deploymentname:tgi-gemma-deploymentminReplicas:1maxReplicas:5metrics:-type:Podspods:metric:name:prometheus.googleapis.com|tgi_queue_size|gaugetarget:type:AverageValueaverageValue:$HPA_AVERAGEVALUE_TARGET
```
Batch size
This example uses thetgi_batch_size TGI server metric, which representsthe number of requests in the current batch.
To determine the right batch size threshold for HPA, seeBest practices forautoscaling LLM inference workloads with GPUs.
```
apiVersion:autoscaling/v2kind:HorizontalPodAutoscalermetadata:name:gemma-serverspec:scaleTargetRef:apiVersion:apps/v1kind:Deploymentname:tgi-gemma-deploymentminReplicas:1maxReplicas:5metrics:-type:Podspods:metric:name:prometheus.googleapis.com|tgi_batch_current_size|gaugetarget:type:AverageValueaverageValue:$HPA_AVERAGEVALUE_TARGET
```

Autoscale using GPU metrics

You can use the usage and performance metrics emitted by the GPU to directautoscaling for your Pods. To learn more about thesemetrics, seeGPU metrics.

To set up custom-metric autoscaling with GPU metrics, follow these steps:

Export the GPU metrics to Cloud Monitoring. If your GKEcluster hassystem metricsenabled, it automatically sends the GPU utilization metric to Cloud Monitoringthrough thecontainer/accelerator/duty_cycle system metric, every 60 seconds.
- To learn how to enable GKE system metrics, seeConfigure metrics collection.
- To set up managed collection, seeGet started with managed collection in the Google Cloud Managed Service for Prometheusdocumentation.
- For additional techniques to monitor your GPU workload performance in GKE,see theRun GPUs in GKE Standard node pools.
The following example manifest shows how to set up yourPodMonitoring resource definition toingest metrics from theNVIDIA DCGM workload:
```
apiVersion:monitoring.googleapis.com/v1kind:PodMonitoringmetadata:name:nvidia-dcgm-exporter-for-hpanamespace:gke-managed-systemlabels:app.kubernetes.io/name:nvidia-dcgm-exporterapp.kubernetes.io/part-of:google-cloud-managed-prometheusspec:selector:matchLabels:app.kubernetes.io/name:gke-managed-dcgm-exporterendpoints:-port:metricsinterval:15smetricRelabeling:-action:keepsourceLabels:[__name__]-action:replacesourceLabels:[__name__]targetLabel:__name__regex:DCGM_FI_DEV_GPU_UTILreplacement:dcgm_fi_dev_gpu_util
```
In the code, make sure to change the DCGM metric name to use in HPA to lowercase.This is because there's aknown issuewhere HPA doesn't work with uppercase external metric names. For clusters not utilizing a managed DCGM exporter, ensure the HPA'smetadata.namespace andspec.selector.matchLabels identically match the DCGM exporter's configuration.This precise alignment is critical for successful custom metric discovery and querying by the HPA.
Install the Custom Metrics Stackdriver Adapter. This adapter makes thecustom metric you exported to Monitoring visible to the HPAcontroller. For more details, seeHorizontal pod autoscaling in theGoogle Cloud Managed Service for Prometheus documentation.
The following example command shows how to execute this installation:
```
kubectlapply-fhttps://raw.githubusercontent.com/GoogleCloudPlatform/k8s-stackdriver/master/custom-metrics-stackdriver-adapter/deploy/production/adapter_new_resource_model.yaml
```

Set up the custom metric-based HPA resource. Deploy a HPA resource basedon your preferred custom metric. For more details, seeHorizontal pod autoscaling in theGoogle Cloud Managed Service for Prometheus documentation.

Identify an average value target for HPA to trigger autoscaling. You can dothis experimentally; for example, generate increasing load on your server andobserve where your GPU utilization peaks. Be mindful of theHPA tolerance, which defaults to a 0.1 no-action range around the targetvalue to dampen oscillation.
We recommend using thelocust-load-inference tool for testing. You can also create aCloud Monitoringcustom dashboard tovisualize the metric behavior.

Select one of these tabs to see an example of how to configure theHorizontalPodAutoscaler resource in your manifest:

Duty cycle (GKE system)

apiVersion:autoscaling/v2kind:HorizontalPodAutoscalermetadata:name:gemma-hpaspec:scaleTargetRef:apiVersion:apps/v1kind:Deploymentname:tgi-gemma-deploymentminReplicas:1maxReplicas:5metrics:-type:Externalexternal:metric:name:kubernetes.io|container|accelerator|duty_cycleselector:matchLabels:resource.labels.container_name:inference-serverresource.labels.namespace_name:defaulttarget:type:AverageValueaverageValue:$HPA_AVERAGEVALUE_TARGET

Duty cycle (DCGM)

apiVersion:autoscaling/v2kind:HorizontalPodAutoscalermetadata:name:gemma-hpaspec:scaleTargetRef:apiVersion:apps/v1kind:Deploymentname:tgi-gemma-deploymentminReplicas:1maxReplicas:5metrics:-type:Externalexternal:metric:name:prometheus.googleapis.com|dcgm_fi_dev_gpu_util|unknownselector:matchLabels:metric.labels.exported_container:inference-servermetric.labels.exported_namespace:defaulttarget:type:AverageValueaverageValue:$HPA_AVERAGEVALUE_TARGET

What's next

Learn how tooptimize Pod autoscaling based on metrics fromCloud Monitoring.
Learn more aboutHorizontal Pod Autoscaling from the open source Kubernetesdocumentation.

Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.

Last updated 2026-02-18 UTC.

Movatterモバイル変換

Configure autoscaling for LLM workloads on GPUs with Google Kubernetes Engine (GKE) Stay organized with collections Save and categorize content based on your preferences.