Optimize GKE AI/ML workload prioritization Stay organized with collections Save and categorize content based on your preferences.
This document describes tools and best practices for maximizing resourceutilization and minimizing downtime of heterogeneous AI/ML workloads inGoogle Kubernetes Engine (GKE), especially when there's no capacity in reservationsor through on-demand resources.Heterogeneous workloads refer to differenttypes of AI/ML workloads that run simultaneously in the same GKEcluster. For example, you might run a latency-sensitive online inference servicealongside a series of interruptible batch training jobs.
This guide provides recommendations for Platform admins and operators andData and AI specialists.
Benefits of AI/ML workload prioritization
Heterogeneous workloads have different priorities and share limited capacity andresources. The best practices in this page describe how to configureGKE and open source tools to help you get the following benefits:
- Minimize downtime for high-priority workloads.
- Quickly execute high-priority workloads.
- Optimize resource consumption.
Background
GKE supports the following open source tools for optimizingresource utilization.
Kueue: a Kubernetes-native workload queueing system designed for batch, AI,and high performance computing workloads. Kueue can be extended to manageother workload types, such as those defined by Custom Resource Definitionslike
leaderworkerset. Kueue manages quotas and how workloads consume them in aKubernetes cluster. Kueue makes decisions about when a workload waits, whena workload starts (for example, by creating the Pod), and when a Podbelonging to a workload gets preempted.For more information about Kueue, see theKueueconcepts documentation.
Hotswap: a technique that reduces mean time to recovery (MTTR). Hotswapenables preemption based on workload priority when cluster resources arefully utilized and no additional capacity is available, either fromon-demand instances or existing reservations.
- When a node that hosts a workload becomes unhealthy, the workload isrescheduled on eligible spare nodes. If no spare nodes are available,Hotswap can preempt a lower-priority workload to make room for the workloadbeing recovered.
- If you configure your Pods with
PriorityClass, the workload configuredwith higher priority evicts a running low-priority workload to acquire itsresources. This eviction process is known as preemption.
Use cases
Use the following table to understand the best practices for each use case:
| Use case | Best practice | Description |
|---|---|---|
| Multiple workloads with different priorities | Use Kueue to define queues and assign priorities to workloads based on their importance. Kueue can manage quota so that certain teams or projects have access to a set amount of resources. | Kueue lets you apply the following configurations:
To test the best practice configuration, see theKueue example in this document. |
| You have to reduce the current MTTR. | Use Hotswap to reschedule workloads in healthy resources when an interruption occurs, and preempt low-priority workloads in favor of high-priority workloads. | Hotswap lets you apply the following configurations:
To test the best practice configuration, see theHotswap example in this document. |
| Multiple AI workloads competing for limited resources | Combine Kueue and Hotswap. This combination provides a robust system that prioritizes critical workloads both during initial scheduling and during runtime. | Kueue and Hotswap let you apply the following configurations:
To test the best practice configuration, see theKueue and Hotswap example in this document. |
Examples of best practice implementations
The following examples demonstrate how to implement Kueue and Hotswap, and howto combine them for the best practices described in the preceding section.
Kueue
The following example manifest shows a Kueue configuration:
apiVersion:kueue.x-k8s.io/v1beta1kind:ResourceFlavormetadata:name:tpu-v6e-slicespec:nodeLabels:cloud.google.com/gke-tpu-accelerator:tpu-v6e-slice---apiVersion:kueue.x-k8s.io/v1beta1kind:ClusterQueuemetadata:name:tpu-training-cqspec:resourceGroups:-flavors:-name:tpu-v6e-sliceresources:-name:google.com/tpunominalQuota:32queueingStrategy:BestEffortFIFOpreemption:reclaimWithinCohort:NeverreclaimOutOfCohort:enable:truereclaimMoreThanNominalQuota:false---apiVersion:kueue.x-k8s.io/v1beta1kind:LocalQueuemetadata:name:default-queuenamespace:defaultspec:clusterQueue:tpu-training-cqThis manifest does the following:
- Defines a
ResourceFlavornamedtpu-v6e-slicethat specifies the nodelabels for TPU v6e slices. - Defines a
ClusterQueuenamedtpu-training-cqthat manages the quotafor TPU resources. - Defines a
LocalQueuenameddefault-queuethat allows workloads inthedefaultnamespace to use thetpu-training-cqcluster queue.
Hotswap
The following example shows a Hotswap configuration that definestwo Priority Classes,low-priority-job andhigh-priority-job. ThisHotswap configuration creates a high-priority JobSet workload and usesMaxText.
apiVersion:scheduling.k8s.io/v1kind:PriorityClassmetadata:name:low-priority-jobvalue:1000000globalDefault:falsedescription:"Thispriorityclassshouldbeusedforlowprioritypodsonly."---apiVersion:scheduling.k8s.io/v1kind:PriorityClassmetadata:name:high-priority-jobvalue:2000000globalDefault:falsedescription:"Thispriorityclassshouldbeusedforcriticalpodsonly."---apiVersion:jobset.x-k8s.io/v1alpha2kind:JobSetmetadata:name:high-jax-trilliumannotations:alpha.jobset.sigs.k8s.io/exclusive-topology:cloud.google.com/gke-nodepoolspec:failurePolicy:maxRestarts:10restartStrategy:BlockingRecreatereplicatedJobs:-name:slicereplicas:2template:spec:backoffLimit:0completions:4parallelism:4template:spec:nodeSelector:cloud.google.com/gke-tpu-accelerator:tpu-v6e-slicecloud.google.com/gke-tpu-topology:4x4hostNetwork:truednsPolicy:ClusterFirstWithHostNetpriorityClassName:high-priority-jobcontainers:-name:jax-programimage:<IMAGE LOCATION>command:-python3-MaxText/train.py-MaxText/configs/base.yml-model_name=llama2-7b-run_name=<UNIQUE RUN NAME>-steps=300-base_output_directory=gs://<OUTPUT BUCKET>-dataset_path=gs://max-datasets-rogue-max_target_length=4096-dataset_type=synthetic-enable_checkpointing=Falseresources:limits:google.com/tpu:4Based on this configuration, Hotswap performs the following actions:
- If an infrastructure failure interrupts the high-priority workload, theJobSet restarts it. Hotswap preempts the low-priorityworkload to reschedule the high-priority workload before the infrastructurerecovers. The low-priority workload remains in a failed status. This processsignificantly reduces workload idle time.
- When the infrastructure recovers, Hotswap reschedules the low-priorityworkload in the node pool that recovered.
Kueue and Hotswap
Combine Kueue and Hotswap when you operate in a complex environment withlimited resources. This combination provides a robust system thatprioritizes critical workloads during initial scheduling and during runtime.
The following example shows a combined Kueue and Hotswap configuration. Thisexample usesMaxText:
apiVersion:scheduling.k8s.io/v1kind:PriorityClassmetadata:name:low-priority-jobvalue:1000000globalDefault:falsedescription:"Thispriorityclassshouldbeusedforlowprioritypodsonly."---apiVersion:scheduling.k8s.io/v1kind:PriorityClassmetadata:name:high-priority-jobvalue:2000000globalDefault:falsedescription:"Thispriorityclassshouldbeusedforcriticalpodsonly."---apiVersion:kueue.x-k8s.io/v1beta1kind:ResourceFlavormetadata:name:tpu-v6e-slicespec:nodeLabels:cloud.google.com/gke-tpu-accelerator:tpu-v6e-slice---apiVersion:kueue.x-k8s.io/v1beta1kind:ClusterQueuemetadata:name:tpu-training-cqspec:resourceGroups:-flavors:-name:tpu-v6e-sliceresources:-name:google.com/tpunominalQuota:32queueingStrategy:BestEffortFIFOpreemption:reclaimWithinCohort:NeverreclaimOutOfCohort:enable:truereclaimMoreThanNominalQuota:false---apiVersion:kueue.x-k8s.io/v1beta1kind:LocalQueuemetadata:name:default-queuenamespace:defaultspec:clusterQueue:tpu-training-cq---apiVersion:jobset.x-k8s.io/v1alpha2kind:JobSetmetadata:name:low-jax-trilliumannotations:kueue.x-k8s.io/queue-name:default-queuealpha.jobset.sigs.k8s.io/exclusive-topology:cloud.google.com/gke-nodepoolspec:failurePolicy:maxRestarts:10restartStrategy:BlockingRecreatereplicatedJobs:-name:slicereplicas:2template:spec:backoffLimit:0completions:4parallelism:4template:metadata:labels:kueue.x-k8s.io/managed-by:kueuekueue.x-k8s.io/priority-class:low-priority-jobspec:nodeSelector:cloud.google.com/gke-tpu-accelerator:tpu-v6e-slicecloud.google.com/gke-tpu-topology:4x4hostNetwork:truednsPolicy:ClusterFirstWithHostNetpriorityClassName:low-priority-jobcontainers:-name:jax-programimage:<IMAGE LOCATION>command:-python3-MaxText/train.py-MaxText/configs/base.yml-model_name=llama2-7b-run_name=low-priority-run-steps=30000-base_output_directory=gs://<OUTPUT BUCKET>-dataset_path=gs://max-datasets-rogue-max_target_length=4096-dataset_type=synthetic-enable_checkpointing=Falseresources:limits:google.com/tpu:4---apiVersion:jobset.x-k8s.io/v1alpha2kind:JobSetmetadata:name:high-jax-trilliumannotations:kueue.x-k8s.io/queue-name:default-queuealpha.jobset.sigs.k8s.io/exclusive-topology:cloud.google.com/gke-nodepoolspec:failurePolicy:maxRestarts:10restartStrategy:BlockingRecreatereplicatedJobs:-name:slicereplicas:2template:spec:backoffLimit:0completions:4parallelism:4template:metadata:labels:kueue.x-k8s.io/managed-by:kueuekueue.x-k8s.io/priority-class:high-priority-jobspec:nodeSelector:cloud.google.com/gke-tpu-accelerator:tpu-v6e-slicecloud.google.com/gke-tpu-topology:4x4hostNetwork:truednsPolicy:ClusterFirstWithHostNetpriorityClassName:high-priority-jobcontainers:-name:jax-programimage:<IMAGE LOCATION>command:-python3-MaxText/train.py-MaxText/configs/base.yml-model_name=llama2-7b-run_name=high-priority-run-steps=300-base_output_directory=gs://<OUTPUT BUCKET>-dataset_path=gs://max-datasets-rogue-max_target_length=4096-dataset_type=synthetic-enable_checkpointing=Falseresources:limits:google.com/tpu:4Based on this configuration, Kueue is combined with Hotswap, and performsthe following actions:
- Kueue manages the admission of both
low-jax-trilliumandhigh-jax-trilliumJobSets into the cluster queue based on theirdefined priorities and available resources. - If the
high-jax-trilliumJobSet is interrupted by an infrastructurefailure, Hotswap preempts thelow-jax-trilliumJobSet to reschedule thehigh-priority JobSet. - Hotswap ensures the high-priority JobSet restarts quickly, minimizing itsidle time.
- When the infrastructure recovers, Hotswap reschedules the low-priorityJobSet in the recovered node pool.
What's next
- Learn how todeploy GPU workloads inGKE.
- Learn how todeploy TPU workloads inGKE.
Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.
Last updated 2025-12-17 UTC.