Optimize GKE AI/ML workload prioritization

This document describes tools and best practices for maximizing resourceutilization and minimizing downtime of heterogeneous AI/ML workloads inGoogle Kubernetes Engine (GKE), especially when there's no capacity in reservationsor through on-demand resources.Heterogeneous workloads refer to differenttypes of AI/ML workloads that run simultaneously in the same GKEcluster. For example, you might run a latency-sensitive online inference servicealongside a series of interruptible batch training jobs.

This guide provides recommendations for Platform admins and operators andData and AI specialists.

Benefits of AI/ML workload prioritization

Heterogeneous workloads have different priorities and share limited capacity andresources. The best practices in this page describe how to configureGKE and open source tools to help you get the following benefits:

  • Minimize downtime for high-priority workloads.
  • Quickly execute high-priority workloads.
  • Optimize resource consumption.

Background

GKE supports the following open source tools for optimizingresource utilization.

  • Kueue: a Kubernetes-native workload queueing system designed for batch, AI,and high performance computing workloads. Kueue can be extended to manageother workload types, such as those defined by Custom Resource Definitionslikeleaderworkerset. Kueue manages quotas and how workloads consume them in aKubernetes cluster. Kueue makes decisions about when a workload waits, whena workload starts (for example, by creating the Pod), and when a Podbelonging to a workload gets preempted.

    For more information about Kueue, see theKueueconcepts documentation.

  • Hotswap: a technique that reduces mean time to recovery (MTTR). Hotswapenables preemption based on workload priority when cluster resources arefully utilized and no additional capacity is available, either fromon-demand instances or existing reservations.

    • When a node that hosts a workload becomes unhealthy, the workload isrescheduled on eligible spare nodes. If no spare nodes are available,Hotswap can preempt a lower-priority workload to make room for the workloadbeing recovered.
    • If you configure your Pods withPriorityClass, the workload configuredwith higher priority evicts a running low-priority workload to acquire itsresources. This eviction process is known as preemption.

Use cases

Use the following table to understand the best practices for each use case:

Use caseBest practiceDescription
Multiple workloads with different prioritiesUse Kueue to define queues and assign priorities to workloads based on their importance. Kueue can manage quota so that certain teams or projects have access to a set amount of resources.

Kueue lets you apply the following configurations:

  • Prioritize high priority Jobs by assigning higher KueueWorkloadPriority to them.
  • Enable Kueue's fair-share queuing so that all workloads eventually receive resources, even low-priority ones.

To test the best practice configuration, see theKueue example in this document.

You have to reduce the current MTTR.Use Hotswap to reschedule workloads in healthy resources when an interruption occurs, and preempt low-priority workloads in favor of high-priority workloads.

Hotswap lets you apply the following configurations:

  • ConfigurePriorityClasses to define priority levels for your workloads.
  • Assign higherPriorityClasses to critical workloads.
  • Automatically reschedule workloads on healthy nodes when interruptions occur.

To test the best practice configuration, see theHotswap example in this document.

Multiple AI workloads competing for limited resourcesCombine Kueue and Hotswap. This combination provides a robust system that prioritizes critical workloads both during initial scheduling and during runtime.

Kueue and Hotswap let you apply the following configurations:

  • Use Kueue to manage the initial scheduling and admission of workloads based on priority.
  • Use Hotswap to handle workload interruptions and enable rapid recovery. Hotswap helps to reduce the time to recovery of a high-priority workload when an interruption occurs.

To test the best practice configuration, see theKueue and Hotswap example in this document.

Examples of best practice implementations

The following examples demonstrate how to implement Kueue and Hotswap, and howto combine them for the best practices described in the preceding section.

Kueue

The following example manifest shows a Kueue configuration:

apiVersion:kueue.x-k8s.io/v1beta1kind:ResourceFlavormetadata:name:tpu-v6e-slicespec:nodeLabels:cloud.google.com/gke-tpu-accelerator:tpu-v6e-slice---apiVersion:kueue.x-k8s.io/v1beta1kind:ClusterQueuemetadata:name:tpu-training-cqspec:resourceGroups:-flavors:-name:tpu-v6e-sliceresources:-name:google.com/tpunominalQuota:32queueingStrategy:BestEffortFIFOpreemption:reclaimWithinCohort:NeverreclaimOutOfCohort:enable:truereclaimMoreThanNominalQuota:false---apiVersion:kueue.x-k8s.io/v1beta1kind:LocalQueuemetadata:name:default-queuenamespace:defaultspec:clusterQueue:tpu-training-cq

This manifest does the following:

  • Defines aResourceFlavor namedtpu-v6e-slice that specifies the nodelabels for TPU v6e slices.
  • Defines aClusterQueue namedtpu-training-cq that manages the quotafor TPU resources.
  • Defines aLocalQueue nameddefault-queue that allows workloads inthedefault namespace to use thetpu-training-cq cluster queue.

Hotswap

The following example shows a Hotswap configuration that definestwo Priority Classes,low-priority-job andhigh-priority-job. ThisHotswap configuration creates a high-priority JobSet workload and usesMaxText.

apiVersion:scheduling.k8s.io/v1kind:PriorityClassmetadata:name:low-priority-jobvalue:1000000globalDefault:falsedescription:"Thispriorityclassshouldbeusedforlowprioritypodsonly."---apiVersion:scheduling.k8s.io/v1kind:PriorityClassmetadata:name:high-priority-jobvalue:2000000globalDefault:falsedescription:"Thispriorityclassshouldbeusedforcriticalpodsonly."---apiVersion:jobset.x-k8s.io/v1alpha2kind:JobSetmetadata:name:high-jax-trilliumannotations:alpha.jobset.sigs.k8s.io/exclusive-topology:cloud.google.com/gke-nodepoolspec:failurePolicy:maxRestarts:10restartStrategy:BlockingRecreatereplicatedJobs:-name:slicereplicas:2template:spec:backoffLimit:0completions:4parallelism:4template:spec:nodeSelector:cloud.google.com/gke-tpu-accelerator:tpu-v6e-slicecloud.google.com/gke-tpu-topology:4x4hostNetwork:truednsPolicy:ClusterFirstWithHostNetpriorityClassName:high-priority-jobcontainers:-name:jax-programimage:<IMAGE LOCATION>command:-python3-MaxText/train.py-MaxText/configs/base.yml-model_name=llama2-7b-run_name=<UNIQUE RUN NAME>-steps=300-base_output_directory=gs://<OUTPUT BUCKET>-dataset_path=gs://max-datasets-rogue-max_target_length=4096-dataset_type=synthetic-enable_checkpointing=Falseresources:limits:google.com/tpu:4

Based on this configuration, Hotswap performs the following actions:

  • If an infrastructure failure interrupts the high-priority workload, theJobSet restarts it. Hotswap preempts the low-priorityworkload to reschedule the high-priority workload before the infrastructurerecovers. The low-priority workload remains in a failed status. This processsignificantly reduces workload idle time.
  • When the infrastructure recovers, Hotswap reschedules the low-priorityworkload in the node pool that recovered.

Kueue and Hotswap

Combine Kueue and Hotswap when you operate in a complex environment withlimited resources. This combination provides a robust system thatprioritizes critical workloads during initial scheduling and during runtime.

The following example shows a combined Kueue and Hotswap configuration. Thisexample usesMaxText:

apiVersion:scheduling.k8s.io/v1kind:PriorityClassmetadata:name:low-priority-jobvalue:1000000globalDefault:falsedescription:"Thispriorityclassshouldbeusedforlowprioritypodsonly."---apiVersion:scheduling.k8s.io/v1kind:PriorityClassmetadata:name:high-priority-jobvalue:2000000globalDefault:falsedescription:"Thispriorityclassshouldbeusedforcriticalpodsonly."---apiVersion:kueue.x-k8s.io/v1beta1kind:ResourceFlavormetadata:name:tpu-v6e-slicespec:nodeLabels:cloud.google.com/gke-tpu-accelerator:tpu-v6e-slice---apiVersion:kueue.x-k8s.io/v1beta1kind:ClusterQueuemetadata:name:tpu-training-cqspec:resourceGroups:-flavors:-name:tpu-v6e-sliceresources:-name:google.com/tpunominalQuota:32queueingStrategy:BestEffortFIFOpreemption:reclaimWithinCohort:NeverreclaimOutOfCohort:enable:truereclaimMoreThanNominalQuota:false---apiVersion:kueue.x-k8s.io/v1beta1kind:LocalQueuemetadata:name:default-queuenamespace:defaultspec:clusterQueue:tpu-training-cq---apiVersion:jobset.x-k8s.io/v1alpha2kind:JobSetmetadata:name:low-jax-trilliumannotations:kueue.x-k8s.io/queue-name:default-queuealpha.jobset.sigs.k8s.io/exclusive-topology:cloud.google.com/gke-nodepoolspec:failurePolicy:maxRestarts:10restartStrategy:BlockingRecreatereplicatedJobs:-name:slicereplicas:2template:spec:backoffLimit:0completions:4parallelism:4template:metadata:labels:kueue.x-k8s.io/managed-by:kueuekueue.x-k8s.io/priority-class:low-priority-jobspec:nodeSelector:cloud.google.com/gke-tpu-accelerator:tpu-v6e-slicecloud.google.com/gke-tpu-topology:4x4hostNetwork:truednsPolicy:ClusterFirstWithHostNetpriorityClassName:low-priority-jobcontainers:-name:jax-programimage:<IMAGE LOCATION>command:-python3-MaxText/train.py-MaxText/configs/base.yml-model_name=llama2-7b-run_name=low-priority-run-steps=30000-base_output_directory=gs://<OUTPUT BUCKET>-dataset_path=gs://max-datasets-rogue-max_target_length=4096-dataset_type=synthetic-enable_checkpointing=Falseresources:limits:google.com/tpu:4---apiVersion:jobset.x-k8s.io/v1alpha2kind:JobSetmetadata:name:high-jax-trilliumannotations:kueue.x-k8s.io/queue-name:default-queuealpha.jobset.sigs.k8s.io/exclusive-topology:cloud.google.com/gke-nodepoolspec:failurePolicy:maxRestarts:10restartStrategy:BlockingRecreatereplicatedJobs:-name:slicereplicas:2template:spec:backoffLimit:0completions:4parallelism:4template:metadata:labels:kueue.x-k8s.io/managed-by:kueuekueue.x-k8s.io/priority-class:high-priority-jobspec:nodeSelector:cloud.google.com/gke-tpu-accelerator:tpu-v6e-slicecloud.google.com/gke-tpu-topology:4x4hostNetwork:truednsPolicy:ClusterFirstWithHostNetpriorityClassName:high-priority-jobcontainers:-name:jax-programimage:<IMAGE LOCATION>command:-python3-MaxText/train.py-MaxText/configs/base.yml-model_name=llama2-7b-run_name=high-priority-run-steps=300-base_output_directory=gs://<OUTPUT BUCKET>-dataset_path=gs://max-datasets-rogue-max_target_length=4096-dataset_type=synthetic-enable_checkpointing=Falseresources:limits:google.com/tpu:4

Based on this configuration, Kueue is combined with Hotswap, and performsthe following actions:

  • Kueue manages the admission of bothlow-jax-trillium andhigh-jax-trillium JobSets into the cluster queue based on theirdefined priorities and available resources.
  • If thehigh-jax-trillium JobSet is interrupted by an infrastructurefailure, Hotswap preempts thelow-jax-trillium JobSet to reschedule thehigh-priority JobSet.
  • Hotswap ensures the high-priority JobSet restarts quickly, minimizing itsidle time.
  • When the infrastructure recovers, Hotswap reschedules the low-priorityJobSet in the recovered node pool.

What's next

Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.

Last updated 2025-12-17 UTC.