Run GPUs in GKE Standard node pools

Standard

This page shows you how to run and optimize your compute-intensive workloads,such as artificial intelligence (AI) and graphics processing, by attaching andusing NVIDIA® graphics processing unit (GPU) hardware accelerators in yourGoogle Kubernetes Engine (GKE) Standard clusters'nodes. If you are usingAutopilot Pods instead, refer toDeploy GPU workloads in Autopilot.

If you want to deploy clusters with NVIDIA B200 or NVIDIA H200 141GB GPUs, seethe following resources instead:

To create GKE clusters, seeCreate an AI-optimized Google Kubernetes Engine cluster with default configuration.
To create Slurm clusters, seeCreate an AI-optimized Slurm cluster.

Overview

With GKE, you can createnode pools equipped withGPUs. GPUs provide compute power to drive deep-learning tasks such as imagerecognition, natural language processing, as well as other compute-intensivetasks such as video transcoding and image processing. In GKE Standard mode, you canattach GPU hardware to nodes in your clusters, and then allocate GPU resources tocontainerized workloads running on those nodes.

To learn more about use cases for GPUs, refer to Google Cloud'sGPUs page. For more information about GPUs in GKE and the differences between Standard mode and Autopilot mode, refer toAbout GPUs in GKE.

You can also use GPUs withSpot VMs if your workloads can toleratefrequent node disruptions. Using Spot VMs reduces the price of runningGPUs. To learn more, refer toUsing Spot VMs with GPU node pools.

As of version 1.29.2-gke.1108000, you can now create GPU node pools onGKE Sandbox. For more information, seeGKE SandboxandGKE Sandbox Configuration.

Before you begin

Before you start, make sure that you have performed the following tasks:

Enable the Google Kubernetes Engine API.

Enable Google Kubernetes Engine API

If you want to use the Google Cloud CLI for this task,install and theninitialize the gcloud CLI. If you previously installed the gcloud CLI, get the latest version by running thegcloud components update command. Earlier gcloud CLI versions might not support running the commands in this document.Note: For existing gcloud CLI installations, make sure to set thecompute/regionproperty. If you use primarily zonal clusters, set thecompute/zone instead. By setting a default location, you can avoid errors in the gcloud CLI like the following:One of [--zone, --region] must be supplied: Please specify location. You might need to specify the location in certain commands if the location of your cluster differs from the default that you set.

Requirements for GPUs on GKE

GPUs on GKE have the following requirements:

GPU quota: You must haveCompute Engine GPU quota in your selectedzone before you can create GPU nodes. To ensure that you have enough GPU quotain your project, refer toQuotas in the Google Cloud console.
If you require additional GPU quota, you mustrequest GPU quotain the Google Cloud console. If you have an established billing account, yourproject automatically receives quota after you submit the quota request.
By default, Free Trial accounts don't receive GPU quota.
NVIDIA GPU drivers: When creatinga cluster or a node pool, you can tell GKE to automaticallyinstall a driver version based on your GKE version.If you don't tell GKE to automatically install GPU drivers, youmustmanually install the drivers.
Machine series: The GPU type you can use depends on the machine series, as follows:
- A4X machine series: GB200 GPUs.
- A4 machine series: B200 GPUs.
- A3 machine series: H200 GPUs (A3Ultra), andH100 GPUs (A3 Mega, High, Edge).
- A2 machine series: A100 GPUs.
- G4 machine series: RTX PRO 6000 GPUs (GKE version 1.34.0-gke.1662000 or later).
- G2 machine series: L4 GPUs.
- N1 machine series: NVIDIA T4 GPUs, NVIDIA V100 GPUs, NVIDIA P100 GPUs, or NVIDIA P4 GPUs.
You should ensure that you have enough quota in your project for the machineseries that corresponds to your selected GPU type and quantity.
GPUs on Ubuntu nodes: If you use GPUs with Ubuntu nodes, the followingrequirements apply:
- Driver compatibility:
  - L4 GPUs andH100 GPUs: NVIDIA driver version 535 or later
  - H200 GPUs: NVIDIA driver version 550 or later
  - B200 GPUs: NVIDIA driver version 570 or later
  - RTX PRO 6000 GPUs: NVIDIA driver version 580 or later
  If a required driver version or a later version isn't the default version inyour GKE version, you mustmanually installa supported driver on your nodes.
- Version compatibility:
  When you use A4 machine series on Ubuntu node pools, you must use aGKE version that includes theubuntu-gke-2404-1-32-amd64-v20250730 image or a later version of thenode image. The minimum GKE versions are the following:
  - 1.32.7-gke.1067000 or later for GKE version 1.32
  - 1.33.3-gke.1247000 or later for GKE version 1.33

Best practice:

Use Container-Optimized OS for GPU nodes. Container-Optimized OSincludes the required drivers to support the specific GKE versionfor GPU nodes.

Limitations of using GPUs on GKE

Before you use GPUs on GKE, keep in mind the followinglimitations:

You can't add GPUs to existing node pools.
GPU nodes can't belive migrated during maintenance events.
Machine series: The GPU type you can use depends on the machine series, as follows:
- A4X machine series: GB200 GPUs.
- A4 machine series: B200 GPUs.
- A3 machine series: H200 GPUs (A3Ultra), andH100 GPUs (A3 Mega, High, Edge).
- A2 machine series: A100 GPUs.
- G4 machine series: RTX PRO 6000 GPUs.
- G2 machine series: L4 GPUs.
- N1 machine series: NVIDIA T4 GPUs, NVIDIA V100 GPUs, NVIDIA P100 GPUs, or NVIDIA P4 GPUs.
You should ensure that you have enough quota in your project for the machineseries that corresponds to your selected GPU type and quantity.
GPUs are not supported in Windows Server node pools.
GKE Standard clusters running versions earlier than1.34.1-gke.1279000 don't supportnodeauto-provisioningcreating node pools with RTX PRO 6000 GPUs. However, clusters runningearlier versions supportclusterautoscaler scalingexisting node pools.
GKE Standard clusters running version1.28.2-gke.1098000 or earlier don't supportnodeauto-provisioningcreating node pools with L4 GPUs. However, clusters running earlier versionssupportclusterautoscaler scalingexisting node pools.

Availability of GPUs by regions and zones

GPUs are available in specificregions and zones. When yourequestGPU quota, consider the regions in which you intend to run yourclusters.

For a complete list of applicable regions and zones, refer toGPUs on Compute Engine.

You can also see GPUs available in your zone using the Google Cloud CLI. To see alist of all GPU accelerator types supported in each zone, run the followingcommand:

gcloudcomputeaccelerator-typeslist

Pricing

For GPU pricing information, refer to thepricing table on the Google Cloud GPU page.

Ensure sufficient GPU quota

Your GPU quota is the total number of GPUs that can run in yourGoogle Cloud project.To create clusters with GPUs, your project must have sufficient GPU quota.

Your GPU quota should be at least equivalent to the total number of GPUs youintend to run in your cluster. If you enablecluster autoscaling, youshould request GPU quota at least equivalent to the number of GPUs per node multiplied by your cluster's maximum number ofnodes.

For example, if you create a cluster with three nodes that runs two GPUsper node, your project requires at least six GPU quota.

Requesting GPU quota

To request GPU quota, use the Google Cloud console. For more information aboutrequesting quotas, refer toGPU quotas in theCompute Engine documentation.

To search for GPU quota and submit a quota request, use the Google Cloud console:

Go to the IAM & AdminQuotas page in the Google Cloud console.
Go to Quotas
In theFilter box, do the following:
1. Select theQuota property, enter the name of theGPU model, and pressEnter.
2. (Optional) To apply more advanced filters to narrow the results, selecttheDimensions (for example, locations) property, add the name of theregion or zone you are using, andpressEnter.
From the list of GPU quotas, select the quota you want to change.
ClickEdit Quotas. A request form opens.
Fill theNew quota limit field for each quota request.
Fill theRequest description field with details about your request.
ClickNext.
In theOverride confirmation dialog, clickConfirm.
In theContact details screen, enter your name and a phone number that the approvers might use to complete your quota change request.
ClickSubmit request.
You receive a confirmation email to track the quota change.

Running GPUs in GKE Standard clusters

To run GPUs in GKE Standard clusters, create a nodepool with attached GPUs.

Best practice:

To improve cost-efficiency, reliability, and availability of GPUs onGKE, perform the following actions:

Create separate GPU node pools. For each node pool, limit the node location to the zones where the GPUs you want are available.
Enable autoscaling in each node pool.
Use regional clusters to improve availability by replicating the Kubernetes control plane across zones in the region.
Configure GKE to automatically install either the default or latest GPU drivers on the node pools so that you don't need to manually install and manage your driver versions.

As described in the following sections, GKE uses node taints and tolerations toensure that Pods are not scheduled onto inappropriate nodes.

Taint a GPU node pool to avoid scheduling it inappropriately

A node taint lets you mark a node so that the scheduler avoids or preventsusing it for certain Pods. Based on the following scenarios, GKEautomatically adds taints, or you can manually add them:

When you add a GPU node pool to an existing clusterthat already runs a non-GPU node pool, GKE automaticallytaints the GPU nodes with the following node taint:
- Key:nvidia.com/gpu
- Effect:NoSchedule
GKE only adds this taint if there is at least one non-GPU node poolin the cluster.
When you add a GPU node pool to a cluster that has only GPU node pools, or ifyou create a new cluster where the defaultnode pool has GPUs attached, you canmanually set taints to the new node pool withthe following values:
- Key:nvidia.com/gpu
- Effect:NoSchedule
When you add a non-GPU node pool to the cluster in the future,GKEdoesnot retroactively apply this taint to existing GPU nodes. You need tomanually set taints to the new node pool.

Automatically restricting scheduling with a toleration

Tolerations let you designate Pods that can be used on "tainted" nodes.GKE automatically applies a toleration so only Podsrequesting GPUs are scheduled on GPU nodes. Thisenables more efficient autoscaling as your GPU nodes can quickly scale down ifthere are not enough Pods requesting GPUs. To do this GKEruns theExtendedResourceToleration admission controller.

Create a GPU node pool

To create a separate GPU node pool in an existing cluster, you can use theGoogle Cloud console or the Google Cloud CLI. You can also use Terraform forprovisioning your GKE clusters and GPU node pool.

GKE supports automatic installation of NVIDIA drivers in thefollowing scenarios:

For GKE clusters with control plane version1.32.2-gke.1297000 and later, GKE automatically installs thedefault NVIDIA driver version for all GPU nodes, including those createdwith node auto-provisioning.
For GKE clusters with control plane version1.30.1-gke.1156000 to 1.32.2-gke.1297000, GKE automaticallyinstalls the default NVIDIA driver version for nodes not created with nodeauto-provisioning.
You can optionally choose the latest available driver version or explicitlydisable automatic driver installation. In versions earlier than1.30.1-gke.1156000, GKE doesn't install a driver by defaultif you don't specify a driver version when you create or update the nodepool.

Caution: If you use automatic node upgrades, aGKE version upgrade might install a newer GPU driver version onthe nodes. If your workloads require a specific NVIDIA driver version, you mightexperience disruptions due to incompatibilities. Test new GKEversions for compatibility in a staging environment before automatic upgradeschange the version on your production GPU nodes, or disable automatic upgradesin those node pools.

gcloud

To create a node pool with GPUs in a cluster, run the following command:

gcloudcontainernode-poolscreatePOOL_NAME\--acceleratortype=GPU_TYPE,count=AMOUNT,gpu-driver-version=DRIVER_VERSION\--machine-typeMACHINE_TYPE\--clusterCLUSTER_NAME\--locationCONTROL_PLANE_LOCATION\--node-locationsCOMPUTE_ZONE1[,COMPUTE_ZONE2]\[--sandbox=type=gvisor][--enable-autoscaling\--min-nodesMIN_NODES\--max-nodesMAX_NODES]\[--scopes=SCOPES]\[--service-account=SERVICE_ACCOUNT]\[--reservation-affinity=specific--reservation=RESERVATION_NAME]

Replace the following:

POOL_NAME: the name you choose for the node pool.
GPU_TYPE: The type ofGPU accelerator that you use. For example,nvidia-tesla-t4.
AMOUNT: the number of GPUs to attach to nodes inthe node pool.
DRIVER_VERSION: the NVIDIA driver version toinstall. Can be one of the following:
- default: Install the default driver version for your nodeGKE version. In GKE version1.30.1-gke.1156000 and later, if you omit thegpu-driver-versionflag, this is the default option. In earlier versions,GKE doesn't install a driver if you omit this flag.
- latest: Install the latest available driver version for your GKEversion. Available only for nodes that use Container-Optimized OS.
- disabled: Skip automatic driver installation. Youmustmanually install a driver after you createthe node pool. In GKE versions earlier than1.30.1-gke.1156000, this is the default option.
Thegpu-driver-version option is only available for GKEversion 1.27.2-gke.1200 and later. In earlier versions, omit this flag andmanually install a driver after you createthe node pool. If you upgrade an existing cluster or node pool to thisversion or later, GKE automatically installs the defaultdriver version that corresponds to the GKE version, unlessyou specify differently when you start the upgrade.
Note: To create a node pool with Ubuntu nodes and NVIDIA L4 GPUs or NVIDIAH100 GPUs and automatically install the default NVIDIA driver version, youmust use aminimum GKE patch version orlater.For earlier versions, you must specifygpu-driver-version=disabled andmanually install the NVIDIA driver.
MACHINE_TYPE: the Compute Engine machine typefor the nodes. Required for the following GPU types:
- NVIDIA B200 GPUs (corresponding to thenvidia-b200 accelerator typeandA4 machine series)
- NVIDIA H200 141 GB GPUs (corresponding to thenvidia-h200-141gbaccelerator type andA3Ultramachine type), or NVIDIA H100 80 GB GPUs (corresponding to thenvidia-h100-80gb accelerator type andA3Highmachine type), or NVIDIA H100 80GB Mega GPUs (corresponding to thenvidia-h100-mega-80gb accelerator type andA3 Mega machine type). For moreinformation, seethe A3 machine seriesin the Compute Engine documentation.
- NVIDIA A100 40 GB GPUs (corresponding tonvidia-tesla-a100 accelerator typeand theA2 Standardmachine type), or NVIDIA A100 80GB GPUs (corresponding to thenvidia-a100-80gb accelerator type andA2 Ultra machine type). For moreinformation, seethe A2 machine seriesin the Compute Engine documentation.
- NVIDIA L4 GPUs (corresponding to thenvidia-l4 accelerator typeand theG2 machine series).
- NVIDIA RTX PRO 6000 GPUs (corresponding to thenvidia-rtx-pro-6000accelerator type and theG4 machine series).
For all other GPUs, this flag is optional.
CLUSTER_NAME: the name of the cluster in which tocreate the node pool.
CONTROL_PLANE_LOCATION: the Compute Enginelocation of the control plane of yourcluster. Provide a region for regional clusters, or a zone for zonal clusters.
COMPUTE_ZONE1,COMPUTE_ZONE2,[...]: thespecificzones where GKE creates the GPU nodes. Thezones must be in the same region as the cluster, specified by the--location flag. The GPU types that you define must beavailable in each selected zone. If you use areservation, you must specify the zones where the reservation hascapacity. We recommend that you always use the--node-locations flagwhen creating the node pool to specify the zone or zones that contain therequested GPUs.
Optionally, you can create node pools to run sandboxed workloads with gVisor.To learn more, seeGKE Sandbox fordetails.
MIN_NODES: the minimum number of nodes for eachzone in the node pool at any time. This value is relevant only if the--enable-autoscaling flag is used.
MAX_NODES: the maximum number of nodes for eachzone in the node pool at any time. This value is relevant only if the--enable-autoscaling flag is used.
Optionally, you can create the GPU node pool using a custom serviceaccount by appending the following flags. If omitted, the node pool uses theCompute Engine default service account:
- SERVICE_ACCOUNT: the name of the IAM service account that your nodes use.
- SCOPES: a comma-separated list of access scopes to grant. Ensure that one of the scopes isstorage-ro orhttps://www.googleapis.com/auth/devstorage.read_only. To learn more about scopes, seeSetting access scopes. If you omit thescope flag, the GPU node pool creation fails with anAccessDenied errorfailed to download gpu_driver_versions.bin from GCS bucket.
Note: If you don't use custom IAM service accounts to create your GKE clusters or node pools, ensure that the default Compute Engine service account in your project has the required permissions for GKE. In organizations that enforce the iam.automaticIamGrantsForDefaultServiceAccounts organization policy constraint, the default Compute Engine service account won't automatically get the required permissions for GKE. This constraint is enforced by default for organizations that were created on or after May 3, 2024. For details, seeDefault GKE node service account.
RESERVATION_NAME: the name of the GPU reservationto use. Specify the--reservation flag with--reservation-affinity=specificto use GPU capacity from a specific reservation. For more information, seeConsuming a specific single-project reservation.

For example, the following command creates a highly-available autoscaling nodepool,p100, with two P100 GPUs for each node, in the regional clusterp100-cluster.GKE automatically installs the default drivers on those nodes.

gcloudcontainernode-poolscreatep100\--acceleratortype=nvidia-tesla-p100,count=2,gpu-driver-version=default\--clusterp100-cluster\--locationus-central1\--node-locationsus-central1-c\--min-nodes0--max-nodes5--enable-autoscaling

You can also update an existing node pool. For example, you might want toupdate the GPU driver to switch to the latest available driver:

gcloudcontainernode-poolsupdatep100\--acceleratortype=nvidia-tesla-p100,count=2,gpu-driver-version=latest\--clusterp100-cluster\--locationus-central1

Console

To create a node pool with GPUs:

Go to theGoogle Kubernetes Engine page in the Google Cloud console.
Go to Google Kubernetes Engine
In the cluster list, click the name of the cluster you want to modify.
ClickAdd Node Pool.
Optionally, on theNode pool details page, select theEnable autoscaling checkbox.
Configure your node pool as you want.
From the navigation pane, selectNodes.
UnderMachine configuration, clickGPU.
Select aGPU type andNumber of GPUs to run on each node.
Read the warning and selectI understand the limitations.
In theGPU Driver installation section, select one of the followingmethods:
- Google-managed: GKE automatically installs a driver.If you select this option, choose one of the following from theVersion drop-down:
  - Default: Install the default driver version.
  - Latest: Install the latest available driver version.
- Customer-managed: GKE doesn't install a driver. Youmust manually install a compatible driver using the instructionsinInstalling NVIDIA GPU device drivers.
ClickCreate.

Note: If you don't use custom IAM service accounts to create your GKE clusters or node pools, ensure that the default Compute Engine service account in your project has the required permissions for GKE. In organizations that enforce the iam.automaticIamGrantsForDefaultServiceAccounts organization policy constraint, the default Compute Engine service account won't automatically get the required permissions for GKE. This constraint is enforced by default for organizations that were created on or after May 3, 2024. For details, seeDefault GKE node service account.

Terraform

You can create a regional cluster with Terraform with GPUs using aTerraform module.

Set the Terraform variables by including the following block in thevariables.tf file:
```
variable"project_id"{default=PROJECT_IDdescription="the gcp_name_short project where GKE creates the cluster"}variable"region"{default=CLUSTER_REGIONdescription="the gcp_name_short region where GKE creates the cluster"}variable"zone"{default="COMPUTE_ZONE"description="the GPU nodes zone"}variable"cluster_name"{default="CLUSTER_NAME"description="the name of the cluster"}variable"gpu_type"{default="GPU_TYPE"description="the GPU accelerator type"}variable"gpu_driver_version"{default="DRIVER_VERSION"description="the NVIDIA driver version to install"}variable"machine_type"{default="MACHINE_TYPE"description="The Compute Engine machine type for the VM"}
```
Replace the following:
- PROJECT_ID: your project ID.
- CLUSTER_NAME: the name of the GKEcluster.
- CLUSTER_REGION: thecompute region for the cluster.
- COMPUTE_ZONE:the specificzone where GKE creates the GPU nodes.The zone must be in the same region specified by theregion variable.These zones must have the GPU types you defined available. For moreinformation, seeAvailability of GPUs by regions and zones.
- GPU_TYPE: The type ofGPU accelerator that you use. For example,nvidia-tesla-t4.
- DRIVER_VERSION: the GPU driver version forGKE to automatically install. This field is optional.The following values are supported:
  - INSTALLATION_DISABLED: Disable automatic GPU driver installation.Youmustmanually install drivers to runyour GPUs. In GKE versions earlier than1.30.1-gke.1156000, this is the default option if you omit this field.
  - DEFAULT: Automatically install the default driver version foryour node operating system version. In GKE version1.30.1-gke.1156000 and later, if you omit this field, this is thedefault option. In earlier versions, GKE doesn'tinstall a driver if you omit this field.
  - LATEST: Automatically install the latest available driver versionfor your node OS version. Available only for nodes that useContainer-Optimized OS.
  If you omit this field, GKE doesn't automatically installa driver. This field isn't supported in node pools that use nodeauto-provisioning. To manually install a driver, seeManually install NVIDIA GPU drivers in thisdocument.
- MACHINE_TYPE: the Compute Engine machine typefor the nodes. Required for the following GPU types:
  - NVIDIA B200 GPUs (corresponding to thenvidia-b200 acceleratortype andA4 machine series)
  - NVIDIA H200 141 GB GPUs (corresponding to thenvidia-h200-141gb accelerator type andA3Ultramachine type), or NVIDIA H100 80 GB GPUs (corresponding to thenvidia-h100-80gbaccelerator type andA3 Highmachine type), or NVIDIA H100 80GB Mega GPUs (corresponding to thenvidia-h100-mega-80gb accelerator type andA3 Mega machine type). For moreinformation, seethe A3 machine seriesin the Compute Engine documentation.
  - NVIDIA A100 40 GB GPUs (corresponding to thenvidia-tesla-a100 accelerator typeand theA2 Standardmachine type), or NVIDIA A100 80GB GPUs (corresponding to thenvidia-a100-80gb accelerator type andA2 Ultra machine type). For moreinformation, seethe A2 machine seriesin the Compute Engine documentation.
  - NVIDIA L4 GPUs (corresponding to thenvidia-l4 accelerator typeand theG2 machine series).
  - NVIDIA RTX PRO 6000 GPUs (corresponding to thenvidia-rtx-pro-6000accelerator type and theG4 machine series).
  For all other GPUs, this flag is optional.

Add the following block to your Terraform configuration:

provider"google"{project=var.project_idregion=var.region}resource"google_container_cluster""ml_cluster"{name=var.cluster_namelocation=var.regioninitial_node_count=1}resource"google_container_node_pool""gpu_pool"{name=google_container_cluster.ml_cluster.namelocation=var.regionnode_locations=[var.zones]cluster=google_container_cluster.ml_cluster.namenode_count=3autoscaling{total_min_node_count="1"total_max_node_count="5"}management{auto_repair="true"auto_upgrade="true"}node_config{oauth_scopes=["https://www.googleapis.com/auth/logging.write","https://www.googleapis.com/auth/monitoring","https://www.googleapis.com/auth/devstorage.read_only","https://www.googleapis.com/auth/trace.append","https://www.googleapis.com/auth/service.management.readonly","https://www.googleapis.com/auth/servicecontrol",]labels={env=var.project_id}guest_accelerator{type=var.gpu_typecount=1gpu_driver_installation_config{gpu_driver_version=var.gpu_driver_version}}image_type="cos_containerd"machine_type=var.machine_typetags=["gke-node", "${var.project_id}-gke"]disk_size_gb="30"disk_type="pd-standard"metadata={disable-legacy-endpoints="true"}}}

Terraform calls Google Cloud APIs to set create a new cluster with anode pool that uses GPUs. The node pool initially has three nodesand autoscaling is enabled. To learn more about Terraform, see thegoogle_container_node_pool resource spec on terraform.io.

Best practice:

To avoid incurring further costs, remove all the resources defined in the configuration file by using theterraform destroy command.

Best practice:You can also create a new cluster with GPUs and specify zones using the--node-locationsflag. However, we recommend that you create a separate GPU node pool in anexisting cluster, as shown in this section.

Manually install NVIDIA GPU drivers

You can manually install NVIDIA GPU drivers on your nodes by deploying aninstallation DaemonSet to those nodes. Use manual installation in the followingsituations:

You chose to disable automatic device driver installation when you createda GPU node pool.
You use a GKE version earlier than the minimum supportedversion for automatic installation.
Your workload requires a specific NVIDIA driver version that isn't availableas the default or the latest driver with automatic installation. For example,using GPUs with Confidential GKE Nodes.

Best practice:

Use automatic driver installation whenever possible.To do this, specify thegpu-driver-version option in the--accelerator flag when youcreate your Standard cluster. If you used the installation DaemonSetto manually install GPU driverson or before January 25, 2023, you mightneed to re-apply the DaemonSet to get a version that ignores nodes that useautomatic driver installation.

To run the installation DaemonSet, the GPU node pool requires thehttps://www.googleapis.com/auth/devstorage.read_onlyscope for communicating withCloud Storage.Without this scope, downloading of the installation DaemonSet manifest fails.This scope is one of the defaultscopes,which is typically added when you create the cluster.

The following instructions show you how to install the drivers onContainer-Optimized OS (COS) and Ubuntu nodes, and using Terraform.

COS

To deploy the installationDaemonSet andinstall the default GPU driver version, run the following command:

kubectlapply-fhttps://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/nvidia-driver-installer/cos/daemonset-preloaded.yaml

To install a newer GPU driver version from the driver version table inthis section, run the following command:

kubectlapply-fhttps://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/nvidia-driver-installer/cos/daemonset-preloaded-latest.yaml

To install a GPU driver version that supportsrunning GPU workloads on Confidential GKE Nodes,run the following command:

kubectlapply-fhttps://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/refs/heads/master/nvidia-driver-installer/cos/daemonset-confidential.yaml

The installation takes several seconds to complete. After the installationcompletes, the NVIDIA GPU device plugin uses the Kubernetes API to make theNVIDIA GPU capacity available.

Each version of Container-Optimized OS has at least one supportedNVIDIA GPU driver version. For more information about mapping the GPU driverversion to GKE version, you can do any of the following:

Map the GKEversion and Container-Optimized OS node image version to the GPUdriver version.
Use the following table which lists the available GPU driver versions ineach GKE version:

GKE NVIDIA driver versions
1.33	R535 (default), R570, R575, or R580
1.32	R535 (default), R570, R575, or R580
1.31	R535 (default), R570, R575, or R580
1.30	R535 (default) or R550
1.29	R535 (default) or R550
1.28	R535 (default) or R550
1.27	R470 (default), R525, R535, or R550
1.26	R470 (default), R525, R535, or R550

Ubuntu

The installation DaemonSet that you deploy depends on the GPU type and onthe GKE node version as follows:

For all GPUsexcept NVIDIA H200 GPUs, run the following command:

kubectlapply-fhttps://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/nvidia-driver-installer/ubuntu/daemonset-preloaded.yaml

For NVIDIA H200 GPUs, install theR550 driver:

kubectlapply-fhttps://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/refs/heads/master/nvidia-driver-installer/ubuntu/daemonset-preloaded-R550.yaml

The installation takes several seconds to complete. Once installed, theNVIDIA GPU device plugin uses the Kubernetes API to make the NVIDIA GPUcapacity available.

The following table lists the available driver versions in eachGKE version:

Ubuntu GPU drivers and GKE versions
1.33	R535 (default)
1.32	R535 (default)
1.31	R535 (default)
1.30	R470 or R535
1.29	R470 or R535
1.28	R470 or R535
1.27	R470 or R535
1.26	R470 or R535

Terraform

You can use Terraform to install the default GPU driver version based on thetype of nodes. In both cases, you must configure thekubectl_manifest Terraform resource type.

To install theDaemonSet on COS, add thefollowing block in your Terraform configuration:

data"http""nvidia_driver_installer_manifest"{url="https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/nvidia-driver-installer/cos/daemonset-preloaded.yaml"}resource"kubectl_manifest""nvidia_driver_installer"{yaml_body=data.http.nvidia_driver_installer_manifest.body}

To installDaemonSet on Ubuntu, add thefollowing block in your Terraform configuration:

data"http""nvidia_driver_installer_manifest"{url="https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/nvidia-driver-installer/ubuntu/daemonset-preloaded.yaml"}resource"kubectl_manifest""nvidia_driver_installer"{yaml_body=data.http.nvidia_driver_installer_manifest.body}

Map the GKE version and Container-Optimized OS node image version to the GPU driver version

To find the GPU driver versions that are mapped withGKEversions and Container-Optimized OS node image versions, do the following steps:

Map Container-Optimized OS node image versions to GKE patch versions for the specific GKE version where you want to find the GPU driver version. For example, 1.33.0-gke.1552000 uses cos-121-18867-90-4.
Choose the milestone of the Container-Optimized OS node image version in theContainer-Optimized OS release notes. For example, choose Milestone 121 for cos-121-18867-90-4.
In the release notes page for the specific milestone, find the release note corresponding with the specific Container-Optimized OS node image version. For example, inContainer-Optimized OS Release Notes: Milestone 121, seecos-121-18867-90-4. In the table in theGPU Drivers column, clickSee List to see the GPU driver version information.

Installing drivers using node auto-provisioning with GPUs

When you use node auto-provisioning with GPUs, by default the auto-provisionednode pools don't have sufficient scopes to install the drivers. To grant therequired scopes, modify the default scopes for node auto-provisioning to addlogging.write,monitoring,devstorage.read_only, andcompute, such as inthe following example.

gcloudcontainerclustersupdateCLUSTER_NAME--enable-autoprovisioning\--min-cpu=1--max-cpu=10--min-memory=1--max-memory=32\--autoprovisioning-scopes=https://www.googleapis.com/auth/logging.write,https://www.googleapis.com/auth/monitoring,https://www.googleapis.com/auth/devstorage.read_only,https://www.googleapis.com/auth/compute

For clusters running GKE version 1.32.2-gke.1297000 and later,GKE automatically installs the default NVIDIA driver version forall GPU nodes, including those created with node auto-provisioning. You can skipthe following instructions for clusters running GKE version1.32.2-gke.1297000 and later.

In GKE version 1.29.2-gke.1108000 and later, you can select a GPUdriver version for GKE to automatically install inauto-provisioned GPU nodes. Add the following field to your manifest:

spec:nodeSelector:cloud.google.com/gke-gpu-driver-version:"DRIVER_VERSION"

ReplaceDRIVER_VERSION with one of the following values:

default: the default, stable driver for your node GKEversion.
latest: the latest available driver version for your nodeGKE version.
disabled: disable automatic GPU driver installation. With this valueselected, you mustmanually install drivers to run yourGPUs. In GKE versions earlier than 1.32.2-gke.1297000, this isthe default option if you omit the node selector.

To learn more about auto-provisioning, seeUsing node auto-provisioning.

Configuring Pods to consume GPUs

You use aresource limit to configure Pods to consumeGPUs. You specify a resource limit in aPod specificationusing the following key-value pair

Key:nvidia.com/gpu
Value: Number of GPUs to consume

alpha.kubernetes.io/nvidia-gpu is not supported as a resource name inGKE. Usenvidia.com/gpu as the resource name instead.

The following manifest is an example of a Pod specification that consumes GPUs:

apiVersion: v1kind: Podmetadata:  name: my-gpu-podspec:  # Optional: Use GKE Sandbox  # runtimeClassName: gvisor  containers:  - name: my-gpu-container    image: nvidia/cuda:11.0.3-runtime-ubuntu20.04    command: ["/bin/bash", "-c", "--"]    args: ["while true; do sleep 600; done;"]    resources:      limits:       nvidia.com/gpu: 2

Note: If you run the workload with GKE Sandbox, you need to create a GKE Sandboxnode pool. For details, see Enable GKE Sandbox on a new Standard cluster.

Consuming multiple GPU types

If you want to use multiple GPU accelerator types per cluster, you must createmultiple node pools, each with their own accelerator type.GKE attaches a uniquenode selector to GPUnodes to help place GPU workloads on nodes with specific GPU types:

Key:cloud.google.com/gke-accelerator
Value: The type ofGPU accelerator that you use.For example,nvidia-tesla-t4.

You can target particular GPU types by adding this node selector to yourworkload Pod specification. For example:

apiVersion: v1kind: Podmetadata:  name: my-gpu-podspec:  containers:  - name: my-gpu-container    image: nvidia/cuda:11.0.3-runtime-ubuntu20.04    command: ["/bin/bash", "-c", "--"]    args: ["while true; do sleep 600; done;"]    resources:      limits:       nvidia.com/gpu: 2  nodeSelector:    cloud.google.com/gke-accelerator: nvidia-tesla-t4

Upgrade node pools using accelerators (GPUs and TPUs)

GKEautomatically upgradesStandard clusters, including node pools. You can alsomanuallyupgrade nodepools if you want your nodes on a later version sooner. To control how upgradeswork for your cluster, usereleasechannels,maintenancewindows andexclusions,androlloutsequencing.

You can also configure anode upgradestrategy foryour node pool, such assurgeupgrades,blue-greenupgradesorshort-lived upgrades.By configuring these strategies, you can ensure that the node pools are upgradedin a way that achieves the optimal balance between speed and disruption for yourenvironment. Formulti-host TPU slice nodepools, instead of using theconfigured node upgrade strategy, GKE atomically recreates theentire node pool in a single step. To learn more, see the definition ofatomicity inTerminology related to TPU inGKE.

Using a node upgrade strategy temporarily requires GKE toprovision additional resources, depending on the configuration. If Google Cloudhas limited capacity for your node pool's resources—for example, you're seeingresource availabilityerrors when trying to create more nodes with GPUs or TPUs—seeUpgrade in aresource-constrainedenvironment.

About the NVIDIA CUDA-X libraries

CUDAis NVIDIA's parallel computing platform and programming model for GPUs. Touse CUDA applications, the image that you use must have the libraries. To add the NVIDIA CUDA-X libraries, you can build and use your own image by including the following values in theLD_LIBRARY_PATH environment variable in your container specification:

/usr/local/nvidia/lib64: the location of the NVIDIA device drivers.
/usr/local/cuda-CUDA_VERSION/lib64: the location of the NVIDIA CUDA-X libraries on the node.
ReplaceCUDA_VERSION with the CUDA-X image version that you used. Some versions also contain debug utilities in/usr/local/nvidia/bin. For details, seethe NVIDIA CUDA image on DockerHub.
To check the minimum GPU driver version required for your version of CUDA, seeCUDA Toolkit and Compatible Driver Versions.

Ensure that the GKE patch version running on your nodesincludes a GPU driver version that's compatible with your chosen CUDAversion. For more information about mapping the GPU driver version to GKEversion, seeMap the GKE version and Container-Optimized OS node image version to the GPU driver version.

Monitor your GPU node workload performance

If your GKE cluster hassystem metrics enabled, then the following metrics are available inCloud Monitoring to monitor your GPU workload performance:

Duty Cycle (container/accelerator/duty_cycle): Percentage of time over the past sample period (10 seconds) during which the accelerator was actively processing. Between 1 and 100.
Memory Usage (container/accelerator/memory_used): Amount of accelerator memory allocated in bytes.
Memory Capacity (container/accelerator/memory_total): Total accelerator memory in bytes.

These metrics apply at the container level (container/accelerator) and are not collected for containers scheduled on a GPU that uses GPU time-sharing or NVIDIA MPS.

You can use predefined dashboards to monitor your clusters with GPU nodes. For more information, see View observability metrics. For general information about monitoring your clusters and their resources, refer toObservability for GKE.

View usage metrics for workloads

You view your workload GPU usage metrics from theWorkloads dashboard in the Google Cloud console.

To view your workload GPU usage, perform the following steps:

Go to theWorkloads page in the Google Cloud console.
Go to Workloads
Select a workload.

The Workloads dashboard displays charts for GPU memory usage and capacity, and GPU duty cycle.

View NVIDIA Data Center GPU Manager (DCGM) metrics

You can collect and visualize NVIDIA DCGM metrics by usingGoogle Cloud Managed Service for Prometheus. For Autopilot clusters, GKE installs the drivers. For Standard clusters, you must install the NVIDIA drivers.

For instructions on how to deploy the GKE-managed DCGM package, seeCollect and view NVIDIA Data Center GPU Manager (DCGM) metrics.

JobSet and node health metrics for GPU workloads

In addition to DCGM metrics, you can use the following metrics to monitor the health and performance of your GPU workloads, especially when running them as JobSets.

JobSet metrics

The following metrics apply to both GPU and TPU JobSets that have a single replicated Job:

kubernetes.io/jobset/times_between_interruptions
kubernetes.io/jobset/times_to_recover
kubernetes.io/jobset/uptime

For more information about these system metrics, seeKubernetes metrics.

You can also use theJobSet dashboard in the Google Cloud console to visualize and monitor your GPU workloads:

Go to Deployments

Node health metrics

The following node-level metrics apply to all nodes, including those with GPUs:

kubernetes.io/node/status_condition: This metric requires GKE version 1.32.1-gke.1357001 or later.

Node interruption and node pool interruption metrics also apply to non-TPU nodes.

Kube-state-metrics for JobSets

The kube-state-metrics for JobSets can be used with GPUs. Collection of these metrics requires GKE version 1.32.1-gke.1357001 or later. For more information, see theJobSet metrics documentation.

Configure graceful termination of GPU nodes

In GKE clusters with the control plane running 1.29.1-gke.1425000or later, GPU nodes supportSIGTERM signals that alert the node of an imminentshutdown. The imminent shutdown notification is configurable up to60 minutesin GPU nodes.

To configure GKE to terminate your workloads gracefullywithin this notification timeframe, follow the steps inManage GKE node disruption for GPUs and TPUs.

What's next

Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.

Last updated 2025-12-15 UTC.

Movatterモバイル変換

Run GPUs in GKE Standard node pools Stay organized with collections Save and categorize content based on your preferences.

Overview

Before you begin

Requirements for GPUs on GKE

Limitations of using GPUs on GKE

Availability of GPUs by regions and zones

Pricing

Ensure sufficient GPU quota

Requesting GPU quota

Running GPUs in GKE Standard clusters

Taint a GPU node pool to avoid scheduling it inappropriately

Automatically restricting scheduling with a toleration

Create a GPU node pool

gcloud

Console

Terraform

Manually install NVIDIA GPU drivers

COS

Ubuntu

Terraform

Map the GKE version and Container-Optimized OS node image version to the GPU driver version

Installing drivers using node auto-provisioning with GPUs

Configuring Pods to consume GPUs

Consuming multiple GPU types

Upgrade node pools using accelerators (GPUs and TPUs)

About the NVIDIA CUDA-X libraries

Monitor your GPU node workload performance

View usage metrics for workloads

View NVIDIA Data Center GPU Manager (DCGM) metrics

JobSet and node health metrics for GPU workloads

JobSet metrics

Node health metrics

Kube-state-metrics for JobSets

Configure graceful termination of GPU nodes

What's next

Run GPUs in GKE Standard node pools