Maximize GPU network bandwidth in Standard mode clusters

Standard

Warning: GPUDirect-TCPX is unavailable with the A3 High machine type forGKE version 1.34 and later. For more information, see the knownissue.This page shows you how to maximize network bandwidth and throughput forhigh-performance GPU workloads in Google Kubernetes Engine (GKE) Standardclusters by using GPUDirect-TCPXO, GPUDirect-TCPX, gVNIC, and multi-networking.If you use Autopilot clusters, seeMaximize GPU network bandwidth in Autopilot mode clusters.

This page isintended for machine learning (ML) engineersand platform administrators who facilitate ML workloads. To learn more aboutcommon roles and example tasks that we reference in Google Cloud content, seeCommon GKE user roles and tasks.

Artificial intelligence (AI), ML, and high performancecomputing (HPC) applications require powerful acceleration to optimizeperformance by reducing job completion times. For example, ML models that focuson conversational AI and image generation require high scalability and computepower.

Before reading this page, ensure that you're familiar with networkingtechnologies, such as network interface cards(NICs) and TCP, and with accelerator technologies like the NVIDIA CollectiveCommunications Library (NCCL).

About Google Cloud GPU supercomputers

Google Cloud has accelerator-optimized supercomputers that are built forscalable, massive models. These machines have the following benefits:

Eight NVIDIA B200, H200, or H100 GPUs per machine.
Up to 200 Gbps bandwidth on the primary NIC.
Secondary NICs (up to eight on A3 Mega machine types and up to four on A3High machine types), each supporting up to 200 Gbps bandwidth for GPUdata transfer. On A3 High machine types, the expected bandwidth per NIC isapproximately 150 Gbps.

Note: In practice, the bandwidth and throughput improvements that you see mightbe less than these optimal maximum values.

Your GKE workloadmust use all available GPUs and all availablesecondary NICs on a single node and use a significant portion of theavailable bandwidth. The solution described in this document is ideal forworkloads that require high performance, high throughput, and low latency.

Required features and capabilities for maximized bandwidth

To maximize your network bandwidth in GPU supercomputer nodes, useall ofthe following features:

GPUDirect networking stack: The A3 machine series supportsthree networking stacks for custom, remote direct memory access (RDMA):
On A3 High machine types and NVIDIA H100 GPUs, utilizeGPUDirect-TCPXto reduce the overhead required to transfer packet payloads to and fromGPUs, which significantly improves throughput at scale compared to GPUs thatdon't use GPUDirect.
On A3 Mega machine types and NVIDIA H100 Mega GPUs, utilizeGPUDirect-TCPXOwhich further improves GPU to VM communication.
On A3 Ultra machine types and NVIDIA H200 GPUs, and A4 machine types andNVIDIA B200 GPUs, utilizeGPUDirect RDMA to run distributed AIworkloads with further throughput improvements. To get started,create acustom AI-optimized GKEcluster.
gVNIC: Enable GPUDirect capabilities such as packet header splitting,flow steering, and buffer management. gVNIC is required to useGPUDirect-TCPX or GPUDirect-TCPXO. For details about gVNIC, seeIncrease network traffic speed for GPU nodes.
Multi-networking: Add secondary NICs tothe accelerator-optimized machine. Each NIC is associated with a separatesubnet in its own VPC to avoid conflicts. For details about multi-network support, seeSetup multi-network support for Pods.
Placement policies: Use a resource placement policy to place all GPU nodesfor a specific workload on physically close servers to minimize latency.For details, seeDefine compact placement for GKE nodes.

Procedure outline

To use all of these capabilities together, you'll do the following:

Before you begin

Before you start, make sure that you have performed the following tasks:

Enable the Google Kubernetes Engine API.

Enable Google Kubernetes Engine API

If you want to use the Google Cloud CLI for this task,install and theninitialize the gcloud CLI. If you previously installed the gcloud CLI, get the latest version by running thegcloud components update command. Earlier gcloud CLI versions might not support running the commands in this document.Note: For existing gcloud CLI installations, make sure to set thecompute/regionproperty. If you use primarily zonal clusters, set thecompute/zone instead. By setting a default location, you can avoid errors in the gcloud CLI like the following:One of [--zone, --region] must be supplied: Please specify location. You might need to specify the location in certain commands if the location of your cluster differs from the default that you set.

Ensure that you have capacity for A3 Mega or A3 High VMs. To obtain thiscapacity, first, choose from theconsumptionoptions.To follow the instructions on this page, you can use either on-demand capacity,on-demand reservations, Future reservations, orFuture reservations for up to 90 days (in calendar mode). Afteryou've chosen a consumption option, follow the respective instructions toobtain capacity using the consumption option that you've chosen.
Ensure that you have enough quota for H100 GPUs.To request more quota, seeGPU quotas.

Requirements

The following requirements apply to both GPUDirect-TCPX and GPUDirect-TCPXOunless otherwise indicated.

GPUDirect-TCPX is supported on GKE version 1.27 or later with specific patch versions, and requires:
- Thea3-highgpu-8g machine type.
- For GKE version 1.27, use GKE patch version 1.27.7-gke.1121000 or later.
- For GKE version 1.28, use GKE patch version 1.28.10-gke.1141000 or later.
- For GKE version 1.29, use GKE patch version 1.29.5-gke.1121000 or later.
- For GKE version 1.30 to 1.33, use any patch version.
- Don't use GKE version 1.34 or later. For more information,see theGPUDirect-TCPX is unavailable with A3 High for GKEversion 1.34 andlaterknown issue.
GPUDirect-TCPXO is supported on GKE version 1.28 or later and requires:
- Thea3-megagpu-8g machine type.
- For GKE version 1.28, use GKE patch version 1.28.9-gke.1250000 or later.
- For GKE version 1.29, use GKE patch version 1.29.4-gke.1542000 or later.
- For GKE version 1.30, use GKE patch version 1.30.4-gke.1129000 or later.
- For GKE version 1.31, use GKE patch version 1.31.1-gke.2008000 or later.
- For GKE version 1.32, use GKE patch version 1.32.2-gke.1489001 or later.
The GKE node must use a Container-Optimized OS (COS) nodeimage. Ubuntu and Windows node images are not supported.

Your GPU nodes must use NVIDIA driver version 535 or later.
You must use GKE Dataplane V2.
For GPUDirect-TCPX or GPUDirect-TCPXO workloads that run across multiple nodepools, all of the node pools must be in the same Compute Engine zonesand must use the same network sets, such as VPCs and subnets.

Limitations

The following limitations apply:

GPUDirect-TCPX and GPUDirect-TCPXO are not supported withmulti-instance GPUs,GPU time-sharing, orNVIDIA MPS.
You can't useNCCL FastSocket with GPUDirect-TCPX or GPUDirect-TCPXO .
Your GKE workloadmust use all available GPUs and allavailable secondary NICs on a single node. Multiple pods cannot useGPUDirect-TCPX or GPUDirect-TCPXO on a single node.
You can only use thea3-highgpu-8g and thea3-megagpu-8g machine types.Other A3 machine types aren't supported.

Create VPCs and subnets

Create separate VPC networks in your project for each virtualNIC that you'll add to your nodes. Each VPC network must have a subnetand a firewall rule that allows internal network traffic.

Create the VPC networks for GPUDirect in your project,each with a subnet and a firewall rule. Choose the GPUDirect-TCPX tab for A3High machine types, or choose the GPUDirect-TCPXO tab for A3 Mega machinetypes, then complete the following instructions:
GPUDirect-TCPXO
To maximize your bandwidth, we recommend that you create eight new networks.
```
forNin$(seq18);dogcloudcomputenetworkscreatePREFIX-net-$N\--subnet-mode=custom\--mtu=8244gcloudcomputenetworkssubnetscreatePREFIX-sub-$N\--network=PREFIX-net-$N\--region=REGION\--range=SUBNET_RANGEgcloudcomputefirewall-rulescreatePREFIX-internal-$N\--network=PREFIX-net-$N\--action=ALLOW\--rules=tcp:0-65535,udp:0-65535,icmp\--source-ranges=SOURCE_RANGEdone
```
Replace the following:
- PROJECT_ID: your Google Cloud project ID.
- REGION: the Compute Engine region for eachsubnet.
- SUBNET_RANGE: the IP address range of each subnetin CIDR notation. This example command iterates for eight subnets, soyou should use a variable to change the IP address for each subnet.For example, specify192.168.$N.0/24 so that the first subnet uses192.168.1.0/24, thesecond subnet uses192.168.2.0/24, and so on.
- SOURCE_RANGE: The source IP address range for thefirewall rule to allow ingress traffic, in CIDR notation. For example,192.168.0.0/16.
GPUDirect-TCPX
To maximize your bandwidth, we recommend that you create four new networks.
```
forNin$(seq14);dogcloudcomputenetworkscreatePREFIX-net-$N\--subnet-mode=custom\--mtu=8244gcloudcomputenetworkssubnetscreatePREFIX-sub-$N\--network=PREFIX-net-$N\--region=REGION\--range=SUBNET_RANGEgcloudcomputefirewall-rulescreatePREFIX-internal-$N\--network=PREFIX-net-$N\--action=ALLOW\--rules=tcp:0-65535,udp:0-65535,icmp\--source-ranges=SOURCE_RANGEdone
```
Replace the following:
- PROJECT_ID: your Google Cloud project ID.
- REGION: the Compute Engine region for eachsubnet.
- SUBNET_RANGE: the IP address range of each subnetin CIDR notation. This example command iterates for four subnets, soyou should use a variable to change the IP address for each subnet.For example, specify192.168.$N.0/24 so that the first subnet uses192.168.1.0/24, thesecond subnet uses192.168.2.0/24, etc.
- SOURCE_RANGE: The source IP address range for thefirewall rule to allow ingress traffic, in CIDR notation. For example,192.168.0.0/16.
Verify that the networks were created:
```
gcloudcomputenetworkslist
```

Create the GKE environment

Create a new GKE cluster that uses multi-networking (Preview)and create a GPU node pool that has the following characteristics:

gVNIC enabled
Multi-networking subnets specified for each secondary NIC
A3 machine series with H100 GPUs backing the nodes
Latest NVIDIA drivers installed

You can't update an existing cluster to use multi-networking.

GPUDirect-TCPXO

Choose an available GKE version that supports GPUDirect-TCPXO.To list the versions, run this command:
```
gcloudcontainerget-server-config\--format="yaml(validMasterVersions)"\--region=REGION\--project=PROJECT_ID
```
Replace the following:
- REGION: thecompute regionfor the cluster control plane.
- PROJECT_ID: your Google Cloud project ID.
Create a cluster:
```
gcloudbetacontainerclusterscreateCLUSTER_NAME\--enable-dataplane-v2\--enable-ip-alias\--location=CONTROL_PLANE_LOCATION\--enable-multi-networking\--cluster-version=VERSION\--no-enable-autoupgrade\--project=PROJECT_ID
```
Replace the following:
- CLUSTER_NAME: the name of your new cluster.
- VERSION: a GKE version thatsupports GPUDirect-TCPXO, as described inRequirements.
- CONTROL_PLANE_LOCATION: the Compute Enginelocation of the control plane of yourcluster. Provide a region for regional clusters, or a zone for zonal clusters.

Create Network and GKENetworkParamSet resources in the cluster thatcorrespond to the VPC networks and subnetworks that youcreated:

kubectlapply-f-<<EOFapiVersion:networking.gke.io/v1kind:Networkmetadata:name:vpc1spec:parametersRef:group:networking.gke.iokind:GKENetworkParamSetname:vpc1type:Device---apiVersion:networking.gke.io/v1kind:Networkmetadata:name:vpc2spec:parametersRef:group:networking.gke.iokind:GKENetworkParamSetname:vpc2type:Device---apiVersion:networking.gke.io/v1kind:Networkmetadata:name:vpc3spec:parametersRef:group:networking.gke.iokind:GKENetworkParamSetname:vpc3type:Device---apiVersion:networking.gke.io/v1kind:Networkmetadata:name:vpc4spec:parametersRef:group:networking.gke.iokind:GKENetworkParamSetname:vpc4type:Device---apiVersion:networking.gke.io/v1kind:Networkmetadata:name:vpc5spec:parametersRef:group:networking.gke.iokind:GKENetworkParamSetname:vpc5type:Device---apiVersion:networking.gke.io/v1kind:Networkmetadata:name:vpc6spec:parametersRef:group:networking.gke.iokind:GKENetworkParamSetname:vpc6type:Device---apiVersion:networking.gke.io/v1kind:Networkmetadata:name:vpc7spec:parametersRef:group:networking.gke.iokind:GKENetworkParamSetname:vpc7type:Device---apiVersion:networking.gke.io/v1kind:Networkmetadata:name:vpc8spec:parametersRef:group:networking.gke.iokind:GKENetworkParamSetname:vpc8type:Device---apiVersion:networking.gke.io/v1kind:GKENetworkParamSetmetadata:name:vpc1spec:vpc:PREFIX-net-1vpcSubnet:PREFIX-sub-1deviceMode:NetDevice---apiVersion:networking.gke.io/v1kind:GKENetworkParamSetmetadata:name:vpc2spec:vpc:PREFIX-net-2vpcSubnet:PREFIX-sub-2deviceMode:NetDevice---apiVersion:networking.gke.io/v1kind:GKENetworkParamSetmetadata:name:vpc3spec:vpc:PREFIX-net-3vpcSubnet:PREFIX-sub-3deviceMode:NetDevice---apiVersion:networking.gke.io/v1kind:GKENetworkParamSetmetadata:name:vpc4spec:vpc:PREFIX-net-4vpcSubnet:PREFIX-sub-4deviceMode:NetDevice---apiVersion:networking.gke.io/v1kind:GKENetworkParamSetmetadata:name:vpc5spec:vpc:PREFIX-net-5vpcSubnet:PREFIX-sub-5deviceMode:NetDevice---apiVersion:networking.gke.io/v1kind:GKENetworkParamSetmetadata:name:vpc6spec:vpc:PREFIX-net-6vpcSubnet:PREFIX-sub-6deviceMode:NetDevice---apiVersion:networking.gke.io/v1kind:GKENetworkParamSetmetadata:name:vpc7spec:vpc:PREFIX-net-7vpcSubnet:PREFIX-sub-7deviceMode:NetDevice---apiVersion:networking.gke.io/v1kind:GKENetworkParamSetmetadata:name:vpc8spec:vpc:PREFIX-net-8vpcSubnet:PREFIX-sub-8deviceMode:NetDeviceEOF

These resources tell GKE to configure the NICs for GPUtraffic in passthrough mode. GKE doesn't apply built-innetworking programming using eBPF to this traffic.

GPUDirect-TCPX

Create a cluster:
```
gcloudbetacontainerclusterscreateCLUSTER_NAME\--enable-dataplane-v2\--enable-ip-alias\--location=CONTROL_PLANE_LOCATION\--enable-multi-networking\--cluster-version=VERSION\--no-enable-autoupgrade\--project=PROJECT_ID
```
Replace the following:
- CLUSTER_NAME: the name of your new cluster.
- CONTROL_PLANE_LOCATION: the Compute Enginelocation of the control plane of yourcluster. Provide a region for regional clusters, or a zone for zonal clusters.
- VERSION: a GKE version thatsupports GPUDirect-TCPX, as described inRequirements.

Create Network and GKENetworkParamSet resources in the cluster thatcorrespond to the VPC networks and subnetworks that youcreated:

kubectlapply-f-<<EOFapiVersion:networking.gke.io/v1kind:Networkmetadata:name:vpc1spec:parametersRef:group:networking.gke.iokind:GKENetworkParamSetname:vpc1type:Device---apiVersion:networking.gke.io/v1kind:Networkmetadata:name:vpc2spec:parametersRef:group:networking.gke.iokind:GKENetworkParamSetname:vpc2type:Device---apiVersion:networking.gke.io/v1kind:Networkmetadata:name:vpc3spec:parametersRef:group:networking.gke.iokind:GKENetworkParamSetname:vpc3type:Device---apiVersion:networking.gke.io/v1kind:Networkmetadata:name:vpc4spec:parametersRef:group:networking.gke.iokind:GKENetworkParamSetname:vpc4type:Device---apiVersion:networking.gke.io/v1kind:GKENetworkParamSetmetadata:name:vpc1spec:vpc:PREFIX-net-1vpcSubnet:PREFIX-sub-1deviceMode:NetDevice---apiVersion:networking.gke.io/v1kind:GKENetworkParamSetmetadata:name:vpc2spec:vpc:PREFIX-net-2vpcSubnet:PREFIX-sub-2deviceMode:NetDevice---apiVersion:networking.gke.io/v1kind:GKENetworkParamSetmetadata:name:vpc3spec:vpc:PREFIX-net-3vpcSubnet:PREFIX-sub-3deviceMode:NetDevice---apiVersion:networking.gke.io/v1kind:GKENetworkParamSetmetadata:name:vpc4spec:vpc:PREFIX-net-4vpcSubnet:PREFIX-sub-4deviceMode:NetDeviceEOF

These resources tell GKE to configure the NICs for GPUtraffic in passthrough mode. GKE doesn't apply built-innetworking programming using eBPF to this traffic.

Create a GPU node pool

Best practice: After you create the cluster,create a separate node pool to run the GPUs.

GPUDirect-TCPXO

Create a node pool for the H100 GPUs:

gcloudbetacontainernode-poolscreateNODE_POOL_NAME\--location=CONTROL_PLANE_LOCATION\--cluster=CLUSTER_NAME\--project=PROJECT_ID\--accelerator=type=nvidia-h100-mega-80gb,count=8,gpu-driver-version=LATEST\--machine-type=a3-megagpu-8g\--num-nodes=2\--additional-node-networknetwork=PREFIX-net-1,subnetwork=PREFIX-sub-1\--additional-node-networknetwork=PREFIX-net-2,subnetwork=PREFIX-sub-2\--additional-node-networknetwork=PREFIX-net-3,subnetwork=PREFIX-sub-3\--additional-node-networknetwork=PREFIX-net-4,subnetwork=PREFIX-sub-4\--additional-node-networknetwork=PREFIX-net-5,subnetwork=PREFIX-sub-5\--additional-node-networknetwork=PREFIX-net-6,subnetwork=PREFIX-sub-6\--additional-node-networknetwork=PREFIX-net-7,subnetwork=PREFIX-sub-7\--additional-node-networknetwork=PREFIX-net-8,subnetwork=PREFIX-sub-8\--enable-gvnic\--no-enable-autoupgrade\--scopes"https://www.googleapis.com/auth/cloud-platform"[\--placement-policy=POLICY_NAME\--reservation-affinity=specific\--reservation=projects/PROJECT_ID/reservations/RESERVATION_NAME/reservationBlocks/BLOCK_NAME\--host-maintenance-interval=PERIODIC]

ReplaceNODE_POOL_NAME with your node pool name.

In the example, the--scopes "https://www.googleapis.com/auth/cloud-platform"argument sets the node instance's scope to becloud-platform fortesting convenience. For production, you may want to limit the scope toconfigure finer-grained credentials.

To use a reservation, use the--placement-policy,--reservation-affinity,and--reservation flags. Specify these flags to configure the policy nameand reservation in the node pool. If the reservation doesn't require aresource policy, omit the--placement-policy flag.

The--reservation-affinity flag can take the values ofspecific orany. However, for high performance distributed AI workloads, we recommendthat you use a specific reservation. You can find information about your reservation, such as the name of your reservation or the name of a specificblock in your reservation. To find these values for on-demand reservations,view a list of yourreservations, or,view future reservationrequests.

Replace the following to use a reservation:

PROJECT_ID: optionally, your Google Cloudproject ID. If the reservation is located in the current project (notasharedreservation)you can omitprojects/PROJECT_ID/reservations/ from thereservation value.
RESERVATION_NAME: the name of your reservation.
BLOCK_NAME: optionally, the name of aspecific block within the reservation. Omit/reservationBlocks/BLOCK_NAME if you don'twant to use a specific block.

If this command fails, you might not have enough H100 GPU quota in yourproject. Ensure that you have quota and retry the command.

GPUDirect-TCPX

Create a node pool for the H100 GPUs:

gcloudcontainernode-poolscreateNODE_POOL_NAME\--cluster=CLUSTER_NAME\--location=CONTROL_PLANE_LOCATION\--machine-type=a3-highgpu-8g\--accelerator=type=nvidia-h100-80gb,count=8,gpu-driver-version=LATEST\--additional-node-network=network=PREFIX-net-1,subnetwork=PREFIX-sub-1\--additional-node-network=network=PREFIX-net-2,subnetwork=PREFIX-sub-2\--additional-node-network=network=PREFIX-net-3,subnetwork=PREFIX-sub-3\--additional-node-network=network=PREFIX-net-4,subnetwork=PREFIX-sub-4\--enable-gvnic\--no-enable-autoupgrade[\--placement-policy=POLICY_NAME\--reservation-affinity=specific\--reservation=projects/PROJECT_ID/reservations/RESERVATION_NAME/reservationBlocks/BLOCK_NAME]

ReplaceNODE_POOL_NAME with the name of the node pool.

Replace the following to use a reservation:

PROJECT_ID: optionally, your Google Cloudproject ID. If the reservation is located in the current project (notasharedreservation)you can omitprojects/PROJECT_ID/reservations/ from thereservation value.
RESERVATION_NAME: the name of your reservation.
BLOCK_NAME: optionally, the name of aspecific block within the reservation. Omit/reservationBlocks/BLOCK_NAME if you don'twant to use a specific block.

If this command fails, you might not have enough H100 GPU quota in yourproject. Ensure that you have quota and retry the command.

After you create the node pool, verify that each node has the attached GPUs:

Get a list of nodes in the cluster:
```
kubectlgetnodes
```
Verify that each GPU node has eight GPUs:
```
kubectldescribenodeNODE_NAME
```
ReplaceNODE_NAME with the name of the node todescribe.
The output is similar to the following:
```
Capacity:  ...  nvidia.com/gpu:             8Allocatable:  ...  nvidia.com/gpu:             8
```

Install the GPUDirect binary and configure NCCL

This section shows you how to install the GPUDirect binary, based on yourA3 machine type (GPUDirect-TCPX for A3 High, GPUDirect-TCPXO for A3 Mega) anda specific NCCL library version using a DaemonSet.

GPUDirect-TCPXO

This DaemonSet does the following:

Pre-installation to setup GPUDirect-TCPXO related configurations.
Installs the NCCL library and GPUDirect-TCPXO binary on the node.
Stores the library and the binary in the/home/kubernetes/bin/nvidia/lib64 directory on the VM.By default, GKE mounts this directory into the/usr/local/nvidia/lib64 path in GPU containers that need to useNCCL and GPUDirect-TCPXO.

To install the binary and configure NCCL, do the following steps:

Review thenccl-tcpxo-installer.yaml Daemonset manifest in GitHub.

Deploy the DaemonSet:

kubectlapply-fhttps://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/gpudirect-tcpxo/nccl-tcpxo-installer.yaml

The NCCL plugin takes approximately two minutes to start running.

Verify the status of the DaemonSet Pods:

kubectlgetpods-n=kube-system-l=name=nccl-tcpxo-installer

The output is similar to the following:

# Outputnccl-tcpxo-installer-6c2pv                    1/1     Running   0          2m11snccl-tcpxo-installer-qgg82                    1/1     Running   0          2m11s

GPUDirect-TCPX

This DaemonSet does the following:

Installs the NCCL library and GPUDirect-TCPX binary on the node.
Stores the library and the binary in the/home/kubernetes/bin/nvidia/lib64 directory on the VM.By default, GKE mounts this directory into the/usr/local/nvidia/lib64 path in GPU containers that need to useNCCL and GPUDirect-TCPX.

To install the binary and configure NCCL, do the following:

Review thenccl-tcpx-installer.yaml Daemonset manifest in GitHub.

Deploy the DaemonSet:

kubectlapply-fhttps://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/gpudirect-tcpx/nccl-tcpx-installer.yaml

The NCCL plugin takes approximately two minutes to start running.

Verify the status of the DaemonSet Pods:

kubectlgetpods-n=kube-system-l=name=nccl-tcpx-installer

The output is similar to the following:

nccl-tcpx-installer-6c2pv                    1/1     Running   0          2m11snccl-tcpx-installer-qgg82                    1/1     Running   0          2m11s

Deploy NRI device injector plugin

This section shows you how to install the NRI device injector by using aDaemonSet. Both H100 GPU machine types install the same NRI device injectorplugin. This plugin does the following:

Enables Node Resource Interface (NRI) on the node that has H100 GPUs.NRI is enabled by default on GKE version 1.29 and later.
Deploys a NRI device injector plugin container that injects GPU devices intocontainers specified by Pod annotations.

To install the plugin, do the following:

Review thenri-device-injector.yaml Deployment manifest in GitHub.

Deploy the DaemonSet:

kubectlapply-fhttps://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/nri_device_injector/nri-device-injector.yaml

The NCCL plugin takes approximately two minutes to start running.

Verify the status of the DaemonSet Pods:

kubectlgetpods-n=kube-system-l=name=device-injector

The output is similar to the following:

# Outputdevice-injector-md6hb                         1/1     Running   0       4h54mdevice-injector-vh9bm                         1/1     Running   0       4h54m

Deploy a test workload

In this section, you deploy a sample workload to verify that NCCL andGPUDirect-TCPX or GPUDirect-TCPXO work as expected. This sample workloaddoes the following:

Deploys two Pods, each of which runs in a node that has H100 GPUs.
Deploys a sidecar container in each Pod to let those Pods use GPUDirect-TCPXOor GPUDirect-TCPX.

To deploy this sample workload, do the following:

GPUDirect-TCPXO

This workload includes a sidecar containernamed thetcpxo-daemon, which runs a service that lets the Pod useGPUDirect-TCPXO.You must add this sidecar container to any Pods in yourown environment that need to use GPUDirect-TCPXO. For a snippet of the requiredfields to add to your manifests, seeAdd GPUDirect to your manifest.

Review thenccl-test-latest.yaml manifest in GitHub.

Deploy two Pods with the test workload:

kubectlapply-fhttps://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/gpudirect-tcpxo/nccl-test-latest.yaml

After the Pods deploy, trigger an all-gather test:

kubectlexec--stdin--tty--container=nccl-testnccl-test-host-1--/scripts/allgather.shnccl-host-1nccl-host-2

The output is similar to the following:

#                                                              out-of-place                       in-place#        size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong#         (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)            0             0     float    none      -1     0.24    0.00    0.00      0     0.18    0.00    0.00      0            0             0     float    none      -1     0.19    0.00    0.00      0     0.17    0.00    0.00      0            0             0     float    none      -1     0.17    0.00    0.00      0     0.17    0.00    0.00      0            0             0     float    none      -1     0.17    0.00    0.00      0     0.17    0.00    0.00      0            0             0     float    none      -1     0.17    0.00    0.00      0     0.17    0.00    0.00      0          256             4     float    none      -1    235.2    0.00    0.00      0    235.1    0.00    0.00      0          512             8     float    none      -1    241.0    0.00    0.00      0    236.1    0.00    0.00      0         1024            16     float    none      -1    236.3    0.00    0.00      0    233.3    0.00    0.00      0         2048            32     float    none      -1    234.1    0.01    0.01      0    233.4    0.01    0.01      0         4096            64     float    none      -1    237.1    0.02    0.02      0    235.3    0.02    0.02      0         8192           128     float    none      -1    236.2    0.03    0.03      0    235.2    0.03    0.03      0        16384           256     float    none      -1    236.6    0.07    0.06      0    238.5    0.07    0.06      0        32768           512     float    none      -1    237.9    0.14    0.13      0    238.8    0.14    0.13      0        65536          1024     float    none      -1    242.3    0.27    0.25      0    239.4    0.27    0.26      0       131072          2048     float    none      -1    263.0    0.50    0.47      0    275.1    0.48    0.45      0       262144          4096     float    none      -1    279.2    0.94    0.88      0    269.9    0.97    0.91      0       524288          8192     float    none      -1    273.5    1.92    1.80      0    273.5    1.92    1.80      0      1048576         16384     float    none      -1    315.1    3.33    3.12      0    314.1    3.34    3.13      0      2097152         32768     float    none      -1    319.2    6.57    6.16      0    311.5    6.73    6.31      0      4194304         65536     float    none      -1    331.8   12.64   11.85      0    331.3   12.66   11.87      0      8388608        131072     float    none      -1    356.3   23.54   22.07      0    353.8   23.71   22.23      0     16777216        262144     float    none      -1    409.1   41.01   38.45      0    405.2   41.40   38.81      0     33554432        524288     float    none      -1    451.4   74.34   69.69      0    447.7   74.94   70.26      0     67108864       1048576     float    none      -1    713.4   94.07   88.19      0    713.8   94.01   88.13      0    134217728       2097152     float    none      -1   1122.1  119.62  112.14      0   1116.3  120.23  112.72      0    268435456       4194304     float    none      -1   1785.8  150.32  140.92      0   1769.2  151.72  142.24      0    536870912       8388608     float    none      -1   2859.7  187.74  176.00      0   2852.6  188.20  176.44      0   1073741824      16777216     float    none      -1   5494.1  195.44  183.22      0   5568.2  192.83  180.78      0   2147483648      33554432     float    none      -1    10841  198.09  185.71      0    10798  198.88  186.45      0   4294967296      67108864     float    none      -1    21453  200.21  187.70      0    21490  199.86  187.37      0   8589934592     134217728     float    none      -1    42603  201.63  189.03      0    42670  201.31  188.73      0# Out of bounds values : 0 OK# Avg bus bandwidth    : 45.7587#

Success: At this point, you've successfully installed GPUDirect-TCPXO on yournodes and can use it to optimize the throughput of GPU-heavy workloads that runon those nodes. The required fields to use GPUDirect-TCPXO in your own Pods aredescribed in Add GPUDirect to your manifests in this document.

GPUDirect-TCPX

This workload includes a sidecar containernamed thetcpx-daemon, which runs a service that lets the Pod useGPUDirect-TCPX.You must add this sidecar container to any Pods in yourown environment that need to use GPUDirect-TCPX. For a snippet of the requiredfields to add to your manifests, seeAdd GPUDirect to your manifest.

Review thenccl-config.yaml ConfigMap manifest in GitHub.This manifest deploys scripts that initialize an NCCL all-gather test andsets NCCL-specific configuration settings.
Review thenccl-test-latest.yaml Deployment manifest in GitHub.

Deploy the ConfigMap and the test workload:

kubectlapply-fhttps://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/gpudirect-tcpx/nccl-config.yamlkubectlapply-fhttps://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/gpudirect-tcpx/nccl-test-latest.yaml

Run the following commands to trigger an NCCL all-gather test for thenodes:

kubectlexec\--stdin--tty--container=nccl-testnccl-test-host-1\--/configs/allgather.shnccl-host-1nccl-host-2

The output is similar to the following:

#                                                              out-of-place                       in-place#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)    1048576         16384     float    none      -1    696.8    1.50    1.41      0    729.0    1.44    1.35      0    2097152         32768     float    none      -1    776.4    2.70    2.53      0    726.7    2.89    2.71      0    4194304         65536     float    none      -1    774.3    5.42    5.08      0    805.1    5.21    4.88      0    8388608        131072     float    none      -1    812.1   10.33    9.68      0    817.6   10.26    9.62      0   16777216        262144     float    none      -1   1035.2   16.21   15.19      0   1067.8   15.71   14.73      0   33554432        524288     float    none      -1   1183.3   28.36   26.59      0   1211.8   27.69   25.96      0   67108864       1048576     float    none      -1   1593.4   42.12   39.49      0   1510.5   44.43   41.65      0  134217728       2097152     float    none      -1   2127.8   63.08   59.13      0   2312.7   58.03   54.41      0  268435456       4194304     float    none      -1   3603.0   74.50   69.85      0   3586.2   74.85   70.17      0  536870912       8388608     float    none      -1   7101.7   75.60   70.87      0   7060.9   76.03   71.28      0# Out of bounds values : 0 OK# Avg bus bandwidth    : 29.8293

Success: At this point, you've successfully installed GPUDirect-TCPX on yournodes and can use it to optimize the throughput of GPU-heavy workloads that runon those nodes. The required fields to use GPUDirect-TCPX in your own Pods aredescribed in Add GPUDirect to your manifestsin this document.

Use required NCCL configuration settings to improve performance

The following key-value pairs are the required NCCL configuration settings for GPUDirect-TCPX and GPUDirect-TCPXO. When deploying your workloads that use NCCL, set them as environment variables to optimize performance.

GPUDirect-TCPXO

"LD_LIBRARY_PATH=\"${LD_LIBRARY_PATH}:/usr/local/nvidia/lib64\"","NCCL_FASTRAK_CTRL_DEV=eth0","NCCL_FASTRAK_IFNAME=eth1,eth2,eth3,eth4,eth5,eth6,eth7,eth8","NCCL_SOCKET_IFNAME=eth0","NCCL_CROSS_NIC=0","NCCL_ALGO=Ring,Tree","NCCL_PROTO=Simple,LL128","NCCL_MIN_NCHANNELS=4","NCCL_TUNER_PLUGIN=libnccl-tuner.so","NCCL_TUNER_CONFIG_PATH=/usr/local/nvidia/lib64/a3plus_tuner_config.textproto","NCCL_SHIMNET_GUEST_CONFIG_CHECKER_CONFIG_FILE=/usr/local/nvidia/lib64/a3plus_guest_config.textproto","NCCL_DYNAMIC_CHUNK_SIZE=524288","NCCL_P2P_NET_CHUNKSIZE=524288","NCCL_P2P_PCI_CHUNKSIZE=524288","NCCL_P2P_NVL_CHUNKSIZE=1048576","NCCL_FASTRAK_NUM_FLOWS=2","NCCL_FASTRAK_USE_SNAP=1","NCCL_FASTRAK_PLUGIN_ACCEPT_TIMEOUT_MS=600000","NCCL_FASTRAK_ENABLE_CONTROL_CHANNEL=0","NCCL_BUFFSIZE=8388608","CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7","NCCL_NET_GDR_LEVEL=PIX","NCCL_FASTRAK_ENABLE_HOTPATH_LOGGING=0","NCCL_FASTRAK_USE_LLCM=1","NCCL_NVLS_ENABLE=0"

Optionally, you can set all the configurations at once by following these steps:

In your workload container manifest, add the following key-value pair asan environment variable:
```
NCCL_LIB_DIR="/usr/local/nvidia/lib64"
```
Ensure thenccl-env-profile.sh script is executed when your workloadcontainer starts. For example, you can do this in your Pod specificationby overriding the container's command to include the following:
```
source${NCCL_LIB_DIR}/nccl-env-profile.sh
```

Note: Starting from theus-docker.pkg.dev/gce-ai-infra/gpudirect-tcpxo/nccl-plugin-gpudirecttcpx-dev:v1.0.9-1NCCL plugin version, the LL128 NCCL communication protocol support becomes thedefault tuning parameter in GPUDirect-TCPXO. To use or disable LL128, see the LL128 support section.

LL128 support

The NVIDIA LL128 (low-latency 128) NCCL communication protocol can significantlyimprove performance for small-to-medium sized collectives. GPUDirect-TCPXOsupports the LL128 protocol.

To use LL128, ensure that thenccl-tcpxo-installer.yaml file in theInstall the GPUDirect binary and configure NCCL sectionuses the following container image version or later:

us-docker.pkg.dev/gce-ai-infra/gpudirect-tcpxo/nccl-plugin-gpudirecttcpx-dev:v1.0.8-1

To set up LL128, do the following:

For theus-docker.pkg.dev/gce-ai-infra/gpudirect-tcpxo/nccl-plugin-gpudirecttcpx-dev:v1.0.8-1 NCCL plugin version, do these steps:
1. In your workload manifest, set the following environment variable:
```
NCCL_LIB_DIR="/usr/local/nvidia/lib64
```
2. Configure your workload to execute thenccl-env-profile-ll128.sh scriptwhen the container starts. In your workload manifest, set the followingcommand:
```
source ${NCCL_LIB_DIR}/nccl-env-profile-ll128.sh
```
  Thenccl-env-profile-ll128.sh script has the following environment variables:
```
NCCL_PROTO=Simple,LL128NCCL_TUNER_CONFIG_PATH=/usr/local/nvidia/lib64/a3plus_tuner_config_ll128.textprotoNCCL_SHIMNET_GUEST_CONFIG_CHECKER_CONFIG_FILE=/usr/local/nvidia/lib64/a3plus_guest_config_ll128.textproto
```
For theus-docker.pkg.dev/gce-ai-infra/gpudirect-tcpxo/nccl-plugin-gpudirecttcpx-dev:v1.0.9-1NCCL plugin version and later, LL128 becomes a default parameter, so usingeithernccl-env-profile.sh script ornccl-env-profile-ll128.sh scriptenables LL128. To disable LL128:
1. In your workload manifest, set the following environment variable:
```
NCCL_LIB_DIR="/usr/local/nvidia/lib64
```
2. Configure your workload to execute thenccl-env-profile-ll128.sh scriptwhen the container starts. In your workload manifest, set the followingcommand:
```
source ${NCCL_LIB_DIR}/nccl-env-profile-simple.sh
```
  Thenccl-env-profile-simple.sh script has the following environment variables:
```
NCCL_PROTO=SimpleNCCL_TUNER_CONFIG_PATH=/usr/local/nvidia/lib64/a3plus_tuner_config_simple.textprotoNCCL_SHIMNET_GUEST_CONFIG_CHECKER_CONFIG_FILE=/usr/local/nvidia/lib64/a3plus_tuner_config_simple.textproto
```

GPUDirect-TCPX

"LD_LIBRARY_PATH=\"${LD_LIBRARY_PATH}:/usr/local/tcpx/lib64\"","NCCL_SOCKET_IFNAME=\"eth0\"","NCCL_ALGO=Ring","NCCL_PROTO=Simple","NCCL_CROSS_NIC=0","NCCL_NET_GDR_LEVEL=PIX","NCCL_P2P_PXN_LEVEL=0","NCCL_GPUDIRECTTCPX_SOCKET_IFNAME=eth1,eth2,eth3,eth4","NCCL_GPUDIRECTTCPX_CTRL_DEV=eth0","NCCL_DYNAMIC_CHUNK_SIZE=524288","NCCL_P2P_NET_CHUNKSIZE=524288","NCCL_P2P_PCI_CHUNKSIZE=524288","NCCL_P2P_NVL_CHUNKSIZE=1048576","NCCL_BUFFSIZE=4194304","NCCL_NSOCKS_PERTHREAD=4","NCCL_SOCKET_NTHREADS=1","NCCL_GPUDIRECTTCPX_TX_BINDINGS=\"eth1:8-21,112-125;eth2:8-21,112-125;eth3:60-73,164-177;eth4:60-73,164-177\"","NCCL_GPUDIRECTTCPX_RX_BINDINGS=\"eth1:22-35,126-139;eth2:22-35,126-139;eth3:74-87,178-191;eth4:74-87,178-191\"","NCCL_GPUDIRECTTCPX_PROGRAM_FLOW_STEERING_WAIT_MICROS=500000"

Note: In this configuration,eth0 is used for control traffic of theGPUDirect-TCPX workload. Avoid rate limiting or restricting the primaryeth0 device. You can remove theNCCL_GPUDIRECTTCPX_CTRL_DEV setting,which specifies the network interface forGPUDirect-TCPX control traffic,and the control traffic will instead use its GPU aligned network device.However, NCCL itself will continue to useeth0 for orchestration becauseit's set as the value for theNCCL_SOCKET_IFNAME.

Collect NCCL debugging logs

To log NCCL errors, we recommend that you add the following NCCL config:

NCCL_DEBUG=INFONCCL_DEBUG_SUBSYS=INIT,NET,ENV,COLL,GRAPHNCCL_DEBUG_FILE=/DIRECTORY/FILE_NAME.%h.%p

NCCL_DEBUG=INFO: prints debugging information.

For large-scale workloads (64 nodes or more), extensive logging canoccur. To avoid this scenario—and unless you specifiedNCCL_DEBUG_FILE—we recommend settingNCCL_DEBUG=WARN to limit logs toerrors only.

NCCL_DEBUG_SUBSYS: filters the subsystems for which NCCL collectsdebugging information. We recommend that you collect logs for thefollowing subsystems:
- INIT: the initialization phase of NCCL.
- NET: the NCCL network.
- ENV: the environment variables that NCCL uses.
- COLL: collective operations.
- GRAPH: topology detection and graph search.
If you want to collect logs for different subsystems, see NCCL_DEBUG_SUBSYSin the NCCL documentation for a list of accepted values.
NCCL_DEBUG_FILE (Optional): directs the NCCL debug logging output to a file thatyou specify. This variable writes NCCL logs tostandard files, which prevents the log output from mixing with applicationoutput. This variable also writes logs from different NCCL ranks todifferent files, which prevents the logs from mixing.
Use the following filename format:
```
/DIRECTORY/FILE_NAME.%h.%p
```
Replace the following:
- DIRECTORY: the directory where you want tostore the log files.
- FILE_NAME: the name of the log files.
The placeholder%h resolves to the hostname of the node, while%presolves to the process ID (PID) of the process that's generating thelog.

For more information about debugging NCCL logs, seeTroubleshoot GPUs in GKE.

Add GPUDirect to your manifests

This section shows the required fields that you must add to your Kubernetesmanifests for your Pods to use GPUDirect.

Depending on the type of GPUDirect, do the following:

GPUDirect-TCPXO

Add the following annotations to the Pod metadata.Without these annotations,hostNetwork:true will be required for the Pod, andprivileged:true will be required for thetcpxo-daemon container.

metadata:annotations:devices.gke.io/container.tcpxo-daemon:|+- path: /dev/nvidia0- path: /dev/nvidia1- path: /dev/nvidia2- path: /dev/nvidia3- path: /dev/nvidia4- path: /dev/nvidia5- path: /dev/nvidia6- path: /dev/nvidia7- path: /dev/nvidiactl- path: /dev/nvidia-uvm- path: /dev/dmabuf_import_helpernetworking.gke.io/default-interface:'eth0'networking.gke.io/interfaces:|[{"interfaceName":"eth0","network":"default"},{"interfaceName":"eth1","network":"vpc1"},{"interfaceName":"eth2","network":"vpc2"},{"interfaceName":"eth3","network":"vpc3"},{"interfaceName":"eth4","network":"vpc4"},{"interfaceName":"eth5","network":"vpc5"},{"interfaceName":"eth6","network":"vpc6"},{"interfaceName":"eth7","network":"vpc7"},{"interfaceName":"eth8","network":"vpc8"}]

Add the following fields to the Pod specification:

spec:volumes:-name:librarieshostPath:path:/home/kubernetes/bin/nvidia/lib64-name:syshostPath:path:/sys-name:proc-syshostPath:path:/proc/sys-name:aperture-deviceshostPath:path:/dev/aperture_devices

Add the following container to the manifest to run thetcpxo-daemon service.Replace (TCPXO_DAEMON_IMAGE) with the latest image,us-docker.pkg.dev/gce-ai-infra/gpudirect-tcpxo/tcpgpudmarxd-dev:v1.0.17:

-name:tcpxo-daemonimage:TCPXO_DAEMON_IMAGEimagePullPolicy:Alwayscommand:["/bin/sh","-c"]args:-|set -exchmod 755 /fts/entrypoint_rxdm_container.sh/fts/entrypoint_rxdm_container.sh --num_hops=2 --num_nics=8 --uid= --alsologtostderrsecurityContext:capabilities:add:-NET_ADMIN-NET_BIND_SERVICEvolumeMounts:-name:librariesmountPath:/usr/local/nvidia/lib64-name:sysmountPath:/hostsysfs-name:proc-sysmountPath:/hostprocsysfsenv:-name:LD_LIBRARY_PATHvalue:/usr/local/nvidia/lib64

Add the following environment variable to every GPU container:

env:-name:LD_LIBRARY_PATHvalue:/usr/local/nvidia/lib64-name:NCCL_FASTRAK_LLCM_DEVICE_DIRECTORYvalue:/dev/aperture_devices

Add the following volumeMounts to every GPU container. Withoutaperture_devices setups,privileged:true is required for GPU containers:
```
volumeMounts:-name:aperture-devicesmountPath:/dev/aperture_devices
```
Add environment variables to configure NCCL options. For details, seeUse recommended NCCL configuration settings to improve performance.

A completed Pod specification looks like the following:

apiVersion:v1kind:Podmetadata:name:a3plus-workloadsannotations:devices.gke.io/container.tcpxo-daemon:|+- path: /dev/nvidia0- path: /dev/nvidia1- path: /dev/nvidia2- path: /dev/nvidia3- path: /dev/nvidia4- path: /dev/nvidia5- path: /dev/nvidia6- path: /dev/nvidia7- path: /dev/nvidiactl- path: /dev/nvidia-uvm- path: /dev/dmabuf_import_helpernetworking.gke.io/default-interface:'eth0'networking.gke.io/interfaces:|[{"interfaceName":"eth0","network":"default"},{"interfaceName":"eth1","network":"vpc1"},{"interfaceName":"eth2","network":"vpc2"},{"interfaceName":"eth3","network":"vpc3"},{"interfaceName":"eth4","network":"vpc4"},{"interfaceName":"eth5","network":"vpc5"},{"interfaceName":"eth6","network":"vpc6"},{"interfaceName":"eth7","network":"vpc7"},{"interfaceName":"eth8","network":"vpc8"}]...containers:-name:tcpxo-daemonimage:TCPXO_DAEMON_IMAGEimagePullPolicy:Alwayscommand:["/bin/sh","-c"]args:-|set -exchmod 755 /fts/entrypoint_rxdm_container.sh/fts/entrypoint_rxdm_container.sh --num_hops=2 --num_nics=8 --uid= --alsologtostderrsecurityContext:capabilities:add:-NET_ADMIN-NET_BIND_SERVICEvolumeMounts:-name:librariesmountPath:/usr/local/nvidia/lib64-name:sysmountPath:/hostsysfs-name:proc-sysmountPath:/hostprocsysfsenv:-name:LD_LIBRARY_PATHvalue:/usr/local/nvidia/lib64-name:main-application-container...env:-name:LD_LIBRARY_PATHvalue:/usr/local/nvidia/lib64-name:NCCL_FASTRAK_LLCM_DEVICE_DIRECTORYvalue:/dev/aperture_devicessecurityContext:volumeMounts:-name:aperture-devicesmountPath:/dev/aperture_devicesresources:limits:nvidia.com/gpu:8volumes:-name:librarieshostPath:path:/home/kubernetes/bin/nvidia-name:syshostPath:path:/sys-name:proc-syshostPath:path:/proc/sys-name:aperture-deviceshostPath:path:/dev/aperture_devices

GPUDirect-TCPX

Add the following annotations to the Pod metadata.Without these annotations,hostNetwork:true will be required for the Pod, andprivileged:true will be required for thetcpx-daemon container.

metadata:annotations:devices.gke.io/container.tcpx-daemon:|+- path: /dev/nvidia0- path: /dev/nvidia1- path: /dev/nvidia2- path: /dev/nvidia3- path: /dev/nvidia4- path: /dev/nvidia5- path: /dev/nvidia6- path: /dev/nvidia7- path: /dev/nvidiactl- path: /dev/nvidia-uvmnetworking.gke.io/default-interface:'eth0'networking.gke.io/interfaces:|[{"interfaceName":"eth0","network":"default"},{"interfaceName":"eth1","network":"vpc1"},{"interfaceName":"eth2","network":"vpc2"},{"interfaceName":"eth3","network":"vpc3"},{"interfaceName":"eth4","network":"vpc4"},]

Add the following fields to the Pod specification:

spec:volumes:-name:librarieshostPath:path:/home/kubernetes/bin/nvidia/lib64-name:syshostPath:path:/sys-name:proc-syshostPath:path:/proc/sys

Add the following container to the manifest to run thetcpx-daemon service:

-name:tcpx-daemonimage:us-docker.pkg.dev/gce-ai-infra/gpudirect-tcpx/tcpgpudmarxd-dev:v2.0.9command:-/tcpgpudmarxd/build/app/tcpgpudmarxd---gpu_nic_preset-a3vm---gpu_shmem_type-fd---uds_path-/run/tcpx---setup_param-\"--verbose 128 2 0 \"securityContext:capabilities:add:-NET_ADMINvolumeMounts:-name:librariesmountPath:/usr/local/nvidia/lib64-name:tcpx-socketmountPath:/run/tcpx-name:sysmountPath:/hostsysfs-name:proc-sysmountPath:/hostprocsysfsenv:-name:LD_LIBRARY_PATHvalue:/usr/local/nvidia/lib64

Add the following volume mounts to any containers that request GPUs:
```
volumeMounts:-name:tcpx-socketmountPath:/tmp-name:librariesmountPath:/usr/local/nvidia/lib64
```
Note: The default tcpx-socket path is/tmp for containers that request GPUs.If you set theNCCL_GPUDIRECTTCPX_UNIX_CLIENT_PREFIX environment variable to avalue other than/tmp, GKE mounts thetcpx-socketvolume to thatmountPath.
Add environment variables to configure NCCL options. For details, see the Use recommended NCCL configuration settings to improve performancesection in this document.
Add the following environment variable to every GPU container:
```
env:-name:LD_LIBRARY_PATHvalue:/usr/local/nvidia/lib64
```

A completed Pod specification looks like the following:

apiVersion:v1kind:Podmetadata:name:a3-gpu-workloads-examplelabels:name:a3-gpu-workloads-exampleannotations:devices.gke.io/container.tcpx-daemon:|+- path: /dev/nvidia0- path: /dev/nvidia1- path: /dev/nvidia2- path: /dev/nvidia3- path: /dev/nvidia4- path: /dev/nvidia5- path: /dev/nvidia6- path: /dev/nvidia7- path: /dev/nvidiactl- path: /dev/nvidia-uvmnetworking.gke.io/default-interface:'eth0'networking.gke.io/interfaces:|[{"interfaceName":"eth0","network":"default"},{"interfaceName":"eth1","network":"vpc1"},{"interfaceName":"eth2","network":"vpc2"},{"interfaceName":"eth3","network":"vpc3"},{"interfaceName":"eth4","network":"vpc4"}]spec:containers:-name:tcpx-daemonimage:us-docker.pkg.dev/gce-ai-infra/gpudirect-tcpx/tcpgpudmarxd-dev:v2.0.11imagePullPolicy:Alwayscommand:-/tcpgpudmarxd/build/app/tcpgpudmarxd---gpu_nic_preset-a3vm---gpu_shmem_type-fd---uds_path-/run/tcpx---setup_param-\"--verbose 128 2 0 \"securityContext:capabilities:add:-NET_ADMINvolumeMounts:-name:librariesmountPath:/usr/local/nvidia/lib64readOnly:true-name:tcpx-socketmountPath:/run/tcpx-name:sysmountPath:/hostsysfs-name:proc-sysmountPath:/hostprocsysfsenv:-name:LD_LIBRARY_PATHvalue:/usr/local/nvidia/lib64-name:a3-gpu-workloads-example...volumeMounts:-name:tcpx-socketmountPath:/tmp-name:librariesmountPath:/usr/local/nvidia/lib64readOnly:trueresources:limits:nvidia.com/gpu:8env:-name:LD_LIBRARY_PATHvalue:/usr/local/nvidia/lib64...volumes:-name:librarieshostPath:path:/home/kubernetes/bin/nvidia/lib64-name:tcpx-socketemptyDir:-name:syshostPath:path:/sys-name:proc-syshostPath:path:/proc/sys

What's next

Read theGPUDirect-TCPXO Release Notes
Learn more about thebest practice to run workloads with GPUDirect-TCPX(O)
Learn aboutbest practices for GKE networking.
Learn more about theNvidia GPUDirect family of technologies for data movement and access on Nvidia GPUs.
Learn aboutcurrent GPU version availability andrequesting GPUs in GKE.

Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.

Last updated 2025-12-18 UTC.

Movatterモバイル変換

Maximize GPU network bandwidth in Standard mode clusters Stay organized with collections Save and categorize content based on your preferences.

About Google Cloud GPU supercomputers

Required features and capabilities for maximized bandwidth

Procedure outline

Before you begin

Requirements

Limitations

Create VPCs and subnets

GPUDirect-TCPXO

GPUDirect-TCPX

Create the GKE environment

GPUDirect-TCPXO

GPUDirect-TCPX

Create a GPU node pool

GPUDirect-TCPXO

GPUDirect-TCPX

Install the GPUDirect binary and configure NCCL

GPUDirect-TCPXO

GPUDirect-TCPX

Deploy NRI device injector plugin

Deploy a test workload

GPUDirect-TCPXO

GPUDirect-TCPX

Use required NCCL configuration settings to improve performance

GPUDirect-TCPXO

LL128 support

GPUDirect-TCPX

Collect NCCL debugging logs

Add GPUDirect to your manifests

GPUDirect-TCPXO

GPUDirect-TCPX

What's next

Maximize GPU network bandwidth in Standard mode clusters