Troubleshoot your Dataflow TPU job

If you run into problems running your Dataflow job with TPUs, usethe following troubleshooting steps to resolve your issue.

Troubleshoot your container image

It can be helpful to debug your container and TPU software on a standalone VM.You can debug with a VM created by a GKE nodepool, or you candebug on a running Dataflow worker VM.

Debug with a standalone VM

To debug your container on a standalone VM, you can create a GKEnode pool that uses the sameTPUVM for local experimentation.For example, creating a GKE node pool with one TPU V5 Lite deviceinus-west1-c would look like the following:

Create a GKE cluster.

gcloudcontainerclusterscreateTPU_CLUSTER_NAME\--projectPROJECT_ID\--release-channel=stable\--scopes=cloud-platform\--enable-ip-alias\--locationus-west1-c

Create a GKE node pool.

gcloudcontainernode-poolscreateTPU_NODE_POOL_NAME\--projectPROJECT_ID\--location=us-west1-c\--cluster=TPU_CLUSTER_NAME\--node-locations=us-west1-c\--machine-type=ct5lp-hightpu-1t\--num-nodes=1\[--reservationRESERVATION_NAME\--reservation-affinity=specific]

Find the VM name of the TPU node in the nodepool in the GKE UI or with the following command.

gcloudcomputeinstanceslist--filter='metadata.kube-labels:"cloud.google.com/gke-nodepool=TPU_NODEPOOL_NAME"'

Connect to a VM created by the GKE node pool usingSSH:

gcloudcomputessh--zone"us-west1-c""VM_NAME"--projectPROJECT_ID

After connecting to a VM using SSH, configure Docker for the Artifact Registry youare using.
```
docker-credential-gcrconfigure-docker--registries=us-west1-docker.pkg.dev
```

Then, start a container from the image that you use.

dockerrun--privileged--network=host-it--rm--entrypoint=/bin/bashIMAGE_NAME

Inside the container, test that TPUs are accessible.

For example, if you have an image that uses PyTorch to utilize TPUs,open a Python interpreter:

python3

Then, perform a computation on a TPU device:

importtorchimporttorch_xla.core.xla_modelasxmdev=xm.xla_device()t1=torch.randn(3,3,device=dev)t2=torch.randn(3,3,device=dev)print(t1+t2)

Sample output:

>>>tensor([[0.3355,-1.4628,-3.2610],>>>[-1.4656,0.3196,-2.8766],>>>[0.8667,-1.5060,0.7125]],device='xla:0')

If the computation fails, your image might not be properly configured.

For example, you might need to set the required environmentvariables in the image Dockerfile. To confirm, retry the computationafter setting the environment variables manually as follows:

exportTPU_SKIP_MDS_QUERY=1# Don't query metadataexportTPU_HOST_BOUNDS=1,1,1# There's only one hostexportTPU_CHIPS_PER_HOST_BOUNDS=1,1,1# 1 chips per hostexportTPU_WORKER_HOSTNAMES=localhostexportTPU_WORKER_ID=0# Always 0 for single-host TPUsexportTPU_ACCELERATOR_TYPE=v5litepod-1# Since we use v5e 1x1 accelerator.

If PyTorch or LibTPU dependencies are missing, you could retry thecomputation after installing them using the following command:

# Install PyTorch with TPU supportpipinstalltorchtorch_xla[tpu]torchvision-fhttps://storage.googleapis.com/libtpu-releases/index.html

Debug by using a Dataflow VM

As an alternative, you can connect to the Dataflow worker VMinstance using SSH while a job is running. Because Dataflow workerVMs shut down after pipeline completion, you might need to artificially increasethe runtime by doing a computation that waits for a prolonged period of time.

Because a TPU device cannot be shared between multiple processes, you might needto run a pipeline that doesn't make any computations on a TPU.

Find a VM for the running TPU job by searching for theDataflow job ID in the Google Cloud console search bar or byusing the followinggcloud command:
```
gcloudcomputeinstanceslist--projectPROJECT_ID--filter"STATUS='RUNNING' AND description ~ 'Created for Dataflow job:JOB_ID'"
```
After connecting to a VM with TPUs using SSH, start a container from theimage that you use. For an example, seeDebug with a standalone VM.
Inside the container, reconfigure the TPU settings and install necessarylibraries to test your setup. For an example, seeDebug with a standaloneVM.

Workers don't start

Before troubleshooting, verify the following pipeline options are set correctly:

the--dataflow_service_option=worker_accelerator option
the--worker_zone option
the--machine_type option

Check if the console logs show that workers are starting, but the job fails witha message similar to the following:

Workflowfailed.Causes:TheDataflowjobappearstobestuckbecausenoworkeractivityhasbeenseeninthelast25m.

The cause of these issues might be related to capacity or workerstartup issues.

Capacity: If you use on-demand TPU capacity, or a reservation that isexhausted, new pipelines might not start until capacity is available. If youuse a reservation, check its remaining capacity on theCompute Reservationspage in theGoogle Cloud console or with the following command:
```
gcloudcomputereservationsdescribeRESERVATION_NAME--zoneZONE
```
Check whether your job has started any worker VMs. When your job starts aworker, loggers such asworker,worker_startup,kubelet, and othersgenerally provide output. Additionally, on theJobmetrics page inthe Google Cloud console, the number of current workers should be greater thanzero.
Worker startup: Check thejob-message andlauncher logs. If yourpipeline starts workers but they can't boot, you might haveerrors in yourcustom container.
Disk space: Verify that sufficient disk space is available for your job.To increase disk space, use the--disk_size_gb option.

Job fails with an error

Use the following troubleshooting advice when your job fails with an error.

Startup of worker pool failed

If you see the following error, verify that your pipeline specifies--worker_zone and that the zone matches the zone for your reservation.

JOB_MESSAGE_ERROR: Startup of the worker pool in zoneZONE failed tobring up any of the desired 1 workers.[...] INVALID_FIELD_VALUE:Instance 'INSTANCE_NAME' creation failed: Invalid value for field'resource.reservationAffinity': '{ "consumeReservationType":"SPECIFIC_ALLOCATION", "key":"compute.googleapis.com/RESERVATION_NAME...'. Specified reservations[RESERVATION_NAME] do not exist.

Managed instance groups don't support Cloud TPUs

If you see the following error, contact youraccountteam to verify whether your project hasbeen enrolled to use TPUs, or file a bug using theGoogle IssueTracker.

apache_beam.runners.dataflow.dataflow_runner.DataflowRuntimeException: Dataflowpipeline failed. State: FAILED, Error: Workflow failed. Causes: One or moreoperations had an error[...]: [INVALID_FIELD_VALUE] 'Invalid valuefor field 'resource.instanceTemplate': Managed Instance Groups do not supportCloud TPUs. '.

Invalid value for field

If you see the following error, verify that your pipeline invocation sets theworker_accelerator Dataflow service option.

JOB_MESSAGE_ERROR: Workflow failed. Causes: One or more operations had an error:'operation-[...]': [INVALID_FIELD_VALUE] 'Invalid value for field'resource.instanceTemplate': 'projects/[...]-harness'. RegionalManaged Instance Groups do not support Cloud TPUs.'

Device or resource busy

If you see the following error, then a Dataflow worker processingyour pipeline likely is running more than one process that is accessing the TPUat the same time. This is not supported. For more information, seeTPUs andworker parallelism.

RuntimeError: TPU initialization failed: open(/dev/vfio/0): Device or resourcebusy: Device or resource busy; Couldn't open iommu group /dev/vfio/0

If you see the preceding error while debugging your pipeline on a VM, you caninspect and terminate the process that is holding up the TPU by using thefollowing commands:

aptupdate;aptinstalllsoflsof-w/dev/vfio/0kill-9PROCESS_ID# to terminate the process.

Instances with guest accelerators do not support live migration

If you see the following error, the pipeline was likely launched with anexplicitly-set machine type that has accelerators, but didn't specifyaccelerator configuration correctly. Verify that your pipeline invocation setstheworker_accelerator Dataflow service option, and make surethe option name doesn't contain typos.

JOB_MESSAGE_ERROR: Startup of the worker pool in zoneZONE failed tobring up any of the desired 1 workers.[...] UNSUPPORTED_OPERATION:InstanceINSTANCE_ID creation failed: Instances with guestaccelerators do not support live migration.

The workflow was automatically rejected by the service

The following errors might also appear if some of the required pipeline optionsare missing or incorrect:

The workflow was automatically rejected by the service. The requestedaccelerator typetpu-v5-lite-podslice;topology:1x1 requires settingthe worker machine type toct5lp-hightpu-1t. Learn more at:https://cloud.google.com/dataflow/docs/guides/configure-worker-vm

Timed out waiting for an update from the worker

When you launch pipelines on TPU VMs with a lot of vCPU, the job might encountererrors like the following:

Workflow failed. CausesWORK_ITEM failed.The job failed because a work item has failed4 times.Root cause: Timed out waiting for an update from the worker.

If you see this error, try reducing the number of threads.For example, you could set:--number_of_worker_harness_threads=50.

No TPU usage

If your pipeline runs successfully but TPU devices aren't used or aren'taccessible, verify that the frameworks you are using, such as JAX or PyTorch,can access the attached devices. To troubleshoot your container image on asingle VM, seeDebug with a standalone VM.

Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.

Last updated 2026-02-19 UTC.

Movatterモバイル変換

Troubleshoot your Dataflow TPU job Stay organized with collections Save and categorize content based on your preferences.