Troubleshoot your Dataflow TPU job Stay organized with collections Save and categorize content based on your preferences.
If you run into problems running your Dataflow job with TPUs, usethe following troubleshooting steps to resolve your issue.
Troubleshoot your container image
It can be helpful to debug your container and TPU software on a standalone VM.You can debug with a VM created by a GKE nodepool, or you candebug on a running Dataflow worker VM.
Debug with a standalone VM
To debug your container on a standalone VM, you can create a GKEnode pool that uses the sameTPUVM for local experimentation.For example, creating a GKE node pool with one TPU V5 Lite deviceinus-west1-c would look like the following:
Create a GKE cluster.
gcloudcontainerclusterscreateTPU_CLUSTER_NAME\--projectPROJECT_ID\--release-channel=stable\--scopes=cloud-platform\--enable-ip-alias\--locationus-west1-cCreate a GKE node pool.
gcloudcontainernode-poolscreateTPU_NODE_POOL_NAME\--projectPROJECT_ID\--location=us-west1-c\--cluster=TPU_CLUSTER_NAME\--node-locations=us-west1-c\--machine-type=ct5lp-hightpu-1t\--num-nodes=1\[--reservationRESERVATION_NAME\--reservation-affinity=specific]Find the VM name of the TPU node in the nodepool in the GKE UI or with the following command.
gcloudcomputeinstanceslist--filter='metadata.kube-labels:"cloud.google.com/gke-nodepool=TPU_NODEPOOL_NAME"'Connect to a VM created by the GKE node pool usingSSH:
gcloudcomputessh--zone"us-west1-c""VM_NAME"--projectPROJECT_IDAfter connecting to a VM using SSH, configure Docker for the Artifact Registry youare using.
docker-credential-gcrconfigure-docker--registries=us-west1-docker.pkg.devThen, start a container from the image that you use.
dockerrun--privileged--network=host-it--rm--entrypoint=/bin/bashIMAGE_NAMEInside the container, test that TPUs are accessible.
For example, if you have an image that uses PyTorch to utilize TPUs,open a Python interpreter:
python3Then, perform a computation on a TPU device:
importtorchimporttorch_xla.core.xla_modelasxmdev=xm.xla_device()t1=torch.randn(3,3,device=dev)t2=torch.randn(3,3,device=dev)print(t1+t2)Sample output:
>>>tensor([[0.3355,-1.4628,-3.2610],>>>[-1.4656,0.3196,-2.8766],>>>[0.8667,-1.5060,0.7125]],device='xla:0')If the computation fails, your image might not be properly configured.
For example, you might need to set the required environmentvariables in the image Dockerfile. To confirm, retry the computationafter setting the environment variables manually as follows:
exportTPU_SKIP_MDS_QUERY=1# Don't query metadataexportTPU_HOST_BOUNDS=1,1,1# There's only one hostexportTPU_CHIPS_PER_HOST_BOUNDS=1,1,1# 1 chips per hostexportTPU_WORKER_HOSTNAMES=localhostexportTPU_WORKER_ID=0# Always 0 for single-host TPUsexportTPU_ACCELERATOR_TYPE=v5litepod-1# Since we use v5e 1x1 accelerator.If PyTorch or LibTPU dependencies are missing, you could retry thecomputation after installing them using the following command:
# Install PyTorch with TPU supportpipinstalltorchtorch_xla[tpu]torchvision-fhttps://storage.googleapis.com/libtpu-releases/index.html
Debug by using a Dataflow VM
As an alternative, you can connect to the Dataflow worker VMinstance using SSH while a job is running. Because Dataflow workerVMs shut down after pipeline completion, you might need to artificially increasethe runtime by doing a computation that waits for a prolonged period of time.
Because a TPU device cannot be shared between multiple processes, you might needto run a pipeline that doesn't make any computations on a TPU.
Find a VM for the running TPU job by searching for theDataflow job ID in the Google Cloud console search bar or byusing the following
gcloudcommand:gcloudcomputeinstanceslist--projectPROJECT_ID--filter"STATUS='RUNNING' AND description ~ 'Created for Dataflow job:JOB_ID'"After connecting to a VM with TPUs using SSH, start a container from theimage that you use. For an example, seeDebug with a standalone VM.
Inside the container, reconfigure the TPU settings and install necessarylibraries to test your setup. For an example, seeDebug with a standaloneVM.
Workers don't start
Before troubleshooting, verify the following pipeline options are set correctly:
- the
--dataflow_service_option=worker_acceleratoroption - the
--worker_zoneoption - the
--machine_typeoption
Check if the console logs show that workers are starting, but the job fails witha message similar to the following:
Workflowfailed.Causes:TheDataflowjobappearstobestuckbecausenoworkeractivityhasbeenseeninthelast25m.The cause of these issues might be related to capacity or workerstartup issues.
Capacity: If you use on-demand TPU capacity, or a reservation that isexhausted, new pipelines might not start until capacity is available. If youuse a reservation, check its remaining capacity on theCompute Reservationspage in theGoogle Cloud console or with the following command:
gcloudcomputereservationsdescribeRESERVATION_NAME--zoneZONECheck whether your job has started any worker VMs. When your job starts aworker, loggers such as
worker,worker_startup,kubelet, and othersgenerally provide output. Additionally, on theJobmetrics page inthe Google Cloud console, the number of current workers should be greater thanzero.Worker startup: Check the
job-messageandlauncherlogs. If yourpipeline starts workers but they can't boot, you might haveerrors in yourcustom container.Disk space: Verify that sufficient disk space is available for your job.To increase disk space, use the
--disk_size_gboption.
Job fails with an error
Use the following troubleshooting advice when your job fails with an error.
Startup of worker pool failed
If you see the following error, verify that your pipeline specifies--worker_zone and that the zone matches the zone for your reservation.
JOB_MESSAGE_ERROR: Startup of the worker pool in zoneZONE failed tobring up any of the desired 1 workers.[...] INVALID_FIELD_VALUE:Instance 'INSTANCE_NAME' creation failed: Invalid value for field'resource.reservationAffinity': '{ "consumeReservationType":"SPECIFIC_ALLOCATION", "key":"compute.googleapis.com/RESERVATION_NAME...'. Specified reservations[RESERVATION_NAME] do not exist.Managed instance groups don't support Cloud TPUs
If you see the following error, contact youraccountteam to verify whether your project hasbeen enrolled to use TPUs, or file a bug using theGoogle IssueTracker.
apache_beam.runners.dataflow.dataflow_runner.DataflowRuntimeException: Dataflowpipeline failed. State: FAILED, Error: Workflow failed. Causes: One or moreoperations had an error[...]: [INVALID_FIELD_VALUE] 'Invalid valuefor field 'resource.instanceTemplate': Managed Instance Groups do not supportCloud TPUs. '.
Invalid value for field
If you see the following error, verify that your pipeline invocation sets theworker_accelerator Dataflow service option.
JOB_MESSAGE_ERROR: Workflow failed. Causes: One or more operations had an error:'operation-[...]': [INVALID_FIELD_VALUE] 'Invalid value for field'resource.instanceTemplate': 'projects/[...]-harness'. RegionalManaged Instance Groups do not support Cloud TPUs.'
Device or resource busy
If you see the following error, then a Dataflow worker processingyour pipeline likely is running more than one process that is accessing the TPUat the same time. This is not supported. For more information, seeTPUs andworker parallelism.
RuntimeError: TPU initialization failed: open(/dev/vfio/0): Device or resourcebusy: Device or resource busy; Couldn't open iommu group /dev/vfio/0
If you see the preceding error while debugging your pipeline on a VM, you caninspect and terminate the process that is holding up the TPU by using thefollowing commands:
aptupdate;aptinstalllsoflsof-w/dev/vfio/0kill-9PROCESS_ID# to terminate the process.Instances with guest accelerators do not support live migration
If you see the following error, the pipeline was likely launched with anexplicitly-set machine type that has accelerators, but didn't specifyaccelerator configuration correctly. Verify that your pipeline invocation setstheworker_accelerator Dataflow service option, and make surethe option name doesn't contain typos.
JOB_MESSAGE_ERROR: Startup of the worker pool in zoneZONE failed tobring up any of the desired 1 workers.[...] UNSUPPORTED_OPERATION:InstanceINSTANCE_ID creation failed: Instances with guestaccelerators do not support live migration.
The workflow was automatically rejected by the service
The following errors might also appear if some of the required pipeline optionsare missing or incorrect:
The workflow was automatically rejected by the service. The requestedaccelerator typetpu-v5-lite-podslice;topology:1x1 requires settingthe worker machine type toct5lp-hightpu-1t. Learn more at:https://cloud.google.com/dataflow/docs/guides/configure-worker-vm
Timed out waiting for an update from the worker
When you launch pipelines on TPU VMs with a lot of vCPU, the job might encountererrors like the following:
Workflow failed. CausesWORK_ITEM failed.The job failed because a work item has failed4 times.Root cause: Timed out waiting for an update from the worker.
If you see this error, try reducing the number of threads.For example, you could set:--number_of_worker_harness_threads=50.
No TPU usage
If your pipeline runs successfully but TPU devices aren't used or aren'taccessible, verify that the frameworks you are using, such as JAX or PyTorch,can access the attached devices. To troubleshoot your container image on asingle VM, seeDebug with a standalone VM.
Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.
Last updated 2026-02-19 UTC.