Dynamic work rebalancing Stay organized with collections Save and categorize content based on your preferences.
The Dynamic Work Rebalancing feature of the Dataflow service allows theservice to dynamically repartition work based on runtime conditions. Theseconditions might include the following:
- Imbalances in work assignments
- Workers taking longer than expected to finish
- Workers finishing faster than expected
The Dataflow service automatically detects these conditions andcan dynamically assign work to unused or underused workers to decreasethe overall processing time of your job.
Limitations
Dynamic work rebalancing only happens when the Dataflow service isprocessing some input data in parallel: when reading data from an external inputsource, when working with a materialized intermediatePCollection, or whenworking with the result of an aggregation likeGroupByKey. If a large numberof steps in your job arefused, your job has fewerintermediatePCollections, and dynamic work rebalancing islimited to the number of elements in the source materializedPCollection. Ifyou want to ensure that dynamic work rebalancing can be applied to a particularPCollection in your pipeline, you canprevent fusion in a fewdifferent ways to ensure dynamic parallelism.
Dynamic work rebalancing cannot reparallelize data finer than a single record.If your data contains individual records that cause large delays in processingtime, they might still delay your job. Dataflow can'tsubdivide and redistribute an individual "hot" record to multiple workers.
Java
If you set a fixed number of shards for the final output of your pipeline (forexample, by writing data usingTextIO.Write.withNumShards),Dataflow limits parallelization based on the number ofshards that you choose.
Python
If you set a fixed number of shards for the final output of your pipeline (forexample, by writing data usingbeam.io.WriteToText(..., num_shards=...)),Dataflow limits parallelization based on the number ofshards that you choose.
Go
If you set a fixed number of shards for the final output of your pipeline,Dataflow limits parallelization based on the number of shardsthat you choose.
Working with Custom Data Sources
Java
If your pipeline uses a custom data source that you provide, you mustimplement the methodsplitAtFraction to allow your source to work with thedynamic work rebalancing feature.
splitAtFraction, it'scritical that you test your code extensively and with maximum code coverage.If you implementsplitAtFraction incorrectly, records from your source mightappear to get duplicated or dropped. See theAPI reference information on RangeTracker for help and tips onimplementingsplitAtFraction.
Python
If your pipeline uses a custom data source that you provide, yourRangeTracker must implementtry_claim,try_split,position_at_fraction, andfraction_consumed to allow your source to workwith the dynamic work rebalancing feature.
See theAPI reference information on RangeTrackerfor more information.
Go
If your pipeline uses a custom data source that you provide, you mustimplement a validRTracker to allow your source to work with the dynamicwork rebalancing feature.
For more information, see theRTracker API reference information.
Dynamic work rebalancing uses the return value of thegetProgress()method of your custom source to activate. The default implementation forgetProgress() returnsnull. To ensure autoscaling activates, make sure your custom source overridesgetProgress() to return an appropriate value.
Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.
Last updated 2026-02-19 UTC.