Dataflow shuffle for batch jobs

Batch jobs use Dataflow shuffle by default.Dataflow shufflemoves the shuffle operation out of the worker VMs and into theDataflow service backend.

The information on this page applies to batch jobs. Streaming jobs use adifferent shuffle mechanism, calledstreaming shuffle.

About Dataflow shuffle

Dataflow shuffle is the base operation behindDataflow transforms such asGroupByKey,CoGroupByKey, andCombine.
The Dataflow shuffle operation partitions and groupsdata by key in a scalable, efficient, fault-tolerant manner.

Benefits of Dataflow shuffle

The service-based Dataflow shuffle has the following benefits:

Faster execution time of batch pipelines for the majority of pipeline jobtypes.
A reduction in consumed CPU, memory, and Persistent Disk storage resourceson the worker VMs.
BetterHorizontal Autoscaling, becauseVMs don't hold any shuffle data and can therefore be scaled down earlier.
Better fault tolerance, because an unhealthy VM holding Dataflowshuffle data doesn't cause the entire job to fail.

Support and limitations

This feature is available in all regions where Dataflow is supported. To see available locations, readDataflow locations. There might be performance differences between regions.
Workers must be deployed in the same region as the Dataflow job.
Don't specify thezone pipeline option. Instead, specify theregion, andset the value to one of the available regions. Dataflowautomatically selects the zone in the region you specified.
If you specify thezonepipeline option and set it to a zone outside of the available regions, theDataflow job returns an error. If you set an incompatible combinationofregion andzone, your job can't use Dataflow shuffle.
For Python, Dataflow shuffle requires Apache Beam SDKfor Python version 2.1.0 or later.

Disk size considerations

The default boot disk size for each batch job is 25 GB. For some batch jobs,you might be required to modify the size of the disk. Consider the following:

A worker VM uses part of the 25 GB of disk space for the operating system,binaries, logs, and containers. Jobs that use a significant amount of disk andexceed the remaining disk capacity may fail when you useDataflow shuffle.
Jobs that use a lot of disk I/O may be slow due to the performance of thesmall disk. For more information about performance differences between disksizes, seeCompute Engine Persistent Disk Performance.

To specify a larger disk size for a Dataflow shuffle job, you canuse the--disk_size_gbparameter.

Pricing

Most of the reduction in worker resources comes from offloading the shuffle workto the Dataflow service. For that reason, there is acharge associated with the use of Dataflowshuffle. The execution times might vary from run to run. If you are runninga pipeline that has important deadlines, we recommend allocating sufficientbuffer time before the deadline.

Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.

Last updated 2026-02-19 UTC.

Movatterモバイル変換