Pipeline lifecycle

This page provides an overview of the pipeline lifecycle from pipeline code to aDataflow job.

This page explains the following concepts:

  • What an execution graph is, and how an Apache Beam pipeline becomes aDataflow job
  • How Dataflow handles errors
  • How Dataflow automatically parallelizes and distributes theprocessing logic in your pipeline to the workers performing your job
  • Job optimizations that Dataflow might make

Execution graph

When you run your Dataflow pipeline, Dataflowcreates an execution graph from the code that constructs yourPipeline object,including all of the transforms and their associated processing functions, suchasDoFn objects. This is the pipelineexecution graph, and the phase is calledgraph construction time.

During graph construction, Apache Beam locally executes the code from themain entry point of the pipeline code, stopping at the calls to a source, sink,or transform step, and turning these calls into nodes of the graph.Consequently, a piece of code in a pipeline's entry point (Java and Gomainmethod or the top-level of a Python script) locally executes on the machine thatruns the pipeline. The same code declared in a method of aDoFn objectexecutes in the Dataflow workers.

For example, theWordCountsample included with the Apache Beam SDKs, contains a series of transformsto read, extract, count, format, and write the individual words in a collectionof text, along with an occurrence count for each word. The following diagramshows how the transforms in the WordCount pipeline are expanded into anexecution graph:

The transforms in the WordCount example program expanded into an execution graphof steps to be executed by the Dataflow service.

Figure 1: WordCount example execution graph

The execution graph often differs from the order in which you specified yourtransforms when you constructed the pipeline. This difference exists because theDataflow service performs various optimizations and fusions on theexecution graph before it runs on managed cloud resources. TheDataflow service respects data dependencies when executing yourpipeline. However, steps without data dependencies between them mightrun in any order.

To see the unoptimized execution graph that Dataflow hasgenerated for your pipeline, select your job in theDataflow monitoring interface. For more information about viewingjobs, seeUse the Dataflow monitoring interface.

During graph construction, Apache Beam validates that any resourcesreferenced by the pipeline, such as Cloud Storage buckets,BigQuery tables, and Pub/Sub topics orsubscriptions, actually exist and are accessible. The validation is done throughstandard API calls to the respective services, so it's vital that the useraccount used to run a pipeline has proper connectivity to the necessary servicesand is authorized to call the APIs of the services. Before submitting the pipeline to theDataflow service, Apache Beam also checks for other errors,and ensures that the pipeline graph doesn't contain any illegal operations.

The execution graph is then translated into JSON format, and the JSON executiongraph is transmitted to the Dataflow service endpoint.

The Dataflow service then validates the JSON execution graph. Whenthe graph is validated, it becomes ajob on the Dataflowservice. You can see your job, its execution graph, status, and loginformation by using theDataflow monitoring interface.

Java

The Dataflow service sends a response to the machine where you runyour Dataflow program. This response is encapsulated in the objectDataflowPipelineJob, which contains thejobId of your Dataflow job.Use thejobId to monitor, track, and troubleshoot your job by using theDataflow monitoring interfaceand theDataflow command-line interface.For more information, see theAPI reference for DataflowPipelineJob.

Python

The Dataflow service sends a response to the machine where you runyour Dataflow program. This response is encapsulated in the objectDataflowPipelineResult, which contains thejob_id of your Dataflow job.Use thejob_id to monitor, track, and troubleshoot your jobby using theDataflow monitoring interfaceand theDataflow command-line interface.

Go

The Dataflow service sends a response to the machine where you runyour Dataflow program. This response is encapsulated in the objectdataflowPipelineResult, which contains thejobID of your Dataflow job.Use thejobID to monitor, track, and troubleshoot your jobby using theDataflow monitoring interfaceand theDataflow command-line interface.

Graph construction also happens when you execute your pipelinelocally, but the graph is not translated to JSON or transmitted to the service.Instead, the graph is run locally on the same machine where you launched yourDataflow program. For more information, seeConfiguring PipelineOptions for local execution.

Error and exception handling

Your pipeline might throw exceptions while processing data. Some of these errorsare transient, such as temporary difficulty accessing an external service. Othererrors are permanent, such as errors caused by corrupt or unparseable input data,or null pointers during computation.

Dataflow processes elements in arbitrary bundles, and retries thecomplete bundle when an error is thrown for any element in that bundle. Whenrunning in batch mode, bundles that include a failing item are retried four times.The pipeline fails completely when a single bundle has failed four times. Whenrunning in streaming mode, a bundle that includes a failing item is retriedindefinitely, which might cause your pipeline to permanently stall.

When processing in batch mode, you might see a large number ofindividual failures before a pipeline job fails completely, which happens whenany given bundle fails after four retry attempts. For example, if your pipelineattempts to process 100 bundles, Dataflow couldgenerate several hundred individual failures until a single bundle reaches thefour-failure condition for exit.

Startup worker errors, like failure to install packages on the workers, aretransient. This scenario results in indefinite retries, and might cause yourpipeline to permanently stall.

Parallelization and distribution

The Dataflow service automatically parallelizes and distributesthe processing logic in your pipeline to the workers you assign to performyour job. Dataflow uses the abstractions in theprogramming model to representparallel processing functions. For example, theParDo transforms in a pipelinecause Dataflow to automatically distribute processing code,represented byDoFn objects, to multiple workers to be run in parallel.

There are two types of job parallelism:

  • Horizontal parallelism occurs when pipeline data is split and processed onmultiple workers at the same time. The Dataflow runtimeenvironment is powered by a pool of distributed workers. A pipeline hashigher potential parallelism when the pool contains more workers, butthat configuration also has a higher cost. Theoretically, horizontalparallelism doesn't have an upper limit. However, Dataflowlimits the worker pool to 4000 workers to optimize fleetwide resource usage.

  • Vertical parallelism occurs when pipeline data is split and processed bymultiple CPU cores on the same worker. Each worker is powered by aCompute Engine VM. A VM can run multiple processes to saturate all of itsCPU cores. A VM with more cores has higher potential vertical parallelism,but this configuration results in higher costs. A higher number of cores oftenresults in an increase in memory usage, so the number of cores is usuallyscaled together with memory size. Given the physical limit of computerarchitectures, the upper limit of vertical parallelism is much lower than theupper limit of horizontal parallelism.

Managed parallelism

By default, Dataflow automatically manages job parallelism.Dataflow monitors the runtime statistics for the job, such as CPUand memory usage, to determine how to scale the job. Depending on your jobsettings, Dataflow can scale jobs horizontally, referred to asHorizontal Autoscaling, orvertically, referred to asVertical scaling.Automatically scaling for parallelism optimizes the job cost andjob performance.

To improve job performance, Dataflow also optimizes pipelinesinternally. Typical optimizations arefusion optimizationandcombine optimization.By fusing pipeline steps, Dataflow eliminates unnecessarycosts associated with coordinating stepsin a distributed system and running each individual step separately.

Factors that affect parallelism

The following factors impact how well parallelism functions inDataflow jobs.

Input source

When an input source doesn't allow parallelism, the input source ingestion step canbecome a bottleneck in a Dataflow job. For example, when you ingest datafrom a single compressed text file, Dataflow can't parallelize theinput data. Because most compression formats can't be arbitrarily divided intoshards during ingestion, Dataflow needs to read the data sequentially from thebeginning of the file. The overall throughput of the pipeline is slowed down bythe non-parallel portion of the pipeline. The solution to this problem is touse a more scalable input source.

In some instances, step fusion also reduces parallelism.When the input source doesn't allow parallelism, if Dataflowfuses the data ingestion step with subsequent steps and assigns this fused stepto a single thread, the entire pipeline might run more slowly.

To avoid this scenario, insert aRedistribute step after the input sourceingestion step. For more information, see thePrevent fusion section of this document.

Default fanout and data shape

The default fanout of a single transform step can become abottleneck and limit parallelism. For example, "high fan-out"ParDo transform can causefusion to limit Dataflow's ability to optimize worker usage. Insuch an operation, you might have an input collection with relatively fewelements, but theParDo produces an output with hundreds or thousands of timesas many elements, followed by anotherParDo. If the Dataflowservice fuses theseParDo operations together, parallelism in this step islimited to at most the number of items in the input collection, even though theintermediatePCollection contains many more elements.

For potential solutions, see thePrevent fusion section of this document.

Data shape

The shape of the data, whether it's input data or intermediate data, can limit parallelism.For example, when aGroupByKey step on a natural key, such as a city,is followed by amap orCombine step, Dataflow fuses the twosteps. When the key space is small, for example, five cities, and one key is veryhot, for example, a large city, most items in the output of theGroupByKeystep are distributed to one process. This process becomes a bottleneck and slowsdown the job.

In this example, you can redistribute theGroupByKey step results into alarger artificial key space instead of using the natural keys. Insert aRedistribute step between theGroupByKey step and themap orCombinestep. In theRedistribute step, create the artificial key space, such as byusing ahash function, to overcome the limited parallelism caused by the datashape.

For more information, see thePrevent fusion section of this document.

Output sink

A sink is a transform that writes to an external data storage system, such as afile or a database. In practice, sinks are modeled and implemented as standardDoFn objects and are used to materialize aPCollection to external systems.In this case, thePCollection contains the final pipeline results. Threads thatcall sink APIs can run in parallel to write data to the external systems. Bydefault, no coordination between the threads occurs. Without an intermediatelayer to buffer the write requests and control flow, the external system canget overloaded and reduce write throughput. Scaling up resources by adding moreparallelism might slow down the pipeline even further.

The solution to this problem is to reduce the parallelism in the write step.You can add aGroupByKey step right before the write step. TheGroupByKeystep groups output data into a smaller set of batches to reduce total RPC callsand connections to external systems. For example, use aGroupByKey to create ahash space of 50 out of 1 million data points.

The downside to this approach is that it introduces a hardcoded limit to parallelism.Another option is to implement exponential backoff in the sink when writingdata. This option can provide bare-minimum client throttling.

Monitor parallelism

To monitor parallelism, you can use the Google Cloud console to view any detectedstragglers. For more information, seeTroubleshoot stragglers in batch jobsandTroubleshoot stragglers in streaming jobs.

Fusion optimization

After the JSON form of your pipeline execution graph has been validated, theDataflow service might modify the graph to perform optimizations.Optimizations can include fusing multiple steps or transforms in yourpipeline's execution graph into single steps. Fusing steps prevents theDataflow service from needing to materialize every intermediatePCollection in your pipeline, which can be costly in terms of memory andprocessing overhead.

Although all the transforms you specify in your pipeline construction areexecuted on the service, to ensure the most efficient execution of yourpipeline, the transforms might be executed in a different order or as partof a larger fused transform. The Dataflow service respects datadependencies between the steps in the execution graph, but otherwise steps mightbe executed in any order.

Fusion example

The following diagram shows how the execution graph from theWordCount example included with theApache Beam SDK for Java might be optimized and fused by theDataflow service for efficient execution:

The execution graph for the WordCount example program optimized and with steps fusedby the Dataflow service.

Figure 2: WordCount Example Optimized Execution Graph

Prevent fusion

In some cases, Dataflow might incorrectly guess theoptimal way to fuse operations in the pipeline, which can limitDataflow's ability to use all available workers. In such cases,you can give a hint to Dataflow to redistribute the data, by usingaRedistribute transform.

To add aRedistribute transform, call one of the following methods:

  • Redistribute.arbitrarily: Indicates the data is likely to beimbalanced. Dataflow chooses the best algorithm to redistributethe data.

  • Redistribute.byKey: Indicates that aPCollection of key-valuepairs is likely to be imbalanced and should be redistributed based on thekeys. Typically, Dataflow co-locates all elements of a singlekey on the same worker thread. However, co-location of keys is not guaranteed,and the elements are processed independently.

If your pipeline contains aRedistribute transform, Dataflowusually prevents fusion of the steps before and after theRedistributetransform, and shuffles the data so that the steps downstream of theRedistribute transform have more optimal parallelism.

Monitor fusion

You can access your optimized graph and fused stages in the Google Cloud console,by using the gcloud CLI, or by using the API.

Console

To view your graph's fused stages and stepsin the console, in theExecution details tab for yourDataflow job, open theStage workflowgraph view.

To see the component steps that are fused for a stage, in the graph, clickthe fused stage. In theStage info pane, theComponent steps rowdisplays the fused stages. Sometimes portions of a single composite transformare fused into multiple stages.

gcloud

To access your optimized graph and fused stages by using thegcloud CLI, run the followinggcloud command:

gclouddataflowjobsdescribe--fullJOB_ID--formatjson

ReplaceJOB_ID with the ID of your Dataflow job.

To extract the relevant bits, pipe the output of thegcloud command tojq:

gclouddataflowjobsdescribe--fullJOB_ID--formatjson|jq'.pipelineDescription.executionPipelineStage\[\] | {"stage_id": .id, "stage_name": .name, "fused_steps": .componentTransform }'

To see the description of the fused stages in the output response file, within theComponentTransformarray, see theExecutionStageSummaryobject.

API

To access your optimized graph and fused stages by using the API, callproject.locations.jobs.get.

To see the description of the fused stages in the output response file, within theComponentTransformarray, see theExecutionStageSummaryobject.

Combine optimization

Aggregation operations are an important concept in large-scale data processing.Aggregation brings together data that's conceptually far apart, making itextremely useful for correlating. The Dataflowprogramming model represents aggregation operations as theGroupByKey,CoGroupByKey, andCombine transforms.

Dataflow's aggregation operations combine data across the entiredataset, including data that might be spread across multiple workers. During suchaggregation operations, it's often most efficient to combine as much datalocally as possible before combining data across instances. When you apply aGroupByKey or other aggregating transform, the Dataflow serviceautomatically performs partial combining locally before the main groupingoperation.

Note: Because the Dataflow service automatically performs partiallocal combining, it is strongly recommended that you do not attempt to make thisoptimization by hand in your pipeline code.

When performing partial or multi-level combining, the Dataflowservice makes different decisions based on whether your pipeline is working withbatch or streaming data. For bounded data, the service favors efficiency andwill perform as much local combining as possible. For unbounded data, theservice favors lower latency, and might not perform partial combining, because itmight increase latency.

Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.

Last updated 2026-02-19 UTC.