Design Dataflow pipeline workflows

Pipeline development involves different stages and tasks, such as codedevelopment, testing, and delivery into production. This document explains:

Considerations for continuous integration and continuous delivery(CI/CD) to support automated build, testing, and pipeline deployment intodifferent environments.
Dataflow features to optimize performance and resourceutilization in production.
Approaches and watchpoints for updating streaming pipelines in production.
Best practices for improving pipeline reliability in production.

Continuous integration

Continuous integration (CI)requires developers to merge code into a shared repository frequently, which isuseful for applications that change a lot, such as websites that are updatedfrequently. Although data pipelines don't usually change as much as other typesof applications, CI practices provide many benefits for pipelinedevelopment. For example, test automation provides rapid feedback when defectsare encountered and reduces the likelihood that regressions enter the codebase.

Test automation is an important part of CI. When it's combined with appropriatetest coverage,test automation run your test suite on each code commit. YourCI server can work in conjunction with a build automation tool likeMavento run your test suite as one or more steps of the CI pipeline. You can packagecode that successfully passes unit tests and integration tests intodeployment artifacts from which pipelines are launched. This build is referred toas apassing build.

The number and types of deployment artifacts created from a passing buildvaries depending on how pipelines are launched. Using the Apache Beam JavaSDK, you can package your pipeline codeinto a self-executing JAR file.You can then store the JAR file in a bucket that is hosted in the project for adeployment environment,such as the preproduction or production Google Cloud project. If you use ClassicTemplates (a type oftemplated execution),the deployment artifacts include aJSON template file,the JAR file for your pipeline, and anoptional metadata template.You can then deploy the artifacts into different deployment environments usingcontinuous delivery, as explained in the following section.

Continuous delivery and deployment

Continuous delivery (CD)copies deployment artifacts to one or more deployment environments that areready to be launched manually. Typically, the artifacts built by the CI serverare deployed to one or more preproduction environments for runningend-to-end tests. The production environmentis updated if all end-to-end tests pass successfully.

For batch pipelines,continuous deploymentcan directly launch the pipeline as a new Dataflow job.Alternatively, other systems can use the artifacts to launch batch jobs whenrequired. For example, you can useCloud Composer to run batch jobs within a workflow, or Cloud Scheduler to schedulebatch jobs.

Streaming pipelines can be more complex to deploy than batch pipelines, andtherefore can be more difficult to automate using continuous deployment. Forexample, you might need to determine how toreplace or update an existing streaming pipeline.If you can't update a pipeline, or if you choose not to update it, you can useother methods such ascoordinating multiple Dataflow jobsto minimize or prevent business disruption.

Identities and roles required for CI/CD

Your CI/CD pipeline interacts with different systems to build, test, and deploypipelines. For example, your pipeline needs access to your source coderepository. To enable these interactions, ensure that your pipeline has theproper identities and roles. The following pipeline activities might alsorequire your pipeline to have specific identities and roles.

Integration testing with external services, including Google Cloud Platform

When you use the Direct Runner for running ad hoc tests or system integrationtests, your pipeline usesApplication Default Credentials (ADC))to obtain credentials. How you set up ADC depends on where your pipeline isrunning. For more information, seeSet up Application Default Credentials.

Ensure that the service account that's used to obtain credentials forGoogle Cloud Platform resources accessed by the pipeline has sufficientroles and permissions.

Deploy artifacts to different deployment environments

Where possible, use unique credentials for each environment (effectively foreach project) and limit access to resources accordingly.

Use uniqueservice accountsfor each project to read and write deployment artifacts to storage buckets.Depending on whether your pipeline uses atemplate,your deployment process might need to stage multiple artifacts. For example,creating and staging a Dataflow templaterequires permissions to write deployment artifacts that areneeded to launch your pipeline, such as the pipeline's templatefile, to a Cloud Storage bucket.

Deploy pipelines to different deployment environments

Where possible, use unique service accounts for each project to access andmanage Google Cloud resources within the project, including accessingDataflow itself.

The service account that you use to create and manageDataflow jobs needs to have sufficient IAMpermissions for job management. For details, see theDataflow service accountsection in the Dataflow security and permissions page.

The worker service accountthat you use to run Dataflow jobs needs permission to manageCompute Engine resources while the job runs and to manage the interaction betweenthe Apache Beam pipeline and the Dataflow service.For details, see theWorker service accountsection in the Dataflow security and permissions page.

To specify auser-managed worker service accountfor a job, usethe--serviceAccountpipeline option.If you don't specify a worker service accountwhen you create a job, Dataflow attempts to use theCompute Engine default service account.We recommend instead that you specify a user-managed worker service accountfor production environments, because the Compute Engine default serviceaccount usually has a broader set of permissions than the permissions that arerequired for your Dataflow jobs.

In production scenarios, we recommend that you use separate serviceaccounts for Dataflow job management and for the workerservice account, which provides improved security compared to using asingle service account. For example, the service account that's used tocreate Dataflow jobs might not need to access data sourcesand sinks or to use other resources that are used by the pipeline. In thisscenario, the worker service account that's used to runDataflow jobs is granted permissions to use pipelineresources. A different service account for job creation is grantedpermissions to manage (including creating) Dataflow jobs.

Example CI/CD pipeline

The following diagram provides a general and tool-agnostic view of CI/CD fordata pipelines. Additionally, the diagram shows the relationship betweendevelopment tasks, deployment environments, and the pipeline runners.

Stages of a CI/CD pipeline.

The diagram shows the following stages:

Code development: During code development, a developer runspipeline code locally using the Direct Runner. In addition, developers usea sandbox environment for ad hoc pipeline execution using theDataflow Runner.
In typical CI/CD pipelines, the continuous integration process istriggered when a developer makes a change to the source control system,such as pushing new code to a repository.
Build and test: The continuous integration process compiles thepipeline code and then runsunit testsandtransform integration testsusing the Direct Runner. Optional system integration tests, which includeintegration testing with external sources and sinks using small testdatasets, can also run.
If the tests succeed, the CI process stores the deployment artifacts,which might include JAR files, Docker images, and template metadata,required to launch the pipeline to locations that areaccessible to the continuous delivery process. Depending on the types ofdeployment artifacts generated, you might useCloud Storage and Artifact Registry to store the differentartifact types.
Deliver and deploy: The continuous delivery process copies thedeployment artifacts to a preproduction environment or makes these artifactsavailable for use within that environment. Developers can manually runend-to-end testsusing the Dataflow Runner, or they can use continuousdeployment to initiate the test automatically. Typically, acontinuous deployment approach is easier to enable for batch pipelinesthan for streaming pipelines. Because batch pipelines don't run continuously,it's easier to replace them with a new release.
The process of updating streaming pipelines might besimple or complex,and you should test updates in the preproduction environment. Updateprocedures might not always be consistent between releases. For example, apipeline might change in such a way that makes in-place updates impossible.For this reason, it's sometimes difficult toautomate pipeline updates using continuous deployment.

If all end-to-end tests pass, you can copy the deployment artifacts or make themavailable to the production environment as part of the continuous delivery process. If thenew pipeline updates or replaces an existing streaming pipeline, use theprocedures tested in the preproduction environment to deploy the new pipeline.

Non-templated versus templated job execution

You can create a Dataflow job by using the Apache Beam SDK directlyfrom a development environment. This type of job is called anon-templated job.Although this approach is convenient for developers, you might prefer toseparate the tasks of developing and running pipelines. To make this separation,you can useDataflow templates,which allow you to stage and run your pipelines as independent tasks. After atemplate is staged, other users, including non-developers, can run the jobs fromthe template using the Google Cloud CLI, theGoogle Cloud console, or the Dataflow REST API.

Dataflow offers the followingtypes of job templates:

Classic Templates: Developers use the Apache Beam SDK to run thepipeline code and save the JSON serialized execution graph as the template.The Apache Beam SDK stages the template file to a Cloud Storage location, alongwith any dependencies that are required by the pipeline code.
Flex Templates: Developers use the Google Cloud CLI topackage the pipeline as a Docker image, which is then stored inArtifact Registry.A Flex Template spec file is also automatically generated and stored to auser-specified Cloud Storage location. The Flex Template spec filecontains metadata that describes how to run the template, such as pipelineparameters.

In addition to theFlex Template featuresexplained in the linked documentation,Flex Templates offer advantages over Classic Templates for managing templates.

With Classic Templates, multiple artifacts, such as JAR files, might be storedin a Cloud Storage staging location, but without any features to managethese multiple artifacts. In comparison, a Flex Template is encapsulated withina Docker image. The image packages all dependencies, aside from the FlexTemplate spec, that are needed for your pipeline into one deployment artifactthat's managed by Artifact Registry.
You can use Docker image managementfeatures for your Flex Templates. For example, you can securely share FlexTemplates by granting pull (and optionally push) permissions toArtifact Registry, and use Docker image tags for different versions ofyour Flex Templates.

Developers can use Classic Templates and Flex Templates to launch jobs in aproject that's different from theproject that owns the registry and the storage bucket that hosts the templateassets, or just the storage bucket if you use Classic Templates. This featureis useful if you need to isolate data processing for multiple customers into differentprojects and pipeline jobs. Using Flex Templates, you can further specifydifferent versions of a Docker image to use when you launch a pipeline. Thisfeature simplifies phased replacement of batch pipelines or streaming pipelinesover multiple projects when you update templates later.

Dataflow features for optimizing resource usage

Dataflow provides the following runner-specific features tooptimize resource usage, which can improve performance and lower costs:

Streaming Engine: Streaming Engine moves the execution of streaming pipelines outof VM workers and into a dedicated service. The benefits include improvedautoscaling responsiveness, reductions in consumed worker VM resources, andautomatic service updates without redeployment. In some cases, you can alsoreduce resource usage by usingat-least-once processing for usecases that can tolerate duplicates.Enabling Streaming Engine isrecommended for streaming pipelines. The feature is enabled by default whenyou use the latest versions of the Apache Beam Java SDK or the Python SDK.
Dataflow Shuffle: Dataflow Shuffle movesshuffle operations forbatch pipelines out of VM workers and into a dedicated service. Thebenefits include faster execution for most batch pipelines, reducedresource consumption by worker VMs, improved autoscaling responsiveness,and improved fault tolerance. Enabling Dataflow Shuffle isrecommended for batch pipelines. The feature is enabled by default using theApache Beam Java SDK and the latest Python SDK.
Flexible resource scheduling (FlexRS): FlexRS reduces batch processing costs byusing advanced scheduling techniques, the Dataflow Shuffleservice, and a combination of preemptible VM instances and regular VMs.

Update streaming pipelines in production

SeeUpgrade a streaming pipeline.

Life of a Dataflow job

A Dataflow job goes through a lifecycle that's represented byvariousjob states.To run a Dataflow job, submit it to aregion.The job is then routed to an available Dataflow backend in one ofthe zones within the region. Before Dataflow assigns a backend,it verifies that you have sufficient quota and permissions to run the job. Whenthese preflight checks are complete and a backend has been assigned, the jobmoves to aJOB_STATE_PENDING state. ForFlexRSjobs, the backend assignment might be delayed to a future time, and these jobsenter aJOB_STATE_QUEUED state.

The assigned backend picks up the job to run and attempts to startDataflow workers in your Google Cloud Platform project. The zone that's chosenfor the worker VMs depends on a number of factors.For batch jobs that useDataflow Shuffle,the service also tries to ensure that the Dataflow backend andworker VMs are located in the same zone to avoid cross-zone traffic.

After the Dataflow workers start, they request work fromthe Dataflow backend. The backend is responsible for splittingthe work into parallelizable chunks, calledbundles,that are distributed among the workers. If the workers can't handle the existingwork, and if autoscalingis enabled, the backend starts more workers in order to handle the work.Similarly, if more workers are started than are needed, some of the workers are shutdown.

After the Dataflow workers start, theDataflow backend acts as thecontrol plane to orchestrate the job's execution. During processing, the job'sdata plane performs shuffle operations such asGroupByKey,CoGroupByKey, andCombine.Jobs use one the following data-plane implementations for shuffle operations:

The data plane runs on the Dataflow workers, and shuffledata is stored onpersistent disks.
The data plane runs as a service, externalized from the worker VMs.This implementation has two variants, which you specify when youcreate the job:Dataflow Shufflefor batch pipelines andStreaming Enginefor streaming pipelines. The service-based shuffle significantly improvesthe performance and scalability of data-shuffling operations compared tothe worker-based shuffle.

Streaming jobs that enter aJOB_STATE_RUNNING state continue to runindefinitely until they'recancelledordrained,unless a job failure occurs. Batch jobs automatically stop when allprocessing is completed or if an unrecoverable error occurs. Depending on howthe job is stopped, Dataflow sets the job'sstatus to one of multiple terminal states, includingJOB_STATE_CANCELLED,JOB_STATE_DRAINED, orJOB_STATE_DONE.

Pipeline reliability best practices

This section discusses failures that might occur when you work withDataflow and best practices for Dataflowjobs.

Follow isolation principles

A general recommendation to improve overall pipeline reliability is to followthe isolation principles behindregions and zones.Ensure that your pipelines don't have critical cross-regiondependencies. If you have a pipeline that has critical dependency on servicesfrom multiple regions, a failure in any one of those regions can impact yourpipeline. To help avoid this issue, deploy to multiple regionsfor redundancy and backup.

Create Dataflow snapshots

Dataflow offers a snapshot feature thatprovides a backup of a pipeline's state. You can restore the pipeline snapshotinto a new streaming Dataflow pipeline in another zone or region. Youcan then start the reprocessing of messages in the Pub/Sub or Kafkatopics starting at the snapshot timestamp. If you set up regular snapshotsof your pipelines, you can minimize Recovery Time Objective (RTO) time.

For more information about Dataflow snapshots, seeUse Dataflow snapshots.

Handle job submission failures

You submit non-template Dataflow jobsusing the Apache Beam SDK. To submit the job, yourun the pipeline using theDataflow Runner,which is specified as part of the pipeline'soptions.The Apache Beam SDK stages files in Cloud Storage, creates a jobrequest file, and submits the file to Dataflow.

Alternatively, jobs created from Dataflow templates usedifferent submission methods,which commonly rely on thetemplates API.

You might see different errors returned by Dataflowthat indicate job failure for template and non-template jobs. This sectiondiscusses different types of job submission failures and best practices forhandling or mitigating them.

Retry job submissions after transient failures

If a job fails to start due to a Dataflow service issue, retrythe job a few times. Retrying mitigates transient serviceissues.

Mitigate zonal failures by specifying a worker region

Dataflow providesregional availabilityand is available in multiple regions.When a user submits a job to aregionwithout explicitly specifying a zone, Dataflow routesthe job to a zone in the specified region based on resource availability.

The recommended option for job placement is to specify a worker regionby using the--region flaginstead of the--zone flag whenever possible. This step allowsDataflow to provide an additional level of faulttolerance for your pipelines by automatically choosing the best possible zonefor that job. Jobs that specify an explicit zone don't have this benefit, andthey fail if problems occur within the zone. If a job submission fails due to azonal issue, you can oftenresolve the problem by retrying the job without explicitly specifying a zone.

Mitigate regional failures by storing data in multiple regions

If an entire region is unavailable, try the job in a different region. It'simportant to think about the availability of your data when jobs fail acrossregions. To protect against single-region failures without manually copying datato multiple regions, use Google Cloud resources that automatically storedata in multiple regions. For example, use Cloud Storagedual-region andmulti-regionbuckets. If one region becomes unavailable, you can rerun the pipeline inanother region where the data is available.

For an example of using multi-regional services with Dataflow,seeHigh availability and geographic redundancy.

Handle failures in running pipelines

After a job is submitted and has been accepted, the only valid operations forthe job are the following:

cancel for batch jobs
update, drain, or cancel for streaming jobs

You can't change the location of running jobs after you submit the job.If you're not using FlexRS, jobs usually start processing data within a fewminutes after submission. FlexRS jobs can take up to six hours for dataprocessing to begin.

This section discusses failures for running jobs and best practices forhandling them.

Monitor jobs to identify and resolve issues caused by transient errors

For batch jobs, bundles that include a failing item are retried four times.Dataflow terminates the job when a single bundle has failed fourtimes. This process takes care of many transient issues. However, if a prolonged failureoccurs, the maximum retry limit is usually reached quickly, which allows the jobto fail quickly.

For monitoring and incident management, configure alerting rules to detectfailed jobs. If a job fails,inspect the job logsto identify job failures caused by failed work items that exceeded theretry limit.

For streaming jobs, Dataflow retries failed work itemsindefinitely. The job is not terminated. However, the job might stall until theissue is resolved. Createmonitoring policies to detectsigns of a stalled pipeline, such as anincrease in system latency and a decrease in data freshness.Implement error logging in your pipeline code to help identify pipeline stallscaused by work items that fail repeatedly.

Restart jobs in a different zone if a zonal outage occurs

After a job starts, the Dataflow workers that run user code areconstrained to a single zone. If a zonal outage occurs, Dataflowjobs are often impacted, depending on the extent of the outage.

For outages that impact only Dataflow backends, the backends areautomatically migrated to a different zone by the managed service so that they cancontinue the job. If the job usesDataflow Shuffle,the backend cannot be moved across zones. If a Dataflow backendmigration occurs, jobs might be temporarily stalled.

If a zonal outage occurs, running jobs are likely to fail or stalluntil zone availability is restored. If a zone becomes unavailable for along period, stop jobs(cancelfor batch jobs anddrainfor streaming jobs) and then restart them to let Dataflow choosea new, healthy zone.

Restart batch jobs in a different region if a regional outage occurs

If a regional outage occurs in a region where your Dataflow jobsare running, the jobs can fail or stall. For batch jobs, restart thejob in a different region if possible. It's important to ensure that your data isavailable in different regions.

Mitigate regional outages by using high availability or failover

For streaming jobs, depending on the fault tolerance and budget for your application,you have different options for mitigating failures. For a regionaloutage, the simplest and most cost-effective option is to wait until the outageends. However, if your application islatency-sensitive or if data processing must either not be disrupted or shouldbe resumed with minimal delay, the following sections discuss options.

High-availability: Latency-sensitive with no data loss

If your application cannot tolerate data loss, run duplicate pipelines inparallel in two different regions, and have the pipelines consume the same data.The same data sources need to be available in both regions. Thedownstream applications that depend on the output of these pipelines must beable to switch between the output from these two regions. Due to the duplicationof resources, this option involves the highest cost compared to other options.For an example deployment, see the sectionHigh availability and geographic redundancy.

Failover: Latency-sensitive with some potential data loss

If your application can tolerate potential data loss, make the streaming datasource available in multiple regions. For example, using Pub/Sub,maintain two independent subscriptions for the same topic, one for each region.If a regional outage occurs, start a replacement pipeline in anotherregion, and have the pipeline consume data from the backup subscription.

Replaythe backup subscription to an appropriate time to keep data loss to a minimumwithout sacrificing latency. Downstream applications must know how to switch tothe running pipeline's output, similar to the high-availability option. Thisoption uses fewer resources than running duplicate pipelines because only thedata is duplicated.

High availability and geographic redundancy

You can run multiple streaming pipelines in parallel for high-availability dataprocessing. For example, you can run two parallel streaming jobs in differentregions, which provides geographical redundancy and fault tolerance for dataprocessing.

By considering the geographic availability of data sources and sinks, you canoperate end-to-end pipelines in a highly available, multi-region configuration.The following diagram shows an example deployment.

2 regional pipelines use separate subscriptions to read from a global Pub/Sub topic. The pipelines write to separate multi-regional BigQuery tables, one in the US and one in Europe.

The diagram shows the following flow:

Pub/Sub runs in most regions around the world, which letsthe service offer fast, global data access, while giving you control overwhere messages are stored. Pub/Sub can automatically storepublished messages in the Google Cloud region that's nearest tosubscribers, or you can configure it to use a specific region or set of regionsby usingmessage storage policies.
Pub/Sub then delivers the messages to subscribers across theworld, regardless of where the messages are stored. Pub/Subclients don't need to be aware of the server locations they are connectingto, because global load-balancing mechanisms direct traffic to the nearestGoogle Cloud data center where messages are stored.
Dataflow runs inspecific Google Cloud regions.By running parallel pipelines in separate Google Cloud regions, youcanisolate your jobs from failures that affect a single region.The diagram shows two instances of the same pipeline, each one running in aseparate Google Cloud region.
The parallel Dataflow pipelines write to two separateBigQuery regions, providing geographic redundancy. Within aregion, BigQuery automatically stores copies of your data intwo different Google Cloud zones. For more information, seeDisaster planning inthe BigQuery documentation.

Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.

Last updated 2026-02-19 UTC.

Movatterモバイル変換

Design Dataflow pipeline workflows Stay organized with collections Save and categorize content based on your preferences.