Plan your Dataflow pipeline

This page explains important considerations for planning your data pipelinebefore you begin code development. Data pipelines move data from one system toanother and are often criticalcomponents of business information systems. The performance and reliability ofyour data pipeline can impact these broader systems and howeffectively your business requirements are met.

If you plan your data pipelines before you develop them,you can improve their performance and reliability. This page explains variousplanning considerations for Dataflow pipelines, including:

Performance expectations for your pipelines, including standards formeasurability
Integration of your pipelines with data sources, sinks, and otherconnected systems
Regionalization of pipelines, sources, and sinks
Security, such as data encryption and private networking

Define and measure SLOs

An important measure of performance is how well your pipeline meets yourbusiness requirements.Service level objectives (SLOs)provide tangible definitions of performance that you can compare againstacceptable thresholds. For example, you might define the following example SLOsfor your system:

Data freshness: generate 90% of product recommendationsfrom user website activity that occurred no later than 3 minutes ago.
Data correctness: within a calendar month, less than 0.5% ofcustomer invoices contain errors.
Data isolation/load balancing: within a business day, process allhigh-priority payments within 10 minutes of lodgement,and complete standard-priority payments by the next business day.

You can use service level indicators (SLIs) to measure SLO compliance. SLIs arequantifiable metrics that indicate how well your system is meeting agiven SLO. For example, you can measure the example data-freshness SLO byusing the age of the most recently processed user activity as an SLI. If yourpipeline generates recommendations from user activity events, and if your SLIreports a 4-minute delay between the event time and the time the event is processed,the recommendations don't consider a user's website activity from earlierthan 4 minutes. If a pipeline that processes streaming data exceeds a systemlatency of 4 minutes, you know that the SLO is not met.

Because system components beyond your pipeline affect your SLO,it's important to capture a range of SLIs that describe the overall performanceof the system beyond the performance of the pipeline itself, including metricsthat describe the end-to-end health of your system. For example, yourDataflow pipeline might compute results with acceptable delay,but a performance issue might occur with a downstream system that impactswider SLOs.

For more information about important SLOs to consider, see theSite Reliability Engineeringbook.

Data freshness

Data freshness refers to the usability of data in relation to its age. Thefollowing data freshness SLOs are mentioned in the Site Reliability Engineeringbook as themost common pipeline data freshness SLOformats:

X% of data processed in Y [seconds, days, minutes]. This SLO refersto the percentage of data that's processed in a given period of time. Itis commonly used for batch pipelines that process bounded data sources. Themetrics for this type of SLO are the input and output data sizes at keyprocessing steps relative to the elapsed pipeline runtime. You can choose astep that reads an input dataset and another step that processes each itemof the input. An example SLO is "For the Shave the Yak game, 99% ofuser activities that impact players' scores are accounted for within30 minutes of match completion."
The oldest data is no older than X [seconds, days, minutes]. ThisSLO refers to the age of data produced by the pipeline. It is commonlyused for streaming pipelines that process data from unbounded sources. Forthis type of SLO, use metrics that indicate how long your pipelinetakes to process data. Two possible metrics are the age of theoldest unprocessed item, that is, how long an unprocessed item has been in thequeue, or the age of the most recently processed item. An example SLO is"Product recommendations are generated from user activity that isno older than 5 minutes."
The pipeline job completes successfully within X [seconds, days, minutes].This SLO sets a deadline for successful completion and iscommonly used for batch pipelines that process data from bounded datasources. This SLO requires the total pipeline-elapsed time andjob-completion status, in addition to other signals that indicate thesuccess of the job, such as the percentage of processed elements thatresult in errors. An example SLO is "Customer orders from thecurrent business day are processed by 9 AM the next day."

For information about usingCloud Monitoring to measuredata freshness, seeDataflow job metrics.

Data correctness

Data correctness refers to data being free of errors. You can determine datacorrectness through different means, including:

Unit testsandintegration tests,which you can automate by usingcontinuous integration.
End-to-end pipeline tests,which you can run in apreproduction environmentafter the pipeline successfully passes unit and integration tests. You canautomate end-to-end pipeline tests usingcontinuous delivery.
Pipelines running in production, when you usemonitoring to observemetrics related to data correctness.

For running pipelines, defining a data correctness target usually involvesmeasuring correctness over a period of time. For example:

On a per-job basis, less than X% of input items contain dataerrors. You can use this SLO to measure data correctness for batchpipelines. An example SLO is "For each daily batch job to processelectricity meter readings, less than 3% of readings contain data entry errors."
Over an X-minute moving window, less than Y% of input items containdata errors. You can use this SLO to measure data correctness forstreaming pipelines. An example SLO is "Less than 2% of electricitymeter readings over the last hour contain negative values."

To measure these SLOs, use metrics over a suitable period of time toaccumulate the number of errors by type. Examples of error types are the data isincorrect due to a malformed schema, or the data is outside a valid range.

For information about usingCloud Monitoring to measuredata correctness, seeDataflow job metrics.

Data isolation and load balancing

Data isolation involves segmenting data by attribute, which can make loadbalancing easier. For example, in an online payment-processing platform, youcan segment data so that individual payments are either standard priority orhigh priority. Your pipeline can then use load balancing to ensure thathigh-priority payments are processed before standard-priority payments.

Imagine that you define the following SLOs for payment processing:

High-priority payments are processed within 10 minutes.
Standard-priority payments are processed by 9 AM the next business day.

If the payment platform complies with these SLOs, customers are able to seefinalized high-priority payments on a reporting dashboard as they are completed.In contrast, standard payments might not appear on the dashboard until the nextday.

In this example scenario, you have the following options:

Run a single pipeline to process both standard-priority andhigh-priority payments.
Isolate and load-balance the data based on priorities across multiplepipelines.

The following sections describe each option in detail.

Use a single pipeline to deliver against mixed SLOs

The following diagram illustrates a single pipeline that's used to process bothhigh-priority and standard-priority payments. The pipeline receives notificationof new payments from a streaming data source, such as a Pub/Sub topicor anApache Kafka topic. It then immediately processes payments and writes events toBigQuery usingstreaming inserts.

Single pipeline for all processing, with an overall SLO of less than 10 minutes.

The advantage of a single pipeline is that it simplifies your operationalrequirements, because you need to manage only a single data source and pipeline.Dataflow usesautomatic tuningfeatures to help run your job as quickly and efficiently as possible.

A disadvantage of a single pipeline is that the shared pipeline can't prioritizehigh-priority payments over standard-priority payments, and pipeline resourcesare shared across both payment types. In the business scenario described previously,your pipeline must maintain the more stringent of the two SLOs. That is, thepipeline must use the SLO for high-priority payments regardless of the actualpriority of the processed payments. Another disadvantage is that in the event ofa work backlog, the streaming pipeline is unable to prioritize backlogprocessing according to the urgency of work.

Use multiple pipelines tailored for specific SLOs

You can use two pipelines to isolate resources and deliveragainst specific SLOs. The following diagram illustrates this approach.

Using two pipelines, one for high-priority payments (with SLO less than 10 minutes) and another one for lower-priority payments (with SLO less than 24 hours).

High-priority payments are isolated to a streamingpipeline for prioritized processing. Standard-priority payments are processed bya batch pipeline that runs daily and that uses BigQueryload jobs to write processed results.

Isolating data in different pipelines has advantages. Todeliver high-priority payments against tighter SLOs, you can shorten processingtimes by assigning more resources to the pipeline dedicated to high-prioritypayments. Resources configurations include addingDataflow workers, using larger machines, and enablingauto-scaling. Isolating high-priority items to a separate processing queue canalso mitigate processing delays if a sudden influx of standard-prioritypayments occurs.

When you use multiple pipelines to isolate and load-balance the data from batchand streaming sources, the Apache Beam programming model allows thehigh-priority (streaming) and standard-priority (batch) pipelines to share thesame code. The only exception is the initial read transform, which reads from abounded source for the batch pipeline. For more information, seeCreate libraries of reusable transforms.

Plan for data sources and sinks

To process data, a data pipeline needs to be integrated with other systems.Those systems are referred to assources andsinks. Data pipelines read datafrom sources and write data to sinks. In addition to sources and sinks, datapipelines might interact with external systems for data enrichment, filtering,or calling external business logic within a processing step.

For scalability, Dataflow runs the stages of your pipelinein parallel across multiple workers. Factors that are outside your pipeline codeand the Dataflow service also impact the scalability ofyour pipeline. These factors might include the following:

Scalability of external systems: external systems that yourpipeline interacts with can constrain performance and can form theupper bound of scalability. For example, anApache Kafka topic configured with an insufficient number of partitions for the readthroughput that you need can affect your pipeline's performance. Tohelp ensure that the pipeline and its components meet your performancetargets, refer to the best practices documentation for the external systemsthat you're using. You can also simplify infrastructure capacity planning byusing Google Cloud Platform services that provide built-in scalability. For moreinformation, seeUsing Google Cloud Platform managed sources and sinkson this page.
Choice of data formats: certain data formats might be faster toread than others. For example, using data formats that supportparallelizable reads, such as Avro, is usually faster than using CSV filesthat have embedded newlines in fields, and is faster than using compressedfiles.
Data location and network topology: the geographic proximity andnetworking characteristics of data sources and sinks in relation to thedata pipeline might impact performance. For more information, seeRegional considerationson this page.

External services calls

Calling external services from your pipeline incurs per-call overheads that candecrease the performance and efficiency of your pipeline. If your data pipelinecalls external services, to reduce overheads, batch multiple data elements into singlerequests where possible. Many native Apache Beam I/Otransforms automatically perform this task, includingBigQueryIO and streaming insert operations. Aside from capacity limits, some externalservices also enforce quotas that limit the total number of calls over aperiod of time, such as a daily quota, or restrict the rate of calling,such as number of requests per second.

Because Dataflow parallelizes work across multiple workers, toomuch traffic can overwhelm an external service or exhaustavailable quotas. When autoscaling is used, Dataflow mightattempt to compensate by adding workers to run a slow step like an externalcall. Adding workers can add further pressure on external systems. Ensure thatexternal systems can support your anticipated load requirements, orlimit the traffic from your pipeline to sustainable levels. For moreinformation, seeLimit batch sizes and concurrent calls to external services.

Use Google Cloud managed sources and sinks

Using Google Cloud managed services with your Dataflowpipeline removes the complexity of capacity management by providing built-inscalability, consistent performance, and quotas and limits that accommodate mostrequirements. You still need to be aware of different quotas and limitsfor pipeline operations. Dataflow itself imposesquotas and limits.You can increase some of these by contactingGoogle Cloud support.

Dataflow uses Compute Engine VM instances to run yourjobs, so you need sufficientCompute Engine quota.Insufficient Compute Engine quota can hinder pipeline autoscaling orprevent jobs from starting.

The remaining parts of this section explore how different Google Cloudquotas and limits might influence how you design, develop, and monitor yourpipeline. Pub/Sub and BigQuery are used as examplesof pipeline sources and sinks.

Example 1: Pub/Sub

When you use Pub/Sub with Dataflow,Pub/Sub provides a scalable and durable event ingestion servicefor delivering messages to and from your streaming data pipelines. You canuse the Google Cloud console to viewPub/Sub quota consumption and increase quota limits. We recommendthat you request a quota increase if you have any single pipeline that exceedsthe per-project quotas and limits.

Pub/Sub quotas and limits are designed aroundproject level usage. Specifically,publishers and subscribersin different projects are given independent data-throughput quotas. If multiplepipelines publish or subscribe to a single topic, you can get maximum allowablethroughput on that topic by deploying each pipeline into its own project. Inthis configuration, each pipeline uses a different project-based service accountto consume and publish messages.

In the following diagram,Pipeline 1 andPipeline 2 share the samesubscriber and publisher throughput quota that's available toProject A. Incontrast,Pipeline 3 can use the entire subscriber and publisher throughputquota that's attached toProject B.

Three pipelines. Pipeline 1 and Pipeline 2 are in Pipeline Project A; each has its own subscription to a Pub/Sub topic. Pipeline 3 is in Pipeline Project B, which has its own subscription.

Multiple pipelines can read from a single Pub/Sub topic by usingseparate subscriptions to the topic, which allows Dataflowpipelines to pull and acknowledge messages independently of other subscribers,such as other pipelines. This feature makes it easy to clone pipelines bycreating additional Pub/Sub subscriptions. Creating additionalsubscriptions is useful for creating replica pipelines forhigh availability(typically for streaming use cases), for running additionaltest pipelinesagainst the same data, and for enabling pipeline updates.

Example 2: BigQuery

Reading and writingBigQuery datais supported by the Apache Beam SDK for multiple languages, including Java, Python, and Go. When you use Java, theBigQueryIO class provides this functionality. BigQueryIO supports twomethods for reading data:EXPORT (table export) andDIRECT_READ.The different read methods consume different BigQuery quotas.

Table export is the default read method. It works as shown in the followingdiagram:

The pipeline sends an export request to BigQuery, which writes data to a temporary location in Cloud Storage. The pipeline then reads data from that temporary location.

The diagram shows the following flow:

BigQueryIO invokes aBigQuery export request to export table data. The exported table data is written to a temporaryCloud Storage location.
BigQueryIO reads the table data from the temporaryCloud Storage location.

BigQuery export requests are limited byexport quotas.The export request must also complete before the pipeline can start processingdata, which adds additional run time for the job.

In contrast, the direct read approach uses theBigQuery Storage APIto read table data directly from BigQuery. The BigQuery Storage APIprovides high-throughput read performance for table row data usinggRPC.Using the BigQuery Storage API makes the export step unnecessary, which avoidsexport quota restrictions and potentially decreases job run time.

The following diagram shows the flow if you use the BigQuery Storage API. Incontrast to the flow that uses a BigQuery export request,this flow is simpler, because it only has a single direct-read step to getthe data from the table to the pipeline.

The pipelines reads from a BigQuery table directly.

Writing data to BigQuery tables also has its own quotaimplications. Batch pipelines that use BigQuery loadjobs consume differentBigQuery load job quotasthat apply at the table and project level. Similarly, streaming pipelines thatuse BigQuery streaming inserts consumeBigQuery streaming insert quotas.

To determine the most appropriate methods to read and write data, consider youruse case. For example, avoid using BigQuery loadjobs to append data thousands of times per day into a table. Use a streamingpipeline to write near real-time data to BigQuery. Your streamingpipeline should use either streaming inserts or theStorage Write API for this purpose.

Regional considerations

Dataflow is offered as a managed service inmultiple Google Cloud regionsWhen choosing a region to use to run your jobs, consider the followingfactors:

The location of data sources and sinks
Preferences or restrictions on data processing locations
Dataflow features that are offered only in specific regions
The region that's used to manage execution of a given job
The zone that's used for the job's workers

For a given job, the region setting that youuse for the job and for the workers can differ. For more information,including when to specify regions and zones, see theDataflow regions documentation.

By specifying regions to run your Dataflow jobs, you can planaround regional considerations for high availability and disaster recovery. Formore information, seeHigh availability and geographic redundancy.

Regions

Dataflow regions store and handle metadata relating to your job, such asinformation about the Apache Beam graph itself, like transform names. They alsocontrol worker behaviors such as autoscaling. Specifying a regionhelps you meet your needs for security and compliance, data locality, and theregional placement of a job. To avoid performance impact fromcross-region network calls, we recommend that you use the same region for thejob and for the workers when possible.

Dataflow workers

Dataflow jobs use Compute Engine VM instances, calledDataflow workers, to run your pipeline. Dataflowjobs can use any Compute Engine zone for workers, including regionswhere there are noDataflow locations. Byspecifying a worker region for your job, you can control the regional placementof your workers. To specify a worker region or zone, do the following:

If you use thegcloud CLIto create a job from a Dataflow template, use the--worker-region flag to override the worker region, or use the--worker-zone flag to override the worker zone.
If you use the Apache Beam Java SDK to create your job, set regionsand zones for workers usingpipeline options.UseworkerRegion to override the worker region orworkerZone to overridethe worker zone.

To improve network latency and throughput, we recommend that you create workersin a region that's geographically close to your data sources and sinks. If youdon't specify a region or zone for workers when you create a job,Dataflow automaticallydefaults to a zonethat's in the same region as the job.

If you don't use the Dataflow Shuffle service or StreamingEngine, the data that's processed by the job (that is, data stored in anyPCollection object) resides on the job's workers, assuming that no user codetransmits data outside the workers. If either the DataflowShuffle service or Streaming Engine is enabled, the distributed datasetrepresented by aPCollection object can be transmitted between the workers andthese services.

Data encryption considerations

As a fully managed service, Dataflow automatically encrypts datathat moves through your data pipeline using Google-owned and Google-managed encryption keys for bothin-flight data and at-rest data. Instead of using Google-owned and Google-managed encryption keys, youmight prefer tomanage your own encryptionkeys. For that case,Dataflow supports customer-managed encryption keys (CMEK) usingtheCloud Key Management Service (KMS). You can also useCloud HSM, a cloud-hosted hardware security module (HSM)service that allows you to host encryption keys and perform cryptographicoperations in a cluster ofFIPS 140-2 Level3 certified HSMs.

When you use CMEK, Dataflow uses your Cloud KMS key toencrypt the data,except for data-key-based operations such as windowing, grouping, and joining.If data keys contain sensitive data, such as personally identifiable information(PII), you must hash or otherwise transform the keys before they enter theDataflow pipeline.

Private networking considerations

Your networking and security requirements might mandate that VM-based workloadssuch as Dataflow jobs use only private IP addresses.Dataflow lets you specify that workers use private IPaddresses for all network communication. If public IPs are disabled, youmust enablePrivate Google Accesson the subnetwork so that Dataflow workers can reach Google APIsand services.

We recommend that you disable public IPs for Dataflow workers,unless your Dataflow jobs require public IPs to access networkresources outside of Google Cloud Platform. Disabling public IPs preventsDataflow workers from accessing resources that are outside thesubnetwork or from accessingpeer VPC networks.Similarly, network access to VM workers from outside the subnetwork or peer VPCnetworks is prevented.

For more information about using the--usePublicIps pipeline option tospecify whether workers should have only private IPs, seePipeline options.

What's next

Develop and test your pipeline.
Learn aboutdesigning complete pipeline workflows.
Read more about Google'sSite Reliability Engineering (SRE) practices for data processing pipelines.

Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.

Last updated 2026-02-19 UTC.

Movatterモバイル変換

Plan your Dataflow pipeline Stay organized with collections Save and categorize content based on your preferences.