Best practices for Dataflow cost optimization Stay organized with collections Save and categorize content based on your preferences.
This document explains best practices for optimizing yourDataflow jobs with the goal of minimizing costs. It explainsfactors that impact costs and provides techniques for monitoring and managingthose costs.
For more information about how costs are calculated for Dataflowjobs, seeDataflow pricing.
Several factors can have a large impact on job cost:
- Runtime settings
- Pipeline performance
- Pipeline throughput requirements
The following sections provide details about how to monitor your jobs, factorsthat impact job cost, and suggestions for how to improve pipeline efficiency.
Define SLOs
Before you start to optimize, define your pipeline's service level objectives(SLOs), especially for throughput and latency. These requirements will help youto reason about tradeoffs between cost and other factors.
- If your pipeline requires low end-to-end ingest latency, pipeline costs mightbe higher.
- If you need to process late arriving data, the overall pipeline cost might behigher.
- If your streaming pipeline has data spikes that need to be processed, thepipeline might need extra capacity, which can increase costs.
Monitor jobs
To determine how to optimize your job, you first need to understand itsbehavior. Use the Dataflow monitoring tools to observe yourpipeline as it runs. Then use this information to improve performance andefficiency.
Cost monitoring
Use the following techniques to predict and monitor costs.
- Before running the pipeline in production, run one or more smaller jobs on asubset of your data. For many pipelines, this technique can provide a costestimate.
- Use theCost page in the Dataflow monitoring interface tomonitor the estimated cost of your jobs. The estimated cost might not reflectthe actual job cost for various reasons, such as contractual discounts, but itcan provide a good baseline for cost optimization. For more information, seeCost monitoring.
- Export Cloud Billing data to BigQueryand perform a cost analysis on the billing export tables. Cloud Billingexport lets you export detailed Google Cloud Platform billing data automaticallythroughout the day to a BigQuery dataset. Billing data includesusage, cost estimates, and pricing data.
- To avoid unexpected costs, create monitoring alerts when yourDataflow job exceeds a threshold that you define. For moreinformation, seeUse Cloud Monitoring for Dataflow pipelines.
Job monitoring
Monitor your jobs and identify areas where you might be able to improve pipelineefficiency.
- Use the Dataflowjob monitoring interface toidentify problems in your pipelines. The monitoring interface shows ajob graph andexecution details for eachpipeline. Both of these tools can help you to understand your pipeline andidentify slow stages, stuck stages, or steps with too muchwall time.
- UseMetrics Explorerto see detailed Dataflow job metrics. You can use custommetrics to capture performance data. The
Distributionmetric isparticularly useful for gathering performance data. - For CPU-intensive pipelines, useCloud Profiler to identifythe parts of the pipeline code consuming the most resources.
- Usedata sampling to identify problemswith your data. Data sampling lets you observe the data at each step of aDataflow pipeline. By showing the actual inputs and outputs ina running or completed job, this information can help you to debug problemswith your pipeline.
- Customize theproject monitoring dashboard toshow potentially expensive jobs. For more information, seeCustomize the Dataflow monitoring dashboard.
It's not recommended to log per-element processing metrics in high-volumepipelines, because logging is subject tolimits, and excessive logging candegrade job performance.
Optimize runtime settings
The following runtime settings can affect cost:
- Whether you run a streaming job or a batch job
- What service you use to run the job, such as Streaming Engine or FlexRS
- The machine type, disk size, and number of GPUs in the worker VMs
- The autoscaling mode
- The initial number of workers and the maximum number of workers
- The streaming mode (exactly-once mode or at-least-once mode)
This section describes potential changes that you can make to optimize your job.To determine whether these suggestions are appropriate for your workload,consider your pipeline design and requirements. Not all suggestions areappropriate or helpful for all pipelines.
Before making any large-scale changes, test changes on small pipelines that usea subset of your data. For more information, seeRun small experiments for large jobsin "Best practices for large batch pipelines."
Job location
Most Dataflow jobs interact with other services such as datastores and messaging systems. Consider where these are located.
- Run your job in the same region as the resources that your job uses.
- Create your Cloud Storage bucket for storing staging and temporary jobfiles in the same region as your job. For more information, see the
gcpTempLocationandtemp_locationpipeline options.
Adjust machine types
The following adjustments to worker VMs might improve cost efficiency.
- Run your job with the smallest machine type required. Adjust the machine typeas needed based on the pipeline requirements. For example, streaming jobs withCPU-intensive pipelines sometimes benefit from changing the machine type fromthe default. For more information, seeMachine type.
- For memory-intensive or compute-intensive workloads, use appropriate machinetypes. For more information, seeCoreMark scores of VMs by family.
- Set the initial number of workers. When a job scales up, work has to beredistributed to the new VMs. If you know how many workers your jobs needs,you might avoid this cost by setting the initial number of workers. To setthe initial number of workers, use the
numWorkersornum_workerspipeline option. - Set the maximum number of workers. By setting a value for thisparameter, you can potentially limit the total cost of your job. When youfirst test the pipeline, start with a relatively low maximum. Then increasethe value until it's high enough to run a production workload. Consider yourpipeline SLOs before setting a maximum. For more information, seeHorizontal Autoscaling.
- Useright fitting to customize theresource requirements for specific pipeline steps.
- Some pipelines benefit from using GPUs. For more information, seeGPUs with Dataflow. By using rightfitting, you can configure GPUs for specific steps of the pipeline.
- Make sure you have enoughnetwork bandwidthto access data from your worker VMs, particularly when you need to accesson-premise data.
Optimize settings for batch jobs
This section provides suggestions for optimizing runtime settings for batchjobs. For batch jobs, the job stages execute sequentially, which can affectperformance and cost.
Use Flexible Resource Scheduling
If your batch job is not time sensitive, consider usingFlexible Resource Scheduling (FlexRS). FlexRSreduces batch processing costs by finding the best time to start the job, andthen using a combination ofpreemptible VM instances and standardVMs. Preemptible VMs are available at much lower price compared to standard VMs,which can lower the total cost. By using a combination of preemptable andstandard VMs, FlexRS helps to ensure that your pipeline makes progress even ifCompute Engine preempts the preemptible VMs.
Avoid running very small jobs
When feasible, avoid running jobs that process very small amounts of data. Ifpossible, run fewer jobs on larger datasets. Starting and stopping worker VMsincurs a cost, so running fewer jobs on more data can improve efficiency.
Make sure thatDataflow Shuffle is enabled.Batch jobs use Dataflow shuffle by default.
Adjust autoscaling settings
By default, batch jobs use autoscaling. For some jobs, such as short-runningjobs, autoscaling isn't needed. If you think that your pipeline doesn't benefitfrom autoscaling, turn it off. For more information, seeHorizontal Autoscaling.
You can also usedynamic thread scaling to letDataflow tune the thread count based on CPU utilization.Alternately, if you know the optimal number of threads for the job, explicitlyset the number of threads per worker by using thenumberOfWorkerHarnessThreads ornumber_of_worker_harness_threadspipeline option.
Stop long-running jobs
Set your jobs to automatically stop if they exceed a predetermined run time. Ifyou know approximately how long your job takes to run, use themax_workflow_runtime_walltime_secondsservice optionto automatically stop the job if it runs longer than expected.
Optimize settings for streaming jobs
This section provides suggestions for optimizing runtime settings for streamingjobs.
Use Streaming Engine
Streaming Engine moves pipeline executionfrom the worker VMs and into the Dataflow service backend forgreater efficiency. It's recommended to use Streaming Engine for your streamingjobs.
Consider at-least-once mode
Dataflow supports two modes for streaming jobs: exactly-oncemode and at-least-once mode. If your workload can tolerate duplicate records,then at-least-once mode can significantly reduce the cost of your job. Beforeyou enable at-least-once mode, evaluate whether your pipeline requiresexactly-once processing of records.For more information, seeSet the pipeline streaming mode.
Choose your pricing model
Committed use discounts (CUDs) forDataflow streaming jobs provide discounted prices in exchangefor your commitment to continuously use a certain amount ofDataflow compute resources for a year or longer.Dataflow CUDs are useful when your spending onDataflow compute capacity for streaming jobs involves apredictable minimum that you can commit to for at least a year. By using CUDs,you can potentially reduce the cost of your Dataflow jobs.
Also consider usingresource-based billing.With resource-based billing, the Streaming Engine resources consumed by yourjob are metered and measured inStreaming Engine Compute Units.You're billed for worker CPU, worker memory, and Streaming Engine Compute Units.
Adjust autoscaling settings
Use autoscaling hints to tune your autoscaling settings. For more information,seeTune Horizontal Autoscaling for streaming pipelines.For streaming jobs that use Streaming Engine, you can update the autotuningsettings without stopping or replacing the job. For more information, seeIn-flight job option update.
If you think that your pipeline doesn't benefit from autoscaling, turn it off.For more information, seeHorizontal Autoscaling.
If you know the optimal number of threads for the job, explicitly set thenumber of threads per worker by using thenumberOfWorkerHarnessThreads ornumber_of_worker_harness_threadspipeline option.
Stop long-running jobs
For streaming jobs, Dataflow retries failed work itemsindefinitely. The job is not terminated. However, the job might stall until theissue is resolved. Createmonitoring policies to detectsigns of a stalled pipeline, such as an increase in system latency and adecrease indata freshness.Implement error logging in your pipeline code to help identify work items thatfail repeatedly.
- To monitor pipeline errors, seeWorker error log count.
- To troubleshoot errors, seeTroubleshoot Dataflow errors.
Pipeline performance
Pipelines that run faster might cost less. The following factors can affectpipeline performance:
- The parallelism available to your job
- The efficiency of the transforms, I/O connectors, and coders used in thepipeline
- The data location
The first step to improving pipeline performance is to understand the processingmodel:
- Learn about theApache Beam model and theApache Beam execution model.
- Learn more about thepipeline lifecycle,including how Dataflow manages parallelization and theoptimization strategies it uses. Dataflow jobs use multipleworker VMs, and each worker runs multiple threads. Element bundles from a
PCollectionare distributed to each worker thread.
Use these best practices when you write your pipeline code:
- When possible, use the latest supportedApache Beam SDK version.Follow therelease notesto understand the changes in different versions.
- Followbest practices for writing pipeline code.
- FollowI/O connector best practices.
- For Python pipelines, consider usingcustom containers.Pre-packaging dependencies decreases worker start-up time.
Logging
Follow these best practices when logging:
- Excessive logging can hurt performance.
- To reduce the volume of logs, consider changing thepipeline log level. For moreinformation, seeControl log volume.
- Don't log individual elements. Enabledata sampling instead.
- Use adead letter patternfor per-element errors, instead of logging each error.
Testing
Testing your pipeline has many benefits, including helping with SDK upgrades,pipeline refactoring, and code reviews. Many optimizations, such as reworkingcustom CPU-intensive transforms, can be tested locally without needing to run ajob on Dataflow.
Test large scale pipelines with realistic test data for your workload, includingthe total number of elements for batch pipelines, the number of elements persecond for streaming pipelines, the element size, and the number of keys. Testyour pipelines in two modes: in a steady state, and processing a large backlogto simulate a crash recovery.
For more information about creating unit tests, integration tests, andend-to-end tests, seeTest your pipeline.For examples of tests, see thedataflow-ordered-processingGitHub repository.
What's next
Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.
Last updated 2026-02-19 UTC.