Dataflow overview

Dataflow is a Google Cloud service that provides unifiedstream and batch data processing at scale. Use Dataflow tocreate data pipelines that read from one or more sources, transform the data,and write the data to a destination.

Typical use cases for Dataflow include the following:

Data movement: Data ingestion or replication across subsystems.
ETL (extract-transform-load)workflows that ingest data into a data warehouse such as BigQuery.
Backend support for business intelligence (BI) dashboards
Real-time machine learning (ML) analysis of streaming data.
Sensor data processing or log data processing at scale.

Dataflow uses the same programming model for both batch andstream analytics. Streaming pipelines can achieve low latency. You can ingest,process, and analyze fluctuating volumes of real-time data. By default,Dataflow providesexactly-onceprocessing of every record. For streamingpipelines that can tolerate duplicates, you can reduce cost and improve latencyby enablingat-least-once mode.

Advantages for data processing with Dataflow

This section describes some of the advantages of using Dataflow.

Managed data processing

Dataflow is a fully managed service. That means Google managesall of the resources needed to run Dataflow. When you run aDataflow job, the Dataflow service allocates apool of worker VMs to execute the pipeline. You don't need to provision ormanage these VMs. When the job completes or is cancelled,Dataflow automatically deletes the VMs. You're billed for thecompute resources that your job uses. For more information about costs, seeDataflow pricing.

Scalable data pipelines

Dataflow is designed to support batch and streaming pipelines atlarge scale. Data is processed in parallel, so the work is distributed acrossmultiple VMs.

Dataflow can autoscale by provisioning extra worker VMs, or byshutting down some worker VMs if fewer are needed. It also optimizes the work,based on the characteristics of the pipeline. For example,Dataflow candynamically rebalance work among theVMs, so that parallel work completes more efficiently.

Portable with Apache Beam

Dataflow is built on the open sourceApache Beam project.Apache Beam lets you write pipelines using a language-specific SDK.Apache Beam supports Java, Python, and Go SDKs, as well asmulti-language pipelines.

Dataflow executes Apache Beam pipelines. If you decide laterto run your pipeline on a different platform, such as Apache Flink or ApacheSpark, you can do so without rewriting the pipeline code.

Flexible data pipeline development

You can use Dataflow for pipelines with straightforward usecases, such as just moving data. However, Dataflow is alsosuitable for more advanced applications, such as real-time streaming analytics.A solution built on Dataflow can grow with your needs as you movefrom batch to streaming or encounter more advanced use cases.

Dataflow supports several different ways to create and executepipelines, depending on your needs:

Write code using the Apache Beam SDKs.
Deploy aDataflow template.Templates let you run predefined pipelines. For example, a developer cancreate a template, and then a data scientist can deploy it on demand.
Google also provides alibrary of templates forcommon scenarios. You can deploy these templates without knowing anyApache Beam programming concepts.
UseJupyterLab notebooksto develop and run pipelines iteratively.

Observable data pipeline jobs

You can monitor the status of your Dataflow jobs through theDataflow monitoring interfacein the Google Cloud console. The monitoring interface includes a graphicalrepresentation of your pipeline, showing the progress andexecutiondetails of each pipeline stage. Themonitoring interface makes it easier to spot problems such as bottlenecks orhigh latency. You can alsoprofile your Dataflowjobs to monitor CPU usage andmemory allocation.

How data pipelines work for stream and batch processing

Dataflow uses a data pipeline model, where data moves through aseries of stages. Stages can include reading data from a source, transformingand aggregating the data, and writing the results to a destination.

Pipelines can range from very basic to more complex processing. For example, apipeline might do the following:

Move data as-is to a destination.
Transform data to be more usable by the target system.
Aggregate, process, and enrich data for analysis.
Join data with other data.

A pipeline that is defined in Apache Beam does not specifyhow thepipeline is executed. Running the pipeline is the job of arunner.The purpose of a runner is to run an Apache Beam pipeline on a specificplatform. Apache Beam supports multiple runners, including aDataflow runner.

To use Dataflow with your Apache Beam pipelines, specify theDataflow runner. The runner uploadsyour executable code and dependencies to a Cloud Storage bucket and creates aDataflowjob. Dataflow then allocates a pool ofVMs to execute the pipeline.

The following diagram shows a typical ETL and BI solution usingDataflow and other Google Cloud services:

Diagram of an ETL and BI solution that uses Dataflow.

This diagram shows the following stages:

Pub/Sub ingests data from an external system.
Dataflow reads the data from Pub/Sub and writes it toBigQuery. During this stage, Dataflow mighttransform or aggregate the data.
BigQuery acts as a data warehouse, allowing data analysts torun ad hoc queries on the data.
Looker provides real-time BI insights from the data stored inBigQuery.

For basic data movement scenarios, you might run a Google-providedtemplate. Some templates support user-defined functions (UDFs) written inJavaScript. UDFs let you add custom processing logic to a template. For morecomplex pipelines, start with the Apache Beam SDK.

What's next

For more information about Apache Beam, seeProgramming model for Apache Beam.
Create your first pipeline by following theJob builder quickstart orDataflow template quickstart.
Learn how touse Apache Beam to build pipelines.

Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.

Last updated 2026-02-19 UTC.

Movatterモバイル変換

Dataflow overview Stay organized with collections Save and categorize content based on your preferences.