Programming model for Apache Beam

Dataflow is based on the open-source Apache Beam project. Thisdocument describes the Apache Beam programming model, providing an overviewof its architecture and serving as a guide to its core concepts.

Apache Beam data processing overview

This section provides an overview of the Apache Beam architecture, detailinghow its components work together for efficient data processing. Apache Beamis an open-source, unified model for defining both batch and streamingpipelines. The Apache Beam programming model simplifies the mechanics oflarge-scale data processing. Using one of the Apache Beam SDKs, you build aprogram that defines the pipeline. Then, you execute the pipeline on a specificplatform such as Dataflow. This model lets you concentrate on thelogical composition of your data processing job, rather than managing theorchestration of parallel processing.

Apache Beam insulates you from the low-level details of distributedprocessing, such as coordinating individual workers, sharding datasets, andother such tasks. Dataflow fully manages these low-level details.

Apipeline is a graph of transformations that are applied to collections ofdata. In Apache Beam, a collection is called aPCollection, and atransform is called aPTransform. APCollection can be bounded or unbounded.AboundedPCollection has a known, fixed size, and can be processed using abatch pipeline. UnboundedPCollections must use a streaming pipeline, becausethe data is processed as it arrives.

Apache Beam provides connectors to read from and write to different systems,including Google Cloud services and third-party technologies such asApache Kafka.

The following diagram shows an Apache Beam pipeline.

An Apache Beam pipeline.

You can writePTransforms that perform arbitrary logic. The Apache BeamSDKs also provide a library of usefulPTransforms out of the box, includingthe following:

Filter out all elements that don't satisfy a predicate.
Apply a 1-to-1 mapping function over each element.
Group elements by key.
Count the elements in a collection
Count the elements associated with each key in a key-value collection.

To run an Apache Beam pipeline using Dataflow, perform thefollowing steps:

Use the Apache Beam SDK to define and build the pipeline. Alternatively,you can deploy a prebuilt pipeline by using a Dataflowtemplate.
Use Dataflow to run the pipeline. Dataflowallocates a pool of VMs to run the job, deploys the code to the VMs, andorchestrates running the job.
Dataflow performs optimizations on the backend to make yourpipeline run efficiently and take advantage of parallelization.
While a job is running and after it completes, use Dataflowmanagement capabilities to monitor progress and troubleshoot.

Apache Beam programming concepts

This section contains summaries of fundamental concepts.

Basic batch and streaming processing concepts

Pipelines: A pipeline encapsulates the entire series of computations that are involved inreading input data, transforming that data, and writing output data. The inputsource and output sink can be the same type or of different types, letting youconvert data from one format to another. Apache Beam programs start byconstructing aPipeline object, and then using that object as the basis forcreating the pipeline's datasets. Each pipeline represents a single, repeatablejob.
PCollection: APCollection represents a potentially distributed, multi-element dataset thatacts as the pipeline's data. Apache Beam transforms usePCollection objects as inputs and outputs for each step in your pipeline. APCollection can hold a dataset of a fixed size or an unbounded dataset from acontinuously updating data source.
Transforms: A transform represents a processing operation that transforms data. Atransform takes one or morePCollections as input, performs an operation thatyou specify on each element in that collection, and produces one or morePCollections as output. A transform can perform nearly any kind of processingoperation, including performing mathematical computations on data, convertingdata from one format to another, grouping data together, reading and writingdata, filtering data to output only the elements you want, or combining dataelements into single values.
ParDo: ParDo is the core parallel processing operation in the Apache Beam SDKs,invoking a user-specified function on each of the elements of the inputPCollection.ParDo collects the zero or more output elements into an outputPCollection. TheParDo transform processes elements independently andpossibly in parallel. The user-defined function for aParDo is called aDoFn.
Pipeline I/O: Apache Beam I/O connectors let you read data into your pipeline andwrite output data from your pipeline. An I/O connector consists of a source anda sink. All Apache Beam sources and sinks are transforms that let yourpipeline work with data from several different data storage formats. You canalso write a custom I/O connector.
Aggregation: Aggregation is the process of computing some value from multiple inputelements. The primary computational pattern for aggregation in Apache Beamis to group all elements with a common key and window. Then, it combines eachgroup of elements using an associative and commutative operation.
User-defined functions (UDFs): Some operations within Apache Beam let you execute user-defined code as away of configuring the transform. ForParDo, user-defined code specifies theoperation to apply to every element, and forCombine, it specifies how valuesshould be combined. A pipeline might contain UDFs written in a differentlanguage than the language of your runner. A pipeline might also contain UDFswritten in multiple languages.
Runner: Runners are the software that accepts a pipeline and executes it. Most runnersare translators or adapters to massively parallel big-data processing systems.Other runners exist for local testing and debugging.
Source: A transform that reads from an external storage system. A pipeline typicallyreads input data from a source. The source has a type, which may be differentfrom the sink type, so you can change the format of data as it moves through thepipeline.
Sink: A transform that writes to an external data storage system, such as a file or adatabase.
TextIO: APTransform for reading and writing text files. TheTextIO source andsink support files compressed withgzip andbzip2. TheTextIO input sourcesupports JSON. However, for the Dataflow service to be able toparallelize input and output, your source data must be delimited with a linefeed. You can use a regular expression to target specific files with theTextIO source. Dataflow supports general wildcard patterns. Yourglob expression can appear anywhere in the path. However, Dataflowdoes not support recursive wildcards (**).

Advanced batch and streaming processing concepts

Event time: The time a data event occurs, determined by the timestamp on the dataelement itself. This contrasts with the time the actual data elementgets processed at any stage in the pipeline.
Windowing: Windowing lets you group operations over unbounded collections by dividingthe collection into windows of finite collections according to the timestamps ofthe individual elements. A windowing function tells the runner how to assignelements to an initial window, and how to merge windows of grouped elements.Apache Beam lets you define different kinds of windows or use thepredefined windowing functions.
Watermarks: Apache Beam tracks a watermark, which is the system's notion of when alldata in a certain window can be expected to have arrived in the pipeline.Apache Beam tracks a watermark because data is not guaranteed to arrivein a pipeline in time order or at predictable intervals. In addition, it's notguaranteed that data events will appear in the pipeline in the same orderthat they were generated.
Trigger: Triggers determine when to emit aggregated results as data arrives. Forbounded data, results are emitted after all of the input has been processed. Forunbounded data, results are emitted when the watermark passes the end of thewindow, indicating that the system believes all input data for that window hasbeen processed. Apache Beam provides several predefined triggers and letsyou combine them.

What's next

To learn more about the basic concepts of building pipelines using theApache Beam SDKs, see theApache Beam Programming Guide in the Apache Beam documentation.
For more details about the Apache Beam capabilities supported byDataflow, see theApache Beam capability matrix.

Apache Beam® is a registered trademark of The Apache Software Foundation or its affiliates in the United States and/or other countries.

Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.

Last updated 2026-02-19 UTC.

Movatterモバイル変換