Dataflow components

The Dataflow components let you submit Apache Beam jobs toDataflow for execution. In Dataflow, aJobresource represents a Dataflow job.

The Google Cloud SDK includes thefollowing operators for creatingJob resources and monitor their execution:

Additionally, the Google Cloud SDK includes theWaitGcpResourcesOpcomponent, which you can use to mitigate costs while runningDataflow jobs.

DataflowFlexTemplateJobOp

TheDataflowFlexTemplateJobOpoperator lets you create aVertex AI Pipelines component to launch aDataflow Flex Template.

In Dataflow, aLaunchFlexTemplateParameterresource represents a Flex Template to launch. This component creates aLaunchFlexTemplateParameter resource and then requests Dataflow tocreate a job by launching the template. If the template is launchedsuccessfully, Dataflow returns aJobresource.

The Dataflow Flex Template component terminates upon receiving aJobresource from Dataflow. The component outputs ajob_id as aserializedgcp_resources proto. Youcan pass this parameter to aWaitGcpResourcesOpcomponent, to wait for the Dataflow job to complete.

DataflowPythonJobOp

TheDataflowPythonJobOpoperator lets you create a Vertex AI Pipelines component that preparesdata by submitting a Python-based Apache Beam job to Dataflow forexecution.

The Python code of the Apache Beam job runs with Dataflow Runner.When you run your pipeline with the Dataflow service, the runneruploads your executable code to the location specified by thepython_module_path parameterand dependencies to a Cloud Storage bucket (specified bytemp_location), and then creates aDataflow job that executes your Apache Beam pipeline on managed resources in Google Cloud.

To learn more about the Dataflow Runner, seeUsing the Dataflow Runner.

The Dataflow Python component accepts a list of argumentsthat are passed using the Beam Runner to your Apache Beamcode. These arguments are specified byargs. For example, you can use thesearguments to set theapache_beam.options.pipeline_options tospecify a network, a subnetwork, customer-managed encryption key (CMEK), andother options when you run Dataflow jobs.

WaitGcpResourcesOp

Dataflow jobs can often take long time to complete. The costs ofabusy-wait container (the container that launches Dataflow job andwait for the result) can become expensive.

After submitting the Dataflow job using the Beam runner,theDataflowPythonJobOpcomponent terminates immediately and returns ajob_id output parameter as aserializedgcp_resources proto. Youcan pass this parameter to aWaitGcpResourcesOp component, to wait for theDataflow job to complete.

dataflow_python_op=DataflowPythonJobOp(project=project_id,location=location,python_module_path=python_file_path,temp_location=staging_dir,requirements_file_path=requirements_file_path,args=['--output',OUTPUT_FILE],)dataflow_wait_op=WaitGcpResourcesOp(gcp_resources=dataflow_python_op.outputs["gcp_resources"])

Vertex AI Pipelines optimizes theWaitGcpResourcesOp to execute it in aserverless fashion, and has zero cost.

IfDataflowPythonJobOp andDataflowFlexTemplateJobOp don't meet yourrequirements, you can also create your own component that outputs thegcp_resources parameter and pass it to theWaitGcpResourcesOp component.

For more information about how to creategcp_resources output parameter, seeWrite a component to show a Google Cloud console link.

API reference

Tutorials

Version history and release notes

To learn more about the version history and changes to the Google Cloud Pipeline Components SDK, see theGoogle Cloud Pipeline Components SDK Release Notes.

Technical support contacts

If you have any questions, reach out tokubeflow-pipelines-components@google.com.

Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.

Last updated 2026-02-19 UTC.