Dataflow ML in ML workflows Stay organized with collections Save and categorize content based on your preferences.
To orchestrate complex machine learning workflows, you can create frameworks thatinclude data pre- and post-processing steps. You might need to pre-process databefore you can use it to train your model or to post-process data to transform theoutput of your model.
ML workflows often contain many steps that together form a pipeline.To build your machine learning pipeline, you can use one of the followingmethods.
- Use an orchestration framework that has a built-in integration with Apache Beam andthe Dataflow runner, such as TensorFlow Extended (TFX) orKubeflow Pipelines (KFP). This option is the least complex.
- Build a custom component in aDataflow template andthen call the template from your ML pipeline. The call contains yourApache Beam code.
- Build a custom component to use in your ML pipeline and put the Python codedirectly in the component. You define a custom Apache Beam pipeline anduse the Dataflow runner within the custom component. This option is themost complex and requires you to manage pipeline dependencies.
After you create your machine learning pipeline, you can use an orchestrator tochain together the components to create an end-to-end machine learningworkflow. To orchestrate the components, you can use a managed service, such asVertex AI Pipelines.
Use ML accelerators
For machine learning workflows that involve computationally intensive dataprocessing, such as inference with large models, you can use accelerators withDataflow workers. Dataflow supports using both GPUs andTPUs.
GPUs
You can use NVIDIA GPUs with Dataflow jobs to accelerateprocessing. Dataflow supports various NVIDIA GPU types, includingthe T4, L4, A100, H100, and V100. To use GPUs, you need to configure yourpipeline with a custom container image that has the necessary GPU drivers andlibraries installed.
For detailed information on using GPUs with Dataflow, seeDataflow support for GPUs.
TPUs
Dataflow also supports Cloud TPUs, which are Google'scustom-designed AI accelerators optimized for large AI models. TPUs can be agood choice for accelerating inference workloads on frameworks likePyTorch, JAX, and TensorFlow. Dataflow supportssingle-host TPU configurations, where each worker manages one or more TPUdevices.
For more information, seeDataflow support for TPUs.
Workflow orchestration
Workflow orchestration use cases are described in the following sections.
- I want to use Dataflow with Vertex AI Pipelines
- I want to use Dataflow with KFP
- I want to use Dataflow with TFX
BothTFXandKubeflow Pipelines(KFP) use Apache Beam components.
I want to use Dataflow with Vertex AI Pipelines
Vertex AI Pipelines help you to automate, monitor, and govern your MLsystems by orchestrating your ML workflows in a serverless manner. You can useVertex AI Pipelines to orchestrate workflow directed acyclic graphs (DAGs)defined by either TFX orKFP and to automatically track your ML artifacts using Vertex ML Metadata.To learn how to incorporate Dataflow with TFX and KFP, use theinformation in the following sections.
I want to use Dataflow with Kubeflow Pipelines
Kubeflow is an ML toolkit dedicated to making deployments of MLworkflows on Kubernetes easier to use, portable, and scalable.Kubeflow Pipelines are reusable end-to-end ML workflows built using theKubeflow PipelinesSDK.
The Kubeflow Pipelines service aims to provide end-to-end orchestration and tofacilitate experimentation and reuse. With KFP, you can experiment withorchestration techniques and manage your tests, and you can reuse components andpipelines to create multiple end-to-end solutions without starting over eachtime.
When using Dataflow with KFP, you can use theDataflowPythonJobOPoperator or theDataflowFlexTemplateJobOpoperator. You can also build a fully custom component. We recommend using theDataflowPythonJobOP operator.
If you want to build a fully custom component, see theDataflowcomponents page in theVertex AI documentation.
I want to use Dataflow with TFX
TFX pipeline components are built on TFXlibraries, and the data processing libraries use Apache Beam directly. Forexample, TensorFlow Transform translates the user's calls toApache Beam. Therefore, you can use Apache Beam and Dataflowwith TFX pipelines without needing to do extra configurationwork. To use TFX with Dataflow, when you build yourTFX pipeline, use the Dataflowrunner. For more information, see the following resources:
Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.
Last updated 2026-02-19 UTC.