Use data lineage in Dataflow Stay organized with collections Save and categorize content based on your preferences.
Data lineage is a Dataflow feature that lets you trackhow data moves through your systems: where it comes from, where it is passed to,and what transformations are applied to it.
Each pipeline that you run by using Dataflow has several associateddata assets. The lineage of a data asset includes its origin, what happens toit, and where it moves over time. With data lineage, you can trackthe end-to-end movement of your data assets, from origin to eventual destination.
When you enable data lineage for yourDataflow jobs, Dataflowcaptures lineage events and publishes them to the Dataplex Universal CatalogData Lineage API.
To access lineage information through Dataplex Universal Catalog, seeUse data lineage with Google Cloud Platform systems.
Before you begin
Set up your project:
- Sign in to your Google Cloud Platform account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.
Verify that billing is enabled for your Google Cloud project.
Enable the Dataplex, BigQuery, and Data lineage APIs.
Roles required to enable APIs
To enable APIs, you need the Service Usage Admin IAM role (
roles/serviceusage.serviceUsageAdmin), which contains theserviceusage.services.enablepermission.Learn how to grant roles.Verify that billing is enabled for your Google Cloud project.
Enable the Dataplex, BigQuery, and Data lineage APIs.
Roles required to enable APIs
To enable APIs, you need the Service Usage Admin IAM role (
roles/serviceusage.serviceUsageAdmin), which contains theserviceusage.services.enablepermission.Learn how to grant roles.
In Dataflow, you also need to enable lineage at the job level.SeeEnable data lineage in Dataflow inthis document.
Required roles
To get the permissions that you need to view lineage visualization graphs, ask your administrator to grant you the following IAM roles:
- Dataplex Catalog viewer (
roles/dataplex.catalogViewer) on the Dataplex Universal Catalog resource project - Data Lineage Viewer (
roles/datalineage.viewer) on the project where you use Dataflow - Dataflow viewer (
roles/dataflow.viewer) on the project where you use Dataflow
For more information about granting roles, seeManage access to projects, folders, and organizations.
You might also be able to get the required permissions throughcustom roles or otherpredefined roles.
For more information about data lineage roles, seePredefined roles for data lineage.
Support and limitations
Data lineage in Dataflow has the following limitations:
- Data lineage is supported in the Apache Beam SDK versions2.63.0 and later.
- You must enable data lineage on a per-job basis.
- Data capture isn't instantaneous. It can takea few minutes for Dataflow job lineage data to appear inDataplex Universal Catalog.
The following sources and sinks are supported:
- Apache Kafka
- BigQuery (Streaming jobs in Python use the legacy
STREAMING_INSERTmethod, which doesn't support data lineage. To use data lineage, switch tothe recommendedSTORAGE_WRITE_APImethod.For more information, seeWrite from Dataflow to BigQuery.) - Bigtable
- Cloud Storage
- JDBC (Java Database Connectivity)
- Pub/Sub
- Spanner (Change Stream is not supported)
Dataflow templatesthat use these sources and sinks also automatically capture and publishlineage events.
Enable data lineage in Dataflow
You need to enable lineage at the job level. To enable data lineage,use theenable_lineageDataflow service optionas follows:
Java
--dataflowServiceOptions=enable_lineage=truePython
--dataflow_service_options=enable_lineage=trueGo
--dataflow_service_options=enable_lineage=truegcloud
Use thegcloud dataflow jobs run commandwith theadditional-experiments option. If you're using Flex Templates, usethegcloud dataflow flex-template runcommand.
--additional-experiments=enable_lineage=trueOptionally, you can specify one or both of the following parameters with theservice option:
process_id: A unique identifier that Dataplex Universal Catalog uses to groupjob runs. If not specified, the job name is used.process_name: A human-readable name for the data lineage process.If not specified, the job name prefixed with"Dataflow "is used.
Specify these options as follows:
Java
--dataflowServiceOptions=enable_lineage=process_id=PROCESS_ID;process_name=DISPLAY_NAMEPython
--dataflow_service_options=enable_lineage=process_id=PROCESS_ID;process_name=DISPLAY_NAMEGo
--dataflow_service_options=enable_lineage=process_id=PROCESS_ID;process_name=DISPLAY_NAMEgcloud
--additional-experiments=enable_lineage=process_id=PROCESS_ID;process_name=DISPLAY_NAMEView lineage in Dataplex Universal Catalog
Data lineage provides information about the relations between your projectresources and the processes that created them. You can view data lineageinformation in the Google Cloud console in the form of a graph or asingle table. You can also retrieve data lineage information from theData Lineage API in the form of JSON data.
For more information, seeUse data lineage with Google Cloud Platform systems.
Disable data lineage in Dataflow
If data lineage is enabled for a specific job and you want to disableit, cancel the existing job and run a new version of the job without theenable_lineage service option.
Billing
Using data lineage in Dataflow doesn't impact yourDataflow bill, but it might incur additional charges on yourDataplex Universal Catalog bill. For more information, seeData lineage considerationsandDataplex Universal Catalog pricing.
What's next
- Learn more aboutdata lineage.
- Learn how tousedata lineage.
Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.
Last updated 2026-02-19 UTC.