Use data lineage in Dataflow

Data lineage is a Dataflow feature that lets you trackhow data moves through your systems: where it comes from, where it is passed to,and what transformations are applied to it.

Each pipeline that you run by using Dataflow has several associateddata assets. The lineage of a data asset includes its origin, what happens toit, and where it moves over time. With data lineage, you can trackthe end-to-end movement of your data assets, from origin to eventual destination.

When you enable data lineage for yourDataflow jobs, Dataflowcaptures lineage events and publishes them to the Dataplex Universal CatalogData Lineage API.

To access lineage information through Dataplex Universal Catalog, seeUse data lineage with Google Cloud Platform systems.

Before you begin

Set up your project:

  1. Sign in to your Google Cloud Platform account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.
  2. Verify that billing is enabled for your Google Cloud project.

  3. Enable the Dataplex, BigQuery, and Data lineage APIs.

    Roles required to enable APIs

    To enable APIs, you need the Service Usage Admin IAM role (roles/serviceusage.serviceUsageAdmin), which contains theserviceusage.services.enable permission.Learn how to grant roles.

    Enable the APIs

  4. Verify that billing is enabled for your Google Cloud project.

  5. Enable the Dataplex, BigQuery, and Data lineage APIs.

    Roles required to enable APIs

    To enable APIs, you need the Service Usage Admin IAM role (roles/serviceusage.serviceUsageAdmin), which contains theserviceusage.services.enable permission.Learn how to grant roles.

    Enable the APIs

Caution: Data lineage is enabled on a per-project basis, not aper-service basis.After you enable the Data Lineage API, lineage information is automaticallyreported for multiple Google Cloud Platform services in the project, depending on theirproduct-level lineage control.For more details, seeData lineage considerations.

In Dataflow, you also need to enable lineage at the job level.SeeEnable data lineage in Dataflow inthis document.

Required roles

To get the permissions that you need to view lineage visualization graphs, ask your administrator to grant you the following IAM roles:

For more information about granting roles, seeManage access to projects, folders, and organizations.

You might also be able to get the required permissions throughcustom roles or otherpredefined roles.

For more information about data lineage roles, seePredefined roles for data lineage.

Support and limitations

Data lineage in Dataflow has the following limitations:

  • Data lineage is supported in the Apache Beam SDK versions2.63.0 and later.
  • You must enable data lineage on a per-job basis.
  • Data capture isn't instantaneous. It can takea few minutes for Dataflow job lineage data to appear inDataplex Universal Catalog.
  • The following sources and sinks are supported:

    • Apache Kafka
    • BigQuery (Streaming jobs in Python use the legacySTREAMING_INSERTmethod, which doesn't support data lineage. To use data lineage, switch tothe recommendedSTORAGE_WRITE_API method.For more information, seeWrite from Dataflow to BigQuery.)
    • Bigtable
    • Cloud Storage
    • JDBC (Java Database Connectivity)
    • Pub/Sub
    • Spanner (Change Stream is not supported)

    Dataflow templatesthat use these sources and sinks also automatically capture and publishlineage events.

Enable data lineage in Dataflow

You need to enable lineage at the job level. To enable data lineage,use theenable_lineageDataflow service optionas follows:

Java

--dataflowServiceOptions=enable_lineage=true

Python

--dataflow_service_options=enable_lineage=true

Go

--dataflow_service_options=enable_lineage=true

gcloud

Use thegcloud dataflow jobs run commandwith theadditional-experiments option. If you're using Flex Templates, usethegcloud dataflow flex-template runcommand.

--additional-experiments=enable_lineage=true

Optionally, you can specify one or both of the following parameters with theservice option:

  • process_id: A unique identifier that Dataplex Universal Catalog uses to groupjob runs. If not specified, the job name is used.
  • process_name: A human-readable name for the data lineage process.If not specified, the job name prefixed with"Dataflow " is used.

Specify these options as follows:

Java

--dataflowServiceOptions=enable_lineage=process_id=PROCESS_ID;process_name=DISPLAY_NAME

Python

--dataflow_service_options=enable_lineage=process_id=PROCESS_ID;process_name=DISPLAY_NAME

Go

--dataflow_service_options=enable_lineage=process_id=PROCESS_ID;process_name=DISPLAY_NAME

gcloud

--additional-experiments=enable_lineage=process_id=PROCESS_ID;process_name=DISPLAY_NAME

View lineage in Dataplex Universal Catalog

Data lineage provides information about the relations between your projectresources and the processes that created them. You can view data lineageinformation in the Google Cloud console in the form of a graph or asingle table. You can also retrieve data lineage information from theData Lineage API in the form of JSON data.

For more information, seeUse data lineage with Google Cloud Platform systems.

Disable data lineage in Dataflow

If data lineage is enabled for a specific job and you want to disableit, cancel the existing job and run a new version of the job without theenable_lineage service option.

Billing

Using data lineage in Dataflow doesn't impact yourDataflow bill, but it might incur additional charges on yourDataplex Universal Catalog bill. For more information, seeData lineage considerationsandDataplex Universal Catalog pricing.

What's next

Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.

Last updated 2026-02-19 UTC.