Use data lineage with Serverless for Apache Spark Stay organized with collections Save and categorize content based on your preferences.
This document describes how to enabledata lineage onGoogle Cloud Serverless for Apache Spark batch workloads and interactive sessions at theproject,batch workload, orinteractive session level.
Overview
Data lineage is aDataplex Universal Catalogfeature that lets you track how data moves through your systems: where it comesfrom, where it is passed to, and what transformations are applied to it.
Google Cloud Serverless for Apache Spark workloads and sessions capture lineage events and publish them tothe Dataplex Universal CatalogData Lineage API.Serverless for Apache Spark integrates with the Data Lineage API throughOpenLineage, using theOpenLineage Spark plugin.
You can access lineage information through Dataplex Universal Catalog, usinglineage graphsand theData Lineage API.For more information, seeView lineage graphs in Dataplex Universal Catalog.
Availability
Data lineage, which supports BigQuery and Cloud Storagedata sources, is available for workloads and sessions that run withsupported Serverless for Apache Spark runtime versionswith the following exceptions and limitations:
- Data lineage is not available for SparkR or Spark streaming workloads or sessions.
Before you begin
On the project selector page in the Google Cloud console, select the projectto use for your Serverless for Apache Spark workloads or sessions.
Enable the Data Lineage API.
Upcoming Spark data lineage changes See theServerless for Apache Spark release notesfor the announcement of a change that will automatically makeSpark data lineage available to your projects, batch workloads,and interactive sessions when you enable the Data Lineage API (seeControl lineage ingestion for a service)without requiring additional project, batch workload,or interactive session settings.
Required roles
If your batch workload uses thedefaultServerless for Apache Spark service account,it has theDataproc Workerrole, which contains the permissions required by data lineage.
However, if your batch workload uses a custom service account to enabledata lineage, you must grantone of the roles listedin the following paragraph, which contain the permissions required by datalineage, to the custom service account.
To get the permissions that you need to use data lineage with Dataproc , ask your administrator to grant you the following IAM roles on your batch workload custom service account:
- Grantone of the following roles:
- Dataproc Worker (
roles/dataproc.worker) - Data Lineage Editor (
roles/datalineage.editor) - Data Lineage Producer (
roles/datalineage.producer) - Data Lineage Administrator (
roles/datalineage.admin)
- Dataproc Worker (
For more information about granting roles, seeManage access to projects, folders, and organizations.
You might also be able to get the required permissions throughcustom roles or otherpredefined roles.
Enable Spark data lineage
You can enable Spark data lineage for your project, batch workload, or interactivesession.
Enable data lineage at the project level
After you enable Spark data lineage at the project level, subsequentSpark jobs that run in a batch workload or interactive session willhave Spark data lineage enabled.
Note: Setting data lineage on a project does not enabledata lineage for batch workloads and interactive sessions that use theServerless for Apache Spark 3.0 runtime.You must separately enable data lineage on the 3.0batch workload,interactive session or session template.To enable Spark data lineage on your project,set the following custom project metadata:
| Key | Value |
|---|---|
DATAPROC_LINEAGE_ENABLED | true |
You can disable Spark data lineage for a project by setting theDATAPROC_LINEAGE_ENABLED metadata tofalse.
Enable data lineage on a Spark batch workload
To enable data lineage on a batch workload,set thespark.dataproc.lineage.enabled property totrue when yousubmit the workload. This setting overrides any Spark data lineagesetting at theproject level: if Sparkdata lineage is disabled at the project level but enabled for the batch workload,the batch workload setting takes precedence.
You can disable Spark data lineage on a Spark batch workloadby setting thespark.dataproc.lineage.enabled property tofalsewhen you submit the workload.
This example uses the gcloud CLI to submit a batchlineage-example.py workload with Spark lineage enabled.
gcloud dataproc batches submit pyspark lineage-example.py \ --region=REGION \ --deps-bucket=gs://BUCKET \ --properties=spark.dataproc.lineage.enabled=true
The followinglineage-example.py code reads data from a public BigQuerytable, and then writes the output a new table in an existing BigQuerydataset. It uses a Cloud Storage bucket for temporary storage.
#!/usr/bin/env pythonfrompyspark.sqlimportSparkSessionimportsysspark=SparkSession \.builder \.appName('LINEAGE_BQ_TO_BQ') \.getOrCreate()source='bigquery-public-data:samples.shakespeare'words=spark.read.format('bigquery') \.option('table',source) \.load()words.createOrReplaceTempView('words')word_count=spark.sql('SELECT word, SUM(word_count) AS word_count FROM words GROUP BY word')destination_table='PROJECT_ID:DATASET.TABLE'word_count.write.format('bigquery') \.option('table',destination_table) \.option('writeMethod','direct') \.save()Replace the following:
- REGION: Theregion to run the workload
- BUCKET: The name of an existing Cloud Storage bucket to store dependencies
- PROJECT_ID,DATASET, andTABLE: Theproject ID, the name of an existing BigQuery dataset, andthe name of a new table to create in the dataset (the table must not exist)
You can view the lineage graph in the Dataplex Universal Catalog UI.

Enable data lineage on a Spark interactive session or session template
To enable data lineage on a Spark interactive session or session template,set thespark.dataproc.lineage.enabled property totrue when youcreate the session or session template. This setting overrides any Spark data lineagesetting at theproject level: if Sparkdata lineage is disabled at the project level but enabled for the interactive session,the interactive session setting takes precedence.
You can disable Spark data lineage on a Spark interactive session or session templateby setting thespark.dataproc.lineage.enabled property tofalsewhen you create the interactive session or session template.
The following PySpark notebook code configures a Serverless for Apache Sparkinteractive session with Spark data lineage enabled. It then creates aSpark Connectsession that runs a word count query on a public BigQueryShakespeare dataset, and then writes the output to a new table in anexisting BigQuery dataset (seeCreate a Spark session in a BigQuery Studio notebook) .
# Configure the Dataproc Serverless interactive session# to enable Spark data lineage.fromgoogle.cloud.dataproc_v1importSessionsession=Session()session.runtime_config.properties["spark.dataproc.lineage.enabled"]="true"# Create the Spark Connect session.fromgoogle.cloud.dataproc_spark_connectimportDataprocSparkSessionspark=DataprocSparkSession.builder.dataprocSessionConfig(session).getOrCreate()# Run a wordcount query on the public BigQuery Shakespeare dataset.source="bigquery-public-data:samples.shakespeare"words=spark.read.format("bigquery").option("table",source).load()words.createOrReplaceTempView('words')word_count=spark.sql('SELECT word, SUM(word_count) AS word_count FROM words GROUP BY word')# Output the results to a BigQuery destination table.destination_table='PROJECT_ID:DATASET.TABLE'word_count.write.format('bigquery') \.option('table',destination_table) \.save()Replace the following:
- PROJECT_ID,DATASET, andTABLE: Theproject ID, the name of an existing BigQuery dataset, andthe name of a new table to create in the dataset (the table must not exist)
You can view the data lineage graph by clicking the destination table name listedin the navigation pane on BigQueryExplorer page,then selecting the lineage tab on the table details pane.

View lineage in Dataplex Universal Catalog
A lineage graph displays relationships between your projectresources and the processes that created them. You canview data lineage informationin the Google Cloud console orretrieve the information from theData Lineage API as JSON data.
What's next
- Learn more aboutdata lineage.
- Try data lineage in an interactive lab:Capture and Explore Data Updates With Data Lineage and OpenLineage in Dataplex.
Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.
Last updated 2026-02-18 UTC.