Use data lineage with Serverless for Apache Spark

This document describes how to enabledata lineage onGoogle Cloud Serverless for Apache Spark batch workloads and interactive sessions at theproject,batch workload, orinteractive session level.

Overview

Data lineage is aDataplex Universal Catalogfeature that lets you track how data moves through your systems: where it comesfrom, where it is passed to, and what transformations are applied to it.

Google Cloud Serverless for Apache Spark workloads and sessions capture lineage events and publish them tothe Dataplex Universal CatalogData Lineage API.Serverless for Apache Spark integrates with the Data Lineage API throughOpenLineage, using theOpenLineage Spark plugin.

You can access lineage information through Dataplex Universal Catalog, usinglineage graphsand theData Lineage API.For more information, seeView lineage graphs in Dataplex Universal Catalog.

Availability

Data lineage, which supports BigQuery and Cloud Storagedata sources, is available for workloads and sessions that run withsupported Serverless for Apache Spark runtime versionswith the following exceptions and limitations:

  • Data lineage is not available for SparkR or Spark streaming workloads or sessions.

Before you begin

  1. On the project selector page in the Google Cloud console, select the projectto use for your Serverless for Apache Spark workloads or sessions.

    Go to project selector

  2. Enable the Data Lineage API.

    Enable the APIs

    Upcoming Spark data lineage changes See theServerless for Apache Spark release notesfor the announcement of a change that will automatically makeSpark data lineage available to your projects, batch workloads,and interactive sessions when you enable the Data Lineage API (seeControl lineage ingestion for a service)without requiring additional project, batch workload,or interactive session settings.

Required roles

If your batch workload uses thedefaultServerless for Apache Spark service account,it has theDataproc Workerrole, which contains the permissions required by data lineage.

However, if your batch workload uses a custom service account to enabledata lineage, you must grantone of the roles listedin the following paragraph, which contain the permissions required by datalineage, to the custom service account.

To get the permissions that you need to use data lineage with Dataproc , ask your administrator to grant you the following IAM roles on your batch workload custom service account:

For more information about granting roles, seeManage access to projects, folders, and organizations.

You might also be able to get the required permissions throughcustom roles or otherpredefined roles.

Enable Spark data lineage

You can enable Spark data lineage for your project, batch workload, or interactivesession.

Enable data lineage at the project level

After you enable Spark data lineage at the project level, subsequentSpark jobs that run in a batch workload or interactive session willhave Spark data lineage enabled.

Note: Setting data lineage on a project does not enabledata lineage for batch workloads and interactive sessions that use theServerless for Apache Spark 3.0 runtime.You must separately enable data lineage on the 3.0batch workload,interactive session or session template.

To enable Spark data lineage on your project,set the following custom project metadata:

KeyValue
DATAPROC_LINEAGE_ENABLEDtrue

You can disable Spark data lineage for a project by setting theDATAPROC_LINEAGE_ENABLED metadata tofalse.

Note: You can also disable Spark data lineage for a project bydisabling lineage ingestion for the Dataproc service.This setting overrides all other data lineage settings.

Enable data lineage on a Spark batch workload

To enable data lineage on a batch workload,set thespark.dataproc.lineage.enabled property totrue when yousubmit the workload. This setting overrides any Spark data lineagesetting at theproject level: if Sparkdata lineage is disabled at the project level but enabled for the batch workload,the batch workload setting takes precedence.

You can disable Spark data lineage on a Spark batch workloadby setting thespark.dataproc.lineage.enabled property tofalsewhen you submit the workload.

Note: You can also disable Spark data lineage bydisabling lineage ingestion for the Dataproc service.This setting overrides all other data lineage settings.

This example uses the gcloud CLI to submit a batchlineage-example.py workload with Spark lineage enabled.

gcloud dataproc batches submit pyspark lineage-example.py \    --region=REGION \    --deps-bucket=gs://BUCKET \    --properties=spark.dataproc.lineage.enabled=true

The followinglineage-example.py code reads data from a public BigQuerytable, and then writes the output a new table in an existing BigQuerydataset. It uses a Cloud Storage bucket for temporary storage.

#!/usr/bin/env pythonfrompyspark.sqlimportSparkSessionimportsysspark=SparkSession \.builder \.appName('LINEAGE_BQ_TO_BQ') \.getOrCreate()source='bigquery-public-data:samples.shakespeare'words=spark.read.format('bigquery') \.option('table',source) \.load()words.createOrReplaceTempView('words')word_count=spark.sql('SELECT word, SUM(word_count) AS word_count FROM words GROUP BY word')destination_table='PROJECT_ID:DATASET.TABLE'word_count.write.format('bigquery') \.option('table',destination_table) \.option('writeMethod','direct') \.save()

Replace the following:

  • REGION: Theregion to run the workload
  • BUCKET: The name of an existing Cloud Storage bucket to store dependencies
  • PROJECT_ID,DATASET, andTABLE: Theproject ID, the name of an existing BigQuery dataset, andthe name of a new table to create in the dataset (the table must not exist)

You can view the lineage graph in the Dataplex Universal Catalog UI.

Spark lineage graph

Enable data lineage on a Spark interactive session or session template

To enable data lineage on a Spark interactive session or session template,set thespark.dataproc.lineage.enabled property totrue when youcreate the session or session template. This setting overrides any Spark data lineagesetting at theproject level: if Sparkdata lineage is disabled at the project level but enabled for the interactive session,the interactive session setting takes precedence.

You can disable Spark data lineage on a Spark interactive session or session templateby setting thespark.dataproc.lineage.enabled property tofalsewhen you create the interactive session or session template.

Note: You can also disable Spark data lineage bydisabling lineage ingestion for the Dataproc service.This setting overrides all other data lineage settings.

The following PySpark notebook code configures a Serverless for Apache Sparkinteractive session with Spark data lineage enabled. It then creates aSpark Connectsession that runs a word count query on a public BigQueryShakespeare dataset, and then writes the output to a new table in anexisting BigQuery dataset (seeCreate a Spark session in a BigQuery Studio notebook) .

# Configure the Dataproc Serverless interactive session# to enable Spark data lineage.fromgoogle.cloud.dataproc_v1importSessionsession=Session()session.runtime_config.properties["spark.dataproc.lineage.enabled"]="true"# Create the Spark Connect session.fromgoogle.cloud.dataproc_spark_connectimportDataprocSparkSessionspark=DataprocSparkSession.builder.dataprocSessionConfig(session).getOrCreate()# Run a wordcount query on the public BigQuery Shakespeare dataset.source="bigquery-public-data:samples.shakespeare"words=spark.read.format("bigquery").option("table",source).load()words.createOrReplaceTempView('words')word_count=spark.sql('SELECT word, SUM(word_count) AS word_count FROM words GROUP BY word')# Output the results to a BigQuery destination table.destination_table='PROJECT_ID:DATASET.TABLE'word_count.write.format('bigquery') \.option('table',destination_table) \.save()

Replace the following:

  • PROJECT_ID,DATASET, andTABLE: Theproject ID, the name of an existing BigQuery dataset, andthe name of a new table to create in the dataset (the table must not exist)

You can view the data lineage graph by clicking the destination table name listedin the navigation pane on BigQueryExplorer page,then selecting the lineage tab on the table details pane.

Spark lineage graph

View lineage in Dataplex Universal Catalog

A lineage graph displays relationships between your projectresources and the processes that created them. You canview data lineage informationin the Google Cloud console orretrieve the information from theData Lineage API as JSON data.

What's next

Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.

Last updated 2026-02-18 UTC.