Use the Spark Spanner connector

This page shows you how to create a Dataproc cluster that uses theSpark Spanner connectorto read data from and write data toSpannerusingApache Spark.

The Spanner connector works with Spark to read data fromand write data to the Spanner database using theSpanner Java library.The Spanner connector supports reading Spannertables andgraphs into SparkDataFramesandGraphFrames,and writing DataFrame data into Spanner tables.

Costs

In this document, you use the following billable components of Google Cloud:

  • Dataproc
  • Spanner
  • Cloud Storage

To generate a cost estimate based on your projected usage, use thepricing calculator.

New Google Cloud users might be eligible for afree trial.

Before you begin

  1. Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.
  2. In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

    Roles required to select or create a project

    • Select a project: Selecting a project doesn't require a specific IAM role—you can select any project that you've been granted a role on.
    • Create a project: To create a project, you need the Project Creator role (roles/resourcemanager.projectCreator), which contains theresourcemanager.projects.create permission.Learn how to grant roles.
    Note: If you don't plan to keep the resources that you create in this procedure, create a project instead of selecting an existing project. After you finish these steps, you can delete the project, removing all resources associated with the project.

    Go to project selector

  3. Verify that billing is enabled for your Google Cloud project.

  4. Enable the Spanner, Dataproc, and Cloud Storage APIs.

    Roles required to enable APIs

    To enable APIs, you need the Service Usage Admin IAM role (roles/serviceusage.serviceUsageAdmin), which contains theserviceusage.services.enable permission.Learn how to grant roles.

    Enable the APIs

  5. In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

    Roles required to select or create a project

    • Select a project: Selecting a project doesn't require a specific IAM role—you can select any project that you've been granted a role on.
    • Create a project: To create a project, you need the Project Creator role (roles/resourcemanager.projectCreator), which contains theresourcemanager.projects.create permission.Learn how to grant roles.
    Note: If you don't plan to keep the resources that you create in this procedure, create a project instead of selecting an existing project. After you finish these steps, you can delete the project, removing all resources associated with the project.

    Go to project selector

  6. Verify that billing is enabled for your Google Cloud project.

  7. Enable the Spanner, Dataproc, and Cloud Storage APIs.

    Roles required to enable APIs

    To enable APIs, you need the Service Usage Admin IAM role (roles/serviceusage.serviceUsageAdmin), which contains theserviceusage.services.enable permission.Learn how to grant roles.

    Enable the APIs

  8. Grant required roles.
  9. Set up a Dataproc cluster.
  10. Set up a Spanner instance with a Singers database table.

Grant required roles

Certain IAM roles are required torun the examples on this page. Depending on organization policies, theseroles may have already been granted. To check role grants, seeDo you need to grant roles?.

For more information about granting roles, seeManage access to projects, folders, and organizations.

To ensure that the Compute Engine default service account has the necessary permissions to create a Dataproc cluster, ask your administrator to grant the Compute Engine default service account the following IAM roles on the project:

Important: You must grant these roles to the Compute Engine default service account,not to your user account. Failure to grant the roles to the correct principal might result in permission errors.

Set up a Dataproc cluster

Create a Dataproc clusteror use an existing Dataproc cluster that was created with the2.1 or later Dataproc image or, if thecluster was created with the2.0 or earlier image, it mus have been created withthescope property set tocloud-platform scope.

Set up a Spanner instance with a Singers database table

Create a Spanner instancewith a database that contains aSingers table. Note the Spannerinstance ID and database ID.

Use the Spanner connector with Spark

The Spanner connector is available for Spark versions3.1+. Youspecify theconnector versionas part of the Cloud Storage connector JAR file specification when yousubmit a job to aDataproc cluster.

Example: gcloud CLI Spark job submission with theSpanner connector.

gcloud dataproc jobs submit spark \    --jars=gs://spark-lib/spanner/spark-3.1-spanner-CONNECTOR_VERSION.jar \    ... [other job submission flags]

Replace the following:

CONNECTOR_VERSION: Spanner connector version. Choose the Spanner connector version from the version list in the GitHubGoogleCloudDataproc/spark-spanner-connector repository.

Note: If the connector isn't available at runtime, aClassNotFoundException isthrown.

Read Spanner tables

You can use Python or Scala to read Spanner table data into aSpark Dataframe using theSpark data source API.

PySpark

You can run the example PySpark code in this section on your cluster by submitting the job to theDataproc service or by running the job from thespark-submit REPLon the cluster master node.

Dataproc job

  1. Create asingers.py file in using a local text editor or inCloud Shell using the pre-installedvi,vim, ornano text editor.
    1. After populating the placehoder variables, paste the following code into thesingers.py file. Note that the SpannerData Boost feature is enabled, which has near-zero impact on the main Spanner instance.
      #!/usr/bin/env python"""Spanner PySpark read example."""frompyspark.sqlimportSparkSessionspark=SparkSession \.builder \.master('yarn') \.appName('spark-spanner-demo') \.getOrCreate()# Load data from Spanner.singers=spark.read.format('cloud-spanner') \.option("projectId","PROJECT_ID") \.option("instanceId","INSTANCE_ID") \.option("databaseId","DATABASE_ID") \.option("table","TABLE_NAME") \.option("enableDataBoost","true") \.load()singers.createOrReplaceTempView('Singers')# Read from Singersresult=spark.sql('SELECT * FROM Singers')result.show()result.printSchema()

      Replace the following:

      1. PROJECT_ID: Your Google Cloud project ID. Project IDs are listed in theProject info section on the Google Cloud consoleDashboard.
      2. INSTANCE_ID,DATABASE_ID, andTABLE_NAME : SeeSet up a Spanner instance withSingers database table.
    2. Save thesingers.py file.
  2. Submit the job to the Dataproc service using the Google Cloud console, gcloud CLI or Dataproc API.

    Example: gcloud CLI job submission with the Spanner connector.

    gcloud dataproc jobs submit pyspark singers.py \    --cluster=CLUSTER_NAME \    --region=REGION \    --jars=gs://spark-lib/spanner/spark-3.1-spanner-CONNECTOR_VERSION.jar

    Replace the following:

    1. CLUSTER_NAME: The name of the new cluster.
    2. REGION: An available Compute Engineregion to run the workload.
    3. CONNECTOR_VERSION: Spanner connector version. Choose the Spanner connector version from the version list in the GitHubGoogleCloudDataproc/spark-spanner-connector repository.

spark-submit job

  1. Connect to the Dataproc cluster master node using SSH.
    1. Go to the DataprocClusters page in the Google Cloud console, then click the name of your cluster.
    2. On theCluster details page, select the VM Instances tab. Then clickSSH to the right of the name of the cluster master node.
      Screenshot of the Dataproc Cluster details page in the Google Cloud console, showing the SSH button used to connect to the cluster master node.

      A browser window opens at your home directory on the master node.

          Connected, host fingerprint: ssh-rsa 2048 ...    ...    user@clusterName-m:~$
  2. Create asingers.py file on the master node using the pre-installedvi,vim, ornano text editor.
    1. Paste the following code into thesingers.py file after populating the placehoder variables into thesingers.py file. Note that the SpannerData Boost feature is enabled, which has near-zero impact on the main Spanner instance.
      #!/usr/bin/env python"""Spanner PySpark read example."""frompyspark.sqlimportSparkSessionspark=SparkSession \.builder \.master('yarn') \.appName('spark-spanner-demo') \.getOrCreate()# Load data from Spanner.singers=spark.read.format('cloud-spanner') \.option("projectId","PROJECT_ID") \.option("instanceId","INSTANCE_ID") \.option("databaseId","DATABASE_ID") \.option("table","TABLE_NAME") \.option("enableDataBoost","true") \.load()singers.createOrReplaceTempView('Singers')# Read from Singersresult=spark.sql('SELECT * FROM Singers')result.show()result.printSchema()

      Replace the following:

      1. PROJECT_ID: Your Google Cloud project ID. Project IDs are listed in theProject info section on the Google Cloud consoleDashboard.
      2. INSTANCE_ID,DATABASE_ID, andTABLE_NAME : SeeSet up a Spanner instance withSingers database table.
    2. Save thesingers.py file.
  3. Runsingers.py withspark-submit to create the SpannerSingers table.
    spark-submit --jars gs://spark-lib/spanner/spark-3.1-spanner-CONNECTOR_VERSION.jar singers.py

    Replace the following:

    1. CONNECTOR_VERSION: Spanner connector version. Choose the Spanner connector version from the version list in the GitHubGoogleCloudDataproc/spark-spanner-connector repository.

    The output is:

    ...+--------+---------+--------+---------+-----------+|SingerId|FirstName|LastName|BirthDate|LastUpdated|+--------+---------+--------+---------+-----------+|       1|     Marc|Richards|     null|       null||       2| Catalina|   Smith|     null|       null||       3|    Alice| Trentor|     null|       null|+--------+---------+--------+---------+-----------+root |-- SingerId: long (nullable = false) |-- FirstName: string (nullable = true) |-- LastName: string (nullable = true) |-- BirthDate: date (nullable = true) |-- LastUpdated: timestamp (nullable = true)only showing top 20 rows

Scala

To run the example Scala code on your cluster, complete the following steps:

  1. Connect to the Dataproc cluster master node using SSH.
    1. Go to the DataprocClusters page in the Google Cloud console, then click the name of your cluster.
    2. On theCluster details page, select the VM Instances tab. Then clickSSH to the right of the name of the cluster master node.Dataproc Cluster details page in the Google Cloud console.

      A browser window opens at your home directory on the master node.

          Connected, host fingerprint: ssh-rsa 2048 ...    ...    user@clusterName-m:~$
  2. Create asingers.scala file on the master node using the pre-installedvi,vim, ornano text editor.
    1. Paste the following code into thesingers.scala file. Note that the SpannerData Boost feature is enabled, which has near-zero impact on the main Spanner instance.
      objectsingers{defmain():Unit={/*     * Uncomment (use the following code) if you are not running in spark-shell.     *    import org.apache.spark.sql.SparkSession    val spark = SparkSession.builder()      .appName("spark-spanner-demo")      .getOrCreate()    */// Load data in from Spanner. See// https://github.com/GoogleCloudDataproc/spark-spanner-connector/blob/main/README.md#properties// for option information.valsingersDF=(spark.read.format("cloud-spanner").option("projectId","PROJECT_ID").option("instanceId","INSTANCE_ID").option("databaseId","DATABASE_ID").option("table","TABLE_NAME").option("enableDataBoost",true).load().cache())singersDF.createOrReplaceTempView("Singers")// Load the Singers table.valresult=spark.sql("SELECT * FROM Singers")result.show()result.printSchema()}}

      Replace the following:

      1. PROJECT_ID: Your Google Cloud project ID. Project IDs are listed in theProject info section on the Google Cloud consoleDashboard.
      2. INSTANCE_ID,DATABASE_ID, andTABLE_NAME : SeeSet up a Spanner instance withSingers database table.
    2. Save thesingers.scala file.
  3. Launch thespark-shell REPL.
    $ spark-shell --jars=gs://spark-lib/spanner/spark-3.1-spanner-CONNECTOR_VERSION.jar

    Replace the following:

    CONNECTOR_VERSION: Spanner connector version. Choose the Spanner connector version from the version list in the GitHubGoogleCloudDataproc/spark-spanner-connector repository.

  4. Runsingers.scala with the:load singers.scala command to create the SpannerSingers table. The output listing displays examplesfrom the Singers output.
    > :load singers.scalaLoading singers.scala...defined object singers> singers.main()...+--------+---------+--------+---------+-----------+|SingerId|FirstName|LastName|BirthDate|LastUpdated|+--------+---------+--------+---------+-----------+|       1|     Marc|Richards|     null|       null||       2| Catalina|   Smith|     null|       null||       3|    Alice| Trentor|     null|       null|+--------+---------+--------+---------+-----------+root |-- SingerId: long (nullable = false) |-- FirstName: string (nullable = true) |-- LastName: string (nullable = true) |-- BirthDate: date (nullable = true) |-- LastUpdated: timestamp (nullable = true)

Read Spanner graphs

The Spanner connector supports exporting the graph into separatenode and edgeDataFramesas well as exporting intoGraphFramesdirectly.

The following example exports a Spanner into aGraphFrame. Ituses the PythonSpannerGraphConnectorclass, included in theSpanner connector jar, to read theSpanner Graph.

Note: Populate the placeholder variables before running the example.
frompyspark.sqlimportSparkSessionconnector_jar="gs://spark-lib/spanner/spark-3.1-spanner-CONNECTOR_VERSION.jar"spark=(SparkSession.builder.appName("spanner-graphframe-graphx-example").config("spark.jars.packages","graphframes:graphframes:0.8.4-spark3.5-s_2.12").config("spark.jars",connector_jar).getOrCreate())spark.sparkContext.addPyFile(connector_jar)fromspannergraphimportSpannerGraphConnectorconnector=(SpannerGraphConnector().spark(spark).project("PROJECT_ID").instance("INSTANCE_ID").database("DATABASE_ID").graph("GRAPH_ID"))g=connector.load_graph()g.vertices.show()g.edges.show()

Replace the following:

To export node and edgeDataFrames instead of GraphFrames, useload_dfsinstead:

df_vertices,df_edges,df_id_map=connector.load_dfs()
SeeExportingSpanner Graphs in GitHub for more information.

Write Spanner tables

Preview

This feature is subject to the "Pre-GA Offerings Terms" in the General Service Terms section of theService Specific Terms. Pre-GA features are available "as is" and might have limited support. For more information, see thelaunch stage descriptions.

The Spanner connector supports writing a Spark Dataframeto a Spanner table using theSpark data source API.During preview, the Spark Spanner connector supportsappendsave mode only.

Write DataFrame to Spanner table example

Populate the variables before saving and running the code.

"""Spanner PySpark write example."""frompyspark.sqlimportSparkSessionspark=SparkSession.builder.appName('Spanner Write App').getOrCreate()columns=['id','name','email']data=[(1,'John Doe','john.doe@example.com'),(2,'Jane Doe','jane.doe@example.com')]df=spark.createDataFrame(data,columns)df.write.format('cloud-spanner') \.option("projectId","PROJECT_ID").option("instanceId","INSTANCE_ID").option("databaseId","DATABASE_ID").option("table","TABLE_NAME").mode("append") \.save()

Replace the following.

  • PROJECT_ID: The Google Cloud project ID.Project IDs are listed in theProject info section onthe Google Cloud consoleDashboard.
  • INSTANCE_ID,DATABASE_ID, andTABLE_NAME Insertthe instance, database, and table IDs.
Note: For more information, seeWriting to Spanner tables in GitHub.

Clean up

To avoid incurring ongoing charges to your Google Cloud account, you canstop ordelete yourDataproc cluster anddelete your Spanner instance.

What's next

Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.

Last updated 2026-02-19 UTC.