Use the Serverless Spark Connect client

The Dataproc Spark Connect client is a wrapper of the Apache Spark Connectclient. It lets applications communicate with a remote Serverless for Apache Sparksession using theSpark Connect protocol.This document shows you how to install, configure, and use the client.

Before you begin

Ensure you have theIdentity and Access Management rolesthat contain the permissions needed to manageinteractive sessions and session templates.
If you run the client outside of Google Cloud, provideauthentication credentials.Set theGOOGLE_APPLICATION_CREDENTIALS environment variableto the path of your service account key file.

Install or uninstall the client

You can install or uninstall thedataproc-spark-connect package usingpip.

Install

To install the latest version of the client, run the following command:

pipinstall-Udataproc-spark-connect

Uninstall

To uninstall the client, run the following command:

pipuninstalldataproc-spark-connect

Configure the client

Specify the project and region for your session. You canset these values using environment variables or by using the builder API inyour code.

Environment variables

Set theGOOGLE_CLOUD_PROJECT andGOOGLE_CLOUD_REGION environmentvariables.

# Google Cloud Configuration for Dataproc Spark Connect Integration Tests# Copy this file to .env and fill in your actual values# ============================================================================# REQUIRED CONFIGURATION# ============================================================================# Your Google Cloud Project IDGOOGLE_CLOUD_PROJECT="your-project-id"# Google Cloud Region where Dataproc sessions will be createdGOOGLE_CLOUD_REGION="us-central1"# Path to service account key file (if using SERVICE_ACCOUNT auth)GOOGLE_APPLICATION_CREDENTIALS="/path/to/your/service-account-key.json"# ============================================================================# AUTHENTICATION CONFIGURATION# ============================================================================# Authentication type (SERVICE_ACCOUNT or END_USER_CREDENTIALS). If not set, API default is used.# DATAPROC_SPARK_CONNECT_AUTH_TYPE="SERVICE_ACCOUNT"# DATAPROC_SPARK_CONNECT_AUTH_TYPE="END_USER_CREDENTIALS"# Service account email for workload authentication (optional)# DATAPROC_SPARK_CONNECT_SERVICE_ACCOUNT="your-service-account@your-project.iam.gserviceaccount.com"# ============================================================================# SESSION CONFIGURATION# ============================================================================# Session timeout in seconds (how long session stays active)# DATAPROC_SPARK_CONNECT_TTL_SECONDS="3600"# Session idle timeout in seconds (how long session stays active when idle)# DATAPROC_SPARK_CONNECT_IDLE_TTL_SECONDS="900"# Automatically terminate session when Python process exits (true/false)# DATAPROC_SPARK_CONNECT_SESSION_TERMINATE_AT_EXIT="false"# Custom file path for storing active session information# DATAPROC_SPARK_CONNECT_ACTIVE_SESSION_FILE_PATH="/tmp/dataproc_spark_connect_session"# ============================================================================# DATA SOURCE CONFIGURATION# ============================================================================# Default data source for Spark SQL (currently only supports "bigquery")# Only available for Dataproc runtime version 2.3# DATAPROC_SPARK_CONNECT_DEFAULT_DATASOURCE="bigquery"# ============================================================================# ADVANCED CONFIGURATION# ============================================================================# Custom Dataproc API endpoint (uncomment if needed)# GOOGLE_CLOUD_DATAPROC_API_ENDPOINT="your-region-dataproc.googleapis.com"# Subnet URI for Dataproc Spark Connect (full resource name format)# Example: projects/your-project-id/regions/us-central1/subnetworks/your-subnet-name# DATAPROC_SPARK_CONNECT_SUBNET="projects/your-project-id/regions/us-central1/subnetworks/your-subnet-name"

Builder API

Use the.projectId() and.location() methods.

spark=DataprocSparkSession.builder.projectId("my-project").location("us-central1").getOrCreate()

Start a Spark session

To start a Spark session, add the required imports to your PySpark applicationor notebook, then call theDataprocSparkSession.builder.getOrCreate() API.

Import theDataprocSparkSession class.

Call thegetOrCreate() method to start the session.

fromgoogle.cloud.dataproc_spark_connectimportDataprocSparkSessionspark=DataprocSparkSession.builder.getOrCreate()

Configure Spark properties

To configure Spark properties, chain one or more.config() methods to thebuilder.

fromgoogle.cloud.dataproc_spark_connectimportDataprocSparkSessionspark=DataprocSparkSession.builder.config('spark.executor.memory','48g').config('spark.executor.cores','8').getOrCreate()

Note: Serverless for Apache Spark3.0+ runtimes support Spark single-node execution.To enable it, setspark.master=local in your session config.

Use advanced configuration

For advanced configuration, use theSession class to customize settings suchas the subnetwork or runtime version.

fromgoogle.cloud.dataproc_spark_connectimportDataprocSparkSessionfromgoogle.cloud.dataproc_v1importSessionsession_config=Session()session_config.environment_config.execution_config.subnetwork_uri='SUBNET'session_config.runtime_config.version='3.0'spark=DataprocSparkSession.builder.projectId('my-project').location('us-central1').dataprocSessionConfig(session_config).getOrCreate()

Reuse a named session

Named sessions let you share a single Spark session across multiple notebookswhile avoiding repeated session startup-time delays.

In your first notebook, create a session with a custom ID.

fromgoogle.cloud.dataproc_spark_connectimportDataprocSparkSessionsession_id='my-ml-pipeline-session'spark=DataprocSparkSession.builder.dataprocSessionId(session_id).getOrCreate()df=spark.createDataFrame([(1,'data')],['id','value'])df.show()

In another notebook, reuse the session by specifying the same session ID.

fromgoogle.cloud.dataproc_spark_connectimportDataprocSparkSessionsession_id='my-ml-pipeline-session'spark=DataprocSparkSession.builder.dataprocSessionId(session_id).getOrCreate()df=spark.createDataFrame([(2,'more-data')],['id','value'])df.show()

Session IDs must be 4-63 characters long, start with a lowercase letter, andcontain only lowercase letters, numbers, and hyphens. The ID cannot end with ahyphen. A session with an ID that is in aTERMINATED state cannot bereused.

Use Spark SQL magic commands

The package supports thesparksql-magic library to executeSpark SQL queries in Jupyter notebooks. Magic commands are an optional feature.

Install the required dependencies.
```
pipinstallIPythonsparksql-magic
```
Load the magic extension.
```
%load_ext sparksql_magic
```
Optional: configure default settings.
```
%config SparkSql.limit=20
```
Execute SQL queries.
```
%%sparksqlSELECT * FROM your_table
```

To use advanced options, add flags to the%%sparksql command. For example, tocache results and create a view, run the following command:

%%sparksql --cache --view result_view dfSELECT * FROM your_table WHERE condition = true

The following options are available:

--cache or-c: caches the DataFrame.
--eager or-e: caches with eager loading.
--view VIEW or-v VIEW: creates a temporary view.
--limit N or-l N: overrides the default row display limit.
variable_name: stores the result in a variable.

What's next

Learn more aboutDataproc sessions.
Run PySpark code in BigQuery Studio notebooks.

Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.

Last updated 2026-02-19 UTC.

Movatterモバイル変換

Use the Serverless Spark Connect client Stay organized with collections Save and categorize content based on your preferences.