Use the JupyterLab extension to develop serverless Spark workloads

This document describes how to install and use the JupyterLabextension on a machine or self-managed VM that has access to Google services. Italso describes how to develop and deploy serverless Spark notebook code.

Install the extension within minutes to take advantage of the following features:

  • Launch serverless Spark & BigQuery notebooks to develop code quickly
  • Browse and preview BigQuery datasets in JupyterLab
  • Edit Cloud Storage files in JupyterLab
  • Schedule a notebook on Composer
Note: The JupyterLab extension is pre-installed onVertex AI Workbench instances. For more information, seeCreate a Dataproc-enabled Vertex AI Workbench instance.

Before you begin

  1. Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.
  2. In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

    Roles required to select or create a project

    • Select a project: Selecting a project doesn't require a specific IAM role—you can select any project that you've been granted a role on.
    • Create a project: To create a project, you need the Project Creator role (roles/resourcemanager.projectCreator), which contains theresourcemanager.projects.create permission.Learn how to grant roles.
    Note: If you don't plan to keep the resources that you create in this procedure, create a project instead of selecting an existing project. After you finish these steps, you can delete the project, removing all resources associated with the project.

    Go to project selector

  3. Enable the Dataproc API.

    Roles required to enable APIs

    To enable APIs, you need the Service Usage Admin IAM role (roles/serviceusage.serviceUsageAdmin), which contains theserviceusage.services.enable permission.Learn how to grant roles.

    Enable the API

  4. Install the Google Cloud CLI.

    Note: If you installed the gcloud CLI previously, make sure you have the latest version by runninggcloud components update.
  5. If you're using an external identity provider (IdP), you must first sign in to the gcloud CLI with your federated identity.

  6. Toinitialize the gcloud CLI, run the following command:

    gcloudinit
  7. In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

    Roles required to select or create a project

    • Select a project: Selecting a project doesn't require a specific IAM role—you can select any project that you've been granted a role on.
    • Create a project: To create a project, you need the Project Creator role (roles/resourcemanager.projectCreator), which contains theresourcemanager.projects.create permission.Learn how to grant roles.
    Note: If you don't plan to keep the resources that you create in this procedure, create a project instead of selecting an existing project. After you finish these steps, you can delete the project, removing all resources associated with the project.

    Go to project selector

  8. Enable the Dataproc API.

    Roles required to enable APIs

    To enable APIs, you need the Service Usage Admin IAM role (roles/serviceusage.serviceUsageAdmin), which contains theserviceusage.services.enable permission.Learn how to grant roles.

    Enable the API

  9. Install the Google Cloud CLI.

    Note: If you installed the gcloud CLI previously, make sure you have the latest version by runninggcloud components update.
  10. If you're using an external identity provider (IdP), you must first sign in to the gcloud CLI with your federated identity.

  11. Toinitialize the gcloud CLI, run the following command:

    gcloudinit

Install the JupyterLab extension

Note: The JupyterLab extension is provided without charge.However, you're charged for using Google Cloud services.You can use thepricing calculatorto generate a cost estimate based on your projected usage of Google Cloudresources.

You can install and use the JupyterLab extension on amachine or VM that has access to Google services, such as your local machine oraCompute Engine VM instance.

To install the extension, follow these steps:

  1. Download and install Python version 3.11 or higher frompython.org/downloads.

    • Verify the Python 3.11+ installation.
      python3 --version
  2. Virtualize the Python environment.

    pip3 install pipenv

    • Create a installation folder.
      mkdir jupyter
    • Change to the installation folder.
      cd jupyter
    • Create a virtual environment.
      pipenv shell
  3. Install JupyterLab in the virtual environment.

    pipenv install jupyterlab

  4. Install the JupyterLab extension.

    pipenv install bigquery-jupyter-plugin

  5. Start JupyterLab.

    jupyter lab

    1. The JupyterLabLauncher page opens in your browser. It containsaDataproc Jobs and Sessions section. It can also containServerless for Apache Spark Notebooks andDataproc Cluster Notebooks sectionsif you have access to Dataproc serverless notebooks or Dataproc clusterswith theJupyter optional componentrunning in your project.

      JupyterLab launcher browser page
      On macOS, if you receive aSSL: CERTIFICATE_VERIFY_FAILED errorin your terminal when you launch Jupyterlab, update your Python SSL certificateby executingInstall Certificates.command from the Python installation path.This file is located in the Python home directory.
    2. By default, your Serverless for Apache Spark Interactive session runsin the project and region you set when you rangcloud init inBefore you begin. You can change the project andregion settings for your sessions fromJupyterLabSettings > Google Cloud Settings > Google Cloud Project Settings.

      You must restart the extension for the changes to take effect.

Create a Serverless for Apache Spark runtime template

Serverless for Apache Spark runtime templates (also called session templates)contain configuration settings for executing Spark code in a session. You cancreate and manage runtime templates using Jupyterlab or the gcloud CLI.

JupyterLab

  1. Click theNew runtime template card in theServerless for Apache Spark Notebookssection on the JupyterLabLauncher page.

  2. Fill in theRuntime template form.

    • Template Info:

      • Display name,Runtime ID, andDescription: Accept or filla template display name, template runtime ID, and template description.
    • Execution Configuration: SelectUser Account to execute notebookswith the user identity instead of the Dataproc service account identity.

      • Service Account: If you do not specify a service account, theCompute Engine default service account is used.
      • Runtime version: Confirm or select theruntime version.
      • Custom container image: Optionally specify the URIof acustom container image.
      • Staging Bucket: You can optionally specify the name of a Cloud Storagestaging bucket for use byServerless for Apache Spark.
      • Python packages repository: By default, Python packages aredownloaded and installed from the PyPI pull-through cachewhen users executepip install commands intheir notebooks. You can specify your organization's private artifactsrepository for Python packages to use as the default Python packagesrepository.
    • Encryption: Accept the defaultGoogle-owned and Google-managed encryption key or selectCustomer-managed encryption key (CMEK).If CMEK, select of provide the key information.

    • Network Configuration: Select asubnetworkin the project or shared from a host project (you can change the projectfrom JupyterLabSettings > Google Cloud Settings > Google Cloud Project Settings.You can specifynetwork tagsto apply to the specified network. Note that Serverless for Apache SparkenablesPrivate Google Access (PGA) on thespecified subnet. For network connectivity requirements, seeGoogle Cloud Serverless for Apache Spark network configuration.

    • Session Configuration: You can optionally fill in these fields tolimit the duration of sessions created with the template.

      • Max idle time: The maximum idle time before the session isterminated. Allowable range: 10 minutes to 336 hours (14 days).
      • Max session time: The maximum lifetime of a session before thesession is terminated. Allowable range: 10 minutes to 336 hours (14 days).
    • Metastore: To use aDataproc Metastore servicewith your sessions, select the metastore project ID and service.

    • Persistent History Server: You can select an availablePersistent Spark History Serverto allow you to access session logs during and after sessions.The PHS must be set up in location (region) where your sessions run. By default,Serverless for Apache Spark sessions run in the project and region set with thegcloud init command. You can change the project and region settingsfrom JupyterLabSettings > Google Cloud Settings > Google Cloud Project Settings.

    • Spark properties: You can select then add SparkResource Allocation,Autoscaling, orGPU properties. ClickAdd Property to addother Spark properties. For more information, seeSpark properties.

    • Labels: ClickAdd Label for eachlabel to set on sessions created withthe template.

  3. ClickSave to create the template.

  4. To view or delete a runtime template.

    1. ClickSettings > Google Cloud Settings.
    2. TheDataproc Settings > Serverless Runtime Templatessection displays the list of runtime templates.

      List of runtime templates

      • Click a template name to view template details.
      • You can delete a template from theAction menu for the template.
  5. Open and reload the JupyterLabLauncher page to view the saved notebook templatecard on the JupyterLabLauncher page.

gcloud

  1. Create a YAML file with your runtime template configuration.

    Simple YAML

    environmentConfig:  executionConfig:    networkUri: defaultjupyterSession:  kernel: PYTHON  displayName: Team Alabels:  purpose: testingdescription: Team A Development Environment

    Complex YAML

    description: Example session templateenvironmentConfig:  executionConfig:    serviceAccount: sa1    # Choose either networkUri or subnetworkUri    networkUri:    subnetworkUri: default    networkTags:     - tag1    kmsKey: key1    idleTtl: 3600s    ttl: 14400s    stagingBucket: staging-bucket  peripheralsConfig:    metastoreService: projects/my-project-id/locations/us-central1/services/my-metastore-id    sparkHistoryServerConfig:      dataprocCluster: projects/my-project-id/regions/us-central1/clusters/my-cluster-idjupyterSession:  kernel: PYTHON  displayName: Team Alabels:  purpose: testingruntimeConfig:  version: "2.3"  containerImage: gcr.io/my-project-id/my-image:1.0.1  properties:    "p1": "v1"description: Team A Development Environment

  2. Create a session (runtime) template from your YAML file by running the followinggcloud beta dataproc session-templates importcommand locally or inCloud Shell:

    gcloud beta dataproc session-templates importTEMPLATE_ID \    --source=YAML_FILE \    --project=PROJECT_ID \    --location=REGION

Launch and manage notebooks

Afterinstalling the Dataproc JupyterLab extension,you can click template cards on the JupyterLabLauncher page to:

Launch a Jupyter notebook on Serverless for Apache Spark

TheServerless for Apache Spark Notebooks sectionon the JupyterLab Launcher page displays notebook templatecards that map to Serverless for Apache Spark runtime templates (seeCreate a Serverless for Apache Spark runtime template).

  1. Click a card to create a Serverless for Apache Spark session andlaunch a notebook. When session creation is complete and the notebookkernel is ready to use, the kernel status changesfromStarting toIdle (Ready).

  2. Write and test notebook code.

    1. Copy and paste the following PySparkPi estimation codein the PySpark notebook cell, then pressShift+Return torun the code.

      importrandom    definside(p):    x,y=random.random(),random.random()    returnx*x+y*y <1    count=sc.parallelize(range(0,10000)).filter(inside).count()print("Pi is roughly%f"%(4.0*count/10000))

      Notebook result:

  3. After creating and using a notebook, you can terminate the notebook sessionby clickingShut Down Kernel from theKernel tab.

    • To reuse the session, create a new notebook by choosingNotebook from theFile>>New menu. After the new notebook is created, choose the existing session from the kernel selection dialog. The new notebook will reuse the session and retain the session context from the previous notebook.
  4. If you don't terminate the session, Dataproc terminates the sessionwhen the session idle timer expires. You can configure the sessionidle time in theruntime template configuration.The default session idle time is one hour.

Launch a notebook on a Dataproc on Compute Engine cluster

If you created aDataproc on Compute Engine Jupyter cluster,the JupyterLabLauncher page contains aDataproc Cluster Notebook section with pre-installed kernel cards.

To launch a Jupyter notebook on your Dataproc onCompute Engine cluster:

  1. Click a card in theDataproc Cluster Notebook section.

  2. When the kernel status changes fromStarting toIdle (Ready), you canstart writing and executing notebook code.

  3. After creating and using a notebook, you can terminate the notebook sessionby clickingShut Down Kernel from theKernel tab.

Manage input and output files in Cloud Storage

Analyzing exploratory data and building ML models often involvesfile-based inputs and outputs. Serverless for Apache Spark accesses these fileson Cloud Storage.

  • To access the Cloud Storage browser, click the Cloud Storage browsericon in the JupyterLabLauncher page sidebar, then double-click afolder to view its contents.

  • You can click Jupyter-supported file types to open and edit them. When yousave changes to the files, they are written to Cloud Storage.

  • To create a new Cloud Storage folder, click the new folder icon,then enter the name of the folder.

  • To upload files into a Cloud Storage bucket or a folder, clickthe upload icon, then select the files to upload.

Develop Spark notebook code

Afterinstalling the Dataproc JupyterLab extension,you can launch Jupyter notebooks from the JupyterLabLauncher page to developapplication code.

PySpark and Python code development

Serverless for Apache Spark and Dataproc on Compute Engineclusters support PySpark kernels. Dataproc on Compute Enginealso supports Python kernels.

SQL code development

To open a PySpark notebook to write and execute SQL code, on the JupyterLabLauncher page, in theServerless for Apache Spark Notebooks orDataproc Cluster Notebook section, click the PySpark kernel card.

Spark SQL magic: Since the PySpark kernel that launchesServerless for Apache Spark Notebooksis preloaded with Spark SQL magic, instead of usingspark.sql('SQL STATEMENT').show()to wrap your SQL statement, you can type%%sparksql magic at the top of a cell, then type your SQL statement in the cell.

BigQuery SQL: The BigQuery Spark connector allows yournotebook code to load data from BigQuery tables, perform analysisin Spark, and then write the results to a BigQuery table.

TheServerless for Apache Spark2.2and later runtimes include theBigQuery Spark connector.If you use earlier runtime to launch Serverless for Apache Spark notebooks,you can install Spark BigQuery Connector by adding the following Spark propertyto yourServerless for Apache Spark runtime template:

spark.jars: gs://spark-lib/bigquery/spark-bigquery-with-dependencies_2.12-0.25.2.jar
Note:Install the BigQuery Spark connector on your Dataproc cluster,to use it withDataproc cluster Notebooks.

Scala code development

Dataproc on Compute Engine clusters created with imageversions2.0 and later includeApache Toree,a Scala kernel for the Jupyter Notebook platform that provides interactive accessto Spark.

  • Click the Apache Toree card in theDataproc cluster Notebooksection on the JupyterLabLauncher page to open a notebook for Scalacode development.

    Figure 1. Apache Toree kernel card in the JupyterLab Launcher page.

Develop code with the Visual Studio Code extension

You can use the Google CloudVisual Studio Code (VS Code)extension lets you do the following:

  • Develop and run Spark code in Serverless for Apache Spark notebooks.
  • Create and manage Serverless for Apache Spark runtime (session) templates,interactive sessions, and batch workloads.

The Visual Studio Code extension is free, but you are charged for anyGoogle Cloud services, including Dataproc,Serverless for Apache Spark and Cloud Storage resources that you use.

Use VS Code with BigQuery: You can alsouse VS Code with BigQuery todo the following:

  • Develop and execute BigQuery notebooks.
  • Browse, inspect, and preview BigQuery datasets.

Before you begin

  1. Download and install VS Code.
  2. Open VS Code, and then in the activity bar, clickExtensions.
  3. Using the search bar, find theJupyter extension, and then clickInstall. The Jupyter extension by Microsoft is a required dependency.

    A list of Jupyter extensions in the VS Code console.

Install the Google Cloud extension

  1. Open VS Code, and then in the activity bar, clickExtensions.
  2. Using the search bar, find theGoogle Cloud Code extension, and thenclickInstall.

    The Google Cloud Code extension in the VS Code console.

  3. If prompted, restart VS Code.

TheGoogle Cloud Code icon is now visible in theVS Codeactivity bar.

Configure the extension

  1. Open VS Code, and then in the activity bar, clickGoogle Cloud Code.
  2. Open theDataproc section.
  3. ClickLogin to Google Cloud. You are redirected to sign in with yourcredentials.
  4. Use the top-level application taskbar to navigate toCode> Settings> Settings> Extensions.
  5. FindGoogle Cloud Code, and click theManage icon to open the menu.
  6. SelectSettings.
  7. In theProject andDataproc Region fields,enter the name of the Google Cloud project and the regionto use to develop notebooks and manageServerless for Apache Spark resources.

Develop Serverless for Apache Spark notebooks

  1. Open VS Code, and then in the activity bar, clickGoogle Cloud Code.
  2. Open theNotebooks section, then clickNew Serverless Spark Notebook.
  3. Select or create a new runtime (session) template to use for the notebook session.
  4. A new.ipynb file containing sample code is created and opened in the editor.

    New Serverless Spark notebook in the VS Code console.

    You can now write and execute code in your Serverless for Apache Spark notebook.

Create and manage Serverless for Apache Spark resources

  1. Open VS Code, and then in the activity bar, clickGoogle Cloud Code.
  2. Open theDataproc section, then click the followingresource names:

    • Clusters: Create and manage clusters and jobs.
    • Serverless: Create and manage batch workloads and interactive sessions.
    • Spark Runtime Templates: Create and manage session templates.

    Dataproc resources listed in the VS Code console.

Dataset explorer

Use the JupyterLab Dataset explorer to viewBigLake metastore datasets.

To open the JupyterLab Dataset Explorer, click its icon in the sidebar.

You can search for a database, table, or column in the Dataset explorer.Click a database, table, or column name to view the associated metadata.

Deploy your code

Afterinstalling the Dataproc JupyterLab extension,you can use JupyterLab to:

  • Execute your notebook code on the Google Cloud Serverless for Apache Spark infrastructure

  • Schedule notebook execution on Cloud Composer

  • Submit batch jobs to the Google Cloud Serverless for Apache Spark infrastructure or to yourDataproc on Compute Engine cluster.

Schedule notebook execution on Cloud Composer

Complete the following steps to schedule your notebook code on Cloud Composerto run as a batch job on Serverless for Apache Spark or on aDataproc on Compute Engine cluster.

  1. Create a Cloud Composer environment.

  2. Click theJob Scheduler button on top right of the notebook.

  3. Fill in theCreate A Scheduled Job form to provide the following information:

    • A unique name for the notebook execution job
    • The Cloud Composer environment to use to deploy the notebook
    • Input parameters if the notebook is parameterized
    • The Dataproc cluster or serverless runtime template to userun the notebook
      • If a cluster is selected, whether to stop the cluster after the notebookfinishes executing on the cluster
    • Retry count and retry delay in minutes if notebook execution fails onthe first try
    • Execution notifications to send and the recipient list.Notifications are sent using an Airflow SMTP configuration.
    • The notebook execution schedule
  4. ClickCreate.

  5. After the notebook is successfully scheduled, the job name appearsin the list of scheduled jobs in the Cloud Composerenvironment.

Submit a batch job to Google Cloud Serverless for Apache Spark

  • Click theServerless card in theDataproc Jobs and Sessions sectionon the JupyterLabLauncher page.

  • Click theBatch tab, then clickCreate Batch and fill in theBatch Info fields.

  • ClickSubmit to submit the job.

Submit a batch job to a Dataproc on Compute Engine cluster

  • Click theClusters card in theDataproc Jobs and Sessions sectionon the JupyterLabLauncher page.

  • Click theJobs tab, then clickSubmit Job.

  • Select aCluster, then fill in theJob fields.

  • ClickSubmit to submit the job.

View and manage resources

Afterinstalling the Dataproc JupyterLab extension,you can view and manage Google Cloud Serverless for Apache Spark and Dataproc on Compute Enginefrom theDataproc Jobs and Sessions section on the JupyterLabLauncher page.

Click theDataproc Jobs and Sessions section to show theClusters andServerless cards.

To view and manage Google Cloud Serverless for Apache Spark sessions:

  1. Click theServerless card.
  2. Click theSessions tab, then a session ID to open theSession details page to view session properties, view Google Cloud logs in Logs Explorer, and terminate a session. Note: A unique Google Cloud Serverless for Apache Spark session is created to launch each Google Cloud Serverless for Apache Spark notebook.

To view and manage Google Cloud Serverless for Apache Spark batches:

  1. Click theBatches tab to view the list of Google Cloud Serverless for Apache Spark batches in the current project and region. Click a batch ID to view batch details.

To view and manage Dataproc on Compute Engine clusters:

  1. Click theClusters card. TheClusters tab is selected to list active Dataproc on Compute Engine clusters in the current project and region. You can click the icons in theActions column to start, stop, or restart a cluster. Click a cluster name to view cluster details. You can click the icons in theActions column to clone, stop, or delete a job.

To view and manage Dataproc on Compute Engine jobs:

  1. Click theJobs card to view the list of jobs in the current project. Click a job ID to view job details.

Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.

Last updated 2025-12-15 UTC.