Run a genomics analysis in a JupyterLab notebook on Dataproc

This tutorial shows you how to run a single-cell genomics analysisusingDask,NVIDIA RAPIDS, andGPUs, which you can configure onDataproc. You can configure Dataproc to run Dask either with its standalone scheduler or with YARN for resource management.

This tutorial configures Dataproc with a hostedJupyterLab instance to run a notebook featuring a single-cell genomics analysis. Using a Jupyter Notebook on Dataproc lets you combine the interactive capabilities of Jupyter with the workload scaling that Dataproc enables. With Dataproc, you can scale out your workloads from one to many machines, which you can configure with as many GPUs as you need.

This tutorial is intended for data scientists and researchers. It assumes thatyou are experienced with Python and have basic knowledge of the following:

Objectives

  • Create a Dataproc instance which is configured with GPUs, JupyterLab, and open source components.
  • Run anotebook on Dataproc.

Costs

In this document, you use the following billable components of Google Cloud:

  • Dataproc
  • Cloud Storage
  • GPUs
  • To generate a cost estimate based on your projected usage, use thepricing calculator.

    New Google Cloud users might be eligible for afree trial.

    When you finish the tasks that are described in this document, you can avoid continued billing by deleting the resources that you created. For more information, seeClean up.

    Before you begin

    1. In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

      Roles required to select or create a project

      • Select a project: Selecting a project doesn't require a specific IAM role—you can select any project that you've been granted a role on.
      • Create a project: To create a project, you need the Project Creator role (roles/resourcemanager.projectCreator), which contains theresourcemanager.projects.create permission.Learn how to grant roles.
      Note: If you don't plan to keep the resources that you create in this procedure, create a project instead of selecting an existing project. After you finish these steps, you can delete the project, removing all resources associated with the project.

      Go to project selector

    2. Verify that billing is enabled for your Google Cloud project.

    3. Enable the Dataproc API.

      Roles required to enable APIs

      To enable APIs, you need the Service Usage Admin IAM role (roles/serviceusage.serviceUsageAdmin), which contains theserviceusage.services.enable permission.Learn how to grant roles.

      Enable the API

    Prepare your environment

    1. Select alocation for your resources.

      REGION=REGION

    2. Create a Cloud Storage bucket.

      gcloud storage buckets create gs://BUCKET --location=REGION

    3. Copy the followinginitialization actions to your bucket.

      SCRIPT_BUCKET=gs://goog-dataproc-initialization-actions-REGIONgcloud storage cp ${SCRIPT_BUCKET}/gpu/install_gpu_driver.shBUCKET/gpu/install_gpu_driver.shgcloud storage cp ${SCRIPT_BUCKET}/dask/dask.shBUCKET/dask/dask.shgcloud storage cp ${SCRIPT_BUCKET}/rapids/rapids.shBUCKET/rapids/rapids.shgcloud storage cp ${SCRIPT_BUCKET}/python/pip-install.shBUCKET/python/pip-install.sh

    Create a Dataproc cluster with JupyterLab and open source components

    1. Create a Dataproc cluster.
    gcloud dataproc clusters createCLUSTER_NAME \    --regionREGION \    --image-version 2.0-ubuntu18 \    --master-machine-type n1-standard-32 \    --master-accelerator type=nvidia-tesla-t4,count=4 \    --initialization-actionsBUCKET/gpu/install_gpu_driver.sh,BUCKET/dask/dask.sh,BUCKET/rapids/rapids.sh,BUCKET/python/pip-install.sh\    --initialization-action-timeout=60m \    --metadatagpu-driver-provider=NVIDIA,dask-runtime=yarn,rapids-runtime=DASK,rapids-version=21.06,PIP_PACKAGES="scanpy==1.8.1,wget" \    --optional-components JUPYTER \    --enable-component-gateway \    --single-node

    The cluster has the following properties:

    • --region: theregion where your cluster is located.
    • --image-version:2.0-ubuntu18, thecluster image version
    • --master-machine-type:n1-standard-32, the mainmachine type.
    • --master-accelerator: the type and count ofGPUs on the main node, fournvidia-tesla-t4 GPUs.
    • --initialization-actions: the Cloud Storage paths to the installationscripts that install GPU drivers, Dask, RAPIDS, and extra dependencies.
    • --initialization-action-timeout: the timeout for the initialization actions.
    • --metadata: passed to the initialization actions to configure the cluster withNVIDIA GPU drivers, the standalone scheduler for Dask, and RAPIDS version21.06.
    • --optional-components: configures the cluster with theJupyter optional component.
    • --enable-component-gateway: allows access to web UIs on the cluster.
    • --single-node: configures the cluster as a single node (no workers).

    Access the Jupyter Notebook

    1. Open theClusters page in the Dataproc Google Cloud console.
      Open Clusters page
    2. Click your cluster and click theWeb Interfaces tab.
    3. ClickJupyterLab.
    4. Open anew terminal in JupyterLab.
    5. Clone theclara-parabricks/rapids-single-cell-examplesrepository and check out thedataproc/multi-gpu branch.

      git clone https://github.com/clara-parabricks/rapids-single-cell-examples.gitgit checkout dataproc/multi-gpu

    6. In JupyterLab, navigate to therapids-single-cell-examples/notebooks repositoryand open the1M_brain_gpu_analysis_uvm.ipynb Jupyter Notebook.

    7. To clear all the outputs in the notebook, selectEdit > Clear All Outputs

    8. Read the instructions in the cells of the notebook. The notebook usesDask and RAPIDS on Dataproc to guide you through asingle-cell RNA-seq workflow on 1 million cells, including processing andvisualizing the data. To learn more, seeAccelerating Single Cell Genomic Analysis using RAPIDS.

    Clean up

    To avoid incurring charges to your Google Cloud account for the resources used in this tutorial, either delete the project that contains the resources, or keep the project and delete the individual resources.

    Delete the project

      Caution: Deleting a project has the following effects:
      • Everything in the project is deleted. If you used an existing project for the tasks in this document, when you delete it, you also delete any other work you've done in the project.
      • Custom project IDs are lost. When you created this project, you might have created a custom project ID that you want to use in the future. To preserve the URLs that use the project ID, such as anappspot.com URL, delete selected resources inside the project instead of deleting the whole project.

      If you plan to explore multiple architectures, tutorials, or quickstarts, reusing projects can help you avoid exceeding project quota limits.

    1. In the Google Cloud console, go to theManage resources page.

      Go to Manage resources

    2. In the project list, select the project that you want to delete, and then clickDelete.
    3. In the dialog, type the project ID, and then clickShut down to delete the project.

    Delete individual resources

    1. Delete your Dataproc cluster.
      gcloud dataproc clusters delete cluster-name \    --region=region
    2. Delete the bucket:
      gcloud storage buckets deleteBUCKET_NAME
      Important: Your bucket must be empty before you can delete it.

    What's next

    Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.

    Last updated 2025-12-15 UTC.