Use custom containers with Google Cloud Serverless for Apache Spark Stay organized with collections Save and categorize content based on your preferences.
Google Cloud Serverless for Apache Spark runs workloads withinDocker containers.The container provides the runtime environment for the workload's driverand executor processes. By default, Google Cloud Serverless for Apache Spark uses acontainer image that includes the default Spark, Java, Python and R packagesassociated with aruntime release version.The Google Cloud Serverless for Apache SparkbatchesAPI lets you to use a custom container image instead of the defaultimage. Typically, a custom container image adds Spark workload Java or Pythondependencies not provided by the default container image.Important: Do notinclude Spark in your custom container image; Google Cloud Serverless for Apache Spark will mount Sparkinto the container at runtime.
Submit a Spark batch workload using a custom container image
gcloud
Use thegcloud dataproc batches submit sparkcommand with the--container-image flag to specify your custom container imagewhen you submit a Spark batch workload.
gcloud dataproc batches submit spark \ --container-image=custom-image, for example, "gcr.io/my-project-id/my-image:1.0.1" \ --region=region \ --jars=path to user workload jar located in Cloud Storage or included in the custom container \ --class=The fully qualified name of a class in the jar file, such as org.apache.spark.examples.SparkPi \ --add any workload arguments here
Notes:
- Custom-image: Specify the custom container image using the following Container Registry image naming format:
{hostname}/{project-id}/{image}:{tag}, for example, "gcr.io/my-project-id/my-image:1.0.1".Note: You must host your custom container image onContainer Registry orArtifact Registry. (Google Cloud Serverless for Apache Spark cannot fetch containers from other registries). --jars: Specify a path to a user workload included in your custom container image or located in Cloud Storage, for example,file:///opt/spark/jars/spark-examples.jarorgs://my-bucket/spark/jars/spark-examples.jar.- Other batches command options: You can add other optional batches command flags, for example, to use aPersistent History Server (PHS). Note: The PHS must be located in the region where you run batch workloads.See
gcloud dataproc batches submitfor supported command flags. - workload argumentsYou can add any workload arguments by adding a "--" to the end of the command, followed by the workload arguments.
REST
The custom container image is provided through theRuntimeConfig.containerImagefield as part of abatches.create API request.
This following example shows how to use a custom container to submit a batch workload using the Google Cloud Serverless for Apache Spark batches.create API.
Before using any of the request data, make the following replacements:
- project-id: Google Cloud project ID
- region:region
- custom-container-image: Specify the custom container image using the following Container Registry image naming format:
{hostname}/{project-id}/{image}:{tag}, for example, "gcr.io/my-project-id/my-image:1.0.1". Note: You must host your custom container onContainer Registry or Artifact Registry. (Google Cloud Serverless for Apache Spark cannot fetch containers from other registries). jar-uri: Specify a path to a workload jar included in your custom container image or located in Cloud Storage, for example, "/opt/spark/jars/spark-examples.jar" or "gs:///spark/jars/spark-examples.jar".class: The fully qualified name of a class in the jar file, such as "org.apache.spark.examples.SparkPi".- Other options: You can use other batch workload resource fields, for example, use the
sparkBatch.argsfield to pass arguments to your workload (see theBatchresource documentation for more information). To use aPersistent History Server (PHS), seeSetting up a Persistent History Server. Note: The PHS must be located in the region where you run batch workloads.
HTTP method and URL:
POST https://dataproc.googleapis.com/v1/projects/project-id/locations/region/batches
Request JSON body:
{ "runtimeConfig":{ "containerImage":"custom-container-image }, "sparkBatch":{ "jarFileUris":[ "jar-uri" ], "mainClass":"class" }}To send your request, expand one of these options:
curl (Linux, macOS, or Cloud Shell)
Note: The following command assumes that you have logged in to thegcloud CLI with your user account by runninggcloud init orgcloud auth login , or by usingCloud Shell, which automatically logs you into thegcloud CLI . You can check the currently active account by runninggcloud auth list. Save the request body in a file namedrequest.json, and execute the following command:
curl -X POST \
-H "Authorization: Bearer $(gcloud auth print-access-token)" \
-H "Content-Type: application/json; charset=utf-8" \
-d @request.json \
"https://dataproc.googleapis.com/v1/projects/project-id/locations/region/batches"
PowerShell (Windows)
Note: The following command assumes that you have logged in to thegcloud CLI with your user account by runninggcloud init orgcloud auth login . You can check the currently active account by runninggcloud auth list. Save the request body in a file namedrequest.json, and execute the following command:
$cred = gcloud auth print-access-token
$headers = @{ "Authorization" = "Bearer $cred" }
Invoke-WebRequest `
-Method POST `
-Headers $headers `
-ContentType: "application/json; charset=utf-8" `
-InFile request.json `
-Uri "https://dataproc.googleapis.com/v1/projects/project-id/locations/region/batches" | Select-Object -Expand Content
You should receive a JSON response similar to the following:
{"name":"projects/project-id/locations/region/batches/batch-id", "uuid":",uuid", "createTime":"2021-07-22T17:03:46.393957Z", "runtimeConfig":{ "containerImage":"gcr.io/my-project/my-image:1.0.1" }, "sparkBatch":{ "mainClass":"org.apache.spark.examples.SparkPi", "jarFileUris":[ "/opt/spark/jars/spark-examples.jar" ] }, "runtimeInfo":{ "outputUri":"gs://dataproc-.../driveroutput" }, "state":"SUCCEEDED", "stateTime":"2021-07-22T17:06:30.301789Z", "creator":"account-email-address", "runtimeConfig":{ "properties":{ "spark:spark.executor.instances":"2", "spark:spark.driver.cores":"2", "spark:spark.executor.cores":"2", "spark:spark.app.name":"projects/project-id/locations/region/batches/batch-id" } }, "environmentConfig":{ "peripheralsConfig":{ "sparkHistoryServerConfig":{ } } }, "operation":"projects/project-id/regions/region/operation-id"}Build a custom container image
Google Cloud Serverless for Apache Spark custom container images are Docker images. You canuse the tools for building Docker images to build custom container images, but there areconditions the images must meet to be compatible with Google Cloud Serverless for Apache Spark.The following sections explain these conditions.
Operating system
You can choose any operating system base image for your custom container image.
Recommendation: Use the default Debian 12 images, for example,debian:12-slim,since they have been tested to avoid compatibility issues.
Utilities
You must include the following utility packages, which are requiredto run Spark, in your custom container image:
procpstini
To run XGBoost from Spark (Java or Scala), you must includelibgomp1
Container user
Google Cloud Serverless for Apache Spark runs containers as thespark Linux user with a1099 UID and a1099 GID.USER directives set in custom container image Dockerfilesare ignored at runtime. Use the UID and GID for file system permissions.For example, if you add a jar file at/opt/spark/jars/my-lib.jar in the imageas a workload dependency, you must give thespark user read permission to thefile.
Image streaming
Serverless for Apache Spark normally begins a workload requiring a custom container imageby downloading the entire image to disk. This can mean a delay in initializationtime, especially for customers with large images.
You can instead use image streaming, which is a method to pull image data on anas-needed basis. This lets the workload start up without waiting for the entireimage to download, potentially improving initialization time. To enable imagestreaming, you must enable theContainer File System API.Your must also store your container images in Artifact Registry and the Artifact Registryrepository must be in the same region as your Dataproc workloador in a multi-region that corresponds with the region where your workload isrunning. If Dataproc does not support the image or the imagestreaming service is not available, our streaming implementation downloads theentire image.
Note that we don't support the following for image streaming:
- Images with empty layers or duplicate layers
- Images that use theV2 Image Manifest, schema version 1
In these cases, Dataproc pulls the entire image before startingthe workload.
Spark
Don't include Spark in your custom container image. At runtime, Google Cloud Serverless for Apache Sparkmounts Spark binaries and configs from the host into the container:binaries are mounted to the/usr/lib/spark directory and configs are mountedto the/etc/spark/conf directory. Existing files in these directories areoverridden by Google Cloud Serverless for Apache Spark at runtime.
Java Runtime Environment
Don't include your own Java Runtime Environment (JRE) in your custom containerimage. At run time, Google Cloud Serverless for Apache Spark mountsOpenJDK from the host into thecontainer. If you include a JRE in your custom container image, it will be ignored.
Java packages
You can include jar files as Spark workload dependencies in your custom container image, and you canset theSPARK_EXTRA_CLASSPATH env variable to include the jars. Google Cloud Serverless for Apache Spark willadd the env variable value in the classpath of Spark JVM processes. Recommendation:put jars under the/opt/spark/jars directory and setSPARK_EXTRA_CLASSPATHto/opt/spark/jars/*.
You can include the workload jar in your custom container image, then reference it with a local pathwhen submitting the workload, for examplefile:///opt/spark/jars/my-spark-job.jar(seeSubmit a Spark batch workload using a custom container image for an example).
Python packages
By default, Google Cloud Serverless for Apache Spark mountsConda environment build using anOSS Conda-Forge repo from the host tothe/opt/dataproc/conda directory in the container at runtime.PYSPARK_PYTHON is set to/opt/dataproc/conda/bin/python. Its base directory,/opt/dataproc/conda/bin, is included inPATH.
You can include your Python environment with packages in a different directoryin your custom container image, for example in/opt/conda, and set thePYSPARK_PYTHON environment variable to/opt/conda/bin/python.
pyspark package. Google Cloud Serverless for Apache Sparkmountspyspark into your container at runtime.Your custom container image can include other Python modules that are not partof the Python environment, for example, Python scripts with utility functions.Set thePYTHONPATH environment variable to include the directories wherethe modules are located.
R environment
You can customize the R environment in your custom container image using one ofthe following options:
- Use Conda to manage and install R packages from
conda-forgechannel. - Add an R repository for your container image Linux OS, and install R packagesusing the Linux OS package manager (see theR Software package index).
When you use either option, you must set theR_HOME environment variableto point to your custom R environment. Exception: If you are using Conda toboth manage your R environment and customize your Python environment,you don't need to set theR_HOME environment variable; it is automaticallyset based on thePYSPARK_PYTHON environment variable.
Example custom container image build
This section includes custom container image build examples, whichinclude sampleDockerfiles,followed by a build command. One sample includes the minimum configurationrequired to build an image. The other sample includes examples of extraconfiguration, including Python and R libraries.
Minimum configuration
# Recommendation: Use Debian 12.FROM debian:12-slim# Suppress interactive promptsENV DEBIAN_FRONTEND=noninteractive# Install utilities required by Spark scripts.RUN apt update && apt install -y procps tini libjemalloc2# Enable jemalloc2 as default memory allocatorENV LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libjemalloc.so.2# Create the 'spark' group/user.# The GID and UID must be 1099. Home directory is required.RUN groupadd -g 1099 sparkRUN useradd -u 1099 -g 1099 -d /home/spark -m sparkUSER spark
Extra configuration
# Recommendation: Use Debian 12.FROM debian:12-slim# Suppress interactive promptsENV DEBIAN_FRONTEND=noninteractive# Install utilities required by Spark scripts.RUN apt update && apt install -y procps tini libjemalloc2# Enable jemalloc2 as default memory allocatorENV LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libjemalloc.so.2# Install utilities required by XGBoost for Spark.RUN apt install -y procps libgomp1# Install and configure Miniconda3.ENV CONDA_HOME=/opt/miniforge3ENV PYSPARK_PYTHON=${CONDA_HOME}/bin/pythonENV PATH=${CONDA_HOME}/bin:${PATH}ADD https://github.com/conda-forge/miniforge/releases/latest/download/Miniforge3-Linux-x86_64.sh .RUN bash Miniforge3-Linux-x86_64.sh -b -p /opt/miniforge3 \ && ${CONDA_HOME}/bin/conda config --system --set always_yes True \ && ${CONDA_HOME}/bin/conda config --system --set auto_update_conda False \ && ${CONDA_HOME}/bin/conda config --system --set channel_priority strict# Packages ipython and ipykernel are required if using custom conda and want to# use this container for running notebooks.RUN ${CONDA_HOME}/bin/mamba install ipython ipykernel#Install Google Cloud SDK.RUN ${CONDA_HOME}/bin/mamba install -n base google-cloud-sdk# Install Conda packages.## The following packages are installed in the default image.# Recommendation: include all packages.## Use mamba to quickly install packages.RUN ${CONDA_HOME}/bin/mamba install -n base \ accelerate \ bigframes \ cython \ deepspeed \ evaluate \ fastavro \ fastparquet \ gcsfs \ google-cloud-aiplatform \ google-cloud-bigquery-storage \ google-cloud-bigquery[pandas] \ google-cloud-bigtable \ google-cloud-container \ google-cloud-datacatalog \ google-cloud-dataproc \ google-cloud-datastore \ google-cloud-language \ google-cloud-logging \ google-cloud-monitoring \ google-cloud-pubsub \ google-cloud-redis \ google-cloud-spanner \ google-cloud-speech \ google-cloud-storage \ google-cloud-texttospeech \ google-cloud-translate \ google-cloud-vision \ langchain \ lightgbm \ koalas \ matplotlib \ mlflow \ nltk \ numba \ numpy \ openblas \ orc \ pandas \ pyarrow \ pynvml \ pysal \ pytables \ python \ pytorch-cpu \ regex \ requests \ rtree \ scikit-image \ scikit-learn \ scipy \ seaborn \ sentence-transformers \ sqlalchemy \ sympy \ tokenizers \ transformers \ virtualenv \ xgboost# Install pip packages.RUN ${PYSPARK_PYTHON} -m pip install \ spark-tensorflow-distributor \ torcheval# Install R and R libraries.RUN ${CONDA_HOME}/bin/mamba install -n base \ r-askpass \ r-assertthat \ r-backports \ r-bit \ r-bit64 \ r-blob \ r-boot \ r-brew \ r-broom \ r-callr \ r-caret \ r-cellranger \ r-chron \ r-class \ r-cli \ r-clipr \ r-cluster \ r-codetools \ r-colorspace \ r-commonmark \ r-cpp11 \ r-crayon \ r-curl \ r-data.table \ r-dbi \ r-dbplyr \ r-desc \ r-devtools \ r-digest \ r-dplyr \ r-ellipsis \ r-evaluate \ r-fansi \ r-fastmap \ r-forcats \ r-foreach \ r-foreign \ r-fs \ r-future \ r-generics \ r-ggplot2 \ r-gh \ r-glmnet \ r-globals \ r-glue \ r-gower \ r-gtable \ r-haven \ r-highr \ r-hms \ r-htmltools \ r-htmlwidgets \ r-httpuv \ r-httr \ r-hwriter \ r-ini \ r-ipred \ r-isoband \ r-iterators \ r-jsonlite \ r-kernsmooth \ r-knitr \ r-labeling \ r-later \ r-lattice \ r-lava \ r-lifecycle \ r-listenv \ r-lubridate \ r-magrittr \ r-markdown \ r-mass \ r-matrix \ r-memoise \ r-mgcv \ r-mime \ r-modelmetrics \ r-modelr \ r-munsell \ r-nlme \ r-nnet \ r-numderiv \ r-openssl \ r-pillar \ r-pkgbuild \ r-pkgconfig \ r-pkgload \ r-plogr \ r-plyr \ r-praise \ r-prettyunits \ r-processx \ r-prodlim \ r-progress \ r-promises \ r-proto \ r-ps \ r-purrr \ r-r6 \ r-randomforest \ r-rappdirs \ r-rcmdcheck \ r-rcolorbrewer \ r-rcpp \ r-rcurl \ r-readr \ r-readxl \ r-recipes \ r-recommended \ r-rematch \ r-remotes \ r-reprex \ r-reshape2 \ r-rlang \ r-rmarkdown \ r-rodbc \ r-roxygen2 \ r-rpart \ r-rprojroot \ r-rserve \ r-rsqlite \ r-rstudioapi \ r-rvest \ r-scales \ r-selectr \ r-sessioninfo \ r-shape \ r-shiny \ r-sourcetools \ r-spatial \ r-squarem \ r-stringi \ r-stringr \ r-survival \ r-sys \ r-teachingdemos \ r-testthat \ r-tibble \ r-tidyr \ r-tidyselect \ r-tidyverse \ r-timedate \ r-tinytex \ r-usethis \ r-utf8 \ r-uuid \ r-vctrs \ r-whisker \ r-withr \ r-xfun \ r-xml2 \ r-xopen \ r-xtable \ r-yaml \ r-zipENV R_HOME=/usr/lib/R# Add extra Python modules.ENV PYTHONPATH=/opt/python/packagesRUN mkdir -p "${PYTHONPATH}"# Add extra jars.ENV SPARK_EXTRA_JARS_DIR=/opt/spark/jars/ENV SPARK_EXTRA_CLASSPATH='/opt/spark/jars/*'RUN mkdir -p "${SPARK_EXTRA_JARS_DIR}"#Uncomment below and replace EXTRA_JAR_NAME with the jar file name.#COPY "EXTRA_JAR_NAME" "${SPARK_EXTRA_JARS_DIR}"# Create the 'spark' group/user.# The GID and UID must be 1099. Home directory is required.RUN groupadd -g 1099 sparkRUN useradd -u 1099 -g 1099 -d /home/spark -m sparkUSER sparkBuild command
Run the following command in the Dockerfile directoryto build and push the custom image to theArtifact Registry.
# Build and push the image.gcloud builds submit --region=REGION \ --tagREGION-docker.pkg.dev/PROJECT/REPOSITORY/IMAGE_NAME:IMAGE_VERSION
Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.
Last updated 2026-02-19 UTC.