Customize your Spark job runtime environment with Docker on YARN Stay organized with collections Save and categorize content based on your preferences.
The DataprocDocker on YARN feature allows youto create and use a Docker image to customize yourSpark job runtime environment. The image can include customizations toJava, Python, and R dependencies, and to your job jar.
Limitations
Feature availability or support isnot available with:
- Dataproc image versions prior to 2.0.49 (not available in 1.5 images)
- MapReduce jobs (only supported for Spark jobs )
- Spark client mode (only supported with Spark cluster mode)
- Kerberos clusters:cluster creation fails if youcreate a cluster with Docker on YARNand Kerberos enabled.
- Customizations of JDK, Hadoop and Spark: the host JDK, Hadoop, and Sparkare used, not your customizations.
Create a Docker image
The first step to customize your Spark environment isbuilding a Docker image.
Dockerfile
You can use the following Dockerfile as an example, making changes andadditions to meet you needs.
FROMdebian:10-slim# Suppress interactive prompts.ENVDEBIAN_FRONTEND=noninteractive# Required: Install utilities required by Spark scripts.RUNaptupdate &&aptinstall-yprocpstini# Optional: Add extra jars.ENVSPARK_EXTRA_JARS_DIR=/opt/spark/jars/ENVSPARK_EXTRA_CLASSPATH='/opt/spark/jars/*'RUNmkdir-p"${SPARK_EXTRA_JARS_DIR}"COPY*.jar"${SPARK_EXTRA_JARS_DIR}"# Optional: Install and configure Miniconda3.ENVCONDA_HOME=/opt/miniconda3ENVPYSPARK_PYTHON=${CONDA_HOME}/bin/pythonENVPYSPARK_DRIVER_PYTHON=${CONDA_HOME}/bin/pythonENVPATH=${CONDA_HOME}/bin:${PATH}COPYMiniconda3-py39_4.10.3-Linux-x86_64.sh.RUNbashMiniconda3-py39_4.10.3-Linux-x86_64.sh-b-p/opt/miniconda3\ &&${CONDA_HOME}/bin/condaconfig--system--setalways_yesTrue\ &&${CONDA_HOME}/bin/condaconfig--system--setauto_update_condaFalse\ &&${CONDA_HOME}/bin/condaconfig--system--prependchannelsconda-forge\ &&${CONDA_HOME}/bin/condaconfig--system--setchannel_prioritystrict# Optional: Install Conda packages.## The following packages are installed in the default image. It is strongly# recommended to include all of them.## Use mamba to install packages quickly.RUN${CONDA_HOME}/bin/condainstallmamba-nbase-cconda-forge\ &&${CONDA_HOME}/bin/mambainstall\conda\cython\fastavro\fastparquet\gcsfs\google-cloud-bigquery-storage\google-cloud-bigquery[pandas]\google-cloud-bigtable\google-cloud-container\google-cloud-datacatalog\google-cloud-dataproc\google-cloud-datastore\google-cloud-language\google-cloud-logging\google-cloud-monitoring\google-cloud-pubsub\google-cloud-redis\google-cloud-spanner\google-cloud-speech\google-cloud-storage\google-cloud-texttospeech\google-cloud-translate\google-cloud-vision\koalas\matplotlib\nltk\numba\numpy\openblas\orc\pandas\pyarrow\pysal\pytables\python\regex\requests\rtree\scikit-image\scikit-learn\scipy\seaborn\sqlalchemy\sympy\virtualenv# Optional: Add extra Python modules.ENVPYTHONPATH=/opt/python/packagesRUNmkdir-p"${PYTHONPATH}"COPYtest_util.py"${PYTHONPATH}"# Required: Create the 'yarn_docker_user' group/user.# The GID and UID must be 1099. Home directory is required.RUNgroupadd-g1099yarn_docker_userRUNuseradd-u1099-g1099-d/home/yarn_docker_user-myarn_docker_userUSERyarn_docker_userBuild and push the image
The following is commands for building and pushing the example Docker image, youcan make changes according to your customizations.
# Increase the version number when there is a change to avoid referencing# a cached older image. Avoid reusing the version number, including the default# `latest` version.IMAGE=gcr.io/my-project/my-image:1.0.1# Download the BigQuery connector.gcloudstoragecp\gs://spark-lib/bigquery/spark-bigquery-with-dependencies_2.12-0.22.2.jar.# Download the Miniconda3 installer.wgethttps://repo.anaconda.com/miniconda/Miniconda3-py39_4.10.3-Linux-x86_64.sh# Python module example:cat>test_util.py<<EOFdefhello(name):print("hello {}".format(name))defread_lines(path):withopen(path)asf:returnf.readlines()EOF# Build and push the image.dockerbuild-t"${IMAGE}".dockerpush"${IMAGE}"Create a Dataproc cluster
Aftercreating a Docker imagethat customizes your Spark environment, create a Dataproc clusterthat will use your Docker image when running Spark jobs.
gcloud
gcloud dataproc clusters createCLUSTER_NAME \ --region=REGION \ --image-version=DP_IMAGE \ --optional-components=DOCKER \ --properties=dataproc:yarn.docker.enable=true,dataproc:yarn.docker.image=DOCKER_IMAGE \ other flags
Replace the following;
- CLUSTER_NAME: The cluster name.
- REGION: The cluster region.
- DP_IMAGE: Dataproc image version must be
2.0.49or later (--image-version=2.0will use a qualified minor version later than2.0.49). --optional-components=DOCKER: Enables theDocker component on the cluster.--propertiesflag:dataproc:yarn.docker.enable=true: Required property to enable theDataproc Docker on YARN feature.dataproc:yarn.docker.image: Optional property that you can add to specifyyourDOCKER_IMAGE using thefollowing Container Registry image naming format:{hostname}/{project-id}/{image}:{tag}.Example:
dataproc:yarn.docker.image=gcr.io/project-id/image:1.0.1
Requirement: You must host your Docker image onContainer Registry orArtifact Registry. (Dataproc cannot fetchcontainers from other registries).
Recommendation: Add this property when you create your clusterto cache your Docker image and avoid YARN timeouts later whenyou submit a job that uses the image.
Whendataproc:yarn.docker.enable is set totrue, Dataprocupdates Hadoop and Spark configurations to enable the Docker on YARN feature inthe cluster. For example,spark.submit.deployMode is set tocluster, andspark.yarn.appMasterEnv.YARN_CONTAINER_RUNTIME_DOCKER_MOUNTS andspark.executorEnv.YARN_CONTAINER_RUNTIME_DOCKER_MOUNTS are set to mountdirectories from the host into the container.
Submit a Spark job to the cluster
Aftercreating a Dataproc cluster, submita Spark job to the cluster that uses your Docker image. The examplein this section submits a PySpark job to the cluster.
Set job properties:
# Set the Docker image URI.IMAGE=(e.g.,gcr.io/my-project/my-image:1.0.1)# Required: Use `#` as the delimiter for properties to avoid conflicts.JOB_PROPERTIES='^#^'# Required: Set Spark properties with the Docker image.JOB_PROPERTIES="${JOB_PROPERTIES}#spark.yarn.appMasterEnv.YARN_CONTAINER_RUNTIME_DOCKER_IMAGE=${IMAGE}"JOB_PROPERTIES="${JOB_PROPERTIES}#spark.executorEnv.YARN_CONTAINER_RUNTIME_DOCKER_IMAGE=${IMAGE}"# Optional: Add custom jars to Spark classpath. Don't set these properties if# there are no customizations.JOB_PROPERTIES="${JOB_PROPERTIES}#spark.driver.extraClassPath=/opt/spark/jars/*"JOB_PROPERTIES="${JOB_PROPERTIES}#spark.executor.extraClassPath=/opt/spark/jars/*"# Optional: Set custom PySpark Python path only if there are customizations.JOB_PROPERTIES="${JOB_PROPERTIES}#spark.pyspark.python=/opt/miniconda3/bin/python"JOB_PROPERTIES="${JOB_PROPERTIES}#spark.pyspark.driver.python=/opt/miniconda3/bin/python"# Optional: Set custom Python module path only if there are customizations.# Since the `PYTHONPATH` environment variable defined in the Dockerfile is# overridden by Spark, it must be set as a job property.JOB_PROPERTIES="${JOB_PROPERTIES}#spark.yarn.appMasterEnv.PYTHONPATH=/opt/python/packages"JOB_PROPERTIES="${JOB_PROPERTIES}#spark.executorEnv.PYTHONPATH=/opt/python/packages"Notes:
- SeeLaunching Applications Using Docker Containers information on related properties.
gcloud
Submit the job to the cluster.
gcloud dataproc jobs submit pysparkPYFILE \ --cluster=CLUSTER_NAME \ --region=REGION \ --properties=${JOB_PROPERTIES}Replace the following;
- PYFILE: The file path to your PySpark job file. It can bea local file path or the URI of the file in Cloud Storage(
gs://BUCKET_NAME/PySpark filename). - CLUSTER_NAME: The cluster name.
- REGION: The cluster region.
Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.
Last updated 2026-02-19 UTC.