Customize your Spark job runtime environment with Docker on YARN

The DataprocDocker on YARN feature allows youto create and use a Docker image to customize yourSpark job runtime environment. The image can include customizations toJava, Python, and R dependencies, and to your job jar.

Limitations

Feature availability or support isnot available with:

  • Dataproc image versions prior to 2.0.49 (not available in 1.5 images)
  • MapReduce jobs (only supported for Spark jobs )
  • Spark client mode (only supported with Spark cluster mode)
  • Kerberos clusters:cluster creation fails if youcreate a cluster with Docker on YARNand Kerberos enabled.
  • Customizations of JDK, Hadoop and Spark: the host JDK, Hadoop, and Sparkare used, not your customizations.

Create a Docker image

The first step to customize your Spark environment isbuilding a Docker image.

Dockerfile

You can use the following Dockerfile as an example, making changes andadditions to meet you needs.

FROMdebian:10-slim# Suppress interactive prompts.ENVDEBIAN_FRONTEND=noninteractive# Required: Install utilities required by Spark scripts.RUNaptupdate &&aptinstall-yprocpstini# Optional: Add extra jars.ENVSPARK_EXTRA_JARS_DIR=/opt/spark/jars/ENVSPARK_EXTRA_CLASSPATH='/opt/spark/jars/*'RUNmkdir-p"${SPARK_EXTRA_JARS_DIR}"COPY*.jar"${SPARK_EXTRA_JARS_DIR}"# Optional: Install and configure Miniconda3.ENVCONDA_HOME=/opt/miniconda3ENVPYSPARK_PYTHON=${CONDA_HOME}/bin/pythonENVPYSPARK_DRIVER_PYTHON=${CONDA_HOME}/bin/pythonENVPATH=${CONDA_HOME}/bin:${PATH}COPYMiniconda3-py39_4.10.3-Linux-x86_64.sh.RUNbashMiniconda3-py39_4.10.3-Linux-x86_64.sh-b-p/opt/miniconda3\  &&${CONDA_HOME}/bin/condaconfig--system--setalways_yesTrue\  &&${CONDA_HOME}/bin/condaconfig--system--setauto_update_condaFalse\  &&${CONDA_HOME}/bin/condaconfig--system--prependchannelsconda-forge\  &&${CONDA_HOME}/bin/condaconfig--system--setchannel_prioritystrict# Optional: Install Conda packages.## The following packages are installed in the default image. It is strongly# recommended to include all of them.## Use mamba to install packages quickly.RUN${CONDA_HOME}/bin/condainstallmamba-nbase-cconda-forge\    &&${CONDA_HOME}/bin/mambainstall\conda\cython\fastavro\fastparquet\gcsfs\google-cloud-bigquery-storage\google-cloud-bigquery[pandas]\google-cloud-bigtable\google-cloud-container\google-cloud-datacatalog\google-cloud-dataproc\google-cloud-datastore\google-cloud-language\google-cloud-logging\google-cloud-monitoring\google-cloud-pubsub\google-cloud-redis\google-cloud-spanner\google-cloud-speech\google-cloud-storage\google-cloud-texttospeech\google-cloud-translate\google-cloud-vision\koalas\matplotlib\nltk\numba\numpy\openblas\orc\pandas\pyarrow\pysal\pytables\python\regex\requests\rtree\scikit-image\scikit-learn\scipy\seaborn\sqlalchemy\sympy\virtualenv# Optional: Add extra Python modules.ENVPYTHONPATH=/opt/python/packagesRUNmkdir-p"${PYTHONPATH}"COPYtest_util.py"${PYTHONPATH}"# Required: Create the 'yarn_docker_user' group/user.# The GID and UID must be 1099. Home directory is required.RUNgroupadd-g1099yarn_docker_userRUNuseradd-u1099-g1099-d/home/yarn_docker_user-myarn_docker_userUSERyarn_docker_user

Build and push the image

The following is commands for building and pushing the example Docker image, youcan make changes according to your customizations.

# Increase the version number when there is a change to avoid referencing# a cached older image. Avoid reusing the version number, including the default# `latest` version.IMAGE=gcr.io/my-project/my-image:1.0.1# Download the BigQuery connector.gcloudstoragecp\gs://spark-lib/bigquery/spark-bigquery-with-dependencies_2.12-0.22.2.jar.# Download the Miniconda3 installer.wgethttps://repo.anaconda.com/miniconda/Miniconda3-py39_4.10.3-Linux-x86_64.sh# Python module example:cat>test_util.py<<EOFdefhello(name):print("hello {}".format(name))defread_lines(path):withopen(path)asf:returnf.readlines()EOF# Build and push the image.dockerbuild-t"${IMAGE}".dockerpush"${IMAGE}"

Create a Dataproc cluster

Aftercreating a Docker imagethat customizes your Spark environment, create a Dataproc clusterthat will use your Docker image when running Spark jobs.

gcloud

gcloud dataproc clusters createCLUSTER_NAME \    --region=REGION \    --image-version=DP_IMAGE \    --optional-components=DOCKER \    --properties=dataproc:yarn.docker.enable=true,dataproc:yarn.docker.image=DOCKER_IMAGE \    other flags

Replace the following;

  • CLUSTER_NAME: The cluster name.
  • REGION: The cluster region.
  • DP_IMAGE: Dataproc image version must be2.0.49or later (--image-version=2.0 will use a qualified minor version later than2.0.49).
  • --optional-components=DOCKER: Enables theDocker component on the cluster.
  • --properties flag:
    • dataproc:yarn.docker.enable=true: Required property to enable theDataproc Docker on YARN feature.
    • dataproc:yarn.docker.image: Optional property that you can add to specifyyourDOCKER_IMAGE using thefollowing Container Registry image naming format:{hostname}/{project-id}/{image}:{tag}.

      Example:

      dataproc:yarn.docker.image=gcr.io/project-id/image:1.0.1

      Requirement: You must host your Docker image onContainer Registry orArtifact Registry. (Dataproc cannot fetchcontainers from other registries).

      Recommendation: Add this property when you create your clusterto cache your Docker image and avoid YARN timeouts later whenyou submit a job that uses the image.

Whendataproc:yarn.docker.enable is set totrue, Dataprocupdates Hadoop and Spark configurations to enable the Docker on YARN feature inthe cluster. For example,spark.submit.deployMode is set tocluster, andspark.yarn.appMasterEnv.YARN_CONTAINER_RUNTIME_DOCKER_MOUNTS andspark.executorEnv.YARN_CONTAINER_RUNTIME_DOCKER_MOUNTS are set to mountdirectories from the host into the container.

Submit a Spark job to the cluster

Aftercreating a Dataproc cluster, submita Spark job to the cluster that uses your Docker image. The examplein this section submits a PySpark job to the cluster.

Set job properties:

# Set the Docker image URI.IMAGE=(e.g.,gcr.io/my-project/my-image:1.0.1)# Required: Use `#` as the delimiter for properties to avoid conflicts.JOB_PROPERTIES='^#^'# Required: Set Spark properties with the Docker image.JOB_PROPERTIES="${JOB_PROPERTIES}#spark.yarn.appMasterEnv.YARN_CONTAINER_RUNTIME_DOCKER_IMAGE=${IMAGE}"JOB_PROPERTIES="${JOB_PROPERTIES}#spark.executorEnv.YARN_CONTAINER_RUNTIME_DOCKER_IMAGE=${IMAGE}"# Optional: Add custom jars to Spark classpath. Don't set these properties if# there are no customizations.JOB_PROPERTIES="${JOB_PROPERTIES}#spark.driver.extraClassPath=/opt/spark/jars/*"JOB_PROPERTIES="${JOB_PROPERTIES}#spark.executor.extraClassPath=/opt/spark/jars/*"# Optional: Set custom PySpark Python path only if there are customizations.JOB_PROPERTIES="${JOB_PROPERTIES}#spark.pyspark.python=/opt/miniconda3/bin/python"JOB_PROPERTIES="${JOB_PROPERTIES}#spark.pyspark.driver.python=/opt/miniconda3/bin/python"# Optional: Set custom Python module path only if there are customizations.# Since the `PYTHONPATH` environment variable defined in the Dockerfile is# overridden by Spark, it must be set as a job property.JOB_PROPERTIES="${JOB_PROPERTIES}#spark.yarn.appMasterEnv.PYTHONPATH=/opt/python/packages"JOB_PROPERTIES="${JOB_PROPERTIES}#spark.executorEnv.PYTHONPATH=/opt/python/packages"

Notes:

gcloud

Submit the job to the cluster.

gcloud dataproc jobs submit pysparkPYFILE \    --cluster=CLUSTER_NAME \    --region=REGION \    --properties=${JOB_PROPERTIES}

Replace the following;

  • PYFILE: The file path to your PySpark job file. It can bea local file path or the URI of the file in Cloud Storage(gs://BUCKET_NAME/PySpark filename).
  • CLUSTER_NAME: The cluster name.
  • REGION: The cluster region.

Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.

Last updated 2026-02-19 UTC.