Build custom container images for Dataflow

This document describes how to create a custom container image forDataflow jobs.

Requirements

A custom container image for Dataflow must meet the followingrequirements:

  • The Apache Beam SDK and necessary dependencies are installed. We recommendstarting with a default Apache Beam SDK image. For more information, seeSelect a base image in this document.
  • The/opt/apache/beam/boot script must run as the last step during containerstartup. This script initializes the worker environment and starts the SDKworker process. This script is the defaultENTRYPOINT in the Apache BeamSDK images. However, if you use a different base image, or if you override thedefaultENTRYPOINT, then you must run the script explicitly. For moreinformation, seeModify the container entrypoint in thisdocument.
  • Your container image must support the architecture of the worker VMs for yourDataflow job. If you plan to use the custom container on ARMVMs, we recommend building a multi-architecture image. For more information,seeBuild a multi-architecture container image.

Before you begin

  1. Verify that the version of the Apache Beam SDK installed supportsRunner v2 and your language version. For moreinformation, seeInstall the Apache Beam SDK.

  2. To test your container image locally, you must have Docker installed.For more information, seeGet Docker.

  3. Create anArtifact Registry repository.Specify Docker image format. You must have at leastArtifact Registry Writer accessto the repository.

    To create a new repository, run thegcloud artifacts repositories create command:

    gcloudartifactsrepositoriescreateREPOSITORY\--repository-format=docker\--location=REGION\--async

    Replace the following:

    • REPOSITORY: a name for your repository. Repository names must beunique for each location in a project.
    • REGION: the region todeploy your Dataflow job in. Select a Dataflowregion close to where you runthe commands. The value must be a valid region name. For more informationabout regions and locations, seeDataflow locations.

    This example uses the--async flag. The command returns immediately, withoutwaiting for the operation to complete.

  4. To configure Docker to authenticate requests for Artifact Registry, run thegcloud auth configure-docker command:

    gcloudauthconfigure-dockerREGION-docker.pkg.dev

    The command updates your Docker configuration. You can now connect withArtifact Registry in your Google Cloud project to push images.

Select a base image

We recommend starting with an Apache Beam SDK image as the base containerimage. These images are released as part of Apache Beam releases toDocker Hub.

Use an Apache Beam base image

To use an Apache Beam SDK image as the base image, specify the containerimage in theFROM instruction and then add your own customizations.

Java

This example uses Java 8 with the Apache Beam SDK version 2.69.0.

FROM apache/beam_java8_sdk:2.69.0# Make your customizations here, for example:ENV FOO=/barCOPY path/to/myfile ./

The runtime version of the custom container must match the runtime that you willuse to start the pipeline. For example, if you will start the pipeline from alocal Java 11 environment, theFROM line must specify a Java 11 environment:apache/beam_java11_sdk:....

Python

This example uses Python 3.10 with the Apache Beam SDK version 2.69.0.

FROM apache/beam_python3.10_sdk:2.69.0# Make your customizations here, for example:ENV FOO=/barCOPY path/to/myfile ./

The runtime version of the custom container must match the runtime that you willuse to start the pipeline. For example, if you will start the pipeline from alocal Python 3.10 environment, theFROM line must specify a Python 3.10 environment:apache/beam_python3.10_sdk:....

Go

This example uses Go with the Apache Beam SDK version 2.69.0.

FROM apache/beam_go_sdk:2.69.0# Make your customizations here, for example:ENV FOO=/barCOPY path/to/myfile ./

Use a custom base image

If you want to use a different base image, or need to modify some aspect of thedefault Apache Beam images (such as OS version or patches), use amultistage build process. Copy the necessary artifacts from a default Apache Beam base image.

Note: Apache Beam and Dataflow are routinely tested usingDebian-based images.Alpine based images are not supported at thistime.

Set theENTRYPOINT to run the/opt/apache/beam/boot script, whichinitializes the worker environment and starts the SDK worker process. If youdon't set this entrypoint, the Dataflow workers don't startproperly.

The following example shows a Dockerfile that copies files from theApache Beam SDK:

Java

FROM openjdk:8# Copy files from official SDK image, including script/dependencies.COPY --from=apache/beam_java8_sdk:2.69.0 /opt/apache/beam /opt/apache/beam# Set the entrypoint to Apache Beam SDK launcher.ENTRYPOINT ["/opt/apache/beam/boot"]

Python

FROM python:3.10-slim# Install SDK.RUN pip install --no-cache-dir apache-beam[gcp]==2.69.0# Verify that the image does not have conflicting dependencies.RUN pip check# Copy files from official SDK image, including script/dependencies.COPY --from=apache/beam_python3.10_sdk:2.69.0 /opt/apache/beam /opt/apache/beam# Set the entrypoint to Apache Beam SDK launcher.ENTRYPOINT ["/opt/apache/beam/boot"]

This example assumes necessary dependencies (in this case, Python 3.10 andpip)have been installed on the existing base image. Installing the Apache BeamSDK into the image ensures that the image has the necessary SDK dependenciesand reduces the worker startup time.

Important: The SDK version specified intheRUN andCOPY instructions must match the version used to launch the pipeline.

Go

FROM golang:latest# Copy files from official SDK image, including script/dependencies.COPY --from=apache/beam_go_sdk:2.69.0 /opt/apache/beam /opt/apache/beam# Set the entrypoint to Apache Beam SDK launcher.ENTRYPOINT ["/opt/apache/beam/boot"]

Modify the container entrypoint

If your container runs a custom script during container startup, the script mustend with running/opt/apache/beam/boot. Arguments passed byDataflow during container startup must be passed to the defaultboot script. The following example shows a custom startup script that calls thedefault boot script:

#!/bin/bashecho"This is my custom script"# ...# Pass command arguments to the default boot script./opt/apache/beam/boot"$@"

In your Dockerfile, set theENTRYPOINT to call your script:

Java

FROM apache/beam_java8_sdk:2.69.0COPY script.sh path/to/my/script.shENTRYPOINT [ "path/to/my/script.sh" ]

Python

FROM apache/beam_python3.10_sdk:2.69.0COPY script.sh path/to/my/script.shENTRYPOINT [ "path/to/my/script.sh" ]

Go

FROM apache/beam_go_sdk:2.69.0COPY script.sh path/to/my/script.shENTRYPOINT [ "path/to/my/script.sh" ]

Build and push the image

You can use Cloud Build or Docker to build your container image and push itto an Artifact Registry repository.

Cloud Build

To build the file and push it to your Artifact Registry repository, run thegcloud builds submit command:

gcloudbuildssubmit--tagREGION-docker.pkg.dev/PROJECT_ID/REPOSITORY/dataflow/IMAGE:TAG.

Docker

dockerbuild.--tagREGION-docker.pkg.dev/PROJECT_ID/REPOSITORY/dataflow/IMAGE:TAGdockerpushREGION-docker.pkg.dev/PROJECT_ID/REPOSITORY/dataflow/IMAGE:TAG

Replace the following:

  • REGION: theregionto deploy your Dataflow job in. The value of theREGIONvariable must be a valid region name.
  • PROJECT_ID: the project name or username.
  • REPOSITORY: the image repository name.
  • IMAGE: the image's name.
  • TAG: the image tag. Always specify a versionedcontainer SHA or tag. Don't use the:latest tag or a mutable tag.

Pre-install Python dependencies

This section applies to Python pipelines.

When you launch a Python Dataflow job, you can specify additionaldependencies by using the--requirements_file or the--extra_packages option atruntime. For more information, seeManaging Python Pipeline Dependencies.Additional dependencies are installed in each Dataflow workercontainer. When the job first starts and during autoscaling, the dependencyinstallation often leads to high CPU usage and a long warm-up period on allnewly started Dataflow workers.

To avoid repetitive dependency installations, you can pre-build a custom PythonSDK container image with the dependencies pre-installed. You can perform thisstep at build time by using a Dockerfile, or at run time when you submit thejob.

Workers create a new virtual Python environment when they start the container.For this reason, install dependencies into the default (global) Pythonenvironment instead of creating a virtual environment. If you activate avirtual environment in your container image, this environment might not beactivate when the job starts. For more information, seeCommon issues.

Pre-install using a Dockerfile

To add extra dependencies directly to your Python custom container, use thefollowing commands:

FROM apache/beam_python3.10_sdk:2.69.0COPY requirements.txt .# Pre-install Python dependencies. For reproducibile builds,# supply all of the dependencies and their versions in a requirements.txt file.RUN pip install -r requirements.txt# You can also install individual dependencies.RUN pip install lxml# Pre-install other dependencies.RUN apt-get update \  && apt-get dist-upgrade \  && apt-get install -y --no-install-recommends ffmpeg

Submit your job with the--sdk_container_image and the--sdk_location pipeline options.The--sdk_location option prevents the SDK from downloading when your job launches.The SDK is retrieved directly from the container image.

The following example runs thewordcount example pipeline:

python-mapache_beam.examples.wordcount\--input=INPUT_FILE\--output=OUTPUT_FILE\--project=PROJECT_ID\--region=REGION\--temp_location=TEMP_LOCATION\--runner=DataflowRunner\--experiments=use_runner_v2\--sdk_container_image=IMAGE_URI--sdk_location=container

Replace the following:

  • INPUT_FILE: an input file for the pipeline
  • OUTPUT_FILE: a path to write output to
  • PROJECT_ID: the Google Cloud Platform project ID
  • REGION: theregion to deployyour Dataflow job in
  • TEMP_LOCATION: the Cloud Storage path forDataflow to stage temporary job files
  • IMAGE_URI: the custom container image URI

Pre-build a container image when submitting the job

Pre-building a container image lets you to pre-install the pipeline dependenciesbefore job startup. You don't need to build a custom container image.

To pre-build a container with additional Python dependencies when you submit ajob, use the following pipeline options:

  • --prebuild_sdk_container_engine=[cloud_build | local_docker]. When this flagis set, Apache Beam generates a custom container and installs all of thedependencies specified by the--requirements_file and the--extra_packagesoptions. This flag supports the following values:

    • cloud_build. UseCloud Build to build thecontainer. The Cloud Build API must be enabled in your project.
    • local_docker. Use your local Docker installation to build the container.
  • --docker_registry_push_url=IMAGE_PATH.ReplaceIMAGE_PATH with an Artifact Registry folder.

  • --sdk_location=container. This option prevents the workers from downloadingthe SDK when your job launches. Instead, the SDK is retrieved directly fromthe container image.

The following example uses Cloud Build to pre-build the image:

python-mapache_beam.examples.wordcount\--input=INPUT_FILE\--output=OUTPUT_FILE\--project=PROJECT_ID\--region=REGION\--temp_location=TEMP_LOCATION\--runner=DataflowRunner\--disk_size_gb=DISK_SIZE_GB\--experiments=use_runner_v2\--requirements_file=./requirements.txt\--prebuild_sdk_container_engine=cloud_build\--docker_registry_push_url=IMAGE_PATH\--sdk_location=container

The pre-build feature requires the Apache Beam SDK for Python version2.25.0 or later.

The SDK container image pre-building workflow uses the image passed using the--sdk_container_image pipeline option as the base image. If the option is notset, by default an Apache Beam image is used as the base image.

Note: With Apache Beam SDK versions 2.38.0 and earlier, to specify the base image,use--prebuild_sdk_container_base_image.

You can reuse a prebuilt Python SDK container image in another job with the same dependencies and SDK version.To reuse the image, pass the prebuilt container image URL to the other jobby using the--sdk_container_image pipeline option. Remove the dependencyoptions--requirements_file,--extra_packages, and--setup_file.

If you don't plan to reuse the image, delete it after the job completes.You can delete the image with the gcloud CLIor in the Artifact Registry pages in the Google Cloud console.

If the image is stored in Artifact Registry, use theartifacts docker images delete command:

gcloudartifactsdockerimagesdeleteIMAGE--delete-tags

Common issues

  • If your job has extra Python dependencies from a private PyPi mirror and can't be pulledby a remote Cloud Build job, try using the local docker option ortry building your container using a Dockerfile.

  • If the Cloud Build job fails withdocker exit code 137, the build job ran out of memory,potentially due to the size of the dependencies being installed. Use a larger Cloud Buildworker machine type by passing--cloud_build_machine_type=machine_type,wheremachine_type is one of the following options:

    • n1-highcpu-8
    • n1-highcpu-32
    • e2-highcpu-8
    • e2-highcpu-32

    By default, Cloud Build uses the machine typee2-medium.

  • In Apache Beam 2.44.0 and later, workers create a virtual environment whenstarting a custom container. If the container creates its own virtualenvironment to install dependencies, those dependencies are discarded. Thisbehavior can cause errors such as the following:

    ModuleNotFoundError: No module named '<dependency name>'

    To avoid this issue, install dependencies into the default (global) Pythonenvironment. As a workaround, disable this behavior in Beam 2.48.0 andlater by setting the following environment variable in your container image:

    ENV RUN_PYTHON_SDK_IN_DEFAULT_ENVIRONMENT=1

What's next

Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.

Last updated 2026-02-19 UTC.