Pipeline options

This page documents Dataflow pipeline options. For informationabout how to use these options, seeSetting pipeline options.

Basic options

This table describes basic pipeline options that are used by many jobs.

Java

FieldTypeDescription
dataflowServiceOptionsString

Specifies additional job modes and configurations. Also provides forward compatibility for SDK versions that don't have explicit pipeline options for later Dataflow features. Requires Apache Beam SDK 2.29.0 or later. To set multiple service options, specify a comma-separated list of options. For a list of supported options, seeService options.

enableStreamingEngineboolean

Specifies whether DataflowStreaming Engine is enabled or disabled. Streaming Engine lets you run the steps of your streaming pipeline in the Dataflow service backend, which conserves CPU, memory, and Persistent Disk storage resources.

The default value isfalse. When set to the default value, the steps of your streaming pipeline are run entirely on worker VMs.

Supported in Flex Templates.

experimentsString

Enables experimental or pre-GA Dataflow features, using the following syntax:--experiments=experiment. When setting multiple experiments programmatically, pass a comma-separated list.

gcpTempLocationString

Cloud Storage path for temporary files. Must be a valid Cloud Storage URL, beginning withgs://BUCKET-NAME/. In the path, the at sign (@) can't be followed by a number or by an asterisk (*).

If not set, the value oftempLocation is used. If neithergcpTempLocation nortempLocation is set, then Dataflow creates a new Cloud Storage bucket.

Supported in Flex Templates.

jobNameString

The name of the Dataflow job being executed as it appears in the Dataflow jobs list and job details. Also used whenupdating an existing pipeline.

If not set, Dataflow generates a unique name automatically.

labelsString

User-definedlabels, also known asadditional-user-labels. User-specified labels are available in billing exports, which you can use for cost attribution. Specify a JSON string of "key": "value" pairs. Example:--labels='{ "name": "wrench", "mass": "1_3kg", "count": "3" }'.

Supported in Flex Templates.

projectString

The project ID for your Google Cloud project. The project is required if you want to run your pipeline using the Dataflow managed service.

If not set, defaults to the project that is configured in thegcloud CLI.

regionString

Specifies aregion for deploying your Dataflow jobs.

If not set, defaults tous-central1.

runnerClass (NameOfRunner)

ThePipelineRunner to use. This option lets you determine thePipelineRunner at runtime. To run your pipeline on Dataflow, useDataflowRunner. To run your pipeline locally, useDirectRunner.

The default value isDirectRunner (local mode).

stagingLocationString

Cloud Storage path for staging local files. Must be a valid Cloud Storage URL, beginning withgs://BUCKET-NAME/.

If not set, defaults to the value ofgcpTempLocation.

tempLocationString

Cloud Storage path for temporary files. Must be a valid Cloud Storage URL, beginning withgs://BUCKET-NAME/. In the path, the at sign (@) can't be followed by a number or by an asterisk (*).

If you set bothtempLocation andgcsTempLocation, then Dataflow uses the valueofgcsTempLocation.

Supported in Flex Templates.

Python/YAML

FieldTypeDescription
dataflow_service_optionsstr

Specifies additional job modes and configurations. Also provides forward compatibility for SDK versions that don't have explicit pipeline options for later Dataflow features. Requires Apache Beam SDK 2.29.0 or later. To set multiple service options, specify a comma-separated list of options. For a list of supported options, seeService options.

experimentsstr

Enables experimental or pre-GA Dataflow features, using the following syntax:--experiments=experiment. When setting multiple experiments programmatically, pass a comma-separated list.

enable_streaming_enginebool

Specifies whether DataflowStreaming Engine is enabled or disabled. Streaming Engine lets you run the steps of your streaming pipeline in the Dataflow service backend, which conserves CPU, memory, and Persistent Disk storage resources.

The default value depends on your pipeline configuration. For more information, seeUse Streaming Engine. When set tofalse, the steps of your streaming pipeline are run entirely on worker VMs.

Supported in Flex Templates.

job_namestr

The name of the Dataflow job being executed as it appears in the Dataflow jobs list and job details.

If not set, Dataflow generates a unique name automatically.

labelsstr

User-definedlabels, also known asadditional-user-labels. User-specified labels are available in billing exports, which you can use for cost attribution.

For each label, specify a "key=value" pair.

Keys must conform to the regular expression:[\p{Ll}\p{Lo}][\p{Ll}\p{Lo}\p{N}_-]{0,62}.

Values must conform to the regular expression:[\p{Ll}\p{Lo}\p{N}_-]{0,63}.

For example, to define two user labels:--labels "name=wrench" --labels "mass=1_3kg".

Supported in Flex Templates.

no_wait_until_finishbool

By default, the "with" statement waits for the job to complete. Set this flag to bypass this behavior and continue execution immediately.

pickle_librarystr

The pickle library to use for data serialization. Supported values aredill,cloudpickle, anddefault. To use thecloudpickle option, set the option both at the start of the code and as a pipeline option. You must set the option in both places because pickling starts whenPTransforms are constructed, which happens before pipeline construction. To include at the start of the code, add lines similar to the following:

from apache_beam.internal import picklerpickler.set_library(pickler.USE_CLOUDPICKLE)

If not set, defaults todill.

projectstr

The project ID for your Google Cloud project. The project is required if you want to run your pipeline using the Dataflow managed service.

If not set, throws an error.

regionstr

Specifies aregion for deploying your Dataflow jobs.

If not set, defaults tous-central1.

runnerstr

ThePipelineRunner to use. This option lets you determine thePipelineRunner at runtime. To run your pipeline on Dataflow, useDataflowRunner. To run your pipeline locally, useDirectRunner.

The default value isDirectRunner (local mode).

sdk_locationstr

Path to the Apache Beam SDK. Must be a valid URL, Cloud Storage path, or local path to an Apache Beam SDK tar or tar archive file. To install the Apache Beam SDK from within a container, use the valuecontainer.

If not set, defaults to the current version of the Apache Beam SDK.

Supported in Flex Templates.

staging_locationstr

Cloud Storage path for staging local files. Must be a valid Cloud Storage URL, beginning withgs://BUCKET-NAME/.

If not set, defaults to a staging directory withintemp_location. You must specify at least one oftemp_location orstaging_location to run your pipeline on Google Cloud.

temp_locationstr

Cloud Storage path for temporary files. Must be a valid Cloud Storage URL, beginning withgs://BUCKET-NAME/. In thetemp_location filename, the at sign (@) can't be followed by a number or by an asterisk (*).

You must specify eithertemp_location orstaging_location (or both). Iftemp_location is not set,temp_location defaults to the value forstaging_location.

Supported in Flex Templates.

Go

FieldTypeDescription
dataflow_service_optionsstrSpecifies additional job modes and configurations. Also provides forward compatibility for SDK versions that don't have explicit pipeline options for later Dataflow features. Requires Apache Beam SDK 2.40.0 or later. To set multiple service options, specify a comma-separated list of options. For a list of supported options, seeService options.
experimentsstrEnables experimental or pre-GA Dataflow features, using the following syntax:--experiments=experiment. When setting multiple experiments programmatically, pass a comma-separated list.
job_namestr

The name of the Dataflow job being executed as it appears in the Dataflow jobs list and job details.

If not set, Dataflow generates a unique name automatically.

projectstr

The project ID for your Google Cloud project. The project is required if you want to run your pipeline using the Dataflow managed service.

If not set, returns an error.

regionstr

Specifies aregion for deploying your Dataflow jobs.

If not set, returns an error.

runnerstr

ThePipelineRunner to use. This option lets you determine thePipelineRunner at runtime. To run your pipeline on Dataflow, usedataflow. To run your pipeline locally, usedirect.

The default value isdirect (local mode).

staging_locationstr

Cloud Storage path for staging local files. Must be a valid Cloud Storage URL, beginning withgs://BUCKET-NAME/.

If not set, returns an error.

temp_locationstr

Cloud Storage path for temporary files. Must be a valid Cloud Storage URL, beginning withgs://BUCKET-NAME/. In thetemp_location filename, the at sign (@) can't be followed by a number or by an asterisk (*).

Iftemp_location is not set,temp_location defaults to the value forstaging_location.

Dataflow jobs useCloud Storage to store temporary filesduring pipeline execution. To avoid being billed for unnecessary storage costs,turn off the soft delete feature on buckets that yourDataflow jobs use for temporary storage.For more information, seeDisable soft delete.

Resource utilization

This table describes pipeline options that you can set to manage resourceutilization.

Java

FieldTypeDescription
autoscalingAlgorithmString

The autoscaling mode for your Dataflow job. Possible values areTHROUGHPUT_BASED to enable autoscaling, orNONE to disable. SeeAutotuning features to learn more about how autoscaling works in the Dataflow managed service.

Defaults toTHROUGHPUT_BASED for all batch Dataflow jobs, and for streaming jobs that useStreaming Engine. Defaults toNONE for streaming jobs that don't use Streaming Engine.

flexRSGoalString

SpecifiesFlexible Resource Scheduling (FlexRS) for autoscaled batch jobs. Affects thenumWorkers,autoscalingAlgorithm,zone,region, andworkerMachineType parameters. For more information, see theFlexRS pipeline options section.

If unspecified, defaults toSPEED_OPTIMIZED, which is the same as omitting this flag. To turn on FlexRS, you must specify the valueCOST_OPTIMIZED to allow the Dataflow service to choose any available discounted resources.

maxNumWorkersint

The maximum number of Compute Engine instances to be made available to your pipeline during execution. This value can be higher than the initial number of workers (specified bynumWorkers) to allow your job to scale up, automatically or otherwise.

If unspecified, the Dataflow service determines an appropriate number of workers.

Supported in Flex Templates.

numberOfWorkerHarnessThreadsint

This option influences the number of concurrent units of work that can be assigned to one worker VM at a time. Lower values might reduce memory usage by decreasing parallelism. This value influences the upper bound of parallelism, but the actual number of threads on the worker might not match this value depending on other constraints. The implementation depends on the SDK language and other runtime parameters.

To reduce the parallelism for batch pipelines, set the value of the flag to a number that is less than the number of vCPUs on the worker. For streaming pipelines, set the value of the flag to a number that is less than the number of threads per Apache Beam SDK process. To estimate threads per process, see the table in theDoFn memory usage section in "Troubleshoot Dataflow out of memory errors."

For more information about using this option to reduce memory usage, seeTroubleshoot Dataflow out of memory errors.

If unspecified, the Dataflow service determines an appropriate value.

Supported in Flex Templates.

numWorkersint

The initial number of Compute Engine instances to use when executing your pipeline. This option determines how many workers the Dataflow service starts up when your job begins.

If unspecified, the Dataflow service determines an appropriate number of workers.

Supported in Flex Templates.

Python/YAML

FieldTypeDescription
autoscaling_algorithmstr

The autoscaling mode for your Dataflow job. Possible values areTHROUGHPUT_BASED to enable autoscaling, orNONE to disable. SeeAutotuning features to learn more about how autoscaling works in the Dataflow managed service.

Defaults toTHROUGHPUT_BASED for all batch Dataflow jobs, and for streaming jobs that useStreaming Engine. Defaults toNONE for streaming jobs that don't use Streaming Engine.

flexrs_goalstr

SpecifiesFlexible Resource Scheduling (FlexRS) for autoscaled batch jobs. Affects thenum_workers,autoscaling_algorithm,zone,region, andmachine_type parameters. For more information, see theFlexRS pipeline options section.

If unspecified, defaults toSPEED_OPTIMIZED, which is the same as omitting this flag. To turn on FlexRS, you must specify the valueCOST_OPTIMIZED to allow the Dataflow service to choose any available discounted resources.

max_num_workersint

The maximum number of Compute Engine instances to be made available to your pipeline during execution. This value can be higher than the initial number of workers (specified bynum_workers) to allow your job to scale up, automatically or otherwise.

If unspecified, the Dataflow service determines an appropriate number of workers.

Supported in Flex Templates.

number_of_worker_harness_threadsint

This option influences the number of concurrent units of work that can be assigned to one worker VM at a time. Lower values might reduce memory usage by decreasing parallelism. This value influences the upper bound of parallelism, but the actual number of threads on the worker might not match this value depending on other constraints. The implementation depends on the SDK language and other runtime parameters.

To reduce the parallelism for batch pipelines, set the value of the flag to a number that is less than the number of vCPUs on the worker. For streaming pipelines, set the value of the flag to a number that is less than the number of threads per Apache Beam SDK process. To estimate threads per process, see the table in theDoFn memory usage section in "Troubleshoot Dataflow out of memory errors."

When using this option to reduce memory usage, using the--experiments=no_use_multiple_sdk_containers option might also be necessary, particularly for batch pipelines. For more information, seeTroubleshoot Dataflow out of memory errors.

If unspecified, the Dataflow service determines an appropriate value.

Supported in Flex Templates.

experiments=no_use_multiple_sdk_containers

Configures Dataflow worker VMs to start only one containerized Apache Beam Python SDK process. Does not decrease the total number of threads, therefore all threads run in a single Apache Beam SDK process. Due to Python'sglobal interpreter lock (GIL), CPU utilization might be limited and performance reduced. When using this option with a worker machine type that has many vCPU cores, to prevent stuck workers, consider reducing the number of worker harness threads.

If not specified, Dataflow starts one Apache Beam SDK process per VM core. This experiment only affects Python pipelines that useDataflow Runner V2.

Supported in Flex Templates. Can be set by the template or by using the--additional_experiments option.

num_workersint

The number of Compute Engine instances to use when executing your pipeline.

If unspecified, the Dataflow service determines an appropriate number of workers.

Supported in Flex Templates.

Go

FieldTypeDescription
autoscaling_algorithmstr

The autoscaling mode for your Dataflow job. Possible values areTHROUGHPUT_BASED to enable autoscaling, orNONE to disable. SeeAutotuning features to learn more about how autoscaling works in the Dataflow managed service.

Defaults toTHROUGHPUT_BASED for all batch Dataflow jobs.

flexrs_goalstr

SpecifiesFlexible Resource Scheduling (FlexRS) for autoscaled batch jobs. Affects thenum_workers,autoscaling_algorithm,zone,region, andworker_machine_type parameters. Requires Apache Beam SDK 2.40.0 or later. For more information, see theFlexRS pipeline options section.

If unspecified, defaults toSPEED_OPTIMIZED, which is the same as omitting this flag. To turn on FlexRS, you must specify the valueCOST_OPTIMIZED to allow the Dataflow service to choose any available discounted resources.

max_num_workersint

The maximum number of Compute Engine instances to be made available to your pipeline during execution. This value can be higher than the initial number of workers (specified bynum_workers) to allow your job to scale up, automatically or otherwise.

If unspecified, the Dataflow service determines an appropriate number of workers.

number_of_worker_harness_threadsint

This option influences the number of concurrent units of work that can be assigned to one worker VM at a time. Lower values might reduce memory usage by decreasing parallelism. This value influences the upper bound of parallelism, but the actual number of threads on the worker might not match this value depending on other constraints. The implementation depends on the SDK language and other runtime parameters.

To reduce the parallelism for batch pipelines, set the value of the flag to a number that is less than the number of vCPUs on the worker. For streaming pipelines, set the value of the flag to a number that is less than the number of threads per Apache Beam SDK process. To estimate threads per process, see the table in theDoFn memory usage section in "Troubleshoot Dataflow out of memory errors."

For more information about using this option to reduce memory usage, seeTroubleshoot Dataflow out of memory errors.

If unspecified, the Dataflow service determines an appropriate value.

num_workersint

The number of Compute Engine instances to use when executing your pipeline.

If unspecified, the Dataflow service determines an appropriate number of workers.

Debugging

This table describes pipeline options that you can use to debug your job.

Java

FieldTypeDescription
hotKeyLoggingEnabledboolean

Specifies that when ahot key is detected in the pipeline, the literal, human-readable key is printed in the user's Cloud Logging project.

If not set, only the presence of a hot key is logged.

Note: Hot key detection and logging is disabled for streaming pipelines as of March 2022.

Python/YAML

FieldTypeDescription
enable_hot_key_loggingbool

Specifies that when a hot key is detected in the pipeline, the literal, human-readable key is printed in the user's Cloud Logging project.

RequiresDataflow Runner V2 and Apache Beam SDK 2.29.0 or later. Must be set as a service option, using the formatdataflow_service_options=enable_hot_key_logging.

If not set, only the presence of a hot key is logged.

Note: Hot key detection and logging is disabled for streaming pipelines as of March 2022.

Go

No debugging pipeline options are available.

Security and networking

This table describes pipeline options for controlling your account andnetworking.

Java

FieldTypeDescription
dataflowKmsKeyString

Specifies the usage and the name of acustomer-managed encryption key (CMEK) used to encrypt data at rest. You can control the encryption key through Cloud KMS. You must also specifytempLocation to use this feature.

If unspecified, Dataflow uses the defaultGoogle Cloud encryption instead of a CMEK.

Supported in Flex Templates.

gcpOauthScopesList

Specifies the OAuth scopes that will be requested when creating the default Google Cloud credentials. Might have no effect if you manually specify the Google Cloud credential or credential factory.

If not set, the following scopes are used:

"https://www.googleapis.com/auth/bigquery",
"https://www.googleapis.com/auth/bigquery.insertdata",
"https://www.googleapis.com/auth/cloud-platform",
"https://www.googleapis.com/auth/datastore",
"https://www.googleapis.com/auth/devstorage.full_control",
"https://www.googleapis.com/auth/pubsub",
"https://www.googleapis.com/auth/userinfo.email"

impersonateServiceAccountString

If set, all API requests are made as the designated service account or as the target service account in an impersonation delegation chain. Specify either a single service account as the impersonator, or a comma-separated list of service accounts to create an impersonation delegation chain. This option is only used to submit Dataflow jobs.

If not set,Application Default Credentials are used to submit Dataflow jobs.

serviceAccountString

Specifies a user-managed worker service account, using the formatmy-service-account-name@<project-id>.iam.gserviceaccount.com. For more information, see theWorker service account section of the Dataflow security and permissions page.

If not set, workers use the Compute Engine service account of your project as the worker service account.

Supported in Flex Templates.

networkString

The Compute Enginenetwork for launching Compute Engine instances to run your pipeline. See how tospecify your network.

If not set, Google Cloud assumes that you intend to use a network nameddefault.

Supported in Flex Templates.

subnetworkString

The Compute Enginesubnetwork for launching Compute Engine instances to run your pipeline. See how tospecify your subnetwork.

The Dataflow service determines the default value.

Supported in Flex Templates.

usePublicIpsboolean

Specifies whether Dataflow workers useexternal IP addresses. If the value is set tofalse, Dataflow workers use internal IP addresses for all communication. In this case, if thesubnetwork option is specified, thenetwork option is ignored. Make sure that the specifiednetwork orsubnetwork hasPrivate Google Access enabled. External IP addresses have anassociated cost.

You can also use theWorkerIPAddressConfiguration API field to specify how IP addresses are allocated to worker machines.

If not set, the default value istrue and Dataflow workers use external IP addresses.

Python/YAML

FieldTypeDescription
dataflow_kms_keystr

Specifies the usage and the name of acustomer-managed encryption key (CMEK) used to encrypt data at rest. You can control the encryption key through Cloud KMS. You must also specifytemp_location to use this feature.

If unspecified, Dataflow uses the defaultGoogle Cloud encryptioninstead of a CMEK.

Supported in Flex Templates.

gcp_oauth_scopeslist[str]

Specifies the OAuth scopes that will be requested when creating Google Cloud credentials. If set programmatically, must be set as a list of strings.

If not set, the following scopes are used:

"https://www.googleapis.com/auth/bigquery",
"https://www.googleapis.com/auth/cloud-platform",
"https://www.googleapis.com/auth/datastore",
"https://www.googleapis.com/auth/devstorage.full_control",
'https://www.googleapis.com/auth/spanner.admin",
"https://www.googleapis.com/auth/spanner.data",
"https://www.googleapis.com/auth/userinfo.email"

impersonate_service_accountstr

If set, all API requests are made as the designated service account or as the target service account in an impersonation delegation chain. Specify either a single service account as the impersonator, or a comma-separated list of service accounts to create an impersonation delegation chain. This option is only used to submit Dataflow jobs.

If not set,Application Default Credentials are used to submit Dataflow jobs.

service_account_emailstr

Specifies a user-managed worker service account, using the formatmy-service-account-name@<project-id>.iam.gserviceaccount.com. For more information, see theWorker service account section of the Dataflow security and permissions page.

If not set, workers use the Compute Engine service account of your project as the worker service account.

Supported in Flex Templates.

networkstr

The Compute Enginenetwork for launching Compute Engine instances to run your pipeline. See how tospecify your network.

If not set, Google Cloud assumes that you intend to use a network nameddefault.

Supported in Flex Templates.

subnetworkstr

The Compute Enginesubnetwork for launching Compute Engine instances to run your pipeline. See how tospecify your subnetwork.

The Dataflow service determines the default value.

Supported in Flex Templates.

use_public_ipsOptional [bool]

Specifies whether Dataflow workers must use external IP addresses. External IP addresses have anassociated cost.

To enable external IP addresses for Dataflow workers, specify the command-line flag:--use_public_ips or set the option using the programmatic API—for example,options = PipelineOptions(use_public_ips=True).

To make Dataflow workers useinternal IP addresses for all communication, specify the command-line flag:--no_use_public_ips or set the option using the programmatic API—for example,options = PipelineOptions(use_public_ips=False). In this case, if the subnetwork option is specified, the network option is ignored. Make sure that the specified network or subnetwork has Private Google Access enabled.

You can also use theWorkerIPAddressConfiguration API field to specify how IP addresses are allocated to worker machines.

If the option is not explicitly enabled or disabled, the Dataflow workers use external IP addresses.

Supported in Flex Templates.

no_use_public_ips

Command-line flag that setsuse_public_ips toFalse. Seeuse_public_ips.

Supported in Flex Templates.

Go

FieldTypeDescription
dataflow_kms_keystr

Specifies the usage and the name of acustomer-managed encryption key (CMEK) used to encrypt data at rest. You can control the encryption key through Cloud KMS. You must also specifytemp_location to use this feature. Requires Apache Beam SDK 2.40.0 or later.

If unspecified, Dataflow uses the defaultGoogle Cloud encryptioninstead of a CMEK.

networkstr

The Compute Enginenetwork for launching Compute Engine instances to run your pipeline. See how tospecify your network.

If not set, Google Cloud assumes that you intend to use a network nameddefault.

service_account_emailstr

Specifies a user-managed worker service account, using the formatmy-service-account-name@<project-id>.iam.gserviceaccount.com. For more information, see theWorker service account section of the Dataflow security and permissions page.

If not set, workers use the Compute Engine service account of your project as the worker service account.

subnetworkstr

The Compute Enginesubnetwork for launching Compute Engine instances to run your pipeline. See how tospecify your subnetwork.

The Dataflow service determines the default value.

no_use_public_ipsbool

Specifies that Dataflow workers must not useexternal IP addresses. If the value is set totrue, Dataflow workers use internal IP addresses for all communication. In this case, if thesubnetwork option is specified, thenetwork option is ignored. Make sure that the specifiednetwork orsubnetwork hasPrivate Google Access enabled. External IP addresses have anassociated cost.

You can also use theWorkerIPAddressConfiguration API field to specify how IP addresses are allocated to worker machines.

If not set, Dataflow workers use external IP addresses.

Streaming pipeline management

This table describes pipeline options that let you manage the state of yourDataflow pipelines across job instances.

Java

FieldTypeDescription
createFromSnapshotString

Specifies the snapshot ID to use when creating a streaming job. Snapshots save the state of a streaming pipeline and allow you to start a new version of your job from that state. For more information on snapshots, seeUsing snapshots.

If not set, no snapshot is used to create a job.

enableStreamingEngineboolean

Specifies whether DataflowStreaming Engine is enabled or disabled. Streaming Engine lets you run the steps of your streaming pipeline in the Dataflow service backend, which conserves CPU, memory, and Persistent Disk storage resources.

The default value isfalse. This default means that the steps of your streaming pipeline are executed entirely on worker VMs.

Supported in Flex Templates.

updateboolean

Replaces the existing job with a new job that runs your updated pipeline code. For more information, readUpdating an existing pipeline.

The default value isfalse.

Python/YAML

FieldTypeDescription
create_from_snapshotString

Specifies the snapshot ID to use when creating a streaming job. Snapshots save the state of a streaming pipeline and allow you to start a new version of your job from that state. For more information on snapshots, seeUsing snapshots.

If not set, no snapshot is used to create a job.

enable_streaming_enginebool

Specifies whether DataflowStreaming Engine is enabled or disabled. Streaming Engine lets you run the steps of your streaming pipeline in the Dataflow service backend, which conserves CPU, memory, and Persistent Disk storage resources.

The default value isfalse. This default means that the steps of your streaming pipeline are executed entirely on worker VMs.

Supported in Flex Templates.

updatebool

Replaces the existing job with a new job that runs your updated pipeline code. For more information, readUpdating an existing pipeline.

The default value isfalse.

Go

FieldTypeDescription
updatebool

Replaces the existing job with a new job that runs your updated pipeline code. For more information, readUpdating an existing pipeline. Requires Apache Beam SDK 2.40.0 or later.

The default value isfalse.

Worker-level options

This table describes pipeline options that apply to the Dataflowworker level.

Java

FieldTypeDescription
diskSizeGbint

The disk size, in gigabytes, to use on each remote Compute Engine worker instance. For more information, seeDisk size.

Set to0 to use the default size defined in your Google Cloud Platform project.

filesToStageList<String>

A non-empty list of local files, directories of files, or archives (such as JAR or zip files) to make available to each worker. If you set this option, then only those files you specify are uploaded (the Java classpath is ignored). You must specify all of your resources in the correct classpath order. Resources are not limited to code, but can also include configuration files and other resources to make available to all workers. Your code can access the listed resources using the standard Javaresource lookup methods. Cautions: Specifying a directory path is suboptimal since Dataflow zips the files before uploading, which involves a higher startup time cost. Also, don't use this option to transfer data to workers that is meant to be processed by the pipeline since doing so is significantly slower than using built-inCloud Storage/BigQuery APIs combined with the appropriate Dataflow data source.

IffilesToStage is omitted, Dataflow infers the files to stage based on the Java classpath. The considerations and cautions mentioned in the left column also apply here (types of files to list and how to access them from your code).

workerDiskTypeString

The type ofPersistent Disk to use. For more information, seeDisk type.

The Dataflow service determines the default value.

workerMachineTypeString

The Compute Enginemachine type that Dataflow uses when starting worker VMs. For more information, seeMachine type.

If you don't set this option, Dataflow chooses the machine typebased on your job.

Supported in Flex Templates.

workerRegionString

Specifies a Compute Engine region for launching worker instances to run your pipeline. This option is used to run workers in a different location than theregion used to deploy, manage, and monitor jobs. The zone forworkerRegion isautomatically assigned.

Note: This option cannot be combined withworkerZone orzone.

If not set, defaults to the value set forregion.

Supported in Flex Templates.

workerZoneString

Specifies a Compute Engine zone for launching worker instances to run your pipeline. This option is used to run workers in a different location than theregion used to deploy, manage, and monitor jobs.

Note: This option cannot be combined withworkerRegion orzone.

If you specify eitherregion orworkerRegion,workerZone defaults to a zone from the corresponding region. You can override this behavior byspecifying a different zone.

Supported in Flex Templates.

zoneString

(Deprecated) For Apache Beam SDK 2.17.0 or earlier, this option specifies the Compute Engine zone for launching worker instances to run your pipeline.

If you specifyregion,zone defaults to a zone from the corresponding region. You can override this behavior byspecifying a different zone.

Supported in Flex Templates.

workerCacheMbint

Specifies the size of cache for side inputs and user state. By default, the Dataflow allocate 100 MB of memory for caching side inputs and user state.A larger cache might improve the performance of jobs that use large iterable side inputs but also consumes more worker memory.

Defaults to 100 MB.

maxCacheMemoryUsageMbint

For jobs that use Dataflow Runner v2, specifies the cache size for side inputs and user state in the formatmaxCacheMemoryUsageMb=N, whereN is the cache size in MB. A larger cache might improve the performance of jobs that use large iterable side inputs but also consumes more worker memory. Alternatively, to set the cache size as a percentage of total VM space, specifymaxCacheMemoryUsagePercent.

Defaults to 100 MB.

maxCacheMemoryUsagePercentint

For jobs that use Dataflow Runner v2, specifies the cache size as a percentage of total VM space in the formatmaxCacheMemoryUsagePercent=N, whereN is the cache size as a percentage of total VM space. A larger cache might improve the performance of jobs that use large iterable side inputs but also consumes more worker memory.

Defaults to 20%.

elementProcessingTimeoutMinutesint

For jobs that use Dataflow Runner v2, this flag specifies the timeout for any PTransform to finish processing a single element in the formatelementProcessingTimeoutMinutes=N, whereN is the number of minutes. The minimum supported timeout is 1 minute. If the timeout is exceeded, Runner v2 restarts the SDK harness. When failing to process a single element, Runner v2 will restart the SDK harness a maximum of 4 times for batch jobs, but there isn't a cap for restarting the SDK harness for streaming jobs. This feature is available in Apache Beam SDK versions 2.68.0 and later.

Defaults to 0 (no timeout).

Python/YAML

FieldTypeDescription
disk_size_gbint

The disk size, in gigabytes, to use on each remote Compute Engine worker instance. For more information, seeDisk size.

Set to0 to use the default size defined in your Google Cloud Platform project.

files_to_stagelist[str]

Specifies a list of local files to upload to the worker's staging location.

This feature is available in Apache Beam SDK versions 2.63.0 and later. For Dataflow, the staging location is/tmp/staged/.

worker_disk_typestr

The type ofPersistent Disk to use. For more information, seeDisk type.

The Dataflow service determines the default value.

machine_typestr

The Compute Enginemachine type that Dataflow uses when starting worker VMs. For more information, seeMachine type.

If you don't set this option, Dataflow chooses the machine type based on your job.

Supported in Flex Templates.

worker_regionstr

Specifies a Compute Engine region for launching worker instances to run your pipeline. This option is used to run workers in a different location than theregion used to deploy, manage, and monitor jobs. The zone forworker_region isautomatically assigned.

Note: This option cannot be combined withworker_zone orzone.

If not set, defaults to the value set forregion.

Supported in Flex Templates.

worker_zonestr

Specifies a Compute Engine zone for launching worker instances to run your pipeline. This option is used to run workers in a different location than theregion used to deploy, manage, and monitor jobs.

Note: This option cannot be combined withworker_region orzone.

If you specify eitherregion orworker_region,worker_zone defaults to a zone from the corresponding region. You can override this behavior byspecifying a different zone.

Supported in Flex Templates.

zonestr

(Deprecated) For Apache Beam SDK 2.17.0 or earlier, this option specifies the Compute Engine zone for launching worker instances to run your pipeline.

If you specifyregion,zone defaults to a zone from the corresponding region. You can override this behavior byspecifying a different zone.

Supported in Flex Templates.

max_cache_memory_usage_mbint

Starting in Apache Beam Python SDK version 2.52.0, you can use this option to control the cache size for side inputs and for user state. Applies for each SDK process. Increasing the amount of memory allocated to workers might improve the performance of jobs that use large iterable side inputs but also consumes more worker memory.

To increase the side input cache value, use one of the following pipeline options.

  • For SDK versions 2.52.0 and later, use--max_cache_memory_usage_mb=N.
  • For SDK versions 2.42.0 to 2.51.0, use--experiments=state_cache_size=N.

    ReplaceN with the cache size, in MB.

  • For SDK versions 2.52.0-2.54.0, defaults to 100 MB.

  • For other SDK versions, defaults to 0 MB.

element_processing_timeout_minutesint

For jobs that use Dataflow Runner v2, this flag specifies the timeout for any PTransform to finish processing a single element in the formatelement_processing_timeout_minutes=N, whereN is the number of minutes. The minimum supported timeout is 1 minute. If the timeout is exceeded, Runner v2 restarts the SDK harness. When failing to process a single element, Runner v2 will restart the SDK harness a maximum of 4 times for batch jobs, but there isn't a cap for restarting the SDK harness for streaming jobs. This feature is available in Apache Beam SDK versions 2.68.0 and later.

Defaults to 0 (no timeout).

Go

FieldTypeDescription
disk_size_gbint

The disk size, in gigabytes, to use on each remote Compute Engine worker instance. For more information, seeDisk size.

Set to0 to use the default size defined in your Google Cloud Platform project.

disk_typestr

The type ofPersistent Disk to use. For more information, seeDisk type.

The Dataflow service determines the default value.

worker_machine_typestr

The Compute Enginemachine type that Dataflow uses when starting worker VMs. For more information, seeMachine type.

If you don't set this option, Dataflow chooses the machine type based on your job.

worker_regionstr

Specifies a Compute Engine region for launching worker instances to run your pipeline. This option is used to run workers in a different location than theregion used to deploy, manage, and monitor jobs. The zone forworker_region isautomatically assigned.

Note: This option cannot be combined withworker_zone orzone.

If not set, defaults to the value set forregion.

worker_zonestr

Specifies a Compute Engine zone for launching worker instances to run your pipeline. This option is used to run workers in a different location than theregion used to deploy, manage, and monitor jobs. Requires Apache Beam SDK 2.40.0 or later.

Note: This option cannot be combined withworker_region orzone.

If you specify eitherregion orworker_region,worker_zone defaults to a zone from the corresponding region. You can override this behavior byspecifying a different zone.

element_processing_timeoutduration

For jobs that use Dataflow Runner v2, this flag specifies the timeout for any PTransform to finish processing a single element in the formatelement_processing_timeout_minutes=MmSs, whereMmSs is the number of minutes and seconds. For example,5m or3m30s. The minimum supported timeout is 1 minute. If the timeout is exceeded, Runner v2 restarts the SDK harness. When failing to process a single element, Runner v2 will restart the SDK harness a maximum of 4 times for batch jobs, but there isn't a cap for restarting the SDK harness for streaming jobs. This feature is available in Apache Beam SDK versions 2.68.0 and later.

Defaults to 0 (no timeout).

Setting other local pipeline options

When executing your pipeline locally, the default values for the properties inPipelineOptions are usually sufficient.

Java

You can find the default values forPipelineOptions in the Apache Beam SDK for JavaAPI reference; see thePipelineOptionsclass listing for complete details.

If your pipeline uses Google Cloud products such as BigQuery orCloud Storage for I/O, you might need to set certainGoogle Cloud project and credential options. In such cases, you shoulduseGcpOptions.setProject to set your Google Cloud Platform Project ID. You may alsoneed to set credentials explicitly. See theGcpOptionsclass for complete details.

Python/YAML

You can find the default values forPipelineOptions in the Apache Beam SDK forPython API reference; see thePipelineOptionsmodule listing for complete details.

If your pipeline uses Google Cloud services such asBigQuery or Cloud Storage for I/O, you might need toset certain Google Cloud project and credential options. In such cases,you should useoptions.view_as(GoogleCloudOptions).project to set yourGoogle Cloud Project ID. You may also need to set credentialsexplicitly. See theGoogleCloudOptionsclass for complete details.

Go

You can find the default values forPipelineOptions in the Apache Beam SDK forGo API reference; seejoboptsfor more details.

Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.

Last updated 2026-02-19 UTC.