Dataflow security and permissions

You can run Dataflow pipelines locally or on managed Google Cloud Platformresources by using the Dataflow runner.Whether running pipelines locally or in the cloud, your pipeline and its workers use apermissions system to maintain secure access to pipeline files and resources.Dataflow permissions are assigned according to the role that's used toaccess pipeline resources. This document explains the following concepts:

  • Upgrading Dataflow VMs
  • Roles and permissions required for running local and Google Cloud Platform pipelines
  • Roles and permissions required for accessing pipeline resources
  • Types of data used in a Dataflow service and in data security

Before you begin

Read about Google Cloud Platform project identifiers in theGoogle Cloud Platform overview. These identifiers include theproject name, project ID, and project number.

Upgrade and patch Dataflow VMs

Dataflow usesContainer-Optimized OS.The security processes of Container-Optimized OS also apply toDataflow.

Batch pipelines are time-bound and don't require maintenance. When a new batchpipeline starts, the latest Dataflow image is used.

For streaming pipelines, if a security patch is immediately required,Google Cloud notifies you by usingsecuritybulletins. For streaming pipelines,we recommend that you use the--updateoption to restart your job with thelatest Dataflow image.

Dataflow container images are available in theGoogle Cloud console.

Container image security

Dataflow uses container images for the runtime environment ofpipeline user code. By default, a prebuilt Apache Beam image is used. Youcan also provide acustomcontainer.

Google hardens and patches the operating system of base images used by theDataflow-owned images. Google promptly makes any patches to theseimages available. All Dataflow production resources, including theBeam images provided by default, are automatically and regularly scanned forvulnerabilities. Any identified issues in Google-owned containers are addressedunder a defined service-level objective (SLO). Issues in Beam containers areaddressed by the Beam community. As part of Dataflow'sshared responsibilitymodel,to manage security issues responsively, we recommend that you use customcontainer images.

While the Container-Optimized OS images that Dataflowuses are CIS Level 1 compliant, achieving overall compliance is asharedresponsibility. The VMinstances on which these containers run reside within the customer's project.Customers are responsible for scanning their own project resources. You can useSecurity Command Centerto scan your resources for compliance and vulnerabilities.

Runtime environment

For the runtime environment of pipeline user code, Dataflow uses a prebuiltApache Beam image, or acustom container if one was provided.

The user for container execution is selected by the Dataflowservice. Pipeline resources are allocated on a per-job basis; there is no cross-pipeline sharing of VMs and other resources.

The runtime environment lifecycle is tied to the pipeline lifecycle. It is started at pipeline start, and stopped at pipeline termination; the runtime environment may be restarted one or more times during pipeline execution.

Security and permissions for local pipelines

When you run locally, your Apache Beam pipeline runs as theGoogle Cloud account that youconfigured with the Google Cloud CLI executable.Locally run Apache Beam SDK operations and your Google Cloudaccount have access to the same files and resources.

To list the Google Cloud account that you selected as your default, run thegcloud config list command.

Local pipelines can output data to local destinations, such as local files, or to cloud destinations, such as Cloud Storage or BigQuery. If your locally run pipeline writes files to cloud-based resources such as Cloud Storage, it uses your Google Cloud account credentials and the Google Cloud project that you configured as the Google Cloud CLI default. For instructions about how to authenticate with your Google Cloud account credentials, see the tutorial for the language you're using:Java,Python, orGo.

Security and permissions for pipelines on Google Cloud

When you run your pipeline, Dataflow uses two service accounts tomanage security and permissions:

  • The Dataflow service account. TheDataflow service uses the Dataflow serviceaccount as part of the job creation request, such as to check project quotaand to create worker instances on your behalf. Dataflow alsouses the Dataflow service account during job execution tomanage the job. This account is also known as the Dataflowservice agent.

  • The worker service account. Worker instancesuse the worker service account to access input and output resources afteryou submit your job. By default, workers use theCompute Engine defaultserviceaccountassociated with your project as the worker service account. As a bestpractice, we recommend that you specify auser-managed serviceaccount instead of using the default worker service account.

To impersonate the serviceaccount when you run a pipeline, the account that launches the pipeline needstheiam.serviceAccounts.actAs permission. This permission is included in theService Account User role (roles/iam.serviceAccountUser). Impersonating aservice account lets you temporarily adopt its identity and permissions toperform tasks requiring different access levels, while still avoiding the risksassociated with persistent keys. For more information, seeUse service accountimpersonation.

Depending on other project permissions,your user account might also need theroles/dataflow.developer role. If youare a project owner or editor, you already have the permissions contained by theroles/dataflow.developer role.

Best practices

  • When possible, for theworker service account,specify auser-managed service accountinstead of using the default worker service account.
  • When giving permissions on resources, grant the role that contains theminimum required permissions for the task. You cancreate a custom role that includes only therequired permissions.
  • When granting roles to access resources, use the lowest possible resourcelevel. For example, instead of granting theroles/bigquery.dataEditor roleon a project or folder, grant the role on the BigQuery table.
  • Create a bucket owned by your projectto use as the staging bucket for Dataflow. The default bucketpermissions allow Dataflow to use the bucket to stagethe executable files of the pipeline.

Dataflow service account

All projects that have used the resourceDataflow Jobhave aDataflow Service Account,also known as the Dataflowservice agent,which has the following email:

service-PROJECT_NUMBER@dataflow-service-producer-prod.iam.gserviceaccount.com

This service account is created and managed by Google and assigned to yourproject automatically upon first usage of theresourceDataflow Job.

As part of running Dataflow pipelines,Dataflow manipulates resources on your behalf. Forexample, it creates additional VMs. When you run your pipeline withDataflow, this service account is used.

This account is assigned the DataflowService Agent role on the project. It has the necessary permissions to run aDataflow job in the project, including startingCompute Engine workers. This account is used exclusively byDataflow and is specific to your project.

You can review the roles assigned to the Dataflow service account in theGoogle Cloud console or the Google Cloud CLI.

Console

  1. Go to theRoles page.

    Go to Roles

  2. If applicable, select your project.

  3. In the list, click the titleCloud Dataflow Service Agent. A page opens thatlists the permissions assigned to the Dataflow service account.

gcloud CLI

View the permissions of the Dataflow service account:

gcloud iam roles describe roles/dataflow.serviceAgent

Because Google Cloud services expect to have read and write access to theproject and its resources, it's recommended that you don't change the defaultpermissions automatically established for your project. If a Dataflow service accountloses permissions to a project, Dataflow cannotlaunch VMs or perform other management tasks.

If you remove the permissions for the service account from the Identity and Access Management (IAM)policy, the account remains present, because it's owned by theDataflow service.

Worker service account

Compute Engine instances execute Apache Beam SDK operationsin the cloud. These workers use the worker service account of your project toaccess the files and other resources associated with the pipeline. The worker service account isused as the identity for all worker VMs, and all requests that originate from theVM use the worker service account. This service account is also used tointeract with resources such as Cloud Storage buckets and Pub/Sub topics.

Note: Service account permissions are project-specific. If you need to run asimilar job in a different project, you must configure appropriate serviceaccounts and permissions within that new project. Dataflow jobscannot be migrated between projects. To move a job, recreate it in the newproject. For more information, seeMigrate pipeline jobs to anotherGoogle Cloud Platform project.
  • For the worker service account to be able to run a job, it must have theroles/dataflow.worker role.
  • For the worker service account to be able to create or examine a job, it musthave theroles/dataflow.admin role.

In addition, when your Apache Beam pipelines access Google Cloud resources,you need to grant the required roles to your Dataflow project'sworker service account. The worker service account needs to be able toaccess the resources while running theDataflow job. For example, if your job writes toBigQuery, your service account must also have at least theroles/bigquery.dataEditor role on the BigQuery table. Examples of resources include:

Default worker service account

By default, workers use theCompute Engine default service accountof your project as the worker service account. This service account has the following email:

PROJECT_NUMBER-compute@developer.gserviceaccount.com

This service account isautomatically created when you enable the Compute Engine API for yourproject from theAPI Library in the Google Cloud console.

Although you can use the Compute Engine default service account as theDataflow worker service account, we recommend thatyou create a dedicated Dataflow worker service account that hasonly the roles and permissions that you need.

Depending on your organization policy configuration, the default service account might automatically be granted theEditor role on your project. We strongly recommend that you disable the automatic role grant by enforcing theiam.automaticIamGrantsForDefaultServiceAccounts organization policy constraint. If you created your organization after May 3, 2024, this constraint is enforced by default.

If you disable the automatic role grant, you must decide which roles to grant to the default service accounts, and thengrant these roles yourself.

If the default service account already has the Editor role, we recommend that you replace the Editor role with less permissive roles.To safely modify the service account's roles, usePolicy Simulator to see the impact of the change, and thengrant and revoke the appropriate roles.

Note: In the past, Dataflow users were able to deploy applications that authenticatedas the Compute Engine default service account, even if they didn't have permission toimpersonate the Compute Engine default service account. This legacy behavior still affectssome organizations. For more information, seeRequiringimpersonation permissions when attaching service accounts to resources.

Specify a user-managed worker service account

If you want to create and use resources with fine-grained access control, youcan create a user-managed service account. Use this account as the worker serviceaccount.

  1. If you don't have a user-managed service account,create a service account.

  2. Set the required IAM roles for your service account.

    • For the worker service account to be able to run a job, it must have theroles/dataflow.worker role.
    • For the worker service account to be able to create or examine a job, it musthave theroles/dataflow.admin role.
    • Alternately, create a custom IAM role with the requiredpermissions. For a list of the required permissions, seeRoles.
  3. Your service account might need additional roles to use Google Cloud Platformresources as required by your job, such as BigQuery,Pub/Sub, or Cloud Storage. For example, if your job readsfrom BigQuery, your service account must also have at leasttheroles/bigquery.dataViewer role on the BigQuery table.

  4. Ensure that your user-managed service account has read and write access tothe staging and temporary locations specified in the Dataflowjob.

  5. To launch the pipeline, your user account must have theiam.serviceAccounts.actAs permission to impersonate the worker serviceaccount.

  6. In the project that contains the user-managed worker service account, theDataflow Service Account(service-PROJECT_NUMBER@dataflow-service-producer-prod.iam.gserviceaccount.com)and theCompute Engine Service Agent(service-PROJECT_NUMBER@compute-system.iam.gserviceaccount.com)must have the following roles.PROJECT_NUMBER is the IDof the project that your Dataflow job runs in. Both of theseaccounts are service agents.

    Assume that the Dataflow job is running in project A and that theworker service account is hosted in project B, make sure that the serviceagents from project A have theiam.serviceAccountTokenCreator andiam.serviceAccountUser roles in project B.In the project that yourDataflow job runs in, the accounts have these roles by default.To grant these roles, follow the steps in theGrant a single rolesection in the Manage access to service accounts page.

  7. When the user-managed worker service account and the job are indifferent projects, ensure that theiam.disableCrossProjectServiceAccountUsage boolean constraint is notenforced for the project that owns the user-managed service account. For moreinformation, seeEnable service accounts to be attached across projects.

  8. When you run your pipeline job, specify your service account.

    Java

    Use the--serviceAccount option and specify your service account when you run your pipeline job from the command line:--serviceAccount=SERVICE_ACCOUNT_NAME@PROJECT_ID.iam.gserviceaccount.com

    Use the--service-account-email option and specify your service account when you run your pipeline job as a Flex template:--service-account-email=SERVICE_ACCOUNT_NAME@PROJECT_ID.iam.gserviceaccount.com

    Python

    Use the--service_account_email option and specify your service account when you run your pipeline job:--service_account_email=SERVICE_ACCOUNT_NAME@PROJECT_ID.iam.gserviceaccount.com

    Go

    Use the--service_account_email option and specify your service account when you run your pipeline job:--service_account_email=SERVICE_ACCOUNT_NAME@PROJECT_ID.iam.gserviceaccount.com

The user-managed service account can be in the same project as your job, or in adifferent project. If the service account and the job are in different projects,you mustconfigure the service accountbefore you run the job.

Add roles

To add roles in your project, follow these steps.

Console

  1. In the Google Cloud console, go to theIAM page.

    Go to IAM

  2. Select your project.

  3. In the row containing your user account, clickEdit principal,and then clickAdd another role.

  4. In the drop-down list, select the roleService Account User.

  5. In the row containing your worker service account, clickEdit principal,and then clickAdd another role.

  6. In the drop-down list, select the roleDataflow Worker.

  7. If your worker service account needs the Dataflow Admin role,repeat for theDataflow Admin.

  8. Repeat for any roles required by resources used in your job, and then clickSave.

    For more information about granting roles, seeGrant an IAM role by using the console.

gcloud CLI

  1. Grant theroles/iam.serviceAccountUser role to your user account. Run the following command:

    gcloudprojectsadd-iam-policy-bindingPROJECT_ID--member="user:EMAIL_ADDRESS --role=roles/iam.serviceAccountUser
    • ReplacePROJECT_ID with your project ID.
    • ReplaceEMAIL_ADDRESS with the email address for the user account.
  2. Grant roles to your worker service account. Run thefollowing command for theroles/dataflow.worker IAM roleand for any roles required by resources used in your job.If your worker service account needs the Dataflow Admin role,repeat for theroles/dataflow.admin IAM role. Thisexample uses the Compute Engine default service account, but werecommend using a user-managed service account.

    gcloudprojectsadd-iam-policy-bindingPROJECT_ID--member="serviceAccount:PROJECT_NUMBER-compute@developer.gserviceaccount.com"--role=SERVICE_ACCOUNT_ROLE
    • ReplacePROJECT_ID with your project ID.
    • ReplacePROJECT_NUMBER with your project number.To find your project number, seeIdentify projectsor use thegcloud projects describe command.
    • ReplaceSERVICE_ACCOUNT_ROLE with each individual role.

Access Google Cloud resources

Your Apache Beam pipelines can access Google Cloud resources, eitherin the same Google Cloud project or in other projects. These resources include:

To ensure that your Apache Beam pipeline can access these resources, youneed to use the resources' respective access control mechanismsto explicitly grant access to your Dataflow projectworker service account.

If you use Assured Workloads features with Dataflow, such asEU Regions and Support with Sovereignty Controls,all Cloud Storage,BigQuery, Pub/Sub, I/Oconnectors, and other resources that your pipeline accesses must be located inyour organization'sAssured Workloads project or folder.

If you're using a user-managed worker service account or accessingresources in other projects, then additional action might be needed. The followingexamples assume that the Compute Engine default service account is used, but youcan also use a user-managed worker service account.

Access Artifact Registry repositories

When youuse custom containers with Dataflow,you might upload artifacts to an Artifact Registry repository.

To use Artifact Registry with Dataflow, you must grant at leastArtifact Registry Writer access(role/artifactregistry.writer)to theworker service accountthat runs the Dataflow job.

All repository content is encrypted using either Google-owned and Google-managed encryption keys orcustomer-managed encryption keys. Artifact Registry usesGoogle-owned and Google-managed encryption keys by default and no configuration is requiredfor this option.

Access Cloud Storage buckets

To grant your Dataflow project access to a Cloud Storage bucket,make the bucket accessible to your Dataflow projectworker service account. At a minimum, your serviceaccount needs read and write permissions to both the bucket and itscontents. You can useIAM permissions for Cloud Storageto grant the required access.

To give your worker service account the necessary permissions to read fromand write to a bucket, use thegcloud storage buckets add-iam-policy-bindingcommand. This command adds your Dataflow project service accountto abucket-level policy.

gcloudstoragebucketsadd-iam-policy-bindinggs://BUCKET_NAME--member="serviceAccount:PROJECT_NUMBER-compute@developer.gserviceaccount.com"--role=SERVICE_ACCOUNT_ROLE

Replace the following:

  • BUCKET_NAME: the name of your Cloud Storage bucket
  • PROJECT_NUMBER: your Dataflow project number. To find your project number,seeIdentify projectsor use thegcloud projects describe command.
  • SERVICE_ACCOUNT_ROLE: the IAM role, for examplestorage.objectViewer

To retrieve a list of the Cloud Storage buckets in aGoogle Cloud project, use thegcloud storage buckets listcommand:

gcloudstoragebucketslist--project=PROJECT_ID

ReplacePROJECT_ID with the ID of the project.

Unless you're restricted by organizational policies that limit resource sharing,you can access a bucket that resides in a different project than yourDataflow pipeline. For more information about domain restrictions, seeRestricting identities by domain.

If you don't have a bucket,create a new bucket.Then, give your worker service account the necessary permissions to read fromand write to the bucket.

You can also set bucket permissions from the Google Cloud console. For moreinformation, seeSetting bucket permissions.

Cloud Storage offers two systems for granting users access to your bucketsand objects: IAM and Access Control Lists (ACLs). In most cases,IAM is the recommended method for controllingaccess to your resources.

  • IAM controls permissioning throughoutGoogle Cloud and lets you grant permissions at the bucket and projectlevels. For a list of IAM roles that are associated withCloud Storage and the permissions that are contained in each role, seeIAM roles for Cloud Storage.If you need more control over permissions,create a custom role.

  • If youuse ACLs to control access,ensure that your worker service account permissions are consistentwith your IAM settings. Due to the inconsistency betweenIAM and ACL policies, the Cloud Storage bucket mightbecome inaccessible to your Dataflow jobs when theCloud Storage bucket is migrated from fine-grained access to uniformbucket-level access. For more information, seeCommon error guidance.

Access BigQuery datasets

You can use theBigQueryIO API to access BigQuery datasets, either inthe same project where you're using Dataflow or in a differentproject. For the BigQuerysource and sink to operate properly, the following two accounts must have accessto any BigQuery datasets that your Dataflow jobreads from or writes to:

  • The Google Cloud account that you use to run the Dataflowjob
  • Theworker service account that runs theDataflow job

You might need to configure BigQuery to explicitly grant accessto these accounts. SeeBigQuery Access Controlfor more information on granting access to BigQuery datasetsusing either theBigQuery pageor theBigQuery API.

Among the required BigQuery permissions,thebigquery.datasets.get IAM permission is required by the pipelineto access a BigQuery dataset. Typically, mostBigQuery IAM roles include thebigquery.datasets.get permission, but theroles/bigquery.jobUser role is an exception.

Access Pub/Sub topics and subscriptions

To access a Pub/Sub topic or subscription, use theIdentity and Access Management features of Pub/Sub to set up permissions for theworker service account.

Permissions from the followingPub/Sub roles are relevant:

  • roles/pubsub.subscriber isrequired to consume data.
  • roles/pubsub.editor isrequired to create a Pub/Sub subscription.
  • roles/pubsub.viewer isrecommended so that Dataflow can query the configurations of topics and subscriptions. This configuration has two benefits:
    • Dataflow can check forunsupported settings on subscriptions that might not work as expected.
    • If the subscription does not use the defaultack deadline of 10 seconds, performance improves. Dataflow repeatedly extends the ack deadline for a message while it's being processed by the pipeline. Withoutpubsub.viewer permissions, Dataflow is unable to query the ack deadline, and therefore must assume a default deadline. This configuration causes Dataflow to issue moremodifyAckDeadline requests than necessary.
    • If VPC Service Controls is enabled on the project that owns the subscription or topic, IP address-based ingress rules don't allow Dataflow to query the configurations. In this case, an ingress rule based on the worker service account is required.

For more information and some code examples that demonstrate how to use the Identity and Access Management features of Pub/Sub, seeSample use case: cross-project communication.

Access Firestore

To access a Firestore database (in Native mode orDatastore mode), add your Dataflow worker service account(for example,PROJECT_NUMBER-compute@developer.gserviceaccount.com)as editor of the project that owns the database,or use a more restrictiveDatastore role likeroles/datastore.viewer.Also, enable the Firestore API in both projects from theAPI Library in the Google Cloud console.

Access images for projects with a trusted image policy

If you have atrusted image policyset up for your project and your boot image is located in anotherproject, ensure that the trusted image policy is configured to have access to the image.For example, if you're running a templated Dataflow job, ensure thatthe policy file includes access to thedataflow-service-producer-prod project.This Google Cloud project contains the images for template jobs.

Data access and security

The Dataflow service works with two kinds of data:

  • End-user data. This data is processed by a Dataflow pipeline. Atypical pipeline reads data from one or more sources, implementstransformations of the data, and writes the results to one or more sinks. Allthe sources and sinks are storage services that are not directly managed byDataflow.

  • Operational data. This data includes all the metadata that is required formanaging a Dataflow pipeline. This data includes both user-provided metadatasuch as a job name or pipeline options and also system-generated metadata such asa job ID.

The Dataflow service uses several security mechanisms to keep yourdata secure and private. These mechanisms apply to the following scenarios:

  • Submitting a pipeline to the service
  • Evaluating a pipeline
  • Requesting access to telemetry and metrics during and after a pipelineexecution
  • Using a Dataflow service such as Shuffle or Streaming Engine

Data locality

Note: We recommend that youalways specify a region when you run a pipeline.

All of the core data processing for Dataflow happens inthe region that is specified in the pipeline code. If a region is not specified,the default regionus-central1 is used. If you specify that option in thepipeline code, the pipeline job can optionally read and write from sources andsinks in other regions. However, the actual data processing occurs only in the regionthat is specified to run the Dataflow VMs.

Pipeline logic is evaluated on individual worker VM instances. You can specify thezone where these instances and the private network that theycommunicate over are located. Ancillary computations for the platform depend onmetadata such as Cloud Storage locations or file sizes.

Dataflow is a regional service. For more information about datalocality and regions, seeDataflow regions.

Data in a pipeline submission

The IAM permissions for your Google Cloud project control access to theDataflow service. Any principals who are given editor or ownerrights to your project can submit pipelines to the service. To submit pipelines, you mustauthenticate by using the Google Cloud CLI. After you authenticate,your pipelines are submitted using the HTTPS protocol. For instructions abouthow to authenticate with your Google Cloud account credentials, see thequickstart for the language that you're using.

Data in a pipeline evaluation

As part of evaluating a pipeline, temporary data might be generated and storedlocally in the worker VM instances or in Cloud Storage. Temporary datais encrypted at rest and does not persist after a pipeline evaluation concludes.Such data can also be stored in the Shuffle service or Streaming Engine service(if you have opted for the service) in the same region specified in theDataflow pipeline.

Java

By default, Compute Engine VMs are deleted when the Dataflow job completes, regardless of whether the job succeeds or fails. Consequently, the associatedPersistent Disk, and any intermediate data that might be stored on it, is deleted. The intermediate data stored in Cloud Storage can be found in sublocations of the Cloud Storage path that you provide as your--stagingLocation or--tempLocation. If you're writing output to a Cloud Storage file, temporary files might be created in the output location before the `write` operation is finalized.

Python

By default, Compute Engine VMs are deleted when the Dataflow job completes, regardless of whether the job succeeds or fails. Consequently, the associatedPersistent Disk, and any intermediate data that might be stored on it, is deleted. The intermediate data stored in Cloud Storage can be found in sublocations of the Cloud Storage path that you provide as your--staging_location or--temp_location. If you're writing output to a Cloud Storage file, temporary files might be created in the output location before the `write` operation is finalized.

Go

By default, Compute Engine VMs are deleted when the Dataflow job completes, regardless of whether the job succeeds or fails. Consequently, the associatedPersistent Disk, and any intermediate data that might be stored on it, is deleted. The intermediate data stored in Cloud Storage can be found in sublocations of the Cloud Storage path that you provide as your--staging_location or--temp_location. If you're writing output to a Cloud Storage file, temporary files might be created in the output location before the `write` operation is finalized.

Data in pipeline logs and telemetry

Information stored inCloud Logging isprimarily generated by the code in your Dataflow program. TheDataflow service might also generate warning and error data inCloud Logging, but this data is the only intermediate data that the serviceadds to logs. Cloud Logging is a global service.

Telemetry data and associated metrics are encrypted at rest, and access to thisdata is controlled by your Google Cloud project's read permissions.

Data in Dataflow services

If you use Dataflow Shuffle or Dataflow Streamingfor your pipeline, don't specify the zone pipeline options. Instead, specifythe region and set the value to one of the regions where Shuffle or Streaming isavailable. Dataflow auto-selects the zone in the regionthat you specify. The end-user data in transit stays within the worker VMs and inthe same zone. These Dataflow jobs can still read and write tosources and sinks that are outside the VM zone. The data in transit can also besent to Dataflow Shuffle or Dataflow Streamingservices, however the data always remains in the region specified in thepipeline code.

Recommended practice

We recommend that you use the security mechanisms available in theunderlying cloud resources of your pipeline. These mechanisms include the datasecurity capabilities of data sources and sinks such as BigQueryand Cloud Storage. It's also best not to mix different trust levels ina single project.

Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.

Last updated 2026-02-19 UTC.