Process ML data using Dataflow and Cloud Storage FUSE

This page describes how to useCloud Storage FUSE with Dataflowto process datasets for machine learning (ML) tasks.

When working with ML tasks, Dataflow can be used for processing largedatasets. However, some common software libraries used for ML, like OpenCV, haveinput file requirements. They frequently require files to be accessed as if theyare stored on a local computer's hard drive, rather than from cloud-basedstorage. This requirement creates difficulties and delays. As a solution,pipelines can either use special I/O connectors for input or download files ontothe Dataflow virtual machines (VMs) before processing. These solutionsare frequently inefficient.

Cloud Storage FUSE provides a way to avoid these inefficient solutions.Cloud Storage FUSE lets you mount your Cloud Storage buckets onto theDataflow VMs. This makes the files in Cloud Storage appear as if theyare local files. As a result, the ML software can access them directly withoutneeding to download them beforehand.

Benefits

Using Cloud Storage FUSE for ML tasks offers the following benefits:

  • Input files hosted on Cloud Storage can be accessed in theDataflow VM using local file system semantics.
  • Because the data is accessed on-demand, the input files don't have to bedownloaded beforehand.

Support and limitations

  • To use Cloud Storage FUSE with Dataflow, you must configure worker VMs withexternal IP addresses so that they meet the internet access requirements.

Specify buckets to use with Cloud Storage FUSE

To specify a Cloud Storage bucket to mount to a VM, use the--experiments flag. To specifymultiple buckets, use a semicolon delimiter (;) between bucket names.

The format is as follows:

--experiments="gcsfuse_buckets=CONFIG"

Replace the following:

  • CONFIG: a semicolon-delimited list ofCloud Storage entries, where each entry is one of the following:

    1. BUCKET_NAME: A Cloud Storage bucket name.For example,dataflow-samples. If you omit the bucket mode, the bucketis treated as read-only.

    2. BUCKET_NAME:MODE: ACloud Storage bucket name and its associated mode, whereMODE iseitherro (read-only) orrw (read-write).

      For example:

      --experiments="gcsfuse_buckets=read-bucket1;read-bucket2:ro;write-bucket1:rw"

      In this example, specifying the mode assures the following:

      • gs://read-bucket1 is mounted in read-only mode.
      • gs://read-bucket2 is mounted in read-only mode.
      • gs://write-bucket1 is mounted in read-writemode.

    Beam pipeline code can access these buckets at/var/opt/google/gcs/BUCKET_NAME.

Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.

Last updated 2026-02-19 UTC.