Read from Cloud Storage to Dataflow Stay organized with collections Save and categorize content based on your preferences.
To read data from Cloud Storage to Dataflow, use theApache BeamTextIO orAvroIOI/O connector.
Include the Google Cloud Platform library dependency
To use theTextIO orAvroIO connector with Cloud Storage, includethe following dependency. This library provides a schema handler for"gs://"filenames.
Java
<dependency><groupId>org.apache.beam</groupId><artifactId>beam-sdks-java-io-google-cloud-platform</artifactId><version>${beam.version}</version></dependency>Python
apache-beam[gcp]==VERSIONGo
import_"github.com/apache/beam/sdks/v2/go/pkg/beam/io/filesystem/gcs"For more information, seeInstall the Apache Beam SDK.
Enable gRPC on Apache Beam I/O connector on Dataflow
You canconnect to Cloud Storage using gRPC through theApache Beam I/O connector on Dataflow.gRPC is ahigh performance open-source remote procedure call (RPC) framework developed byGoogle that you can use to interact withCloud Storage.
To speed up your Dataflow job's read requests to Cloud Storage, youcan enable the Apache Beam I/O connector on Dataflow to use gRPC.
Command line
- Ensure that you use the Apache Beam SDK version2.55.0 or later.
- To run a Dataflow job, use
--additional-experiments=use_grpc_for_gcspipeline option. For information about the different pipeline options, seeOptional flags.
Apache Beam SDK
- Ensure that you use the Apache Beam SDK version2.55.0 or later.
- To run a Dataflow job, use
--experiments=use_grpc_for_gcspipeline option. For information about the different pipeline options, seeBasic options.
You can configure Apache Beam I/O connector on Dataflow to generate gRPC related metrics in Cloud Monitoring. The gRPC related metrics can help you to do the following:
- Monitor and optimize the performance of gRPC requests to Cloud Storage.
- Troubleshoot and debug issues.
- Gain insights into your application's usage and behavior.
For information about how to configure Apache Beam I/O connector on Dataflow to generate gRPC related metrics, seeUse client-side metrics. If gathering metrics isn't necessary for your use case, you can choose to opt-out of metrics collection. For instructions, seeOpt-out of client-side metrics.
Parallelism
TheTextIO andAvroIO connectors support two levels of parallelism:
- Individual files are keyed separately, so that multiple workers can read them.
- If the files are uncompressed, the connector can read sub-ranges of each fileseparately, leading to a very high level of parallelism. This splitting isonly possible if each line in the file is a meaningful record. For example,it's not available by default for JSON files.
Performance
The following table shows performance metrics for reading fromCloud Storage. The workloads were run on onee2-standard2 worker,using the Apache Beam SDK 2.49.0 for Java. They did not use Runner v2.
| 100 M records | 1 kB | 1 column | Throughput (bytes) | Throughput (elements) |
|---|---|---|
| Read | 320 MBps | 320,000 elements per second |
These metrics are based on simple batch pipelines. They are intended to compare performance between I/O connectors, and are not necessarily representative of real-world pipelines. Dataflow pipeline performance is complex, and is a function of VM type, the data being processed, the performance of external sources and sinks, and user code. Metrics are based on running the Java SDK, and aren't representative of the performance characteristics of other language SDKs. For more information, seeBeam IO Performance.
Best practices
Avoid using
watchForNewFileswithCloud Storage. This approach scales poorly for large productionpipelines, because the connector must keep a list of seen files in memory. Thelist can't be flushed from memory, which reduces the working memory of workersover time. Consider usingPub/Sub notifications for Cloud Storageinstead. For more information, seeFile processing patterns.If both the filename and the file contents are useful data, use the
FileIOclass to read filenames. For example, a filename mightcontain metadata that is useful when processing the data in the file. For moreinformation, seeAccessing filenames.TheFileIOdocumentation also shows an example of this pattern.
Example
The following example shows how to read from Cloud Storage.
Java
To authenticate to Dataflow, set up Application Default Credentials. For more information, seeSet up authentication for a local development environment.
importorg.apache.beam.sdk.Pipeline;importorg.apache.beam.sdk.PipelineResult;importorg.apache.beam.sdk.io.TextIO;importorg.apache.beam.sdk.options.Description;importorg.apache.beam.sdk.options.PipelineOptions;importorg.apache.beam.sdk.options.PipelineOptionsFactory;importorg.apache.beam.sdk.transforms.MapElements;importorg.apache.beam.sdk.values.TypeDescriptors;publicclassReadFromStorage{publicstaticPipelinecreatePipeline(Optionsoptions){varpipeline=Pipeline.create(options);pipeline// Read from a text file..apply(TextIO.read().from("gs://"+options.getBucket()+"/*.txt")).apply(MapElements.into(TypeDescriptors.strings()).via((x->{System.out.println(x);returnx;})));returnpipeline;}}What's next
- Read the
TextIOAPI documentation. - See the list ofGoogle-provided templates.
Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.
Last updated 2026-02-19 UTC.