Read from Bigtable to Dataflow

To read data from Bigtable to Dataflow, use theApache BeamBigtable I/O connector.

Note: Depending on your scenario, consider using one of the Google-provided Dataflow templates.Several of these read from Bigtable.

Parallelism

Parallelism is controlled by the number ofnodes in theBigtable cluster. Each node manages one or more key ranges,although key ranges can move between nodes as part ofload balancing. For more information,seeReads and performance in theBigtable documentation.

You are charged for the number of nodes in your instance's clusters. SeeBigtable pricing.

Performance

The following table shows performance metrics for Bigtable readoperations. The workloads were run on onee2-standard2 worker, using theApache Beam SDK 2.48.0 for Java. They did not use Runner v2.

100 M records \| 1 kB \| 1 column	Throughput (bytes)	Throughput (elements)
Read	180 MBps	170,000 elements per second

These metrics are based on simple batch pipelines. They are intended to compare performance between I/O connectors, and are not necessarily representative of real-world pipelines. Dataflow pipeline performance is complex, and is a function of VM type, the data being processed, the performance of external sources and sinks, and user code. Metrics are based on running the Java SDK, and aren't representative of the performance characteristics of other language SDKs. For more information, seeBeam IO Performance.

Best practices

For new pipelines, use theBigtableIO connector, notCloudBigtableIO.
Create separateapp profiles for each type ofpipeline. App profiles enable better metrics for differentiating trafficbetween pipelines, both for support and for tracking usage.
Monitor the Bigtable nodes. If you experience performancebottlenecks, check whether resources such as CPU utilization are constrainedwithin Bigtable. For more information, seeMonitoring.
In general, the default timeouts are well tuned for most pipelines. If astreaming pipeline appears to get stuck reading from Bigtable,try callingwithAttemptTimeout to adjust the attempttimeout.
Consider enablingBigtable autoscaling, or resizethe Bigtable cluster to scale with the size of yourDataflow jobs.
Consider settingmaxNumWorkerson the Dataflow job to limit load on theBigtable cluster.
If significant processing is done on a Bigtable element beforea shuffle, calls to Bigtable might time out. In that case, youcan callwithMaxBufferElementCount to bufferelements. This method converts the read operation from streaming to paginated,which avoids the issue.
If you use a single Bigtable cluster for both streaming andbatch pipelines, and the performance degrades on the Bigtableside, consider setting up replication on the cluster. Then separate the batchand streaming pipelines, so that they read from different replicas. For moreinformation, seeReplication overview.

What's next

Read theBigtable I/O connector documentation.
See the list ofGoogle-provided templates.

Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.

Last updated 2026-02-19 UTC.

Movatterモバイル変換

Read from Bigtable to Dataflow Stay organized with collections Save and categorize content based on your preferences.

Parallelism

Performance

Best practices

What's next

Read from Bigtable to Dataflow