Read from Pub/Sub to Dataflow

This page describes best practices for reading from Pub/Sub inDataflow.

Apache Beam provides a reference implementation of the Pub/SubI/O connector for use by non-Dataflow runners. However, theDataflow runner uses its own custom implementation of theconnector. This implementation takes advantage of Google Cloud Platform-internal APIs andservices to offer low-latency watermarks, high watermark accuracy, and efficientdeduplication for exactly-once message processing. The connector is availableforJava,Python,andGo.

Exactly-once processing

Pub/Sub decouples event publishers from event consumers. Theapplication publishes messages to a topic, and Pub/Subasynchronously delivers the messages to subscribers.

Pub/Sub assigns a unique message ID to each message that issuccessfully published to a topic. By default, Pub/Sub performsat-least-once message delivery. To achieve at-least-once semantics, ifPub/Sub doesn't receive acknowledgement from the subscriber withinthe acknowledgement deadline, it retries the message delivery. Retries mightalso occur before the acknowledgment deadline, or after a message has beenacknowledged.

Dataflow acknowledges messages after they are successfullyprocessed by the first fused stage and side effects of that processing have beenwritten to persistent storage. To reduce the number of duplicates messages,Dataflow continually extends the acknowledgment deadline whilea batch of messages is being processed in this stage.

Because Pub/Sub might re-deliver a message, it is possible forduplicate messages to arrive at the pipeline. If your Dataflowpipeline usesexactly-once streaming mode, then Dataflow deduplicates these messages to achieve exactly-oncesemantics.

If your pipeline can tolerate some duplicate records, then consider usingat-least-once streaming mode instead.This mode can significantly lower latency and the total cost of your pipeline.The tradeoff is that duplicate messages might be processed twice. For moreinformation, seeChoose which streaming mode to use.

Note: Pub/Sub also supports exactly-once delivery. However, it'snot recommended to use exactly-once delivery with Dataflow. Formore information, seePub/Sub exactly-once delivery.

Deduplicate by message attribute

By default, Dataflow deduplicates based on message ID. However,an application might send the same record twice as two distinctPub/Sub messages. For example, the original source data mightcontain duplicate records, or the application might incorrectly publish the samemessage twice. The latter can happen due to retries, if acknowledgment wasdropped due to network issues or other interruptions. In these situations, theduplicate messages have different message IDs.

Depending on your scenario, your data might contain a unique field that can beused to deduplicate. For example, records might contain a unique transaction ID.You can configure the Pub/Sub I/O connector to deduplicatemessages based on the value of a message attribute, instead of using thePub/Sub message ID. As long as the publisher sets this attributeconsistently during retries, then Dataflow can detect theduplicates. Messages must be published to Pub/Sub within 10minutes of each other for deduplication.

For more information about using ID attributes, see the following SDKreference topics:

Subscriptions

When you configure your pipeline, you specify either a Pub/Subtopic or a Pub/Sub subscription to read from. If you specify asubscription, don't use the same Pub/Sub subscription for multiplepipelines. If two pipelines read from a single subscription, each pipelinereceives part of the data in a nondeterministic manner, which might causeduplicate messages, watermark lag, and inefficient autoscaling. Instead, createa separate subscription for each pipeline.

If you specify a topic, the connector creates a new temporary subscription. Thissubscription is unique per pipeline.

Timestamps and watermarks

All Pub/Sub messages have atimestamp, whichrepresents the time when Pub/Sub receives the message. Your datamight also have anevent timestamp, which is the time when the record wasgenerated by the source.

You can configure the connector to read the event timestamp from an attribute onthe Pub/Sub message. In that case, the connector uses the eventtimestamp for watermarking. Otherwise, by default it uses thePub/Sub message timestamp.

For more information about using event timestamps, see the following SDKreference topics:

The Pub/Sub connector has access to Pub/Sub'sprivate API that provides the age of the oldest unacknowledged message in asubscription. This API provides lower latency than is available inCloud Monitoring. It enables Dataflow to advance pipelinewatermarks and emit windowed computation results with low latencies.

If you configure the connector to use event timestamps, thenDataflow creates a second Pub/Sub subscription,called thetracking subscription. Dataflow uses the trackingsubscription to inspect the event times of messages that are still in thebacklog. This approach allows Dataflow to estimate the event-timebacklog accurately. Theworker service accountmust have at least the following permissions on the project that contains thetracking subscription:

  • pubsub.subscriptions.create
  • pubsub.subscription.consume
  • pubsub.subscription.delete

In addition, it needs thepubsub.topics.attachSubscription permission on thePub/Sub topic. It's recommended to create acustom Identity and Access Management role that contains justthese permissions.

For more information about watermarks, see the StackOverflow pagethat covershow Dataflow computes Pub/Sub watermarks.

If a pipeline has multiple Pub/Sub sources, and one of them hasvery low volume or is idle, it delays the entire watermark from advancing,increasing overall pipeline latency. If there are timers or window aggregationsin the pipeline based on the watermark, these are also delayed.

Pub/Sub Seek

Pub/Sub Seek lets users replaypreviously acknowledged messages. You can use Pub/Sub Seek withDataflow to reprocess messages in a pipeline.

However, it's not recommended to use Pub/Sub Seek in a runningpipeline. Seeking backwards in a running pipeline can lead to duplicate messagesor messages being dropped. It also invalidates Dataflow'swatermark logic and conflicts with the state of a pipeline that incorporatesprocessed data.

To reprocess messages by using Pub/Sub Seek, the following workflow isrecommended:

  1. Create asnapshot of thesubscription.
  2. Create a new subscription for the Pub/Sub topic. The new subscriptioninherits the snapshot.
  3. Drain or cancel the current Dataflow job.
  4. Resubmit the pipeline using the new subscription.

For more information, seeMessage reprocessing with Pub/Sub Snapshot and Seek.

Pub/Sub source parallelism

The Pub/Sub source assigns each message a deterministic key forprocessing and uses those keys to shuffle the messages. ForStreaming Engine jobs, 1024 keys are used forshuffling. For non-Streaming Engine jobs, the number of keys is the lowest powerof 2 greater than(4 * maximum workers).

To override the default number of shuffle keys, set thenum_pubsub_keysservice option:

Java

--dataflowServiceOptions=num_pubsub_keys=NUMBER_OF_KEYS

Python

--dataflow_service_options=num_pubsub_keys=NUMBER_OF_KEYS

Go

--dataflow_service_options=num_pubsub_keys=NUMBER_OF_KEYS

ReplaceNUMBER_OF_KEYS with the number of keys. Thenext power of 2 greater than or equal to the specified value is used.

For example, you might set this option in the following situations:

If you set this option, consider the tradeoffs described inParallelization and distribution.

You can't change the number of keys as part of apipeline update. To change thenumber of keys for an existing pipeline job, you must start a new job.

Unsupported Pub/Sub features

The following Pub/Sub features aren't supported in theDataflow runner's implementation of the Pub/Sub I/Oconnector.

Exponential backoff

When you create a Pub/Sub subscription, you can configure it touse an exponential backoff retry policy. However, exponential backoff does notwork with Dataflow. Instead, create the subscription with theRetry immediately retry policy.

Exponential backoff is triggered by a negative acknowledgment or when theacknowledgment deadline expires. However, Dataflow doesn't sendnegative acknowledgements when pipeline code fails. Instead, it retries messageprocessing indefinitely, while continually extending the acknowledgment deadlinefor the message.

Dead-letter topics

Don't use Pub/Sub dead-letter topics with Dataflow,for the following reasons:

  • Dataflow sends negative acknowledgments for various internalreasons (for example, if a worker is shutting down). As a result, messagesmight be delivered to the dead-letter topic even when no failures occur in thepipeline code.

  • Dataflow acknowledges messages after a bundle of messages issuccessfully processed by the first fused stage. If the pipeline has multiplefused stages and failures occur at any point after the first stage, themessages are already acknowledged and don't go to the dead-letter topic.

Instead, implement the dead-letter pattern explicitly in the pipeline, byrouting failed messages to a destination for later processing. Some I/O sinkshave built-in support for dead-letter queues. The following examples implementdead-letter patterns:

Pub/Sub exactly-once delivery

Because Dataflow has its own mechanisms for exactly-onceprocessing, it's not recommended to usePub/Sub exactly-once deliverywith Dataflow. Enabling Pub/Sub exactly-oncedelivery reduces pipeline performance, because it limits the number of messagesthat are available for parallel processing.

Pub/Sub message ordering

Message ordering is a feature in Pub/Subthat lets a subscriber receive messages in the order they were published.

It's not recommended to use message ordering with Dataflow, forthe following reasons:

  • The Pub/Sub I/O connector might not preserve message ordering.
  • Apache Beam doesn't define strict guidelines regarding the order in whichelements are processed. Therefore, ordering might not be preservedin downstream transforms.
  • Using Pub/Sub message ordering with Dataflow canincrease latency and decrease performance.

Pub/Sub single message transforms

Single message transforms (SMTs) let youmanipulate, validate, and filter messages based on their attributes or data asthey stream through the system. Subscriptions that feed into Dataflowshouldn't use SMTs that filter out messages as it can interfere withautoscaling. This happens because subscription SMT filtering can cause the backlog to appear larger than what is delivered to Dataflow untilthe filtered-out messages are actually processed by the SMT. Topic SMTs thatfilter messages won't cause issues with autoscaling.

What's next

Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.

Last updated 2026-02-19 UTC.