Stream data from PostgreSQL databases

This section contains information about:

  • The behavior of how Datastream handles data that's being pulled from a source PostgreSQL database
  • The versions of PostgreSQL database that Datastream supports
  • An overview of how to setup a source PostgreSQL database so that data can be streamed from it to a destination
  • Known limitations for using PostgreSQL database as a source

Behavior

The source PostgreSQL database relies upon itslogical decoding feature. Logical decoding exposes all changes committed to the database and allows consuming and processing these changes in a user-friendly format using an output plugin. Datastream uses thepgoutput plugin, which is the standard PostgreSQL logical decoding plugin for PostgreSQL 10 and later.

  • All schemas or specific schemas from a given PostgreSQL source, as well as all tables from the schema or specific tables, can be selected.
  • All historical data is replicated.
  • All data manipulation language (DML) changes, such as inserts, updates, and deletes from the specified databases and tables, are replicated.
  • Only committed changes are replicated.
  • If you define aREPLICA IDENTITY on a table, Datastream treats the specified columns as primary keys.
  • Datastream periodically sends heartbeat messages to the source database. As a result, logical decoding message events (op:"m") are inserted directly into the WAL file. These messages are required by Datastream to ensure source availability and to calculate freshness. We recommend taking that into consideration if other replication setups read from the same source database.

Versions

Datastream supports PostgreSQL version 10 and later.

Datastream supports the following types of PostgreSQL database:

  • Self-hosted PostgreSQL
  • Cloud SQL for PostgreSQL
  • AlloyDB for PostgreSQL
  • AlloyDB Omni
  • Amazon RDS for PostgreSQL
  • Amazon Aurora PostgreSQL

Best practices

This section describes recommended best practices for configuring your PostgreSQLsource for use with Datastream.

Use multiple streams to prevent head-of-line blocking

For PostgreSQL sources, Datastream uses a single logical replication slotfor an entire stream. A large transaction or multiple updates on onehigh-volume table can delay data replication for all other tables in the samestream.

To prevent head-of-line blocking, create separate streams for different setsof tables. For example, you can create one stream for high-volume tables,and another stream for low-volume tables. This isolates high-churn tables andprevents them from delaying replication for other tables.

Recommendation: identify tables with exceptionally high write(INSERT/UPDATE/DELETE) rates and place them in their owndedicated Datastream stream with a separate replication slot.

Avoid long-running transactions

Long-running transactions can lead to Write-Ahead Log (WAL) buildup. Because WALis sequential, PostgreSQL can't flush the WAL until the long transactioncompletes, even as other transactions are being consumed. This can increasethe replication slot size and slow down logical decoding, because changes fromlong-running transactions that overlap with the current transaction must bedecoded repeatedly.

Recommendation: on the source database, configure thestatement_timeoutandidle_in_transaction_session_timeout parameters to avoid long-runningtransactions. For more information, see thePostgreSQL documentation.

Use table filtering when creating publications

If you're replicating changes from only a few tables, ensure that you create aPUBLICATION that includes only those tables. When a publicationis scoped to specific tables, PostgreSQL efficiently persists changes only forthose tables in the replication slot. This helps reduce the size of thereplication slot and improves logical decoding performance.

Proactively manage replication slots

Datastream uses a logical replication slot on your PostgreSQL primaryinstance, which ensures that WAL files are retained until Datastreamconfirms that they've been processed. If a stream fails, is paused or deletedwithout dropping the replication slot, PostgreSQL continues to retain WAL filesindefinitely. This can fill up your database server disk and lead to aproduction outage.

Recommendation: Set up efficient alerting and monitor WAL disk usage onyour source PostgreSQL server.

Configure replica identity correctly

TheREPLICA IDENTITY setting tells PostgreSQL what data to write to the WALforUPDATE andDELETE events, allowing Datastream to identifywhich rows were changed.

If you use BigQuery as a destination, avoid settingREPLICA IDENTITYtoFULL. Datastream uses the logged columns as a logical key forBigQueryMERGE operations.IfREPLICA IDENTITY is set toFULL and a table has more than 16 columns,this exceeds the BigQuery 16-column limit for primary keys inMERGEoperations and breaks the stream.

Recommendations (in order of preference):

  1. Best: use a primary key. The default setting ofREPLICA IDENTITY DEFAULT automatically and efficiently usesthe existing primary key.
  2. Good: if no primary key exists, create aUNIQUE NOT NULLindex and setREPLICA IDENTITY USING INDEXINDEX_NAME.
  3. Least recommended: only use theREPLICA IDENTITY FULL setting ontables with no unique identifier. Be aware of the performance impact andthe 16-column limit and the restriction on supported data types for primary keys if replicating to BigQuery.

Known limitations

Note: This section describes limitations for using Datastream with a PostgreSQL database as a source. In addition to these limitations, you need to familiarize yourself with PostgreSQL logical replication limitations, because they also affect how Datastream streams data from a PostgreSQL source. To learn about limitations of PostgreSQL logical replication, seelogical replication restrictions.

Known limitations for using Datastream with a PostgreSQL database as a source include:

  • Streams are limited to 10,000 tables.
  • A table that has more than 500 million rows can't be backfilled unless the following conditions are met:
    1. The table has a unique B-tree index.
    2. The index doesn't include columns of the following types:DOUBLE,FLOAT,MONEY,REAL,JSON,JSONB,BYTEA,TXID,XML,composite data typesorgeometric data types.
    3. None of the columns of the index are nullable.
    4. All columns of the index are in ascending order, or all columns of the index are in descending order.
    5. All columns of the index are included in the stream.
  • Tables without primary keys must have aREPLICA IDENTITY. Otherwise, onlyINSERT events are replicated to the destination.
  • Tables with primary keys can't have theREPLICA IDENTITY set toFULL orNOTHING. It has to be set toDEFAULT.
  • Datastream can't replicate from a read replica instance, because PostgreSQL doesn't support logical decoding in read replicas.
  • Not all changes to the source schema can be detected automatically, in which case data corruption may occur. The following schema changes may cause data corruption or failure to process the events downstream:
    • Dropping columns.
    • Adding columns to the middle of a table.
    • Changing the data type of a column.
    • Reordering columns.
    • Dropping tables (relevant if the same table is then recreated with new data added).
  • Datastream doesn't support columns of thegeometric data types.
  • Datastream doesn't support columns of therange data types.
  • Datastream doesn't support arrays of unsupported data types, arrays ofuser-defined data types (includingENUM) or arrays ofDATE,TIMESTAMP orTIMESTAMP WITH TIME ZONE data types. Such columns are ignored.
  • Datastream doesn't support replicatingUPDATE events for rows which includeTOAST values in columns that are part of the table's replica identity. Such events are discarded.
  • Datastream doesn't support replicating rows which includeJSON orJSONB values with more than 2950 nested objects. Events containing suchJSON orJSONB values aren't replicated to the destination database.
  • Datastream doesn't support replicating rows which includeNaN values inNUMERIC (precision, scale) columns. The values in such columns are replaced withNULL values.
  • Datastream doesn't support replicating columns of thehstore data type. The values in such columns are replaced withNULL values.
  • Datastream doesn't support replicating non-ASCII records from a SQL_ASCII encoded source database. Such records are discarded.
  • Datastream doesn't support replicating tables with Row-Level Security (RLS) policies defined.For information about how to bypass this limitation, seePostgreSQL source behavior and limitations.
  • Datastream doesn't capture changes made togenerated columns.
  • Datastream might stop working or not capture any new events when a PostgreSQL major version upgrade is performed on the database. We suggest you to drop the replication slots before the upgrade, then upgrade the database, and then recreate the replication slots. If the streams fail, recover the stream by specifying the new replication slot name, and perform a backfill if data consistency is required.

What's next

Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.

Last updated 2025-12-15 UTC.