Replication lag

MySQL  |  PostgreSQL  |  SQL Server

This page describes how to troubleshoot and fix replication lag for Cloud SQLread replicas.

Overview

Cloud SQL read replicas usePostgreSQL streaming replication.Changes are written to Write-Ahead Log (WAL) in the primary instance.The WAL sender sends the WAL to the WAL receiver in the replica, where they are applied.

Replication lag can happen in a few scenarios, such as:

  • The primary instance can't send the changes fast enough to the replica.
  • The replica can't receive the changes quickly enough.
  • The replica can't apply the changes quickly enough.
The first two reasons above can be monitored with thenetwork_lag metric.The third is observed via thereplica_lag metric. Highreplica_lag means thatthe replica can't apply replication changes fast enough. The total lag can beobserved viareplica_byte_lag metric, which has labels to indicate furtherdetails. These metrics are described in theMonitor replication lagsection below.

Ensure replica is adequately provisioned

A replica instance that is smaller than the primary instance (for example, withless vCPUs and memory) can experience replication lag. A smaller replica mightalso have different default configuration flags compared to a larger primaryinstance. We recommend that the replica instance is at least as large as theprimary instance to have enough resources to handle the replication load.

High CPU utilization on the replica can also cause replication lag. If thereplica's CPU utilization is high (for example, over 90%), consider increasingthe replica's CPU capacity.

You can use theSHOW ALL command to see replica and primary instanceconfiguration and compare them for differences.

Optimize queries and schema

This section suggests some common query and schema optimizations you can make toimprove replication performance.

Long-running queries in the read replica

Long-running queries in the replica might block replication for Cloud SQL.This can happen when replication is trying to apply changes (like from aVACUUM operation) to rows that are being read by a query on the replica.

You might want to have separate replicas for online transaction processing(OLTP) and online analytical processing (OLAP) purposes and only sendlong-running queries to the OLAP replica.

To help address replication delays or blockages caused by long-running transactions,we recommend the following:

  • Adjust standby delay flags. Themax_standby_archive_delay andmax_standby_streaming_delay flags control how long a replica will wait before canceling standby queries that conflict with replication. Reasonable values are often around 30 to 60 seconds. You can check thepg_stat_database_conflicts view for insights about query conflicts.
  • Enable thehot_standby_feedback flag. Setting thehot_standby_feedback flag toon in the replica can help by delaying vacuum operations on the primary. However, this can cause table bloat on the primary, so it's a trade-off.

ReviewPostgreSQL documentation for more information.

High network lag

High network lag indicates that WAL records are not being sent by the primary or received by the replica fast enough. This can be caused by the following:

  • Cross-region replication. Replicating between different regions can introduce higher network latency.
  • High primary CPU utilization. If the primary's CPU is over 90%, the WAL sender process might not get enough CPU time. Consider reducing load on the primary or increasing its CPU.
  • High replica CPU utilization. If the replica's CPU is over 90%, the WAL receiver process might not get enough CPU time. Consider reducing load on the replica or increasing its CPU.
  • Network bandwidth issues or disk I/O bottlenecks. A closer region or a higher throughput disk configuration might help. Consider modifying thewal_compression flag value in the primary instance to help reduce cross-region traffic.
You can monitor network lag with thecloudsql.googleapis.com/database/replication/network_lag metric.This metric has a maximum limit of 25 seconds, even if the actual lag is higher.

Thisnetwork_lag metric is similar to thecloudsql.googleapis.com/database/postgresql/replication/replica_byte_lagmetric which measures thesent_location lag in terms of bytes indicated by itsreplica_lag_type label.

Exclusive locks due to DDL

Data definition language (DDL) commands, such asALTER TABLE andCREATE INDEX, can cause replication lag in the replica due toexclusive locks. To avoid lock contention, consider scheduling DDL executionduring times when the query load is lower on the replicas.

ReviewPostgreSQL documentation for more information.

Overloaded replica

If a read replica is receiving too many queries, replication could be blocked.Consider splitting the reads among multiple replicas to reduce the load on eachone.

To avoid query spikes, consider throttling replica read queries in yourapplication logic or in a proxy layer if you use one.

If there are spikes of activity on the primary instance, consider spreading outupdates.

Monolithic primary database

Consider sharding the primary database vertically (or horizontally) to preventone or more lagging tables from holding back all the other tables.

Monitor replication lag

You can use thereplica_lag andnetwork_lag metrics to monitor replicationlag and identify whether the cause of the lag is in the primary database,the network, or the replica.

MetricDescription
Replication lag
(cloudsql.googleapis.com/database/replication/replica_lag)

The number of seconds that the replica's state is lagging behind the state of the primary instance. This is the difference between the current time and the original timestamp at which the primary database committed the transaction that is currently being applied on the replica. In particular, writes might be counted as lagging even if they have been received by the replica, if the replica hasn't yet applied the write to the database.

This metric is calculated usingnow() - pg_last_xact_replay_timestamp() in the replica. This is an approximation. If replication is broken, the replica won't know how far ahead the primary database is and this metric won't indicate total lag.

Lag bytes
(cloudsql.googleapis.com/database/postgres/replication/replica_byte_lag)

The amount of bytes by which the replica's state is lagging behind the state of the primary database.replica_byte_lag exports 4 time series, and thereplica_lag_type label can indicate any of the following:

  • sent_location: Indicates how many bytes of WAL have been generated, but haven't yet been sent to the replica.
  • write_location: Write minus sent lag shows WAL bytes in the network, that have been sent out but not yet written in the replica.
  • flush_location: Flush minus write lag shows WAL bytes written in the replica but not yet flushed in the replica.
  • replay_location: Shows total lag in bytes. Replay minus flush lag indicates replay delay.
Network lag
(cloudsql.googleapis.com/database/replication/network_lag)

The amount of time in, seconds that it takes from commit in the primary database to reach the WAL receiver in the replica.

If thenetwork_lag is zero, or negligible, but thereplica_lag is high, it indicates that the WAL receiver is not able to apply replication changes fast enough.

Verify replication

To verify that replication is working, run the following statement against thereplica:

  select status, last_msg_receipt_time from pg_stat_wal_receiver;

If replication is happening, you see the statusstreaming and a recentlast_msg_receipt_time:

  postgres=> select status, last_msg_receipt_time from pg_stat_wal_receiver;    status   |     last_msg_receipt_time  -----------+-------------------------------  streaming | 2020-01-21 20:19:51.461535+00  (1 row)

If replication is not happening, an empty result is returned:

  postgres=> select status, last_msg_receipt_time from pg_stat_wal_receiver;  status | last_msg_receipt_time  --------+-----------------------  (0 rows)

What's next:

Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.

Last updated 2025-11-24 UTC.