Replication lag

MySQL  |  PostgreSQL  |  SQL Server

This page describes how to troubleshoot and fix replication lag for Cloud SQLread replicas.

Overview

Cloud SQL read replicas use MySQL row-based replication using global transactionidentifiers (GTIDs). Changes are written to the binary log of the primary instanceand sent to the replica, where they are received and then applied to the database.

Replication lag can happen in a few scenarios, such as:

  • The primary instance can't send the changes fast enough to the replica.
  • The replica can't receive the changes quickly enough.
  • The replica can't apply the changes quickly enough.
Use thenetwork_lag metric to monitor the first two scenarios when theprimary instance can't send changes fast enough or the replica can't receive changesquickly enough.

The total lag is observed with thereplica_lag metric.The difference betweenreplica_lag and thenetwork_lag can indicatethe third reason when the replica can't apply replication changes fast enough.For more information about these metrics, seeMonitor replication lag.

Faster replica configuration

We have two ways to make a MySQL replica apply changes faster. Users can configuretheir replicas with the following options:

  • Parallel replication
  • High performance flushing

Parallel replication

Parallel replication might help replication lag by configuring the replica to usemultiple threads acting in parallel to apply changes on the replica.For information about using parallel replication, seeConfiguring parallel replication.

When you enable parallel replication by setting thereplica_parallel_workers(orslave_parallel_workers) flag, consider the following:

  • We recommend setting thereplica_parallel_workers flag value to a number that matches the vCPU count of the replica instance. Setting it to a very high value can cause lock waits, lock wait timeouts, and deadlocks. If you observe lock wait spikes aligned with replication lag, consider reducing parallelism.
  • If your MySQL version supports thebinlog_transaction_dependency_tracking flag, consider setting it toWRITESET for the primary instance. This is the default behavior for 8.4 and later.

High performance flushing

By default, Cloud SQL for MySQL flushes the redo logs to disk after each transactionfor durability.High performance flushing reduces the frequency with which the redo logs are flushedto the disk to once per second. Doing so can help improve write performance on thereplica by reducing disk I/O.

Set theinnodb_flush_log_at_trx_commitflag on the read replica to 2. If binary logging is enabled for the replica, to help make theinnodb_flush_log_at_trx_commit flag effective, we recommend setting thesync_binlogflag to a high value, for example, 10,000.

SeeTips for working with flags for more information about this flag.

When theinnodb_flush_log_at_trx_commitflag is set on the read replica and Cloud SQL detects that a crash might haveoccurred, Cloud SQL automatically recreates the replica.

Ensure replica is adequately provisioned

A replica instance that is smaller than the primary instance (for example, withless vCPUs and memory) can experience replication lag. A smaller replica mightalso have different default configuration flags compared to a larger primaryinstance. We recommend that the replica instance is at least as large as theprimary instance to have enough resources to handle the replication load.

High CPU utilization on the replica can also cause replication lag. If thereplica's CPU utilization is high (for example, over 90%), consider increasingthe replica's CPU capacity.

You can use theSHOW VARIABLES command to see replica and primary instanceconfiguration and compare them for differences. For example, a smallerreplica can't configure theinnodb_buffer_pool_size to be the same value asthe primary and it might impact the replica's performance.

Optimize queries and schema

This section suggests some common query and schema optimizations you can make toimprove replication performance.

Query isolation level in the read replica

TheREPEATABLE READ andSERIALIZABLE transaction isolationlevels acquire locks that might block the replication changes. Consider reducingthe isolation level for your queries in the replica. TheREAD COMMITTEDtransaction isolation level might perform better.

Note: The MySQL default transaction isolation level isREPEATABLE READ for historical reasons.

Long-running transactions in the primary database

Long-running transactions on the primary instance can cause replication lag.The binary log is not sent to the replica until the transaction is committed.

If a large number of rows are updated in a single transaction,it can cause a sudden spike in the number of changes that need to be applied to the primaryinstance and then sent to the replica. This applies to single-statement updates or deletesthat affect many rows at once. Changes are sent to the replica after they are committed.Applying a sudden spike of changes in the replica can increase the possibilityof lock contention in the replica if the query load on the replica is alsohigh, leading to replication lag.

Consider breaking large transactions into multiple smaller transactions. Youcan monitor for long-running transactions by checking thecloudsql.googleapis.com/database/mysql/innodb/active_trx_longest_time metricon the primary.

Missing primary keys

Cloud SQL read replicas use row-based replication, which performs poorly if theMySQL tables that are replicated don't have primary keys. We recommend that allof the replicated tables have primary keys.

For MySQL 8 or later, we recommend that you set the flagsql_require_primary_key toON to require tables in your database to have primary keys.

Long-running transactions in the read replica

Long-running transactions on the replica, such asSELECT statements, can blockor slow down replication. Table scan is a common problem. Investigate anylong-running queries and consider optimizing them. These queries can lead toInnoDB history list size growth.

ExcessiveInnoDB history length

A very largeInnoDB history list can cause performance problems and slow downreplication. You can monitor the history list length using thecloudsql.googleapis.com/database/mysql/innodb/history_list_length metric.This metric can be high in the primary as well and might be causing performanceissues already. If, after initial startup, your replica shows signs of high replication lag,this might be the cause.

A large history list can be caused by the following:

  • Long-running transactions. Long-running or idle transactions prevent purging of old undo log entries.
  • Slow disk performance. Purging is an I/O-intensive operation.
  • REPEATABLE READ isolation level. This can contribute to history list growth.
  • Insufficient purge configuration.Theinnodb_purge_threads parameter, which controls the number of threads dedicated to purging, might be set too low for the workload.

To help address this, try the following:

  • Breakdown large transaction to smaller ones. Allow quicker purging of old logs.
  • Use a larger instance. Larger instances have more CPU and memory.
  • Tune purge settings. Increaseinnodb_purge_threads,innodb_io_capacity, andinnodb_io_capacity_max.
  • UseREAD COMMITTED isolation level.
  • Ensure tables have primary keys. Tables without primary keys can cause table scans which can slow down replication and contribute to history list growth.

High amount of lock waits

A high number of lock waits on the replica can slow down replication, especially with parallel replication enabled. You can monitor for lock waits and deadlocks using the followingmetrics:

  • cloudsql.googleapis.com/database/mysql/innodb/row_lock_waits_count
  • cloudsql.googleapis.com/database/mysql/innodb/row_lock_time
  • cloudsql.googleapis.com/database/mysql/innodb/lock_timeout_count
  • cloudsql.googleapis.com/database/mysql/innodb/deadlocks_count

If these lock metrics are too high and seem to be correlated with replication lag,consider reducing the value of thereplica_parallel_workers flag.Isolation level might also impact locks.

Exclusive locks due to DDL

Data definition language (DDL) commands, such asALTER TABLE andCREATE INDEX, can cause replication lag in the replica due toexclusive locks. To avoid lock contention, consider scheduling DDL executionduring times when the query load is lower on the replicas.

Consider utilizing the MySQL 8.0 Online DDL feature.

Overloaded replica

If a read replica is receiving too many queries, replication could be blocked.Consider splitting the reads among multiple replicas to reduce the load on eachone.

To avoid query spikes, consider throttling replica read queries in yourapplication logic or in a proxy layer if you use one.

If there are spikes of activity on the primary instance, consider spreading outupdates.

Monolithic primary database

Consider sharding the primary database vertically (or horizontally) to preventone or more lagging tables from holding back all the other tables.

Monitor replication lag

You can use thereplica_lag andnetwork_lag metrics to monitor replicationlag and identify whether the cause of the lag is in the primary database,the network, or the replica.

MetricDescription
Replication lag
(cloudsql.googleapis.com/database/replication/replica_lag)

The number of seconds that the replica's state is lagging behind the state of the primary instance. This is the difference between the current time and the original timestamp at which the primary database committed the transaction that is currently being applied on the replica. In particular, writes might be counted as lagging even if they have been received by the replica, if the replica hasn't yet applied the write to the database.

This metric reports the value ofSeconds_Behind_Master whenSHOW SLAVE STATUS is run on the replica. For more information, seeChecking Replication Status in the MySQL Reference Manual.

Last I/O thread error number
(cloudsql.googleapis.com/database/mysql/replication/last_io_errno)

Indicates the last error that caused the I/O thread to fail. If this is non-zero, replication is broken. This is rare, but it might happen. Check MySQL documentation to understand what the error code indicates. For example, binlog files in the primary instance might have been deleted before the replica received them. Cloud SQL usually automatically recreates the replica if replication is broken. Thislast_io_errno metric might tell you the reason why.

Last SQL thread error number
(cloudsql.googleapis.com/database/mysql/replication/last_sql_errno)

Indicates the last error that caused the SQL thread to fail. If this is non-zero, replication is broken. This is rare, but it might happen. Check MySQL documentation to understand what the error code indicates. Cloud SQL will usually automatically recreate the replica if replication is broken. Thislast_sql_errno metric can tell you the reason why.

Network lag
(cloudsql.googleapis.com/database/replication/network_lag)

The length of time, in seconds, that it takes from writing the binlog in the primary database to reaching the IO thread in the replica.

If thenetwork_lag is zero, or negligible, but thereplica_lag is high, it indicates that the SQL thread is not able to apply replication changes fast enough.

Verify replication

To verify that replication is working, run the following statement against thereplica:

mysql> SHOW SLAVE STATUS\G;*************************** 1. row ***************************               Slave_IO_State: Queueing master event to the relay log                  Master_Host: xx.xxx.xxx.xxx                  Master_User: cloudsqlreplica                  Master_Port: 3306                Connect_Retry: 60              Master_Log_File: mysql-bin.199927          Read_Master_Log_Pos: 83711956               Relay_Log_File: relay-log.000025                Relay_Log_Pos: 24214376        Relay_Master_Log_File: mysql-bin.199898             Slave_IO_Running: Yes            Slave_SQL_Running: Yes              Replicate_Do_DB:          Replicate_Ignore_DB:           Replicate_Do_Table:       Replicate_Ignore_Table:      Replicate_Wild_Do_Table:  Replicate_Wild_Ignore_Table:                   Last_Errno: 0                   Last_Error:                 Skip_Counter: 0          Exec_Master_Log_Pos: 24214163              Relay_Log_Space: 3128686571              Until_Condition: None               Until_Log_File:                Until_Log_Pos: 0           Master_SSL_Allowed: Yes           Master_SSL_CA_File: master_server_ca.pem           Master_SSL_CA_Path: /mysql/datadir              Master_SSL_Cert: replica_cert.pem            Master_SSL_Cipher:               Master_SSL_Key: replica_pkey.pem        Seconds_Behind_Master: 2627Master_SSL_Verify_Server_Cert: No                Last_IO_Errno: 0                Last_IO_Error:               Last_SQL_Errno: 0               Last_SQL_Error:  Replicate_Ignore_Server_Ids:             Master_Server_Id: 321071839                  Master_UUID: 437d04e9-8456-11e8-b13d-42010a80027b             Master_Info_File: mysql.slave_master_info                    SQL_Delay: 0          SQL_Remaining_Delay: NULL      Slave_SQL_Running_State: System lock           Master_Retry_Count: 86400                  Master_Bind:      Last_IO_Error_Timestamp:     Last_SQL_Error_Timestamp:               Master_SSL_Crl:           Master_SSL_Crlpath:           Retrieved_Gtid_Set: 437d04e9-8456-11e8-b13d-42010a80027b:52111095710-52120776390            Executed_Gtid_Set: 437d04e9-8456-11e8-b13d-42010a80027b:1-52113039508                Auto_Position: 1         Replicate_Rewrite_DB:                 Channel_Name:           Master_TLS_Version:1 row in set (0.00 sec)

If replication is happening, the first column,Slave_IO_State, showsWaitingfor master to send event or a similar message. Also, theLast_IO_Error fieldis empty.

If replication is not happening, theSlave_IO_State column shows the statusConnecting to master and theLast_IO_Error column shows the statuserror connecting to master cloudsqlreplica@x.x.x.x:3306.

According to theMySQL documentation,a few other interesting fields that relate to replication lag include thefollowing:

FieldDescription
Master_Log_File
The name of the source binary log file that the I/O thread is currently reading from.
Read_Master_Log_Pos
The position in the current source binary log file the I/O thread has read up to.
Relay_Log_File
The name of the relay log file the SQL thread is currently reading and executing from.
Relay_Log_Pos
The position in the current relay log file the SQL thread has read and executed up to.
Relay_Master_Log_File
The name of the source binary log file containing the most recent event executed by the SQL thread.

In the previous exampleRelay_Master_Log_File has the valuemysql-bin.199898.Master_Log_File has the valuemysql-bin.199927. The numeric suffix 199898 isless than 199927. This means that even though the replica has received a newermysql-bin.199927 log file, it's still applying the oldermysql-bin.199898.

In this case, the SQL thread is lagging in the replica.

You can also connect to the primary database and execute:

  SHOW MASTER STATUS;

This command shows you which binlog file is being written in the primary database.

If the primary database binary log file is newer than theMaster_Log_File in the replica,it means that the I/O thread is lagging. The replica is still reading an olderbinary log file from the primary database.

When the I/O thread is lagging, thenetwork_lag metric is also high. When the SQL threadis lagging, but the I/O thread is not, then thenetwork_lag metric isn't as high, butthereplica_lag is high.

The previous commands let you observe lag details while the lag is happening,but the metricsnetwork_lag andreplica_lag provide you a way tolook into the past occurrences of the lag.

Recreate lagging replica

Preview

This feature is subject to the "Pre-GA Offerings Terms" in the General Service Terms section of theService Specific Terms. Pre-GA features are available "as is" and might have limited support. For more information, see thelaunch stage descriptions.

Recreate a lagging replica when replication falls behind an acceptable length oftime.

With Cloud SQL, you can configure your read replica to recreate itselfif replication lags (or is delayed) beyond an acceptable length of time,and that delay persists for at least five minutes.

If you define an acceptable replication delay as being less than 360 seconds(six minutes), and a replication delay of at least 361 seconds persists for morethan five minutes, then after five minutes, the primary instance creates a newsnapshot of itself and the read replica is recreated using this snapshot.

Recreating a lagging read replica provides the following benefits:

  • You control what's considered an acceptable range for replication delay.
  • You can reduce the time spent troubleshooting replication delay by hours oreven days.

Additional feature details apply:

  • Compatible with the following versions:
    • MySQL 5.7
    • MySQL 8.0
    • MySQL 8.4
  • An acceptable range for replication lag or delay must be defined in seconds.
  • The minimum acceptable value is 300 seconds or five minutes.
  • The maximum acceptable value is 31,536,000 seconds or one year.
    • If you enable recreate lagging replica for an instance but don't set themaximum acceptable replication delay, then Cloud SQL uses thedefault value of one year.
  • Types of instances that are supported:
    • Read replica
    • Cross-region read replica
    • Cascading replica
  • The value set for thereplicationLagMaxSeconds field is specific to eachreplica instance. If a primary instance has multiple replica instances, thenyou can set each replica with a different value.
  • When a replica is recreated, users can expect some downtime while the followingoperations are completed:
    • Replication is stopped.
    • The replica is deleted.
    • A snapshot of the primary instance is created.
    • The replica is recreated from this latest snapshot. The new replica uses thesame name and IP address as the previous replica. As a result, MySQL muststop and restart.
    • The new replica starts replicating data.
  • ThereplicationLagMaxSeconds is an instance level field. Each instance hasits own value.
  • If you have multiple read replicas for the same primary instance, you can seta unique value for thereplicationLagMaxSeconds field for each replica.

    Defining different time thresholds for different replicas can help you avoida scenario where all replicas go down at the same time.

Enable recreate lagging replica

The recreate lagging replica feature is disabled by default. To enable it whenyou create an instance, use one of the following methods:

gcloud

Use thegcloud sql instances createcommand to create a new read replica instance with the
--replication-lag-max-seconds-for-recreate flag:

gcloudbetasqlinstancescreateREPLICA_INSTANCE_NAME\--master-instance-name=PRIMARY_INSTANCE_NAME\--database-version=DATABASE_VERSION\--tier=TIER\--edition=EDITION\--region=REGION\--root-password=PASSWORD\--replication-lag-max-seconds-for-recreate=REPLICATION_LAG_MAX_SECONDS

Where:

  • REPLICA_INSTANCE_NAME is the name of the replicainstance.
  • PRIMARY_INSTANCE_NAME is the name of the primaryinstance.
  • DATABASE_VERSION is the database version of theinstance. For example,MYSQL_8_0_31.
  • TIER is the machine type you want to use for thereplica instance. For example,db-perf-optimized-N-4. For moreinformation, seeCustom instance configurations.
  • EDITION is the edition you want to use for thereplica instance. For example,ENTERPRISE_PLUS. For more information,seeCreate an instance.
  • REGION is the region you want to use for thereplica instance. For example,us-central1.
  • PASSWORD is the root password for the instance.
  • REPLICATION_LAG_MAX_SECONDS is the maximumacceptable replication lag or delay in seconds. For example,600. Theminimum acceptable value is 300 seconds or five minutes. The maximumacceptable value is 31,536,000 seconds or one year.

REST API

ThereplicationLagMaxSeconds field is located in theDatabaseInstanceresource. Add this field to the request body:

{"settings":{"replicationLagMaxSeconds":REPLICATION_LAG_MAX_SECONDS,}...}

Where:

  • REPLICATION_LAG_MAX_SECONDS is the maximum acceptablereplication lag or delay in seconds. For example,600.

Update the recreate timeframe for replication lag

To view an instance's settings, use any of the methods described inView instance summary information.

With this information, you can choose whether or not to update the replicationlag timeframe you specified as being acceptable before the replica is recreated.

gcloud

Use thegcloud sql instances patchcommand to update the timeframe for recreating the instance based onreplication lag:

gcloudbetasqlinstancespatchINSTANCE_NAME\--replication-lag-max-seconds-for-recreate=REPLICATION_LAG_MAX_SECONDS

Where:

  • INSTANCE_NAME is the name of the instance.
  • REPLICATION_LAG_MAX_SECONDS is the maximumacceptable replication lag or delay in seconds. For example,700. Ifyou want to revert to the default value of one year, enter31536000.The minimum acceptable value is 300 seconds or five minutes. The maximumacceptable value is 31,536,000 seconds or one year.

REST API

The policy can be updated using theinstances.patchandinstance.insert.

To see an example of how to update the setting using the REST API, seeEdit an instance.

Limitations

The following limitations apply to recreating lagging replicas:

  • Values forreplicationLagMaxSeconds can only be set in seconds.
  • Indexes created onthe read replica prior to a recreate operation won't persist. If an indexexists, then create a secondary index after the replica is recreated.
  • To avoid frequent downtimes on read replicas, recreations are limited toone per day per instance.
  • Replicas of external servers are notsupported with this feature.
  • If you enable recreation of lagging replicas on a cascading replica, thenCloud SQL recreates the leaf replicas first in order to maintainreplication consistency.
  • Recreating a cross-region replica incurs additionalcost.
  • You can't enable recreation of lagging replicas in the Google Cloud console.

What's next:

Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.

Last updated 2025-12-15 UTC.