About disaster recovery (DR) in Cloud SQL

MySQL  |  PostgreSQL  |  SQL Server

This page introduces disaster recovery in Cloud SQL.

Overview

In Google Cloud, database disaster recovery (DR) is about providingcontinuity of processing, specifically when aregionfails or becomes unavailable.Cloud SQL is a regional service (whenCloud SQL is configured forhigh availability (HA)).Therefore, if the Google Cloud region that hosts a Cloud SQLdatabase becomes unavailable, then the Cloud SQL database also becomesunavailable.

To continue processing, you must make the database available in a secondaryregion as soon as possible. The DR plan requires you to configure a cross-regionread replica in Cloud SQL. A failover based on export/importor backup/restore is also possible, but that approach takes longer,especially for large databases.

The following business scenarios are examples that warrant a cross-regionfailover configuration:

  • The service level agreement of the business application is greater thanthe regionalCloud SQL Service Level Agreement(99.99% availability depending on your Cloud SQL edition).By failing over to another region, you can mitigatean outage.
  • All tiers of the business application are already multi-regional and cancontinue processing when a region outage occurs. The cross-region failoverconfiguration helps support the continued availability of a database.
  • The required recovery time objective (RTO) and recovery point objective(RPO) are in minutes rather than in hours.Failing over to another region is faster than recreating a database.

In general, there are two variants for the DR process:

  • A database fails over to a secondary region. After the database isready and used by an application, it becomes the new primary database andremains the primary database.
  • A database fails over to a secondary region but falls back to theprimary region after the primary region is recovered from its failure.

This Google Cloud SQL database disaster recovery overview describes thesecond variant—when a failed database is recovered and falls back to theprimary region. This DR process variant is especially relevant for databasesthat must run in the primary region because of network latency, or because someresources are available only in the primary region. With this variant, thedatabase runs in the secondary region only for the duration of outage in theprimary region.

Disaster recovery architecture

The following diagram shows the minimal architecture that supports database DRfor an HA Cloud SQL instance:

The primary and standby instances are located in one region, and theread replica is in a second region.

The architecture works as follows:

  • Two instances of Cloud SQL (a primary instance and astandby instance) are located in two separate zones within a single region(the primary region). The instances are synchronized by using regionalpersistent disks.
  • One instance of Cloud SQL (the cross-region read replica) islocated in a second region (the secondary region). For DR, the cross-regionread replica is set up to synchronize (by using asynchronous replication)with the primary instance using a read replica setup.

The primary and standby instances share the same regional disk, so theirstates are identical.

Because this setup uses asynchronous replication, it's possible that thecross-region read replica lags behind the primary instance. As a result, when a failoveroccurs, the cross-region read replica RPO is likely non-zero.

Disaster recovery (DR) process

The disaster recovery (DR) process starts when the primary region becomesunavailable. To resume processing in a secondary region, you trigger a failoverof the primary instance by promoting a cross-region read replica. The DR processprescribes the operational steps that must be performed, either manually orautomatically, to mitigate the region failure and establish a running primaryinstance in a secondary region.

The following diagram shows the DR process:

When region 1 becomes unavailable, the original read replica is promoted tobe the primary.

The DR process consists of the following steps:

  1. The primary region (R1), which is running the primary instance, becomesunavailable.
  2. The operations team recognizes and formally acknowledges the disasterand decides whether a failover is required.
  3. If a failover is required, you can promote the cross-region read replica inthe secondary region (R2) to be the new primary instance.
  4. Client connections are reconfigured to resume processing on the newprimary instance and access the primary instance in R2.

This initial process establishes a working primary database again. However, itdoesn't establish a complete DR architecture, where the new primary instanceitself has a standby instance and a cross-region read replica.

A complete DR process ensures that the single instance, the new primary, isenabled for HA and has a cross-region read replica. A complete DR process alsoprovides a fallback to the original deployment in the original primary region.

Failing over to a secondary region

A complete DR process extends the basic DR process by adding steps to establisha complete DR architecture after failover. The following diagram shows acomplete database DR architecture after the failover:

Clients begin accessing the new primary instance and a read replica is set upin a third region.

The complete database DR process consists of the following steps:

  1. The primary region (R1), which is running the primary database,becomes unavailable.
  2. The operations team recognizes and formally acknowledges the disasterand decides whether a failover is required.
  3. If a failover is required, you can thenpromote the cross-region read replica inthe secondary region (R2) to be the new primary instance.
  4. Client connections are reconfigured to access and process on the newprimary instance (R2).
  5. A new standby instance is created and started in R2 and added to theprimary instance. The standby instance is in a different zone from theprimary instance. The primary instance is now highly available because astandby instance was created for it.
  6. In a third region (R3), a new cross-region read replica is created andattached to the primary instance. At this point, a complete disasterrecovery architecture is recreated and operational.

If the original primary region (R1) becomes available before step 6 isimplemented, the cross-region read replica can be placed in region R1, ratherthan region R3, right away. In this case, the fallback to the original primaryregion (R1) is less complex and requires fewer steps.

Note:You can reduce the RTO for the complete DR process by eliminating steps5 and 6. To do so, create the cross-region replica in the secondaryregion (R2) as a HA replica and the cross-region replica in the third region(R3) as a cascading replica with its source as the replica in R2.

Avoiding a split-brain state

A failure of the primary region (R1) doesn't mean that the original primaryinstance and its standby instance are automatically shut down, removed, orotherwise made inaccessible when R1 becomes available again. If R1 becomesavailable, clients might read and write data (even by accident) on the originalprimary instance. In this case, asplit-brain situationcan develop, where some clients access stale data in the old primary database,and other clients access fresh data in the new primary database, leading toproblems in your business.

To avoid a split-brain situation, you must ensure that clients can no longeraccess the original primary instance after R1 becomes available. Ideally, youshould make the original primary inaccessible before clients start using the newprimary instance, then delete the original primary right after you make itinaccessible.

Establishing an initial backup after failover

When you promote the cross-region read replica to be the new primary in afailover, the transactions in the new primary might not be fully synchronizedwith transactions from the original primary. Therefore, those transactions areunavailable in the new instance.

As a best practice, we recommend that you immediately back up the new primaryinstance at the start of the failover and before clients access the database.This backup represents a consistent, known state at the point of the failover.Such backups can be important for regulatory purposes or for recovering to aknown state if clients encounter issues when accessing the new primary.

Falling back to the original primary region

As outlined earlier, this document provides the steps to fall back to theoriginal region (R1). There are two different versions of the fallbackprocess.

  • If you created the new cross-region read replica in a tertiary region(R3), you must create another (second) cross-region read replica in theprimary region (R1).
  • If you created the new cross-region read replica in the primary region(R1), you don't need to create another additional cross-region read replicain R1.

After the cross-region read replica in R1 exists, the Cloud SQLinstance can fall back to R1. Because this fallback is manually triggered andnot based on an outage, you can choose an appropriate day and time for thismaintenance activity.

Thus, to achieve a complete DR that has a primary, standby, and cross-regionread replica, you need two failovers. The first failover is triggered by theoutage (a true failover), and the second failover re-establishes the startingdeployment (a fallback).

Fallback to the original primary region (R1) consists of the following steps:

  1. Promote the newly created cross-region replica in the original primary region(R1).
  2. If the promoted instance wasn't originally created as an HA replica,then enable HA on the instance for protection from zonal failures.
  3. Reconfigure your applications to connect to the new primary instance.
  4. Create a cross-region replica for the new primary instance in the DR region (R2).
  5. (Optional) To avoid running multiple independent primary instances, clean upthe primary instance in the DR region (R2).

Advanced disaster recovery (DR)

If you are using Cloud SQL Enterprise Plus edition, then you can take advantageof advanced DR. Advanced DR simplifies recovery and fallback after across-regional failover. As described in theDisaster recovery process,when you do DR, you remove the connection between the failed region of the oldprimary instance and the operational region of the new primary instance. WithDR, to restore connections to the original deployment region and regain your oldprimary instance, you must perform a series of manual fallback steps.

With advanced DR, when a region failure occurs, you can invoke areplica failover.With replica failover, you promote a cross-region read replica similar to performingregular DR, except that you promote thedesignated disaster recovery (DR) replica. The promotion of the DR replica is immediate.

Instead of removing the old primary instance, theinstance remains a part of Cloud SQL's asynchronous replicationtopology. The old primary instance (instance A) eventually becomes areplica of its DR replica (instance B) after the DR replica hasbeen promoted to the new primary instance.

After the old primary instance (A) has been turned into a replica, youcan perform the final step of advanced DR. You can return your Cloud SQLdeployment to its original state and restore the old primary instance (A)to its former role as the primary instance with zero data loss. To perform thiszero data loss restoration of the old primary instance (A), you can use theswitchover operation. When you perform a switchover, there is no data lossbecause the primary instance (B) remains in read-only mode untilits designated DR replica (A) catches up with the primary instance (B).After the DR replica (A) has received all of its replication updates,then the DR replica (A) assumes the role of the primary instance while theprevious primary instance (B) is automatically reconfigured as the DR replicaof the current primary instance (A). The instances are returned to their originalroles, thus returning the topology to its original state before DR and replicafailover.

Throughout advanced DR, all instances involved in both replica failover andswitchover operations retain their IP addresses.

Note: While switchover results in zero data loss, replica failover can result indata loss if the DR replica experiences replication lag when you start the replicafailover operation.

You can also use the switchover operation of advanced DR to perform routine DRdrills to test and prepare your Cloud SQL topology for cross-regionalfailoverbefore a disaster occurs. If an actual disaster occurs, then youcan perform the cross-regional replica failover that you've already tested.

Disaster recovery (DR) replica

As a required component of advanced DR, the DR replica has the followingcharacteristics:

  • A DR replica is a directly connected cross-region read replica.
  • You can change the DR replica designation multiple times.
  • You can change the DR replica designation at any time, except during aswitchover or replica failover operation.

In addition, to reduce RTO after using advanced DR, we recommend that you dothe following:

Replica failover

To summarize, a replica failover consists of the following events:

  1. You create and assign a DR replica.
  2. The primary region becomes unavailable.
  3. You perform the replica failover to the DR replica.
  4. The write endpoint is updated and starts pointing to the new primaryinstance.
  5. When the original primary instance comes back online, it becomes a read replicaof the new primary instance.
  6. You can use the switchover operation to restore your deployment to its originaltopology.

To see the details and diagrams of a replica failover operation,click the following tabs.

Assign DR replica

Before performing a replica failover, you've assigned a DR replica to the primary instance and possibly have tested the process by performing a switchover.

Cloud SQL      instance architecture in its original configuration of      two different regions where both regions are healthy.
Figure 1: All regions are healthy

Outage occurs

The primary region, which is running the primary database, becomes unavailable.

Cloud SQL      instance architecture in a configuration where one region is experiencing      an outage.
Figure 2: Region R1 experiences an outage

Replica failover

After determining that disaster recovery is required, you perform a replica failover to your cross-region designated DR replica.

The cross-region designated DR replica becomes the primary instance immediately and starts accepting incoming reads and writes. The write endpoint is updated and starts pointing to the new primary instance.

Cloud SQL      instance architecture where the DNS write endpoint is updated to point      to the new primary instance in the healthy region.
Figure 3: Perform replica failover to end the outage
Note: If you're not using a DNS write endpoint, then you must configure applications to point to the new primary instance.

Original primary becomes replica

After the replica has been promoted, Cloud SQL periodically checks if the original primary instance is back online. If the original primary instance is online, then Cloud SQL recreates the old primary as a replica of the promoted instance. The old primary instance retains its IP address.

Cloud SQL instance architecture where the original primary instance     becomes a replica of the DR replica.
Figure 4: Original primary instance becomes DR replica

Failback to original

After you have performed a replica failover, you can restore the primary instance inyour original region by performing the switchover operation, reversing thesame DR replica and primary instance pair.

Cloud SQL instance architecture where the original primary instance     becomes a replica of the DR replica.
Figure 5: Failback using switchover to the original deployment

Switchover

To summarize, a switchover operation consists of the following events:

  1. You create and assign a DR replica.
  2. You initiate a switchover.
  3. When replication lag goes down to zero, the new primary instances startsaccepting incoming connections.
  4. The old primary instance becomes a read replica.
  5. If a DNS write endpoint is being used, then the DNS write endpoint is updatedto point to the new primary instance.

To see the details and diagrams of a switchover operation, click the following tabs.

Assign DR replica

Before starting the *switchover* operation, you must assign a DR replica to the primary instance.

Verify that the primary instance is healthy. You can only perform a switchover when both the primary instance and the DR replica are online.

Cloud SQL      instance architecture in its original configuration of      two different regions where both regions are healthy.
Figure 1: Original deployment

Initiate switchover

You initiate the switchover. When you initiate a switchover, the primary instance stops accepting writes and becomes read-only. Cloud SQL waits for the transaction logs to be copied to Cloud Storage. The designated DR replica catches up to the primary instance.

When the replication lag goes down to zero, the DR replica is promoted as the new primary instance. The new primary instance starts accepting incoming connections, including application reads and writes.

Cloud SQL      instance architecture where a switchover is performed.
Figure 2: Initiate switchover and promote DR replica to primary instance when replication lag = 0

Endpoint updated

After DR replica is promoted to the new primary instance, the DNS write endpoint is updated and starts pointing to the new primary instance. If you're not using a DNS write endpoint, then you must configure your applications to point to the IP address of the new primary instance.

The old primary instance is reconfigured as a read replica.

PITR is enabled automatically for the new primary instance. PITR is only possible after the first automated backup.

Cloud SQL      instance architecture where a switchover is performed and write endpoint      is updated.
Figure 3: Switchover completion

Write endpoint

A write endpoint is a global domain name service (DNS) name that resolves to theIP address of the current primary instance automatically. This endpoint redirectsincoming connections to the new primary instance automatically in case of a replicafailover or switchoveroperation. You can use the write endpoint in a SQL connection string instead ofan IP address. By using a write endpoint, you can avoid having to makeapplication connection changes when a region outage occurs.

A write endpoint requires that the Cloud DNS API is enabled on the project whereyou create or have your existing Cloud SQL Enterprise Plus edition primary instance.When you create a Cloud SQL Enterprise Plus edition instance with a privateIP address and authorized networks, then Cloud SQL generatesa write endpoint for the instance automatically. If you already havean Cloud SQL Enterprise Plus edition primary instance, then Cloud SQL generatesthe write endpoint when you create the DR replica (a cross-region replicathat you designate for the primary instance).If the primary instance changes due to a switchover or replica failover operation,then Cloud SQL assigns the write endpoint to the DR replica whenthe DR replica becomes the new primary instance.

For more information about using a write endpoint to connect to an instance, seeConnect to an instance using a write endpoint.

Note: If you use theCloud SQL Auth Proxy,then you can't replace the IP address with the write endpoint. You must use theIP address to connect to the instance.

What's next

Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.

Last updated 2025-07-14 UTC.