Disaster recovery scenarios for data Stay organized with collections Save and categorize content based on your preferences.
This document is the third part of a series that discussesdisaster recovery (DR) in Google Cloud. This part discusses scenarios for backing up and recoveringdata.
The series consists of these parts:
- Disaster recovery planning guide
- Disaster recovery building blocks
- Disaster recovery scenarios for data (this document)
- Disaster recovery scenarios for applications
- Architecting disaster recovery for locality-restricted workloads
- Disaster recovery use cases: locality-restricted data analytic applications
- Architecting disaster recovery for cloud infrastructure outages
Introduction
Your disaster recovery plans must specify how you can avoid losing data during adisaster. The termdata here covers two scenarios. Backing up and thenrecovering database, log data, and other data types fits into one of thefollowing scenarios:
- Data backups. Backing up data alone involves copying a discreteamount of data from one place to another. Backups are made as part of arecovery plan either to recover from a corruption of data so that you canrestore to a known good state directly in the production environment, or sothat you can restore data in your DR environment if your productionenvironment is down. Typically, data backups have a small to medium RTO anda small RPO.
- Database backups. Database backups are slightly more complex, becausethey typically involve recovering to the point in time. Therefore, inaddition to considering how to back up and restore the database backups andensuring that the recovery database system mirrors the productionconfiguration (same version, mirrored disk configuration), you also need toconsider how to back up transaction logs. During recovery, after yourestore database functionality, you have to apply the latest databasebackup and then the recovered transaction logs that were backed up afterthe last backup. Due to the complicating factors inherent to databasesystems (for example, having to match versions between production andrecovery systems), adopting a high-availability-first approach to minimizethe time to recover from a situation that could cause unavailability of thedatabase server lets you achieve smaller RTO and RPO values.
When you run production workloads on Google Cloud, you might use aglobally distributed system so that if something goes wrong in oneregion,the application continues to provide service even if it's less widely available.In essence, that application invokes its DR plan.
The rest of this document discusses examples of how to design somescenarios for data and databases that can help you meet your RTO and RPOgoals.
Production environment is on-premises
In this scenario, your production environment is on-premises, and your disasterrecovery plan involves using Google Cloud as the recovery site.
Data backup and recovery
You can use a number of strategies to implement a process to regularly back updata from on-premises to Google Cloud. This section looks at two of themost common solutions.
Solution 1: Back up to Cloud Storage using a scheduled task
This pattern uses the following DR building blocks:
- Cloud Storage
One option for backing up data is to create a scheduled task that runs a scriptor application to transfer the data to Cloud Storage. You can automatea backup process to Cloud Storage using thegcloud storage Google Cloud CLI command or by using one of the Cloud Storageclient libraries.For example, the followinggcloud storage command copies all files from a sourcedirectory to a specified bucket.
gcloud storage cp -rSOURCE_DIRECTORY gs://BUCKET_NAME
ReplaceSOURCE_DIRECTORY with the path to your source directoryandBUCKET_NAME with a name of your choice for the bucket.The name must meet thebucket name requirements.
The following steps outline how to implement a backup and recovery processusing thegcloud storage command.
- Install the
gcloud CLIon the on-premises machine that you use to upload your data files from. - Create a bucket as the target for your data backup.
- Create a service account.
- Create anIAM policy to restrict who can access the bucket and its objects. Include the serviceaccount created specifically for this purpose. For details about permissionsfor access to Cloud Storage, seeIAM permissions for
gcloud storage. - UseService account impersonation to provide access for your local Google Clouduser (or service account) to impersonate the service account you just createdearlier. Alternatively you can create a new user specifically for this purpose.
- Test that you canupload and download files in the target bucket.
- Set up a schedule for the script that you use to upload your backups usingtools such as Linux
crontaband Windows Task Scheduler. - Configure a recovery process that uses the
gcloud storagecommand to recover your datato your recovery DR environment on Google Cloud.
You can also use thegcloud storage rsync command to perform real-time incremental syncs between your data and a Cloud Storage bucket.
For example, the followinggcloud storage rsync commandmakes the contents in a Cloud Storage bucket the same as the contentsin the source directory by copying any missing files or objects or those whosedata has changed. If the volume of data that has changed between successivebackup sessions is small relative to the entire volume of the source data, thenusinggcloud storage rsync can be more efficient than usinggcloud storage cpcommand. By usinggcloud storage rsync, you can implement a more frequentbackup schedule and achieve a lower RPO.
gcloud storage rsync -rSOURCE_DIRECTORY gs://BUCKET_NAME
For more information, seegcloud storage command for smaller transfers of on-premises data.
Solution 2: Back up to Cloud Storage using Transfer service for on-premises data
This pattern uses the following DR building blocks:
- Cloud Storage
- Transfer service for on-premises data
Transferring large amounts of data across a network often requires careful planningand robust execution strategies. It is a non-trivial task to develop customscripts that are scalable, reliable, and maintainable. Custom scripts can often lead tolowered RPO values and even increased risks of data loss.
For guidance on moving large volumes of data from on-premises locations toCloud Storage, seeMove or back up data from on-premises storage.
Solution 3: Back up to Cloud Storage using a partner gateway solution
This pattern uses the following DR building blocks:
- Cloud Interconnect
- Cloud Storage tiered storage
On-premises applications are often integrated with third-party solutions thatcan be used as part of your data backup and recovery strategy. The solutionsoften use a tiered storage pattern where you have the most recent backups onfaster storage, and slowly migrate your older backups to cheaper (slower)storage. When you use Google Cloud as the target, you have severalstorage class options available to use as theequivalent of the slower tier.
One way to implement this pattern is to use a partner gateway between youron-premises storage and Google Cloud to facilitate this transfer of data toCloud Storage. The following diagram illustrates this arrangement, witha partner solution that manages the transfer from the on-premises NAS applianceor SAN.
In the event of a failure, the data being backed up must be recovered toyour DR environment. The DR environment is used to serve production trafficuntil you are able to revert to your production environment. How youachieve this depends on your application, and on the partner solution andits architecture. (Some end-to-end scenarios are discussed in theDR application document.)
You can also use managed Google Cloud databases as your DR destinations.For example, Cloud SQL for SQL Server supports transaction log imports. You canexport transaction logs from your on-premises SQL Server instance, upload themto Cloud Storage, andimport them into Cloud SQL for SQL Server.
For further guidance on ways to transfer data from on-premises toGoogle Cloud, seeTransferring big data sets to Google Cloud.
For more information about partner solutions, seeFind the right Google Cloud partner.
Database backup and recovery
You can use a number of strategies to implement a process to recover a databasesystem from on-premises to Google Cloud. This section looks at two of themost common solutions.
It is out of scope in this document to discuss in detail the various built-inbackup and recovery mechanisms included with third-party databases. This sectionprovides general guidance, which is implemented in the solutions discussed here.
Solution 1: Backup and recovery using a recovery server on Google Cloud
- Create a database backup using the built-in backup mechanisms of yourdatabase management system.
- Connect your on-premises network and your Google Cloud network.
- Create a Cloud Storage bucket as the target foryour data backup.
- Copy the backup files to Cloud Storage using
gcloud storagegcloud CLI or a partner gateway solution (see thesteps discussed earlier in the data backup and recovery section). For details, seeMigrate to Google Cloud: Transfer your large datasets. - Copy the transaction logs to your recovery site on Google Cloud. Havinga backup of the transaction logs helps keep your RPO values small.
After configuring this backup topology, you must ensure that you can recover tothe system that's on Google Cloud. This step typically involves notonly restoring the backup file to the target database but also replaying thetransaction logs to get to the smallest RTO value. A typical recoverysequence looks like this:
- Create acustom image of your database server on Google Cloud. The database server shouldhave the same configuration on the image as your on-premises database server.
- Implement a process to copy your on-premises backup files and transactionlog files to Cloud Storage. Seesolution 1for an example implementation.
- Start a minimally sized instance from the custom image and attach any persistent disks that are needed.
- Settheauto delete flag to false for the persistent disks.
- Apply the latest backup file that was previously copied toCloud Storage, following the instructions from your databasesystem for recovering backup files.
- Apply the latest set of transaction log files that have been copied toCloud Storage.
- Replace the minimal instance with a larger instance that is capable of accepting production traffic.
- Switch clients to point at the recovered database in Google Cloud.
When you have your production environment running and able to support productionworkloads, you have to reverse the steps that you followed to fail over to theGoogle Cloud recovery environment. A typical sequence to return to theproduction environment looks like this:
- Take a backup of the database running on Google Cloud.
- Copy the backup file to your production environment.
- Apply the backup file to your production database system.
- Prevent clients from connecting to the database system inGoogle Cloud; for example, by stopping the database system service.From this point, your application will be unavailable until you finishrestoring the production environment.
- Copy any transaction log files over to the production environment andapply them.
- Redirect client connections to the production environment.
Solution 2: Replication to a standby server on Google Cloud
One way to achieve very small RTO and RPO values is to replicate (not just backup) data and in some cases database state in real time to a replica of yourdatabase server.
- Connect your on-premises network and your Google Cloud network.
- Create acustom image of your database server on Google Cloud. The database server shouldhave the same configuration on the image as the configuration of youron-premises database server.
- Start an instance from the custom image and attach any persistent disks that are needed.
- Settheauto delete flag to false for the persistent disks.
- Configure replication between your on-premises database server and thetarget database server in Google Cloud following the instructionsspecific to the database software.
- Clients are configured in normal operation to point to the databaseserver on premises.
After configuring this replication topology, switch clients to point to thestandby server running in your Google Cloud network.
When you have your production environment back up and able to support productionworkloads, you have to resynchronize the production database server with theGoogle Cloud database server and then switch clients to point back to theproduction environment
Production environment is Google Cloud
In this scenario, both your production environment and your disaster recoveryenvironment run on Google Cloud.
Data backup and recovery
A common pattern for data backups is to use a tiered storage pattern. When yourproduction workload is on Google Cloud, the tiered storage system lookslike the following diagram. You migrate data to a tier that has lower storagecosts, because the requirement to access the backed-up data is less likely.
This pattern uses the following DR building blocks:

Because the Nearline, Coldline, and Archive storageclasses are intended for storing infrequently accessed data, there areadditional costs associated with retrieving data or metadata stored in these classes, as well asminimum storage durations that you are charged for.
Database backup and recovery
When you use a self-managed database (for example, you've installed MySQL,PostgreSQL, or SQL Server on an instance of Compute Engine), the sameoperational concerns apply as managing production databases on premises, but youno longer need to manage the underlying infrastructure.
Backup and DR Service is a centralized, cloud-native solution forbacking up and recovering cloud and hybrid workloads. It offers swift datarecovery and facilitates the quick resumption of essential business operations.
For more information about using Backup and DR for self-managed databasescenarios on Google Cloud, see the following:
Note: The scenario for using a managed Google Cloud database is discussedin the next section.Alternatively, you can set up HA configurations by using the appropriateDR building block features to keep RTO small. You can design your database configuration to makeit feasible to recover to a state as close as possible to the pre-disaster state;this helps keep your RPO values small. Google Cloud provides a wide varietyof options for this scenario.
Two common approaches to designing your database recovery architecture forself-managed databases on Google Cloud are discussed in this section.
Recovering a database server without synchronizing state
A common pattern is to enable recovery of a database server that does notrequire system state to be synchronized with an up-to-date standby replica.
This pattern uses the following DR building blocks:
- Compute Engine
- Managed instance groups
- Cloud Load Balancing (internal load balancing)
The following diagram illustrates an example architecture that addresses thescenario. By implementing this architecture, you have a DR plan that reactsautomatically to a failure without requiring manual recovery.
The following steps outline how to configure this scenario:
- Create a VPC network.
Create acustom image that is configured with the database server by doing the following:
- Configure the server so the database files and log files arewritten to an attached standard persistent disk.
- Create a snapshot from the attached persistent disk.
- Configure a startup script to create a persistent disk from thesnapshot and to mount the disk.
- Create a custom image of the boot disk.
Createaninstance template that uses the image.
Using the instance template, configure amanaged instance group with a target size of 1.
Configure health checking using Cloud Monitoring metrics.
Configureinternal load balancing using the managed instance group.
Configure a scheduled task to create regular snapshots of the persistentdisk.
In the event a replacement database instance is needed, this configurationautomatically does the following:
- Brings up another database server of the correct version in the same zone.
- Attaches a persistent disk that has the latest backup and transaction logfiles to the newly created database server instance.
- Minimizes the need to reconfigure clients that communicate with yourdatabase server in response to an event.
- Ensures that the Google Cloud security controls (IAM policies,firewall settings) that apply to the production database server apply tothe recovered database server.
Because the replacement instance is created from an instance template,the controls that applied to the original apply to the replacement instance.
This scenario takes advantage of some of the HA features available inGoogle Cloud; you don't have to initiate any failover steps, because theyoccur automatically in the event of a disaster. The internal load balancerensures that even when a replacement instance is needed, the same IP address isused for the database server. The instance template and custom image ensure thatthe replacement instance is configured identically to the instance it isreplacing. By taking regular snapshots of the persistent disks, you ensure thatwhen disks are re-created from the snapshots and attached to the replacementinstance, the replacement instance is using data recovered according to an RPOvalue dictated by the frequency of the snapshots. In this architecture, thelatest transaction log files that were written to the persistent disk are alsoautomatically restored.
The managed instance group provides HA in depth. It provides mechanisms to reactto failures at the application or instance level, and you don't have to manuallyintervene if any of those scenarios occur. Setting a target size of one ensuresyou only ever have one active instance that runs in the managed instance groupand serves traffic.
Standard persistent disks are zonal, so if there's a zonal failure, snapshotsare required to re-create disks. Snapshots are also available across regions,allowing you to restore a disk not only within the same region but also to adifferent region.
A variation on this configuration is to use regional persistent disks in placeof standard persistent disks. In this case, you don't need to restore thesnapshot as part of the recovery step.
The variation you choose is dictated by your budget and RTO and RPOvalues.
Recovering from partial corruption in very large databases
Persistent Disk Asynchronous Replication offers block storage replication with low RPO and low RTO for cross-regionactive-passive DR. This storage option lets you to manage replication for Compute Engineworkloads at the infrastructure level, rather than at the workload level.
If you're using a database that's capable of storing petabytes of data, youmight experience an outage that affects some of the data, but not all of it. Inthat case, you want to minimize the amount of data that you need to restore; youdon't need to (or want to) recover the entire database just to restore some ofthe data.
There are a number of mitigating strategies you can adopt:
- Store your data in different tables for specific time periods. Thismethod ensures that you need to restore only a subset of data to a newtable, rather than a whole dataset.
Store the original data on Cloud Storage. This approach lets youcreate a new table and reload the uncorrupted data. From there, you canadjust your applications to point to the new table.
Note: This method provides good availability, with only a smallinterruption as you point your applications to the new store. However,unless you have implemented application-level controls to prevent access tothe corrupted data, this method can result in inaccurate results duringlater analysis.
Additionally, if your RTO permits, you can prevent access to the table that hasthe corrupted data by leaving your applications offline until the uncorrupteddata has been restored to a new table.
Managed database services on Google Cloud
This section discusses some methods you can use to implement appropriate backupand recovery mechanisms for the managed database services onGoogle Cloud.
Managed databases are designed for scale, so the traditional backup and restoremechanisms you see with traditional RDMBSs are usually not available. As in thecase of self-managed databases, if you are using a database that is capable ofstoring petabytes of data, you want to minimize the amount of data that you needto restore in a DR scenario. There are a number of strategies for each manageddatabase to help you achieve this goal.
Bigtable providesBigtable replication.A replicated Bigtable database can provide higher availabilitythan a single cluster, additional read throughput, and higher durability andresilience in the face of zonal or regional failures.
Bigtablebackups is a fully managed service that lets you save a copy of a table's schema anddata, then restore from the backup to a new table at a later time.
You can also export tables from Bigtable as a series of Hadoopsequence files.You can then store these files in Cloud Storage or use them to importthe data back into another instance of Bigtable. You canreplicate your Bigtable dataset asynchronously across zoneswithin a Google Cloud region.
BigQuery. If you want to archive data, you can take advantageofBigQuery'slong term storage.If a table is not edited for 90 consecutive days, the price of storage for thattable automatically drops by 50 percent. There is no degradation of performance,durability, availability, or any other functionality when a table is consideredlong term storage. If the table is edited, though, it reverts back to theregular storage pricing and the 90 day countdown starts again.
BigQuery isreplicated to two zones in a single region,but this won't help with corruption in your tables. Therefore, you need to havea plan to be able to recover from that scenario. For example, you can do thefollowing:
- If the corruption is caught within 7 days, query the table to a pointin time in the past to recover the table prior to the corruption usingsnapshot decorators.
- Export the data from BigQuery,and create a new table that contains the exported data but excludes thecorrupted data.
- Store your data in different tables for specific time periods. Thismethod ensures that you will need to restore only a subset of data to a newtable, rather than a whole dataset.
- Makecopies of your dataset at specific time periods. You can use these copies if a data-corruptionevent occurred beyond what a point-in-time query can capture (for example,more than 7 days ago). You can also copy a datasetfrom one region to another to ensure data availability in the event ofregion failures.
- Store the original data on Cloud Storage, which lets youcreate a new table and reload the uncorrupted data. From there, you canadjust your applications to point to the new table.
Firestore. Themanaged export and import service lets you import and export Firestore entities using aCloud Storage bucket. You can then implement a processthat can be used to recover from accidental deletion of data.
Cloud SQL. If you use Cloud SQL, the fully managedGoogle Cloud MySQL database, you should enable automated backups and binarylogging for your Cloud SQL instances. This approach lets you performa point-in-time recovery, which restores your database from a backup andrecovers it to a fresh Cloud SQL instance. For more information, seeAbout Cloud SQL backups andAbout disaster recovery (DR) in Cloud SQL
You can also configure Cloud SQL in anHA configuration andcross-region replicas to maximize up time in the event of zonal or regional failure.
If you enabled near-zero downtime planned maintenance for Cloud SQL,you can evaluate the impact maintenance events on your instances by simulatingnear-zero downtime planned maintenance events onCloud SQL for MySQL,and onCloud SQL for PostgreSQL.
ForCloud SQL Enterprise Plus edition, youcan useadvanced disaster recovery (DR)to simplify recovery and fallback processes with zero data loss after you performa cross-regional failover.
Spanner. You can use Dataflow templates formaking a fullexport of your database to a set ofAvro files in aCloud Storage bucket, and use another template forre-importing the exported files into a new Spanner database.
For more controlled backups, theDataflow connector lets you write code to read and write data to Spanner in aDataflow pipeline. For example, you can use the connector to copydata out of Spanner and into Cloud Storage as the backuptarget. The speed at which data can be read from Spanner (orwritten back to it) depends on the number of configurednodes. This has a direct impact on your RTO values.
The Spannercommit timestamp feature can be useful for incremental backups, by allowing you to select onlythe rows that have been added or modified since the last full backup.
For managed backups,Spanner Backup and Restorelets you create consistent backups that can be retained for up to 1 year.The RTO value is lower compared toexport because the restoreoperation directly mounts the backup without copying the data.
For small RTO values, you could set up a warm standby Spannerinstance configured with the minimum number of nodes required to meet yourstorage and read and write throughput requirements.
Spannerpoint-in-time-recovery (PITR) letsyou recover data from a specific point in time in the past. For example, if anoperator inadvertently writes data or an application rollout corrupts thedatabase, with PITR you can recover the data from a point in time in thepast, up to a maximum of 7 days.
Cloud Composer. You can use Cloud Composer (a managedversion of Apache Airflow) to schedule regular backups of multipleGoogle Cloud databases. You can create a directed acyclic graph (DAG) torun on a schedule (for example, daily) to either copy the data to anotherproject, dataset, or table (depending on the solution used), or to export thedata to Cloud Storage.
Exporting or copying data can be done using the variousCloud Platform operators.
For example, you can create a DAG to do any of the following:
- Export a BigQuery table to Cloud Storage usingtheBigQueryToCloudStorageOperator.
- Export Firestore in Datastore mode (Datastore) to Cloud Storageusing theDatastoreExportOperator.
- Export MySQL tables to Cloud Storage using theMySqlToGoogleCloudStorageOperator.
- Export Postgres tables to Cloud Storage using thePostgresToGoogleCloudStorageOperator.
Production environment is another cloud
In this scenario, your production environment uses another cloud provider, andyour disaster recovery plan involves using Google Cloud as the recoverysite.
Data backup and recovery
Transferring data between object stores is a common use case for DR scenarios.Storage Transfer Service iscompatible with Amazon S3 and is the recommended way to transfer objectsfrom Amazon S3 to Cloud Storage.
You can configure a transfer job to schedule periodic synchronization from datasource to data sink, with advanced filters based on file creation dates,filename filters, and the times of day you prefer to transfer data. To achievethe RPO that you want, you must consider the following factors:
Rate of change. The amount of data that's being generated or updated fora given amount of time. The higher the rate of change, the more resources areneeded to transfer the changes to the destination at each incremental transferperiod.
Transfer performance. The time it takes to transfer files. For large file transfers, this istypically determined by the available bandwidth betweensource and destination. However, if a transfer job consists of a largenumber of small files, QPS can become a limiting factor. If that's thecase, you can schedule multiple concurrent jobs to scale theperformance as long as sufficient bandwidth is available. We recommend you thatyou measure the transfer performance using a representative subset of yourreal data.
Frequency. The interval between backup jobs. The freshness of data at the destination is as recent as the lasttime a transfer job was scheduled. Therefore, it's important that theintervals between successive transfer jobs are not longer than yourRPO objective. For example, if the RPO objective is 1 day, thetransfer job must be scheduled at least once a day.
Monitoring and alerts. Storage Transfer Service providesPub/Sub notifications on avariety of events. We recommend that you subscribe to these notifications tohandle unexpected failures or changes in job completion times.
Database backup and recovery
It is out of scope in this document to discuss in detail the various built-inbackup and recovery mechanisms included with third-party databases or the backupand recovery techniques used on other cloud providers. If you are operatingnon-managed databases on the compute services, you can take advantage of the HAfacilities that your production cloud provider has available. You can extendthose to incorporate a HA deployment to Google Cloud, or useCloud Storage as the ultimate destination for the cold storage of yourdatabase backup files.
What's next?
- Read aboutGoogle Cloud geography and regions.
Read other documents in this DR series:
- Disaster recovery planning guide
- Disaster recovery building blocks
- Disaster recovery scenarios for applications
- Architecting disaster recovery for locality-restricted workloads
- Disaster recovery use cases: locality-restricted data analytic applications
- Architecting disaster recovery for cloud infrastructure outages
- Architectures for high availability of MySQL clusters on Compute Engine
Explore reference architectures, diagrams, and best practices about Google Cloud.Take a look at ourCloud Architecture Center.
Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.
Last updated 2024-08-05 UTC.