Disaster recovery scenarios for applications

Last reviewed 2024-08-05 UTC

This document is part of a series that discussesdisaster recovery (DR) in Google Cloud. This part explores common disaster recovery scenarios forapplications.

The series consists of these parts:

Introduction

This document frames DR scenarios for applications in terms ofDR patterns that indicate how readily the application can recover from a disaster event. Ituses the concepts discussed in theDR building blocks document to describe how you can implement an end-to-end DR plan appropriate foryourrecovery goals.

To begin, consider some typical workloads to illustrate how thinking about yourrecovery goals and architecture has a direct influence on your DR plan.

Batch processing workloads

Batch processing workloads tend not to be mission critical, so you typicallydon't need to incur the cost of designing a high availability (HA) architectureto maximize uptime; in general, batch processing workloads can deal withinterruptions. This type of workload can take advantage of cost-effectiveproducts such asSpot VMs andpreemptible VM instances which is an instance you can create and run at a much lower price than normal instances. (However,Compute Engine might preemptively stop or delete these instances if itrequires access to those resources for other tasks.

By implementing regular checkpoints as part of the processing task, theprocessing job can resume from the point of failure when new VMs are launched.If you're using Dataproc, the process oflaunching preemptible worker nodes is managed by a managed instance group. This can be considered a warm pattern,where there's a short pause waiting for replacement VMs to continue processing.

Ecommerce sites

In ecommerce sites, some parts of the application can have larger RTO values.For example, the actual purchasing pipeline needs to have high availability, butthe email process that sends order notifications to customers can tolerate a fewhours' delay. Customers know about their purchase, and so although they expect aconfirmation email, the notification is not a crucial part of the process. Thisis a mix of a hot (purchasing) and warm and cold (notification) patterns.

The transactional part of the application needs high uptime with a minimal RTOvalue. Therefore, you use HA, which maximizes the availabilityof this part of the application. This approach can be considereda hot pattern.

The ecommerce scenario illustrates how you can have varying RTO values withinthe same application.

Video streaming

A video streaming solution has many components that need to be highly available,from the search experience to the actual process of streaming content to theuser. In addition, the system requires low latency to create a satisfactory userexperience. If any aspect of the solution fails to provide a great experience,it's bad for the supplier as well as the customer. Moreover, customers todaycan turn to a competitive product.

In this scenario, an HA architecture is a must-have, and small RTO valuesare needed. This scenario requires a hot pattern throughout the applicationarchitecture to help ensure minimal impact in case of a disaster.

DR and HA architectures for production on-premises

This section examines how to implement three patterns—cold, warm, and hot—whenyour application runs on-premises and your DR solution is onGoogle Cloud.

Cold pattern: Recovery to Google Cloud

In a cold pattern, you have minimal resources in the DR Google Cloudproject—just enough to enable a recovery scenario. When there's a problem thatprevents the production environment from running production workloads, thefailover strategy requires a mirror of the production environment to be startedin Google Cloud. Clients then start using the services from the DRenvironment.

In this section we examine an example of this pattern. In the example,Cloud Interconnect is configured with a self-managed (non-Google Cloud) VPN solution to provideconnectivity to Google Cloud. Data is copied to Cloud Storage aspart of the production environment.

This pattern uses the following DR building blocks:

  • Cloud DNS
  • Cloud Interconnect
  • Self-managed VPN solution
  • Cloud Storage
  • Compute Engine
  • Cloud Load Balancing
  • Deployment Manager

The following diagram illustrates this example architecture:

Architecture for cold pattern when production is on-premises

The following steps outline how you can configure the environment:

  1. Create a VPC network.
  2. Configure connectivity between your on-premises network and the Google Cloud network.
  3. Create a Cloud Storage bucket as the target for your data backup.
  4. Create a service account.
  5. Create anIAM policy to restrict who can access the bucket and its objects. You include theservice account created specifically for this purpose. You also add theuser account or group to the policy for your operator or systemadministrator, granting to all these identities the relevant permissions.For details about permissions for access to Cloud Storage, seeIAM permissions for Cloud Storage.
  6. UseService account impersonation to provide access for your local Google Clouduser (or service account) to impersonate the service account you createdearlier. Alternatively you can create a new user specifically for this purpose.
  7. Test that you canupload and download files in the target bucket.
  8. Create a data-transfer script.
  9. Create a scheduled task to run the script. You can use tools such as Linuxcrontab and Windows Task Scheduler.
  10. Createcustom images that are configured for each server in the production environment. Eachimage should be of the same configuration as its on-premises equivalent.

    As part of the custom image configuration for the database server,create a startup script that will automatically copy the latest backup from aCloud Storage bucket to the instance and then invoke the restoreprocess.

  11. Configure Cloud DNS to point to your internet-facing web services.

  12. Create aDeployment Manager template thatwill create application servers in your Google Cloud network using thepreviously configured custom images. This template should also set up theappropriate firewall rules required.

You need to implement processes to ensure that the custom images have the sameversion of the application as on-premises. Ensure that you incorporate upgradesto the custom images as part of your standard upgrade cycle, and ensure thatyour Deployment Manager template is using the latest customimage.

Failover process and post-restart tasks

If a disaster occurs, you can recover to the system that's running onGoogle Cloud. To do this, you launch your recovery process in order tocreate the recovery environment using the Deployment Managertemplate you create. When the instances in the recovery environment are ready toaccept production traffic, you adjust the DNS to point to the web server inGoogle Cloud.

A typical recovery sequence is this:

  1. Use the Deployment Manager template tocreate a deployment inGoogle Cloud.
  2. Apply the most recent database backup in Cloud Storage to thedatabase server running in Google Cloud by following the instructionsyour database system for recovering backup files.
  3. Apply the most recent transaction logs in Cloud Storage.
  4. Test that the application works as expected by simulating user scenarioson the recovered environment.
  5. When tests succeed,configure Cloud DNS to point to the web server on Google Cloud. (For example, you can usean anycast IP address behind a Google Cloud load balancer, withmultiple web servers behind the load balancer.)

The following diagram shows the recovered environment:

Configuration of cold pattern for recovery when production is on-premises

When the production environment is running on-premises again and the environmentcan support production workloads, you reverse the steps that you followed tofailover to the Google Cloud recovery environment. A typical sequence toreturn to the production environment is this:

  1. Take a backup of the database running on Google Cloud.
  2. Copy the backup file to your production environment.
  3. Apply the backup file to your production database system.
  4. Prevent connections to the application in Google Cloud. For example,prevent connections to the global load balancer. From this point yourapplication will be unavailable until you finish restoring the productionenvironment.
  5. Copy any transaction log files over to the production environment andapply them to the database server.
  6. Configure Cloud DNS to point to your on-premises web service.
  7. Ensure that the process you had in place to copy data toCloud Storage is operating as expected.
  8. Delete your deployment.

Warm standby: Recovery to Google Cloud

A warm pattern is typically implemented to keep RTO and RPO values as small aspossible without the effort and expense of a fully HA configuration. The smallerthe RTO and RPO value, the higher the costs as you approach having a fullyredundant environment that can serve traffic from two environments. Therefore,implementing a warm pattern for your DR scenario is a good trade-off betweenbudget and availability.

An example of this approach is to use Cloud Interconnect configuredwith a self-managed VPN solution to provide connectivity to Google Cloud. Amultitiered application is running on-premises while using a minimal recoverysuite on Google Cloud. The recovery suite consists of an operationaldatabase server instance on Google Cloud. This instance must run at alltimes so that it can receive replicated transactions through asynchronous orsemisynchronous replication techniques. To reduce costs, you can run thedatabase on the smallest machine type that's capable of running the databaseservice. Because you can use a long-running instance, sustained use discountswill apply.

This pattern uses the following DR building blocks:

  • Cloud DNS
  • Cloud Interconnect
  • Self-managed VPN solution
  • Compute Engine
  • Deployment Manager

Compute Engine snapshots provide a way to take backups that you can rollback to a previous state. Snapshots are used in this example because updated webpages and application binaries are written frequently to the productionweb and to application servers. These updates are regularly replicated to thereference web server and application server instances on Google Cloud. (Thereference servers don't accept production traffic; they are used to create thesnapshots.)

The following diagram illustrates an architecture that implements this approach.The replication targets are not shown in the diagram.

Architecture for a warm pattern when production is on-premises

The following steps outline how you can configure the environment:

  1. Create a VPC network.
  2. Configure connectivity between your on-premises network and the Google Cloud network.
  3. Replicate your on-premises servers to Google Cloud VM instances. Oneoption is to use a partner solution; the method you employ depends on yourcircumstances.
  4. Create acustom image of your database server on Google Cloud that has the sameconfiguration as your on-premises database server.
  5. Create snapshots of the web server and application server instances.
  6. Start a database instance in Google Cloud using the custom image youcreated earlier. Use the smallest machine type that is capable ofaccepting replicated data from the on-premises production database.
  7. Attach persistent disks to the Google Cloud database instance for the databases andtransaction logs.
  8. Configure replication between your on-premises database server and thedatabase server in Google Cloud by following the instructions for yourdatabase software.
  9. Set theauto delete flag on the persistent disks attached to the database instance tono-auto-delete.
  10. Configure a scheduled task to create regular snapshots of the persistentdisks of the database instance on Google Cloud.
  11. Create reservations to assure capacity for your web server and applicationservers as needed.
  12. Test the process of creating instances from snapshots and of takingsnapshots of the persistent disks.
  13. Create instances of the web server and the application server using thesnapshots created earlier.
  14. Create a script that copies updates to the web application and theapplication server whenever the corresponding on-premises servers areupdated. Write the script to create a snapshot of the updatedservers.
  15. Configure Cloud DNS to point to your internet-facing web service on premises.

Failover process and post-restart tasks

To manage a failover, you typically use your monitoring and alerting system toinvoke an automated failover process. When the on-premises application needs tofail over, you configure the database system on Google Cloud so it is ableto accept production traffic. You also start instances of the web andapplication server.

The following diagram shows the configuration after failover toGoogle Cloud enabling production workloads to be served fromGoogle Cloud:

Configuration of warm pattern for recovery when production is on-premises

A typical recovery sequence is this:

  1. Resize the database server instance so that it can handle production loads.
  2. Use the web server and application snapshots on Google Cloudto create new web server and application instances.
  3. Test that the application works as expected by simulating user scenarioson the recovered environment.
  4. When tests succeed,configure Cloud DNS to point to your web service on Google Cloud.

When the production environment is running on-premises again and can supportproduction workloads, you reverse the steps that you followed to fail over tothe Google Cloud recovery environment. A typical sequence to return to theproduction environment is this:

  1. Take a backup of the database running on Google Cloud.
  2. Copy the backup file to your production environment.
  3. Apply the backup file to your production database system.
  4. Prevent connections to the application in Google Cloud. One way todo this is to prevent connections to the web server bymodifying the firewall rules.From this point your application will be unavailable until you finishrestoring the production environment.
  5. Copy any transaction log files over to the production environment andapply them to the database server.
  6. Test that the application works as expected by simulating user scenarioson the production environment.
  7. Configure Cloud DNS to point to your on-premises web service.
  8. Delete the web server and application server instances that are running inGoogle Cloud. Leave the reference servers running.
  9. Resize the database server on Google Cloud back to the minimum instance sizethat can accept replicated data from the on-premises production database.
  10. Configure replication between your on-premises database server and thedatabase server in Google Cloud by following the instructions for yourdatabase software.

Hot HA across on-premises and Google Cloud

If you have small RTO and RPO values, you can achieve these only by runningHA across your production environment and Google Cloud concurrently. Thisapproach gives you a hot pattern, because both on-premises and Google Cloudare serving production traffic.

The key difference from the warm pattern is that the resources in bothenvironments are running in production mode and serving production traffic.

This pattern uses the following DR building blocks:

  • Cloud Interconnect
  • Cloud VPN
  • Compute Engine
  • Managed instance groups
  • Cloud Monitoring
  • Cloud Load Balancing

The following diagram illustrates this example architecture. By implementingthis architecture, you have a DR plan that requires minimal intervention in theevent of a disaster.

Architecture for a hot pattern when production is on-premises

The following steps outline how you can configure the environment:

  1. Create a VPC network.
  2. Configure connectivity between your on-premises network and your Google Cloud network.
  3. Createcustom images in Google Cloud that are configured for each server in the on-premisesproduction environment. Each Google Cloud image should have the sameconfiguration as its on-premises equivalent.
  4. Configure replication between your on-premises database server and thedatabase server in Google Cloud by following the instructions for yourdatabase software.

    Many database systems permit only a single writeable database instance whenyou configure replication. Therefore, you might need to ensure that one ofthe database replicas acts as a read-only server.

  5. Create individualinstance templates that use the images for the application servers and the web servers.

  6. Configureregional managed instance groups for the application and web servers.

  7. Configure health checks usingCloud Monitoring.

  8. Configureload balancing using the regional managed instance groups that were configured earlier.

  9. Configure a scheduled task to create regular snapshots of the persistentdisks.

  10. Configure a DNS service to distribute traffic between your on-premisesenvironment and the Google Cloud environment.

With this hybrid approach, you need to use a DNS service that supports weightedrouting to the two production environments so that you can serve the sameapplication from both.

You need to design the system for failures that might occur in only part of anenvironment (partial failures). In that case, traffic should be rerouted to theequivalent service in the other backup environment. For example, if theon-premises web servers become unavailable, you can disable DNS routing to thatenvironment. If your DNS service supports health checks, this will occurautomatically when the health check determines that web servers in one of theenvironments can't be reached.

If you're using a database system that allows only a single writeable instance,in many cases the database system will automatically promote the read-onlyreplica to be the writeable primary when the heartbeat between the originalwritable database and the read replica loses contact. Be sure that youunderstand this aspect of your database replication in case you need tointervene after a disaster.

You must implement processes to ensure that the custom VM images inGoogle Cloud have the same version of the application as the versionson-premises. Incorporate upgrades to the custom images as part of your standardupgrade cycle, and ensure that your Deployment Manager templateis using the latest custom image.

Failover process and post-restart tasks

In the configuration described here for a hot scenario, a disaster meansthat one of the two environments isn't available. There is no failover processin the same way that there is with the warm or cold scenarios, where you need tomove data or processing to the second environment. However, you might need tohandle the following configuration changes:

  • If your DNS service doesn't automatically reroute traffic based on ahealth check failure, you need to manually configure DNS routing to sendtraffic to the system that's still up.
  • If your database system doesn't automatically promote a read-only replicato be the writeable primary on failure, you need to intervene to ensurethat the replica is promoted.

When the second environment is running again and can handle production traffic,you need to resynchronize databases. Because both environments supportproduction workloads, you don't have to take any further action to change whichdatabase is the primary. After the databases are synchronized, you can allowproduction traffic to be distributed across both environments again by adjustingthe DNS settings.

DR and HA architectures for production on Google Cloud

When you design your application architecture for production workload onGoogle Cloud, the HA features of the platform have a direct influence onyour DR architecture.

Backup and DR Service is a centralized, cloud-native solution forbacking up and recovering cloud and hybrid workloads. It offers swift datarecovery and facilitates the quick resumption of essential business operations.

For more information about using Backup and DR Service for applications scenarios onGoogle Cloud, see the following:

Cold: recoverable application server

In a cold failover scenario where you need a single active server instance,only one instance should write to disk. In an on-premises environment, you oftenuse an active / passive cluster. When you run a production environment onGoogle Cloud, you can create a VM in amanaged instance group that only runs one instance.

This pattern uses the following DR building blocks:

  • Compute Engine
  • Managed instance groups

This cold failover scenario is shown in the following example architectureimage:

Configuration of cold pattern for recovery when production is on Google Cloud

The following steps outline how to configure this cold failover scenario:

  1. Create a VPC network.
  2. Create acustom VM image that's configured with your application web service.
    1. Configure the VM so that the data processed by the application service iswritten to an attachedpersistent disk.
  3. Create a snapshot from the attached persistent disk.
  4. Create aninstance template that references the custom VM image for the web server.
    1. Configure a startup script to create a persistent disk from the latest snapshot and to mount the disk.This script must be able to get the latest snapshot of the disk.
  5. Create amanaged instance group and health checks with a target size of one that references the instancetemplate.
  6. Create ascheduled task to create regular snapshots of the persistent disk.
  7. Configure anexternal Application Load Balancer.
  8. Configure alerts using Cloud Monitoring to send an alert when the service fails.

This cold failover scenario takes advantage of some of the HA features availablein Google Cloud. If a VM fails, the managed instance group tries to recreatethe VM automatically. You don't have to initiate this failover step. Theexternal Application Load Balancer makes sure that even when a replacement VM is needed,the same IP address is used in front of the application server. The instancetemplate and custom image make sure that the replacement VM is configuredidentically to the instance it replaces.

Your RPO is determined by the last snapshot taken. The more often you takesnapshots, the smaller the RPO value.

The managed instance group provides HA in depth. The managed instance groupprovides ways to react to failures at the application or VM level. You don'tmanually intervene if any of those scenarios occur. A target size of one makessure that you only ever have one active instance that runs in the managedinstance group and serves traffic.

Persistent disks are zonal, so you must take snapshots to re-create disks ifthere's a zonal failure. Snapshots are also available acrossregions, which letsyou restore a disk to a different region similar to restoring it tothe same region.

In the unlikely event of a zonal failure, you must manually intervene torecover, as outlined in the next section.

Failover process

If a VM fails, the managed instance group automatically tries to recreate a VMin the same zone. The startup script in the instance template creates apersistent disk from the latest snapshot and attaches it to the new VM.

However, a managed instance group with size one doesn't recover if there's azone failure. In the scenario where a zone fails, you must react to theCloud Monitoring alert, or other monitoring platform, when the servicefails and manually create an instance group in another zone.

A variation on this configuration is to use regional persistent disks instead ofzonal persistent disks. With this approach, you don't need to use snapshots torestore the persistent disk as part of the recovery step. However, thisvariation consumes twice as much storage and you need to budget for that.

The approach you choose is dictated by your budget and RTO and RPO values.

Warm: static site failover

If Compute Engine instances fail, you can mitigate service interruptionby having aCloud Storage-based static site on standby.This pattern is appropriate when your web application is mostly static.

In this scenario, the primary application runs on Compute Engineinstances. These instances are grouped into managed instance groups, and theinstance groups serve asbackend services for an HTTPS load balancer. The HTTP load balancer directs incoming traffic tothe instances according to the load balancer configuration, the configuration ofeach instance groups, and the health of each instance.

This pattern uses the following DR building blocks:

  • Compute Engine
  • Cloud Storage
  • Cloud Load Balancing
  • Cloud DNS

The following diagram illustrates this example architecture:

Architecture for a warm failover to a static site when production is on Google Cloud

The following steps outline how to configure this scenario:

  1. Create a VPC network.
  2. Create acustom image that's configured with the application web service.
  3. Create aninstance template that uses the image for the web servers.
  4. Configure amanaged instance group for the web servers.
  5. Configure health checks using Monitoring.
  6. Configureload balancing using the managed instance groups that you configured earlier.
  7. Create aCloud Storage-based static site.

In the production configuration, Cloud DNS is configured to point atthis primary application, and the standby static site sits dormant. If theCompute Engine application goes down, you would configureCloud DNS to point to this static site.

Failover process

If the application server or servers go down, your recovery sequence is toconfigure Cloud DNS to point to your static website. The following diagram shows the architecturein its recovery mode:

Configuration after failover to a static site when production is on Google Cloud.

When the application Compute Engine instances are running again and cansupport production workloads, you reverse the recovery step: youconfigure Cloud DNS to point to the load balancer that fronts the instances.

Alternatively, you can usePersistent Disk Asynchronous Replication. It offers blockstorage replication with low recovery point objective (RPO) and low recoverytime objective (RTO) for cross-region active-passive DR. This storage option lets you manage replication for Compute Engineworkloads at the infrastructure level, rather than at the workload level.

Hot: HA web application

A hot pattern when your production environment is running on Google Cloudis to establish a well-architected HA deployment.

This pattern uses the following DR building blocks:

  • Compute Engine
  • Cloud Load Balancing
  • Cloud SQL

The following diagram illustrates this example architecture:

Architecture of a hot patterns when production is on Google Cloud

This scenario takes advantage of HA features in Google Cloud—you don'thave to initiate any failover steps, because they will occur automatically inthe event of a disaster.

As shown in the diagram, the architecture uses a regional managed instance grouptogether with global load balancing and Cloud SQL. The example hereuses a regional managed instance group, so the instances are distributed acrossthree zones.

With this approach, you get HA in depth. Regional managed instance groupsprovide mechanisms to react to failures at the application, instance, or zonelevel, and you don't have to manually intervene if any of those scenariosoccurs.

To address application-level recovery, as part of setting up the managedinstance group, you configure HTTP health checks that verify that the servicesare running properly on the instances in that group. If a health checkdetermines that a service has failed on an instance, the group automaticallyre-creates that instance.

For more informations about building scalable and resilient applications onGoogle Cloud, seePatterns for scalable and resilient apps.

What's next

Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.

Last updated 2024-08-05 UTC.