Disaster recovery planning guide

Last reviewed 2024-07-05 UTC

This document is the first part of a series that discussesdisaster recovery (DR) in Google Cloud. This part provides an overview of the DRplanning process: what you need to know in order to design and implement a DRplan. Subsequent parts discuss specific DR use cases with exampleimplementations on Google Cloud.

The series consists of the following parts:

Service-interrupting events can happen at any time. Your network could have anoutage, your latest application push might introduce a critical bug, or youmight have to contend with a natural disaster. When things go awry, it'simportant to have a robust, targeted, and well-tested DR plan.

With a well-designed, well-tested DR plan in place, you can make sure that ifcatastrophe hits, the impact on your business's bottom line will be minimal. Nomatter what your DR needs look like, Google Cloud has a robust, flexible, andcost-effective selection of products and features that you can use to build oraugment the solution that is right for you.

Basics of DR planning

DR is a subset ofbusiness continuity planning.DR planning begins with a business impact analysis that defines two keymetrics:

Arecovery time objective (RTO),which is the maximum acceptable length of time that your application can beoffline. This value is usually defined as part of a largerservice level agreement (SLA).
Arecovery point objective (RPO),which is the maximum acceptable length of time during which data might belost from your application due to a major incident. This metric variesbased on the ways that the data is used. For example, user data that'sfrequently modified could have an RPO of just a few minutes. In contrast,less critical, infrequently modified data could have an RPO of severalhours. (This metric describes only the length of time; it doesn't addressthe amount or quality of the data that's lost.)

Typically, the smaller your RTO and RPO values are (that is, the faster yourapplication must recover from an interruption), the more your application willcost to run. The following graph shows the ratio of cost to RTO/RPO.

Graph showing that small RTO/RPO maps to high cost.

Because smaller RTO and RPO values often mean greater complexity, the associatedadministrative overhead follows a similar curve. A high-availability applicationmight require you to manage distribution between two physically separated datacenters, manage replication, and more.

RTO and RPO values typically roll up into another metric: theservice level objective (SLO),which is a key measurable element of an SLA. SLAs and SLOs are oftenconflated. An SLA is the entire agreement that specifies what service is to beprovided, how it is supported, times, locations, costs, performance, penalties,and responsibilities of the parties involved. SLOs are specific, measurablecharacteristics of the SLA, such as availability, throughput, frequency,response time, or quality. An SLA can contain many SLOs. RTOs and RPOs aremeasurable and should be considered SLOs.

You can read more aboutSLOs and SLAs in the Google Site Reliability Engineering book.

You might also be planning an architecture forhigh availability (HA).HA doesn't entirely overlap with DR, but it's often necessary to take HA intoaccount when you're thinking about RTO and RPO values. HA helps to ensure anagreed level of operational performance, usuallyuptime,for a higher than normal period. When you run production workloads onGoogle Cloud, you might use a globally distributed system so that ifsomething goes wrong in oneregion,the application continues to provide service even if it's less widely available.In essence, that application invokes its DR plan.

Why Google Cloud?

Google Cloud can greatly reduce the costs that are associated with bothRTO and RPO when compared to fulfilling RTO and RPO requirements on premises.For example, DR planning requires you to account for a number ofrequirements, including the following:

Capacity: securing enough resources to scale as needed.
Security: providing physical security to protect assets.
Network infrastructure: including software components such asfirewalls and load balancers.
Support: making available skilled technicians to perform maintenanceand to address issues.
Bandwidth: planning suitable bandwidth for peak load.
Facilities: ensuring physical infrastructure, including equipment andpower.

By providing a highly managed solution on a world-class production platform,Google Cloud helps you bypass most or all of these complicating factors,removing many business costs in the process. In addition, Google Cloud'sfocus on administrative simplicity means that the costs of managing a complexapplication are reduced as well.

Google Cloud offers several features that are relevant to DR planning,including the following:

A global network. Google has one of the largest and most advancedcomputer networks in the world. The Google backbone network uses advancedsoftware-defined networking and edge-caching services to deliver fast,consistent, and scalable performance.
Redundancy. Multiple points of presence (PoPs) across the globe meanstrong redundancy. Your data is mirrored automatically across storagedevices in multiple locations.
Scalability. Google Cloud is designed to scale like other Googleproducts (for example, search and Gmail), even when you experience a hugetraffic spike. Managed services such as Cloud Run,Compute Engine, and Firestore giveyou automatic scaling that enables your application to growand shrink as needed.
Security. TheGoogle security model is built on decades of experience with helping to keep customers safeon Google applications like Gmail and Google Workspace. In addition, thesite reliability engineering teams at Google help ensure high availabilityand help prevent abuse of platform resources.
Compliance. Google undergoes regular independent third-party auditsto verify that Google Cloud is in alignment with security, privacy,and compliance regulations and best practices. Google Cloudcomplies with certifications such as ISO 27001, SOC 2/3, and PCI DSS 3.0.

DR patterns

DR patterns are considered to be cold, warm, or hot. These patterns indicate howreadily the system can recover when something goes wrong. An analogy might bewhat you would do if you were driving and punctured a car tire.

How you deal with a flat tire depends on how prepared you are:

Cold: You have no spare tire, so you must call someone to come to youwith a new tire and replace it. Your trip stops until help arrives to makethe repair.
Warm: You have a spare tire and a replacement kit, so you can get back onthe road using what you have in your car. However, you must stop yourjourney to repair the problem.
Hot: You have run-flat tires. You might need to slow down a little, butthere is no immediate impact on your journey. Your tires run well enoughthat you can continue (although you must eventually address the issue).

Creating a detailed DR plan

This section provides recommendations for how to create your DR plan.

Design according to your recovery goals

When you design your DR plan, you need to combine your application and datarecovery techniques and look at the bigger picture. The typical way to do thisis to look at your RTO and RPO values and which DR pattern you can adopt to meetthose values. For example, in the case of historical compliance-oriented data,you probably don't need speedy access to the data, so a large RTO value and coldDR pattern is appropriate. However, if your online service experiences aninterruption, you'll want to be able to recover both the data and theuser-facing part of the application as quickly as possible. In that case, ahot pattern would be more appropriate. Your email notification system, whichtypically isn't business critical, is probably a candidate for a warm pattern.

For guidance on using Google Cloud to address common DR scenarios, reviewthe application recovery scenarios. These scenarios provide targeted DRstrategies for a variety of use cases and offer example implementations onGoogle Cloud for each.

Design for end-to-end recovery

It isn't enough just to have a plan for backing up or archiving your data. Makesure your DR plan addresses the full recovery process, from backup to restore tocleanup. We discuss this in the related documents about DR data and recovery.

Make your tasks specific

When it's time to run your DR plan, you don't want to be stuck guessing whateach step means. Make each task in your DR plan consist of one or more concrete,unambiguous commands or actions. For example, "Run the restore script" is toogeneral. In contrast, "Open a shell and run/home/example/restore.sh" isprecise and concrete.

Implementing control measures

Add controls to prevent disasters from occurring and to detect issuesbefore they occur. For example, add a monitor that sends an alert when adata-destructive flow, such as a deletion pipeline, exhibits unexpected spikesor other unusual activity. This monitor could also terminate the pipelineprocesses if a certain deletion threshold is reached, preventing a catastrophicsituation.

Preparing your software

Part of your DR planning is to make sure that the software you rely on is readyfor a recovery event.

Verify that you can install your software

Make sure that your application software can be installed from source or from apreconfigured image. Make sure that you are appropriately licensed for anysoftware that you will be deploying on Google Cloud—check with thesupplier of the software for guidance.

Make sure that needed Compute Engine resources are available in therecovery environment. This might require preallocating instances orreserving them.

Design continuous deployment for recovery

Your continuous deployment (CD) toolset is an integral component when you aredeploying your applications. As part of your recovery plan, you must considerwhere in your recovered environment you will deploy artifacts. Plan where youwant to host your CD environment and artifacts—they need to be available andoperational in the event of a disaster.

Implementing security and compliance controls

When you design a DR plan, security is important. The same controls thatyou have in your production environment must apply to your recoveredenvironment. Compliance regulations will also apply to your recoveredenvironment.

Configure security the same for the DR and production environments

Make sure that your network controls provide the same separation and blockingthat the source production environment uses. Learn how to configureShared VPC andfirewalls to let you establish centralized networking and security control of yourdeployment, to configure subnets, and to control inbound and outbound traffic.Understand how to use service accounts to implement least privilege forapplications that access Google Cloud APIs. Make sure to use serviceaccounts as part of the firewall rules.

Make sure that you grant users the same access to the DR environment that theyhave in the source production environment. The following list outlines ways tosynchronize permissions between environments:

If your production environment is Google Cloud, replicating IAMpolicies in the DR environment is straightforward. You can useinfrastructure as code (IaC) tools likeTerraform to deploy yourIAM policies to production. You then use the same tools to bind the policies tocorresponding resources in the DR environment as part of the process ofstanding up your DR environment.
If your production environment is on-premises, you map the functionalroles, such as your network administrator and auditor roles, toIAM policies that have the appropriate IAMroles. The IAM documentation has some example functional roleconfigurations—for example, see the documentation for creatingnetworking andaudit logging functional roles.
You have to configure IAM policies to grant appropriatepermissions to products. For example, you might want torestrict access to specific Cloud Storage buckets.
If your production environment is another cloud provider, map thepermissions in the other provider's IAM policies to Google Cloud IAMpolicies.

Verify your DR security

After you've configured permissions for the DR environment, make sure that youtest everything. Create a test environment. Verify that the permissions that yougrant to users match those that the users have on-premises.

Make sure users can access the DR environment

Don't wait for a disaster to occur before checking that your userscan access the DR environment. Make sure that you have granted appropriateaccess rights to users, developers, operators, data scientists, securityadministrators, network administrators, and any other roles in yourorganization. If you are using an alternative identity system, make sure thataccounts have been synced with your Cloud Identity account. Because the DRenvironment will be your production environment for a while, get your users whowill need access to the DR environment to sign in, and resolve anyauthentication issues. Incorporate users who are logging in to the DRenvironment as part of the regular DR tests that you implement.

To centrally manage who has administrative access to virtual machines (VMs) thatare launched, enable theOS login feature on the Google Cloud projects that constitute your DR environment.

Train users

Users need to understand how to undertake the actions in Google Cloud thatthey're used to accomplishing in the production environment, such as logging inand accessing VMs. Using the test environment, train your users how to performthese tasks in ways that safeguard your system's security.

Make sure that the DR environment meets compliance requirements

Verify that access to your DR environment is restricted to only those who needaccess. Make sure that PII data is redacted and encrypted. If you performregular penetration tests on your production environment, you should includeyour DR environment as part of that scope and carry out regular tests bystanding up a DR environment.

Make sure that while your DR environment is in service, any logs that youcollect are backfilled into the log archive of your production environment.Similarly, make sure that as part of your DR environment, you can export auditlogs that are collected throughCloud Logging to your main log sink archive. Use the export sink facilities.For application logs, create a mirror of your on-premises logging and monitoringenvironment. If your production environment is another cloud provider, map thatprovider's logging and monitoring to the equivalent Google Cloud services.Have a process in place to format input into your production environment.

Treat recovered data like production data

Make sure that the security controls that you apply to your production data alsoapply to your recovered data: the same permissions, encryption, and auditrequirements should all apply.

Know where your backups are located and who is authorized to restore data. Makesure your recovery process is auditable—after a disaster recovery, make sure youcan show who had access to the backup data and who performed therecovery.

Making sure your DR plan works

Make sure that if a disaster does occur, your DR plan works as intended.

Maintain more than one data recovery path

In the event of a disaster, your connection method to Google Cloud mightbecome unavailable. Implement an alternative means of access toGoogle Cloud to help ensure that you can transfer data toGoogle Cloud. Regularly test that the backup path is operational.

Test your plan regularly

After you create a DR plan, test it regularly, noting any issues thatcome up and adjusting your plan accordingly. Using Google Cloud, you cantest recovery scenarios at minimal cost. We recommend that you implement thefollowing to help with your testing:

Automate infrastructure provisioning. You can use IaC tools likeTerraform to automate the provisioning of your Google Cloudinfrastructure. If you're running your production environment on premises,make sure that you have a monitoring process that can start the DR processwhen it detects a failure and can trigger the appropriate recovery actions.
Monitor your environments with Google Cloud Observability.Google Cloud has excellent logging and monitoring tools that you canaccess through API calls, allowing you to automate the deployment ofrecovery scenarios by reacting to metrics. When you're designing tests,make sure that you have appropriate monitoring and alerting in place thatcan trigger appropriate recovery actions.
Perform the testing noted earlier:
- Test that permissions and user access work in the DRenvironment like they do in the production environment.
- Perform penetration testing on your DR environment.
- Perform a test in which your usual access path to Google Clouddoesn't work.

What's next?

Read aboutGoogle Cloud geography and regions.
Read other documents in this DR series:
For more reference architectures, diagrams, and best practices, explore theCloud Architecture Center.

Contributors

Authors:

Grace Mollison | Solutions Lead
Marco Ferrari | Cloud Solutions Architect

Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.

Last updated 2024-07-05 UTC.

Movatterモバイル変換

Disaster recovery planning guide Stay organized with collections Save and categorize content based on your preferences.