Disaster recovery building blocks

Last reviewed 2024-07-01 UTC

This document is the second part of a series that discussesdisaster recovery (DR) in Google Cloud. This part discusses services and products that you can use asbuilding blocks for your DR plan—both Google Cloud products and productsthat work across platforms.

The series consists of these parts:

Google Cloud has a wide range of products that you can use as part of yourdisaster recovery architecture. This section discusses DR-relatedfeatures of the products that are most commonly used as Google Cloud DRbuilding blocks.

Many of these services have high availability (HA) features. HA doesn't entirelyoverlap with DR, but many of the goals of HA also apply to designing a DR plan.For example, by taking advantage of HA features, you can design architecturesthat optimize uptime and that can mitigate the effects of small-scale failures,such as a single VM failing. For more about the relationship of DR and HA, seetheDisaster recovery planning guide.

The following sections describe these Google Cloud DR building blocks andhow they help you implement your DR goals.

Compute and storage

The following table provides a summary of the features in Google Cloud compute and storage services that serve as building blocks for DR:

ProductFeature
Compute Engine
  • Scalable compute resources
  • Predefined and custom machine types
  • Fast boot times
  • Snapshots
  • Instance templates
  • Managed instance groups
  • Reservations
  • Persistent disks
  • Transparent maintenance
  • Live migration
Cloud Storage
  • Highly durable object store
  • Redundancy across regions
  • Storage classes
  • Object lifecycle management
  • Data transfer from other sources
  • Encryption at rest by default
  • Soft deletion
Google Kubernetes Engine (GKE)
  • Managed environment for deploying and scaling containerized applications
  • Node auto-repair
  • Liveness and readiness probes
  • Persistent volumes
  • Multi-zone and regional clusters
  • Multi-cluster networking

For more information about how the features and the design of these and otherGoogle Cloud products might influence your DR strategy, seeArchitecting disaster recovery for cloud infrastructure outages: product reference.

Compute Engine

Compute Engine provides virtual machine (VM) instances; it's the workhorse of Google Cloud. Inaddition to configuring, launching, and monitoring Compute Engineinstances, you typically use a variety of related features in order toimplement a DR plan.

For DR scenarios, you can prevent accidental deletion of VMs by setting thedelete protection flag.This is particularly useful where you are hosting stateful services such asdatabases.

For information about how to meet low RTO and RPO values, seeDesigning resilient systems.

Instance templates

You can use Compute Engineinstance templates to save the configuration details of the VM and thencreate Compute Engine instances from existing instance templates.You can use the template to launch as many instances as you need,configured exactly the way you want when you need to stand up your DRtarget environment. Instance templates are globally replicated, so you canrecreate the instance anywhere in Google Cloud with the same configuration.

For more information, see the following resources:

For details about using Compute Engine images, see thebalancing image configuration and deployment speed section later in this document.

Managed instance groups

Managed instance groups work with Cloud Load Balancing (discussed later in this document) todistribute traffic to groups of identically configured instances that arecopied across zones. Managed instance groups allow for features likeautoscaling and autohealing, where the managed instance group can deleteand recreate instances automatically.

Reservations

Compute Engine allows for thereservation of VM instances in a specificzone, using custom or predefined machine types, with or without additional GPUsor local SSDs. In order to assure capacity for your mission critical workloadsfor DR, you shouldcreate reservations in your DR target zones. Withoutreservations, there is a possibility that you might not get the on-demandcapacity you need to meet your recovery time objective. Reservations can beuseful in cold, warm, or hot DR scenarios. They let you keep recoveryresources available for failover to meet lower RTO needs, without having tofully configure and deploy them in advance.

Persistent disks and snapshots

Persistent disks are durable network storage devices that your instances can access. Theyare independent of your instances, so you can detach and move persistentdisks to keep your data even after you delete your instances.

You can take incremental backups or snapshots of Compute Engine VMs thatyou can copy across regions and use to recreate persistent disks in theevent of a disaster. Additionally, you cancreate snapshots of persistent disks to protect against data loss due to user error. Snapshots are incremental,and take only minutes to create even if your snapshot disks are attached torunning instances.

Persistent disks have built-in redundancy to protect your data againstequipment failure and to ensure data availability through data centermaintenance events. Persistent disks are either zonal or regional. Regionalpersistent disks replicate writes across two zones in a region. In the event ofa zonal outage, a backup VM instance can force-attach a regional persistent diskin the secondary zone. To learn more, seeHigh availability options using regional persistent disks.

Transparent maintenance

Google regularly maintains its infrastructure by patching systems with thelatest software, performing routine tests and preventative maintenance, andworking to ensure that Google infrastructure is as fast and efficient aspossible.

By default, all Compute Engine instances are configured so that thesemaintenance events are transparent to your applications and workloads. For moreinformation, seeTransparent maintenance.

When a maintenance event occurs, Compute Engine usesLive Migration to automatically migrate your running instances to another host in the samezone. Live Migration lets Google perform maintenance that's integral tokeeping infrastructure protected and reliable without interrupting any ofyour VMs.

Virtual disk import tool

Thevirtual disk import tool lets you import file formats including VMDK, VHD, and RAW tocreate new Compute Engine virtual machines. Using this tool, you can createCompute Engine virtual machines that have the same configuration as youron-premises virtual machines. This is a good approach for when you are notable to configure Compute Engine images from the source binaries ofsoftware that's already installed on your images.

Automated backups

You can automate backups of your Compute Engine instances using tags. Forexample, you can create a backup plan template using Backup and DR Service, andautomatically apply the template to your Compute Engine instances.

For more information, seeAutomate protection of new Compute Engine instances.

Cloud Storage

Cloud Storage is an object store that's ideal for storing backup files. It providesdifferentstorage classes that are suited for specific use cases, as outlined in the following diagram.

Diagram showing Standard storage for high-frequency access, Nearline and Coldline for low-frequency access, and Archive for lowest-frequency access

In DR scenarios, Nearline, Coldline, and Archive storage are of particularinterest. These storage classes reduce your storage cost compared toStandard storage. However, there are additional costsassociated with retrieving data or metadata stored in these classes, aswell as minimum storage durations that you are charged for. Nearline isdesigned for backup scenarios where access is at most once a month, whichis ideal for allowing you to undertake regular DR stress tests whilekeeping costs low.

Nearline, Coldline, and Archive are optimized for infrequentaccess, and the pricing model is designed with this in mind. Therefore, you arecharged for minimum storage durations, and there are additionalcosts for retrieving data or metadata in these classesearlier than the minimum storage duration for the class.

To protect your data in a Cloud Storage bucket against accidental ormalicious deletion, you can use theSoft Delete feature to preserve deleted and overwritten objects for a specified period, andtheObject holds feature to prevent deletion or updates to objects.

Storage Transfer Service lets you import data from Amazon S3, Azure Blob Storage, or on-premises datasources into Cloud Storage. In DR scenarios, you can useStorage Transfer Service to do the following:

  • Back up data from other storage providers to a Cloud Storage bucket.
  • Move data from a bucket in a dual-region or multi-region to a bucket in aregion to lower your costs for storing backups.

Filestore

Filestore instances are fully managed NFS file servers for use with applications runningon Compute Engine instances or GKE clusters.

Filestore Basic and Zonal tiers are zonal resources and don'tsupport replication across zones, while Filestore Enterprise tierinstances are regional resources. To help you increase the resiliency of yourFilestore environment, we recommend that you use Enterprise tier instances.

Google Kubernetes Engine

GKE is a managed, production-ready environment for deploying containerizedapplications. GKE lets you orchestrate HA systems, and includesthe following features:

  • Node auto repair.If a node fails consecutive health checks over an extended time period(approximately 10 minutes), GKE initiates a repairprocess for that node.
  • Liveness and readiness probes.You can specify a liveness probe, which periodically tellsGKE that the pod is running. If the pod fails theprobe, it can be restarted.
  • Multi-zone and regional clusters.You can distribute Kubernetes resources across multiple zones within aregion.
  • Multi-cluster Gateway lets you configure shared load balancing resources across multiple GKE clusters in different regions.
  • Backup for GKE lets you back up and restore workloads in GKE clusters.

Networking and data transfer

The following table provides a summary of the features in Google Cloud networking and data transfer services that serve as building blocks for DR:

ProductFeature
Cloud Load Balancing
  • Health checks
  • Global load balancing
  • Regional load balancing
  • Multi-region failover
  • Multi-protocol load balancing
  • External and internal load balancing
Cloud Service Mesh
  • Google-managed service mesh control plane
  • Advanced request routing and rich traffic-control policies
Cloud DNS
  • Programmatic DNS management
  • Access control
  • Anycast to serve zones
  • DNS policies
Cloud Interconnect
  • Cloud VPN (IPsec VPN)
  • Direct peering

Cloud Load Balancing

Cloud Load Balancing provides HA for Google Cloud computing products by distributing usertraffic across multiple instances of your applications.You can configure Cloud Load Balancing withhealth checks that determine whether instances are available to do work so that trafficis not routed to failing instances.

Cloud Load Balancing provides a single anycast IP address tofront your applications. Your applications can have instancesrunning in different regions (for example, in Europe and in the US), andyour end users are directed to the closest set of instances. In addition toproviding load balancing for services that are exposed to the internet, youcan configureinternal load balancing for your services behind a private load-balancing IP address. This IPaddress is accessible only to VM instances that are internal to yourVirtual Private Cloud (VPC).

For more information seeCloud Load Balancing overview.

Cloud Service Mesh

Cloud Service Mesh is a Google-managed service mesh that's available on Google Cloud.Cloud Service Mesh provides in-depth telemetry to help you gatherdetailed insights about your applications. It supports services that run on arange of computing infrastructures.

Cloud Service Mesh also supports advanced traffic management androuting features, such as circuit breaking and fault injection. With circuitbreaking, you can enforce limits on requests to a particular service. Whencircuit breaking limits are reached, requests are prevented from reaching theservice, which prevents the service from degrading further. With faultinjection, Cloud Service Mesh can introduce delays or abort a fractionof requests to a service. Fault injection enables you to test your service'sability to survive request delays or aborted requests.

For more information, seeCloud Service Mesh overview.

Cloud DNS

Cloud DNS provides a programmatic way to manage your DNS entries as part of anautomated recovery process. Cloud DNS uses Google's global network ofAnycast name servers to serve your DNS zones from redundant locations around theworld, providing high availability and lower latency for your users.

If you chose to manage DNS entries on-premises, you can enable VMs inGoogle Cloud to resolve these addresses throughCloud DNSforwarding.

Cloud DNS supportspolicies to configure how it responds to DNS requests. For example, you can configureDNS routing policies to steer traffic based on specific criteria, such asenabling failover to a backup configuration to provide high availability,or to route DNS requests based on their geographic location.

Cloud Interconnect

Cloud Interconnect provides ways to move information from othersources to Google Cloud. We discuss this product later underTransferring data to and from Google Cloud.

Management and monitoring

The following table provides a summary of the features in Google Cloud management and monitoring services that serve as building blocks for DR:

ProductFeature
Cloud Status Dashboard
  • Status of Google Cloud services
Google Cloud Observability
  • Uptime monitoring
  • Alerts
  • Logging
  • Error reporting
Google Cloud Managed Service for Prometheus
  • Google-managed Prometheus solution

Cloud Status Dashboard

TheCloud Status Dashboard shows you the current availability of Google Cloud services. You can viewthe status on the page, and you can subscribe to an RSS feed that is updatedwhenever there is news about a service.

Cloud Monitoring

Cloud Monitoring collects metrics, events, and metadata from Google Cloud, AWS, hosteduptime probes, application instrumentation, and a variety of other applicationcomponents. You can configurealerting to send notifications tothird-party tools such as Slack or Pagerduty in order to provide timely updates toadministrators.

Cloud Monitoring lets you create uptime checks forpublicly available endpoints and forendpoints within your VPCs.For example, you can monitor URLs, Compute Engine instances,Cloud Run revisions, and third-party resources, such as Amazon ElasticCompute Cloud (EC2) instances.

Google Cloud Managed Service for Prometheus

Google Cloud Managed Service for Prometheus is a Google-managed, multi-cloud, cross-projectsolution for Prometheus metrics. It lets you globally monitor and alert on yourworkloads, using Prometheus, without having to manually manage and operatePrometheus at scale.

For more information, seeGoogle Cloud Managed Service for Prometheus.

Cross-platform DR building blocks

When you run workloads across more than one platform, a way to reduce theoperational overhead is to select tooling that works with all of theplatforms you're using. This section discusses some tools and servicesthat are platform-independent and therefore support cross-platform DR scenarios.

Infrastructure as code

By defining your infrastructure using code, instead of graphical interfaces orscripts, you can adopt declarative templating tools and automate the provisioning and configuration of infrastructure acrossplatforms. For example, you can useTerraform andInfrastructure Manager to actuate your declarative infrastructure configuration.

Configuration management tools

For large or complex DR infrastructure, we recommend platform-agnosticsoftware management tools such as Chef and Ansible. These tools ensure thatreproducible configurations can be applied no matter where your computeworkload is.

Orchestrator tools

Containers can also be considered a DR building block. Containers are a wayto package services and introduce consistency across platforms.

If you work with containers, you typically use an orchestrator.Kubernetes works not just to manage containers within Google Cloud (using GKE),but provides a way to orchestrate container-based workloads across multipleplatforms. Google Cloud, AWS, and Microsoft Azure all provide managed versionsof Kubernetes.

To distribute traffic to Kubernetes clusters running in different cloudplatforms, you can use a DNS service that supports weighted records andincorporates health checking.

You also need to ensure you can pull the image to the target environment.This means you need to be able to access your image registry in the eventof a disaster. A good option that's also platform-independent isArtifact Registry.

Data transfer

Data transfer is a critical component of cross-platform DR scenarios. Makesure that you design, implement, and test your cross-platform DR scenariosusing realistic mockups of what the DR data transfer scenario calls for. Wediscuss data transfer scenarios in the next section.

Backup and DR Service

Backup and DR Service is a backup and DR solution for cloud workloads. It helps yourecover data and resume critical business operation, and supports severalGoogle Cloud products and third-party databasesand data storage systems.

For more information, seeBackup and DR Service overview.

Patterns for DR

This section discusses some of the most common patterns for DR architecturesbased on the building blocks discussed earlier.

Transferring data to and from Google Cloud

An important aspect of your DR plan is how quickly data can be transferredto and from Google Cloud. This is critical if your DR plan is based onmoving data from on-premises to Google Cloud or from anothercloud provider to Google Cloud. This section discusses networking andGoogle Cloud services that can ensure good throughput.

When you are using Google Cloud as the recovery site for workloads thatare on-premises or on another cloud environment, consider the followingkey items:

  • How do you connect to Google Cloud?
  • How much bandwidth is there between you and the interconnect provider?
  • What is the bandwidth provided by the provider directly to Google Cloud?
  • What other data will be transferred using that link?

For more information about transferring data to Google Cloud, seeMigrate to Google Cloud: Transfer your large datasets.

Balancing image configuration and deployment speed

When you configure a machine image for deploying new instances, consider theeffect that your configuration will have on the speed of deployment. There is atradeoff between the amount of image preconfiguration, the costs of maintainingthe image, and the speed of deployment. For example, if a machine image isminimally configured, the instances that use it will require more time tolaunch, because they need to download and install dependencies. On the otherhand, if your machine image is highly configured, the instances that use itlaunch more quickly, but you must update the image more frequently. The timetaken to launch a fully operational instance will have a direct correlation toyour RTO.

Diagram showing 3 levels of bundling (unbundled to fully bundled) mapped against image boot time (the most bundled is the fastest to boot)

Maintaining machine image consistency across hybrid environments

If you implement a hybrid solution (on-premises-to-cloud or cloud-to-cloud), youneed to find a way to maintain image consistency across production environments.

If a fully configured image is required, consider something likePacker,which can create identical machine images for multiple platforms. You can usethe same scripts with platform-specific configuration files. In the case ofPacker, you can put the configuration file in version control to keep track ofwhat version is deployed in production.

As another option, you can use configuration management tools such as Chef,Puppet, Ansible, or Saltstack to configure instances with finer granularity,creating base images, minimally-configured images, or fully-configured images asneeded.

You can alsomanually convert and import existing images such as Amazon AMIs, Virtualbox images, and RAW disk images toCompute Engine.

Implementing tiered storage

The tiered storage pattern is typically used for backups where the most recentbackup is on faster storage, and you slowly migrate your older backups tolower cost (but slow) storage. By applying this pattern, you migrate backupsbetween buckets of different storage classes, typically fromStandard to lower cost storage classes, such asNearline and Coldline.

To implement this pattern, you can useObject Lifecycle Management.For example, you can automatically change the storage class of objects olderthan a certain amount of time to Coldline.

What's next

Contributors

Authors:

Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.

Last updated 2024-07-01 UTC.