Architecting disaster recovery for cloud infrastructure outages

Last reviewed 2024-05-10 UTC

This article is part of a series that discussesdisaster recovery (DR) in Google Cloud. This part discusses the process for architecting workloadsusing Google Cloud and building blocks that are resilient to cloud infrastructure outages.

The series consists of these parts:

Disaster recovery planning guide
Disaster recovery building blocks
Disaster recovery scenarios for data
Disaster recovery scenarios for applications
Architecting disaster recovery for locality-restricted workloads
Disaster recovery use cases: locality-restricted data analytic applications
Architecting disaster recovery for cloud infrastructure outages (this document)

Introduction

As enterprises move workloads on to the public cloud, they need to translatetheir understanding of building resilient on-premises systems to the hyperscaleinfrastructure of cloud providers like Google Cloud. This articlemaps industry standard concepts around disaster recovery such as RTO (RecoveryTime Objective) and RPO (Recovery Point Objective) to the Google Cloudinfrastructure.

The guidance in this document follows one of Google's key principles forachieving extremely high service availability: plan for failure. WhileGoogle Cloud provides extremely reliable service, disasters will strike -natural disasters, fiber cuts, and complex unpredictable infrastructurefailures - and these disasters cause outages. Planning for outages enablesGoogle Cloud customers to build applications that perform predictably throughthese inevitable events, by making use of Google Cloud products with"built-in" DR mechanisms.

Disaster recovery is a broad topic which covers a lot more than justinfrastructure failures, such as software bugs or data corruption, and youshould have a comprehensive end-to-end plan. However this article focuses on onepart of an overall DR plan: how to design applications that are resilient tocloud infrastructure outages. Specifically, this article walks through:

The Google Cloud infrastructure, how disaster events manifest asGoogle Cloud outages, and how Google Cloud is architected to minimize thefrequency and scope of outages.
An architecture planning guide that provides a framework for categorizingand designing applications based on the desired reliability outcomes.
A detailed list of select Google Cloud products that offer built-in DRcapabilities which you may want to use in your application.

For further details on general DR planning and using Google Cloud as a componentin your on-premises DR strategy, see thedisaster recovery planning guide.Also, whileHigh Availability is a closely related concept to disaster recovery, it is not covered in thisarticle. For further details on architecting for high availability see theWell-Architected Framework.

A note on terminology: this article refers toavailability whendiscussing the ability for a product to be meaningfully accessed and used overtime, whilereliability refers to a set of attributes includingavailability but also things like durability and correctness.

How Google Cloud is designed for resilience

Google data centers

Traditional data centers rely on maximizing availability of individualcomponents. In the cloud, scale allows operators like Google to spread servicesacross many components using virtualization technologies and thus exceedtraditional component reliability. This means you can shiftyour reliability architecture mindset away from the myriad details you onceworried about on-premises. Rather than worry about the various failure modes ofcomponents -- such as cooling and power delivery -- you can planaround Google Cloud products and their stated reliability metrics. These metricsreflect the aggregate outage risk of the entireunderlying infrastructure. This frees you to focus much more onapplication design, deployment, and operations rather than infrastructuremanagement.

Google designs its infrastructure to meet aggressive availability targets basedon our extensive experience building and running modern data centers. Google isa world leader in data center design. From power to cooling to networks, eachdata center technology has its own redundancies and mitigations, includingFMEA plans.Google's data centers are built ina way that balances these many different risks and presents to customers aconsistent expected level of availability for Google Cloud products. Googleuses its experience to model the availability of the overall physical andlogical system architecture to ensure that the data center design meetsexpectations. Google's engineers take great lengths operationally to help ensurethose expectations are met. Actual measured availability normally exceeds ourdesign targets by a comfortable margin.

By distilling all of these data center risks and mitigations into user-facingproducts, Google Cloud relieves you from those design and operationalresponsibilities. Instead, you can focus on the reliability designed intoGoogle Cloud regions and zones.

Regions and zones

Regions are independent geographic areas that consist ofzones. Zones and regions arelogical abstractions of underlying physical resources. For more information aboutregion-specific considerations, seeGeography and regions.

Google Cloud products are divided into zonal resources, regional resources, ormulti-regional resources.

Zonal resources are hosted within a single zone. A service interruption inthat zone can affect all of the resources in that zone. For example, aCompute Engine instance runs in a single, specified zone; if a hardware failureinterrupts service in that zone, that Compute Engine instance isunavailable for the duration of the interruption.

Regional resources are redundantly deployed across multiple zones within aregion. This gives them higher reliability relative to zonal resources.

Multi-regional resources are distributed within and across regions. Ingeneral, multi-regional resources have higher reliability than regionalresources. However, at this level products must optimize availability,performance, and resource efficiency. As a result, it is important to understandthe tradeoffs made by each multi-regional product you decide to use. Thesetradeoffs are documented on a product-specific basis later in this document.

Examples of zonal, regional, and multi-regional Google Cloud products

How to leverage zones and regions to achieve reliability

Google SREs manage and scale highly reliable, global user products like Gmailand Search through a variety of techniques and technologies that seamlesslyleverage computing infrastructure around the world. This includes redirectingtraffic away from unavailable locations using global load balancing, runningmultiple replicas in many locations around the planet, and replicating dataacross locations. These same capabilities are available to Google Cloudcustomers through products like Cloud Load Balancing, Google Kubernetes Engine (GKE),and Spanner.

Google Cloud generally designs products to deliver the following levels ofavailability for zones and regions:

Resource	Examples	Availability design goal	Implied downtime
Zonal	Compute Engine, Persistent Disk	99.9%	8.75 hours / year
Regional	Regional Cloud Storage, Replicated Persistent Disk, Regional GKE	99.99%	52 minutes / year

Note: These are design guidelines. Google Cloud service level commitments canbe found at Google Cloud Service Level Agreements.

Compare the Google Cloud availability design goals against your acceptablelevel of downtime to identify the appropriate Google Cloud resources.While traditional designs focus on improving component-level availability toimprove the resulting application availability, cloud models focus instead oncomposition of components to achieve this goal. Many products withinGoogle Cloud use this technique. For example, Spanner offers amulti-region database that composes multiple regions in order to deliver 99.999%availability.

Composition is important because without it, your application availabilitycannot exceed that of the Google Cloud products you use; in fact, unless yourapplicationnever fails, it will have lower availability than the underlyingGoogle Cloud products. The remainder of this section shows generally how you canuse a composition of zonal and regional products to achieve higher applicationavailability than a single zone or region would provide. The next section givesa practical guide for applying these principles to your applications.

Planning for zone outage scopes

Infrastructure failures usually cause service outages in a single zone. Withina region, zones are designed to minimize the risk of correlated failures withother zones, and a service interruption in one zone would usually not affectservice from another zone in the same region. An outage scoped to a zone doesn'tnecessarily mean that the entire zone is unavailable, it just defines theboundary of the incident. It is possible for a zone outage to have no tangibleeffect on your particular resources in that zone.

It's a rarer occurrence, but it's also critical to note that multiple zones willeventually still experience a correlated outage at some point within a singleregion. When two or more zones experience an outage, the regional outage scopestrategy below applies.

Regional resources are designed to be resistant to zone outages bydelivering service from a composition of multiple zones. If one of the zonesbacking a regional resource is interrupted, the resource automatically makesitself available from another zone. Carefully check the product capabilitydescription in the appendix for further details.

Google Cloud only offers a few zonal resources, namely Compute Enginevirtual machines (VMs) and Persistent Disk. If you plan to usezonal resources,you'll need to perform your own resource composition by designing, building,and testing failover and recovery between zonal resources located in multiplezones. Some strategies include:

Routing your traffic quickly to virtual machines in another zone usingCloud Load Balancing when a health check determines that a zone isexperiencing issues.
Use Compute Engine instance templates and/or managed instance groups torun and scale identical VM instances in multiple zones.
Use a regional Persistent Disk to synchronously replicate data to anotherzone in a region. SeeHigh availability options using regional PDs for more details.

Planning for regional outage scopes

A regional outage is a service interruption affecting more than one zone in asingle region. These are larger scale, less frequent outages and can be causedby natural disasters or large scale infrastructure failures.

For a regional product that is designed to provide 99.99% availability, anoutage can still translate to nearly an hour of downtime for a particularproduct every year. Therefore, your critical applications may need to have amulti-region DR plan in place if this outage duration is unacceptable.

Multi-regional resources are designed to be resistant to region outages bydelivering service from multiple regions. As described above, multi-regionproducts trade off between latency, consistency, and cost. The most common tradeoff is between synchronous and asynchronous data replication. Asynchronousreplication offers lower latency at the cost of risk of data loss during anoutage. So, it is important to check the product capability description in theappendix for further details.

Note: In BigQuery, a multi-region location does not providecross-region replication nor regional redundancy. Data will be stored in asingle region within the geographic location.

If you want to useregional resources and remain resilient to regionaloutages, then you must perform your own resource composition bydesigning, building, and testing their failover and recovery between regionalresources located in multiple regions. In addition to the zonal strategiesabove, which you can apply across regions as well, consider:

Regional resources should replicate data to a secondary region, to amulti-regional storage option such as Cloud Storage, or a hybrid cloudoption such as GKE and Google Distributed Cloud.
After you have a regional outage mitigation in place,test it regularly.There are few things worse than thinking you're resistant to a single-regionoutage, only to find that this isn't the case when it happens for real.

Google Cloud resilience and availability approach

Google Cloud regularly beats its availability design targets, but you shouldnot assume that this strong past performance is the minimum availabilityyou can design for. Instead, you should select Google Cloud dependencies whosedesigned-for targets exceed your application's intended reliability, such thatyour application downtime plus the Google Cloud downtime delivers theoutcome you are seeking.

A well-designed system can answer the question: "What happens when azone or region has a 1, 5, 10, or 30 minute outage?" This should be consideredat many layers, including:

What will my customers experience during an outage?
How will I detect that an outage is happening?
What happens to my application during an outage?
What happens to my data during an outage?
What happens to my other applications due to an outage (due tocross-dependencies)?
What do I need to do in order to recover after an outage is resolved? Whodoes it?
Who do I need to notify about an outage, within what time period?

Step-by-step guide to designing disaster recovery for applications in Google Cloud

The previous sections covered how Google builds cloud infrastructure,and some approaches for dealing with zonal and regional outages.

This section helps you develop a framework for applying the principleof composition to your applications based on your desired reliability outcomes.

Customer applications in Google Cloud that target disaster recoveryobjectives such as RTO and RPO must be architected so that business-criticaloperations, subject to RTO/RPO, only have dependencies on data planecomponents that are responsible for continuous processing of operations forthe service. In other words, such customer business-critical operations mustnot depend on management plane operations, which manage configuration stateand push configuration to the control plane and the data plane.

For example, Google Cloud customers who intend to achieve RTO forbusiness-critical operations should not depend on a VM-creation API or on theupdate of an IAM permission.

Step 1: Gather existing requirements

The first step is to define the availability requirements for your applications.Most companies already have some level of design guidance in thisspace, which may be internally developed or derived from regulations or otherlegal requirements. This design guidance is normally codified in two keymetrics: Recovery Time Objective (RTO) and Recovery Point Objective (RPO). Inbusiness terms, RTO translates as "How long after a disaster before I'm up andrunning." RPO translates as "How much data can I afford to lose in the event ofa disaster."

Historically, enterprises have defined RTO and RPO requirements for a wide rangeof disaster events, from component failures to earthquakes. This made sense inthe on-premises world where planners had to map the RTO/RPO requirements throughthe entire software and hardware stack. In the cloud, you no longer need todefine your requirements with such detail because the provider takes care ofthat. Instead, you can define your RTO and RPO requirements in terms of thescope of loss (entire zones or regions) without being specific about theunderlying reasons. For Google Cloud this simplifies your requirement gatheringto 3 scenarios: a zonal outage, a regional outage, or the extremely unlikelyoutage of multiple regions.

Recognizing that not every application has equal criticality, most customerscategorize their applications into criticality tiers against which a specificRTO/RPO requirement can be applied. When taken together, RTO/RPO and applicationcriticality streamline the process of architecting a given application byanswering:

Does the application need to run in multiple zones in the same region, or inmultiple zones in multiple regions?
On which Google Cloud products can the application depend?

This is an example of the output of the requirements gathering exercise:

RTO and RPO by Application Criticality for Example Organization Co:

Application criticality	% of Apps	Example apps	Zone outage	Region outage
Tier 1 (most important)	5%	Typically global or external customer-facing applications, such as real-time payments and eCommerce storefronts.	RTO Zero RPO Zero	RTO Zero RPO Zero
Tier 2	35%	Typically regional applications or important internal applications, such as CRM or ERP.	RTO 15mins RPO 15mins	RTO 1hr RPO 1hr
Tier 3 (least important)	60%	Typically team or departmental applications, such as back office, leave booking, internal travel, accounting, and HR.	RTO 1hr RPO 1hr	RTO 12hrs RPO 12hrs

Step 2: Capability mapping to available products

The second step is to understand the resilience capabilities of Google Cloudproducts that your applications will be using. Most companies review therelevant product information and then add guidance on how to modify theirarchitectures to accommodate any gaps between the product capabilities and theirresilience requirements. This section covers some common areas andrecommendations around data and application limitations in this space.

As mentioned previously, Google's DR-enabled products broadly cater for twotypes of outage scopes: regional and zonal. Partial outages should be plannedfor the same way as a full outage when it comes to DR. This gives an initialhigh level matrix of which products are suitable for each scenario by default:

Google Cloud Product General Capabilities
(seeAppendix for specific product capabilities)

	All Google Cloud products	Regional Google Cloud products with automatic replication across zones	Multi-regional or global Google Cloud products with automatic replication across regions
Failure of a component within a zone	Covered*	Covered	Covered
Zone outage	Not covered	Covered	Covered
Region outage	Not covered	Not covered	Covered

* All Google Cloud products are resilient to component failure, except inspecific cases noted in product documentation. These are typically scenarioswhere the product offers direct access or static mapping to a piece ofspeciality hardware such as memory or Solid State Disks (SSD).

How RPO limits product choices

In most cloud deployments, data integrity is the most architecturallysignificant aspect to be considered for a service. At least some applicationshave an RPO requirement of zero, meaning there should be no data loss inthe event of an outage. This typically requires data to be synchronouslyreplicated to another zone or region. Synchronous replication has cost andlatency tradeoffs, so while many Google Cloud products provide synchronousreplication across zones, only a few provide it across regions. This cost andcomplexity tradeoff means that it's not unusual for different types of datawithin an application to have different RPO values.

For data with an RPO greater than zero, applications can take advantage ofasynchronous replication. Asynchronous replication is acceptable when lost datacan either be recreated easily, or can be recovered from a golden source of dataif needed. It can also be a reasonable choice when a small amount of data lossis an acceptable tradeoff in the context of zonal and regional expected outagedurations. It is also relevant that during atransient outage, data written tothe affected location but not yet replicated to another location generallybecomes available after the outage is resolved. This means that the risk ofpermanent data loss is lower than the risk of losing data access during anoutage.

Key actions: Establish whether you definitely need RPO zero, and if sowhether you can do this for a subset of your data - this dramaticallyincreases the range of DR-enabled services available to you. In Google Cloud,achieving RPO zero means using predominantly regional products for yourapplication, which by default are resilient to zone-scale, but notregion-scale, outages.

How RTO limits product choices

One of the primary benefits of cloud computing is the ability to deployinfrastructure on demand; however, this isn't the same as instantaneousdeployment. The RTO value for your application needs to accommodate thecombined RTO of the Google Cloud products your application utilizes and anyactions your engineers or SREs must take to restart your VMs or applicationcomponents. An RTO measured in minutes means designing an application whichrecovers automatically from a disaster without human intervention, or withminimal steps such as pushing a button to failover. The cost and complexity ofthis kind of system historically has been very high, but Google Cloud productslike load balancers and instance groups make this design both much moreaffordable and simpler. Therefore, you should considerautomated failover and recovery for most applications. Be aware that designing asystem for this kind of hot failoveracross regions is both complicated andexpensive; only a very small fraction of critical services warrant thiscapability.

Most applications will have an RTO of between an hour and a day, which allowsfor a warm failover in a disaster scenario, with some components of theapplication running all the time in a standby mode--such as databases--whileothers are scaled out in the event of an actual disaster, such as web servers.For these applications, you should strongly consider automation for thescale-out events. Services with an RTO over a day are the lowest criticality andcan often be recovered from a backup or recreated from scratch.

Key actions: Establish whether you definitely need an RTO of (near) zero forregional failover, and if so whether you can do this for a subset of yourservices. This changes the cost of running and maintaining your service.

Step 3: Develop your own reference architectures and guides

The final recommended step is building your own company-specific architecturepatterns to help your teams standardize their approach to disaster recovery.Most Google Cloud customers produce a guide for their development teamsthat matches their individual business resilience expectations to the two majorcategories of outage scenarios on Google Cloud. This allows teams to easilycategorize which DR-enabled products are suitable for each criticality level.

Create product guidelines

Looking again at the example RTO/RPO table from above, you have ahypothetical guide that lists which products would be allowed by default foreach criticality tier. Note that where certain products have been identified asnot suitable by default, you can always add your own replication andfailover mechanisms to enable cross-zone or cross-region synchronization, butthis exercise is beyond the scope of this article. The tables also link to moreinformation about each product to help you understand their capabilitieswith respect to managing zone or region outages.

Sample Architecture Patterns for Example Organization Co -- Zone OutageResilience

Google Cloud Product	Does product meet zonal outage requirements for Example Organization(with appropriate product configuration)
Google Cloud Product	Tier 1	Tier 2	Tier 3
Compute Engine	No	No	No
Dataflow	No	No	No
BigQuery	No	No	Yes
GKE	Yes	Yes	Yes
Cloud Storage	Yes	Yes	Yes
Cloud SQL	No	Yes	Yes
Spanner	Yes	Yes	Yes
Cloud Load Balancing	Yes	Yes	Yes

This table is an example only based on hypothetical tiers shown above.

Sample Architecture Patterns for Example Organization Co -- Region OutageResilience

Google Cloud Product	Does product meet region outage requirements for Example Organization(with appropriate product configuration)
Google Cloud Product	Tier 1	Tier 2	Tier 3
Compute Engine	Yes	Yes	Yes
Dataflow	No	No	No
BigQuery	No	No	Yes
GKE	Yes	Yes	Yes
Cloud Storage	No	No	No
Cloud SQL	No	Yes	Yes
Spanner	Yes	Yes	Yes
Cloud Load Balancing	Yes	Yes	Yes

This table is an example only based on hypothetical tiers shown above.

To show how these products would be used, the following sections walk throughsome reference architectures for each of the hypothetical applicationcriticality levels. These are deliberately high level descriptions to illustratethe key architectural decisions, and aren't representative of a completesolution design.

Example tier 3 architecture

Application criticality	Zone outage	Region outage
Tier 3 (least important)	RTO 12 hours RPO 24 hours	RTO 28 days RPO 24 hours

An example tier 3 architecture using Google Cloud products

(Greyed-out icons indicate infrastructure to be enabled for recovery)

This architecture describes a traditional client/server application: internalusers connect to an application running on a compute instance which is backed bya database for persistent storage.

It's important to note that this architecture supports better RTO and RPO valuesthan required. However, you should also consider eliminating additional manualsteps when they could prove costly or unreliable. For example, recovering adatabase from a nightly backup could support the RPO of 24 hours, but this wouldusually need a skilled individual such as a database administrator who might beunavailable, especially if multiple services were impacted at the same time.With Google Cloud's on demand infrastructure you are able to build thiscapability without making a major cost tradeoff, and so this architecture usesCloud SQL HA rather than a manual backup/restore for zonal outages.

Key architectural decisions for zone outage - RTO of 12hrs and RPO of 24hrs:

Aninternal load balancer is usedto provide a scalable access point for users, which allows for automaticfailover to another zone. Even though the RTO is 12 hours, manual changes toIP addresses or even DNS updates can take longer than expected.
Aregional managed instance group is configured with multiple zones but minimal resources. This optimizes forcost but still allows for virtual machines to be quickly scaled out in thebackup zone.
Ahigh availability Cloud SQL configuration provides for automatic failover to another zone. Databases are significantlyharder to recreate and restore compared to the Compute Engine virtualmachines.

Key architectural decisions for region outage - RTO of 28 Days and RPO of 24hours:

Aload balancer would beconstructed in region 2 only in the event of a regional outage. Cloud DNSis used to provide an orchestrated but manual regional failover capability,since the infrastructure in region 2 would only be made available in theevent of a region outage.
A newmanaged instance group would be constructed only in the event of a region outage. This optimizesfor cost and is unlikely to be invoked given the short length of mostregional outages. Note that for simplicity the diagram doesn't show theassociated tooling needed to redeploy, or the copying of the Compute Engineimages needed.
A newCloud SQL instance would be recreatedand the data restored from a backup. Again the risk of an extended outage toa region is extremely low so this is another cost optimization trade-off.
Multi-regionalCloud Storage is used tostore these backups. This provides automatic zone and regional resiliencewithin the RTO and RPO.

Example tier 2 architecture

Application criticality	Zone outage	Region outage
Tier 2	RTO 4 hours RPO zero	RTO 24 hours RPO 4 hours

An example tier 2 architecture using Google Cloud products

This architecture describes a data warehouse with internal users connecting to acompute instance visualization layer, and a data ingest and transformation layerwhich populates the backend data warehouse.

Some individual components of this architecture do not directly support the RPOrequired for their tier. However, because of how they are used together, theoverall service does meet the RPO. In this case, because Dataflow is azonal product, follow the recommendations forhigh availability design.to help prevent data loss during an outage. However, theCloud Storage layer is the golden source of this data and supports an RPO ofzero. As a result, you can re-ingest any lost data into BigQuery by usingzone b in the event of an outage in zone a.

Key architectural decisions for zone outage - RTO of 4hrs and RPO of zero:

Aload balancer is used toprovide a scalable access point for users, which allows for automaticfailover to another zone. Even though the RTO is 4 hours, manual changes toIP addresses or even DNS updates can take longer than expected.
Aregional managed instance group for the data visualization compute layer is configured with multiple zonesbut minimal resources. This optimizes for cost but still allows for virtualmachines to be quickly scaled out.
RegionalCloud Storage is used as astaging layer for the initial ingest of data, providing automatic zoneresilience.
Dataflow is used to extract datafrom Cloud Storage and transform it before loading it intoBigQuery. In theevent of a zone outage this is a stateless process that can be restarted inanother zone.
BigQuery provides the data warehousebackend for the data visualization front end. In the event of a zone outage,any data lost would be re-ingested from Cloud Storage.

Key architectural decisions for region outage - RTO of 24hrs and RPO of 4hours:

Aload balancer in each region isused to provide a scalable access point for users. Cloud DNS is used toprovide an orchestrated but manual regional failover capability, since theinfrastructure in region 2 would only be made available in the event of aregion outage.
Aregional managed instance group for the data visualization compute layer is configured with multiple zonesbut minimal resources. This isn't accessible until the load balancer isreconfigured but doesn't require manual intervention otherwise.
RegionalCloud Storage is used as astaging layer for the initial ingest of data. This is being loaded at thesame time into both regions to meet the RPO requirements.
Dataflow is used to extract datafrom Cloud Storage and transform it before loading it intoBigQuery. In theevent of a region outage this would populate BigQuery with thelatest data from Cloud Storage.
BigQuery provides the data warehousebackend. Under normal operations this would be intermittently refreshed. Inthe event of a region outage the latest data would be re-ingested viaDataflow from Cloud Storage.

Example tier 1 architecture

Application criticality	Zone outage	Region outage
Tier 1 (most important)	RTO zero RPO zero	RTO 4 hours RPO 1 hour

An example tier 1 architecture using Google Cloud products

This architecture describes a mobile app backend infrastructure with externalusers connecting to a set of microservices running in GKE.Spanner provides the backend data storage layer for real time data, andhistorical data is streamed to a BigQuery data lake in each region.

Again, some individual components of this architecture do not directly supportthe RPO required for their tier, but because of how they are used together theoverall service does. In this case BigQuery is being used for analyticqueries. Each region is fed simultaneously from Spanner.

Key architectural decisions for zone outage - RTO of zero and RPO of zero:

Aload balancer is used toprovide a scalable access point for users, which allows for automaticfailover to another zone.
Aregional GKE cluster is used for the application layer which is configured with multiple zones.This accomplishes the RTO of zero within each region.
Multi-regionSpanner is used as a data persistence layer, providing automatic zone dataresilience and transaction consistency.
BigQuery provides the analyticscapability for the application. Each region is independently fed data fromSpanner, and independently accessed by the application.

Key architectural decisions for region outage - RTO of 4 hrs and RPO of 1hr:

Aload balancer is used toprovide a scalable access point for users, which allows for automaticfailover to another region.
Aregional GKE cluster is used for the application layer which is configured with multiple zones.In the event of a region outage, the cluster in the alternate regionautomatically scales to take on the additional processing load.
Multi-regionSpanner is used as a data persistence layer, providing automatic regional dataresilience and transaction consistency. This is the key component inachieving the cross region RPO of 1 hour.
BigQuery provides the analyticscapability for the application. Each region is independently fed data fromSpanner, and independently accessed by the application. Thisarchitecture compensates for the BigQuery component allowing it tomatch the overall application requirements.

Appendix: Product reference

This section describes the architecture and DR capabilities of Google Cloudproducts that are most commonly used in customer applications and that can beeasily leveraged to achieve your DR requirements.

Common themes

Many Google Cloud products offer regional or multi-regional configurations.Regional products are resilient to zone outages, and multi-region and globalproducts are resilient to region outages. In general, this means that during anoutage, your application experiences minimal disruption. Google achievesthese outcomes through a few common architectural approaches, which mirror thearchitectural guidance above.

Redundant deployment: The application backends and data storage aredeployed across multiple zones within a region and multiple regions within amulti-region location. For more information about region-specific considerations, seeGeography and regions.
Data replication: Products use either synchronous or asynchronousreplication across the redundant locations.
- Synchronous replication means that when your application makes anAPI call to create or modify data stored by the product, it receivesa successful response only once the product has written the data tomultiple locations. Synchronous replication ensures that you do notlose access to any of your data during a Google Cloud infrastructureoutage because all of your data is available in one of theavailable backend locations.
  Although this technique provides maximum data protection, it can havetradeoffs in terms of latency and performance. Multi-region productsusing synchronous replication experience this tradeoff mostsignificantly -- typically on the order of 10s or 100s of millisecondsof added latency.
- Asynchronous replication means that when your application makes anAPI call to create or modify data stored by the product, it receivesa successful response once the product has written the data to a singlelocation. Subsequent to your write request, the product replicatesyour data to additional locations.
  This technique provides lower latency and higher throughput at the APIthan synchronous replication, but at the expense of data protection. Ifthe location in which you have written data suffers an outage beforereplication is complete, you lose access to that data until thelocation outage is resolved.
Handling outages with load balancing: Google Cloud uses software loadbalancing to route requests to the appropriate application backends.Compared to other approaches like DNS load balancing, this approach reducesthe system response time to an outage. When a Google Cloud location outageoccurs, the load balancer quickly detects that the backend deployed in thatlocation has become "unhealthy" and directs all requests to a backend in analternate location. This enables the product to continue serving yourapplication's requests during a location outage. When the location outage isresolved, the load balancer detects the availability of the product backendsin that location, and resumes sending traffic there.

Access Context Manager

Access Context Manager lets enterprises configure access levels that map to a policythat's defined on request attributes. Policies are mirrored regionally.

In the case of a zonal outage, requests to unavailable zones are automaticallyand transparently served from other available zones in the region.

In the case of regional outage, policy calculations from the affected region areunavailable until the region becomes available again.

Access Transparency

Access Transparencylets Google Cloud organizationadministrators define fine-grained, attribute-based access control for projectsand resources in Google Cloud. Occasionally, Google must access customerdata for administrative purposes. When we access customer data,Access Transparency provides access logs to affected Google Cloudcustomers. These Access Transparency logs help ensure Google's commitment to datasecurity and transparency in data handling.

Access Transparency is resilient against zonal and regional outages. If a zonalor regional outage happens, Access Transparency continues to processadministrative access logs in another zone or region.

AlloyDB for PostgreSQL

AlloyDB for PostgreSQL is a fully managed, PostgreSQL-compatible database service.AlloyDB for PostgreSQL offers high availability in a region through its primaryinstance's redundant nodes that are located in two different zones of theregion. The primary instance maintains regional availability by triggering anautomatic failover to the standby zone if the active zone encounters an issue.Regional storage guarantees data durability in the event of a single-zone loss.

As a further method of disaster recovery, AlloyDB for PostgreSQL usescross-region replication to provide disaster recovery capabilities byasynchronously replicating your primary cluster's data into secondary clustersthat are located in separate Google Cloud regions.

Zonal outage: During normal operation, only one of the two nodes of ahigh-availability primary instance is active, and it serves all data writes.This active node stores the data in the cluster's separate, regional storagelayer.

AlloyDB for PostgreSQL automatically detects zone-level failures and triggers afailover to restore database availability. During failover,AlloyDB for PostgreSQL starts the database on the standby node, which is alreadyprovisioned in a different zone. New database connections automatically getrouted to this zone.

From the perspective of a client application, a zonal outage resembles atemporary interruption of network connectivity. After the failover completes, aclient can reconnect to the instance at the same address, using the samecredentials, with no loss of data.

Regional Outage: Cross-region replication uses asynchronous replication,which allows the primary instance to commit transactions before they arecommitted on replicas. The time difference between when a transaction iscommitted on the primary instance and when it is committed on the replica isknown asreplication lag. The time difference between when the primarygenerates the write-ahead log (WAL) and when the WAL reaches the replica isknown asflush lag. Replication lag and flush lag depend on database instanceconfiguration and on the user-generated workload.

In the event of a regional outage, you can promote secondary clusters in adifferent region to a writeable, standalone primary cluster. This promotedcluster no longer replicates the data from the original primary cluster that itwas formerly associated with. Due to flush lag, some data loss might occurbecause there could be transactions on the original primary that were notpropagated to the secondary cluster.

Cross-region replication RPO is affected by both the CPU utilization of theprimary cluster, and physical distance between the primary cluster's region andthe secondary cluster's region. To optimize RPO, we recommend testing yourworkload with a configuration that includes a replica to establish a safetransactions per second (TPS) limit, which is the highest sustained TPS thatdoesn't accumulate flush lag. If your workload exceeds the safe TPS limit, flushlag accumulates, which can affect RPO. To limit network lag, pick region pairswithin the same continent.

For more information about monitoring network lag and otherAlloyDB for PostgreSQL metrics, seeMonitorinstances.

Anti Money Laundering AI

Anti Money Laundering AI (AML AI) provides an API to help global financialinstitutions more effectively and efficiently detect money laundering. AntiMoney Laundering AI is a regional offering, meaning customers can choose theregion, but not the zones that make up a region. Data and traffic areautomatically load balanced across zones within a region. The operations (forexample, to create a pipeline or run a prediction) are automatically scaled inthe background and are load balanced across zones as necessary.

Zonal outage: AML AI stores data for its resourcesregionally, replicated in a synchronous manner. When a long-running operationfinishes successfully, the resources can be relied on regardless of zonalfailures. Processing is also replicated across zones, but this replication aimsat load balancing and not high availability, so a zonal failure during anoperation can result in an operation failure. If that happens, retrying theoperation can address the issue. During a zonal outage, processing times mightbe affected.

Regional outage: Customers choose the Google Cloud region they want tocreate their AML AI resources in. Data is never replicated acrossregions. Customer traffic is never be routed to a different region byAML AI. In the case of a regional failure, AML AIwill become available again as soon as the outage is resolved.

API keys

API keys provides a scalable API key resource management for a project.API keys is a global service, meaning that keys are visible andaccessible from any Google Cloud location. Its data and metadata are storedredundantly across multiple zones and regions.

API keys is resilient to both zonal and regional outages. In the caseof zonal outage or regional outage, API keys continues to serverequests from another zone in the same or different region.

For more information about API keys, seeAPI keys API overview.

Apigee

Apigee provides a secure, scalable, and reliableplatform for developing and managing APIs. Apigee offersboth single-region and multi-region deployments.

Zonal outage: Customer runtime data is replicated acrossmultiple availability zones. Therefore, a single-zone outagedoes not impact Apigee.

Regional Outage: For single-region Apigee instances,if a region goes down, Apigee instances are unavailable inthat region and can't be restored to different regions. Formulti-region Apigee instances, the data is replicated acrossall of the regions asynchronously. Therefore, failure ofone region doesn't reduce traffic entirely. However, you mightnot be able to access uncommitted data in the failed region. Youcan divert the traffic away from unhealthy regions. To achieveautomatic traffic failover, you can configure network routingusing managed instance groups (MIGs).

AutoML Translation

AutoML Translation is a machine translation service that allows youimport your own data (sentence pairs) to train custom models for yourdomain-specific needs.

Zonal outage: AutoML Translation has active compute servers inmultiple zones and regions. Italso supports synchronous data replication across zones within regions. Thesefeatures help AutoML Translation achieveinstantaneous failover without any data loss for zonal failures, and withoutrequiring any customer input or adjustments.

Regional outage: In the case of a regional failure,AutoML Translation is not available.

AutoML Vision

AutoML Vision is part of Vertex AI. It offers a unifiedframework to create datasets, import data, train models, and serve models foronline prediction and batch prediction.

AutoML Vision is a regional offering. Customers can choose whichregion they want to launch a job from, but they can't choose the specific zoneswithin that region. The service automatically load-balances workloads acrossdifferent zones within the region.

Zonal outage: AutoML Vision stores metadata for the jobsregionally, and writes synchronously across zones within the region. The jobsare launched in a specific zone, as selected by Cloud Load Balancing.

AutoML Vision training jobs: A zonal outage causes any runningjobs to fail, and the job status updates to failed. If a job fails, retry itimmediately. The new job is routed to an available zone.
AutoML Vision batch prediction jobs: Batch prediction is builton top ofVertex AI Batch prediction.When a zonal outage occurs, the service automatically retries the job by routingit to available zones. If multiple retries fail, the job status updates tofailed. Subsequent user requests to run the job are routed to an available zone.

Regional outage: Customers choose the Google Cloud region they want torun their jobs in. Data is never replicated across regions. In a regionalfailure, AutoML Vision service is unavailable in that region. Itbecomes available again when the outage resolves. To run their jobs, werecommend that customers use multiple regions. In case a regional outage occurs,direct jobs to a different available region.

Batch

Batch is a fully managed service to queue, schedule, and execute batchjobs on Google Cloud. Batch settings are defined at the regionlevel. Customers must choose a region to submit their batch jobs, not a zone ina region. When a job is submitted, Batch synchronously writescustomer data to multiple zones. However, customers can specify the zones whereBatch VMs run jobs.

Zonal Failure: When a single zone fails, the tasks running in that zonealso fail. If tasks have retry settings, Batch automaticallyfails over those tasks to other active zones in the same region. The automaticfailover is subject to availability of resources in active zones in the sameregion. Jobs that require zonal resources (like VMs, GPUs, or zonal persistentdisks) that are only available in the failed zone are queued until the failedzone recovers or until the queueing timeouts of the jobs are reached. Whenpossible, we recommend that customers let Batch choose zonalresources to run their jobs. Doing so helps ensure that the jobs are resilientto a zonal outage.

Regional Failure: In case of a regional failure, the service control planeis unavailable in the region. The service doesn't replicate data or redirectrequests across regions. We recommend that customers use multiple regions torun their jobs and redirect jobs to a different region if a region fails.

Chrome Enterprise Premium threat and data protection

Chrome Enterprise Premium threat and data protectionis part of theChrome Enterprise Premiumsolution. It extends Chrome with a variety of security features, includingmalware and phishing protection, Data Loss Prevention (DLP), URL filtering rulesand security reporting.

Chrome Enterprise Premium admins can opt-in to storing customer core contents that violateDLP or malware policies intoGoogle Workspace rule log eventsand/or into Cloud Storage for future investigation.Google Workspace rule log events are powered by a multi-regionalSpanner database. Chrome Enterprise Premium can take up to several hours to detectpolicy violations. During this time, any unprocessed data is subject to dataloss from a zonal or regional outage. Once a violation is detected, the contentsthat violate your policies are written to Google Workspace rule logevents and/or to Cloud Storage.

Zonal and Regional outage:Because Chrome Enterprise Premium threat and data protection are multi-zonal andmulti-regional, it can survive a complete, unplanned loss of a zone or a regionwithout a loss in availability. It provides this level of reliability byredirecting traffic to its service on other active zones or regions. However,because it can take Chrome Enterprise Premium threat and data protection several hours todetect DLP and malware violations, any unprocessed data in a specific zone orregion is subject to loss from a zonal or regional outage.

BigQuery

BigQuery is a serverless, highly scalable, and cost-effective clouddata warehouse designed for business agility. BigQuery supports thefollowing location types for user datasets:

A region: a specific geographical location, such as Iowa (us-central1) or Montréal(northamerica-northeast1).
A multi-region: a large geographic area that contains two or more geographicplaces, such as the United States (US) or Europe (EU).

In either case, data is stored redundantly in two zones withina single region within the selected location. Data written to BigQueryis synchronously written to both the primary and secondary zones. Thisprotects against unavailability of a single zone within the region, but not against a regional outage.

Binary Authorization

Binary Authorization is a software supply chain security product forGKE and Cloud Run.

All Binary Authorization policies are replicated across multiple zones withinevery region. Replication helps Binary Authorization policy read operationsrecover from failures of other regions. Replication also makes readoperations tolerant of zonal failures within each region.

Binary Authorization enforcement operations are resilient against zonaloutages, but they are not resilient against regional outages. Enforcementoperations run in the same region as the GKE cluster orCloud Run job that's makingthe request. Therefore, in the event of a regional outage, there is nothingrunning to make Binary Authorization enforcement requests.

Certificate Manager

Certificate Manager lets you acquire and manageTransport Layer Security(TLS) certificates for use with different types of Cloud Load Balancing.

In the case of a zonal outage, regional and global Certificate Managerare resilient to zonal failures because jobs and databases are redundant acrossmultiple zones within a region.In the case of a regional outage, global Certificate Manager isresilient to regional failures because jobs and databases are redundant acrossmultiple regions. Regional Certificate Manager is a regional product,so it cannot withstand a regional failure.

Cloud Intrusion Detection System

Cloud Intrusion Detection System (Cloud IDS) is a zonal service that provides zonally-scoped IDSEndpoints, which process the traffic of VMs in one specific zone, and thusisn't tolerant of zonal or regional outages.

Zonal outage: Cloud IDS is tied to VM instances. If a customerplans to mitigate zonal outages by deploying VMs in multiple zones (manually orvia Regional Managed Instance Groups), they will need to deployCloud IDS Endpoints in those zones as well.

Regional Outage: Cloud IDS is a regional product. It doesn'tprovide any cross-regional functionality. A regional failure will takedown all Cloud IDS functionality in all zones in that region.

Google Security Operations SIEM

Google Security Operations SIEM (which is part of Google Security Operations) is a fully managedservice that helps security teams detect, investigate, and respond to threats.

Google Security Operations SIEM has regional and multi-regional offerings.

In regional offerings, data and traffic are automatically load-balanced acrosszones within the chosen region, and data is stored redundantly acrossavailability zones within the region.
Multi-regions are geo-redundant.That redundancy provides a broader set of protections than regional storage.It also helps to ensure that the service continues to function even if a fullregion is lost.
The majority of data ingestion paths replicate customer data synchronouslyacross multiple locations. When data is replicated asynchronously, there is atime window (a recovery point objective, or RPO) during which the data isn't yetreplicated across several locations. This is the case when ingesting withfeedsin multi-regional deployments. After the RPO, the datais available in multiple locations.

Zonal outage:

Regional deployments: Requests are served from any zone within the region.Data is synchronously replicated across multiple zones. In case of a full-zoneoutage, the remaining zones continue to serve traffic and continue to processthe data. Redundant provisioning and automated scaling forGoogle Security Operations SIEM helps to ensure that the service remains operational inthe remaining zones during these load shifts.
Multi-regional deployments: Zonal outages are equivalent to regional outages.

Regional outage:

Regional deployments: Google Security Operations SIEM stores all customer data withina single region and traffic is never routed across regions. In the event of aregional outage, Google Security Operations SIEM is unavailable in the region until theoutage is resolved.
Multi-regional deployments (without feeds): Requests are served from anyregion of the multi-regional deployment. Data is synchronously replicated acrossmultiple regions. In case of a full-region outage, the remaining regionscontinue to serve traffic and continue to process the data. Redundantprovisioning and automated scaling for Google Security Operations SIEM helps ensure thatthe service remains operational in the remaining regions during these loadshifts.
Multi-regional deployments (with feeds): Requests are served from any regionof the multi-regional deployment. Data is replicated asynchronously acrossmultiple regions with the provided RPO. In case of a full-region outage, onlydata stored after the RPO is available in the remaining regions. Data within theRPO window might not be replicated.

Cloud Asset Inventory

Cloud Asset Inventory is a high-performance, resilient, global service that maintains arepository of Google Cloud resource and policy metadata. Cloud Asset Inventoryprovides search and analysis tools that help you track deployed assets acrossorganizations, folders, and projects.

In the case of a zone outage, Cloud Asset Inventory continues to serve requestsfrom another zone in the same or different region.

In the case of a regional outage, Cloud Asset Inventory continues to serve requestsfrom other regions.

Bigtable

Bigtable is a fully managed high performance NoSQL database service forlarge analytical and operational workloads.

Bigtable replication overview

Bigtable offers a flexible and fully configurable replicationfeature, which you can use to increase the availability and durability of yourdata by copying it toclustersin multiple regions or multiple zones within the same region.Bigtable can also provideautomatic failoverfor your requests when you use replication.

When using multi-zonal or multi-regional configurations withmulti-cluster routing,in the case of a zonal or regional outage, Bigtable automaticallyreroutes traffic and serves requests from the nearest available cluster. BecauseBigtable replication isasynchronousandeventually consistent,very recent changes to data in the location of the outage might be unavailableif they have not been replicated yet to other locations.

Performance considerations

Important: You must ensure that CPU usageanddata sizesremain within the recommended maximum values, and your clusters must remainadequately provisioned at all times in order for replication to performpredictably.

When CPU resource demands exceed available node capacity,Bigtable always prioritizes serving incoming requestsahead of replication traffic.

For more information about how to use Bigtable replication withyour workload, seeCloud Bigtable replication overviewandexamples of replication settings.

Bigtablenodes areused both for serving incoming requests and for performing replication ofdata from other clusters. In addition to maintaining sufficient node countsper cluster, you must also ensure that your applications use proper schemadesign to avoidhotspots,which can cause excessive or imbalanced CPU usage and increased replicationlatency.

For more information about designing your application schema to maximizeBigtable performance and efficiency, seeSchema design best practices.

Monitoring

Bigtable provides several ways to visually monitor the replicationlatency of your instances and clusters using thecharts for replicationavailable in theGoogle Cloud console.

You can also programmatically monitor Bigtable replicationmetrics using theCloud Monitoring API.

Certificate Authority Service

Certificate Authority Service (CA Service) lets customers simplify, automate, and customizethe deployment, management, and security of private certificate authorities (CA)and to resiliently issue certificates at scale.

Zonal outage: CA Service is resilient to zonal failuresbecause its control plane is redundant across multiple zones within a region.If there is a zonal outage, CA Service continues to serverequests from another zone in the same region without interruption. Because datais replicated synchronously there is no data loss or corruption.

Regional outage: CA Service is a regional product, so itcannot withstand a regional failure. If you require resilience to regional failures,create issuing CAs in two different regions. Create the primary issuing CA inthe region where you need certificates. Create a fallback CA in a differentregion. Use the fallback when the primary subordinate CA's region has an outage.If needed, both CAs can chain up to the same root CA.

Cloud Billing

The Cloud Billing API allows developers to manage billing for theirGoogle Cloud projects programmatically. The Cloud Billing API isdesigned as a global system with updates synchronously written to multiple zonesand regions.

Zonal or regional failure: The Cloud Billing API will automaticallyfail over to another zone or region. Individual requests may fail, but a retrypolicy should allow subsequent attempts to succeed.

Cloud Build

Cloud Build is a service that executes your builds on Google Cloud.

Cloud Build is composed of regionally isolated instances thatsynchronously replicate data across zones within the region. We recommend thatyou use specific Google Cloud regions instead of the global region, and ensurethat the resources your build uses (including log buckets, Artifact Registryrepositories, and so on) are aligned with the region that your build runs in.

In the case of a zonal outage, control plane operations are unaffected. However,currently executing builds within the failing zone will be delayed orpermanently lost. Newly triggered builds will automatically be distributed tothe remaining functioning zones.

In the case of a regional failure, the control plane will be offline, andcurrently executing builds will be delayed or permanently lost. Triggers, workerpools, and build data are never replicated across regions. We recommend that youprepare triggers and worker pools in multiple regions to make mitigation of anoutage easier.

Cloud CDN

Cloud CDN distributes and caches content across many locations on Google'snetwork to reduce serving latency for clients. Cached content is served on abest-effort basis -- when a request cannot be served by the Cloud CDN cache,the request is forwarded to origin servers, such as backend VMs orCloud Storage buckets, where the original content is stored.

When a zone or a region fails, caches in the affected locations areunavailable. Inbound requests are routed to available Google edge locationsand caches. If these alternate caches cannot serve the request they will forwardthe request to an available origin server. Provided that server can serve therequest with up-to-date data, there will be no loss of content. An increasedrate of cache misses will cause the origin servers to experience higher thannormal traffic volumes as the caches are filled. Subsequent requests are beserved from the caches unaffected by the zone or region outage.

For more information about Cloud CDN and cache behavior, see theCloud CDN documentation.

Cloud Composer

Cloud Composer is a managed workflow orchestration service that letsyou create, schedule, monitor, and manage workflows that span across cloudsand on-premises data centers. Cloud Composer environments are built onthe Apache Airflow open source project.

Cloud Composer API availability isn't affected by zonal unavailability. Duringa zonal outage, you retain access to the Cloud Composer API, including theability to create new Cloud Composer environments.
In Cloud Composer 2 and Cloud Composer 3, you can addresszonal outages by setting updisaster recovery with environment snapshotsin advance. During a zone outage, you can switch to another environment bytransferring the state of your workflows with an environment snapshot.
Cloud Composer 2 and Cloud Composer 3 environments can becreated inHigh Resilience modethat makes them resilient to zonal failures. In a highly resilientenvironment, the control plane and the environment's Airflow database aredistributed across multiple zones in a region. If a singlezone becomes unavailable, the environment continues to operate.

During a zonal outage, Airflow tasks that are executed in the affected zoneare interrupted and marked as failed by Airflow. This happens both inregular and highly-resilient environments. The difference between aregular and a highly resilient environment is in how the environmentrecovers from a zonal outage.

For example, a zonal outage interrupts Airflow tasks that run in a specificzone. Afterwards, a highly resilient environment recovers, restarts itsaffected components in a different zone, and switches its database to asecondary zone. Thus, the failed Airflow tasks can be rescheduled and restartedby Airflow, while at the same time preserving the history of DAG runs and othersettings.

Cloud Data Fusion

Cloud Data Fusion is a fully managed enterprise data integrationservice for quickly building and managing data pipelines. It provides threeeditions.

Zonal outages impactDeveloper edition instances.
Regional outages impactBasic andEnterprise edition instances.

To control access to resources, you might design and run pipelines inseparate environments.This separation lets you design a pipeline once, and then run it in multipleenvironments. You can recover pipelines in both environments.For more information, seeBack up and restore instance data.

The following advice applies to bothregional andzonal outages.

Outages in the pipeline design environment

In the design environment, save pipeline drafts in case of an outage.Depending on specific RTO and RPO requirements, you can use the saved drafts torestore the pipeline in a different Cloud Data Fusion instance duringan outage.

Outages in the pipeline execution environment

In the execution environment, you start the pipeline internally withCloud Data Fusion triggers or schedules, or externally withorchestration tools, such as Cloud Composer. To be able to recoverruntime configurations of pipelines, back up the pipelines and configurations,such as plugins and schedules. In an outage, you can use the backupto replicate an instance in an unaffected region or zone.

Another way to prepare for outages is to have multiple instances across theregions with the same configuration and pipeline set. If you use externalorchestration, running pipelines can be load balanced automatically betweeninstances. Take special care to ensure that there are no resources (such as datasources or orchestration tools) tied to a single region and used by allinstances, as this could become a central point of failure in an outage. Forexample, you can have multiple instances in different regions and useCloud Load Balancing and Cloud DNS to direct the pipeline run requests to aninstance that isn't affected by an outage (see the exampletier one andtier three architectures).

Outages for other Google Cloud data services in the pipeline

Your instance might use other Google Cloud services as data sources orpipeline execution environments, such as Dataproc,Cloud Storage, or BigQuery. Those services can be indifferent regions. When cross-regional execution is required, a failure ineither region leads to an outage. In this scenario, you follow the standarddisaster recovery steps,keeping in mind that cross-regional setup with critical services in differentregions is less resilient.

Cloud Deploy

Cloud Deploy provides continuous delivery of workloads into runtime servicessuch as GKE and Cloud Run. The service is composedof regional instances that synchronously replicate data across zones within theregion.

Zonal outage: Control plane operations are unaffected. However,Cloud Build builds (for example, render or deploy operations) that arerunning when a zone fails are delayed or permanently lost. During anoutage, the Cloud Deploy resource that triggered the build (a releaseor rollout) displays a failure status that indicates the underlying operationfailed. You can re-create the resource to start a new build in the remainingfunctioning zones. For example, create a new rollout by redeploying the releaseto a target.

Regional outage: Control plane operations are unavailable, as is data fromCloud Deploy, until the region is restored. To help make it easier torestore service in the event of a regional outage, we recommend that you storeyour delivery pipeline and target definitions in source control. You can usethese configuration files to re-create your Cloud Deploy pipelines in afunctioning region. During an outage, data about existing releases is lost.Create a new release to continue deploying software to your targets.

Cloud DNS

Cloud DNS is a high-performance, resilient, global Domain Name System (DNS)service that publishes your domain names to the global DNS in a cost-effectiveway.

In the case of a zonal outage, Cloud DNS continues to serve requests fromanother zone in the same or different region without interruption. Updates toCloud DNS records are synchronously replicated across zones within the regionwhere they are received. Therefore, there is no data loss.

In the case of a regional outage, Cloud DNS continues to serve requests fromother regions. It is possible that very recent updates to Cloud DNS recordswill be unavailable because updates are first processed in a single regionbefore being asynchronously replicated to other regions.

Cloud Healthcare API

Cloud Healthcare API, a service for storing and managing healthcare data, isbuilt to provide high availability and offers protection against zonal andregional failures, depending on a chosen configuration.

Regional configuration: in its default configuration, Cloud Healthcare API offersprotection against zonal failure. Service is deployed in three zones across oneregion, with data also triplicated across different zones within the region. Incase of a zonal failure, affecting either service layer or data layer, theremaining zones take over without interruption. With regional configuration, ifa whole region where service is located experiences an outage, service will beunavailable until the region comes back online. In the unforeseen event of aphysical destruction of a whole region, data stored in that region will be lost.

Multi-regional configuration: in its multiregional configuration,Cloud Healthcare API is deployed in three zones belonging to three differentregions. Data is also replicated across three regions. This guards against lossof service in case of a whole-region outage, since the remaining regions wouldautomatically take over. Structured data, such as FHIR, is synchronouslyreplicated across multiple regions, so it's protected against data loss in caseof a whole-region outage. Data that is stored in Cloud Storage buckets, suchas DICOM and Dictation or large HL7v2/FHIR objects, is asynchronously replicatedacross multiple regions.

Cloud Identity

Cloud Identity services are distributed across multiple regions and usedynamic load balancing. Cloud Identity does not allow users to select aresource scope. If a particular zone or region experiences an outage, traffic isautomatically distributed to other zones or regions.

Persistent data is mirrored in multiple regions with synchronous replication inmost cases. For performance reasons, a few systems, such as caches or changesaffecting large numbers of entities, are asynchronously replicated acrossregions. If the primary region in which the most current data is storedexperiences an outage, Cloud Identity serves stale data from another locationuntil the primary region becomes available.

Cloud Interconnect

Cloud Interconnect offers customersRFC 1918 access to Google Cloud networks from their on-premises data centers,over physical cables connected to Google peering edge.

Cloud Interconnect provides customers with a 99.9% SLA if they provisionconnections to two EADs (Edge Availability Domains) in a metropolitan area. A99.99% SLA is available if the customer provisions connections in two EADs intwo metropolitan areas to two regions with Global Routing. SeeTopology for non-critical applications overview andTopology for production-level applications overview for more information.

Cloud Interconnect is compute-zone independent and provides high availabilityin the form of EADs. In the event of an EAD failure, the BGP session to that EADbreaks and traffic fails over to the other EAD.

In the event of a regional failure, BGP sessions to that region break andtraffic fails over to the resources in the working region. This applies whenGlobal Routing is enabled.

Cloud Key Management Service

Cloud Key Management Service (Cloud KMS) provides scalable and highly-durablecryptographic key resource management. Cloud KMS stores all of its dataand metadata in Spanner databases which provide high data durabilityand availability with synchronous replication.

Cloud KMS resources can be created in a single region, multipleregions, or globally.

In the case of zonal outage, Cloud KMS continues to serve requests fromanother zone in the same or different region without interruption. Because datais replicated synchronously, there is no data loss or corruption. When the zoneoutage is resolved, full redundancy is restored.

In the case of a regional outage, regional resources in that region areoffline until the region becomes available again. Note that even within aregion, at least 3 replicas are maintained in separate zones. When higheravailability is required, resources should be stored in a multi-region or globalconfiguration. Multi-region and global configurations are designed to stayavailable through a regional outage by geo-redundantly storing and serving datain more than one region.

Cloud External Key Manager (Cloud EKM)

Cloud External Key Manager is integrated with Cloud Key Management Service to let you control and access external keys through supported third-party partners. You can use these external keys to encrypt data at rest to use for other Google Cloud services that support customer-managed encryption keys (CMEK) integration.

Zonal outage: Cloud External Key Manager is resilient to zonal outages because of the redundancy that's provided by multiple zones in a region. If a zonal outage occurs, traffic is rerouted to other zones within the region. While traffic is rerouting, you might see an increase in errors, but the service is still available.
Regional outage: Cloud External Key Manager isn't available during a regional outage in the affected region. There is no failover mechanism that redirects requests across regions. We recommend that customers use multiple regions to run their jobs.

Cloud External Key Manager doesn't store any customer data persistently. Thus, there's no data loss during a regional outage within the Cloud External Key Manager system. However, Cloud External Key Manager depends on the availability of other services, like Cloud Key Management Service and external third party vendors. If those systems fail during a regional outage, you could lose data. The RPO/RTO of these systems are outside the scope of Cloud External Key Manager commitments.

Cloud Load Balancing

Cloud Load Balancing is a fully distributed, software-defined managed service.With Cloud Load Balancing, a single anycast IP address can serve as the frontendfor backends in regions around the world. It isn't hardware-based, so you don'tneed to manage a physical load-balancing infrastructure. Load balancers are a criticalcomponent of most highly available applications.

Cloud Load Balancing offers both regional and global load balancers. It alsoprovides cross-region load balancing, including automatic multi-region failover,which moves traffic to failover backends if your primary backends become unhealthy.

The global load balancers are resilient to both zonal and regional outages. Theregional load balancers are resilient to zonal outages but are affected by outagesin their region. However, in either case, it is important to understand that theresilience of your overall application depends not just on which type of load balanceryou deploy, but also on the redundancy of your backends.

For more information about Cloud Load Balancing and its features, seeCloud Load Balancing overview.

Cloud Logging

Cloud Logging consists of two main parts: the Logs Router andCloud Logging storage.

The Logs Router handles streaming log events and directs the logs toCloud Storage, Pub/Sub, BigQuery, or Cloud Loggingstorage.

Cloud Logging storage is a service for storing, querying, and managingcompliance for logs. It supports many users and workflows including development,compliance, troubleshooting, and proactive alerting.

Logs Router & incoming logs: During a zonal outage, the Cloud Logging APIroutes logs to other zones in the region. Normally, logs being routed by theLogs Router to Cloud Logging, BigQuery, or Pub/Sub arewritten to their end destination as soon as possible, while logs sent toCloud Storage are buffered and written in batches hourly.

Log Entries: In the event of a zonal or regional outage, log entries thathave been buffered in the affected zone or region and not written to the exportdestination become inaccessible. Logs-based metrics are also calculated inthe Logs Router and subject to the same constraints. Once delivered to theselected log export location, logs are replicated according to the destinationservice. Logs that are exported to Cloud Logging storage are synchronouslyreplicated across two zones in a region. For the replication behavior of otherdestination types, see the relevant section in this article. Note thatlogs exported to Cloud Storage are batched and written every hour.Therefore we recommend using Cloud Logging storage, BigQuery, orPub/Sub to minimize the amount of data impacted by an outage.

Log Metadata: Metadata such as sink and exclusion configuration is storedglobally but cached regionally so in the event of an outage, the regional LogRouter instances would operate. Single region outages have no impact outside ofthe region.

Cloud Monitoring

Cloud Monitoring consists of a variety of interconnected features, such asdashboards (both built-in and user-defined), alerting, and uptime monitoring.

All Cloud Monitoring configuration, including dashboards, uptime checks, andalert policies, are globally defined. All changes to them are replicatedsynchronously to multiple regions. Therefore, during both zonal and regionaloutages, successful configuration changes are durable. In addition, althoughtransient read and write failures can occur when a zone or region initiallyfails, Cloud Monitoring reroutes requests towards available zones andregions. In this situation you may retry configuration changes with exponentialbackoff.

When writing metrics for a specific resource, Cloud Monitoring firstidentifies the region in which the resource resides. It then writes threeindependent replicas of the metric data within the region. The overall regionalmetric write is returned as successful as soon as one of the three writessucceeds. The three replicas are not guaranteed to be in different zones withinthe region.

Zonal: During a zonal outage, metric writes and reads are completelyunavailable for resources in the affected zone. Effectively,Cloud Monitoring acts like the affected zone doesn't exist.
Regional: During a regional outage, metric writes and reads are completelyunavailable for resources in the affected region. Effectively,Cloud Monitoring acts like the affected region doesn't exist.

Cloud NAT

Cloud NAT(network address translation)is a distributed, software-defined managed service that lets certain resourceswithout external IP addresses create outbound connections to the internet. It'snot based on proxy VMs or appliances. Instead, Cloud NAT configures theAndromeda softwarethat powers your Virtual Private Cloud network so that it provides sourcenetwork address translation (source NAT or SNAT) for VMs without external IPaddresses. Cloud NAT also provides destination network address translation(destination NAT or DNAT) for established inbound response packets.

For more information on the functionality of Cloud NAT, see thedocumentation.

Zonal outage: Cloud NAT is resilient to zonal failures because thecontrol plane and network data plane are redundant across multiple zones withina region.

Regional outage: Cloud NAT is a regional product, so it cannot withstanda regional failure.

Cloud Router

Cloud Router is a fully distributed and managed Google Cloud service thatuses the Border Gateway Protocol (BGP) to advertise IP address ranges. Itprograms dynamic routes based on the BGP advertisements that it receives from apeer. Instead of a physical device or appliance, each Cloud Router consistsof software tasks that act as BGP speakers and responders.

In the case of a zonal outage, Cloud Router with a high availability (HA)configuration is resilient to zonal failures. In that case, one interface mightlose connectivity, but traffic is redirected to the other interface throughdynamic routing using BGP.

In the case of a regional outage, Cloud Router is a regional product, so itcannot withstand a regional failure. If customers have enabled global routingmode, routing between the failed region and other regions might be affected.

Cloud Run

Cloud Run is a stateless computing environment where customers can runtheir containerized code on Google's infrastructure.Cloud Run is a regional offering, meaning customers can choose theregion but not the zones that make up a region.Data and traffic are automatically load balanced across zones within a region.Container instances are automatically scaled to meet incoming traffic and areload balanced across zones as necessary.Each zone maintains a scheduler that provides this autoscaling per-zone.It's also aware of the load other zones are receiving and will provision extracapacity in-zone to allow for any zonal failures.

If you useCloud Run GPUs,you have the option to turn off zonal redundancy for the service, and insteaduse best-effort reliability in case of a zonal outage, at a lower cost. Fordetails, seeGPU zonal redundancy options.

Zonal outage: Cloud Run stores metadata as well as the deployedcontainer.This data is stored regionally and written in a synchronous manner.The Cloud Run Admin API only returns the API call once the datahas been committed to a quorum within a region.Since data is regionally stored, data plane operations are not affected by zonalfailures either.Traffic will be routed to other zones in the event of a zonal failure.

Regional outage: Customers choose the Google Cloud region they wantto create their Cloud Run service in. Data is never replicated acrossregions. Customer traffic will never be routed to a different region byCloud Run. In the case of a regional failure, Cloud Run willbecome available again as soon as the outage is resolved. Customers areencouraged to deploy to multiple regions and use Cloud Load Balancing toachieve higher availability if desired.

Cloud Shell

Cloud Shell provides Google Cloud users access to single userCompute Engine instances that are preconfigured for onboarding, education,development, and operator tasks.

Cloud Shell isn't suitable for running application workloads and isinstead intended for interactive development and educational use cases.It has per-user runtime quota limits, it is automatically shut down after ashort period of inactivity, and the instance is only accessible to the assigneduser.

The Compute Engine instances backing the service are zonal resources,so in the event of a zone outage, a user's Cloud Shell is unavailable.

Cloud Source Repositories

Cloud Source Repositories lets users create and manage private source coderepositories. This product is designed with a global model, so you don't need toconfigure it for regional or zonal resiliency.

Instead,git push operations against Cloud Source Repositories synchronouslyreplicate the source repository update to multiple zones across multipleregions. This means that the service is resilient to outages in any one region.

If a particular zone or region experiences an outage, traffic is automaticallydistributed to other zones or regions.

The feature to automatically mirror repositories from GitHub or Bitbucket can beaffected by problems in those products. For example, mirroring is affected ifGitHub or Bitbucket can't alert Cloud Source Repositories of new commits, orif Cloud Source Repositories can't retrieve content from the updatedrepository.

Spanner

Spanner is a scalable, highly available, multi-version, synchronouslyreplicated, and strongly consistent database.

Replication and quorums: Spanner instancesin a regional instance configuration synchronously replicate dataacross three zones in a single region. Spanner synchronouslysends a write to all three replicas in the regional instance and acknowledgesthe write to the client after at least two replicas (a majority quorum of twoout of three) commit the write. This makes Spanner resilient to azone failure, providing access to all data because it persists the latest writesand achieves a majority quorum for writes with two replicas.

Instances in a multi-regional instance configuration have a write-quorum thatsynchronously replicates data across five zones located in three regions (tworead-write replicas each in the default-leader region and another region; andone replica in the witness region). Spanner acknowledges a writeto a multi-regional instance after at least three replicas (a majority quorum ofthree out of five) commit the write. If a zone or region fails,Spanner accesses all data (including the latest writes) andserves read-write requests because the data is persisted in at least three zonesacross two regions when it acknowledges the write to the client.

Backup and restore: To protect from logical data corruption or regionaldisasters, Spanner provides backup and restore capabilities. Youcan create full backups on demand or use a backup schedule to create full orincremental backups. All backups are highly available, encrypted, and you canretain them for up to one year from creation.

For cross-region and cross-project protection, you can copy backups from oneinstance to another instance in a different region or project. Copying backupsto different geographic regions helps protect data from regional failures andmeet compliance requirements.

Export and import: Spanner databases can be exported to Avro data files for long-term backup and archiving.

Point-in-time recovery (PITR) and deletion protection:Spanner PITR protects your databases from logical data corruption,accidental database deletion, or erroneous writes. By default, your databaseretains all versions of its data and schema for one hour. You can increase thistime limit to as long as seven days through theversion_retention_periodoption. Additionally, you can enable database deletion protection to preventaccidental deletions by users or service accounts.

For more information about Spanner, see the following:

Cloud SQL

Cloud SQL is a fully managed relational database service for MySQL,PostgreSQL, and SQL Server. Cloud SQL uses managed Compute Engine virtualmachines to run the database software. It offers a high availabilityconfiguration for regional redundancy, protecting the database from a zoneoutage. Cross-region replicas can be provisioned to protect the database from aregion outage. Because the productalso offers a zonal option, which is notresilient to a zone or region outage, you should be careful to select the highavailability configuration, cross-region replicas, or both.

Zonal outage: Thehigh availability option creates a primary and standbyVM instance in two separate zones within one region. During normal operation,the primary VM instance serves all requests, writing database files to aRegional Persistent Disk, which is synchronously replicated to the primary andstandby zones. If a zone outage affects the primary instance, Cloud SQLinitiates a failover during which the Persistent Disk is attached to the standbyVM and traffic is rerouted.

During this process, the database must be initialized, which includes processingany transactions written to the transaction log but not applied to the database.The number and type of unprocessed transactions can increase the RTO time. Highrecent writes can lead to a backlog of unprocessed transactions. The RTO time ismost heavily impacted by (a) high recent write activity and (b) recent changesto database schemas.

Finally, when the zonal outage has been resolved, you can manually trigger afailback operation to resume serving in the primary zone.

For more details on the high availability option, see the Cloud SQLhigh availabilitydocumentation.

Regional outage: Thecross-region replica option protects your databasefrom regional outages by creating read replicas of your primary instance inother regions. The cross-region replication uses asynchronous replication, whichallows the primary instance to commit transactions before they are committed onreplicas. The time difference between when a transaction is committed on theprimary instance and when it is committed on the replica is known as"replication lag" (which can bemonitored).This metric reflects both transactions which have not been sent from the primaryto replicas, as well as transactions that have been received but have not beenprocessed by the replica. Transactions not sent to the replica would becomeunavailable during a regional outage. Transactions received but not processed bythe replica impact the recovery time, as described below.

Cloud SQL recommends testing your workload with a configuration that includesa replica to establish a "safe transactions per second (TPS)" limit, which isthe highest sustained TPS that doesn't accumulate replication lag. If yourworkload exceeds the safe TPS limit, replication lag accumulates, negativelyaffecting RPO and RTO values. As general guidance, avoid using small instanceconfigurations (<2 vCPU cores, <100GB disks, or PD-HDD), which aresusceptible to replication lag.

In the event of a regional outage, you must decide whether to manually promote aread replica. This is a manual operation because promotion can cause asplit brainscenarioin which the promoted replica accepts new transactions despite having lagged theprimary instance at the time of the promotion. This can cause problems when theregional outage is resolved and you must reconcile the transactions that werenever propagated from the primary to replica instances. If this is problematicfor your needs, you may consider a cross-region synchronous replication databaseproduct like Spanner.

Once triggered by the user, the promotion process follows steps similar to theactivation of a standby instance in the high availability configuration. In thatprocess, the read replica must process the transaction log which drives thetotal recovery time. Because there is no built-in load balancer involved in thereplica promotion, manually redirect applications to the promoted primary.

For more details on the cross-region replica option, see the Cloud SQLcross-region replicadocumentation.

For more information about Cloud SQL DR, see the following:

Cloud Storage

Cloud Storage provides globally unified, scalable, and highly durable objectstorage. Cloud Storage buckets can be created in one of three differentlocation types: in a single region, in a dual-region, or in a multi-region within acontinent. With regional buckets, objects are stored redundantly acrossavailability zones in a single region. Dual-region and multi-region buckets, onthe other hand, are geo-redundant. This means that after newly written data isreplicated to at least one remote region, objects are stored redundantly acrossregions. This approach gives data in dual-region and multi-region buckets a broader setof protections than can be achieved with regional storage.

Regional buckets are designed to be resilient in case of an outage in a single availability zone. If a zone experiences an outage, objects in the unavailable zoneare automatically and transparently served from elsewhere in the region. Dataand metadata are stored redundantly across zones, starting with the initialwrite. No writes are lost if a zone becomes unavailable. In the case of aregional outage, regional buckets in that region are offline until the regionbecomes available again.

If you need higher availability, you can store data in adual-region or multi-region configuration. Dual-region and multi-region bucketsare single buckets (no separate primary and secondary locations) but they storedata and metadata redundantly across regions. In the case of a regional outage,service is not interrupted. You can think of dual-region and multi-regionbuckets as being active-active in that you can read and write workloads inmore than one region simultaneously while the bucket remains stronglyconsistent. This can be especially attractive for customers who want to splittheir workload across the two regions as part of a disaster recoveryarchitecture.

Dual-regions and multi-regions are strongly consistent because metadata isalways written synchronously across regions. This approach allows the service to alwaysdetermine what the latest version of an object is and where it can be servedfrom, including from remote regions.

Data is replicated asynchronously. This means that there is an RPO time window where newly writtenobjects start out protected as regional objects, with redundancy acrossavailability zones within a single region. The service then replicates theobjects within that RPO window to one or more remote regions to make themgeo-redundant. After that replication is complete, data can be servedautomatically and transparently from another region in the case of a regionaloutage. Turbo replication is a premium feature available on a dual-region bucketto obtain a smaller RPO window, which targets 100% of newly written objects beingreplicated and made geo-redundant within 15 minutes.

RPO is an important consideration, because during a regional outage, datarecently written to the affected region within the RPO window might not yet havebeen replicated to other regions. As a result, that data might not be accessibleduring the outage, and could be lost in the case of physical destruction of thedata in the affected region.

Cloud Translation

Cloud Translation has active compute servers in multiple zones and regions. Italso supports synchronous data replication across zones within regions. Thesefeatures help Translation achieveinstantaneous failover without any data loss for zonal failures, and withoutrequiring any customer input or adjustments. In the case of a regional failure,Cloud Translation is not available.

Compute Engine

Compute Engine is one of Google Cloud's infrastructure-as-a-serviceoptions. It uses Google's worldwide infrastructure to offer virtual machines(and related services) to customers.

Compute Engine instances are zonal resources, so in the event of a zone outageinstances are unavailable by default. Compute Engine does offermanaged instance groups (MIGs) which can automatically scale up additional VMs from pre-configuredinstance templates, both within a single zone and across multiple zones within aregion. MIGs are ideal for applications that require resilience to zone loss andare stateless, but requireconfiguration and resource planning.Multiple regional MIGs can be used to achieve region outage resilience forstateless applications.

Applications that have stateful workloads can still usestateful MIGs,but extra care needs to be made in capacity planning since they do notscale horizontally. It's important in either scenario to correctly configure andtest Compute Engine instance templates and MIGs ahead of time to ensureworking failover capabilities to other zones. See theDevelop your own reference architectures and guides section above for moreinformation.

Sole-tenancy

Sole-tenancy lets you have exclusive access to a sole-tenant node, which is aphysical Compute Engine server that is dedicated to hosting only yourproject's VMs.

Sole-tenant nodes, like Compute Engine instances, are zonal resources. In theunlikely event of a zonal outage, they are unavailable. To mitigate zonalfailures, you can create a sole-tenant node in another zone. Given that certainworkloads might benefit from sole-tenant nodes for licensing or CAPEX accountingpurposes, you should plan a failover strategy in advance.

Recreating these resources in a different location might incur additionallicensing costs or violate CAPEX accounting requirements. For general guidance,seeDevelop your own reference architectures and guides.

Sole-tenant nodes are zonal resources, and cannot withstand regional failures.To scale across zones, useregional MIGs.

Networking for Compute Engine

For information about high-availability setups for Interconnect connections, seethe following documents:

You can provision external IP addresses inglobal or regional mode,which affects their availability in the case of a regional failure.

Cloud Load Balancing resilience

Load balancers are a critical component of most highly available applications.It is important to understand that the resilience of your overall applicationdepends not just on the scope of the load balancer you choose (global orregional), but also on the redundancy of your backend services.

The following table summarizes load balancer resilience based on the loadbalancer's distribution orscope.

Load balancer scope	Architecture	Resilient to regional outage
Global	Each load balancer is distributed across all regions
Cross-region	Each load balancer is distributed across multiple regions
Regional	Each load balancer is distributed across multiple zones in the region	An outage in a given region affects the regional load balancers in that region

For more information about choosing a load balancer, see theCloud Load Balancingdocumentation.

Connectivity Tests

Connectivity Tests is a diagnostics tool that lets you check the connectivitybetween network endpoints. It analyzes your configuration and, in some cases,performs a live data plane analysis between the endpoints. An endpointis a source or destination of network traffic, such as a VM,Google Kubernetes Engine (GKE) cluster, load balancer forwarding rule, or an IPaddress. Connectivity Tests is a diagnostic toolwith no data plane components. It does not process or generate user traffic.

Zonal outage: Connectivity Tests resources are global. You can manageand view them in the event of a zonal outage. Connectivity Tests resourcesare the results of your configuration tests. These results might includethe configuration data of zonal resources (for example, VM instances) in an affectedzone. If there's an outage, the analysis results aren't accurate becausethe analysis is based on stale data from before the outage. Don't rely on it.

Regional outage: In a regional outage, you can still manage and viewConnectivity Tests resources. Connectivity Tests resources might includeconfiguration data of regional resources, like subnetworks, in an affectedregion. If there's an outage, the analysis results aren't accurate becausethe analysis is based on stale data from before the outage. Don't rely on it.

Container Registry

Container Registry provides a scalable hosted Docker Registry implementationthat securely and privately stores Docker container images. Container Registryimplements theHTTP Docker Registry API.

Container Registry is a global service that synchronously stores image metadataredundantly across multiple zones and regions by default. Container images arestored in Cloud Storage multi-regional buckets. With this storage strategy,Container Registry provides zonal outage resilience in all cases, and regionaloutage resilience for any data that has been asynchronously replicated tomultiple regions by Cloud Storage.

Database Migration Service

Database Migration Service is a fully managed Google Cloud service to migrate databases from other cloud providers or from on-premises data centers to Google Cloud.

Database Migration Service is architected as a regional control plane. The control plane doesn't depend on an individual zone in a given region.During a zonal outage, you retain access to the Database Migration Service APIs, including the ability to create and manage migration jobs.During a regional outage, you lose access to Database Migration Service resources that belong to that region until the outage is resolved.

Database Migration Service depends on the availability of the source and destination databasesthat are used for the migration process. If a Database Migration Service source or destination database is unavailable,migrations stop making progress, but no customer core data or job data is lost.Migration jobs resume when the source and destination databases become available again.

For example, you can configure a destination Cloud SQL database with high-availability (HA) enabled toget a destination database that is resilient for zonal outages.

Database Migration Service migrations go through two phases:

Full dump: Performs a full data copy from the source to the destination according to the migration job specification.
Change data capture (CDC): Replicates incremental changes from the source to the destination.

Zonal outage: If a zonal failure occurs during either of these phases, you are still able to access and manageresources in Database Migration Service. Data migration is affected as follows:

Full dump: Data migration fails; you need to restart the migration job oncethe destination database completes the failover operation.
CDC: Data migration is paused. The migration job resumes automatically once the destinationdatabase completes the failover operation.

Regional outage: Database Migration Service doesn't support cross-regional resources, and therefore it's not resilient against regional failures.

Dataflow

Dataflow is Google Cloud's fully managed and serverless data processingservice for streaming and batch pipelines. By default, a regional endpointconfigures the Dataflow workerpool to use all available zones within the region. Zone selection iscalculated for each worker at the time that the worker is created, optimizing for resourceacquisition and use of unusedreservations.In the default configuration for Dataflow jobs, intermediate data isstored by the Dataflow service, and the state of the job is stored inthe backend. If a zone fails, Dataflow jobs cancontinue to run, because workers are re-created in other zones.

The following limitations apply:

Regional placement is supported only for jobs using Streaming Engine orDataflow Shuffle. Jobs that have opted out of Streaming Engine orDataflow Shuffle can't use regional placement.
Regional placement applies to VMs only. It doesn't apply to StreamingEngine and Dataflow Shuffle-related resources.
VMs aren't replicated across multiple zones. If a VM becomes unavailable,its work items are considered lost and are reprocessed by anotherVM.
If a region-wide stockout occurs, the Dataflow service can't createany more VMs.

Architecting Dataflow pipelines for high availability

You can run multiple streaming pipelines in parallel for high-availability dataprocessing. For example, you can run two parallel streaming jobs in differentregions. Running parallel pipelines provides geographical redundancy and fault tolerance for dataprocessing. By considering the geographic availability of data sources andsinks, you can operate end-to-end pipelines in a highly available, multi-regionconfiguration. For more information, seeHigh availability and geographic redundancyin "Design Dataflow pipeline workflows."

In case of a zone or region outage, you can avoid data loss by reusing thesame subscription to the Pub/Sub topic. To guarantee that records aren'tlost during shuffle, Dataflow uses upstream backup, which meansthat the worker sending the records retries RPCs until it receives positiveacknowledgement that the record has been received and that the side-effects ofprocessing the record are committed to persistent storage downstream.Dataflow also continues to retry RPCs if the worker sending the recordsbecomes unavailable. Retrying RPCs ensures that every record is delivered exactly once.For more information about the Dataflow exactly-once guarantee, seeExactly-once in Dataflow.

If the pipeline is using grouping or time-windowing, you can use theSeek functionality of Pub/Sub or Replay functionality of Kafka after azonal or regional outage to reprocess data elements to arrive at the samecalculation results. If the business logic used by the pipeline does not relyon data before the outage, the data loss of pipeline outputs can be minimizeddown to 0 elements. If the pipeline business logic does rely on data thatwas processed before the outage (for example, if long sliding windows are used,or if a global time window is storing ever-increasing counters),useDataflow snapshotsto save the state of the streaming pipeline and start a new version of your jobwithout losing state.

Dataproc

Dataproc provides streaming and batch data processing capabilities.Dataproc is architected as a regional control plane that enables usersto manage Dataproc clusters. The control plane does not depend on anindividual zone in a given region. Therefore, during a zonal outage, you retainaccess to the Dataproc APIs, including the ability to create newclusters.

You can create Dataproc clusters on:

Dataproc clusters on Compute Engine

Because a Dataproc cluster on Compute Engine is a zonal resource, azonal outage makes the cluster unavailable, or destroys the cluster.Dataproc does not automatically snapshot cluster status, so a zoneoutage could cause loss of data being processed. Dataproc does notpersist user data within the service. Users can configure their pipelines towrite results to many data stores; you should consider the architecture of thedata store and choose a product that offers the required disaster resilience.

If a zone suffers an outage, you may choose to recreate a new instance of thecluster in another zone, either by selecting a different zone or using the AutoPlacement feature in Dataproc to automatically select an availablezone. Once the cluster is available, data processing can resume. You can alsorun a cluster with High Availability mode enabled, reducing the likelihood apartial zone outage will impact a master node and, therefore, the whole cluster.

Dataproc clusters on GKE

Dataproc clusters on GKEcan be zonal or regional.

For more information about the architecture and the DR capabilities ofzonal and regional GKE clusters, see theGoogle Kubernetes Engine section later in this document.

Datastream

Datastream is a serverless change data capture (CDC) and replicationservice that lets you synchronize data reliably, and with minimal latency.Datastream provides replication of data from operational databases intoBigQuery and Cloud Storage. In addition, it offers streamlinedintegration with Dataflow templates to build custom workflows forloading data into a wide range of destinations, such as Cloud SQL andSpanner.

Zonal outage: Datastream is a multi-zonal service. It can withstanda complete, unplanned zonal outage without any loss of data or availability.If a zonal failure occurs, you can still access and manage your resources inDatastream.

Regional outage: In the case of a regional outage, Datastreambecomes available again as soon as the outage is resolved.

Document AI

Document AI is adocument understanding platform that takes unstructured data fromdocuments and transforms it into structured data, making it easier tounderstand, analyze, and consume. Document AI is a regionaloffering. Customers can choose the region but not the zones within that region.Data and traffic are automatically load balanced across zones within a region.Servers are automatically scaled to meet incoming traffic and are load balancedacross zones as necessary. Each zone maintains a scheduler that provides thisautoscaling per zone. The scheduler is also aware of the load other zones arereceiving and provisions extra capacity in-zone to allow for any zonal failures.

Zonal outage: Document AI stores user documents and processorversion data. This data is stored regionally and written synchronously. Sincedata is regionally stored, data plane operations aren't affected by zonalfailures. Traffic automatically routes to other zones in the event of a zonalfailure, with a delay based on how long it takes dependent services, likeVertex AI, to recover.

Regional outage: Data is never replicated across regions. During a regionaloutage, Document AI will not failover. Customers choose theGoogle Cloud region in which they want to use Document AI.However, that customer traffic is never routed to another region.

Endpoint Verification

Endpoint Verification lets administrators and security operations professionals buildan inventory of devices that access an organization's data.Endpoint Verification also provides critical device trust and security-basedaccess control as a part of the Chrome Enterprise Premium solution.

Use Endpoint Verification when you want an overview of the security posture of yourorganization's laptop and desktop devices. When Endpoint Verificationis paired with Chrome Enterprise Premium offerings, Endpoint Verification helps enforcefine-grained access control on your Google Cloud resources.

Endpoint Verification is available for Google Cloud, Cloud Identity,Google Workspace Business, and Google Workspace Enterprise.

Eventarc

Eventarc provides asynchronously delivered events from Googleproviders (first-party), user apps (second-party), and software as a service(third-party) using loosely-coupled services that react to state changes. Itlets customers configure their destinations (for example, aCloud Run instance or a 2nd gen Cloud Run function) to betriggered when an event occurs in an event provider service or the customer'scode.

Zonal outage: Eventarc stores metadata related to triggers.This data is stored regionally and written synchronously. The Eventarc APIthat creates and manages triggers and channels only returns the API call whenthe data has been committed to a quorum within a region. Since data isregionally stored, data plane operations aren't affected by zonal failures. Inthe event of a zonal failure, traffic is automatically routed to other zones.Eventarc services for receiving and delivering second-party andthird-party events are replicated across zones. These services are regionallydistributed. Requests to unavailable zones are automatically served fromavailable zones in the region.

Regional outage: Customers choose the Google Cloud region that theywant to create their Eventarc triggers in. Data is neverreplicated across regions. Customer traffic is never routed byEventarc to a different region. In the case of a regionalfailure, Eventarc becomes available again as soon as the outageis resolved. To achieve higher availability, customers are encouraged to deploytriggers to multiple regions if desired.

Note the following:

Eventarc services for receiving and delivering first-partyevents are provided on a best-effort basis and are not covered by RTO/RPO.
Eventarc event delivery for Google Kubernetes Engine servicesis provided on a best-effort basis and is not covered by RTO/RPO.

Filestore

The Basic and Zonal tiers are zonal resources. They are not tolerant tofailure of the deployed zone or region.

Regional tier Filestore instances are regional resources.Filestore adopts the strict consistency policy required by NFS.When a client writes data, Filestore doesn't return anacknowledgment until the change is persisted and replicated in two zones so thatsubsequent reads return the correct data.

In the event of a zone failure, aRegional tier instance continues to serve data from other zones, and in the meantime acceptsnew writes. Both the read and write operations might have a degradedperformance; the write operation might not be replicated. Encryption is notcompromised because the key will be served from other zones.

We recommend that clients create external backups in case of further outages inother zones in the same region. The backup can be used to restore the instanceto other regions.

Firestore

Firestore is a flexible, scalable database for mobile, web, and serverdevelopment from Firebase and Google Cloud. Firestore offersautomatic multi-region data replication, strong consistency guarantees, atomicbatch operations, and ACID transactions.

Firestore offers both single region and multi-regional locationsto customers. Traffic is automatically load-balanced across zones in a region.

Regional Firestore instances synchronously replicate data acrossat least three zones. In the case of zonal failure, writes can still becommitted by the remaining two (or more) replicas, and committed data ispersisted. Traffic automatically routes to other zones. A regional locationoffers lower costs, lower write latency, and co-location with other Google Cloudresources.

Firestore multi-regional instances synchronously replicate dataacross five zones in three regions (two serving regions and one witness region),and they are robust against zonal and regional failure. In case of zonal or regionalfailure, committed data is persisted. Traffic automatically routes to servingzones/regions, and commits are still served by at least three zones acrossthe two regions remaining. Multi-regions maximize the availability anddurability of databases.

Firewall Insights

Firewall Insights helps you understand and optimize your firewall rules.It provides insights, recommendations, and metrics about how your firewall rulesare being used. Firewall Insights also uses machine learning to predictfuture firewall rules usage. Firewall Insights lets you make betterdecisions during firewall rule optimization. For example,Firewall Insights identifies rules that it classifies as overlypermissive. You can use this information to make your firewall configuration stricter.

Zonal outage: Since Firewall Insights data are replicated acrosszones, it isn't affected by a zonal outage, and customer traffic isautomatically routed to other zones.

Regional outage: Since Firewall Insights data are replicated acrossregions, it isn't affected by a regional outage, and customer traffic isautomatically routed to other regions.

Fleet

Fleets let customers manage multipleKubernetes clusters as a group, and allow platform administrators touse multi-cluster services. For example, fleets let administrators applyuniform policies across all clusters orset up Multi Cluster Ingress.

When you register a GKE cluster to a fleet, by default, thecluster has a regional membership in the same region. When you registera non-Google Cloud cluster to a fleet, you can pick any region or theglobal location. The best practice is to choose a region that's close to thecluster's physical location. This provides optimal latency when usingConnect gatewayto access the cluster.

In the case of a zonal outage, fleet functionalities are not affected unlessthe underlying cluster is zonal and becomes unavailable.

In the case of a regional outage, fleet functionalities fail statically for thein-region membership clusters. Mitigation of a regional outage requiresdeployment across multiple regions, as suggested byArchitecting disaster recovery for cloud infrastructure outages.

Google Cloud Armor

Cloud Armor helps you protectyour deployments and applications from multiple types of threats, includingvolumetric DDoS attacks and application attacks like cross-site scripting andSQL injection. Cloud Armor filters unwanted traffic at Google Cloudload balancers and prevents such traffic from entering your VPC and consumingresources. Some of these protections are automatic. Some require you toconfigure security policies and attach them to backend services or regions.Globally scoped Cloud Armor security policies are applied at globalload balancers. Regionally scoped security policies are applied at regionalload balancers.

Zonal outage: In case of a zonal outage, Google Cloud load balancersredirect your traffic to other zones where healthy backend instances areavailable. Cloud Armor protection is available immediately after thetraffic failover because your Cloud Armor security policies aresynchronously replicated to all zones in a region.

Regional outage: In case of regional outages, global Google Cloud loadbalancers redirect your traffic to other regions where healthy backendinstances are available. Cloud Armor protection is availableimmediately after the traffic failover because your global Cloud Armorsecurity policies are synchronously replicated to all regions. To be resilientagainst regional failures, you must configure Cloud Armor regionalsecurity policies for all your regions.

Google Kubernetes Engine

Google Kubernetes Engine (GKE) offers managed Kubernetes service bystreamlining the deployment of containerized applications on Google Cloud. Youcan choose between regional or zonal cluster topologies.

When creating azonal cluster, GKE provisions one controlplane machine in the chosen zone, as well as worker machines (nodes) withinthe same zone.
Forregional clusters, GKE provisions three control planemachines in three different zones within the chosen region. By default,nodes are also spanned across three zones, though you can choose to create aregional cluster with nodes provisioned only in one zone.
Multi-zonal clusters are similar to zonal clusters as they include onemaster machine, but additionally offer the ability to span nodes acrossmultiple zones.

Zonal outage: To avoid zonal outages, use regional clusters. The controlplane and the nodes are distributed across three different zones within aregion. A zone outage does not impact control plane and worker nodes deployed inthe other two zones.

Regional outage: Mitigation of a regional outage requires deployment acrossmultiple regions. Although currently not being offered as a built-in productcapability, multi-region topology is an approach taken by severalGKE customers today, and can be manually implemented. You cancreate multiple regional clusters to replicate your workloads across multipleregions, and control the traffic to these clusters usingmulti-cluster ingress.

HA VPN

HA VPN (high availability) is a resilient Cloud VPN offeringthat securely encrypts your traffic from your on-premises private cloud, othervirtual private cloud, or other cloud service provider network to your GoogleCloud Virtual Private Cloud (VPC).

HA VPN's gateways have two interfaces, each with an IP addressfrom separate IP address pools, split both logically and physically acrossdifferent PoPs and clusters, to ensure optimal redundancy.

Zonal outage: In the case of a zonal outage, one interface may loseconnectivity, but traffic is redirected to the other interface via dynamicrouting using Border Gateway Protocol (BGP).

Regional outage: In the case of a regional outage, both interfaces may loseconnectivity for a brief period.

Identity and Access Management

Identity and Access Management (IAM) is responsible for all authorization decisionsfor actions on cloud resources. IAM confirms that a policy grantspermission for each action (in the data plane), and it processes updates to those policiesthrough aSetPolicy call (in the control plane).

All IAM policies are replicated across multiple zones withinevery region, helping IAM data plane operations recover from failuresof other regions and tolerant of zone failures within each region.The resilience of IAM data plane against zone failures and region failures enablesmulti-region and multi-zone architectures for high availability.

IAM control plane operations can depend on cross-regionreplication. WhenSetPolicy calls succeed, the data has been written tomultiple regions, but propagation to other regions is eventually consistent.The IAM control plane is resilient to single region failure.

Identity-Aware Proxy

Identity-Aware Proxy provides access to applications hosted on Google Cloud, on otherclouds, and on-premises. IAP is regionally distributed, andrequests to unavailable zones are automatically served from other availablezones in the region.

Regional outages in IAP affect access to the applications hostedon the impacted region. We recommend that you deploy to multiple regionsand use Cloud Load Balancing to achieve higher availability and resilienceagainst regional outages.

Identity Platform

Identity Platform lets customers add customizable Google-grade identity and access management to their apps. Identity Platform is a global offering. Customers cannot choose the regions or zones in which their data is stored.

Zonal outage: During a zonal outage, Identity Platform fails over requests to the next closest cell. All data is saved on a global scale, so there's no data loss.

Regional outage: During a regional outage, Identity Platform requests to the unavailable region temporarily fail while Identity Platform removes traffic from the affected region. Once there's no more traffic to the affected region, a global server load-balancing service routes requests to the nearest available healthy region. All data is saved globally, so there's no data loss.

Knative serving

Knative serving is a global service that enables the customers to runserverless workloads on customer clusters. Its purpose is to ensure thatKnative serving workloads are properly deployed on customer clustersand that the installation status of Knative serving is reflectedin GKE Fleet API Feature resource.This service takes part only when installing or upgradingKnative serving resources on customer clusters. It isn't involvedin executing cluster workloads.Customer clusters belonging to projects which have Knative servingenabled are distributed between replicas in multiple regions and zones-eachcluster is monitored by one replica.

Zonal and regional outage: Clusters that are monitored by replicas that werehosted in a location undergoing an outage, are automatically redistributedbetween healthy replicas in other zones and regions.While this reassignment is in progress, there might be a short time when someclusters are not monitored by Knative serving.If during that time the user decides to enable Knative servingfeatures on the cluster, the installation of Knative servingresources on the cluster will commence after the cluster reconnects witha healthy Knative serving service replica.

Looker (Google Cloud core)

Looker (Google Cloud core) is a business intelligence platform that providessimplified and streamlined provisioning, configuration, and managementof a Looker instance from the Google Cloud console.Looker (Google Cloud core) lets users explore data, create dashboards, set up alerts, and share reports.In addition, Looker (Google Cloud core) offers an IDE for data modelers and rich embedding and API features for developers.

Looker (Google Cloud core) is composed of regionally isolated instances thatsynchronously replicate data across zones within the region. Ensure that theresources your instance uses, such as the data sources thatLooker (Google Cloud core) connects to, are in the same region that yourinstance runs in.

Zonal outage: Looker (Google Cloud core) instances store metadata and their owndeployed containers. The data is written synchronously across replicatedinstances. In a zonal outage, Looker (Google Cloud core) instances continueto serve from other available zones in the same region. Any transactions orAPI calls return after the data has been committed to a quorum within aregion. If the replication fails, then the transaction is not committed andthe user is notified about the failure. If more than one zone fails, thetransactions also fail and the user is notified. Looker (Google Cloud core)stops any schedules or queries that are currently running. You have toreschedule or queue them again after resolving the failure.

Regional outage: Looker (Google Cloud core) instances within the affectedregion aren't available. Looker (Google Cloud core) stops any schedules orqueries that are currently running. You have to reschedule or queue thequeries again after resolving the failure. You can manually create newinstances in a different region. You can also recover your instances usingthe process defined inImport or export data from a Looker (Google Cloud core) instance.We recommend that you set up a periodic data export process to copy the assetsin advance, in the unlikely event of a regional outage.

Looker Studio

Looker Studio is a data visualization and business intelligenceproduct. It enables customers to connect to their data stored in other systems,create reports and dashboards using that data, and share the reports anddashboards throughout their organization. Looker Studio is a globalservice and does not allow users to select a resource scope.

In the case of a zonal outage, Looker Studio continues to serverequests from another zone in the same region or in a different region withoutinterruption. User assets are synchronously replicated across regions.Therefore, there is no data loss.

In the case of a regional outage, Looker Studio continues to serverequests from another region without interruption. User assets are synchronouslyreplicated across regions. Therefore, there is no data loss.

Memorystore for Memcached

Memorystore for Memcached is Google Cloud's managed Memcached offering.Memorystore for Memcached lets customers create Memcached clusters that can beused as high-throughput, key-value databases for applications.

Memcached clusters are regional, with nodes distributed across allcustomer-specified zones. However, Memcached doesn't replicate any data acrossnodes. Therefore a zonal failure can result in loss of data, also described asapartial cache flush. Memcached instances will continue to operate, butthey will have fewer nodes—the service won't start any new nodesduring a zonal failure. Memcached nodes in unaffected zones will continue toserve traffic, although the zonal failure will result in a lower cache hit rateuntil the zone is recovered.

In the event of a regional failure, Memcached nodes don't serve traffic. Inthat case, data is lost, which results in afull cache flush. To mitigate aregional outage, you can implement an architecture that deploys the applicationand Memorystore for Memcached across multiple regions.

Memorystore for Redis

Memorystore for Redis is a fully managed Redis service for Google Cloud thatcan reduce the burden of managing complex Redis deployments. It currently offers2 tiers: Basic Tier and Standard Tier. For Basic Tier, a zonal or regionaloutage will cause loss of data, also known as afull cache flush. For StandardTier, a regional outage will cause loss of data. A zonal outage might causepartial data loss to Standard Tier instance due to its asynchronous replication.

Important: In order for replication to perform predictably, ensure that the CPU usage andsystem memory usage ratio remain within the recommended values.

Zonal outage: Standard Tier instances asynchronously replicate datasetoperations from the dataset in the primary node to the replica node. When theoutage occurs within the zone of the primary node, the replica node will bepromoted to become the primary node. During the promotion, a failover occursand the Redis client has to reconnect to the instance. After reconnecting,operations resume. For more information about high availability ofMemorystore for Redis instances in the Standard Tier, refer toMemorystore for Redis high availability.

If you enableread replicas in your Standard Tier instance and you only have one replica, the read endpointisn't available for the duration of a zonal outage. For more information aboutdisaster recovery of read replicas, seeFailure modes for read replicas.

Regional outage: Memorystore for Redis is a regional product, so a singleinstance cannot withstand a regional failure. You can scheduleperiodic tasks to export a Redis instance to a Cloud Storage bucket ina different region. When a regional outage occurs, you can restore the Redisinstance in a different region from the dataset you have exported.

Multi-Cluster Service Discovery and Multi Cluster Ingress

GKE multi-cluster Services (MCS) consists of multiplecomponents. The components include the Google Kubernetes Engine hub (which orchestratesmultiple Google Kubernetes Engine clusters by using memberships), the clusters themselves,and GKE hub controllers (Multi Cluster Ingress, Multi-Cluster Service Discovery).The hub controllers orchestrate Compute Engine load balancer configuration byusing backends on multiple clusters.

In the case of a zonal outage, Multi-Cluster Service Discovery continues to serve requests fromanother zone or region. In the case of a regional outage, Multi-Cluster Service Discoverydoes not fail over.

In the case of a zonal outage for Multi Cluster Ingress, if the config cluster is zonaland in scope of the failure, the user needs to manually fail over. The dataplane is fail-static and will continue serving traffic until the user hasfailed over. To avoid the need for manual failover, use a regional cluster forthe configuration cluster.

In the case of a regional outage, Multi Cluster Ingress does not fail over. Users musthave a DR plan in place for manually failing over the configuration cluster. Formore information, seeSetting up Multi Cluster IngressandConfiguring multi-cluster Services.

For more information about GKE, see the"Google Kubernetes Engine" section inArchitecting disaster recovery for cloud infrastructure outages.

Network Analyzer

Network Analyzer automatically monitors your VPC network configurations anddetects misconfigurations and suboptimal configurations. It provides insights onnetwork topology, firewall rules, routes, configuration dependencies, andconnectivity to services and applications. It identifies network failures,provides root cause information, and suggests possible resolutions.

Network Analyzer runs continuously and triggers relevant analyses based onnear real-time configuration updates in your network. If Network Analyzerdetects a network failure, it tries to correlate the failure with recentconfiguration changes to identify root causes. Wherever possible, it providesrecommendations to suggest details on how to fix the issues.

Network Analyzer is a diagnostic toolwith no data plane components. It does not process or generate user traffic.

Zonal outage: Network Analyzer service is replicated globally, and itsavailability isn't affected by a zonal outage.

If insights from Network Analyzer contain configurations from azone suffering an outage, it affects data quality. The network insights thatrefer to configurations in that zone become stale. Don't rely on anyinsights provided by Network Analyzer during outages.

Regional outage: Network Analyzer service is replicated globally, and itsavailability isn't affected by a regional outage.

If insights from Network Analyzer contain configurations from aregion suffering an outage, it affects data quality. The network insights thatrefer to configurations in that region become stale. Don't rely on anyinsights provided by Network Analyzer during outages.

Network Topology

Network Topology is a visualization tool that shows the topology of your network infrastructure. The Infrastructure view shows Virtual Private Cloud (VPC) networks, hybrid connectivity to and from your on-premises networks, connectivity to Google-managed services, and the associated metrics.

Zonal outage: In case of a zonal outage, data for that zone won't appear in Network Topology. Data for other zones aren't affected.

Regional outage: In case of a regional outage, data for that region won't appear in Network Topology. Data for other regions aren't affected.

Performance Dashboard

Performance Dashboard gives you visibility into the performance of the entire Google Cloud network, as well as to the performance of your project's resources.

With these performance-monitoring capabilities, you can distinguish between a problem in your application and a problem in the underlying Google Cloud network. You can also investigate historical network performance problems. Performance Dashboard also exports data to Cloud Monitoring. You can use Monitoring to query the data and get access to additional information.

Zonal outage:

In case of a zonal outage, latency and packet loss data for traffic from the affected zone won't appear in Performance Dashboard. Latency and packet loss data for traffic from other zones isn't affected. When the outage ends, latency and packet loss data resumes.

Regional outage:

In case of a regional outage, latency and packet loss data for traffic from the affected region won't appear in Performance Dashboard. Latency and packet loss data for traffic from other regions isn't affected. When the outage ends, latency and packet loss data resumes.

Network Connectivity Center

Network Connectivity Center is a network connectivity management product that employs ahub-and-spoke architecture. With this architecture, a central managementresource serves as a hub and each connectivity resource serves as a spoke.Hybrid spokes currently support HA VPN, Dedicated and Partner Interconnect,and SD-WAN router appliances from major third party vendors. WithNetwork Connectivity Center hybrid spokes, enterprises can connect Google Cloud workloads andservices to on-premise data centers, other clouds, and their branch officesthrough the global reach of the Google Cloud network.

Zonal outage: A Network Connectivity Center hybrid spoke with HAconfiguration is resilient to zonal failures because the control plane andnetwork data plane are redundant across multiple zones within a region.

Regional outage: A Network Connectivity Center hybrid spoke is a regional resource, so itcan't withstand a regional failure.

Network Service Tiers

Network Service Tiers lets you optimize connectivity between systems on the internet andyour Google Cloud instances. It offers two distinct service tiers, the PremiumTier and the Standard Tier. With the Premium Tier, a globally announced anycastPremium Tier IP address can serve as the frontend for either regional or globalbackends. With the Standard Tier, a regionally announced Standard Tier IPaddress can serve as the frontend for regional backends. The overall resilienceof an application is influenced by both the network service tier and theredundancy of the backends it associates with.

Zonal outage: Both the Premium Tier and the Standard Tier offer resilienceagainst zonal outages when associated with regionally redundant backends. When azonal outage occurs, the failover behavior for cases using regionally redundantbackends is determined by the associated backends themselves. When associatedwith zonal backends, the service will become available again as soon as theoutage is resolved.

Regional outage: The Premium Tier offers resilience against regional outageswhen it is associated with globally redundant backends. In the Standard tier,all traffic to the affected region will fail. Traffic to all other regions isunaffected. When a regional outage occurs, the failover behavior for cases usingthe Premium Tier with globally redundant backends is determined by theassociated backends themselves. When using the Premium Tier with regionalbackends or the Standard Tier, the service will become available again as soonas the outage is resolved.

Organization Policy Service

Organization Policy Service provides centralized and programmatic control over your organization's Google Cloud resources. As the Organization Policy administrator, you can configureconstraints across your entire resource hierarchy.

Zonal outage: All organization policies created by Organization Policy Service are replicated asynchronously across multiple zones within every region. Organization Policy data and control plane operations are tolerant of zone failures within each region.

Regional outage: All organization policies created by Organization Policy Service are replicated asynchronously across multiple regions. Organization Policy control plane operations are written to multiple regions and the propagation to other regions is consistent within minutes. The Organization Policy control plane is resilient to single region failure. The Organization Policy data plane operations can recover from failures in other regions and the resilience of the Organization Policy data plane against zone failures and region failures enablesmulti-region and multi-zone architectures for high availability.

Packet Mirroring

Packet Mirroring clones the traffic of specified instances in your Virtual Private Cloud (VPC) network and forwards the cloned data to instances behind a regionalinternal load balancer for examination. Packet Mirroring captures all traffic and packet data, including payloads and headers.

For more information about the functionality of Packet Mirroring, see thePacket Mirroring overview page.

Zonal outage: Configure the internal load balancer so there are instances in multiple zones. If a zonal outage occurs, Packet Mirroring diverts cloned packets to a healthy zone.

Regional outage: Packet Mirroring is a regional product. If there's a regional outage, packets in the affected region aren't cloned.

Persistent Disk

Persistent Disks are available in zonal and regional configurations.

Zonal Persistent Disks are hosted in a single zone. If the disk's zone isunavailable, the Persistent Disk is unavailable until the zone outage isresolved.

Regional Persistent Disks provide synchronous replication of data between twozones in a region. In the event of an outage in your virtual machine's zone, youcan force attach a regional Persistent Disk to a VM instance in the disk'ssecondary zone. To perform this task, you must either start another VM instancein that zone or maintain a hot standby VM instance in that zone.

To asynchronously replicate data in a Persistent Disk acrossregions, you can usePersistent Disk Asynchronous Replication (PD Async Replication), whichprovides low RTO and RPO block storage replication for cross-regionactive-passive DR. In the unlikely event of a regional outage,PD Async Replication enables you to failover your data to a secondary region andrestart your workload in that region.

Personalized Service Health

Personalized Service Health communicates service disruptions relevant to your Google Cloud projects. It provides multiple channels and processes to view or integrate disruptive events (incidents, planned maintenance) into your incident response process—including the following:

A dashboard in Google Cloud console
A service API
Configurable alerts
Logs generated and sent to Cloud Logging

Zonal outage: Data is served from a global database with no dependency on specific locations. If a zonal outage occurs, Service Health is able to serve requests and automatically reroute traffic to zones in the same region that still function. Service Health can return API calls successfully if it is able to retrieve event data from the Service Health database.

Regional outage: Data is served from a global database with no dependency on specific locations. If there is a regional outage, Service Health is still able to serve requests but may perform with reduced capacity. Regional failures in Logging locations might affect Service Health users consuming logs or cloud alerting notifications.

Private Service Connect

Private Service Connect is a capability of Google Cloudnetworking that lets consumers access managed services privately from insidetheir VPC network. Similarly, it allows managed service producersto host these services in their own separate VPC networks andoffer a private connection to their consumers.

Private Service Connect endpoints for published services

A Private Service Connect endpoint connects to services in serviceproducers VPC network using a Private Service Connect forwarding rule.The service producer provides a service using private connectivity to a serviceconsumer, by exposing a single service attachment. Then the service consumerwill be able to assign a virtual IP address from their VPC for such service.

Zonal outage: Private Service Connect traffic thatcomes from the VM traffic generated by consumer VPC client endpointscan still access exposed managed services on the service producer'sinternal VPC network. This access is possible becausePrivate Service Connect traffic fails over to healthierservice backends in a different zone.

Regional outage: Private Service Connect is a regionalproduct. It isn't resilient to regional outages. Multi-regional managed servicescan achieve high availability during a regional outages by configuringPrivate Service Connect endpoints across multiple regions.

Private Service Connect endpoints for Google APIs

A Private Service Connect endpoint connects to Google APIs usinga Private Service Connect forwarding rule. This forwarding rule letscustomers use customized endpoint names with their internal IP addresses.

Zonal outage: Private Service Connect traffic from consumerVPC client endpoints can still access Google APIs because connectivity betweenthe VM and the endpoint will automatically fail over to another functional zonein the same region. Requests that are already in-flight when an outage beginswill depend on the client's TCPtimeout and retry behavior for recover.

SeeCompute Engine recovery for more details.

For more information about Private Service Connect, see the "Endpoints"section inPrivate Service Connect types.

Pub/Sub

Pub/Sub is a messaging service for application integration and streamanalytics. Pub/Sub topics are global, meaning that they are visible andaccessible from any Google Cloud location. However, any given message is storedin a single Google Cloud region, closest to the publisher and allowed by theresource location policy. Thus, a topic may have messages stored in differentregions throughout Google Cloud. The Pub/Submessage storage policy can restrict the regions in which messages are stored.

Zonal outage: When a Pub/Sub message is published, it issynchronously written to storage in at least two zones within the region.Therefore, if a single zone becomes unavailable, there is no customer-visibleimpact.

Regional outage: During a region outage, messages stored within the affectedregion are inaccessible. Publishers and subscribers that would connect to theaffected region, either via a regional endpoint or the global endpoint, aren'table to connect. Publishers and subscribers that connect to other regions canstill connect, and messages available in other regions are delivered tonetwork-nearest subscribers that have capacity.

If your application relies on message ordering, review thedetailed recommendations from the Pub/Sub team. Message ordering guarantees areprovided on a per-region basis, and can become disrupted if you use a globalendpoint.

reCAPTCHA

reCAPTCHA is a global service that detects fraudulent activity, spam,and abuse. It does not require or allow configuration for regional or zonalresiliency. Updates to configuration metadata are asynchronously replicated toeach region where reCAPTCHA runs.

In the case of a zonal outage, reCAPTCHA continues to serve requestsfrom another zone in the same or different region without interruption.

In the case of a regional outage, reCAPTCHA continues to serve requestsfrom another region without interruption.

Secret Manager

Secret Manager is a secrets and credential management product forGoogle Cloud. With Secret Manager, you can easily audit andrestrict access to secrets, encrypt secrets at rest, and ensure that sensitiveinformation is secured in Google Cloud.

Secret Manager resources are normally created with theautomatic replication policy (recommended), which causes them to bereplicated globally. If your organization has policies that do not allow globalreplication of secret data, Secret Manager resources can becreated with user-managed replication policies, in which one or more regions arechosen for a secret to be replicated to.

Zonal outage: In the case of zonal outage, Secret Managercontinues to serve requests from another zone in the same or different regionwithout interruption. Within each region, Secret Manager alwaysmaintains at least 2 replicas in separate zones (in most regions, 3 replicas).When the zone outage is resolved, full redundancy is restored.

Regional outage: In the case of a regional outage,Secret Manager continues to serve requests from another regionwithout interruption, assuming the data has been replicated to more than oneregion (either throughautomatic replication or through user-managedreplication to more than one region). When the region outage is resolved,full redundancy is restored.

Security Command Center

Security Command Center is the global, real time risk management platform for Google Cloud.It consists of two main components: detectors and findings.

Detectors are affected by both regional and zonal outages, in different ways.During a regional outage, detectors can't generate new findings for regionalresources because the resources they're supposed to be scanning aren't available.

During a zonal outage, detectors can take anywhere from several minutes to hoursto resume normal operation. Security Command Center won't lose finding data.It also won't generate new finding data for unavailable resources. In the worstcase scenario, Container Threat Detection agents may run out of buffer space while connectingto a healthy cell, which could lead to lost detections.

Findings are resilient to both regional and zonal outages because they'resynchronously replicated across regions.

Sensitive Data Protection (including the DLP API)

Sensitive Data Protection provides sensitive data classification,profiling, de-identification, tokenization, and privacy risk analysis services.It works synchronously on the data that's sent in the request bodies, orasynchronously on the data that's already present in cloud storage systems.Sensitive Data Protection can be invoked throughthe global or region-specific endpoints.

Global endpoint: The service is designed to be resilient to both regionaland zonal failures. If the service is overloaded while a failure happens, datasent to thehybridInspectmethod of the service might be lost.

To create a failure-resistant architecture, include logic to examine the mostrecent pre-failure finding that was produced by thehybridInspect method. Incase of an outage, the data that was sent to the method might be lost, but nomore than the last 10 minutes' worth before the failure event. If there arefindings fresher than 10 minutes before the outage started, it indicates thedata that resulted in that finding wasn't lost. In that case, there's no need toreplay the data that came before the finding timestamp, even if it's within the10 minute interval.

Regional endpoint: Regional endpoints are not resilient to regionalfailures. If resiliency against a regional failure is required, consider failingover to other regions. The zonal failure characteristics are the same as above.

Service Usage

The Service Usage API is an infrastructure service of Google Cloud that lets youlist and manage APIs and services in your Google Cloud projects. You can list and manageAPIs and Services provided by Google, Google Cloud, and third-party producers. The Service Usage API is a global service and resilient to both zonal and regionaloutages. In the case of zonal outage or regional outage, the Service Usage APIcontinues to serve requests from another zone across different regions.

For more information about Service Usage, seeService Usage Documentation.

Speech-to-Text

Speech-to-Text lets you convert speech audio to text by using machine learning techniques like neural network models. Audio is sent in real time from an application’s microphone, or it is processed as a batch of audio files.

Zonal outage:

Speech-to-Text API v1: During a zonal outage, Speech-to-Text API version 1 continues to serve requests from another zone in the same region without interruption. However, any jobs that are currently executing within the failing zone are lost. Users must retry the failed jobs, which will be routed to an available zone automatically.
Speech-to-Text API v2: During a zonal outage, Speech-to-Text API version 2 continues to serve requests from another zone in the same region. However, any jobs that are currently executing within the failing zone are lost. Users must retry the failed jobs, which will be routed to an available zone automatically. The Speech-to-Text API only returns the API call once the data has been committed to a quorum within a region. In some regions, AI accelerators (TPUs) are available only in one zone. In that case, an outage in that zone causes speech recognition to fail but there is no data loss.

Regional outage:

Speech-to-Text API v1: Speech-to-Text API version 1 is unaffected by regional failure because it is a global multi-region service. The service continues to serve requests from another region without interruption. However, jobs that are currently executing within the failing region are lost. Users must retry those failed jobs, which will be routed to an available region automatically.
Speech-to-Text API v2:
- Multi-region Speech-to-Text API version 2, the service continues to serve requests from another zone in the same region without interruption.
- Single-region Speech-to-Text API version 2, the service scopes the job execution to the requested region. Speech-to-Text API version 2 doesn't route traffic to a different region, and data is not replicated to a different region. During a regional failure, Speech-to-Text API version 2 is unavailable in that region. However, it becomes available again when the outage is resolved.

Storage Transfer Service

Storage Transfer Service manages data transfers from various cloud sources toCloud Storage, as well as to, from, and between file systems.

The Storage Transfer Service API is a global resource.

Storage Transfer Service depends on the availability of the source and destination ofa transfer. If a transfer source or destination is unavailable, transfers stopmaking progress. However, no customer core data or job data is lost. Transfersresume when the source and destination become available again.

You can use Storage Transfer Service with or without an agent, as follows:

Agentless transfers use regional workers to orchestrate transfer jobs.
Agent-based transfers use software agents that are installed onyour infrastructure. Agent-based transfers rely on the availability ofthe transfer agents and on the ability of the agents to connect to thefile system. When you're deciding where to install transfer agents, considerthe availability of the file system. For example, if you're running transferagents on multiple Compute Engine VMs to transfer data to anEnterprise-tier Filestore instance (a regional resource), youshould consider locating the VMs in different zones within theFilestore instance's region.
If agents become unavailable, or if their connection to the file system isinterrupted, transfers stop making progress, but no data is lost. If allagent processes are terminated, the transfer job is paused until newagents are added to the transfer's agent pool.

During an outage, the behavior of Storage Transfer Service is as follows:

Zonal outage: During a zonal outage, the Storage Transfer Service APIsremain available, and you can continue to create transfer jobs. Datacontinues to transfer.
Regional outage: During a regional outage, the Storage Transfer ServiceAPIs remain available, and you can continue to create transfer jobs. Ifyour transfer's workers are located in the affected region, data transferstops until the region becomes available again and the transferautomatically resumes.

Vertex ML Metadata

Vertex ML Metadata lets you record the metadata and artifacts produced byyour ML system and query that metadata to help analyze, debug, and audit theperformance of your ML system or the artifacts that it produces.

Zonal outage: In the default configuration, Vertex ML Metadata offersprotection against zonal failure. The service is deployed in multiple zonesacross each region, with data synchronously replicated across different zoneswithin each region. In case of a zonal failure, the remaining zones take overwith minimal interruption.

Regional outage: Vertex ML Metadata is a regionalized service. In thecase of a regional outage, Vertex ML Metadata will not fail over toanother region.

Vertex AI Batch prediction

Batch prediction lets users run batch prediction against AI/MLmodels on Google's infrastructure. Batch prediction is aregional offering. Customers can choose the region in which they run jobs, butnot the specific zones within that region. The batch predictionservice automatically load-balances the job across different zones within thechosen region.

Zonal outage: Batch prediction stores metadata forbatch prediction jobs within a region. The data is written synchronously,across multiple zones within that region. In a zonal outage, batch predictionpartially loses workers performing jobs, but automatically adds them back inother available zones. If multiple batch prediction retries fail,the UI lists the job status asfailed in the UI and in the API call requests.Subsequent user requests to run the job are routed to available zones.

Regional outage: Customers choose the Google Cloud region in whichthey want to run their batch prediction jobs. Data is never replicated acrossregions. Batch prediction scopes the job execution to therequested region and never routes prediction requests to a different region.When a regional failure occurs, batch predictionis unavailable in that region. It becomes available again when the outage resolves.We recommend that customers use multiple regions to run their jobs. Incase of a regional outage, direct jobs to a different available region.

Vertex AI Model Registry

Vertex AI Model Registry lets users streamline modelmanagement, governance, and the deployment of ML models in a central repository.Vertex AI Model Registry is a regional offering with high availability andoffers protection against zonal outages.

Zonal outage: Vertex AI Model Registryoffers protection against zonal outages. The service is deployed in three zonesacross each region, with data synchronously replicated across different zoneswithin the region. If a zone fails, the remaining zones will takeover with no data loss and minimum service interruption.

Regional outage: Vertex AI Model Registry is a regionalized service.If a region fails, Model Registry won'tfail over.

Vertex AI Online prediction

Online prediction lets users deploy AI/ML models on Google Cloud. Online prediction is a regional offering. Customers can choose the region where they deploy their models, but not the specific zones within that region. The prediction service will automatically load-balance the workload across different zones within the selected region.

Zonal outage: Online prediction doesn't store any customer content. A zonal outage leads to failure of the current prediction request execution. Online prediction may or may not automatically retry the prediction request depending on which endpoint type is used, specifically, a public endpoint will retry automatically while the private endpoint will not. To help handle failures and for improved resilience, incorporate retry logic with exponential back off in your code.

Regional outage: Customers choose the Google Cloud region in which they want to run their AI/ML models and online prediction services. Data is never replicated across regions. Online prediction scopes the AI/ML model execution to the requested region and never routes prediction requests to a different region. When a regional failure occurs, online prediction service is unavailable in that region. It becomes available again when the outage is resolved. We recommend that customers use multiple regions to run their AI/ML models. In case of a regional outage, direct traffic to a different, available region.

Vertex AI Pipelines

Vertex AI Pipelines is a Vertex AI service that lets you automate,monitor, and govern your machine learning (ML) workflows in a serverless manner.Vertex AI Pipelines is built to provide high availability and it offersprotection against zonal failures.

Zonal outage: In the default configuration, Vertex AI Pipelines offersprotection against zonal failure. The service is deployed in multiple zonesacross each region, with data synchronously replicated across different zoneswithin the region. In case of a zonal failure, the remaining zones take overwith minimal interruption.

Regional outage: Vertex AI Pipelines is a regionalized service. In thecase of a regional outage, Vertex AI Pipelines will not fail over toanother region. If a regional outage occurs we recommend that you run yourpipeline jobs in a backup region.

Vertex AI Search

Vertex AI Search is a customizable search solution with generative AIfeatures and native enterprise compliance.Vertex AI Search is automatically deployed and replicatedacross multiple regions within Google Cloud. You can configure where data isstored by choosing a supported multi-region, such as: global, US, or EU.

Zonal and Regional outage:UserEvents uploaded toVertex AI Search might not be recoverable due to asynchronousreplication delay. Other data and services provided byVertex AI Search remain available due to automaticfailover and synchronous data replication.

Vertex AI Training

Vertex AI Training provides users the ability to run custom training jobson Google's infrastructure. Vertex AI Training is a regionaloffering, meaning that customers can choose the region to run their trainingjobs. However, customers can't choose the specific zones within that region. Thetraining service might automatically load-balance the job execution acrossdifferent zones within the region.

Zonal outage: Vertex AI Training stores metadata for the customtraining job. This data is stored regionally and written synchronously. TheVertex AI Training API call only returns once this metadata has beencommitted to a quorum within a region. The training job might run in a specificzone. A zonal outage leads to failure of the current job execution. If so, theservice automatically retries the job by routing it to another zone. If multipleretries fail, the job status is updated to failed. Subsequent user requests torun the job are routed to an available zone.

Regional outage: Customers choose the Google Cloud region they want torun their training jobs in. Data is never replicated across regions.Vertex AI Training scopes the job execution to the requested regionand never routes training jobs to a different region. In the case of a regionalfailure, Vertex AI Training service is unavailable in that regionand becomes available again when the outage is resolved. We recommend thatcustomers use multiple regions to run their jobs, and in case of a regionaloutage, to direct jobs to a different region that is available.

Virtual Private Cloud (VPC)

VPC is a global service that provides network connectivity toresources (VMs, for example). Failures, however, are zonal. In the event of azonal failure, resources in that zone are unavailable. Similarly, if a regionfails, only traffic to and from the failed region is affected. The connectivityof healthy regions is unaffected.

Zonal outage: If a VPC network covers multiple zones and azone fails, the VPC network will still be healthy for healthyzones. Network traffic between resources in healthy zones will continue to worknormally during the failure. A zonal failure only affects network traffic to andfrom resources in the failing zone. To mitigate the impact of zonal failures, werecommend that you don't create all resources in a single zone. Instead, whenyou create resources, spread them across zones.

Regional outage: If a VPC network covers multiple regions anda region fails, the VPC network will still be healthy for healthyregions. Network traffic between resources in healthy regions will continue towork normally during the failure. A regional failure only affects networktraffic to and from resources in the failing region. To mitigate the impact ofregional failures, we recommended that you spread resources across multipleregions.

VPC Service Controls

VPC Service Controls is a regional service. Using VPC Service Controls, enterprisesecurity teams can define fine-grained perimeter controls and enforce thatsecurity posture across numerous Google Cloud services and projects.Customer policies are mirrored regionally.

Zonal outage: VPC Service Controls continues to serve requests fromanother zone in the same region without interruption.

Regional outage: APIs configured for VPC Service Controls policy enforcementon the affected region are unavailable until the region becomes available again.Customers are encouraged to deploy VPC Service Controls enforced services tomultiple regions if higher availability is desired.

Workflows

Workflows is an orchestration product that lets Google Cloudcustomers:

deploy and run workflows which connect other existing services usingHTTP,
automate processes, including waiting on HTTP responses with automaticretries for up to a year, and
implement real-time processes with low-latency, event-driven executions.

A Workflows customer can deploy workflows that describe the businesslogic they want to perform, then run the workflows either directly with the APIor with event-driven triggers (currently limited to Pub/Sub orEventarc). The workflow being run can manipulate variables, make HTTPcalls and store the results, or define callbacks and wait to be resumed byanother service.

Zonal outage: Workflows source code is not affected by zonaloutages. Workflows stores the source code of workflows, alongwith the variable values and HTTP responses received by workflows that arerunning. Source code is stored regionally and written synchronously: the controlplane API only returns once this metadata has been committed to a quorum withina region. Variables and HTTP results are also stored regionally and writtensynchronously, at least every five seconds.

If a zone fails, workflows are automatically resumed based on the last storeddata. However, any HTTP requests that haven't already received responses aren'tautomatically retried. Use retry policies for requests that can be safelyretried as described in ourdocumentation.

Regional outage: Workflows is a regionalized service; in the caseof a regional outage, Workflows won't fail over. Customers areencouraged to deploy Workflows to multiple regions if higheravailability is desired.

Cloud Service Mesh

Cloud Service Mesh lets you configure a managed service mesh spanning multipleGKE clusters. This documentation concerns only themanaged Cloud Service Mesh, the in-cluster variant is self-hosted and regularplatform guidelines should be followed.

Zonal outage: Mesh configuration, as it is stored in theGKE cluster, is resilient to zonal outages as long as thecluster is regional. Data that the product uses for internal bookkeeping isstored either regionally or globally, and isn't affected if a single zone isout of service. The control plane is run in the same region as theGKE cluster it supports (for zonal clusters it is thecontaining region), and isn't affected by outages within a single zone.

Regional outage: Cloud Service Mesh provides services to GKEclusters, which are either regional or zonal. In case of a regional outage,Cloud Service Mesh won't fail over. Neither would GKE.Customers are encouraged to deploy meshes constituting ofGKE clusters covering different regions.

Service Directory

Service Directory is a platform for discovering, publishing, andconnecting services. It provides real-time information, in a single place, aboutall your services. Service Directory lets you perform serviceinventory management at scale, whether you have a few service endpoints orthousands.

Service Directory resources are created regionally, matching the location parameter specified by the user.

Zonal outage: During a zonal outage, Service Directorycontinues to serve requests from another zone in the same or different regionwithout interruption. Within each region, Service Directory alwaysmaintains multiple replicas.Once the zonal outage is resolved, full redundancy is restored.

Regional outage: Service Directory isn't resilient to regional outages.

Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.

Last updated 2024-05-10 UTC.

Movatterモバイル変換

Architecting disaster recovery for cloud infrastructure outages Stay organized with collections Save and categorize content based on your preferences.

Introduction

How Google Cloud is designed for resilience

Google data centers

Regions and zones

How to leverage zones and regions to achieve reliability

Planning for zone outage scopes

Planning for regional outage scopes

Google Cloud resilience and availability approach

Step-by-step guide to designing disaster recovery for applications in Google Cloud

Step 1: Gather existing requirements

Step 2: Capability mapping to available products

How RPO limits product choices

How RTO limits product choices

Step 3: Develop your own reference architectures and guides

Create product guidelines

Example tier 3 architecture

Example tier 2 architecture

Example tier 1 architecture

Appendix: Product reference

Common themes

Access Context Manager

Access Transparency

AlloyDB for PostgreSQL

Anti Money Laundering AI

API keys

Apigee

AutoML Translation

AutoML Vision

Batch

Chrome Enterprise Premium threat and data protection

BigQuery

Binary Authorization

Certificate Manager

Cloud Intrusion Detection System

Google Security Operations SIEM

Cloud Asset Inventory

Bigtable

Bigtable replication overview

Performance considerations

Monitoring

Certificate Authority Service

Cloud Billing

Cloud Build

Cloud CDN

Cloud Composer

Cloud Data Fusion

Outages in the pipeline design environment

Outages in the pipeline execution environment

Outages for other Google Cloud data services in the pipeline

Cloud Deploy

Cloud DNS

Cloud Healthcare API

Cloud Identity

Cloud Interconnect

Cloud Key Management Service

Cloud External Key Manager (Cloud EKM)

Cloud Load Balancing

Cloud Logging

Cloud Monitoring

Cloud NAT

Cloud Router

Cloud Run

Cloud Shell

Cloud Source Repositories

Spanner

Cloud SQL

Cloud Storage

Cloud Translation

Compute Engine

Sole-tenancy

Networking for Compute Engine

Cloud Load Balancing resilience

Connectivity Tests

Container Registry

Database Migration Service

Dataflow

Architecting Dataflow pipelines for high availability

Dataproc

Architecting disaster recovery for cloud infrastructure outages