Building blocks of reliability in Google Cloud Stay organized with collections Save and categorize content based on your preferences.
Google Cloud infrastructure services run in locations around the globe.The locations are divided into failure domains calledregions andzones,which are the foundational building blocks for designing reliable infrastructurefor your cloud workloads.
A failure domain is a resource or a group of resources that can failindependently of other resources. A standalone Compute Engine VM is an exampleof a resource that's a failure domain. A Google Cloud region or zone is anexample of a failure domain that consists of a group of resources. When anapplication is distributed redundantly across failure domains, it can achieve ahigher aggregated level of availability than that provided by each failuredomain.
This part of theGoogle Cloud infrastructure reliability guide describes the building blocks of reliability in Google Cloud and how theyaffect the availability of your cloud resources.
Regions and zones
Regions are independent geographic areas that consist ofzones. Zones and regions arelogical abstractions of underlying physical resources. For more information aboutregion-specific considerations, seeGeography and regions.
Platform availability
Google Cloud infrastructure is designed to tolerate and recover fromfailures. Google continually invests in innovative approaches to maintain andimprove the reliability of Google Cloud. The following capabilities ofGoogle Cloud infrastructure help to provide a reliable platform for yourcloud workloads:
- Geographically separated regions to mitigate the effects of naturaldisasters and region outages on global services.
- Hardware redundancy and replication to avoid single points of failure.
- Live migration of resources during maintenance events. For example,during planned infrastructure maintenance, Compute Engine VMs canbe moved to another host in the same zone by usinglive migration.
- A secure-by-design infrastructure foundation for the physicalinfrastructure and software on which Google Cloud runs, andoperational security controls to protect your data and workloads. For moreinformation, seeGoogle infrastructure security design overview.
- A high-performance backbone network that uses anadvanced software-defined networking (SDN) approach to network management, with edge-caching services to deliverconsistent performance that scales well.
- Continuous monitoring and reporting. You can view the status ofGoogle Cloud services in every location by using theGoogle Cloud Service Health Dashboard.
- Annual, company-wideDisaster Recovery Testing (DiRT) events to ensure that Google Cloud services and internal business operationscontinue to run during a disaster.
- Achange management approach that emphasizes reliability across all the phases of the softwaredevelopment lifecycle for any changes to the Google Cloud platform andservices.
Google Cloud infrastructure is designed to support the following targetlevels of availability for most customer workloads:
Deployment location | Availability (uptime) % | Estimated maximum downtime |
---|---|---|
Single zone | 3 nines: 99.9% | 43.2 minutes in a 30-day month |
Multiple zones in a region | 4 nines: 99.99% | 4.3 minutes in a 30-day month |
Multiple regions | 5 nines: 99.999% | 26 seconds in a 30-day month |
The availability percentages in the preceding table are targets. The uptimeService Level Agreements (SLAs) for specific Google Cloud services might be different from theseavailability targets. For example, theuptime SLA for a Bigtable instance depends on the number ofclusters,their distribution across locations, and therouting policy that you configure.
The minimum uptime SLA for a Bigtable instance with clusters inthree or more regions is 99.999% if themulti-cluster routing policy isconfigured. But, if thesingle-cluster routing policy is configured, then theminimum uptime SLA is 99.9% regardless of the number of clusters and theirdistribution.
The diagrams in this section show Bigtable instances with varyingcluster sizes and the consequent differences in their uptime SLAs.
Single cluster
The following diagram shows a single-cluster Bigtableinstance, with a minimum uptime SLA of 99.9%:
Multiple clusters
The following diagram shows a multi-cluster Bigtable instancein multiple zones within a single region, with multi-cluster routing (minimumuptime SLA: 99.99%):
Multiple clusters
The following diagram shows a multi-cluster Bigtable instancein three regions, with multi-cluster routing (minimum uptime SLA: 99.999%):
Aggregate infrastructure availability
To run your applications in Google Cloud, you use infrastructure resources likeVMs and databases. These infrastructure resources, together, constitute yourapplication's infrastructurestack.
The following diagram shows an example of an infrastructure stack inGoogle Cloud and the availability SLA for each resource in the stack:
This example infrastructure stack includes the following Google Cloudresources:
- A regional external Application Load Balancer receives and responds to userrequests.
- A regional managed instance group (MIG) is the backend for theregional external Application Load Balancer. The MIG contains two Compute Engine VMs indifferent zones. Each VM hosts an instance of a web server.
- An internal load balancer handles communication between the web serverand the application server instances.
- A second regional MIG is the backend for the internal load balancer.This MIG has two Compute Engine VMs in different zones. Each VMhosts an instance of an application server.
- A Cloud SQL instance that's configured for HA is the database for theapplication. The primary database instance is replicated synchronously to astandby database instance.
The aggregate availability that you can expect from an infrastructure stack likethe preceding example depends on the following factors:
Google Cloud SLAs
The uptime SLAs of the Google Cloud services that you use in yourinfrastructure stack influence the minimum aggregate availability that you canexpect from the stack.
The following tables present a comparison of the uptime SLAs for someservices:
Compute services | Monthly Uptime SLA | Estimated maximum downtime in a 30-day month |
---|---|---|
Compute Engine VM | 99.9% | 43.2 minutes |
GKE Autopilot pods in multiple zones | 99.9% | 43.2 minutes |
Cloud Run service | 99.95% | 21.6 minutes |
Database services | Monthly Uptime SLA | Estimated maximum downtime in a 30-day month |
---|---|---|
Cloud SQL for PostgreSQL instance (Enterprise edition) | 99.95% | 21.6 minutes |
AlloyDB for PostgreSQL instance | 99.99% | 4.3 minutes |
Spanner multi-region instance | 99.999% | 26 seconds |
For the SLAs of other Google Cloud services, seeGoogle Cloud Service Level Agreements.
As the preceding tables show, the Google Cloud services that you choosefor each tier of your infrastructure stack directly affect the overall uptimethat you can expect from the infrastructure stack. To increase the expectedavailability of a workload that's deployed on a Google Cloud resource, youcan provision redundant instances of the resource, as described in the nextsection.
Resource redundancy
Resource redundancy means provisioning two or more identical instances of aresource and deploying the same workload on all the resources in the group. Forexample, to host the web tier of an application, you might provision a MIGcontaining multiple, identical Compute Engine VMs.
If you distribute a group of resources redundantly across multiple failuredomains—for example, two Google Cloud zones—the resource availability thatyou can expect from that group is higher than the uptime SLA of each resource inthe group. This higher availability is because the probability thateveryresource in the group fails at the same time is lower than the probability thatresources in a single failure domain have a coordinated failure.
For example, if the availability SLA for a resource is 99.9%, the probabilitythat the resource fails is 0.001 (1 minus the SLA). If you distribute a workloadacross two instances of this resource that are provisioned in separate failuredomains, then the probability that both the resources fail at the same time is0.000001 (that is, 0.001 x 0.001). This failure probability translates to atheoretical availability of 99.9999% for the group of two resources. However,the actual availability that you can expect is limited to the targetavailability of the deployment location: 99.9% if the resources are in a singleGoogle Cloud zone, 99.99% for a multi-zone deployment, and 99.999% if theredundant resources are distributed across multiple regions.
Note: For more information about region-specific considerations, seeGeography and regions.Stack depth
The depth of an infrastructure stack is the number of distinct tiers (orlayers) in the stack. Each tier in an infrastructure stack contains resourcesthat provide a distinct function for the application. For example, the middletier in a three-tier stack might use Compute Engine VMs or aGKE cluster to host application servers. Each tier in aninfrastructure stack typically has a tight interdependence with its adjacenttiers. That means if any tier of the stack is unavailable, the entire stackbecomes unavailable.
You can calculate the expected aggregate availability of an N-tierinfrastructure stack by using the following formula:
For example, if every tier in a three-tier stack is designed to provide 99.9%availability, then the aggregate availability of the stack is approximately99.7% (0.999 x 0.999 x 0.999). That means, the aggregate availability of amulti-tier stack is lower than the availability of the tier that provides theleast availability.
As the number of interdependent tiers in a stack increases, the aggregateavailability of the stack decreases, as shown in the following table. Eachexample stack in the table has a different number of tiers and every tier isassumed to provide 99.9% availability.
Tier | Stack A | Stack B | Stack C |
---|---|---|---|
Frontend | 99.9% | 99.9% | 99.9% |
Application tier | 99.9% | 99.9% | 99.9% |
Middle tier | – | 99.9% | 99.9% |
Data tier | – | – | 99.9% |
Aggregate availability of the stack | 99.8% | 99.7% | 99.6% |
Estimated maximum downtime of the stack in a 30-daymonth | 86 minutes | 130 minutes | 173 minutes |
Summary of design considerations
When you design your applications, consider the aggregate availability of theGoogle Cloud infrastructure stack.
- The availability of each Google Cloud resource in yourinfrastructure stack influences the aggregate availability of the stack.When you choose Google Cloud services to build your infrastructurestack, consider the availability SLA of the services.
- To improve the availability of the function (for example, compute ordatabase) that's provided by a resource, you can provision redundantinstances of the resource. When you design an architecture with redundantresources, besides the availability benefits, you must also consider thepotential effects on operational complexity, latency, and cost.
- The number of tiers in an infrastructure stack (that is, the depth ofthe stack) has an inverse relationship with the aggregate availability ofthe stack. Consider this relationship when you design or modify your stack.
For example calculations of aggregate availability, see the following sections:
- Example calculation: Single-zone deployment
- Example calculation: Multi-zone deployment
- Example calculation: Multi-region deployment with regional load balancing
- Example calculation: Multi-region deployment with global load balancing
Location scopes
The location scope of a Google Cloud resource determines the extent towhich an infrastructure failure can affect the resource. Most resources that youprovision in Google Cloud have one of the following location scopes:zonal, regional, multi-region, or global.
The location scope of some resource types is fixed; that is, you can't choose orchange the location scope. For example, Virtual Private Cloud (VPC) networks areglobal resources, and Compute Engine virtual machines (VMs) are zonalresources. For certain resources, you can choose the location scope whileprovisioning the resource. For example, when you create aGoogle Kubernetes Engine (GKE) cluster, you can choose to create a zonal or regionalGKE cluster.
The following sections describe location scopes in more detail.
Zonal resources
Zonal resources are deployed within a single zone in a Google Cloudregion. The following are examples of zonal resources. This list is notexhaustive.
- Compute Engine VMs
- Zonal managed instance groups (MIGs)
- Zonal persistent disks
- Single-zone GKE clusters
- Filestore Basic and Zonal instances
- Dataflow jobs
- Cloud SQL instances
- Dataproc clusters on Compute Engine
A failure in a zone might affect the zonal resources that are provisionedwithin that zone. Zones are designed to minimize the risk of correlated failureswith other zones in the region. A failure in one zone usually does not affectthe resources in the other zones in the region. Also, a failure in a zonedoesn't necessarily causeall the infrastructure in that zone to beunavailable. The zone merely defines the expected boundary for the effect of afailure.
To protect applications that use zonal resources against zonal incidents, youcan distribute or replicate the resources across multiple zones or regions. Formore information, seeDesign reliable infrastructure for your workloads in Google Cloud.
Regional resources
Regional resources are deployed redundantly across multiple zones within aregion. The following are examples of regional resources. This list is notexhaustive.
- Regional MIGs
- Regional Cloud Storage buckets
- Regional persistent disks
- Regional GKE clusters with the default (multi-zone)configuration
- VPC subnets
- Regional external Application Load Balancers
- Regional Spanner instances
- Filestore Enterprise instances
- Cloud Run services
Regional resources are resilient to incidents in a specific zone. A regionoutage can affect some or all the regional resources provisioned within thatregion. Such outages can be caused by natural disasters or by large-scaleinfrastructure failures.
Note: For more information about region-specific considerations, seeGeography and regions.Multi-region resources
Multi-region resources are distributed across specific regions. The followingare examples of multi-region resources. This list is not exhaustive.
- Dual-region and multi-region Cloud Storage buckets
- Multi-region Spanner instances
- Multi-cluster (multi-region) Bigtable instances
- Multi-region key rings in Cloud Key Management Service
For a complete list of the Google Services that are available in multi-regionconfigurations, seeProducts available by location.
Multi-region resources are resilient to incidents in specific regions andzones. An infrastructure outage that occurs in multiple regions can affect theavailability of some or all the multi-region resources that are provisioned inthe affected regions.
Global resources
Global resources are available across all Google Cloud locations. Thefollowing are examples of global resources. This list is not exhaustive.
Projects. For guidance and best practices about organizing yourGoogle Cloud resources into folders and projects, seeDecide a resource hierarchy for your Google Cloud landing zone.
VPC networks, including associated routes and firewall rules
Cloud DNS zones
Global external Application Load Balancers
Global key rings in Cloud Key Management Service
Pub/Sub topics
Secrets in Secret Manager
For a complete list of the Google Services that are available globally, seeGlobal products.
Global resources are resilient to zonal and regional incidents. These resourcesdon't rely on infrastructure in any specific region. Google Cloud hassystems and processes that help to minimize the risk of global infrastructureoutages. Google also continually monitors the infrastructure, and quicklyresolves any global outages.
The following table summarizes the relative resilience of zonal, regional,multi-region, and global resources to application and infrastructure issues. Italso describes the effort required to set up these resources, andrecommendations to mitigate the effects of outages.
Resource scope | Resilience | Recommendations to mitigate the effects ofinfrastructure outages |
---|---|---|
Zonal | Low | Deploy the resources redundantly in multiple zones orregions. |
Regional | Medium | Deploy the resources redundantly in multiple regions. |
Multi-region or global | High | Manage changes carefully, and use defense-in-depth fallbackswhere possible. For more information, seeRecommendations tomanage the risk of outages of global resources. |
Recommendations to manage the risk of outages of global resources
To take advantage of the resilience of global resources to zone and regionoutages, you might consider using certain global resources in your architecture.Google recommends the following approaches to manage the risk of outages ofglobal resources:
Careful management of changes to global resources
Global resources are resilient to physical failures. The configuration for suchresources is globally scoped. Thus, setting up and configuring a single globalresource is easier than operating multiple regional resources. However, acritical error in the configuration of a global resource might make it a singlepoint of failure (SPOF). For example, you might use a global load balancer asthe frontend for a geographically-distributed application. A global loadbalancer is often a good choice for such an application. However, an error inthe configuration of the load balancer can cause it to become unavailable acrossall geographies. To avoid this risk, you must manage configuration changes toglobal resources carefully. For more information, seeControl changes to global resources.
Use of regional resources as defense-in-depth fallbacks
For applications that have exceptionally high availability requirements,regionaldefense-in-depthfallbacks can help minimize the effect of outages of global resources. Considerthe example of a geographically-distributed application that has a global loadbalancer as the frontend. To ensure that the application remains accessible evenif the global load balancer is affected by a global outage, you can deployregional load balancers. You can configure the clients to prefer the global loadbalancer, but fail over to the nearest regional load balancer if the global loadbalancer is not available.
Example architecture with zonal, regional, and global resources
Your cloud topology can include a combination of zonal, regional, and globalresources, as shown in the following diagram. The following diagram shows anexample architecture for a multi-tier application that's deployed inGoogle Cloud.
As shown in the preceding diagram, a global external HTTP/S load balancerreceives client requests. The load balancer distributes the requests to thebackend, which is a regional MIG that has two Compute Engine VMs. Theapplication running on the VMs writes data to and reads from aCloud SQL database. The database is configured for HA. The primary andstandby instances of the database are provisioned in separate zones, and theprimary database is replicated synchronously to the standby database. Inaddition, the database is backed up automatically to a multi-region bucket inCloud Storage.
The following table summarizes the Google Cloud resources in the precedingarchitecture and the resilience of each resource to zone and region outages:
Resource | Resilience to outages |
---|---|
VPC network | VPC networks, including associated routes and firewall rules,are global resources. They are resilient to zone and regionoutages. |
Subnets | VPC subnets are regional resources. They are resilient to zoneoutages. |
Global external HTTP/S load balancer | Global external HTTP/S load balancers are resilient to zone andregion outages. |
Regional MIG | Regional MIGs are resilient to zone outages. |
Compute Engine VMs | Compute Engine VMs are zonal resources. If a zone outageoccurs, the individual Compute Engine VMs might be affected. However,the application can continue to serve requests because the backend for theload balancer is a regional MIG, and not standalone VMs. |
Cloud SQL instances | The Cloud SQL deployment in this architecture is configured for HA; that is, the deployment includes a primary-standby pair of database instances. The primary database is replicated synchronously to the standby database by using regional persistent disks.
|
Multi-region Cloud Storage bucket | Data that's stored in multi-region Cloud Storagebuckets is resilient to single-region outages. |
Persistent disks | Persistent disks can be zonal or regional. Regional persistent disks are resilient to zone outages. To prepare for recovery from region outages, you can schedule snapshots of persistent disks and store the snapshots in a multi-region Cloud Storage bucket. |
Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.
Last updated 2024-11-20 UTC.