Building blocks of reliability in Google Cloud

Last reviewed 2024-11-20 UTC

Google Cloud infrastructure services run in locations around the globe.The locations are divided into failure domains calledregions andzones,which are the foundational building blocks for designing reliable infrastructurefor your cloud workloads.

A failure domain is a resource or a group of resources that can failindependently of other resources. A standalone Compute Engine VM is an exampleof a resource that's a failure domain. A Google Cloud region or zone is anexample of a failure domain that consists of a group of resources. When anapplication is distributed redundantly across failure domains, it can achieve ahigher aggregated level of availability than that provided by each failuredomain.

This part of theGoogle Cloud infrastructure reliability guide describes the building blocks of reliability in Google Cloud and how theyaffect the availability of your cloud resources.

Regions and zones

Regions are independent geographic areas that consist ofzones. Zones and regions arelogical abstractions of underlying physical resources. For more information aboutregion-specific considerations, seeGeography and regions.

Platform availability

Google Cloud infrastructure is designed to tolerate and recover fromfailures. Google continually invests in innovative approaches to maintain andimprove the reliability of Google Cloud. The following capabilities ofGoogle Cloud infrastructure help to provide a reliable platform for yourcloud workloads:

  • Geographically separated regions to mitigate the effects of naturaldisasters and region outages on global services.
  • Hardware redundancy and replication to avoid single points of failure.
  • Live migration of resources during maintenance events. For example,during planned infrastructure maintenance, Compute Engine VMs canbe moved to another host in the same zone by usinglive migration.
  • A secure-by-design infrastructure foundation for the physicalinfrastructure and software on which Google Cloud runs, andoperational security controls to protect your data and workloads. For moreinformation, seeGoogle infrastructure security design overview.
  • A high-performance backbone network that uses anadvanced software-defined networking (SDN) approach to network management, with edge-caching services to deliverconsistent performance that scales well.
  • Continuous monitoring and reporting. You can view the status ofGoogle Cloud services in every location by using theGoogle Cloud Service Health Dashboard.
  • Annual, company-wideDisaster Recovery Testing (DiRT) events to ensure that Google Cloud services and internal business operationscontinue to run during a disaster.
  • Achange management approach that emphasizes reliability across all the phases of the softwaredevelopment lifecycle for any changes to the Google Cloud platform andservices.

Google Cloud infrastructure is designed to support the following targetlevels of availability for most customer workloads:

Deployment locationAvailability (uptime) %Estimated maximum downtime
Single zone3 nines: 99.9%43.2 minutes in a 30-day month
Multiple zones in a region4 nines: 99.99%4.3 minutes in a 30-day month
Multiple regions5 nines: 99.999%26 seconds in a 30-day month
Note: For more information about region-specific considerations, seeGeography and regions.

The availability percentages in the preceding table are targets. The uptimeService Level Agreements (SLAs) for specific Google Cloud services might be different from theseavailability targets. For example, theuptime SLA for a Bigtable instance depends on the number ofclusters,their distribution across locations, and therouting policy that you configure.

The minimum uptime SLA for a Bigtable instance with clusters inthree or more regions is 99.999% if themulti-cluster routing policy isconfigured. But, if thesingle-cluster routing policy is configured, then theminimum uptime SLA is 99.9% regardless of the number of clusters and theirdistribution.

The diagrams in this section show Bigtable instances with varyingcluster sizes and the consequent differences in their uptime SLAs.

Single cluster

The following diagram shows a single-cluster Bigtableinstance, with a minimum uptime SLA of 99.9%:

Single-cluster Bigtable instance (minimum uptime SLA: 99.9%).

Multiple clusters

The following diagram shows a multi-cluster Bigtable instancein multiple zones within a single region, with multi-cluster routing (minimumuptime SLA: 99.99%):

Multi-cluster Bigtable instance in multiple zones within a single region, with multi-cluster routing (minimum uptime SLA: 99.99%).

Multiple clusters

The following diagram shows a multi-cluster Bigtable instancein three regions, with multi-cluster routing (minimum uptime SLA: 99.999%):

Multi-cluster Bigtable instance in three regions, with multi-cluster routing (minimum uptime SLA: 99.999%).

Aggregate infrastructure availability

To run your applications in Google Cloud, you use infrastructure resources likeVMs and databases. These infrastructure resources, together, constitute yourapplication's infrastructurestack.

The following diagram shows an example of an infrastructure stack inGoogle Cloud and the availability SLA for each resource in the stack:

Dual-zone deployment.

This example infrastructure stack includes the following Google Cloudresources:

  • A regional external Application Load Balancer receives and responds to userrequests.
  • A regional managed instance group (MIG) is the backend for theregional external Application Load Balancer. The MIG contains two Compute Engine VMs indifferent zones. Each VM hosts an instance of a web server.
  • An internal load balancer handles communication between the web serverand the application server instances.
  • A second regional MIG is the backend for the internal load balancer.This MIG has two Compute Engine VMs in different zones. Each VMhosts an instance of an application server.
  • A Cloud SQL instance that's configured for HA is the database for theapplication. The primary database instance is replicated synchronously to astandby database instance.

The aggregate availability that you can expect from an infrastructure stack likethe preceding example depends on the following factors:

Google Cloud SLAs

The uptime SLAs of the Google Cloud services that you use in yourinfrastructure stack influence the minimum aggregate availability that you canexpect from the stack.

The following tables present a comparison of the uptime SLAs for someservices:

Compute servicesMonthly Uptime SLAEstimated maximum downtime in a 30-day month
Compute Engine VM99.9%43.2 minutes
GKE Autopilot pods in multiple zones99.9%43.2 minutes
Cloud Run service99.95%21.6 minutes
Database servicesMonthly Uptime SLAEstimated maximum downtime in a 30-day month
Cloud SQL for PostgreSQL instance (Enterprise edition)99.95%21.6 minutes
AlloyDB for PostgreSQL instance99.99%4.3 minutes
Spanner multi-region instance99.999%26 seconds

For the SLAs of other Google Cloud services, seeGoogle Cloud Service Level Agreements.

As the preceding tables show, the Google Cloud services that you choosefor each tier of your infrastructure stack directly affect the overall uptimethat you can expect from the infrastructure stack. To increase the expectedavailability of a workload that's deployed on a Google Cloud resource, youcan provision redundant instances of the resource, as described in the nextsection.

Resource redundancy

Resource redundancy means provisioning two or more identical instances of aresource and deploying the same workload on all the resources in the group. Forexample, to host the web tier of an application, you might provision a MIGcontaining multiple, identical Compute Engine VMs.

If you distribute a group of resources redundantly across multiple failuredomains—for example, two Google Cloud zones—the resource availability thatyou can expect from that group is higher than the uptime SLA of each resource inthe group. This higher availability is because the probability thateveryresource in the group fails at the same time is lower than the probability thatresources in a single failure domain have a coordinated failure.

For example, if the availability SLA for a resource is 99.9%, the probabilitythat the resource fails is 0.001 (1 minus the SLA). If you distribute a workloadacross two instances of this resource that are provisioned in separate failuredomains, then the probability that both the resources fail at the same time is0.000001 (that is, 0.001 x 0.001). This failure probability translates to atheoretical availability of 99.9999% for the group of two resources. However,the actual availability that you can expect is limited to the targetavailability of the deployment location: 99.9% if the resources are in a singleGoogle Cloud zone, 99.99% for a multi-zone deployment, and 99.999% if theredundant resources are distributed across multiple regions.

Note: For more information about region-specific considerations, seeGeography and regions.

Stack depth

The depth of an infrastructure stack is the number of distinct tiers (orlayers) in the stack. Each tier in an infrastructure stack contains resourcesthat provide a distinct function for the application. For example, the middletier in a three-tier stack might use Compute Engine VMs or aGKE cluster to host application servers. Each tier in aninfrastructure stack typically has a tight interdependence with its adjacenttiers. That means if any tier of the stack is unavailable, the entire stackbecomes unavailable.

You can calculate the expected aggregate availability of an N-tierinfrastructure stack by using the following formula:

$$tier1\_availability * tier2\_availability * tierN\_availability$$

For example, if every tier in a three-tier stack is designed to provide 99.9%availability, then the aggregate availability of the stack is approximately99.7% (0.999 x 0.999 x 0.999). That means, the aggregate availability of amulti-tier stack is lower than the availability of the tier that provides theleast availability.

As the number of interdependent tiers in a stack increases, the aggregateavailability of the stack decreases, as shown in the following table. Eachexample stack in the table has a different number of tiers and every tier isassumed to provide 99.9% availability.

TierStack AStack BStack C
Frontend99.9%99.9%99.9%
Application tier99.9%99.9%99.9%
Middle tier99.9%99.9%
Data tier99.9%
Aggregate availability of the stack99.8%99.7%99.6%
Estimated maximum downtime of the stack in a 30-daymonth86 minutes130 minutes173 minutes

Summary of design considerations

When you design your applications, consider the aggregate availability of theGoogle Cloud infrastructure stack.

  • The availability of each Google Cloud resource in yourinfrastructure stack influences the aggregate availability of the stack.When you choose Google Cloud services to build your infrastructurestack, consider the availability SLA of the services.
  • To improve the availability of the function (for example, compute ordatabase) that's provided by a resource, you can provision redundantinstances of the resource. When you design an architecture with redundantresources, besides the availability benefits, you must also consider thepotential effects on operational complexity, latency, and cost.
  • The number of tiers in an infrastructure stack (that is, the depth ofthe stack) has an inverse relationship with the aggregate availability ofthe stack. Consider this relationship when you design or modify your stack.

For example calculations of aggregate availability, see the following sections:

Location scopes

The location scope of a Google Cloud resource determines the extent towhich an infrastructure failure can affect the resource. Most resources that youprovision in Google Cloud have one of the following location scopes:zonal, regional, multi-region, or global.

The location scope of some resource types is fixed; that is, you can't choose orchange the location scope. For example, Virtual Private Cloud (VPC) networks areglobal resources, and Compute Engine virtual machines (VMs) are zonalresources. For certain resources, you can choose the location scope whileprovisioning the resource. For example, when you create aGoogle Kubernetes Engine (GKE) cluster, you can choose to create a zonal or regionalGKE cluster.

The following sections describe location scopes in more detail.

Zonal resources

Zonal resources are deployed within a single zone in a Google Cloudregion. The following are examples of zonal resources. This list is notexhaustive.

  • Compute Engine VMs
  • Zonal managed instance groups (MIGs)
  • Zonal persistent disks
  • Single-zone GKE clusters
  • Filestore Basic and Zonal instances
  • Dataflow jobs
  • Cloud SQL instances
  • Dataproc clusters on Compute Engine

A failure in a zone might affect the zonal resources that are provisionedwithin that zone. Zones are designed to minimize the risk of correlated failureswith other zones in the region. A failure in one zone usually does not affectthe resources in the other zones in the region. Also, a failure in a zonedoesn't necessarily causeall the infrastructure in that zone to beunavailable. The zone merely defines the expected boundary for the effect of afailure.

To protect applications that use zonal resources against zonal incidents, youcan distribute or replicate the resources across multiple zones or regions. Formore information, seeDesign reliable infrastructure for your workloads in Google Cloud.

Regional resources

Regional resources are deployed redundantly across multiple zones within aregion. The following are examples of regional resources. This list is notexhaustive.

  • Regional MIGs
  • Regional Cloud Storage buckets
  • Regional persistent disks
  • Regional GKE clusters with the default (multi-zone)configuration
  • VPC subnets
  • Regional external Application Load Balancers
  • Regional Spanner instances
  • Filestore Enterprise instances
  • Cloud Run services

Regional resources are resilient to incidents in a specific zone. A regionoutage can affect some or all the regional resources provisioned within thatregion. Such outages can be caused by natural disasters or by large-scaleinfrastructure failures.

Note: For more information about region-specific considerations, seeGeography and regions.

Multi-region resources

Multi-region resources are distributed across specific regions. The followingare examples of multi-region resources. This list is not exhaustive.

  • Dual-region and multi-region Cloud Storage buckets
  • Multi-region Spanner instances
  • Multi-cluster (multi-region) Bigtable instances
  • Multi-region key rings in Cloud Key Management Service

For a complete list of the Google Services that are available in multi-regionconfigurations, seeProducts available by location.

Multi-region resources are resilient to incidents in specific regions andzones. An infrastructure outage that occurs in multiple regions can affect theavailability of some or all the multi-region resources that are provisioned inthe affected regions.

Global resources

Global resources are available across all Google Cloud locations. Thefollowing are examples of global resources. This list is not exhaustive.

  • Projects. For guidance and best practices about organizing yourGoogle Cloud resources into folders and projects, seeDecide a resource hierarchy for your Google Cloud landing zone.

  • VPC networks, including associated routes and firewall rules

  • Cloud DNS zones

  • Global external Application Load Balancers

  • Global key rings in Cloud Key Management Service

  • Pub/Sub topics

  • Secrets in Secret Manager

For a complete list of the Google Services that are available globally, seeGlobal products.

Global resources are resilient to zonal and regional incidents. These resourcesdon't rely on infrastructure in any specific region. Google Cloud hassystems and processes that help to minimize the risk of global infrastructureoutages. Google also continually monitors the infrastructure, and quicklyresolves any global outages.

The following table summarizes the relative resilience of zonal, regional,multi-region, and global resources to application and infrastructure issues. Italso describes the effort required to set up these resources, andrecommendations to mitigate the effects of outages.

Resource scopeResilienceRecommendations to mitigate the effects ofinfrastructure outages
ZonalLowDeploy the resources redundantly in multiple zones orregions.
RegionalMediumDeploy the resources redundantly in multiple regions.
Multi-region or globalHighManage changes carefully, and use defense-in-depth fallbackswhere possible. For more information, seeRecommendations tomanage the risk of outages of global resources.
Note: For more information about region-specific considerations, seeGeography and regions.

Recommendations to manage the risk of outages of global resources

To take advantage of the resilience of global resources to zone and regionoutages, you might consider using certain global resources in your architecture.Google recommends the following approaches to manage the risk of outages ofglobal resources:

Careful management of changes to global resources

Global resources are resilient to physical failures. The configuration for suchresources is globally scoped. Thus, setting up and configuring a single globalresource is easier than operating multiple regional resources. However, acritical error in the configuration of a global resource might make it a singlepoint of failure (SPOF). For example, you might use a global load balancer asthe frontend for a geographically-distributed application. A global loadbalancer is often a good choice for such an application. However, an error inthe configuration of the load balancer can cause it to become unavailable acrossall geographies. To avoid this risk, you must manage configuration changes toglobal resources carefully. For more information, seeControl changes to global resources.

Use of regional resources as defense-in-depth fallbacks

For applications that have exceptionally high availability requirements,regionaldefense-in-depthfallbacks can help minimize the effect of outages of global resources. Considerthe example of a geographically-distributed application that has a global loadbalancer as the frontend. To ensure that the application remains accessible evenif the global load balancer is affected by a global outage, you can deployregional load balancers. You can configure the clients to prefer the global loadbalancer, but fail over to the nearest regional load balancer if the global loadbalancer is not available.

Example architecture with zonal, regional, and global resources

Your cloud topology can include a combination of zonal, regional, and globalresources, as shown in the following diagram. The following diagram shows anexample architecture for a multi-tier application that's deployed inGoogle Cloud.

Location scopes of Google Cloud resources.

As shown in the preceding diagram, a global external HTTP/S load balancerreceives client requests. The load balancer distributes the requests to thebackend, which is a regional MIG that has two Compute Engine VMs. Theapplication running on the VMs writes data to and reads from aCloud SQL database. The database is configured for HA. The primary andstandby instances of the database are provisioned in separate zones, and theprimary database is replicated synchronously to the standby database. Inaddition, the database is backed up automatically to a multi-region bucket inCloud Storage.

The following table summarizes the Google Cloud resources in the precedingarchitecture and the resilience of each resource to zone and region outages:

ResourceResilience to outages
VPC networkVPC networks, including associated routes and firewall rules,are global resources. They are resilient to zone and regionoutages.
SubnetsVPC subnets are regional resources. They are resilient to zoneoutages.
Global external HTTP/S load balancerGlobal external HTTP/S load balancers are resilient to zone andregion outages.
Regional MIGRegional MIGs are resilient to zone outages.
Compute Engine VMsCompute Engine VMs are zonal resources. If a zone outageoccurs, the individual Compute Engine VMs might be affected. However,the application can continue to serve requests because the backend for theload balancer is a regional MIG, and not standalone VMs.
Cloud SQL instancesThe Cloud SQL deployment in this architecture is configured for HA; that is, the deployment includes a primary-standby pair of database instances. The primary database is replicated synchronously to the standby database by using regional persistent disks.
  • If an outage occurs in the zone that hosts the primary database, theCloud SQL service fails over to the standby database automatically.
  • If a region outage occurs, you can restore the database in adifferent region by using the database backups.
Multi-region Cloud Storage bucketData that's stored in multi-region Cloud Storagebuckets is resilient to single-region outages.
Persistent disksPersistent disks can be zonal or regional. Regional persistent disks are resilient to zone outages. To prepare for recovery from region outages, you can schedule snapshots of persistent disks and store the snapshots in a multi-region Cloud Storage bucket.
Note: For more information about region-specific considerations, seeGeography and regions.

Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.

Last updated 2024-11-20 UTC.