Build highly available systems through resource redundancy

This principle in the reliability pillar of theGoogle Cloud Well-Architected Framework provides recommendations to plan, build, and manage resource redundancy, whichcan help you to avoid failures.

This principle is relevant to thescopingfocus area of reliability.

Principle overview

After youdecide the level of reliability that you need, you must design your systems to avoid anysingle points of failure.Every critical component in the system must be replicated across multiplemachines, zones, andregions.For example, a critical database can't be located in only one region, and ametadata server can't be deployed in only one single zone or region. In thoseexamples, if the sole zone or region has an outage, the system has a globaloutage.

Recommendations

To build redundant systems, consider the recommendations in the followingsubsections.

Identify failure domains and replicate services

Map out your system'sfailure domains,from individual VMs to regions, and design for redundancy across the failuredomains.

To ensure high availability, distribute and replicate your services andapplications across multiple zones and regions. Configure the system forautomatic failover to make sure that the services and applications continue tobe available in the event of zone or region outages.

For examples of multi-zone and multi-region architectures, seeDesign reliable infrastructure for your workloads in Google Cloud.

Detect and address issues promptly

Continuously track the status of your failure domains to detect and addressissues promptly.

You can monitor the current status of Google Cloud services in all regionsby using theGoogle Cloud Service Health dashboard.You can also view incidents relevant to your project by usingPersonalized Service Health.You can use load balancers to detect resource health and automatically routetraffic to healthy backends. For more information, seeHealth checks overview.

Test failover scenarios

Like a fire drill, regularly simulate failures to validate the effectiveness ofyour replication and failover strategies.

For more information, seeSimulate a zone outage for a regional MIG andSimulate a zone failure in GKE regional clusters.

Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.

Last updated 2024-12-30 UTC.