Google Cloud infrastructure reliability guide Stay organized with collections Save and categorize content based on your preferences.
Reliable infrastructure is a critical requirement for workloads in the cloud.As a cloud architect, to design reliable infrastructure for your workloads, youneed a good understanding of the reliability capabilities of your cloud providerof choice. This document describes the building blocks of reliability inGoogle Cloud (zones, regions, and location-scoped resources) and theavailability levels that they provide. This document also provides guidelinesfor assessing the reliability requirements of your workloads, and presentsarchitectural recommendations for building and managing reliable infrastructurein Google Cloud.
This document is divided into the following parts:
- Overview of reliability (this part)
- Building blocks of reliability in Google Cloud
- Assess the reliability requirements for your cloud workloads
- Design reliable infrastructure for your workloads in Google Cloud
- Manage traffic and load for your workloads in Google Cloud
- Manage and monitor your Google Cloud infrastructure
If you've read this guide previously and want to see what's changed, see theRelease notes.
Overview of reliability
An application or workload is reliable when it meets your current objectivesfor availability and resilience to failures.
Availability (or uptime) is the percentage of time that an application isusable. For example, for an application that has an availability target of99.99%, the total downtime must not exceed 8.64 seconds during a 24-hour period.Sometimes, availability is measured as the proportion of requests that theapplication serves successfully during a given period. For example, for anapplication that has an availability target of 99.99%, for every 100,000requests received, not more than ten requests can fail. Availability is oftenexpressed as the number of nines in the percentage. For example, 99.99%availability is expressed as "4 nines".
Depending on the purpose of the application, you might have different sets ofindicators for how reliable the application is. The following are examples ofsuch reliability indicators:
- For applications that serve content, availability, latency, andthroughput are important reliability indicators. They indicate whether theapplication can respond to requests, how long the application takes torespond to requests, and how many requests the application can processsuccessfully in a given period.
- For databases and storage systems, latency, throughput, availability,and durability (how well data is protected against loss or corruption), areindicators of reliability. They indicate how long the system takes to reador write data, and whether data can be accessed on demand.
- For big data and analytics workloads such as data processing pipelines,consistent pipeline performance (throughput and latency) is essential toensure freshness of the data products, and is an important reliabilityindicator. It indicates how much data can be processed, and how long ittakes for the pipeline to progress from data ingestion to data processing.
- Most applications have data correctness as an essential reliabilityindicator.
For further guidelines to define the reliability objectives for yourapplications, seeAssess the reliability requirements for your cloud workloads.
Note: Planning for disaster recovery (DR) is related to reliability, and DRis essential for business continuity. For detailed guidance about DR planning,see theDisaster recovery planning guide.Factors that affect application reliability
The reliability of an application that's deployed in Google Cloud dependson the following factors:
- The internal design of the application.
- The secondary applications or components that the application depends on.
- Google Cloud infrastructure resources such as compute, networking,storage, databases, and security that the application runs on, and how theapplication uses the infrastructure.
- Infrastructure capacity that you provision, and how the capacity scales.
- The DevOps processes and tools that you use to build, deploy, andmaintain the application, its dependencies, and the Google Cloudinfrastructure.
These factors are summarized in the following diagram:
As shown in the preceding diagram, the reliability of an application that'sdeployed in Google Cloud depends on multiple factors. The focus of thisguide is the reliability of the Google Cloud infrastructure.
What's next
- Building blocks of reliability in Google Cloud
- Assess the reliability requirements for your cloud workloads
- Design reliable infrastructure for your workloads in Google Cloud
- Manage traffic and load for your workloads in Google Cloud
- Manage and monitor your Google Cloud infrastructure
Contributors
Authors:
- Nir Tarcic | Cloud Lifecycle SRE UTL
- Kumar Dhanagopal | Cross-Product Solution Developer
Other contributors:
- Alok Kumar | Distinguished Engineer
- Andrew Fikes | Engineering Fellow, Reliability
- Chris Heiser | SRE TL
- David Ferguson | Director, Site Reliability Engineering
- Joe Tan | Senior Product Counsel
- Krzysztof Duleba | Principal Engineer
- Narayan Desai | Principal SRE
- Sailesh Krishnamurthy | VP, Engineering
- Steve McGhee | Reliability Advocate
- Sudhanshu Jain | Product Manager
- Yaniv Aknin | Software Engineer
Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.
Last updated 2024-11-20 UTC.