Google Cloud infrastructure reliability guide

Last reviewed 2024-11-20 UTC

Reliable infrastructure is a critical requirement for workloads in the cloud.As a cloud architect, to design reliable infrastructure for your workloads, youneed a good understanding of the reliability capabilities of your cloud providerof choice. This document describes the building blocks of reliability inGoogle Cloud (zones, regions, and location-scoped resources) and theavailability levels that they provide. This document also provides guidelinesfor assessing the reliability requirements of your workloads, and presentsarchitectural recommendations for building and managing reliable infrastructurein Google Cloud.

This document is divided into the following parts:

If you've read this guide previously and want to see what's changed, see theRelease notes.

Overview of reliability

An application or workload is reliable when it meets your current objectivesfor availability and resilience to failures.

Availability (or uptime) is the percentage of time that an application isusable. For example, for an application that has an availability target of99.99%, the total downtime must not exceed 8.64 seconds during a 24-hour period.Sometimes, availability is measured as the proportion of requests that theapplication serves successfully during a given period. For example, for anapplication that has an availability target of 99.99%, for every 100,000requests received, not more than ten requests can fail. Availability is oftenexpressed as the number of nines in the percentage. For example, 99.99%availability is expressed as "4 nines".

Depending on the purpose of the application, you might have different sets ofindicators for how reliable the application is. The following are examples ofsuch reliability indicators:

For applications that serve content, availability, latency, andthroughput are important reliability indicators. They indicate whether theapplication can respond to requests, how long the application takes torespond to requests, and how many requests the application can processsuccessfully in a given period.
For databases and storage systems, latency, throughput, availability,and durability (how well data is protected against loss or corruption), areindicators of reliability. They indicate how long the system takes to reador write data, and whether data can be accessed on demand.
For big data and analytics workloads such as data processing pipelines,consistent pipeline performance (throughput and latency) is essential toensure freshness of the data products, and is an important reliabilityindicator. It indicates how much data can be processed, and how long ittakes for the pipeline to progress from data ingestion to data processing.
Most applications have data correctness as an essential reliabilityindicator.

For further guidelines to define the reliability objectives for yourapplications, seeAssess the reliability requirements for your cloud workloads.

Note: Planning for disaster recovery (DR) is related to reliability, and DRis essential for business continuity. For detailed guidance about DR planning,see the Disaster recovery planning guide.

Factors that affect application reliability

The reliability of an application that's deployed in Google Cloud dependson the following factors:

The internal design of the application.
The secondary applications or components that the application depends on.
Google Cloud infrastructure resources such as compute, networking,storage, databases, and security that the application runs on, and how theapplication uses the infrastructure.
Infrastructure capacity that you provision, and how the capacity scales.
The DevOps processes and tools that you use to build, deploy, andmaintain the application, its dependencies, and the Google Cloudinfrastructure.

These factors are summarized in the following diagram:

Application reliability dependencies.

As shown in the preceding diagram, the reliability of an application that'sdeployed in Google Cloud depends on multiple factors. The focus of thisguide is the reliability of the Google Cloud infrastructure.

What's next

Contributors

Authors:

Nir Tarcic | Cloud Lifecycle SRE UTL
Kumar Dhanagopal | Cross-Product Solution Developer

Other contributors:

Alok Kumar | Distinguished Engineer
Andrew Fikes | Engineering Fellow, Reliability
Chris Heiser | SRE TL
David Ferguson | Director, Site Reliability Engineering
Joe Tan | Senior Product Counsel
Krzysztof Duleba | Principal Engineer
Narayan Desai | Principal SRE
Sailesh Krishnamurthy | VP, Engineering
Steve McGhee | Reliability Advocate
Sudhanshu Jain | Product Manager
Yaniv Aknin | Software Engineer

Building blocks of reliability

Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.

Last updated 2024-11-20 UTC.

Movatterモバイル変換

Google Cloud infrastructure reliability guide Stay organized with collections Save and categorize content based on your preferences.

Overview of reliability

Factors that affect application reliability

What's next

Contributors

Google Cloud infrastructure reliability guide