Set realistic targets for reliability

Last reviewed 2024-12-30 UTC

This principle in the reliability pillar of theGoogle Cloud Well-Architected Framework helps you define reliability goals that are technically feasible for yourworkloads in Google Cloud.

This principle is relevant to thescopingfocus area of reliability.

Principle overview

Design your systems to be just reliable enough for user happiness. It mightseem counterintuitive, but a goal of 100% reliability is often not the mosteffective strategy. Higher reliability might result in a significantly highercost, both in terms of financial investment and potential limitations oninnovation. If users are already happy with the current level of service, thenefforts to further increase happiness might yield a low return on investment.Instead, you can better spend resources elsewhere.

You need to determine the level of reliability at which your users are happy,and determine the point where the cost of incremental improvements begin tooutweigh the benefits. When you determine this level ofsufficientreliability, you can allocate resources strategically and focus on features andimprovements that deliver greater value to your users.

Recommendations

To set realistic reliability targets, consider the recommendations in thefollowing subsections.

Accept some failure and prioritize components

Aim for high availability such as 99.99% uptime, but don't set a target of 100%uptime. Acknowledge that some failures are inevitable.

The gap between 100% uptime and a 99.99% target is the allowance for failure.This gap is often called theerror budget. The error budget can help you takerisks and innovate, which is fundamental to any business to stay competitive.

Prioritize the reliability of the most critical components in the system.Accept that less critical components can have a higher tolerance for failure.

Balance reliability and cost

To determine the optimal reliability level for your system, conduct thoroughcost-benefit analyses.

Consider factors like system requirements, the consequences of failures, andyour organization's risk tolerance for the specific application. Remember toconsider yourdisaster recovery metrics,such as the recovery time objective (RTO) and recovery point objective (RPO).Decide what level of reliability is acceptable within the budget and otherconstraints.

Look for ways to improve efficiency and reduce costs without compromisingessential reliability features.

Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.

Last updated 2024-12-30 UTC.