Design for graceful degradation

Last reviewed 2024-12-30 UTC

This principle in the reliability pillar of theGoogle Cloud Well-Architected Framework provides recommendations to help you to design your Google Cloud workloadsto fail gracefully.

This principle is relevant to theresponsefocus area of reliability.

Principle overview

Graceful degradation is a design approach where a system that experiences ahigh load continues to function, possibly with reduced performance or accuracy.Graceful degradation ensures continued availability of the system and preventscomplete failure, even if the system's work isn't optimal. When the load returnsto a manageable level, the system resumes full functionality.

For example, during periods of high load, Google Search prioritizes resultsfrom higher-ranked web pages, potentially sacrificing some accuracy. When theload decreases, Google Search recomputes the search results.

Recommendations

To design your systems for graceful degradation, consider the recommendationsin the following subsections.

Implement throttling

Ensure that your replicas can independently handle overloads and can throttleincoming requests during high-traffic scenarios. This approach helps you toprevent cascading failures that are caused by shifts in excess traffic betweenzones.

Use tools likeApigee to control the rate of API requests during high-traffic times. You can configurepolicy rules to reflect how you want to scale back requests.

Drop excess requests early

Configure your systems to drop excess requests at the frontend layer to protectbackend components. Dropping some requests prevents global failures and enablesthe system to recover more gracefully.With this approach, some users mightexperience errors. However, you can minimize the impact of outages, in contrastto an approach likecircuit-breaking, whereall traffic is dropped during anoverload.

Handle partial errors and retries

Build your applications to handle partial errors and retries seamlessly. Thisdesign helps to ensure that as much traffic as possible is served duringhigh-load scenarios.

Test overload scenarios

To validate that the throttle and request-drop mechanisms work effectively,regularly simulate overload conditions in your system. Testing helps ensure thatyour system is prepared for real-world traffic surges.

Monitor traffic spikes

Use analytics and monitoring tools to predict and respond to traffic surgesbefore they escalate into overloads. Early detection and response can helpmaintain service availability during high-demand periods.

Detect potential failures by using observability

Perform testing for recovery from failures

Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.

Last updated 2024-12-30 UTC.

Movatterモバイル変換

Design for graceful degradation Stay organized with collections Save and categorize content based on your preferences.