Detect potential failures by using observability

Last reviewed 2024-12-30 UTC

This principle in the reliability pillar of theGoogle Cloud Well-Architected Framework provides recommendations to help you proactively identify areas where errors andfailures might occur.

This principle is relevant to theobservationfocus area of reliability.

Principle overview

To maintain and improve the reliability of your workloads inGoogle Cloud, you need to implement effective observability by usingmetrics, logs, and traces.

Metrics are numerical measurements of activities that you want to trackfor your application at specific time intervals. For example, you mightwant to track technical metrics like request rate and error rate, which canbe used as service-level indicators (SLIs). You might also need to trackapplication-specific business metrics like orders placed and payments received.
Logs are time-stamped records of discrete events that occur within anapplication or system. The event could be a failure, an error, or a changein state. Logs might include metrics, and you can also use logs for SLIs.
A trace represents the journey of a single user or transaction through anumber of separate applications or the components of an application. Forexample, these components could be microservices. Traces help you to trackwhat components were used in the journeys, where bottlenecks exist, and howlong the journeys took.

Metrics, logs, and traces help you monitor your system continuously.Comprehensive monitoring helps you find out where and why errors occurred. Youcan also detect potential failures before errors occur.

Recommendations

To detect potential failures efficiently, consider the recommendations in thefollowing subsections.

Gain comprehensive insights

To track key metrics like response times and error rates, useCloud Monitoring andCloud Logging.These tools also help you to ensure that the metrics consistently meet the needsof your workload.

To make data-driven decisions, analyze default service metrics to understandcomponent dependencies and their impact on overall workload performance.

To customize your monitoring strategy, create and publish your own metrics byusing the Google Cloud SDK.

Perform proactive troubleshooting

Implement robust error handling and enable logging across all of the componentsof your workloads in Google Cloud. Activate logs likeCloud Storage access logs andVPC Flow Logs.

When you configure logging, consider the associatedcosts.To control logging costs, you can configureexclusion filters on the log sinks to exclude certain logs from being stored.

Optimize resource utilization

Monitor CPU consumption, network I/O metrics, and disk I/O metrics to detectunder-provisioned and over-provisioned resources in services likeGKE, Compute Engine, and Dataproc. For acomplete list of supported services, seeCloud Monitoring overview.

Prioritize alerts

For alerts, focus on critical metrics, set appropriate thresholds to minimizealert fatigue, and ensure timely responses to significant issues. This targetedapproach lets you proactively maintain workload reliability. For moreinformation, seeAlerting overview.

Take advantage of horizontal scalability

Design for graceful degradation

Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.

Last updated 2024-12-30 UTC.

Movatterモバイル変換

Detect potential failures by using observability Stay organized with collections Save and categorize content based on your preferences.