Conduct thorough postmortems

Last reviewed 2024-12-30 UTC

This principle in the reliability pillar of theGoogle Cloud Well-Architected Framework provides recommendations to help you conduct effective postmortems afterfailures and incidents.

This principle is relevant to thelearningfocus area of reliability.

Principle overview

A postmortem is a written record of an incident, its impact, the actions takento mitigate or resolve the incident, the root causes, and the follow-up actionsto prevent the incident from recurring. The goal of a postmortem is to learnfrom mistakes and not assign blame.

The following diagram shows the workflow of a postmortem:

The workflow of a postmortem.

The workflow of a postmortem includes the following steps:

  • Create postmortem
  • Capture the facts
  • Identify and analyze the root causes
  • Plan for the future
  • Execute the plan

Conduct postmortem analyses after major events and non-major events like thefollowing:

  • User-visible downtimes or degradations beyond a certain threshold.
  • Data losses of any kind.
  • Interventions from on-call engineers, such as a release rollback orrerouting of traffic.
  • Resolution times above a defined threshold.
  • Monitoring failures, which usually imply manual incident discovery.

Recommendations

Define postmortem criteria before an incident occurs so that everyone knowswhen a post mortem is necessary.

To conduct effective postmortems, consider the recommendations in the followingsubsections.

Conduct blameless postmortems

Effective postmortems focus on processes, tools, and technologies, and don'tplace blame on individuals or teams. The purpose of a postmortem analysis is toimprove your technology and future, not to find who is guilty. Everyone makesmistakes. The goal should be to analyze the mistakes and learn from them.

The following examples show the difference between feedback that assigns blameand blameless feedback:

  • Feedback that assigns blame: "We need to rewrite the entirecomplicated backend system! It's been breaking weekly for the last threequarters and I'm sure we're all tired of fixing things piecemeal.Seriously, if I get paged one more time I'll rewrite it myself…"
  • Blameless feedback: "An action item to rewrite the entire backendsystem might actually prevent these pages from continuing to happen. Themaintenance manual for this version is quite long and really difficult tobe fully trained up on. I'm sure our future on-call engineers will thank us!"

Make the postmortem report readable by all the intended audiences

For each piece of information that you plan to include in the report, assesswhether that information is important and necessary to help the audienceunderstand what happened. You can move supplementary data and explanations to anappendix of the report. Reviewers who need more information can request it.

Avoid complex or over-engineered solutions

Before you start to explore solutions for a problem, evaluate the importance ofthe problem and the likelihood of a recurrence. Adding complexity to the systemto solve problems that are unlikely to occur again can lead to increasedinstability.

Share the postmortem as widely as possible

To ensure that issues don't remain unresolved, publish the outcome of thepostmortem to a wide audience and get support from management. The value of apostmortem is proportional to the learning that occurs after the postmortem.When more people learn from incidents, the likelihood of similar failuresrecurring is reduced.

Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.

Last updated 2024-12-30 UTC.