How to write a good incident postmortem
… and check why 5600+ Rails engineers read also this
How to write a good incident postmortem
Sometimes, not everything goes smooth when introducing changes in your application. When it happens, you introduce hotfix as soon as possible, usually followed by the coldfix. Such situations are great to take a learning from.
Purpose
The postmortem serves a purpose of finding the root cause of an incident, providing insights to the team to make the system more resilient in the future.
It ain’t cheap
It costs time, but you should consider this as an investment. Sometimes it can be hard to find the origin of the problem which occurred in your system. However, fixing the effects of incident without deep understanding of its origin is putting patches on patches.
Losing control
Every incident in the system makes management think that you don’t have control. This can have several outcomes which you may want to avoid:
- adding more checks like mandatory pre–deployment review
- adding new policies, e.g. no commits to master branch
- adding yet another supervisor to decide whenever you can introduce changes
Regaining control
We’re already responsible developers. Postmortem is a great way to mitigate all the doubts and propose reasonable solutions to prevent further issues.
How to postmortem
Here’s not very opinionated list of elements the postmortem should consists of. Remember about the most important outcome of it: to make a change and improve both your system and organization.
Title
Brief description of what happened, e.g. Cat gifs library RuntimeError.
Status
To inform whether it’s resolved or not.
Severity
State how severe this issue was to your platform, if your organization has this formalized, follow accordingly, e.g. HIGH AF.
Commander
Who is responsible for the investigation, e.g. Andy Dwyer.
First occurrence
When the issue occurred, eg. 2023-02-28 15:03:45 UTC
, maybe followed by a link to favorite bugtracker.
Description
A bit broader on what really happened: Broken cat images generation, 1410 of our customers were disappointed on not getting cute cat images while visiting our website.
Communication channel
Where did you perform the investigation, it can be a link to slack thread, issue on the one–who–must–not–be–named Jira, whatever works in you organization.
Reason
Describe what exactly happened, as detailed as possible:
- the package
cutecatgifs
should live under/usr/bin
since it’s installed as a system package, - the gem
cutecatgifs-binary
has been removed fromGemfile
since it was duplicating the feature already living in the system under/usr/bin
, - unfortunately, due to gem itself being present in the Docker image, but no longer in the
Gemfile
, library calledCuteCatGifsComposer
tried to use thecutecatsgifs-binary
bin wrapper instead of system–wide package. This happened sincecutecatgifs-binary
was present earlier in the$PATH
:usr/local/yourfavouritrrubyversionmanager/gems/ruby-2.7.7/bin:/usr/local/bin:/usr/bin
, - it was expected that binstub won’t be present in a new deployment.
Fix
Describe how you’ve resolved the issue: reverting the changes in Gemfile
and Gemfile.lock
resolved the issue.
Summary
TL;DR for the lazy people with key points taken:
- Incorrect, non–existing in the bundle binary was called causing
RuntimeError
, - Binary path was resolved incorrectly because
bundle exec which cutecatgifs
returned its path based on$PATH
which was prepended by binstubs directory.
Prevention
Describe in points how similar issues can be avoided in the future, it serves a purpose of improving your development process and system itself:
- Avoid shared state coming from Docker image which contributed to the issue
- Add automated post–deployment check whether cute cat gif appears on the website after deployment
- Reduce deployment time from 40 to 4 minutes, so only few people wouldn’t see the picture of a cat, rather than 1410, due to quick revert
Plot twist
This is based on a true story. What’s even more funny is the fact that the development process consisted of all the points mentioned in Losing Control paragraph. It lacked the most important one: ability to act quickly when the issue occurs. Mistakes will happen, especially if taking the risk is cheaper than preventing all the edge cases.
But it’s a topic for a different story.