Outages are awful for users and teams alike. What can we learn from them when they happen?
On May 12, 2020, Slack had its first significant outage in a long time. We published a summary of the incident shortly after, but this story is an interesting one, and we’d like to go into more detail on the technical issues around it.
The user-visible outage began at 4:45pm Pacific time, but the story really begins around 8:30am that morning. Our Database Reliability Engineering team was alerted about a significant load increase in part of our database infrastructure at the same time as our Traffic team received alerts that we were failing some API requests.