Turn incidents into valuable learning opportunities by harnessing the power of incident analysis.
A few years ago, there was an engineer acting as the infrastructure lead for an analytics startup. We’ll call them LeadEng McInfraPants. (This definitely wasn’t me! Of course not!)
LeadEng successfully led many projects that greatly improved reliability and scalability. Despite these successes and their deep experience, one day LeadEng set into motion what became a huge, multi-hour incident that took the company’s services completely offline. Most of us would look at this series of events and say ‘LeadEng made a mistake’.
To grow as an organization, however, we need to understand not just the how, but also the why.