How To Establish a High Severity Incident Management Program

Over the past few years I’ve been involved in many production incidents. I’ve fixed them, I’ve caused them, I’ve watched them from afar, and I’ve been an Incident Manager.

Today I’m excited to share a paper I wrote, it explains how you can establish a high severity incident management program with your team. I hope it helps bring you more reliable systems 💻 and an increase in hours of sleep 😴.

Image for post
Image for post

This paper was written based on my own experiences and with input and feedback from my team at Gremlin. We have worked at a variety of companies including Amazon, Netflix, Salesforce, Dropbox, DigitalOcean, National Australia Bank and Akamai.

I’d love to hear if you implement these incident management practices with your team. My DMs are open on Twitter: @tammybutow.

Want to chat about Incident Management, Chaos Engineering, SRE?
Join our Slack community:


Principal Site Reliability Engineer @GremlinInc | Chaos Engineering ☁️ 💻 ⚡️💀 Previously @DigitalOcean @Dropbox @NAB @QUT

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store