How to set up a very basic Incident Management process?

A guide to IcM process by a seasoned monitoring engineer

Kristjan Hiis
3 min readJul 22, 2020
Photo by CDC on Unsplash

This time around I would like to tone down on the enterprise-ish way of thinking and drill deeper into the needs of smaller businesses and their incident management process.
First of all, you will need a monitoring system of sorts to know where you have the issues — meaning that are the disks of your servers filling up or has the network started flapping or maybe there is a wider outage with your service provider altogether.

In any case, we can assume that you have some sort, either on-premises or SaaS solutions, set up to monitor your whole stack. Once an alert is triggered and things seem bleak for your developers and admins you would need a process to guide them through the incident.

Why?

This is simple — to make sure that this exact thing never happens again and to make sure that everything would run as smoothly as possible in times of trouble.

First of all, an incident consists of multiple parts, such as:

1. Monitoring and alerts
2. Incident respondents (mostly developers and admins) also known as on-call rotation
3. IM Process
4. Post Mortem
5. Final fixes

As we have already established the presence of the first two points, let’s go ahead and move towards the third point in the line. The notorious IM Process — this actually includes many subsections as well, but the biggest part would be to deliver the right alert to the right person at the right time. As you might’ve figured out by now, the third point is tightly coupled with the first two and thus is effective when someone picks up the call that your systems are making or gets the notification once something goes sideways.
Right then and there your in-house IM process begins — by getting the alert and acknowledging that something is broken.
Usually, it is beneficial to start off with basic troubleshooting of the issue at hand, once the responsible person (the one who received the alert notification) has circled out the underlying issue he/she can start gathering the people responsible for it.

To play it out for a clearer picture:
Alert is received by the on-call admin and he/she figures out that the incident is caused by a faulty code that has been implemented to live environment. The admin then can proceed to contact the developer responsible for the code for a quick mitigation process or the admin can issue a code roll-back, so the live environment would serve the previously working code.
It is highly beneficial to have some sort of communication method between the peers in this phase — be it Slack for text discussion or a Discord ad-hoc call for quicker responses — either way, things that are done need to be written down and mark down the time of when these actions were taken. All of this will be used in the fourth step of our IcM process — the Post Mortem.

To keep it simple on-call responder can use a very simple template like:

11:12 UTC — Our alerting system triggered the incident
11:35 UTC — Called in developer X to investigate the issue alongside with me
11:45 UTC — Developer X found out that it was a code bug
11:48 UTC — We deployed the previous code to live, thus mitigating the incident

Now we can actually move on towards the post mortem itself — the expression comes from Latin and stands for after death, but to put it into computer science meanings Wikipedia has the definition of “A project post-mortem is a process, usually performed at the conclusion of a project, to determine and analyze elements of the project that were successful or unsuccessful.” And usually is served in the form of a text. I think most of the process can be templated and thus, I will give you the power of knowledge — the almighty postmortem template.

Now there you have it — a full-fledged IcM workflow for you to implement at your will.
I’m sincerely hoping that you will never have to use it, but I’m in no doubt that once you implement an IcM process you will have to use it at some point.

May your incidents solve fast and your servers run forever.

--

--

Kristjan Hiis

Talking a lot in the realms of observability and engineering in general.