Skip to content

Incident management, the value of shared experience

Share on twitter
Share on linkedin
Share on email
Share on whatsapp
Customer Experience

The main objective of incident management is to restore normal service operations as soon as possible while minimizing the negative impact on business operations.

For this reason, management tools must be used that allow the assignment of resources and time estimates, use alerts and escalation to facilitate the response/resolution of incidents within a defined maximum time.

Additionally, the correct management of incidents requires that they be registered independently, as well as having descriptive information of the incident in order to facilitate the resolution in the first instance. This descriptive information must not only provide value in the resolution of the incident, but should also allow for the feedback of knowledge to facilitate the resolution of subsequent similar incidents.

A few weeks ago, I had the opportunity to attend one of the congresses held by ITSMF (Vision 2017) and there, in one of the talks, we were offered a very graphic example of the management of an incident that I cannot resist telling.

Juan (assumed name), enters to work in the resolution of incidences of a financial organization as today. Of course, on his first day of work, Juan receives a manual with the action protocols for the different types of incidents. Two weeks after starting work, he enters the rotation of the guards and is assigned to the night shift, but just as he enters his shift, he receives an alert: there has been a failure in the network's ATMs and three ATMs have been taken out of service.

Concerned, Juan goes to his manual and finds the information corresponding to the protocol of action: first it is necessary to evaluate the percentage of ATMs that have been out of service, next step, if this percentage exceeds 3 %...

Juan evaluates that three ATMs have fallen, so he does not exceed the threshold, therefore the protocol indicates that they have to be restarted. So Juan proceeds to launch the restart process. But 30 minutes later an alert arrives that they have fallen, 25 ATMs and the number is increasing. Since John, already knows the protocol, the fall of this number of ATMs means that the incidence has to be escalated, then he knows that he has to escalate it and inform the corresponding person.

He picks up the phone and proceeds to detail what has happened, the protocols that have been followed, the result of these protocols and the current situation, but in the meantime, the number of ATMs that are out of service continues to grow.

On the phone, between the two of them they try to solve the incidence that develops in this sequence, being A: John, B: the next person in charge and C: another subject in the protocol.

  • A: Restart the ATMs.
  • B: Check that in system X, everything is working
  • A: I don't have access
  • B: Enter the Z server and check from there
  • A: I don't have access
  • B: Use my credentials
  • A: I can't check the status of system X either
  • B: Wait, I'll call C...(one more step)
  • B: Yes, good night (explains the situation)
  • C: Tell him to log in from the machine in ... to the system ... to use my credentials and reboot ...
  • B: John, who says to use his credentials and proceed to reboot by entering...
  • A: Does not work
  • B: Wait a minute, we set up a call

Fortunately, the crisis cabinet is capable of restoring all the systems, the network is operational at 7 am after 6 hours of intense work and all that remains is for Juan to write the report of what happened and keep it together with the closure of the incident. Two important conclusions can be drawn from this recreation:

  1. The management system never recorded what had happened.
  2. Regardless of what John remembered happening during the 6 hours, the knowledge of what happened and the protocol of how to act, stays with the people who have been involved in the resolution.

In this experience, of course invented and exaggerated, we had the opportunity to see how it actually happened in a project we tackled a few months ago.

Our main objective was to use the logs of the management tool, and through process mining techniques, discover the real map of how what is happening is actually reported to the management tool.

Beyond conclusions on how the system is used, we came to observe that there were patterns in which the same type of incident was assigned to the same team for resolution, and this team rejected it on each occasion. Therefore, we decided to look at whether this was a more widespread behaviour, coming to the conclusion that it is indeed a pattern that occurs often.

We ventured all kinds of hypotheses that could be generating this behavior and in the end we were left with those that seemed more solid and that I summarize below:

  • In teams that perform first-level care, there is a high turnover, so the experience is lost.
  • There is no feedback, i.e. there is no rejection of the incidence, but rather a redirection and therefore the person who assigned it in the first instance cannot acquire such knowledge.
  • Nor is there any knowledge acquired by the system, i.e., the equipment that has closed the incident is recorded in the system. The description of the reason for the incident and the resolution process are rarely found.

As they showed us very well at the ITSMF conference, it is important to have the ability to store and obviously be able to consult the history of what has happened in order to learn and establish protocols for forecasting and solving incidents.

However, the question is: Can we go further and allow the machine to use the information that exists in the system? To do so, we carry out a basic proof of concept, in which we aim to help first-level care in two areas:

  1. Make available to the first level of attention information on how incidents with similar input parameters were previously resolved. We would do this by proposing the optimal route for resolution.
  2. To contribute in this way to the formation of first level teams.

Thus, we took a sample of information from 100,000 records where we used 70,000 for training and 30,000 for testing the results. From this point, we analyzed those input parameters whose influence is greater in the resolution of the incidence and launched the learning process using Decision Tree and Random Forest as algorithms.

As a result of the proof of concept, we obtain that having fulfilled the premise that the information is being stored, the machines contribute an additional 15% in the improvement of this process. The lessons we learned from this approach are that in this changing environment, machines and machine learning provide fundamental support in task management and decision making.

Nor can we, nor should we, dispense with the human component, since the task of disambiguation remains a human one. Finally, we see that the system itself can be used as a knowledge base and, when the time comes, as a motor for the formation of people.

Share the article

Share on twitter
Twitter
Share on linkedin
LinkedIn
Share on email
Email
Share on whatsapp
WhatsApp

A new generation of technological services and products for our customers