The levels of AIOps maturity - three KPIs your IT operations teams should live and breathe

Profile picture for user john.appleby By John Appleby October 7, 2021
Summary:
The inherent nature of IT issue resolution is reactive - there has to be a better way. John Appleby of Avantra shares three KPIs that can take control of AIOps maturity levels - and why it matters.

AIOps KPI maturity concept © Elf-Moondance - Pixabay
(© Elf-Moondance - Pixabay )

In the past, IT operations teams worked on the principle of issue resolution. A business user would raise an issue in an IT Service Management system (ticketing system). It would be assigned to a team to triage, prioritize, and then onto a resolver group.

But there is an inherent problem with this approach – it relies upon interruptions. Often, an interruption might seem harmless; for example, a salesperson wants to update CRM with the latest forecast, but there is a problem, so they raise an incident.

In this situation, the interruption costs 30 minutes lost, switching between tasks, raising an incident, and checking for resolution. But even this benign example can have frustrating consequences. For example, a delayed or inaccurate board report. Practical examples can be far more severe, for example one multinational retailer failed to see a sequence of failures and the entire ERP failed - leading to the inability to do merchandizing or invoicing for several days.

In this blog, I will introduce you to three KPIs that can allow you to take control of IT Operations.

Percentage of issues created by a machine

The first issue that needs attention is detection. A machine can detect a large percentage of issues for several reasons:

First, machines can perform a very high frequency of checks. A human might check a condition once a day, compared to a bot that checks every five or 15 minutes. Most situations do not go from zero to critical in a few minutes, so increasing the frequency by 288x (daily vs. every five minutes) makes a considerable difference in pre-detection.

Second, a bot can apply algorithms against historic data. These can be simple rules (e.g., is a security setting correct?), they can be forecast algorithms (e.g., in how many hours will the storage subsystem become full?). They can be machine learning or AI algorithms that compare past data against current data and predict failure.

Lastly, most issues that cause an incident are compound issues. One situation occurs (a batch job overruns), compounded by a second (a backup starts) and a third (it is 9am and business users start to use the system). It is the compound of those three issues that causes a service interruption and ensuing IT service incidents.

I frequently hear from IT operations professionals that they don't have "unplanned downtime." A complete system outage typically characterizes unplanned downtime, but many service issues are much more insidious than this, and cause IT incidents not classified as "unplanned downtime." (For more on this in an ERP environment, see The hidden costs of SAP downtime).

A high tech company experienced this in the area of benefits management, where an issue with HCM middleware caused a failure in synchronization of birth date updates between the employee portal and insurance provider, which caused declined health insurance benefits.

Percentage of issues resolved before users impacted

Some IT systems, manufacturing execution systems, for example, need to run 24/7 and an outage has an immediate revenue impact. Many others are used through part of the working week, and an issue which causes an outage is not an issue if it is resolved before a user notices. If a computer system falls in the woods and no one hears it, did it fall?

A simple example is common in finance systems – a bank interface fails, which causes a finance batch process to fail, and this means the daily balances are incorrect at 7am. If this is understood, then the error can be corrected, the jobs can be rerun, and the daily balances corrected – all before the finance users sit down at their desks at 9am.

If, however, the finance user has to wait until they run their first reports, they notice discrepancies, do some analysis, raise a service request, IT start to look into the problem – hours of the working day can be lost.

Percentage of incidents resolved proactively

Now it's getting interesting. One of the significant issues pertaining to enterprise systems is the maintenance required to keep them in tip-top condition.

I recently asked a Fortune 100 business how often they proactively apply updates to enterprise systems. The answer? Twice yearly, unless the security team notifies them of a CVE (Common Vulnerability & Exposure).

In so many cases, someone does not know about a patch, the impact, or the bandwidth doesn't exist to do the work, or the policy is to do the bare minimum (often contracted to the lowest bidder Managed Services Provider).

We need to hold ourselves to a higher standard in 2021 and think of these things as incidents. When we take this lens, we can then move to automate this work. We can also, of course, automate issues identified with the detection mechanisms described above. 

Final words

If you are an IT Operations leader, you owe it to your organization to measure these KPIs. Why?

First, every hour spent on issue resolution and manual operations is an hour lost in project work and innovation. There are never enough hours in the day – before we even start to discuss ever-decreasing budgets.

Second, IT Operations leaders increasingly understand the importance of investing in the well-being of their teams. It is imperative to invest in technology so valued employees are not woken unnecessarily at 2am.

Lastly, our research shows that organizations that don't measure these KPIs score very poorly, and this is leading to attrition problems with employees wanting to work somewhere where they can do more meaningful work.