Once in a while it happens that one gets `double-teamed’ by a company willing to talk about its operations and the impact one or other piece of technology has had on their thinking and the way their business has developed. And so it was that I found myself sitting before Damien Perrem and Garvan Power of Allied Irish Banks (AIB).
Perrem is responsible for AIB's’ payments technology platforms, while Power is the middleware web services senior engineer and its main Splunk platform administrator. AIB has had Splunk as part of its IT operations since 2011, starting out as a log aggregation tool on network perimeter devices. These were defined by the pair as “things with gigabytes and gigabytes of logs that nobody could actually look at”.
That need to look was part of the security regime, checking for intrusion attempts and checking the firewall. There was also a further need to meet compliance and regulations requirements, which is commonplace in the banking industry. Perrem explains:
That’s a lot of the driver behind what we use Splunk for: compliance and regulation. It is a highly regulated environment that puts a lot of downward pressure on us to provide a reliable service. When SEPA (The Single Euro Payments Area) first came in back in 2010, and then the first round of PSD in 2014, it was an enormous change for us. And Ireland, like a lot of countries, had an internal domestic system, so all of these legacy systems had to be forced into new systems, with a regulatory deadline. So that was a big change for us.
AIB ended up doing about three years of change in about 18 months, with the realisation that the data it had wasn’t capable of keeping them informed about what was happening. For example, AIB could not be sure if every single payment had been processed unless someone specifically checked it. In practice, that became a big, manually intensive job.
The need to get better at this was the trigger to look at Splunk. The first step was to introduce some business activity monitoring giving the ability to track every payment from beginning to end and be sure every payment got to a valid end point. This led to a business process based on three principles: the integrity of the process of every payment; the performance of that process; and tracking the trends in the volumes run rate. This meant that it could be spotted if one of the banks isn’t sending payments in the last hour.
From there the applications for Splunk have multiplied. For example, it is used to look at the business data and map the servers, track the payment and correlate it across as it travelled as well as match that with information about the technology. This meant AIB got a picture of the health of both the technology, plus the service that is provided to the customer, and bring those things together.
From `something’s gone wrong’ to 'what?’ and 'how bad?'
AIB realised that it was possible to map a business process – such as logging into the mobile app and making a payment - as well as map the steps of that process. Staff could then look at the event data and all the technology used for that process and ask about the health of those pieces of technology, and provide it on a classic `single pane of glass’. This meant it could see how one affected the other and most importantly if one of the pieces of technology went bad, it was possible to see it impacted the customer or not.
Managing this process is a serious problem for all banking businesses, as service delivery is now a key part of compliance regulations for the industry. There is now a requirement for banks to report changes in service delivery levels quickly, which puts a lot of pressure on having solid information about what’s going wrong. This now applies to reportable incidents with an impact on payments of just €5 million, which in banks terms is small change. Perrem notes:
So, if you’re going to stop yourself from bouncing off the limiters that they’ve set, which are incredibly prescriptive and very tight, then your ability to see problems and fix them fast is absolutely key.
AIB’s direction of travel sees this moving towards being a predictive maintenance environment that brings two specific benefits. One is the ability to turn a search into a KPI, which can be weighted into a score out of 100 and put into a Red/Amber/Green (RAG) display system. From there is then possible to have multiple RAGs showing a yet higher level score. This allowed them to build an information system for the health of the service as a whole.
The other advantage is the fact that this can then work with the Machine Learning tool Splunk now provides. This is allowing services to be built that can learn the data patterns of the AIB payment services and train systems to do active anomaly detection. Power explains:
We can tell that this data doesn’t look right. It’s not so bad that the customer is getting creamed, but it’s different enough, statistically, that it will flag it up. You might not run into a problem until you get to two seconds delay, but you want to catch it before you do; say when it has slipped from 20 m/sec to 200m/sec and the trajectory is upward. It has allowed us to go from real-time reactive to proactive and predictive. We can predict that there is going to be an outage and you can see that there’s a problem or a statistical change in a metric that could be an indicator of an impending wider issue.
Then comes orchestration
The next stage the pair are looking towards is developing and expanding the banks’ capabilities towards orchestrating remedial actions when customer experience looks to be threatened. This will mean having to work with far more data, while at the same time developing Splunk as the basis of an analytics engine. The goal would be to build the KPIs necessary to view the performance and health of the whole system and, bottom line, that means being able to work with ever more data. Power notes:
It’s difficult to get into orchestration without the insight that tells you that this is what you need to do. Obviously, you can orchestrate more if you have more clarity on what’s gone wrong, so there’s a depth of data there that will help us on that journey. Having the real time view, the automated alerts and predictive alerts is great but ultimately it’s that orchestration where, when this alert happens, we know exactly what to do.
One area of orchestration they are looking at is using regular simple searches and monitoring to suggest a problem is developing. At this point the orchestration could trigger one or more deeper, more invasive search and monitoring processes. At the very least this should provide rich diagnostic data for mediation teams, even if it cannot go so far as to trigger automated remediation.
This is, of course, meat and drink to Splunk: as the company’s CEO Doug Merritt often remarks, Splunk is about dealing with the dirty raw data and humans shouldn’t really be allowed near the stuff. Its usefulness to the banks is that it helps them answer the key questions: what is the right data to collect and how do they interpret it? Some of it will be very standardised data where the metrics that tell about service health. But they are often monitoring 500+ different metrics, and the trick is to determine which are the 20 that really matter.
This is stage that many companies inevitably end up facing, and the pair acknowledge that it is likely to go on forever, rather than be a definable six-month or one year project. Perrem concludes:
We’d like more platforms, more data, to get more insight. To be honest, we actually don’t think it’ll ever stop. Because it is in the nature of technology now, in terms of things changing so fast.