They can, as the Splunk Data Science Lead at John Lewis, Paul Adams, also discovered, present some unusual problems in trying to sort out, not least because they do not involve things going wrong. They can be relatively easy to find. More of a problem is the combination of collaborating services where each is working as it should but, occasionally, all those `rights’ conspire to make a collective 'wrong'.
This what Adams found himself facing, not long after getting involved with John Lewis. Splunk got its entry into the John Lewis organisation as part of the latter’s EPIC programme, when the company re-platformed its ecommerce platform onto ATG back in 2012. Adams was part of that team, as an outside consultant.
At the time it was brought in on a relatively small use case, as a tool to help validate what was actually happening on the website, which even then was generating some £1bn in turnover. It also followed on from BA’s Terminal 5 'issues', so was seen as the next big risk in major IT systems implementation. So there was a certain amount of fear about the outcome: it was recognised that whatever operational tuning they could get should help to ensure success.
Discovering Splunk did Biz/Ops
The fundamental behind all of this effort, of course, was Splunk’s stock in trade as analysts of machine logs – namely identify where and when something is going wrong, so that the machine and the application causing the problem can be quickly addressed with remedial action. That is also the stock in trade for Splunk’s two main market sectors – operations management and security services.
But when it comes to business management, the Biz/Ops side of the coin there can now be other components to identify and rectify, such as when things are working perfectly……..but not perfectly enough. How do you find that, or even know that there is a `failing’ that needs to be found and corrected?
For example, in trying to extend the value of Splunk to the company, the whole customer purchasing chain of the John Lewis website: the stages of commitment of the customer, was examined. Users rarely have much commitment when they arrive on the website and they probably won’t convert to being a customer. Then they put something in the basket, which means they are more committed, and then they enter the checkout, where they enter their address and delivery details. Only then comes the payment stage. Adams notes:
You lose people from the funnel, all the way down, but at the payment stage you really are at a very high conversion rate. Therefore, if you think about the cost of losing that customer, it is much higher. And in 2013 we had just such an issue; we were losing customers at that very crucial payment stage. It was only one percent of the total so we thought `why bother’, but if you look at the maths, it amounted to hundreds of thousands of Dollars of lost business.
The payment step, which was going wrong, was actually hosted by a third party, and the issue was being played out in the browser, not reporting to the server. So trying to identify it was going to be a real test. It was a big data problem which is the classic use case, but there was no data, and no errors. Adams says:
We’re being told that a director's wife is trying to buy a handbag, and she’d got this issue. The transaction didn’t progress, and that is when the light bulb came on. Maybe we shouldn’t be looking for the presence of errors, but instead the absence of success.
So they set about slicing and dicing the data of everyone who's been through checkout, and looked for those factors that increased the propensity for drop out, and it turned out there were five factors.
- After a certain release of the website a new format comes into place.
- Then you had to use a certain type of device.
- And to use a certain type of OS.
- And to use a certain browser version.
- And to use a certain type of credit card.
It seems like it’s a bit fictional, but when we analysed it we found a very, very small population of people that have a 0% success rate.
And once they had a reproducible issue they could take it offline, set it up in a lab environment then set about debugging it. This became something of an article of faith for Adams and his team because they were unearthing an important business issue. Here was a business process that was working correctly yet, through a random collusion of occasionally occurring circumstances, was preventing something working right.
And for the team Splunk provided a way of doing it, with no data. It helped them establish that what everyone thought as not possible was actually happening, at least in a lab setting.
So it was decided that the team should work away at the page till they got it to work. That way they would be handing over explanation of exactly why it didn’t work, what went wrong and a solution. That is what Adams and his team needed at that stage. Adams recalls:
Although there was that initial article of faith in Splunk, we needed to show real value and put money on it and that’s what started off the checkout flow, if it’s a fallacy to be checking for errors, because there might not be any, or there might be 10 errors for nothing, just unable to lookup your postcode, quick fill in the address, something stupid, once you realise the relationship between errors and strife, is a false friend, if you like, then you can say, let's look up outcomes in an unbiased way and that’s where the checkout flow came in.
The new problem for many users
According to Adams, if a team can trend every single one of those pathways, nothing can hide. An operational issue on-prem or as a SaaS service, even if the business is being attacked, the activity will show up as what he refers to as a bulge in the sparkline. This also occurred at the very time the company was looking to change the checkout process, so there was a lot of interest the day story broke internally, because here was a security and ops management tool now being used to provide business intelligence and management capabilities in ways that other business tools could not match.
The question then became whether it was paying as a piece of business intelligence, and as a piece of operations intelligence, was is it costing? In Adams’ view, the whole thing just rolled up. What is more, he acknowledges they could still go further. For example, more refining of the analysis could lead identification of not just the brand and type of credit or debit card, but the suspect number sequence range as well. Adams says:
You never know how many factors there are, and the more factors there are the more ingredients there are to reproduce, the more likely someone is going to say, `It’s not reproducible’. But if I could Splunk the data, behind the people who failed and who’ve had the same error as me, I would go to that data and say: `right here’s the outcome, I found a log message for that and now I’m going to look for people with certain traits’, because i know there’s enough users to do every permutation possible. And that’s why the analytical experimentation is more powerful than real world experimentation, you try to source all the patterns.
And working analytically he suggests it becomes possible to give a very firm answer, which means the issue can be fixed and can give the company real confidence that it is fixed.
Further out Adams can see this approach even playing a role in ending the never-ending cycles of the blame game’, where companies collaborating on the delivery of a service project readily point fingers at all other collaborators. The evidence of what is happening, what is going wrong or inadvertently not going right will be clearly available. And using this problem as an example, it should eventually be possible for a retail company to demonstrate to a credit card service provider that card type or a numbering sequence is causing a problem which results in loss of business all round. Adams affirms:
We can do some of that now. The best example is with the browser. When users report that the website has gone down we are now able to detect the source of the browser – that the customer is, say, on BT internet or a Virgin Media connection. It is recorded through the log files and we can see that we’re getting no customers from BT or Virgin Media, but those users think the website is down…
So instead of just laying the blame somewhere else, it becomes possible for the collaborators to collaborate about resolving problems quickly, for in an increasingly complex world where digitalisation touches increasing amounts of all business and industrial processes, there is now every chance that a problem will be caused by the interaction of two or more systems doing their `thing’ right, and as planned. Adams concludes:
This gives us the power to go back to third party services with the evidence that their services are not up to scratch and we need them to improve. We now do that regularly with the data that we have. It does mean the end to unsubstantiated blame, leaving everyone working together on common purpose. It has changed totally.
This is an interesting example of the impact of digitalisation, where at one level the collaboration possible can unravel round unforeseen problems of different applications working well in their own right but not working well as a holistic service entity. But then again, putting some effort into unearthing innovative solutions can find new ways of digitalisation, and the collaborators using it, real team players.