An inside look at how Monzo handles critical incidents and stays online
Chris Evans, Platform Lead at Monzo, explains how the digital bank uses PagerDuty to schedule its engineers in response to an incident. He also explains the bank’s ethos to tech deployment.
The success of Monzo bank in the UK has been undeniable. Starting out as a pre-paid card in 2015 with a beautifully designed mobile app to monitor your spending, Monzo soon grew in popularity. The company has raised £211m in funding, most recently adding £20m at the end of 2018, and is also soon set for a US launch. And hey, we all love those luminous cards, don’t we?
However, whilst user-centred design has enabled Monzo to take on some of the biggest financial institutions in the UK, it’s digital first approach (i.e. no high-street presence) means that availability and uptime are core to its proposition. Which means that responding to critical incidents effectively is not a nice to have, it’s strategically critical. There have been numerous examples of traditional banks suffering from outages, leading to reputational and financial losses.
Being the challenger, Monzo can’t afford to align itself with such a dreadful customer experience.
I got the chance to sit down with Chris Evans, Platform Lead at Monzo, to discuss how the company handles its response to critical incidents - most notably using PagerDuty, a hotly tipped software company that handles scheduling for first line response teams.
Evans explained that Monzo has been using PagerDuty ever since the bank had customers. There was a sense within Monzo, since the early days, that it needed to do this better than anyone else. Evans said:
More so than traditional banks, we have to be online. There are no branches, there’s nowhere that customers can go and access their money if our systems are down. That is where PagerDuty comes in, it’s one of those key features of us being online. If there is an issue, it’s the thing that gets hold of the right people to make sure that we can fix those issues.
Schedule for response
Evans described PagerDuty as a “very powerful scheduling tool”, one that manages which of the 120 skilled Monzo engineers will be all hands on deck when something goes wrong. Interestingly, Monzo is also using the PagerDuty platform as an engine to write all of its incident management and incident response tooling. Evans said:
So, if there is an issue in customer operations, where a customer is shouting about bank transfers failing, for example, from within Slack they can trigger an incident. That will page an engineer, using the API to PagerDuty, and the team will assemble from there.
Monzo operates a system whereby for any given incident, once alerted in Slack, PagerDuty automatically alerts two skilled engineers to solve the problem. There are always two assigned to any given problem, which has meant that this first line response is being resolved in 95% of cases. Evans said:
For us, those people on first line of call are some of the best engineers at Monzo. They are senior engineers, very highly skilled people.
However, if that pair isn’t able to fix whatever problem has occurred, an alert then goes out to a specialist team, Evans added:
If they can’t fix the incident, and they need to escalate out to other teams, something that we’ve introduced is this concept of specialists. From within Slack, we can one command page anyone from the payments team, or the security team, or FinCrime, or specialist incident managers. It’s a small pool of people that will typically end up in an incident.
The logic within PagerDuty will find the person that’s on call, for that specific area, it will page them, and it will repeatedly page them. If they don’t answer within 15 minutes, it will escalate to the next person. It gives us as close to a guarantee as possible.
I asked Evans about Monzo’s response times using the PagerDuty platform. Interestingly, he said that the company doesn’t track resolution times because Evans doesn’t believe they are meaningful statistics, as “it’s rare that two incidents are ever the same”. However, Evans did recently check the average or median response time to act to an incident in PagerDuty, which was sub one minute for every single caller. Not too bad at all…
Tech ethos at Monzo
I was interested that Monzo had chosen to build its own incident management and response tooling, using the PagerDuty platform, instead of using a tool out-of-the-box, as is so often perpetuated as the gold standard by software vendors. More vanilla, less custom code.
However, Monzo is building for scale, said Evans. And this means building technology for its needs. He explained:
Part of Monzo’s culture is around optimising for our specific use case. So, there are very, very few third party pieces of software that we use that we haven’t customised, or chosen to build ourselves. Notable exceptions to that are things like Jira, for example. But when it comes to our customer operations, tooling, and the way they interact with customers, it is all done with a an internal built tool that we’ve super-optimised to be amazing for us.
I think it comes down to the Jeff Bezos quote of, “undifferentiated, heavy lifting”. And I think the difference for Monzo is ‘differentiated heavy lifting’. For example, the in-app chat for when you used to want to get in touch with customer services, that used to be powered by Intercom. They’re out of the box, a few lines of code, super reliable.
But we hit the limits of what we could do with that. So we basically built an entire in-app chat platform for our company. We have aspirations to go so big that we need to be able to optimise for that. There’s aspirations for every single customer operations person to support 10,000 customers. We will only get there by being super optimised for us.
However, that hasn’t made Evans complacent. As we know, part of the reason that the larger financial institutions are struggling to keep up with the threat of the challenger banks, is that they’re often restrained by their legacy (well, that’s often the excuse). Evans hopes that Monzo can navigate that in the future, successfully. He said:
Personally, as someone responsible for Monzo’s core platform, one of my biggest fears is waking up one day and being like ‘Kubernetes is this old archaic thing now’. I used to laugh at Barclays and NatWest, and the other banks, because they operate on mainframes. I think it helps we are mindful of that. There’s very little we are scared of technically, we have some very, very smart people.
I think if there’s something new that came out we would be comfortable pivoting over to it.