Gousto finds recipe for system maintenance success with PagerDuty
- Summary:
-
Recipe delivery service Gousto describes itself as a technology and data company first. PagerDuty is helping it make sure dinner is on the table for its customers.
Whilst Gousto is well known for its bright red boxes and its sustainable food delivery service - where all the ingredients are measured out precisely, to reduce waste - it describes itself first and foremost as a technology and data company. Its broad range of systems, from fulfilment in its picking facilities, right through to its customer facing e-commerce platform, underpin every aspect of making sure food lands on the table at dinner time for households across the UK.
Gousto has also scaled significantly in the past 12 months, as households have turned to alternative online food delivery services during the ongoing COVID-19 pandemic. As people continue to avoid shopping in a physical supermarket, in the months that followed the first UK lockdown in March 2020, Gousto saw a 10x surge in traffic to its website. The company has gone from delivering 2.5 million meals a month to 6 million meals a month over the course of 2020.
Supporting this kind of growth requires a laser focus on ensuring systems - right from supply chain through to e-commerce - are running smoothly. Gousto also runs a highly optimised DevOps environment, where it is pushing up to 50 releases into production a day, and its developers need to understand clearly what is working and what isn't.
With this in mind, Gousto turned to SaaS incident response tool PagerDuty in the early months of 2020 to boost transparency of its operations, improve accountability amongst its developers, and ensure that downtime is minimised across its operations.
Prior to the introduction of PagerDuty, Gousto's development team already had a strong culture of ownership for system changes and incidents, but it relied too much on manual processes that were proving burdensome. The introduction of PagerDuty, which automates the identification of issues, but also the response to any alerts, has allowed Gousto to focus on value-add service delivery.
We got the chance to speak with Gousto Chief Technology Officer, Shaun Pearce, about the significance of PagerDuty for the company's operations, but also more broadly about how Gousto is thinking about technology and data for food delivery. Pearce said:
Technology has to be up, it's a vital part of our business, both in terms of fulfilling orders and interacting with customers. But also a lot of the technology we build is about fulfilling those boxes and fulfilling the promise we make to customers to deliver that box. So uptime is hugely important.
It's also worth talking about the cost of failure. We don't sell shoes, we sell dinner times. And if you order a pair of shoes off the internet and they arrive a day later, it's a bit annoying, but you can live with it.
But if you've ordered dinner time for your family off the internet and the box doesn't arrive, it's a hugely painful experience for customers. So our need to get it right every time is really important. So that's why we are so focused on the uptime and the operational excellence of our technology.
Scaling DevOps
In recent years Gousto has scaled its technology and data team significantly, where it now has over 200 people within the team. And when you start to scale in this way, you begin looking at your systems and how to improve them. Pearce said that Gousto has always had a mantra of ‘you design it, you build it, you run it' - meaning that there was already a strong culture of ownership amongst the development team. However, at 200 people, the manual, organic processes that were being used previously quickly became ineffective.
And when you're running an end-to-end e-commerce and supply chain platform, there's a lot of places where things can go wrong. Historically Gousto would rely on its teams managing alarms from its infrastructure, and would make use of on-call rotas, where the engineer responsible would have to make sure that they escalated the incident correctly, triggering the right communication in email, Slack and other processes.
PagedDuty automated much of this and has given Gousto a platform to remove some of that overhead for its developers, allowing them to focus on the technology itself, rather than worrying about managing a process. Pearce explained:
The teams would do a fantastic job on the whole, given what they were working with at the time. But what PagerDuty has done has given us that transparency of what we're managing, who's responsible, clear lines of responsibility. These are all vital as you scale.
PagerDuty gives us this incredibly transparent and structured system to understand the health of our system, understand where we are seeing anomalies, and then having a very clear path of ownership.
So, who is owning any kind of anomaly? Who is doing the investigation? Where are we? What have we found? For me it has created this incredible amount of transparency. We are able to pick up on issues quickly, communicate what we do with those issues to the team, escalate them as necessary, and essentially close down problems - or potential problems - as they come up.
Benefits and learnings
Pearce said that the implementation and rollout of PagerDuty was relatively simple - being an SaaS product that could integrate with the company's single sign on solution. The company has also benefited from running its entire infrastructure on AWS, Pearce added, where it uses CloudWatch as its primary monitoring software. This meant that Gousto already had very standardised ways of deploying its services into production and very standardised approaches to setting alarms, configuring trigger points and understanding anomalies.
What PagerDuty has allowed Gousto to do is bring the data out of those systems and make it easy to manage for developers. Pearce said:
All of that underlying intelligence was in those systems, it was just very hard to manage. And what PagerDuty allowed us to do is essentially plug that layer on top. And then it was simply a case of taking the rotas and some of the things that we used to manage manually and just representing those in the right escalation paths, the right teams, the right service mapping in PagerDuty. So there is an element of configuration there, but we were up and running within a few days.
Gousto is measuring the success of its systems operations and maintenance in a number of ways. The teams regularly assess metrics, such as meantime to resolve and meantime to identify, which PagerDuty has allowed it to do more easily. Pearce added:
It's allowed us to have a much more granular view of what the stability of the systems is across our teams, which of our teams needs support, etc. And then all of that comes up to service uptime. We measure core journeys on our website and on our apps - we're constantly measuring those in production. And if we ever had any issues then we are alerted to those essentially through outside-in monitoring. So, the overall service of the customer.
However, the main learning from Gousto as it thinks about its technology performance and system uptime, is that the project with PagerDuty has been successful because it already had a lot of foundational elements in place. It had thought about monitoring and processes, prior to looking for a tool to automate. Pearce believes that this is key. He said:
You've got to get the foundations right in your team first. We were quite lucky to do that quite early on, so we had the right architecture which allows us to separate our concerns and have teams that are fully accountable for components in production. And we standardised a lot of our monitoring infrastructure and our ways of working there. We'd also standardised some of our processes as well.
What that meant is that the implementation of PagerDuty was relatively simple and we could adopt it fairly quickly. I've worked in teams where some of those foundational elements are only being looked at once you're at 100 people, 150 people, and that's where it's difficult because you've got a huge amount of legacy, a huge amount of tech debt, a huge amount of process debt, a huge amount of culture debt, which you've almost got to fix to be able to leverage tools like PagerDuty well. You can mature very quickly if you've got some of those foundational things in place.