Universal Credit (UC) was designed as a means-tested benefits system to replace six existing benefits schemes for those of working age. The idea behind the Department for Work and Pensions (DWP) policy - which has been controversial - was to simplify support for citizens and encourage those on benefits to get back to work.
Back in 2010-2012 Universal Credit was consistently in the news for its poor approach to technology development and implementation. The programme was initially being developed by a handful of suppliers - including Accenture, IBM, HP and BT - but was later ‘reset' after it was found that the technology was not in tune with the policy requirements.
DWP then started developing a digital version in-house, known as the ‘full service', which was being led by a new ‘DWP Digital' team that aimed to focus on user-centred design, agile methodologies and multi-disciplinary development.
DWP Digital has since written 50 million lines of code, works on 10,000 changes to IT systems per year, maintains 1,000 applications and exchanges 10 million data records per day. The team has been leading the rollout of Universal Credit nationally for the past few years, and once at full scale will be serving 7 million people and paying out £67 billion annually.
We got the chance to speak with Tom Padgham, Deputy Director of Engineering at DWP, and Patrick Downey, Head of Platforms at DWP, this week as part of MongoDB's .Live Northern European event. DWP Digital is using MongoDB as its core database for Universal Credit.
The comments from Padgham and Downey in this story are taken both from their presentation at the event and from our conversation together.
As noted above, DWP Digital started on the Universal Credit project back in 2013 and the team was determined to do things differently. It adopted agile ways of working in collaboration with the Government Digital Service, took a user centred approach to design, focused on building minimum viable products, and sought to iterate quickly.
By 2015 there was a controlled rollout of the benefits system to job centres and in 2017, during the larger national rollout, the team migrated from a government approved datacenter to AWS. By 2018 the national rollout was complete, albeit just for new applicants or those on the existing previous benefits that had had a change in circumstance, which were then migrated to Universal Credit. There are still 6 million people on previous existing benefits to be migrated.
Universal Credit is based on a variety of "mature" technologies, according to Padgham, including Java microservices, MongoDB, Kafka, Jenkins, Git, Bazel, Serenity, Gatling, AWS, Terraform, Vault, Ansible and Puppet.
However, by mid to late March of this year, DWP was facing an unexpected and unprecedented test of its systems and infrastructure, when the realities of the COVID-19 pandemic hit the UK. Padgham said:
Were DWP expecting a pandemic? Well, no, like most other people we weren't. We do have good predictions about future traffic from our business under normal circumstances. This helps us plan ahead and make sure our service is ready for that traffic. A few years ago we were asked to test how Universal Credit would handle a massive spike due to a news story. We didn't take this too seriously because of the type of service Universal Credit is.
We do however do proactive performance testing based on those business predictions and we test six months ahead of ourselves. This gives us time to resolve any performance issues that might arise. This type of testing and the way we have built our platform has stood us in good stead for the events of 2020.
Downey added that the infrastructure itself was designed in a way to help cope with demand, particularly with the use of AWS, microservices and MongoDB. He said:
The very fact that we have got the number of microservices that we have, gives us a lot of levers about where we can increase resources as necessary. We don't need to make everything bigger, we can fairly gradually target which pieces of the architecture need a bit more help.
Downey said that Universal Credit currently uses 8 MongoDB clusters, most of which consist of 5 nodes, spread across 3 AWS availability zones. The busiest of these clusters currently handles around 15,000 requests per second. DWP Digital stores 8.5 billion unique data objects in each of these clusters, with 110 TB of uncompressed MongoDB data - which gives you an idea of the scale of the operation.
When COVID-19 hit, particularly after the Prime Minister's lockdown announcement and the subsequent financial support packages that were unveiled, the Universal Credit system and the DWP team faced a nail biting test of whether they could hold up against the huge spikes in demand from claimants. Padgham said:
We had rolled out really carefully in the early days to make sure that we landed the service well. And then from late 2018 through to 2020 we were on a fairly standard rollout up to around 2 million people. This was all looking pretty normal in early March 2020 and our six month performance testing was looking pretty good. At that time we were experiencing around 4% average month on month growth and we would typically have around 100,000 successful claims to UC every fortnight.
COVID-19 really took hold in the UK in the middle of March 2020. That kind of massive increase in traffic and claims was totally unexpected and within a few months it took us to over 5 million active claims. We were experiencing a 40% increase in people on Universal Credit up to April 2020. A tenfold increase. And in one particular fortnight we have around 950,000 claims.
Downey explained that at certain peaks there was 2.5X the normal number of claims per second and that kind of traffic didn't die down until the beginning of May. He added:
We were sitting on the edge of our seat about how the platform will perform and whether it will stand up to demand. We were actually in a pretty good state because of the six month ahead performance testing that we do. So we knew that our site could withstand much more traffic than it currently receives. But we weren't so sure that it would handle this much extra traffic.
How did we respond from an operations perspective? We increased our application capacity and added more web servers and more application servers. We increased our database capacity - in the first and second week we increased our compute and memory for two of our Mongo clusters. And over the coming months we would move one of our large Mongo databases into a cluster of its own in order to provide us with a bit more capacity. We also increased the oplog size for the busiest cluster so that we would consistently have a day or two of oplog, so that responding to a recovery event wouldn't be so bad.
We benefited from using AWS and being able to expand our capacity on demand. We also benefited from a lot of the government announcements happening late in the day when our usual daytime traffic had fallen away. This actually meant that more capacity was available to deal with access requests. Fortunately we never had to taper or control the amount of traffic coming in, so that was good for us.
For an idea of how dramatic the spikes in demand for Universal Credit were during this period, take a look at the below graph:
Typically DWP Digital does major releases every fortnight and minor releases in the weeks in between - with the ability to release urgent changes on demand with zero downtime. To give you further idea of the impact of COVID-19, in January and February the team released 6 urgent changes. In March and April it released 76 urgent changes. Downey said that this testament to the true agile nature of Universal Credit, in that it could adapt and evolve as the situation changed.
Padgham and Downey state that the experience of COVID-19 and the fact that the Universal Credit systems held up under pressure is evidence that agile delivery and the approach the team has taken is effective. They are now looking to make further changes and are considering the use of a Database-as-a-Service - such as MongoDB Atlas. Downey explained:
The value of that is that it cuts across a number of different areas. We don't need the ownership cost of running a cluster and making sure it's up to date and backing it up and testing restorations. From an operational perspective it's a bit nicer and will reduce some support effort. It makes running that part of the service someone else's problem.
We're here to provide benefits to the nation, we're not here to run database clusters. The other important aspect is giving more control to the delivery teams and allowing them to scale a bit. At the moment there's a bunch of shared resources and so different aspects of the system will compete for resource. With the addition of something like DBaaS that means creating single clusters or database instances should become a whole bunch easier, which means that we can better decouple our system.
The significance of the work DWP Digital is doing with Universal Credit is not lost on Padgham and Downey either, as they look to the months ahead and what pressures may be placed on the system down the line. Padgham said:
Technology is one thing, but more importantly, Universal Credit is a system whose failure affects people with real life consequences. If people don't get paid their families go hungry, they could miss their rent, or not afford their heating. And Britain has entered the deepest recession since records began, so this will continue to have an impact for months, if not years to come. Our focus remains on making sure people get paid and have access to the support they need, when they need it.