OK, your mission, should you choose to accept it, is to take a legacy outsourced system used by north of 60,000 users, transfer it to a new in-sourced cloud platform, do that within a ten week window with a hard stop and get it live over one weekend to handle 10,000 specific tasks on Monday morning.
Actually, you don't really have any choice but to accept the mission. The service in question is part of the MOT Modernisation program at the UK Driver and Vehicles Standards Agency (DVSA). Given that brief, it’s hardly surprising then that DVSA Director of Digital Services and Technology James Munson quips that:
A fixed deadline always focuses your mind and we had a very fixed deadline. If we didn’t hit that deadline, it would impact on the service.
DVSA’s MoT systems were being used by 60,000 users and 23,000 garages across the UK, supporting over 30 million car tests per year and up to 200,000 tests on a busy day. The service had been delivered for ten years through a Public Finance Initiative outsourcing contract which was due to come to an end in September 2015.
In keeping with a general desire for move away from big outsourcing deals in UK government. DVSA wanted to replace the outsourced service with a system controlled and maintained directly by the agency and its partner organizations. There was also an objective for continuous and iterative delivery of production code to improve the service on an ongoing basis.
To complicate matters further, DVSA chose to swap cloud providers en route. Munson explains:
We had the project at build-out, but because it was about two years on, the cloud world had moved on during that period. In the latter stages of the project, we did an independent cloud review to decide where we should host the production build environments.
We did the independent review across several cloud providers. Amazon Web Services came out the lead on that. Because of the timing of completing that study and because we had a very hard stop, with the old system being de-commissioned in September 2015, we had to complete the roll out to garages and then switch cloud provider. We had ten weeks to make a migration from one provider to another for the cloud build-out. We looked for someone who had experience of that kind of build-out.
In the event, DVSA tapped into Northern Irish cloud services firm Kainos, which had already been engaged to develop the web application side of the service.
Before the technical challenges of the migration could be addressed, there was a need to get buy-in for what sounded like an ambitious (frightening?) build-out from a number of stakeholders, including the Department of Transport - the sign-off went right up to the Permanent Secretary -, the board of DVSA itself, the Driver and Vehicle Licensing Agency (DVLA) and the Government Digital Service (GDS). Munson notes:
We had a complex group of different stakeholders that we had to bring on that journey with us.
We worked closely with the Department of Transport. As part of the briefing, it was fairly new to be looking at national infrastructure on public cloud, so keeping them involved every step of the way was important and enabled us to meet their needs.
The relationship with GDS is a good exemplar of the kind of role that the Service wants to play moving forward. that of trusted advisor to departments and agencies. That’s certainly how Munson regards GDS:
Their architect team comes in and visits us pretty regularly to see what we’re doing and how we’re doing it. We still need them to do spend approvals as we go through each stage. So my approach has been to talk to them a lot and really open up our plans. We put together a new digital and technology strategy that got approved by the board and we shared that with GDS, taking them through it and helping them to understand where we’re going over the next 2-3 years.
With all the buy-in place, the next steps were to transition across to the AWS platform in the ten weeks available. The process included introducing Agile practices and automation and infrastructure as code so that environments could be built and software deployed to those environments quickly, repeatedly and consistently.
In the first two weeks, performance tests were created based on the primary user journey, specifically MOT test, and a basic small-scale production environment created with a full-sized snapshot of production data. Monitoring the performance and health infrastructure was established as best practice early on in order to understand data about the platform as it was iterated. The production environment was scaled up, scaled out and tuned according to performance test results which were published at least once a week.
All this was happening with the prospect of a weekend switch-over looming and not much of a safety net. Munson explains:
Over one weekend, we brought the service down, then brought it back up. We didn’t run two systems in parallel, although there is a paper-based system in case the system doesn’t work, so you can still get your car MOT-d and get a piece of paper to take to the Post Office. But there isn’t back-up system.
There were roll-back plans, so we could continue to use the infrastructure that we had, but if the data didn’t migrate at that point, we would be live and and it would be difficult to migrate later. There were lots of things that we wanted to do once the initial system was live. We really had that one window to get that core infrastructure in that we wanted, then look at the user needs and build the services people wanted going forwards.
As things transpired, a 1.5TB MySQL database was successfully transitioned to AWS, while the live service moved over within a 36 hour window, resulting in a production environment on day one of the new live service that could handle 100,000 MOT tests. Munson says:
We did have one or two bumps the first couple of weeks, which we had to stabilise. We had teams working 24 hours shifts, people working day and night first couple of weeks. Once that had stabilised, we went into a period of releasing on an Agile basis. We now have three Agile teams that release code every two weeks. We do Design Sprints and then we do Production Build-Out Sprints. We’re building Amazon out across three dependency zones to give us resilience built-in.
Future plans now include doing discovery for a digital transition for other DVSA services, such as driver testing and vehicle enforcement, which are still hosted on legacy infrastructure in legacy applications. But there’s already been a cultural shift within the agency, says Munson:
We’ve built-out the GDS Service Delivery Model. We’ve got a Service Manager and product owners who don’t sit in IT, they sit with the business teams. We have a monthly steering meeting that the directors sit in on, where we talk about what we’re delivering and what’s coming next in different Sprints. And we’ve mapped those Sprints right through to August.