Trainline’s journey of migration from Exadata to AWS

SUMMARY:

Online rail ticket retailer Trainline refactored its Oracle-based architecture with microservices and continuous delivery when it moved from Exadata to AWS

Mark Holt Trainline speaking at New Relic 2016-02 700px
Mark Holt, Trainline

Founded in 1999, Trainline has grown to become the UK’s sixth largest online retailer, selling £1.9 billion ($2.7bn) worth of rail tickets every year to 6.5 million monthly users. With its business wholly dependent on the robustness and adaptability of its IT, the company took the momentous decision to migrate its entire infrastructure last year to the Amazon Web Services (AWS) public cloud. I had the chance to sit down with Trainline CTO Mark Holt last week to learn the story of what he says has been “a massive and incredible year of transformation.”

In the course of the migration, which is now all but completed, Trainline has undertaken a root-and-branch application modernization, has begun to componentize its core applications into a microservices architecture, and has moved its software development process to continuous delivery.

As a result of the move from its current co-location home, the company calculates it will save around £1.2 million ($1.7m) annually on capital expenditure, without paying any extra in operational costs. As part of the move to AWS, Trainline is leaving behind its high-performance Exadata server, Oracle’s flagship integrated compute and storage system. Holt told me that running on AWS will be “100% better” for Trainline than Exadata.

Exadata is an amazing piece of kit, but it’s an amazing piece of kit for the ’90s, not an amazing piece of kit for 2016.

Rearchitecting for AWS

Adapting Trainline’s software to run on AWS has meant rearchitecting much of the software, he added.

You need to buy in to the cloud philosophy. The difference with cloud is, you need to recognize that latency is going to be there, and you have to start building your applications to be more latency tolerant.

You have to factor in that things will fail. Stuff will break, and you need to start building everything with that in mind.

A lot of the journey of the last year has been making sure that we don’t have single points of failure. Making sure that things can run on multiple boxes. Making sure that if they do take 15 milliseconds to get across the network hop, versus the 1ms that we were having before, that it’s not going to cause a problem.

A move to a new data center would have been on the cards anyway, but moving to AWS has forced a complete modernization of the application stack, which included components such as Microsoft’s BizTalk integration server and end-of-life versions of Windows Server, says Holt.

Now we have a lot less of those dark, scary corners of the code that people are afraid to go in, because, ‘Oh, it’s written in .NET 2.0, running on Windows Server 2003.’ The developers have been able to get rid of a lot of that stuff.

About 25-30% of the IT organization’s resources have been devoted to the move to AWS, leaving the rest firmly focused on driving top-line growth, he says.

We created a bunch of cloud readiness goals. It must run on .NET [Framework] 4.5.2. It must run on Windows Server 2012 R2. It must communicate via HTTPS — because in the cloud environment, you have to work on the principle that everything is inherently insecure and that any data in transit is insecure.

Then each of the teams were tasked with, okay, how do you upgrade your architecture estate to meet those cloud readiness goals?

The move to a wholly cloud environment required a very detailed look at how applications were wired together. Network latency — the time it takes for data to travel from one server to another — can be several factors greater in the AWS environment. This means developers had to think carefully about how the applications transferred data back and forth to the database or from one process to the next.

Clearly, in a world where everything is physically and logically close to the database, when you’re 4ms away at all times, you can keep going backwards and forwards for lots and lots of different queries, at a very micro level. We had components that would do fifteen, or twenty or thirty, or fifty queries against the Oracle database, in order to return a piece of data, and in 4ms that’s okay. At 20ms, all of a sudden bad things start to happen.

Monitoring performance

The team also took the opportunity to componentize the application code so that they could move it to AWS one piece as a time, testing it as they went, rather than moving everything all at once. Trainline is a big user of New Relic application performance management software, which played a crucial role in monitoring the behavior of each piece as it was moved to the AWS environment.

At the most micro level, being able to see which applications are doing how many transactions, what the external dependencies of the components are, is a really effective way of being able to look at that.

As we were going from all components running in Rotheram — that’s our old data center — to some components in Rotheram and some components in AWS, I think we were up to 40ms latency across that link. That’s massive.

Even just calling services across that is a problem. We were able to keep an eye on, what’s the impact on overall response time of having moved this component out. [We could say,] ‘That’s not good, let’s move back again, and figure what else we can do to move things around.’

Another important thing to watch was ensuring that performance didn’t suffer when a component moved to AWS.

We created performance budgets for each of the teams — it’s just a dashboard that says, here’s how your application is currently performing from a response time, therefore, that’s what you need to sustain when we moved to Amazon. As you go through your application modernization work, keep an eye on the performance. Look, it’s going to the database seventy-two times when it could go once, I guess we’d better fix that.

These things just happen, over fifteen years, people write bits of code that then get used in random ways. Being able to just look deeply inside the servers and the applications is really huge, it’s a real value add from New Relic.

Then, of course, when we actually go live, what is the impact on end user response time? We draw that very clear line from response time to revenue. It would have been disastrous for us to introduce half a second of latency from an end user perspective, that would have cost us £8 million ($11.5m) a year [in lost revenues].

Microservices mean more choice

Accompanying the componentization process in preparation for the move to AWS, the development cycle changed from a three-weekly updates to continuous delivery. The development teams autonomously made this move ahead of schedule, says Holt.

We originally set a target of October 2015, for everything to be in continuous delivery. In about June, all of the teams went, ‘Oh, by the way, we’ve pulled our components apart enough, that we’re all in continuous delivery now.’

They understood how to get there and they just got on with it. They knew that the world would be so much better, when they were in continuous delivery. It’s been a big piece of being able to get to Amazon as well.

Breaking down the architecture into smaller components meant that Trainline was able to avoid a single ‘big bang’ move to AWS.

If you’re trying to deploy this much stuff in a big bang, because it’s all tightly coupled, that’s horrific. Because they teased it all apart sufficiently, we got to about 60% of the components running live in Amazon, 40% of them running live in our old data center.

There was then a smaller move when the core Oracle database transferred across to AWS together with ten or so other components that needed to move at the same time because of latency issues. That leaves just a few items to follow over the next few weeks.

All the ugly stuff is there, there’s just a few bits of clean up to do. It’s about 10% still left to go.

Having introduced more of a microservices architecture as part of the move to AWS, Trainline will have more platform flexibility in the future, says Holt.

We’ve had this world where the Oracle database was the hammer that we hit every problem with. We’re now moving to a place where we’ll start to split the schemas apart.

Microservices is clearly the future. We’ve got, not fine-grained services, but not coarse-grained either, we’re somewhere in the middle. We’re starting to create more and more fine-grained services, more microservices. As we do that, that will enable us to tease our Oracle database apart still further.

To start with, we’ll almost certainly use Amazon RDS Oracle — Oracle managed by Amazon. Then over time, we’ll probably look to move onto the Amazon components, maybe something like Aurora or maybe even just SQL Server managed by Amazon.

Image credit - Feature photo: Rail track heading into blue sky with white clouds © Jürgen Fälchle - Fotolia.com; Mark Holt speaking at event courtesy of New Relic

Disclosure - At time of writing, Oracle is a diginomica premier partner. New Relic arranged for me to interview Mark Holt at the company's London conference.