Reading University provides free rein to its meteorologists via hyper-convergence
- Summary:
- Free rein in the cloud for Reading University’s meteorologists was fine - until the bills came in!
The academics at Reading University are no different, though much of the research workload is a bit special. Meteorology is their collective forte, with much of their work being conducted in conjunction with organisations such as the Met office, and the European Centre for Medium Range Weather Forecasting.
These are usually grant-funded projects where the academics approach the Research Council and bid for research money. They are then `under contract’ at that point to submit the research results that they have paid for.
Free reign in the cloud……..no
They have specific IT resources to service their needs, with students having their own separate services. The idea is that they get a resource base where they have what Ryan Kennedy, Academic Computing Team Manager at the University, refers to as free rein, but in a controlled manner.
But giving them that free rein was getting to be increasingly difficult because scaling applications on the old legacy systems was becoming increasingly difficult, and often impossible, he explains:
We originally looked at doing this in the public cloud and we did some testing. It worked very well until we got the bill…
That was Azure. It worked perfectly, but unfortunately when you say to a researcher, ‘here is unlimited cloud resources at your fingertips’ they go ‘ooh, buy, buy, buy’. We couldn’t control that. The only time we found out really that they’d spent all the money was when the bill came.
We didn’t put that control plane in place and we were naïve enough to think that it would be ok, it’s very cheap in the cloud and we get discounts because we’re higher education.
There are currently 100 paid-for projects running actively on the system, with the department managing them as quasi-P&L accounts where no profit is made. In practice, it sells the academics CPU, memory and storage and they use those resources as they see fit, with whatever configuration meets the needs of their projects, though they do work within the constraints of a set of best practice guidelines.
This was another factor against the use of Azure – and many other cloud service providers, for many of these configurations do not map well onto the resource packages available from Azure, says Kennedy:
What we sometimes get is that an academic will have a very memory intensive application, but they don’t need any CPU. But in Azure, they’re confined to the specification VMs that Azure state. An A1 VM is one CPU plus 2 GByte of RAM, an A2 VM is 2 CPUs and 4GBytes of RAM. You can’t say, ‘I would like some of this and some of that please,’ and off you go. Which then I guess is impossible to plan what they need as a service provider so you can’t deviate from that. But our academics are likely to require 128 Gbytes of RAM but just one CPU, which in the real world you would never need. But for their applications it’s very much a thing they do need.
This was obviously wasteful and expensive as resources needed to be signed up and paid for in the knowledge that they would never be used.
Hyper-alternatives
The need for an alternative was apparent, and an on-premise solution seemed the better option in terms of cost management. One of the early decisions in that process was the decision to use hyper-converged architectures, partly because Kennedy’s seven-person team are all generalists in terms of skills mix so didn’t have the in-house skills to manage tasks such as scaling existing three tier systems to the sizes it needed.
The team looked at two hyper-converged environments, Dell-EMC’s VMware-based VxRail, and Nutanix. The decision went to Nutanix, partly because of both the cost and scalability of VMWare, starting with five nodes. They are now up to 33 nodes in the year since the changeover started and he expects it to grow more as new research projects are undertaken, and depending on their size and duration. Kennedy says:
We sort of ramped up quite aggressively with moving our storage workloads over to Nutanix as well so we were sort of moving to a one-stop-shop where all our workloads were running on Nutanix, rather than having multiple systems. So far the cost has been about a half million. But that is very ‘finger-in-the-air’.
On average projects are about three years in length. Some will only take six months, some are a year, but most do not always require compute resources at the start. The first year is often taken up with detailed planning and development of the application. Even then there is a fair degree of guess work in terms of what resources they will need, notes Kennedy:
We have a bit of flexibility, in that if they buy ‘x’ and then in six months’ time that actually they needed ‘y’, we can tweak things so that we shift the balance from what they have to what they require. We can swap them around quite easily which is better than if we were on the public cloud which makes it hard to change what you actually have. Especially if we do prepaid to get the cheapest option.
Another fly in the ointment can be those applications themselves. If a new project is the continuation or extension of an existing one then the old code has to be used:
So I would say, for the typical corporate workloads, what cloud service providers offer is perfect. It’s when it comes into the research space of an application that was written 30 or 40 years ago and for scientific integrity they can’t change it, it just gets to this point where the typical workloads don’t work. I’ve known it happen in the past, where they re-write their application and then they find it hadn’t been working correctly for the last 20 years. But if everything is consistently wrong, it’s ok apparently.
Support has also been a big consideration for Kennedy, who readily recalls some very close calls, such as the time the third part support provider went into administration and overnight the department was left with no support for a Petabyte of storage. He describes that as ‘a scary few days’. With Nutanix the department is yet to have an outage in the first year of year of operation:
We have had problems, but the wonderful thing is when we have had problems, Nutanix support have been purely incredible. At one point we lost all system management functionality and I went around the sun twice with support, but it never felt like I changed person even though it changed about six times as we worked with different engineering teams.
We no longer had to debug what was going on. We just opened a remote support channel and let Nutanix fix the problem, and at no point did a single VM fail however. With the old systems I would say we had a minor problem once a week and a major problem once a month, where it would literally fall over, with potential loss of data.
The move to Nutanix, is also expected to improve the ability of the academics to bid for jobs that they wouldn’t have been able to bid for before Because they will be able to prove they have the infrastructure to be able to do the research. The department can now prove its up-time statistics and demonstrate that the academics are unlikely to be asking for an extension because they didn’t have the compute power they thought they had.
If, perchance, additional resources are required adding them is a straightforward process, concludes Kennedy:
With Nutanix, adding extra resources in a node is a fifteen-minute operation. The Nutanix advice was ‘buy less than you think you need and scale it if you need to’. There aren’t many vendors that would say that.
My take
This is an interesting exercise, because it is an unusual applications requirement where just about any systems configuration businesses would call `normal’ are probably irrelevant. Yet there will be many businesses, especially in their R&D efforts, that probably get quite close to it. And as more machine learning and AI gets applied across business, it is quite possible that it applies to more common business management practices out into the future. There is, perhaps, even some learning here for the cloud service providers in that just servicing the bog-standard 80% of the market may eventually cut them off from the bleeding edge of future developments. Maybe they should learn to experiment a bit more.