Golland recalled his experience of this real-life disaster recovery emergency at a public sector cloud conference at the end of last year. And whilst he was kind enough not to mention the company in question, a quick Google shows that Waltham Forest had struck a deal with sunk 2e2 in 2010 to provide it with 'one infrastructure'.
When 2e2 when into administration in 2013, it was a shock to many and there was a great deal of concern in the industry as a whole as to how companies were going to get hold of their data before it shut its doors. However, there were few case studies that emerged (for obvious reasons), but Golland's story works as a stern reminder to those heading to the cloud not to do so without stringent back-ups in place.
Golland explained that just before the London Olympics, Waltham Forest had decided to go live with a new CRM platform (Microsoft Dynamics) that would provide much of the Council's interaction with the public, but would be hosted externally in a true multi-tenant environment. As part of this project, Golland decided to externalise a lot of Waltham Forest's data centre services and systems. He said:
Our procurement team had had a lot of involvement in this, it was all very driven by outcomes based specifications and it was in the purest sense a service on demand. We wouldn't maintain it, we wouldn't look after it.
However, soon after the Olympic Games finished in the capital, staff at Waltham Forest were being made aware of financial challenges at the cloud provider. Golland said that he had heard rumours in the marketplace that the supplier was having financial difficulties and that there were irregularities, but at the time he had made a number of checks and assurances that pointed to everything being fine. He said:
We ran financial checks galore - you name them, we did it - because we were worried that there was going to be implications. However, not one of these checks brought up any problems.
But these proved irrelevant. In early 2013 the worst happened and Golland received a call that would put the fear of God (or whoever is higher up the chain) in most CIOs.
Then the day arrived. I got a phone call telling me that the provider had gone under. And I can tell you, having to go and see members that basically can barely understand how their email system works that suddenly that the data centre provider had gone, all of our systems, some 160 servers, and this Dynamics environment that we had been developing, major interfaces, had all gone under, wasn't easy.
We were put on notice that in 24 to 48 hours that all of this was being turned off. You can imagine the reaction to the organisation.
Interestingly, up until this point, Golland was fairly sure that Waltham Forest had good continuity and disaster recovery plans in place. However, despite it being a multi-site data centre, which was an IL3 provider, which had a complete virtualised environment, had high availability with every layer robustly tested, with services that services had continuity plans – everything still could easily have fallen apart. He said:
You sit there with the executives and tell them that they're not going to have any IT, potentially for weeks, how many have you got continuity plans to cover that off? I can tell you there was not one person that stepped up and said that they had a manual process that they could fall back on to cover their service in that time. That was a real eye opener and we have learned a lot of that.
We were facing some 8-12 weeks to rebuild from scratch the complete environment. You are talking in effect, three months of services manually being able to cover off all customer interaction and interfaces with all of the third party suppliers that we had. All integrated with the suppliers via the CRM application. So, obviously, that left a massive gap in terms of services.”
There were a number of things that we learnt out of this. When you put things in the cloud look very carefully at services, because what we found was that often our continuity plan was to turn to IT. Not for IT to fundamentally fail in terms of a provider - and we had all our eggs in one basket.
With the prospect of needing to get everything out and set up in under 48 hours, Golland decided that Waltham Forest only had three realistic options. The first being to bring everything back in-house, which was quickly written off because he didn't think the Council had the capabilities to get everything set up so quickly and to continue to maintain it. The second option was to move to a new supplier completely – but, again, Golland didn't think this was feasible. He said:
For everyone that's been involved in other projects, comms and standing up communications and lead times, it just doesn't stack up with the notice period we were faced with.
The last option was to try and find one of our existing suppliers that we had to try and ramp everything up, that's where we ended up.
Luckily, Golland and Waltham Forest were able to speak to one of their existing partners and decided to move everything into their environment over one weekend.
We did come through it, we did move. We pulled out of there and we managed to take copies of the data and the images in the 48 hours prior to that, and we managed to get all of our servers and the whole of the VM environment and recreate it all in the other vendor's data centre. We were very lucky, there is no two ways about it.
However, going through this ordeal, Golland has learnt a number of lessons about putting a lot of systems into the cloud. He explained that for organisations that are doing something similar, they should consider the following:
- Asset management – Golland explained that when the cloud provider went under, the administrator involved was doing what it could to try and protect the company's creditors by trying to maximise the recovery of any assets that are being held. He explained:
One of the things that you have to do is prove that that equipment that is in there is yours. You've got to have very strong asset management around your own assets. But secondly, and more importantly, when you are consuming a cloud service, who owns the IP of
that build? Fine the data is yours, you can collect your data, but that doesn't leave you anything to run it on and pick up normal day to day business.
You may own the software licenses, depending on the licensing and the model you have gone with. But if it is in a combined environment in true cloud style, who owns those VM images on those machines?
There was a very, very interesting debate with the administrators about this. To a point where we had to take legal action to try and secure them. One of the things you need to be very mindful of is being able to be in a position where you can exit in a way that will allow you to stand up in an emergency.
- Customisations – If you haven't gone with an out-the-box cloud build and have implemented a number of customisations, make sure you understand who owns those customisations. Golland said:
It's amazing the amount of people that get into cloud customisations and have not got to grips with who owns that code, or at least made sure that that code has been put into escrow so that they have got an entitlement to access it if there is a failure of the organisation.
When we go into cloud now, when we have the images and any development of them, we insist that those are put into escrow so we can access them.
- Manual processes – One of the main things that Golland emphasised was that companies should not rely on the cloud, or rely on IT. Companies, departments and service leaders need to make sure that they have got manual processes in place to support the ongoing operations in the medium-term if things go to pot. He said:
We insist that services do have processes to manually be able to back up in the event of failure of their services and they have got to be able to maintain at least five working days of those processes.
- Getting access – You may own the data, it may be yours, but even in the event of your cloud provider going under, getting access and getting it out may not be the easiest task. Golland said:
Don't ever underestimate the challenge of physically getting access to that data, especially if it's an IL3 (high security) data centre. That's challenging. Secondly, dependent on the size of the data, don't underestimate the size of the challenge of moving that. Getting a mobile SAN and going to suck that information out is a challenge.
Golland isn't saying don't go to the cloud, but to prepare for the worst. Even if you're being told that there is no way the worst will happen.
A case study that serves as a healthy reminder to get stringent contingency plans in place.