When using the cloud for DR, know your goals
- Summary:
- Public cloud is a cost-effective option for disaster recovery and business continuity, but not all DRaaS solutions deliver the goods
A recent survey of healthcare providers found that the area in which they were most likely to use cloud services is DR. Another survey found that one of the top reasons for adopting hybrid cloud IT infrastructure is to improve application availability and performance. But while public cloud services are inherently reliable, that doesn't mean DR systems built on top of them necessarily are. Understanding the distinction can save organizations some nasty surprises when disaster strikes.
Using cloud services as business application insurance is a perfect fit since they obviate the need for expensive, redundant IT infrastructure while providing an intrinsically resilient, geographically distributed platform made of virtual resources that can be rapidly instantiated in response to an equipment outage or security incident. Furthermore, usage-based pricing means that organizations don't pay for equipment they're not using in an active-passive DR design.
The problem with using the cloud for DR is that when an incident happens, you need the entire application chain recovered ASAP. Finding out after the fact that it only protected parts of your business process is too late. Indeed, using the cloud for redundant infrastructure without having DR processes, service automation and SLAs could be summed up by the often paradoxical Yankee great Yogi Berra in those Aflac commercials, "it's the one you really need to have, 'cause if you don't have it, that's why you need it."
DR metrics mean more than just uptime
The point of any insurance is to provide rapid relief in a temporary crisis, which in the case of cloud DR services means the ability to limit the business damage of an outage by setting limits on its duration. Such boundaries are typically quantified in SLAs, however as I explored in an earlier column on buying cloud services, service agreements are often incomplete or ambiguous. As the article emphasized:
While the legal documents will address many questions, they won’t cover everything important, so you should press cloud providers to fill in details in the following areas.
Given the implications of a significant outage to financial performance and business reputation, when dealing with DR, due diligence should go beyond Q&A and include contractually-binding agreements and independent audits.
Unfortunately, most cloud SLAs cover infrastructure availability, not the linked processes needed to deploy and activate an alternative application environment. The distinction is a crucial difference between generic infrastructure services like AWS and Google Cloud and so-called DR-as-a-service (DRaaS) providers. According to Jeff Ton, EVP of Product & Service Development at Bluelock, a DRaaS specialist, there are four service levels organizations should understand and specify when designing a cloud-based DR process:
- Infrastructure availability: typically what IaaS provider SLAs stipulate
- Replication service: the process of keeping data synchronized between production and backup locations. The terms of this service should include a Recovery Point Objective (RPO) that specifies the maximum amount of data at risk of loss, typically defined as the amount of time that the primary and backup sites can be out of sync.
- Recovery team response: speed of execution by specialists responsible for bringing a backup site online during an incident
- Recovery Time Objective (RTO): the elapsed time between the start of a DR event and when the backup site is fully operational
Ton emphasizes the importance of quantifying the recovery time objective while noting that few providers take this seriously:
Few offer an RTO SLA that covers infrastructure availability, booting virtual machines (VMs), booting operating systems, starting the applications and quality assurance that equates to a customer’s applications returning to normal usage.
IaaS SLAs cover only availability of the compute instances or storage services. For example, Azure and Google Cloud specify monthly uptime objectives for compute instances of 99.95%, or a maximum of 22 minutes downtime, before customers start receiving service credits. However, as the Bluelock list makes clear, such infrastructure availability is only one small piece involved in putting applications back together at a remote location. Instead, Bluelock’s DRaaS SLAs use what its labels True RTO™, which Ton says:
Focuses on the full recovery process (not just the technology aspects) in returning applications to end users again. This means our RTOs cover a target time for virtual machines booting, operating systems booting, and the applications being started.
Besides its core infrastructure services, Microsoft offers a DRaaS product called Azure Site Recovery that I detail here. Much like Bluelock's TrueRTO, the Site Recovery SLA illustrates the distinction between infrastructure and application availability. While Microsoft guarantees three nines for the recovery service itself, it also specifies a four-hour RTO for an on-premise-to-Azure failover. Thus, although you may be able to spin up new machine instances within minutes, bringing a full application and its data back online will take several hours.
Even that won't be fast enough for systems used in digital business, particularly Web front ends that directly interact with customers and business partners. As Gartner's report on DRaaS providers notes,
Because digital business moments will typically be realized in very compressed time frames, the primary service-level metrics of traditional DR — RTOs and RPOs — no longer apply, as supporting web services must be continuously available. As a result, IT leaders will be increasingly challenged to enable a broader level of IT service continuity.
Of course, recovery time can be substantially reduced if you're willing to go the DIY route. Google has several useful tips in its DR cookbook, however those unwilling to operate hot standby systems and write some failover automation scripts must understand the service details and conditions of their chosen DRaaS provider and, if required, find one that supports near-instantaneous failover to active, or at least 'warm' infrastructure.
Don't let the recovery process swamp your IT team
Although many aspects of DR can be automated, there's still a significant human element, along with limited network bandwidth that can be swamped in a large incident if too many systems failover at once. For the same reason power companies don't simultaneously restore every neighborhood after an outage, IT organizations can avoid a crushing spike in workload by smoothing out the DR process. According to Ton:
We also understand that for companies with hundreds or thousands of applications, this large amount coming back online simultaneously can be overwhelming for the IT personnel tasked with the quality assurance step of the recovery process. For this reason, we offer Recovery Waves™ — organizing applications from wider tiers of technology solutions into subsequent smaller ones, the order of which are noted in the recovery playbook. This grouping ensures that the most important applications and data receive the first attention, so that critical IT systems can return to end users in the most efficient fashion.
Although many DRaaS providers offer different service tiers for high- and low-priority applications, the ability to automatically stagger the recovery of a large application portfolio is not common, but potentially quite valuable.
My take
Disaster recovery has never been an exciting task for IT, but with business processes and transactions now entirely online, it's never been more important. Fortunately, the concurrent rise of cloud infrastructure and tailored data replication and application recovery services mean that comprehensive DR has also never been more accessible and affordable. The natural fit of cloud infrastructure to DR is a key reason Gartner predicts that:
From 2016 through 2020, the use of either DRaaS or IaaS to support the failover of production applications will grow by more than 200%.
DRaaS buyers must understand both internal requirements and external capabilities before choosing a vendor. The former entails knowing the infrastructure and data dependencies of an organization's critical applications and how they can best be reconstituted on remote cloud infrastructure. The latter requires due diligence of and negotiation with service providers to get SLAs covering recovery metrics that are relevant to your business and understanding any unique circumstances that may require customizing the DR process and automation tools.