A few weeks ago I wrote about Regulatory Compliance and its impact on the CIO. I’ve continued to think about what’s keeping CIOs up at night. Another topic is disaster recovery planning and business continuity. The CIO and their team need to ensure they have a plan and that plan is tested, since bad things can happen to good people. The plan needs to have alternatives – a belt and suspenders approach, even in an age of cloud computing.
I remember a day, back in the 80s, when we were moving our computers from one building to another. This was when hard drives were expensive and it took forever to back them up to tape. We didn’t have the money to buy a whole set of redundant hardware so we were just going to back up the information and then physically move them on a weekend. That was in the age before five nines of availability. We let the hardware vendor know the moving was happening so they would be on call – just in case. We moved the CPUs and most of the disks, then had only one stack of drives left to go. As it was being loaded onto the truck a series of events took place (a strap came loose…) and it looked like the end of a very competitive game of Jenga. A slow motion fall with parts of the stack hitting the pavement and scattering. Fortunately, we had the drives backed up and repair folks on call. We even had a few spare drives, so we were back operational on Monday. We had an accident — but our recovery was no accident. It was planned.
Today, IT systems are at the center of the successful operation of many business processes. As we move into higher levels of automation, data collection through the IoT and more use of compute power than we’ve yet to even image, the importance of our systems becomes critical to revenue generation and cost control. This is true across every business.
Business continuity plans need to address a wider range of planning, simulation and testing than ever before. A business continuity plan should embrace the following goals:
The management of business continuity falls largely within the sphere of risk management, with some crossover into related fields such as governance, information security and compliance, all of which are at the core of an enterprise architecture.
If the business already has a business continuity plan as part of the approach to risk management, it needs to be reassessed in a world of high levels of automation, contracting for services and reduced latency. The very definition of foundational terms such as “work location,” “service” and “support” are changing. It is easy to overlook the shifts and assume what we’ve always done is still sufficient. It is not. A diverse perspective is needed, looking for new threats, issues and implications.
One of the first elements of planning is a Business impact analysis. This is the term for the process of determining the relative importance or criticality of the various business elements and threats. This analysis drives the determination of priorities, planning, preparations and other business continuity management activities. You cannot address everything at the same level, so the business impact analysis will help concentrating your efforts on the areas of greatest concern, first.
In today’s environment, business impact analysis is becoming more technical, as the interconnections across the ecosystem become more complex (embracing partners, suppliers and even customers). For example, we have seen situations recently involving program trading where an entire financial institution was placed at risk when its automated trading system responded in an unforeseen fashion or its governance processes broke down. Addressing these concerns will become ever more common across industries as automation increases.
Mission-critical IT systems require mission-critical analysis and protection, no matter the platform or the supplier who may be operating the underlying hardware. It is not just a matter of the systems, but the cloud services, network connections and the design of integrated applications that are important.
People easily forget that risks at every level multiply. They don’t add up or average out, so every level needs to be assessed. In a simple example: if your storage is 99.99% reliable, your systems are 99.999% and your software is designed for 99.999% and your network is 99.95%. The overall reliability of the system will be %99.83. That’s almost 15 hours of downtime a year. It doesn’t sound like much but that could easily represent a long run payroll operation or month end billing cycle.
No one cares if the lights are flashing and the disks are spinning if the end-to-end transactions cannot take place. Modern networking and collaboration techniques will be a critical component of the plan as well, since they allow for greater resiliency and flexibility, as well as more effective and timely communications, most older plans don’t use these techniques as effectively as they could.
As Alan Lakein said, “Failing to plan is planning to fail.” All CIOs need to take a step back and assess how things have changed and what the impact on the business may be if all these new systems stop.
Featured image credit: Leadership or Business Failure © ptnphotof – Fotolia.com