Digital pandemonium - are the downsides of technological complexity and interdependence outweighing the benefits for enterprises?

Profile picture for user kmarko By Kurt Marko June 21, 2019
Summary:
Downtime is a fact of life. Prepare, don't panic!

targetblackfriday
A different pandemonium at Target

The IT system meltdown: It’s becoming a regular feature in the technology press, a form of reality programming appealing to our basest instincts. These stories of IT ineptitude have all the appeal of aerial TV coverage of a car chase: you know there’s going to be a crash, you’re just waiting to see how it transpires.

In IT, the events involve some sort of IT system failure that causes a massive, often cascading disruption of a company’s operations, resulting in chaos (and ire) among customers, ulcers for IT and mea culpas by the senior executives. In almost every case, the cause, if it’s ever fully revealed, stems from one or more seemingly simple mistakes or predictable events that reverberate in unpredictable ways.

After watching the same plot line unfold so many times, one wonders if occasional outbursts of pandemonium and confusion are the cost of living in the digital age. Whether enterprise executives must learn to accept the downsides of a pact with the digital transformation devil? Despite — indeed, perhaps because of — our technological sophistication, recent events indicate that yes, sporadic, chaotic outbreaks, like tornados in the Midwest, are something digital natives must expect and plan to mitigate.

The latest victim: Target

Target, one of the world’s largest retailers, was probably already in the IT failure hall of fame for the massive security breach of its point-of-sales systems back in 2013 that cost the CEO his job. The company added to its ignominious reputation on Father’s Day weekend with back-to-back incidents that resulted in upwards of $50 million in lost sales and thousands of angry customers. Target’s CEO dutifully apologized to customers, blamed the outage on routine maintenance on its IT systems, said full operations were restored in two hours and claimed that the lost sales would have no material impact on its quarterly earnings.

A second and shorter, less widespread outage disrupted Target’s point-of-sale terminals the next day. Target blamed this one on a data center outage at NCR, which handles Target’s payment processing. Combined the two incidents not only damaged Target’s reputation, which likely still hasn’t fully recovered from the 2013 identity theft affair, and cost millions in sales on a busy, pre-holiday weekend.

More broadly, the Target outages are the latest example of the interdependence and fragility of today’s IT-powered business processes. It’s an inherently precarious situation that allows seemingly small problems to snowball out of control and enables outages and security incidents at one company to virally spread to its customers, blindsiding them with a disaster they couldn’t see coming and have no way of preventing.

Other examples of system instability

It’s unfair to single out Target for ridicule since it is merely the latest, highest profile example of enterprise IT failures the reverberate in unexpected ways. For example, technology outsourcing specialist Wipro suffered a security breach that exposed private data from its clients according to security researcher Brian Krebs, who writes (emphasis added),

One source familiar with the forensic investigation at a Wipro customer said it appears at least 11 other companies were attacked, as evidenced from file folders found on the intruders’ back-end infrastructure that were named after various Wipro clients. That source declined to name the other clients.

The other source said Wipro is now in the process of building out a new private email network because the intruders were thought to have compromised Wipro’s corporate email system for some time. The source also said Wipro is now telling concerned clients about specific ‘indicators of compromise,’ telltale clues about tactics, tools and procedures used by the bad guys that might signify an attempted or successful intrusion.

While Wipro tried to publicly minimize the breach’s significance, Krebs later reported that (emphasis added):

I heard from a major US company that is partnering with Wipro (at least for now). The source said his employer opted to sever all online access to Wipro employees within days of discovering that these Wipro accounts were being used to target his company’s operations.

Another case-in-point comes from a massive, storm-triggered power outage in San Antonio that took out the Microsoft Azure South-Central US region for 21 hours last September. This wouldn’t be a problem for customers that replicated their workloads to other Azure regions, however the added overhead and expense means such redundancy is often not in place; as Microsoft itself demonstrated. The outage took out regional availability for several Microsoft services hosted in the San Antonio facility including its Visual Studio Team Services (VSTS, aka Azure DevOps) service, along with some global services run out of the location. According to the incident postmortem (emphasis added):

In addition to VSTS organizations hosted in the South Central US region, some global VSTS services, such as Marketplace, hosted there were also affected. That led to global impact such as inability to acquire extensions (including for VS and VS Code), general slowdowns, errors in the Dashboard functionality, and inability to access user profiles stored in South Central US.

In addition, users with VSTS organizations hosted in the US were unable to use Release Management and Package Management services. Build and release pipelines using the Hosted macOS queue failed. Additionally, the VSTS status page was out of date because it used data in South Central US, and the internal tools we use to post updates for customers are also hosted in South Central US.

Why didn’t Microsoft replicate these services to another region? Because synchronizing and reliably failing over to storage replicas is hard. Here’s how the postmortem describes it (emphasis added):

Azure Storage provides two options for recovery in the event of an outage: wait for recovery or access data from a read-only secondary copy. Using read-only storage would degrade critical services like Git/TFVC and Build to the point of not being usable since code could neither be checked in nor the output of builds be saved (and thus not deployed). Additionally, failing over to the backed up DBs, once the backups were restored, would have resulting in data loss due to the latency of the backups.

And there's more. As I detailed in a previous column, a recent outage at multiple U.S. Google Cloud regions not only affected the infrastructure services hosted there, but several high-traffic Google applications like YouTube, Gmail and G Suite. However, it also knocked several third-party consumer services like Snapchat, Nest smart appliances and some iCloud services that are powered by Google Cloud infrastructure.

Meanwhile three of the largest U.S. air carriers, American, Delta and Southwest, were temporarily grounded this spring due to an outage at a third-party service that provides critical weight and balance information for planes that is used in flight planning. A recently released report by the U.S. GAO on IT outages in the airline industry ‘Identified 34 IT outages from 2015 through 2017, affecting 11 of 12 selected airlines, of which, 85 percent resulted in flight delays or cancellations. While the total is insignificant when compared to delays due to weather or mechanical problems, the report illustrates how passengers might face a seemingly inexplicable delay even though the weather is fine and their plane is sitting on the tarmac.

Slide

Aside from system glitches and data center outages, organizations increasingly face significant disruption to their operations resulting from security incidents, notably ransomware. For example, routine services provided by the City of Baltimore ground to a halt for more than a month as the city grappled with a ransomware attack that left it unable to issue water bills, collect traffic tickets or property taxes and issue building permits or title clearances, freezing the Baltimore real estate market in the process. City operations are slowly returning to normal after more than a month, but at the cost of at least $18 million in lost revenue and IT cleanup expenses.

A similar attack crippled the city of Atlanta last year, leaving it with about $10 million in recovery expenses. While many ransomware victims don’t yield to the extortion since there’s no guarantee that the perpetrators will honor their word and provide the encryption keys to unlock files, some do. A Palm Beach Florida suburb recently paid nearly $600,000 after ransomware disrupted city financial systems, its 911 dispatch center and water distribution service.

My take

The downstream ramifications of an IT system failure or security breach are often more significant than the direct costs from the event itself. As such, these cascading secondary and tertiary effects are a form of negative economic externality, since the outsized costs to the victims are often the result of ignorance or short-sighted cost-saving decisions by a third party. For an extreme example, look to the original Target PoS breach and the ramifications of lax security processes (whether by inexperience or cheapness is irrelevant to the point) at a small Pennsylvania HVAC contractor doing work at a few Target locations:

Similarly, a mistake at a virtually unknown data services company can delay hundreds of flights with untold costs to thousands of airline customers.

Sadly, the domino effect of IT problems that upset operations and activities at other companies is an artifact of our digitally transformed, über-connected world. As Joanne Joliet, a senior research director at Gartner notes in a Wall Street Journal article on the latest Target outage:

Nobody is immune to an outage like this...Routine technology maintenance has become more complicated because systems are not isolated from each other. They’re all integrated…which introduces greater complexity.

The lesson for corporate executives and IT leaders is to expand their contingency planning horizon beyond the confines of their organization to include business disruptions and security breaches caused by secondary and tertiary suppliers.

Preventing or mitigating externally caused business disruptions will cost money for increased system and application redundancy, more sophisticated, multi-layered security, greater testing and the development of manual, non-digital (or disconnected) backup processes.

These costs and the added planning overhead can only be considered as added friction that counteracts some, but not all, of the benefits of today’s disaggregated digital economy and represents another unintended consequence of outsourcing.