By now everyone is familiar with the most recent AWS S3 outage. It reverberated across dozens of high-profile online services dependent upon AWS for some or all of their infrastructure. It even took down diginomica’s Xero service for a while although thankfully it occurred at a time of day when normal people in Europe aren’t crunching accounting records.
This incident likely has a longer shelf life than most, as competitors hype it for marketing purposes and cloud skeptics enjoy some schadenfreude. Our goals here are didactic, not dastardly.
An AWS post mortem of the incident identifies the cause as an operator mistake when issuing a routine command to take a few S3 servers offline.
Sadly, a command-line typo inadvertently affected far more than intended, including the index subsystem for the entire S3 complex at the massive US-East, Ashburn, Virginia site, severely crippling all applications in the region using the storage service.
Although the outage created a significant blot on Amazon’s reputation, it also served to highlight borderline negligence on the part of supposedly sophisticated cloud consumers like Medium, Trello, Soundcloud, Slack, Xero and others that base their businesses on AWS.
As Warren Buffett famously quipped, “you only find out who is swimming naked when the tide goes out.” The AWS beachfront was full of skinny dippers on February 28th.
First, some perspective since AWS is sure to be mercilessly flogged by every cloud competitor and legacy IT equipment vendor claiming either: (a) they can do better or (b) that’s what you get for trusting your data to the cloud.
The entire outage lasted 4 hours and 17 minutes with most S3 functionality restored in 3 hours and 41 minutes. Prior to this incident, no S3 region experienced any downtime in the previous month and should it return to that standard of excellence, the US-East region will still achieve 99.95% uptime for the year.
According to the CloudHarmony monitor of cloud service availability, one Azure storage region recorded about 40% more downtime than US-East over the last 30 days. Somehow, we didn’t hear about that one.
Furthermore, IT organizations pointing the finger at AWS for poor performance had better look in the mirror since few, if any can do better.
A couple of years ago, a survey found that colocation centers, a reasonable proxy for large cloud providers, reported significantly fewer outages than enterprises. Statistical data on enterprise IT availability is understandably lacking since no one likes to expose their dirty laundry, but another now dated, but likely still representative survey found that only 26% of enterprises could repair an infrastructure failure within an hour, while 17% said it would take days and another 22% had no idea how long it would take.
The point is, enterprise IT and application developers must plan for failure and design accordingly, not moan and point fingers when the inevitable outage occurs.
How hard can it be?
Although IT analysts and consultants get tired of harping on the need for disaster recovery and business continuity (DR/BC) planning, we do it because, as with the need for greater IT security, the message is lost on too many organizations. Fortunately, cloud services make building backup infrastructure easier and cheaper than ever. As I wrote last month,
Disaster recovery has never been an exciting task for IT, but with business processes and transactions now entirely online, it’s never been more important. Fortunately, the concurrent rise of cloud infrastructure and tailored data replication and application recovery services mean that comprehensive DR has also never been more accessible and affordable.
You needn’t be a large, sophisticated multinational to have a serviceable cloud backup strategy, as this account of the S3 debacle illustrates, a bamboo farmer in Alabama was more prepared for an outage than many tech-savvy corporations. Quoting the farm’s technology lead,
As our business is in bamboo plants, pictures are a very important part of selling our product online. We use Amazon S3 to store and distribute our website images. When Amazon’s servers went down, so did the majority of our website,” said the company’s chief technology officer Daniel Mullaly. “Thankfully we also store the images locally and I was able to serve the images directly from our server instead.
As this example shows, on-site storage is one backup approach, however there are better options. It’s baffling why more organizations don’t use native replication services to clone data and application images in multiple cloud locations. As I detail here and here, the Azure Site Recovery service makes it incredibly easy to replicate entire cloud infrastructure within the Microsoft ecosystem by automating most of the process.
Within AWS, most of the database services, including Aurora, RDS and DynamoDB, and S3 itself, support automatic, cross-region replication. Of course, it adds cost for the redundant storage and complexity since applications must be modified to automatically access a secondary AWS region should the primary be unavailable.
But if you’re one of the aforementioned online properties whose business depends on AWS, it’s an easy expense to justify. Indeed, 54 of the top 100 retailers that were affected by the incident and others in the S&P 500 that ended up losing an estimated $150 million can certainly rationalize ample amounts of DR engineering and expense.
AWS competitors, whether other cloud services seeing an opening or IT equipment vendors defending traditional IT-owned and operated infrastructure, are guaranteed to use the latest S3 outage as sales and marketing fodder.
Hopefully, IT and business leaders will see through the self-serving FUD and press those making the case for details on how and why their products would do any better when used in a similar situation within a single data center.
Don’t be surprised if you hear crickets. Yes, some alternatives like Google Cloud or Azure Storage, make it easier to automatically configure geo-redundant storage, but the onus is on the customer to use it versus cheaper options.
A silver lining of the AWS S3 outage is the lessons it provided to both AWS and IT professionals. For AWS, it exposed flaws in administrative and automation tools and processes that allowed a seemingly benign typo to cause a site-wide chain reaction.
I’m sure AWS is re-engineering the automation scripts and process controls to provide governors and fail-safes that prevent such runaway behavior in the future. For IT organizations, the outage underscored the criticality of designing redundancy into any infrastructure supporting applications and business services that require uninterrupted, 100% uptime.
Image credit - via Werner Vogels