In what is becoming a Holiday tradition for IT operations and security teams (SolarWinds anyone?), two recent events have added stress to an already hectic season by demonstrating the fragility and interdependence of the cloud services and open-source software at the foundation of enterprise applications and processes.
Last week's AWS outage Service Event™ demonstrated that even the largest and most sophisticated hyperscale operators aren't immune from mistakes that cascade into a multitude of other problems. While the AWS outage was spotty, leaving many customers only minorly inconvenienced, the vulnerability in the Log4j software library is broad (affecting millions of software and service users), deep (lodged within countless commercial and open source applications) and likely to be lengthy (with ramifications lasting well into next year).
Together these events illustrate the fragility of modern IT systems and the 'digital transformation' strategies utterly dependent upon them.
AWS's demonstration of control plane dependencies
In one sense, AWS is justified in deliberately avoiding the word "outage" — it doesn't appear once in AWS's almost 2,000-word account of the incident — since many customers, i.e. those whose workloads were relatively static and didn't require any resource or configuration changes, operated normally throughout the event. However, anything that used the AWS control plane to create and manage services experienced significant delays. As AWS's post mortem puts it:
Customers of AWS services like Amazon RDS, EMR, Workspaces would not have been able to create new resources because of the inability to launch new EC2 instances during the event. Similarly, existing Elastic Load Balancers remained healthy during the event, but the elevated API error rates and latencies for the ELB APIs resulted in increased provisioning times for new load balancers and delayed instance registration times for adding new instances to existing load balancers.
Similarly, while the Route53 DNS service so critical to most multi-region — that is, deliberately designed for maximum redundancy and scalability — customer applications continued to respond to DNS queries, customers couldn't make changes, for example, to redirect a newly slow application in AWS US-East-1 to another region or cloud provider. Likewise, the network congestion between AWS's internal and external networks had odd effects on seemingly unrelated services like VPC Endpoints and API Gateways. For example (emphasis added):
AWS Lambda APIs and invocation of Lambda functions operated normally throughout the event. However, API Gateway, which is often used to invoke Lambda functions as well as an API management service for customer applications, experienced increased error rates. API Gateway servers were impacted by their inability to communicate with the internal network during the early part of this event. As a result of these errors, many API Gateway servers eventually got into a state where they needed to be replaced in order to serve requests successfully.
Such increases in latency and access errors also affected other services with AWS-managed control planes like EventBridge, the Fargate, ECS and EKS container services and AWS management console in Northern Virginia.
In all, the initial event lasted more than eight hours, with AWS citing three factors that slowed the recovery and illustrate its entwined service dependencies (emphasis added):
First, the impact on internal monitoring limited our ability to understand the problem.
Second, our internal deployment systems, which run in our internal network, were impacted, which further slowed our remediation efforts.
Finally, because many AWS services on the main AWS network and AWS customer applications were still operating normally, we wanted to be extremely deliberate while making changes to avoid impacting functioning workloads.
While AWS resolved the primary incident by 8 pm EST on December 7th, ThousandEyes detected what it termed an "aftershock" that "caused significant disruption to multiple services" and lasted about 85 minutes on the 10th. Then on Wednesday the 15th, AWS experienced another disruption to services in the US-West-1 and -2 regions that affected Doordash, Playstation Network, QuickBooks and Salesforce among others. According to ThousandEyes' data, at its worst, US-West-1 was unreachable by 14 out of 18 monitoring agents worldwide with the problems occurring "within their main network, where traffic from sources both inside and outside AWS was getting dropped."
Since AWS has not publicly commented on these subsequent events, it's unknown whether they are related. It's noteworthy that AWS's event post mortem wasn't published until the night of the 10th, meaning the 12/10 incident could have been an aftershock. The 12/15 event could be unrelated since it affected another region, but we won't know unless AWS issues another post mortem.
In sum, the AWS event resulted from a nexus of inadequate design redundancy, a flawed software update, unforeseen dependencies between AWS services and inadequate traffic controls between its management and services networks.
Log4j vulnerabilities illustrate software dependencies
The initial AWS event was barely over before a security vulnerability was reported in the log4j software module "broadly used in a variety of consumer and enterprise services, websites, and applications—as well as in operational technology products—to log security and performance information" according to a CISA advisory. A separate statement by the CISA director, Jen Easterly, underscores the threat's significance (emphasis added):
To ensure the broadest possible dissemination of key information, we are also convening a national call with critical infrastructure stakeholders on Monday [12/12] afternoon where CISA’s experts provide further insight and address questions.
We continue to urge all organizations to review the latest CISA current activity alert and upgrade to log4j version 2.15.0, or apply their appropriate vendor recommended mitigations immediately.
To be clear, this vulnerability poses a severe risk. We will only minimize potential impacts through collaborative efforts between government and the private sector. We urge all organizations to join us in this essential effort and take action.
Indeed, within three days of first being reported, Checkpoint detected more than 800,000 attacks using the exploit and 60 variants of the original attack. The number of attacks reached almost 1.3 million by the next day (12/14) when Checkpoint released a detailed analysis of the attack methods. A separate analysis by Risk Based Security identified "over 200 products affected and we’re still processing a long list of related advisories," illustrating the vulnerability's enormous blast radius.
Widespread use of the log4j library, its ease of exploitation and embedded nature as part of other software make it almost impossible for users of applications and services to check their exposure. These qualities also mean that, like the SolarWinds incident, IT administrators will be dealing with the aftermath for months. As Checkpoint concludes:
Attempts to exploit the Apache log4j vulnerability will most likely keep evolving in the future. The ease of the exploitation combined with the popularity of the log4j library created a vast pool of targets for attackers.
This is a wake-up call to examine dependencies and linkages.
These incidents highlight significant weaknesses in today's enterprise IT environments via their dependence on centralized cloud services and software with embedded open source modules that, although widely used, might not be thoroughly hardened, reviewed and tested for security. Unfortunately, there aren't easy answers to these problems since businesses and developers have flocked to cloud services and open source code for good reasons — convenience, cost model, steady flow of features and updates — and have been willing to live with the occasional outage or urgent security patch.
The bigger problem is that enterprises have gradually, often unknowingly, deepened their dependence on such software and services to the point where an unexpected major incident can significantly impair revenue, damage customer relations and increase support costs. However, like the proverbial frog stuck in a pot of water that's slowly been brought to the boiling point, it's too late to jump out. Instead, the rational response is to:
- Use such incidents as motivation to undertake the tedious task of identifying software and service dependencies.
- Find ways to mitigate the risks of single points of failure, whether it is an AWS environment or a critical software package.
- Be careful not to introduce new failure points when fixing identified problems. For example, a kneejerk reaction to an AWS service problem during the initial event, the inability to make DNS changes, might be to decouple your DNS and cloud environments such that a cloud outage wouldn't prevent you from changing DNS records to redirect customers to another cloud environment (assuming you've built redundant systems). Unfortunately, this just introduces another failure point in the system.
Although it is self-serving, I agree with advice from companies like Check Point, ThousandEyes and other security and monitoring vendors that the best approach is to redouble application, infrastructure and security monitoring and have mitigation plans for various scenarios ready when they detect problems. Indeed, such monitoring can often prevent overreacting to an incident by documenting the extent of an outage or risk — for example, in the case of AWS, where the incident degraded a subset of services on one region.
In sum, don't trust, always verify and have multiple contingency plans.