Google Cloud outage caused much Twitter angst, but provides a teachable moment for enterprises

Profile picture for user kmarko By Kurt Marko June 5, 2019
Summary:
Google's recent cloud downtime can provide some valuable learnings.

failure

Google Cloud services suffered a significant outage on Sunday June 2nd and it didn’t take long for Twitter to explode in with posts of panic and outrage illustrating the significant, unexpected ramifications the cloud outage had on a variety of online services. Indeed, the many ways Google Cloud infrastructure has crept into a myriad of services quickly became apparent, notably:

Had the outage not been on a Sunday, the business consequences of unresponsive business applications, disrupted communications and some angry customers to the numerous enterprises reliant on Google Cloud Platform (GCP) and G Suite would have been substantial and costly. Initial details on the cause or extent of the outage were sparse, coming primarily from Google’s status dashboard, which confirmed that the outage:

  • First affected Google Compute Engine instances in the eastern U.S., but had ramifications on multiple Google services
  • Was caused by “high levels of network congestion”
  • Lasted about 4 hours

The Downdetector website, and independent service monitor, confirmed the length and geographic extent of the outage.

Google comes (partially) clean

Google provided a fuller post mortem on Tuesday, summarizing its origins this way (emphasis added),

The root cause of Sunday’s disruption was a configuration change that was intended for a small number of servers in a single region. The configuration was incorrectly applied to a larger number of servers across several neighboring regions, and it caused those regions to stop using more than half of their available network capacity. The network traffic to/from those regions then tried to fit into the remaining network capacity, but it did not. The network became congested, and our networking systems correctly triaged the traffic overload and dropped larger, less latency-sensitive traffic in order to preserve smaller latency-sensitive traffic flows, much as urgent packages may be couriered by bicycle through even the worst traffic jam.

Google’s engineering teams detected the issue within seconds, but diagnosis and correction took far longer than our target of a few minutes. Once alerted, engineering teams quickly identified the cause of the network congestion, but the same network congestion which was creating service degradation also slowed the engineering teams’ ability to restore the correct configurations, prolonging the outage. The Google teams were keenly aware that every minute which passed represented another minute of user impact, and brought on additional help to parallelize restoration efforts.

Google didn’t itemize all of the affected services, but partially quantified the extent in reporting that,

  • YouTube views dropped by 2.5% for one hour.
  • About 1% of Gmail users, which still amounts to millions of accounts, couldn’t send or receive email.
  • Traffic to Google Cloud Storage dropped by 30%

Responsiveness to Google search queries temporarily slowed down in the US, but quickly returned to normal as the company redirected searches to servers in other regions. Google didn’t mention the number of unreachable Compute Engine instances in east coast zones, but given the problems reported with other online services like Snapchat, it was likely substantial. Google also carefully crafts its message to minimize the perceived extent of the outage by presumably citing (we can’t be sure, since it doesn’t specify) drops in total, worldwide service usage and not a more meaningful statistic like the percentage drop in users from the eastern U.S.

Don’t over-complicate the solution

Sadly, this isn’t the first time we’ve seen such an outage — remember AWS two years ago? — and it won’t be the last, demonstrating that even the most sophisticated infrastructure operators in the world make mistakes and have unanticipated problems. The domino effect on other public services highlights the need for system and application architects to put much more effort into designing resilience and redundancy into cloud deployments and not assume that cloud services are an utterly reliable resource.

It didn’t take long before a chorus of commenters and a stream of marketing PR pitches began singing the praises of hybrid- and multi-cloud infrastructure. Although I’m a big believer in the long-term efficacy of multi-cloud designs, I agree with Andreessen Horowitz Board Partner Steven Sinofsky that reflexively chanting multi-cloud it in this case is unwarranted. Not only is the track record of Google, AWS and Azure much better than the availability of most enterprise infrastructure, but the length of enterprise IT’s typical scheduled downtime, which isn’t part of Google’s vocabulary, often dwarfs that of unplanned events.

Google provides many HA options  that don’t require multi-cloud

Like its largest competitors, Google offers many redundancy and data replication options for its various services and provides documentation on how to build “scalable and resilient applications” using GCP. As the design guide explains (emphasis added),

A highly-available, or resilient, application is one that continues to function despite expected or unexpected failures of components in the system. If a single instance fails or an entire zone experiences a problem, a resilient application remains fault tolerant—continuing to function and repairing itself automatically if necessary. Because stateful information isn’t stored on any single instance, the loss of an instance—or even an entire zone—should not impact the application’s performance. A truly resilient application requires planning from both a software development level and an application architecture level.

There are three primary techniques for hardening applications against failures such as Sunday’s:

  • Employing load balancers to monitor servers and reroute traffic from offline or degraded servers to those in either another data center (zone in cloud parlance) or region that can best handle it.
  • Deploying VMs and other so-called zonal resources in multiple regions (see this explanation of the geographic extent and redundancy of various GCP services for details).
  • Using a robust, distributed storage service such as Google Cloud SQL which replicates data across multiple zones in a region, and setting up replicas in other regions to protect against systemic network failures such as Sunday’s. Alternatively, use a multi-region database like Cloud Spanner. Also consider using Google’s dual-region option for its Cloud Storage object store, which is in beta.

The cause of Sunday’s outage, namely a networking problem that spread across several regions, is a complicating factor and likely the reason that its ramifications were so widespread. After all, it’s not like the developers and cloud architects at Apple, Google and Snapchat don’t know how to design redundant systems that span several availability zones.

However, before jumping to a multi-cloud solution, it’s still easier to design multi-region redundancy into a single cloud deployment. For example, the GCP documentation illustrates a multi-region Web app design which uses a combination of global load balancing to front end servers in different regions and a backend database on Cloud Spanner, Google’s innovative distributed relational database service that includes an option for three-continent, multi-regional automatic replication.

google cloud

Source: GCP documentation; Best practices for Compute Engine regions selection

My take

Cloud infrastructure services can be seen as computing and data utilities, albeit ones that are more complex, fragile and prone to disruption than traditional municipal utilities like electricity and water. Unlike these physical resources, cloud infrastructure services from the mega-providers like Google, AWS and Microsoft Azure are supremely malleable and globally distributed, features that allow services to be configured in complex, redundant topologies that considerably reduces the probability of application failure.

When deploying systems to the cloud, enterprises must carefully assess the trade off between reliability versus cost and complexity. Just as the systems on a manned spacecraft require far more redundancy than a passenger car, some applications are so critical to the business that avoiding even short periods of downtime can justify a much more elaborate cloud implementation. Indeed, this old Cloudify blog post has a nice illustration of the hierarchy of cloud infrastructure implementation and management complexity.

The recent Google Cloud outage is the latest teachable moment for enterprise IT architects and application developers. For many of them, Sunday’s incident went unnoticed, but for organizations adversely affected, leaping to the most complicated, costly, gold-plated multi-cloud design isn’t the proper response when there are less drastic options to building resilient cloud infrastructure.