New data demonstrates generally superb cloud connectivity, but quirks remain

Profile picture for user kmarko By Kurt Marko November 13, 2018
Summary:
Quantifying cloud performance becomes vitally important once infrastructure at AWS, Azure and Google Cloud displaces company-owned and operated systems.

Screenshot 2018-11-14 at 11.33.32
Using cloud services, whether for infrastructure or applications, can seem so magical that it’s easy to forget that the cloud isn’t some ethereal, omnipresent force, but a manifestation of physical and logical devices.

Indeed, the largest cloud services are vast, complex systems composed of millions of servers in dozens of data centers and connected by hundreds of network segments over miles and miles of fiber.

These physical realities become important once organizations start using cloud services for business applications, particularly those designed for external customers scattered around the globe. However, ascertaining the significance of cloud vendor and location decisions on application performance for users in various regions is such a daunting task that few organizations attempt it, at least on a comprehensive scale.

The global, distributed nature of cloud services makes them well suited for today's mobile users and composite workloads, however the complexity and idiosyncrasies of different cloud implementations can yield surprising, sometimes unpredictable results for users in certain locales. Unfortunately, comparing cloud network performance and consistency is a painstaking, time-consuming activity meaning that most organizations dispensed with a systematic approach when choosing cloud providers and regional locations in favor of ad hoc rules and common sense.

New data from a methodical study of cloud network performance partially fills this void, providing enterprise cloud architects with some quantitative comparative data on which to base sourcing and siting decisions. ThousandEyes, a developer of network monitoring and analysis software, has done the most exhaustive study of cloud network performance to date showing the three dominant cloud providers to be comparable, but with some differences that could be significant for particular applications and user groups.

Even utilities can vary

Cloud infrastructure services are legitimately described as a form of IT utility, however as we know from decades of experience with industrial era electric and wireline utilities, performance and reliability can vary among providers and between geographic regions. The same holds with cloud services as the ThousandEyes study of, Amazon AWS, Microsoft Azure and Google Cloud (GCP) demonstrates.

According to a post announcing the results, ThousandEyes used its Network Intelligence platform to monitor locations around the Interest and instances deployed at the Big 3 cloud providers. These were used to make bi-directional measurements of network latency, packet loss (quality) and jitter (variability), with the tests done both end-to-end, from a remote client to cloud data center, and over individual network hops making up the overall path. In total, ThousandEyes tested 27 global user locations and 55 cloud regions, 15 at both AWS and GCP and 25 at Azure.

word-image

Realizing that enterprises often replicate resources across cloud data centers, availability zones, regions and even between services, ThousandEyes also wanted to test performance within and between cloud providers. It explains the methodology as follows (emphasis added):

We also knew that deploying applications in the cloud means redundancy and load balancing, so second, we tested Inter-AZ performance. Many organizations need to regionalize their cloud deployments to serve geographically diverse user bases and often use tiered architectures where storage or database components are centralized while compute is distributed. So to look at this angle, we (third) tested Inter-Region performance within each cloud provider’s network — 15 in AWS, 25 in Azure and 15 in GCP. Finally, it has become clear to use that multi-cloud is fast becoming a reality, so we tested performance between all regions of all three providers.

word-image

Macroscopically similar, but some regional differences

The ThousandEyes report focuses on network latency and variability (jitter), probably since throughput is so dependent on a customer's provisioned capacity and terminating circuits. Given their domestic origins and the fact that the U.S. and Europe are their largest markets, it's not surprising how competitively similar AWS, Azure and GCP are for workloads hosted in eastern-US and UK locations, with latencies within a few percent of each other for destinations within the same geography.

Things diverge when looking at performance in less-developed regions like Asia, Oceania and South America, where overall network latency is markedly higher. Likewise, latency variability, i.e. jitter, is generally higher in Asia, particularly for AWS which is much worse than its competitors. The reason stems from the network design of each, where Microsoft and Google attempt to route traffic as far as possible over their internal networks whereas AWS relies on the public Internet for delivery to the destination region. As the report describes it, AWS' reliance on a best-effort delivery network, i.e. the Internet, has deleterious ramifications (emphasis added),

Deployments with an increased reliance on and exposure to the Internet, a best effort network, are subject to greater operational challenges and risks. Analysis of network path data reveals contrasts in cloud connectivity architectures between AWS, Azure and GCP, primarily around the level of Internet exposure in the end-to-end network path

AWS network design forces traffic from the end user through the public Internet, only to enter the AWS backbone closest to the target region. This behavior is in stark contrast to how Azure and GCP design their respective networks. In the latter, traffic from the end-user, irrespective of geographical location, is absorbed into their internal backbone network closest to the user, relying less on the Internet to move traffic between the two locations.

ThousandEyes speculates that AWS designs its network to offload traffic as soon as possible because it shares the backbone with Amazon.com and thus, wanted a system that wouldn’t be overloaded during peak shopping periods.

As expected, each provider has stellar performance within their respective availability zones of geographically close, but distinct data centers, with latencies in the millisecond range. Between regions, ThousandEyes found that latencies tracked physical distance since the laws of physics make no exceptions, even for the most technologically advanced organizations. However, each provider’s global network is sufficiently reliable that there is negligible packet loss and jitter between regions. Indeed, peering relationships between the three mean that even inter-cloud performance is incredibly reliable with less than 0.01 percent packet loss and 0.50ms of jitter.

My take

I suspect that seeing such minutiae gives IT executives the thousand-yard stare, but before dismissing it as in-the-weeds details unworthy of their time, IT leaders should learn a few lessons.

  • When moving enterprise applications, particularly those running critical business processes, to infrastructure and systems you don’t control, it’s imperative that cloud users understand the implications on performance and reliability of every usage scenario they are likely to encounter.
  • The growing scale and geographic diversity of cloud infrastructure promises greatly improved application performance for a global user base, but only for those organizations making the effort to understand and quantify the implications of resource siting and replication decisions. The days of gut feel and ad hoc rules of thumb are over.
  • The network link between users and cloud services is invariably the weak one, responsible for the majority of performance degradation and variability. Given its reliance on the Internet for the network hops between users and the data center hosting a service, AWS is inherently less consistent and reliable than Azure or GCP, which use their private backbones to offload traffic from their nearest POP. The situation is most acute for users in less-developed locales and further aggravated when hosting services in only one region.
  • Cloud infrastructure is continuously improving, with each provider furiously spending to expand and improve. Thus, today’s shortcomings might not exist tomorrow. For example, the report found that AWS cut the average inter-AZ latency (data center-to-data center) in one European region by a factor of 5, to 1ms from 2017 to this year.
  • Despite being intense competitors, the major cloud providers are tightly connected physically, if not metaphorically, meaning that multi-cloud deployments won’t be limited by networking, but rather the complexity of adapting, migrating and managing workloads across disparate services.

Quantifying cloud performance becomes vitally important once infrastructure at AWS, Azure and Google Cloud displaces company-owned and operated systems. Detailed networking data such as that in the ThousandEyes report is a valuable addition to the collection of cloud metrics, but must be supplemented with system- and application-level benchmarks to provide organizations with quantitative support for cloud buying and design decisions.