Cloud infrastructure isn’t created equal - many factors affect network performance to, from different providers
- Summary:
- Pivoting off a detailed report of cloud network performance and some detailed recommendations.
Nothing strips the cloud’s ethereal aura like networking, laying bare its physical limitations and vagaries. When it comes to moving data to and from cloud infrastructure, all the technology Google, Amazon and Microsoft can muster can’t overcome the physics of photons traversing a set distance.
Given today’s vast global Internet with hundreds of carriers, ingress POPs, cloud regions, continental and undersea cables and vendor peering/cross-connection arrangements, characterizing cloud network performance is a complex, multivariate problem that defies quick solutions and simple summarization. The best we can do is test as many of the most common scenarios as possible and try and wrestle through the messy details. Fortunately, the network measurement experts at ThousandEyes have done the leg work of setting up an intricate fleet of measurement sites and test scenarios, sharing their work in what has become an annual cloud performance benchmark report.
There’s a lot to digest in the 60-pages of data and test descriptions, however, as I discussed after last year’s inaugural release, the performance of the three largest cloud providers, AWS, Azure and Google Cloud (GCP) is quite similar in general, but very different in the specifics. However, in technology as in life, we might think in generalities, but it’s lived in the details. Thus, the cloud performance an organization will experience is highly dependent upon the location of its facilities and customers paired with one’s cloud deployment choices. Furthermore, as the report illustrates, cloud networking isn’t static since vendors regularly make changes to their software and infrastructure.
Filling in the picture of cloud network performance - what’s changed
As a reminder of ThousandEye’s methodology and the provider’s general similarities, here’s what I wrote last year (emphasis added):
Macroscopically similar, but some regional differences. The ThousandEyes report focuses on network latency and variability (jitter), probably since throughput is so dependent on a customer's provisioned capacity and terminating circuits. Given their domestic origins and the fact that the U.S. and Europe are their largest markets, it's not surprising how competitively similar AWS, Azure and GCP are for workloads hosted in eastern-US and UK locations, with latencies within a few percent of each other for destinations within the same geography.
Things diverge when looking at performance in less-developed regions like Asia, Oceania and South America, where overall network latency is markedly higher. Likewise, latency variability, i.e. jitter, is generally higher in Asia, particularly for AWS which is much worse than its competitors.
The 2019 performance report covers the same big-three cloud vendors and networking scenarios as last year’s, but includes some useful and needed additions, notably:
- Measurements for Alibaba Cloud and IBM Cloud, often ranked as the 4th- and 5th-largest cloud providers.
- Data for connections to and from China, including both cloud sites and user POPs. Specifically, it has measurements from:
- 7 Alibaba regions and 9 user locations, including POPs at China Telecom and China Mobile
- Measurements from six domestic ISPs, namely AT&T, Verizon, Comcast, CenturyLink, Cox and Charter, taken from six cities (Ashburn, Chicago, Dallas, Los Angeles, San Jose and Seattle). Given the domestic nature of the ISPs, measurements were confined to cloud regions in North America.
- Test on the AWS Global Accelerator, a service introduced at re:Invent 2018 that works much like GCP’s Premium Network Tier by optimizing traffic routes to primarily use AWS’s private network rather than the public Internet.
As it did last year, ThousandEyes found that overall network performance at all five providers was “strong”, i.e. generally good to excellent. However, it noted that the numerous tested scenarios — such as between region pairs, from a wide variety of client sites along with the largest U.S. ISPs and between different providers — revealed significant differences, both between vendors and within a single service. With last year’s data to go on, the report also found that cloud networks aren’t static, with providers continually making architectural changes, sometimes significant ones.
Year-to-year performance comparison
Since each cloud provider operates massive fiber backbones with throughput of many terabits-per-second, testing bandwidth isn’t meaningful. Instead, ThousandEyes measured latency and jitter (latency variability). Thus, a good method of comparing each providers’ network backbone is by measuring the latency and its variability between their different cloud regions. Of the three providers tested across two years, the 2019 report found that Google Cloud inter-region latency improved the most, by 36% on average, with Azure better by 29 percent. In contrast, AWS showed no meaningful difference from year-to-year.
Last year, I noted that AWS lagged its two competitors in performance to Asia, where its latency and jitter were markedly higher. AWS has closed that performance gap, but this year’s measurements reveal a new set of Asian anomalies in China and India. This time GCP is the laggard, particularly in connections to and from Europe. The report found that:
Google Cloud exhibits 2.5x the network latency in comparison to AWS, Azure, Alibaba Cloud and 1.75x higher than IBM Cloud from Europe to regions in Mumbai, India and Chennai, India. Similarly, GCP users from Africa generally experience higher latency when connecting to its data center in India.
By tracing the network path of connections originating in Spain, England and South Africa, ThousandEyes found a strange cause: GCP backhauls traffic from Europe and Africa to the US before routing it onto India. In contrast, Azure does the logical thing and routes traffic through its European hubs directly to India.
Other cloud network oddities
The report is full of juicy data that’s a mix of unsurprising confirmations and curious oddities. Some highlights include:
- Traffic to China is significantly degraded by the Great Firewall with packet loss into China several times higher than that for other regions, regardless of cloud provider. In contrast, traffic within China, i.e. from one of the Alibaba regions to various monitoring locations, does not suffer this performance penalty. As expected, Alibaba has the lowest latency within China, about 30 to 40% better than Azure and AWS.
- AWS Global Accelerator sometimes works in reverse. While connections from Korea were dramatically improved, those from San Francisco were only marginally so. Furthermore, AWS users hosting in India want to avoid it like the plague since Global Accelerator increased latency and jitter by 50 to 60 percent. ThousandEyes speculates the variability is due to different AWS peering relationships with ISPs around the world and whether they support GA or not.
- Google Cloud has made changes to its packet handling that obscures the visibility of the path by modifying the packet TTL parameter in such a way as to prevent collecting metrics about traffic hops. Furthermore, the report notes that, “This behavior was not observed to be consistent across all GCP hosting regions.”
- Traffic performance from major U.S. ISPs to the cloud providers is generally excellent, but latency-inducing oddities still exist. For example, traffic from Verizon sites on the West Coast to Google Cloud’s us-west2 region is routed to the Google backbone in New Jersey before making its way back across the country. The result is 3- to 10-times the latency of traffic from other ISPs.
- Multi-cloud connectivity varies across providers and geographies. While the three largest vendors (AWS, Azure and GCP) have direct network peering between each other, IBM and AliCloud don’t peer with all of their competitors and must route over ISPs for many cloud-to-cloud connections.
My take
The five points I made about last year’s cloud networking report remain valid, however, I will restate the criticality of cloud users quantifying performance to their chosen providers since this year’s data reiterates how much network behavior can vary by ISP, the location of both users and cloud regions, a cloud vendor’s particular network topology and routing decisions and the local availability of network services such as AWS Global Accelerator. Indeed, as the data show, users don’t always get a benefit by purchasing an optional service like Global Accelerator or Direct Connect (or equivalent) given the peculiarities of their infrastructure and ISP choices.
As the report demonstrates, tools such as ThousandEyes and its competitors can overwhelm users in data, but are overkill for those with simple needs. However, conventional network monitoring software when used with the right test plan can provide enough data to make intelligent choices about a cloud vendor, region selection and cloud network configuration.
Furthermore, cloud vendors see the need for better network management tools and several have introduced services to simplify the task. For example, Azure Network Watcher is a serviceable alternative for simple needs, while Google Cloud has just announced Network Intelligence Center with more advanced features such as network topology visualization, connectivity testing and a performance dashboard.
Cloud services seem like a utility, but the non-uniformity of today’s Internet and cloud implementations means that connecting to them is far more nuanced than plugging an appliance into an electrical outlet. Getting the most out of your cloud dollar requires understanding those networking nuances.