Arpanet, the progenitor of the Internet, was planned during the Cold War to keep operating under the worst of scenarios: a nuclear attack. Thankfully, the Internet’s resilient design hasn’t been tested under such extreme conditions, but it’s withstood many stress tests over the decades.
However, none have been more severe, nor more critical to our collective economic and psychological well being than the coronavirus pandemic and subsequent worldwide lockdowns. Life has increasingly been lived online, but within days, the Internet became the lifeline linking people’s isolated physical existence to work, school, shopping and entertainment.
Governments, carriers, ISPs and cloud service providers have invested hundreds of billions of dollars in both the public Internet backbone and internal fiber networks over the past two decades as the dot-com bust turned into an online gold rush. The resulting infrastructure enabled pervasive access to wired broadband and wireless service and transformed high-speed Internet connectivity from a luxury into a necessity. 53% of U.S. adults say Internet access has been essential during the COVID lockdowns. Indeed, it’s shocking how many people would cut off a finger before cutting off their Internet connection.
Stress testing the Internet
The COVID crisis has provided the best opportunity yet to assess the stability, scalability and adaptability of all facets of the Internet, including the backbone network, edge connections and cloud services. Anecdotal evidence has been positive. Most people, myself included, have been able to conduct online research, attend video conferences, interact on social networks and binge-watch The Sopranos without disruption. Occasional outages on sites with extreme spikes in usage have typically been brief with service rapidly restored. However, systematic reviews of Internet performance are now available, showing the system’s remarkable resilience under extreme duress.
Like its study of cloud provider networks, ThousandEyes provides the most thorough review of network performance in a new Internet Performance Report covering ISPs, public cloud operators, CDNs and DNS providers. Recently acquired by Cisco, the company developed an innovative set of monitoring software and agents that collect performance measures such as packet loss, latency, jitter and multi-hop path metrics. Thousands of collection points around the globe provide the company with broad visibility to network performance across the Internet.
Unlike earlier work on cloud network performance, here ThousandsEyes focuses on outages, which it defines as “100% packet loss in the same ASN [IP prefixes belonging to a particular organization] during a given period of time.” It uses multiple probing mechanisms and analysis algorithms to prevent false positives and isolate outages to a particular location and IP range. The report looked at Internet performance and outages across three dimensions:
- Inter-region between ISPs
- Broadband metrics from two providers across six U.S. cities.
- Inter-region CSP measurements covering AWS, Azure and GCP in North America, Europe and APAC.
In sum, it’s thorough.
ThousandEyes’ report validates anecdotal experience that although the Internet might have bent under that added load, the coronacrisis couldn’t break it. As the company’s marketing director, Angelique Medina, puts it, “Doomsday Averted; The Internet Is Fine.” As she details in a blog (emphasis added):
Despite early fear and speculation immediately following pandemic-related lockdowns, the state of the Internet was and remains healthy, with our network measurements (taken via active probing) showing little evidence of systemic network duress, even when traffic shifts and volumes were at their peak. With few exceptions, Internet-related infrastructures over the last six months have held up well.
An early 63% spike in outages appears to be caused by carriers and cloud providers reconfiguring networks to better accommodate changing traffic patterns, aka traffic engineering. The number of outages has declined since March, but remains elevated. The theory of reconfiguration-caused ISP outages rests on two factors:
- A higher than expected number of ISP disruptions occurred outside business hours.
- Most outages were quite brief, with 95% lasting less than 20 minutes.
Cloud providers have done a better job than ISPs, at least until cloud outages doubled in June. That’s partly because the report found that when cloud outages occur, they are more likely to affect users since they occur more frequently during weekday business hours and last longer than ISP disruptions.
There are also notable differences in outage frequency between regions. ISPs in North America and APAC showed significant increases in March and April, after which reliability steadily improved. In contrast, outages in Europe steadily increased throughout the spring before improving at the end of June. Nevertheless, since March, half of the cloud service disruptions in North America happened during business hours (9am-6pm ET).
Overall, ThousandEyes gives all the key providers of Internet connectivity and services a passing grade for adapting to unprecedented circumstances. As the report concludes (emphasis added):
Despite an increase in network disruptions post-pandemic, the state of the Internet is healthy. The networks of services critical for modern application delivery, such as CDN providers, continued to be highly available, mitigating the load on Internet backbone infrastructure. ISPs implemented network changes to meet service needs, while in many instances minimizing the disruptive impact of these changes on businesses. Overall, Internet-related infrastructures have held up well, suggesting overall healthy capacity, scalability, and operator agility needed to adjust to unforeseen demands.
Other data from Ookla confirms ThousandEye’s conclusion that the Internet rapidly adapted to the early chaotic shifts in demand and traffic patterns. The company known for its SpeedTest apps and website tracked users’ aggregate performance for the six months between mid-December and mid-June. Between the end of February and late-March, the SpeedTest data showed about a 7% drop in the average speed of wired connections in the U.S. However, since the early spring, speeds have steadily increased and are now 14% higher than at the beginning of January. Speed numbers for Italy, perhaps the hardest-hit country in Europe, recorded an even steeper 15% drop in March only to mount a similar late-spring recovery.
SpeedTest also tracks significant service disruptions to online applications and telecom services (Q1 2020 here, Q2 2020 here). Although tying most of these to the lockdowns and concomitant explosion in remote work, two at Zoom stand out. The closure of offices and elimination of business travel instantly transformed Zoom and other video conference services into critical infrastructure. Unlike Microsoft Teams and Google Meet, which had ready access to an army of infrastructure engineers and a reservoir of cloud resources, Zoom appeared much less prepared for the explosion in traffic.
SpeedTest recorded two notable, but brief Zoom outages that were likely caused, at least in part, by exponential increases in usage. I hedge because Zoom was also going through a series of self-inflicted security snafus that precipitated mitigation that likely included significant changes to is system configuration with little planning.
The first Zoom outage on March 20 lasted about an hour in the early afternoon Pacific time and only affected U.S. users. The second, more widespread event in May spanned for more than four hours in the middle of the day Eastern time and disrupted users in both the U.S. and Europe. Having to scale infrastructure designed for a few million users to accommodate one hundred times that amount within a few months is a Herculean task. It’s understandable that Zoom would have some hiccups along the way.
When the world was suddenly thrust into physically-isolated, online existence, all facets of online technology, including the Internet backbone, broadband and wireless infrastructure, cloud resources and online services underwent a stress test none predicted and few if any fully prepared for. With several months of hindsight, we now see that the Internet, in the collective sense of the term, passed with flying colors. Postmortem analysis shows that the degree of performance and reliability degradation was far less than we feared in the early days. Such success is a testament to a resilient design with ample capacity and the talent of thousands of IT professionals.
Although the early days of the crisis were undoubtedly spent fighting fires, now that widespread remote work looks to be an ongoing reality until next year, let’s hope that infrastructure operators have transitioned to proactively addressing systemic problems. More stress tests await as primary and secondary education stays virtual, business meetings remain in video rooms and a (mostly) online holiday shopping season awaits.