The engines powering the modern data center have long been made by Intel, however, the increasing diversity of workloads coupled with this decade’s secular deceleration of x86 performance improvements have prompted processor designers, application developers and infrastructure operators to look for alternatives.
While Intel won’t lose its dominant position in data centers anytime soon, several trends eroding its position have been on full display recently, including:
- Traditional general-purpose CPUs are increasingly being supplemented or displaced by processors designed to accelerate particular workloads.
- The combination of API-accessible cloud services and processor-independent software libraries have interposed abstraction layers between application developers and the system architecture, thus facilitating experimentation with various hardware architectures.
- A resurgent AMD has emerged from irrelevance with competitive server CPU and GPU products and a credible roadmap offering better x86 price-performance than its much larger rival.
These trends have been percolating for several years, mostly below the attention of enterprise technology executives, however, rapidly maturing technology and a relentless pursuit of higher performance by hyperscale cloud builders and high-performance computing (HPC) users have incubated an environment favorable to non-Intel alternatives. Recent events illustrate the rapid architectural changes within the AI and HPC communities that have longer-term implications for the average enterprise.
Strong-Arming its way into HPC
The SC supercomputing conferences were once niche events tailored to and dominated by researchers in government labs, academia and HPC vendors seeking to score some benchmarking victories with their latest products. While the target workloads haven’t changed, namely those using numerical simulations for fundamental scientific research, they have been supplemented by practical applications of HPC computational techniques and distributed systems to problems in numerous industries such as resource extraction, social networks, online marketing, cyber security and manufacturing.
Expanding the applicability of HPC to new industries and problems has created an environment that fosters tremendous innovation in many areas like processor architecture, workload-specific hardware acceleration, distributed software management and application development frameworks and libraries. Thus, a conference that was once dominated by Cray and later, custom distributed systems commissioned by government labs, is being disrupted by the likes of NVIDIA, Arm and the cloud vendors.
As is often the case, NVIDIA and its charismatic founder and CEO Jensen Huang — aka the World’s Top CEO — are leading the innovative changes, making several significant announcements at SC19. In sum, they show a company that is one of the catalysts for this decade’s AI renaissance by fostering greater hardware diversity with workload-optimized system designs that substitute Arm processors for traditional x86 CPUs. Specifically, NVIDIA announced:
- A reference design for Arm-based servers using NVIDIA GPUs to accelerate HPC applications in research and industry that builds upon earlier work porting its CUDA-X software libraries and development tools to Arm. The reference hardware, nicknamed EBAC (everything but a CPU), closely resembles NVIDIA’s HGX rack systems for cloud operators by connecting eight top-end V100 Volta GPUs via an internal NVLink fabric in a chassis containing four Ethernet I/O cards, SSD drive connectors and one or more Arm CPU cards (the photo NVIDIA released shows what looks like four dual-socket boards). In announcing the design, Huang said (emphasis added):
There is a renaissance in high performance computing. Breakthroughs in machine learning and AI are redefining scientific methods and enabling exciting opportunities for new architectures. Bringing NVIDIA GPUs to Arm opens the floodgates for innovators to create systems for growing new applications from hyperscale-cloud to exascale supercomputing and beyond.
- An accelerated I/O architecture and supporting software called Magnum IO that can improve performance up to 20-times on data-intensive applications run on distributed clusters using GPU-equipped servers. The core technology, called GPUDirect, is a new communication protocol between GPU nodes that doesn’t require CPU resources and can work over many high-performance RDMA I/O channels, notably NVIDIA NVLInk to overload both network and storage I/O.
- Working with NVIDIA, Microsoft demonstrated that no workload is too large for the cloud by introducing the most powerful GPU-accelerated compute instance to date. The NDv2 instances include 8 NVIDIA V100 Tensor Core GPUs connected via NVLink, each with 32 GB of memory, along with 40 Xeon Platinum 8168 processor cores, and 672 GiB of system memory. The NDv2 can be lashed together to create a virtual supercomputer of monumental proportions. For example, Microsoft and NVIDIA used 64 NDv2 instances in a cluster to train BERT, a popular conversational language processing AI model, in roughly three hours, a job that can take days using older GPUs or 4 to 16 (depending on the model size) Google TPUs.
Building Arm momentum
While the SC19 announcements are focused on the HPC market and related applications, NVIDIA’s latest moves are indicative of broader changes reshaping data center computing and application development that will eventually benefit mainstream enterprises. They are also indicative of growing acceptance of Arm as a data center platform and come amidst other evidence of significant improvements in Arm Server technology. Some examples include:
- The general availability of AWS A1 instances using a home-grown Graviton processor that is up to 45% cheaper than conventional x86 nodes for scale-out workloads such as web servers, containerized microservices, content caching, distributed databases and storage and Arm-native applications. The instances are supported by many Linux distributions including Red Hat, SUSE, Ubuntu, and Amazon Linux 2, and several container runtimes.
- Ampere, the startup the resurrected the X-Gene 64-bit Arm server design from Applied Micro, is close to introducing a second-generation product that exploits a 7nm process node to enable up to 80 cores. With the product, Ampere isn’t merely targeting entry-level applications, but demanding cloud workloads such as databases and analytics. In a podcast interview with DataCenter Knowledge, the company’s VP of products notes that cloud operators “have power efficiency needs that aren’t being met today,” adding that the performance efficiency, i.e. per watt, of x86 chips hasn’t improved enough to satisfy the scaling needs of the largest cloud deployments.
- Marvell recently updated the roadmap for its ThunderX Arm server SoC to show a third-generation, 7nm product “coming soon” (presumably the first half of 2020) and a fourth-generation product two years later. The company’s goal is to more than double performance with each generation through a combination of process scaling and architectural and instruction set improvements such as larger cache sizes, better branch prediction and more sophisticated power management.
The newly energized market for data center Arm SoCs and systems would only be of passing interest to enterprise IT leaders if not for the existence of cloud services that interpose an API-centric abstraction layer between the developer/user and hardware implementation. Few organizations have the stomach for an architectural shift as fundamental to their enterprise software as changing processor platforms, even if it means saving money and hopping on a steeper performance growth curve.
Admittedly, most initial cloud offerings are IaaS compute instances that still expose the user to the processor’s architectural differences, however, even here, growing support for the Arm platform by Linux distros and software libraries and tools like NVIDIA CUDA-X eliminate major roadblocks to developers and IT operators. That said, given how secretive cloud operators like AWS and Google are about their internal workings, we have no idea how many services are already delivered from non-Intel hardware and whether new service features and performance gains result from their willingness to deploy hardware customized to the task. Chances are, they often are, particularly given comments from some of the Arm vendors like Ampere.
The real significance to enterprises of the news summarized here comes over the long term as the combination of vigorous hardware competition which fuels the proliferation of hardware like GPUs, TPUs, FPGAs designed for particular workloads means that the average business can access features and supercomputer performance levels that were once limited to the HPC priesthood in massive research labs. Indeed, the scale of what Microsoft can provide as a service is mind-boggling. Our friends at the Next Platform estimate that a maxed out NDv2 cluster delivers 5.36 petaflops of floating point performance, which would rank number 40 on the Top 500 list of the world’s supercomputers. All for a mere $2,661 per hour, versus the millions it would take to buy and operate such a beast. While an extreme example, it illustrates the tremendous democratizing force of combining cloud services with hardware competition.
Most organizations can’t use a Top 500 supercomputer, but do have computationally intensive problems that can provide significant new business insights, but only be done on a new generation of AI- and HPC-optimized hardware; systems and services that can now be rented as needed. The combination of Arm servers, GPU and other accelerators and cloud services allow enterprise leader to unleash their creativity to solve previously intractable business problems.