Arm has been the dominant processor architecture for mobile devices since the iPod, but has otherwise seen limited use outside of embedded systems and IoT. Despite some success with low-end Chromebooks, Arm never penetrated the PC market until Apple began migrating its entire lineup to the custom-designed M1.
Data center operators similarly shunned Arm processors as too underpowered until AWS built the Graviton1 SoC around Arm's first 64-bit v8 architecture. However, with sub-10nm process nodes enabling 64-core and higher SoCs and cloud operators prioritizing power efficiency over raw performance, Arm is now a viable option for many workloads. With the Armv9 architecture and Neoverse CPU platforms, Arm is a compelling alternative to x86 processors for many applications.
Opening the kimono
Arm teased the new products and roadmap last fall and as I wrote at the time:
Arm has the advantage with an extensive library of standard modules and cores, partner ecosystem and more flexible IP licensing model (which, in the hands of someone like Apple can be used for further product differentiation). While nature abhors a vacuum, the technology world abhors a monopoly and after decades of dominance, Intel's monopoly faces a two-front attack, from AMD and Arm-NVIDIA.
- Better performance on HPC and deep learning workloads from improvements to its Scalable Vector Extensions called SVE2. It also added support for BFloat16, a low-precision floating-point format useful in neural network calculations.
- Support for confidential computing via its Realm Management Extension (RME). (For background, see my column on confidential computing and processor hardware enclaves).
- Support for hardware virtualization, including nested VMs.
- Hardware isolation to create trusted execution environments (TEE).
- A Transactional Memory Extension (TME) that provides thread-level parallelism and reduces lock contention to improve scalability across multiple cores.
- Improved code reliability and debugging via a Memory Tagging Extension (MTE), Branch Target Identification (BTI) and Branch Record Buffer Extensions (BRBE).
This week, Arm has provided more details about the Neoverse platforms, including performance estimates, targeted markets and licensees using Neoverse in new products and services.
Four markets from data centers to edge appliances
By expanding into cloud data centers and HPC installations, Arm follows a path set by Intel when it expanded an instruction set and chip architecture that was initially designed for PCs into processors optimized for servers (Xeon) and embedded systems (Atom). With Neoverse, Arm targets four growing markets:
- Hyperscale cloud operators and online service providers.
- HPC clusters.
- 5G virtual infrastructure.
- Edge systems and IoT.
The v9 architecture provides the foundation, while the Neoverse platforms and Arm's Coherent Mesh Network provide the building blocks and glue for creating SoCs tailored to such a divergent set of workloads.
With the Neoverse N2 and CMN-700, Arm has made evolutionary enhancements to existing technologies, while the Neoverse V1 is a new platform designed for maximum performance and workloads that currently require Xeon, AMD Epyc or IBM POWER, processors. However, Arm's strategy, and critical parts of its intellectual property, extend beyond processor architecture to the surrounding design, development and support ecosystem. To facilitate new products, Arm provides licensees with reference designs and IP blocks, EDA and compiler tools and optimizations, foundry partnerships, IDEs and a growing community of open source and commercial developers.
Neoverse N2 - the mainstream Armv9 option
Neoverse N2 rolls the features introduced in Armv8.4, 8.5, 8.6 and 9 into an update to a platform that AWS has already demonstrated in its Graviton2 to be very effective and efficient for cloud workloads. Significant updates in Neoverse N2 include:
Microarchitecture improvements that provide a 40 percent increase in IPC (instructions per clock cycle) over N1.
SVE2, which adds instructions useful for image and video processing, genomics, in-memory databases and LTE/5G baseband processing.
Improved scalability including support for 128 core SoCs.
Memory partitioning and monitoring (MPAM) to control access to shared system resources, cache and memory bandwidth.
Improved power efficiency and management including the ability to dynamically adjust CPU prediction parameters to maximize the power efficiency for a given workload.
Armv9 security and debugging improvements detailed above.
Neoverse V1 - boldly going where no Arm has gone before
Whereas the N2 is Arm's answer for multi-threaded, general-purpose, efficiency-optimized infrastructure, V1 is squarely focused on maximizing performance per core. If N2 is the Mercedes C-Classfamily sedan, V1 is the AMG edition. Like the N2, V1 builds on the first-generation Neoverse, but the design choices favor maximizing performance over power efficiency with features like:
- An 8-wide front-end and 15-wide issue, 11-stage superscalar microarchitecture that improves IPC by 48 percent over N1 with up to triple the performance on vectorized applications using SVE2.
- Nested virtualization.
- Bfloat16 and Int8 support for ML and DL applications.
- Faster I/O through write gathering.
- Deep persistence to improve the performance and management of non-volatile memory.
V1 also uses an improved CMN-700 (see below) interconnect to connect both on-die and chiplet-based CPUs, accelerators, memory and I/O controllers. The updated mesh interconnect enables designs exceeding 128 CPUs and 128 combination PCIe5, CCIX and CXL I/O lanes.
CMN-700 - the connective tissue for custom SoCs, chiplets and MCPs
Arm CMN provides the connectivity between CPU core clusters, accelerators like GPUs and DSPs and memory, making it analogous to Intel's UltraPath Interconnect (UPI) or AMD's Infinity Fabric (a variant of PCIe). CMN-700 increases the scalability of Neoverse systems by augmenting CMN-600 in several areas, including:
- Supports four times the number of cores per die (128) and system (512).
- Quadruples the maximum system cache to 512MB.
- 2.25-times more cross-point interconnects per die.
- Up to 40 memory device ports per die, a 2.5x increase from the 16 in CMN-600.
It also supports the CCIX and CXL standards to enable multi-chip packages (MCP) using chiplets to improve die yield, chip capacity and allow composite chips using specialized accelerators and silicon photonics chiplets (for example, the TeraPHY from AyarLabs).
In the middle of the last decade, several companies like Applied Micro, Calxeda and Cavium failed to convince data center operators that Arm was a viable server platform despite having some compelling designs. AWS silenced most of the skeptics when it released instances based on its home-grown Graviton processors in 2018, however, despite Microsoft supposedly toying with the platform, Arm remains a rarity in both cloud and enterprise data centers.
A new architecture, a pair of chip platforms and enhanced intra- and off-chip I/O fabric make Arm a legitimate alternative for a wide range of enterprise, cloud, HPC and telecommunications applications. To hammer that point, Arm unleashed a cavalcade of partners to rhapsodize on the benefits of the Arm architecture, Neoverse performance and customizability afforded by the Arm platform and associated IP. Indeed, the group ran the gamut of industries Arm is targeting, namely:
- Cloud infrastructure with AWS, Alibaba Cloud, Tencent and Oracle Cloud all deploying (or planning to) Arm-based services.
- Telecommunications where Marvell discussed updates to its OCTEON products for 5G O-RAN implementations.
- HPC with the Indian Ministry of Electronics and Information Technology announcing an agreement with SiPearl and ETRI to build an exascale system.
As NVIDIA has long contended (see my coverage of GTC 2021 here for more details), modern applications require more than just general-purpose compute cycles. Maximizing performance and security requires hardware subsystems, whether GPUs, tensor processors, crypto engines or secure enclaves. With a competitive CPU architecture, robust ecosystem of design tools and hardware IP, a wide array of hardware licensees and multiple licensing options, Arm is ideally positioned to power the next generation of data center and edge infrastructure. It will be fascinating to see the products that come from the minds of experienced designers at deep-pocketed companies like AWS, Microsoft, Alibaba and Marvell, but if Apple's example in the consumer business is any indication, the improvements to price, performance and efficiency should be impressive.