For more than half a century, innovations in information technology have followed a similar path from the research lab to academics and specialists before evolving into mainstream consumer and enterprise products.
The Internet and its adjacent technologies are the most famous (and profitable) recent example, but we are on the cusp of an equally significant force in AI technology in the form of deep and machine learning.
Although AI has crept into many consumer products such as chatty home assistants, auto-tagging photo apps and language translation apps, these have been developed by multi-billion dollar tech titans employing armies of specialists in arcane topics like transformers, reinforcement learning and recommender systems.
Sadly, the hurdles to transferring AI technology to homegrown enterprise applications remain significant. Whether it is the racks of expensive hardware, selection and management of development frameworks or the complicated workflow from data acquisition to cluster deployment, building world-class AI software requires a team of experts not unlike those required to operate a supercomputer center. Cloud infrastructure like GPU instances and packaged software for common tasks like image recognition and natural language processing has lowered the bar for some types of applications, but it remains too difficult for most enterprises to apply AI to business problems.
Indeed, as I mentioned last week, going to a cloud service like Google for AI can seem like going to the hardware store for a piece of furniture. Although NVIDIA is an 'arms supplier' to all the major cloud operators, it has been increasingly active with traditional enterprise suppliers to create a hardware-software environment that can do for AI what VMware did for virtualization.
Democratizing AI for the enterprise
Amidst the new GPUs and excitement over its first CPU, it was easy to miss the enterprise systems announced at GTC in April (see my coverage here). Instead, NVIDIA used this year's Computex event (yes, still virtual) to showcase its enterprise products, notably the Base Command Platform and associated hosted service that combines a DGX SuperPOD with NetApp OnTAP AI storage.
As the name suggests, Base Command is a management system for collaborative AI development and cloud deployment that combines several features.
- Convenient access to GPU-optimized containers, development frameworks, models and scripts in the NCG catalog.
- APIs and pre-built integrations to MLOps development tools and execution environments like Jupyter notebooks.
- A Tensorboard visualization toolkit for experimentation and model profiling.
- A telemetry dashboard with real-time display of utilization for GPUs, TensorCores and other system resources.
- Tools for managing datasets.
- A flexible job scheduler with resource quotas and other controls.
Complementing the Base Command software are NVIDIA-managed virtual hardware resources that combines SuperPOD compute clusters and NetApp AFF A800 (all-flash) storage systems optimized for high-throughput clusters. The SuperPOD, which NVIDIA also introduced at GTC, is a six-rack AI compute cluster (what NVIDIA calls an SU or scalable unit) built from DGX A100 systems connected to an Infiniband fabric that includes the following:
- 20 DGX A100 servers, each with 8 A100 GPUs
- Independent compute and storage networks with 10 HDR Infiniband (200 Ggbs) connections per system (8 compute, 2 storage).
- NVMe storage. NVIDIA didn’t release specifications for the Base Command Platform, however, its SuperPOD reference architecture shares one storage pod across seven SUs (140 nodes).
The NVIDIA-NetApp Base Command Platform is available to select customers starting at $90K per month. Although the company doesn’t detail the pricing model or resource capabilities, that price guarantees it will be, as they say in car advertising, “well equipped” and intended for organizations with sizable AI development projects. Google Cloud also committed to offering Base Command on its marketplace to go with its existing A2 instances with up to 16 A100 GPUs.
Bringing AI in-house
The resource requirements for serious AI development have caused most organizations to use cloud services for the required hardware. Although nothing save the 123 kg bulk and 6.5kW power draw prevents someone from installing a DGX A100 in their data center, it’s overkill in many situations. A better alternative for traditional data centers is NVIDIA-certified enterprise systems with its AI Enterprise software.
Also introduced at GTC 2021, AI Enterprise marries NVIDIA’s DL/ML development platform with VMware enterprise compute stack in a way that gives AI developers the tools they expect and IT operations teams an infrastructure and virtualization platform they understand. When used with NVIDIA certified systems, the combination provides:
- Enterprise x86 servers running vSphere.
- NVIDIA infrastructure management software including vGPU (virtualization), Magnum I/O (throughput acceleration for multi-GPU systems), CUDA-X AI (GPU accelerated libraries), DOCA (DPU-smartNIC SDK).
- AI application frameworks like TensorFlow, PyTorch, RAPIDS (NVIDIA development library built with CUDA-X), TensorRT (inference SDK) and the Triton inference engine.
Although NVIDIA announced the Enterprise AI stack and hardware certification program at GTC, COMPUTEX is the stage for system announcements with ASUS, Dell Technologies, GIGABYTE, HPE, Lenovo, QCT and Supermicro all announcing compatible hardware. NVIDIA documents separate guidelines for systems designed for AI inference and training, but in general, certified products include:
- One to eight of the following GPUs: A100, A40, A30, A10, RTX A6000 or T4.
- One or two Intel Xeon Scalable (Skylake, Cascade Lake, Ice Lake) or AMD Epyc (Rome or Milan) CPUs.
- ConnectX-6, ConnectX-6 Dx or BlueField-2 DPU 100-200 Gbps NIC.
Several high-end training systems from Dell, HPE and SuperMicro are based on NVIDIA’s HGX reference platform that includes NVLink interconnects and Infiniband interfaces. NVIDIA is also working with GIGABYTE and Wiwynn to certify servers using CPUs based on Arm Neoverse sometime next year. Indeed, COMPUTEX is the debut for general-purpose traditional and HCI hardware using NVIDIA’s second-generation DPU, Bluefield-2.
Enterprises are rapidly integrating AI into many applications, however, as happens with every new technology, it’s easy to get hyperbolic about the prospects. While I don’t think ‘every application needs AI,’ nor does anyone need deep-learning enhanced bathroom fixtures, investment managers like ARK can make a convincing case that “deep learning could create more economic value than the Internet did.” While I’m skeptical about ARK’s overall numbers since they appear to be recategorizing growth in IT writ large into the AI bucket, they support their thesis by noting the exponentially expanding complexity of AI models as the technology progresses from computer vision to NLP and reinforcement learning.
The salient point for NVIDIA and other suppliers to the AI ecosystem is a continuation in the insatiable need for AI hardware. As ARK highlights in its Big Ideas 2021 presentation (emphasis added):
While advances in hardware and software have been driving down AI training costs by 37% per year, the size of AI models is growing much faster, 10x per year. As a result, total AI training costs continue to climb. We believe that state-of-the-art AI training model costs are likely to increase 100-fold, from roughly $1 million today to more than $100 million by 2025.
These are increases that a hobbled Moore’s Law cannot erase through advancing technology, meaning that the cost of developing leading-edge AI models will look more like those for drug discovery than app development. In such a world, it’s better to be the one supplying the picks and shovels than searching for gold. While NVIDIA is the largest supplier in this market, a rising tide (or, more likely, a tsunami) lifts all boats leaving plenty of AI TAM to go around.