In my earlier story, I outlined the chip market for machine learning systems and observations made by Deloitte about the direction this is going.
To recap, the characteristics of graphics rendering and machine learning algorithms, namely the need to process thousands or millions of data points through an algorithm that uses simple floating point calculations, are similar enough that GPUs have become the preferred engine for AI research and implementation.
The result has been phenomenal financial success for NVIDIA, the pioneer of GPU computing, but its success has stoked competition.
As befits a dynamic, disruptive market like AI, competition is not all coming at NVIDIA from traditional sources in the processor market, namely Intel and AMD, but both innovative startups and incumbent cloud service providers looking to build a better AI mousetrap.
This story assesses the runners and riders in this fast moving and intensely competitive market.
Google, the chief contender
NVIDIA developed and has cornered the market for AI-optimized GPUs, however recently rejuvenated AMD has joined the battle with the Radeon Instinct series. While NVIDIA remains the de facto standard platform for most deep learning researchers and cloud platforms, AMD achieved a significant design win in Baidu which announced that it will optimize its AI applications for the Instinct platform.
More interesting are alternative products designed to perform types of calculations common to deep learning, such as the matrix multiplication and addition done by TensorCores, directly in hardware. We’re still early in the development of GPU alternatives, however there is already plenty of activity.
Google TPU (Tensor Processing Unit) is the most notable GPU alternative not only because of who developed it, Google, the organization behind the TensorFlow framework, but because it is the first to be incorporated into a cloud service, namely Cloud TPU. As I mentioned last year, with TPU Google wanted to achieve deep learning performance as good or better than a GPU while using much less power. Google claims that its latest version, TPU 3, recently teased at Google I/O, provides 8x performance improvement over version 2, however it appears that some of this improvement comes from packing twice as many chips in a TPU pod, Google's name for an AI system that also includes multiple TPU chips, conventional CPUs, memory and networking. Indeed that conflation of the TPU chip and pod is a sore point with NVIDIA, with a spokesman pointing out to me that TPU performance benchmarks typically compare a pod of four TPU chips to a single V100 GPU.
The topic of TPU competition also arose on NVIDIA's recent earnings call with analysts in which CEO Jensen Huang said the following (emphasis added),
Google announced TPU 3.0 and it's still behind our Tensor Core GPU. Our Volta is our first generation of a newly reinvented approach of doing GPUs. It's called Tensor Core GPUs. And we're far ahead of the competition and – but more than that, it's programmable. It's not one function. It's programmable. Not only is it faster, it's also more flexible. And as a result of the flexibility, developers could use it in all kinds of applications, whether it's medical imaging or weather simulations or deep learning or computer graphics.
Huang went on to tout the broad adoption of GPUs and their availability on every major cloud service,
As a result, our GPUs are available in every cloud and every datacenter, everywhere on the planet and which developers need so that – accessibility, so that they could develop their software. And so I think that on the one hand, it's too simplistic to compare a TPU to just one of the many features that's in our Tensor Core GPU. But even if you did, we're faster. We support more frameworks. We support all neural networks.
And as a result, if you look at GitHub, there are some 60,000 different neural network research papers that are posted that runs on NVIDIA GPUs. And it's just a handful for the second alternative."
Other options emerging
While Google has the early lead among GPU alternatives, at least in mindshare and cloud service availability, other important contenders abound including:
- Microsoft Brainwave which eschews the custom ASIC approach of the TPU for an FPGA implementation that is "designed for real-time AI, which means the system can ingest a request as soon as it is received over the network at high throughput and at ultra-low latency without batching. Microsoft recently announced that its Brainwave accelerator will be used to power the Azure Machine Learning Service to provide "real-time AI calculations at competitive cost and with the industry’s lowest latency, or lag time."
- Intel Deep Learning Inference Accelerator is another FPGA-based product designed into an accelerator card that can be used with existing servers to yield "throughput gains several times better than CPU alone."
- Intel Movidius Compute Stick embeds a low-power AI accelerator into a USB compute stick can be used on a developer's laptop or smart devices like drones, robots, cameras and IoT gateways.
- Graphcore IPU is a processor designed for machine learning workloads using a graph-based architecture that accelerates both model training and inference by what the company claims is one- to two-orders of magnitude over other AI accelerators based on company-run benchmarks (caveat emptor).
- Tachyum is a startup that recently teased preliminary information about a processor it claims achieves dramatically better performance and power efficiency on AI workloads than the conventional combination of CPUs and GPUs. Indeed, co-founder and CEO Radoslav Danilak made the bold claim that a hyperscale data center built using its processor will be able to provide the same performance as conventional servers while using 1 percent of the physical space and one-tenth the energy. He said the company will reveal technical details about the chip this summer at the Hot Chips conference.
- Wave Computing builds what it calls a data flow accelerator using a coarse-grained reconfigurable array (CGRA) that is well suited to the type of computations done in TensorFlow or other tensor-based algorithms.
NVIDIA’s massive data center business notwithstanding, we are still in the early days of AI accelerators where multiple architectural techniques and product designs are vying for developer mindshare and a slice of IT spending. Indeed, it’s somewhat reminiscent of the early days of microcomputers where various processor architectures like the x86, Zilog Z80 and Motorola 6800-series battled to become the standard for a new generation of systems.
NVIDIA undoubtedly has enormous leads in the number of deployments, developers and software packages using its GPUs and CUDA platform. However, the emergence of new types of machine learning algorithms, high-level software frameworks and cloud services that hide architectural details behind an API abstraction layer could rapidly change the competitive landscape. Indeed, frameworks like TensorFlow, Caffe2, MXNet and others still uninvented could drain the competitive moat that NVIDIA believes it has built in via the CUDA platform and APIs.
The rampant search for faster, more efficient hardware to run AI software portends some vulnerability for the venerable GPU and lends credence to comments Intel’s chief AI architect recently made in an interview in which he said (emphasis added),
GPUs are not the best structure to do deep learning. They're pretty good, but they are not optimized. Think about the underlying concept of temporal versus spatial architectures. In a temporal architecture, you have a flow of instructions, taking data from a known place to a known place. With neural networks, there is a wave front that flows through a very wide set of operations, and you go through all those multiple nodes. Architectures that are more spatial have an advantage in that they don't have to rely on a flow, and they don’t end up waiting on other instructions. The GPU is not the optimal architecture for that kind of compute.
He added that AI applications in areas like machine translation or mapping, e.g. genome sequencing, use combinations of algorithms that require a mix of GPU-style parallelism and CPU-style procedural logic, implying that designs that can tightly combine the two could have a performance advantage.
If you are a hardware designer or AI developer, the dynamism around competing approaches makes for an exciting time. For business users of AI, it means favoring approaches that exploit abstraction layers such as development frameworks and higher level cloud services that insulate you from the implementation details and make it easier to rapidly exploit technological advances.