Although NVIDIA has adapted GPUs for AI algorithms like neural networks by adding features such as the Tensor Core to its latest Volta processor, these 300 Watt behemoths aren’t suited for mobile devices, nor truly optimized for any particular subclass of AI applications. As the competition to produce more intelligent, predictive and quasi-sentient products and services heats up, developers and systems designers are eschewing general purpose processors and are embedding more and more algorithmic intelligence in customized hardware.
From FPGAs to custom silicon
FPGAs, essentially giant blocks of generic logic circuits that can be externally programmed after manufacture, with switch logic customized for a particular application, have been around for decades. Process technology has shrunk to the point that these hardware blank slates can now encode large, complicated algorithms in circuitry. While not as easy to program as a high-level application language, FPGAs are being deployed by sophisticated cloud users like Microsoft to Bing searches and Azure services. According to one of the Microsoft engineers on the project,
[A] key advantage is that you can quickly adapt to whatever the next technological breakthrough is, without having to worry too much about whether you anticipated it or not. That’s because you can easily reprogram the FPGAs directly, instead of using less efficient software or waiting as long as a few years to get new hardware.
Microsoft recently announced the next generation of its customized silicon, Project Brainwave, targeting deep learning AI models. The systems use a combination of FPGA and conventional ASIC digital signal processing blocks, along with a logic compiler and software stack that supports both Microsoft Cognitive Services and Google (now, open source) Tensorflow.
Microsoft hasn't released any application-level benchmarks, but says "Project Brainwave thus achieves unprecedented levels of demonstrated real-time AI performance on extremely challenging models." Based upon the level of integration (858-thousand adaptive logic modules) and raw data released so far, the claim is believable.
Initially, Microsoft is using FPGAs to accelerate higher-level services like search, image tagging and speech recognition. However at the recent Build conference, Azure CTO revealed plans to make FPGAs available to developers as a programmable service.
AWS has beaten them to the punch, introducing FPGA instances last year at re:Invent. Now in production, EC2 F1 instances work with a development kit that provides up to a 30x acceleration of parallelizable algorithms in genomics, seismic analysis, financial risk analysis, big data search, and encryption.
Google has taken a different tack in designing its Tensor Processing Unit (TPU) as an
ASIC packaged in an external accelerator card that connects to existing servers. The TPU implements high-level instructions like matrix multiplication and addition that are commonly used in neural network models. The result is a device that allows Google to have predictable, low-latency performance yet execute deep learning models 15-30-times as fast as a CPU or GPU while only using 2-4% of the power. ML-optimized silicon has allowed Google to reduce the training time for language translation models from a full day, using 32 top-end GPUs to an afternoon using 8 TPUs.
From the cloud to mobile
Seeing the threat to its traditional x86 business, Intel is betting on both approaches, both in the data center and for devices. The company followed up the acquisition of FPGA titan Altera, which is supplying the devices used by Microsoft described above, by snapping up Movidius, a startup specializing in vision recognition accelerators for mobile devices, drones, AR headsets and video surveillance camera. The first fruits of the Movidius deal are a significant partnership with Google in which the pair co-designed a custom Pixel Visual Core chip to handle processing of HDR photos and other AI functions for Google's latest Pixel 2 phones.
Google also used a Movidius processor in its new GoPro competitor, the Clips wireless camera. The Movidius VPU allows moving advanced image processing from the cloud to the device, improving performance and battery life. According to Google's director of Machine Intelligence, "Our partnership with Movidius has meant that we can port some of the most ambitious algorithms to a production device years before we could have with alternative silicon offering."
Google and Intel are arguably playing catch-up in the use of custom acceleration hardware on mobile devices. Apple, which has long designed the SoCs that power iPhones and iPads, added what it calls a Neural Engine to the A11 Bionic chip used in the iPhone 8 and X. Apple was typically tight-lipped about details, only saying that,
The new A11 Bionic neural engine is a dual-core design and performs up to 600 billion operations per second for real-time processing. A11 Bionic neural engine is designed for specific machine learning algorithms and enables Face ID, Animoji and other features.
Apple's Face ID system used for biometric authentication must produce a map and neural network facial recognition model that uses a 3D pattern of 30,000 points. Providing the type of instantaneous performance needed to replace Touch ID without using the network or cloud resources (Apple says all information is stored in the iPhone's encrypted Secure Enclave) is impossible without some significant local horsepower like the Neural Engine. Likewise, Microsoft is planning to use a custom Holographic Processing Unit in the next generation of its HoloLens AR headset.
Phones aren't the only mobile device doing AI processing. As I wrote last week, autonomous vehicles are voracious users of data-intensive AI algorithms. Existing AV and intelligent autopilot systems, including Tesla which has the most sophisticated product commercially available, use GPUs the NVIDIA DRIVE PX that I discussed last week.
However, Tesla is reportedly working with AMD on a chip to accelerate the AI models necessary to deliver fully autonomous (Level 5) cars. Once the hardware development costs have been recouped, a custom accelerator would be substantially cheaper and more power efficient than a GPU-based AI engine.
With device intelligence provided by various machine learning algorithms now a differentiating feature on all manner of products, whether small home appliances, gaming consoles, phones or vehicles, the need for real-time local processing has become acute. That means customized hardware.
Although these devices are network connected most of the time, a link to cloud services can't be guaranteed, nor can remote processing provide the near instantaneous response required of applications like autonomous driving or AR. Furthermore, when used for biometric security like Face ID, sending unique customer authentication data back to a central repository poses serious security and privacy risks. For these reasons and more, we've entered a new age of customized hardware.
As the cost of designing and fabricating fully custom chips has come down due to design automation software and massive commodification of semiconductor foundries, the commercial barriers to developing custom silicon have never been lower.
For situations where a fully custom design isn't needed, FPGAs are a cheaper, more accessible option, particularly with the maturation of high-level code compilers using frameworks like OpenCL or Intel's HLS open FPGA design to a broader developer community, not just skilled hardware designers.
Organizations building or relying upon intelligent distributed systems, whether they're an IoT device on a manufacturing floor or autonomous vehicle/robot moving people or material must carefully evaluate each product's degree of local intelligence and how custom hardware might improve performance and lower cost.