Embedded deep learning: out of the cloud and onto devices


Apple’s Face ID marks the beginning of the second stage of embedded AI, in which more of the intelligence happens on the device, independent of the cloud. But they’re not the only game in town.

Heh Siri flow
Heh Siri flow

Loquacious intelligent assistants have become a standard fixture of consumer devices, such as cell phones and smartwatches. These are harbingers of the accelerating osmosis of AI into everyday life.

While charming, current implementations are pale imitations of what’s coming. With most of the intelligence happening on cloud server farms, today’s products are more like a ventriloquist’s dummy, parroting responses from the real brains behind the curtain; smart, but limited.

The emergence of Face ID, Apple’s wondrous new biometric authentication system that uses facial recognition backed by an array of sensors and a new AI-accelerated iPhone SoC, marks the beginning of the second stage of embedded AI in which more of the intelligence happens on the device, independent of the cloud.

Making this a reality requires a mix of deep learning software and embedded hardware designed to efficiently and instantaneously execute repetitively parallel algorithms.

As I’ve previously detailed, thanks to extraordinary increases in GPU hardware performance, deep learning algorithms are being applied to a growing array of business applications including cybersecurity, logistics, data protection and quality control. However,

The problem with such algorithms is that they are voracious consumers of data with a bias towards complexity – the larger the data set and more computationally intensive the approach, the more accurate and useful the results. For Nvidia, this is a good problem to have, since the company specializes in hardware built to tackle parallelizable complexity. Consequently, GTC has become relevant to more disciplines and businesses every year.

Until recently, such complexity consigned deep learning calculations to the realm of power-hungry servers with an array of GPU accelerators. However, advances in both semiconductor process integration and algorithm development now allow even mobile devices to perform non-trivial tasks like image tagging, biometric authentication, and robotic control.

An iPhone designed for AI

Apple has consistently set the standard for mobile device performance since it began designing custom SoCs, starting with the A4, first used in the original iPad and iPhone 4 in 2010. This year’s edition, the A11 Bionic, keeps up the trend of annual double-digit performance increases but it is the first designed to accelerate the neural network algorithms used in deep learning. Although Apple is congenitally (and understandably) secretive about design details, it did reveal that,

The new A11 Bionic neural engine is a dual-core design and performs up to 600 billion operations per second for real-time processing. The A11 Bionic neural engine is designed for specific machine learning algorithms and enables Face ID, Animoji and other features.

Two of these “other features” are photo tagging and Siri, Apple’s voice assistant, and where the company recently provided details into their implementation in a couple of research papers.

In a paper about On-device Deep Neural Network for Face Detection, Apple researchers noted that it first started using deep learning for face detection last year in iOS 10, where it had to work around the limitations of even high-end phones in running deep learning algorithms.

Like others, Apple had been using cloud-based systems for image recognition. To increase user privacy, Apple wanted to encrypt the photos on the phone, which meant only those from users with an iCloud account could be decrypted in the cloud for processing, thus requiring the image recognition algorithms to run on-device.

The paper describes how Apple worked around limited memory and CPU resources without disrupting other OS tasks and using significant additional power. The paper goes into technical detail about how Apple tuned deep learning models for a SoC-sized GPU and concludes,

Combined, all these strategies ensure that our users can enjoy local, low-latency, private deep learning inference without being aware that their phone is running neural networks at several hundreds of gigaflops per second.

Apple’s research was in preparation for the computational challenges posed by its new Face ID authentication technology, a biometric security system that uses a front-facing IR camera to project 30,000 points to create an infrared image and 3D map of a user’s face. Apple’s description of the implementation is sparse, but indicates the use of new features in the A11 SoC,

A portion of the A11 Bionic chip’s neural engine — protected within the Secure Enclave — transforms the depth map and infrared image into a mathematical representation and compares that representation to the enrolled facial data.” [note: the Secure Enclave is an embedded coprocessor on the A-series chips used for cryptographic operations and storage that’s available only to the OS.

As detailed in another recently released paper, the iPhone also uses deep neural networks (DNN) to recognize and parse the voice commands for its “Hey Siri” feature.

As I detailed in a previous column, the processing required to listen for and detect so-called wake words continuously is minimal. Thus the iPhone uses a low-power auxiliary Always On Processor (AOP) to trigger Siri activation. But as the paper describes, once triggered, the AOP wakes the main processor to analyze the sound with a larger DNN.

App developers can also use the neural network acceleration features of iPhone hardware via its Core ML machine learning framework of APIs and development tools. As this tutorial details, apps can use CoreML with the iOS imaging SDK for tasks like shape recognition and object identification.

ARM, Google, Microsoft, others also bringing AI to devices

ARM, the developer of the processor platform licensed and customized by Apple and used by every other mobile device, is bringing AI to its generic SoC design, a move that will significantly expand the proliferation of AI-accelerated devices.

Known as DynamicIQ, the design adds processor instructions designed to accelerate machine/deep learning algorithms that ARM expects to result in a 50x increase in AI performance over the next 3-5 years relative to current ARM systems.

Some companies are already using the lower-power ARM-M processor for embedded machine learning applications. For example, the Amiko Respiro is an inhaler for asthma patients that uses data from several sensors and onboard ML software to compute a medication’s effectiveness and develop therapies customized to each patient.

Not to be outdone, Google is paving the way for deep learning algorithms on mobile and embedded devices by introducing TensorFlow Lite, a platform designed to enable fast startup of TensorFlow models that can fit in the small memory footprint of mobile devices and exploit any acceleration hardware like embedded GPUs. The development framework also has interfaces that can automatically use on-device hardware accelerators when available.

Microsoft is also developing embedded machine learning software that can fit on mobile and IoT devices, even something as modest as the Raspberry Pi. The research is currently focused on narrow, niche applications for particular scenarios like embedded medical devices or smart industrial sensors.

Another company, Reality AI offers machine learning software libraries designed for embedded sensors and devices. While hardware suitable for such physically small and environmentally-harsh environments doesn’t yet support ML accelerators, over time the technology will shrink and allow such devices to support more complicated and accurate AI models.

My take

The opening chapters of the deep learning story were about the exponential improvement in hardware performance, primarily via GPUs, but also new custom silicon like the Google TPU, for running larger and larger models. These fueled the development of shared cloud services that democratized access to sophisticated, expensive hardware and allowed any organization to build business-specific AI software.

The next phase of AI development is bringing deep learning algorithms out of the clouds and into the physical world, whether a mobile phone, industrial sensor or medical device.

While initial efforts have naturally focused on shrinking existing ML/deep learning models into a mobile device’s limited processor and memory footprint, future implementations will use a mobile SoC’s growing transistor budget for AI accelerators.

The combination of hardware and software innovation will dramatically improve the power of embedded AI software and lead to innovative applications across industries. If your business uses or builds mobile or IoT software, it’s time to start planning for an age of intelligent, self-learning and self-correcting applications.

Image credit - via Apple documents and

    Leave a Reply

    Your email address will not be published. Required fields are marked *