Voice UIs come of age – enterprise IT, listen up!

Profile picture for user kmarko By Kurt Marko January 12, 2017
Summary:
Amazon Alexa and other voice UIs may look like a purely consumer phenomenon, but their rapid acceptance should be a wake-up call for enterprise IT

Man with megaphone yelling at businessman juggling with currency symbols © Tsung-Lin Wu - Fotolia.com

CES is usually a big distraction for enterprise IT, with more noise than signal, and where hype and promises counts for reality. This year was no different. It's unwise, however, to completely ignore CES since as we've learned since the dawn of smartphones, the most compelling consumer technologies end up changing the enterprise, whether it's in new business processes, applications, products or services.

While recent CES events were dominated by themes like AR, 3D TV, smartwatches, e-readers and HD disc formats that had little relevance to any business outside consumer electronics, the consensus winner this year, Amazon Alexa, is the rare hit that can bridge the consumer-enterprise divide. Although the consumer market is the clear target for Alexa and the myriad devices that embody the conversational android (as in humanoid robot, not phone OS), it represents an emergent software category that will prove useful in a variety of business scenarios and that is enabled by a new set of cloud platform services.

Conversational systems, which Gartner identifies as one of top strategic technology trends for 2017, encompass a range of user interfaces including automated chatbots like Facebook Messenger, voice-first devices like the Alexa-powered Amazon Echo, Google Home and Apple Siri or voice-text hybrids like Google Assistant. Voice seems the most disruptive among them. Two consumer studies of voice UIs by Creative Strategies concluded that

The voice-user interface has gone mainstream. What’s more, mainstream consumers seem to recognize its value and convenience. … It is encouraging, from a sentiment perspective, that voice looks to be a natural extension of our keyboard/mouse/touch-based input and output methods. Consumers seem to recognize its value, and want it to work in more ways.

Indeed, the survey found that most consumers wish voice interfaces could do more, with 85% agreeing that "when their voice assistant works, it's great, and when it doesn’t, they get irritated."

Mainstreaming voice AI

In contrast to its rivals from Apple and Google, Alexa's CES ubiquity was catalyzed by Amazon seizing on the power of an ecosystem by encouraging developers to add so-called Skills. The enticements include an open API, ample documentation and sample code that augment the tasks Alexa-powered devices can perform. For example, while the stock Echo can stream music and order products from Amazon, sometimes with unintended consequences, add-on Skills allow it to control a connected thermostat, turn on lights and start your car. Amazon's Skills Kit is the ostensible enabler of such versatility, however it masks the indispensable infrastructure behind these applets, the Alexa AI engine at AWS.

Heretofore, the only way developers could access Amazon's conversational algorithms was through Alexa devices, however at re:Invent 2016, AWS launched several new AI services that will provide the computational foundation for voice, text and image recognition applications. The most interesting is Amazon Lex, which as Amazon CTO Werner Vogels puts it

... is a new service for building conversational interfaces using voice and text. The same conversational engine that powers Alexa is now available to any developer, making it easy to bring sophisticated, natural language 'chatbots' to new and existing applications. The power of Alexa in the hands of every developer, without having to know deep learning technologies like speech recognition, has the potential of sparking innovation in entirely new categories of products and services. Developers can now build powerful conversational interfaces quickly and easily, that operate at any scale, on any device.

Like most recent AI applications, Lex uses machine learning (ML) algorithms such as convolutional neural nets (CNNs) that are trained and iteratively improved by large amounts of data like an image, voice or phrase libraries. According to Vogels,

Developers can simply specify a few sample phrases and the information required to complete a user's task, and Lex builds the deep learning based intent model, guides the conversation, and executes the business logic using AWS Lambda.

By creating the leading consumer voice appliance Amazon may have the surprise hit of CES, but it's arguably late to the conversational, machine-learning, software party. Google has been collecting voice samples for years via its VoIP phone and messaging services, using them to improve voice search. The company has also poured R&D into the problem and has developed one of the leading open source ML programming frameworks, TensorFlow, and a custom ML processor chip. Like AWS, Google offers a suite of ML software services that cover applications like image classification, speech recognition, natural language processing and translation and time series financial data analysis via Google Cloud.

Microsoft has also entered the market for conversational software with consumer services like Cortana, infamous experiments like Tay and cloud services like Azure ML, Cognitive Services and the Bot Service.

Enterprise applications

As is typical during the early stages of any technology, there is plenty of experimentation and a lot of both promising and goofy ideas. As  Jon Reed observed in reviewing enterprise UX futures:

In AI, chatbots have small brains for small use cases. Go for bigger things and risk a Microsoft Tay. UX is design-for-purpose. In a dangerous, hands-free setting like a utility pole, voice-activated UX might be now.

Two Las Vegas resorts are early adopters, with Wynn equipping rooms with Echos to control room functions and the Cosmopolitan unveiling a chatbot with an attitude to arrange for room amenities, provide restaurant suggestions, give guided tours and even play games. Rockwell Automation used various Azure services to build Shelby, a production line monitoring chatbot, which allows managers to get status updates and diagnostic information using a conversational interface.

Hubspot is an early user of Amazon Lex, which it used to create a chatbot that connects to multiple systems including HubSpot and Google Analytics and provides conversational access to marketing information. VoiceBase used the Google Cloud Speech API to built a SaaS speech analytics platform that can transcribe millions of recordings from call centers, audio conferencing services, telecom providers or any business wanting to cull useful information from hours of unstructured voice data.

My take

What all the ideas above have in common is the reliance on sophisticated cloud infrastructure and application services to enable conversational voice and text interfaces. Given the complexity, expense and specialized expertise required to build such ML infrastructure, it's unlikely that any of these businesses would have tackled, much less completed such projects without the availability of pre-built ML engines and services like those from AWS, Google or Microsoft.

The competition to become the preferred ecosystem for a new generation of software is bound to be fierce with unpredictable technical and business developments in the coming year. While the big three cloud services are focused on providing the ML and AI infrastructure, others like IBM with its Watson technology are straddling the PaaS-SaaS divide by providing both developer services and packaged applications.

Business execs and IT leaders should not ignore conversational software as just some consumer product hype, but should systematically study the technology, consider how it might improve existing business processes or customer products, evaluate the various cloud backend services and choose some promising applications that facilitate the iterative testing of intelligent, conversational voice and text UIs.