Buried among the stats about a saturated smartphone market, accelerating e-commerce growth and shifting TV viewing habits are cogent observations about the rapid emergence of voice as a preferred UI for many types of interactions.
We first discussed the coming age of enterprise voice UIs last winter, noting that conversational hardware and software systems like Amazon Echo and Google Home, when paired with powerful AI services on AWS and Google Cloud respectively, are poised to move from consumer gadgetry to enterprise utility. We concluded,
Business execs and IT leaders should not ignore conversational software as just some consumer product hype, but should systematically study the technology, consider how it might improve existing business processes or customer products, evaluate the various cloud backend services and choose some promising applications that facilitate the iterative testing of intelligent, conversational voice and text UIs.
Meeker noted how voice is ripe to replace text as a conversational UI, using Google Assistant as an example where 70% of the requests are made in natural language (conversation), but only 20% of mobile queries, i.e. performed on a device designed for talking, are made by voice.
The days of voice being an error-riddled compromise for typing are largely over since, as she also highlights, the error rate of Google's deep learning transcription service is now below 5%, the threshold for human accuracy. While her point is that the combination of voice, images, data and algorithms can yield competitive advantages for retailers and "crafty big brands," we see their influence going far beyond sales and marketing.
Creating custom voice-activated devices
Our previous column focused on the cloud-based AI services used to power conversational interfaces via voice transcription, sentiment analysis and chatbot logic flow. A missing piece was the means of conversational interaction.
The implicit assumption has been that new software would piggy-back off of either a smartphone or a consumer device like Echo, where Amazon has embraced developers with an open API for so-called Alexa skills leading to a thriving software ecosystem.
Meeker highlighted the explosive growth of both Echo sales (increasing about 30% per quarter) and the number of published Alexa skills (doubled just this year).
While devices like phones and home assistants are the perfect platform for consumer services seeking the broadest reach, they aren't optimal for every business or industrial application.
What if you could create an inexpensive system targeted to a particular task or business scenario, say a hands-free device that workers could use to execute some commands, update a Slack chat and trigger a Twilio SMS message?
In today's world of inexpensive, customizable hardware like the Raspberry Pi, conversational software platforms like those from startup KITT.AI, provide that platform.
Founded by four PhDs in AI, three of whom previously worked at Google, KITT.AI is tackling two challenges to voice UI development: enabling customized wake words for voice-activated devices and parsing and logically mapping a two-way conversation.
KITT's founders demonstrated its first product, a hotword detection engine called Snowboy, at the recent Nvidia developer's conference (see our coverage here). Snowboy addresses a limitation Echo users know all too well; the device can only be activated by prefacing commands with one of three words, Alexa, Amazon or Echo.
Those building a custom conversational device will undoubtedly want to select a custom wake or hotword that's relevant to the product. That's where Snowboy comes in.
The Snowboy AI engine can build a deep learning voice activation model that responds to any word, which can be either activation wake words or commands. For example, when making a voice-controlled garage door opener, the wake word might be the compound phrase "garage door". Similarly, voice activated window coverings could be activated with the word "blinds." Once listening, the device could respond to several different commands such as "open", "close", "stop", "start", or "reverse".
Unlike Echo or Google Home, industrial or commercial voice activated devices might not be connected to the Internet. To meet this requirement, Snowboy builds a compact AI model that can be embedded and run on a variety of hardware including all versions of Raspberry Pi, any ARMv7 (32-bit) or v8 (64-bit) device running Android or the Intel Edison (Atom-based), while consuming less than 10% of system resources on the smallest cores.
Managing the conversation
KITT's other product is what it calls a Natural Language Understanding (NLU) engine that works with the company's ChatFlow backend service to provide conversation routing and session management. The goal of Chatflow is to automate rich, multi-part dialogues between user and machine and eliminate compound commands. As KITT co-founder Xuchen Yao puts it,
How many times have you seen demos like 'book me a four-star hotel in Seattle under $300 with free WiFi less than 2 miles from the city center for tomorrow night?" Conversational AI wouldn’t be as big a deal if we all spoke like that. … In the real world, people book hotels with agents by engaging in a dialogue, not a long one-way monologue. It takes multiple turns, with validations and confirmations along the way.
NLU and ChatFlow replace the process of mapping conversation dialogs into code with a graphical flow editor that translates the chatbot logic into a model, trains it with sample inputs and deploys the model to the ChatFlow SaaS engine. According to Yao, "ChatFlow provides the full stack for programming modern dialogue systems" and synchronizes the graphical conversation design with code implementation.
The system appears similar to Microsoft's Azure Bot framework paired with Logic Apps that we discussed here, albeit more sophisticated and flexible. For example, ChatFlow can take input and deliver results to the following chatbot/conversational platforms: Amazon Alexa, Facebook Messenger, Kik, Layer (work in progress), Skype, Slack, Telegram and Twilio.
Currently, Snowboy and ChatFlow are discrete products: the former designed for embedded systems and ChatFlow for cloud-connected devices or mobile apps connected to other bot frameworks.
While KITT.AI has said nothing about integrating the two, it's not hard to see how an IoT device might use Snowboy for a limited set of offline conversational commands and connect to ChatFlow for more sophisticated interactions.
After a slow start with Siri's balky initial incarnation and gimmicky Echo, as conversational devices and interfaces have matured with concomitant improvements in accuracy, they are rapidly becoming mainstream for consumer applications.
Expect the same to occur in business. Earlier this year, Tintri embraced the idea of "ChatOps" by building both an Alexa skill and Slack add-on that connect to its management API and that can respond to spoken or typed commands such as reporting on the number of zombie VMs.
The Bank of New York is using Tintri's AI integrations to build a voice-controlled system that will allow IT employees to complete many tasks without memorizing and entering arcane commands.
Enabling AI-based IT operations or products will require both an extensible system design using published APIs and AI platforms to handle voice commands and manage conversations.
Alexa skills and various cloud bot frameworks have paved the way, but platforms like ChatFlow and Snowboy will be required to make conversational interfaces more sophisticated and widely available on any device. KITT.AI may be a pioneer, I don't expect it to be alone as other products and software frameworks emerge to enable chatty AI-powered software.