Virtual assistants like Alexa, Siri and Cortana have almost fallen victim to their own hype by causing deflated expectations among many of their users. There have been signs of usage declining after an initial flurry as people come to realize that their powers of conversation are very limited, extending little beyond the ability to satisfy a list of basic requests. In some respects, they seem to be little better than rule-based expert systems.
This is at odds with the great progress that has been made recently after years of slower advances in natural language processing, the ability to process written text and also spoken words, which is partly why virtual assistants were over -hyped. It was assumed, at least by many service providers as well as consumers, that this progress would be replicated by a breakthrough in conversational abilities, for after all if the system could understand one sentence, why could it not go on to process more and start a conversation?
The reason they cannot lies in the complexity of understanding context within a more complex exchange. Humans are good at that, and language has evolved to allow complex ideas to be conveyed in conversation very economically in relatively few words. That is why people take a while to learn a foreign language, and often get stuck for a while at a stage where they can form sentences, say to order a meal, and perhaps understand the initial reply, but then get lost if the respondent carries on.
This is the stage conversational AI as it is called is at today. It is why Microsoft has just bought a start-up called Semantic Machines, despite the fact that it has been working in the field for almost two decades – through both internal training, recruitment and other acquisitions. This underscores the intense competition among the players in voice technology, such as Amazon with Alexa, Apple with Siri, Google with Assistant and Microsoft itself with Cortana.
The Semantic Machines takeover may surprise those who noted Microsoft’s boast that it was first to add full-duplex voice sense to its existing conversational AI system – for conversations involving Cortana and its chat-bot Xiao Ice targeted heavily at Chinese and Japanese speakers. Full duplex voice allows both sides to talk at once more like people do in normal conversations, rather than having to wait for a reply to have finished before speaking again. This requires the system to predict what someone will say next, so that it does not fall behind when it is interrupted, but again is a work in progress and far from matching human capabilities.
So, in a sense, Microsoft’s slight lead in conversational AI with full duplex voice explains why it has purchased Semantic Machines. Google had just launched full duplex voice as well, aptly named Duplex, and so Microsoft was anxious to extend its lead by bringing in people from one of the most promising start-ups in the field that has demonstrated decent conversational capabilities.
After all, support for full duplex voice does not immediately mean that the conversational problem has been solved, and in practice the system could lead to further disappointment if oversold at this stage. Microsoft is acutely aware of that, and its VP and CTO of AI & Research, David Ku, has as good as admitted that there is a long way to go.
Semantic Machines’ own co-founder and CEO, Daniel Roth, has gone further by suggesting that Cortana is still stuck at the single command stage, while his own company’s technology has gone further towards sustaining a longer dialogue in which the system can provide additional information or services in response to requests that emerge during the conversation.
Semantic Machines, founded in 2014 with $21 million backing, has focused exclusively on conversational AI, and has some strong pedigree in its ranks, which helped attract Microsoft’s attention. Roth had earlier launched Voice Signal Technologies, subsequently acquired by Nuance Communications for $300 million in 2007, and ultimately by Spectrum Brands. Larry Gillick, another co-founder and CTO of Semantic Machines, was VP of Research at Dragon Systems, and also VP of Core Technology at Voice Signal Technologies, then for mobile devices at Nuance, and finally, Chief Speech Scientist for Siri at Apple – so he has been around the field at high levels. Finally, yet another co-founder, Chief Scientist and VP of Research Dan Klein, is currently professor of computer science at UC Berkeley having previously been Chief Scientist at Adap.tv.
But the real proof is in the pudding, rather than the chefs that baked it, and so the main prize for Microsoft was Semantic Machines’ Conversation Engine – its core product. The aim with the Conversation Engine from the start was to address the much harder open-ended problem of not just extracting semantic meaning from words, whether in voice or text, but then expanding on that into a framework that incorporates nuances and content as the conversation develops. It attempts this by generating a self-updating learning framework for managing dialog context, state, salience and the goals of the end users.
Machine-learning is at the heart of this engine, which also has to maintain the dialogue intelligently with users as it goes along and eventually allow other ways to interact at the same time, for example on a screen or mobile phone.
This requires two aspects of machine-learning, being able to identify patterns as they develop, quickly in real time, while also being able to cope with some uncertainty and incomplete information. The system must therefore be able to make intelligent predictions, which is where Bayesian inference comes in. This offers a way of working back through a conversation to glean what the speaker most likely meant earlier as a result of subsequent utterances. It allows the system continually to gather more information and update its estimate of what was said or intended earlier on the balance of probabilities.
Of course, Semantic Machines has no monopoly over Bayesian inference or any other technology. However, it has applied its expertise to tune the system and in effect to reach greater levels of certainty over interpretation of meaning and also to drill down further into the analysis than some of its competitors.
That said, we have yet to see anything approaching a meaningful comparison between alternative conversational AI systems, so judgment must be reserved – although Microsoft has presumably done its homework. There have been some studies of chatbots conducted for example by the Chatbots Journal, but these tend to be qualitative rather than quantitative and merely allude to Semantic Machines’ ability to go beyond understanding commands. There is still a long way to go, and it is likely that Microsoft will introduce features derived from the incorporation of Semantic Machine’s engine quietly, without fanning the flames of expectations too fast at first.
What is certain is that Microsoft’s move will stoke the speech technology arms race further, and that more acquisitions can be expected from all the big players. That reflects how a lot of the best people in the field engage in start-ups in the expectation that they will cash in through either IPO or being acquired and make much more money than if they just joined the big players in the first place.