An intense roundtable discussion in London this week, hosted by speech recognition software developer Speechmatics, brought to the fore a whole host of insights into how 2019 can become the year of voice and – ultimately – ended up convincing Faultline Online Reporter why it can’t.
A mix of attendees will always bring a variety of vested interests, yet the frank conclusion is one of a market which wants to achieve so much but hasn’t yet been equipped with the tools to achieve anything remarkable. “We want to do some really clever stuff, but just can’t do it yet,” was the most prudent statement of the bunch, voiced by a young software engineer, while others with extensive experience in neural networks were perhaps guilty at times of overinflating the AI bubble.
But before we too get carried away with the frustrations in AI, specifically for voice applications, Speechmatics coincided its event with some news, unveiling a new languages update for improved accuracy as part of its Global English initiative. This is an interesting new direction for the Cambridge-based company, considering how just two weeks ago we wrote excitedly about its impressive and rapidly growing roster of 74 languages. Speechmatics says the update includes some all new machine learning algorithms and a handful of tweaks to its existing technology, along with the extension and enhancement of its data resources to better drive customer success stories.
The team described a Global English movement as an important step towards removing the confusion and inefficiency in voice technologies, which is often a deterrent for so many. Removing dialects and forming a single English language model means all major accents and dialects are supported in one, thereby reducing overheads for customers and bringing an increase in accuracy of up to 16%. Presumably there could be issues with confusing slang terms between American English and British English, for example, but as the event made quite clear, there is a conversational race happening in the voice market which is years away from anything near perfection.
Voice should be seen as part of the solution to certain problems, rather than the be-all and end-all for how we interact with technology in the future. One attendee, who we shan’t name, argued the point of how companies continue to create press feeding frenzies with what is fundamentally recycled technology. A heated trade off ensued about how the press fuel the fire and therefore hinder the development of AI across the board, yet those in the business of AI should not shoot the press for reporting failure, but instead use negative coverage as a learning curve.
The question, “What does the next generation of consumer want?” is nigh impossible to answer, but clues are already being drawn upon from where we can begin to gauge the winners and losers of tomorrow. One such sign comes from young children who, upon interacting with technology, will become bamboozled that a screen isn’t controlled by touch, or that a speaker isn’t equipped with voice functionality. Some see these changing habits as the start of our inevitable dystopian future, but in a decade or maybe less, interaction with technology in this manner will be less alien and therefore more accepted.
More acceptance will mean more investment, and that is precisely why voice is the unsung hero of technology – because today’s voice technologies fundamentally don’t even work, yet they are already having an unprecedented impact on the next generation of consumers.
There are Android smartphones on the market right now offering the pick of voice assistants, letting users switch between Alexa on the morning commute to having Google Assistant at home, for example, but to our knowledge other devices like smart speakers don’t currently offer this choice. However, we recall a smart speaker coming out in China with three different assistants, each with a different name and slightly different personalities. Whether or not the device has been a success is unclear, but this sort of personalization is what the market needs. Simply being able to change the name of your voice assistant to make interactions more personal is not something Alexa, Google Assistant or Cortana can offer – nor would it be in their best interests.
Speaking of China, there was no doubt during the discussion this week of a huge gap emerging. “It’s scary. China is investing something like $1 billion in AI for every city, while companies in the West think a few million dollars is substantial,” highlighted one participant, while another said Chinese voice companies are mostly interested in other languages, primarily English, because voice technologies long ago perfected Mandarin.
And on the personalization front, prickly components of language like sarcasm, humor and irony are data hungry monsters with poor generalization, we were told, using superficial algorithms which the industry is desperate to move beyond. Even incorporating emotions is apparently not on the immediate roadmaps of most in the voice space, although Deloitte was mentioned as one major firm to have done so, albeit on a basic level, incorporating happy and sad into its BEAT (Behavioral and Emotional Analytics Tool) platform, but reportedly this hasn’t taken off. “We live in the age of cheap prediction,” was one damning statement.
That said, a move to running things locally on the device is critical for progress, although Speechmatics says it was doing this two years ago and claims Nuance is the only other vendor offering local processing. Someone else suggested UK-based Intelligent Voice might also be on this list.
Outside of the consumer space, there is a case for voice in verticals like call centers and other automation process. But while cost savings for someone like a tier 1 operator are potentially huge, apparently the number one request when contacting a call center is to speak to a human – a fact unlikely to change anytime soon. Therefore, the argument from suppliers like Speechmatics is not to replace employees, but to place software in between the caller and agent, where it can process the conversation and assist if necessary. For example, if the system detects a caller becoming irate, it can ping suggested terms to the agent to defuse the situation.
“Some of us were mortified when we first heard one of our customers had achieved 33% increase in efficiency, assuming this meant a major workforce reduction, but that wasn’t the case at all,” confessed the marketing team.
On the R&D side, we touched briefly on the ongoing improvements to natural language processing tasks underpinning the subtle complexities in language which continue to trip up even the smartest algorithms. Coreference resolution, for example, is the task of finding all expressions referring to the same entity in a text and linking these mentions to real world entities. A quick example, “My brother has a friend called Susan, he thinks she is so clever.” Simple enough, but an AI would get itself in a twist. Coreference resolution is a particular pain point for chatbots, which are essentially the laughing stock of the industry. Despite coreference resolution coming on in leaps and bounds, it apparently continues to lack real-world knowledge – a feat voice AI may never overcome.
Conclusively, despite significant progress in recent years, voice and speech technologies are untrusted, untested and face hurdles which industry insiders describe as impossible to scale.
For balance, we should really sign off on a positive note, so here’s a quote from Head of Speech at Speechmatics, David Pye, “We may be the first speech technology company to do away with English dialects completely, our expertise has proven that Global English is the right way to drive a shift change in the market and our Next Generation languages update supports this.”