While CES did nothing more than muddy the waters regarding all things voice, we sought some clarity this week from our friends at machine learning specialist Speechmatics – covering the key challenges facing voice in 2019 and what the technology vendors are attempting to do about it. Our increasing interest in this area stems from the bottom line, in which every technology titan targeting voice deployments is aiming for the same thing – the most natural, human-like experience possible. We know that’s a long, long way away, yet there are signs significant progress is being made.
“We need to make this work for real, not provide something you can write a paper on,” proclaimed VP of Products Ian Firth, appropriately summarizing the state of the market.
One of the key challenges is what Speechmatics describes as the “unprepared speech model” – quite simply the unstructured manner in which humans talk. A run of the mill conversation will be littered with fillers, made up words (like company and product names), and repetitions, for example, making the life of even the smartest AI pretty tricky.
During a webinar called “Automatic Speech Recognition for real world applications,” we were treated to our first Speechmatics demos despite having spoken to the Cambridge, UK-based vendor on a couple of occasions now. The first involved subtitling a speech from UK Prime minister Theresa May, in a comparison of the BBC’s capabilities versus Speechmatics’ own system. Obviously, Speechmatics showed a superior experience, with much faster speech-to-text capabilities and far better accuracy. But this is merely the start.
The second was more interesting, setting up Speechmatics alongside a rival company (and also a partner) called Transcribe Me, with each system dictating the same academic speech. The Speechmatics software automatically removed redundant words in verbatim transcript, as well as repeated words and even altered a made up word back to its original dictionary form. For a company attempting to streamline voice technology adoption through its Global English project, ironing out the creases in speech is essential, although there is an argument that this takes away the human element, meaning this particular technique would not be ideal in the consumer space, but is perfect for business use cases.
Subtitling might sound trivial in the grand scheme of things, in an age where engaging in conversation with robots (smart speakers and smartphones) is the norm, but it’s interesting that the majority of content out there is non-regulated, meaning adding subtitles is not required. They can be costly, so the emergence of machine learning to allow broadcasters and media companies to add subtitles at a fraction of the cost, particularly for major live events, and therefore appeal to a wider audience, is something you cannot put a price on in our view.
It’s all well and good eliminating elements of unprepared speech to create a smoother experience in areas like subtitling, but would the consumer experience of using a smart voice assistant be enhanced if it were to respond to queries in a similar unprepared manner? The more human-like, the better, we imagine. But for a bit of context on the specifics of demand in the market at the moment, below is a graph based on results from a Speechmatics survey, comparing what people view as the most important elements of a speech system. “Words are correct” comes out on top by some distance, while “Formatting is correct” – e.g. 123 instead of one two three – is a less desired feature.
As for legitimate voice uses cases, a frustrating one emerged last week which Speechmatics helped shine some light on for us. It involved US radio firm Pandora claiming it was only the second music streaming company ever – after Amazon – to bring voice functionality in-app. Pandora’s Voice Mode is built on a voice and conversational AI platform from song recognition firm SoundCloud called Houndify, along with what Pandora calls Speech-to-Meaning and Deep Meaning Understanding voice recognition technologies.
Unfortunately, Pandora itself never got back to us, but Firth told Faultline Online Reporter, “They are starting to be picked up as buzzwords but there is some truth there. We can create representations that encode aspects of what we would understand as meaning: basic sentiment, speaker-attributes and other fairly basic and well-defined things that are studied in the field of natural language processing. As for “meaning” as we would know it, it’s a million miles away.”
It was fitting that the Q&A session during Speechmatics’ webinar was a relentless bombardment – with attendees probing all sorts of topics from noise disturbances to accuracy, to the difference between transcription and dictation. Ultimately, this is testament to the voice market and to the growth of vendors like Speechmatics. It’s going to be an exciting year.