Voice may be one of the hottest markets in digital entertainment right now but there remains an abundance of challenges in perfecting the underlying technologies which deliver such divinely convenient experiences. Dialects, accents, over-talk, syntax, slang terminologies – these are all common problems. But while modern advances generally make life easier, building products around something as complex and fast-evolving as human languages mean these hurdles are only becoming more difficult to navigate.
Google, Amazon, Apple and Microsoft battling it out in voice has made for fascinating observation over the past couple of years since the market really kicked into life as a means of controlling content or a smart assistant – but ultimately will the winners and losers be decided by which service supports the most languages and therefore wins the most users? This is the one area where Google Assistant trumps Alexa, supporting ten languages compared to three, although the two also offer specific support for different English dialects and accents.
“We were shocked at how fast language is changing, so we have invested in internal tools to help companies quickly build support for new languages,” said Ian Firth, VP of Products at Speechmatics, a developer of neural networks for voice, based in the UK city of Cambridge. In comparison, a specialist like Speechmatics supports 74 languages, but given the pace at which each individual language is changing and the minute intricacies at the heart of the evolution of language, the concept of managing this with just 55 people sounds like witchcraft.
Founded in 2006, Speechmatics has spent the majority of its life as a research institute and only recently transitioned to the commercial space, offering a cloud transcription service as its main product, handling tasks like automatic note taking. But increasingly this is being used in media broadcast for captioning live events, which all too often is littered with mistakes. Firth also reckons the technology’s potential is much greater still, with Speechmatics hoping it can address areas like providing hints and tips to agents in call centers, possibly leading to huge truck roll savings.
“Up to around 2013 we were just playing around and waiting for compute to become available,” Firth told Faultline Online Reporter, adding that it was almost by luck that the company arrived where it is today – claiming accuracy beyond anything the big boys currently offer.
Speechmatics has built a framework called Automatic Linguist (AL) which is designed to develop new languages at a rapid rate beyond its current roster of 74. The technology was tasked with building a language a day over six weeks, doing so simply to prove a point more than any other reason, Firth confessed. AL is powered by a set of open source algorithms but Firth claims that Speechmatics, unlike others, consumes, understands and modifies these algorithms rather than taking them straight off the shelf and shipping a product. Its container-based Kubernetes Automatic Speech Recognition (ASR) product is available for private or public clouds, or available on-premises in batch and real-time. Public cloud transcription starts at £0.06 (about $0.08) for a minute of audio.
Growing at a run rate of about two employees a month, Speechmatics’ most notable customer is Red Bee Media, which added Speechmatics technology to its own Subito Live subtitling system two months ago – for bringing real-time subtitles to social media video content. This is particularly interesting as earlier this year Ericsson decided not to sell off its broadcast-focused Red Bee business despite selling 51% of its Media Solutions business to a private equity group. At the time, Ericsson said Red Bee Media delivers 2.7 million hours of programming in 60 languages for 600 TV channels, including 20,000 hours of live and 15,000 hours of catch-up content each week – showing that Red Bee has plenty more use for Speechmatics and perhaps this could soon lead to a sizable contract upgrade.
Addressing market demands, Firth noted that people today want mostly English, but with a smattering of other languages. To that end, Speechmatics spends time balancing data from different industries, and covering language elements of different age groups and genders.
This reminded us of our recent coverage concerning a partnership between Chinese OTT video service iQiyi and local TV network Beijing Gehua CATV Network (BGCTV) – launching a joint set top equipped with voice functionality. An interesting new feature of Baidu’s updated DuerOS 3.0 is Child Safe Mode, using voice recognition technology to determine the age of a user based on the sound of their voice. If the technology recognizes the voice of a child, only suitable content will be displayed, for example replacing horror movies with educational cartoons. However, we raised the question of whether this is based on voice pitch or sentence construction, or a combination?
The answer remains unclear, but if it transpires the decision of the AI is based on voice pitch, then there are certain to be situations of mistaken identity for adults with high-pitched voices. If instead the AI bases its decision on more complex analysis of syntax based on natural language processing, whereby simplistic sentences are deemed child-like, then surely this gives rise to a gray area around disabilities? Nevertheless, it is an intelligent feature for parents who may forget to manually switch on child safe mode.
Back to Speechmatics, Firth cited noise as the biggest enemy for voice recognition, also referencing accents as a particular pain point, with some situations going “horribly wrong.” That said, hardware is a different realm altogether, with Amazon claiming to get around the noise problem in its Echo devices by triangulating sounds to isolate a particular voice. “Our partners do the hardware stuff, harmonics make it tricky,” said Firth.
As for competitors away from the usual suspects, there are certainly similarities between Speechmatics and Zoo Digital, a machine learning firm also based in the UK, which Faultline Online Reporter spoke to back in March. The company is focused heavily on applying machine learning to content dubbing and subtitling for major studios, winning some major deals at Netflix and Disney.
Speechmatics and Zoo Digital might be close rivals but they agreed wholeheartedly on one thing – that the likes of Google, Microsoft and IBM offer generic speech technologies with little attention to detail which fundamentally fail to address real-world challenges.