Zoo Digital talks what’s hot in voice AI; demoing new security at NAB

UK machine learning outfit Zoo Digital will be applying its algorithms to the anti-piracy field at NAB next month, demonstrating security features for automatically obscuring video streams – essentially making the content of virtually no value to content pirates without inhibiting the dubbing and subtitling process.

Should it prove a success, the new feature will become part of Zoo Digital’s full service for applying machine learning to content dubbing for major studios, which the company’s CEO and former CTO Stuart Green explained to Faultline Online Reporter this week it is really a very recent venture – despite Zoo Digital coming across as a bit of an AI expert.

With the sheer number of companies claiming to be AI pioneers, how is it that Zoo Digital has differentiated itself in such a competitive market and bagged deals at the likes of Disney and Netflix?

“Our competitors are basically human capital businesses,” said Green, adding that powerhouses such as IBM Watson and Google provide useful tools for the job of transcribing content, using some form of automatic speech recognition (ASR), but do not offer a full suite of localization and digital packaging technologies. They are more focused on generic packaging and speech to text technologies, providing general purpose technologies that a smaller company like Zoo Digital might use.

Earning its stripes during a successful period of providing DVD authoring tools during the physical disk boom, once a laborious and expensive process, Zoo Digital has emerged into the OTT era unscathed – by building on what Green described as very powerful, low level programming languages.

It is common practice for AI companies to avoid revealing the technical details of what makes them tick, but Green elaborated on the inner workings of Zoo Digital’s collaboration with the University of Leeds, as well its undisclosed grant from Innovate UK in partnership with the University of Sheffield last month. Green attributed a large share of the new face of Zoo Digital to working with academics, explaining that one aspect of the research has focused on machine translation, in which millions of texts are fed into machine learnings algorithms to create an internal model, for use in a market like content dubbing.

Scripts from the EU Parliament are the most popular resource for machine translation, according to Green, due to being readily available in various languages and linguistically covering a wealth of data. Companies like Microsoft have adopted neural machine translation for translating user manuals into multiple languages, although this still requires manual tweaks from a post-editor.

“We have the world’s biggest resource of training materials, generating tens of thousands of dialogues,” said Green.

The world of Zoo is working to take this to the next step. For example, spoken words will include complexities such as colloquialisms, idioms and even physical gestures, which are all integral to dialogue, but cannot be inferred by an algorithm on a written page. Green was clear that machine translation is not capable of encompassing these intricacies into translating dubbed content today, but is an exciting area being investigated by Zoo.

While a breakthrough in machine translation sits on the back burner, Zoo is enhancing its core service for analyzing original and target language voices to improve lip-sync dubbing, for which it relies on AWS cloud infrastructure.

To provide a basic example of the demand driving this market, when translating dubbing from English to French in feature film, the word “Bonjour” creates very different lip movements to “Hello” – therefore machine learning algorithms can replace “Bonjour” with “Salut” which more accurately represents the lip movements of “Hello”. This algorithmic process then adapts scripts for voice actors, reducing much of the trial and error work involved in studio production, while improving the dubbing process for actors, studios and end viewers.

Feeding a corpus of good quality training materials into an AI system is all well and good, but one problem is distinguishing between speech and background noise. In the case of smart assistants, devices such as the Amazon Echo are able to filter out background noise in very specific controlled environments – triangulating sounds to isolate a voice. The same concept is not as simple for video content, according to Green, adding that Zoo Digital itself is not in the industry of enhancing voice, but academic partners are researching this field.

Subtitling and captioning might seem like a much bigger market than dubbing, given the comparative simplicity and lower costs, which machine learning developments can enhance further. However, Green said Zoo Digital has nailed this area and subtitling does not really warrant AI.

As for the next big thing for AI in speech, voice synthesis is a very hot area of research right now, said Green, albeit not a convincing one in its fledgling state. A foray into artificially producing human speech by Zoo Digital is by no means on the company’s roadmap, as it would entirely contradict its core values of improving and diversifying the dubbing process, not replacing the jobs of voice artists. That said, Green highlighted how Zoo Digital’s network of “thousands of freelancers” is more important to the business than the software itself. It will be interesting to see how the company’s payroll has changed in a few years’ time.