Watson to soak up video metadata – if it is cheap enough?

A paper out this week from IBM shows just one of the many things that it wants to do with “Watson” in video – and its plans seems to be to replace modern video metadata. It will have its work cut out.

We managed to miss IBM at IBC, where it had promised to tell us about how Watson can improve video, but back in April IBM was telling the press about its Content Enrichment Service which would use Watson’s cognitive abilities to provide a deep analysis of video and to extract metadata such as keywords, concepts, visual imagery, tone and emotional context.

At the time it talked about its tone analyser, which used visual recognition of faces (on the screen, not the viewers) to convey the mood of each video scene. It was also hinted that this would be a surefire way of ensuring that the video which your adverts was played alongside would be “brand safe.”

But it turns out that this is a full frontal attack on the likes of TiVo and Gracenote, which control about $300 million of video metadata delivery between them. That’s not a huge market for IBM, and there are other global providers, but if it is one of 30 or 40 markets which Watson is capable of opening up for IBM, it may well be relatively large money spinner. Also IBM is saying that this metadata is significantly more useful, and describes it as 1,000 times more detailed than existing metadata.

It looks to us that both Gracenote and TiVo are starting points and that once you have identified which version of the specific video you have, and can allocate its copyright to the correct organization, you may want to do a lot more with it than merely categorize it or use its description to personalize a recommendation system.

It strikes us that both Gracenote and TiVo have available to them, either using Watson to augment their services, or building a relationship with another machine learning services – and there are a number of smaller ones around, as well as the majors in Google, Amazon, Facebook and Microsoft – to choose from.

TiVo already has both what it calls a Knowledge Graph in the products it acquired from Veveo in 2014, as well as Digital Smiths, which old TiVo acquired in the same year, which has separate scene by scene metadata for many films already. So both of them may be well in advance of this IBM offering, When we met Graham McKenna, VP of marketing at Gracenote at IBC, he confirmed that Gracenote under its new owner Nielsen, was already dabbling with AI, initially in music metadata, and it was soon to take on its use in video.

IBM said Watson Video Enrichment (VE) can create automated metadata sets which are thousands of times more detailed and searchable than is currently
possible for large libraries of video.

Okay that may be true, but who is the customer here. There have been startups in the segment for as long as we can remember, not all using AI, but many of them using natural language understanding to decipher the spoken word throughout the video’s dialog and using this for classification. These systems are often used to augment existing metadata, but more often to search web pages that videos are found on, to establish what is supposed to be going on, within the entire web page.

The IBM vision seems somewhat deeper, beginning at content ingest and then effectively watching the video and listening – to automatically detect and break down the number of scenes, keywords, objects, and dominant emotions which it does by recognizing facial expressions.

Watson then provides a 5-level keyword taxonomy, and identifies entities including people, cities, and organizations, all associated with a confidence
score. Watson also captures the high-level concepts and themes related to a video. It categorizes everyday objects, celebrity faces, and food types but also
detects sentiment and emotions on the screen

To do this it goes through a speech to text transcription and then uses Natural Language Understanding to infer fundamental concepts out of the dialog. It uses visual recognition of objects and people. It then outputs a JSON metadata file, which is a Javascript lightweight file for interrogation and search.

Somehow, for each individual distributor, this might be a sledgehammer to crack a nut, and perhaps only the Major Hollywood studios look like the type of businesses which might buy into this, to creates stronger metadata to sell on to pay TV operators. Of course the other two great customers night be TiVo and Gracenote, which may use this to shut out competing services.

Gracenote has recently tried to develop into Sports with its Infostrada Sports and SportsDirect acquisitions, and it has expanded its data and tagging catalog outside of music, TV and movies. All of its development has been on
Hadoop in a Microsoft Azure environment. Not AI, more big data.

What we are going to see emerge is a nuanced video selection criteria which interfaces to voice search and the new voice platforms such as Alexa. Such systems should be able to pick up on a conversational selection process whereby two people discuss what kind of thing they fancy watching, and a voice assistant makes suggestions, and extracts detailed metadata, and then uses that to create a vocal description of each video asset.

That’s the end point, and although that seems to be the route all of this video metadata is taking us in, it will only be worth it if it makes video services worth more. There needs to be a payback in order to afford to store all the data that IBM will be generating with Watson, and payback for the processing services even if all of that is upfront.

If Watson can generate subtitles in multiples languages using automatic translation, that may also be a way of paying for this type of service, getting it to fresh markets sooner, with more confidence. But until it comes out with express “ROI” calculations for its AI, we remain skeptical.