AI and automation make perfect marriage for deep metadata

Video search and recommendation have been held back by lack of detailed metadata, which often offers little more than title, genre, length, time of creation and some credits including director, actors and other participants. Fortunately times are changing as metadata starts to dig deeper and allow recommendations engines to pick up more nuanced information down to scene level and some of the bit part players. This promises not only to match content as a whole more accurately to viewer preferences and profiles, but also to allow micro level extraction down to scenes with potential even for creating and serving personalized extracts or trailers.

On the one hand new content is already being created automatically with more information from cameras for example, time stamping individual frames, with scope for building higher level metadata about scenes on top. But this does nothing for the mountains of archived content, nor does it offer potential to yield the ultimate in content based video search on the basis of audio or visual objects or even spoken words, coming closer to the text based search Google and others have made commonplace. This is where AI and Machine Learning (ML) techniques come in by enabling metadata to be identified on the basis of audio and video analysis quickly and sometimes automatically. The ability to identify words and people speaking them from the audio can be combined with facial recognition from the video.

There is also a growing ability to identify aspects of the scenes, such as whether the terrain is mountainous, wooded or a sea scape. Scenes can also be categorized on the basis of where they were shot, which could be in a single room or from a moving car, or from actions taking place and emotions registering on faces. The latter has potential for helping content editing as well as search or recommendation.

One technique that is making waves in metadata creation is convolutional neural networking, inspired loosely by the way mammalian brains process images in their visual cortex, although it works equally well for many forms of pattern recognition including identifying words in audio. The inspiration springs from the way the cortex roughly allocates groups of neurons to individual geometrical features such as lines or curves and then aggregates these to facilitate identification of more complex objects.

This is emulated in convolutional neural networks by creating templates for shapes, or indeed any pattern, which are two dimensional grids of numbers, or matrices in mathematical speak. If say this represents an edge,  in a simple example pixels from that part of the image would be represented by higher numbers. Then during the recognition phase this template or filter would be played around the image data, or convoluted – hence the name – seeking a match. This is proving highly effective at homing in on scenes or faces, as metadata specialist Media Distillery based in Amsterdam showed at the recent OTT World Summit in London. It was able to demonstrate how such fine grained “intra-content” metadata could be applied to extract clips of particular interest from whole content, which is becoming more desirable with increased watching of replays.

Such fine-grained metadata can help improve the experience in other ways, such as locating the precise start time of a recorded program to avoid having to wade through ads or preambles. There are doubtless many other use cases of granular metadata yet to be conceived.

Needless to say these are early days and there is much work to be done working out how to incorporate such metadata in search and recommendations, as well as with standardization so that the engines can work across multiple content types, sources and distribution infrastructures.