MPEG looks to neural networks for next-gen video

MPEG chair and co-founder Leonardo Chiariglione has courted the limelight lately by posting two dramatic and despairing blogs warning that success of the Alliance for Open Media’s (AOMedia) royalty free codec model would be a disaster not just for his organization but the video industry as a whole.

It would destroy incentives for technology companies to develop and contribute intellectual property (IP) for the common good of the industry and its users, he contended. He accused “Non-Practicing Entities”, or NPEs, which are patent license holders with interest in extracting money from their IP, but no intention of developing products from it, of dragging down the whole MPEG movement. They had become increasingly aggressive in extracting value from their IP and had thwarted his efforts to heal the fractured royalty scene that had retarded take up of the latest ISO MPEG HEVC/H.265 codec. He concluded that as a result AOMedia had been able to step in and occupy the resulting void with its AV1 codec now on the verge of release.

These outpourings were born out of frustration and Chiariglione is much more comfortable talking about the innovations still taking place within MPEG, which cover a lot of ground and show the organization is still at the cutting edge of emerging developments such as object-based video search and new codecs for immersive video including VR (Virtual Reality). Yet his blogs resonate here too because again we can sense the presence of the big Internet technology players in the background as elephants in the room. This is because Google, Facebook, Apple, Netflix and others have been unwilling to wait for standards from MPEG if doing so will hold them up and this touches on the whole argument around IP. For these players IP is not a commodity to be exploited directly for value but a means to a larger end.

This was evident in the case of encoding ambisonic audio for VR video, sometimes called “full sphere surround sound” because it embraces sources from all directions in 3D space and not just in the 2D plane of the listener. It represents sound as a field, independent of where the user’s speakers are located, and therefore requires encoding in such a way that the audio can be decoded to match a particular speaker configuration and deliver a more realistic sound experience. MPEG did publish an ambisonic sound standard called MPEG-H 3D Audio in 2015 but Google and Facebook decided it was not advanced enough for their VR services and agreed on a format which has become the de facto standard.

There is a danger the same could happen for VR video formats. MPEG is doing good work on this front having early in 2017 started a project called “coding of immersive media”, which will culminate in a standard called MPEG-I. But, under its current roadmap, work on the first version of MPEG I will not be completed until 2022, which may well be too late because by then the associated technology may already have been widely implemented in the field.

Apart from the ambisonic sound, a key objective on the video front is to allow users to move with 3D headsets as if they were navigating around a scene and maintain a realistic perspective. This is complex to achieve and is coming in phases, starting with allowance for limited movement by the user. “We already have one standard for three degrees of freedom,” said Chiariglione. These three degrees are rotational, involving pitching forwards or backwards, rolling from side to side and yawing or twisting the head – the three rotational dimensions of movement. “You can move your head left and right, which is more than you could do before with the old standard because there was a parallax effect,” said Chiariglione. Parallax is the phenomenon whereby the relative position of objects changes as the viewer moves. It has been exploited by astronomers for example to work out the distance of a star by analyzing its very slight change in position relative to other stars at opposite sides of the earth.

“In parallel with that for the longer term, we are putting in six degrees of freedom, allowing the person to walk and have an experience indistinguishable from the natural one,” said Chiariglione. The additional three degrees of freedom making up the six are the translational ones of left/right, up/down and forwards/backwards, in other words the three lateral dimensions of movement.

Chiariglione highlighted the important role being played by neural networks for much of MPEG’s work on advanced encoding and associated areas such as object-based video search. Modelled on understanding of biological networks in brains, neural networks comprise nodes and connections between them conveying outputs of one node as inputs to another node downstream. Outputs of each node are determined by the inputs combined with a weighting function that can be adjusted independently of all the others during training to converge towards the desired result.

These networks can be applied to any data set that represents objects, structures or form of any kind, providing suitable training sets are available. They can be nested to allow deeper levels of learning for more complex tasks. For image and video processing or recognition, an enhancement called convoluted neural network (CNN) works well, which will sound familiar to developers of traditional video codecs because it breaks down images into tiles that make it easier for the system to exploit and target objects within the image. The machine is first trained to identify individual tiles and then finally to predict or recognize the whole picture by aggregating all the tiles together. So when recognizing say a cat the system would first learn how to identify some of the characteristic partial shapes, such as ears and whiskers perhaps, some of which might be shared by other animals, but not all of them.

Image search, recognition and compression can be enhanced further by application of “neuro fuzzy segmentation” which equates whole objects to geometrical relationships between their parts. This can for example represent images of the human body as component parts like the face, torso and limbs, with further sub-division into smaller parts like hands or even fingers, as well as their place within the whole. So when a system encounters a face it would immediately predict there was a body lying below it, depending on the orientation.

Application of neuro fuzzy segmentation involves three steps, clustering, detection and refinement. First clustering is applied to collect similar pixels, grouping them together in say 4 x 4 blocks and then assembling them into larger areas. Then detection of objects takes place by analyzing these clusters, identifying faces by picking up on their characteristic values for chrominance or color and limited range of luminance or light intensity. Then refinement is applied to help identify objects in more ambiguous areas of a frame, around the edges perhaps where only part of an image such as a face is visible.

This same principle can then be extended to whole video sequences, again with fuzzy techniques playing a role identifying relationships between objects within the content, even when separated by a few frames.

It can also be applied to video search which MPEG is doing for a standard which Chiariglione says should be released early in 2019. But here again the MPEG community is up against the Internet brigade, with Google having introduced its Video Intelligence API in March 2017 for developers to build applications that can automatically extract entities from a video, whether dogs or even specific faces. Previous Google APIs could only do this for still images. Meanwhile Netflix has been applying convolutional neural networks to content-based metadata extraction, to characterize content down to scene level on the basis of objects that will be searchable by subscribers.

It may well be that MPEG’s video search will be superior to these schemes but the question once more is whether that will be enough to gain widespread traction among those broadcasters and content owners without access to approaches already adopted by Google, Netflix and others. This comes back to the royalty model and raises the question not so much of whether it is broken, but if it is redundant. Chiariglione is surely right to be fearful that an era that defined pay TV might be drawing to a close.

In a way, by ceding to the powerful AOMedia members the ability to add AI features to codecs and compression, before anyone else, by adopting their standards, we open the possibility that one of them may make a compelling first move into a future technology that gives it monopolistic control of the video market.