Video standards group MPEG is evaluating technologies to enable efficient feature coding for machine vision tasks. These technologies could slash global Internet bandwidth by more effectively compressing video that is currently guzzled by wasteful machine-to-machine (M2M) video communications – in fields including surveillance, autonomous vehicles, industrial equipment and military systems.
An estimated 50% all global video traffic will be viewed solely by machines by 2024, according to Nokia. When we also consider that video accounts for an inordinate volume of total global Internet traffic (often pegged at 80%), then the demand for more efficient delivery of machine vision video traffic arguably far outweighs the demand for consumer video traffic – from a sustainability perspective.
While home viewers enjoy stunning image quality on TV thanks to advanced video compression techniques, similar technologies are redundant in machine use cases. This is an example of why perceptual metrics such as structural similarity index measurement (SSIM) are popular, to quantify image degradation in an objective manner that is perceptual only to the human eye, not to machine vision processing.
MPEG’s 140th meeting announced that six responses (from 17 calls for proposals and eight responses to its call for evidence) have been considered. These technologies include learning-based video codecs, block-based video codecs, and hybrids that combine the two with novel video coding architectures.
These responses reported up to 57% improvement in object tracking compression efficiency, up to 45% on instance segmentation, and up to 39% on object detection, in terms of bitrate reduction for equivalent task performance. The announcement does not specify the tasks.
MPEG has described the latest meeting as a success, and its Working Group 4 will begin a new standardization project using a test model based on the anonymous technologies and results from the first round of experiments. This dovetails with the Joint Video Team with ITU-T SG 16 (WG 5) which will study encoder optimization methods for machine vision tasks on top of existing MPEG video compression standards.
Emerging metaverse applications are somewhere that machine vision technologies will play a tremendous role. Leaders in the silicon world like ARM, Qualcomm, Xilinx and MediaTek envisage machine vision driving a significant evolution in mobile applications and devices themselves. With machine vision, mobile devices need to be lightweight while emitting minimal heat – which is easier said than done when running real-time experiences.
Against this immersive backdrop, MPEG has also announced completion of a new standard called Video Coding Interface for Immersive Media (VDI) – with the first version supporting HEVC, VVC, and EVC.
VDI addresses encoding inefficiencies in immersive content where only a tiny portion of content is actually presented to users, unlike 2D media. In something like Meta’s Horizon Worlds, for instance, you are unlikely to be viewing Mark Zuckerberg’s legless avatar from the front and back simultaneously. Only the front and sides of a point cloud object therefore need to be delivered, decoded, and presented.
MPEG explains that the VDI standard allows for dynamic adaptation of video bitstreams to provide the decoded output pictures in such a way that the number of actual video decoders can be smaller than the number of the elementary video streams to be decoded. In other cases, virtual instances of video decoders can be associated with the portions of elementary streams required to be decoded.
With this standard, the resource requirements of a platform running multiple virtual video decoder instances can be further optimized by considering the specific decoded video regions to be presented to the users rather than considering only the number of video elementary streams in use. It also includes support for API standards that are widely used in practice, such as Vulkan by Khronos.
Similar rules apply to background scenes, which is why MPEG has also been working on technologies for interoperable and distributable scene description – hailing this as a key element to fostering the emergence of immersive media services.