Content adaptive compression involving neural networks and other techniques are set to infiltrate the next generation of codecs and yield substantial improvements in encoding efficiency.
The potential has now been well proven by Netflix, which has just finished re-encoding its entire catalog with the help of content adaptation methods developed in partnership with the Universities of Southern California, Nantes, France and Texas in Austin. Netflix has combined these techniques with the established codecs such as HEVC and H.264 to compress further, achieving on average just over 20% lower bit rate, per title.
However this is content dependent, so the degree of efficiency gain varies, but given Netflix accounts for 40% of all Internet traffic in the USA this could liberate something close to 10% of extra capacity for other web traffic, which is quite substantial.
In fact the economics are such that Netflix is planning to repeat the exercise by drilling down further to the scene level and re-encoding the whole lot again, aiming perhaps for a further 20% average bit rate reduction. Meanwhile improvements in hardware cost/performance have made content adaptive encoding feasible for live content, where the ability to reduce bit rate can relieve congested networks while also improving quality by avoiding the need to step down to lower resolutions in adaptive bit rate streaming.
To be effective, content adaptive coding must make sure a constant quality as perceived by the user is maintained. The gold standard here is the Video Quality Breakdown Bitrate (VQBB), the bit rate at which artefacts just become noticeable when a video is watched under ideal subjective viewing conditions. Such conditions include relatively low lighting and optimum ratio between viewing distance and screen size.
There is however no universally agreed way of measuring VQBB automatically other than asking a panel of people for their perception. For this reason various alternative formulae have been applied, starting with the original generic PSNR (Peak Signal-to-Noise Ratio) applied in the first codecs. This equates the power of the signal to the distortion resulting from transmission factors like interference for electrical circuits and imperfections of the medium. For video compression it estimates the difference between the compressed version and the original, but is a crude measure taking no direct account of the content itself.
SSIM (Structured SiMilarity index) was developed at the University of Texas at Austin as an improvement by taking account of image degradation and perceptual detail, such as luminance and contrast masking. It relies on the usually correct assumption that pixel parameters such as color are highly inter-dependent or correlated and therefore contain redundant information that can be compressed. It gives a closer approximation to human visual perception, but still with considerable discrepancies.
For this reason more advanced content-aware measures have been developed, including the one Netflix has been using. There have also been a few notable proprietary techniques, such as the Perceptual Quality Measure (PQM) from Beamr, an Israeli/Californian video codec firm founded in 2009, which has focused on content aware encoding.
The firm’s premise was that PSNR and SSIM could be dismissed because they are not good enough. There were some other promising methods emerging, but these were at the time too computationally intensive for use in live environments, either making the encoding too heavy to be applied in time or decoding too demanding to be performed by power-constrained mobile devices, or both.
It therefore set about developing a method to measure perceptual quality on the fly to enable CABR (Content Adaptive Bit Rate). The idea was to vary bit rate accurately on a frame by frame basis to achieve the lowest bandwidth possible at a given time while preserving a constant quality. It could be said to be an enhancement of VBR (Variable Bit Rate) for video.
The foundation for all such efforts is the ability to accurately determine human MOS (Mean Opinion Score) values in the first place as a basis for calibrating automated tools and metrics.
This has been achieved with the ITU BT-500, which has stood the test of time as the most accurate way of turning subjective evaluations of video quality into the most objective scale possible. Beamr then noticed that there were a number of tools knocking around academia that in different ways attempted to automate ITU BT-500. It incorporated some of these and enhanced them with patented developments of its own to yield PCR, whose details it is rather coy about, being the source of competitive advantage.
Netflix is more open about its equivalent method called Video Multimethod Assessment Fusion (VMAF), given that in this case it is a means to an end as a user rather than a differentiating technology. As the name indicates VMAF fuses a variety of techniques, with application of neural based Machine Learning to optimize weights of each method to yield the best possible MOS scores for a given bit rate.
The underlying premise of VMAF was that since Netflix video streams are delivered using the Transmission Control Protocol (TCP) with packets retransmitted if dropped, packet losses and bit errors are never sources of visual impairments. That leaves just two sources of visual impairment in the encoding process, compression artefacts themselves due to the reduced number of bits and scaling artefacts resulting from video down-sampling before compression and subsequent up-sampling on the viewer’s device. By tailoring a quality metric to only cover compression and scaling artefacts, Netflix trades generality for higher precision and overall accuracy improves.
The various metrics are then fused into a final super-metric using a machine-learning algorithm, in this case a Support Vector Machine (SVM) regressor, which assigns weights to each elementary metric. Some metrics count for more than others and at varying degrees according to the content, which is reflected in the final weighting algorithm.
The elementary metrics are of two types, one for spatial compression of the still image and one for temporal compression between frames. For the spatial Netflix used Visual Information Fidelity (VIF), a widely-adopted image quality metric based on the premise that perceived quality depends on the level of distortion as measured by several different scales, allowing for the detail within the image rather than treating every pixel the same. This is complemented by the Detail Loss Metric (DLM), which measures loss of details which affect the content’s visibility and separately impairments which distract the viewer’s attention. Then on the temporal front, Netflix simply measures the average absolute difference in luminance between corresponding pixels in successive frames.
Netflix has different objectives from distributors of live content and its method is computationally intensive on the encoding side, with the VIF metric especially probably too demanding for linear services. Meanwhile Beamr has evaluated its method in live operation, where it operates in conjunction with established codecs such as H.264 and HEVC. It claims it can achieve 20% to 40% savings in bandwidth with HEVC over and above what the codec achieves itself. Typically when streaming say 1080p “full HD” video, HEVC at VBR bitrate averages 3Mbps, but Beamr’s leading HEVC encoder operating with CABR has got down to 2Mbps without any quality degradation, according to the company.
As well as saving costs or improving quality this gives broadcasters and operators the option of delaying upgrades of their core codec, perhaps keeping H.264 for longer. This might appeal at a time when the codec market remains in a state of flux, with HEVC quite well established among some broadcasters, while the Alliance for Open Media’s AV1 codec looks ready to gain traction among OTT services and with device makers. Furthermore, successors to both HEVC and AV1 are in the wings and they will themselves incorporate some of these content adaptive techniques anyway. That would seem the obvious direction rather than running CABR as an adjunct to codecs, with MPEG’s chair and co-founder Leonardo Chiariglione recently indicating its next codec will incorporate neural network-based content adaptation.