A surprising amount of heat can be generated in video quality measurement not just between competing vendors but also champions of different algorithms. It has risen up the agenda, as was evident at this year’s NAB 2019, because of growing use of streaming over networks of varying capacity to devices of different types and resolutions. This makes it harder for broadcasters, operators and content owners to deliver a consistent viewing experience and has driven demand for better measurements of perceptual video quality against a constantly moving target.
The field has attracted a number of start-ups and also standards initiatives, some from established bodies and others led by operators, aiming to emulate and automate human video perception. One problem has always been that human perception is subjective, leading to the idea of Mean Opinion Scores (MOS) as the foundation benchmark. But that does not scale, nor even does “expert” viewing by individuals, so the goal has been to mimic MOS scores as accurately and efficiently as possible, leading to the gold standard for automated video quality measurement of full-reference algorithms, which compare processed or encoded streams with the original source. These yield the most accurate results because the equipment performing the test is supplied with two copies of the content: a source (or reference) version of the video content, and a version that has been processed in some way during transmission. However full-reference technique is expensive computationally, since it needs to measure the whole sequence in real time, so that normal video delivery can continue uninterrupted while tests are being performed.
As a result, various stripped-down methods have been developed. One is bitstream evaluation of the IP header without even looking at the video payload at all, which is taking surrogacy to its ultimate level, under the assumption that deterioration in quality can be sampled by looking at just a small subset of the data being transmitted.
Another effort is a lightweight QoE evaluation protocol synthesized from several existing methodologies by the EBU (European Broadcasting Union), which we covered last month. This development involved analysis to derive mathematical relationships between combinations of six parameters and MOS scores. The first three are packet loss, jitter measured as average difference between mean latency and latency of the sample, and then again start-up delay. Fourth comes underflow time ratio defined as cumulative duration of stalls, or the latency added as a result of them. Fifth is number of stalls, while sixth is duration of stalls and seventh resolution switches, that is number of transitions between two different segment qualities. This represents quite a broad spectrum of quality measures, so it is hard to see how it equates consistently to viewer perceptions, especially as the impact of some artefacts is so content dependent, or even frame dependent, as well as varying between diverse connected devices. Nevertheless it has produced some promising early results.
Of the start-ups, the one that has made the biggest splash is Canada’s SSIMWAVE, whose SSIM (Structural Similarity), has become the most widely cited video QoE algorithm and been adopted by several tier 1 operators, including Telefonica. It has certainly been a great marketing success, but the original algorithm has been surpassed in some tests by Netflix, which itself evaluated SSIM at scale, along with other existing video quality metrics, including fundamental PSNR (Peak Signal-to-Noise Ratio), but concluded that none of them captured human visual perception accurately enough.
This led Netflix to collaborate with the Universities of Southern California, Austin and Nantes with the goal of improving its own VMAF (Video Multi-method Assessment Fusion), which is now available open source on Github. It has now quite widely been found to yield the closest match to MOS scores obtained by human panels, particularly for 4K content, but there are concerns over efficiency given it is based on the full reference model. It However it has also matched DMOS (Differential MOS) scores, which take account of the range in scores among the panel, as well as just the mean.
Then one of the latest papers from Cornell University published in January 2019 found that despite having been developed for traditional 2D content it accurately predicted user perception for 360VR (Vitality Reality) footage.
The essential idea of VMAF is that individual quality metrics, such as visual information, detail loss and motion (temporal difference between adjacent frames), only give partial measures of perceptual quality, but when fused together can yield a more accurate “master metric”. Netflix has applied conventional machine learning to assign weights to different metrics and tune the model until it matches human perception as closely as possible.
The latest version of VMAF featured at NAB 2019 in a product called ClearView from Video Clarity, which has found it very consistent across content types and good at assessing artefacts associated with streaming.
However, SSIMWAVE has not been standing still and at NAB showed the latest version of its technology in SSIMPLUS, with a focus on comparing individual components such as encoders, transcoders and packagers. It also wants to apply its technology to Service Level Agreements covering content handoffs between different parties in the distribution chain.
It will be interesting to see how SSIMPLUS shapes out against VMAF, although it definitely scores for efficiency, certainly in earlier tests. A study from the University of Waterloo in Ontario published July 2017 found that SSIMPLUS ran 10 times faster than Netflix’s VMAF at the time.