The quest for automated and real time measurement of perceptual video quality has been elusive ever since the first statistical algorithms emerged in the 1990s. The goal has been to emulate as closely as possible the Mean Opinion Scores (MOS) determined by human panels in automated tools that can be scaled up to major streaming of broadcast services delivering to multiple devices at different formats and resolutions. Furthermore, there has been an aim to achieve this in real time, leading towards the next goal of being able to take action in the event of content slipping below set quality thresholds at various stages of distribution. Such actions might include shutting down a sub-standard stream or preferably fixing it in real time to bring it up to standard.
Among the leaders on these fronts is Canadian pioneer in the field SSIMWave, whose name is associated with the Structural Similarity Index Measurement (SSIM), which emerged in the early noughties as a successor to basic statistical surrogates of MOS, such as Mean Squared Error (MSE). The firm’s co-founder and Chief Science Officer Zhou Wang co-developed the predecessor of SSIM called the Universal Quality Index (UQI) or Wang–Bovik Index with Alan Bovik. They then collaborated with two others, Hamid Sheikh and Eero Simoncelli, to develop the current version of SSIM, which was published in April 2004 in the IEEE Transactions on Image Processing. That has been largely unchanged since and quite widely implemented with general agreement that it outperforms alternatives such as MSE and also Peak Signal-to-Noise Ratio (PSNR).
The objective of all such methods is to provide analogues of average human perception that can be computed rapidly in principle at any point in the video distribution chain and apply equally at all resolutions, device sizes, formats and viewing distances. A perception score is derived by comparing the asset at a given point with the near perfect version output from production prior to any compression or video processing, in order to derive a surrogate MOS score.
PSNR is one commonly used measure of quality after compression resulting in information loss, because it provides a basic comparison between the original signal and the “noise” or error introduced by reducing the number of pixels. MSE achieves a similar result by summarizing the differences between the original and compressed video. Both these methods have been refined for video but still fail to converge closely with MOS scores provided by human panels. They have been quite widely implemented though because they give some indication of video quality while being relatively easy to deploy with acceptable computational burden.
SSIM is still a statistical method but represents an improvement because it allows the different elements making up perceptual quality, that is color, luminance and structure, to be tuned to human-derived MOS, with greater flexibility for adjustment to specific scenarios. The first two measures are on an individual pixel basis while structure involves some correlation between pixels to define integrity of objects and especially their boundaries, such as the edge of a face. The mathematical formula counts the errors across each of these three metrics individually as well as correlations between them, with scope for dividing frames into smaller blocks. The output is then a weighed combination of the measures which can be tuned to optimize the MOS scores.
Although an improvement over predecessors, SSIM suffered from one major impediment, or really two – inability to adapt to different spatial and temporal resolutions. These are major handicaps as a few examples indicate. It means that a given video stream shown on a 60-inch TV and a 6-inch smartphone would yield the same SSIM score and yet exhibit very different perceptual QoE as measured by humans. A jagged edge of an object on the big screen might appear smooth on the small screen for example, with similarly greater tolerances for color and luminance.
Then if a high frame rate (HFR) video at say 60 fps is down sampled along the temporal direction to a lower frame rate such as 30 fps, again perceptual quality might be fine on a small screen while exhibiting annoying motion artifacts on a big one. Yet again the SSIM scores would be the same because the algorithms cannot take account of temporal variations.
There is also a question over assessment of the impact of HDR (High Dynamic Range), for although luminance is taken into account it had not been assessed whether the resulting scores in that case are close to human perception. Humans tend to be very positive about HDR.
Such considerations led SSIMWave to develop its enhanced SSIMPlus model which factors display size, frame rate and other parameters into the scoring calculations. The output is a score from 1 to 100, so more finely grained than most, with the company claiming it has a correlation rate with human-derived scores of over 90%.
Wang again had a hand in development of the underlying algorithms, which were presented in the journal Human Vision and Electronic Engineering in 2015, entitled “Display Device-Adapted Video Quality-of-Experience Assessment”. In essence, the weightings are adjusted by a further factor taking account of device characteristics and viewing distance. The system computes what is called the Contrast Sensitivity Function (CSF), which can incorporate up to 13 parameters that determine the human visual experience. The main ones are: average or range of user viewing distance, sizes of viewing window and screen, screen resolution, video scaling, temporal resolution and viewing angle. Of course, some of these in practice cannot be known in a live environment, including viewing distance and angle, but assumptions can be made and so it is useful to be able to exploit these when making decisions over quality thresholds during distribution.
Indeed, benefits of accurate quality perception calculation include not just improvement in the viewing experience but also efficiency by reducing bandwidth for example, with less needs for overprovisioning to insulate against short term fluctuations in bit rate for example. It is possible to cut down resolution and bit rate if the perceptual quality is known accurately and can be adjusted in real time, with greater scope also for taking the receiving display characteristics into account.
Of course, SSIMWave is not the only show in town and there are other specialists in perceptual quality measurement. One is Video Clarity, with its ClearView range of video quality measurement and analysis systems, with some interesting features. It incorporates both temporal and spatial metrics so that it can assess impact of frame rate changes as SSIMWave now can, while having a picture zoom feature of up to 16 times to inspect the impact of compression artefacts in close up. This again could help resolve the varying impact of compression at different display sizes.
Another early mover in perceptual quality measurement was Germany’s Opticom with its Perceptual Evaluation of Video Quality (PEVQ), which like SSIMWave compares each frame of a video signal under test, employing pixel-based analysis, against the corresponding original video content, which could be an uncompressed studio master, or an HD video at the head-end.
The algorithm was assessed independently in its earlier incarnation, having been a winner of the Video Quality Experts Group (VQEG) Multimedia project and as a result became enshrined in ITU-T Recommendation J.247 in 2008. PEVQ-S, an advanced hybrid version of PEVQ for IP-based video streaming, was then nominated winner of the latest VQEG hybrid benchmark in 2014.
But that was before the enhanced SSIMPlus came into being and since then there has been a need for independent assessment of the varying options and claims by their respective vendors, as well as counterclaims. One such counterclaim is that SSIMPlus is not after all radically different from those preceding methods such as SME, with one paper arguing that they are related both statistically and methodologically – in other words they are related both in their approach and in the results they obtain. This is disingenuous because it could be said of so many technological domains where subsequent methods build on previous ones. It could be said of the HEVC and H.264 codecs for example, or even MPEG-2 before that, which are related by their relative performance in given scenarios as well as in their underlying principles, but that does not mean they are equally efficient.
SSIMWave has recently developed variants of its algorithms for VoD content where there is more scope for taking actions during delivery because it is less time critical. A key recent release is a combination of two products, the SSIMPlus VoD Monitor Production and SSIMPlus VoD Inspector tool. As SSIMWave’s CEO and Co-Founder Abdul Rehman explained to us, these two products use exactly the same algorithms and differ primarily in their use cases, scalability, deployment environment and UI.
“The main use case for VoD Production is qualifying content based on viewer experience,” said Rehman. “On the other hand, VoD Inspector can be deployed on any CentOS system and can be managed using a GUI and a REST-API. The main use-case for VoD Inspector is the comparison of source content, encoder, codec, and configurations through deep analysis of content quality.”
The VoD Inspector can also be used to set quality thresholds at multiple points in the delivery system. Those assets that meet the threshold could be let through while those that fail can be analyzed to identify how the problems can be fixed.
“A frame-comparison tool is also available in the GUI, where comparisons can be done between outputs and corresponding source,” said Rehman. “While there are manifold reasons a video file may suffer from quality issues, VoD Inspector identifies exactly which problem exists in your file so you can solve it easily.”
One slight deficit we can identify, as we hinted earlier, is shortage of independent assessment and comparison. To give SSIMWave credit here that was alluded to in the 2015 paper on the SSIMPlus enhancements, with the admission that databases capable of feeding tests assessing perceptual quality at different spatial and temporal resolutions did not exist, so internal data sets had to be used. Given the increases in resolutions and frame rates now possible, as well as emergence of HDR in several flavors coupled with rapid proliferation of live streaming services, demand for perceptual quality measurement is growing fast with a pressing need for independent assessment of the various tools and methods available.