A couple of weeks ago, a very interesting article posted at NetFlix Tech Blog, providing us a view towards a practical video quality metric, as it is perceived by the worldwide leading content provider – NetFlix.
The proposed metric used by NetFlix is called Video Multimethod Assessment Fusion (VMAF) and seeks to reflect the viewer’s perception of the NetFlix streaming quality. The plan for this metric is to be provided as an open-source tool and to be possible for the research community to get involved in the evolution process of this metric.
The main objective of NetFlix is to deliver content of high quality, providing to the subscribers a great viewing experience: smooth video playback, free of annoying picture artifacts, given the constraints of the network bandwidth and viewing device.
Currently NetFlix utilizes the most contemporary codecs, such as H.264/AVC, HEVC and VP9, in order to stream at reasonable bit-rates with the cost of quality degradation and the appearance of coding specific artifacts. At NetFlix, they encode the video streams in a distributed cloud-based media pipeline (more info is available here), which allows to scale to meet the needs of the service. To minimize the impact of bad source deliveries, software bugs and the unpredictability of cloud instances (transient errors), automated quality monitoring is performed at various points of the pipeline.
Towards the video quality research, NetFlix first starts with building an appropriate data set for experimentation, which meets the standards of the specific services, in terms of content variety and source of artifacts (due to TCP-based streaming, the quality degradation observed at NetFlix is caused by two types of artifacts:
- compression artifacts (due to lossy compression) and
- scaling artifacts (for lower bitrates, video is downsampled before compression, and later upsampled on the viewer’s device)
So the objective of NetFlix research is to focus on building a special purpose metric, based on the two aforementioned artifacts, which will outperform the general purpose video quality metrics.
For the NetFlix dataset, a sample of 34 source clips (also called reference videos) was selected, each 6 seconds long, from popular TV shows and movies from the Netflix catalog and combined them with a selection of publicly available clips. The source clips covered a wide range of high-level features (animation, indoor/outdoor, camera motion, face close-up, people, water, obvious salience, number of objects) and low level characteristics (film grain noise, brightness, contrast, texture, motion, color variance, color richness, sharpness). Using the source clips, NetFlix researchers encoded H.264/AVC video streams at resolutions ranging from 384×288 to 1920×1080 and bitrates from 375 kbps to 20,000 kbps, resulting in about 300 distorted videos. This sweeps a broad range of video bitrates and resolutions to reflect the widely varying network conditions of Netflix members.
Using this data set, subjective DSIS assessment tests were performed as specified in recommendation ITU-R BT.500-13. The results of these process where mapped to respective DMOS values. The scatter plots below show the observers’ DMOS on the x-axis and the predicted score from different quality metrics on the y-axis, namely: PSNR, SSIM, Multiscale FastSSIM, and PSNR-HVS.
It can be seen from the graphs that these metrics fail to provide scores that consistently predict the DMOS ratings from observers.Above each plot, we report the Spearman’s rank correlation coefficient (SRCC), the Pearson product-moment correlation coefficient (PCC) and the root-mean-squared-error (RMSE) figures for each of the metrics, calculated after a non-linear logistic fitting, as outlined in Annex 3.1 of ITU-R BT.500-13. SRCC and PCC values closer to 1.0 and RMSE values closer to zero are desirable. Among the four metrics, PSNR-HVS demonstrates the best SRCC, PCC and RMSE values, but is still lacking in prediction accuracy. To address this issue, NetFlix adopts a machine-learning based model to design a metric that seeks to reflect human perception of video quality.
NetFlix researchers by collaborating with Prof. C.-C. J. Kuo and his group at the University of Southern California, developed Video Multimethod Assessment Fusion, or VMAF, that predicts subjective quality by combining multiple elementary quality metrics.By ‘fusing’ elementary metrics into a final metric using a machine-learning algorithm – in NetFlix case, a Support Vector Machine (SVM) regressor – which assigns weights to each elementary metric, the final metric could preserve all the strengths of the individual metrics, and deliver a more accurate final score. The machine-learning model is trained and tested using the opinion scores obtained through the aforementioned subjective experiment on NetFlix dataset.
The current version of the VMAF algorithm uses the following elementary metrics fused by Support Vector Machine (SVM) regression:
- Visual Information Fidelity (VIF) . VIF is a well-adopted image quality metric based on the premise that quality is complementary to the measure of information fidelity loss. In VMAF, a modified version of VIF is adopted, where the loss of fidelity is included as an elementary metric.
- Detail Loss Metric (DLM) . DLM is an image quality metric based on the rationale of separately measuring the loss of details which affects the content visibility, and the redundant impairment which distracts viewer attention. The original metric combines both DLM and additive impairment measure (AIM) to yield a final score. In VMAF, only the DLM is adopted as an elementary metric.
VIF and DLM are both image quality metrics. We further introduce the following simple feature to account for the temporal characteristics of video:
- Motion. This is a simple measure of the temporal difference between adjacent frames. This is accomplished by calculating the average absolute pixel difference for the luminance component.
These elementary metrics and features were chosen from amongst other candidates through iterations of testing and validation. From the posted article, it is not sufficiently clear, how the Motion metric is applied at shot boundaries, which results to high values and most probably are discarded.
Then, NetFlix researches compare the accuracy of VMAF to PSNR-HVS, the best performing metric from the earlier section, where it is clear that VMAF performs appreciably better.
The articles reports also on comparison of VMAF to the Video Quality Model with Variable Frame Delay (VQM-VFD) , considered by many as state of the art in the field. VQM-VFD is an algorithm that uses a neural network model to fuse low-level features into a final metric. It is similar to VMAF in spirit, except that it extracts features at lower levels such as spatial and temporal gradients.
It is clear that VQM-VFD performs close to VMAF on the NFLX-TEST dataset. Since the VMAF approach allows for incorporation of new elementary metrics into its framework, VQM-VFD could serve as an elementary metric for VMAF as well.
Summarizing, the article provides the SRCC, PCC and RMSE of the different metrics discussed earlier, on the NetFlix dataset and three popular public datasets: the VQEG HD (vqeghd3 collection only), the LIVE Video Database and the LIVE Mobile Video Database. The results show that VMAF outperforms other metrics in all but the LIVE dataset, where it still offers competitive performance compared to the best-performing VQM-VFD.
*For compression-only impairments (H.264/AVC and MPEG-2 Video)
Finally, the article concludes on the current open research issues:
- Viewing conditions. Netflix supports thousands of active devices covering smart TV’s, game consoles, set-top boxes, computers, tablets and smartphones, resulting in widely varying viewing conditions for our members. With more subjective data, NetFlix researchers plan to generalize the algorithm such that viewing conditions (display size, distance from screen, etc.) can be inputs to the regressor.
- Temporal pooling. Our current VMAF implementation calculates quality scores on a per-frame basis. In many use-cases, it is desirable to temporally pool these scores to return a single value as a summary over a longer period of time. For example, a score over a scene, a score over regular time segments, or a score for an entire movie is desirable. A perceptually accurate temporal pooling mechanism for VMAF and other quality metrics remains an open and challenging problem.
- A consistent metric. Since VMAF incorporates full-reference elementary metrics, VMAF is highly dependent on the quality of the reference. Unfortunately, the quality of video sources may not be consistent across all titles in the Netflix catalog. Sources come into our system at resolutions ranging from SD to 4K. Because of this, it can be inaccurate to compare (or summarize) VMAF scores across different titles. For quality monitoring, it is highly desirable that absolute quality scores are calculated that are consistent across sources. So, the future work includes the development of a method that applies an automated way to predict what opinion the viewers form about the quality of the video delivered to them, taking into account all factors that contributed to the final presented video on that screen.
Original Post: http://techblog.netflix.com/2016/06/toward-practical-perceptual-video.html
 H. Sheikh and A. Bovik, “Image Information and Visual Quality,” IEEE Transactions on Image Processing, vol. 15, no. 2, pp. 430–444, Feb. 2006.
 S. Li, F. Zhang, L. Ma, and K. Ngan, “Image Quality Assessment by Separately Evaluating Detail Losses and Additive Impairments,” IEEE Transactions on Multimedia, vol. 13, no. 5, pp. 935–949, Oct. 2011.
 S. Wolf and M. H. Pinson, “Video Quality Model for Variable Frame Delay (VQM_VFD),” U.S. Dept. Commer., Nat. Telecommun. Inf. Admin., Boulder, CO, USA, Tech. Memo TM-11-482, Sep. 2011.