VMAF: A Netflix Video Quality Metric

A couple of weeks ago, a very interesting article posted at NetFlix Tech Blog, providing us a view towards a practical video quality metric, as it is perceived by the worldwide leading content provider – NetFlix.

The proposed metric used by NetFlix is called Video Multimethod Assessment Fusion (VMAF) and seeks to reflect the viewer’s perception of the NetFlix streaming quality.  The plan for this metric is to be provided as an open-source tool and to be possible for the research community to get involved in the evolution process of this metric.

The main objective of NetFlix is to deliver content of high quality, providing to the subscribers a great viewing experience: smooth video playback, free of annoying picture artifacts, given the constraints of the network bandwidth and viewing device.

Currently NetFlix utilizes the most contemporary codecs, such as H.264/AVC, HEVC and VP9, in order to stream at reasonable bit-rates with the cost of quality degradation and the appearance of coding specific artifacts. At NetFlix, they encode the video streams in a distributed cloud-based media pipeline (more info is available here), which allows to scale to meet the needs of the service. To minimize the impact of bad source deliveries, software bugs and the unpredictability of cloud instances (transient errors), automated quality monitoring is performed at various points of the pipeline.

Towards the video quality research, NetFlix first starts with building an appropriate data set for experimentation, which meets the standards of the specific services, in terms of content variety and source of artifacts (due to TCP-based streaming, the quality degradation observed at NetFlix is caused by two types of artifacts:

  1. compression artifacts (due to lossy compression) and
  2. scaling artifacts (for lower bitrates, video is downsampled before compression, and later upsampled on the viewer’s device)

So the objective of NetFlix research is to focus on building a special purpose metric, based on the two aforementioned artifacts, which will outperform the general purpose video quality metrics.

For the NetFlix dataset, a sample of 34 source clips (also called reference videos) was selected, each 6 seconds long, from popular TV shows and movies from the Netflix catalog and combined them with a selection of publicly available clips. The source clips covered a wide range of high-level features (animation, indoor/outdoor, camera motion, face close-up, people, water, obvious salience, number of objects) and low level characteristics (film grain noise, brightness, contrast, texture, motion, color variance, color richness, sharpness). Using the source clips, NetFlix researchers encoded H.264/AVC video streams at resolutions ranging from 384×288 to 1920×1080 and bitrates from 375 kbps to 20,000 kbps, resulting in about 300 distorted videos. This sweeps a broad range of video bitrates and resolutions to reflect the widely varying network conditions of Netflix members.

Using this data set, subjective DSIS assessment tests were performed as specified in recommendation ITU-R BT.500-13. The results of these process where mapped to respective DMOS values. The scatter plots below show the observers’ DMOS on the x-axis and the predicted score from different quality metrics on the y-axis, namely: PSNR, SSIM, Multiscale FastSSIM, and PSNR-HVS.


It can be seen from the graphs that these metrics fail to provide scores that consistently predict the DMOS ratings from observers.Above each plot, we report the Spearman’s rank correlation coefficient (SRCC), the Pearson product-moment correlation coefficient (PCC) and the root-mean-squared-error (RMSE) figures for each of the metrics, calculated after a non-linear logistic fitting, as outlined in Annex 3.1 of ITU-R BT.500-13. SRCC and PCC values closer to 1.0 and RMSE values closer to zero are desirable. Among the four metrics, PSNR-HVS demonstrates the best SRCC, PCC and RMSE values, but is still lacking in prediction accuracy. To address this issue, NetFlix adopts a machine-learning based model to design a metric that seeks to reflect  human perception of video quality.

NetFlix researchers by collaborating with Prof. C.-C. J. Kuo and his group at the University of Southern California, developed Video Multimethod Assessment Fusion, or VMAF, that predicts subjective quality by combining multiple elementary quality metrics.By ‘fusing’ elementary metrics into a final metric using a machine-learning algorithm – in NetFlix case, a Support Vector Machine (SVM) regressor – which assigns weights to each elementary metric, the final metric could preserve all the strengths of the individual metrics, and deliver a more accurate final score. The machine-learning model is trained and tested using the opinion scores obtained through the aforementioned subjective experiment on NetFlix dataset.

The current version of the VMAF algorithm uses the following elementary metrics fused by Support Vector Machine (SVM) regression:

  1. Visual Information Fidelity (VIF) [1]. VIF is a well-adopted image quality metric based on the premise that quality is complementary to the measure of information fidelity loss. In VMAF, a modified version of VIF is adopted, where the loss of fidelity is included as an elementary metric.
  2. Detail Loss Metric (DLM) [2]. DLM is an image quality metric based on the rationale of separately measuring the loss of details which affects the content visibility, and the redundant impairment which distracts viewer attention. The original metric combines both DLM and additive impairment measure (AIM) to yield a final score. In VMAF, only the DLM is adopted as an elementary metric.

VIF and DLM are both image quality metrics. We further introduce the following simple feature to account for the temporal characteristics of video:

  1. Motion. This is a simple measure of the temporal difference between adjacent frames. This is accomplished by calculating the average absolute pixel difference for the luminance component.

These elementary metrics and features were chosen from amongst other candidates through iterations of testing and validation. From the posted article, it is not sufficiently clear, how the Motion metric is applied at shot boundaries, which results to high values and most probably are discarded.

Then, NetFlix researches compare the accuracy of VMAF to PSNR-HVS, the best performing metric from the earlier section, where it is clear that VMAF performs appreciably better.


The articles reports also on comparison of VMAF to the Video Quality Model with Variable Frame Delay (VQM-VFD) [3], considered by many as state of the art in the field. VQM-VFD is an algorithm that uses a neural network model to fuse low-level features into a final metric. It is similar to VMAF in spirit, except that it extracts features at lower levels such as spatial and temporal gradients.


It is clear that VQM-VFD performs close to VMAF on the NFLX-TEST dataset. Since the VMAF approach allows for incorporation of new elementary metrics into its framework, VQM-VFD could serve as an elementary metric for VMAF as well.

Summarizing, the article provides the SRCC, PCC and RMSE of the different metrics discussed earlier, on the NetFlix dataset and three popular public datasets: the VQEG HD (vqeghd3 collection only), the LIVE Video Database and the LIVE Mobile Video Database. The results show that VMAF outperforms other metrics in all but the LIVE dataset, where it still offers competitive performance compared to the best-performing VQM-VFD.

LIVE dataset*

PSNR 0.416 0.394 16.934
SSIM 0.658 0.618 12.340
FastSSIM 0.566 0.561 13.691
PSNR-HVS 0.589 0.595 13.213
VQM-VFD 0.763 0.767 9.897
VMAF 0.3.1 0.690 0.655 12.180

*For compression-only impairments (H.264/AVC and MPEG-2 Video)

Finally, the article concludes on the current open research issues:

  1. Viewing conditions. Netflix supports thousands of active devices covering smart TV’s, game consoles, set-top boxes, computers, tablets and smartphones, resulting in widely varying viewing conditions for our members. With more subjective data, NetFlix researchers plan to generalize the algorithm such that viewing conditions (display size, distance from screen, etc.) can be inputs to the regressor.
  2. Temporal pooling. Our current VMAF implementation calculates quality scores on a per-frame basis. In many use-cases, it is desirable to temporally pool these scores to return a single value as a summary over a longer period of time. For example, a score over a scene, a score over regular time segments, or a score for an entire movie is desirable. A perceptually accurate temporal pooling mechanism for VMAF and other quality metrics remains an open and challenging problem.
  3. A consistent metric. Since VMAF incorporates full-reference elementary metrics, VMAF is highly dependent on the quality of the reference. Unfortunately, the quality of video sources may not be consistent across all titles in the Netflix catalog. Sources come into our system at resolutions ranging from SD to 4K. Because of this, it can be inaccurate to compare (or summarize) VMAF scores across different titles. For quality monitoring, it is highly desirable that absolute quality scores are calculated that are consistent across sources. So, the future work includes the development of a method that applies an automated way to predict what opinion the viewers form about the quality of the video delivered to them, taking into account all factors that contributed to the final presented video on that screen.

Original Post: http://techblog.netflix.com/2016/06/toward-practical-perceptual-video.html


[1] H. Sheikh and A. Bovik, “Image Information and Visual Quality,” IEEE Transactions on Image Processing, vol. 15, no. 2, pp. 430–444, Feb. 2006.

[2] S. Li, F. Zhang, L. Ma, and K. Ngan, “Image Quality Assessment by Separately Evaluating Detail Losses and Additive Impairments,” IEEE Transactions on Multimedia, vol. 13, no. 5, pp. 935–949, Oct. 2011.

[3] S. Wolf and M. H. Pinson, “Video Quality Model for Variable Frame Delay (VQM_VFD),” U.S. Dept. Commer., Nat. Telecommun. Inf. Admin., Boulder, CO, USA, Tech. Memo TM-11-482, Sep. 2011.

Key Network Delivery Metrics

Streaming Video Alliance (specifically the QoE working group) released a few days ago a document describing key network delivery metrics for streaming Internet video. Although many more metrics could have been documented, these particular metrics represent the most commonly used.

QoE metrics

The report is available online at the Streaming Video Alliance web site (here).

Cisco Thor: a Royalty Free Video Codec

Jonathan Rosenberg recently posted on Cisco Blog for the release of the project Thor codec to the community some weeks ago (link to Thor project). The effort is being staffed by some of the world’s most foremost codec experts, including the legendary Gisle Bjøntegaard and Arild Fuldseth, both of whom have been heavy contributors to prior video codecs. Cisco decided to open source the code, which is available on http://thor-codec.org. Moreover, Thor was contributed as an input to the Internet Engineering Task Force (IETF) (contribution available here, Presentation slides available here), which has begun a standards activity to develop a next-gen royalty free video codec in its NetVC workgroup.

More documents on IETF NetVC are available also here.

As Jonathan describes, the patent licensing situation for H.265 is dependent on two distinct patent licensing pools that have formed so far, and unfortunately many license holders are not represented in either. On the contrary, for the case of H.264 there is only one license pool, which make H.264 much cheaper than H.265, where the total costs to license H.265 from these two pools is up to sixteen times more expensive than H.264, per unit. Moreover, H.264 had an upper bound on yearly licensing costs, whereas H.265 has no such upper limit, while at the same time the licensing terms preclude usage of H.265 in any kind of open source or freely distributed software application, such as web browsers.

Jonathan invites others to work on Thor by contributing to the codec development or contributing their own Intellectual Property Rights on a royalty free basis (you may contact  netvc-inquiry@cisco.com).

Although activity graphs from github are not really encouraging for the community involvement (see figure below) and that there are many basic features that are not implemented yet, the project progress may be ambitious within 2015, however the project is still promising.


The main Thor-exclusive feature is the 64×64 super block, which provide significant better performance for specific video content types according to the provided performance comparison data (available here).

Summarizing, we may say that currently Thor is a parallel draft to Daala. Daala is the code-name for a new video compression technology promoted by a collaboration between Mozilla Foundation, Xiph.Org Foundation and other contributors. The goal of Daala is to provide a free to implement, use and distribute digital media format and reference implementation with technical performance superior to H.265. IETF may standarddize only one codec within the NetVC activity, therefore the competition for the one codec to prevail will be very tough. Initial comparison to Daala is available (here) but it will be interesting to close monitor the performance competition between the two codecs in the next months.

Tizen and HTML5

The Tizen SDK is a comprehensive set of tools for developing Tizen web and native applications. Tizen is based on the Linux kernel and the GNU C Library implementing the Linux API. It targets a very wide range of devices including smartphones, tablets, in-vehicle infotainment (IVI) devices, smart TVs, PCs, smart cameras, wearable computing (such as Smartwatch), Blu-ray Players, Printers and Smart Home Appliances. Its purpose is to offer a consistent user experience across devices, allowing developers to use HTML5 and related Web technologies to write applications that run on supported devices.

In October 2014, the Worldwide Web Consortium elevated the HTML5 specification to ‘recommendation’ status , giving it the group’s highest level of endorsement, which is to becoming a standard. The W3C also introduced Application Foundations with the announcement of the HTML5 recommendation. Open Web Platform (OWP) is a set of technologies for developing distributed applications with the greatest interoperability among different terminal devices. HTML5 plays a key role to that.

In this framework, Tizen Association works closely with the Linux Foundation which runs the Tizen open source project (Tizen.org), with a focus on platform development and delivery.

Samsung, a member of Tizen association, has launched Tizen-based Samsung TVs and has released also the relative Tizen SDK and Caph SDK, in order to support the development of web applications and promote in the market device independent apps. Both SDKs are available for downloading at http://www.samsungdforum.com/