SVT-AV1 is an open-source AV1 codec implementation hosted on GitHub https://github.com/OpenVisualCloud/SVT-AV1/ under a BSD + patent license. As mentioned in our earlier blog post, Intel and Netflix have been collaborating on the SVT-AV1 encoder and decoder framework since August 2018. The teams have been working closely on SVT-AV1 development, discussing architectural decisions, implementing new tools, and improving compression efficiency. Since open-sourcing the project, other partner companies and the open-source community have contributed to SVT-AV1. In this tech blog, we will report the current status of the SVT-AV1 project, as well as the characteristics and performance of the encoder and decoder.
SVT-AV1 codebase status
The SVT-AV1 repository includes both an AV1 encoder and decoder, which share a significant amount of the code. The SVT-AV1 decoder is fully functional and compliant with the AV1 specification for all three profiles (Main, High, and Professional).
The SVT-AV1 encoder supports all AV1 tools which contribute to compression efficiency. Compared to the most recent master version of libaom (AV1 reference software), SVT-AV1 is similar in compression efficiency and at the same time achieves significantly lower encoding latency on multi-core platforms when using its inherent parallelization capabilities.
SVT-AV1 is written in C and can be compiled on major platforms, such as Windows, Linux, and macOS. In addition to the pure C function implementations, which allows for more flexible experimentation, the codec features extensive assembly and intrinsic optimizations for the x86 platform. See the next section for an outline of the main SVT-AV1 features that allow high performance at competitive compression efficiency. SVT-AV1 also includes extensive documentation on the encoder design targeted to facilitate the onboarding process for new developers.
One of Intel’s goals for SVT-AV1 development was to create an AV1 encoder that could offer performance and scalability. SVT-AV1 uses parallelization at several stages of the encoding process, which allows it to adapt to the number of available cores, including the newest servers with significant core count. This makes it possible for SVT-AV1 to decrease encoding time while still maintaining compression efficiency.
The SVT-AV1 encoder uses multi-dimensional (process-, picture/tile-, and segment-based) parallelism, multi-stage partitioning decisions, block-based multi-stage and multi-class mode decisions, and RD-optimized classification to achieve attractive trade-offs between compression and performance. Another feature of the SVT architecture is open-loop hierarchical motion estimation, which makes it possible to decouple the first stage of motion estimation from the rest of the encoding process.
Compression efficiency and performance
SVT-AV1 reaches similar compression efficiency as libaom at the slowest speed settings. During the codec development, we have been tracking the compression and encoding results at the https://videocodectracker.dev/ site. The plot below shows the improvements in the compression efficiency of SVT-AV1 compared to the libaom encoder over time. Note that the libaom compression has also been improving over time, and the plot below represents SVT-AV1 catching up with the moving target. In the plot, the Y-axis shows the additional bitrate in percent needed to achieve similar quality as libaom encoder according to three metrics. The plot shows the results of the 2-pass encoding mode in both codecs. SVT-AV1 uses 4-thread mode, whereas libaom operates in a single-thread mode. The SVT-AV1 results for the 1-pass fixed-QP encoding mode, commonly used in research, are even more competitive, as detailed below.
The comparison results of the SVT-AV1 against libaom on objective-1-fast test set are presented in the table below. For estimating encoding times, we used Intel(R) Xeon(R) Platinum 8170 CPU @ 2.10GHz machine with 52 physical cores and 96 GB of RAM, with 60 jobs running in parallel. Both codecs use bi-directional hierarchical prediction structure of 16 pictures. The results are presented for 1-pass mode with fixed frame-level QP offsets. A single-threaded compression mode is used. Below, we compute the BD-rates for the various quality metrics: PSNR on all three color planes, VMAF, and MS-SSIM. A negative BD-Rate indicates that the SVT-AV1 encodes produce the same quality with the indicated relative reduction in bitrate. As seen below, SVT-AV1 demonstrates 16.5% decrease in encoding time compared to libaom while being slightly more efficient in compression ability. Note that the encoding times ratio may vary depending on the instruction sets supported by the platform. The results have been obtained on SVT-AV1 cs2 branch (a development branch that is currently being merged into the master, git hash 3a19f29) against the libaom master branch (git hash fe72512). The QP values used to calculate the BD-rates are: 20, 32, 43, 55, 63.
*The overall encoding CPU time difference is calculated as change in total CPU time for all sequences and QPs of the test compared to that of the anchor. It is not equal to the average of per sequence values. Per each sequence, the encoding CPU time difference is calculated as change in total CPU time for all QPs for this sequence.
Since all sequences in the objective-1-fast test set have 60 frames, both codecs use one key frame. The following command line parameters have been used to compare the codecs.
The results above demonstrate the excellent objective performance of SVT-AV1. In addition, SVT-AV1 includes implementations of some subjective quality tools, which can be used if the codec is configured for the subjective quality.
On the objective-1-fast test set, the SVT-AV1 decoder is slightly faster than the libaom in the 1-thread mode, with larger improvements in the 4-thread mode. We observe even larger speed gains over libaom decoder when decoding bitstreams with multiple tiles using the 4-thread mode. The testing has been performed on Windows, Linux, and macOS platforms. We believe the performance is satisfactory for a research decoder, where the trade-offs favor easier experimentation over further optimizations necessary for a production decoder.
To help ensure codec conformance, especially for new code contributions, the code has been comprehensively covered with unit tests and end-to-end tests. The unit tests are built on the Google Test framework. The unit and end-to-end tests are triggered automatically for each pull request to the repository, which is supported by GitHub actions. The tests support sharding, and they run in parallel to speed-up the turn-around time on pull requests.
Over the last several months, SVT-AV1 has matured to become a complete encoder/decoder package providing competitive compression efficiency and performance trade-offs. The project is bolstered with extensive unit test coverage and documentation.
Our hope is that the SVT-AV1 codebase helps further adoption of AV1 and encourages more research and development on top of the current AV1 tools. We believe that the demonstrated advantages of SVT-AV1 make it a good platform for experimentation and research. We invite colleagues from industry and academia to check out the project on Github, reach out to the codebase maintainers for questions and comments or join one of the SVT-AV1 Open Dev meetings. We welcome more contributors to the project.
By Aditya Mavlankar, Jan De Cock¹, Cyril Concolato, Kyle Swanson, Anush Moorthy and Anne Aaron
We need an alternative to JPEG that a) is widely supported, b) has better compression efficiency and c) has a wider feature set. We believe AV1 Image File Format (AVIF) has the potential. Using the framework we have open sourced, AVIF compression efficiency can be seen at work and compared against a whole range of image codecs that came before it.
Image compression at Netflix
Netflix is enjoyed by its members on a variety of devices — smart TVs, phones, tablets, personal computers and streaming devices connected to TV screens. The user interface (UI), intended for browsing the catalog and serving up recommendations, is rich in images and graphics across all device categories. Shown below are screenshots of the Netflix app on iOS as an example.
Image assets might be based on still frames from the title, special on-set photography or a combination thereof. Assets could also stem from art generated during the production of the feature.
As seen above, image assets typically have gradients, text and graphics, for example the Netflix symbol or other title-specific symbols such as “The Witcher” insignia, composited on the image. Such special treatments lead to a variety of peculiarities which do not necessarily arise in natural images. Hard edges, including those with chroma differences on either side of the edge, are common and require good detail preservation, since they typically occur at salient locations and convey important information. Further, there is typically a character or a face in salient locations with a smooth, uncluttered background. Again, preservation of detail on the character’s face is of primary importance. In some cases, the background is textured and complex, exhibiting a wide range of frequencies.
After an image asset is ingested, the compression pipeline kicks in and prepares compressed image assets meant for delivering to devices. The goal is to have the compressed image look as close to the original as possible while reducing the number of bytes required. Given the image-heavy nature of the UI, compressing these images well is of primary importance. This involves picking, among other things, the right combination of color subsampling, codec, encoder parameters and encoding resolution.
Let us take color subsampling as an example. Choosing 420 subsampling, over the original 444 format, halves the number of samples (counting across all 3 color planes) that need to be encoded while relying on the fact that the human visual system is more sensitive to luma than chroma. However, 420 subsampling can introduce color bleeding and jaggies in locations with color transitions. Below we toggle between the original source in 444 and the source converted to 420 subsampling. The toggling shows loss introduced just by the color subsampling, even before the codec enters the picture.
Nevertheless, there are source images where the loss due to 420 subsampling is not obvious to human perception and in such cases it can be advantageous to use 420 subsampling. Ideally, a codec should be able to support both subsampling formats. However, there are a few codecs that only support 420 subsampling — webp, discussed below, is one such popular codec.
Brief overview of image coding formats
The JPEG format was introduced in 1992 and is widely popular. It supports various color subsamplings including 420, 422 and 444. JPEG can ingest RGB data and transform it to a luma-chroma representation before performing lossy compression. The discrete cosine transform (DCT) is employed as the decorrelating transform on 8×8 blocks of samples. This is followed by quantization and entropy coding. However, JPEG is restricted to 8-bit imagery and lacks support for alpha channel. The more recent JPEG-XT standard extends JPEG to higher bit-depths, support for alpha channel, lossless compression and more in a backwards compatible way.
The JPEG 2000 format, based on the discrete wavelet transform (DWT), was introduced as a successor to JPEG in the year 2000. It brought a whole range of additional features such as spatial scalability, region of interest coding, range of supported bit-depths, flexible number of color planes, lossless coding, etc. With the motion extension, it was accepted as the video coding standard for digital cinema in 2004.
The webp format was introduced by Google around 2010. Google added decoding support on Android devices and Chrome browser and also released libraries that developers could add to their apps on other platforms, for example iOS. Webp is based on intra-frame coding from the VP8 video coding format. Webp does not have all the flexibilities of JPEG 2000. It does, however, support lossless coding and also a lossless alpha channel, making it a more efficient and faster alternative to PNG in certain situations.
High-Efficiency Video Coding (HEVC) is the successor of H.264, a.k.a. Advanced Video Coding (AVC) format. HEVC intra-frame coding can be encapsulated in the High-Efficiency Image File Format (HEIF). This format is most notably used by Apple devices to store recorded imagery.
Similarly, AV1 Image File Format (AVIF) allows encapsulating AV1 intra-frame coded content, thus taking advantage of excellent compression gains achieved by AV1 over predecessors. We touch upon some appealing technical features of AVIF in the next section.
The JPEG committee is pursuing a coding format called JPEG XL which includes features aimed at helping the transition from legacy JPEG format. Existing JPEG files can be losslessly transcoded to JPEG XL while achieving file size reduction. Also included is a lightweight conversion process back to JPEG format in order to serve clients that only support legacy JPEG.
AVIF technical features
Although modern video codecs were developed with primarily video in mind, the intraframe coding tools in a video codec are not significantly different from image compression tooling. Given the huge compression gains of modern video codecs, they are compelling as image coding formats. There is a potential benefit in reusing the hardware in place for video compression/decompression. Image decoding in hardware may not be a primary motivator, given the peculiarities of OS dependent UI composition, and architectural implications of moving uncompressed image pixels around.
In the area of image coding formats, the Moving Picture Experts Group (MPEG) has standardized a codec-agnostic and generic image container format: ISO/IEC 23000–12 standard (a.k.a. HEIF). HEIF has been used to store most notably HEVC-encoded images (in its HEIC variant) but is also capable of storing AVC-encoded images or even JPEG-encoded images. The Alliance for Open Media (AOM) has recently extended this format to specify the storage of AV1-encoded images in its AVIF format. The base HEIF format offers typical features expected from an image format such as: support for any image codec, ability to use a lossy or a lossless mode for compression, support for varied subsampling and bit-depths, etc. Furthermore, the format also allows the storage of a series of animated frames (offering an efficient and long-awaited alternative to animated GIFs), and the ability to specify an alpha channel (which sees tremendous use in UIs). Further, since the HEIF format borrows learnings from next-generation video compression, the format allows for preserving metadata such as color gamut and high dynamic range (HDR) information.
Image compression comparison framework
We have open sourced a Docker based framework for comparing various image codecs. Salient features include:
Encode orchestration (with parallelization) and insights generation using Python 3
Easy reproducibility of results and
Easy control of target quality range(s).
Since the framework allows one to specify a target quality (using a certain metric) for target codec(s), and stores these results in a local database, one can easily utilize the Bjontegaard-Delta (BD) rate to compare across codecs since the target points can be restricted to a useful or meaningful quality range, instead of blindly sweeping across the encoder parameter range (such as a quality factor) with fixed parameter values and landing on arbitrary quality points.
An an example, below are the calls that would produce compressed images for the choice of codecs at the specified SSIM and VMAF values, with the desired tolerance in target quality:
For the various codecs and configurations involved in the ensuing comparison, the reader can view the actual command lines in the shared repository. We have attempted to get the best compression efficiency out of every codec / configuration compared here. The reader is free to experiment with changes to encoding commands within the framework. Furthermore, newer versions of respective software implementations might have been released compared to versions used at the time of gathering below results. For example, a newer software version of Kakadu demo apps is available compared to the one in the framework snapshot on github used at the time of gathering below results.
This is the section where we get to admire the work of the compression community over the last 3 decades by looking at visual examples comparing JPEG and the state-of-the-art.
The encoded images shown below are illustrative and meant to compare visual quality at various target bitrates. Please note that the quality of the illustrative encodes is not representative of the high quality bar that Netflix employs for streaming image assets on the actual service, and is meant to be purely educative in nature.
Shown below is one original source image from the Kodak dataset and the corresponding result with JPEG 444 @ 20,429 bytes and with AVIF 444 @ 19,788 bytes. The JPEG encode shows very obvious blocking artifacts in the sky, in the pond as well as on the roof. The AVIF encode is much better, with less blocking artifacts, although there is some blurriness and loss of texture on the roof. It is still a remarkable result, given the compression factor of around 59x (original image has dimensions 768×512, thus requiring 768x512x3 bytes compared to the 20k bytes of the compressed image).
For the same source, shown below is the comparison of JPEG 444 @ 40,276 bytes and AVIF 444 @ 39,819 bytes. The JPEG encode still has visible blocking artifacts in the sky, along with ringing around the roof edges and chroma bleeding in several locations. The AVIF image however, is now comparable to the original, with a compression factor of 29x.
Shown below is another original source image from the Kodak dataset and the corresponding result with JPEG 444 @ 13,939 bytes and with AVIF 444 @ 4,176 bytes. The JPEG encode shows blocking artifacts around most edges, particularly around the slanting edge as well as color distortions. The AVIF encode looks “cleaner” even though it is one-third the size of the JPEG encode. It is not a perfect rendition of the original, but with a compression factor of 282x, this is commendable.
Shown below are results for the same image with slightly higher bit-budget; JPEG 444 @ 19,787 bytes versus AVIF 444 @ 20,120 bytes. The JPEG encode still shows blocking artifacts around the slanting edge whereas the AVIF encode looks nearly identical to the source.
Shown below is an original image from the Netflix (internal) 1142×1600 resolution “boxshots-1” dataset. Followed by JPEG 444 @ 69,445 bytes and AVIF 444 @ 40,811 bytes. Severe banding and blocking artifacts along with color distortions are visible in the JPEG encode. Less so in the AVIF encode which is actually 29kB smaller.
Shown below are results for the same image with slightly increased bit-budget. JPEG 444 @ 80,101 bytes versus AVIF 444 @ 85,162 bytes. The banding and blocking is still visible in the JPEG encode whereas the AVIF encode looks very close to the original.
Shown below is another source image from the same boxshots-1 dataset along with JPEG 444 @ 81,745 bytes versus AVIF 444 @ 76,087 bytes. Blocking artifacts overall and mosquito artifacts around text can be seen in the JPEG encode.
Shown below is another source image from the boxshots-1 dataset along with JPEG 444 @ 80,562 bytes versus AVIF 444 @ 80,432 bytes. There is visible banding, blocking and mosquito artifacts in the JPEG encode whereas the AVIF encode looks very close to the original source.
Shown below are results over public datasets as well as Netflix-internal datasets. The reference codec used is JPEG from the JPEG-XT reference software, using the standard quantization matrix defined in Annex K of the JPEG standard. Following are the codecs and/or configurations tested and reported against the baseline in the form of BD rate.
The encoding resolution in these experiments is the same as the source resolution. For 420 subsampling encodes, the quality metrics were computed in 420 subsampling domain. Likewise, for 444 subsampling encodes, the quality metrics were computed in 444 subsampling domain. Along with BD rates associated with various quality metrics, such as SSIM, MS-SSIM, VIF and PSNR, we also show rate-quality plots using SSIM as the metric.
Kodak dataset; 24 images; 768×512 resolution
We have uploaded the source images in PNG format here for easy reference. We give the necessary attribution to Kodak as the source of this dataset.
Given a quality metric, for each image, we consider two separate rate-quality curves. One curve associated with the baseline (JPEG) and one curve associated with the target codec. We compare the two and compute the BD-rate which can be interpreted as the average percentage rate reduction for the same quality over the quality region being considered. A negative value implies rate reduction and hence is better compared to the baseline. As a last step, we report the arithmetic mean of BD rates over all images in the dataset. We also highlight the best performer in the tables below.
Billboard dataset (Netflix-internal); 223 images; 2048×1152 resolution
Billboard images generally occupy a larger canvas than the thumbnail-like boxshot images and are generally horizontal. There is room to overlay text or graphics on one of the sides, either left or right, with salient characters/scenery/art being located on the other side. An example can be seen below. The billboard source images are internal to Netflix and hence do not constitute a public dataset.
Unlike billboard images, boxshot images are vertical and typically boxshot images representing different titles are displayed side-by-side in the UI. Examples from this dataset are showcased in the section above on visual examples. The boxshots-1 source images are internal to Netflix and hence do not constitute a public dataset.
The boxshots-2 dataset also has vertical box art but of lower resolution. The boxshots-2 source images are internal to Netflix and hence do not constitute a public dataset.
At this point, it might be prudent to discuss the omission of VMAF as a quality metric here. In previous work we have shown that for JPEG-like distortions and datasets similar to “boxshots” and “billboards”, VMAF has high correlation with perceived quality. However, VMAF, as of today, is a metric trained and developed to judge encoded videos rather than static images. The range of distortions associated with the range of image codecs in our tests is broader than what was considered in the VMAF development process and to that end, it may not be an accurate measure of image quality for those codecs. Further, today’s VMAF model is not designed to capture chroma artifacts and hence would be unable to distinguish between 420 and 444 subsampling, for instance, apart from other chroma artifacts (this is also true of some other measures we’ve used, but given the lack of alternatives, we’ve leaned on the side of using the most well tested and documented image quality metrics). This is not to say that VMAF is grossly inaccurate for image quality, but to say that we would not use it in our evaluation of image compression algorithms with such a wide diversity of codecs at this time. We have some exciting upcoming work to improve the accuracy of VMAF for images, across a variety of codecs, and resolutions, including chroma channels in the score. Having said that, the code in the repository computes VMAF and the reader is encouraged to try it out and see that AVIF also shines judging by VMAF as is today.
PSNR does not have as high correlation with perceptual quality over a wide quality range. However, if encodes are made with a high PSNR target then one overspends bits but can rest assured that a high PSNR score implies closeness to the original. With perceptually driven metrics, we sometimes see failure manifest in rare cases where the score is undeservingly high but visual quality is lacking.
Interesting observation regarding subsampling
In addition to above quality calculations, we have the following observation which reveals an encouraging trend among modern codecs. After performing an encode with 420 subsampling, let’s assume we decode the image, up-convert it to 444 subsampling and then compute various metrics by comparing against the original source in 444 format. We call this configuration “444u” to distinguish from above cases where “encode-subsampling” and “quality-computation-subsampling” match. Among the chosen metrics, PSNR_AVG is one which takes all 3 channels (1 luma and 2 chroma) into account. With an older codec like JPEG, the bit-budget is spread thin over more samples while encoding 444 subsampling compared to encoding 420 subsampling. This shows as poorer PSNR_AVG for encoding JPEG with 444 subsampling compared to 420 subsampling, as shown below. However, given a rate target, with modern codecs like HEVC and AVIF, it is simply better to encode 444 subsampling over a wide range of bitrates.
We see that with modern codecs we yield a higher PSNR_AVG when encoding 444 subsampling than 420 subsampling over the entire region of “practical” rates, even for the other, more practical, datasets such as boxshots-1. Interestingly, with JPEG, we see a crossover; i.e., after crossing a certain rate, it starts being more efficient to encode 444 subsampling. Such crossovers are analogous to rate-quality curves crossing over when encoding over multiple spatial resolutions. Shown below are rate-quality curves for two different source images from the boxshots-1 dataset, comparing JPEG and AVIF in both 444u and 444 configurations.
AVIF support and next steps
Although AVIF provides superior compression efficiency, it is still at an early deployment stage. Various tools exist to produce and consume AVIF images. The Alliance for Open Media is notably developing an open-source library, called libavif, that can encode and decode AVIF images. The goal of this library is to ease the integration in software from the image community. Such integration has already started, for example, in various browsers, such as Google Chrome, and we expect to see broad support for AVIF images in the near future. Major efforts are also ongoing, in particular from the dav1d team, to make AVIF image decoding as fast as possible, including for 10-bit images. It is conceivable that we will soon test AVIF images on Android following on the heels of our recently announced AV1 video adoption efforts on Android.
The datasets used above have standard dynamic range (SDR) 8-bit imagery. At Netflix, we are also working on HDR images for the UI and are planning to use AVIF for encoding these HDR image assets. This is a continuation of our previous efforts where we experimented with JPEG 2000 as the compression format for HDR images and we are looking forward to the superior compression gains afforded by AVIF.
We would like to thank Marjan Parsa, Pierre Lemieux, Zhi Li, Christos Bampis, Andrey Norkin, Hunter Ford, Igor Okulist, Joe Drago, Benbuck Nason, Yuji Mano, Adam Rofer and Jeff Watts for all their contributions and collaborations.
¹as part of his work while he was affiliated with Netflix
Today we are excited to announce that Netflix has started streaming AV1 to our Android mobile app. AV1 is a high performance, royalty-free video codec that provides 20% improved compression efficiency over our VP9† encodes. AV1 is made possible by the wide-ranging industry commitment of expertise and intellectual property within the Alliance for Open Media (AOMedia), of which Netflix is a founding member.
Our support for AV1 represents Netflix’s continued investment in delivering the most efficient and highest quality video streams. For our mobile environment, AV1 follows on our work with VP9, which we released as part of our mobile encodes in 2016 and further optimized with shot-based encodes in 2018.
While our goal is to roll out AV1 on all of our platforms, we see a good fit for AV1’s compression efficiency in the mobile space where cellular networks can be unreliable, and our members have limited data plans. Selected titles are now available to stream in AV1 for customers who wish to reduce their cellular data usage by enabling the “Save Data” feature.
Our AV1 support on Android leverages the open-source dav1d decoder built by the VideoLAN, VLC, and FFmpeg communities and sponsored by the Alliance for Open Media. Here we have optimized dav1d so that it can play Netflix content, which is 10-bit color. In the spirit of making AV1 widely available, we are sponsoring an open-source effort to optimize 10-bit performance further and make these gains available to all.
As codec performance improves over time, we plan to expand our AV1 usage to more use cases and are now also working with device and chipset partners to extend this into hardware.
† AV1-libaom compression efficiency as measured against VP9-libvpx.
by Andrey Norkin, Joel Sole, Kyle Swanson, Mariana Afonso, Anush Moorthy, Anne Aaron
Netflix headquarters circa 2014. It’s a nice building with good architecture! This was the primary home of Netflix for a number of years during the company’s growth, but at some point Netflix had outgrown its home and needed more space. One approach to solve this problem would have been to extend the building by attaching new rooms, hallways, and rebuilding the older ones. However, a more scalable approach would be to begin with a new foundation and begin a new building. Below you can see the new Netflix headquarters in Los Gatos, California. The facilities are modern, spacious and scalable. The new campus started with two buildings, connected together, and was further extended with more buildings when more space was needed. What does this example have to do with software development and video encoding? When you are building an encoder, sometimes you need to start with a clean slate too.
What is SVT-AV1?
Intel and Netflix announced their collaboration on a software video encoder implementation called SVT-AV1 on April 8, 2019. Scalable Video Technology (SVT) is Intel’s open source framework that provides high-performance software video encoding libraries for developers of visual cloud technologies. In this tech blog, we describe the relevance of this partnership to the industry and cover some of our own experiences so far. We also describe how you can become a part of this development.
A brief look into the history of video standards
Historically, video compression standards have been developed by two international standardization organizations, ITU-T and MPEG (ISO). The first successful digital video standard was MPEG-2, which truly enabled digital transmission of video. The success was repeated by H.264/AVC, currently, the most ubiquitous video compression standard supported by modern devices, often in hardware. On the other hand, there are examples of video codecs developed by companies, such as Microsoft’s VC-1 and Google’s VPx codecs. The advantage of adopting a video compression standard is interoperability. The standard specification describes in minute detail how a video bitstream should be processed in order to produce displayable video frames. This allows device manufacturers to independently work on their decoder implementations. When content providers encode their video according to the standard, this guarantees that all compliant devices are able to decode and display the video.
Recently, the adoption of the newest video codec standardized by ITU-T and ISO has been slow in light of widespread licensing uncertainty. A group of companies formed the Alliance for Open Media (AOM) with the goal of creating a modern, royalty-free video codec that would be widely adopted and supported by a plethora of devices. The AOM board currently includes Amazon, Apple, ARM, Cisco, Facebook, Google, IBM, Intel, Microsoft, Mozilla, Netflix, Nvidia, and Samsung, and many companies joined as promoter members. In 2018, AOM has published a specification for the AV1 video codec.
Decoder specification is frozen, encoder being improved for years
As mentioned earlier, a standard specifies how the compressed bitstream is to be interpreted to produce displayable video, which means that encoders can vary in their characteristics, such as computational performance and achievable quality for a given bitrate. The encoder can typically be improved years after the standard has been frozen including varying speed and quality trade-offs. An example of such development is the x264 encoder that has been improving years after the H.264 standard was finalized.
To develop a conformant decoder, the standard specification should be sufficient. However, to guide codec implementers, the standardization committee also issues reference software, which includes a compliant decoder and encoder. Reference software serves as the basis for standard development, a framework, in which the performance of video coding tools is evaluated. The reference software typically evolves along with the development of the standard. In addition, when standardization is completed, the reference software can help to kickstart implementations of compliant decoders and encoders.
AOM has produced the reference software for AV1, which is called libaom and is available online. The libaom was built upon the codebase from VP9, VP8, and previous generations of VPx video codecs. During the AV1 development, the software was further developed by the AOM video codec group.
Netflix interest in SVT-AV1
Reference software typically focuses on the best possible compression at the expense of encoding speed. It is well known that encoding time of reference software for modern video codecs can be rather long.
One of Intel’s goals with SVT-AV1 development was to create a production-grade AV1 encoder that offers performance and scalability. SVT-AV1 uses parallelization at several stages of the encoding process, which allows it to adapt to the number of available cores including newest servers with significant core count. This makes it possible for SVT-AV1 to decrease encoding time while still maintaining compression efficiency.
In August 2018, Netflix’s Video Algorithms team and Intel’s Visual Cloud team decided to join forces on SVT-AV1 development. Since that time, Intel’s and Netflix’s teams closely collaborated on SVT-AV1 development, discussing architectural decisions, implementing new tools, and improving the compression efficiency. Netflix’s main interest in SVT-AV1 was somewhat different and complementary to Intel’s intention of building a production-grade highly scalable encoder.
At Netflix, we believe that the AV1 ecosystem would benefit from an alternative clean and efficient open-source encoder implementation. There exists at least one other alternative open-source AV1 encoder, rav1e. However, rav1e is written in Rust programming language, whereas an encoder written in C has a much broader base of potential developers. The open-source encoder should also enable easy experimentation and a platform for testing new coding tools. Consequently, our requirements to the AV1 software are as follows:
Easy to understand code with a low entry barrier and a test framework
Competitive compression efficiency on par with the reference implementation
Complete toolset and a decoder implementation sharing common code with the encoder, which simplifies experiments on new coding tools
Decreased encoder runtime that enables quicker turn-around when testing new ideas
We believe that if SVT-AV1 is aligned with these characteristics, it can be used as a platform for future video coding standards development, such as the research and development efforts towards the AV2 video codec, and improved AV1 encoding.
Thus, Netflix and Intel approach SVT-AV1 with complementary goals. The encoder speed helps innovation, as it is faster to run experiments. Cleanliness of the code helps adoption in the open-source community, which is crucial for the success of an open-source project. It can be argued that extensive parallelization may have compression efficiency trade-offs but it also allows testing more encoding options. Moreover, we expect multi-core platforms be prevalently used for video encoding in the future, which makes it important to test new tools in an architecture supporting many threads.
Our progress so far
We have accomplished the following milestones to achieve the goals of making SVT-AV1 an excellent experimentation platform and AV1 reference:
Added a continuous integration (CI) framework for Linux, Windows, and MacOs.
Added a unit tests framework based on Google Test. An external contractor is adding unit tests to achieve sufficient coverage for the code already developed. Furthermore, unit tests will cover new code.
Added other types of testing in the CI framework, such as automatic encoding and Valgrind test.
Started a decoder project that shares common parts of AV1 algorithms with the encoder.
Introduced style guidelines and formatted the existing code accordingly.
SVT-AV1 is currently work in progress since it is still missing the implementation of some coding tools and therefore has an average gap of about 14% in PSNR BD-rate with the libaom encoder in a 1-pass mode. The following features are planned to be added and will decrease the BD-rate gap:
Eighth-pel motion compensation (1/8-pel)
Global motion compensation
Adaptive transform block sizes
Trellis Quantized Coefficient Optimization
Rate control (ABR, CBR, VBR)
2-pass encoding mode
There is still much work ahead, and we are committed to making the SVT-AV1 project satisfy the goal of being an excellent experimentation platform, as well as viable for production applications. You can track the SVT-AV1 performance progress on the beta of AWCY (AreWeCompressedYet) website. AWCY was the framework used to evaluate AV1 tools during its development. In the figure below, you can see a comparison of two versions of the SVT-AV1 codec, the blue plot representing SVT-AV1 version from March 15, 2019, and the green one from March 19, 2019.
SVT-AV1 already stands out in its speed. SVT-AV1 does not reach the compression efficiency of libaom at the slowest speed settings, but it performs encoding significantly faster than the fastest libaom mode. Currently, SVT-AV1 in the slowest mode uses about 13.5% more bits compared to the libaom encoder in a 1-pass mode with cpu_used=1 (the second slowest mode of libaom), while being about 4 times faster*. The BD-rate gap with 2-pass libaom encoding is wider and we are planning to address this by implementing 2-pass encoding in SVT-AV1. One could also note that faster encoding settings of SVT-AV1 decrease the encoding times even more dramatically providing significant encoder speed-up.
The cookie settings on this website are set to "allow cookies" to give you the best browsing experience possible. If you continue to use this website without changing your cookie settings or you click "Accept" below then you are consenting to this.