RFC 8761 | Video Codec Requirements and Evaluation | April 2020 |
Filippov, et al. | Informational | [Page] |
This document provides requirements for a video codec designed mainly for use over the Internet. In addition, this document describes an evaluation methodology for measuring the compression efficiency to determine whether or not the stated requirements have been fulfilled.¶
This document is not an Internet Standards Track specification; it is published for informational purposes.¶
This document is a product of the Internet Engineering Task Force (IETF). It represents the consensus of the IETF community. It has received public review and has been approved for publication by the Internet Engineering Steering Group (IESG). Not all documents approved by the IESG are candidates for any level of Internet Standard; see Section 2 of RFC 7841.¶
Information about the current status of this document, any errata, and how to provide feedback on it may be obtained athttps://www.rfc-editor.org/info/rfc8761.¶
Copyright (c) 2020 IETF Trust and the persons identified as the document authors. All rights reserved.¶
This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (https://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Simplified BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the Simplified BSD License.¶
This document presents the requirements for a video codec designed mainlyfor use over the Internet. The requirements encompass a wide range ofapplications that use data transmission over the Internet, including Internetvideo streaming, IPTV, peer-to-peer video conferencing, video sharing,screencasting, game streaming, and video monitoring and surveillance. For eachapplication, typical resolutions, frame rates, and picture-access modes arepresented. Specific requirements related to data transmission overpacket-loss networks are considered as well. In this document, when wediscuss data-protection techniques, we only refer to methods designed andimplemented to protect data inside the video codec since there are manyexisting techniques that protect generic data transmitted over networks withpacket losses. From the theoretical point of view, both packet-loss andbit-error robustness can be beneficial for video codecs. In practice, packetlosses are a more significant problem than bit corruption in IP networks. Itis worth noting that there is an evident interdependence between the possibleamount of delay and the necessity of error-robust video streams:¶
Thus, error resilience can be useful for delay-critical applications toprovide low delay in a packet-loss environment.¶
In this section, an overview of video codec applications that are currentlyavailable on the Internet market is presented. It is worth noting that thereare different use cases for each application that define a target platform;hence, there are different types of communication channels involved (e.g.,wired or wireless channels) that are characterized by different QoSas well as bandwidth; for instance, wired channels are considerablymore free from error than wireless channels and therefore require different QoSapproaches. The target platform, the channel bandwidth, and thechannel quality determine resolutions, frame rates, and either quality orbitrates for video streams to be encoded or decoded.By default, color format YCbCr 4:2:0 is assumed forthe application scenarios listed below.¶
Typical content for this application is movies, TV series and shows, andanimation. Internet video streaming uses a variety of client devices and hasto operate under changing network conditions. For this reason, an adaptivestreaming model has been widely adopted. Video material is encoded atdifferent quality levels and different resolutions, which are then chosen by aclient depending on its capabilities and current network bandwidth. An examplecombination of resolutions and bitrates is shown inTable 1.¶
A video encoding pipeline in on-demand Internet video streaming typically operates as follows:¶
Resolution * | PAM | Frame Rate, FPS ** |
---|---|---|
4K, 3840x2160 | RA |
|
2K (1080p), 1920x1080 | RA | |
1080i, 1920x1080* | RA | |
720p, 1280x720 | RA | |
576p (EDTV), 720x576 | RA | |
576i (SDTV), 720x576* | RA | |
480p (EDTV), 720x480 | RA | |
480i (SDTV), 720x480* | RA | |
512x384 | RA | |
QVGA, 320x240 | RA |
*Note: Interlaced content can be handled at the higher system level and not necessarily by using specialized video coding tools. It is included in this table only for the sake of completeness, as most video content today is in the progressive format.¶
**Note: The set of frame rates presented in this table is taken from Table 2 in[1].¶
The characteristics and requirements of this application scenario are as follows:¶
Support and efficient encoding of a wide range of content types and formats is required:¶
This is a service for delivering television content over IP-based networks. IPTV may be classified into two main groups based on the type of delivery, as follows:¶
In the IPTV scenario, traffic is transmitted over managed (QoS-based)networks. Typical content used in this application is news, movies, cartoons,series, TV shows, etc. One important requirement for both groups is that randomaccess to pictures (i.e., the random access period (RAP)) should be kept smallenough (approximately 1-5 seconds). Optional requirements are as follows:¶
For this application, typical values of resolutions, frame rates, and PAMsare presented inTable 2.¶
Resolution * | PAM | Frame Rate, FPS ** |
---|---|---|
2160p (4K), 3840x2160 | RA |
|
1080p, 1920x1080 | RA | |
1080i, 1920x1080* | RA | |
720p, 1280x720 | RA | |
576p (EDTV), 720x576 | RA | |
576i (SDTV), 720x576* | RA | |
480p (EDTV), 720x480 | RA | |
480i (SDTV), 720x480* | RA |
*Note: Interlaced content can be handled at the higher system level and not necessarily by using specialized video coding tools. It is included in this table only for the sake of completeness, as most video content today is in a progressive format.¶
**Note: The set of frame rates presented in this table is takenfrom Table 2 in[1].¶
This is a form of video connection over the Internet. This form allowsusers to establish connections to two or more people by two- way video andaudio transmission for communication in real time. For this application, bothstationary and mobile devices can be used. The main requirements are asfollows:¶
Support of resolution and quality (SNR) scalability is highlydesirable. For this application, typical values of resolutions, frame rates,and PAMs are presented inTable 3.¶
Resolution | Frame Rate, FPS | PAM |
---|---|---|
1080p, 1920x1080 | 15, 30 | FIZD |
720p, 1280x720 | 30, 60 | FIZD |
4CIF, 704x576 | 30, 60 | FIZD |
4SIF, 704x480 | 30, 60 | FIZD |
VGA, 640x480 | 30, 60 | FIZD |
360p, 640x360 | 30, 60 | FIZD |
This is a service that allows people to upload and share video data (usinglive streaming or not) and watch those videos. It is also known as video hosting. Atypical User-Generated Content (UGC) scenario for this application is tocapture video using mobile cameras such as GoPros or cameras integrated intosmartphones (amateur video). The main requirements are as follows:¶
Support of resolution and quality (SNR) scalability is highlydesirable. For this application, typical values of resolutions, frame rates,and PAMs are presented inTable 4.¶
Typical values of resolutions and frame rates inTable 4 are taken from[10].¶
Resolution | Frame Rate, FPS | PAM |
---|---|---|
2160p (4K), 3840x2160 | 24, 25, 30, 48, 50, 60 | RA |
1440p (2K), 2560x1440 | 24, 25, 30, 48, 50, 60 | RA |
1080p, 1920x1080 | 24, 25, 30, 48, 50, 60 | RA |
720p, 1280x720 | 24, 25, 30, 48, 50, 60 | RA |
480p, 854x480 | 24, 25, 30, 48, 50, 60 | RA |
360p, 640x360 | 24, 25, 30, 48, 50, 60 | RA |
This is a service that allows users to record and distribute video data from a computer screen. This service requires efficient compression ofcomputer-generated content with high visual quality up to visually andmathematically (numerically) lossless[11]. Currently, this applicationincludes business presentations (PowerPoint, Word documents, email messages,etc.), animation (cartoons), gaming content, and data visualization. Thistype of content is characterized by fast motion, rotation, smooth shade,3D effect, highly saturated colors with full resolution, clear textures andsharp edges with distinct colors[11], virtual desktopinfrastructure (VDI),screen/desktop sharing and collaboration, supervisory control and dataacquisition (SCADA) display, automotive/navigation display, cloud gaming, factory automationdisplay, wireless display, display wall,digital operating room (DiOR), etc. For this application, an importantrequirement is the support of low-delay configurations with zero structuraldelay for a wide range of video formats (e.g., RGB) in addition to YCbCr 4:2:0and YCbCr 4:4:4[11]. For this application, typical values of resolutions,frame rates, and PAMs are presented inTable 5.¶
Resolution | Frame Rate, FPS | PAM |
---|---|---|
Input color format: RGB 4:4:4 | ||
5k, 5120x2880 | 15, 30, 60 | AI, RA, FIZD |
4k, 3840x2160 | 15, 30, 60 | AI, RA, FIZD |
WQXGA, 2560x1600 | 15, 30, 60 | AI, RA, FIZD |
WUXGA, 1920x1200 | 15, 30, 60 | AI, RA, FIZD |
WSXGA+, 1680x1050 | 15, 30, 60 | AI, RA, FIZD |
WXGA, 1280x800 | 15, 30, 60 | AI, RA, FIZD |
XGA, 1024x768 | 15, 30, 60 | AI, RA, FIZD |
SVGA, 800x600 | 15, 30, 60 | AI, RA, FIZD |
VGA, 640x480 | 15, 30, 60 | AI, RA, FIZD |
Input color format: YCbCr 4:4:4 | ||
5k, 5120x2880 | 15, 30, 60 | AI, RA, FIZD |
4k, 3840x2160 | 15, 30, 60 | AI, RA, FIZD |
1440p (2K), 2560x1440 | 15, 30, 60 | AI, RA, FIZD |
1080p, 1920x1080 | 15, 30, 60 | AI, RA, FIZD |
720p, 1280x720 | 15, 30, 60 | AI, RA, FIZD |
This is a service that provides game content over the Internet to differentlocal devices such as notebooks and gaming tablets. In this category ofapplications, the server renders 3D games in a cloud server and streams the game toany device with a wired or wireless broadband connection[12]. There are low-latency requirements for transmitting user interactions andreceiving game data with a turnaround delay of less than 100 ms. This allowsanyone to play (or resume) full-featured games from anywhere on the Internet[12]. An example of this application is Nvidia Grid[12]. Another application scenario of this category is broadcast of video gamesplayed by people over the Internet in real time or for later viewing[12]. There are many companies, such as Twitch and YY in China, that enablegame broadcasting[12]. Games typically contain a lot ofsharp edges and large motion[12]. The main requirements areas follows:¶
Support of resolution and quality (SNR) scalability is highlydesirable. For this application, typical values of resolutions, frame rates,and PAMs are similar to ones presented inTable 3.¶
This is a type of live broadcasting over IP-based networks. Video streamsare sent to many receivers at the same time. A new receiver may connect to thestream at an arbitrary moment, so the random access period should be keptsmall enough (approximately, 1-5 seconds). Data are transmitted publicly inthe case of video monitoring and privately in the case of videosurveillance. For IP cameras that have to capture, process, and encode videodata, complexity -- including computational and hardware complexity, as wellas memory bandwidth -- should be kept low to allow real-time processing. Inaddition, support of a high dynamic range and a monochrome mode (e.g., forinfrared cameras) as well as resolution and quality (SNR) scalability is anessential requirement for video surveillance.In some use cases, highvideo signal fidelity is required even after lossy compression. Typical valuesof resolutions, frame rates, and PAMs for video monitoring and surveillanceapplications are presented inTable 6.¶
Resolution | Frame Rate, FPS | PAM |
---|---|---|
2160p (4K), 3840x2160 | 12, 25, 30 | RA, FIZD |
5Mpixels, 2560x1920 | 12, 25, 30 | RA, FIZD |
1080p, 1920x1080 | 25, 30 | RA, FIZD |
1.23Mpixels, 1280x960 | 25, 30 | RA, FIZD |
720p, 1280x720 | 25, 30 | RA, FIZD |
SVGA, 800x600 | 25, 30 | RA, FIZD |
Taking the requirements discussed above for specific video applications,this section proposes requirements for an Internet video codec.¶
The most fundamental requirement is coding efficiency, i.e., compressionperformance on both "easy" and "difficult" content for applications and usecases inSection 3. The codec should provide higher coding efficiency overstate-of-the-art video codecs such as HEVC/H.265 and VP9, at least 25%, inaccordance with the methodology described inSection 5 of this document. Forhigher resolutions, the improvements in coding efficiency are expected to behigher than for lower resolutions.¶
Good-quality specification and well-defined profiles and levels arerequired to enable device interoperability and facilitate decoderimplementations. A profile consists of a subset of entire bitstream syntaxelements; consequently, it also defines the necessary tools for decoding aconforming bitstream of that profile. A level imposes a set of numericallimits to the values of some syntax elements. An example of codec levels to besupported is presented inTable 7. An actual leveldefinition should include constraints on features that impact the decodercomplexity. For example, these features might be as follows: maximum bitrate,line buffer size, memory usage, etc.¶
Level | Example picture resolution at highest frame rate |
---|---|
1 | 128x96(12,288*)@30.0 |
2 | 352x288(101,376*)@30.0 |
3 | 352x288(101,376*)@60.0 |
4 | 640x360(230,400*)@60.0 |
5 | 720x576(414,720*)@75.0 |
6 | 1,280x720(921,600*)@68.0 |
7 | 1,280x720(921,600*)@120.0 |
8 | 1,920x1,080(2,073,600*)@120.0 |
9 | 1,920x1,080(2,073,600*)@250.0 |
10 | 1,920x1,080(2,073,600*)@300.0 |
11 | 3,840x2,160(8,294,400*)@120.0 |
12 | 3,840x2,160(8,294,400*)@250.0 |
13 | 3,840x2,160(8,294,400*)@300.0 |
*Note: The quantities of pixels are presented for applications in which a picture can have an arbitrary size (e.g., screencasting).¶
Bitstream syntax should allow extensibility and backwardcompatibility. New features can be supported easily by using metadata (such asSEI messages, VUI, and headers) without affecting the bitstream compatibilitywith legacy decoders. A newer version of the decoder shall be able to playbitstreams of an older version of the same or lower profile and level.¶
A bitstream should have a model that allows easy parsing and identification ofthe sample components (such as Annex B of ISO/IEC 14496-10[18] or ISO/IEC 14496-15[19]). Inparticular, information needed for packet handling (e.g., frame type) shouldnot require parsing anything below the header level.¶
Perceptual quality tools (such as adaptive QP and quantization matrices)should be supported by the codec bitstream.¶
The codec specification shall define a buffer model such as hypothetical reference decoder (HRD).¶
Specifications providing integration with system and delivery layers should be developed.¶
Input pictures coded by a video codec should have one of the following formats:¶
Color sampling formats:¶
Exemplary input source formats for codec profiles are shown inTable 8.¶
Profile | Bit depths per color component | Color sampling formats |
---|---|---|
1 | 8 and 10 | 4:0:0 and 4:2:0 |
2 | 8 and 10 | 4:0:0, 4:2:0, and 4:4:4 |
3 | 8, 10, and 12 | 4:0:0, 4:2:0, 4:2:2, and 4:4:4 |
In order to meet coding delay requirements, a video codec should support all of the following:¶
Support of configurations with zero structural delay, also referred toas "low-delay" configurations.¶
Encoding and decoding complexity considerations are as follows:¶
The mandatory scalability requirement is as follows:¶
In order to meet the error resilience requirement, a video codec shouldsatisfy all of the following conditions:¶
It is a desired but not mandatory requirement for a video codec to supportsome of the following features:¶
Desirable scalability requirements are as follows:¶
Tools that enable parallel processing (e.g., slices, tiles, and wave-frontpropagation processing) at both encoder and decoder sides are highly desirablefor many applications.¶
Compression efficiency on noisy content, content with film grain, computergenerated content, and low resolution materials is desirable.¶
As shown inFigure 1, compression performance testing isperformed in three overlapped ranges that encompass ten different bitrate values:¶
Initially, for the codec selected as a reference one (e.g., HEVC or VP9), aset of ten QP (quantization parameter) values should be specified as in[14], and corresponding quality values should becalculated. InFigure 1, QP and quality values are denoted as "QP0"-"QP9" and"Q0"-"Q9", respectively. To guarantee the overlaps of qualitylevels between the bitrate ranges of the reference and tested codecs, aquality alignment procedure should be performed for each range's outermost(left- and rightmost) quality levels Qk of the reference codec (i.e., for Q0,Q3, Q6, and Q9) and the quality levels Q'k (i.e., Q'0, Q'3, Q'6, and Q'9) ofthe tested codec. Thus, these quality levels Q'k, and hence the correspondingQP value QP'k (i.e., QP'0, QP'3, QP'6, and QP'9), of the tested codec should beselected using the following formulas:¶
Q'k = min { abs(Q'i - Qk) }, i in RQP'k = argmin { abs(Q'i(QP'i) - Qk(QPk)) }, i in R¶
where R is the range of the QP indexes of the tested codec, i.e., thecandidate Internet video codec. The inner quality levels (i.e., Q'1, Q'2, Q'4,Q'5, Q'7, and Q'8), as well as their corresponding QP values of each range(i.e., QP'1, QP'2, QP'4, QP'5, QP'7, and QP'8), should be as equidistantlyspaced as possible between the left- and rightmost quality levels withoutexplicitly mapping their values using the procedure described above.¶
QP'9 QP'8 QP'7 QP'6 QP'5 QP'4 QP'3 QP'2 QP'1 QP'0 <+----- ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ | Tested | | | | | | | | | | | codecQ'0 Q'1 Q'2 Q'3 Q'4 Q'5 Q'6 Q'7 Q'8 Q'9 <+----- ^ ^ ^ ^ | | | |Q0 Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 <+----- ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ | Reference | | | | | | | | | | | codecQP9 QP8 QP7 QP6 QP5 QP4 QP3 QP2 QP1 QP0 <+-----+----------------+--------------+--------------+--------->^ ^ ^ ^ Bitrate|-------LBR------| |-----HBR------| ^ ^ |------MBR-----|
Since the QP mapping results may vary for different sequences, this qualityalignment procedure eventually needs to be performed separately for each qualityassessment index and each sequence used for codec performance evaluation tofulfill the requirements described above.¶
To assess the quality of output (decoded) sequences, two indexes (PSNR[3] and MS-SSIM[3][15]) are separately computed. In the case of the YCbCrcolor format, PSNR should be calculated for each color plane, whereas MS-SSIMis calculated for the luma channel only. In the case of the RGB color format,both metrics are computed for R, G, and B channels. Thus, for each sequence,30 RD-points for PSNR (i.e., three RD-curves, one for each channel) and 10RD-points for MS-SSIM (i.e., one RD-curve, for luma channel only) should becalculated in the case of YCbCr. If content is encoded as RGB, 60 RD-points(30 for PSNR and 30 for MS-SSIM) should be calculated (i.e., three RD-curves,one for each channel) are computed for PSNR as well as three RD-curves (onefor each channel) for MS-SSIM.¶
Finally, to obtain an integral estimation, BD-rate savings[13] should becomputed for each range and each quality index. In addition, average valuesover all three ranges should be provided for both PSNR and MS-SSIM. A list ofvideo sequences that should be used for testing, as well as the ten QP valuesfor the reference codec, are defined in[14]. Testing processes should use theinformation on the codec applications presented in this document. As thereference for evaluation, state-of-the-art video codecs such as HEVC/H.265[4][5] or VP9 must be used. The reference sourcecode of the HEVC/H.265 codeccan be found at[6]. The HEVC/H.265 codec must be configuredaccording to[16]andTable 9.¶
Intra-period, second | HEVC/H.265 encoding mode according to[16] |
---|---|
AI | Intra Main or Intra Main10 |
RA | Random access Main or |
FIZD | Low delay Main or |
According to the coding efficiency requirement described inSection 4.1.1, BD-rate savings calculated for each color plane andaveraged for all the video sequences used to test the NETVC codec should be,at least,¶
Since values of the two objective metrics (PSNR and MS-SSIM) are availablefor some color planes, each value should meet these coding efficiencyrequirements. That is, the final BD-rate saving denoted as S is calculated fora given color plane as follows:¶
S = min { S_psnr, S_ms-ssim }¶
where S_psnr and S_ms-ssim are BD-rate savings calculated for the given color plane using PSNR and MS-SSIM metrics, respectively.¶
In addition to the objective quality measures defined above, subjectiveevaluation must also be performed for the final NETVC codec adoption. Forsubjective tests, the MOS-based evaluation procedure must be used as describedin Section 2.1 of[3]. For perception-oriented tools that primarily impact subjective quality, additional tests may also be individually assigned even for intermediate evaluation, subject to a decision of the NETVC WG.¶
This document itself does not address any security considerations. However,it is worth noting that a codec implementation (for both an encoder and adecoder) should take into consideration the worst-case computationalcomplexity, memory bandwidth, and physical memory size needed to process thepotentially untrusted input (e.g., the decoded pictures used as references).¶
This document has no IANA actions.¶
The authors would like to thankMr. Paul Coverdale,Mr. Vasily Rufitskiy, andDr. Jianle Chen for many useful discussions on this document and their help whilepreparing it, as well asMr. Mo Zanaty,Dr. Minhua Zhou,Dr. Ali Begen,Mr. Thomas Daede,Mr. Adam Roach,Dr. Thomas Davies,Mr. Jonathan Lennox,Dr. Timothy Terriberry,Mr. Peter Thatcher,Dr. Jean-Marc Valin,Mr. Roman Danyliw,Mr. Jack Moffitt,Mr. Greg Coppa, andMr. Andrew Krupiczka for their valuable comments on different revisions of thisdocument.¶