CN116366851A

Movatterモバイル変換

Info

Publication number: CN116366851A
Application number: CN202211637941.7A
Authority: CN
Inventors: 庄政彦; 陈俊嘉; 徐志玮; 庄子德; 陈庆晔; 黄毓文
Original assignee: MediaTek Inc
Current assignee: MediaTek Inc
Priority date: 2022-12-16
Filing date: 2022-12-16
Publication date: 2023-06-30

Abstract

Translated fromChinese

描述了与视频编码并行化技术有关的各种方案。一种装置接收视频数据。该装置随后计算多个品质因数(FOM)，每个FOM表示特定编码工具在对视频数据进行编码时可以获得的品质多高。该装置还通过比较FOM来确定可能适合于对视频数据进行编码的编码工具。在确定编码工具时，该装置利用时间交错技术来并行处理视频数据。视频数据可以包括编码块阵列，并且该装置可以使用蛇形处理顺序扫描编码块阵列来接收视频数据。

Various schemes related to video coding parallelization techniques are described. An apparatus receives video data. The device then calculates a number of figures of merit (FOMs), each FOM representing how high a quality a particular encoding tool can achieve when encoding the video data. The apparatus also compares the FOM to determine an encoding tool that may be suitable for encoding the video data. When determining the encoding tool, the device utilizes time interleaving technology to process video data in parallel. The video data may include an array of encoded blocks, and the apparatus may sequentially scan the array of encoded blocks using a serpentine process to receive the video data.

Description

Translated fromChinese

视频数据编码方法与装置Video data encoding method and device

交叉引用cross reference

本申请享有2021年12月16日提交的申请号为63/290,073的美国临时专利申请的优先权，该先前申请在此全文引用。This application benefits from US Provisional Patent Application No. 63/290,073, filed December 16, 2021, which is hereby incorporated by reference in its entirety.

技术领域technical field

本公开总体上涉及视频编码，并且更具体地，涉及用于利用并行化技术进行高效视频编码的方法和装置。The present disclosure relates generally to video coding, and more particularly, to methods and apparatus for efficient video coding using parallelization techniques.

背景技术Background technique

除非本文另有说明，否则本节中描述的方法不是下面列出的权利要求的现有技术，并且不因包含在本节中而被承认为现有技术。Unless otherwise indicated herein, the approaches described in this section are not prior art to the claims listed below and are not admitted to be prior art by inclusion in this section.

视频编码通常涉及通过编码器将视频(即，原始视频)编码成比特流，将比特流传输到解码器，以及通过解码器解析和处理比特流以产生重建的比特流。编码器可以在对视频进行编码时采用各种编码模式或工具，其目的之一是减少需要传输到解码器的比特流的总大小，同时仍然向解码器提供关于原始视频的足够信息，使得解码器可以生成非常忠实于原始视频的重建视频。例如，在2020年发布的最先进的视频编码标准通用视频编码(VVC)标准的最终版本中，其中新定义了各种编码工具以实现与上一代视频编码标准(即高效视频编码(HEVC)标准，自2013年发布以来的视频编码规范)相比约40％的编码增益(例如Bjontegaard Delta-Rate增益)。借助VVC提供的新编码工具，高性能视频编码成为可能，支持新的视频用例，例如视野相关的360°视频流，具有区域随机访问、信噪比可扩展性等高级功能(SNR)等。Video encoding generally involves encoding video (ie, raw video) into a bitstream by an encoder, transmitting the bitstream to a decoder, and parsing and processing the bitstream by the decoder to produce a reconstructed bitstream. Encoders can employ various encoding modes or tools when encoding video, one of the purposes of which is to reduce the overall size of the bitstream that needs to be transmitted to the decoder, while still providing the decoder with enough information about the original video to enable decoding The detector can produce reconstructed video that is very faithful to the original video. For example, in the final version of the most advanced video coding standard, the Versatile Video Coding (VVC) standard released in 2020, various coding tools are newly defined in order to achieve compatibility with the previous generation of video coding standards, namely the High Efficiency Video Coding (HEVC) standard. , a video coding specification published since 2013) compared to about 40% coding gain (eg Bjontegaard Delta-Rate gain). With the help of new encoding tools provided by VVC, high-performance video encoding is possible, supporting new video use cases, such as view-dependent 360° video streaming, with advanced features such as regional random access, signal-to-noise ratio scalability (SNR), and more.

例如，VVC标准包括与帧内预测相关的新编码工具，例如基于矩阵的帧内预测(MIP)、色度分离树(CST)、帧内子分区(ISP)和帧内块复制(IBM)。与帧间预测相关的新编码工具，例如自适应运动矢量分辨率(AMVR)、运动矢量差分合并模式(MMVD)、组合帧间/帧内预测(CIIP)和几何分区(GPM)也包含在VVC中。适用于图片内和图片间预测的新工具也包含在VVC中，例如采样自适应偏移(SAO)、自适应环路滤波器(ALF)、交叉分量自适应环路滤波器(CCALF)和色差联合编码(JCCR)。此外，与编码器块划分相关的新工具也包含在VVC中，例如三元树划分(TT)、二叉树三元树划分(BT_TT)、更大的最大编码树单元大小为64像素x 64像素(CTU64)、以及更大的最大变换单元大小为32像素x 32像素(TU32)。其他新开发的视频编码标准也跟随VVC的类似趋势，包括更多的编码工具以实现更好的编码性能。For example, the VVC standard includes new coding tools related to intra prediction, such as matrix-based intra prediction (MIP), chroma separation tree (CST), intra sub-partitioning (ISP), and intra block copying (IBM). New coding tools related to inter prediction, such as Adaptive Motion Vector Resolution (AMVR), Merge Motion Vector Difference (MMVD), Combined Inter/Intra Prediction (CIIP) and Geometric Partitioning (GPM) are also included in VVC middle. New tools for intra- and inter-picture prediction are also included in VVC, such as Sample Adaptive Offset (SAO), Adaptive Loop Filter (ALF), Cross Component Adaptive Loop Filter (CCALF), and Chromatic Difference Joint Code (JCCR). In addition, new tools related to encoder block partitioning are also included in VVC, such as triple tree partition (TT), binary tree triple tree partition (BT_TT), larger maximum coding tree unit size of 64 pixels x 64 pixels ( CTU64), and a larger maximum transform unit size of 32 pixels by 32 pixels (TU32). Other newly developed video coding standards follow a similar trend of VVC, including more coding tools for better coding performance.

因此，编码器需要使用的编码工具将取决于编码器被设计为支持哪个或哪些视频编码标准。随着视频编码标准的不断演进，标准中定义了越来越多的编码工具，因此期望通用的视频编码器能够实现各种编码工具。因此，对于要编码的每个图片或其一部分，编码器能够快速确定要应用于要编码的直接视频数据的优选的或其他合适的编码工具是非常重要的，以便以合理的编码成本实现所需的视频质量。Therefore, the encoding tools an encoder needs to use will depend on which video encoding standard or standards the encoder is designed to support. With the continuous evolution of video coding standards, more and more coding tools are defined in the standards, so it is expected that a general video coder can implement various coding tools. Therefore, for each picture or portion thereof to be encoded, it is important that the encoder be able to quickly determine the preferred or otherwise suitable encoding tool to apply to the direct video data to be encoded in order to achieve the desired video quality.

发明内容Contents of the invention

以下概述仅是说明性的，并不旨在以任何方式进行限制。即，提供以下概述以介绍本文描述的新颖的和非显而易见的技术的概念、亮点、好处和优点。选择的实现在下面的详细描述中进一步描述。因此，以下概述不旨在识别要求保护的主题的基本特征，也不旨在用于确定要求保护的主题的范围。The following overview is illustrative only and not intended to be limiting in any way. That is, the following overview is provided to introduce the concepts, highlights, benefits and advantages of the novel and non-obvious technologies described herein. Selected implementations are further described in the detailed description below. Accordingly, the following summary is not intended to identify essential features of the claimed subject matter, nor is it intended to be used to determine the scope of the claimed subject matter.

本公开的目的是提供与利用并行化技术进行视频编码有关的方案、概念、设计、技术、方法和设备。据信，利用本公开中的各种实施例，实现了包括改进的编码等待时间、简化的搜索存储器访问和/或减少的硬件开销在内的益处。The purpose of the present disclosure is to provide solutions, concepts, designs, techniques, methods and devices related to video encoding using parallelization technology. It is believed that with various embodiments in the present disclosure, benefits including improved encoding latency, simplified search memory access, and/or reduced hardware overhead are realized.

一方面，提出了一种使用优选编码工具对视频数据进行编码的方法。该方法可以涉及通过多个处理元件(PE)接收视频数据，每个处理元件被配置为针对相应的编码工具执行编码效率评估，以在执行编码效率评估时进行评估。在一些实施例中，每个PE可以是低复杂度率失真优化器(LC-RDO)。该方法随后可以涉及通过执行编码效率评估的多个PE中的每一个来计算特定于相应编码工具和视频数据的相应品质因数(FOM)。在一些实施例中，FOM可以是平方差和(SSD)、绝对差和(SAD)或绝对变换差和(SATD)。该方法还可以包括通过比较由多个PE计算的FOM来确定特定于视频数据的编码工具。在一些实施例中，该方法还可以涉及确定一组与所确定的编码工具有关的参数设置。最后，该方法可以使用确定的编码工具和参数设置对视频数据进行编码。In one aspect, a method of encoding video data using a preferred encoding tool is presented. The method may involve receiving video data by a plurality of processing elements (PEs), each processing element being configured to perform a coding efficiency evaluation for a respective coding tool to perform the evaluation when performing the coding efficiency evaluation. In some embodiments, each PE may be a Low Complexity Rate-Distortion Optimizer (LC-RDO). The method may then involve computing, by each of the plurality of PEs performing coding efficiency evaluation, a respective figure of merit (FOM) specific to the respective coding tool and video data. In some embodiments, the FOM may be the sum of squared differences (SSD), sum of absolute differences (SAD), or sum of absolute transformed differences (SATD). The method may also include determining an encoding tool specific to the video data by comparing the FOMs calculated by the plurality of PEs. In some embodiments, the method may also involve determining a set of parameter settings related to the determined encoding tool. Finally, the method can encode video data using defined encoding tools and parameter settings.

在一些实施例中，视频数据可以是编码块(CB)，其被划分成形成列和行的阵列的多个子块。每个PE在接收视频数据时，可能一次连续接收多个子块。每个PE一次接收的子块的数量可以与涉及的PE的数量相同，即与待评估的编码工具的数量相同。在一些实施例中，PE可以使用蛇形扫描处理顺序来接收和处理视频数据以通过列或行进行处理。In some embodiments, video data may be a coded block (CB), which is divided into a number of sub-blocks forming an array of columns and rows. When each PE receives video data, it may continuously receive multiple sub-blocks at a time. The number of sub-blocks each PE receives at one time may be the same as the number of PEs involved, ie the same number of coding tools to be evaluated. In some embodiments, PEs may receive and process video data using a serpentine processing order to process by column or row.

在一些实施例中，子块可以存储在具有多个存储体的高速缓冲存储器中。存储体可以分为两组，其中每组可以具有与PE的数量一样多的存储体。在PE通过逐列蛇形或光栅扫描接收子块的情况下，将子块的任意两列相邻的子块分别存储在两组存储体中。在PE通过逐行蛇形扫描或光栅扫描接收子块的情况下，将任意两行相邻的子块分别存储在两组存储体中。In some embodiments, sub-blocks may be stored in a cache memory having multiple banks. Banks can be divided into two groups, where each group can have as many banks as the number of PEs. In the case that the PE receives the sub-blocks by column-by-column serpentine or raster scanning, any two adjacent sub-blocks of the sub-blocks are respectively stored in two groups of memory banks. In the case that the PE receives sub-blocks through progressive serpentine scanning or raster scanning, any two rows of adjacent sub-blocks are respectively stored in two groups of memory banks.

在另一方面，提出了一种装置，其包括高速缓冲存储器、处理器、多个处理元件(PE)和比较器。处理器被配置为根据视频数据特定的存储体分配方案将视频数据存储在高速缓存存储器中，其中存储体分配方案由处理器基于诸如视频数据的编码块的大小、视频数据子块的大小、以时间交错方式同时运行的PE的数量、用于处理视频数据子块的扫描顺序(例如，光栅扫描或蛇形扫描)等的各种因素来确定。每个PE被配置为将各自的编码模式或编码工具应用于视频数据，并且随后通过计算品质因数(FOM)例如平方差和(SSD)，绝对差之和(SAD)，或绝对转换差之和(SATD)来确定其编码效率。比较器用于比较PE计算出的FOM，从而确定编码工具。In another aspect, an apparatus is presented that includes a cache memory, a processor, a plurality of processing elements (PEs) and a comparator. The processor is configured to store the video data in the cache memory according to a video data specific bank allocation scheme, wherein the bank allocation scheme is determined by the processor based on, for example, the size of an encoded block of video data, the size of a sub-block of video data, and The number of PEs running simultaneously in a time-interleaved manner, the scanning order (eg, raster scan or serpentine scan) used to process video data sub-blocks is determined by various factors. Each PE is configured to apply its respective encoding mode or encoding tool to the video data, and then calculates a figure of merit (FOM) such as sum of square difference (SSD), sum of absolute difference (SAD), or sum of absolute transform difference (SATD) to determine its coding efficiency. The comparator is used to compare the FOM calculated by the PE to determine the encoding tool.

附图说明Description of drawings

附图被包括以提供对本公开的进一步理解并且并入并构成本公开的一部分。附图图示了本公开的实施方式，并且与描述一起用于解释本公开的原理。值得注意的是，附图不一定是按比例绘制的，因为一些组件可能被显示为与实际实施中的尺寸不成比例，以清楚地说明本公开的概念。The accompanying drawings are included to provide a further understanding of the disclosure and are incorporated in and constitute a part of this disclosure. The drawings illustrate the embodiments of the disclosure and, together with the description, serve to explain principles of the disclosure. It is worth noting that the drawings are not necessarily drawn to scale as some components may be shown out of scale to actual implementation in order to clearly illustrate the concepts of the present disclosure.

图1是根据本公开的实施方式的示例设计的图。FIG. 1 is a diagram of an example design according to an embodiment of the disclosure.

图2是根据本公开的实施方式的示例设计的图。FIG. 2 is a diagram of an example design according to an embodiment of the disclosure.

图3是根据本公开的实施方式的示例设计的图。3 is a diagram of an example design according to an embodiment of the disclosure.

图4是根据本公开的实施方式的示例设计的图。4 is a diagram of an example design according to an embodiment of the disclosure.

图5是根据本公开的实施方式的示例设计的图。5 is a diagram of an example design according to an embodiment of the disclosure.

图6是根据本公开的实施方式的示例编码效率评估装置的图。FIG. 6 is a diagram of an example encoding efficiency evaluation device according to an embodiment of the present disclosure.

图7是根据本公开的实施方式的示例过程的流程图。7 is a flowchart of an example process according to an embodiment of the disclosure.

图8是根据本公开的实施方式的示例电子系统的图。8 is a diagram of an example electronic system according to an embodiment of the disclosure.

具体实施方式Detailed ways

本文公开了要求保护的主题的详细实施例和实施方式。然而，应当理解，所公开的实施例和实施方式仅仅是可以以各种形式体现的要求保护的主题的说明。然而，本公开可以以许多不同的形式来体现，并且不应被解释为限于在此阐述的示例性实施例和实施方式。相反，提供这些示例性实施例和实施方式使得本公开的描述是透彻和完整的，并且将向本领域的技术人员充分传达本公开的范围。在下面的描述中，可以省略众所周知的特征和技术的细节以避免不必要地模糊所呈现的实施例和实现方式。Detailed examples and implementations of the claimed subject matter are disclosed herein. It is to be understood, however, that the disclosed examples and implementations are merely illustrations of claimed subject matter that can be embodied in various forms. However, this disclosure may be embodied in many different forms and should not be construed as limited to the example embodiments and implementations set forth herein. Rather, these exemplary embodiments and implementations are provided so that the description of the present disclosure will be thorough and complete, and will fully convey the scope of the present disclosure to those skilled in the art. In the following description, well-known features and technical details may be omitted to avoid unnecessarily obscuring the presented embodiments and implementations.

根据本公开的实施方式涉及与高效并行化视频编码和搜索存储器访问有关的各种技术、方法、方案和/或解决方案。根据本发明，可以单独或联合实施多种可能的方案。也就是说，虽然这些可能的解决方案可以在下面单独描述，但是这些可能的解决方案中的两个或更多个可以以一种或另一种组合来实现。Embodiments according to the present disclosure relate to various techniques, methods, schemes and/or solutions related to efficient parallelization of video encoding and search memory access. According to the invention, various possible solutions can be implemented individually or in combination. That is, while these possible solutions may be described individually below, two or more of these possible solutions may be implemented in one or another combination.

一、并行编码工具评估1. Parallel Coding Tool Evaluation

如上文别处所述，重要的是编码器(即，视频编码器)快速确定哪个编码工具适合于对即时视频数据进行编码。编码器因此将使用确定的编码工具而不是编码器也能够执行的其他编码模式来编码视频数据。编码器可以根据各种因素确定某个编码工具是最合适的，例如要编码的视频的特定属性、编码比特流的特定特征等。此外，视频数据的不同部分可以是使用不同的编码工具或模式进行编码。例如，视频的每个帧可以被划分成非重叠块，有时称为编码块(CB)，并且每个帧可以被划分成多个切片(slice)，每个切片具有非重叠块的相关组(group)。视频数据可以以每个切片(即，其编码块)用相应的编码工具编码的方式编码。As noted elsewhere above, it is important for an encoder (ie, a video encoder) to quickly determine which encoding tool is suitable for encoding the instant video data. The encoder will thus encode the video data using certain encoding tools rather than other encoding modes that the encoder is also capable of performing. An encoder can determine that a certain encoding tool is the most appropriate based on various factors, such as specific properties of the video being encoded, specific characteristics of the encoded bitstream, and so on. Additionally, different portions of the video data may be encoded using different encoding tools or modes. For example, each frame of a video can be divided into non-overlapping blocks, sometimes called coding blocks (CB), and each frame can be divided into slices, each slice having an associated group of non-overlapping blocks ( group). Video data can be coded in such a way that each slice (ie its coded block) is coded with a corresponding coding tool.

为了确定编码工具(即，用于编码即时视频数据或其切片的最合适的编码工具)，编码器可能需要使用即时视频的至少一部分要编码的数据来评估若干候选编码工具。为了快速确定编码工具，评估过程的目的不是为了获得精细(即高度准确)的编码结果，而是针对每个候选编码工具及时获得粗略(即不太准确)的结果，以便比较结果，并据此确定编码工具。编码器随后将使用确定的编码工具对即时视频数据进行编码。该评估过程在下文中可互换地称为“编码工具评估过程”或“编码效率评估过程”。In order to determine an encoding tool (ie, the most suitable encoding tool for encoding the instant video data or a slice thereof), the encoder may need to evaluate several candidate encoding tools using at least a portion of the data to be encoded for the instant video. In order to identify coding tools quickly, the evaluation process is not aimed at obtaining fine (i.e. highly accurate) coding results, but rather to obtain coarse (i.e. less accurate) results in time for each candidate coding tool so that the results can be compared and based on this Identify coding tools. The encoder will then encode the live video data using the determined encoding tool. This evaluation process is hereinafter interchangeably referred to as "coding tool evaluation process" or "coding efficiency evaluation process".

值得注意的是，所确定的编码工具通常取决于待编码的视频数据。这是因为适用于对某种类型的视频数据进行编码的编码工具可能并不同样适用于对其他类型的视频数据进行编码。例如，当对主要包含自然图像的视频数据进行编码时与对主要包含屏幕内容的视频数据进行编码时，可以分别确定不同的编码工具。It is worth noting that the determined encoding tool usually depends on the video data to be encoded. This is because an encoding tool suitable for encoding one type of video data may not be equally suitable for encoding another type of video data. For example, when encoding video data mainly including natural images and when encoding video data mainly including screen content, different encoding tools may be respectively determined.

为了及时评估多个候选编码工具，编码器可以对评估过程采用并行化(parallelization)。即，两个或多个处理元件(processing element,PE)同时运行，每个PE评估各自候选编码工具的性能(例如，编码效率)。图1是根据本公开的实施方式的示例设计的图，其中呈现了并行编码工具评估方案100。在方案100中，并行化是由四个同时操作的处理单元(PE)，即PE 130、131、132和133来实现的。PE 130-133中的每一个被配置为评估应用于存储在搜索存储器110中的视频数据的相应编码工具的编码效率。例如，PE 130被配置为对编码工具T0执行编码效率评估，PE 131用于对编码工具T1进行编码效率评估。同时，PE132和PE 133用于分别对编码工具T2和T3进行编码效率评估。每个编码工具T0、T1、T2和T3可以是VVC、HEVC或其他视频编码标准中定义的编码工具之一，例如上文别处所述的VVC编码工具。In order to evaluate multiple candidate encoding tools in time, the encoder can employ parallelization of the evaluation process. That is, two or more processing elements (processing elements, PEs) run concurrently, and each PE evaluates the performance (eg, coding efficiency) of a respective candidate coding tool. FIG. 1 is a diagram of an example design in which a parallel codingtool evaluation scheme 100 is presented, according to an embodiment of the disclosure. Inscheme 100, parallelization is achieved by four processing elements (PEs), namelyPEs 130, 131, 132 and 133, operating simultaneously. Each of the PEs 130 - 133 is configured to evaluate the coding efficiency of the corresponding coding tool applied to the video data stored in thesearch memory 110 . For example,PE 130 is configured to perform coding efficiency evaluation on coding tool T0, andPE 131 is configured to perform coding efficiency evaluation on coding tool T1. Meanwhile, PE132 and PE133 are used to evaluate the coding efficiency of coding tools T2 and T3 respectively. Each encoding tool T0, T1, T2, and T3 may be one of the encoding tools defined in VVC, HEVC, or other video encoding standards, such as the VVC encoding tools described elsewhere above.

如上所述，PE 130-133旨在及时评估编码工具的效率。因此，涉及低复杂度硬件和/或软件模块的简单评估算法通常用于实现PE。例如，PE 130-133中的每一个可以是低复杂度率失真优化器(low-complexity rate-distortion optimizer，LC-RDO)，其被配置为通过执行相对简单的计算来评估编码工具的编码效率，例如空间像素滤波、绝对像素间差值计算、逐像素平方差值计算、逐像素变换差值计算。通常，PE 130-133中的每一个可以具有流水线结构或架构，其包括多个处理阶段。流水线结构被配置为通过顺序地将数据从一个阶段传递到下一个阶段来处理数据。在一些实施例中，PE 130-133可以从搜索存储器(高速缓存)110递增地获取视频数据用于处理。例如，PE 130-133中的每一个可以是具有流水线结构的LC-RDO，其包括水平滤波(HFIR)级，其后是垂直滤波(VFIR)级，其后是失真计算(DIST)级，然后是比较(COMP)级。LC-RDO可以使用流水线级递增地处理数据，其中每个级在每个流水线周期期间处理数据的不同部分。As mentioned above, PE 130-133 is designed to assess the efficiency of coding tools in a timely manner. Therefore, simple evaluation algorithms involving low-complexity hardware and/or software modules are usually used to implement PE. For example, each of the PEs 130-133 may be a low-complexity rate-distortion optimizer (LC-RDO) configured to evaluate the coding efficiency of a coding tool by performing a relatively simple calculation , such as spatial pixel filtering, absolute pixel-to-pixel difference calculation, pixel-by-pixel square difference calculation, and pixel-by-pixel transform difference calculation. In general, each of PEs 130-133 may have a pipelined structure or architecture that includes multiple processing stages. A pipeline structure is configured to process data by sequentially passing it from one stage to the next. In some embodiments, PEs 130-133 may incrementally fetch video data from search memory (cache) 110 for processing. For example, each of PEs 130-133 may be an LC-RDO with a pipelined structure comprising a horizontal filtering (HFIR) stage, followed by a vertical filtering (VFIR) stage, followed by a distortion calculation (DIST) stage, and then It is comparative (COMP) level. LC-RDO can incrementally process data using pipeline stages, where each stage processes a different portion of the data during each pipeline cycle.

每个PE 130-133使用相同的视频数据，即存储在搜索存储器110中的视频数据113，来评估相应编码工具的编码效率。在一些实施例中，视频数据113可以包括视频的编码块(CB)。编码工具160在方案100中确定后，用于对CB 113进行编码。编码工具160被确定为编码工具T0、T1、T2和T3之一。编码工具160由比较器150确定，比较器150被配置为比较由PE130-133生成的评估结果。每个PE130-133可以通过对视频数据113应用各自的编码工具来执行编码效率评估，从而生成评估结果。例如，PE 130可通过对视频数据113应用编码工具T0来执行编码效率评估，从而产生表现在品质因数(FOM)140中的评估结果。类似地，PE131、132和133中的每一个可以通过分别对视频数据113应用编码工具T1、T2和T3来执行编码效率评估，从而分别产生表现在FOM 141、142和143中的评估结果。在一些实施例中，FOM140-143中的每一个可以是所得到的编码视频与原始视频数据113之间的平方差之和(SSD)、绝对差之和(SAD)或绝对变换差之和(SATD)，其中对视频数据113中的每个像素计算总和。比较器150可以比较FOM 140-143并确定编码工具T0、T1、T2和T3中的哪一个是编码工具160，其稍后将用于对CB 113进行编码。例如，每个FOM 140-143可以是相应的SSD值，并且比较器150可以比较FOM 140-143并且确定FOM 142具有FOM 140-143中的最低值。因此，比较器150从而可以决定编码工具T2是用于对视频数据113进行编码的编码工具160。Each PE 130-133 uses the same video data, ie, thevideo data 113 stored in thesearch memory 110, to evaluate the coding efficiency of the corresponding coding tool. In some embodiments,video data 113 may include coded blocks (CBs) of video. Theencoding tool 160 is used to encode theCB 113 after being determined in thescheme 100 .Coding tool 160 is identified as one of coding tools T0, T1, T2 and T3. Theencoding tool 160 is determined by acomparator 150 configured to compare the evaluation results generated by the PEs 130-133. Each PE 130-133 may perform encoding efficiency evaluation by applying a respective encoding tool tovideo data 113, thereby generating an evaluation result. For example, thePE 130 may perform encoding efficiency evaluation by applying the encoding tool T0 to thevideo data 113 , thereby generating an evaluation result expressed in a figure of merit (FOM) 140 . Similarly, each ofPEs 131, 132, and 133 can perform encoding efficiency evaluation by applying encoding tools T1, T2, and T3 tovideo data 113, respectively, thereby generating evaluation results represented inFOMs 141, 142, and 143, respectively. In some embodiments, each of the FOMs 140-143 may be a sum of squared differences (SSD), a sum of absolute differences (SAD), or a sum of absolute transformed differences ( SATD), where the sum is calculated for each pixel in thevideo data 113.Comparator 150 may compare FOMs 140 - 143 and determine which of encoding tools T0 , T1 , T2 , and T3 is encodingtool 160 , which will later be used to encodeCB 113 . For example, each FOM 140-143 may be a corresponding SSD value, andcomparator 150 may compare the FOMs 140-143 and determine thatFOM 142 has the lowest value of the FOMs 140-143. Therefore, thecomparator 150 can thus decide that the encoding tool T2 is theencoding tool 160 for encoding thevideo data 113 .

在一些实施例中，除了确定的编码工具160之外，比较器150还可以确定一组编码参数以与确定的编码工具160一起使用以对视频数据113进行编码。为此，一些PE 130-133可以配置为使用相同的编码工具但使用不同的编码参数设置进行操作。例如，T0和T1可以是相同的编码工具，而PE 130和131以应用于相同编码工具的不同编码参数设置操作，例如，第一组编码参数与第二组编码参数。得到的FOM 140和141将指示在第一组和第二组之间优选哪组编码参数。优选的编码参数集被包括作为确定的编码工具160的一部分。In some embodiments,comparator 150 may determine a set of encoding parameters to use with identifiedencoding tool 160 to encodevideo data 113 in addition to identifiedencoding tool 160 . To this end, some PEs 130-133 can be configured to operate using the same encoding tool but with different encoding parameter settings. For example, T0 and T1 may be the same encoding tool, whilePEs 130 and 131 operate with different encoding parameter settings applied to the same encoding tool, eg, a first set of encoding parameters versus a second set of encoding parameters. The resultingFOMs 140 and 141 will indicate which set of encoding parameters is preferred between the first set and the second set. A preferred set of encoding parameters is included as part of the definedencoding tool 160 .

在一些实施例中，方案100可以涉及PE(例如，PE 130、131、132或133)，其包括高复杂度率失真优化器(HC-RDO)来替代或增加到PE的LC-RDO。HC-RDO可以与PE的LC-RDO级联。与仅具有LC-RDO的PE的实施方式相比，具有HC-RDO的PE可以通过涉及更复杂的运算以更高精度来确定或以其他方式计算相应的FOM(即，FOM 140、141、142或143)，尽管通常以更多的处理时间为代价。由于更高的准确性，由涉及HC-RDO的PE确定的编码工具160可能不同于由仅涉及LC-RDO的PE确定的编码工具160,并且可能更适合编码,具有增强的编码效率和/或表现。In some embodiments,scheme 100 may involve a PE (eg,PE 130, 131, 132, or 133) that includes a high-complexity rate-distortion optimizer (HC-RDO) instead of or in addition to the PE's LC-RDO. The HC-RDO can be cascaded with the LC-RDO of the PE. A PE with an HC-RDO can determine or otherwise calculate the corresponding FOM (i.e.,FOM 140, 141, 142 or 143), though usually at the cost of more processing time. Thecoding tool 160 determined by PEs involving HC-RDO may be different from thecoding tool 160 determined by PEs involving only LC-RDO due to higher accuracy, and may be more suitable for coding, with enhanced coding efficiency and/or Performance.

二、时间交错缓存访问2. Time interleaved cache access

搜索存储器110有时被称为“高速缓存”或“高速缓冲存储器”。高速缓存110被设计为在编码工具评估过程中用于存储视频数据例如CB 113的临时存储器，其中PE 130-133可以重复访问高速缓存110以加载CB 113的不同部分。然而，高速缓存110不能提供对每个PE130-133的同时访问。即，即使方案100说明PE 130-133可以通过数据总线120、121、122和123访问高速缓存110，上述高速缓存存储器110的属性要求在任何时候，数据总线120-123中只有一条可以“打开”，即，将数据从高速缓存110传输到一个PE 130 -133。由此可见，只有当高速缓存110被复制成多个副本时，PE 130-133之间的真正并行化才是可能的，每个副本由PE 130 -133中的相应一个访问。显然，复制高速缓存110不是一个有吸引力的并行化解决方案，因为复制副本的硬件成本很高并且可能不实用。Search memory 110 is sometimes referred to as a "cache" or "cache memory."Cache 110 is designed as temporary storage for storing video data such asCB 113 during encoding tool evaluation, where PEs 130-133 may repeatedly accesscache 110 to load different portions ofCB 113 . However,cache 110 cannot provide simultaneous access to each of PEs 130-133. That is, even thoughscheme 100 states that PEs 130-133 can accesscache 110 viadata buses 120, 121, 122, and 123, the properties ofcache memory 110 described above require that only one of data buses 120-123 be "open" at any one time , that is, transfer data fromcache 110 to one of PEs 130-133. It can thus be seen that true parallelization among PEs 130-133 is only possible whencache 110 is replicated in multiple copies, with each copy being accessed by a corresponding one of PEs 130-133. Clearly, replicating thecache 110 is not an attractive parallelization solution because the hardware cost of replicating the replicas is high and may not be practical.

图2是根据本公开的实施方式的示例设计的图，其中在不复制高速缓存110的情况下实现实际意义的并行化。具体地，图2图示了时间交错(time-interleaving)高速缓存访问方法200，其中PE 130-133可以同时操作，其中在任何时间不超过一个数据总线120-123打开以访问高速缓存110。即，在任何给定时间，不超过一个PE 130-133可以从高速缓存110接收数据(例如，视频数据113)。FIG. 2 is a diagram of an example design in which substantial parallelization is achieved without duplicatingcache 110 , according to an embodiment of the disclosure. Specifically, FIG. 2 illustrates a time-interleavingcache access method 200 in which PEs 130-133 can operate concurrently with no more than one data bus 120-123 open toaccess cache 110 at any time. That is, no more than one PE 130-133 may receive data (eg, video data 113) fromcache 110 at any given time.

在开始编码效率评估过程之前，PE 130-133可以不加载或以其他方式从高速缓存110中读取全部视频数据113。相反，PE 130-133可以只加载视频数据113的一部分，例如CB113的一部分115。PE130-133可能不需要访问高速缓存110来加载CB 113的更多部分，直到部分115被处理。PE 130-133中的每一个可以具有内部存储器，通常被称为“队列缓冲器(line buffer)”，以存储当前正在加载的视频数据113的部分。PE可以访问队列缓冲器以检索视频数据113的部分用于编码工具评估过程。PE可以使用队列缓冲器来保存或存储视频数据113的部分直到高速缓存窗口再次打开，此时正在加载视频数据113的下一部分。队列缓冲器然后可以由当前加载的视频数据113的新部分来补充。PEs 130-133 may not load or otherwise read all ofvideo data 113 fromcache 110 before beginning the encoding efficiency evaluation process. Instead, PEs 130-133 may only load a portion ofvideo data 113, such asportion 115 ofCB 113. PEs 130-133 may not need to accesscache 110 to load further portions ofCB 113 untilportion 115 is processed. Each of the PEs 130-133 may have internal memory, often referred to as a "line buffer", to store the portion of thevideo data 113 currently being loaded. The PE may access the queue buffer to retrieve portions of thevideo data 113 for the encoding tool evaluation process. The PE may use the queue buffer to hold or store portions of thevideo data 113 until the cache window is opened again, at which time the next portion of thevideo data 113 is being loaded. The queue buffer can then be replenished with new portions of the currently loadedvideo data 113 .

在一些实施例中，CB 113可以被分成多个非重叠的子块，通常具有相同的大小(例如，4个像素的高度和4个像素的宽度)。也就是说，CB 113的子块可以形成CB 113的列和行的阵列。CB 113的部分115可以包括多个子块，例如，标有“0”、“1”、“2”、“3”、“4”、“5”。此外，如上文别处所述，PE 130-133中的每一个可以是由HFIR级、VFIR级、DIST级和COMP级组成的LC-RDO流水线。数据可以通过LC-RDO流水线的各个阶段，首先由HFIR阶段处理，然后由VFIR阶段处理，然后由DIST阶段处理，最后由COMP阶段处理。图2中提供了时间线299。表示前13个流水线周期的进展，即流水线周期1-13。In some embodiments, theCB 113 may be divided into multiple non-overlapping sub-blocks, generally of the same size (eg, 4 pixels high and 4 pixels wide). That is, sub-blocks ofCB 113 may form an array of columns and rows ofCB 113 .Portion 115 ofCB 113 may include a number of sub-blocks, eg, labeled "0", "1", "2", "3", "4", "5". Furthermore, as described elsewhere above, each of PEs 130-133 may be an LC-RDO pipeline consisting of HFIR stages, VFIR stages, DIST stages, and COMP stages. Data can pass through various stages of the LC-RDO pipeline, first processed by the HFIR stage, then by the VFIR stage, then by the DIST stage, and finally by the COMP stage. Atimeline 299 is provided in FIG. 2 . Indicates the progress of the first 13 pipeline cycles, that is, pipeline cycles 1-13.

参考图2，PE 130-133以时间交错的方式访问高速缓存110。例如，在第一个流水线周期期间，轮到PE130访问高速缓存110(在图中由“读取”阶段指示)，在此期间PE 130加载CB 113的子块“0”。PE131-133分别在接下来的三个流水线周期，即第二、第三和第四流水线周期中加载CB 113的子块“0”。在PE 131-133依次加载子块“0”之后，PE 130在第五流水线周期再次访问高速缓存110，在此期间PE130加载下一个子块(即，CB 113的子块“1”)。同样，PE 131-133分别在接下来的三个流水线周期(即第六、第七和第八个)中加载CB113的子块“1”。在PE 131-133依次加载子块“1”之后，PE 130在第九个流水线周期再次轮到访问高速缓存110，在此期间PE 130加载下一个子块(即，CB 113的子块“2”)。PE 131-133在接下来的三个流水线周期(即第十、第十一和第十二流水线周期)中分别加载CB 113的子块“2”。Referring to FIG. 2, PEs 130-133access cache 110 in a time-staggered manner. For example, during the first pipeline cycle, it isPE 130's turn to access cache 110 (indicated in the figure by the "read" stage), during whichPE 130 loads sub-block "0" ofCB 113 . PE131-133 respectively load sub-block "0" ofCB 113 in the next three pipeline cycles, ie, the second, third and fourth pipeline cycles. After PEs 131-133 sequentially load sub-block "0",PE 130accesses cache 110 again in the fifth pipeline cycle, during whichPE 130 loads the next sub-block (ie, sub-block "1" of CB 113). Likewise, PEs 131-133 load sub-block "1" ofCB 113 in the next three pipeline cycles (ie sixth, seventh and eighth) respectively. After PEs 131-133 sequentially load sub-block "1,"PE 130 has another turn to accesscache 110 in the ninth pipeline cycle, during whichPE 130 loads the next sub-block (i.e., sub-block "2" ofCB 113 "). PEs 131-133 load sub-block "2" ofCB 113 in the next three pipeline cycles (ie, tenth, eleventh and twelfth pipeline cycles), respectively.

由于PE 130-133对高速缓存110的访问是时间交错的，因此由于PE的流水线特性，PE 130-133内加载的子块的处理也是时间交错的。例如，PE 130在第四个流水线周期结束时完成处理CB 113的子块“0”(PE 130的COMP阶段完成处理子块“0”)，而PE 131、132和133分别在第五、第六和第七流水线周期结束时完成对CB 113的子块“0”的处理。Since accesses tocache 110 by PEs 130-133 are time-interleaved, the processing of sub-chunks loaded within PEs 130-133 is also time-interleaved due to the pipelined nature of PEs. For example,PE 130 finishes processing sub-block "0" ofCB 113 at the end of the fourth pipeline cycle (the COMP stage ofPE 130 finishes processing sub-block "0"), whilePE 131, 132 and 133 finish processing sub-block "0" at the end of the fourth pipeline cycle. Processing of sub-block "0" ofCB 113 is complete at the end of the sixth and seventh pipeline cycles.

根据时间交错高速缓存访问方法200，在任何给定时间，PE130-133中至多只有一个从高速缓存110加载子块数据。因此，并行化方案100仅利用高速缓存110的一个副本可以通过采用时间交错高速缓存访问方法200来实现。然而，方法200导致非常低的PE利用率。如图2所示，PE流水线阶段在大多数流水线周期中都是空闲的(即不处理任何数据)。在四个PE并行化的情况下，时间交错缓存访问方法200导致大约25％的PE利用率。在并行化方案中涉及超过四个PE的情况下，时间交错高速缓存访问方法200将导致甚至更低的PE利用率。According to time-interleavedcache access method 200, at most one of PEs 130-133 is loading sub-chunk data fromcache 110 at any given time. Therefore, theparallelization scheme 100 utilizing only one copy of thecache 110 can be implemented by employing the time-interleavedcache access method 200 . However,method 200 results in very low PE utilization. As shown in Figure 2, the PE pipeline stage is idle (i.e. not processing any data) for most of the pipeline cycle. With four PE parallelizations, the time-interleavedcache access method 200 results in approximately 25% PE utilization. In cases where more than four PEs are involved in the parallelization scheme, the time-interleavedcache access method 200 will result in even lower PE utilization.

图3是根据本公开的实施方式的示例设计的图，其中示出了另一种时间交错缓存访问方法(即，方法300)，其大大改善了方法200导致的低PE利用率。如图3所示、PE空闲时间比方法200少很多。事实上，经过多次流水线循环后，方法300的PE利用率接近100％。通过在每个缓存访问窗口加载CB 113的一个以上子块，方法200中的大部分PE空闲时间在方法300中被消除。例如，虽然方法200命令PE 130在第一个流水线周期期间仅从缓存110加载CB113的子块“0”，但方法300提倡加载四个子块，即在第一个流水线周期期间，加载块CB 113的子块“0”、“1”、“2”和“3”。假设子块“1”、“2”和“3”在与子块“0”在同一流水线周期中被加载和保存到PE 130的队列缓冲区，则可以提早子块“1”、“2”和“3”上的PE 130流水线操作的开始。例如，PE 130可以早在第二流水线周期开始处理子块“1”并在第五个流水线周期完成处理，与方法200相比提早了三个流水线周期。PE 130处理子块“2”的完成被进一步提早，从如图2所示的第十二个流水线周期拖拽至图3的第六流水线周期。FIG. 3 is a diagram of an example design showing another time-interleaved cache access method (ie, method 300 ) that greatly improves the low PE utilization caused bymethod 200 , according to an embodiment of the disclosure. As shown in FIG. 3 , PE idle time is much less thanmethod 200 . In fact, the PE utilization ofmethod 300 is close to 100% after many pipeline cycles. Most of the PE idle time inmethod 200 is eliminated inmethod 300 by loading more than one sub-block ofCB 113 per cache access window. For example, whilemethod 200 instructsPE 130 to load only sub-block "0" ofCB 113 fromcache 110 during the first pipeline cycle,method 300 advocates loading four sub-blocks, i.e.,loading block CB 113 during the first pipeline cycle subblocks "0", "1", "2" and "3". Assuming sub-chunks "1", "2" and "3" are loaded and saved to the queue buffer ofPE 130 in the same pipeline cycle as sub-chunk "0", sub-chunks "1", "2" can be and "3" onPE 130 for the start of pipeline operation. For example,PE 130 may start processing sub-block “1” as early as the second pipeline cycle and finish processing in the fifth pipeline cycle, which is three pipeline cycles earlier thanmethod 200 . Completion ofPE 130 processing sub-block "2" is further advanced, dragged from the twelfth pipeline cycle as shown in FIG. 2 to the sixth pipeline cycle in FIG. 3 .

具体而言，根据时间交错缓存访问方法300，PE 130-133中的每一个以子块的批次(batch)接收视频数据113，每个批次具有视频数据113的多个子块，每批中的子块的数量等于编码效率评估过程中并行操作的PE的数量。例如，在并行化方案100的编码效率评估过程中并行使用了4个PE(即PE 130-133)，因此需要4个PE中的每一个，每次打开时间窗口以访问高速缓存110时，加载CB113的一批4个子块(例如，子块“0-3”、子块“4-7”或子块“8-11”)，如时间交错高速缓存访问方法300所示。Specifically, according to the time-interleavedbuffer access method 300, each of the PEs 130-133 receives thevideo data 113 in batches of sub-blocks, each batch having a plurality of sub-blocks of thevideo data 113, and in each batch The number of sub-blocks of is equal to the number of PEs operated in parallel during the coding efficiency evaluation process. For example, 4 PEs (i.e., PEs 130-133) are used in parallel in the coding efficiency evaluation process of theparallelization scheme 100, so each of the 4 PEs is required, and each time the time window is opened to access thecache 110, load A batch of 4 sub-blocks (eg, sub-block "0-3", sub-block "4-7" or sub-block "8-11") ofCB 113, as shown in time-interleavedcache access method 300 .

在一些实施例中，高速缓存110可以被分成几个“体(bank)”(即，存储体)。缓存的体数量是缓存的一个重要参数，体的数量代表了可以同时从缓存中读取或写入的数据条目的数量。具体地，在任何时候，至多只有一个数据条目可以从存储体读取或写入到存储体。鉴于每个PE 130-133预期在一个流水线周期内接收CB 113的四个子块，因此高速缓存110需要具有至少四个存储体，其中四个子块在一个流水线周期中的批次分别存储在四个独立的存储体中。如下文别处所述，诸如高速缓存110必须至少具有的存储体的数量、以及视频数据113的哪些子块存储在哪些存储体中的考虑是实现中并行编码工具评估方案100与时间交错高速缓存访问方法300的结合的重要设计参数。In some embodiments,cache 110 may be divided into several "banks" (ie, memory banks). The number of cache bodies is an important parameter of the cache, and the number of bodies represents the number of data entries that can be read or written from the cache at the same time. Specifically, at any one time, at most one data entry may be read from or written to a memory bank. Since each PE 130-133 is expected to receive four sub-blocks ofCB 113 within one pipeline cycle,cache 110 needs to have at least four banks, where the batches of the four sub-blocks in one pipeline cycle are stored in four in a separate memory. As described elsewhere below, considerations such as the number of memory banks that thecache 110 must have at least, and which sub-chunks of thevideo data 113 are stored in which memory banks are parallel encodingtool evaluation schemes 100 in implementation and time-interleaved cache access Important design parameters for the combination ofmethod 300.

三、子块扫描顺序3. Sub-block scanning order

如上文别处所述，编码块可被划分成多个子块，使得子块形成编码块的列和行的阵列。图4是根据本公开的实施方式的示例设计的图，其中CB 113被分成形成列和行的阵列的非重叠子块。具体地，CB 113如图所示。图4的CB大小为32像素宽和32像素高，而每个子块的大小为4x 4像素。因此，CB 113被分成64个子块，如411、412、451和452中的每一个所示。As described elsewhere above, a coding block may be divided into sub-blocks such that the sub-blocks form an array of columns and rows of the coding block. 4 is a diagram of an example design in which theCB 113 is divided into non-overlapping sub-blocks forming an array of columns and rows, according to an embodiment of the disclosure. Specifically,CB 113 as shown. The CB size of Fig. 4 is 32 pixels wide and 32 pixels high, while the size of each sub-block is 4 x 4 pixels. Therefore,CB 113 is divided into 64 sub-blocks, as shown in each of 411 , 412 , 451 and 452 .

根据时间交错高速缓存访问方法300，PE 130-133中的每一个被设计成分批加载或以其他方式接收CB 113的子块，每批包含CB 113的四个连续子块。图4示出了PE 130-133可以用来接收CB 113的子块的两种类型的扫描顺序。具体地，PE 130-133可以使用称为“光栅扫描”的扫描顺序来接收CB 113的子块，如图表411和412所示。或称为“蛇形扫描”的扫描顺序，如图表451和452所示。可以逐列方式或逐行方式执行光栅扫描。逐行方式如图表411所示，其中PE 130-133从左到右加载CB 113第一行中的子块，然后加载CB113的第二行，也是从左到右，依此类推。逐列方式在图表412中示出，其中PE 130-133从上到下加载CB 113的第一列中的子块，随后加载CB 113的第二列，也从上到下，依此类推。According to time-interleavedcache access method 300 , each of PEs 130 - 133 is designed to load or otherwise receive sub-blocks ofCB 113 in batches, each batch containing four consecutive sub-blocks ofCB 113 . FIG. 4 shows two types of scan sequences that PEs 130-133 may use to receive sub-blocks ofCB 113. In particular, PEs 130-133 may receive sub-blocks ofCB 113 using a scanning order known as "raster scan," as shown in diagrams 411 and 412 . Or a scanning sequence called “serpentine scanning”, as shown in diagrams 451 and 452 . Raster scanning can be performed column by column or row by row. The row-by-row approach is shown in diagram 411, where PEs 130-133 load sub-blocks in the first row ofCB 113 from left to right, then load the second row ofCB 113, also from left to right, and so on. The column-by-column approach is shown in diagram 412, where PEs 130-133 load subblocks in the first column ofCB 113 from top to bottom, followed by the second column ofCB 113, also from top to bottom, and so on.

同样，蛇形扫描也可以逐列方式或逐行方式执行。在蛇形扫描中，扫描方向每行或每列交替。逐列蛇形扫描如图表451所示，其中PE130-133从上到下加载CB 113的第一列中的子块，然后从下到上加载CB 113的第二列，然后再次从上到下加载CB 113的第三列，依此类推。逐行蛇形扫描如图表452所示，其中PE 130-133从左到右加载CB 113的第一行中的子块，然后从右到左加载CB 113的第二行，然后再次从左到右加载CB 113的第三行，依此类推。Likewise, serpentine scanning can be performed column-by-column or row-by-row. In serpentine scanning, the scanning direction alternates per row or column. A column-by-column serpentine scan is shown in Diagram 451, where PE130-133 load subblocks in the first column ofCB 113 from top to bottom, then load the second column ofCB 113 from bottom to top, then top to bottom again Load the third column ofCB 113, and so on. A progressive serpentine scan is shown in diagram 452, where PEs 130-133 load the subblocks in the first row ofCB 113 from left to right, then load the second row ofCB 113 from right to left, and then again from left to Right load the third row ofCB 113, and so on.

如上文别处所述，PE 130-133中的每一个都需要根据时间交错高速缓存访问方法300一次(即，在流水线周期期间)加载一批四个子块。如图所示,对于CB 113,每列或每行可以恰好分两批加载，无论是光栅扫描还是蛇形扫描，在加载任意4个子块的批次时，都不会出现跨列跨行的情况。即，不存在这样的情况，其中在流水线周期期间获取的一批中的四个子块中的两个位于CB 113的两个相邻列或行中。As described elsewhere above, each of the PEs 130-133 needs to load a batch of four sub-chunks once (ie, during a pipeline cycle) according to the time-interleavedcache access method 300 . As shown in the figure, forCB 113, each column or row can be loaded in exactly two batches, whether it is raster scanning or serpentine scanning, when loading any batch of 4 sub-blocks, there will be no cross-column cross-row situation . That is, there are no cases where two of the four sub-blocks in a batch fetched during a pipeline cycle are located in two adjacent columns or rows of theCB 113 .

为图4的子块分配对应的高速缓存体也可以很容易地确定。例如，高速缓存体分配422可用于图表412的光栅扫描和图表451的蛇形扫描。如高速缓存体分配422所示，高速缓存110需要具有四个存储体，即，如缓存体分配422所示的“0”、“1”、“2”、“3”。CB 113的子块根据缓存体分配422存储在缓存110中。即，每一列的第一个以及第五个子块存储在存储体“0”中；每列的第二个和第六个子块存储在存储体“1”中；每列的第三个和第七个子块存储在存储体“3”中；最后，每一列的第四个和第八个子块存储在存储体“4”中。Allocating corresponding cache banks for the sub-blocks of FIG. 4 can also be easily determined. For example,cache bank allocation 422 may be used for raster scanning ofgraph 412 and serpentine scanning ofgraph 451 . As indicated bycache bank allocation 422 ,cache 110 needs to have four memory banks, ie, "0", "1", "2", "3" as indicated bycache bank allocation 422 . Sub-blocks ofCB 113 are stored incache 110 according tocache bank allocation 422 . That is, the first and fifth sub-blocks of each column are stored in bank "0"; the second and sixth sub-blocks of each column are stored in bank "1"; the third and sixth sub-blocks of each column are stored in bank "1" Seven sub-blocks are stored in bank "3"; finally, the fourth and eighth sub-blocks of each column are stored in bank "4".

然而，对于在行或列中具有更多或更少数量的子块的编码块，或者对于并行编码工具评估方案100中涉及的不同数量的并行PE，跨列或跨行的情况可能是不可避免的，相应的缓存体分配会变得更加复杂。对于这些情况，蛇形扫描处理顺序优于光栅扫描处理顺序，因为与光栅扫描相比，蛇形扫描的相应高速缓存体分配相对简单。可能难以找到或确定针对光栅扫描处理顺序的相应缓存体分配，因为跨列或跨行地址差异可能非常不同，具体取决于所使用的编码块的大小。相反，蛇扫描处理顺序在面对跨列或跨行场景时，地址差异有限。However, cases across columns or rows may be unavoidable for encoded blocks with a greater or lesser number of sub-blocks in rows or columns, or for different numbers of parallel PEs involved in the parallel encodingtool evaluation scheme 100 , the corresponding buffer allocation becomes more complicated. For these cases, serpentine scan processing order is preferred over raster scan processing order because the corresponding cache bank allocation is relatively simple for serpentine scan compared to raster scan. It can be difficult to find or determine the corresponding cache bank allocation for raster scan processing order, because the address difference can be very different across columns or across rows, depending on the size of the encoding block used. In contrast, the snake scan processing order has limited address differences when facing cross-column or cross-row scenarios.

图5是根据本公开的实施方式的示例设计的图，其中图示了可能的跨列蛇形扫描场景，而不管CB 113的大小。具体地，图表540图示了跨列的并行编码工具评估方案100中涉及四个PE的所有四种可能性场景。如图表540所示，四种可能性之间的最大地址差异等于子块高度的四倍。类似地，图表530说明了当并行编码工具评估方案100中涉及三个PE时跨列场景的所有三种可能性；三种可能性之间的最大地址差等于子块高度的三倍。同样地，图表550说明了当并行编码工具评估方案100中涉及五个PE时跨列场景的所有五种可能性；五种可能性之间的最大地址差异等于子块高度的五倍。FIG. 5 is a diagram of an example design illustrating a possible serpentine scan scenario across columns, regardless of the size of theCB 113 , according to an embodiment of the disclosure. In particular, diagram 540 illustrates all four possible scenarios involving four PEs in parallel codingtool evaluation scheme 100 across columns. As shown ingraph 540, the maximum address difference between the four possibilities is equal to four times the subblock height. Similarly,graph 530 illustrates all three possibilities for the column-spanning scenario when three PEs are involved in the parallel encodingtool evaluation scheme 100; the maximum address difference between the three possibilities is equal to three times the subblock height. Likewise,graph 550 illustrates all five possibilities for the column-spanning scenario when five PEs are involved in the parallel encodingtool evaluation scheme 100; the maximum address difference between the five possibilities is equal to five times the subblock height.

还如图所示。图5分别是图表530、540和550中所示场景的对应体分配，即高速缓存110的体分配532、542和552。高速缓存110的存储体可以分为两组(group)，其中每组可以具有与PE的数量一样多的存储体。例如，在存储体分配542中，高速缓存110具有两组四个存储体，第一组由存储体“0”、“1”、“2”和“3”组成，第二组由存储体“4”、“5”、“6”和“7”组成。第一组中的体从上到下重复分配给每个奇数列(即第一、第三、第五、第七、第九和第十一列等)的子块，而第二组中的体被重复分配给每个偶数列的子块(即第二、第四、第六、第八、第十和第十二列等)，也是从上到下。因此，任意两列相邻的子块分别存储在两组存储体中。作为另一个示例，在存储体分配552中，高速缓存110具有两组五个存储体，第一组由存储体“0”、“1”、“2”、“3”和“4”组成，第二组由存储体“5”、“6”、“7”、“8”和“9”组成。第一组中的存储体从上到下重复分配给每个奇数列的子块，而第二组中的存储体重复分配给每个偶数列的子块列，也是从上到下。因此，任意两列相邻的子块分别存储在两组存储体中。Also as shown. 5 are the corresponding volume allocations for the scenarios shown ingraphs 530, 540, and 550, namely,volume allocations 532, 542, and 552 forcache 110, respectively. The memory banks of thecache 110 can be divided into two groups, where each group can have as many memory banks as the number of PEs. For example, inbank allocation 542,cache 110 has two sets of four banks, the first set consisting of banks "0", "1", "2" and "3" and the second set consisting of banks " 4", "5", "6" and "7". The volumes in the first group repeat from top to bottom the subblocks assigned to each odd column (i.e., the first, third, fifth, seventh, ninth, and eleventh columns, etc.), while the volumes in the second group Volumes are repeatedly assigned to subblocks of each even column (ie, second, fourth, sixth, eighth, tenth, and twelfth columns, etc.), also from top to bottom. Therefore, any two columns of adjacent sub-blocks are respectively stored in two groups of memory banks. As another example, inbank allocation 552,cache 110 has two sets of five banks, the first set consisting of banks "0," "1," "2," "3," and "4," The second group consists of banks "5", "6", "7", "8" and "9". Banks in the first group are repeatedly assigned to subblocks of each odd column from top to bottom, while banks in the second group are repeatedly assigned to subblock columns of each even column, also from top to bottom. Therefore, any two columns of adjacent sub-blocks are respectively stored in two groups of memory banks.

四、说明性实施4. Illustrative implementation

图6示出了能够使用上述并行化方法评估多个编码工具的编码效率的示例装置600。如图所示，装置600接收用于评估编码工具的视频数据601，并相应地确定适合于对视频数据601进行编码的编码工具660。在一些实施例中，装置600还可以确定编码参数的设置666，其将与确定的编码工具660一起使用。视频数据601可以包括编码块113，而确定的编码工具660可以是编码工具160的实施例。装置600可以用于实现并行编码工具评估方案100使用时间交错高速缓存访问方法200或300。FIG. 6 shows anexample apparatus 600 capable of evaluating coding efficiency of multiple coding tools using the parallelization method described above. As shown, theapparatus 600 receivesvideo data 601 for evaluating encoding tools, and accordingly determines anencoding tool 660 suitable for encoding thevideo data 601 . In some embodiments, theapparatus 600 may also determine a setting 666 of encoding parameters to be used with the identifiedencoding tool 660 .Video data 601 may include encoding blocks 113 , anddetermined encoding tool 660 may be an embodiment ofencoding tool 160 . Theapparatus 600 can be used to implement the parallel codingtool evaluation scheme 100 using the time-interleavedcache access method 200 or 300 .

如图所示，装置600具有用于处理视频数据601和确定编码工具660的若干组件或模块，至少包括一些元件选自处理器605、搜索存储器或高速缓存610、多个处理元件(例如作为PE 631-634)、存储器640和比较器650。高速缓存610可以包括多个存储体，例如存储体611-614，每个存储体611-614能够与其余存储体同时提供相应的数据条目。As shown,apparatus 600 has several components or modules for processingvideo data 601 and determiningencoding tools 660, including at least some elements selected from processor 605, search memory or cache 610, a plurality of processing elements (such as PE 631-634), memory 640 and comparator 650. Cache 610 may include multiple memory banks, such as memory banks 611-614, each memory bank 611-614 capable of serving corresponding data entries concurrently with the remaining memory banks.

在一些实施例中，如上所列的模块605-650是由计算设备或电子设备的一个或多个处理单元(例如，处理器)执行的软件指令的模块。在一些实施例中，模块605-650是由电子装置的一个或多个集成电路(IC)实现的硬件电路模块。尽管模块605-650被示为单独的模块，但是一些模块可以组合成单个模块。In some embodiments, the modules 605-650 listed above are modules of software instructions executed by one or more processing units (eg, processors) of a computing device or electronic device. In some embodiments, modules 605-650 are hardware circuit modules implemented by one or more integrated circuits (ICs) of an electronic device. Although modules 605-650 are shown as separate modules, some modules may be combined into a single module.

处理器605被配置为接收和分析视频数据601，从而确定存储体分配(例如，存储体分配422、532、542或552)。即，存储体分配是特定于视频数据601的。处理器605还被配置为根据确定的存储体分配将视频数据601的子块存储在搜索存储器610中。Processor 605 is configured to receive and analyzevideo data 601 to determine a memory bank allocation (eg,memory bank allocation 422, 532, 542, or 552). That is, bank allocation is specific tovideo data 601 . Processor 605 is also configured to store sub-chunks ofvideo data 601 in search memory 610 according to the determined memory bank allocation.

高速缓存610可以包括多个存储体，例如存储体611、612、613和614。存储体的数量可以与存储体分配中指示的存储体的数量一致(例如，等于)，其由处理器605决定。高速缓存610的存储体的数量可以多于图6所示的四个存储体。例如，存储体分配542指示八个不同的存储体用于蛇形扫描。处理器605可以相应地将视频数据601存储在高速缓存610的八个不同的存储体中。高速缓存610可以包含搜索存储器110。Cache 610 may include multiple memory banks, such asmemory banks 611 , 612 , 613 and 614 . The number of memory banks may be identical to (eg, equal to) the number of memory banks indicated in the memory bank allocation, as determined by the processor 605 . The number of banks of the cache 610 may be more than the four banks shown in FIG. 6 . For example,bank allocation 542 indicates eight different banks for serpentine scanning. Processor 605 may storevideo data 601 in eight different banks of cache 610 accordingly. Cache 610 may includesearch memory 110 .

处理元件631-634中的每一个可以是PE 130-133之一的实施例。在一些实施例中，处理元件631-634中的每一个可以是低复杂度RDO流水线。在一些实施例中，处理元件631-634中的每一个可以附加地或备选地包括高复杂度RDO。处理元件631-634可以被配置为通过以时间交错方式访问高速缓存610来获取视频数据601的一部分(例如，遵循时间交错方法200或300)。一次获取的视频数据部分601可以包括视频数据601的多个子块(例如，CB113的部分115的子块0-3、4-7或8-11)。在一些实施例中，每个处理元件631-634可以包括队列缓冲器，其被配置为临时存储从缓存610中批量提取的子块，直到该批次的所有子块经由各个处理元件的流水线阶段都被处理完为止。Each of processing elements 631-634 may be an embodiment of one of PEs 130-133. In some embodiments, each of processing elements 631-634 may be a low-complexity RDO pipeline. In some embodiments, each of processing elements 631-634 may additionally or alternatively include a high-complexity RDO. Processing elements 631-634 may be configured to fetch a portion ofvideo data 601 by accessing cache 610 in a time-interleaved manner (eg, following time-interleavedmethod 200 or 300). A portion ofvideo data 601 acquired at one time may include multiple sub-blocks of video data 601 (eg, sub-blocks 0-3, 4-7, or 8-11 ofportion 115 of CB 113). In some embodiments, each processing element 631-634 may include a queue buffer configured to temporarily store a batch of sub-chunks fetched from cache 610 until all sub-chunks of the batch have passed through the respective processing element's pipeline stages. are all processed.

处理元件631-634中的每一个还可以被配置为计算对于视频数据601、指示所应用的相应编码工具的编码效率的相应品质因数(FOM)(例如，FOM 140、141、142或143)。因此，FOM特定于相应的编码工具和视频数据601。也就是说，FOM特定于相应的编码工具和视频数据601的组合。FOM可以是总和平方差、绝对差之和或绝对变换差之和。处理元件631-634计算出的FOM可以存储在存储器640中并用作比较器650的输入。在一些实施例中，处理元件631-634还可以存储用于计算FOM的编码参数。在一些实施例中，PE 631-634中的每一个可以使用相同的编码工具但具有不同的编码参数设置来计算视频数据601的多个FOM。即，在这些实施例中，每个计算出的FOM特定于相应的编码工具、相应的编码参数和视频数据601的组合。每个FOM和相应的编码参数设置都可以保存在存储器640中。Each of the processing elements 631-634 may also be configured to calculate, for thevideo data 601, a respective figure of merit (FOM) (eg,FOM 140, 141, 142 or 143) indicative of the coding efficiency of the respective coding tool applied. Therefore, the FOM is specific to the corresponding encoding tool andvideo data 601 . That is, the FOM is specific to the combination of the corresponding encoding tool andvideo data 601 . The FOM can be sum of squared differences, sum of absolute differences, or sum of absolute transformed differences. The FOM calculated by processing elements 631 - 634 may be stored in memory 640 and used as an input to comparator 650 . In some embodiments, the processing elements 631-634 may also store encoding parameters used to calculate the FOM. In some embodiments, each of PEs 631-634 may calculate multiple FOMs forvideo data 601 using the same encoding tool but with different encoding parameter settings. That is, in these embodiments, each calculated FOM is specific to a combination of a corresponding encoding tool, a corresponding encoding parameter, andvideo data 601 . Each FOM and corresponding encoding parameter settings may be stored in memory 640 .

比较器650可以是比较器150的一个实施例，并被配置为通过比较由处理元件631-634计算并存储在存储器640中的FOM来确定编码工具660。比较器650的比较可以确定首选FOM。例如，优选的FOM可以是具有最低值的SAD。因此，导致SAD的最低值的编码工具可以被确定为编码工具660。在一些实施例中，比较器650还可以确定参数设置666，其可以是处理元件631-634使用的参数设置导致首选FOM(例如，具有最低的SAD值)。Comparator 650 may be an embodiment ofcomparator 150 and is configured to determinecoding tool 660 by comparing the FOM calculated by processing elements 631 - 634 and stored in memory 640 . The comparison by comparator 650 can determine the preferred FOM. For example, the preferred FOM may be the SAD with the lowest value. Therefore, the coding tool that resulted in the lowest value of SAD may be determined to be thecoding tool 660 . In some embodiments, comparator 650 may also determineparameter settings 666, which may be parameter settings used by processing elements 631-634 that result in a preferred FOM (eg, having the lowest SAD value).

五、说明过程5. Explain the process

图7图示了根据本公开的实施方式的示例过程700。过程700可以表示实现上述各种提议的设计、概念、方案、系统和方法的方面。更具体地，过程700可以表示与根据本公开在多个编码工具中确定编码工具有关的所提出的概念和方案的方面。过程700可包括如框710、720、730和740中的一者或一者以上所说明的一或多个操作、动作或功能。虽然说明为离散的框，但过程700的各种框可分为额外的框、组合成更少的框块，或消除，这取决于所需的实施。此外，过程700的方框/子方框可以图7所示的顺序执行，或者以不同的顺序。此外，可以重复或迭代地执行过程700的一个或多个块/子块。过程700可由装置600及其任何变型实施或在装置600中实施。仅出于说明的目的并且不限制范围，过程700在下面在装置600的上下文中被描述。过程700可以开始于块710。FIG. 7 illustrates anexample process 700 according to an embodiment of the disclosure.Process 700 may represent aspects that implement the various proposed designs, concepts, solutions, systems and methods described above. More specifically,process 700 may represent aspects of proposed concepts and approaches related to determining an encoding tool among a plurality of encoding tools according to the present disclosure.Process 700 may include one or more operations, actions, or functions as illustrated by one or more ofblocks 710 , 720 , 730 , and 740 . Although illustrated as discrete blocks, the various blocks ofprocess 700 may be separated into additional blocks, combined into fewer block blocks, or eliminated, depending on the desired implementation. Additionally, the blocks/sub-blocks ofprocess 700 may be performed in the order shown in FIG. 7, or in a different order. Additionally, one or more blocks/sub-blocks ofprocess 700 may be performed repeatedly or iteratively.Process 700 may be implemented by or indevice 600 and any variant thereof.Process 700 is described below in the context ofapparatus 600 for purposes of illustration only and without limitation in scope.Process 700 may begin atblock 710 .

在710，过程700可以涉及装置600的每个处理元件(例如，PE631-634)接收要在执行编码效率评估中评估的视频数据(例如，视频数据113或601)。每个处理元件被配置为针对相应的编码工具(例如，图1的编码工具T0、T1、T2或T3)执行编码效率评估。在一些实施例中，装置600的PE通过以时间交错的方式访问缓存610来接收视频数据601。也就是说，在任何时候，装置600的不超过一个PE可以访问缓存610。在一些实施例中，视频数据601可以包括编码块(CB)，编码块可以被划分成多个形成列或行阵列的子块。装置600的PE可以以子块的批次接收CB，每个批次具有多个子块。在一些实施例中，一批子块的数量等于装置600的同时操作的PE的数量。在一些实施例中，视频数据601的子块可以由装置600的PE使用通过视频数据601的子块的列或行的蛇形扫描来获取。过程700可以从710继续到720。At 710,process 700 may involve each processing element (eg, PE 631-634) ofapparatus 600 receiving video data (eg,video data 113 or 601) to be evaluated in performing encoding efficiency evaluation. Each processing element is configured to perform a coding efficiency evaluation for a corresponding coding tool (eg, coding tool T0 , T1 , T2 or T3 of FIG. 1 ). In some embodiments, a PE ofdevice 600 receivesvideo data 601 by accessing cache 610 in a time-staggered manner. That is, no more than one PE ofdevice 600 may access cache 610 at any one time. In some embodiments,video data 601 may include coded blocks (CBs), which may be divided into a plurality of sub-blocks forming a column or row array. The PEs ofapparatus 600 may receive CBs in batches of sub-blocks, each batch having multiple sub-blocks. In some embodiments, the number of sub-blocks in a batch is equal to the number of concurrently operating PEs of theapparatus 600 . In some embodiments, the sub-blocks ofvideo data 601 may be acquired by the PE ofapparatus 600 using serpentine scanning through the columns or rows of the sub-blocks ofvideo data 601 .Process 700 can continue from 710 to 720 .

在720，过程700可以涉及装置600的每个PE计算相应的FOM。在一些实施例中，每个PE可以是LC-RDO，并且相应的FOM可以是平方差和(SSD)、绝对差和(SAD)或绝对变换差和(SATD)。由装置600的PE计算的FOM可以存储在存储器640中。在一些实施例中，用于计算FOM的编码参数也可以存储在存储器640中。过程700可以从720进行到730。At 720,process 700 may involve each PE ofapparatus 600 calculating a corresponding FOM. In some embodiments, each PE may be an LC-RDO, and the corresponding FOM may be a sum of squared differences (SSD), sum of absolute differences (SAD), or sum of absolute transformed differences (SATD). The FOM calculated by the PEs of thedevice 600 may be stored in the memory 640 . In some embodiments, encoding parameters used to calculate the FOM may also be stored in memory 640 .Process 700 can proceed from 720 to 730 .

在730，过程700可以涉及比较器650比较存储在存储器640中的FOM并且相应地确定编码工具660，其特定于视频数据601。在一些实施例中，比较器650可以确定与确定的编码工具660一起使用的参数设置666。确定的参数设置666可以是包括多个编码参数的值的一组设置。过程700可以从730进行到740。At 730 ,process 700 may involve comparator 650 comparing the FOMs stored in memory 640 and determiningencoding tools 660 accordingly, which are specific tovideo data 601 . In some embodiments, comparator 650 may determineparameter settings 666 to use with adetermined encoding tool 660 .Determined parameter settings 666 may be a set of settings including values for a plurality of encoding parameters.Process 700 can proceed from 730 to 740 .

在740，过程700可以涉及处理器605使用确定的编码工具660对视频数据601进行编码。在一些实施例中，处理器605可以使用确定的编码工具660和确定的参数设置666对视频数据601进行编码.At 740 ,process 700 may involve processor 605encoding video data 601 usingdetermined encoding tool 660 . In some embodiments, processor 605 may encodevideo data 601 usingdetermined encoding tools 660 anddetermined parameter settings 666.

六、说明性电子系统6. Illustrative Electronic System

许多上述特征和应用被实现为软件过程，这些软件过程被指定为记录在计算机可读存储介质(也称为计算机可读介质)上的一组指令。当这些指令由一个或多个计算或处理单元(例如，一个或多个处理器、处理器核心或其他处理单元)执行时，它们会导致处理单元执行指令中指示的动作。计算机可读介质的示例包括但不限于CD-ROM、闪存驱动器、随机存取存储器(RAM)芯片、硬盘驱动器、可擦除可编程只读存储器(EPROM)、电可擦除可编程只读存储器(EEPROM))等。计算机可读介质不包括无线或通过有线连接传递的载波和电子信号。Many of the above-described features and applications are implemented as software processes specified as a set of instructions recorded on a computer-readable storage medium (also referred to as a computer-readable medium). These instructions, when executed by one or more computing or processing units (eg, one or more processors, processor cores or other processing units), cause the processing units to perform the actions indicated in the instructions. Examples of computer readable media include, but are not limited to, CD-ROMs, flash drives, Random Access Memory (RAM) chips, hard drives, Erasable Programmable Read Only Memory (EPROM), Electrically Erasable Programmable Read Only Memory (EEPROM)) etc. Computer-readable media exclude carrier waves and electronic signals transmitted wirelessly or through wired connections.

在本说明书中，术语“软件”意味着包括驻留在只读存储器中的轫体或存储在存储器中的应用程序，其可以被读入存储器以供处理器处理。此外，在一些实施例中，多个软件发明可以作为较大程序的子部分来实现，同时保留不同的软件发明。在一些实施例中，多个软件发明也可以被实现为单独的程序。最后，一起实现这里描述的软件发明的单独程序的任何组合都在本公开的范围内。在一些实施例中，当软件程序被安装以在一个或多个电子系统上运行时，定义了一个或多个执行和执行软件程序的操作的特定机器实现。In this specification, the term "software" is meant to include firmware residing in read-only memory or application programs stored in memory that can be read into memory for processing by a processor. Furthermore, in some embodiments, multiple software inventions may be implemented as sub-parts of a larger program, while maintaining distinct software inventions. In some embodiments, multiple software inventions may also be implemented as separate programs. Finally, any combination of separate programs that together implement the software inventions described herein is within the scope of the present disclosure. In some embodiments, when a software program is installed to run on one or more electronic systems, one or more specific machine implementations that execute and perform the operations of the software program are defined.

图8概念性地图示了实现本公开的一些实施例的电子系统800。电子系统800可以是计算机(例如台式计算机、个人计算机、平板计算机等)、电话、PDA或任何其他种类的电子设备。这样的电子系统包括各种类型的计算机可读介质和用于各种其他类型的计算机可读介质的接口。电子系统800包括总线805、处理单元810、图形处理单元(GPU)815、系统存储器820、网络825、只读存储器(ROM)830、永久存储设备835、输入设备840和输出设备845。Figure 8 conceptually illustrates anelectronic system 800 implementing some embodiments of the present disclosure.Electronic system 800 may be a computer (eg, desktop computer, personal computer, tablet computer, etc.), telephone, PDA, or any other kind of electronic device. Such electronic systems include various types of computer-readable media and interfaces for various other types of computer-readable media.Electronic system 800 includesbus 805 , processingunit 810 , graphics processing unit (GPU) 815 ,system memory 820 ,network 825 , read only memory (ROM) 830 ,persistent storage 835 ,input devices 840 andoutput devices 845 .

总线805共同表示通信连接电子系统800的众多内部设备的所有系统、外围设备和芯片组总线。例如，总线805通信连接处理单元810与GPU 815，GPU 815、只读存储器830、系统存储器820和永久存储设备835。Collectively,bus 805 represents all system, peripheral, and chipset buses that communicatively connect the numerous internal devices ofelectronic system 800 . For example,bus 805 communicatively connects processingunit 810 withGPU 815 ,GPU 815 , read-only memory 830 ,system memory 820 , andpersistent storage 835 .

从这些不同的存储器单元，处理单元810检索要执行的指令和要处理的数据以便执行本公开的处理。在不同的实施例中，处理单元可以是单处理器或多核处理器。一些指令被传递到GPU 815并由其执行。GPU 815可以卸载各种计算或补充由处理单元810提供的图像处理。From these various memory units, processingunit 810 retrieves instructions to execute and data to process in order to perform the processes of the present disclosure. In different embodiments, the processing unit may be a single processor or a multi-core processor. Some instructions are passed to and executed byGPU 815 .GPU 815 may offload various computations or supplement the image processing provided by processingunit 810 .

只读存储器(ROM)830存储由处理单元810和电子系统的其他模块使用的静态数据和指令。另一方面，永久存储设备835是读写存储设备。该设备是即使在电子系统800关闭时也存储指令和数据的非易失性存储单元。本公开的一些实施例使用大容量存储设备(例如磁盘或光盘及其相应的磁盘驱动器)作为永久存储设备835。Read Only Memory (ROM) 830 stores static data and instructions used by processingunit 810 and other modules of the electronic system.Persistent storage 835, on the other hand, is a read-write storage device. This device is a non-volatile memory unit that stores instructions and data even when theelectronic system 800 is turned off. Some embodiments of the present disclosure use mass storage devices such as magnetic or optical disks and their corresponding disk drives aspersistent storage 835 .

其他实施例使用可移动存储设备(例如软盘、闪存设备等，及其对应的磁盘驱动器)作为永久存储设备。与永久存储设备835一样，系统存储器820是读写存储设备。然而，与存储设备835不同，系统存储器820是易失性读写存储器，例如随机存取存储器。系统存储器820存储处理器在运行时使用的一些指令和数据。在一些实施例中，根据本公开的过程存储在系统存储器820、永久存储设备835和/或只读存储器830中。例如，各种存储器单元包括用于处理多媒体剪辑的指令与一些实施例。从这些不同的存储器单元，处理单元810检索要执行的指令和要处理的数据以便执行一些实施例的过程。Other embodiments use removable storage devices (eg, floppy disks, flash memory devices, etc., and their corresponding disk drives) as permanent storage devices. Likepersistent storage 835,system memory 820 is a read-write storage device. However, unlikestorage device 835,system memory 820 is a volatile read-write memory, such as random access memory.System memory 820 stores some instructions and data used by the processor at runtime. In some embodiments, processes according to the present disclosure are stored insystem memory 820 ,persistent storage 835 , and/or read-only memory 830 . For example, various memory units include instructions and some embodiments for processing multimedia clips. From these various memory units, processingunit 810 retrieves instructions to execute and data to process in order to perform the processes of some embodiments.

总线805还连接到输入和输出设备840和845。输入设备840使用户能够向电子系统传送信息和选择命令。输入设备840包括字母数字键盘和定点设备(也称为「光标控制设备」)、相机(例如网络摄像头)、麦克风或用于接收语音命令的类似设备等。输出设备845显示由电子系统生成的图像或否则输出数据。输出设备845包括打印机和显示设备，例如阴极射线管(CRT)或液晶显示器(LCD)，以及扬声器或类似的音频输出设备。一些实施例包括同时用作输入和输出设备的设备，例如触摸屏。Bus 805 is also connected to input andoutput devices 840 and 845 .Input device 840 enables a user to communicate information and select commands to the electronic system.Input devices 840 include alphanumeric keyboards and pointing devices (also referred to as "cursor control devices"), cameras (eg, webcams), microphones or similar devices for receiving voice commands, and the like.Output device 845 displays images or otherwise outputs data generated by the electronic system.Output devices 845 include printers and display devices, such as cathode ray tubes (CRT) or liquid crystal displays (LCD), as well as speakers or similar audio output devices. Some embodiments include devices that function as both input and output devices, such as touch screens.

最后，如图8所示，总线805还通过网络适配器(未示出)将电子系统800耦合到网络825。以这种方式，计算机可以是计算机网络的一部分，例如局域网(“LAN”)、广域网(“WAN”)或内联网，电子系统800的任何或所有组件可结合本公开使用。Finally, as shown in FIG. 8,bus 805 also coupleselectronic system 800 to network 825 through a network adapter (not shown). In this manner, the computer may be part of a computer network, such as a local area network ("LAN"), wide area network ("WAN"), or intranet, and any or all components ofelectronic system 800 may be used in conjunction with the present disclosure.

一些实施例包括在机器可读或计算机可读介质(或者称为计算机可读存储介质、机器可读介质或机器-可读存储介质)。此类计算机可读介质的一些示例包括RAM、ROM、只读光盘(CD-ROM)、可记录光盘(CD-R)、可重写光盘(CD-RW)、只读数字多功能光盘(例如,DVD-ROM,双层DVD-ROM),各种可刻录/可重写DVD(例如,DVD-RAM,DVD-RW,DVD+RW,等等),闪存(例如,SD卡,mini-SD卡、微型SD卡等)、磁性和/或固态硬盘驱动器、只读和记录可用的

光盘、超密度光盘、任何其他光学或磁性媒体以及软盘。计算机可读介质可以存储可由至少一个处理单元执行并且包括用于执行各种操作的指令集的计算机程序。计算机程序或计算机代码的示例包括机器代码，例如由编译器生成的机器代码，以及包括由计算机、电子组件或使用解释器的微处理器执行的高级代码的文件。Some embodiments are embodied on a machine-readable or computer-readable medium (alternatively referred to as a computer-readable storage medium, machine-readable medium, or machine-readable storage medium). Some examples of such computer readable media include RAM, ROM, Compact Disc Read Only (CD-ROM), Compact Disc Recordable (CD-R), Compact Disc Rewritable (CD-RW), Digital Versatile Disc Read Only (e.g. , DVD-ROM, Dual Layer DVD-ROM), various recordable/rewritable DVDs (e.g., DVD-RAM, DVD-RW, DVD+RW, etc.), flash memory (e.g., SD card, mini-SD card, micro SD card, etc.), magnetic and/or solid-state hard drives, read-only and logging available

Compact discs, ultra-density discs, any other optical or magnetic media, and floppy disks. The computer-readable medium may store a computer program executable by at least one processing unit and including a set of instructions for performing various operations. Examples of computer programs or computer code include machine code, such as generated by a compiler, and files including high-level code executed by a computer, electronic component, or microprocessor using an interpreter.

虽然以上讨论主要涉及执行软件的微处理器或多核处理器，但许多上述特征和应用是由一个或多个集成电路执行的，例如专用集成电路(ASIC)或现场可编程门阵列(FPGA)。在一些实施例中，这样的集成电路执行存储在电路本身上的指令。此外，一些实施例执行存储在可编程逻辑设备(PLD)、ROM或RAM设备中的软件。While the above discussion primarily refers to microprocessors or multi-core processors executing software, many of the above features and applications are performed by one or more integrated circuits, such as application specific integrated circuits (ASICs) or field programmable gate arrays (FPGAs). In some embodiments, such integrated circuits execute instructions stored on the circuit itself. Additionally, some embodiments execute software stored in a programmable logic device (PLD), ROM or RAM device.

如在本说明书和本申请的任何权利要求中使用的，术语“计算机”、“服务器”、“处理器”和“存储器”均指电子或其他技术设备。这些术语不包括人或人群。出于说明书的目的，术语显示或显示表示在电子设备上显示。如本说明书和本申请的任何权利要求中所使用，术语“计算机可读介质”、“计算机可读介质”和“机器可读介质”完全限于以可读形式存储信息的有形物理对象。一台电脑。这些术语不包括任何无线信号、有线下载信号和任何其他临时信号。尽管已经参考许多具体细节描述了本公开，但是本领域的普通技术人员将认识到，在不脱离本公开的精神的情况下，可以以其他具体形式来实施本公开。As used in this specification and any claims of this application, the terms "computer", "server", "processor" and "memory" all refer to electronic or other technological devices. These terms do not include persons or groups of people. For the purpose of the description, the term display or display means displaying on the electronic device. As used in this specification and any claims of this application, the terms "computer-readable medium", "computer-readable medium" and "machine-readable medium" are strictly limited to tangible physical objects that store information in a readable form. a computer. These terms exclude any wireless signals, wired download signals and any other temporary signals. Although the present disclosure has been described with reference to numerous specific details, those skilled in the art will recognize that the present disclosure can be embodied in other specific forms without departing from the spirit of the disclosure.

补充说明Supplementary Note

此处描述的主题有时说明不同的组件包含在不同的其他组件内或与不同的其他组件连接。应当理解，这样描绘的架构仅仅是示例，并且实际上可以实现相同功能的许多其他架构。从概念上讲，实现相同功能的任何组件排列都是有效“关联”的，从而实现所需的功能。因此，此处组合以实现特定功能的任何两个组件可以被视为彼此“相关联”以使得实现期望的功能，而不管架构或中间组件如何。同样，如此关联的任何两个组件也可被视为彼此“可操作地连接”或“可操作地耦合”以实现所需的功能，并且能够如此关联的任何两个组件也可被视为“可操作地连接”，彼此实现所需的功能。可操作地耦合的具体示例包括但不限于物理上可配合和/或物理上交互的组件和/或无线上可交互和/或无线上交互的组件和/或逻辑上交互和/或逻辑上可交互的组件。The subject matter described herein sometimes illustrates that different components are contained within, or connected with, different other components. It is to be understood that such depicted architectures are merely examples, and that in fact many other architectures are possible which achieve the same functionality. Conceptually, any arrangement of components to achieve the same functionality is effectively "associated" such that the desired functionality is achieved. Hence, any two components herein combined to achieve a particular functionality can be seen as "associated with" each other such that the desired functionality is achieved, irrespective of architectures or intermedial components. Likewise, any two components so related can also be considered to be "operably connected" or "operably coupled" to each other to achieve the desired functionality, and any two components capable of being so related can also be considered to be "operably connected" Operably connected" to each other to achieve the desired function. Specific examples of operably coupled include, but are not limited to, physically mateable and/or physically interacting components and/or wirelessly interactable and/or wirelessly interactable components and/or logically interactable and/or logically interactable components Interactive components.

此外，关于本文中基本上任何复数和/或单数术语的使用，本领域技术人员可以根据上下文从复数翻译成单数和/或从单数翻译成复数和/或申请。为了清楚起见，可以在本文中明确地阐述各种单数/复数排列。Furthermore, with respect to the use of substantially any plural and/or singular term herein, those skilled in the art can translate from plural to singular and/or from singular to plural and/or apply depending on the context. Various singular/plural permutations may be explicitly set forth herein for the sake of clarity.

此外，本领域技术人员将理解，一般而言，本文使用的术语，尤其是在所附权利要求中，例如所附权利要求的主体，通常意在作为开放术语，例如，“包括”一词应解释为“包括但不限于”，“有”一词应解释为“至少有”，“包括”一词应解释为“包括但不限于”，等。本领域的技术人员将进一步理解，如果意图引入特定数量的权利要求陈述，则该意图将在权利要求中明确地陈述，并且在没有该陈述的情况下不存在该意图。例如，为了帮助理解，以下所附权利要求可能包含使用介绍性短语“至少一个”和“一个或多个”来介绍权利要求的叙述。然而，使用此类短语不应被解释为暗示通过不定冠词“一”或“一个”引入的权利要求将包含此类引入的权利要求的任何特定权利要求限制为仅包含一个此类权利要求的实施方式，即使当同一权利要求包括介绍性短语“一个或多个”或“至少一个”和不定冠词，例如“一”或“一个”，应解释为“至少一个”或“一个或多个”；这同样适用于使用定冠词来引入索赔陈述。此外，即使明确引用了引入的权利要求记载的特定数量，本领域技术人员将认识到，这种记载应被解释为至少是指被引用的数量，例如，仅引用“两次引用”，而不其他修饰语，表示至少两次背诵，或者两次或更多次背诵。此外，在那些约定类似于“A、B和C等中的至少一个”的情况下。被使用，一般来说，这样的结构意在本领域技术人员会理解约定的意义上，例如，“具有A、B和C中的至少一个的系统”将包括但不限于这样的系统单独有A，单独有B，单独有C，A和B在一起，A和C在一起，B和C在一起，和/或A、B和C在一起，等等。在那些类似于“至少一个”的约定的情况下A、B或C等。通常这样的结构意在本领域技术人员理解约定的意义上，例如，“具有A、B或C中的至少一个的系统”将包括但不限于系统具有单独的A、单独的B、单独的C、A和B在一起、A和C在一起、B和C在一起和/或A、B和C在一起等。本领域技术人员将进一步理解实际上无论是在说明书、权利要求书还是附图中，任何出现两个或更多替代术语的分离词和/或短语都应该被理解为考虑包括一个术语、一个术语或两个术语的可能性。例如，短语“A或B”将被理解为包括“A”或“B”或“A和B”的可能性。Furthermore, those skilled in the art will understand that terms used herein in general, and especially in the appended claims, such as the subject of the appended claims, are generally intended as open terms, for example, the word "comprising" shall The word "including but not limited to" should be interpreted as "including but not limited to", the word "have" should be interpreted as "at least", the word "including" should be interpreted as "including but not limited to", etc. It will be further understood by those within the art that if a specific number of a claim recitation is intended, such an intent will be expressly recited in the claim, and in the absence of such recitation no such intent is present. For example, as an aid to understanding, the following appended claims may contain usage of the introductory phrases "at least one" and "one or more" to introduce claim recitations. However, use of such phrases should not be construed to imply that a claim introduced by the indefinite article "a" or "an" limits any particular claim containing such an introduced claim to those containing only one such claim. Embodiment, even when the same claim includes the introductory phrase "one or more" or "at least one" and an indefinite article, such as "a" or "an", it should be interpreted as "at least one" or "one or more ”; the same applies to the use of the definite article to introduce a statement of claim. Furthermore, even if a specific number of an introduced claim recitation is expressly recited, those skilled in the art will recognize that such recitation should be interpreted to mean at least the recited number, for example, simply citing "twice cited" and not Other modifiers, indicating at least two recitations, or two or more recitations. Also, where those conventions are like "at least one of A, B, and C, etc." are used, generally, such structures are intended in the agreed sense that those skilled in the art would understand, e.g., "a system having at least one of A, B, and C" would include, but not be limited to, such a system having A alone , there is B alone, there is C alone, A and B together, A and C together, B and C together, and/or A, B and C together, etc. A, B, or C, etc. in those cases where there is a convention like "at least one". Typically such structures are intended in the sense that those skilled in the art understand the convention, for example, "a system having at least one of A, B, or C" would include, but not be limited to, a system having A alone, B alone, C alone , A and B together, A and C together, B and C together, and/or A, B and C together, etc. Those skilled in the art will further understand that in fact, whether in the specification, claims or drawings, any separate word and/or phrase where two or more alternative terms appear should be understood as including a term, a term or two term possibilities. For example, the phrase "A or B" will be understood to include the possibilities of "A" or "B" or "A and B."

从上文中可以理解，为了说明的目的，本文已经描述了本公开的各种实施方式，并且在不脱离本公开的范围和精神的情况下可以进行各种修改。因此，本文公开的各种实施方式并非旨在限制，真正的范围和精神由所附权利要求指示。From the foregoing it will be appreciated that various embodiments of the present disclosure have been described herein for purposes of illustration and that various modifications may be made without departing from the scope and spirit of the present disclosure. Therefore, the various embodiments disclosed herein are not intended to be limiting, with the true scope and spirit being indicated by the appended claims.

Claims

1. A method of video data encoding, comprising:

receiving video data by a plurality of Processing Elements (PEs), each processing element configured to perform an encoding efficiency evaluation on a respective encoding tool;

performing, by each of a plurality of processing elements, the encoding efficiency assessment to calculate a respective figure of merit (FOM) specific to the respective encoding tool and video data;

determining a coding tool specific to the video data by comparing the figures of merit calculated by the plurality of processing elements; and

the video data is encoded using the determined encoding tool.

2. The method of claim 1, wherein no more than one of the plurality of processing elements receives video data simultaneously.

3. The video data encoding method according to claim 1, wherein:

The video data includes a Coded Block (CB) divided into a plurality of sub-blocks, the plurality of sub-blocks forming an array of columns and rows,

the plurality of processing elements includes a first number of processing elements,

the plurality of processing elements receiving the video data includes each of a plurality of processing elements receiving a plurality of sub-blocks of the encoded block in batches, each batch including a second number of the plurality of sub-blocks, and

the second number is equal to the first number.

4. The method of encoding video data of claim 3, wherein receiving the video data by the plurality of processing elements comprises each of the plurality of processing elements receiving the plurality of sub-blocks using a serpentine scan through columns or rows.

5. A video data encoding method according to claim 3, further comprising:

storing, by each of the plurality of processing elements, the second number of sub-blocks in a respective buffer accessible by the respective processing element.

6. A video data encoding method as claimed in claim 3, wherein each of the plurality of processing elements has a pipeline structure comprising a plurality of stages, and wherein calculating the respective figure of merit comprises sequentially processing the second number of sub-blocks through the plurality of stages.

7. The video data encoding method according to claim 1, wherein:

receiving video data includes receiving video data stored in a memory having a plurality of memory banks including a first group of the plurality of memory banks having a third number of memory banks and a second group of the plurality of memory banks having a fourth number of memory banks, an

Each pair of adjacent columns or rows of the array is stored in the first and second sets of banks, respectively.

8. The video data encoding method according to claim 7, wherein:

the plurality of processing elements includes a first number of processing elements, an

Each of the third number and the fourth number is equal to the first number.

9. The video data encoding method of claim 1, wherein each of the plurality of PEs is a low complexity rate distortion optimizer (LC-RDO), and wherein the figure of merit comprises Sum of Squared Differences (SSD), sum of Absolute Differences (SAD), or Sum of Absolute Transformed Differences (SATD).

10. The video data encoding method according to claim 1, further comprising:

A set of encoding parameters associated with the determined encoding tool is determined.

11. A video data encoding apparatus comprising:

a cache memory having a plurality of banks;

the processor is configured to store video data in the cache memory according to a block allocation scheme specific to the video data;

a plurality of Processing Elements (PEs), each processing element configured to calculate a respective figure of merit (FOM) specific to a respective encoding tool and video data; and

a comparator configured to determine the video data specific encoding tool by comparing the figures of merit calculated by the plurality of processing elements.

12. The video data encoding apparatus of claim 11, wherein no more than one of the plurality of processing elements accesses the cache memory.

13. The video data encoding apparatus according to claim 11, wherein:

each of the plurality of processing elements accessing the cache memory to fetch a plurality of sub-blocks of the CB in batches, each batch including a second number of sub-blocks, and

The second number is equal to the first number.

14. The video data encoding apparatus of claim 13, wherein each processing element of the plurality of processing elements accesses the cache using a serpentine scan through columns or rows of the plurality of sub-blocks.

15. The video data encoding device of claim 13, wherein each processing element of the plurality of processing elements is configured to store the second number of sub-blocks in a respective buffer accessible by the respective processing element.

16. The video data encoding apparatus of claim 13, wherein each of the plurality of processing elements has a pipeline structure including a plurality of stages, and wherein each of the plurality of processing elements is configured to sequentially process the second number of sub-blocks through the plurality of stages to calculate the respective figure of merit.

17. The video data encoding apparatus according to claim 11, wherein:

the plurality of memory banks includes a first group of the plurality of memory banks having a third number of memory banks and a second group of the plurality of memory banks having a fourth number of memory banks, an

Each pair of adjacent columns or rows of the array are stored in the first and second plurality of memory banks, respectively.

18. The video data encoding apparatus according to claim 17, wherein:

Each of the third number and the fourth number is equal to the first number.

19. The video data encoding apparatus of claim 11, wherein each of the plurality of PEs is a low complexity rate distortion optimizer (LC-RDO), and wherein the figure of merit comprises Sum of Squared Differences (SSD), sum of Absolute Differences (SAD), or Sum of Absolute Transformed Differences (SATD).

20. The video data encoding apparatus of claim 11, wherein the comparator further determines a set of encoding parameters associated with the determined encoding tool.