GB2639675A

Movatterモバイル変換

Info

Publication number: GB2639675A
Application number: GB2404564.3A
Authority: GB
Inventors: Ciccarelli Lorenzo; Dorovic Katarina
Original assignee: V Nova International Ltd
Current assignee: V Nova International Ltd
Priority date: 2024-03-28
Filing date: 2024-03-28
Publication date: 2025-10-01
Also published as: GB202404564D0; WO2025202411A1

Abstract

A rate distortion optimisation (RDO) apparatus designates a temporal mode to apply to an encoded enhancement signal comprising one or more layers of residual data, generated based on comparison of data derived from a decoded version and input video signal, for an encoded video signal. The apparatus receives a current frame of the encoded enhancement signals; calculates cost of sending temporal mode signalling data for each of a first and a second temporal mode for the encoded enhancement signal (current frame block), the first temporal mode not applying non-zero values from a temporal buffer, the second mode applying non-zero values from the buffer; assesses a motion estimate relative to a quantity range; identifies temporal mode preference for the current frame accordingly; generates a weighting based on the estimate; updates, when temporal mode preference matches the mode of the first and second temporal mode with the lower cost, the calculated cost of the mode of first or the second temporal mode with the lower cost by applying the weighting; and designates the temporal mode for the current block as the mode with lower cost based on calculated cost. Also, an enhancement encoding method using a binary data set indicative of coding decision(s).

Description

RATE DISTORTION OPTIMISATION IMPLEMENTATION AND METHOD FIELD OF THE INVENTION

The present disclosure generally relates to rate distortion optimisation for video coding, typically in relation to encoding schemes, such as codecs, involving decisions between different modes to apply during encoding.

BACKGROUND

A hybrid backwards-compatible coding technology has been previously proposed, for example, in WO 20131171173, WO 2014/170819, WO 2019/141987, and WO 2018/046940, the contents of which are incorporated herein by reference. Further examples of tier-based coding formats include ISO/IEC MPEG-5 Part 2 LCEVC (hereafter ICEVC'). LCEVC has been described in WO 2020/188273A1, GB 2018723.3, WO 2020/188242, and the associated standard specification documents including ISO/IEC 23094-2 MPEG Part 2 Low Complexity Enhancement Video Coding (LCEVC), First edition October 2021; all of these documents being incorporated by reference herein in their entirety.

In these coding formats a signal is decomposed in multiple "echelon? (also known as "hierarchical tiers") of data, each corresponding to a "Level of Quality", from the highest echelon at the sampling rate of the original signal to a lowest echelon. The lowest echelon is typically a low-quality rendition of the original signal and other echelons contain information on correction to apply to a reconstructed rendition to produce the final output.

LCEVC adopts this multi-layer approach where any base codec (for example Advanced Video Coding -AVC, also known as H.264, or High Efficiency Video Coding -HEVC, also known as H.265) can be enhanced via an additional low bitrate stream. LCEVC is defined by two component streams, a base stream typically decodable by a hardware decoder and an enhancement stream consisting of one or more enhancement layers suitable for software processing implementation with sustainable power consumption.

In the specific LCEVC example of these tiered formats, the process works by encoding a lower resolution version of a source image using any existing codec (the base codec) and the difference between the reconstructed lower resolution image and the source using a different compression method (the enhancement).

The remaining details that make up the difference with the source are efficiently and rapidly compressed with LCEVC, which uses specific tools designed to compress residual data. The LCEVC enhancement compresses residual information on at least two layers, one at the resolution of the base to correct artefacts caused by the base encoding process and one at the source resolution that adds details to reconstruct the output frames. Between the two reconstructions, the picture is optionally upscaled using either a normative up-sampler or a custom one specified by the encoder in the bitstream. In addition, LCEVC also performs some non-linear operations called residual prediction, which further improve the reconstruction process preceding residual addition, collectively producing a low-complexity smart content-adaptive (i.e., encoder-driven) upscaling.

In each layer of the enhancement stream, there are frames, which are divided into (transform) blocks, with a certain number of blocks being able to form a tile. As part of the encoding process, analysis and various decisions are made to optimise the encoding.

Rate-distortion optimization (RDO) is a method of improving video quality in video compression and refers to the optimisation of the amount of distortion (i.e. loss of video quality) against the amount of data required to encode the video, the rate. In LCEVC, an RDO formula is used to decide between encoding a transform block in different temporal modes.

These temporal modes are "Intra" and "Inter". A decision to use the Intra mode resulting in a set of values representing differences between a preceding block and a block being processed being used during the encoding. The alternative of a decision to use the Inter mode results in a set of stored values representing differences between a preceding block and an earlier block than the preceding block is used in the encoding.

In development of the LCEVC standard most of the work done has been focused on making sure this decision between Intra and Inter temporal modes is the right one, and that by choosing Inter or Intra, most benefits of video quality are brought for least possible bitrate.

From analysing the LCEVC bitstream, we have found that this temporal signalling by indicating a temporal mode can be costly, taking up a significant portion of the bitstream. There is therefore a need to reduce the cost of temporal signalling and other similar processes.

SUMMARY OF INVENTION

According to a first aspect, there is provided a rate distortion optimisation apparatus configured to designate a temporal mode to be applied to an encoded enhancement signal for an encoded video signal, the encoded enhancement signal comprising one or more layers of residual data, the residual data being generated based on a comparison of data derived from a decoded version of a video signal and data derived from an input video signal" each encoded enhancement signal comprising respective frames, each frame of the respective frames being divided into a plurality of blocks, the apparatus being configured to: receive a current frame of the encoded enhancement signals to be processed; calculate a cost of sending temporal mode signalling data for each of a first temporal mode and a second temporal mode for the encoded enhancement signal of a current block of the current frame, the first temporal mode not applying nonzero values from a temporal buffer for generating the encoded enhancement signal and the second temporal mode applying non-zero values from the temporal buffer for generating the encoded enhancement signal; assess a motion estimate of the current frame relative to a motion quantity range; identify a temporal mode preference for the current frame based on the assessment; generate a weighting based on the motion estimate; update, when the temporal mode preference matches the mode of the first temporal mode and the second temporal mode with the lower cost, the calculated cost of the mode of first temporal mode or the second temporal mode with the lower cost by apply the weighting; and designate the temporal mode for the current block as the mode of the first temporal mode and the second temporal mode with the lower cost based on the calculated cost.

This optimises the rate and distortion decision for temporal signalling by providing a preference for the Intra or Inter mode to apply to a block based on motion in a corresponding frame. This reduces uncertainty in the temporal mode decision, which improves performance relative to the amount of motion in a sequence of blocks and frames. Further, this allows favouring of one decision over another, such as favouring Intra decisions when a sequence is moving and favouring Inter decisions when a sequence is static.

Each frame of the video signal may have a frame area divided into a plurality of tiles and/or a plurality of blocks, and the set of residuals includes residual values for a subset of the plurality of tiles or a subset of the plurality of blocks.

A "block" is intended to be a coding unit for encoding and decoding. For example, the size of a block may depend upon a directional decomposition transform used in the encoding and decoding.

A "tile" is intended to be a group of blocks that cover a region of a frame. A tile size parameter may be included as overhead in the video signal. Increased tile size can reduce overhead for controlling temporal prediction on a per-tile basis, while reduced tile size can increase the flexibility of temporal prediction.

Additionally, a frame may be partitioned into a plurality of "planes", and each plane may be partitioned into tiles and/or blocks. Planes may for example be colour channels which combine to give a multicolour image.

Assessing the motion estimate of the current frame may be relative to a continuous scale within the motion quantity range, may be stepped or segmented. Typically, the assessment of the motion estimate of the current frame relative to the motion quantity range may include an assessment of the motion estimate relative to a lower threshold and an upper threshold of the motion quantity range, and is configured to identify the temporal mode preference for the current frame as the first temporal mode when the motion estimate is equal to or lower than the lower threshold and as the second temporal mode when the motion estimate is equal to or greater than the upper threshold. This allows a clear differentiation between modes at predetermined points on range.

As an altemative to the lower threshold and upper threshold, there may be a first threshold and a second threshold. In that case, optionally, the first threshold may be towards a lower end of the motion quantity range and the second threshold may be at an upper end of the range, or, more simply, the first threshold may be a lower threshold than the second threshold. Regardless, the lower threshold or first threshold may be in one half of the motion quantity range, and the upper threshold or second threshold may be in the other half of the range, such as the lower threshold or first threshold being in the lower half of the motion quantity range, and/or the upper threshold or second threshold may be in the upper half of the range.

The motion estimate may be in the range of 0 to 1. AO on this range may represent a movement end of the range, such as completely moving. A 1 on this range may represent a static end of the range, such as completely static. Of course, these may be switched in some implementations.

The first temporal mode may correspond to the Intra temporal mode.

The second temporal mode may correspond to the Inter temporal mode.

Typically, the temporal mode preference may be a first temporal mode preference or a second temporal mode preference, also referred to as a preference for the first temporal mode or a preference for the second temporal mode. This may also be considered as favouring the first temporal mode or the second temporal mode.

There may be circumstances where a temporal mode preference for the first temporal mode or the second temporal mode is always arrived at when identifying a temporal mode preference. There may, however, be a third choice of the temporal mode preference, which is no preference or a plurality of further choices beyond just the first temporal mode and the second temporal mode. Consistent with this, the temporal mode preference for the current frame may be identified as preferring neither the first temporal mode nor the second temporal mode. This may be when the motion estimate is between the lower threshold and the upper threshold on the motion quantity range. This allows no preference to be set, such as when the amount of motion is not easily distinguishable as being more static or more mobile or is too close, not significant enough or too insignificant.

Should there be an identification resulting in no preference or not preferring either the first temporal mode or the second temporal mode, this still allows a temporal mode to be designated as set out above. There is less likely to be an update to the cost of the temporal mode, however.

The weighting may be generated with a value between 0 and 1 or may be generated on another scale. Typically, the apparatus may be configured to generate a weighting of between 20% and 80%, 30% and 70%, 40% and 60%, or 50% when the motion estimate of the current frame is equal to the lower threshold or upper threshold. This may be a weighting of 0.5 when a scale between 0 and 1 is used for the weighting. This provides a significant cost reduction when, based on the motion estimate, the temporal preference would be considered significant.

The weighting may be calculated by any suitable means relative to the motion estimate. Typically, the weighting decreases from a mid-point of the motion quantity range towards each end of the motion quantity range (based on how the motion estimate corresponds to the motion quantity range). This provides a reduced cost when the temporal preference is each of the first temporal mode and second temporal mode. This may result in the decrease being symmetrical about the mid-point of the motion quantity range, but can be asymmetric.

Additionally or alternatively, the weighting may be any particular shape as a function of the motion estimate and/or the motion quantity range, such as a polynomial of the third degree. Typically, the weighting may be a staircase function or exponential function of the motion estimate and/or the motion quantity range.

This makes calculation or setting of the weighting simple as can be identified from a formula, set for sections of a range or from a look-up table.

The apparatus may carry out the designation based only on the motion estimate and calculated cost(s) having received the current frame. Typically though, the apparatus may be further configured to receive a motion standard deviation threshold, and, when the motion standard deviation threshold is a value other than a default value, the apparatus is further configured to identify a variation amount of the motion estimate over a predetermined set of frames, and identify the temporal mode preference as the first temporal mode when the variation amount is higher than the motion standard deviation threshold. If the variation of the motion estimate is high across multiple frames, this pushes the preference to being a preference for the first temporal mode. This allows the significance of motion in the current frame to be more limited by taking into account motion across at least a portion of a sequence.

When this is implemented, this may, of course, result in assessing a motion estimate for each frame of the predetermined set of frames before identifying a variation amount.

The predetermined set of frames may be any plurality of frames. Typically, the predetermined set of frames includes or consists of six frames.

The predetermined set of frames could be a fixed group of frames. Typically, however, the predetermined set of frames is a changing group of frames, the group of frames being based on the current frame. The group of frames may include the current frame, of which the current frame may be the last frame, first frame or a middle frame or may be the frame immediately preceding or immediately after the group of frames.

Further or alternatively, with regard to the apparatus being able to carry out the designation based only on the motion estimate and calculated cost(s) having received the current frame, typically, the apparatus may be further configured to receive an indicator of a quantity of static elements in the current frame, and wherein, if the motion estimate is equal to or lower than a base threshold and the indicator is equal to or lower than a static element threshold, designating the temporal mode of the current block as the first temporal mode. This pushes the apparatus towards designating the first temporal mode for the current block when there is high motion and a low amount of static elements in the corresponding frame, which reduces other processing requirements.

The apparatus may be still further configured to designate the temporal mode of all the blocks of the current frame as the first temporal mode with the designation of the temporal mode of the current block. This results in a designation of the same temporal mode for all blocks of the current frame, which reduces processing by a significant amount by allowing processing to move on to the next frame.

Additionally or altematively, the apparatus may be further configured to set a flag when the motion estimate is equal to or lower than a base threshold and the indicator is equal to or lower than a static element threshold, the flag being configured as readable as a temporal buffer refresh flag (such as by being a temporal buffer refresh flag) thereby causing values in the temporal buffer to be reset/to refresh. This allows the bypassing of other processing for a whole frame, such as by designating the temporal mode for each block of the current frame as the first temporal mode.

The apparatus may be further configured to receive a refresh parameter in the video signal, and to determine whether to refresh the temporal buffer based on the refresh parameter. The refresh parameter may be part of an encoded frame. Alternatively, the refresh parameter may be part of a control component of the video signal, separate from the encoded frame. For example, the refresh parameter may be included in the video signal in a similar way to the "temporal_enabled" parameter of WO 2020/188272, which is incorporated herein by reference.

The apparatus may be used at a pre-analysis, rate control or encode stage, such as at a rate controller, during analysis, or before or during encode.

According to a second aspect, there is provided a method of designating a temporal mode to be applied to an encoded enhancement signal for an encoded video signal, the encoded enhancement signal comprising one or more layers of residual data, the residual data being generated based on a comparison of data derived from a decoded version of a video signal and data derived from an input video signal, each encoded enhancement signal comprising respective frames, each frame of the respective frames being divided into a plurality of blocks, the method comprising: receiving a current frame of the encoded enhancement signals to be processed; calculating a cost of sending temporal mode signalling data for each of a first temporal mode and a second temporal mode for the encoded enhancement signal of a current block of the current frame, the first temporal mode not applying non-zero values from a temporal buffer for generating the encoded enhancement signal and the second temporal mode applying nonzero values from the temporal buffer for generating the encoded enhancement signal; assessing a motion estimate of the current frame relative to a motion quantity range; identifying a temporal mode preference for the current frame based on the assessment; generating a weighting based on the motion estimate; updating, when the temporal mode preference matches the mode of the first temporal mode and the second temporal mode with the lower cost, the calculated cost of the mode of first temporal mode or the second temporal mode with the lower cost by apply the weighting; and designating the temporal mode for the current block as the mode of the first temporal mode and the second temporal mode with the lower cost based on the calculated cost.

According to a third aspect, there is provided an encoder configured to encode an input video into one or more encoded enhancement signals, the one or more encoded signals being suitable for combining with a base encoded stream to reconstruct the input video, the encoder being configured to: receive an input video comprising respective frames; receive a decoded base encoded stream; generate one or more enhancement signals, each enhancement signal comprising one or more layers of residual data, the residual data being generated based on a comparison of data derived from a decoded version of a video signal and data derived from an input video signal and each enhancement signal comprising respective frames, each frame of the respective frames being divided into a plurality of blocks; determine a temporal mode to be applied to at least one of the enhancement signals, wherein the temporal mode is one of a first temporal mode and a second temporal mode, the first temporal mode not applying non-zero values from a temporal buffer for generating the at least one enhancement signal and the second temporal mode applying non-zero values from the temporal buffer for generating the at least one enhancement signal, the encoder being configured to produce the determination by: receiving a current frame of the at least one enhancement signal to be processed, calculating a cost of sending temporal mode signalling data of a current block of the current frame for each of the first temporal mode and the second temporal mode, assessing a motion estimate of the current frame relative to a motion quantity range, identifying a temporal mode preference of the current frame based on the assessment, generating a weighting based on the motion estimate, updating, when the temporal mode preference matches the mode of the first temporal mode and the second temporal mode with the lower cost, the cost of the mode of first temporal mode or the second temporal mode with the lower cost by apply the weighting, and determining the temporal mode for the current block of the current frame as the mode of the first temporal mode and the second temporal mode with the lower cost based on the calculated cost; and generate the one or more encoded enhancement signals from the one or more enhancement signals based on data derived from the base encoded stream and the input video according to the determined temporal mode, wherein generating the one or more encoded enhancement signals comprise applying a transform to each of a series of blocks of the plurality of blocks.

The encoder may be configured to generate the one or more encoded enhancement signals only based on data derived from the base encoded stream and the input video according to the determined temporal mode. Typically, however, the apparatus may be further configured to generate the one or more encoded enhancement signals further based on the identified temporal preference and/or generated weighting. While the identified temporal preference and generated weighting may feed into the determination of the temporal mode, this explicitly causes the one or more encoded enhancement signals to be generated based on these, meaning they may be passed to a portion of the encoder generating the encoded enhancement signal.

The encoder may further comprise the temporal buffer. The encoder may then be configured to assess a quantity of static elements in the current frame, and wherein, when the motion estimate is at or below a base threshold, if a quantity of static elements in the current frame is less than or equal to a static element threshold, designating the temporal mode of the current block as the first temporal mode and resetting the values in the temporal buffer The encoder may include the rate distribution optimisation apparatus of the first aspect. The rate distribution optimisation apparatus of the first aspect may then be configured to provide the encoder's determination of the temporal mode of the current block.

Values generated during pre-analysis may be received by the rate distribution optimisation apparatus. Further, the values generated during pre-analysis may be passed to the encoder. This may be as part of the generation of the one or more encoded enhancement signals.

According to a fourth aspect, there is provided a method of encoding an input video into one or more encoded enhancement signals, the one or more encoded signals being suitable for combining with a base encoded stream to reconstruct the input video, the method comprising: receive an input video comprising respective frames; receive a decoded base encoded stream; generate one or more enhancement signals, each enhancement signal comprising one or more layers of residual data, the residual data being generated based on a comparison of data derived from a decoded version of a video signal and data derived from an input video signal and each enhancement signal comprising respective frames, each frame of the respective frames being divided into a plurality of blocks; determine a temporal mode to be applied to at least one of the enhancement signals, wherein the temporal mode is one of a first temporal mode and a second temporal mode, the first temporal mode not applying non-zero values from a temporal buffer for generating the at least one enhancement signal and the second temporal mode applying non-zero values from the temporal buffer for generating the at least one enhancement signal, the determination comprising: receiving a current frame of the at least one enhancement signal to be processed, calculating a cost of sending temporal mode signalling data of a current block of the current frame for each of the first temporal mode and the second temporal mode, assessing a motion estimate of the current frame relative to a motion quantity range, identifying a temporal mode preference of the current frame based on the assessment, generating a weighting based on the motion estimate, updating, when the temporal mode preference matches the mode of the first temporal mode and the second temporal mode with the lower cost, the cost of the mode of first temporal mode or the second temporal mode with the lower cost by apply the weighting, and determining the temporal mode for the current block of the current frame as the mode of the first temporal mode and the second temporal mode with the lower cost based on the calculated cost; and generating the one or more encoded enhancement signals from the one or more enhancement signals based on data derived from the base encoded stream and the input video according to the determined temporal mode, wherein generating the one or more encoded enhancement signals comprises applying a transform to each of a series of blocks of the plurality of blocks.

According to a fifth aspect, there is provided an encoder configured to encode an input video into one or more encoded signals, the one or more encoded signals being suitable for combining with a base encoded stream to reconstruct the input video, each encoded enhancement signal comprising one or more layers of residual data, the residual data being generated based on a comparison of data derived from a decoded version of a video signal and data derived from an input video signal, each encoded signal comprising respective frames, each frame of the respective frames being divided into a plurality of blocks, the encoder being configured to: receive a binary data set to be encoded into an encoded layer using entropy encoding, wherein the binary data set is a set of elements indicative of a one of a first decision or a second decision for each of a sequential set of blocks of the plurality of blocks, the number of blocks in the set of blocks matching the number of elements in the set of elements, the binary data set including a first subset of elements, a second subset of elements and a third subset of elements of which the respective blocks are sequential within each subset and the subsets are sequential; identifying if all elements of the second subset are indicative of the other decision of the first decision and second decision to the decision of which all elements of the first subset and third subset are indicative; modifying, when it is identified that all the elements of the second subset are indicative of the other decisions of the first decision and second decision from the decision of which all elements of the first subset and third subset are indicative, the decision of each element of the second subset of elements to the other decision of the first decision and second decision; and encoding the data set with the modified decision into the encoded layer using entropy encoding.

For a bitstream, often decision points are available when generating the bitstream. Some of these decisions have (only) two choices available. This can, for example, result in a decision for one of the two choices being taken for a set of one or more elements, with that choice being opposite to decisions of the immediately preceding and immediately following corresponding set of one or more elements. This may be referred to as an "isolated decision" since the set of elements for which that decision is made is separated in the bitstream from any other set of elements on which the same decision has been made. We have found that isolated decisions are costly patterns for signalling in a bitstream. By implementing the encoder of the fifth aspect, isolated decisions are able to be identified and removed. This reduces the cost associated with a bitstream, such as for temporal signalling, thereby optimising temporal signalling by reducing the contribution of the bitstream. For example, when implemented for temporal signalling, we found that this reduced the bitstream cost by about 10%. Overall this means that an encoder according to this aspect is capable of optimising temporal signalling by reducing contribution of a bitstream to the cost, which reduces bitstream cost on temporal signalling.

The entropy encoding may be any form of entropy encoding, such as Huffman encoding or arithmetic encoding. Typically, however, the entropy encoding is run-length encoding. This enhances the cost saving since removing isolated decisions removes items that would break a run for run-length encoding. This therefore avoids breaking a run, reducing the cost.

Each subset of elements may include elements with up to a respective ten blocks, five blocks or three blocks, such as ten blocks, five blocks or three blocks.

Typically, each subset of elements includes only an element with a respective single block only. This provides an ability to identify individual isolated decisions instead of a group of decisions that are isolated from other groups. This also reduces the amount or number of blocks for which an awareness is needed, and thus reduces the storage capacity required to track decisions on previous blocks by limiting that to only needing to be aware of the decision of two blocks as well as the decision for a current block.

According to a sixth aspect, there is provided a method of encoding an input video into one or more encoded signals, the one or more encoded signals being suitable for combining with a base encoded stream to reconstruct the input video, each encoded enhancement signal comprising one or more layers of residual data, the residual data being generated based on a comparison of data derived from a decoded version of a video signal and data derived from an input video signal, each encoded signal comprising respective frames, each frame of the respective frames being divided into a plurality of blocks, the method comprising: receive a binary data set to be encoded into an encoded layer using entropy encoding, wherein the binary data set is a set of elements indicative of a one of a first decision or a second decision for each of a sequential set of blocks of the plurality of blocks, the number of blocks in the set of blocks matching the number of elements in the set of elements, the binary data set including a first subset of elements, a second subset of elements and a third subset of elements of which the respective blocks are sequential within each subset and the subsets are sequential; identifying if all elements of the second subset are indicative of the other decision of the first decision and second decision to the decision of which all elements of the first subset and third subset are indicative; modifying, when it is identified that all the elements of the second subset are indicative of the other decisions of the first decision and second decision from the decision of which all elements of the first subset and third subset are indicative, the decision of each element of the second subset of elements to the other decision of the first decision and second decision; and encoding the data set with the modified decision into the encoded layer using entropy encoding.

According to a seventh aspect, there is provided an encoder according to the third aspect and according to the fifth aspect, wherein the first decision is the designation of the mode of each block of the elements of the respective subset as the first temporal mode and the second decision is the designation of the mode of each block of the elements of the respective subset as the second temporal mode.

This allows the encoders of the third aspect and fifth aspect to form part of the same encoder and to work in a complementary manner. Overall, we have found this reduces the bitstream cost while maintaining or enhancing quality of an eventual output.

Between the step of identifying if all elements of the second subset are indicative of the other decision of the first decision and second decision to the decision of which all elements of the first subset and third subset are indicative and the step of modifying the decision of each element of the second subset of elements to the other decision of the first decision and second decision, the encoder may be further configured, when it is identified that all the elements of the second subset are indicative of the other decisions of the first decision and second decision from the decision of which all elements of the first subset and third subset are indicative, to: identify when the mode of the current block was determined as the mode for which the cost was updated, and encode the data set into the encoded layer using entropy encoding without the modifying the decision. This gives priority to the designation arrived at in the encoder according to the third aspect over modifying the decision/designation in accordance with the fifth aspect. This avoids reversing a decision previously made, ensuring earlier processing was not carried out unnecessarily.

For LCEVC, this is because, sometimes, a decision which could provide a lower rate for a temporal signalling layer is not necessarily better than a decision which would provide a higher rate for temporal signalling layer, but produce less distortion and/or have an overall lower rate for the coefficients to be sent.

Consistent with this, the encoder may be further configured, after the step of identifying if all elements of the second subset are indicative of the other decision of the first decision and second decision to the decision of which all elements of the first subset and third subset are indicative, to identify if the cost of the first temporal mode and the cost of the second temporal mode are equal, and wherein, when the cost of the first temporal mode and the cost of the second temporal mode are equal, the encoder may be further configured to calculate a distortion of applying the first temporal mode and a distortion of applying the second temporal mode and only modify the decision if the decision is to be modified to the decision representing the mode of the first temporal mode and the second temporal mode with the lower distortion.

According to an eighth aspect, there is provided a computer program comprising instructions which, when executed, cause an apparatus to perform the method according to the fourth aspect, sixth aspect or to provide the encoder of the third aspect, fifth aspect or seventh aspect.

According to a ninth aspect, there is provided a non-transitory computer-readable medium comprising the computer program according to the eighth aspect.

BRIEF DESCRIPTION OF DRAWINGS

Example apparatus, encoders and methods are described in detail herein, with reference to the accompanying drawings, in which: Figure 1 is a schematic diagram showing a background example of an encoding process; Figure 2 is a schematic diagram showing a further background example of an encoding process; Figure 3 is a schematic diagram of an example encoder; Figure 4 is a flow diagram of a first example encoding process; Figure 5 is a graph of an example motion range against illustrative weightings; Figure 6 is a flow diagram of an example rate distortion optimisation process; Figure 7 is a flow diagram of a second example encoding process; and Figure 8 is a schematic diagram of an example non-transitory computer-readable medium.

DETAILED DESCRIPTION

This disclosure describes an implementation for integration of a hybrid backwards-compatible coding technology with existing codecs, optionally via a software update. In a non-limiting example, the disclosure relates to an implementation and integration of MPEG-5 Part 2 Low Complexity Enhancement Video Coding (LCEVC). LCEVC is a hybrid backwards-compatible coding technology which is a flexible, adaptable, highly efficient and computationally inexpensive coding format combining a different video coding format, a base codec (i.e. an encoder-decoder pair such as AVC/H.264, HEVC/H.265, or any other present or future codec, as well as non-standard algorithms such as VP9, AV1 and others) with one or more enhancement levels of coded data.

Implementations described herein may also be suitable for other hierarchical coding schemes such as VC-6 or SHEVC.

Although one or more examples have been described in relation to LCEVC, aspects may also be implemented in other hierarchical coding schemes. In such examples, the "base encoder" may correspond to an encoder configured to encode a low layer (corresponding to a low quality) of a hierarchical coding scheme. In such examples, the "enhancement encoder" may correspond to an encoder configured to encode a high layer (corresponding to a high quality, i.e. a higher quality than the low layer) of the hierarchical coding scheme. For example, the base encoder may correspond to an encoder configured to encode a lowest layer of the hierarchical coding scheme and the enhancement encoder may correspond to an encoder configured to encoding a (e.g. first) enhancement layer of the hierarchical coding scheme. More generally, the base encoder may correspond to an encoder configured to encode a nth layer of a hierarchical coding scheme and the enhancement encoder may correspond to an encoder configured to encode a (n+1)th layer of the hierarchical coding scheme.

In various examples, a base encoder is itself a multi-layer encoder. Typically, a base encoder may comprise a base encoder and one or more enhancement encoders.

An enhancement encoder comprises multiple encoders in a number of examples. Typically, an enhancement encoder may comprise multiple enhancement encoders.

In some examples, a base encoding is the output of a one or more layers of encoding. For example, a base encoding may be a first coding layer combined one or more further (e.g. enhancement) coding layers.

An enhancement encoding comprises one or more layers of enhancement in various examples.

For example, a base encoding may be a base layer (i.e. output by a single-layer codec such as HEVC, WC, and so forth) combined with a first layer of LCEVC residuals, whilst the enhancement encoding may be a second layer of LCEVC residuals. In a further example, a base encoding may be a lowest layer encoded in accordance with the SMPTE VC-6 standard combined with one or more VC-6 enhancement layers, whilst the enhancement encoding may be one or more 'higher' layer enhancement layers of the VC-6 standard.

Example hybrid backwards-compatible coding technologies use a down-sampled source signal encoded using a base codec to form a base stream. An enhancement stream is formed using an encoded set of residuals which correct or enhance the base stream for example by increasing resolution or by increasing frame rate. There may be multiple levels of enhancement data in a hierarchical structure. In certain arrangements, the base stream may be decoded by a hardware decoder while the enhancement stream may be suitable for being processed using a software implementation. Thus, streams are considered to be a base stream and one or more enhancement streams, where there are typically two enhancement streams possible but often one enhancement stream is used. It is worth noting that typically the base stream may be decodable by a hardware decoder while the enhancement stream(s) may be suitable for software processing implementation with suitable power consumption. Streams can also be considered as layers.

The video frame is encoded hierarchically as opposed to using block-based approaches as done in the MPEG family of algorithms. Hierarchically encoding a frame includes generating residuals for the full frame, and then a reduced or decimated frame and so on. In the examples described herein, residuals may be considered to be errors or differences at a particular level of quality or resolution.

For context purposes only, as the detailed structure of LCEVC is known and set out in the approved standards specification, Figure 1 illustrates an example encoder 100. The illustrated components may also be implemented as steps of a corresponding encoding process. Those skilled in the art will understand how the examples described herein are also applicable to other multi-layer coding schemes (e.g., those that use a base layer and an enhancement layer) based on the general description of LCEVC that is presented with reference to Figure 1.

In the encoder 100, an input full resolution video 102 is processed to generate various encodings. A first encoding (base encoding 110) is produced by feeding a base encoder 106 (e.g., AVC, HEVC, or any other codec) with a down-sampled version of the input video, which is produced by down-sampling 104 the input video 102. A second encoding (level 1 encoding 116, an example of an enhancement encoding) is produced by applying an encoding operation 114 to the residuals obtained by taking the difference 112 between the reconstructed base codec video and the down-sampled version of the input video. The reconstructed base codec video is obtained by decoding the output of the base encoder 106 with a base decoder 108. A third encoding (level 2 encoding 128, another example of an enhancement encoding) is produced by processing 126 the residuals obtained by taking the difference 124 between an up-sampled version of a corrected version of the reconstructed base coded video and the input video 102. The corrected version of the reconstructed base codec video is obtained by combining the reconstructed base codec video and the residuals obtained by applying a decoding operation 118 to the level 1 encoding 116.

The level 2 encoding operation 126 operates with an optional level 2 temporal buffer 132, which may be used to apply temporal processing as described further below. In some examples, the level 2 temporal buffer 132 operates under the control of a temporal selection component 134. The temporal selection component 134 receives one or more of the input video 102 and the output of the down-sampling 104 to select a temporal mode in various examples. This is explained in more detail in later examples.

LCEVC can be rapidly implemented in existing encoders with a software update and is inherently backwards-compatible since devices that have not yet been updated to encode LCEVC are able to play the video using the underlying base codec, which further simplifies deployment.

In this context, there is proposed herein an encoder implementation to integrate encoding with existing systems and devices that perform base encoding and decoding. The integration is easy to deploy. It also enables the support of a broad range of encoding and player vendors, and can be updated easily to support future systems.

The proposed encoder implementation may be provided through an optimised software library for encoding MPEG-5 LCEVC enhanced streams, providing a simple yet powerful control interface or API. This allows developers flexibility and the ability to deploy LCEVC at any level of a software stack, e.g. from low-level command-line tools to integrations with commonly used open-source encoders and players. Typically, examples generally relate to driver-level implementations and a System on a chip (SoC) level implementation.

The terms LCEVC and enhancement may be used herein interchangeably, for example, the enhancement layer may comprise one or more enhancement streams, that is, the residuals data of the LCEVC enhancement data.

Regarding frames of a video signal, each frame may be composed of three different planes representing a different colour component. For example, each component of a three-channel YUV video may have a different plane. Each plane may then have residual data that relates to a given level of enhancement, e.g. a 'Y plane may have a set of level 1 residual data and a set of level 2 residual data.

In certain cases, e.g. for monochrome signals, there may only be one plane; in which case, the terms frame and plane may be used interchangeably. The level-1 residuals data and the level-2 residuals data may be partitioned as follows: Residuals data is divided into blocks whose size depends on the size of the transform used. The blocks are, for example, a 2x2 block of elements if a 2x2 directional decomposition transform is used or, for example, a 4x4 block of elements if a 4x4 directional decomposition transform is used. A tile is typically a group of blocks that cover a region of a frame (e.g. an M by N region, which may be a square region). A tile is for example a 32x32 tile of elements, each element being a block in various examples. As such, each frame may be divided into a plurality of tiles, and each tile of the plurality of tiles may be divided into a plurality of blocks. For colour video, each frame may be partitioned into a plurality of planes, where each plane is divided into a plurality of tiles, and each tile of the plurality of tiles is divided into a plurality of blocks.

Consistent with the above, in general, the term "residuals" as used herein refers to a difference between a value of a reference array or reference frame and an actual array or frame of data. The array may be a one or two-dimensional array that represents a coding unit. For example, a coding unit may be a 2x2 or 4x4 set of residual values that correspond to similar sized areas of an input video frame.

It should be noted that this generalised example is agnostic as to the encoding operations performed and the nature of the input signal. Reference to "residual data" as used herein refers to data derived from a set of residuals, e.g. a set of residuals themselves or an output of a set of data processing operations that are performed on the set of residuals. Throughout the present description, generally, a set of residuals includes a plurality of residuals or residual elements, each residual or residual element corresponding to a signal element, that is, an element of the signal or original data. The signal may be an image or video. In these examples, the set of residuals corresponds to an image or frame of the video, with each residual being associated with a pixel of the signal, the pixel being the signal element. Examples disclosed herein describe how these residuals may be modified (i.e. processed) to impact the encoding pipeline or the eventually decoded image while reducing overall data size. Residuals or sets may be processed on a per residual element (or residual) basis, or processed on a group basis such as per tile or per coding unit where a tile or coding unit is a neighbouring subset of the set of residuals. In one case, a tile may comprise a group of smaller coding units. A tile may comprise a 16x16 set of picture elements or residuals (e.g. an 8 by 8 set of 2x2 coding units or a 4 by 4 set of 4x4 coding units). Note that the processing may be performed on each frame of a video or on only a set number of frames in a sequence.

In general, each or both enhancement streams may be encapsulated into one or more enhancement bitstreams using a set of Network Abstraction Layer Units (NALUs). The NALUs are meant to encapsulate the enhancement bitstream in order to apply the enhancement to the correct base reconstructed frame. The NALU may for example contain a reference index to the NALU containing the base decoder reconstructed frame bitstream to which the enhancement has to be applied. In this way, the enhancement can be synchronised to the base stream and the frames of each bitstream combined to produce the decoded output video (i.e. the residuals of each frame of enhancement level are combined with the frame of the base decoded stream). A group of pictures may represent multiple NALUs.

Further, it is important that any optimisation used in the coding technology is tailored to the specific requirements or constraints of the enhancement stream and is of low complexity. Such requirements or constraints include: the potential reduction in computational capability resulting from the need for software decoding of the enhancement stream; the need for combination of a decoded set of residuals with a decoded frame; the likely structure of the residual data, i.e. the relatively high proportion of zero values with highly variable data values over a large range; the nuances of a quantized block of coefficients; and, the structure of the enhancement stream being a set of discrete residual frames separated into various components. Note that the constraints placed on the enhancement stream mean that a simple and fast entropy coding operation is essential to enable the enhancement stream to effectively correct or enhance individual frames of the base decoded video. Note that in some scenarios the base stream is also being decoded substantially simultaneously before combination, putting a strain on resources.

Returning to the aspects described above, a further encoder is generally illustrated at 10 in Figure 2. This shows an input video 102, which, in certain circumstances, is the input video as described above received later by an encoder 100 as part of an encoding procedure, such as the encoding procedure shown in Figure 1. In keeping with this, in the example shown in Figure 2, this input video is passed through the encoder to generate an encoded video 11. In various examples, the encoded video is one or more encoded enhancement signals.

A pre-analysis module 12 is configured to receive the input video 102 in various examples. This assesses the input video and/or one or more sets of residuals as set out in WO 2023/187308, which is incorporated by reference. This typically generates various values in relation to the input video and/or residuals.

In various examples, the pre-analysis module 12 outputs an encoder parameter 20 to an encoder 14, which uses the encoder parameters as described above and below. For example, the pre-analysis module may be located in a control server (not shown), and may communicate with the encoder via an interface.

The pre-analysis module 12, in a number of examples, comprises a perception metric generator. As a first stage of pre-analysis, the perception metric generator generates a detail perception metric based on one or more frames of the input video.

In some examples, the detail perception metric may comprise an edge detection metric. A user may be more likely to notice loss of detail in the edge of an object depicted in a frame when compared to loss of detail in the bulk of the object.

The edge detection metric may be implemented using a transform. For example, this can be a directional decomposition transform, such as a Hadamard-based transform, but the pre-analysis module 12 may select between different transforms. The edge detection metric may alternatively comprise a binary choice, or a selection from a discrete set of options, such as: no edges, few edges, many edges.

In some examples, the detail perception metric comprises a motion metric based on comparing two or more frames. A user may be more likely to notice loss of detail in directional motion when compared to loss of detail in other types of motion. Furthermore, when a frame or portion of a frame is static, it may be easier for viewers to spot tiny details, and therefore it may be important to preserve residual information, e.g. a priority of certain static residual elements may be higher than a comparative set of transient residual elements. Also, sources of noise in an original video recording at higher resolutions (e.g. an L-2 enhancement stream) may lead to many small yet transient residual values (e.g. normally distributed values of -2 or -1 or 1 or 2)-these may be given a lower priority and/or set to 0 prior to residual processing in the enhancement level encoders.

The motion metric comprises a sum of absolute differences (SAD) between a pair of frames in various examples. In certain examples, the motion metric comprises a binary choice, or a selection from a discrete set of options, such as: no motion, low motion, or high motion. The motion metric is able to be evaluated in this manner per frame pair, per block pair or per coding unit pair.

For example, a motion metric for motion between a frame m and a frame n may be based on J. = Sum(abs(L,y,. -I",y,r.)), where I x,y,n is a value for coding unit (x,y) of frame n, and I x,y,rn is a value for coding unit (x,y) of frame m.

Furthermore, when the motion metric is based on comparing more than two frames, the motion metric may comprise a weighted sum of SAD values. For example, a detail perception metric for a frame n may be calculated by comparing frame n to each of preceding frames k and m, and the motion metric may be based on: Jn = Sum(abs(I.,y,n -lx,y,m)) + Sum(abs(lx,y,n -lx,y,k)), or J0 = wm.Sum(abs(lx,y,n -lx,y.m)) Wk'SUM(abS(lx,y,n -where Wm and wk are weighting factors.

In a number of examples, the motion metric alternatively comprises a binary choice, or a selection from a discrete set of options, such as: no motion, low motion, or high motion. Such a selection is based on comparing the sum of absolute differences to one or more thresholds in some examples. The thresholds can in turn depend on high-level parameters for encoding, such as a required bit rate. For example, when a low bitrate is required, the threshold for determining that there is motion may be relatively high, so that most of the input video is encoded more compactly.

Afirst frame and second frame used to generate the motion metric are consecutive frames of the input video in some examples, or the motion metric is generated at a reduced frequency (e.g. comparing motion between two frames separated by Nal intermediate frames of the input video, comparing motion between randomly sampled frames of the input video, etc.) depending on contextual requirements in various examples. The frequency of generating the motion metric may depend upon the motion metric (for example, decreasing motion metric generation frequency after generating the detail perception metric for a series of frames exhibiting low motion).

The number of times the motion metric is calculated is able to be reduced by reusing the same calculation for forward and backward motion. In other words, when a motion metric is calculated by comparing frames m and n, this motion metric may be used when generating a detail perception metric for frame m and when generating a detail perception metric for frame n. For example, adjacent frames may be paired up, with the motion metric calculated once for each pair of frames (i.e. a motion metric is calculated for frames 1 and 2, for frames 3 and 4, for frames 5 and 6, etc.).

The detail perception metric may comprise a combination of metrics. For example, the detail perception metric may comprise an edge detection metric based on a second frame and a motion metric based on a difference between first and second frames.

In a number of examples, the pre-analysis module 12 further comprises a feature extractor. As part of a first stage of pre-analysis, the feature extractor 1120 may extract additional metrics and statistics for use in determining one or more encoder parameters. The extracted features may comprise, for each block or coding unit of a frame: a histogram; a mean value; a minimum value; and a maximum value.

Based on the extracted features, the feature extractor may classify each block or coding unit within the frame, for example by providing a perceptibility rating relative to adjacent blocks or coding units.

In certain examples, the pre-analysis module 12 comprises a residual mode selector, for example, as an implementation of an encoder parameter determining unit. This type of encoder parameter is particularly suited to pre-analysis for an LCEVC encoder.

The residual mode may be determined by categorizing a frame, block or coding unit which is to be encoded. The categorisation may be based, for example, on certain spatial and/or temporal characteristics of the input image, such as the detail perception metric and optionally also the features extracted by a feature extractor. For example, the residual mode may be chosen by comparing the detail perception metric against one or more thresholds.

Additionally or altematively, an example pre-analysis module comprises a temporal prediction controller, as an implementation of an encoder parameter determining unit. The temporal prediction controller is configured to determine whether to apply temporal prediction. This type of encoder parameter is again particularly suited to pre-analysis for an LCEVC encoder.

The detail perception metric may be used to estimate a cost of temporal prediction, on a per frame basis and/or on a per portion basis, e.g. per tile and/or per coding unit (i.e. block). The cost of temporal prediction increases if it is expected to cause a loss of perceived quality. On the other hand, the cost of temporal prediction decreases based on the expected improvement of compression in frames encoded using temporal prediction of residuals.

In one case, a cost that is used to determine whether or not to apply temporal prediction may be controllable, e.g. by setting a parameter in a configuration file. The cost may be evaluated with and without temporal prediction, and temporal prediction may be used when it has lower cost than not using temporal prediction. In certain cases, the encoding parameter may comprise a map that indicates whether to apply temporal prediction for a frame, or a set of portions of a frame, of video.

In one example, the cost function may be simply the motion metric generated by a perception metric generator. In an example, a temporal prediction controller may further be configured to control whether to perform a temporal refresh for a 15 frame.

In a number of examples, the values generated by the pre-analysis, additionally or alternatively, include any one or more of a value indicative of an amount of static blocks in a frame; a value indicative of a rate estimate representing a cost of applying one or more forms of processing to a block, tile or frame; a value indicative of an (estimated) motion quantity of a frame; a value indicative of a variance in quantity of motion over a set of frames. Any one or more of these values may be in the range of 0 to 1, or may be disabled, which, in various examples is identified by a value of -1 (negative 1).

Regarding a pre-analysis sequence, as an example, the pre-analysis module 12 obtains a first video frame of the input video. The first frame is any frame which is expected to be subsequently encoded by an encoder, and "first" is not indicative of any particular position in the sequence of frames of the input video. The pre-analysis module down-samples the first video frame to obtain a first down-sampled video frame. Subsequently, the pre-analysis module generates a detail perception metric based on the first down-sampled video frame. Following this, the pre-analysis module determines, based on the detail perception metric, an encoder parameter for encoding the first video frame. This step may be implemented using an encoder parameter determining unit, such as a residual mode selector, a temporal prediction controller or a rate controller This example may also be applied to two video frames where the pre-analysis module obtains two video frames of the input video (a first video frame and a second video frame). The first and second video frames are any two different frames which are expected to be subsequently encoded by an encoder, and "first" and "second" are not indicative of any particular position in the sequence of frames of the input video, although the second video frame follows (i.e. occurs at least one frame after) the first video frame in the sequence of video frames. For example, the second frame may be one frame after the first frame (i.e. the immediately following frame), or two frames after the first frame, in the sequence of frames of the input video. The same process is followed, but differs in that multiple frames are used to determine the detail perception metric, and therefore the detail perception metric can include a motion metric.

In some examples, the pre-analysis module 12 comprises a rate controller, and in other examples, the rate controller 16 is a separate module to the pre-analysis module to which the pre-analysis module provides input, as illustrated in Figure 2.

Whichever arrangement is applied, the rate controller is configured, in various examples, to manage encoding to achieve a required bit rate, as described above with reference to the output buffer feature present in some encoders.

For example, the rate controller 16 may be the rate controller described in WO 2023/187308. In various examples, the rate controller may be configured to determine one or more quantization parameters. The determined quantization parameters may include any of a quantization bin size, a dead zone parameter, a bin folding parameter, a quantization offset parameter and a quantization matrix parameter.

In various examples, the rate controller is configured to determine an encoding parameter based on the detail perception metric generated by a perception metric generator and optionally also the features extracted by the feature extractor. As one example, a detail perception metric may indicate high perception of details in a specific portion (e.g. tile or block) of one or more frames. This may be due to, for example, edges or motion. At the same time, a feature extracted by the feature extractor may indicate that the pixel values in the specific portion fall within a small part of the total possible value range. In response, a quantization bin size parameter may be decreased and the size of a dead zone may be increased. This may have the effect of increasing the level of detail without increasing the required number of bits for residuals.

In cases where multiple encoder parameters are determined, then the determination for one parameter may be used as an input for the determination of another parameter.

For example, if a quantization parameter determined for a frame, block or coding unit would cause at least one corresponding residual at the encoder to not be quantized or quantized to zero, then the residual mode can be determined to prevent transformation or quantization of that residual at the encoder. This avoids unnecessarily performing transformation on the residual at the encoder, and thereby saves encoder processing resources.

The encoder parameter determining unit(s) may be configured to pass encoder parameters to the encoder 14 in real-time. Alternatively, the pre-analysis module 12 may store the determined encoder parameters for subsequent use in an encoder. In other examples, the pre-analysis module may simply generate the detail perception metric using the perception metric generator, and pass the detail perception metric to the encoder. The encoder parameter determining units may instead be arranged as part of the encoder.

In various examples, as shown in Figure 2, the pre-analysis module 12 and the rate controller 16 are connected to a temporal decision module 18. The temporal decision module, based, in some examples, on the pre-analysis and rate control processes, is configured to determine a temporal mode for one or more further encoded enhancement streams for use in reconstructing the input video together with the base stream. The one or more further encoded enhancement streams are generated in a number of examples using an enhancement encoder, which is typically different to a base encoder.

Under certain circumstances, there are at least two temporal modes. These are a first temporal mode and a second temporal mode.

An example first temporal mode is a mode that does not use the temporal buffer or that uses the temporal buffer with all zero values. In some examples, the first temporal mode may be seen as an intra-frame mode as it only uses information from within a current frame. In the first temporal mode, following any applied ranking and transformation, coefficients may be quantized without modification based on information from one or more previous frames.

An example second temporal mode that makes use of the temporal buffer, e.g. that uses a temporal buffer with possible non-zero values. In various examples, the second temporal mode may be seen as an inter-frame mode as it uses information from outside a current frame, e.g. from multiple frames. In the second temporal mode, following any applied residual prioritization and transformation, previous frame dequantized coefficients may be subtracted from the coefficients to be quantized. In other words, applying non-zero values from the temporal buffer in the second temporal mode comprises deriving a set of non-zero temporal coefficients from a temporal buffer and using the set of non-zero temporal coefficients to modify a current set of coefficients for generating the one or more further encoded streams in some examples.

The temporal decision module 18, in various examples, is configured to determine the temporal mode based on a cost function. The cost function may incorporate a cost of sending temporal mode signalling data for the temporal mode. The cost of sending temporal mode signalling data for the temporal mode may penalise one value of the temporal mode signalling data as compared to other values of the temporal mode signalling data.

In some examples, the cost function comprises a function of the input video and at least one of the one or more further encoded enhancement streams. The temporal decision module 18, in a number of examples, is configured to evaluate the cost function for a frame, tile or block of the input video.

To evaluate the cost function, the temporal decision module 18 is typically configured to encode the one or more further encoded enhancement streams using each of the first temporal mode and the second temporal mode. To evaluate the cost function, the temporal decision module can also be configured to compare one or more metrics determined for each of the first temporal mode and the second temporal mode.

To determine the temporal mode, in some examples, the temporal decision module 18 is configured to obtain temporal mode metadata for a set of blocks of the plurality of blocks. The temporal decision module can also be configured to determine the temporal mode to use for encoding the set of blocks based on the temporal mode metadata.

Ultimately, through a combination of one or more processes carried out by one or more of the pre-analysis module 12, the rate controller 16 and temporal decision module 18, a temporal mode is selected. At this stage, the encoder 14, in various examples, is configured to encode, separately from the one or more further encoded streams, temporal mode signalling data indicating the temporal mode for the one or more further encoded streams.

In some examples, the encoder 14 is configured to encode the temporal signalling data using run-length encoding.

Further, in various examples, the encoder 14 is configured to determine whether to refresh the temporal buffer for a given frame of the input video based on at least one of: a first proportion of blocks of the given frame for which the first temporal mode is to be used for reconstructing the given frame, and a second proportion of the blocks of the given frame for which the second temporal mode is to be used for reconstructing the given frame. Refreshing the temporal buffer for the given frame may comprise setting the values within the temporal buffer to zero.

In a number of examples, the encoder 14 is configured to determine the temporal mode for a second frame of the input video, after a first frame. The encoder may also be configured to omit a quantized value of a transformed block of the first frame from the one or more further encoded enhancement streams based on the temporal mode determined for the second frame.

The encoder 14 is configured, in certain circumstances, to generate temporal mode signalling data indicating the temporal mode for the one or more further encoded enhancement streams for a decoder. The temporal mode signalling data may be compressed in some examples.

The encoder 100 shown in Figure 1 and the encoder 10 shown in Figure 2 provide a general background for the functioning of an encoder and existing, and known, processes carried out by an encoder. In some circumstances, these are able to be used by, contribute to or form part of an encoder according to an aspect disclosed herein. Turning to the developments provided by aspects disclosed herein, Figure 3 shows an encoder generally illustrated at 1.

The encoder 1 in the example shown in Figure 3 is shown separated into two modules, an optional rate controller 2, and an LOQ (Level Of Quality) Encode module 3. In other examples, these are the same module or are separated into one or more other modules.

In the example shown in Figure 3, the LOQ Encode module 3 includes an encoding unit 30, and RDO decision unit 32 and, optionally, an isolated decision unit 34.

For the encoder 1 shown in the example Figure 3, in which the rate controller 2 and the isolated decision unit 34 are described as optional, the process(es) of at least one of these is implemented by the encoder. As such, in all examples, the process(es) of at least one of the rate controller and isolated decision unit are implemented.

In various examples, the LOQ Encode module 3 and the rate controller 2 receive, for processing, a current frame of encoded enhancement for an encoded video signal, the encoded enhancement signal comprising one or more layers of residual data, the residual data having been generated based on a comparison of data derived from a decoded version of a video signal and data derived from an input video signal, each encoded enhancement signal comprising respective frames, each frame of the respective frames being divided into a plurality of blocks. This is in combination with other frames of the encoded enhancement signals in some examples.

In accordance with the processes set out above, in a number of examples, the current frame is encoded using each of the first temporal mode and the second temporal mode. In the example shown in Figure 3, this is achieved by the encoding unit 30 encoding the current frame to encode intra 302, corresponding in some examples to the first temporal mode, and to encode inter 304, corresponding in some examples to the second temporal mode. This may then be passed on to the RDO decision unit 32.

In some examples, the rate controller 2 implements a favouring logic 20, which is set out in more detail below. This is intended to identify if one of the first temporal mode and the second temporal mode is preferred. The outcome of the favouring logic is provided as an input to the RDO decision unit 32 in some examples. While shown inside the rate controller in Figure 3, the favouring logic can be in a different location in other examples, such as outside the rate controller, and/or inside the LOQ Encode module 3 or inside a component of the LOQ Encode module, or outside of the rate controller and LOQ Encode module as its own module or within a further (not shown) module. As a further alternative, the favouring logic have a plurality of components with all components inside (only) a single location or module, or with one or more components in two or more of the rate controller, LOQ Encode module and one or more further modules.

The RDO decision unit 32 decides on whether to implement an Inter decision of an Infra decision. Following this, this information is passed to the isolated decision unit 34 in some examples.

In various examples, following an inter/intra decision a process is implemented to identify if that decision is an isolated decision, which is set out in more detail below. This is intended to identify if one or more inter or intra decisions is/are separated from other respective inter or intra decisions, and if they are, in some circumstances, switching the decision to the other temporal mode.

The steps shown in the example of Figure 3 are able to be carried out at a pre-analysis, rate control or encode stage, such as at a rate controller, during analysis, or before or during encode.

Favouring logic Turning to the favouring logic in more detail, as set out in Figure 4, in some examples, an encoder is able to apply the steps of receiving an input video at step S10. Using one or more known processes, at step S12, one or more enhancement signals are generated from the input video. Following this, a determination of which temporal mode to apply to one or more frames or one or more blocks is conducted at step S14. After the determination, at step S16, one or more encoded enhancement signals are generated, which are generated using one or more known processes.

The favouring logic forms part of the determining temporal mode step S14.

An example process implementing the favouring logic is shown in Figure 6. This is able to be implemented by an encoder or a module of an encoder, such as a rate distortion optimisation apparatus, which, in some examples is a rate controller, or is included in a pre-analysis module or in an encoder such as the pre-analysis module 12 or encoder 14 of the example shown in Figure 2. Further, the rate distortion optimisation apparatus is able to encompass and/or be implemented in a combination of these components in some examples.

The rate distortion optimisation apparatus generally illustrated as 22 in Figure 6, is configured to implement the example process shown in Figure 6.

In the example process shown in Figure 6, this includes receiving a frame of an encoded enhancement signal at step S200. This is followed, at step S202, by calculating a cost of sending temporal mode signalling data for each of a first temporal mode and a second temporal mode for the encoded enhancement signal of a current block of the frame. In some examples, the cost is calculated by applying the cost function set out above and/or evaluation of the cost function.

In a number of examples, the calculation of the costs is received by the rate distortion optimisation apparatus, such as due to being conducted outside of the rate distortion optimisation apparatus. This is received from a pre-analysis module in various examples. Alternatively, the rate distortion optimisation apparatus includes part of the pre-analysis module that calculates the costs.

At step S204, the frame is assessed. The assessment arrives at a motion estimate of the frame in some examples, such as an assessment against a range of motion, which may be from completely static (i.e. no movement) or completely non-static (i.e. all movement and/or significant amounts of motion). This is achieved by applying or using the motion metric described above in some examples.

In various examples, the output of the assessment may be a motion estimate a value. This value is typically in the range of 0 to 1, in which 1 represents static or no movement.

The motion estimate has the identifier m_motionModifierin a number of examples.

In some examples, the motion estimate assessment is received by the rate distortion optimisation apparatus instead of being conducted by the rate distortion optimisation apparatus, such as due to being conducted outside of the rate distortion optimisation apparatus. This is received from a rate controller in various examples. Alternatively, the rate distortion optimisation apparatus includes part of another module that assesses a motion estimate.

Following the motion estimate assessment step, at optional step S206, the motion estimate is compared to a base threshold. In the same step, in some examples, an indicator of a quantity of static elements in the current frame is compared to a static threshold.

The indicator, with the identifier zeroStaficRatio in some examples, at optional step S2062 is received by the rate distortion optimisation apparatus. This is received from the rate controller in various examples.

In some examples, the static threshold, which has the identifier temporal motion modifier static thr in various examples, is pre-set, such as being provided from (and calculated or input into) a pre-analysis module. Alternatively, the static threshold may be user set or may be pre-set in the rate distortion optimisation apparatus, such as being "hard coded" into the apparatus.

Typically, the static threshold is a value from 0 to 1, such as being a floating point value able, in some examples, to be set as identified above.

Regardless of the mechanism by which it is arrived at, the base threshold, in some examples, is 0.11 (such as in a range of 0 to 1), but can be between 0.05 and 0.30, between 0.05 and 0.25, between 0.05 and 0.20, between 0.7 and 0.18, between 0.9 and 0.15. A value of 0.11 is one example of the base threshold because this represents a (probability of) very high motion in the current frame.

In some examples, when the motion estimate is less than or equal to the base threshold and the indicator is less than or equal to the static threshold, the temporal mode of the current block is designated as the first temporal mode at optional step S2064. This can be achieved by designating the temporal mode of the current block only, but in various examples, this is achieved by applying the same temporal mode to all the blocks in the current frame of by applying the temporal mode to the current frame as a whole. This is accomplished, in a number of examples, by triggering a temporal buffer refresh.

In certain examples, a temporal buffer refresh corresponds to a first set of values stored in the temporal buffer being replaced with a second set of values. To achieve a temporal buffer refresh, in some examples, a temporal refresh parameter is signalled. For example, the temporal buffer may store dequantized coefficients for a previous frame that are loaded when a temporal refresh flag is set (e.g. is equal to 1 indicating "refresh"). In this case, the dequantized coefficients are stored in the temporal buffer and used for temporal prediction for future frames (e.g. for subtraction) while the temporal refresh flag for a frame is unset (e.g. is equal to 0 indicating "no refresh"). In this case, when a frame is received that has an associated temporal refresh flag set to, for example, 1, the contents of the temporal buffer are replaced. This may be performed on a per frame basis and/or applied for portions of a frame such as tiles or blocks (i.e. coding units).

Following the motion estimate assessment, regardless of whether optional step S206 is carried out, at step 5208, in some examples, a temporal mode preference is identified for the current block.

In various examples, the temporal mode preference is arrived at based on where on the motion quantity range the motion estimate is. On the example range of 0 to 1, if the motion estimate is in the lower half of the range (e.g. 0 to 0.5), then this indicates a preference is more likely to be the first temporal mode, and if the motion estimate is in the upper half of the range (e.g. 0.5 to 1), then this indicates a preference is more likely to be the second temporal mode. In some examples, this is applied to identify the temporal mode preference. In other examples, however, the motion estimate is assessed relative to a lower threshold and an upper threshold of the motion quantity range.

In some examples, when the motion estimate is equal to or lower than the lower threshold, then the temporal mode preference is identified as the first temporal mode. In certain examples, when the motion estimate is equal to or higher than the upper threshold, then the temporal mode preference is identified as the second temporal mode.

In a number of examples, the lower threshold, which has the identifier temporal motion modifier low in various examples, is pre-set, such as being provided from (and calculated or input into) a pre-analysis module. Alternatively, the lower threshold may be user set or may be pre-set in the rate distortion optimisation apparatus, such as being "hard coded" into the apparatus. Typically, the lower threshold is a value from 0 to 1, such as being a floating point value able, in some examples, to be set as identified above.

In various examples, the upper threshold, which has the identifier temporal motion modifier high in some examples, is pre-set, such as being provided from (and calculated or input into) a pre-analysis module. Alternatively, the upper threshold may be user set or may be pre-set in the rate distortion optimisation apparatus, such as being "hard coded" into the apparatus. Typically, the upper threshold is a value from 0 to 1, such as being a floating point value able, in some examples, to be set as identified above.

In certain examples, when the motion estimate is between the lower threshold and upper threshold, the temporal mode preference is still indicated based on whether the motion estimate is in the lower half or the upper half of the motion quantity range with the upper and lower thresholds then being a faster way of identifying the temporal mode preference. In such examples, further analysis may be carried out in providing a temporal mode preference, such as to increase certainty.

Typically, however, when the motion estimate is between the lower threshold and upper threshold, the temporal mode preference is identified as neither the first temporal mode nor the second temporal mode, such as "no" or unset temporal mode preference.

As an output, in some examples, the temporal mode preference is provided as a value. In various examples, the values are 0 and 1 and can include a further value of 2. In a number of examples, the value 0 corresponds to the second temporal mode, the value 1 corresponds to the first temporal mode, and, when included, the value 2 corresponds to no temporal mode preference or an unset temporal mode preference.

In various examples, the temporal mode preference has an identifier m_rcPreferredChoice.

As a further optional portion of the process able to be carried out after assessing the motion estimate of the current frame, at optional step S207, a variation in motion over a set of frames is identified. This is carried out following receipt, at optional set S2072, of a motion standard deviation threshold, which is assessed, at optional step S2074, to identify if the motion standard deviation threshold is a default value.

In some examples, when the motion standard deviation threshold is a default value, no further action is taken and the process advances to identifying the temporal mode preference. The default value is -1 (negative 1) in various examples, but can be another value able to be identified as a default value.

In a number of examples, the motion standard deviation threshold, which has the identifier temporal motion modifier stddev in various examples, is pre-set, such as being provided from (and calculated or input into) a pre-analysis module.

Alternatively, the motion standard deviation threshold may be user set or may be pre-set in the rate distortion optimisation apparatus, such as being "hard coded" into the apparatus.

In various examples, the motion standard deviation threshold is a value, such as a value between 0 and 1, which is typically a non-default value. When a non-default value, this is used, at optional step S207, in identifying variation in motion estimate over a set of frames in some examples and comparing the identified variation to the motion standard deviation threshold value. This is achieved, in some examples, by calculating the standard deviation of the motion estimate for a predetermined set of frames. The number of frames in the set of frames varies from example to example, but a typical number of claims is six frames. In some examples, this is a "rolling" set of six frames that includes the current frame along with five other adjacent and consecutive frames, whether are all before or after the current frame in a sequence of frames or with the current frame part way through the set of frames in relation to the sequence of frames. In other examples, the number of frames is more or less than six.

When included in the process, should the variation in the motion estimate over the set of frames be greater than or equal to the motion standard deviation threshold, the temporal mode preference of the current block is identified as the first temporal mode at step S208. This is achieved in relation to the optional process including step S207 by applying the first temporal mode to the whole frame, either on a block-by-block basis, by applying the first temporal mode to all the blocks in the frame in one step or by applying the first temporal mode to the whole frame.

Following the identification of the temporal mode preference at step S208, regardless of whether the process progressed directly from step S204 or from implemented optional steps S206 and/or S207, in some examples, at step S210, a weighting is generated. In various examples, the weighting is generated based on a function of the motion estimate and motion quantity range. When progressing from step S206, the process proceeds to the temporal mode preference identification step when the motion estimate and/or the indicator are respectively greater than the base threshold and the static threshold.

In a number of examples, the weighting decreases from a mid-point of the motion quantity range. Typically, as shown in the example plot 50 of Figure 5, the decrease from the mid-point can either be an exponential curve 52 or a staircase function 54. While the plot in Figure 5 shows both of these examples, only a single version, symmetric about the mid-point is applied at any one time in typical examples.

The example plot 50 of Figure 5 has the motion range as the x-axis and weighting on the y-axis. In various examples, the weighting is calculated based on the motion estimate's value on the motion range.

In some examples, when the motion estimate is at the lower threshold, illustrated in an example manner in Figure 5 at 56, on the motion range, the weighting is 50% of its maximum.

In a number of examples, when the motion estimate is at the upper threshold, illustrated in an example manner in Figure 5 at 58, on the motion range, the weighting is 50% of its maximum.

The weighting has a range between 0 and 1 in various examples. When a staircase function is applied, this may give weighting values of 0.95, 0.65 and 0.5 for each of the steps. In other examples, other values are used for the steps.

In certain examples, the weighting has an identifier m_rcPreferredChoiceFa vouringFactor.

The weighting, in some examples, is used to modulate a pre-determined favouring factor. In various examples, the favouring factor is a multiplier that is able to be applied to a rate estimate. This, typically, has a value of 0 to 1, and is thus able to reduce the rate estimate or cost when used as a multiplier. In a number of examples, the modulation is applied by using the weighting as a multiplier to the favouring factor, by subtracting the weighting from the favour factor or by replacing the favouring factor with the weighting.

In some examples, the favouring factor, which has the identifier temporal motion modifier favouring in various examples, is pre-set, such as being provided from (and calculated or input into) a pre-analysis module. Alternatively, the favouring factor may be user set or may be pre-set in the rate distortion optimisation apparatus, such as being "hard coded" into the apparatus.

Typically, the favouring factor is a value from 0 to 1, such as being a floating point value able, in some examples, to be set as identified above.

In various examples, the favouring factor has a default value. This value is 50% or 0.5 when the motion estimate is below the lower threshold and/or above the upper threshold in some examples.

Regardless of whether the weighting is used to modify the favouring factor, is just calculated or is calculated and used to replace the favouring factor, in some examples, at step S212, it is identified whether the temporal mode preference matches the mode of the first temporal mode and the second temporal mode with the lower cost. In various examples, when there is no match, the mode of the current block is designated as the mode with the lower cost.

In a number of examples, when the temporal mode preference matches the mode with the lower cost, the calculated cost is updated, at step S214, by applying the weighting or modulated favouring factor. In some examples, the applying of the weighting or modulated favouring factor is implemented as a multiplier resulting in the cost being multiplied by the weighting or modulated favouring factor.

Following any update of the calculated cost, in certain examples, at step S216, the mode of the current block is designated as the temporal mode with the lower cost. In some examples, this is able to be the mode for which the cost was updated, but is also able to be the mode for which the cost was not updated, such as when the non-updated cost is still lower than the updated cost.

In various examples, the mode designation of the block is achieved by comparing the cost of the first temporal mode and the second temporal mode and identifying the mode with the lower cost.

The designation step S216 is able to be carried out in the rate controller 2 or in the LOQ Encode, such as the RDO decision unit 32, in a number of examples. If carried out in the LOQ Encode and the temporal preference, along with the weighting or updated favouring factor are able to be passed to the LOQ Encode from the rate controller. In examples where a temporal refresh flag is generated, this is also passed to the LOQ Encode from the rate controller.

At this stage, as set out in Figure 4 at step S16, in certain examples, one or more encoded enhancement signals are generated based on the temporal mode applied to the current block. This is achieved using known processes.

It is noted that, in some examples, there is a tendency to apply the second temporal mode, e.g. an Inter temporal mode, more than applying a first temporal mode, e.g. an Infra temporal mode. For example, the Inter temporal mode is able to be a default mode. This allows use of the values in the temporal buffer as the default decision since this keeps processing requirements low. However, the above process allows it to be identified which temporal mode is preferred and for a decision to be made as to whether to apply the preferred option or to apply the other option.

Isolated decisions Looking in more detail at the isolated decision process that is able to be applied, a process for identifying isolated decisions is generally illustrated at 340 in Figure 7. This process is able to be applied before the one or more encoded enhancement signals are generated at step S16 in Figure 4 or after.

An example process implementing the isolated decision process is shown in Figure 7. This is able to be implemented by an encoder or a module of an encoder, which, in some examples is a rate controller, or is included in a pre-analysis module or in an encoder such as the pre-analysis module 12 or encoder 14 of the example shown in Figure 2. Further, the encoder or module is able to encompass and/or be implemented in a combination of these components in some examples.

The encoder or module is generally illustrated at 340 in Figure 7, is configured to implement the example process shown in Figure 7.

As a first step, at step S300, a binary data set is received. In various examples, the data in the binary data set is indicative of one of a first decision or a second decision for each of a sequential set of blocks of a plurality of blocks, the number of blocks in the set of blocks matching the number of elements in the set of elements.

In some examples, the binary data set is intended to be encoded into an encoded layer in view of the encoder 340 being configured, in various examples, to encode an input video into one or more encoded signals, the one or more encoded signals being suitable for combining with a base encoded stream to reconstruct the input video, each encoded enhancement signal comprising one or more layers of residual data, the residual data being generated based on a comparison of data derived from a decoded version of a video signal and data derived from an input video signal, each encoded signal comprising respective frames, each frame of the respective frames being divided into a plurality of blocks.

In certain examples, the binary data set includes a first subset of elements, a second subset of elements and a third subset of elements of which the respective blocks are sequential within each subset and the subsets are sequential. While other examples are possible, in some examples, each subset of elements includes (only) a single element. Each element thus represents a decision on (only) a single block, for example.

In various examples, the decision is indicative of the temporal mode determined for the respective block. In examples in which each subset of elements includes only a single element, this allows the encoder 340 to carry out the isolated decision process for a current block being processed and using data from the two preceding blocks. The same approach applies for a larger number of blocks in examples in which each subset of elements includes more than a single element.

In other examples, the process may be carried out after all the relevant blocks have been processed.

In the example shown in Figure 7, at step S302, it is identified whether the element(s) of the second subset are indicative of the opposite decision to the element(s) of the first and third subset. In various examples, this is a process that identifies whether, for the one or more blocks relevant to the second subset, the first temporal mode is (to be) designated when, for the one or more blocks relevant to the first subset and the third subset, the second temporal mode is (to be) designated, or alternatively, whether, for the one or more blocks relevant to the second subset, the second temporal mode is (to be) designated when, for the one or more blocks relevant to the first subset and the third subset, the first temporal mode is (to be) designated.

There are examples where the first temporal mode is (to be) designated for the one or more blocks relevant to the second subset, and the first temporal mode is (to be) designated for the one or more blocks relevant to the first subset and the third subset, or, alternatively, the second temporal mode is (to be) designated for the one or more blocks relevant to the second subset, and the second temporal mode is (to be) designated for the one or more blocks relevant to the first subset and the third subset. This is an example where there is no difference that would be identified.

In examples where there is no difference identified, the process advances to a step of encoding the binary data set without further processing being conducted. In some examples, this is achieved by advancing to step S308 in the example shown in Figure 7, at which the binary data set is encoded.

Returning to the identification of a difference in the decision (to be) made, in examples where there is a difference, the process instead advances, optionally, to step S304. At step S304, in some examples, it is identified if the cost of applying the first temporal mode and the cost of applying the second temporal mode to the block(s) represented by the element(s) of the second subset is equal. This is achieved by assessing a calculated cost of applying the first temporal mode and a calculated cost of applying the second temporal mode, such as by the processes set out above for calculating these costs. This assessment, in certain examples, is carried out in respect of costs not updated or before being updated by the favouring logic process, or, in other examples, is carried out in respect of a cost or costs updated by or after being updated by the favouring logic process.

In examples where the costs are equal, at optional step 53042, a calculation of the distortion caused by applying each mode is then calculated. This is calculated using typical known processes for calculating distortion.

At optional step S3042, it is identified which mode of the first temporal mode and second temporal mode has the lower distortion for block(s) represented by the element(s) of the second subset. Should the mode with the lower distortion be the mode (to be) designated to the block(s) represented by the element(s) of the second subset, in some examples, this is maintained and the process advances to encoding the binary data set, such as by applying step S310. Alternatively, should the mode with the lower distortion not be the mode (to be) designated to the block(s) represented by the element(s) of the second subset but instead be the other temporal mode, in various examples, the mode (to be) designated to the block(s) represented by the element(s) of the second subset is modified to the mode with the lower distortion. As shown in the example in Figure 7, this can result in modifying the decision, such as modifying the decision to the first temporal mode if the distortion of the first temporal mode is lower. The same is able to apply with respect to the second temporal mode.

Should the cost of the modes not be equal, or instead of identifying if the cost of the modes is equal, optionally, at step S306, it is identified if the decision of which the element(s) of the second subset are indicative is a decision reached due to application of the favouring logic process above. In some examples, this is achieved by identifying if a temporal preference has been provided or generated.

If the decision of which the element(s) of the second subset are indicative was reached due to the favouring logic process above, the process advances to encoding the binary data sets, such as by advancing to step S310. This gives priority to decisions made by the favouring logic over the isolated decision process. This check could, of course, be carried out by identifying whether the decision(s) of which the element(s) of the second subset are indicative were arrived at by applying the favouring logic before step S302 This would avoid needing to carry out step S302.

If optional steps S304 and/or S306 are not included, or when whichever is/are applied has/have a negative outcome, in the example process shown in Figure 7, the process advances to step S308. At step S308, in various examples, the decision is modified to the opposite decision to the current decision. This is due to it having been identified that this is an isolated decision. In examples where there is more than one element in each subset, this will be because the decisions, which, for example, are all the same within each respective subset, are isolated decisions relative to the elements of the first and third subsets.

In examples when the block(s) represented by the element(s) in the second subset have already been encoded, this results in the block(s) being re-encoded with the opposing temporal mode applied. In other words, this results in a temporal mode is encoded as the first temporal mode being re-encoded as the second temporal mode or a temporal mode is encoded as the second temporal mode being re-encoded as the first temporal mode. In examples in which the block(s) are yet to be encoded, this results in changing the designation of the temporal mode that is to be applied to the opposing temporal mode.

Following modifying of the decisions, at step S310, in some examples, the data set is encoded into an encoded layer. This is achieved using an entropy encoding operation. In various examples, the entropy encoding operation is any suitable type of entropy encoding, such as a Huffman encoding operation or a run-length encoding (RLE) operation, or a combination of both a Huffman encoding operation and an RLE operation. Typically, the process applies at least an RLE operation to encode the binary data set.

Examples

We evaluated the performance of the above process. The tests were designed to isolate, as much as possible, the RDO tools in order to measure the benefit of it in isolation. The Video Quality Framework, VQF, tests were conducted using the following tools: * Tune: Tune VQ and preset default * Base codec: x264 codec * RDO: Dual pass RDO * Transform Type: 2x2 (Directional Decomposition, DD) * Pre-analysis: legacy pre-analysis (no low-complexity) * Disabled tools: Priority Map; Reduced signalling; Temporal type * Rate Control: fixed pCRF values = {20, 23, ... 44, 47} To establish relevant results, 11 sequences were tested. These included five moving sequences; five static sequences; and two static sequences. The moving sequences are sequences in which the camera moves in an erratic manner, usually to follow a scene. For these sequences, the scene often changes rapidly. The panning sequences are sequences in which the camera moves in a more-or-less constant way along a direction. For these sequences, the scene often changes slightly between frames. The static sequences are sequences in which the camera is more or less in the same position for the duration of the sequence (for example when it is intended to be in the same position). In these sequences, there may be parts in movement and parts that are static, but the movement is usually not a consequence of the camera moving but rather due to movement of an object.

In testing a comparison between a Master (e.g. a current implementation of the overall encoding) and the processes that have been developed was run. This resulted in the average results shown in Table 1 for peak signal-to-noise ratio (PSNR), Structural Similarity (SSIM), Video Multi-Method Assessment Fusion (VMAF) and VMAF-no enhancement gain (VMAF-NEG), which also shows overall results presented as BD-rate (Bjontegaard delta rate) percentage: Class PSNR SSIM VMAF VMAF-NEG Moving -1.18 -0.73 -0.26 -0.63 Panning -3.60 -1.69 -0.81 -1.58 Static -2.92 -1.05 -0.13 -0.85 Overall -2.46 -1.12 -0.39 -0.99

Table

A test was run comparing the Master and the processes that have been developed without applying the isolated decision process set out above. This means, that of the favouring logic and isolated decision processes, only the favouring logic was included in this test. The average and overall results for this are shown in Table 2 presented as BD-rate percentage: Class PSNR SSIM VMAF VMAF-NEG Moving -0.83 -0.55 -0.20 -0.44 Panning -2.77 -1.23 -0.61 -1.07 Static -2.19 -0.68 0.17 -0.43 Overall -1.85 -0.80 -0.22 -0.63

Table 2

A test was also run comparing the processes set out above as used in the test from which the Table 1 results were arrived at and the processes set out above as used in the test from which the Table 2 results were arrived at. This provides a comparison between using the isolated decision process and not using the isolated decision process. The average and overall results for this are shown in Table 3 presented as BD-rate percentage: Class PSNR SSIM VMAF VMAF-NEG Moving -0.36 -0.18 -0.06 -0.19 Panning -0.85 -0.46 -0.20 -0.52 Static -0.75 -0.38 -0.30 -0.43 Overall -0.63 -0.33 -0.17 -0.36

Table 3

An assessment of two different bitrate ranges, corresponding to "quality bands", was also carried out. For a low-quality band, LCEVC encoder equivalent rate control mode, pCRF, values of 35, 32 and 29 were assessed. For a high-quality band, pCRF values of 24, 23 and 20 were assessed. The test for which the results are presented in Table 1 was run for each of the low-quality band and high-quality band The low-quality band results, as BD-rate percentage, are shown in Table 4: Class PSNR SSIM VMAF VMAF-NEG Moving -0.08 -0.26 -0.12 -0.15 Panning -1.43 -0.78 -0.46 -0.56 Static -0.80 -0.39 -0.11 -0.16 Overall -0.72 -0.46 -0.22 -0.28

Table 4

The high-quality band results, as BD-rate percentage, are shown in Table 5: Class PSNR SSIM VMAF VMAF-NEG Moving -1.76 -1.81 -2.07 -1.59 Panning -4.18 -3.92 -3.64 -3.51 Static -3.75 -2.27 4.08 -1.86 Overall -3.12 -2.60 -0.66 -2.26

Table 5

Objective evaluations for the processes that have been developed show an overall improvement on all metrics. In particular, it can be observed that: * The average improvement is in the order of 2.5% in PSNR (with peaks of 8%), 0.4% in VMAF (with peaks of nearly 4%) and 1% in VMAF-NEG (with peaks of 5%).

* The static and the panning sequences achieve the highest score, especially in PSNR, reaching an average improvement of nearly 4%, 1.7% in VMAF and 1.6% in VMAF-NEG.

* The majority of the improvement can be observed for the higher quality band (over 3% in PSNR, over 2% in VMAF and nearly 2% in VMAF-NEG). The panning sequences are again the ones benefitting the most, scoring nearly 4% gains on all the metrics.

* The isolated decision logic seems to consistently improve the favouring logic, especially on the static and panning sequences.

Implementation In some examples, a non-transitory computer-readable medium, such as a hard-drive, solid-state drive or some other form of storage medium, is provided as generally illustrated at 1000 in Figure 8. This is capable of holding or storing a computer program 1001. The computer program includes instructions, typically in the form of computer code, that, when implemented on a computing device, such as by executing the program, cause an apparatus to perform one or more of the methods and processes set out above or to provide one or more of the encoders set out above. The apparatus may be a computing device such as a chip, server or some other form of computer.

The above examples are to be understood as illustrative examples. Further examples are envisaged. It is to be understood that any feature described in relation to any one example may be used alone, or in combination with other features described, and may also be used in combination with one or more features of any other of the examples, or any combination of any other of the examples. Furthermore, equivalents and modifications not described above may also be employed within the scope of the accompanying claims.

Claims

CLAIMS1. A rate distortion optimisation apparatus configured to designate a temporal mode to be applied to an encoded enhancement signal for an encoded video signal, the encoded enhancement signal comprising one or more layers of residual data, the residual data being generated based on a comparison of data derived from a decoded version of a video signal and data derived from an input video signal, each encoded enhancement signal comprising respective frames, each frame of the respective frames being divided into a plurality of blocks, the apparatus being configured to: receive a current frame of the encoded enhancement signals to be processed; calculate a cost of sending temporal mode signalling data for each of a first temporal mode and a second temporal mode for the encoded enhancement signal of a current block of the current frame, the first temporal mode not applying non-zero values from a temporal buffer for generating the encoded enhancement signal and the second temporal mode applying non-zero values from the temporal buffer for generating the encoded enhancement signal; assess a motion estimate of the current frame relative to a motion quantity range; identify a temporal mode preference for the current frame based on the assessment; generate a weighting based on the motion estimate; update, when the temporal mode preference matches the mode of the first temporal mode and the second temporal mode with the lower cost, the calculated cost of the mode of first temporal mode or the second temporal mode with the lower cost by apply the weighting; and designate the temporal mode for the current block as the mode of the first temporal mode and the second temporal mode with the lower cost based on the calculated cost.
2. The rate distortion optimisation apparatus according to claim 1, wherein the assessment the motion estimate of the current frame relative to the motion quantity range includes assessment of the motion estimate relative to a lower threshold and an upper threshold of the motion quantity range, and is configured to identify the temporal mode preference for the current frame as the first temporal mode when the motion estimate is equal to or lower than the lower threshold and as the second temporal mode when the motion estimate is equal to or greater than the upper threshold.
3. The rate distortion optimisation apparatus according to claim 2, wherein the temporal mode preference for the current frame is identified as preferring neither the first temporal mode or the second temporal mode when the motion estimate is between the lower threshold and the upper threshold on the motion quantity range.
4. The rate distortion optimisation apparatus according to claim 2 or claim 3, further configured to generate a weighting of 50% when the motion estimate of the current frame is equal to the lower threshold or upper threshold.
5. The rate distortion optimisation apparatus according to any one of the preceding claims, wherein the weighting decreases from a mid-point of the motion quantity range towards each end of the motion quantity range.
6. The rate distortion optimisation apparatus according to any one of the preceding claims, wherein the weighting is a staircase function or exponential function of the motion estimate and the motion quantity range.
7. The rate distortion optimisation apparatus according to any one of the preceding claims, further configured to receive a motion standard deviation threshold, and, when the motion standard deviation threshold is a value other than a default value, the apparatus is further configured to identify a variation amount of the motion estimate over a predetermined set of frames, and identify the temporal mode preference as the first temporal mode when the variation amount is higher than the motion standard deviation threshold.
8. The rate distortion optimisation apparatus according to any one of the preceding claims, further configured to receive an indicator of a quantity of static elements in the current frame, and wherein if the motion estimate is equal to or lower than a base threshold and the indicator is equal to or lower than a static element threshold, designating the temporal mode of the current block as the first temporal mode.
9. The rate distortion optimisation apparatus according to claim 8, further configured to designate the temporal mode of all the blocks of the current frame as the first temporal mode with the designation of the temporal mode of the current 10 block.
10. The rate distortion optimisation apparatus according to claim 8 or claim 9, further configured to set a flag when the motion estimate is equal to or lower than a base threshold and the indicator is equal to or lower than a static element threshold, the flag being configured as readable as a temporal buffer refresh flag thereby causing values in the temporal buffer to be reset.
11. A method of designating a temporal mode to be applied to an encoded enhancement signal for an encoded video signal, the encoded enhancement signal comprising one or more layers of residual data, the residual data being generated based on a comparison of data derived from a decoded version of a video signal and data derived from an input video signal, each encoded enhancement signal comprising respective frames, each frame of the respective frames being divided into a plurality of blocks, the method comprising: receiving a current frame of the encoded enhancement signals to be processed; calculating a cost of sending temporal mode signalling data for each of a first temporal mode and a second temporal mode for the encoded enhancement signal of a current block of the current frame, the first temporal mode not applying non-zero values from a temporal buffer for generating the encoded enhancement signal and the second temporal mode applying non-zero values from the temporal buffer for generating the encoded enhancement signal; assessing a motion estimate of the current frame relative to a motion quantity range; identifying a temporal mode preference for the current frame based on the assessment; generating a weighting based on the motion estimate; updating, when the temporal mode preference matches the mode of the first temporal mode and the second temporal mode with the lower cost, the calculated cost of the mode of first temporal mode or the second temporal mode with the lower cost by apply the weighting; and designating the temporal mode for the current block as the mode of the first temporal mode and the second temporal mode with the lower cost based on the calculated cost.
12. An encoder configured to encode an input video into one or more encoded enhancement signals, the one or more encoded signals being suitable for combining with a base encoded stream to reconstruct the input video, the encoder being configured to: receive an input video comprising respective frames; receive a decoded base encoded stream; generate one or more enhancement signals, each enhancement signal comprising one or more layers of residual data, the residual data being generated based on a comparison of data derived from a decoded version of a video signal and data derived from an input video signal and each enhancement signal comprising respective frames, each frame of the respective frames being divided into a plurality of blocks; determine a temporal mode to be applied to at least one of the enhancement signals, wherein the temporal mode is one of a first temporal mode and a second temporal mode, the first temporal mode not applying non-zero values from a temporal buffer for generating the at least one enhancement signal and the second temporal mode applying non-zero values from the temporal buffer for generating the at least one enhancement signal, the encoder being configured to produce the determination by: receiving a current frame of the at least one enhancement signal to be processed, calculating a cost of sending temporal mode signalling data of a current block of the current frame for each of the first temporal mode and the second temporal mode, assessing a motion estimate of the current frame relative to a motion quantity range, identifying a temporal mode preference of the current frame based on the assessment, generating a weighting based on the motion estimate, updating, when the temporal mode preference matches the mode of the first temporal mode and the second temporal mode with the lower cost, the cost of the mode of first temporal mode or the second temporal mode with the lower cost by apply the weighting, and determining the temporal mode for the current block of the current frame as the mode of the first temporal mode and the second temporal mode with the lower cost based on the calculated cost; and generate the one or more encoded enhancement signals from the one or more enhancement signals based on data derived from the base encoded stream and the input video according to the determined temporal mode, wherein generating the one or more encoded enhancement signals comprise applying a transform to each of a series of blocks of the plurality of blocks.
13. The encoder according to claim 12, further configured to generate the one or more encoded enhancement signals is further based on the identified temporal preference and/or generated weighting.
14. The encoder according to claim 12 or claim 13, further comprising the temporal buffer, the encoder being configured to assess a quantity of static elements in the current frame, and wherein, when the motion estimate is at or below a base threshold, if a quantity of static elements in the current frame is less than or equal to a static element threshold, designating the temporal mode of the current block as the first temporal mode and resetting the values in the temporal buffer.
15. The encoder according to any one of claims 12 to 14, wherein the encoder includes the rate distortion optimisation apparatus according to any one of claims 1 to 10, wherein the rate distortion optimisation being configured to provide the determination of the temporal mode of the current block.
16. A method of encoding an input video into one or more encoded enhancement signals, the one or more encoded signals being suitable for combining with a base encoded stream to reconstruct the input video, the method 10 comprising: receive an input video comprising respective frames; receive a decoded base encoded stream; generate one or more enhancement signals, each enhancement signal comprising one or more layers of residual data, the residual data being generated based on a comparison of data derived from a decoded version of a video signal and data derived from an input video signal and each enhancement signal comprising respective frames, each frame of the respective frames being divided into a plurality of blocks; determine a temporal mode to be applied to at least one of the enhancement signals, wherein the temporal mode is one of a first temporal mode and a second temporal mode, the first temporal mode not applying non-zero values from a temporal buffer for generating the at least one enhancement signal and the second temporal mode applying non-zero values from the temporal buffer for generating the at least one enhancement signal, the determination comprising: receiving a current frame of the at least one enhancement signal to be processed, calculating a cost of sending temporal mode signalling data of a current block of the current frame for each of the first temporal mode and the second temporal mode, assessing a motion estimate of the current frame relative to a motion quantity range, identifying a temporal mode preference of the current frame based on the assessment, generating a weighting based on the motion estimate, updating, when the temporal mode preference matches the mode of the first temporal mode and the second temporal mode with the lower cost, the cost of the mode of first temporal mode or the second temporal mode with the lower cost by apply the weighting, and determining the temporal mode for the current block of the current frame as the mode of the first temporal mode and the second temporal mode with the lower cost based on the calculated cost; and generating the one or more encoded enhancement signals from the one or more enhancement signals based on data derived from the base encoded stream and the input video according to the determined temporal mode, wherein generating the one or more encoded enhancement signals comprises applying a transform to each of a series of blocks of the plurality of blocks.
17. An encoder configured to encode an input video into one or more encoded signals, the one or more encoded signals being suitable for combining with a base encoded stream to reconstruct the input video, each encoded enhancement signal comprising one or more layers of residual data, the residual data being generated based on a comparison of data derived from a decoded version of a video signal and data derived from an input video signal, each encoded signal comprising respective frames, each frame of the respective frames being divided into a plurality of blocks, the encoder being configured to: receive a binary data set to be encoded into an encoded layer using entropy encoding, wherein the binary data set is a set of elements indicative of a one of a first decision or a second decision for each of a sequential set of blocks of the plurality of blocks, the number of blocks in the set of blocks matching the number of elements in the set of elements, the binary data set including a first subset of elements, a second subset of elements and a third subset of elements of which the respective blocks are sequential within each subset and the subsets are sequential; identifying if all elements of the second subset are indicative of the other decision of the first decision and second decision to the decision of which all elements of the first subset and third subset are indicative; modifying, when it is identified that all the elements of the second subset are indicative of the other decisions of the first decision and second decision from the decision of which all elements of the first subset and third subset are indicative, the decision of each element of the second subset of elements to the other decision of the first decision and second decision; and encoding the data set with the modified decision into the encoded layer using entropy encoding.
18. The-encoder according to claim 17, wherein the entropy encoding is run-length encoding.
19. The encoder according to claim 17 or claim 18, wherein each subset of elements includes only an element with a respective single block only.
20. A method of encoding an input video into one or more encoded signals, the one or more encoded signals being suitable for combining with a base encoded stream to reconstruct the input video, each encoded enhancement signal comprising one or more layers of residual data, the residual data being generated based on a comparison of data derived from a decoded version of a video signal and data derived from an input video signal, each encoded signal comprising respective frames, each frame of the respective frames being divided into a plurality of blocks, the method comprising: receive a binary data set to be encoded into an encoded layer using entropy encoding, wherein the binary data set is a set of elements indicative of a one of a first decision or a second decision for each of a sequential set of blocks of the plurality of blocks, the number of blocks in the set of blocks matching the number of elements in the set of elements, the binary data set including a first subset of elements, a second subset of elements and a third subset of elements of which the respective blocks are sequential within each subset and the subsets are sequential; identifying if all elements of the second subset are indicative of the other decision of the first decision and second decision to the decision of which all elements of the first subset and third subset are indicative; modifying, when it is identified that all the elements of the second subset are indicative of the other decisions of the first decision and second decision from the decision of which all elements of the first subset and third subset are indicative, the decision of each element of the second subset of elements to the other decision of the first decision and second decision; and encoding the data set with the modified decision into the encoded layer using entropy encoding.
21. An encoder according to any one of claims 12 to 15 and any one of claims 17 to 19, wherein the first decision is the designation of the mode of each block of the elements of the respective subset as the first temporal mode and the second decision is the designation of the mode of each block of the elements of the respective subset as the second temporal mode.
22. The encoder according to claim 21, wherein, between the step of identifying if all elements of the second subset are indicative of the other decision of the first decision and second decision to the decision of which all elements of the first subset and third subset are indicative and the step of modifying the decision of each element of the second subset of elements to the other decision of the first decision and second decision, the encoder is further configured, when it is identified that all the elements of the second subset are indicative of the other decisions of the first decision and second decision from the decision of which all elements of the first subset and third subset are indicative, to: identify when the mode of the current block was determined as the mode for which the cost was updated, and encode the data set into the encoded layer using entropy encoding without the modifying the decision.
23. The encoder according to claim 21 or claim 22, further configured, after the step of identifying if all elements of the second subset are indicative of the other decision of the first decision and second decision to the decision of which all elements of the first subset and third subset are indicative, to identify if the cost of the first temporal mode and the cost of the second temporal mode are equal, and wherein, when the cost of the first temporal mode and the cost of the second temporal mode are equal, the encoder is further configured to calculate a distortion of applying the first temporal mode and a distortion of applying the second temporal mode and only modify the decision if the decision is to be modified to the decision representing the mode of the first temporal mode and the second temporal mode with the lower distortion.
24. A computer program comprising instructions which, when executed, cause an apparatus to perform the method according to claim 16 or claim 20 or to provide the encoder of any one of claims 12 to 15 or claims 17 to 19 or claims 21 to 23.
25. A non-transitory computer-readable medium comprising the computer program according to claim 24.