HK1088162B

Movatterモバイル変換

Info

Publication number: HK1088162B
Application number: HK06108434.1A
Authority: HK
Inventors: Miska Hannuksela
Original assignee: Nokia Technologies Oy
Priority date: 2003-02-18
Filing date: 2004-02-17
Publication date: 2010-04-16

Description

Image decoding method

Technical Field

The invention relates to a method for ordering encoded pictures, the method comprising an encoding step for forming encoded pictures in an encoder, an optional hypothetical decoding step for decoding said encoded pictures in the encoder, a transmission step for transmitting said encoded pictures to a decoder, and a rearranging step for arranging the decoded pictures in decoding order. The invention also relates to a system, an encoder, a decoder, an apparatus, a computer program, a signal, a module and a computer program product.

Background

The disclosed video coding standards include ITU-T H.261, ITU-T H.263, ISO/IECMPEG-1, ISO/IEC, MEPG-2, and ISO/IEC MPEG-4 second part. These standards are referred to herein as conventional video coding standards.

Video communication system

Video communication systems can be divided into conversational (conversational) and non-conversational systems. Conversational systems include video conferencing and video telephony. Examples of such systems include ITU-T recommendations h.320, h.323, and h.324, which detail video conferencing/telephony systems operating in ISDN, IP, PSTN networks, respectively. Conversational systems are characterized by their efforts to minimize terminal-to-terminal latency (from audio-video capture to far-end audio-video presentation) to improve user experience.

Non-conversational systems include playback systems that store content, such as digital compact discs (DVDs) or video files stored in a large memory of a playback device, digital televisions, and streaming. A brief review of the most important criteria in these technical fields is given below.

The dominant standard in today's digital video consumer electronics is MPEG-2, which includes specifications for video compression, audio compression, storage and transmission. The storage and transmission of coded video is based on the concept of elementary streams. The elementary stream is composed of encoded data from a single source (e.g. video) and additional data required for synchronization, identification and description of the source information. The Elementary Stream is Packetized into constant-length or variable-length packets to form a Packetized Elementary Stream (PES). Each PES packet consists of a header followed by stream data, which is called payload. The PES packets from each elementary stream are combined to form a Program Stream (PS) or a Transport Stream (TS). PS is directed towards applications with negligible transmission errors, such as store-and-play type application devices. TS is directed to applications with vulnerable transmission errors. However, TS assumes that the network throughput remains constant.

There is an ongoing effort to develop a standard in the Joint Video Team (JVT) and ISO/IEC of ITU-T. The work of JVT is based on the earlier standardization project of ITU-T, known as H.26L. The JVT standard aims to release the same standard text as for example ITU-T recommendation H.264, ISO/IEC international standard 14496-10(MPEG-4 part 10). The draft standard is referred to herein as the JVT coding standard and the codec according to the draft standard is referred to as the JVT codec.

The codec specification itself conceptually distinguishes between a Video Coding Layer (VCL) and a Network Abstraction Layer (NAL). The VCL includes signal processing functions of the codec, such as transport, quantization, motion search/compensation, and loop filtering. It follows the most common concept in the field of current video encoders: a macroblock-based encoder utilizes intra picture prediction with motion compensation and transform codes the residual signal. The output of the VCL encoder is a slice (slice): a bit string containing macroblock data for an integer number of macroblocks, and slice header information (including the spatial address of the first macroblock in the slice, initial quantization parameters, etc.). The macroblocks in a slice are arranged in order in the scan order unless a different macroblock allocation is specified using the so-called flexible macroblock ordering syntax. Intra-picture prediction, such as intra-prediction and motion vector prediction, is used only within slices.

The NAL compresses the slice output of the VCL into network abstraction layer units (NAL units or NALUs) suitable for transmission over a packetized network or for use in a packetized directional multiplexing environment. Annex B of jjjvt defines an encapsulation process to transport such NALUs over a byte stream oriented network.

The h.263 optional reference picture selection mode and the MPEG-4 part 2 newtred coding tool are able to select a reference frame for motion compensation for each picture segment (e.g., each slice in h.263). Furthermore, the h.263 optional enhanced reference picture selection mode and the JVT coding standard are capable of selecting a reference frame for each macroblock separately.

Reference picture selection enables multiple types of temporal scalability schemes. Fig. 1 illustrates an example of a temporal scalability scheme, referred to herein as recursive temporal scalability. The exemplary scheme is capable of decoding with three constant frame rates. Fig. 2 depicts a scheme known as video redundancy coding, in which a sequence of pictures is divided into two or more independent coding lines in an interleaved manner. The arrows in these and all subsequent figures indicate the direction of motion compensation, and the values below the frames correspond to the number of captures and displays per frame.

Parameter set concept

One fundamental design concept of the JVT codec is to generate self-contained packets, so that no structure such as a copy of the header needs to be generated. This may be achieved by separating information relating to more than one piece from the media stream. Higher layer meta-information should be reliably sent asynchronously before the RTP packet stream containing the slice packets. The information can also be transmitted in-band in devices that do not have an out-of-band transport channel dedicated for this purpose. The combination of high level parameters is called a parameter set. The parameter set includes information such as picture size, display window, optional coding mode employed, macroblock allocation map, etc.

In order to be able to change picture parameters (e.g. picture size) without having to send parameter sets updated in synchronization with the slice packetized stream, the encoder and decoder may comprise a list of more than one parameter set. Each slice header includes a codeword indicating the set of parameters to be used.

This structure allows to separate the transmission of the parameter sets from the packet stream and to transmit them by additional means, for example as a side effect of capacity transformation, or by means of a (reliable or unreliable) control protocol. It is even possible that they are never transmitted, but are fixed by an application design specification.

Sequence of transmissions

In conventional video coding standards, the decoding order of pictures is the same as the display order, except for B pictures. A block of a conventional B picture can be temporally bi-directionally predicted from two reference pictures, one of which is displayed temporally first and the other temporally later. Only the nearest reference picture in decoding order can precede a B picture in display order (except: in interleaved encoding of h.263, two field pictures of temporally successive reference frames precede a B picture in decoding order). The conventional B picture cannot be used as a reference picture in temporal prediction, and thus the conventional B picture can be arbitrarily placed without affecting decoding of any other picture.

The JVT coding standard has the following novel technical features compared to earlier standards:

-the decoding order of the pictures is separated from the display order. The value of the frame _ number (frame _ num) syntax element represents the decoding order, and the picture order count represents the display order.

The reference picture for a block in the B picture may precede or follow the B picture in display order. Thus, B pictures represent bi-predictive pictures, not bi-directional pictures.

Pictures that are not used as reference pictures are explicitly marked. Any type of picture (intra picture, inter picture, B, etc.) is either a reference picture or a non-reference picture. (thus, B pictures can be used as reference pictures for temporally predicting other pictures).

The image may comprise slices encoded with different encoding types. In other words, the encoded picture may include, for example, intra-coded slices and B-coded slices.

The separation of the display order from the decoding order has advantages in terms of compression efficiency and error resilience.

An example of the potential improved compression efficiency of the prediction structure is illustrated in figure 3. The boxes represent pictures, the capital letters within the boxes represent coding types, the numbers within the boxes are picture numbers according to the JVT coding standard, and the arrows represent the predicted dependencies. Note that picture B17 is a reference picture for picture B18. Compression efficiency is potentially improved compared to conventional coding, since the reference pictures for picture B18 are temporally closer compared to conventional coded picture schemes using PBBP or PBBBP. Compression efficiency is potentially improved compared to conventional PBP coded picture schemes, since part of the reference pictures are bi-directionally predicted.

Fig. 4 illustrates an example of an intra image delay method that can be used to improve error resilience. Conventionally, intra pictures are encoded immediately after a scene cut, or as a response to an expired intra picture update period. In the intra picture delay method, an intra picture is not encoded immediately after a need to encode an intra picture is generated, but a temporally continuous picture is selected as an intra picture. Each picture between the coded intra picture and the conventional position of the intra picture is predicted using the next temporally successive picture. As shown in fig. 4, the intra picture delay method generates two independent inter picture prediction chains, whereas the conventional encoding method generates only one single inter picture chain. It is clear that the two-chain approach is more robust against erasure errors (erasureerror) than the one-chain conventional approach. If one chain loses a packet, the other chain can still be correctly received. In conventional coding, packet loss often causes errors to be passed to the rest of the inter-picture prediction chain.

There are generally two types of ordering and temporal information associated with digital video: decoding and presentation order. The related art will be further described below.

The Decoding Time Stamp (DTS) represents the time associated with the reference clock at which the coded data unit was decoded. If the DTS is encoded and transmitted, it serves two purposes: first, if the decoding order of pictures is different from their output order, DTS explicitly indicates the decoding order. Second, DTS guarantees a specific pre-decoding buffer (buffering of coded data units for the decoder), provided that the reception rate is close to the transmission rate at any moment. The second application of DTS has no or little impact in networks where the terminal-to-terminal reaction time varies. In contrast, the received data can be decoded as fast as possible provided that there is room for an uncompressed picture in a post-decoding buffer (buffer for decoded pictures).

The transport of DTS depends on the communication system and the video coding standard used. In the MPEG-2 system, the DTS can be selectively transmitted as one item in the header of the PES packet. In the JVT coding standard, DTS can optionally be carried as part of Supplemental Enhancement Information (SEI) and can be used in an optional hypothetical reference decoder. In the ISO base media file format, the DTS contributes its own box type, decoding time, to the sampling box. In many systems, such as RTP-based streaming systems, DTS is not transmitted at all, since in this system the decoding order is assumed to be the same as the transmission order and the decimation decoding time does not play an important role.

H.263 choice annex U and annex w.6.12 specify a picture number that is incremented by 1 in decoding order relative to the previous reference picture. In the JVT coding standard, the frame _ number syntax element (hereinafter also referred to as the frame number) is similar to the picture number of H.263. The JVT coding standard specifies a particular type of intra picture, called Instantaneous Decoding Refresh (IDR) picture. No later picture will refer to a picture that precedes the IDR picture in decoding order. An IDR picture is often encoded as a response to a scene change. In the JVT coding standard, the frame number is reset to 0 at each IDR picture, which can improve error resilience in any case if an IDR picture is lost as shown in fig. 5a and 5 b. However, it should be noted that the scene information SEI message in the JVT coding standard can also be used to detect scene changes.

The h.263 picture number can be used to restore the decoding order of the reference picture. Similarly, the JVT frame number can be used to restore the decoding order of frames between an IDR picture (inclusive) and the next IDR picture (exclusive) in decoding order. However, since the pair of supplementary reference fields (consecutive pictures are encoded as fields, their parities are different) share the same frame number, their decoding order cannot be reconstructed from the frame numbers.

The H.263 picture number or JVT frame number of a non-reference picture is specified to be equal to the picture number or frame number of the previous reference picture in decoding order plus 1. If multiple non-reference frames are consecutive in decoding order, they share the same picture or frame number. The picture or frame number of a non-reference picture may also be the same as the picture or frame number of a subsequent reference picture in decoding order. The decoding order of consecutive non-reference pictures can also be restored using the Temporal Reference (TR) coding unit in h.263 or the Picture Order Count (POC) concept in the JVT coding standard.

When it is assumed that a picture is displayed, a Presentation Time Stamp (PTS) indicates a time relative to a reference clock. The presentation time stamps are also referred to as display time stamps, output time stamps, and composition time stamps.

The transmission of the PTS depends on the communication system and the video coding standard used. In the MPEG-2 system, the PTS can be alternatively transmitted as one item of the header of the PES packet. In the JVT coding standard, PTS can be optionally transmitted as part of the Supplemental Enhancement Information (SEI), and used in the operation of the hypothetical reference decoder. In the ISO base media file format, the PTS contributes its unique logical unit type, composition time, to the sampling logic, where the presentation time stamp has been encoded relative to the corresponding decoding time stamp. In RTP, an RTP timestamp in the header of an RTP packet corresponds to the PTS.

Conventional video coding standards function in various respects in temporal reference coding (TR) units similar to PTS. In some conventional video coding standards, such as MPEG-2 video, TR is reset to zero at the beginning of a group of pictures (GOP). In the JVT coding standard, there is no notion of time in the video coding layer. Picture Order Count (POC) is specified for each frame and field and, like TR, is used for direct temporal prediction of B slices. POC is reset to 0 at the IDR picture.

The RTP sequence number is typically a 16-bit unsigned value located in the RTP header, which is incremented by 1 for each transmitted RTP packet and can be used by the receiver to detect packet loss and recover the packet sequence.

Transmission of multimedia streams

A multimedia streaming system has a streaming server and a plurality of players accessing the server via a network. Networks are typically packet-oriented and provide little or no means to maintain quality of service. The player typically retrieves pre-stored or live multimedia content from the server and plays the content in real-time while downloading the content. The kind of communication may be point-to-point or multicast. In point-to-point streaming, the server provides a separate connection for each player. In multicast streaming, a server transmits a single data stream to multiple players, and a network element replicates the stream only when needed.

When a player has established a connection with the server and requests a multimedia stream, the server starts to transmit the required stream. The player does not start playing back the stream immediately, but typically buffers the input data for a few seconds. This buffering is referred to herein as initial buffering. Initial buffering helps maintain uninterrupted playback because the user can decode and play back buffered data in the event of occasional increased transmission delays or a network outright drop.

In order to avoid unlimited transmission delays, reliable transmission protocols are generally disliked in streaming media. Instead, the system selects an unreliable transport protocol, such as UDP, which inherits more stable transmission delays on the one hand, but also suffers from data corruption or loss on the other hand.

RTP and RTCP protocols may be used over UDP to control real-time communications. RTP provides means to detect transmission packet losses, rearrange the correct order of packets at the receiving end, and associate a sampling timestamp with each packet. RTCP conveys information about how large part of a packet is received correctly and can therefore be used for flow control purposes.

Transmission error

There are two main types of transmission errors, namely bit errors and packet errors. Bit errors are often associated with circuit switched channels, such as radio access network connections in mobile communications, which are caused by imperfections in the physical channel, such as radio interference. Such defects may lead to bit inversion, bit insertion, and bit erasure in the transmitted data. Packet errors are typically caused by elements in a packet-switched network. For example, packet routers are typically congested, i.e., too many packets are input, but cannot output them at the same rate. In this case, the buffer overflows and some packets are lost. It is also possible to duplicate packets and output them in a different order than the transmission order, but they are generally considered to be less common than packet loss. Packet errors may also be caused by the operation of the transport protocol stack used. For example, some protocols use checksums that are computed in the transmitter and compressed together with the source encoded data. If there is a bit-reversal error in the data, the receiver cannot end up with the same checksum and the received packet can be discarded.

Second (2G) and third (3G) generation mobile networks, including GPRS, UMTS and CDMA-2000, provide two basic types of radio link connections, acknowledged and unacknowledged. The acknowledged connection is made by the receiver (mobile station MS. or base station subsystem, BSS) checking the integrity of the radio link frame and, in the event of a transmission error, sending a retransmission request to the other terminal of the radio link. Due to the link layer retransmission, the originator has to buffer a radio link frame until a positive acknowledgement of the frame is received. In a severe wireless environment, the buffer may overflow and cause data loss. Regardless, it has been shown that it is beneficial to use an acknowledged radio link protocol mode for streaming services. An unacknowledged connection will typically drop the erroneous radio link frame.

Packet losses can be corrected or concealed. Loss correction refers to the ability to perfectly recover lost data, as if no loss had been introduced. Loss concealment refers to the ability to conceal the effects of transmission errors so that they are not visible in the reconstructed video sequence.

When the player detects a packet loss, it may request retransmission of the packet. Due to the initial buffering, the retransmitted packet may be received before its scheduled playback time. Some commercial internet streaming systems perform retransmission requests using a priority protocol. Work is performed on IETF to standardize the selective retransmission request scheme as part of RTCP.

A common feature of all these retransmission request protocols is that they are not suitable for multicasting to a large number of players, as this may cause a drastic increase in network traffic. As a result, the multicast streaming device has to rely on non-interactive packet loss control.

Point-to-point streaming systems may also benefit from non-interactive error control techniques. First, some systems do not include any interactive error control mechanisms, nor do they preferably have any feedback from the player, thereby simplifying the system. Second, retransmission of lost packets and other forms of interactive error control typically occupy a larger portion of the transmission data rate than non-interactive error control methods. The streaming server must ensure that the interactive error control method does not retain a large portion of the available network throughput. In fact, the server must limit the number of interactive error control operations. Third, transmission delays may limit the number of interactions between the server and the player, since all interactive error control operations performed on a particular data sample must preferably be performed before the data sample is played back.

Non-interactive packet loss control mechanisms can be classified into forward error control and error concealment by post-processing. Forward error control refers to a technique in which a transmitter adds redundant data to transmission data so that a receiver can recover at least part of the transmission data even when there is a transmission loss. The error concealment by pre-processing is all receiver oriented. These methods strive to estimate the correct representation of the erroneous received data.

Most video compression algorithms produce temporally predicted INTER or P pictures. As a result, data loss in one image may cause a visual effect degradation in subsequent images that are temporally predicted from the corrected image. The video communication system may hide the loss in the displayed image or freeze the latest correct image on the screen until an image frame independent of the correction frame is received.

In conventional video coding standards, the decoding order is related to the output order. In other words, the decoding order of the I and P pictures is the same as their output order, and the decoding order of the B picture immediately follows the decoding order of the last reference picture of the B picture in output order. As a result, it is possible to restore the decoding order from the known output order. The output order is typically conveyed in the elementary video bitstream in the Temporal Reference (TR) field, as well as in the system multiplex layer, e.g. RTP header. Thus, in conventional video coding standards, there is no presentation problem associated with transmission orders other than decoding orders.

It will be apparent to those skilled in the art that the decoding order of the encoded pictures can be reconstructed from the frame count in which the video bitstream is numbered similarly to the h.263 pictures and not zeroed out at each IDR picture (as in the JVT encoding standard). However, two problems may occur when using such a solution:

first, fig. 5a illustrates a solution in which a continuous counting scheme is used. For example, if IDR picture I37 is lost (cannot be received or decoded), the decoder continues to decode the next pictures, but it uses the wrong reference pictures. This results in errors continuing down to the next frame until the next frame is received independent of the corrected frame and the frame is decoded correctly. In the example shown in fig. 5b, the frame number is reset to 0 at one IDR picture. Now, in the case of missing an IDR picture 10, the decoder informs that there is a large gap in picture numbering after the latest correctly decoded picture P36. The decoder then assumes that an error has occurred and freezes the display at picture P36 until the next frame independent of the correction frame is received and decoded.

Second, if the receiver assumes a predefined counting scheme, e.g. 1 for each reference picture in decoding order, the problems arising in splicing and sub-sequence removal are apparent. The receiver may detect the loss using a predefined counting scheme. Splicing (slicing) refers to the operation of inserting one coding sequence in the middle of another coding sequence. An example of a practical use of splicing is the insertion of advertisements in digital television broadcasts. If the receiver employs a predefined counting scheme, the transmitter must update the frame count during transmission based on the position of the splicing sequence and the frame count. Similarly, if the transmitter decides not to transmit any sub-sequences to avoid network congestion in an IP network, for example, the number of frame counts needs to be updated during transmission based on the position of the placed sub-sequences and the frame count. In this specification, the concept of a subsequence will be described in detail later.

It will be apparent to those skilled in the art that the decoding order of NAL units can be reconstructed from NAL unit sequence numbers that are similar to RTP sequence numbers but indicate the decoding order of the NAL units, not the transmission order. However, two problems arise when this solution is adopted:

first, in some cases, perfect restoration of decoding order is not required. For example, SEI messages for a picture are typically decoded in any order. If the decoder supports arbitrary slice ordering, slices of a picture can be decoded in arbitrary order. As a result, NAL units can be received out of the order of NAL sequence numbers due to unconscious packet ordering differences in network elements, and the receiver does not have to wait for NAL units corresponding to the missing NAL unit sequence numbers, even though the NAL units and the next NAL unit sequence numbers can actually be decoded. This additional delay can degrade the subjective quality experienced by the video communication system used. Moreover, it may cause unnecessary use of loss correction or concealment processing.

Second, certain NAL units, such as slices for non-reference pictures and SEI NAL units, may be discarded by network elements without affecting the decoding process of other NAL units. The dropping of NAL units may cause gaps in the received sequence of NAL unit sequence numbers. As a result, for example, for RTP sequence numbers, the receiver assumes a predefined counting scheme, e.g., each NAL unit is incremented by 1 in decoding order, and loss detection is performed using gaps in sequence numbers. The use of NAL unit sequence numbers contradicts the permutation of NAL units without affecting the decoding of other NAL units.

Subsequence(s)

The JVT coding standard also includes a sub-sequence concept that can improve temporal scalability compared to the use of non-reference pictures, so that the inter prediction chain of pictures can be treated as a whole, while not affecting the interpretability of the rest of the coded stream.

A sub-sequence is a set of coded pictures within a sub-sequence layer. A picture belongs to one sub-sequence layer and only to one sub-sequence layer. A sub-sequence is not dependent on any other sub-sequence in the same sub-sequence layer, nor on sub-sequences in higher sub-sequence layers. The sub-sequence of layer 0 can be decoded independently of any other sub-sequence and the previous long-term reference pictures. Fig. 6a shows an example of a picture stream containing sub-sequences at layer 1.

One sub-sequence layer comprises a subset of the coded pictures in the sequence. The sub-sequence layers are counted using non-negative integers. The layer having the larger number of layers is a layer higher than the layer having the smaller number of layers. The layers are ordered according to their dependency on each other so that a layer is not dependent on any higher layers, but may be dependent on lower layers. In other words, layer 0 can be decoded independently, pictures in layer 1 can be predicted from layer 0, pictures in layer 2 can be predicted from layer 0 and layer 1, etc. The subjective quality is expected to increase with the number of decoded layers.

The concept of subsequences is contained in the JVT decoding standard: a requirement _ frame _ num _ update _ background _ flag equal to 1 in the sequence parameter set indicates that the coded sequence does not contain all sub-sequences. The use of required _ frame _ num _ update _ background _ flag does not require adding 1 to the frame number of each reference frame. Instead, the gap in the frame number may be specifically indicated in the decoded picture buffer. If a "lost" frame number relates to inter prediction, it is possible to infer that the picture is lost. Otherwise, frames corresponding to the "lost" frame numbers are processed as if they were normal frames inserted into the decoded picture buffer having the sliding window buffering mode. All picture structures in the placed sub-sequence are assigned a "missing" frame number in the decoded picture buffer, but they are never used in inter prediction of other sub-sequences.

The JVT coding standard also includes optional sub-sequences related to SEI messages. The sub-sequence information SEI message relates to the next in decoding order. Which indicates the sub-sequence layer to which the piece belongs and the sub-sequence identifier (sub _ seq _ id) of the sub-sequence.

The slice header of each IDR picture includes an identifier (IDR _ pic _ id). If two IDR pictures are consecutive in decoding order without any intervening pictures in between, the value of IDR _ pic _ id will change from the first IDR picture to the other pictures. If the current picture belongs to a sub-sequence whose first picture in decoding order is an IDR picture, the value of sub _ seq _ id will be the same as the value of IDR _ pic _ id of the IDR picture.

The decoding order of the coded pictures in the JVT coding standard cannot usually be reconstructed from the frame numbers and the sub-sequence identifiers. If the transmission order is different from the decoding order and the coded pictures belong to sub-sequence layer 1, the decoding order associated with sub-sequence layer 0 cannot be inferred from the sub-sequence identifier and the frame number. For example, consider the coding scheme illustrated in FIG. 6b, where the output order is from left to right, boxes represent pictures, capital letters inside boxes represent frame numbers according to the JVT coding standard, underlined characters represent non-reference pictures, and arrows represent prediction dependencies. If the pictures are transmitted in the order of I0, P1, P3, I0, P1, B2, B4, P5, it cannot be inferred to which independent group of pictures (independent GOP) the picture B2 belongs. An independent GOP is a group of pictures that can be correctly decoded without reference to pictures in any other group of pictures.

It can be discussed that in the previous example, the correct independent GOP for picture B2 can be inferred from its output timestamp. However, the decoding order of pictures cannot be restored from the output time stamps and the picture numbers because the decoding order and the output order are not related. Consider the following example (fig. 6c), where the output order is from left to right, the boxes represent pictures, the capital letters within the boxes represent coding types, the numbers within the boxes represent frame numbers according to the JVT coding standard, and the arrows represent prediction dependencies. If the pictures are not output in decoding order, it cannot be reliably detected whether the picture P4 should be decoded after P3 of the first or second independent GOP in output order.

Base and redundant pictures

A base coded picture is a base coded representation of a picture. The decoded base coded picture covers the entire picture area, i.e. the base coded picture comprises all slices and macroblocks of the picture. Redundant coded pictures are not used for redundant coded representation of a picture or a portion of a picture that is decoded unless the underlying coded picture is lost or corrupted. The redundant coded picture need not include all macroblocks in the base coded picture.

Buffer

Streaming clients typically have a receiver buffer that can store a relatively large amount of data. Initially, when a streaming session is established, the client does not immediately begin playing back the data stream, but rather typically buffers the incoming data for a few seconds. This buffering helps to maintain continuous playback because the client can decode and play back the buffered data in the event of occasional increased transmission delays or network throughput drops. Otherwise, there is no initial buffering, the client simply freezes the display, stops decoding, and waits for incoming data. This buffering is also necessary for automatic or selective retransmissions at any one of the protocol layers. If any portion of the picture is lost, a retransmission mechanism is employed to retransmit the lost data. The loss is perfectly recovered if the retransmitted data is received before its scheduled decoding or playback time.

The coded pictures may be ranked according to their importance in the subjective quality of the decoded sequence. For example, non-reference pictures, such as conventional B pictures, are subjectively least important because their absence does not affect the decoding of other pictures. Subjective ranking can also be performed on a data partitioning or slice grouping basis. The subjectively most important coded slices and data partitions may be sent earlier than their decoding order indication, while subjectively least important coded slices and data partitions may be sent later than their natural coding order indication. Therefore, retransmissions of any one of the most important stripes and data partitions are more likely to be received before the scheduled decoding or playback time than the least important stripes and data partitions.

Disclosure of Invention

The present invention enables rearranging of video data from a transmission order into a decoding order in a video communication scheme in which such an order is very advantageous for transmitting data out of the decoding order.

In the present invention, in-band signaling of decoding order is transmitted from a transmitter to a receiver. This signaling may be in addition to or instead of any other signaling in the video bitstream that can be used to restore decoding order, such as frame numbering in the JVT coding standard.

The supplemental signaling of frame numbers in the JVT coding standard will be described below. Later, an independent GOP includes pictures from one IDR picture (inclusive) to the next IDR picture (exclusive) in decoding order. Each NAL unit in the stream includes or is associated with a video sequence ID that remains constant for all NALs in an independent GOP.

The video sequence ID of an independent GOP will be different from the video sequence ID of the previous independent GOP in decoding order or it should be increased compared to the previous video sequence ID (modulo operation). In the former case, the decoding order of the independent GOPs is determined by their receiving order. For example, an independent GOP starting with an IDR picture having the smallest RTP sequence number is first decoded. In the latter case, the independent GOPs are decoded in the order of increasing video sequence ID.

In the following description the invention is described by using a codec based system, but it is clear that the invention can also be implemented in other systems storing video signals. The stored video signal may be an unencoded signal stored prior to encoding, as an encoded signal stored after encoding, or as a decoded signal stored after encoding and decoding processes. A file system receives an audio and/or video bitstream, which is encapsulated, e.g., in decoding order, and stored as a file. In addition, the encoder and file system can generate metadata that accounts for the subjective importance of pictures and NAL units and essentially contains information about the sub-sequences. The file can be stored in a database from which the direct playback server can read NAL units and encapsulate them into RTP packets. Depending on the optional metadata and the data connection used, the direct playback server can modify the transmission order of the packets to be different from the decoding order, remove sub-sequences, determine what, if any, the SEI message will transmit, etc. At the receiving end, the RTP packets are received and buffered. Typically, the NAL units are first rearranged into the correct order, after which the NAL units are transmitted to the decoder.

According to the h.264 standard, VCL NAL units are specified as NAL units with NAL _ unit _ type equal to 1-5 (including 1 and 5). In this standard, NAL unit types 1-5 are defined as:

1 coding slice of non-IDR image

2 encoding stripe data portion A

3 coding the stripe data part B

4 coding strip data part C

Coding strip of 5 IDR image

Alternative signaling of any decoding order information in a video bitstream is described below according to a preferred embodiment of the present invention. One Decoding Order Number (DON) indicates the decoding order of NAL units, not the transmission order of NAL units to a decoder. Hereinafter, DON is assumed to be a 16-bit unsigned integer with no loss of generality. Let DON of one NAL unit be D1 and DON of another NAL unit be D2. If D1 < D2 and D2-D1 < 32768, or if D1 > D2, and D1-D2 > -32768, then the NAL unit of DON-D1 precedes the NAL unit of DON-D2 in NAL unit transmission order. D1 < D2 and D2-D1 > -32768, or if D1 > D2, and D1-D2 < 32768, then the NAL unit of DON-D2 precedes the NAL unit of DON-D1 in NAL unit transmission order. NAL units associated with different base coded pictures do not have the same value of DON. NAL units associated with the same base coded picture may have the same value of DON. If all NAL units of a primary coded picture have the same value of DON, the NAL units of a redundant coded picture associated with the primary coded picture may have a different value of DON than the NAL units of the primary coded picture. The NAL unit transmission order of NAL units having the same value of DON may be, for example, as follows:

1. picture partitioning NAL units, if any

2. Sequence parameter set NAL Unit, if present

3. Picture parameter set NAL Unit, if present

SEI NAL Unit, if present

5. Coded slices and slice data partitioning NAL units of a base coded picture, if any

6. Coding slices and slice data partitioning NAL units for redundant coded pictures, if any

7. Filtered data NAL units, if any

8. Sequence NAL end Unit, if any

9. Stream NAL end unit, if present.

According to a first aspect of the present invention there is provided a method, substantially characterized in that in the encoding step, a video sequence ID different from the picture ID is defined for the encoded pictures.

According to a second aspect of the present invention there is provided an encoder for encoding pictures and for ordering encoded pictures, comprising an arranger for forming at least one group of coded pictures and defining a picture ID for each picture of the group of pictures, the encoder further comprising a definer for defining a video sequence ID for the encoded pictures separate from the picture ID, the video sequence ID being arranged to be the same for each picture of the same group of pictures.

According to a third aspect of the present invention there is provided a decoder for decoding encoded pictures to form decoded pictures, comprising a rearranger for arranging the encoded pictures in decoding order, the decoder further comprising a processor for determining which pictures belong to the same group of pictures by using a video sequence ID.

According to a fourth aspect of the present invention there is provided a software program comprising computer executable steps for performing a method of ordering encoded pictures, the method comprising an encoding step for forming encoded pictures in an encoder, wherein at least one group of pictures is formed, a picture ID being defined for each picture of the group of pictures, a transmission step for transmitting said encoded pictures to a decoder, a rearranging step for arranging the encoded pictures in decoding order, wherein in the encoding step a video sequence ID separate from the picture ID is defined for the encoded pictures.

According to a fifth aspect of the present invention there is provided a signal comprising encoded pictures, the encoded pictures forming at least a group of pictures, a picture ID being defined for each picture of the group of pictures, wherein a video sequence ID separate from the picture ID is defined in the signal for the encoded pictures, the video sequence ID being the same for each picture of the same group of pictures.

According to a sixth aspect of the invention, there is provided a method for ordering encoded pictures comprising a first and a second encoded picture, at least a first transmission unit being formed on the basis of the first encoded picture and at least a second transmission unit being formed on the basis of the second encoded picture, wherein a first identifier is defined for said first transmission unit and a second identifier is defined for said second transmission unit, the first and second identifiers being indicative of the respective decoding order of information comprised in the first transmission unit and information comprised in the second transmission unit.

According to a seventh aspect of the present invention, there is provided an apparatus for ordering encoded pictures comprising a first and a second encoded picture, the apparatus comprising an arranger for forming at least a first transmission unit on the basis of the first encoded picture and at least a second transmission unit on the basis of the second encoded picture, the apparatus further comprising a definer for defining a first identifier for said first transmission unit and a second identifier for said second transmission unit, the first and second identifiers being indicative of the respective decoding order of information comprised in the first transmission unit and information comprised in the second transmission unit.

According to an eighth aspect of the present invention there is provided an encoder for encoding pictures and for ordering encoded pictures comprising a first and a second encoded picture, the encoder comprising an arranger for forming at least a first transmission unit on the basis of the first encoded picture and a second transmission unit on the basis of the second encoded picture, the encoder further comprising a definer for defining a first identifier for said first transmission unit and a second identifier for said second transmission unit, the first and second identifiers being indicative of the respective decoding order of information comprised in the first transmission unit and information comprised in the second transmission unit.

According to a ninth aspect of the present invention, there is provided a decoder for decoding encoded pictures to form decoded pictures, the encoded pictures comprising a first encoded picture transmitted in at least one first transmission unit formed on the basis of the first encoded picture and a second encoded picture transmitted in at least one second transmission unit formed on the basis of the second encoded picture, the decoder further comprising a processor for determining the decoding order of information contained in the first transmission unit and information contained in the second transmission unit on the basis of a first identifier defined for said first transmission unit and a second identifier defined for said second transmission unit.

According to a tenth aspect of the present invention there is provided a system comprising an encoder for encoding pictures and for ordering encoded pictures comprising a first and a second encoded picture, the encoder comprising an arranger for forming at least a first transmission unit on the basis of the first encoded picture and at least a second transmission unit on the basis of the second encoded picture, and a decoder for decoding the encoded pictures, the system further comprising a definer in the encoder for defining a first identifier for said first transmission unit and a second identifier for said second transmission unit, the first and second identifiers being indicative of the respective decoding order of information comprised in the first transmission unit and information comprised in the second transmission unit, and a processor in the decoder for determining information comprised in the first transmission unit and information comprised in the second transmission unit on the basis of said first identifier and said second identifier The decoding order of the information in the second transmission unit.

According to an eleventh aspect of the present invention, there is provided a computer program comprising computer executable steps for performing a method for ordering encoded pictures comprising a first and a second encoded picture, thereby forming at least a first transmission unit on the basis of the first encoded picture, and at least a second transmission unit on the basis of the second encoded picture, the computer program further comprising computer executable steps for determining a first identifier for said first transmission unit and a second identifier for said second transmission unit, the first and second identifiers being indicative of a decoding order of information comprised in the first transmission unit and information comprised in the second transmission unit, respectively.

According to a twelfth aspect of the present invention, there is provided a computer program product for storing a computer program comprising computer executable steps for performing a method for ordering encoded pictures comprising a first and a second encoded picture, thereby forming at least a first transmission unit on the basis of the first encoded picture and at least a second transmission unit on the basis of the second encoded picture, the computer program further comprising computer executable steps for determining a first identifier for said first transmission unit and a second identifier for said second transmission unit, the first and second identifiers being indicative of a decoding order of information comprised in the first transmission unit and information comprised in the second transmission unit, respectively.

According to a thirteenth aspect of the present invention there is provided a signal comprising at least a first transmission unit formed on the basis of a first coded picture and at least a second transmission unit formed on the basis of a second coded picture, the signal further comprising a first identifier determined for said first transmission unit and a second identifier determined for said second transmission unit, the first and second identifiers being indicative of the respective decoding order of information contained in the first transmission unit and information contained in the second transmission unit.

According to a fourteenth aspect of the present invention there is provided a module for ordering encoded pictures for transmission, the encoded pictures comprising first and second encoded pictures, the module comprising an arranger for forming at least a first transmission unit on the basis of the first encoded picture and at least a second transmission unit on the basis of the second encoded picture, the module further comprising a definer for defining a first identifier for said first transmission unit and a second identifier for said second transmission unit, the first and second identifiers being indicative of the respective decoding order of information included in the first transmission unit and information included in the second transmission unit.

According to a fifteenth aspect of the present invention, there is provided a module for reordering encoded pictures for decoding, the encoded pictures comprising a first encoded picture transmitted in at least one first transmission unit formed on the basis of the first encoded picture and a second encoded picture transmitted in at least one second transmission unit formed on the basis of the second encoded picture, the module further comprising a processor for determining the decoding order of information contained in the first transmission unit and information contained in the second transmission unit on the basis of a first identifier defined for said first transmission unit and a second identifier defined for said second transmission unit.

The invention improves the reliability of the coding system. By using the present invention, the correct decoding order of pictures can be determined more reliably than in existing systems, even if some packets in the video stream are not available in the decoder.

Drawings

Figure 1 illustrates an example of a recursive temporal scalability scheme,

fig. 2 illustrates a scheme known as video redundancy coding, in which a sequence of pictures is divided into two or more independent coding threads in an interleaved manner,

figure 3 illustrates an example of a prediction structure that potentially improves compression efficiency,

figure 4 shows an example of an intra-image delay method that can be used to improve error resilience,

figures 5a and 5b illustrate different counting schemes for pictures of a coded video stream in the prior art,

figure 6a shows an example of a picture stream containing sub-sequences at layer 1,

figure 6b shows an example of a picture stream containing two independent groups of pictures with sub-sequences at layer 1,

figure 6c shows an example of an image stream of different independent groups of images,

figure 7 shows another example of a picture stream containing sub-sequences at layer 1,

figure 8 shows a preferred embodiment of the system according to the invention,

figure 9 shows a preferred embodiment of an encoder according to the invention,

figure 10 shows a preferred embodiment of a decoder according to the invention,

FIG. 11a shows an example of the NAL packetization format used in the present invention, an

Fig. 11b shows another example of NAL packetization format used in the present invention.

Detailed Description

The invention will be described in detail below with reference to the system of fig. 8, the encoder 1 and the optional Hypothetical Reference Decoder (HRD)5 in fig. 9 and the decoder 2 in fig. 10. The images to be encoded may be, for example, video stream images from a video source 3 such as a video camera, video recorder or the like. The images (frames) of the video stream may be divided into smaller portions, such as slices. The bars may be further divided into blocks. In the encoder 1, the video stream is encoded to reduce information transmitted over the transmission channel 4 or stored in a storage medium (not shown). Video stream images are input to the encoder 1. The encoder has an encoding buffer 1.1 (fig. 9) for temporarily storing some of the pictures to be encoded. The encoder 1 further comprises a memory 1.3 and a processor 1.2, in which the encoding tasks according to the invention can be applied. The memory 1.3 and the processor 1.2 may be shared with the transmission device 6 or the transmission device 6 may have other processors and/or memories (not shown) for other functions of the transmission device 6. The encoder 1 performs motion estimation and/or other tasks to compress the video stream. The similarity between the image to be encoded (current image) and the previous and/or subsequent images is searched in the motion estimation. If a similarity is found, the comparison image or part of the image may be used as a reference image for the image to be encoded. In JVT, the display order and decoding order of pictures are not necessarily the same, where a reference picture must be stored in a buffer (e.g., coding buffer 1.1) whenever the picture is used as the reference picture. The encoder 1 also inserts display sequence information of the pictures into the transport stream. Indeed, in addition to the JVT syntax (e.g., RTP timestamps), time information SEI messages or timestamps may also be used.

The coded pictures are moved from the coding process to the coded picture buffer 1.2, if necessary. The encoded images are transmitted from the encoder 1 to the decoder 2 via a transmission channel 4. In the decoder 2, the encoded image is decoded to form an uncompressed image corresponding to the encoded image as much as possible. Each decoded picture is buffered in a Decoded Picture Buffer (DPB)2.1 of the decoder 2 unless the picture is displayed immediately after decoding and is not used as a reference picture. In the system according to the invention the reference picture buffering and the display picture buffering are combined and they use the same decoded picture buffer 2.1. This eliminates the need to store the same image in two different places, thus reducing the memory requirements of the decoder 2.

The decoder 1 further comprises a memory 2.3 and a processor 2.2 in which the decoding tasks according to the invention are applied. The memory 2.3 and the processor 2.2 may be common with the receiving device 8 or the receiving device 8 may have an external processor and/or memory (not shown) for other functions of the receiving device 8.

The payload format of an RTP packet is defined as a number of different payload structures as required. However, what structure a received RTP packet contains can be known explicitly from the first byte of the payload. This byte will be constructed as a NAL unit header. The NAL unit type field indicates which structure is present. The possible structures are: individual NAL unit packets, aggregate packets, and segmentation units. An individual NAL unit packet includes only one individual NAL unit in the payload. The NAL header type field will be equal to the original NAL unit type, i.e., 1 and 23 inclusive in the range of 1-23. The aggregation packet type is used to aggregate multiple NAL units into a single RTP payload. There are four versions of this packet, single time-aggregated packet type ｃA (STAP- ｃA), single time-aggregated packet type B (STAP-B), multiple time-aggregated packet (MTAP) with 16-bit offset (MTAP16), and multiple time-aggregated packet (MTAP) with 24-bit offset (MTAP 24). The NAL unit type numbers set for STAP-A, STAP-B, MTAP16, and MTAP24 are 24, 25, 26, and 27, respectively. Segmentation units are used to divide a single NAL unit into multiple RTP packets. It exists in two versions, identified by NAL unit type numbers 28 and 29, respectively.

There are three cases of the packetization mode defined for RTP packet transmission:

-a single NAL unit mode of the NAL,

-a non-interleaved pattern, and

-an interleaving pattern.

The single NAL unit mode is suitable for conversational systems that comply with ITU-T recommendation h.241. The non-interleaved mode is suitable for conversational systems that do not comply with ITU-T recommendation h.241. In non-interleaved mode, NAL units are transmitted in NAL unit decoding order. The interleaving mode is suitable for systems that do not require very low terminal-to-terminal reaction times. The interleaving mode allows NAL units to be transmitted out of NAL unit decoding order.

The packing pattern used may be represented by an optional packing pattern MIME parameter or by an external device. The packetization mode used controls the types of NAL units allowed in the RTP payload.

In interleaved packetization mode, the transmission order of NAL units is allowed to be different from the decoding order of NAL units. The Decoding Order Number (DON) is a field in the payload structure or a derived variable that indicates the NAL unit decoding order.

The combination of transmission and decoding order is controlled by the optional interleaving depth MIME parameter as follows. When the value of the optional interleaving depth MIME parameter is equal to 0 and NAL units are not allowed to be transmitted out of decoding order, the transmission order of the NAL units is the same as the decoding order of the NAL. When the value of the optional interleaving depth MIME parameter is greater than 0 or NAL units are allowed to be transmitted out of their decoding order, in practice,

the order of NAL units in the multi-time aggregation packet 16(MTAP16) and the multi-time aggregation packet 24(MTAP24) need not be the same as the NAL unit decoding order, and

the order of NAL units formed by decapsulating individual time-set packets B (STAP-B), MTAP and segmentation units B (FU-B) in two consecutive packets need not be the same as the NAL unit decoding order.

The RTP payload structures of the individual NAL unit packets, STAP-A and FU-A do not include DON. STAP-B and FU-B structures include DON, and the structure of MTAP allows derivation of DON.

If the transmitter desires to encapsulate one NAL unit per packet and transmit the packets out of decoding order, STAP-B packet types may be used.

In the single NAL unit packetization mode, the transmission order of NAL units is the same as their NAL unit decoding order. In the non-interleaved packetization mode, the transmission order of the individual NAL unit packets and STAP-A, FU-A is the same as their NAL unit decoding order. NAL units in a STAP are presented in NAL unit decoding order.

Since h.264 allows the decoding order to be different from the display order, the value of the RTP timestamp may not be monotonically decreasing as a function of the RTP sequence number.

The DON value of the first NAL unit in transmission order can be set to an arbitrary value. The range of DON values is 0-65535 (including 0 and 65535). After the maximum value is reached, the value of DON is zeroed.

The decoding order of two NAL units contained in an arbitrary STAP-B, MTAP or the decoding order of a series of fragment units starting with an FU-B is determined as follows. Let the value of DON for one NAL unit be D1 and the value of DON for another NAL unit be D2. If D1 is D2, the NAL unit decoding order of the two NAL units may be arbitrary. If D1 < D2, and D2-D1 < 32768, or if D1 > D2, and D1-D2 > -32768, then in NAL unit decoding order, the NAL unit with a value of DON equal to D1 precedes the NAL unit with a value of DON equal to D2. If D1 < D2, and D2-D1 > -32768, or if D1 > D2, and D1-D2 < 32768, then in NAL unit decoding order, NAL units with a value of DON equal to D2 precede NAL units with a value of DON equal to D1. The values of DON associated with the fields are determined such that the decoding order determined by the values of DON in the manner described above coincides with the NAL unit decoding order. If the order of two consecutive NAL units in a NAL unit stream is switched and the new order is still consistent with the NAL unit decoding order, the NAL units have the same DON value. For example, when the video coding scheme in use allows arbitrary slice order, all coded slice NAL units in a coded picture are allowed to have the same value of DON. As a result, NAL units having the same DON value can be decoded in any order, and two NAL units having different DON values should be transmitted to a decoder in the order described above. When two consecutive NAL units in NAL unit decoding order have different values of DON, the value of DON for the second NAL unit in decoding order should be the value of DON for the first NAL unit in decoding order plus 1. The receiver cannot expect the absolute difference in the values of DON of two consecutive NAL units in NAL unit decoding order to be equal to 1, even in error-free transmission. Since it may not be known whether all NAL units are delivered to the receiver when correlating the value of DON with NAL units, there is no need to increase by 1. For example, when there is a bit rate deficiency in the network to which packets are to be forwarded, the gateway will not forward coded slice NAL units or SEI NAL units for non-reference pictures. In another example, a live broadcast may be interrupted over and over again by pre-encoded content such as advertisements. The first intra picture in the pre-encoded clip is transmitted in advance to ensure that the picture can actually be received at the receiver. When transmitting the first intra picture, the originator does not know exactly how many NAL units will appear in decoding order before the first intra picture of the pre-encoded clip. Therefore, the value of DON of the NAL units of the first intra picture in the pre-coded clip must be estimated when they are transmitted, and a blank of DON values may occur.

Encoding

Let us now consider the codec process in detail. Pictures from the video source 3 enter the encoder 1 and are pre-stored in a pre-encoding buffer 1.1. The encoding process need not start immediately after the first picture has entered the encoding buffer 1.1, but may start after a predetermined number of pictures are present in the encoding buffer 1.1. The encoder 1 then strives to select suitable candidate frames from the pictures that can be used as reference frames. The encoder 1 then performs encoding to form an encoded image. The coded pictures can be, for example, predictive pictures (P), bidirectional predictive pictures (B), and/or intra coded pictures (I). Intra-coded pictures can be decoded without using any other pictures, but other types of pictures require at least one reference picture before being decoded. Any of the above types of pictures may be used as reference pictures.

The encoder preferably attaches two timestamps to the picture: one Decoding Time Stamp (DTS) and one Output Time Stamp (OTS). The decoder can use the time stamp to determine the correct decoding time and the time to output (display) the picture. However, these timestamps are not necessarily communicated to the decoder or the decoder does not use the timestamps.

The encoder also forms sub-sequences on one or more layers above the lowest layer 0. Sub-sequences on layer 0 can be decoded independently, but pictures of higher layers depend on pictures of some lower layers. In the example shown in fig. 6a, there are two layers: layer 0 and layer 1. Images I0, P6 and P12 belong to level 0, while the other images shown in FIG. 6a, P1-P5, P7-P11, belong to level 1. Advantageously, the encoder forms groups of pictures (GOPs) such that each picture in a GOP can be reconstructed using only pictures in the same GOP. In other words, a GOP contains at least one independently decoded picture that is either a reference picture for all other pictures or the first reference picture in a chain of reference pictures. In the example shown in fig. 7, there are two independent sets of images. The first independent image group includes images 10(0) of layer 0, P1(0), P3(0), images B2(0), 2 × B3(0), B4(0), 2 × B5(0), B6(0), P5(0), P6(0) on layer 1. The second independent image group includes images 10(1) and P1(1) of layer 0 and images 2 × B3(1) and B2(1) on layer 1. The images on layer 1 in each independent group of images are further configured as sub-sequences. The first sub-sequence of the first group of independent images comprises images B3(0), B2(0), B3(0), the second sub-sequence comprises images B5(0), B4(0), B5(0), and the third sub-sequence comprises images B6(0), P5(0), P6 (0). The sub-sequences of the second independent group of pictures comprise pictures B3(1), B2(1), B3 (1). The numbers within the extension indicate the video sequence ID defined for the independent group of pictures to which the picture belongs.

A video sequence ID is transmitted for each picture. Which may be transmitted in a video bitstream, for example, supplemental enhancement information data. The video sequence ID can also be transmitted in a header field of a transport protocol, for example in the RTP payload header of the JVT coding standard. In accordance with the current partitioning of the independent GOPs, the video sequence IDs may be stored in metadata in a video file format, such as in the MPEG-4AVC file format.

Another preferred method for representing decoding order information in a video bitstream will be briefly described below. The encoder initializes the Decoding Order Number (DON) to an appropriate start value, e.g., 0. Here, a cyclic up-counting scheme with a maximum value is assumed. If, for example, the decoding order number is a 16-bit unsigned integer, the maximum value is 65535. The encoder forms one or more NAL units from the base coded picture. The encoder can define the same decoding order number for each NAL unit of the same picture and, if there are redundant coded pictures (higher layer sub-sequences), the encoder can assign a different DON for the NAL units of these redundant coded pictures. When the entire primary coded picture and its possible redundant coded pictures are coded, the encoder starts processing the next primary coded picture in decoding order. Preferably, the encoder increments the decoding order number by 1 if the value of the decoding order number is smaller than said maximum value. If the decoding order number has a maximum value, the encoder sets the decoding order value to a minimum value, preferably 0. The encoder then forms NAL units from said next base coded picture and assigns them as the current value of the decoding order number. If there are any redundant coded pictures of the same base coded picture, respectively, they are also converted into NAL units. Operation continues until all of the primary coded pictures, as well as each of the redundant coded pictures, if any, are processed. The transport device may start transporting NAL units before all pictures are processed.

If the encoder knows that the far-end decoder cannot process the received slices in any order (i.e., not in raster scan order), the encoder should assign an increasing value of DON to each slice of the base encoded image in raster scan order. That is, if a slice is transported in a single NAL unit, each consecutive NAL unit has a different value of DON. If the slice is transported as one data partitioning NAL unit, each data partitioning NAL unit of the slice shares the same DON value. For a slice of a redundant coded picture, the encoder assigns a value of DON to the DON that is greater than the corresponding slice of the corresponding base coded picture.

In the receiver, the decoding order number can be used to determine the correct decoding order of the encoded pictures.

Fig. 11a and 11b depict examples of NAL packet formats that can be applied to the present invention. The packet comprises a header 11 and a payload 12. The header 11 preferably comprises an error indicator field 11.1(F, disabled), a priority field 11.2, and a type field 11.3. The error indicator field 11.1 indicates a NAL unit with no bit error. Preferably, when the error indicator field is set, the decoder is alerted that a bit error may be present in the payload or NALU type byte. Decoders that cannot handle bit errors will discard these packets. The priority field 11.2 is used to indicate the importance of the image encapsulated in the payload portion 12 of the packet. In one application example, the priority field may have the following four different values. The value 00 indicates that the contents of the NALU are not used to reconstruct the reference picture (which may be used as a future reference). Such NALUs can be discarded and there is no risk of incomplete reference images. Values above 00 indicate that decoding of NALUs requires maintaining the integrity of the reference pictures. Also, the values above 00 indicate relative transmission priority, as determined by the encoder. The intelligent network element can use this information to provide better protection for the more important NALUs. 11 is the highest transmission priority, followed by 10, 01, 00 being the lowest.

The payload portion 12 of the NALU comprises at least a video sequence ID field 12.1, a field indicator 12.2, a size field 12.3, temporal information 12.4, and coded picture information 12.5. The video sequence ID field 12.1 is used to store the video sequence number to which the picture belongs. The field indicator 12.2 is used to indicate whether the picture is the first frame or the second frame when a two frame picture format is used. Both frames are encoded as independent images. A first field indicator equal to 1 indicates that the NALU belongs to a coded frame or coded field preceding the second coded field of the same frame in decoding order. A first field indicator equal to 0 indicates that the NALU belongs to a coded field located after the first coded field of the same frame in decoding order. The time information field 11.3 is used for converting time-related information, if necessary.

NAL units can be transported in packets of different kinds. In the preferred embodiment, the different packet formats include simple packets and aggregate packets. The aggregate package can be further divided into a single-time aggregate package and a multi-time aggregate package.

A simple packet according to the present invention comprises one NALU. A NAL unit stream formed by decapsulating simple packets in RTP order should follow the NAL unit transmission order.

Aggregation packages are a package aggregation scheme of load specification. This scheme was introduced to reflect two different network types-wired IP networks (MTU size is typically limited by ethernet MTU size, about 1500 bytes), and IP or non-IP (e.g., h.324/M) based wireless networks, which preferably have a transmission unit size of 254 bytes or less-a significant difference in MTU size. To prevent media transcoding between the two domains and to avoid undesirable packetization costs, a packet aggregation scheme is introduced.

A Single Time Aggregation Package (STAP) aggregates NALUs and consistent NALU time. Multiple time-set packets (MTAPs) are associated with potentially different NALU time sets, respectively. Two different MTAPs are defined that differ in NALU timestamp offset length. NALU time is defined as the value that the RTP timestamp would have if the NALU were to be transmitted in its own RTP packet.

The MTAP and STAP share the non-limiting packing rules according to a preferred embodiment of the invention described below. The RTP timestamp must be set to the minimum of the NALU times for all NALUs to be aggregated. The type field in the NALU type byte must be set to an appropriate value in the manner of table 1. If all the error indicator fields of the set NALU are 0, the error indicator field 11.1 must be cleared, otherwise, it must be set.

TABLE 1

Type (B)

Bag (bag)

Timestamp offset field length (in bits)

0×18	STAP	0
0×18	STAP	0	0×19	MTAP16	16
0×20	MTAP24	24	0×19	MTAP16	16

The NALU payload of a collective packet includes one or more collective packets. The aggregate packet is capable of transmitting as many aggregate units as needed, but the total number of data in one aggregate packet must fit into one IP packet, and the size can be selected so that the resulting IP packet is smaller than the MTU size.

As long as the NALUs of the set share the same NALU time, a separate time-set packet (STAP) should be used. The NALU payload of a STAP comprises a video sequence ID field 12.1, e.g. 7 bits, and a field indicator 12.2 located after an individual picture assembly unit (SPAU). The individual time-set packet type B (STAP-B) also includes DON.

A video sequence according to the present description may be any part of a NALU stream that can be decoded independently of other parts of the NALU stream.

One frame includes two fields encoded as independent images. A first field indicator equal to 1 indicates that the NALU belongs to a coded frame or to a coded field preceding the second coded field of the same frame in decoding order. A first field indicator equal to 0 indicates that the NALU belongs to a coded field following the first coded field of the same frame in decoding order.

The individual image collection unit includes, for example, 16-bit unsigned size information indicating the byte size of the next NALU (excluding two bytes, but including the NALU type byte of the NALU) following the NALU itself including its NALU type byte.

The multi-time aggregation packet (MTAP) has a structure similar to that of the STAP. It includes NALU header bytes, and one or more multiple picture aggregation units. The choice between different MTAP fields varies from application to application, with the larger the timestamp offset, the more flexible the MTAP, but the higher the cost.

Two different multi-time aggregation units are defined in this specification. Both of them comprise e.g. 16-bit unsigned size information of the next NALU (same as the size information of the STAP). In addition to these 16 bits, there is also included a video sequence ID field 12.1 (e.g. 7 bits), a field indicator 12.2 and n bits of temporal information for the NALU, where n may be 16 or 24. The time information field must be set so that the RTP timestamp (NALU time) of the RTP packet of each NALU in the MTAP can be generated by adding time information from the RTP timestamp of the MTAP.

In another alternative embodiment, a multi-time aggregation packet (MTAP) includes NALU header bytes, a Decoding Order Numbering Base (DONB) field 12.1 (e.g., 16 bits), and one or more multi-picture aggregation units. In this case, two different multi-time aggregation units are defined in the following manner. Both of them comprise e.g. 16-bit unsigned size information of the following NALU (same as the size information of the STAP). In addition to these 16 bits, there is a decoding order number variable (DOND) field 12.5 (e.g., 7 bits), and n bits of time information for the NALU, where n may be 16 or 24. The DON of the subsequent NALU is equal to DONB + DOND. The time information field must be set so that the RTP timestamp (NALU time) of the RTP packet of each NALU in the MTAP can be generated by adding time information from the RTP timestamp of the MTAP. The DONB may include the smallest DON value in NAL units of the MTAP.

Transmission of

Transmission and/or storage (and optionally virtual decoding) of the encoded pictures can begin immediately after the first encoded picture is prepared. Since the decoding order and the output order are not identical, the picture is not necessarily the first in the decoder output order.

The transmission may be started when the first picture of the video stream is encoded. The coded pictures are optionally stored in a coded picture buffer 1.2. And the transmission may start at a later stage, e.g. after encoding a certain part of the video stream.

The decoder 2 should also output the decoded pictures in the correct order, e.g. by using an ordering of picture order counts, and the reordering process needs to be clearly and standardly defined.

Unpacking

The unpacking process is implementation dependent. Thus, described below is a non-limiting example of one suitable implementation. Other schemes may also be used. Optimization schemes with respect to the above algorithm are also possible.

The basic concept behind these unpacking rules is to reorder the NAL units from transmission order to NAL unit transport order.

Decoding

Next, the operation of the receiver 8 will be described. The receiver 8 collects all packets belonging to one image and arranges them in a reasonable order. The stringency of the sequence depends on the applied protocol. The received packets are preferably stored in a receive buffer 9.1 (pre-decode buffer). The receiver 8 discards anything useless and passes the remainder to the decoder 2. Aggregate packets are processed by offloading their load into individual RTP packets carrying NALUs. These NALUs are processed as received in separate RTP packets in the order they are arranged in the aggregate packet.

For each NAL unit stored in the buffer, the RTP sequence number of the packet containing the NAL unit is preferably stored and associated with the stored NAL unit. Also, packet types (simple packets or aggregate packets) containing NAL units are stored and associated with each stored NAL unit.

In the following, N is the value of an optional num-reorder-VCL-NAL-units parameter (interleaving depth parameter) specifying the maximum number of VCL NAL units that precede any VCL NAL unit in the packet stream in NAL unit transmission order and follow the VCL NAL unit in RTP sequence numbering order or in the synthesis order of the aggregate packet containing the VCL NAL units. If the parameter is not present, a 0 value number may be used. When a video streaming session is initialized, the receiver buffers at least N VCL NAL units in the reception buffer 9.1 before transmitting any packets to the decoder 2.

When the receive buffer 9.1 contains at least N VCL NAL units, the NAL units are removed from the receive buffer 9.1 and transferred to the decoder 2 in the following order until the buffer contains N-1 VCL NAL units.

Later, let PVSID be the video sequence id (vsid) of the last NAL unit delivered to the decoder. All NAL units in one STAP share the same VSID. The order in which the NAL is delivered to the decoder is specified in the following way: if the oldest RTP sequence number of the buffer corresponds to a simple packet, the NALU in the simple packet is the next NALU in NAL unit transmission order. If the oldest RTP sequence number in the buffer corresponds to one aggregation packet, the NAL unit transmission order is restored in the NALU in the aggregation packet transmitted in RTP sequence number order until the next simple packet (not included). This set of NALUs will be referred to hereinafter as candidate NALUs. If the NALU transmitted in the simple packet is not resident in the buffer, all NALUs belong to the candidate NALUs.

For each NAL unit in the candidate NALU, the distance of VSID is calculated in the following manner. If the VSID of the NAL unit is greater than the PVSID, then the VSID distance is equal to VSID-PVSID. Otherwise, the VSID distance is equal to 2^ (number of bits used to mark the VSID) -PVSID + VSID. NAL units are transmitted to the decoder in increasing order of VSID distance. If several NAL units share the same VSID distance, the order in which they are transmitted to the decoder will follow the NAL unit transmission order defined in this specification. The NAL unit transmission order can be restored in the manner described below.

The terms PVSID and VSID are used above. Obviously, PDON (decoding order number of the previous NAL unit of the aggregation packet in NAL unit transmission order) and DON (decoding order number) may also be used.

First, the slices and data partitions are associated with the picture according to their frame number, RTP timestamp, and first field marker: all NALUs sharing the same frame number, RTP timestamp, and value of the first field marker belong to the same picture. SEI NALU, sequence parameter set NALU, picture partition NALU, end of sequence NALU, end of stream NALU, padding NALU belonging to the picture of the next VCL NAL unit in transmission order.

Second, the transmission order of the pictures is inferred from the nal _ ref _ idc, the frame number, the first field marker, and the RTP timestamp of each picture. The transmission order of the images is in the order of increasing frame numbers (modulo arithmetic). If several pictures share the same value of the frame number, a picture with nal _ ref _ idc equal to 0 is first transmitted. If several pictures share the same value of the frame number and their nal _ ref _ idc are all equal to 0, the pictures are transmitted in the order of increasing RTP timestamps. If two pictures share the same RTP timestamp, the picture with the first field mark 1 is transmitted first. Note that the base coded picture and the corresponding redundant coded picture are considered herein as one coded picture.

Third, if the video decoder used does not support arbitrary slice ordering, the slice and a data separate transmission order is in the order of increasing first _ mb _ in _ slice syntax element in the slice header. Also, in the transmission order, the B and C data partitions are transmitted immediately after the corresponding a data partition.

The following additional unpacking rules may be used to perform the operations of the JVT unpacker: NALUs are delivered to the JVT decoder in the order of RTP sequence numbers. The NALUs carried in the aggregate package are presented in the aggregate package in their order. All NALUs of the aggregate packet are processed before processing the next RTP packet.

The intelligent RTP receiver (e.g., gateway) can identify the missing DPA. If a lost DPA is found, the gateway can decide not to send DPB and DPC parts, so their information is meaningless for JVT decoding. In this way, the network element can reduce the network load by discarding useless packets without parsing a complex bit stream.

The intelligent receiver may discard all packets with NAL reference ldc of 0. However, they should process these packets if possible, because the user experience is also affected if the packets are discarded.

DPB2.1 contains space for storing multiple pictures. These spaces are also referred to as frame memories in the description. The decoder 2 decodes the received pictures in the correct order. To do so, the decoder checks the video sequence ID information of the received picture. If the encoder has freely selected the video sequence ID for each independent group of pictures, the decoder will decode the pictures in the independent group of pictures in the order in which they were received. If the encoder defines a video sequence ID for each independent group of pictures using an increasing (or decreasing) number scheme, the decoder decodes the independent groups of pictures in the order of the video sequence ID. In other words, the independent group of pictures with the smallest (or largest) video sequence ID will be decoded first.

The present invention may be implemented in a variety of systems and devices. The transmission device 6 comprising the encoder 1 and the optional HRD5 preferably further comprises a transmitter 7 for transmitting the encoded pictures to the transmission channel 4. The receiving device 8 comprises a receiver 9 for receiving the encoded pictures, the decoder 2, and a display 10 for displaying the decoded pictures thereon. The transmission channel may be, for example, an over-the-road transport communication channel and/or a wireless communication channel. The transmitting device and the receiving device also comprise one or more processors 1.2, 2.2 capable of executing the necessary steps for controlling the encoding/decoding process of the video stream according to the invention. The method according to the invention is thus mainly implemented as computer executable steps of a processor. The buffering of the image may be performed in the memory 1.3, 2.3 of the device. The program code 1.4 of the decoder may be stored in the memory 1.3. The program code 2.4 of the decoder can be stored in the memory 2.3 accordingly.

Claims

1. A method for ordering encoded pictures, the method comprising an encoding step for forming encoded pictures in an encoder, wherein at least one group of pictures is formed and a picture identification is defined for each picture of the group of pictures, a transmission step for transmitting said encoded pictures to a decoder, and a rearranging step for arranging the encoded pictures in decoding order, wherein in the encoding step a video sequence identification separate from the picture identification is defined for the encoded pictures, wherein the video sequence identification is encoded to the transmitted picture stream, wherein the video sequence identification is the same for each picture of the same group of pictures, wherein in the decoding step the video sequence identification is arranged to be used for determining which pictures belong to the same group of pictures.

2. The method according to claim 1, wherein two or more groups of pictures are formed and different video sequence identifications are defined for said two or more groups of pictures.

3. The method of claim 2, wherein the decoding order of the pictures is defined in accordance with a video sequence identification.

4. The method of claim 2, wherein the video sequence identification is transmitted at a transport layer and the image identification is transmitted at a video layer.

5. A method for decoding an encoded picture stream in a decoder, said stream comprising at least one group of pictures, a picture identity being defined for each picture of the group of pictures and a video sequence identity being defined for the group of pictures separate from the picture identity, wherein the video sequence identity is encoded into the transmitted picture stream, wherein the video sequence identity is used to determine which pictures belong to the same group of pictures upon decoding.

6. The method according to claim 5, wherein one picture of each group of pictures is an independently decodable picture for which said video sequence identification is defined, the pictures of the group of pictures forming at least one sub-sequence, and each picture of the sub-sequence having the same video sequence identification as the independently decodable picture of the same group of pictures.

7. An encoder for encoding pictures and for ordering encoded pictures, comprising an arranger for forming at least one group of pictures of the encoded pictures and defining a picture identification for each picture of the group of pictures, wherein the encoder further comprises a definer for defining a video sequence identification for the encoded pictures separate from the picture identification, wherein the video sequence identification is encoded into a transmitted picture stream, the video sequence identification being arranged to be identical for each picture of the same group of pictures, wherein the video sequence identification is used for determining which pictures belong to the same group of pictures upon decoding.

8. A decoder for decoding encoded pictures to form decoded pictures, comprising a rearranger for arranging the encoded pictures in decoding order, wherein the decoder further comprises a processor for determining which pictures belong to the same group of pictures by using a video sequence identification, wherein the video sequence identification is encoded into a transmitted picture stream.

9. A method for ordering encoded pictures, the encoded pictures comprising a first and a second encoded picture, at least a first transmission unit being formed on the basis of the first encoded picture, at least a second transmission unit being formed on the basis of the second encoded picture, wherein a first identifier is defined for said first transmission unit and a second identifier is defined for said second transmission unit, the first and second identifiers being indicative of the respective decoding order of information contained in the first transmission unit and information contained in the second transmission unit, and the first and second identifiers being encoded into a transmitted picture stream.

10. The method of claim 9, wherein the identifier is defined as an integer.

11. The method of claim 10, wherein the identifier is a round-robin capable integer, wherein a larger integer indicates a later decoding order.

12. The method of claim 9, wherein the encoded picture comprises one or more slices, the first transmission unit comprises a first slice, and the second transmission unit comprises a second slice.

13. An apparatus for ordering encoded pictures comprising a first and a second encoded picture, the apparatus comprising an arranger for forming at least a first transmission unit on the basis of the first encoded picture and at least a second transmission unit on the basis of the second encoded picture, wherein the apparatus further comprises a definer for defining a first identifier for said first transmission unit and a second identifier for said second transmission unit, the first and second identifiers being indicative of a decoding order of information comprised in the first transmission unit and information comprised in the second transmission unit, respectively, and for encoding the first and second identifiers into a transmitted picture stream.

14. The device of claim 13, wherein the device is a gateway device.

15. The device of claim 13, wherein the device is a mobile communication device.

16. The apparatus of claim 13, wherein the apparatus is a streaming server.

17. An encoder for encoding pictures and for ordering encoded pictures comprising a first and a second encoded picture, the encoder comprising an arranger for forming at least a first transmission unit on the basis of the first encoded picture and at least a second transmission unit on the basis of the second encoded picture, wherein the encoder further comprises a definer for defining a first identifier for said first transmission unit and a second identifier for said second transmission unit, the first and second identifiers being indicative of the respective decoding order of information contained in the first transmission unit and information contained in the second transmission unit, and for encoding the first and second identifiers into a transmitted picture stream.

18. The encoder of claim 17, wherein the encoded picture comprises one or more slices, said arranger being configured to include a first slice in said first transmission unit and a second slice in said second transmission unit.

19. A decoder for decoding encoded pictures for forming decoded pictures, the encoded pictures comprising first and second encoded pictures transmitted in at least a first transmission unit formed on the basis of the first encoded picture and in at least a second transmission unit formed on the basis of the second encoded picture, wherein the decoder further comprises a processor for defining a decoding order of information contained in the first transmission unit and information contained in the second transmission unit on the basis of a first identifier defined for said first transmission unit and a second identifier defined for said second transmission unit, the first and second identifiers being encoded into a transmitted picture stream.

20. A system comprising an encoder for encoding pictures and an encoder for ordering encoded pictures comprising a first and a second encoded picture, the encoder comprising an arranger for forming at least a first transmission unit on the basis of the first encoded picture and at least a second transmission unit on the basis of the second encoded picture, and a decoder for decoding the encoded pictures, wherein the system further comprises a definer in the encoder for defining a first identifier for said first transmission unit and a second identifier for said second transmission unit, the first and second identifiers being indicative of a decoding order of information comprised in the first transmission unit and information comprised in the second transmission unit, respectively, and a processor in the decoder for determining the information comprised in the first transmission unit and the information comprised in the first transmission unit on the basis of said first identifier and said second identifier The decoding order of the information in the two transmission units, the first and second identifiers being encoded into the transmitted image stream.

21. An apparatus for ordering encoded pictures for transmission, the encoded pictures comprising a first and a second encoded picture, the apparatus comprising an arranger for forming at least a first transmission unit on the basis of the first encoded picture and at least a second transmission unit on the basis of the second encoded picture, wherein the apparatus further comprises a definer for defining a first identifier for said first transmission unit and a second identifier for said second transmission unit, the first and second identifiers being indicative of the respective decoding order of information contained in the first transmission unit and information contained in the second transmission unit, and for encoding the first and second identifiers into a transmitted picture stream.

22. An apparatus for rearranging encoded pictures for decoding, the encoded pictures comprising a first and a second encoded picture to be transmitted in at least a first transmission unit formed on the basis of the first encoded picture and in at least a second transmission unit formed on the basis of the second encoded picture, wherein the apparatus further comprises a processor for determining a decoding order of information contained in the first transmission unit and information contained in the second transmission unit on the basis of a first identifier defined for said first transmission unit and a second identifier defined for said second transmission unit, and the first and second identifiers are encoded into a transmitted picture stream.