HK1227166A1

Movatterモバイル変換

Info

Publication number: HK1227166A1
Application number: HK17100763.6A
Authority: HK
Inventors: Z．菲左
Original assignee: Dts（英属维尔京群岛）有限公司
Priority date: 2014-03-06
Filing date: 2015-02-26
Publication date: 2017-10-13

Description

Post-encoding bit rate reduction for multi-object audio

Cross Reference to Related Applications

This application claims priority to U.S. patent application 14/199706 entitled "post-encoding bitrate reduction for multi-object audio," filed 3/6 2014, which is incorporated herein by reference in its entirety.

Background

Audio compression techniques minimize the number of digital bits used to create a representation of an input audio signal. Uncompressed high quality digital audio signals tend to contain large amounts of data. The large size of these uncompressed signals often makes them undesirable or unsuitable for storage and transmission.

Compression techniques may be used to reduce the file size of the digital signal. These compression techniques reduce the digital storage space required to store the audio signal for future playback or transmission. Furthermore, these techniques may be used to generate a trusted representation of an audio signal at a reduced file size. This low bit rate version of the audio signal may then be transmitted at a low bit rate over a limited bandwidth network channel. This compressed version of the audio signal is decompressed after transmission to reconstruct a representation of the acoustically acceptable input audio signal.

As a general rule, the quality of a reconstructed audio signal is inversely proportional to the number of bits used to encode the input audio signal. In other words, the fewer bits used to encode the audio signal, the greater the difference between the reconstructed audio signal and the input audio signal. Conventional audio compression techniques fix the bit rate, and thus the level of audio quality, at the time of compression encoding. The bit rate is the number of bits used to encode the input audio signal per time segment. Further reduction of the bit rate cannot be achieved without re-encoding the input audio signal at a lower bit rate or decompressing the compressed audio signal and then re-compressing the decompressed signal at a lower bit rate. These conventional techniques are not "scalable" in addressing situations where different applications require bitstreams that are encoded at different bitrates.

One technique used to create scalable bitstreams is differential encoding. Differential encoding encodes an input audio signal into a high-bit-rate bit stream composed of a subset of the low-bit-rate bit stream. The low bit rate bitstream is then used to construct a higher bit rate bitstream. Differential encoding requires extensive analysis of the scaled bit stream and is computationally intensive. This computational intensity requires significant processing power to achieve real-time performance.

Another scalable coding technique uses multiple compression methods to create a layered scalable bitstream. This approach uses a hybrid compression technique to cover the desired range of scalable bit rates. However, the limited scalability range and limited resolution make this layered approach unsuitable for many types of applications. For these reasons, the desired scenario of storing a single compressed audio bitstream and delivering content from this single bitstream at different bit rates is often difficult to achieve.

Disclosure of Invention

This summary is provided to introduce a selection of concepts in a simplified form that are further described below in the detailed description. This summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used to limit the scope of the claimed subject matter.

Embodiments of the post-encoding bitrate reduction system and method generate one or more scaled compressed bitstreams from a single full file. The full document contains a plurality of audio object files that have previously been encoded separately. Thus, the processing of the full file is performed after the audio object file has been encoded with the scalability features of the full file.

The encoding process for each encoded audio file is scalable such that bits can be reduced from the frames of the encoded audio file to reduce the file size. This scalability allows data to be encoded at a particular bit rate, and then any percentage of the encoded data can be cut out or discarded, while still retaining the ability to correctly decode the encoded data. For example, if the data is encoded at bit rate Z, half of the frames may be cut or dropped to obtain half the bit rate (Z/2) and still be able to be decoded correctly.

One example where such fine granularity scalability and work from a single encoded full file is valuable is when streaming to devices of different bandwidths. For example, if there are multiple audio object files located on the server, embodiments of the present system and method will encode these audio object files separately at some high bit rate that the content provider wants to achieve. However, if this content is streamed to a different and lower bandwidth device, such as a cell phone, car, television, etc., the bit rate needs to be reduced. While working from a single encoded full file, embodiments of the present system and method allow the bit rate to be adjusted for the bit rate of each individual device. Thus, each delivery is tailored differently. But a single file is used to deliver different bit rate bitstreams. Furthermore, it is not necessary to re-encode the encoded audio object file.

Rather than re-encoding the audio object file, embodiments of the present system and method process a single version of the encoded full file and then reduce the bit rate. Furthermore, the scaling of the bit rate is done without first decoding the full file back to its uncompressed form and then re-encoding the resulting uncompressed data at a different bit rate. This can all be achieved without re-encoding the encoded audio object file.

Encoding and compression are expensive, computationally demanding processes, whereas post-encoding bit rate scaling of embodiments of the present system and method is a very lightweight process. This means that embodiments of the present system and method impose much less server requirements when compared to prior systems and methods that perform multiple encodings simultaneously to serve each different channel bit rate.

Embodiments of the present system and method generate a scaled compressed bitstream from a single full file. A full file at full bitrate is created by merging a plurality of separately encoded audio object files. An audio object is a source signal of a particular sound or combination of sounds. In some embodiments, the full file includes hierarchical metadata corresponding to the encoded audio object file. The hierarchical metadata contains priority information for each encoded audio object file relative to other encoded audio object files. For example, a dialogue audio object in a movie soundtrack may have a higher weight than a street noise audio object (during the same time period). In some embodiments, the entire length of time of each encoded audio object file is used in the full file. This means that even if the encoded audio object files contain silent periods, they are still contained in the full file.

Each audio object file is segmented into data frames. The time period is selected and the data frame activity of the data frames of each encoded audio file at that specified time period are compared to each other. This gives a data frame activity comparison for all encoded audio files over a selected time period. Bits are then assigned from the pool of available bits to each data frame of the encoded audio object file during the selected time period based on the data frame activity comparison and, in some cases, the hierarchical metadata. This results in a bit allocation for a selected time period. In some embodiments, the hierarchical metadata contains encoded audio object file priorities such that the files are ranked in order of priority or importance to the user. It should be noted that bits from the pool of available bits are allocated to all data frames and all encoded audio object files of the selected time period. In other words, each audio object file and the frames therein receive bits for a given period of time, but some files receive more bits than others based on their frame activity and other factors.

The measurement data frame activity may be based on any number of parameters available in the encoded bitstream. For example, audio levels, video activity, and other measures of frame activity may be used to measure data frame activity. Further, in some embodiments of the present systems and methods, data frame activity is measured at the encoder side and embedded in the bitstream, such as one digit per frame. In other embodiments, the decoded frames may be analyzed for frame activity.

In some embodiments, data frame activity is compared between frames. Typically during a certain time period, more activity will be present in some data frames, while other data frames will have less activity. The data frame comparison includes selecting a time period and then measuring data frame activity within the data frame during the time period. Each frame of encoded audio objects is examined during a selected time period. The data frame activity in each data frame is then compared to other frames to obtain a data frame activity comparison. The comparison is a measure of the activity of a particular data frame relative to other data frames during the time period.

Embodiments of the present system and method then reduce the full file by allocating bits of the reduced data frame according to the bits to generate a reduced frame. This bit reduction uses the scalability of the full file and reduces the bits in the data frames in reverse rank order. This results in a number of bits being allocated to the data frame in the bit allocation such that lower ranked bits are reduced before higher ranked bits. In some embodiments, scalability of encoding a frame within an audio object file comprises extracting tones from a frequency domain representation of the audio object file to obtain a time domain residual signal representing the audio object file with at least some of the tones removed. The extracted pitch and time domain residual signals are formatted into a plurality of data blocks, where each data block includes a plurality of bytes of data. The data chunks in the data frames of the encoded audio object file and the bits in the data chunks are both ordered in order of psycho-acoustic importance to obtain a ranking order from the most important bits to the least important bits.

The bit reduced encoded audio object file is obtained from the pruned frames. The bit-reduced encoded audio object files are then multiplexed together and packed into a scaled compressed bitstream such that the scaled compressed bitstream has a target bitrate that is less than or equal to the full bitrate in order to facilitate post-encoding bitrate reduction of a single full file.

The measured data frame activity for each data frame over a selected period of time is compared to a silence threshold to determine if there is a minimum amount of activity in any data frame. If the data frame activity for a particular data frame is less than or equal to the mute threshold, that data frame is designated as a muted data frame. Furthermore, the number of bits used to represent that frame of data is maintained without reducing any bits. On the other hand, if the data frame activity for a particular data frame is greater than the mute threshold, the data frame activity is stored in the frame activity buffer. The pool of available bits for the selected time period is determined by subtracting the bits used by the silent data frames during the selected time period from the plurality of bits allocated to the selected time period.

In some embodiments, the scaled compressed bitstream is transmitted over the network channel at a bit rate less than or equal to the target bit rate. The bitstream is received by a receiving device and then decompressed to obtain a decoded audio object file. In some cases, the decoded audio object files are mixed to create an audio object mix. The user may manually or automatically mix the decoded audio objects to create an audio object mix. Furthermore, the encoded audio object files in the hierarchical metadata may be prioritized based on spatial positioning in the audio object mix. Furthermore, two or more decoded audio object files may be interdependent for spatial masking based on their location in the mix.

Embodiments of the present system and method may also be used to obtain multiple scaled compressed bitstreams from a single full file. This is done by separately encoding a plurality of audio object files at full bit-rate using a scalable bitstream encoder with fine granularity scalability to obtain a plurality of encoded audio object files. This fine-grained scalability feature ranks the bits in each data frame of the encoded audio object file in order of psychoacoustic importance to human hearing. The full file is generated by merging the plurality of independently encoded audio object files and corresponding hierarchical metadata. Each of the plurality of encoded audio object files is persistent and exists for the entire duration of a full file.

A first scaled compressed bitstream at a first target bit rate is constructed from the full file and a second scaled compressed bitstream at a second target bit rate. This produces multiple scaled bitstreams at different target bitrates from a single full file without any re-encoding of the multiple encoded audio object files. Further, the first target bit rate and the second target bit rate are different from each other and both are less than the full bit rate. The first target bit rate is the maximum bit rate at which the first scaled compressed bit stream will be transmitted over the network channel.

As described above, the data frame activity of the data frames of each of the plurality of encoded audio files over the selected time period is compared to one another to obtain a data frame activity comparison. The data frame activity comparison and the first target bit rate are used to allocate bits to each of the data frames of the encoded audio object file based on the selected time period to obtain a bit allocation for the selected time period. The full file is reduced by reducing bits of the data frame according to the bit allocation to achieve the first target bit rate and obtain a bit reduced encoded audio object file. These bit-reduced encoded audio object files are multiplexed together and packed into a first scaled compressed bitstream at a first target bit rate. The first scaled compressed bitstream is transmitted to a receiving device at a first target bitrate and decoded to obtain a decoded audio object. These decoded audio objects are mixed to create an audio object mix.

It should be noted that alternative embodiments are possible, and that steps and elements discussed herein may be changed, added, or deleted depending on the particular embodiment. These alternative embodiments include alternative steps and alternative elements that may be used, and structural changes may be made, without departing from the scope of the present invention.

Drawings

Referring now to the drawings in which like reference numbers represent corresponding parts throughout:

fig. 1 is a block diagram illustrating a general overview of an embodiment of a post-encoding bit rate reduction system and method.

Fig. 2 is a block diagram illustrating a general overview of an embodiment of a post-encoding bit rate reduction system that obtains multiple scaled compressed bit streams from a single full file.

Fig. 3 is a block diagram illustrating details of a first embodiment of the post-encoding bit rate reduction system shown in fig. 1 and 2.

Fig. 4 is a block diagram illustrating details of a second embodiment of the post-encoding bit rate reduction system shown in fig. 1 and 2.

Fig. 5 is a block diagram illustrating an exemplary embodiment of the scalable bitstream encoder shown in fig. 1 and 4.

Fig. 6 is a block diagram illustrating an exemplary example of an embodiment of a post-encoding bit rate reduction system and method implemented in a networked environment.

Fig. 7 is a block diagram illustrating details of the frame-by-frame horizon allocation module shown in fig. 3.

Fig. 8 is a flow chart illustrating the general operation of an embodiment of the post-encoding bit rate reduction system and method shown in fig. 1-7.

Fig. 9 is a flow chart illustrating details of a first embodiment among the embodiments of the post-encoding bit rate reduction system and method illustrated in fig. 1-8.

Fig. 10 illustrates audio frames according to some embodiments of the post-encoding bit rate reduction systems and methods shown in fig. 1-9.

Fig. 11 illustrates an exemplary embodiment of a scalable frame of data generated by the scalable bitstream encoder shown in fig. 1.

FIG. 12 shows an exemplary embodiment of an example of dividing a full document into a plurality of frames and time periods.

Fig. 13 shows details of a frame of a full file over a period of time.

Detailed Description

In the following description of embodiments of the post-encoding bit rate reduction system and method, reference is made to the accompanying drawings. These drawings show by way of illustration specific examples of how embodiments of the post-encoding bit rate reduction system and method may be practiced. It is to be understood that other embodiments may be utilized and structural changes may be made without departing from the scope of the claimed subject matter.

I. Introduction to

An audio object is a source signal of a particular sound or combination of sounds. In some cases, the audio object also includes its associated presentation metadata. Rendering metadata is data that accompanies an audio object that indicates how the audio object should be rendered in audio space during playback. Such metadata may include multi-dimensional audio spatial information, position information in space, and surrounding placement information.

The audio object may represent various types of sound sources such as various musical instruments and human voice. Further, an audio object may include an audio stem (stem), which is sometimes referred to as a sub-mix, sub-group, or bus. The audio stem may also be a single track containing a set of audio content such as a string component, a horn component, or street noise.

In a conventional audio content production environment, audio objects are recorded. A professional audio engineer then mixes the audio objects into a final master mix. The resulting mix is then delivered to the end user for playback. In general, such a mix of audio objects is final and the end user has little to no change to the mix.

In contrast to traditional audio content production, multi-object audio (or variants thereof) allows the end-user to mix audio objects after delivery. One way to control and specify such post-delivery mixing in a specific or suggested manner is by utilizing embedded metadata that is transmitted with the audio content. Another way is by providing user controls that allow the end user to directly process and mix audio objects. Multi-object audio allows end users to create unique and highly personalized audio presentations.

The multi-object audio may be stored as a file on a storage device and then transmitted in a bitstream when requested. The audio bitstream may be compressed or encoded to reduce the bit rate required to transmit the bitstream and the storage space required to store the file. In general, by way of explanation and not limitation, compression of a bitstream means that less information is used to represent the bitstream. On the other hand, encoding of the bit stream means that the bit stream is represented in another form, such as with symbols. However, encoding does not always compress the bitstream.

The coded bit stream is transmitted over a limited bandwidth network channel. Embodiments of the post-encoding bitrate reduction system and method take separately encoded audio objects and merge them with each other and additional data to generate an encoded bitstream. When an individually encoded audio object is transmitted, the bandwidth of the encoded bitstream containing the encoded audio object tends to exceed the capacity of the network channel. In this case, a bit stream having a lower bit rate that is not suitable for a particular application may be transmitted over the network channel. This may result in a reduction in the quality of the received audio data.

This degradation of quality is particularly problematic when multiple streams of audio data (such as multiple audio objects) are multiplexed for simultaneous or near-simultaneous transmission over a common network channel. This is because, in some cases, the bandwidth of each encoded audio object is proportionally degraded, which does not take into account the relative content of each audio object or group of audio objects. For example, one audio object may contain music, while another may contain street noise. Scaling down the bandwidth of each audio object will likely have a more detrimental effect on music data than on noise data.

There may be times when the encoded bit stream is transmitted over a network channel at a particular bit rate and channel conditions will change. For example, the bandwidth of the channel may become tightened and a lower transmission bit rate may be required. In these cases, embodiments of the post-encoding bit rate reduction system and method may react to this change in network conditions by adjusting the scaling of the encoded bit stream bit rate. For example, when the bandwidth of the network channel becomes limited, the bit rate of the encoded bit stream decreases so that transmission over the network channel can continue. Rather than re-encoding the audio object, embodiments of the present system and method process a single version of the encoded bitstream and then reduce the bitrate. The resulting scaled bit stream may then be transmitted over the network channel at a reduced bit rate.

Scenarios may arise in which it is desirable to transmit a single encoded bit stream at different bit rates over various network channels. This may occur, for example, when each network channel has different capacity and bandwidth or when the bit stream is received by devices with different capabilities. In this case, embodiments of the present system and method alleviate the need to encode or compress each channel separately. Instead, a single version of the encoded bit stream is used and the scaling of the bit rate is adjusted in response to the capacity of each channel.

The encoded bitstream may be processed in real time or substantially real time. Substantially real-time may occur, for example, where access to an entire audio file or program is not available, such as during a broadcast of a live sporting event. Furthermore, the audio data may be processed offline and played back in real time. This occurs when an entire audio file or program, such as a video-on-demand application, is accessed. In the case of an encoded audio bitstream, it may comprise a plurality of audio objects, some or all of which comprise sound information and associated metadata. Such metadata may include, but is not limited to, location information including location in space, velocity, trajectory, etc., sonic characteristics including divergence, radiation parameters, etc.

Each audio object or group of audio objects may be encoded separately using the same or different encoding techniques. The encoding may be performed on frames or blocks of the bitstream. A "frame" is a discrete piece of data in time used in the compression and encoding of an audio signal. These data frames can be placed one after the other (like a motion picture film) in a serial sequence to create a compressed audio bitstream. Each frame is of fixed size and represents a constant time interval inclusion. The frame size depends on the Pulse Code Modulation (PCM) sampling rate and the bit rate of the encoding.

Each data frame is typically preceded by a header containing information about the following data. The header may be followed by error detection and correction data, while the remainder of the frame contains audio data. The audio data includes PCM data and amplitude (volume) information at a specific point in time. To produce a recognizable sound, tens of thousands of frames are played sequentially to produce a frequency.

Depending on the goals of a particular application, different frames (such as frames of the same object but occurring at different times) may be encoded with different bitrates based on, for example, the audio content of the frames. This method is called Variable Bit Rate (VBR) encoding because the size of the encoded data varies with time. This approach may provide flexibility and improve the quality to bandwidth ratio of the encoded data. Alternatively, the frames may be encoded using the same bit rate. This method is called Constant Bit Rate (CBR) encoding because the size of the encoded data is constant over time.

Although it is possible to independently transmit audio objects in an unencoded and uncompressed manner in order to maintain separation, this is generally not feasible due to the large bandwidth requirements typically required to transmit generally large files. Thus, some form of audio compression and encoding is frequently used to facilitate economical delivery of multi-object audio to end users. It has been found that it is difficult to encode an audio signal comprising audio objects to reduce its bitrate, while still maintaining a suitable acoustic separation between the audio objects.

For example, some existing audio compression techniques for multiple audio objects are based on object dependencies. In particular, joint coding techniques frequently use the dependencies of audio objects based on factors such as position, spatial masking, and frequency masking. However, one challenge with these joint coding techniques is that spatial and frequency masking between objects is difficult to predict if the placement of the objects is unknown prior to delivery.

Another type of existing audio compression technique is discrete object-based audio scene coding, which typically requires computationally expensive decoding and rendering systems and high transmission or data storage rates for carrying multiple audio objects separately. Another type of encoding technique for delivering multi-object audio is multi-channel spatial audio coding. However, unlike discrete object based audio scene coding techniques, this spatial audio coding method does not define separable audio objects. Therefore, the spatial audio decoder cannot separate the contribution of each audio object in the downmix audio signal.

Yet another technique for encoding multiple audio objects is Spatial Audio Object Coding (SAOC). However, the SAOC technique cannot completely separate audio objects in a downmix signal that are concurrent in the time-frequency domain. Thus, extensive amplification or attenuation of objects by the SAOC decoder as may be required by the interactive user controls may result in a significant deterioration of the audio quality of the reproduced scene.

It should be noted that for purposes of teaching and ease of explanation, this document refers primarily to the use of audio data. However, the features described herein may also be applied to other forms of data, including video data and data containing time series signals such as seismic and medical data. Furthermore, the features described herein may also be applied to virtually any type of data manipulation, such as storage of data and transmission of data.

Overview of the System

Embodiments of the post-encoding bitrate reduction system and method encode multiple audio object files separately and independently at some full bitrate. Embodiments of the system and method then merge these encoded audio object files, along with their associated hierarchical metadata, to generate a full file. Multiple bitstreams may be obtained from a single full file. The plurality of bitstreams are at a target bit rate that is less than or equal to a full bit rate. This change in bit rate, referred to as scaling, ensures that an optimal quality is maintained at each scaled bit rate. In addition, the scaling of the bit rate is achieved without first decoding the full file back to its uncompressed form and then re-encoding the resulting uncompressed data at a different bit rate.

As explained in detail below, this scaling is implemented in part as follows. First, the audio object files are separately encoded using a scalable bitstream encoder that orders the bits in each frame based on psychoacoustic importance. Such scalable coding also provides bit rate changes in a fine-scaled manner by removing bits within the frame. Second, at each frame interval, the corresponding frame activity within each target file is considered. Then, based on the relative relationship between these frame activity measures, embodiments of the system and method decide which frame payload of each compressed object file is retained. In other words, each frame payload of an audio object file is scaled based on its measured multimedia frame activity and its relationship to all frame activity in all other audio object files to be multiplexed together.

Fig. 1 is a block diagram illustrating a general overview of an embodiment of a post-encoding bit rate reduction system 100. The system 100 resides on a server computing device 110. An embodiment of the system 100 receives as input an audio signal 120. The audio signal 120 may contain various types of content in various forms and types. Further, the audio signal 120 may be in analog, digital, or other form. The type may be a signal that occurs in repeated discrete amounts, in a continuous stream, or some other type. The content of the input signal may be almost any content, including audio data, video data, or both. In some embodiments, audio signal 120 contains a plurality of audio object files.

An embodiment of the system 100 comprises a scalable bitstream encoder 130 which encodes each audio object file contained in the audio signal 120 separately. It should be noted that the scalable bitstream encoder 130 may be a plurality of encoders. As shown in fig. 1, the output from the scalable bitstream encoder 130 is M independently encoded audio object files, including an encoded audio object file (1) through an encoded audio object file (M), where M is a non-zero positive integer. The encoded audio object files (1) to (M) are merged with the associated hierarchical metadata to obtain a full file 140.

Whenever a bitstream having a particular target bitrate 160 is desired, the full file 140 is processed by the bit reduction module 150 to produce the desired bitstream. The bit reduction module 150 processes the full file 140 to produce a scaled compressed bit stream 170 having a bit rate less than or equal to the target bit rate 160. Once the scaled compressed bitstream 170 is generated, it may then be transmitted to a receiving device 180. The server computing device 110 communicates with other devices, such as receiving device 180, via network 185. Server computing device 110 accesses network 185 using first communication link 190 and receiving device 180 accesses network 185 using second communication link 195. In this manner, the scaled compressed bitstream 170 may be requested and transmitted by the receiving device 180 to the receiving device 180.

In the embodiment shown in fig. 1, the network channel comprises a first communication link 190, a network 185, and a second communication link 195. The network channel has some maximum bandwidth that is communicated to the bit reduction module as a target bit rate 160. The scaled compressed bit stream 170 is transmitted over the network channel at or below the target bit rate so as not to exceed the maximum bandwidth of the channel.

As described above, in some cases, it is desirable to transmit a single full file at different bit rates over multiple network channels having multiple capabilities. Fig. 2 is a block diagram illustrating a general overview of an embodiment of the post-encoding bit rate reduction system 100 that obtains multiple scaled compressed bit streams from a single full file 140. As shown in fig. 2, full file 140 contains M encoded audio object files at some full bitrate. In particular, fig. 2 shows the encoded audio object file at full bitrate (1), the encoded audio object file at full bitrate (2), the encoded audio object file at full bitrate (3), and any additional encoded audio object files (as indicated by the ellipses) comprising the encoded audio object file at full bitrate (M).

The encoded audio object files (1) to (M) are independently encoded at full bitrate by the scalable bitstream encoder 130. The full bit rate is higher than the target bit rate 160. In general, target bitrate 160 is the bitrate used to transmit content over a network channel without exceeding the available bandwidth of the channel.

In some embodiments, full file 140 encodes the M independently encoded audio object files using a high bit rate, such that the size of full file 140 is quite large. This can be problematic if the content of the full file 140 is to be transmitted over a network channel having limited bandwidth. As explained in detail below, to alleviate the difficulties associated with sending large size files, such as the full file 140, over a limited bandwidth channel, the encoded audio object files (1) through (M) are processed by the bit reduction module 150 to create multiple scaled encoded bitstreams from a single full file 140. This is accomplished in part by removing blocks of ordered data in the data frame based on bit allocation.

Although a single target bit rate 160 is shown in fig. 1, in some cases, there may be multiple target bit rates. For example, it may be desirable to transmit full file 140 via various network channels each having a different bit rate. As shown in fig. 2, there are N target bit rates 200, where N is a positive non-zero integer. Target bit rate 200 includes target bit rate (1), target bit rate (2), and so on up to target bit rate (N).

The bit reduction module 150 receives a target bit rate 160 to scale the bit rate of the full file 140 so that the resulting scaled encoded bit stream will best fit a particular limited bandwidth channel. The target bit rate 200 is typically transmitted from an Internet Service Provider (ISP) to inform embodiments of the system 100 and method about the bandwidth requirements and capabilities of the network channel over which the bit stream will be transmitted. The target bit rate 200 is less than or equal to the full bit rate.

In the exemplary embodiment of fig. 2, target bit rate 200 includes N different target bit rates, where N is a non-zero positive integer that may be equal to, less than, or greater than M. Target bit rate 200 includes a target bit rate (1), a target bit rate (2), in some cases additional target bit rates (as indicated by the ellipses), and a target bit rate (N). Typically, the target bit rates 200 will be different from each other, but they may be similar in some embodiments. Further, it should be noted that each of the target bit rates 200 may be sent together or separately over time.

The scaled compressed bitstream shown in fig. 2 corresponds to a target bitrate 200. For example, the target bitrate (1) is used to create a scaled compressed bitstream (1) at the target bitrate (1), the target bitrate (2) is used to create a scaled compressed bitstream (2) at the target bitrate (2), in some cases an additional scaled compressed bitstream at the target bitrate (as shown by the ellipses), and a scaled encoded file (N), where N is the same non-zero positive integer as described above. In some embodiments, the respective target bit rates may be similar or identical, but typically the respective target bit rates are different from one another.

It should be noted that for purposes of teaching, a specific number of encoded audio object files, target bitrates, and scaled compressed bitstreams are shown in fig. 2. However, there are cases where N-1, M-1, and a single scaled compressed bitstream are obtained from the full file 140. In other embodiments, N may be a large number, with several scaled compressed bitstreams obtained from the full file 140. Furthermore, the scaled compressed bitstream may be created on-the-fly in response to a request from a client. Alternatively, the scaled compressed bitstream may be created in advance and stored on a storage device.

Details of the System

System details of the components of an embodiment of the post-encoding bit rate reduction system 100 will now be discussed. These components include a bit reduction module 150, a scalable bitstream encoder 130, and a frame-by-frame hierarchical bit allocation module. Further, decoding of the scaled compressed bitstream 170 on the receiving device 180 will be discussed. It should be noted that only a few of the ways in which the system can be implemented are described in detail below. Many variations are possible.

Fig. 3 is a block diagram illustrating details of a first embodiment of the post-encoding bit rate reduction system 100 shown in fig. 1 and 2. In this particular embodiment, the audio object files have been encoded separately and individually and contained in a full document 140. Full file 140 is input to an embodiment of post-encoding bit rate reduction system 100. The system 100 receives the separately encoded audio object files at full bitrate 300 for further processing.

The separately encoded audio object files 300 are processed by the bit reduction module 150. As explained in detail below, the bit reduction module 150 reduces the number of bits used to represent the encoded audio object file in order to achieve the target bitrate 200. The bit reduction module 150 receives the separately encoded audio object files 300 and processes them with a frame-by-frame split bit-level allocation module 310. This module 310 reduces the number of bits in each frame based on a hierarchical bit allocation scheme. The output of the module 310 is a reduced bit encoded audio object file 320.

The statistical multiplexer 330 takes the bit-reduced encoded audio object files 320 and merges them. In some embodiments, statistical multiplexer 330 allocates channel capacity or bandwidth (measured in number of bits) to each of encoded audio object files 1 through M based at least in part on a split-level bit allocation scheme. In some embodiments, the encoded audio object file is Variable Bit Rate (VBR) encoded data and statistical multiplexer 330 outputs Constant Bit Rate (CBR) encoded data.

In some embodiments, statistical multiplexer 330 also accounts for other characteristics of the encoded audio object file during bit allocation. For example, the audio content (e.g., music, speech, noise, etc.) of the encoded audio object files may be relevant. An encoded audio object file associated with a simple collision, such as noise, may require less bandwidth than an object associated with a music track. As another example, the volume of an object may be used in the bandwidth allocation (so that a loud object may benefit from more bit allocations). As yet another example, the frequency of audio data associated with an object may also be used in bit allocation (so that a wideband object may benefit from more bit allocation).

A bitstream packager (packer)340 then processes the multiplexed bit-reduced encoded audio object files 320 and packages them into frames and containers for transmission. The output of the bitstream packager 340 is a scaled compressed bitstream 170 containing variable size frame payloads. The scaled compressed bit stream 170 is at a bit rate less than or equal to the target bit rate 160.

In some embodiments, the audio object file is not yet encoded. Fig. 4 is a block diagram illustrating details of a second embodiment of the post-encoding bit rate reduction system 100 shown in fig. 1 and 2. The unencoded audio object file 400 is received by an embodiment of the system 100. The scalable bitstream encoder 130 independently encodes each audio object file 400 to obtain the full file 140.

The full file 140 is input to the bit reduction module 150. The frame-by-frame split horizon allocation module 310 processes the full file 140 to obtain a reduced-bit encoded audio object file 320. The statistical multiplexer 330 takes the bit-reduced encoded audio object files 320 and merges them. The bitstream packager 340 then processes the multiplexed bit-reduced encoded audio object files 320 and packages them into frames and containers for transmission. The output of the bitstream packager 340 is a scaled compressed bitstream 170 containing variable size frame payloads. The scaled compressed bit stream 170 is at a bit rate less than or equal to the target bit rate 160.

Fig. 5 is a block diagram illustrating an exemplary embodiment of the scalable bitstream encoder 130 shown in fig. 1 and 4. These embodiments of scalable bitstream encoder 130 comprise a plurality of scalable bitstream encoders. In the exemplary embodiment shown in fig. 5, the scalable bitstream encoder 500 comprises M encoders, i.e. scalable bitstream encoder (1) to scalable bitstream encoder (M), where M is a non-zero positive integer. The input to the scalable bitstream encoder 500 is the audio signal 120. In these embodiments, audio signal 120 contains a plurality of audio object files. In particular, the audio signal 120 comprises M audio object files, including audio object file (1) to audio object file (M).

In the exemplary embodiment shown in fig. 5, scalable bitstream encoder 500 contains M encoders for each of the M audio object files. Thus, there is an encoder for each audio object. However, in other embodiments, the number of scalable bitstream encoders may be less than the number of audio object files. Regardless of the number of scalable bitstream encoders, each of the plurality of encoders respectively encodes each of the plurality of audio object files to obtain respectively encoded object files 300, i.e., respectively encoded audio object files (1) through respectively encoded audio object files (M).

Fig. 6 is a block diagram illustrating an exemplary example of an embodiment of a post-encoding bit rate reduction system 100 and method implemented in a networked environment. In fig. 6, embodiments of the system 100 and method are shown as being implemented on a computing device in the form of a media database server 600. Media database server 600 may be almost any device that includes a processor, such as a desktop computer, a notebook computer, and an embedded device such as a mobile phone.

In some embodiments, the system 100 and method are stored on the media database server 600 as a cloud-based service for cross-application, cross-device access. The server 600 communicates with other devices via the network 185. In some embodiments, one of the other devices is receiving device 180. Media database server 600 accesses network 185 using first communication link 190 and receiving device 180 accesses network 185 using second communication link 195. In this manner, the media database server 600 and the receiving device 180 may communicate and transfer data between each other.

The full file 140 containing the encoded audio object files (1) to (M) is located on the media database server 600. The full file 140 is processed by the bit reduction module 150 to obtain a bit reduced encoded audio object file 320. The bit reduced encoded audio object file 320 is processed by the statistical multiplexer 330 and the bitstream wrapper 340 to generate a scaled compressed bitstream 170 at or below a target bitrate. The target bit rate is obtained from a target bit rate 200 shown in fig. 2.

In the embodiment shown in fig. 6, the full document 140 is shown as being stored on the media database server 600. As described above, full file 140 contains M encoded audio object files that are independently encoded at the full bitrate. As used in this document, bit rate is defined as the rate of a stream of binary digits through a communication link or channel. In other words, the bit rate describes the rate at which bits are transferred from one location to another. The bit rate is usually expressed as a number of bits per second.

The bit rate may indicate the download speed such that downloading a 3 gigabyte (Gb) file takes less time than downloading a 1 gigabyte file for a given bit rate. The bit rate may also indicate the quality of the media file. By way of example, an audio file compressed at 192 kilobits per second (Kbps) will typically have better or higher quality (in the form of greater dynamic range and clarity) than the same audio file compressed at 128 Kbps. This is because more bits are used to represent the data for each second of playback. The quality of a multimedia file is therefore measured and indicated by its associated bit rate.

In the embodiments shown in fig. 1-5, the encoded audio object file is encoded at a full bitrate greater than any target bitrate 200. This means that the encoded audio object files of the full file 140 are of higher quality than the encoded audio object files at any target bitrate 200 contained in the scaled compressed bitstream 170.

Full file 140 and each encoded audio object file are input to embodiments of the post-encoding bitrate reduction system 100 and method. As discussed in detail below, embodiments of the system 100 and method use a frame-by-frame bit reduction to reduce the number of bits used to represent an encoded audio object file. This is achieved without re-encoding the object. This results in a reduced-bit file (not shown) containing a plurality of reduced-bit encoded audio object files 320. This means that at least some of the encoded audio object files of the full file 140 are represented as bit reduced encoded audio object files 320 by a reduced number of bits compared to the full file 140. The individual bit reduced encoded audio object files 320 are then processed into a single signal by the statistical multiplexer 330 and packed into a scaled compressed bitstream 170 by the bitstream packer 340. The scaled compressed bit stream 170 is at a bit rate less than or equal to the target bit rate. In addition, the target bit rate is less than the full bit rate.

The scaled compressed bitstream 170 is transmitted to the receiving device 180 via the network 185. This transfer typically occurs when requested by the receiving device 180, but many other situations may occur, including storing the scaled compressed bitstream 170 as a file on the media database server 600. The receiving device 180 can be any network-enabled computing device capable of storing or playing back the scaled compressed bitstream 170. Although receiving device 180 is shown in fig. 6 as residing on a different computing device than embodiments of the post-encoding bit rate reduction system 100 and method, it should be noted that in some embodiments, they may reside on the same computing device (such as media database server 600).

The receiving device 180 processes the received scaled compressed bitstream 170 by using the demultiplexer 610 to separate the encoded audio object file into its individual components. As shown in fig. 6, these individual components include an encoded audio object file (1), an encoded audio object file (2), an encoded audio object file (3), other encoded audio object files that exist (as indicated by the ellipses), up to and including an encoded audio object file (M). Each of these individual encoded audio object files is sent to a scalable bitstream decoder 620 capable of decoding the encoded audio object files. In some embodiments, scalable bitstream decoder 630 comprises a separate decoder for each encoded audio object file.

As shown in fig. 6, in some embodiments, the scalable bitstream decoder 620 includes a scalable decoder (1) to decode the encoded audio object file (1), a scalable decoder (2) to decode the encoded audio object file (2), a scalable decoder (3) to decode the encoded audio object file (3), other scalable decoders as needed (as indicated by the ellipses), and a scalable decoder (M) to decode the encoded audio object (file M). It should be noted that in other embodiments, any number of scalable decoders may be used to decode the encoded audio object files.

The output of the scalable bitstream decoder 620 is a plurality of decoded audio object files. In particular, the plurality of decoded audio object files includes a decoded audio object file (1), a decoded audio object file (2), a decoded audio object file (3), other decoded audio object files that may be needed (as indicated by the ellipses), and a decoded audio object file (M). In this regard, the decoded audio object file may be stored for later use or immediate use. Either way, at least a portion of the decoded audio object file is input to the mixing device 630. Typically, the mixing device 630 is controlled by a user mixing the decoded audio object files to generate a personalized audio object mix 640. However, in other embodiments, the mixing of decoded audio object files may be automatically handled by embodiments of the system 100 and method. In other embodiments, the audio object mix 640 is created by a third party vendor.

Fig. 7 is a block diagram illustrating details of frame-by-frame sub-horizon assignment module 310 shown in fig. 3. The module 310 receives separately encoded audio object files 300 that have been encoded at full bitrate. For a particular time period, each frame of each encoded audio object file in that time period is examined over all encoded audio object files of the particular time period 700. The hierarchical information 710 is input to a hierarchical module 720. The hierarchical information 710 includes data on how the frame should be prioritized and how the final bits should be allocated in the frame.

The bits available in the bit pool 730 are used by the allocation module 740 to determine how many bits are available to allocate between frames during the time period. Based on the hierarchical information 710, the allocation module 740 allocates bits between frames in that time period. These bits are allocated across the encoded audio object files, sub-bands, and frames based on the layering information 710.

Allocation module 740 generates such a bit allocation 750 indicating the number of bits allocated to each frame in a particular time period. Based on the bit allocation, the reduction module 760 prunes bits from each frame as needed to conform to the bit allocation 750 for that particular frame. This results in a blanking frame 770 for a given time period. These clipped frames are combined to generate a reduced bit encoded audio object file 320.

Overview of the operation

Fig. 8 is a flow chart illustrating the general operation of an embodiment of the post-encoding bit rate reduction system 100 and method shown in fig. 1-7. The operation begins by inputting a plurality of audio object files (block 800). These audio object files may include source signals in combination with their associated rendering metadata and may represent various sound sources. These sound sources may include individual instruments and human voices, as well as combinations of sound sources, such as audio objects of a drum kit that contain multiple tracks of the various components of the drum kit.

Next, embodiments of the system 100 and method encode each audio object file independently and individually (block 810). Such encoding employs one or more scalable bitstream encoders with fine-grain scalability features. Examples of Scalable bitstream encoders with fine granularity scalability features are set forth in U.S. patent No. 7,333,929 entitled "modulated Scalable Compressed Audio Data Stream" filed on 19.2.2008 and U.S. patent No. 7,548,853 entitled "Scalable Compressed Audio Bit Stream and code Using a high performance filter and multimedia Joint Coding" filed on 16.6.2009.

The system 100 and method merges the plurality of separately encoded audio files and any hierarchical metadata 710 to generate the full file 140 (block 820). Full file 140 is encoded at full bit rate. It should be emphasized that each audio object file is encoded separately in order to maintain separation and isolation between the plurality of audio object files.

The hierarchical metadata may contain at least three types of hierarchies or priorities. One or any combination of these types of priorities may be included in the hierarchical metadata. The first type of priority is a bit priority within a frame. In these cases, the bits are placed in order of psychoacoustic importance to human hearing. The second type of priority is a frame priority within the audio object file. In these cases, the importance or priority of the frame is based on the activity of the frame. If a frame activity is high relative to other frames during the frame interval, it is ranked higher in the hierarchy than a lower active frame.

The third type of priority is audio object file priority within a full file. This includes both cross-object masking and user-defined priority. In cross-object masking, a particular audio object file may be masked by another audio object file based on where the audio object is rendered in audio space. In this case, one audio object file will have a higher priority than the masked audio object file. In the user-defined priority, the user may define that one audio object file is more important to them than another audio object file. For example, for an audio track for a movie, an audio object file containing a dialog may have a higher importance to the user than an audio object file containing street noise or an audio object file containing background music.

Based on the desired target bit rate, the full file 140 is processed by the bit reduction module 150 to produce a scaled compressed bit stream 170. The scaled compressed bitstream is generated without any re-encoding. Further, the scaled compressed bit stream is designed to be transmitted over a network channel at a bit rate equal to or less than the target bit rate.

The target bit rate is always less than the full bit rate. Further, it should be noted that each audio object is independently encoded at a full bitrate that exceeds any target bitrate 200. In the case where the target bitrate is not known prior to encoding, each audio object is encoded at the maximum available bitrate or at a bitrate that exceeds the highest expected target bitrate to be used during transmission.

To obtain a scaled compressed bitstream, embodiments of the system 100 and method divide the full file 140 into a series of frames. In some embodiments, each audio object file in the full file 140 exists throughout the entire duration of the file 140. This is true even if the audio object file contains periods of silence during playback.

Referring again to fig. 8, embodiments of the system 100 and method select a frame time interval (or time period) and compare the frame activity for frames during the selected time period (block 830). Such a frame time interval comprises a frame from each audio object. The frame-by-frame comparison for a selected time period generates a data frame activity comparison for that time period. In general, frame activity is a measure of how difficult it is to encode audio in a frame. Frame activity may be determined in a number of ways. In some embodiments, the frame activity is based on a number of extracted pitches and the resulting frame residual energy. Other embodiments calculate the entropy of the frame to derive the frame activity.

Bits are designated or allocated from the pool of available bits in the frame for the selected time period (block 840). Bits are allocated based on data frame activity and hierarchical metadata. Once the bit allocation between frames for a selected time period is known, bits are distributed between the frames. Each frame is then pruned to obtain a pruned frame by pruning bits that exceed the bit allocation for that frame to conform to its bit allocation (block 850). As explained in detail below, this bit reduction is performed in an ordered fashion such that the bits with the highest priority and importance are eventually pruned.

This bit reduction, which is responsible for the plurality of reduced frames in the plurality of encoded audio object files, generates a reduced-bit encoded audio object file 320 (block 860). The bit reduced encoded audio object files 320 are then multiplexed together (block 870). The system 100 and method then wraps the multiplexed bit-reduced encoded audio object file 320 with a bitstream wrapper 340 to obtain a scaled compressed bitstream 170 at the target bitrate (block 880).

In some cases, the need to transmit encoded audio objects at several different bit rates may arise. For example, if the full file is stored on the media database server 600, it may be requested by several clients each having different bandwidth requirements. In this case, a plurality of scaled compressed bitstreams may be obtained from a single full file 140. Further, each scaled compressed bit stream may be at a different target bit rate, where each target bit rate is less than the full bit rate. All this can be done without re-encoding the encoded audio object file.

Embodiments of system 100 and method may then transmit one or more of the scaled compressed bit streams to receiving device 180 at a bit rate equal to or less than the target bit rate. The receiving apparatus 180 then demultiplexes the received scaled compressed bit stream to obtain a plurality of bit-reduced encoded audio objects. Next, the system 100 and method decode the bit-reduced encoded audio objects using at least one scalable bit rate decoder to obtain a plurality of decoded audio object files. The decoded audio object files may then be mixed or automatically mixed by the end user, content provider to generate an audio object mix 640.

Details of the operation

Embodiments of the post-encoding bit rate reduction system 100 and method include embodiments that handle silent periods of audio and embodiments that deliver a single full file to a variety of different bandwidth network channels. The mute period embodiment addresses those situations when several audio object files may have a significant period of time in which the audio is muted or at a very low level relative to other audio object files. For example, audio content containing music may have a long period of time where the vocal tracks are silent or at a very low level. When these audio object files are encoded with a constant bit rate audio codec, a significant amount of data payload is wasted on the encoding silence period.

The system 100 and method take advantage of fine grain scalability of each encoded audio object file to mitigate the waste of any data (or frame) payload during the silent period. This enables a reduction of the overall compressed data payload without affecting the quality of the reconstructed compressed audio. In some embodiments, the encoded audio object file has start and stop times. The start time indicates a point of time where muting starts and the stop time indicates a point of time where muting ends. In these cases, the system 100 and method may mark frames between the start and stop times as empty frames. This allows bits to be assigned to frames of other audio object files during a time period.

In other scenarios, an immediate bit rate reduction scheme may be required in addition to or instead of the silent period embodiment. This may occur, for example, when a single high quality encoded audio file or bitstream containing multiple audio object files is stored on a server that needs to serve clients with different connection bandwidths simultaneously. Embodiments of a single full file to a variety of different bandwidth network channels use fine-grain scalability features of the audio file or bitstream to reduce the overall bitrate of the encoded audio object file while trying to maintain as much overall quality as possible.

Operational details of embodiments of the system 100 and method will now be discussed. Fig. 9 is a flow chart illustrating details of a first embodiment of an embodiment of the post-encoding bit rate reduction system 100 and method shown in fig. 1-8. The operation begins by inputting a full file containing a plurality of individually encoded audio object files (block 900). Each of the plurality of encoded audio object files is segmented into data frames (block 905).

The system 100 and method then select a time period at the beginning of the full document (block 910). This time period ideally coincides with the time length of each frame. The selected time period begins at the beginning of the full document. The method processes the data frames for a selected time period and then processes the remainder of the data frames sequentially by taking time periods in chronological order. In other words, the next time period selected is a time period adjacent in time to the previous time period and the methods described above and below are used to process the data frames during each time period.

Next, the system 100 and method selects data frames for a plurality of encoded audio object files during a selected time period (block 915). Frame activity is measured for each data frame in the audio object file during a selected time period (block 920). As described above, various techniques may be used to measure frame activity.

For each data frame during the time period, the system 100 and method make a determination as to whether the measured frame activity is greater than a silence threshold (block 925). If so, frame activity for the data frame is stored in a frame activity buffer (block 930). If the measured frame activity is less than or equal to the mute threshold, the data frame is designated as a mute data frame (block 935). This designation means that the data frame has been reduced to a minimum payload and the number of bits in that frame is used to represent the data frame without further reduction. The mute data frame is then stored in the frame activity buffer (block 940).

The system 100 and method then compares the data frame activity stored in the frame activity buffer for each data frame in the selected time period with other data frames for the current time period (945). This results in a data frame activity comparison. The system 100 and method then determines the number of available bits used by any silence frames during the time period (block 950). The number of available bits that can be allocated to the remaining data frames during the time period is then determined. This is done by subtracting the bits used by any muted data frames from the number of bits already allocated to be used during the time period (block 955).

Bit allocation in the remaining data frames is performed by allocating available bits to data frames from each encoded audio object file for a selected time period (block 960). This bit allocation is performed based on the data frame activity comparison and the hierarchical metadata. Next, the bits ordered in the data frame are pruned to fit the bit allocation (block 965). In other words, bits are removed from the data frame in such a way that the important bits are removed last and the least important bits are removed first. This continues until only the number of bits allocated to that particular frame remains. The result is a clipped data frame.

These clipped data frames are stored (block 970) and a determination is made as to whether there are more time periods (block 975). If so, the next sequential time period is selected (block 980). The process begins again by selecting a data frame for the plurality of encoded audio object files at the new time period (block 915). Otherwise, the pruned data frames are packed into a scalable compressed bitstream (block 985).

V.A. frame and container

As discussed above, in some embodiments, the full file 140 includes a plurality of encoded audio object files. Some or all of these encoded audio object files may contain any combination of audio data, sound information, and associated metadata. Furthermore, in some embodiments, the encoded audio object file may be divided or partitioned into data frames. The use of data frames or frames may be efficient for streaming applications. In general, a "frame" is a discrete segment of data created by a codec and used in encoding and decoding.

Fig. 10 illustrates an audio frame 1000 according to some embodiments of the post-encoding bit rate reduction system 100 and method illustrated in fig. 1-9. Frame 1000 includes a frame header 1010, which may be configured to indicate the beginning of frame 1000, and a frame trailer 1020, which may be configured to indicate the end of frame 1000. The frame 1000 also includes one or more blocks of encoded audio data 1030 and corresponding metadata 1040. The metadata 1040 includes one or more segment header 1050 blocks that may be configured to indicate the start of a new metadata segment. The metadata 1040 may include hierarchical metadata 710 used by the hierarchy module 720.

The non-grouped audio objects may be included as object fragments 1060. The grouped audio object 1070 may include a grouping start and end block. These blocks may be configured to indicate the start and end of a new group. Further, the grouped audio object 1070 may include one or more object fragments. In some embodiments, frame 1000 may then be packaged into a container (such as MP 4).

In general, a "container" or package format is a metafile format that specifies how different data elements and metadata coexist in a computer file. A container refers to the way data is organized within a file, regardless of the encoding scheme used. Furthermore, the container "wraps" multiple bitstreams together and synchronizes the frames to ensure that they are presented in the correct order. The container may also be responsible for adding information to the streaming server if needed so that the streaming server knows when to send which part of the file. As shown in fig. 10, frame 1000 may be packaged into a container 1080. Examples of digital container formats that may be used for container 1080 include Transport Stream (TS), Material exchange format (MXF), motion picture experts Group, Part 14 (MP 4), and so forth.

V.B. fine granularity bitstream scalability

The structure and order of the elements placed in the scaled compressed bitstream 170 provides a wide bit range and fine granularity scalability of the bitstream 170. This structure and order allows the bitstream 170 to be smoothly scaled by external mechanisms such as the bit reduction module 150.

Fig. 11 illustrates an exemplary embodiment of a scalable frame of data generated by the scalable bitstream encoder 130 shown in fig. 1. It should be noted that one or more other types of audio compression codecs based on other decomposition rules may be used to provide fine-grained scalability to embodiments of the post-encoding bitrate reduction system 100 and method. In these cases, the other code will provide a different set of psycho-acoustically related elements.

The scalable compressed bitstream 170 used in the example of fig. 11 is composed of a plurality of resource-switched file format (RIFF) data structures (referred to as "chunks"). It should be noted that this is an exemplary embodiment and that other types of data structures may be used. This RIFF file format, well known by those skilled in the art, allows identifying the type of data carried by the blocks and the amount of data carried by the blocks. It should be noted that any bitstream format that carries information about the amount and type of data carried in its defined bitstream data structure may be used with embodiments of the system 100 and method.

Fig. 11 shows the layout of scalable bit rate frame blocks 1100, along with sub-blocks including grid 1 block 1105, pitch 1 block 1110, pitch 2 block 1115, pitch 3 block 1120, pitch 4 block 1125, pitch 5 block 1130. In addition, the sub-blocks include a high resolution grid block 1135, a time sample 1 block 1140, and a time sample 2 block 1145. These blocks constitute the psychoacoustic data carried within the frame block 1100. Although fig. 11 depicts only a tile Identification (ID) and a tile length of the frame tile 1100, sub-tile ID and sub-tile length data are included in each sub-tile.

Fig. 11 shows the order of blocks in a frame of a scalable bitstream. These blocks contain psychoacoustic audio elements generated by the scalable bitstream encoder 130 shown in fig. 1. In addition to the tiles being arranged in psychoacoustic importance, the audio elements in the tiles are also arranged in psychoacoustic importance.

The last block in the frame is a null block 1150. It is used to pad blocks in the case where a frame is required to be of constant or particular size. Therefore, the empty block 1150 has no psychoacoustic correlation. As shown in fig. 11, the least significant psychoacoustic tile is the temporal sample 2 tile 1145. In contrast, the most important psychoacoustic tile is grid 1 tile 1105. In operation, if the scalable bit rate frame chunk 1100 needs to be scaled down, data is removed starting from the psycho-acoustically least relevant chunk at the end of the bitstream (time sample 2 chunk 1145) and moving up the psycho-acoustic relevance rank. This will move from right to left in fig. 11. This means that the psychoacoustically most relevant block (grid 1 block 1105) with the highest possible quality in the scalable bit rate frame blocks 1100 is most likely not removed.

It should be noted that the highest target bitrate (along with the highest audio quality) that will be able to be supported by the bitstream is defined at the time of encoding. However, the lowest bit rate after scaling may be defined by using an acceptable audio quality level for the application. Each psychoacoustic element that is removed does not use the same number of bits. By way of example, the scaling resolution for the exemplary embodiment shown in FIG. 11 ranges from 1 bit for the lowest psychoacoustic significance elements to 32 bits for those highest psychoacoustic significance elements.

It should also be noted that the mechanism for scaling the bitstream does not require that the entire block be removed at once. As previously indicated, the audio elements within each block are arranged such that the psycho-acoustically most important data is placed at the beginning of the scalable bit rate frame block 1100 (closest to the right side of fig. 11). For this reason, audio elements may be removed from the end of the block, one element at a time, by the scaling mechanism, while maintaining the best possible audio quality if each element is removed from the scalable bit rate frame block 1100. This is what is meant by "fine granularity scalability".

The system 100 and method removes audio elements within a block as needed and then updates the block length field of the particular block from which the audio elements are removed. In addition, the system 100 and method also updates the frame block length 1155 and the frame checksum 1160. With the updated block length field for each scaled block along with the updated frame block length 1155 and the updated frame checksum information, the decoder can correctly process and decode the scaled bitstream. Furthermore, the system 100 and method can automatically generate an audio output signal at a fixed data rate even if there are blocks in the bitstream that are missing audio elements and blocks that are completely missing from the bitstream. In addition, a frame block identification (frame block ID 1165) is included in scalable bit rate frame block 1100 for identification purposes. In addition, frame chunk data 1170 includes (moving from right to left) checksums 1160 to empty blocks 1150.

V.C. bit allocation

An example of bit allocation between frames during a time period will now be discussed. It should be noted that this is just one of several ways in which bit allocation may be performed. Fig. 12 shows an exemplary embodiment of an example of dividing the full file 140 into a plurality of frames and time periods. As shown in fig. 12, the full file 140 is shown divided into a plurality of frames for a plurality of audio objects. The x-axis is the time axis and the y-axis is the encoded audio object file number. In this example, there are M encoded audio objects, where M is a positive non-zero integer. Furthermore, in this illustrative example, each encoded audio object file exists for the entire duration of the full file 140.

Looking from left to right across the time axis, it can be seen that each encoded audio object (numbered 1 to M) is divided into X frames, where X is a positive non-zero integer. Each box is marked by the label F_M,XIndicating where F is a frame, M is an audio object file number, and X is a frame number. For example, frame F_1,2Represents the second frame of the encoded audio object file (1).

As shown in fig. 12, a time period 1200 corresponding to the length of the frame is defined for the full file 140. Fig. 13 shows details of a frame of the full file 140 during a time period 1200. Its ordered frequency components are shown in each frame, which are of relative importance to their quality for the full file 140. It should be noted that the x-axis is frequency (in kHz) andthe y-axis represents the magnitude (in decibels) of a particular frequency. For example, at F_1,1In this example, it can be seen that 7kHz is the most important frequency component (in this example), followed by frequency components of 6kHz and 8kHz, respectively, and so on. Thus, each frame of each audio object contains these ranked frequency components.

The target bit rate is used to determine a number of available bits for time period 1200. In some embodiments, psychoacoustics (such as masking curves) are used to distribute the available bits across the frequency components in a non-uniform manner. For example, the number of bits available for each of the 1, 19, and 20kHz frequency components may be 64 bits, while 2048 bits may be available for each of the 7, 8, and 9kHz frequency components. This is because the human ear is most sensitive to 7, 8 and 9kHz frequency components according to the masking curve, whereas the human ear is relatively insensitive to very low and very high components, i.e. frequency components of 1kHz and below and frequency components of 19 and 20 kHz. Although psychoacoustics is used to determine the distribution of available bits across a frequency range, it should be noted that many other different techniques may be used to distribute available bits.

Embodiments of the post-encoding bitrate reduction system 100 and method then measure the frame activity of each frame for a corresponding time period 1200 for each encoded audio object file. The frame activity of each data frame of each encoded audio object file is compared to each other during time period 1200. This is referred to as a data frame activity comparison, which is the frame activity relative to other frames during time period 1200.

In some embodiments, the frame is assigned a frame activity number. As an example, it is assumed that the number of audio object files is 10, so that the frame activity numbers range from 1 to 10. In this example, 10 means the frame with the largest frame activity and 1 means the frame with the smallest activity during the time period 1200. It should be noted that many other techniques may be used to rank the frame activity within each frame during time period 1200. Based on the data frame activity comparison and the available bits from the bit pool, embodiments of the system 100 and method then allocate the available bits among the frames of the encoded audio object file for the time period 1200.

The number of available bits and the data frame activity comparison are used by the system 100 and method to reduce the bits in the frame as needed to coincide with the assigned bits. The system 100 and method take advantage of the fine-grained scalability features and the fact that bits are ranked in order of importance based on hierarchical metadata. For example, referring to FIG. 13, for F_1,1It is assumed that there are only enough allocation bits to represent the first four frequency components. This means that frequency components of 7, 6, 8 and 3kHz will be included in the bit-reduced encoded bit stream. F_1,1The 5kHz frequency components and those lower in the ordering are discarded.

In some embodiments, the data frame activity comparison is weighted by audio object importance. This information is contained in hierarchical metadata 710. As an example, assuming that the encoded audio object file #2 is important for an audio signal, this may occur if the audio is a movie track and the encoded audio object file #2 is a dialog track. Even though the encoded audio object file #9 may be the highest relative frame activity ranking of 10 and the encoded audio object file #2 has a ranking of 7, the ranking of the encoded audio object file #2 may be increased to 10 due to weighting because of the importance of the audio object. It should be noted that many variations of the above techniques and other techniques may be used to allocate bits.

VI.Alternative embodiments and exemplary operating Environment

Many other variations from those described herein will be apparent from this document. For example, depending on the embodiment, certain acts, events or functions of any methods and algorithms described herein can be performed in a different order, added, combined, or excluded altogether (such as where not all described acts or events are necessary for the practice of the methods and algorithms). Moreover, in some embodiments, acts or events may be performed concurrently, such as through multi-threaded processing, interrupt processing, or multiple processors or processor cores, or on other parallel architectures, rather than sequentially. Further, different tasks or processes may be performed by different machines and computing systems that may function together.

The various illustrative logical blocks, modules, methods, and algorithm processes and sequences described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both. To clearly illustrate this interchangeability of hardware and software, various illustrative components, blocks, modules, and process actions have been described above generally in terms of their functionality. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the overall system. The described functionality may be implemented in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of this document.

The various illustrative logical blocks and modules described in connection with the embodiments disclosed herein may be implemented or performed with a machine such as a general purpose processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other programmable logic device, discrete gate or transistor logic, discrete hardware components, or any combination thereof designed to perform the functions described herein. A general purpose processor may be a microprocessor, but in the alternative, the processor may be a controller, microcontroller, or state machine, combinations of the above, or the like. A processor may also be implemented as a computing device, such as a combination of a DSP and a microprocessor, a plurality of microprocessors, one or more microprocessors in conjunction with a DSP core, or any other such configuration.

Embodiments of the post-encoding bit rate reduction system 100 and method described herein operate in various types of general purpose or special purpose computing system environments or configurations. In general, a computing environment may include any type of computer system, including but not limited to one or more microprocessor-based computer systems, mainframe computers, digital signal processors, portable computing devices, personal organizers, device controllers, computing engines in appliances, mobile telephones, desktop computers, mobile computers, tablet computers, smart phones, and appliances with embedded computers, to name a few.

Such computing devices may typically be found in devices having at least some minimum computing capability, including but not limited to personal computers, server computers, hand-held computing devices, laptop or mobile computers, communication devices such as cell phones and PDAs, multiprocessor systems, microprocessor-based systems, set top boxes, programmable consumer electronics, network PCs, minicomputers, mainframe computers, audio or video media players, and the like. In some embodiments, the computing device will include one or more processors. Each processor may be a dedicated microprocessor, such as a Digital Signal Processor (DSP), Very Long Instruction Word (VLIW), or other microcontroller, or may be a conventional Central Processing Unit (CPU) having one or more processing cores, including a dedicated Graphics Processing Unit (GPU) based core in a multi-core CPU.

The processing acts of a method, process, or algorithm described in connection with the embodiments disclosed herein may be embodied directly in hardware, in a software module executed by a processor, or in any combination of the two. The software modules may be embodied in a computer-readable medium accessible by a computing device. Computer-readable media include both volatile and nonvolatile media, either removable or non-removable, or some combination thereof. Computer-readable media are used to store information such as computer-readable or computer-executable instructions, data structures, program modules or other data. By way of example, and not limitation, computer readable media may comprise computer storage media and communication media.

Computer storage media includes, but is not limited to, computer or machine readable media or storage devices, such as blu-ray discs (BD), Digital Versatile Discs (DVD), Compact Discs (CD), floppy disks, tape drives, hard drives, optical drives, solid state memory devices, RAM memory, ROM memory, EPROM memory, EEPROM memory, flash memory or other memory technology, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other device that can be used to store the desired information and that can be accessed by one or more computing devices.

A software module may reside in RAM memory, flash memory, ROM memory, EPROM memory, EEPROM memory, registers, hard disk, a removable disk, a CD-ROM, or any other form of non-transitory computer-readable storage medium, media, or physical computer storage known in the art. An exemplary storage medium may be coupled to the processor such the processor can read information from, and write information to, the storage medium. In the alternative, the storage medium may be integral to the processor. The processor and the storage medium may reside in an Application Specific Integrated Circuit (ASIC). The ASIC may reside in a user terminal. In the alternative, the processor and the storage medium may reside as discrete components in a user terminal.

As used in this document, the phrase "non-transitory" refers to "persistent or longevity". The phrase "non-transitory computer readable medium" includes any and all computer readable media, with the sole exception of transitory propagating signals. By way of example, and not limitation, this includes non-transitory computer-readable media such as register memory, processor cache, and Random Access Memory (RAM).

The retention of information such as computer-readable or computer-executable instructions, data structures, program modules, etc., may also be implemented by encoding one or more modulated data signals, electromagnetic waves (such as carrier waves), or other transport mechanisms or communication protocols using a variety of communication media and includes any wired or wireless information delivery mechanisms. Generally, these communication media refer to signals whose one or more characteristics are set or changed in such a manner as to encode information or instructions in the signal. For example, communication media includes wired media such as a wired network or direct-wired connection carrying one or more modulated data signals, and wireless media such as acoustic, Radio Frequency (RF), infrared, laser, and other wireless media for transmitting, receiving, or both, or modulated data signals or electromagnetic waves. Combinations of any of the above should also be included within the scope of communication media.

Additionally, one or any combination of software, programs, computer program products, which embody some or all, or portions thereof, of the various embodiments of the post-encoding bit rate reduction system 100 and method described herein may be stored, received, transmitted, or read from any desired combination of a computer or machine readable medium or storage device and a communication medium in the form of computer-executable instructions or other data structures.

Embodiments of the post-encoding bitrate reduction system 100 and methods described herein may be further described in the general context of computer-executable instructions, such as program modules, being executed by a computing device. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The embodiments described herein may also be practiced in distributed computing environments where tasks are performed by one or more remote processing devices, or in a cloud of one or more devices that are linked through one or more communications networks. In a distributed computing environment, program modules may be located in both local and remote computer storage media including media storage devices. Still further, the instructions described above may be implemented partially or fully as hardware logic circuits, which may or may not include a processor.

As used herein, conditional language, such as "capable," "may," "for example," and the like, is generally intended to convey that certain embodiments include, but not others, certain features, elements, and/or states, unless otherwise stated or understood in context otherwise, as used. Thus, such conditional language is not generally intended to imply that features, elements, and/or states are in any way required for one or more embodiments or that one or more embodiments necessarily include logic for deciding, with or without author input or prompting, whether such features, elements, and/or states are included or are to be performed in any particular embodiment. The terms "comprising," "having," and the like are synonymous and are used inclusively in an open-ended fashion, and do not exclude additional elements, features, acts, operations, and the like. Also, the term "or" is used in its inclusive sense (and not in its exclusive sense) such that, when used in conjunction with, for example, a list of connected elements, the term "or" refers to one, some, or all of the elements in the list.

While the above detailed description has shown, described, and pointed out novel features as applied to various embodiments, it will be understood that various omissions, substitutions, and changes in the form and details of the device or algorithm illustrated may be made without departing from the spirit of the disclosure. As will be recognized, certain embodiments of the present invention described herein may be embodied within a form that does not provide all of the features and benefits set forth herein, as some features may be used or practiced separately from others.

Furthermore, although the subject matter has been described in language specific to structural features and methodological acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described above. Rather, the specific features and acts described above are disclosed as example forms of implementing the claims.

Claims

1. A method performed by one or more processing devices for generating a scaled compressed bitstream from a single full file, comprising:

creating a full document having a full bitrate by combining a plurality of separately encoded audio object files, wherein an audio object is a source signal of a specific sound or a combination of sounds;

segmenting each encoded audio object file into data frames;

comparing the data frame activity of the data frames of each encoded audio file over a selected time period with each other to obtain a data frame activity comparison of all encoded audio files over the selected time period;

assigning bits from the pool of available bits to each data frame of the encoded audio object file during the selected time period based on the data frame activity comparison to obtain a bit assignment for the selected time period;

scaling down the full file by allocating bits of the reduced data frame according to the bits to generate a reduced frame;

obtaining bit-reduced encoded audio object files from the pruned frames and multiplexing the bit-reduced encoded audio object files together; and

packing the multiplexed reduced-bit encoded audio object files into a scaled compressed bitstream such that the scaled compressed bitstream has a target bitrate that is less than or equal to the full bitrate so as to facilitate post-encoding bitrate reduction of a single full file.

2. The method of claim 1, further comprising:

creating the full file by merging the plurality of individually encoded audio object files and corresponding hierarchical metadata, wherein the hierarchical metadata contains priority information for each encoded audio object file relative to other encoded audio object files; and

bits from the pool of available bits are assigned to each data frame based on the data frame activity comparison and the hierarchical metadata to obtain a bit assignment for the selected time period.

3. The method of claim 1, wherein the full length of time of each encoded audio object file is used to create the full file.

4. The method of claim 1, further comprising assigning bits from the pool of available bits to all data frames and all audio coding object files for a selected time period.

5. The method of claim 2, further comprising:

measuring data frame activity for each data frame over a selected time period; and

the data frame activity of each data frame is compared to a silence threshold to determine whether there is a minimum amount of activity in any data frame.

6. The method of claim 5, further comprising:

designating a particular data frame as a mute data frame having a minimum amount of activity if the data frame activity of the particular data frame is less than or equal to a mute threshold and maintaining the same number of bits used to represent the mute data frame without any bit reduction; and

if the data frame activity for a particular data frame is greater than the mute threshold, the data frame activity is stored in a frame activity buffer.

7. The method of claim 6, further comprising determining a pool of available bits for a selected time period by subtracting bits used by the muted data frames during the selected time period from a plurality of bits allocated to the selected time period.

8. The method of claim 2, further comprising reducing bits of the data frame in a reverse ranking order to enable a plurality of bits allocated to the data frame in a bit allocation such that lower ranked bits are reduced before higher ranked bits.

9. The method of claim 8, further comprising:

extracting the tones from the frequency domain representation of the audio object file to obtain a time domain residual signal representing the audio object file with at least some of the tones removed;

formatting the extracted pitch and time domain residual signals into a plurality of data blocks, each data block comprising a plurality of bytes of data; and

both the data chunks in the data frames of the audio object file and the bits in the data chunks are ordered in an order of psycho-acoustic importance to obtain a ranked order from most significant bits to least significant bits.

10. The method of claim 2, further comprising:

transmitting the scaled compressed bitstream over a network channel at a bit rate less than or equal to the target bit rate; and

a scaled compressed bitstream is received and decoded to obtain a decoded audio object file.

11. The method of claim 10, further comprising mixing the decoded audio object files to create an audio object mix, wherein two or more of the decoded audio object files are dependent on each other for spatial masking based on their location in the mix.

12. The method of claim 2, further comprising prioritizing the encoded audio object files in the layered metadata based on spatial positioning in the audio object mix.

13. The method of claim 2, further comprising prioritizing the encoded audio object files based on the importance of each audio object file in the audio object mix to the user.

14. A method for obtaining a plurality of scaled compressed bitstreams from a single full document, comprising:

separately encoding a plurality of audio object files at full bitrate using a scalable bitstream encoder having fine granularity scalability to obtain a plurality of encoded audio object files, the encoder ranking bits in each data frame of the encoded audio object files in an order of psychoacoustic importance to human hearing;

generating a full file having a full bitrate by combining a plurality of independently encoded audio object files and corresponding hierarchical metadata;

constructing a first scaled compressed bitstream of a first target bit rate from the full file; and

constructing a second scaled compressed bitstream of a second target bitrate from the full file such that multiple scaled bitstreams of different target bitrates are obtained from a single full file without any re-encoding of the multiple encoded audio object files;

wherein the first target bit rate and the second target bit rate are different from each other and both are smaller than the full bit rate.

15. The method of claim 14, wherein the first target bit rate is a maximum bit rate at which the first scaled compressed bit stream is to be transmitted.

16. The method of claim 15, wherein each of the plurality of encoded audio object files is persistent and exists for the entire duration of a full file.

17. The method of claim 16, further comprising:

comparing data frame activity of data frames of each of the plurality of encoded audio files over a selected time period to one another to obtain a data frame activity comparison;

allocating bits to each data frame of the encoded audio object file over a selected time period based on the data frame activity comparison and the first target bit rate to obtain a bit allocation for the selected time period;

scaling down the full file by allocating bits of the reduced data frame according to the bits to achieve a first target bit rate and obtain a bit-reduced encoded audio object file; and

the bit reduced audio object files are multiplexed together and packed into a first scaled compressed bitstream at a first target bit rate.

18. The method of claim 17, further comprising:

transmitting the first scaled compressed bitstream to a receiving device at a first target bit rate; and

the first scaled compressed bitstream is decoded to obtain a decoded audio object.

19. The method of claim 18, further comprising mixing the decoded audio objects to create an audio object mix.

20. A post-encoding bit rate reduction system, comprising:

a full file containing separately encoded audio object files that have been encoded at a full bitrate and merged with corresponding layered metadata to form the full file;

a bit reduction module for reducing a number of bits allocated to data frames of the encoded audio object file based on a data frame activity comparison of each data frame of each audio object file over a selected time period to obtain a bit-reduced encoded audio object;

a bitstream wrapper for arranging data frames of the bit-reduced encoded audio objects in a container for transmission over a computer network; and

a multiplexer for merging containers containing reduced-bit encoded audio to generate a scaled compressed bitstream at a target bitrate, wherein the target bitrate is less than the full bitrate.

21. An audio signal receiving system comprising:

a scaled compressed bitstream received over a network at a target bitrate, the bitstream comprising a plurality of bit-reduced encoded audio object files that have been individually encoded with a scalable bitstream encoder resident on a computing device, and having bits in data frames of full files encoded at a full bitrate reduced based on a data frame activity comparison and corresponding layered metadata, wherein the target bitrate is less than or equal to the full bitrate;

a demultiplexer for separating the scaled compressed bitstream into a plurality of encoded audio object files; and

a scalable bitstream decoder for decoding the encoded audio object to obtain a decoded audio object.

22. The audio signal receiving system of claim 22, further comprising a mixing device to mix the decoded audio object files and generate an audio object mix.