HK1226887A1

Movatterモバイル変換

Info

Publication number: HK1226887A1
Application number: HK17100439.0A
Authority: HK
Inventors: C．Q．罗宾森; N．R．特斯恩高斯; C．查巴尼
Original assignee: 杜比实验室特许公司
Priority date: 2011-07-01
Filing date: 2017-01-13
Publication date: 2017-10-06

Description

System and method for adaptive audio signal generation, encoding and rendering

The present application is a divisional application of the chinese invention patent application having application number 201280032058.3, filing date 2012, 6-27, entitled "system and method for adaptive audio signal generation, encoding and rendering".

Cross Reference to Related Applications

Priority of U.S. provisional application No.61/504,005 filed on 7/1/2011 and U.S. provisional application No.61/636,429 filed on 4/20/2012, both of which are hereby incorporated by reference in their entireties for all purposes.

Technical Field

One or more implementations relate generally to audio signal processing and, more particularly, to mixed object and channel-based audio processing for use in cinema, home, and other environments.

Background

The subject matter discussed in the background section should not be assumed to be prior art merely because it was mentioned in the background section. Similarly, the problems mentioned in the background section or associated with the subject matter of the background section should not be assumed to have been previously recognized in the prior art. The subject matter in the background section merely represents different approaches in and of themselves that may be inventions.

Since the introduction of sound into film (film), there has been a steady development of technology for capturing the author's artistic intent on a moving image soundtrack and accurately reproducing it in a movie theater environment. The basic role of the movie sound is to support the story shown on the screen. A typical movie soundtrack includes many different sound elements corresponding to images and elements on a screen, dialog, noise, and sound effects emanating from the different on-screen elements, and in combination with background music and environmental effects to create an overall audience experience. The artistic intentions of the creators and producers represent their desire to have these sounds reproduced for sound source position, intensity, movement and other similar parameters in a manner that corresponds as closely as possible to what is shown on the screen.

Current movie creation, distribution and playback suffer from limitations that restrict the creation of truly immersive and realistic audio. Conventional channel-based audio systems send audio content in the form of speaker feeds to separate speakers in a playback environment, such as stereo and 5.1 systems. The introduction of digital cinema has created new standards for sound on film, such as the incorporation of up to 16 channels of audio in order to allow greater creativity for content creators, as well as a more encompassing and realistic listening experience for viewers. The introduction of 7.1 surround systems has provided a new format that increases the number of surround channels by separating the existing left and right surround channels into four zones (zones), thus increasing the range for sound designers and mixers to control the positioning of audio elements in a theater.

To further improve the listener experience, the playback of sound in virtual three-dimensional environments has become an area of increased research and development. The spatial representation of sound utilizes audio objects that are audio signals having associated parametric source descriptions (e.g., 3D coordinates) with apparent (apparent) source locations, apparent source widths, and other parameters. Object-based audio is increasingly being used for many current multimedia applications, such as digital movies, video games, simulators, and 3D video.

Extending beyond traditional speaker feeds and channel-based audio as a means for distributing spatial audio is critical and there has been considerable interest in maintaining mode (model) -based audio descriptions that allow listeners/exhibitors to freely select playback configurations that suit their personal needs or budgets and have a commitment to audio to present specifically for the configuration they choose. At a high level, there are currently four main spatial audio description formats: wherein the audio is described as a speaker feed intended for signals of a speaker at a nominal speaker position; wherein the audio is described as a microphone feed of signals captured by virtual or actual microphones in a predefined array; a mode-based description in which the audio is described in terms of a sequence of audio events at the described locations; and binaural (binaural) in which the audio is described by signals arriving at the ears of the listener. These four description formats are often associated with one or more rendering techniques that convert audio signals into speaker feeds. Current rendering techniques include: panning, where the audio stream is converted to a speaker feed (typically rendered before distribution) using a set of panning rules and known or assumed speaker locations; ambisonics (Ambisonics), in which the microphone signals are converted into a supply of scalable (scalable) arrays for loudspeakers (typically presented after distribution); WFS (wave field synthesis), where sound events are converted into appropriate loudspeaker signals in order to synthesize a sound field (typically rendered after distribution); and binaural, where the signals of the L/R (left/right) binaural are typically delivered to the L/R ear (presented before or after distribution) using headphones (headphones) and through the use of speakers and crosstalk cancellation. Of these formats, the speaker feed format is the most common because it is simple and efficient. The best sound results (most accurate, most reliable) are achieved by direct mixing/monitoring and distribution to the loudspeaker feeds, since there is no processing between the content creator and the listener. The speaker feed description typically provides the highest fidelity if the playback system is known in advance. However, in many practical applications, the playback system is unknown. The mode-based description is considered most adaptive because it makes no assumptions about the rendering technique and is therefore most easily applied to any rendering technique. Although the pattern-based description captures spatial information efficiently, it becomes very inefficient as the number of audio sources increases.

Film systems have been characterized for many years as having discrete screen channels in the form of left, center, right and occasionally 'inner left (innerleft)' and 'inner right (innerright)' channels. These discrete sources typically have sufficient frequency response and power handling (powerhandling) to allow the sound to be accurately placed in different areas of the screen and to allow timbre matching as the sound is moved or panned between locations. Recent developments in improving the listener experience have attempted to accurately reproduce the position of sound relative to the listener. In the 5.1 setup, the surround "zone" consists of an array of speakers, all carrying the same audio information in each left surround or right surround zone. Such an array may be effective in the case of 'ambient' or diffuse surround effects, however, many sound effects in daily life originate from randomly placed point sources. For example, in a restaurant, ambient music may be played apparently from all around, though tiny but discrete sounds originate from specific points: people from one spot chat, knives from another spot click on the tray (clatter). The ability to discretely place such sounds around an auditorium may add enhanced realism without being noticeably noticeable. Overhead sound is also an important component of the surround definition. In the real world, sound originates from all directions, rather than always from a single horizontal plane. An increased sense of realism can be achieved if the sound can be heard overhead, in other words from the 'upper hemisphere'. However, current systems do not provide for truly accurate reproduction of different audio types of sound in a variety of different playback environments. Using existing systems requires extensive processing, knowledge and configuration of the actual playback environment to attempt an accurate representation of the location-specific sounds, thus rendering current systems impractical for most applications.

What is needed is a system that supports multiple screen channels, resulting in increased clarity and improved audio-visual coherence for sounds or dialog on the screen, and the ability to precisely locate sources anywhere in the surround area to improve the audio-visual transition from the screen to the room. For example, if a character on the screen looks at a sound source in a room, the sound engineer ("mixer") should have the ability to precisely position the sound so that it matches the character's line of sight and the effect will be consistent across all viewers. However, in conventional 5.1 or 7.1 surround sound mixing, the effect is highly dependent on the listener's seat position, which is disadvantageous for most large-scale listening environments. Increased surround resolution creates a new opportunity to utilize sound in a room-centric manner, as opposed to traditional approaches, where a single listener is assumed to create content at a "sweet spot".

In addition to spatial issues, current multi-channel prior art systems suffer from issues with respect to timbre. For example, some of the timbre quality of sound, such as a hissing sound (missing) out of a broken tube, may be subject to reproduction by an array of loudspeakers. The ability to direct specific sounds to individual speakers gives the mixer the opportunity to eliminate artifacts (artifacts) reproduced by the array and deliver a more realistic experience to the audience. Traditionally, surround speakers do not support the same full range of audio frequencies and levels supported by large screen channels. Historically, this has caused problems for mixers, reducing their ability to freely move the full range of sounds from screen to room. As a result, theater owners do not feel forced to upgrade their surround channel configurations, preventing widespread adoption of higher quality equipment.

Disclosure of Invention

Systems and methods are described for a cinema sound format and processing system that includes a new speaker layout (channel configuration) and associated spatial description format. Adaptive audio systems and formats are defined to support multiple rendering technologies. The audio streams are transmitted with metadata describing the "mixer's intent" including the desired location of the audio streams. The location may be represented as a named (named) channel (from within a predefined channel configuration) or as three-dimensional location information. This channel plus object format combines optimal channel-based and mode-based audio scene description methods. The audio data for an adaptive audio system comprises a number of independent mono audio streams. Each stream has metadata associated with it that specifies whether the stream is a channel-based or object-based stream. The channel-based stream has presentation information encoded with a channel name; and the object-based stream has location information encoded by a mathematical expression encoded in more associated metadata. The original independent audio streams are packaged as a single serial bit stream containing all the audio data. This configuration allows sound to be rendered according to an non-self-centric (allocentric) frame of reference, where the rendering location of the sound is based on characteristics of the playback environment (e.g., room size, shape, etc.) so as to correspond to the intent of the mixer. The object location metadata contains the appropriate non-self-centric frame of reference information needed to properly play sound using the available speaker locations in the room set up to play the adaptive audio content. This enables the sound to be optimally mixed for a particular playback environment, which may be different from the mixing environment experienced by the sound engineer.

The adaptive audio system improves audio quality in different rooms through such benefits as improved room equalization and surround bass management so that the speakers (whether on-screen or off-screen) can be freely addressed by the mixer without having to consider timbre matching. Adaptive audio systems add the flexibility and power of dynamic audio objects to traditional channel-based workflows. These audio objects allow the author to control discrete sound elements independently of any particular playback speaker configuration, including the overhead speakers. The system also introduces new efficiencies for the post-production process, allowing sound engineers to effectively capture all their intentions and then either in real-time monitoring, or automatically generate surround sound 7.1 and 5.1 versions.

Adaptive audio systems simplify distribution by encapsulating audio ontology (essence) and artistic intent in a single track file within a digital cinema processor, which can be faithfully played back in a wide range of theater configurations. The system provides the best reproduction of artistic intent when mixing and rendering utilize the same channel configuration and a single inventory (which is down-adapted to the rendering configuration (i.e., down-mix)).

These and other advantages are provided by embodiments involving a cinema sound platform that addresses current system limitations and delivers an audio experience beyond currently available systems.

Drawings

Like reference numerals are used to refer to like elements in the following figures. Although the following figures depict various examples, one or more implementations are not limited to the examples depicted in the figures.

FIG. 1 is an overview of the top level of an audio creation and playback environment utilizing an adaptive audio system, according to one embodiment.

Fig. 2 illustrates the combination of channels and object-based data to produce an adaptive audio mix, according to one embodiment.

Fig. 3 is a block diagram illustrating a workflow of creating, packaging, and rendering adaptive audio content, according to one embodiment.

Fig. 4 is a block diagram of the rendering phase of an adaptive audio system, in accordance with one embodiment.

Fig. 5 is a table listing metadata types and associated metadata elements for an adaptive audio system, according to one embodiment.

Fig. 6 is a diagram illustrating post-production and hosting for an adaptive audio system, according to one embodiment.

Fig. 7 is a diagram of an example workflow for a digital cinema packaging process using adaptive audio files, in accordance with one embodiment.

Fig. 8 is a top view of an example layout of suggested speaker locations for use with an adaptive audio system in a typical auditorium.

Fig. 9 is a front view of an example arrangement of suggested speaker locations at a screen for use in a typical auditorium.

Fig. 10 is a side view of an example layout of suggested speaker locations for use with an adaptive audio system in a typical auditorium.

FIG. 11 is an example of the placement of top and side surround speakers relative to a reference point, according to one embodiment.

Detailed Description

Systems and methods are described for an adaptive audio system and associated audio signal and data formats that support multiple rendering techniques. Aspects of one or more embodiments described herein may be implemented in an audio or audiovisual system that processes source audio information in a mixing, rendering and playback system that includes a processing device or one or more computers executing software instructions. Any of the described embodiments may be used alone or in any combination with one another. While various embodiments may have been motivated by various deficiencies with the prior art, which may be discussed or implied in one or more places in the specification, embodiments do not necessarily address any of these deficiencies. In other words, different embodiments may address different deficiencies that may be discussed in the specification. Some embodiments may only partially address some or only one of the deficiencies that may be discussed in the specification, and some embodiments may not address any of these deficiencies.

For the purposes of this specification, the following terms have the associated meanings:

channel or audio channel: a monophonic audio signal or audio stream is added with metadata in which the position is encoded as a channel ID, for example "LeftFront" or "RightTopSurround". The channel object may drive multiple speakers, e.g., a "LeftSurround" channel (Ls) would feed all speakers in the Ls array.

Channel configuration: a predefined set of speaker zones with associated nominal positions, e.g. 5.1, 7.1, etc.; 5.1 refers to a six channel surround sound audio system having front left and right channels, a center channel, two surround channels, and subwoofer channels; 7.1 refers to an eight channel surround system, which adds two additional surround channels to the 5.1 system. Examples of 5.1 and 7.1 configurations includeA surround system.

A loudspeaker: an audio transducer or a group of transducers that render an audio signal.

Speaker area: an array of one or more loudspeakers, which may be uniquely mentioned and receive a single audio signal, for example "LeftSurround" as typically found in movies, and is used in particular to exclude or include object presentations.

Speaker channels or speaker feed channels: audio channels associated with speaker zones or named speakers within a defined speaker configuration. The speaker channels are nominally rendered using the associated speaker zones.

Speaker channel group: a set of one or more speaker channels corresponding to a channel configuration (e.g., stereo, mono, etc.).

Object or object channel: one or more audio channels having a parametric source description, such as an apparent source position (e.g., 3D coordinates), an apparent source width, and so forth. The audio stream is added with metadata in which the position is encoded as a 3D position in space.

Audio program: the entire set of speaker channels and/or object channels and associated metadata describing the desired spatial audio representation.

Non-self-centric reference: spatial references, in which audio objects are defined relative to features within the rendering environment, such as room walls and corners, standard speaker positions, and screen positions (e.g., the front left corner of a room).

Self-centric (egocentric) reference: spatial reference, in which the perspective of an audio object relative to a (viewer) listener is defined and often specified as an angle relative to the listener (e.g. 30 degrees to the right of the listener).

Frame: the frames are shorter and the overall audio program is divided into independently decodable segments. The audio frame rate and boundaries are typically aligned with the video frames.

Adaptive audio: the channel-based audio signal and/or the object-based audio signal is supplemented with metadata that renders the audio signal based on the playback environment.

The cinema sound format and processing system (also referred to as an "adaptive audio system") described in the present application utilizes new spatial audio description and presentation techniques to allow enhanced audience immersion, more artistic control, system flexibility and scalability, and ease of installation and maintenance. Embodiments of the movie audio platform include several discrete components, including mixing tools, packers/encoders, unpackers/decoders, in-theater final mixing and rendering components, new speaker designs, and networked amplifiers. The system includes recommendations for new channel configurations to be used by content creators and exhibitors. The system utilizes a model-based description that supports several features such as: a single manifest with adaptation (adaptation) down and up to the rendering configuration, i.e. delayed rendering and enabling optimal use of the available speakers; improved sound encapsulation, including optimized downmix to avoid inter-channel correlation; increased spatial resolution by steering through a (steer-thru) array (e.g., audio objects dynamically assigned to one or more speakers within a surround array); and support for alternative rendering methods.

FIG. 1 is an overview of the top level of an audio creation and playback environment utilizing an adaptive audio system, according to one embodiment. As shown in FIG. 1, the integrated, peer-to-peer environment 100 includes content creation, packaging, distribution, and playback/rendering components over a large number of endpoint devices and usage scenarios. The overall system 100 begins with content captured from and for many different use cases, including different user experiences 112. The content capture element 102 includes, for example, movies, TV, live broadcasts, user generated content, recorded content, games, music, etc., and may include audio/visual or audio-only content. As content progresses through the system 100 from the capture stage 102 to the final user experience 112, the content passes through several key processing steps by discrete system components. These processing steps include pre-processing of audio 104, authoring tools and processes 106, encoding by an audio codec 108 that captures, for example, audio data, additional metadata and rendering information, and object channels. Various processing effects (such as compression (lossy or lossless), encryption, etc.) may be applied to the object channels for efficient and secure distribution over various media. Appropriate endpoint-specific decoding and rendering processes 110 are then applied to render and deliver the ad hoc adaptive audio user experience 112. The audio experience 112 represents playback of audio or audio/visual content through suitable speakers and playback devices, and may represent any environment in which a listener is experiencing playback of captured content, such as a movie theater, concert hall, theater, home or room, listening booth (listeningboth), car, game console, headphone or earphone system, Public Address (PA) system, or any other playback environment.

Embodiments of system 100 include an audio codec 108 that is capable of efficiently distributing and storing multi-channel audio programs, and thus may be referred to as a 'hybrid' codec. The codec 108 combines conventional channel-based audio data with associated metadata to produce audio objects that facilitate the creation and delivery of audio that is adapted and optimized for rendering and playback in an environment that may be different from a mixed environment. This allows the sound engineer to encode his or her intent as to how the final audio should be heard by the listener based on the listener's actual listening environment.

Conventional channel-based audio codecs operate under the assumption that an audio program will be reproduced by an array of speakers in a predetermined position relative to the listener. To create a complete multi-channel audio program, sound engineers typically mix a large number of separate audio streams (e.g., dialog, music, effects) to create an overall desired impression. Audio mixing decisions are typically made by listening to audio programs reproduced by an array of speakers in a predetermined location (e.g., a particular 5.1 or 7.1 system in a particular theater). The final mixed signal is used as input to an audio codec. For reproduction, a spatially accurate sound field is only achieved when the loudspeakers are placed in predetermined positions.

A new form of audio coding, called audio object coding, provides different sound sources (audio objects) as input to the encoder in the form of separate audio streams. Examples of audio objects include a conversation track, individual instruments, individual sound effects, and other point sources. Each audio object is associated with spatial parameters that may include, but are not limited to, sound position, sound width, and velocity information. The audio objects and associated parameters are then encoded for distribution and storage. Final audio object mixing and rendering is performed at the receiving end of the audio distribution chain as part of the audio program playback. This step may be based on knowledge of the actual speaker positions, such that the result is an audio distribution system that is customizable for user-specific listening conditions. Both encoding forms (channel-based and object-based) perform optimally for different input signal conditions. Channel-based audio encoders are generally more efficient for encoding input signals containing a dense mix of different audio sources and for diffuse sound. Conversely, an audio object encoder is more efficient for encoding a small number of highly directional sound sources.

In one embodiment, the components and methods of system 100 include an audio encoding, distribution and decoding system configured to produce one or more bitstreams containing both conventional channel-based audio elements and audio object coding elements. This combined approach provides greater coding efficiency and rendering flexibility than either the channel-based approach or the object-based approach taken separately.

Other aspects of the described embodiments include extending a predefined channel-based audio codec in a backward compatible manner to include audio object coding elements. A new 'extension layer' containing audio object coding elements is defined and added to a 'base' or 'backward compatible' layer of the channel-based audio codec bitstream. This approach enables one or more bitstreams that include extension layers to be processed by legacy (legacy) decoders, while at the same time providing the user with an enhanced listener experience with new decoders. One example of an enhanced user experience includes control of audio object presentation. An additional advantage of this approach is that audio objects can be added or modified anywhere along the distribution chain without decoding/mixing/re-encoding multi-channel audio encoded with a channel-based audio codec.

With respect to the reference frame, the spatial effect of the audio signal is critical in providing an immersive experience for the listener. The sound intended to emanate from a viewing screen or a particular region of a room should be played through a speaker(s) located at the same relative position. Thus, the primary audio metadata of a sound event in the pattern-based description is the location, but other parameters such as size, orientation, velocity, and sound dispersion may also be described. To communicate the position, the mode-based, 3D, audio space description requires a 3D coordinate system. The coordinate system used for transmission (Euclidean), spherical, etc.) is usually chosen for convenience or brevity, however, other coordinate systems may be used for the rendering process. In addition to the coordinate system, a reference system is also required to represent the position of the object in space. For systems for accurately reproducing location-based sound in a variety of different environments, selecting the correct reference frame can be a key factor. With an egocentric frame of reference, audio source locations are defined relative to features within the presentation environment, such as room walls and corners, standard speaker locations, and screen locations. In a self-centered reference frame, the position is expressed relative to the listener's perspective, such as "in front of me, slightly to the left", and so on. Scientific research in spatial perception (audio and others) has shown that self-centric viewing angles are used almost everywhere. For movie theaters, however, an non-self-centering is generally more appropriate for several reasons. For example, the exact position of an audio object is of utmost importance when there is an associated object on the screen. Using the non-egocentric reference, for each listening position, and for any screen size, the sound will be positioned at the same relative position on the screen, e.g., the middle left third of the screen. Another reason is that the mixer tends to think and mix in terms of being non-egocentric and lay out the panning tools in a non-egocentric frame (room wall) and the mixer expects them to be presented, e.g., the sound should be on the screen, the sound should be off the screen, or from the left wall, etc.

Although non-egocentric reference frames are used in cinema environments, there are some situations where egocentric reference frames may be useful and more appropriate. These include non-storyline sounds, i.e., those sounds that do not exist in "story space," such as atmosphere music, for which a self-centered uniform presentation may be desirable. Another situation is a near-field effect that requires a self-centered representation (e.g., a buzzing mosquito in the listener's left ear). There is currently no means of presenting such a sound field without the use of headphones (headphones) or very near-field loudspeakers. In addition, an infinite sound source (and the resulting plane wave) appears to come from a constant self-centered position (e.g., 30 degrees to the left), and such sound is more easily described in terms of self-centering than in terms of non-self-centering.

In some cases, a non-egocentric frame of reference may be used as long as the nominal listening position is defined, but some examples require a representation of egocentric that may not yet be present. While non-egocentric references may be more useful and appropriate, the audio representation should be extensible in that many new features (including representation of egocentric) may be more desirable in certain applications and listening environments. Embodiments of the adaptive audio system include a hybrid spatial description approach that includes a recommendation channel configuration for optimal fidelity and for rendering diffuse or complex, multi-point sources (e.g., stadium crowd, environment) using self-centric reference, plus a non-self-centric, pattern-based sound description to effectively enable increased spatial resolution and scalability.

System assembly

Referring to fig. 1, original sound content data 102 is first processed in a pre-processing block 104. The pre-processing block 104 of the system 100 includes object channel filtering components. In many cases, the audio objects contain separate sound sources for enabling independent panning of the sound. In some cases, such as when creating audio programs using natural or "production" sound, it may be necessary to extract individual sound objects from a recording containing multiple sound sources. Embodiments include methods for isolating independent source signals from more complex signals. The undesired elements to be separated from the independent source signal may include, but are not limited to, other independent acoustic sources and background noise. In addition, reverberation can be removed to recover a "dry" sound source.

The preprocessor 104 also includes source separation and content type detection functions. The system provides for automatic generation of metadata through analysis of input audio. Positional metadata is derived from the multi-channel recording by analyzing the relative levels of the correlation inputs between the channel pairs. Detection of the type of content (such as "talk" or "music") may be achieved, for example, by feature extraction and classification.

Creation tool

The authoring tool block 106 includes features for improving the authoring of an audio program by optimizing the input and compilation (codification) of the sound engineer's authoring intent to allow him to create a final audio mix at a time that is optimized for playback in virtually any playback environment. This is achieved by using the position data and the audio objects associated and encoded with the original audio content. In order to accurately place sound around an auditorium, sound engineers need to control how the sound will ultimately be rendered based on the characteristics of the actual constraints and playback environment. Adaptive audio systems provide this control by allowing sound engineers to change how audio content is designed and mixed using audio objects and position data.

Audio objects may be thought of as groups of sound elements that may be perceived as emanating from a particular physical location or locations in an auditorium. Such objects may be static, or they may move. In the adaptive audio system 100, audio objects are controlled by metadata that details, among other things, the position of sound at a given point in time. When objects are monitored or played back in a theater, they are rendered according to the location metadata by using the existing speakers, rather than having to be output to the physical channels. Tracks in a conversation may be audio objects and standard pan data is similar to position metadata. In this way, content located on the screen may pan effectively in the same way as channel-based content, but content located in surround can be rendered to separate speakers as needed. While the use of audio objects provides the desired control for discrete effects, other aspects of the motion picture soundtrack do work effectively in a channel-based environment. For example, many ambient effects or reverberation actually benefit from being supplied to the loudspeaker array. While these can be treated as objects with sufficient width to fill the array, it is beneficial to retain some of the channel-based functionality.

In one embodiment, the adaptive audio system supports a "base" (bed) in addition to audio objects, where the base is effectively based on a sub-mix or a stem (stem) of channels. These may be delivered for final playback (presentation) either independently or combined into a single basis, depending on the intent of the content creator. These foundations can be created in different channel-based configurations (such as 5.1, 7.1) and are extensible to a wider format, such as 9.1, and arrays including on-head speakers.

Fig. 2 illustrates the combination of channels and object-based data to produce an adaptive audio mix, according to one embodiment. As shown in process 200, channel-based data 202, which may be, for example, 5.1 or 7.1 surround sound data provided in the form of Pulse Code Modulated (PCM) data, is combined with audio object data 204 to produce an adaptive audio mix 208. The audio object data 204 is generated by combining elements of the original channel-based data with associated metadata specifying specific parameters regarding the location of the audio object.

As conceptually illustrated in fig. 2, the authoring tool provides the ability to create an audio program that contains a combination of both object channels and speaker channel groups. For example, an audio program may contain one or more speaker channels (or tracks, e.g., stereo or 5.1 tracks), optionally organized into groups, description metadata for the one or more speaker channels, one or more object channels, and description metadata for the one or more object channels. Within an audio program, each speaker channel group and each object channel may be represented using one or more different sampling rates. For example, digital cinema (D-cinema) applications support 48kHz and 96kHz sampling rates, but other sampling rates may also be supported. In addition, uptake (ingest), storage, and editing of channels having different sampling rates can also be supported.

The creation of an audio program requires the step of sound design, which includes combining sound elements as a sum of horizontally adjusted constituent sound elements in order to create a new desired sound effect. The authoring tool of the adaptive audio system enables the creation of sound effects as a collection of sound objects with relative positions using a spatial-visual sound design graphical user interface. For example, a visual representation of a sound producing object (e.g., a car) may be used as a template for assembling audio elements (exhaust tone, tire humming, engine noise) as object channels containing sound and appropriate spatial locations (at the tailpipe, tire, hood). The individual object channels can then be linked and manipulated as a whole. The authoring tool 106 includes several user interface elements to allow the sound engineer to enter control information and view mixing parameters and improve system functionality. The sound design and authoring process is also improved by allowing the object channels and speaker channels to be linked and manipulated as a whole. One example is to combine object channels with discrete, dry sound sources with a set of loudspeaker channels containing associated reverberation signals.

The audio authoring tool 106 supports the ability to combine multiple audio channels (commonly referred to as mixing). A number of mixing methods are supported and may include traditional level-based mixing and loudness-based mixing. In horizontal-based mixing, wideband scaling (scaling) is applied to the audio channels, and the scaled audio channels are then summed together. The wideband scaling factor for each channel is selected to control the absolute level of the resulting mixed signal, as well as the relative level of the mixed channels within the mixed signal. In loudness-based mixing, one or more input signals are modified using frequency-dependent amplitude scaling, where the frequency-dependent amplitudes are selected so as to provide a desired perceived absolute and relative loudness, while maintaining the perceived timbre of the input sound.

The authoring tool allows the ability to create speaker channels and speaker channel groups. This allows metadata to be associated with each speaker channel group. Each speaker channel group may be labeled according to content type. The content type may be expanded via textual description. Content types may include, but are not limited to, dialog, music, and effects. Each speaker channel group may be assigned unique instructions on how to upmix from one channel configuration (upmix) to another, where upmix is defined as creating M audio channels from N channels, where M > N. The upmix instructions may include, but are not limited to, the following: an enable/disable flag for indicating whether upmixing is allowed; an upmix matrix for controlling the mapping between each input and output channel; and default enablement and matrix settings may be assigned based on content type, e.g., enabling upmixing only for music. Each speaker channel group may also be assigned a unique instruction on how to downmix (downmix) from one channel configuration to another, wherein downmix is defined as creating Y audio channels from X channels, wherein Y < X. The downmix instructions may include, but are not limited to, the following: a matrix for controlling the mapping between each input and output channel; and default matrix settings may be assigned based on content type, e.g., dialog should be downmixed to the screen; the effect should be downmixed off the screen. Each speaker channel may also be associated with a metadata flag for disabling bass management during rendering.

Embodiments include features that enable the creation of object channels and object channel groups. The present invention allows metadata to be associated with each object channel group. Each object channel group may be labeled according to a content type. The content type is extensible via textual description, where the content type may include, but is not limited to, dialog, music, and effects. Each object channel group may be assigned metadata describing how one or more objects should be rendered.

The position information is provided to indicate a desired apparent source position. The location may be indicated by using a self-centric or non-self-centric frame of reference. A self-centering reference is appropriate when the source location is to involve the listener. For self-centered positions, spherical coordinates are useful for position description. Non-egocentric reference is a typical frame of reference for a movie or other audio/visual presentation where the source location is mentioned relative to an object in the presentation environment, such as a visual display screen or room boundary. Three-dimensional (3D) trajectory information is provided to enable interpolation of position or for using other rendering decisions, such as enabling "fast-forward (snap) to mode". The size information is provided to indicate a desired apparent perceived audio source size.

Spatial quantization is provided by a "fast-shift to closest speaker" control, which is indicated by a sound engineer or mixer as intended to have an object rendered by exactly one speaker (with some possible sacrifice in spatial accuracy). The limit on allowable spatial distortion may be indicated by elevation and azimuth tolerance thresholds so that if the thresholds are exceeded, no "snap-through" function occurs. In addition to the distance threshold, a cross-fade (crossfade) rate parameter may also be indicated to control how quickly a moving object will transition or jump from one speaker to another when the desired location crosses between speakers.

In one embodiment, dependent spatial metadata is used for location-specific metadata. For example, metadata may be automatically generated for a "slave" object by associating it with the "master" object that the slave object is to follow. A time lag or relative speed may be assigned to the dependent object. Mechanisms may also be provided to allow definition of the acoustic center of gravity for groups or clusters of objects so that an object may be rendered such that it is perceived to move around another object. In this case, one or more objects may rotate around the object or defined area (such as a main guide point or a dry area of a room). Even if the final position information would be represented as a position relative to the room, as opposed to a position relative to another object, the acoustic center of gravity would then be used in the rendering phase to help determine the position information for each suitable object-based sound.

When an object is rendered, it is assigned to one or more speakers according to the position metadata and the position of the playback speakers. Additional metadata may be associated with the object to limit the speakers that should be used. The restricted use may disable the use of the indicated speaker or simply disable the indicated speaker (allowing less energy into the speaker or speakers than would otherwise be applied). The set of speakers to be constrained may include, but is not limited to, any of the named speakers or speaker zones (e.g., L, C, R, etc.), or speaker zones such as: front wall, back wall, left wall, right wall, ceiling, floor, speakers in a room, etc. Likewise, in specifying a desired mix of multiple sound elements, one or more sound elements may be made inaudible or "masked" due to the presence of other "masked" sound elements. For example, when masked elements are detected, they may be identified to a user via a graphical display.

As described elsewhere, the audio program description may be adapted to be presented over a wide variety of speaker facilities and channel configurations. When an audio program is being composed, it is important to monitor the effect of rendering the program in a desired playback configuration to verify that the desired result is achieved. The present invention includes the ability to select a target playback configuration and monitor the results. In addition, the system may automatically monitor the worst case (i.e., highest) signal level that will be generated in each expected playback configuration and provide an indication in the event that clipping or restriction will occur.

Fig. 3 is a block diagram illustrating a workflow of creating, packaging, and rendering adaptive audio content, according to one embodiment. The workflow 300 of fig. 3 is divided into three distinct task groups labeled create/author, package, and expose. In general, the blending model of the bases and objects shown in fig. 2 allows most sound design, editing, pre-mixing and final blending to be performed in the same manner as today and without adding excessive overhead to the current process. In one embodiment, the adaptive audio functionality is provided in the form of software, firmware, or circuitry for use in conjunction with a sound production and processing device, where such a device may be a new type of hardware system or an update to an existing system. For example, a plug-in application may be provided for a digital audio workstation to allow existing panning techniques within sound design and editing to remain unchanged. In this way, both the base and the object can be laid down within a workstation in an editing room of a 5.1 or similar surrounding installation. Object audio and metadata are recorded in a session to prepare for the pre-mixing and final mixing stages in a dubbing theater.

As shown in FIG. 3, the creation or authoring task includes entering a mixing control 302 by a user (e.g., a sound engineer in the following example) into a mixing console or audio workstation 304. In one embodiment, metadata is integrated into the mixing console surface, allowing the volume controller (fades), panning, and audio processing of the channel bars (strips) to work on both the base or stem and the audio objects. The metadata may be edited using a console surface or workstation user interface and the sound monitored using a presentation and master unit (RMU) 306. Base and object audio data and associated metadata are recorded during the hosting session in order to create a 'print master' that includes adaptive audio mixing 310 and any other rendered deliverables (deliveries) (such as mixes that surround7.1 or 5.1 theaters) 308. Existing authoring tools (e.g., digital audio workstations such as Pro tools) may be used to allow sound engineers to mark individual audio tracks within a mixing session. Embodiments extend this concept by allowing users to mark individual sub-segments within a track to aid in the discovery or rapid recognition of audio elements. The user interface to the mixing console that enables the metadata to be defined and created may be implemented through graphical user interface elements, physical controls (e.g., sliders and knobs), or any combination thereof.

In the packaging stage, the print master file is wrapped, hash and optionally encrypted using an industry standard MXF wrapping (wrap) process in order to ensure the integrity of the audio content for delivery to the digital cinema packaging facility. This step may be performed by a Digital Cinema Processor (DCP)312 or any suitable audio processor depending on the final playback environment, such as a standard surround sound equipped theater 318, an adaptive audio enabled theater 320, or any other playback environment. As shown in fig. 3, the processor 312 outputs appropriate audio signals 314 and 316 according to the exhibition environment.

In one embodiment, the adaptive audio print master includes an adaptive audio mix, and a Pulse Code Modulation (PCM) mix that complies with a standard DCI. The PCM mix may be presented by a presentation and master unit in a dubbing theater or created on demand by a separate mixing approach. The PCM audio forms standard main audio track files within the digital cinema processor 312 and the adaptive audio forms additional track files. Such a track file may comply with existing industry standards and be ignored by DCI compliant servers that cannot use it.

In an example cineloop playback environment, a DCP containing an adaptive audio track file is identified by a server as a valid capsule and ingested into the server and subsequently streamed to an adaptive audio cinema processor. The system has both linear PCM and adaptive audio files available, and the system can switch between them as needed. For distribution to the exhibitions phase, the adaptive audio packaging scheme allows delivery of a single type of capsule to be delivered to the movie theater. The DCP package contains both PCM and adaptive audio files. The use of a secure key, such as a Key Delivery Message (KDM), may be incorporated to enable secure delivery of movie content or other similar content.

As shown in FIG. 3, the adaptive audio method is implemented by enabling a sound engineer to express his or her intent with respect to the presentation and playback of audio content through the audio workstation 304. By controlling the specific input controls, the engineer can specify where and how to play back the audio objects and sound elements according to the listening environment. Metadata is generated in the audio workstation 304 in response to the engineer's mixing input 302 to provide a rendering queue that controls spatial parameters (e.g., position, velocity, intensity, timbre, etc.) and specifies which speaker(s) or group of speakers in the listening environment to play corresponding sounds during exhibitions. The metadata is associated with corresponding audio data in the workstation 304 or RMU306 for packaging and transmission by the DCP 312.

The software tools and graphical user interfaces that provide control of the workstation 304 by an engineer include at least portions of the authoring tool 106 of FIG. 1.

Hybrid audio codec

As shown in fig. 1, the system 100 includes a hybrid audio codec 108. This component comprises an audio encoding, distribution and decoding system configured to produce a single bitstream containing both conventional channel-based audio elements and audio object coding elements. The hybrid audio coding system is built around a channel-based coding system configured to produce a single (unified) bitstream that is simultaneously compatible with (i.e., decodable by) a first decoder configured to decode (channel-based) audio data encoded according to a first encoding protocol and one or more secondary decoders configured to decode (object-based) audio data encoded according to one or more secondary encoding protocols. The bitstream may include both encoded data (in the form of subframes of data (bursts)) that are decodable by the first decoder (and ignored by any secondary decoder) and encoded data (e.g., other subframes of data) that are decodable by one or more secondary decoders (and ignored by the first decoder). The decoded audio and associated information (metadata) from one or more of the secondary decoders and the first decoder may then be combined in such a way that both channel-based and object-based information are rendered simultaneously in order to recreate the replication of the environment (facsimile), the channels, the spatial information, and the objects rendered to the hybrid coding system (i.e., within a three-dimensional space or listening environment).

The codec 108 generates a bitstream containing information about the multiple sets of channel positions (speakers) and encoded audio information. In one embodiment, one set of channel positions is fixed and used for channel-based encoding protocols, while another set of channel positions is adaptive and used for audio object-based encoding protocols, such that the channel configuration for audio objects may change over time (depending on where the objects are placed in the sound field). Thus, a hybrid audio coding system may carry information about two sets of speaker positions for playback, where one set may be fixed and a subset of the other set. A device that supports legacy encoded audio information will decode and present audio information from a fixed subset, while a device that is capable of supporting a larger group may decode and present additional encoded audio information that will be time-variably assigned to different speakers from the larger group. Further, the system does not rely on one or more of the secondary decoders and the first decoder being present concurrently within the system and/or device. Thus, legacy and/or existing devices/systems that only include decoders that support the first protocol will produce sound fields that are fully compatible to be rendered via legacy channel-based rendering systems. In this case, the unknown or unsupported portion(s) of the hybrid bitstream protocol (i.e., the audio information represented by the quadratic coding protocol) will be ignored by system or device decoders that support the first hybrid coding protocol.

In another embodiment, the codec 108 is configured to operate in a mode in which the first encoding subsystem (supporting the first protocol) contains a combined representation of all sound field information (channels and objects) represented in both one or more of the secondary encoder subsystems present within the hybrid encoder and the first encoder. This ensures that the mixed bitstream includes backward compatibility with decoders supporting only the protocol of the first encoder subsystem by allowing audio objects (typically carried in one or more secondary encoder protocols) to be rendered and represented within decoders supporting only the first protocol.

In yet another embodiment, the codec 108 includes two or more encoding subsystems, wherein each of these subsystems is configured to encode audio data according to a different protocol, and is configured to combine the outputs of the subsystems to produce a mixed-format (unified) bitstream.

One of the benefits of the embodiments is the ability to carry mixed encoded audio bitstreams over a wide range of content distribution systems, where each of the distribution systems traditionally only supports data encoded according to the first encoding protocol. This eliminates the need to modify/change any system and/or transport level protocol to specifically support a hybrid coding system.

Audio coding systems typically utilize standardized bitstream elements in order to enable the transmission of additional (arbitrary) data within the bitstream itself. This extra (arbitrary) data is typically skipped (i.e., ignored) during decoding of the encoded audio included in the bitstream, but may be used for purposes other than decoding. Different audio coding standards express these additional data fields by using unique nomenclature (nomenclature). This general type of bitstream element may include, but is not limited to, auxiliary data, skip fields, data stream elements, padding elements, supplemental data, and sub-stream (substream) elements. Unless otherwise stated, the use of the expression "auxiliary data" in this document does not imply a particular type or format of additional data, but should be construed to include a generic expression of any or all examples associated with the invention.

Data channels enabled via "auxiliary" bitstream elements of a first encoding protocol within a combined hybrid coding system bitstream may carry one or more secondary (independent or dependent) audio bitstreams (encoded according to one or more secondary encoding protocols). One or more secondary audio bitstreams may be divided into blocks of N samples and multiplexed into the "auxiliary data" field of the first bitstream. The first bit stream may be decoded by a suitable (complementary) decoder. In addition, the auxiliary data of the first bitstream may be extracted, recombined into one or more secondary audio bitstreams, decoded by a processor supporting the syntax of one or more of the secondary bitstreams, and then combined and presented together or independently. Furthermore, the roles of the first and second bit streams may also be reversed, such that a block of data of the first bit stream is multiplexed into the auxiliary data of the second bit stream.

The bitstream elements associated with the secondary coding protocol also carry and convey information (metadata) characteristics of the underlying audio, which may include, but are not limited to, desired sound source location, velocity, and size. This metadata is utilized during the decoding and rendering process in order to recreate the correct (i.e., initial) position for the associated audio object carried within the applicable bitstream. The above-mentioned metadata may also be carried within bitstream elements associated with the first encoding protocol, which may be applied to audio objects contained in one or more secondary bitstreams present in the mixed stream.

Bitstream elements associated with one or both of the first and second encoding protocols of the hybrid encoding system carry/convey context metadata that identifies spatial parameters (i.e., the ontology of signal characteristics themselves) and additional information describing the type of underlying audio ontology in the form of a particular audio category carried within the hybrid encoded audio bitstream. Such metadata may indicate, for example, that there is spoken dialog, music, dialog over music, applause, singing, etc., and may be used to adaptively modify the nature of interconnected pre-or post-processing modules upstream or downstream of the hybrid coding system.

In one embodiment, the codec 108 is configured to operate with a shared or common pool of bits (pool) in which bits available for encoding are "shared" among some or all of the encoding subsystems that support one or more protocols. Such a codec may distribute the available bits (from a common "shared" pool of bits) among the coding subsystems in order to optimize the overall audio quality of the unified bitstream. For example, during a first time interval, the codec may allocate more available bits to the first coding subsystem and less available bits to the remaining subsystems, while during a second time interval, the codec may allocate less available bits to the first coding subsystem and more available bits to the remaining subsystems. The decision of how to distribute the bits between the encoding subsystems may depend, for example, on the results of statistical analysis of the shared pool of bits and/or analysis of the audio content encoded by each subsystem. The codec may allocate bits from the shared pool in such a way that a unified bitstream constructed by multiplexing the outputs of the coding subsystems maintains a constant frame length/bit rate over a particular time interval. The frame length/bit rate of the unified bitstream may also be varied in some cases within a specific time interval.

In an alternative embodiment, codec 108 produces a unified bitstream that includes data encoded according to a first encoding protocol that is configured and transmitted as independent sub-streams of an encoded data stream that a decoder supporting the first encoding protocol will decode, and data encoded according to a second protocol that is transmitted as independent or dependent sub-streams of the encoded data stream that a decoder supporting the first protocol will ignore. More generally, in one class of embodiments, a codec produces a unified bitstream that includes two or more independent or dependent substreams (where each substream includes data encoded according to a different or the same encoding protocol).

In yet another alternative embodiment, the codec 108 generates a unified bitstream that includes data encoded according to a first encoding protocol configured and transmitted with a unique bitstream identifier (which a decoder supporting the first encoding protocol associated with the unique bitstream identifier will decode) and data encoded according to a second protocol configured and transmitted with a unique bitstream identifier (which a decoder supporting the first protocol will ignore). More generally, in one class of embodiments, a codec produces a unified bitstream that includes two or more sub-streams (where each sub-stream includes data encoded according to a different or the same encoding protocol and where each carries a unique bitstream identifier). The method and system for creating the unified bitstream described above provides the ability to clearly signal (to the decoder) which interleaving (interleaving) and/or protocol has been utilized within the mixed bitstream (e.g., whether the described AUX data, SKIP, DSE, or substream methods are utilized).

The hybrid coding system is configured to support de-interleaving/de-multiplexing and re-interleaving/re-multiplexing of bit streams supporting one or more secondary protocols into a first bit stream (supporting a first protocol) at any processing point discovered during the entire media delivery system. The hybrid codec is also configured to be able to encode audio input streams having different sampling rates into one bitstream. This provides a means for efficiently encoding and distributing audio sources containing signals with inherently different bandwidths. For example, conversation tracks typically have inherently lower bandwidth than music and effects tracks.

Presenting

Under an embodiment, the adaptive audio system allows multiple (e.g., up to 128) tracks to be packaged, typically as a combination of base and object. The basic format of audio data for an adaptive audio system comprises a number of independent mono audio streams. Each stream has metadata associated with it that specifies whether the stream is a channel-based stream or an object-based stream. The channel-based stream has presentation information encoded with a channel name or a label; and the object-based stream has position information encoded by a mathematical expression encoded in the additionally associated metadata. The original independent audio streams are then packaged as a single serial bit stream containing all the audio data in an ordered fashion. This adaptive data configuration allows rendering of sound according to an non-self-centric frame of reference, where the final rendering position of the sound is based on the playback environment to correspond to the mixer's intent. Thus, sound may be designated as originating from the frame of reference of the playback room (e.g., the middle of the left wall), rather than a particular labeled speaker or group of speakers (e.g., left surround). The object location metadata contains appropriate non-self-centric frame of reference information needed to correctly play sound using available speaker locations in a room set up to play the adaptive audio content.

The renderer takes the bitstream encoded the audio track and processes the content according to the signal type. The basis is fed to the array, which will likely require different delay and equalization processing than the individual objects. Processing supports the presentation of these bases and objects to multiple (up to 64) speaker outputs. Fig. 4 is a block diagram of the rendering phase of an adaptive audio system, in accordance with one embodiment. As shown in the system 400 of fig. 4, many input signals (such as up to 128 audio tracks, which include an adaptive audio signal 402) are provided by specific components of the creation, authoring, and packaging stages of the system 300 (such as the RMU306 and the processor 312). These signals include channel-based bases and objects that are utilized by the renderer 404. The channel-based audio (basis) and objects are input to a level manager 406, which provides control of the amplitude or output level of the different audio components. The particular audio component can be processed by the array correction component 408. The adaptive audio signal then passes through a B-chain processing component 410, which generates a plurality (e.g., up to 64) of speaker supply output signals. In general, B-chain feeds refer to signals processed by power amplifiers, hybrids, and speakers, as opposed to a-chain content that constitutes the soundtrack on a motion picture film.

In one embodiment, the renderer 404 runs a rendering algorithm that intelligently uses the surround speakers in the theater as much as possible. By improving the power handling and frequency response of the surround speakers and maintaining the same monitoring reference level for each output channel or speaker in the theater, objects panning between the screen and surround speakers can maintain their sound pressure levels and have a closer timbre match without significantly increasing the overall sound pressure level in the theater. An array of appropriately designated surround speakers will typically have sufficient headroom (headroom) to reproduce the maximum dynamic range available within the surround7.1 or 5.1 audio track (i.e., 20dB above the reference level), yet it is unlikely that a single surround speaker will have the same headroom of a large multi-path screen speaker. As a result, there will likely be situations where an object located in the surround field will require a sound pressure that is greater than the sound pressure available using a single surround speaker. In these cases, the renderer will spread the sound across a suitable number of speakers in order to achieve the required sound pressure level. The adaptive audio system improves the quality and power processing of the surround speakers in order to provide an improvement in the realism of the presentation. It provides support for bass management of the surround speakers by using optional rear subwoofers that allow improved power handling for each surround speaker and, possibly, utilizing smaller speaker boxes (bins). It also allows the addition of side surround speakers closer to the screen than is currently practical to ensure that objects can smoothly transition from screen to surround.

By using metadata specifying the location information of audio objects with a particular rendering process, the system 400 provides a comprehensive, flexible approach for content creators to move beyond the constraints of existing systems. As previously mentioned current systems create and distribute audio that is fixed to a particular speaker location with limited knowledge of the type of content being delivered in the audio body (part of the audio being played back). The adaptive audio system 100 provides a new mixing approach that includes the option of both audio (left channel, right channel, etc.) specific to speaker locations and object-oriented audio elements that have already summarized spatial information that may include, but is not limited to, location, size, and velocity. This mixing method provides a way to balance fidelity (provided by fixed speaker positions) and flexibility (generalized audio objects) in the presentation. The system also provides additional useful information about the audio content that is coordinated with the audio ontology through the content creator at the time of content creation. This information provides powerful detailed information about the properties of the audio that can be used in a very powerful way during rendering. Such attributes may include, but are not limited to, content type (dialog, music, effects, foley recording, background/environment, etc.), spatial attributes (3D position, 3D size, speed), and rendering information (fast-move to speaker position, channel weights, gains, bass management information, etc.).

The adaptive audio system described in this application provides powerful information that can be used by a widely varying number of endpoints for presentation. The best rendering technique to apply in many cases depends to a large extent on the endpoint device. For example, home theater systems and sound bars may have 2, 3, 5, 7, or even 9 separate speakers. Many other types of systems, such as televisions, computers, and music docks, have only two speakers, and almost all commonly used devices have a two-ear headphone output (PC, laptop, tablet, cellular phone, music player, etc.). However, for traditional audio distributed today (mono, stereo, 5.1, 7.1 channels), the end-point device often needs to make a simplistic decision and compromise in order to render and reproduce the audio now distributed in a channel/speaker specific fashion. There is additionally little or no information transmitted about the actual content being distributed (dialog, music, environment, etc.) and there is little or no information about the content creator's intent on audio reproduction. However, the adaptive audio system 100 provides this information and possibly accesses audio objects, which may be used to create a mandatory (composing) next generation user experience.

The system 100 allows content creators to embed mixed spatial intent within a bitstream using metadata (such as position, size, speed, etc.) through unique and powerful metadata and adaptive audio transport formats. This allows a lot of flexibility in the spatial reproduction of audio. From a spatial rendering perspective, adaptive audio enables the adaptation of the mix to the exact location of the speakers in a particular room to avoid spatial distortion that occurs when the geometry of the playback system is not the same as the authoring system. In current audio reproduction systems, where only audio for the speaker channels is sent, the content creator's intent is unknown. The system 100 uses metadata that is passed throughout the creation and distribution pipeline. Adaptive audio-aware rendering systems may use this metadata information to render content in a manner that matches the content creator's original intent. Likewise, the mixing may be adapted to the exact hardware configuration of the rendering system. Currently, there are many different possible speaker configurations and types in rendering devices, such as televisions, home theaters, sound bars (soundbars), portable music player docks (docks), etc. When these systems are sent with channel-specific audio information today (i.e., left and right channel audio or multi-channel audio), the system must process the audio to properly match the capabilities of the rendering device. One example is that standard stereo audio is sent to a sound bar with more than two speakers. In current audio reproduction, where only audio for the speaker channels is sent, the content creator's intent is unknown. By using metadata that is passed throughout the creation and distribution pipeline, an adaptive audio-aware rendering system can use this information to render content in a manner that matches the content creator's original intent. For example, some sound bars have a side firing (surround) speaker to create a surround sensation. With adaptive audio, spatial information and content type (such as environmental effects) can be used by the sound bar to send only the appropriate audio to these side-firing speakers.

Adaptive audio systems allow for infinite interpolation of loudspeakers in the system on all scales front/back, left/right, up/down, near/far. In current audio reproduction systems, there is no information on how to process the audio in which it may be desirable to position the audio so that it is perceived by a listener as audio between two speakers. Currently, in the case of audio assigned only to a specific speaker, a spatial quantization factor is introduced. With adaptive audio, the spatial location of the audio can be accurately known and reproduced on the audio reproduction system accordingly.

For headphone rendering, the creator's intent is implemented by matching Head Related Transfer Functions (HRTFs) to spatial locations. When audio is rendered over headphones, spatial virtualization may be achieved by applying head-related transfer functions that process the audio, adding perceptual cues (cues) that create the perception of audio that is played in three-dimensional space and not over headphones. The accuracy of spatial reproduction depends on the choice of a suitable HRTF, which may vary based on several factors including spatial location. Using spatial information provided by an adaptive audio system may enable the selection of one or a continuously varying number of HRTFs to greatly improve the reproduction experience.

The spatial information delivered by the adaptive audio system may not only be used by content creators to create a mandatory entertainment experience (movies, television, music, etc.), but the spatial information may also indicate the listener's position relative to a physical object, such as a building or a geographical point of interest. This will allow the user to interact with the virtualized audio experience in relation to the real world, i.e. increase realism.

Embodiments also enable spatial upmixing by performing enhanced upmixing with reading metadata only when object audio data is not available. Knowing the positions of all objects and their types allows the upmixer to better distinguish elements within the channel-based track. Existing upmix algorithms have to infer information such as the type of audio content (speech, music, environmental effects) and the positions of different elements within the audio stream in order to create a high quality upmix with minimal or no audible artifacts. Often the inferred information may be incorrect or inappropriate. In the case of adaptive audio, additional information available from metadata regarding, for example, audio content type, spatial position, speed, audio object size, etc., may be used by the upmix algorithm to create high quality reproduction results. The system also spatially matches audio to video by accurately positioning the audio objects of the screen to the visual elements. In this case, a mandatory audio/video reproduction experience is possible if the spatial position of the reproduction of certain audio elements matches the picture elements on the screen, especially in the case of larger screen sizes. One example is having a conversation in a movie or television program that is spatially consistent with the person or character that is speaking on the screen. In the case of typical speaker channel based audio, there is no easy way to determine where the dialog should be spatially located in order to match the position of the character or person on the screen. This audio/visual alignment may be achieved using audio information available for adaptive audio. Visual position and audio spatial alignment may also be used for non-character/dialog objects (such as cars, trucks, animations, etc.).

The spatial masking process is facilitated by the system 100, since knowledge of the spatial intent of the mix by the adaptive audio metadata means that the mix can be adapted to any speaker configuration. However, due to playback system limitations, there is a risk of downmixing objects in the same or nearly the same location. For example, if the surround channel is not present, an object intended to pan in the rear left may be downmixed to the front left, but if a louder element appears in the front left at the same time, the downmixed object will be masked and disappear from the mix. Using adaptive audio metadata, spatial masking may be expected by the renderer, and the spatial and or loudness downmix parameters for each object may be adjusted so that all audio elements of the mix remain as perceptible in the original mix. Since the renderer understands the spatial relationship between the mixing and playback systems, it has the ability to "fast-move" objects to the closest speakers rather than creating a phantom (phantom) between two or more speakers. Although this may distort the mixed spatial representation slightly, it also allows the renderer to avoid unintentional ghosting. For example, if the angular position of the left speaker of the mixing stage does not correspond to the angular position of the left speaker of the playback system, then using the fast-shift to closest speaker function may avoid the playback system reproducing a constant illusion of the left channel of the mixing stage.

For content processing, the adaptive audio system 100 allows a content creator to create individual audio objects and add information about the content that may be delivered to the rendering system. This allows a lot of flexibility in the audio processing before reproduction. From a content processing and rendering perspective, the adaptive audio system enables processing to be adapted to the object type. For example, dialog enhancements may be applied only to dialog objects. Dialog enhancement refers to a method of processing audio containing a dialog such that the audibility and/or intelligibility of the dialog is increased and/or improved. The audio processing applied to dialog is in many cases inappropriate for non-conversational audio content (i.e. music, environmental effects, etc.) and can lead to unpleasant audible artefacts. With adaptive audio, an audio object may contain only dialog in a piece of content, and it may be tagged accordingly so that the presentation solution may selectively apply dialog enhancement only to the dialog content. In addition, if the audio object is simply a dialog (and not a mixture of dialog and other content as is often the case), the dialog enhancement process may specifically process the dialog (thereby limiting any processing performed on any other content). Likewise, bass management (filtering, attenuation, gain) may be directed to specific objects based on their type. Bass management refers to selectively isolating and processing only bass (or lower) frequencies in a particular piece of content. In the case of current audio systems and transport mechanisms, this is a "blind" process that is applied to all audio. With adaptive audio, specific audio objects suitable for bass management can be identified by metadata and rendering processing can be applied appropriately.

The adaptive audio system 100 also provides object-based dynamic range compression and selective upmixing. Conventional audio tracks have the same duration as the content itself, but the audio objects may only be present in the content for a limited amount of time. The metadata associated with the object may contain information about its mean and peak signal amplitude as well as its onset (onset) or impact time (particularly for transient materials). This information will allow the compressor to better modify its compression and time constants (bump, release, etc.) to better adapt to the content. For selective upmixing, the content creator may choose to indicate in the adaptive audio bitstream whether the objects should be upmixed or not. This information allows the adaptive audio renderer and upmixer to distinguish which audio elements can be safely upmixed while taking into account the creator's intent.

Embodiments also allow the adaptive audio system to select a preferred rendering algorithm from a number of available rendering algorithms and/or surround sound formats. Examples of presentation algorithms that may be used include: binaural, stereo dipole, stereophonic reverberant, Wave Field Synthesis (WFS), multi-channel panning, original skeleton with position metadata. Others include double balancing and vector-based amplitude panning.

The binaural distribution format uses a binaural representation of the sound field from the signal appearing at the left and right ears. Binaural information may be created via in-ear recording or synthesized using HRTF modes. Playback of the representation of the binaural sound is typically done over headphones, or by employing crosstalk cancellation. Playback over any speaker set-up would require signal analysis in order to determine the associated sound field and/or one or more signal sources.

The stereo dipole rendering method is a cross-channel (transoral) crosstalk cancellation process to produce binaural signals that can be played over stereo speakers (e.g., at + and-10 degrees off-center).

The ambisonics are encoded in the form of four channels called B-format (distribution format and rendering method). The first channel W is a non-directional pressure signal; the second channel X is a directional pressure gradient containing front and back information; the third channel Y contains left and right and Z contains up and down. These channels define a first order sample of the entire sound field at one point. Stereo reverberant sound uses all available loudspeakers to recreate a sampled (or synthesized) sound field within a loudspeaker array, so that when some loudspeakers are pushing (pulling) others are pulling (pulling).

Wave field synthesis is a rendering method based on the sound reproduction of the exact construction of the desired wave field by means of secondary sources. WFS is based on the huygens principle and is implemented as an array of loudspeakers (tens or hundreds) that surround a listening space and operate in a coordinated phased manner to recreate each individual sound wave.

Multi-channel panning is a distribution format and/or rendering method and may be referred to as channel-based audio. In this case, sound is represented as a number of discrete sources to be played back through an equal number of loudspeakers at defined angles from the listener. The content creator/mixer may create a virtual image by panning the signal between adjacent channels to provide directional cues; early reflections, reverberation, etc. may be mixed into many channels to provide directional and ambient cues.

The original backbone with the location metadata is a distribution format and may also be referred to as object-based audio. In this format, different "close' ed" sound sources are represented along with location and environment metadata. The virtual source is rendered based on the metadata and the playback device and listening environment.

The adaptive audio format is a mixture of a multi-channel pan format and an original backbone format. The rendering method in this embodiment is multi-channel panning. For the audio channel, rendering (panning) occurs at authoring time, but for object rendering (panning) occurs at playback.

Metadata and adaptive audio transmission format

As described above, metadata is generated during the creation phase to encode specific location information for audio objects and accompany an audio program to aid in the presentation of the audio program, and in particular, to describe the audio program in a manner that enables the audio program to be presented on a wide variety of playback devices and playback environments. Metadata is generated for a given program and for editors and mixers that create, collect, edit, and manipulate audio during post-production. An important feature of the adaptive audio format is the ability to control how the audio will be interpreted into a playback system and environment that is different from the mixed environment. In particular, a given movie may have less capabilities than a hybrid environment.

The adaptive audio renderer is designed to take full advantage of the available equipment to recreate the mixer's intent. Furthermore, the adaptive audio authoring tool allows the mixer to preview and adjust how the mix will be presented in various playback configurations. All metadata values may be adjusted (condition) over the playback environment and speaker configuration. For example, different mixing levels for a given audio element may be specified based on the playback configuration or mode. In one embodiment, the list of adjusted playback modes is extensible and includes the following: (1) playback based on soundtrack only: 5.1, 7.1 (height), 9.1; and (2) discrete speaker playback: 3D, 2D (no height).

In one embodiment, the metadata controls or specifies different aspects of the adaptive audio content and is organized based on different types, including: program metadata, audio metadata, and presentation metadata (for channels and objects). Each type of metadata includes one or more metadata items that provide values for a property referred to by an Identifier (ID). Fig. 5 is a table listing metadata types and associated metadata elements for an adaptive audio system, according to one embodiment.

As shown in table 500 of fig. 5, the first type of metadata is program metadata that includes metadata elements specifying a frame rate, a number of tracks, an extensible channel description, and a mixing phase description. The frame rate metadata element specifies a rate of audio content frames in frames per second (fps). The original audio format does not have to include framing of audio or metadata, since audio is provided as full tracks (duration of a disc (reel) or entire feature) rather than audio pieces (duration of an object). The original format does need to carry all the information needed to enable the adaptive audio encoder to frame the audio and metadata, including the actual frame rate. Table 1 shows IDs, example values, and descriptions of frame rate metadata elements.

TABLE 1

The track number metadata element indicates the number of audio tracks in the frame. The exemplary adaptive audio decoder/processor may support up to 128 simultaneous audio tracks, but the adaptive audio format will support any number of audio tracks. Table 2 shows IDs, example values, and descriptions of track number metadata elements.

TABLE 2

Channel-based audio can be assigned to non-standard channels, and extensible channel description metadata elements enable mixing using new channel positions. The following metadata should be provided for each extension channel, as shown in table 3:

TABLE 3

The mixing phase description metadata element specifies the frequency at which a particular speaker produces half the power of the passband. Table 4 shows ID, example values and descriptions of the mix phase description metadata element, where LF is low frequency; HF ═ high frequency; the 3dB point is the edge of the speaker passband.

TABLE 4

As shown in fig. 5, the second type metadata is audio metadata. Each channel-based or object-based audio element is composed of an audio ontology and metadata. The audio body is a mono audio stream carried on one of many audio tracks. The associated metadata describes how the audio ontology is stored (audio metadata, e.g. sampling rate) or how it should be rendered (presentation metadata, e.g. desired audio source location). Typically, the audio track is continuous for the duration of the audio program. The program editor or mixer is responsible for assigning audio elements to the tracks. It is contemplated that the track usage is sparse, i.e., the median simultaneous track usage may be only 16 to 32. In typical implementations, the audio will be efficiently transmitted using a lossless encoder. However, alternative implementations are possible, such as sending unencoded audio data or lossy encoded audio data. In a typical implementation, the format consists of up to 128 audio tracks, with each track having a single sample rate and a single encoding system. The duration of each track duration feature (no explicit volume (reel) support). The mapping of objects to tracks (time multiplexing) is the responsibility of the content creator (blender).

As shown in fig. 3, the audio metadata includes a sampling rate, a bit depth, and elements of an encoding system. Table 5 shows IDs, example values, and descriptions of sample rate metadata elements.

TABLE 5

Table 6 shows the ID, example values and descriptions (for PCM and lossless compression) of the bit depth metadata element.

TABLE 6

Table 7 shows IDs, example values, and descriptions of encoding system metadata elements.

TABLE 7

As shown in fig. 5, the third type metadata is presentation metadata. The rendering metadata specifies values that help the renderer match the original mixer intent as closely as possible regardless of the playback environment. The set of metadata elements is different for channel-based audio and object-based audio. The first presentation metadata field selects between two audio channel-based or object-based types, as shown in table 8.

TABLE 8

Rendering metadata for channel-based audio contains location metadata elements that specify audio source locations as one or more speaker locations. Table 9 shows the ID and value for the position metadata element for the channel-based case.

TABLE 9

The rendering metadata for the channel-based audio also contains rendering control elements that specify particular characteristics for playback of the channel-based audio, as shown in table 10.

Watch 10

For object-based audio, the metadata includes elements similar to channel-based audio. Table 11 provides the ID and value for the object location metadata element. The object position is described in one of three ways: three-dimensional coordinates; a plane and a two-dimensional coordinate; or a line and a one-dimensional coordinate. The presentation method may be modified based on the location information type.

TABLE 11

The IDs and values for the object presentation control metadata elements are shown in table 12. These values provide additional means for controlling or optimizing the rendering of the object-based audio.

TABLE 12

In one embodiment, the metadata described above and shown in fig. 5 is generated and stored as one or more files that are associated or indexed (extended) with the corresponding audio content such that the audio stream is processed by the adaptive audio system that interprets the metadata generated by the mixer. It should be noted that the above-described metadata is an exemplary set of IDs, values, and definitions, and that other or additional metadata elements may be included for use in the adaptive audio system.

In one embodiment, two (or more) sets of metadata elements are associated with each of the object-based audio stream and the channel. For a first condition of the playback environment, a first set of metadata is applied to the plurality of audio streams, and for a second condition of the playback environment, a second set of metadata is applied to the plurality of audio streams. For a given audio stream, a second or subsequent set of metadata elements is substituted for the first set of metadata elements based on conditions of the playback environment. The conditions may include factors such as room size, shape, material composition within the room, person density and current occupancy within the room, ambient noise characteristics, ambient light characteristics, and any other factors that may affect the atmosphere of the sound or even the playback environment.

Post-production and master control

The rendering stage 110 of the adaptive audio processing system 100 may include an audio post-production step that directs the creation of a final mix. In movie applications, the three main categories of sound used in movie mixing are dialogue, music and effects. An effect consists of a sound that is not a conversation or music (e.g., ambient noise, background/scene noise). The sound effects may be recorded or synthesized by the sound designer, or they may originate from a library of effects. Subgroup effects that include specific noise sources (e.g., footsteps, doors, etc.) are called Foley recordings (Foley) and are performed by Foley recorders. Different types of sounds are marked and panned accordingly by the recording engineer.

Fig. 6 illustrates an example workflow for post production processes in an adaptive audio system, according to one embodiment. As shown in diagram 600, the individual sound components of music, dialog, forre recording, and effects are all put together in the dubbing theater during the final mix 606, and the re-recording mixer(s) 604 use the pre-mix (also referred to as 'mix minus') as well as the individual sound objects and location data to create a backbone in a manner that groups, for example, dialog, music, effects, forre recording, and background sounds. In addition to forming the final mix 606, the music and full effect stems can be used as the basis for creating a dubbing language version of the movie. Each skeleton consists of a channel-based base and several audio objects with metadata. The stems combine to form the final blend. Using object pan information from both the audio workstation and the mixing console, the presentation and master unit 608 presents audio to speaker locations in the dubbing theater. This rendering allows the mixer to hear how the channel-based basis and audio objects combine and also provides the ability to render to different configurations. The mixer may use conditional metadata that defaults to the associated profile in order to control how the content is rendered to the surround channels. In this way, the mixer retains full control of how the movie is played back in all scalable environments. A monitoring step may be included after one or both of the re-recording step 604 and the final mixing step 606 in order to allow the mixer to hear and evaluate the intermediate content produced during each of these stages.

During the master session, the backbone, objects, and metadata are put together in an adaptive audio capsule 614, which is generated by the print master 610. This package also contains a mix 612 of backward compatible (legacy 5.1 or 7.1) surround sound theaters. A presentation/master unit (RMU)608 may present this output when needed; thereby eliminating the need for any additional workflow steps in generating existing channel-based deliverables. In one embodiment, the audio file is packaged using a standard material exchange format (MXF) wrapper. The adaptive audio mixing master file may also be used to generate other deliverables such as consumer multi-channel or stereo mixes. The smart profile and conditional metadata allow for controlled presentation, which can significantly reduce the time required to create such a mix.

In one embodiment, the packaging system may be used to create a digital cinema package for deliverables that include adaptive audio mixing. The audio track files may be locked together to help prevent synchronization errors with the adaptive audio track file. Certain territories (terrorisites) require that track files be added during the packaging phase, for example, adding a Hearing Impairment (HI) or vision impairment narration (VI-N) track to a primary audio track file.

In one embodiment, the speaker array in the playback environment may include any number of surround sound speakers placed and instructed according to established surround sound standards. Any number of additional speakers for accurate rendering of object-based audio content may also be placed based on the conditions of the playback environment. These additional speakers may be set up by a sound engineer and this setup is provided to the system in the form of a setup file that is used by the system for rendering the object-based component of the adaptive audio to a particular speaker or speakers within the entire speaker array. The setup file includes at least a list of speaker designations (designations) and channel to individual speaker mappings, information about the grouping of speakers, and a runtime mapping based on the relative positions of the speakers to the playback environment. Runtime mapping is utilized by the fast-moving feature of the system that renders audio content based on point source objects to specific speakers closest to the perceived location of the sound as intended by the sound engineer.

Fig. 7 is a diagram of an example workflow for a digital cinema packaging process using adaptive audio files, in accordance with one embodiment. As shown in diagram 700, an audio file containing both an adaptive audio file and a 5.1 or 7.1 surround sound audio file is input to a wrapper/encryption block 704. In one embodiment, after creating the digital cinema package in block 706, the pcmxf file (with the appropriate additional tracks appended) is encrypted according to existing practice using the SMPTE specification. The adaptive audio MXF is encapsulated as an auxiliary track file and optionally encrypted using a symmetric content key according to the SMPTE specification. This single DCP708 can then be delivered to any Digital Cinema Initiative (DCI) compliant server. Typically, any facility that is not properly equipped will simply ignore the extra track files that contain the adaptive audio track, and will use the existing main audio track files for standard playback. A facility equipped with a suitable adaptive audio processor will be able to ingest and playback adaptive audio tracks when applicable, reverting to standard audio tracks as required. The wrapping/encryption component 704 can also provide input directly to the distribute KDM block 710 for generating appropriate security keys for use by the digital cinema server. Other movie elements or files, such as subtitles 714 and images 716, may be packaged and encrypted with audio file 702. In this case, certain processing steps may be included, such as compression 712 in the case of image file 716.

For content management, the adaptive audio system 100 allows a content creator to create individual audio objects and add information about content that can be delivered to a rendering system. This allows a great deal of flexibility in the content management of the audio. From a content management perspective, the adaptive audio approach enables several different features. These include changing the language of the content for space savings, download efficiency, geographic playback adaptation, etc. by simply replacing the dialog object. Movies, television and other entertainment programming are typically distributed internationally. This often requires that the language in the piece of content be changed depending on where it is to be rendered (french for movies shown in france, german for TV programs shown in germany, etc.). This today often requires the creation, packaging and distribution of completely independent audio tracks. In the case of adaptive audio and its inherent concept of audio objects, a dialog for a piece of content may be an independent audio object. This allows the language of the content to be easily changed without updating or changing other elements of the audio track, such as music, effects, etc. This does not apply only to foreign languages but also inappropriate languages for particular viewers (e.g., children's television shows, airline movies, etc.), targeted advertising, etc.

Facility and equipment considerations

The adaptive audio file format and associated processor allow for variations in how theater equipment is installed, calibrated, and maintained. With the introduction of many more possible speaker outputs, each being equalized and balanced independently, there is a need for intelligent and time-efficient automatic room equalization that can be performed by manually adjusting the ability of any automated room equalization. In one embodiment, the adaptive audio system uses an optimized 1/12 octave band equalization engine. Up to 64 outputs may be processed to more accurately balance the sound in the theater. The system also allows for scheduled monitoring of the individual loudspeaker outputs, from the film processor outputs all the way to the sound reproduced in the auditorium. Local or network alerts may be created to ensure that appropriate action is taken. The flexible rendering system can automatically remove the damaged speaker or amplifier from the playback chain and render around it, thus allowing the show to proceed.

The cinema processor may connect to the digital cinema server using an existing 8xAES primary audio connection, as well as an Ethernet (Ethernet) connection for streaming adaptive audio data. Playback of surround7.1 or 5.1 content uses an existing PCM connection. Adaptive audio data is streamed over ethernet to the cinema processor for decoding and presentation, and communication between the server and the cinema processor allows the audio to be identified and synchronized. In case of any problem with adaptive audio track playback, the sound is restored to dolby surround7.1 or 5.1PCM audio.

Although the embodiments have been described with respect to 5.1 and 7.1 surround sound systems, it should be noted that many other present and future surround configurations may also be used in conjunction with the embodiments, including 9.1, 11.1 and 13.1 and more.

Adaptive audio systems are designed to allow both content creators and exhibitors to decide how sound content is to be presented in different playback speaker configurations. The ideal number of loudspeaker output channels to use will vary depending on the room size. The recommended speaker arrangement is therefore dependent on many factors such as size, composition, seating configuration, environment, average audience size, etc. Example or representative speaker configurations and layouts are provided in this application for illustrative purposes only and are not intended to limit the scope of any claimed embodiments.

The recommended speaker placement for the adaptive audio system remains compatible with existing cinema systems, which is vital in order not to compromise the playback of existing 5.1 and 7.1 channel-based formats. In order to maintain the intention of the adaptive audio sound engineer and the intention of the mixer of 7.1 and 5.1 content, the position of the existing screen channels should not be changed too fundamentally in an effort to emphasize or to reintroduce new speaker positions. In contrast to using all available 64 output channels, the adaptive audio format can be accurately rendered to a speaker configuration (such as 7.1) in a movie theater, thus even allowing the format (and associated benefits) to be used in existing theaters without changing amplifiers or speakers.

Different speaker locations may have different effectiveness depending on theater design, so there is currently no industry-specified number or arrangement of ideal sound channels. Adaptive audio is intended to be truly adaptable and capable of accurate playback in a variety of auditoriums, whether they have a limited number of playback channels or many channels with a highly flexible configuration.

Fig. 8 is a top view 800 of an example layout of suggested speaker locations for use with an adaptive audio system in a typical auditorium, and fig. 9 is a front view 900 of an example layout of suggested speaker locations at the screen of the auditorium. The reference position mentioned hereinafter corresponds to a position 2/3 rearward of the distance from the screen to the rear wall on the center line of the screen. Standard screen speakers 801 are shown in their usual position relative to the screen. Studies of the perception of elevation in the screen plane have shown that additional speakers 804 behind the screen, such as left center (Lc) and right center (Rc) screen speakers (in the location of the "LeftExtra" and "RightExtra" channels in 70mm film format), may be advantageous in creating a smoother pan across the screen. Such alternative loudspeakers are therefore recommended, in particular in auditoriums with screens greater than 12m (40ft) wide. All screen speakers should be angled so that they point to a reference position. The recommended placement of subwoofer 810 behind the screen should remain unchanged, including maintaining an asymmetric cabinet placement relative to the center of the room to prevent excitation of standing waves. Additional subwoofers 816 may be placed at the back of the theater.

The surround speakers 802 should be independently wired back to an amplifier cabinet (amplifierrack) and the dedicated channels of power amplification matched with the power processing of the speakers according to the manufacturer's specifications are independently amplified when possible. Ideally, surround speakers should be specified to handle increased SPL for each individual speaker, and also have a wider frequency response where possible. Empirically for an average size theater, the spacing of the surround speakers should be between 2 and 3m (6'6 "to 9'9"), with the left and right surround speakers placed symmetrically. However, the spacing of the surround speakers is most effectively considered to be the angle between adjacent speakers from a given listener's subtend (dominant), as opposed to using the absolute distance between the speakers. For optimal playback in the entire auditorium, the angular distance between adjacent loudspeakers should be 30 degrees or less, referenced from each of the four corners of the main listening area. Good results can be achieved with a pitch of up to 50 degrees. For each surround area, the speakers should maintain equal linear spacing adjacent the seating area, where possible. The linear spacing beyond the listening area (e.g., between the front row and the screen) may be slightly larger. Fig. 11 is an example of the placement of top surround speakers 808 and side surround speakers 806 relative to a reference position, in accordance with one embodiment.

The additional side surround speakers 806 should be mounted closer to the screen than currently recommended practice starting at about one third of the distance to the back of the auditorium. These speakers are not used as side surround during playback of dolby surround7.1 or 5.1 audio tracks, but will enable smooth transitions and improved timbre matching when panning objects from screen speakers to surround areas. To maximize the spatial impression, the surrounding array should be placed as low as practically possible subject to the following constraints: the vertical placement of the surround speakers in front of the array should be fairly close to the height of the screen speaker acoustic center, and high enough to maintain good coverage over the seating area according to the directionality of the speakers. The vertical arrangement of the surround speakers should be such that they form a straight line from front to back, and (typically) tilt upwards, so that the relative elevation angle of the surround speakers above the listener is maintained towards the rear of the theatre as the seat elevation angle increases, as shown in fig. 10, which is a side view of an example layout of suggested speaker locations for use with an adaptive audio system in a typical auditorium. In practice this can be most simply achieved by selecting the elevation angles for the front-most and rear-most side surround speakers and placing the remaining speakers in the line between these points.

In order to provide optimal coverage over the seating area for each speaker, the side surround 806 and rear speakers 816 and top surround 808 should be directed to a reference location in the theater under defined criteria with respect to spacing, position, angle, etc.

Embodiments of the adaptive audio cinema system and format achieve improved audience immersion and engagement (engagment) levels beyond current systems by providing mixers with powerful new authoring tools, and the new cinema processor features a flexible rendering engine that optimizes audio quality and surround effects of the soundtrack for each room's speaker layout and characteristics. In addition, the system maintains backward compatibility and minimizes the impact on current production and distribution workflows.

While embodiments have been described with respect to examples and implementations in a cinema environment in which adaptive audio content is associated with film content for use in a digital cinema processing system, it should be noted that embodiments may also be implemented in non-cinema environments. Adaptive audio content, which includes object-based audio and channel-based audio, can be used in conjunction with any related content (associated audio, video, graphics, etc.), or it can constitute independent audio content. The playback environment may be any suitable listening environment, from a headset or near field monitor to a small or large room, a car, an open air stage, a concert hall, and so forth.

Aspects of system 100 may be implemented in a suitable computer-based sound processing network environment for processing digital or digitized audio files. Portions of the adaptive audio system may include one or more networks comprising any desired number of individual machines, including one or more routers (not shown) to buffer and route data sent between the computers. Such a network may be established over a variety of different network protocols and may be the internet, a Wide Area Network (WAN), a Local Area Network (LAN), or any combination thereof. In one embodiment, where the network comprises the internet, the one or more machines may be configured to access the internet through a web browser program.

One or more of the components of the composition, module, process or other functionality may be implemented by a computer program controlling the execution of a processor-based computing device of the system. It should also be noted that the various functions disclosed in this application may be described using any number of combinations of hardware, firmware, and/or instructions and/or data embodied in various machine-readable or computer-readable media, in terms of their behavioral, register transfer, logic component, and/or other characteristics. Computer-readable media in which such formatted data and/or instructions may be embodied include, but are not limited to, physical (non-transitory), non-volatile storage media in various forms, such as optical, magnetic or semiconductor storage media.

Unless the context clearly requires otherwise, throughout the description and the claims, the words "comprise", "comprising", and the like are to be construed in an inclusive sense as opposed to an exclusive or exhaustive sense; that is, it is to be interpreted in the sense of "including, but not limited to". Words using the singular or plural number also include the plural or singular number, respectively. In addition, the words "in this application," "hereinafter," "above," "below," and words of similar import refer to this application as a whole and not to any particular portions of this application. When the word "or" is used in reference to a list of two or more items, that word covers all of the following interpretations of the word: any of the items in the list, all of the items in the list, and any combination of the items in the list.

Although one or more implementations have been described by way of example and in terms of particular embodiments, it is to be understood that one or more implementations are not limited to the disclosed embodiments. On the contrary, it is intended to cover various modifications and similar arrangements as would be apparent to those skilled in the art. Therefore, the scope of the appended claims should be accorded the broadest interpretation so as to encompass all such modifications and similar arrangements.

Claims

1. A system for processing an audio signal, comprising an authoring component configured to:

receiving a plurality of audio signals;

generating an adaptive audio mix comprising a plurality of mono audio streams and metadata associated with each of the audio streams and indicating a playback position of the respective mono audio stream, wherein at least some of the plurality of mono audio streams are identified as channel-based audio and others of the plurality of mono audio streams are identified as object-based audio, and wherein the playback position of the channel-based mono audio streams comprises a designation of a speaker in a speaker array and the playback position of the object-based mono audio streams comprises a position in three-dimensional space, and wherein each object-based mono audio stream is rendered in at least one particular speaker in the speaker array; and

encapsulating the plurality of mono audio streams and metadata into a bitstream for transmission to a rendering system configured to render the plurality of mono audio streams to a plurality of speaker feeds corresponding to speakers in a playback environment, wherein the speakers in the speaker array are placed at specific locations in the playback environment, and wherein metadata elements associated with each respective object-based mono audio stream indicate whether one or more sound components are to be rendered to the speaker feeds for playback through the speaker closest to the intended playback location of the sound component, such that the respective object-based mono audio stream is effectively rendered by the speaker closest to the intended playback location.

2. The system of claim 1, wherein the authoring component comprises a mixing console having controls operable by a user to indicate playback levels of the plurality of mono audio streams, and wherein the metadata elements associated with each respective object-based stream are automatically generated upon user input to the controls of the mixing console.

3. The system of claim 1 or claim 2, further comprising an encoder coupled to the authoring component and configured to receive the plurality of mono audio streams and the argument and to generate a single digital bitstream containing the plurality of mono audio streams in an ordered manner.

4. A system for processing an audio signal, comprising a rendering system configured to:

receiving a bitstream encapsulating an adaptive audio mix containing a plurality of mono audio streams and metadata associated with each of the audio streams and indicating a playback location of the respective mono audio stream, wherein at least some of the plurality of mono audio streams are identified as channel-based audio and others of the plurality of mono audio streams are identified as object-based audio, and wherein the playback location of the channel-based mono audio streams contains designations of speakers in a speaker array and the playback location of the object-based mono audio streams contains a location in three-dimensional space, and wherein each object-based mono audio stream is rendered in at least one particular speaker in the speaker array; and

presenting the plurality of mono audio streams to a plurality of speaker feeds corresponding to speakers in a playback environment, wherein speakers in a speaker array are placed at specific locations in the playback environment, and wherein metadata elements associated with each respective object-based mono audio stream indicate whether one or more sound components are presented to the speaker feeds for playback through the speaker closest to the intended playback location of the sound component, such that the respective object-based mono audio stream is effectively presented by the speaker closest to the intended playback location.

5. The system of claim 4, wherein the metadata element associated with each respective object-based mono audio stream is further indicative of a spatial distortion threshold, and wherein the metadata element indicates whether to ignore the rendering of the respective sound component by the speaker closest to the intended playback position if the spatial distortion caused by the rendering of the respective sound component by the speaker closest to the intended playback position exceeds the spatial distortion threshold.

6. The system of claim 5, wherein the spatial distortion threshold comprises at least one of an azimuth tolerance threshold and an elevation tolerance threshold.

7. The system of claim 4, wherein the metadata element associated with each respective object-based mono audio stream is further indicative of a cross-fade rate parameter, and wherein a rate at which the sound component changes from the first speaker to the second speaker is controlled in response to the cross-fade rate parameter when the speaker closest to the intended playback location of the sound component changes from the first speaker to the second speaker.

8. The system of any one of claims 4-7, wherein the metadata element associated with each object-based mono audio stream is further indicative of spatial parameters controlling playback of the respective sound component, the spatial parameters including one or more of: sound location, sound width, and sound velocity.

9. The system of any one of claims 4-7, wherein the playback position of each of the plurality of object-based mono audio streams comprises a spatial position relative to a screen within the playback environment or a surface surrounding the playback environment, and wherein the surface comprises a front, a rear, a left, a right, an upper, and a lower.

10. The system of any of claims 4-7, wherein the presentation system selects a presentation algorithm utilized by the presentation system, the presentation algorithm selected from the group consisting of: binaural stereo, stereo dipole, ambisonics, wave field synthesis WFS, multi-channel panning, original skeleton with position metadata, double balancing, and vector-based amplitude panning.

11. The system of any one of claims 4-7, wherein the playback position of each of the plurality of object-based mono audio streams is independently specified relative to a egocentric frame of reference or an egocentric frame of reference, wherein the egocentric frame of reference is assumed with respect to a listener in the playback environment, and wherein the egocentric frame of reference is assumed with respect to a characteristic of the playback environment.

12. A method for authoring audio content for presentation, comprising:

receiving a plurality of audio signals;

generating an adaptive audio mix comprising a plurality of mono audio streams and metadata associated with each of the audio streams and indicating a playback location of the respective mono audio stream, wherein at least some of the plurality of mono audio streams are identified as channel-based audio and others of the plurality of mono audio streams are identified as object-based audio, and wherein the playback location of the channel-based audio comprises speaker designations of speakers in a speaker array, and the playback location of the object-based audio comprises a location in three-dimensional space, and wherein each object-based mono audio stream is rendered in at least one particular speaker in the speaker array; and

encapsulating the plurality of mono audio streams and metadata into a bitstream for transmission to a rendering system configured to render the plurality of mono audio streams to a plurality of speaker feeds corresponding to speakers in a playback environment, wherein the speakers in the speaker array are placed at specific locations in the playback environment, and wherein metadata elements associated with each respective object-based mono audio stream indicate whether one or more sound components are to be rendered to the speaker feeds for playback through the speaker closest to the intended playback location of the sound component, such that the object-based mono audio stream is effectively rendered by the speaker closest to the intended playback location.

13. The method of claim 12, further comprising:

receiving from a mixing console having controls operable by a user to specify playback levels of the plurality of mono audio streams containing audio content; and

the metadata elements associated with each respective object-based stream are automatically generated upon receipt of user input.

14. A method for presenting audio data, comprising:

presenting the plurality of mono audio streams to a plurality of speaker feeds corresponding to speakers in a playback environment, wherein the speakers of the speaker array are placed at specific locations in the playback environment, and wherein the metadata element associated with each respective object-based mono audio stream indicates whether one or more sound components are presented to the speaker feeds for playback by the speaker closest to the intended playback location of the sound component, such that the object-based mono audio stream is effectively presented by the speaker closest to the intended playback location.

15. The method of claim 14, wherein the metadata element associated with each respective object-based mono audio stream is further indicative of a spatial distortion threshold, and wherein the metadata element indicates whether to ignore the rendering of the respective sound component by the speaker closest to the intended playback position if the spatial distortion caused by the rendering of the respective sound component by the speaker closest to the intended playback position exceeds the spatial distortion threshold.

16. The method of claim 15, wherein the spatial distortion threshold comprises at least one of an azimuth tolerance threshold and an elevation tolerance threshold.

17. The method of claim 14, wherein the metadata element associated with each respective object-based mono audio stream is further indicative of a cross-fade rate parameter, and wherein a rate of respective object transitions from the first speaker to the second speaker is controlled in response to the cross-fade rate parameter when a speaker closest to an intended playback location of the sound component changes from the first speaker to the second speaker.

18. The method of any of claims 14-17, wherein the metadata element associated with each object-based mono audio stream further indicates spatial parameters controlling playback of the respective sound component, the spatial parameters including one or more of: sound location, sound width, and sound velocity.

19. The method of any of claims 14-17, wherein the playback position of each of the plurality of object-based mono audio streams comprises a spatial position relative to a screen within the playback environment or a surface surrounding the playback environment, and wherein the surface comprises a front, a rear, a left, a right, an upper, and a lower, and/or the playback position of each of the plurality of object-based mono audio streams is independently specified relative to a self-centric reference frame or a non-self-centric reference frame, wherein the self-centric reference frame is assumed with respect to a listener in the playback environment, and wherein the non-self-centric reference frame is assumed with respect to a characteristic of the playback environment.

20. The method of any of claims 14-17, wherein the presentation system selects a presentation algorithm utilized by the presentation system, the presentation algorithm selected from the group consisting of: binaural stereo, stereo dipole, ambisonics, wave field synthesis WFS, multi-channel panning, original skeleton with position metadata, double balancing, and vector-based amplitude panning.