Movatterモバイル変換


[0]ホーム

URL:


HK1190554B - Apparatus and method for generating audio output signals using object based metadata - Google Patents

Apparatus and method for generating audio output signals using object based metadata
Download PDF

Info

Publication number
HK1190554B
HK1190554BHK14103638.6AHK14103638AHK1190554BHK 1190554 BHK1190554 BHK 1190554BHK 14103638 AHK14103638 AHK 14103638AHK 1190554 BHK1190554 BHK 1190554B
Authority
HK
Hong Kong
Prior art keywords
audio
objects
signal
downmix
metadata
Prior art date
Application number
HK14103638.6A
Other languages
Chinese (zh)
Other versions
HK1190554A1 (en
Inventor
Stephan SCHREINER
Wolfgang FIESEL
Matthias Neusinger
Oliver Hellmuth
Ralph Sperschneider
Original Assignee
Fraunhofer-Gesellschaft Zur Foerderung Der Angewandten Forschung E.V.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from EP08017734Aexternal-prioritypatent/EP2146522A1/en
Application filed by Fraunhofer-Gesellschaft Zur Foerderung Der Angewandten Forschung E.V.filedCriticalFraunhofer-Gesellschaft Zur Foerderung Der Angewandten Forschung E.V.
Publication of HK1190554A1publicationCriticalpatent/HK1190554A1/en
Publication of HK1190554BpublicationCriticalpatent/HK1190554B/en

Links

Abstract

An apparatus for generating at least one audio output signal representing a superposition of at least two different audio objects comprises a processor for processing an audio input signal to provide an object representation of the audio input signal, where this object representation can be generated by a parametrically guided approximation of original objects using an object downmix signal. An object manipulator individually manipulates objects using audio object based metadata referring to the individual audio objects to obtain manipulated audio objects. The manipulated audio objects are mixed using an object mixer for finally obtaining an audio output signal having one or several channel signals depending on a specific rendering setup.

Description

Apparatus and method for generating audio output signal using object-based metadata
The present application is a divisional application entitled "apparatus and method for generating an audio output signal using object-based metadata" filed as florunhff applied science research promoting association, having an application date of 2011, 1, month 17, and an application number of 200980127935.3.
Technical Field
The present invention relates to audio processing, and in particular to audio processing in terms of audio object coding, such as spatial audio object coding.
Background
In today's broadcast systems, such as televisions, in some cases it is desirable not to reproduce the audio track as designed by the sound engineer, but rather to perform special adjustments to address the constraints imposed on the presentation. One well-known technique for controlling such post-production adjustments is to provide appropriate metadata accompanying those tracks.
Conventional sound reproduction systems, such as older home television systems, consist of one loudspeaker or a pair of stereo loudspeakers. More advanced multichannel reproduction systems use five or even more loudspeakers.
If a multi-channel reproduction system is considered, the sound engineer can more flexibly place several mono sources on a two-dimensional plane and can therefore also use a higher dynamic range for all its tracks, since it is much easier to achieve speech intelligibility due to the well-known cocktail party effect (cocktails).
However, those audio with fidelity and high dynamics may cause problems on conventional reproduction systems. There may be scenarios where: the customer may not want such a highly dynamic signal because she or he is listening to the content in a noisy environment (e.g., while driving or in an airplane, or using a mobile entertainment system), she or he is wearing a hearing aid, or she or he does not want to disturb her or his neighbors (e.g., late at night).
Furthermore, broadcasters face the problem that different items (e.g. commercials) in a program may be at different volume levels due to different crest factors required for the adjustment level of consecutive items.
In a conventional broadcast transmission chain, end users receive mixed audio tracks. Any further operations on the part of the receiver may be performed in a very restricted manner. The current set of dolby metadata features (featurets) allows the user to modify some characteristics of the audio signal.
Generally, the operation according to the metadata mentioned above is applied without any frequency selective distinction, because metadata traditionally belonging to an audio signal does not provide enough information to do so.
Furthermore, only the complete audio stream itself can be manipulated. In addition, there is no method for adopting and segmenting individual audio objects in this audio stream. This may be undesirable, particularly in inappropriate listening environments.
In the midnight mode, because the guidance information is lost, it is impossible for existing audio processors to distinguish between ambient noise and dialogue. Thus, in the case of high levels of noise (which must be compressed or limited in volume), the dialog will also be operated in parallel. This may impair speech intelligibility.
Increasing the dialog level relative to the ambient sound helps to improve the perception of speech, especially for hearing impaired people. Such techniques work only when the audio signal is additionally matched with the property control information, but when the dialog is really separated from the environmental component. If only a stereo downmix signal is available, no further separation can be applied to distinguish and manipulate the speech information separately.
Current downmix solutions allow dynamic stereo level adjustment for the center and surround channels. But there is no real description from the transmitter of how to downmix the final multi-channel audio source for a loudspeaker configuration replacing any variant of stereo sound. Only the default formulation in the decoder performs the signal mixing in a very inflexible way.
In all the described schemes, there are generally two different approaches. The first approach is to downmix a set of audio objects into a mono, stereo or multi-channel signal when generating an audio signal to be transmitted. This signal, which is to be transmitted to the user via broadcast, any other transmission protocol, or distributed on a computer-readable storage medium, will typically have a number of channels that is less than the number of original audio objects that were downmixed by the sound engineer, for example in a studio environment. Furthermore, metadata may be attached to allow for several different modifications, but these modifications may only be applied on the complete transmitted signal, or, if the transmitted signal has several different transmitted channels, on separate transmitted channels in their entirety. However, since these transmission channels are always a superposition of several audio objects, an independent operation for a specific audio object is completely impossible without other audio objects being operated.
Another approach is to transmit the audio object signal when it is a separate transmission channel without performing object downmix. Such a scheme works well if the number of audio objects is small. For example, when there are only five audio objects, it is possible to transmit the five different audio objects separately from each other in the 5.1 scheme. Metadata may be associated with these channels that indicates the proprietary nature of the object/channel. Then, on the receiver side, these transmitted channels can be manipulated based on the transmitted metadata.
The drawback of this approach is that it is not backward compatible and only works well in case of small number of audio objects. As the number of audio objects increases, the required bit rate to transmit all objects as separate distinct tracks rises dramatically. This increased bit rate is particularly undesirable in the case of broadcast applications.
Thus, the current bit rate efficient (bitrate efficient) approach does not allow independent operation of distinct audio objects. Such independent operation is only allowed when each object is sent separately. However, this approach is not bit rate efficient and therefore not feasible especially in broadcast scenarios.
It is an object of the present invention to provide a bit rate efficient yet feasible solution to these problems.
According to a first aspect of the present invention, this object is achieved by an apparatus for generating at least one audio output signal representing a superposition of at least two different audio objects, said apparatus comprising: a processor for processing an audio input signal to provide an object representation of the audio input signal, wherein at least two different audio objects are separate from each other, which at least two different audio objects are operable as separate audio object signals, and which at least two different audio objects are operable independently of each other; an object manipulator for manipulating an audio object signal or a mixed audio object signal of at least one audio object based on audio object based metadata regarding the at least one audio object to obtain a manipulated audio object signal or a manipulated mixed audio object signal for the at least one audio object; and an object mixer for mixing the object representations by combining the manipulated audio objects with unmodified audio objects or with different manipulated audio objects that are manipulated differently as at least one audio object.
According to a second aspect of the invention, this object is achieved by a method for generating at least one audio output signal representing a superposition of at least two different audio objects, the method comprising: processing an audio input signal to provide an object representation of the audio input signal, wherein at least two different audio objects are separated from each other, the at least two different audio objects being operable as separate audio object signals and the at least two different audio objects being operable independently from each other; manipulating the audio object signal or mixed audio object signal of at least one audio object in dependence on audio object based metadata regarding the at least one audio object to obtain a manipulated audio object signal or a manipulated mixed audio object signal for the at least one audio object; and mixing the object representations by combining the manipulated audio object with an unmodified audio object or with a different manipulated audio object that is manipulated differently as at least one audio object.
According to a third aspect of the present invention, this object is achieved by an apparatus for generating an encoded audio signal representing a superposition of at least two different audio objects, said apparatus comprising: a data stream formatter for formatting a data stream such that the data stream comprises an object downmix signal representing a combination of the at least two different audio objects and metadata regarding at least one of the different audio objects as side information.
According to a fourth aspect of the present invention, this object is achieved by a method for generating an encoded audio signal representing a superposition of at least two different audio objects, said method comprising: the data stream is formatted such that the data stream comprises an object downmix signal representing a combination of at least two different audio objects and metadata as side information about at least one of the different audio objects.
Further methods of the invention relate to a computer program for performing the inventive method, and a computer readable storage medium having stored thereon an object downmix signal, and object parameter data and metadata as side information about one or more audio objects comprised in the object downmix signal.
The invention is based on the finding that independent manipulation of separate audio object signals or separate sets of mixed audio object signals allows independent object-related processing based on object-related metadata. According to the invention, the result of this operation is not output directly to the loudspeaker, but is provided to an object mixer, which generates an output signal for a certain presentation scene, wherein the output signal is generated by the superposition of at least one manipulated object signal or a set of mixed object signals plus other manipulated object signals and/or unmodified object signals. Of course, it is not necessary to operate individual objects, but in some cases it may be sufficient to operate only one of the plurality of audio objects, without operating further objects. The result of this object mixing operation is more than one audio output signal according to the operated object. Depending on the specific application scenario, these audio output signals may be sent to a speaker, or stored for further utilization, or even sent to other receivers.
Preferably, the signal input to the inventive manipulation/mixing device is a downmix signal generated by downmixing a plurality of audio object signals. The downmix operation may be metadata-controlled independently for each object, or may be uncontrolled, e.g. identical for each object. In the former case, object-dependent object manipulation is an object-controlled independent individual object-specific upmix operation in which a loudspeaker component signal is generated representing this object. Preferably, spatial object parameters are also provided, which can be used to reconstruct the original signal by an approximate version thereof using the transmitted object downmix signal. A processor for processing the audio input signal to provide an object representation of the audio input signal then operates on the basis of the parametric data to calculate a re-composed version of the original audio object, wherein these approximate object signals can then be independently operated on by object-based metadata.
Preferably, object presentation information is also provided, wherein this object presentation information comprises information about the desired audio reproduction setting and information about the placement of the individual audio objects in the reproduction scene. However, certain embodiments may also operate without the use of object location data. These arrangements are for example the provision of a stationary object position, which may be fixedly set or negotiated between the transmitter and the receiver for a complete audio track (new).
Drawings
Preferred embodiments of the present invention are discussed below in conjunction with the attached drawing figures, wherein:
FIG. 1 illustrates a preferred embodiment of an apparatus for generating at least one audio output signal;
FIG. 2 illustrates a preferred embodiment of the processor of FIG. 1;
FIG. 3a illustrates a preferred embodiment for manipulating object signals;
FIG. 3b shows a preferred embodiment of the object mixer in the manipulator as shown in FIG. 3 a;
FIG. 4 illustrates a processor/operator/object mixer configuration in the case of operations after such downmixing of objects but before final object mixing;
FIG. 5a shows a preferred embodiment of an apparatus for generating an encoded audio signal;
FIG. 5b shows a transmission signal with object downmix, object based metadata, and several spatial object parameters;
FIG. 6 shows a map indicating several audio objects bounded by a certain ID, with object audio files, and a joint audio object information matrix E;
FIG. 7 shows an illustration of the object co-variation matrix E in FIG. 6;
fig. 8 shows a downmix matrix and an audio object encoder controlled by the downmix matrix D;
FIG. 9 shows a target presentation matrix A, which is typically provided by a user, and one example of a scene for a particular target presentation;
FIG. 10 illustrates a preferred embodiment of an apparatus for generating at least one audio output signal according to a further aspect of the present invention;
FIG. 11a shows a further embodiment;
FIG. 11b shows yet a further embodiment;
FIG. 11c shows a further embodiment;
FIG. 12a illustrates an exemplary application scenario; and is
Fig. 12b shows a further exemplary application scenario.
Detailed Description
To address the above-mentioned problems, a preferred approach is to provide appropriate metadata with those tracks. Such metadata may consist of information to control the following three factors (three "classical" D):
dialogue volume normalization (dialognormal)
Dynamic Range control (dynamic range control)
Downmix (downmix)
Such audio metadata helps the receiver to manipulate the received audio signal based on adjustments performed by the listener. In order to distinguish this audio metadata from other metadata (e.g., descriptive metadata such as author, title, etc.), it will often be referred to as "dolby metadata" (as it is also implemented only by the dolby system). Only such audio metadata is considered in the following and will be simply referred to as metadata.
Audio metadata is additional control information carried with the audio program and has data about this audio that is necessary for the receiver. Metadata provides a number of important functions including dynamic range control for imperfect listening environments, level matching between programs, downmix information for multi-channel audio reproduction via fewer speaker channels, and other information.
Metadata provides the required tools to render audio programs accurately and artistically in many different listening situations, from full-scale home theaters to in-air entertainment, regardless of the number of speaker channels, recording and playback equipment quality, or relative ambient noise level.
While engineers or content producers are very cautious in providing the highest quality audio possible in their programs, she or he has no control over the wide variety of consumer electronics or listening environments that attempt to reproduce the original soundtrack. Metadata provides engineers or content producers greater control over how their works are to be rendered and enjoyed in almost all imaginable listening environments.
Dolby metadata is a special format to provide information to control the three factors mentioned.
The three most important functions of dolby metadata are:
dialogue volume is normalized to achieve a long-term average level of dialogue in a show, which is often composed of different program types such as drama, commercials, etc.
Dynamic range control to satisfy most audiences with pleasing audio compression, while at the same time allowing each individual customer to control the dynamics of the audio signal and adjust the compression to suit her or his personal listening environment.
Downmix to map the sound of a multi-channel audio signal to two or one channel in case no multi-channel audio recording and playback equipment is available.
Dolby metadata is used along with dolby numbers (AC-3) and dolby E. The dolby-E audio metadata format is illustrated in [16 ]. Dolby digital (AC-3) is designed specifically for the interpretation of audio to the home via digital television broadcast (high or standard resolution), DVD or other medium.
Dolby numbers can carry anything from a single channel of audio to a full 5.1 channel program, including metadata. In both the digital television and DVD cases, it is also commonly used for stereo transmission, in addition to the full 5.1 split audio program.
Dolby E is specifically designed for distribution of multi-channel audio in a professional authoring and distribution environment. Dolby E is the preferred method of video distribution of multi-channel/multi-program audio at any time prior to delivery to consumers. Dolby E can carry up to eight separate audio channels (including meta-information for each) configured into any number of independent program configurations in existing two-channel digital audio infrastructures. Unlike dolby numbers, dolby E can process many encoded/decoded products and is synchronized with the image frame rate. Like dolby numbers, dolby E also carries metadata for each individual audio program encoded in the data stream. The use of dolby E allows the generated audio data stream to be decoded, modified and re-encoded without audible degradation. Since the dolby E stream is synchronized with the video frame rate, it can be delivered, switched, and edited in a professional broadcast environment.
In addition, several means are provided with mpeg aac to perform dynamic range control and control downmix generation.
In order to process original data having variable peak levels, average levels and dynamic ranges in a manner that minimizes variability for consumers, it is necessary to control the reproduction level so that, for example, the dialogue level or the average music level is set to the level that the consumer controls at the time of reproduction, regardless of how the program is originated. Furthermore, not all consumers can listen to these programs in good environments (e.g., low noise), so there is no limit to how much they want to amplify the volume. For example, a driving environment has a high level of ambient noise, so it can be expected that a listener will want to reduce the range of levels that would otherwise be reproduced.
For these two reasons, dynamic range control must be available in the specifications for AAC. To achieve this, the dynamic range for setting and controlling these program items must be used to accompany the reduced bit rate audio. Such controls must be specified with respect to reference levels and with respect to important program elements, such as dialogs.
The dynamic range control is characterized as follows:
1. dynamic Range Control (DRC) is completely selective. Thus, as long as the syntax is correct, the complexity does not change for those who do not want to invoke DRC.
2. The reduced bit rate audio data is transmitted with the full dynamic range of the source material, with the support data assisting the dynamic range control.
3. Dynamic range control data may be sent out every frame to minimize the delay in setting the playback gain.
4. Dynamic range control data is transmitted using the "fill _ element" feature of AAC.
5. The reference level is designated as full scale.
6. Program reference levels are transmitted to permit level parity between playback levels from different sources, and this provides an associated reference to which dynamic range control may be applied. The characteristics of the source signal are most relevant to the subjective impression of the volume of the program, such as the level of dialog content in the program or the average level in a music program.
7. The program reference level represents a level of a program that may be reproduced in a set level associated with the reference level in the consumer hardware to achieve playback level parity. In this regard, quieter portions of the program may be elevated in level, while louder portions of the program may be reduced in level.
8. The program reference level is specified in the range of 0 to-31.75 dB relative to the reference level.
9. The program reference level uses a 7-bit field with a 0.25 db pitch.
10. The dynamic range control is specified in the range of ± 31.75 decibels.
11. Dynamic range control uses an 8-bit field (1 symbol, 7 magnitudes) with 0.25 db pitch.
12. The dynamic range control may be applied to all spectral coefficients or bands of the audio channel as a whole, or the coefficients may be split into different scale factor bands, each of which is controlled by a separate set of dynamic range control data.
13. The dynamic range control may be applied to all channels (of a stereo or multi-channel bitstream) as a whole, or may be split up, where sets of channels are individually controlled by separate dynamic ranges.
14. If an expected dynamic range control data set is missing, the most recently received valid values should be used.
15. Not all elements of the dynamic range control data are sent out at a time. For example, the program reference level may only be sent on average once every 200 milliseconds.
16. Error detection/protection is provided by the transport layer when needed.
17. The user should be given a way to modify the amount of dynamic range control applied to the signal level presented in the bitstream.
In addition to the possibility of sending separate mono or stereo downmix channels in a 5.1 channel transmission, AAC also allows automatic downmix generation from 5 channel soundtracks. In this case, the LFE channel should be ignored.
The matrix downmix method may be controlled by an editor of a soundtrack having a small set of parameters defining the number of rear channels to be added to the downmix.
The matrix downmix method only requests downmix of 3 front/2 rear speaker configuration, 5 channel program to stereo or mono program. Not applicable to any program other than the 3/2 configuration.
In MPEG, several approaches are provided to control the audio presentation at the receiver side.
The general technology is provided by scene description voices such as BIFS and LASeR. Both techniques are used to render audiovisual components from separate encoded objects into a playback scene.
BIFS is normalized in [5] and LASeR is normalized in [6 ].
MPEG-D is primarily a process (parametric) description (e.g., metadata)
To produce multi-channel audio based on a down-mixed audio representation (MPEG surround); and
to generate MPEG surround parameters based on audio objects (MPEG spatial audio object coding).
MPEG surround uses in-channel differences in level, phase and coherence equivalent to ILD, ITD and IC cues to capture the spatial image of the multi-channel audio signal associated with the transmitted downmix signal and to encode these cues in a very compact form so that these cues and the transmitted signal can be decoded to synthesize a high quality multi-channel representation. An MPEG surround encoder receives a multi-channel audio signal, where N is the number of input channels (e.g., 5.1). A key problem in the encoding process is that the downmix signals xt1 and xt2, which are typically stereo (but may also be mono), are derived from the multi-channel input signal and are compressed for transmission on this channel, are the downmix signal and not the multi-channel signal. The encoder may benefit from using the downmix procedure to form a faithful equivalence of the multi-channel signal in a mono or stereo downmix, and also to form the best possible multi-channel decoding based on the downmix and the encoded spatial cue signal. Alternatively, the downmix may be supported externally. The MPEG surround encoding program is agnostic to the compression algorithm used for the transmitted channels; it may be any of a number of high-performance compression algorithms such as MPEG-1LayerIII, MPEG-4AAC, or MPEG-4HighEfficiencyAAC, or it may even be PCM.
MPEG surround technology supports very efficient parametric coding of multi-channel audio signals. The principle of mpeg saoc is to apply similar basic assumptions together with similar parametric representations for very efficient parametric coding of independent audio objects (tracks). Furthermore, a rendering function is included to interactively render these audio objects as sound scenes for several types of reproduction systems (1.0, 2.0, 5.0,. for loudspeakers; or two channels for headphones). SAOC is designed to send multiple audio objects in a joint mono or stereo downmix signal to later allow rendering of such independent objects in an interactive presentation audio scene. For this purpose, SAOC encodes Object Level Differences (OLD), intra object cross-correlation (IOC), and Downmix Channel Level Differences (DCLD) into a stream of parameter words. The SAOC decoder converts this SAOC parametric representation into an MPEG surround parametric representation, which is then decoded by the MPEG surround decoder together with the downmix signal to generate the desired audio scene. The user interactively controls the program to change the representation of the audio objects in the resulting audio scene. In such a variety of conceivable applications of SAOC, several typical cases are listed below.
Consumers can create personal interactive mixes using a virtual mixing stage. For example, certain instruments may be attenuated for standalone performances (e.g., karaoke), the original mix may be modified to suit personal tastes, dialog levels in movies/broadcasts may be adjusted for better speech intelligibility, and so forth.
For interactive games, SAOC is a memory for reproducing audio tracks and a way to have efficient computation. The movement around in the virtual scene is reflected by employing object presentation parameters. Networked multi-player games benefit from the use of one SAOC stream to represent the transmission efficiency of all sound objects outside a certain player end.
In the case of such an application, the term "audio object" also encompasses the "main sound" known in the sound production scenario. In particular, the dominant tone is an independent component in the mix that is stored separately for several purposes of use of the mix (typically to a disc (disc)). The associated vowels are typically bounced from the same original position. Examples of this may be a drum master (including all related drum instruments in the mix), a vocal master (including only vocal tracks) or a rhythm master (including all rhythm related instruments such as drums, guitars, keyboards).
Current telecommunications infrastructure is monophonic and can be extended in functionality. An endpoint equipped with SAOC extensions picks up several audio sources (objects) and generates a mono downmix signal, which is transmitted in a compatible manner by using an existing (speech) encoder. The side information may be carried in an embedded, backward compatible manner. When the SAOC enabled terminal is capable of demonstrating an auditory scene, the remaining end points will continue to produce a mono output and thus improve the intelligibility by spatially separating the different loudspeakers ("cocktail party effect").
The following paragraphs describe an overview of a practically available dolby audio metadata application:
midnight mode
As mentioned in paragraph [ ], there may be scenarios where the listener may not want a high dynamic signal. Thus, she or he may initiate a so-called "midnight mode" of her or his receiver. Thus, the compressor is applied to the entire audio signal. To control the parameters of this compressor, the transmitted metadata is evaluated and applied to the overall audio signal.
Clean audio (clearaudioio)
Another scenario is hearing impaired people who do not want to have high dynamic ambient noise, but who want to have a very clean signal with dialogue. ("clean audio"). Metadata may also be used to implement this schema.
The solution proposed so far is defined in [15] -appendix E. The balance between the stereo main signal and the additional mono dialog description channels is here handled by an independent set of level parameters. The proposed solution based on a separate syntax is referred to as a supplementary audio service in DVB.
Downmix
There are several separate metadata parameters that govern the L/R downmix. Some metadata parameters allow the engineer to choose how the stereo downmix is to be constructed, and which analog signal is preferred. Here, the central and surround downmix levels define the final mixing balance of the downmix signal for each decoder.
Fig. 1 shows an apparatus for generating at least one audio output signal representing a superposition of at least two different audio objects according to a preferred embodiment of the present invention. The apparatus of fig. 1 comprises a processor 10 for processing an audio input signal 11 to provide an object representation 12 of this audio input signal, wherein at least two different audio objects are separated from each other, wherein at least two different audio objects may be treated as separate audio object signals, and wherein at least two different audio objects may be operated independently of each other.
The manipulation of the object representation is performed in the audio object manipulator 13 to manipulate the audio object signal or to manipulate a mixed representation of the audio object signal of at least one audio object based on the metadata 14 of the audio object, wherein the metadata 14 based on the audio object is associated with the at least one audio object. The object manipulator 13 is adapted to obtain a manipulated audio object signal for the at least one audio object or a manipulated mixed audio object signal 15.
The signal generated by the object manipulator is input into the object mixer 16 for mixing the object representations by combining the manipulated audio object with an unmodified audio object or with a different manipulated audio object, which has been manipulated in a different way as at least one audio object. The result of this object mixer comprises more than one audio output signal 17a, 17b, 17 c. Preferably, this more than one output signal 17a to 17c is preferably designed for a specific presentation setting, such as a mono presentation setting, a stereo presentation setting, a multi-channel presentation setting comprising three or more channels, e.g. a surround setting requiring at least five or at least seven different audio output signals.
Fig. 2 shows a preferred embodiment of a processor 10 for processing an audio input signal. The audio input signal 11 is preferably implemented as an object downmix 11, as obtained by an object downmixer 101a in fig. 5a, which will be explained later on. In such a case, the processor additionally receives object parameters 18, as generated by, for example, object parameter calculator 101a of fig. 5a, which is described later. The processor 10 then computes the separate object representations 12 in place. The number of object representations 12 may be higher than the number of channels in the object downmix 11. The object downmix 11 may comprise a mono downmix, a stereo downmix, or even a downmix having more than two channels. However, the processor 12 is operable to generate more object representations 12 than the number of signals alone in the object downmix 11. Due to the parameterization process performed by the processor 10, these audio object signals are not a true reproduction of the original audio objects, which were rendered before the object downmix 11 was performed, but these audio object signals are approximate versions of the original audio objects, wherein the accuracy of the approximation depends on the type of separation algorithm performed in the processor 10, and of course the accuracy of the transmitted parameters. The preferred object parameters are known from spatial audio object coding, while the preferred reconstruction algorithm for generating the individual separated audio object signals is a reconstruction algorithm implemented according to this spatial audio object coding standard. The preferred embodiments of the processor 10 and the object parameters are described later in the context of fig. 6 to 9.
Fig. 3a and 3b together show one embodiment of object manipulation performed on reconstruction prior to object downmix, while fig. 4 shows a further embodiment where object downmix is prior to manipulation and manipulation is prior to final object mix manipulation. The results of this procedure are the same in fig. 3a, 3b compared to fig. 4, but the target operation is performed at a different level in the processing architecture. Although the manipulation of audio object signals is an issue in the context of efficiency and computational resources, the embodiment of fig. 3a/3b is preferred because audio object manipulation must be performed on only a single audio signal, rather than multiple audio signals as in fig. 4. In a different embodiment, where the configuration of fig. 4 is preferred, there may be a requirement that the object downmix must use the unmodified object signal, in fig. 4 the operation is followed by the object downmix, but is performed before the final object mix to help obtain the output signal, for example, for the left channel L, the center channel C or the right channel R.
Fig. 3a shows a situation where the processor 10 of fig. 2 outputs a separated audio object signal. At least one audio object signal, such as a signal to an object 1, is operated on in the object operator 13a based on metadata for this object 1. Other objects, such as object 2, are also operated by the object operator 13b, depending on the implementation. Of course, a situation may occur in which there is actually an object such as the object 3, the object 3 is not operated, but is generated by object separation. In the example of fig. 3a, the operation result of fig. 3a is two operated object signals and one non-operated signal.
These results are input to the object mixer 16, which comprises a first mixer stage implemented with object downmixers 19a, 19b and 19c, and which further comprises a second object mixer stage implemented with devices 16a, 16b and 16 c.
The first stage of the object mixer 16 comprises object downmixers for the respective outputs of fig. 3a, such as object downmixer 19a for output 1 of fig. 3a, object downmixer 19b for output 2 of fig. 3a, object downmixer 19c for output 3 of fig. 3 a. The object downmixers 19a to 19c have the purpose of "assigning" the respective objects to the output channels. Thus, each object down-mixer 19a, 19b, 19C has outputs for the left component signal L, the middle component signal C and the right component signal R. Thus, for example, if object 1 is a single object, then down-mixer 19a is a straight down-mixer and the output of block 19a is the same as the final output L, C, R indicated at 17a, 17b, 17 c. The object downmixers 19a to 19c preferably receive presentation information as indicated at 30, which may indicate presentation settings, i.e. there are only three output speakers as in the embodiment of fig. 3 e. These outputs are a left speaker L, a center speaker C, and a right speaker R. For example, the presentation setting or reproduction setting comprises a 5.1 architecture, then each object down-mixer has six output channels, and there would be six adders to enable to obtain a final output signal for the left channel, a final output signal for the right channel, a final output signal for the center channel, a final output signal for the left surround channel, a final output signal for the right surround channel, and a final output signal for the low frequency enhancement (subwoofer) channel.
In particular, the adders 16a, 16b, 16c are adapted to combine these component signals for individual channels, which are generated by the corresponding object downmixers. Such a combination is preferably a straight-forward sample (straight-forward sample) by sample addition, but depending on the implementation, weighting factors may also be applied. Furthermore, the functions in fig. 3a, 3b may also be performed in the frequency domain or the secondary frequency domain, so that the components 19a to 19c may operate in this frequency domain, and in the reproduction setting there is some kind of frequency/time conversion before actually outputting these signals to the loudspeakers.
Fig. 4 shows an alternative embodiment in which the components 19a, 19b, 19c, 16a, 16b, 16c function similarly to the embodiment of fig. 3 b. Importantly, however, the operation that occurred prior to the object downmix 19a in fig. 3a now occurs after the object operation 19 a. Thus, the metadata-controlled object-specific operations for individual objects are done in the downmix domain, i.e. before the actual addition of the component signals that are operated on afterwards. When comparing fig. 4 with fig. 1, it becomes clear that object downmixers such as 19a, 19b, 19c will be implemented in the processor 10, and that the object mixer 16 will comprise adders 16a, 16b, 16 c. When fig. 4 is implemented and the object downmixer is part of the processor, the processor will receive presentation information 30, i.e. information on the position of the respective audio object and information and additional information on the presentation settings, as the case may be, in addition to the object parameters 18 of fig. 1.
Further, the operations may include a downmix operation as implemented by blocks 19a, 16b, 16 c. In this embodiment, an operator includes these blocks, and additional operations may occur, but this is not required in all cases.
Fig. 5a shows an encoder-side embodiment, which can generate a data stream as schematically shown in fig. 5 b. In particular, fig. 5a shows an apparatus for generating an encoded audio signal 50 representing a superposition of at least two different audio objects. Basically, the apparatus of fig. 5a shows a data stream formatter 51 for formatting a data stream 50 such that the data stream comprises an object downmix signal 52 representing a combination such as a weighted or unweighted combination of the at least two audio objects. Furthermore, the data stream 50 contains, as side information, at least one object related metadata 53 associating said different audio objects. The data stream preferably further comprises parametric data 54 which is time and frequency selective and allows a high quality separation of the object downmix signal into audio objects, wherein this operation is also referred to as an object upmix operation, which is performed by the processor 10 shown in fig. 1, as previously discussed.
The object downmix signal 52 is preferably generated by the object downmixer 101 a. The parameter data 54 is preferably generated by the object parameter calculator 101a and the object selective metadata 53 is generated by the object selective metadata provider 55. This object-selective metadata provider may be an input for receiving metadata as generated by a music producer in a studio, or may be an input for receiving data as generated by an object and associated analysis, which may occur after object separation. In particular, this object-selective metadata provider may be implemented to analyze the output of the object by the processor 10, for example to ascertain whether the object is a speech object, a sound object, or an ambient sound object. Thus, speech objects can be analyzed by some well-known speech detection algorithms known from speech coding, and object-selective analysis can be implemented to also find sound objects originating from the instrument. Such sound objects have a high pitch nature and can therefore be distinguished from speech objects or ambient sound objects. The ambient sound object may have a rather noisy nature reflecting the background sound typically present in e.g. drama movies, e.g. the sound in which the background noise may be traffic or any other static noisy signal, or a non-static signal with a wide-band spectrum, such as is generated when e.g. a gun shot scene occurs in a drama.
Based on this analysis, one can amplify the sound subject and attenuate other subjects to emphasize this speech, as this is useful for the hearing impaired or elderly to better understand the movie. As previously mentioned, other embodiments include providing object specific metadata such as object identifiers and object related data due to the acoustician generating the actual object downmix signal on a CD or DVD, such as stereo downmix or ambient sound downmix.
Fig. 5d shows an exemplary data stream 50 with mono, stereo or multi-channel object downmix as main information and with object parameters 54 as side information and object based metadata 53, which is static in case of recognizing objects only as speech or environment or which is time-variant in case of providing level data as object based metadata, as required in a midnight mode. However, it is preferable not to provide the object-based metadata in a frequency selective manner to save data rate.
Fig. 6 illustrates one embodiment of an audio object map showing a number N of objects. In the exemplary interpretation of fig. 6, each object has an object ID, a corresponding object audio file, and important object parameter information, which is preferably information related to the energy of this audio object and information related to intra-object relevance of this audio object. This audio object parameter information comprises an object co-variation matrix E for each subband and for each time block.
An example of an audio parameter data matrix E for such an object is shown in fig. 7. Diagonal element eiiIncluding power or energy information of the ith audio object in the corresponding sub-band and in the corresponding time block. For this purpose, the subband signals representing a certain i-th audio object are input to a power or energy calculator, which may for example perform an auto-correlation function (acf) to obtain the value e with or without some normalization11. Alternatively, the energy may be calculated as the sum of the squares of the signal over a segment length (i.e., the vector product: ss). The acf may in a sense account for the spectral distribution of this energy, but due to the fact that, anyway, T/F conversion for frequency selection is preferably used, the energy calculation may be performed for each sub-band separation without acf. Thus, the main diagonal elements of the object audio parameter matrix E show one measure of the power for the energy of an audio object in a certain sub-band and in a certain time block.
On the other hand, off-diagonal element eijDisplaying audio objects i, j in pairsIndividual correlation measures between the corresponding sub-bands and the time blocks. As is clear from fig. 7, the matrix E-for real-valued entries-is diagonally symmetric. Typically this matrix is the Hermitian matrix (Hermitianmatrix). Correlation metric element eijThe cross-correlation measure, which may or may not be normalized, may be calculated by e.g. the cross-correlation of the two sub-band signals of the individual audio objects. Other correlation metrics may be used that are not calculated using cross-correlation operations, but are calculated by other methods of determining the correlation between two signals. For practical reasons, all elements of matrix E are normalized so that they have a magnitude between 0 and 1, where 1 shows the maximum power or maximum correlation, while 0 shows the minimum power (zero power), and-1 shows the minimum correlation (inverse).
A downmix matrix D having a size of K × N, where K > 1, is in the form of a matrix having K columns, and K channel downmix signals are determined by matrix operation.
X=DS(2)
FIG. 8 shows a block diagram with a downmix matrix element dijAn example of the downmix matrix D. Such an element dijIt is shown whether the object i downmix signal comprises part or all of the object j. For example, when d therein12Equal to zero, meaning that the object 1 downmix signal does not comprise the object 2. On the other hand, when d23Is equal to 1, the display object 3 is completely included in the object 2 downmix signal.
Values of the downmix matrix elements between 0 and 1 are possible. In particular, a value of 0.5 indicates that a certain object is included in the downmix signal, but only half of its energy. Thus, when an audio object, such as object 4, is equally distributed into two downmix signal channels, d24And d14It will be equal to 0.5. This downmixing method is an energy-conserving downmixing operation, which may be preferred in some cases. Alternatively, however, non-energy-preserving downmix may be used, wherein the entire audio object is guided into both the left downmix channel and the right downmix channelSo that the energy of the audio object is doubled for other audio objects in the downmix signal.
In the lower part of fig. 8, an overview of the object encoder 101 of fig. 1 is given. Specifically, the object encoder 101 includes two different portions 101a and 101 b. Part 101a is a down-mixer, which preferably performs a weighted linear combination of the audio objects 1, 2,. N, and the second part of the object encoder 101 is an audio object parameter calculator 101b, which calculates audio object parameter information, such as a matrix E, for each time block or subband, to provide audio energy and correlation information, which is parametric information and can therefore be transmitted at a low bit rate or can be stored consuming a small amount of memory resources.
A user control object presentation matrix a having a size of M × N determines an M-channel target presentation of the audio object through a matrix operation in a matrix form having M columns.
Y=AS(3)
Since the target is placed on a stereo presentation, in the following derivation, it will be assumed that M is 2. Given an initial rendering matrix for more than two channels and a downmix rule that leads from these several channels to two channels, it will be obvious to a person skilled in the art to derive a corresponding rendering matrix a for a stereo rendering with a size of 2 x N. It will also be assumed for simplicity that K is 2, so that the object downmix is also a stereo signal. The case of the downmix of the stereo object is the most important special case from the application point of view.
Fig. 9 shows a detailed explanation of the target demo matrix a. Depending on the application, the target presentation matrix a may be provided by the user. The user has complete freedom to indicate where the audio objects should be virtually located for one playback setting. The concept of the strength of the audio object is that the downmix information and the audio object parameter information are completely independent on a specific localization (localization) of the audio object. Such localization of audio objects is provided by the user in the form of targeted presentation information. The target presentation information may preferably be implemented by a target presentation matrix a, which may be in the form in fig. 9. In particular, the rendering matrix A has M columns and N rows, where M equals the number of channels in the rendered output signal, and where N equals the number of audio objects. M corresponds to 2 in the preferred stereo presentation scenario, but if an M channel presentation is performed, the matrix a has M rows.
In particular, the matrix element aijShowing whether part or all of the jth object is to be presented in the ith particular output channel. The lower part of fig. 9 gives a simple example for a target presentation matrix of a scene, where there are six audio objects AO1 to AO6, where only the first five audio objects should be presented at a specific position and the sixth audio object should not be presented at all.
With regard to the audio object AO1, the user wants this audio object to be presented on the left in the playback scene. This object is thus placed at the position of the left loudspeaker in the (virtual) playback room, which results in the first column in the presentation matrix a being (10). As for the second audio object, a22Is 1, and a12Which is 0, this means that the second audio object is to be presented on the right.
The 3 rd audio object is to be rendered in the middle between the left and right speakers such that 50% of the level or signal of this audio object goes into the left channel and 50% into the right channel, such that the third column of the corresponding target rendering matrix a is (0.5 length 0.5).
Similarly, any arrangement between the left and right speakers may be displayed by the target presentation matrix. As for the 4 th audio object, the arrangement on the right is more because of the matrix element a24Greater than a14. Similarly, matrix element a is demonstrated as by the target15And a25As shown, the fifth audio object AO5 is presented more on the left speaker. The target presentation matrix a additionally allows not to present a certain audio object at all. Thereby achieving the purpose ofThe sixth column of the presentation matrix a with zero elements is exemplarily shown.
Next, a preferred embodiment of the present invention is summarized with reference to fig. 10.
Preferably, the method known from SAOC (spatial audio object coding) splits an audio object into different parts. These portions may be, for example, different audio objects, but they may not be limited thereto.
If the metadata is sent for a single portion of this audio object, it allows only some signal components to be adjusted, while other portions will remain unchanged or may even be modified with different metadata.
This can be done for different sound objects, but also for separate spatial ranges.
The parameters for object separation are typical for each individual audio object, or even new metadata (gain, compression, level). These data may preferably be sent.
The decoder processing box is implemented in two distinct stages: in a first stage, object separation parameters are used to generate (10) individual audio objects. In the second stage, the processing unit 13 has a plurality of cases, each of which is directed to an independent object. Here, object specific metadata should be applied. At the end of the decoder, all the independent objects are again combined (16) into a single audio signal. In addition, the dry/wet controller 20 may allow for a smooth fade between the original and manipulated signals to give the end user the possibility of simply finding her or her preferred settings.
Depending on the particular implementation, fig. 10 illustrates two aspects. In a basic aspect, the object related metadata only displays an object description for a particular object. Preferably, this object description is related to the object ID, as shown at 21 in FIG. 10. Thus, the object-based metadata operated by device 13a for the top is only data for which this object is a "voice (speech)" object. There is information that this second object is an environmental object for another object-based metadata processed by item 13 b.
The basic object related metadata for these two objects may be sufficient to implement an enhanced clean audio mode in which the speech object is enlarged and the ambient object is attenuated, or, in general, the speech object is enlarged relative to the ambient object or the ambient object is attenuated relative to the speech object. However, the user may preferably implement different processing modes on the receiver/decoder side, which may be programmed via the mode control input. These different modes may be a dialog level mode, a compression mode, a downmix mode, an enhanced midnight mode, an enhanced clean audio mode, a dynamic downmix mode, a guided upmix mode, a mode for object resetting, etc.
Depending on the implementation, different schemas require different object-based metadata in addition to basic information indicating the characteristic type of the object, such as speech or environment. In the midnight mode in which the dynamic range of the audio signal has to be compressed, preferably, for each object such as a speech object and an environmental object, one of an actual level or a target level for this midnight mode is provided as metadata. When the actual level of the object is provided, the receiver must calculate the target level for the midnight mode. However, when the target relative position is given, decoder/receiver side processing is reduced.
In this embodiment, each object has a time-varying object-based sequence of level information, which is used by the receiver to compress the dynamic range in order to reduce level differences in the signal objects. This automatically results in a final audio signal in which level differences are reduced from time to time as required for the midnight mode implementation. For clean audio applications, a target level for this speech object may also be provided. The environmental object can then be set to zero or almost zero to greatly enhance the speech object in the sound produced by a certain speaker setting. In high fidelity applications, as opposed to midnight mode, the dynamic range of this object or the dynamic range of the differences between these objects may even be enhanced. In this embodiment, it may be desirable to provide target object gain levels because these target levels ensure that, in the end, the sound created by the artistic sound engineer in the recording studio is obtained and therefore has the highest quality compared to the automatic or user-defined settings.
In other embodiments where object-based metadata is associated with advanced downmix, the object operations include downmix that is different from the particular presentation setting. This object-based metadata is then imported into the object downmixer blocks 19a to 19c in fig. 3b or fig. 4. In this embodiment, the operator may comprise blocks 19a to 19c when the downmix performs individual objects depending on the presentation settings. Specifically, the object downmix blocks 19a to 19c may be set to be different from each other. In such a case, depending on the channel configuration, the speech object may be directed only to the center channel, not to the left or right channel. The down-mixer blocks 19a to 19c may then have a different number of component signal outputs. The downmix may also be dynamically implemented.
Furthermore, guided upmix information and information to reset the position of the object may also be provided.
Next, a brief description will be given of a preferred mode of providing metadata and object-specific metadata.
The audio objects may not be as perfectly separated as in typical SOAC applications. For audio operations it may be sufficient to have the object "masked" rather than completely separated.
This may result in fewer/coarser parameters for the separation.
For applications called "midnight mode", the sound engineer needs to define all metadata parameters for each object independently, e.g. generated in a fixed dialogue volume, rather than the ambient noise of operation ("enhanced midnight mode").
This may also be beneficial for the person wearing the hearing aid ("enhanced clean audio").
New downmix architecture: different separate objects may be treated differently for each particular downmix case. For example, a 5.1 channel signal must be downmixed for a stereo home television system, while another receiver has even only a mono recording playback system. Thus, different objects can be treated differently (and due to the metadata provided by the sound engineer, these are all controlled by the sound engineer during the manufacturing process).
Similarly, a downmix to 3.0 etc. is also preferred.
The generated downmix will not be defined by a fixed global parameter(s), but it may be generated by parameters related to time-varying objects.
It is also possible to perform guided upmixing with new object-based metadata.
The objects may be placed in different locations, for example, to make the spatial imagery broader when the environment is impaired. This will help the speech recognition of the hearing impaired.
The method proposed in this document extends the existing metadata concept implemented by dolby codecs and mainly used by dolby codecs. Now, it is possible to apply the known metadata concept not only on the complete audio stream, but also on the extracted objects in this stream. This gives the sound engineer and artist more flexibility, a larger adjustment range, and thus, better audio quality and more enjoyment to the listener.
Fig. 12a, 12b show different application scenarios of the inventive concept. In a typical scenario, there is motion on a television, where people have a stadium atmosphere in the 5.1 channel, and the speaker channel is mapped to the center channel. Such "mapping" may be performed by adding the speaker channels directly to the center channel for the 5.1 channels that propagate this stadium atmosphere. Now, this innovative approach allows to have such a center channel in the stadium atmosphere sound description. The addition operation then mixes the center channel from the stadium atmosphere with the speaker. By generating center channel object parameters for this speaker and from the stadium atmosphere, the invention allows to separate the two sound objects at the decoder side and to enhance or attenuate the speaker or the center channel from the stadium atmosphere. A further architecture is when one has two speakers. Such a situation may occur when two people are commenting on the same soccer game. In particular, when there are two speakers that are simultaneously delivered, it may be useful to make the two speakers separate objects, and furthermore, to separate the two speakers from the stadium atmosphere channel. In such applications, when the low frequency enhancement channel (subwoofer channel) is ignored, the 5.1 channel and the two loudspeaker channels can be processed as eight different audio objects or seven different audio objects. Since this straight-line distribution basic setting is adapted to a 5.1 channel sound signal, the seven (or eight) objects can be downmixed to a 5.1 channel downmix signal and the object parameters can be provided in addition to this 5.1 downmix band, so that on the receiving side the objects can be separated again and object specific processing is possible before the final 5.1 channel downmix by this object mixer takes place on the receiving side due to the fact that the loudspeaker objects will be identified from the stadium atmosphere objects based on the object's metadata.
In this architecture, one may also have a first object containing a first speaker, and a second object containing a second speaker, and a third object containing the complete stadium atmosphere.
Next, implementations of different object-based downmix architectures will be discussed in the context of fig. 11a to 11 c.
When sound generated by, for example, the architecture of fig. 12a or 12b must be played back in a conventional 5.1 rendering system, the embedded metadata stream can be ignored and the received stream played as it would. However, when recording and playback must occur on stereo speaker settings, downmix from 5.1 to stereo must occur. If the ambient channel is only added to the left/right, the arbiter may be at a too small level. Therefore, it is preferable to reduce the ambience level before or after the downmix before the arbiter object is (re-) added.
When still having both speakers separated on the left/right side, the hearing impaired may want to reduce the ambience level to have a better speech recognition, the so-called "cocktail party effect", when a person hears her or her name, to focus attention on the direction in which she or he hears her or his name. From a psychoacoustic point of view, this particular directional concentration can attenuate sound from different directions. Thus, the distinctive location of a particular object, such as a speaker on the left or right or a speaker on both the left or right such that the speaker appears in the middle of the left or right, may improve recognition. For this purpose, the input audio stream is preferably divided into separate objects, wherein these objects must have a ranking in the metadata that illustrates that an object is important or less important. Then, the level differences among them can be adjusted according to the metadata or the object positions can be relocated to improve the identification according to the metadata.
To achieve this, metadata is not applied on the transmitted signal, but on a single separate audio object, before or after object downmix as the case may be. The invention now no longer requires that the objects have to be restricted to the spatial channels so that these channels can be operated on individually. In contrast, this innovative object-based metadata concept does not require a specific object in one specific channel, but the object can be downmixed to several channels and still be operated on separately.
Fig. 11a shows a further implementation of a preferred embodiment. The object down-mixer 16 generates m output channels from k × n input channels, where k is the number of objects and each object generates n channels. Fig. 11a corresponds to the architecture of fig. 3a, 3b, where the operations 13a, 13b, 13c occur before object downmixing.
FIG. 11a further includes level operators 19d, 19e, 19f, which may be implemented without metadata control. Alternatively, however, these operators may be controlled by object-based metadata, such that the level modifications implemented by the blocks 19 d-19 f are also part of the object operator 13 of FIG. 1. Similarly, this is true on downmix operations 19a to 19b to 19c, when these downmix operations are controlled by object-based metadata. However, this case is not shown in fig. 11a, but it may also be implemented when this object-based metadata is also delivered to the downmix blocks 19a to 19 c. In the latter case, these blocks are also part of the object manipulator 13 of fig. 11a, and the remaining functions of the object mixer 16 are implemented by a combination of output channel equations for the manipulated object component signals of the corresponding output channels. Fig. 11a further contains a dialog normalization function 25, which can be implemented with conventional metadata, since this dialog normalization does not take place in the object domain, but in the output channel domain.
Fig. 11b shows an embodiment of an object based 5.1 stereo downmix. Where the downmix is performed prior to the operation, and thus, fig. 11b corresponds to the architecture of fig. 4. The level modification 13a, 13b is performed by means of metadata based on objects, where for example the upper branch corresponds to a speech object and the lower branch corresponds to an environment object, or for example in fig. 12a, 12b the upper branch corresponds to one loudspeaker or both loudspeakers and the lower branch corresponds to all environment information. Then the level operators 13a, 13b may operate both objects based on the fixedly set parameters such that the object based metadata will only be an identifier of the object, but the level operators 13a, 13b may also operate a level based on a target level provided by the metadata 14, or based on an actual level provided by the metadata 14. Thus, in order to generate a stereo downmix for a multi-channel input, a downmix formula for individual objects is applied and the objects are weighted by a given level before they are mixed again to the output signal.
For clean audio applications as shown in fig. 11c, importance levels are sent as metadata to enable the reduction of less important signal components. The other branch will then correspond to the importance component, which is amplified when the lower branch may correspond to a less important component that may be attenuated. How the specific de-emphasis and/or magnification of the different objects is performed may be fixedly set by the receiving end, but may also be controlled by object-based metadata, as implemented by the "dry/wet" controller 14 in fig. 11 c.
In general, dynamic range control may be performed in the object domain, which is done with multi-band compression in a similar manner to the AAC dynamic range control implementation. The object-based metadata may even be frequency selective data, such that frequency selective compression is performed similar to a balancer implementation.
As previously mentioned, the dialog normalization is preferably performed after the downmix, i.e. the downmix signal. In general, downmix should be able to process k objects with n input channels to m output channels.
It is not important to separate the objects into discrete objects. It may be sufficient to "mask" the signal components to be operated on. This is similar to editing masks in image processing. Then, one generalized "object" becomes a superposition of several original objects, wherein this superposition comprises a number of objects smaller than the total number of original objects. All objects are again summed at a final stage. There may be no interest in a separate single object and for some objects the level value may be set to 0, which is a high decibel number, when an object must be completely removed, e.g. for karaoke applications one may be interested in completely removing a vocal object so that a karaoke singer can direct her or his own voice to the remaining instrumental objects.
Other preferred applications of the present invention are an enhanced midnight mode to reduce the dynamic range of a single subject, or a high fidelity mode to extend the dynamic range of a subject, as described above. In this context, the transmitted signal may be compressed and it tends to invert such compression. The application of dialog normalization is primarily more desirable for all signals to occur when output to the speaker, but nonlinear de-emphasis/amplification for different objects is useful when dialog normalization is adjusted. In addition to separating different audio object parameter data from the object downmix signal, it is preferred to transmit level values for each signal and for the additive signal in addition to typical metadata associated with the additive signal, for the downmix, the importance and a value indicating the importance level for clean audio, the object identifier, an actual absolute or relative level being time varying information or an absolute or relative target level being time varying information, etc.
The illustrated embodiments are merely illustrative of the principles of the inventions. It is to be understood that modifications and variations to the arrangements of the details described herein will be apparent to those skilled in the art. The scope of the invention is, therefore, indicated by the appended claims rather than by the specific details presented for the purpose of illustration and explanation of the embodiments.
The innovative method can be implemented in hardware or software depending on certain implementation requirements of the innovative method. This embodiment may be implemented using a digital storage medium, in particular a disk, DVD or CD having electronically readable control signals stored thereon, which may be used in conjunction with a programmable computer system to perform the inventive methods. Generally, the present invention is thus a computer program product with a program code stored on a machine-readable carrier, the program code being operative for performing the inventive methods when the computer program product runs on a computer. In other words, the inventive methods are thus computer programs having program code for performing at least one of the inventive methods when run on a computer.
Reference to the literature
[1]ISO/IEC13818-7:MPEG-2(Genericcodingofmovingpicturesandassociatedaudioinformation)-Part7:AdvancedAudioCoding(AAC)
[2]ISO/IEC23003-1:MPEG-D(MPEGaudiotechnologies)-Part1:MPEGSurround
[3]ISO/IEC23003-2:MPEG-D(MPEGaudiotechnologies)-Part2:SpatialAudioObjectCoding(SAOC)
[4]ISO/IEC13818-7:MPEG-2(Genericcodingofmovingpicturesandassociatedaudioinformation)-Part7:AdvancedAudioCoding(AAC)
[5]ISO/IEC14496-11:MPEG4(Codingofaudio-visualobjects)-Part11:SceneDescriptionandApplicationEngine(BIFS)
[6]ISO/IEC14496-:MPEG4(Codingofaudio-visualobjects)-Part20:LightweightApplicationSceneRepresentation(LASER)andSimpleAggregationFormat(SAF)
[7]http:/www.dolby.com/assets/pdf/techlibrary/17.AllMetadata.pdf
[8]http:/www.dolby.com/assets/pdf/tech_library/18_Metadata.Guide.pdf
[9]Krauss,Kurt;Jonas;Schildbach,Wolfgang:TranscodingofDynamicRangeControlCoefficientsandOtherMetadataintoMPEG-4HEAA,AESconvention123,October2007,pp7217
[10]Robinson,CharlesQ.,Gundry,Kenneth:DynamicRangeControlviaMetadata,AESConvention102,September1999,pp5028
[11]Dolby,“StandardsandPracticesforAuthoringDolbyDigitalandDolbyEBitstreams”,Issue3
[14]CodingTechnologies/Dolby,“DolbyE/aacPlusMetadataTranscoderSolutionforaacPlusMultichannelDigitalVideoBroadcast(DVB)”,V1.1.0
[15]ETSITS101154:DigitalVideoBroadcasting(DVB),V1.8.1
[16]SMPTERDD6-2008:DescriptionandGuidetotheUseofDolbyEaudioMetadataSerialBitstream

Claims (5)

HK14103638.6A2008-07-172014-04-16Apparatus and method for generating audio output signals using object based metadataHK1190554B (en)

Applications Claiming Priority (4)

Application NumberPriority DateFiling DateTitle
EP080129392008-07-17
EP08012939.82008-07-17
EP08017734.82008-10-09
EP08017734AEP2146522A1 (en)2008-07-172008-10-09Apparatus and method for generating audio output signals using object based metadata

Publications (2)

Publication NumberPublication Date
HK1190554A1 HK1190554A1 (en)2014-07-04
HK1190554Btrue HK1190554B (en)2016-11-18

Family

ID=

Similar Documents

PublicationPublication DateTitle
KR101325402B1 (en)Apparatus and method for generating audio output signals using object based metadata
KR102178231B1 (en)Encoded audio metadata-based equalization
CN107851440B (en)Metadata-based dynamic range control for encoded audio extension
RU2617553C2 (en)System and method for generating, coding and presenting adaptive sound signal data
EP2191463B1 (en)A method and an apparatus of decoding an audio signal
TWI396187B (en) Method and apparatus for encoding and decoding an object-based audio signal
EP2974010B1 (en)Automatic multi-channel music mix from multiple audio stems
US20170098452A1 (en)Method and system for audio processing of dialog, music, effect and height objects
US20110081024A1 (en)System for spatial extraction of audio signals
KR20140028094A (en)Method and apparatus for generating side information bitstream of multi object audio signal
JP2008513845A (en) System and method for processing audio data, program elements and computer-readable medium
AU2013200578B2 (en)Apparatus and method for generating audio output signals using object based metadata
HK1190554B (en)Apparatus and method for generating audio output signals using object based metadata
HK1155884B (en)Apparatus and method for generating audio output signals using object based metadata
HK1140351A (en)Apparatus and method for generating audio output signals using object based metadata

[8]ページ先頭

©2009-2025 Movatter.jp