The present invention relates to the field of audio signal processing and in particular to the generation of multiple output channels from fewer input channels, such as one (mono) channel or two (stereo) input channels.
A similar device is found, for example, in EP 1 021 063.
This has led to many end users now also owning multichannel playback systems. This is mainly due to DVDs becoming more popular and therefore many DVD users now also having 5.1 multichannel equipment. Such playback systems generally consist of three speakers L (left), C (center) and R (right) that are typically placed in front of the user, and two speakers Ls and R, which are typically placed behind the user, and typically referred to as an LFE channel, which is also referred to as a low frequency echo channel or subwoofer.
Err1:Expecting ',' delimiter: line 1 column 347 (char 346)
Err1:Expecting ',' delimiter: line 1 column 90 (char 89)
However, there is a huge amount of user-owned or publicly available audio material that exists only as stereo material, which has only two channels, namely the left channel and the right channel.
To play such stereo material over a 5.1 multichannel audio system, there are two options recommended by the ITU.
The first option is to play the left and right channels over the left and right speakers of the multi-channel system, but the disadvantage of this solution is that it does not take advantage of the large number of existing speakers, so that the presence of the center speaker and the two rear speakers is not taken advantage of.
Err1:Expecting ',' delimiter: line 1 column 403 (char 402)
Only then does the second option, i.e. the use of all the speakers of the multichannel system, have an advantage over the first solution, i.e. no upmix errors are made.
Err1:Expecting ',' delimiter: line 1 column 133 (char 132)
Err1:Expecting ',' delimiter: line 1 column 356 (char 355)
Err1:Expecting ',' delimiter: line 1 column 83 (char 82)
Err1:Expecting ',' delimiter: line 1 column 68 (char 67)
Err1:Expecting ',' delimiter: line 1 column 71 (char 70)
All known techniques attempt in various ways to extract the ambient signals from the original stereo signal or even synthesize them from noise or other information, although the synthesis of the ambient signals may also use information not present in the stereo signal.
Err1:Expecting ',' delimiter: line 1 column 234 (char 233)
Err1:Expecting ',' delimiter: line 1 column 615 (char 614)
Err1:Expecting ',' delimiter: line 1 column 1153 (char 1152)
There are several techniques for up-mixing stereo recordings. One technique is the use of matrix decoders. Matrix decoders are known as Dolby Pro Logic II, DTS Neo:6 or HarmanKardon/Lexicon Logic 7 and are included in almost every audio/video receiver sold today.
As already explained, frequency range techniques described by Avendano and Jot are also used to identify and extract ambient information in stereo audio signals. This method is based on the calculation of an interchannel coherence index and a nonlinear imaging function, which allows the determination of time-frequency regions, which are mainly composed of ambient signal components. The ambient signals are then synthesized and used to feed the surround channels of the multichannel playback system.
A component of the direct/environmental mixing process is the extraction of an ambient signal that is fed into the two rear channels Ls, Rs. There are certain requirements for a signal to be used as an ambient-like signal in the context of a direct/environmental mixing process. One requirement is that no relevant parts of the direct sound sources should be audible in order to locate the direct sound sources safely in front of the listener. This is especially important if the audio signal contains language or one or more distinguishable speakers.
If a special amount of speech components were to be reproduced through the rear channels, this would result in the position of the speaker or speakers being placed from front to back or a piece far away from the user or even behind the user, resulting in a very disturbing sound perception.
A basic requirement for the sound signal of a motion picture (a soundtrack) is that the audible impression should be in line with the impression produced by the images. Audible indications of localization should therefore not be in contrast to visible indications of localization. Consequently, when a speaker is seen on the screen, the corresponding language should also be placed in front of the user.
The same is true for all other audio signals, i.e. it is not necessarily limited to situations in which audio and video signals are simultaneously presented. Such other audio signals are, for example, radio signals or audiobooks. A listener is used to speech being generated by the front channels, whereby if suddenly speech were coming from the back channels, he would probably turn around to restore his usual impression.
In order to improve the quality of the ambient signals, the German patent application DE 102006017280.9-55 proposes to subject an ambient signal once extracted to transient detection and to achieve transient suppression without significant energy loss in the ambient signal by means of signal substitution to replace areas with transients by corresponding signals without transients but with approximately the same energy.
Err1:Expecting ',' delimiter: line 1 column 68 (char 67)
For example, US 6 914 988 shows a device for improving the voice response in a multichannel system. The object of the present invention is to create a concept for producing a multichannel signal with a number of output channels, which provides both a flexible and a high-quality product.
This task is solved by a device for generating a multichannel signal according to claim 1, a process for generating a multichannel signal according to claim 23, or a computer program according to claim 24.
The present invention is based on the finding that speech components are suppressed in the rear channels, i.e. in the surrounding channels, so that the rear channels are free of speech components. To this end, an input signal is mixed up with one or more channels to provide a direct signal channel and to provide an environment signal channel or, depending on the implementation, the modified environment signal. A voice detector is designed to search for speech components in the input signal, direct channel or surrounding channel, such that such speech components occur in time-varying and/or frequency-varying segments or even in fractions of an orthogonal arrangement. For example, a high-frequency signal modifier is designed to be used by the speaker to produce a signal signal or to include components that are not directly responsive to the input signal, while the corresponding signal components are used to produce a signal or to reduce the noise.
However, if the input signal has been modified, the environment signal generated by the mixer is used directly, as the voice components are already suppressed there, as the underlying audio signal also had voice components already suppressed. However, in this case, if the mixer process also produces a direct channel, the direct channel is calculated not on the basis of the modified input signal but on the basis of the unmodified input signal, in order to achieve selective suppression of the voice components, and only in the environment channel, but not in the direct channel where the voice components are specifically desired.
This prevents the reproduction of speech components in the rear channels or surrounding signal channels, which would otherwise disturb or even confuse the listener, and therefore ensures that dialogue and other speech that is understandable to a listener, i.e. has a spectral characteristic typical of speech, is placed in front of the listener.
The same requirements exist for the in-band concept, which also requires that direct signals are placed not in the back channels but in front of the listener and, if necessary, on the side of the listener, but not behind the listener, as shown in Figure 5c, where the direct signal components (and also the ambient signal components) are all placed in front of the listener.
Thus, the invention involves signal-dependent processing to remove or suppress the speech components in the rear channels or in the surrounding signal, using two main steps, namely, the detection of speech occurrence and the suppression of speech, whereby the detection of speech occurrence can be done in the input signal, in the direct channel or in the surrounding channel, and whereby the suppression of speech in the surrounding channel can be done directly or indirectly in the input signal, which is then used to generate the surrounding channel, whereby this modified input signal is not used to generate the direct channel.
The invention thus achieves that when a multichannel surround signal is generated from a fewer-channel audio signal that contains speech components, the resulting signals for the rear channels as seen by the user are ensured to contain a minimum amount of speech to obtain the original sound image in front of the user (front image). If a special amount of speech components were reproduced by the rear channels, the position of the speakers would be positioned outside the front range, and therefore somewhere between the listener and the front speakers or in more extreme cases even behind the listener. This would result in very low noise interference, especially if the audio signals are received simultaneously with the visual signal.
Err1:Expecting ',' delimiter: line 1 column 425 (char 424)5a multichannel playback scenario in which discrete signal sources can also be at least partially reproduced by rear channels and in which the surrounding channels are not or less reproduced by the rear speakers than in Fig. 5b;Fig. 6aeanother embodiment with a voice detection in the surrounding channel and a modification of the surrounding channel;Fig. 6a embodiment with voice detection in the input signal and modification of the surrounding subchannel;Fig. 6a mode of output with a voice detection in the input and a modification of the output signal;Fig. 6a broadcast with a fictitious signal detection in the input and a modification of the output signal in the output band;Fig. 7a mode of output with a specific signal and a modification in the output band;8a more detailed illustration of a reinforcement calculation block from Fig. 7.
Err1:Expecting ',' delimiter: line 1 column 784 (char 783)
The device shown in Figure 1 comprises a high-mixer 14 for the high-mixing of the input signal 12 to produce at least a direct signal channel 15 and an ambient signal channel 16 or, where appropriate, a modified ambient signal channel 16', and a voice detector 18 trained to use the input signal 12 as an analysis signal as provided for in 18a, or to use the direct signal channel 15 as provided for in 18b, or to use another signal which is significant in terms of its temporal/frequency occurrence or qualitative characteristics as regards its linguistic components, similar to the voice signal 12; the voice detector detects a section of the input signal which is significant in terms of its quantitative or qualitative quality, or in terms of its linguistic composition, which may also be described as a quantitative measure of the quality of the voice signal 18 and therefore dependent on the degree of its qualitative or qualitative significance.
A quantitative measure quantifies a language feature with a numerical value and this numerical value is compared with a threshold. A qualitative measure makes a decision per section that can be made by one or more decision criteria. Such decision criteria can be, for example, different quantitative features that are compared/weighted or processed in some way to reach a yes/no decision.
The device shown in Figure 1 also includes a signal modifier 20 which is trained to modify the original input signal as shown in 20a or trained to modify the surrounding channel 16. If the surrounding channel 16 is modified, the signal modifier 20 gives a modified surrounding channel 21, while if the E signal 20a is modified, a modified E signal 20b is given to the amplifier 14 which then gives the modified surrounding channel 16' input. For example, the same amplifier that was used for the direct channel 15 would produce the modified surrounding channel 16' input. If this process of amplification were used as a result of the modified E signal 20b, a direct signal 20b would also be used, as this is not a direct signal 12b, and therefore no direct signal 20b would be rejected from the modified channel 12 as a direct signal.
The signal modifier is trained to modify sections of at least one environmental channel or input signal, such sections being, for example, temporal or frequency segments or segments of an orthogonal distribution; in particular, the segments corresponding to the segments detected by the voice detector are modified so that the signal modifier, as shown, detects the modified environmental channel 21 or the modified input signal 20b in which a voice component is attenuated or eliminated, with the voice component attenuated in the corresponding section of the direct channel to a lesser extent or not at all.
In addition, the device shown in Figure 1 includes a loudspeaker output device 22 for the output of loudspeaker signals in a playback scenario, such as the example scenario 5.1 shown in Figure 1, but also a 7.1 scenario, a 3.0 scenario or another or higher scenario is possible. In particular, at least one direct channel and at least one modified surrounding channel are used to generate the loudspeaker signals for a playback scenario, with the modified surrounding channel either coming from the signal modifier 20 as shown at 21 or from the loudspeaker 14 as shown at 16'.
For example, if two modified surrounding channels 21 are supplied, these two modified surrounding channels could be fed directly into the two speaker signals Ls, Rs, while the direct channels are fed only into the three front speakers L, R, C, so that a complete split between surrounding signal components and direct signal components has occurred. The direct signal components are then all in front of the user and the surrounding signal components are all behind the user. Alternatively, surrounding signal components can also typically be fed into the frontal response channels to a smaller percentage, so that, for example, the direct/around signal signal shown in Figure 5b is generated not only by the surrounding signal, but also by the surrounding signal.
In contrast, if the in-band scenario is preferred, the ambient signal components are also mainly output by the front speakers e.g. L, R, C, but also direct signal components are at least partially fed into the two rear speakers Ls, Rs. In order to achieve a placement of the two direct signal sources 1100 and 1102 in Fig. 5c at the locations shown, the proportion of the source 1100 in the L speaker will be as large as in the L speaker, so that, according to a typical panning rule, the source 1100 can be placed in the middle between L and L. The speaker output device 22 can, in some cases, produce a direct input of a specific type of sound, or a direct output of a specific type of sound, which can be used to create a direct input or output, for example, by converting the input/output channel into a single channel, or by converting the input/output channel into a single channel, and so on.
Fig. 2 shows a time/frequency distribution of an analysis signal in the upper section and an ambient channel or input signal in a lower section. In particular, time is plotted along the horizontal axis and frequency is plotted along the vertical axis. This means that Fig. 2 shows 15 time/frequency ticking or time/frequency segments for each signal, which have the same number in the analysis signal and the ambient channel/E signal. This means that the signal modifier 20 sec. e.g. when the speech input detector 18 in section 22 detects a speech signal, the ambient channel/E signal section is selectively deleted, for example by a complete or partial attenuation of the signal, or by a specific subset of the signal, as shown in section 2 of Fig. 2, in order to achieve a selective response, it is also necessary to indicate that no signal detector can be used in the same way as in section 2 of this section.
Alternatively, orthogonal decomposition can be performed, e.g. by main component analysis, using the same component decomposition in both the environment channel or input signal and the analysis signal. Then certain components that have been detected as voice components in the analysis signal are attenuated or suppressed or completely eliminated in the environment channel or input signal.
Fig. 3 shows an implementation of a voice detector in cooperation with an ambient channel modifier, whereby the voice detector only provides time information, i.e. when Fig. 2 is considered, broadband only identifies the first, second, third, fourth or fifth time interval and communicates this information to the ambient channel modifier 20 via a control line 18d (Fig. 1). The voice detector 18 and the ambient channel modifier 20 working in synchrony or buffered together achieve that in the directly modifying signal, e.g. signal 12 or signal 16, the voice signal or voice component is attenuated, while ensuring that further high-amplitude attenuation of the corresponding voice signal or voice component is achieved without further processing in a specific direction, such as in the case of the voice processing, or without further processing in a specific direction.
Alternatively, if the signal modifier subjects the input signal to voice suppression, the high-flow mixer 14 may operate in a way twice to extract the direct channel component on the basis of the original input signal on the one hand, but to extract the modified surrounding channel 16' on the basis of the modified input signal 20b.
Depending on the implementation, the environmental channel modifier has either a broadband attenuation functionality or a high pass filtering functionality, as described below.
The following illustrations illustrate various implementations of the device of the invention, using Figures 6a, 6b, 6c and 6d.
In Fig. 6a, the environmental signal a is extracted from the input signal x, where this extraction is part of the functionality of the mixer 14. The occurrence of language is detected in the environmental signal a. The detection result d is used in the environmental channel modifier 20 which calculates the modified environmental signal 21 in which parts of the language are suppressed.
Figure 6b shows a different configuration to Figure 6a, where the input signal and not the ambient signal is fed to the voice detector 18 as the analysis signal 18a. In particular, the modified ambient channel signal as is calculated similarly to the configuration in Figure 6a, but with the language detected in the input signal. This is motivated by the fact that the voice components are generally more clearly detectable in the E signal x than in the ambient signal a.
In Fig. 6c, the voice-modified environment signal as is extracted from a version xs of the input signal that has already been subjected to voice signal suppression. Since the voice components in x typically stand out more prominently than in an extracted environment signal, their suppression is more secure and durable than in Fig. 6a. The disadvantage of the configuration shown in Fig. 6c compared to the configuration in Fig. 6a is that possible artifacts of the voice suppression and environment extraction process could be further enhanced depending on the type of extraction process. However, in Fig. 6c, the functionality of the original environment tractor is used only to extract the modified environment signal from the audio channel x, but not directly from the x (20b) input signal.
In the configuration shown in Fig. 6d, the ambient signal a is extracted from the input signal x by the mixer. The presence of speech is detected in the input signal x. In addition, a speech analyzer calculates 30 additional page information e, which additionally controls the functionality of the ambient channel modifier 20. This page information is calculated directly from the E signal and may be the position of voice components in a time/frequency representation, for example in the form of a spectrogram of Fig. 2 or may contain additional input information, which will be further detailed below.
Err1:Expecting ',' delimiter: line 1 column 881 (char 880)
Typically, microphones are used as sensors for a speech recognition system. A preparation may include an A/D conversion, resampling, or noise reduction. Feature extraction is the calculation of characteristic features for each object from the measurements. The features are selected in such a way that they are similar among objects of the same class, so that a good intra-class compactness is achieved, and that they are different for objects of different classes, so that an inter-class traceability is achieved. A third requirement is that the features should be robust with respect to noise, environmental conditions, and transformations of the signal irrelevant to human perception.
Classification is the process of deciding whether language exists or not, based on the extracted features and a trained classifier.
In the above equation, a set of training vectors Ωxy is defined, wherein characteristic vectors are denoted by xi and the set of classes by Y. For a basic language comprehension, therefore, Y has two values, namely {language, non-language}.
In the training phase, the characteristics xi are calculated from designated data, i.e. from audio signals in which it is known to which class y they belong.
In the application phase of the classifier, the characteristics are calculated and projected from the unknown data as in the training phase and classified by the classifier based on the knowledge of the characteristics of the classes gained in the training.
Speech amplification is used to amplify speech in a mixture of speech and background noise. Such methods can be modified to produce the opposite effect, namely a speech suppression, as is done for the present invention.
Err1:Expecting ',' delimiter: line 1 column 873 (char 872)
In principle, therefore, any method that amplifies speech or suppresses non-speech components can be used in the opposite way to their known use to suppress speech or to amplify non-speech. The general model of speech amplification or noise suppression is that the input signal is a mixture of the desired signal (speech) and the background noise (non-speech).
However, an important requirement in voice suppression is that, in the context of the high-fidelity mixing, the resulting audio signal is perceived as a high-quality audio signal. Voice enhancement and noise reduction techniques are known to introduce audible artifacts into the output signal. An example of such an artifact is known as music noise or music tones, and results from an incorrect estimate of noise floors and fluctuating subband damping factors.
Alternatively, blind source separation methods can be used to separate the voice signal components from the ambient signal and then manipulate both separately.
However, for the specific requirement of producing high-quality audio signals, certain methods described below are preferred because they perform significantly better than other methods. One method is broadband attenuation, as indicated in Fig. 3 at 20. The audio signal is attenuated at the time periods when there is speech. Special amplification factors are in the range of -12 dB to -3 dB, with a preferred attenuation of 6 dB. Since other signal components/proportions are equally suppressed, it might be assumed that the total loss of audio signal energy is clearly perceived.This is particularly compounded by the additional typical effect of increasing the level of the audio signal anyway due to a language input. By introducing a damping in the range between -12 dB and 3 dB, the damping is not perceived as disturbing. Instead, the user feels much more comfortable that an effect is achieved by suppressing voice components in the rear channels,which results in the user's voice components being positioned exclusively in the front channels.
An alternative method, also indicated in Fig. 3 at 20, is highpass filtering. The audio signal is subjected to highpass filtering where there is speech, with a limiting frequency in the range between 600 Hz and 3,000 Hz. The setting of the limiting frequency is derived from the speech signal characteristic of the present invention. The long-term power spectrum of a speech signal is concentrated in a range below 2.5 kHz. The preferred range of the basic frequency of voiced speech is therefore in the range between 75 Hz and 330 Hz. A range between 60 Hz and 250 Hz for adult speakers.
Another preferred implementation is the sine-signal modeling, which is illustrated by Fig. 4. Thus, in a first step 40 the base wave of a language is detected, whereby this detection can take place in the language detector 18 or, as shown in Fig. 6e, in the language analyzer 30. Then, in a second step 41, an investigation is performed to find the upper waves belonging to the base wave. This functionality can be performed in the language detector/language analyzer or even already in the upper environment signal modifier.
This sinusoidal signal modeling is often used for sound synthesis, audio coding, source separation, sound manipulation, and noise suppression. Here a signal is represented as a composition of sine waves with time-varying amplitudes and frequencies. Tonal speech signal components are manipulated by identifying and modifying the partial tones, i.e. the base wave and their harmonic (overwave) ones.
The partial tones are identified by a partialton finder as described at 41. Typically, partialton finding is performed in the time/frequency domain. A spectrogram is performed by a short-time Fourier transform as indicated at 42. Local maxima in each spectrum of the spectrogram are detected and trajectories determined by local maxima of neighboring spectra. An estimate of the fundamental frequency can aid the peak picking process, whereby this estimation of the fundamental frequency is performed at 40. A sind signal variation is then achieved from the trajectories.
Err1:Expecting ',' delimiter: line 1 column 240 (char 239)
The purpose of the present invention is to achieve the opposite, that is, to suppress the partial tones, whereby the partial tones include the base wave and its harmonics, and this for a speech segment with tonal language. Typically, the high-energy speech components are tonal. Thus, a language is spoken at a level of 60 - 75 dB for vowels and 20 - 30 dB lower for consonants. For t language (vowels), for example, the excitation is a periodic pulsary signal. The excitation signal is filtered through the vocal tract.
Err1:Expecting ',' delimiter: line 1 column 515 (char 514)
The audio signal is split into a number of frequency bands by means of a filter bank or a short-term Fourier transform, as shown in Fig. 7 at 70; then, as shown in Fig. 71a and 71b, time-varying amplification factors are calculated for all subbands from such low-level features to dampen subband signals in proportion to the amount of speech they contain. Suitable features at the lower level are the spectral flatness measure (SFM) and the 4 Hz modulation energy measure (4HzME). The SFM measures the degree of amplification of an audio signal and gives a geometric mean value for the TME of the mean energy component in a 4 Hz band, which is approximately equivalent to the mean energy of the spatial band, and is thus characterized by a 4 Hz spectral spread rate.
Figure 8 shows a more detailed representation of the amplification calculation blocks 71a and 71b of Figure 7. A variety of different low-level features, i.e. LLF1, ..., LLFn, are calculated on the basis of a subband xi. These features are then combined in a combiner 80 to obtain an amplification factor gi for a subband.
It should be noted that depending on the implementation, not necessarily low-order features, but any features, such as energy features, etc., can be used, which can then be combined in a combinator to obtain a quantitative gain factor gi, as implemented in Figure 8, such that each band is (at any time) variablely attenuated to achieve speech suppression.
Depending on the circumstances, the method of the invention may be implemented in hardware or software. The implementation may be on a digital storage medium, in particular a floppy disk or CD with electronically readable control signals, which can interact with a programmable computer system in such a way that the method is executed. In general, the invention thus also consists of a computer program product with a program code stored on a machine-readable medium to perform the method of the invention, if the computer program product runs on a computer. In other words, the invention may thus be realized as a computer program with a program code to execute the method, if the computer program runs on a computer.