EP3025514B1

Movatterモバイル変換

Info

Publication number: EP3025514B1
Application number: EP14748239.2A
Authority: EP
Inventors: Grégory PALLONE; Marc Emerit
Original assignee: Orange SA
Current assignee: Orange SA
Priority date: 2013-07-24
Filing date: 2014-07-04
Publication date: 2019-09-11
Anticipated expiration: 2034-07-04
Also published as: CN105684465A; US20160174013A1; US9848274B2; KR20160034942A; KR102206572B1; KR102310859B1; KR20210008952A; CN105684465B; FR3009158A1; WO2015011359A1; ES2754245T3; JP2016527815A; EP3025514A1; JP6486351B2

Description

Translated fromFrench

L'invention est relative au traitement de données sonores, et plus particulièrement à la spatialisation (dite « rendu 3D ») de signaux audio.The invention relates to the processing of sound data, and more particularly to the spatialization (called "3D rendering") of audio signals.

Une telle opération est par exemple exécutée lors du décodage d'un signal audio 3D codé, représenté sur un certain nombre de canaux, vers un nombre de canaux différents, deux par exemple, pour permettre la restitution des effets 3D audio sur un casque d'écoute.Such an operation is for example performed when decoding a coded 3D audio signal, represented on a number of channels, to a number of different channels, for example two, to allow the reproduction of the 3D audio effects on a headset. listening.

L'invention est également relative à la transmission et à la restitution de signaux audio multicanaux et à leur conversion vers un dispositif de restitution, transducteur, imposé par l'équipement d'un utilisateur. C'est par exemple le cas pour la restitution d'une scène sonore 5.1 par un casque d'écoute audio, ou par une paire de hauts parleurs.The invention also relates to the transmission and reproduction of multichannel audio signals and their conversion to a rendering device, transducer, imposed by the equipment of a user. This is for example the case for the reproduction of a 5.1 sound stage by an audio headset, or by a pair of loudspeakers.

L'invention est également relative à la restitution, dans le cadre d'un jeu ou enregistrement vidéo par exemple, d'un ou plusieurs échantillons sonores stockés dans des fichiers, en vue de leur spatialisation.The invention also relates to the rendering, in the context of a game or video recording, for example, of one or more sound samples stored in files, with a view to their spatialization.

Dans le cas d'une source monophonique statique, la binauralisation est basée sur le filtrage du signal monophonique par la fonction de transfert entre la position désirée de la source et chacune des deux oreilles. Le signal binaural (deux canaux) obtenu peut alors alimenter un casque audio et fournir la sensation à l'auditeur d'une source à la position simulée. Ainsi, le terme « binaural » vise la restitution d'un signal sonore avec des effets de spatialisation.In the case of a static monophonic source, the binauralization is based on the monophonic signal filtering by the transfer function between the desired position of the source and each of the two ears. The binaural signal (two channels) obtained can then feed a headphone and provide the listener with a feeling of the source at the simulated position. Thus, the term "binaural" refers to the reproduction of a sound signal with spatialization effects.

Chacune des fonctions de transfert simulant différentes positions peuvent être mesurées en chambre sourde, aboutissant ainsi à un ensemble de HRTF (pour « Head Related Transfer Functions » ou « Fonctions de Transferts Relatives à la Tête ») dans lesquelles aucun effet de salle n'est présent.Each of the transfer functions simulating different positions can be measured in a deaf chamber, thus resulting in a set of HRTFs (for "Head Related Transfer Functions") in which no room effect is present.

Ces fonctions de transfert peuvent également être mesurées dans une salle « classique », aboutissant ainsi à un ensemble de BRIR (pour « Binaural Room Impulse Response » ou « Réponse Impulsionnelle Binaurale de salle») dans lesquelles l'effet de salle, ou réverbération, est présent. L'ensemble des BRIR correspondent donc à un ensemble de fonctions de transfert entre une position donnée et les oreilles d'un auditeur (réel ou tête artificielle) placé dans une salle.These transfer functions can also be measured in a "classical" room, resulting in a set of BRIRs ("Binaural Room Impulse Response") in which the room effect, or reverb, is present. The set of BRIRs thus correspond to a set of transfer functions between a given position and the ears of a listener (real or artificial head) placed in a room.

La technique habituelle de mesure de BRIR consiste à envoyer successivement dans chacun des haut-parleurs réels, positionnés autour d'une tête (réelle ou artificielle) équipée de microphones dans les oreilles, un signal de test (par exemple un signal sweep, une séquence binaire pseudo-aléatoire ou un bruit blanc). Ce signal de test rend possible, lors d'un traitement hors temps réel, la reconstitution (généralement par déconvolution) de la réponse impulsionnelle entre la position du haut-parleur et chacune des deux oreilles.The usual BRIR measurement technique consists of successively sending in each of the actual loudspeakers, positioned around a head (real or artificial) equipped with microphones in the ears, a test signal (for example a sweep signal, a sequence pseudo-random binary or white noise). This test signal makes it possible, during a non-real-time processing, to reconstitute (generally by deconvolution) the impulse response between the position of the loudspeaker and each of the two ears.

La différence entre un ensemble de HRTF et de BRIR se situe principalement dans la longueur de la réponse impulsionnelle, de l'ordre de la milliseconde pour les HRTF, à l'ordre de la seconde pour les BRIR.The difference between a set of HRTF and BRIR lies mainly in the length of the impulse response, of the order of one millisecond for the HRTF, to the order of one second for the BRIRs.

Le filtrage étant basé sur la convolution entre le signal monophonique et la réponse impulsionnelle, la complexité pour effectuer une binauralisation avec des BRIR (contenant un effet de salle) est nettement plus élevée qu'avec des HRTF.Since the filtering is based on the convolution between the monophonic signal and the impulse response, the complexity of binauralizing with BRIRs (containing a room effect) is much higher than with HRTFs.

Il est possible par cette technique de simuler au casque ou sur un nombre limité de haut-parleurs une écoute d'un contenu multicanal (L canaux) généré par L haut-parleurs dans une salle. En effet, il suffit de considérer chacun des L haut-parleurs comme une source virtuelle positionnée idéalement par rapport à l'auditeur, de mesurer dans la salle à simuler les fonctions de transfert (pour les oreilles gauche et droite) de chacun de ces L haut-parleurs, puis d'appliquer à chacun des L signaux audio (censés alimenter les L haut-parleurs réels) les filtres BRIR correspondant aux haut-parleurs. Les signaux alimentant chacune des oreilles sont sommés pour fournir un signal binaural alimentant un casque audio.It is possible by this technique to simulate the headphones or a limited number of speakers listening to a multichannel content (L channels) generated by L speakers in a room. Indeed, it suffices to consider each of the L loudspeakers as a virtual source ideally positioned relative to the listener, to measure in the room to simulate the transfer functions (for the left and right ears) of each of these L speakers, then apply to each of the L audio signals (supposed to supply the L actual speakers) BRIR filters corresponding to the speakers. The signals feeding each of the ears are summed to provide a binaural signal feeding an audio headset.

On note I(1) (avec 1=[I,L]) le signal d'entrée censé alimenter les L haut-parleurs. On note BRIR^g/d(1), les BRIR de chacun des haut-parleurs pour chacune des deux oreilles, et on note O^g/d le signal binaural de sortie. La binauralisation du signal multicanal s'écrit donc : $O^{g} = \sum_{l = 1}^{L} I (l) * {BRIR}^{g} (l)$

O^{d} = \sum_{l = 1}^{L} I (l) * {BRIR}^{d} (l)

We denote I (1) (with 1 = [I, L]) the input signal supposed to supply the L loudspeakers. We write BRIR^{g / d} (1), the BRIRs of each loudspeaker for each of the two ears, and we write O^{g / d} the binaural output signal. The binauralization of the multichannel signal is therefore written:

O^{g} = Σ_{l = 1}^{The} I (l) * {BRIR}^{g} (l)

O^{d} = Σ_{l = 1}^{The} I (l) * {BRIR}^{d} (l)

Où * représente l'opérateur de convolution.Where * represents the convolution operator.

Par la suite, l'indice 1 tel quel ∈ [1,L] fait référence à un des L haut-parleurs. On a bien une BRIR pour un signal 1.Subsequently, theindex 1 such thatl ∈ [1,L ] refers to one of the L loudspeakers. We have a BRIR for asignal 1.

Ainsi, en référence à lafigure 1, deux convolutions (une pour chaque oreille) sont présentes pour chaque haut-parleur (étapes S11 à S1L).So, with reference to the figure 1 two convolutions (one for each ear) are present for each speaker (steps S11 to S1L).

Pour L haut-parleurs, la binauralisation nécessite donc 2.L convolutions. On peut calculer la complexité C_conv dans le cas d'une implémentation rapide par bloc. Une implémentation rapide par bloc est par exemple donnée par une transformée de Fourier rapide (FFT pour « Fast Fourier Transform »). Le document « Submission and Evaluation Procedures for 3D Audio » (MPEG 3D Audio) précise une formule possible pour le calcul de C_conv : $C_{conv} = (L + 2) . (nBlocs) . (6 . \log_{2} (2 Fs / nBlocs))$

For loudspeakers, binauralization therefore requires 2.L convolutions. The complexity C_conv can be calculated in the case of a fast block implementation. A fast implementation by block is for example given by a Fast Fourier Transform (FFT). The document "Submission and Evaluation Procedures for 3D Audio" specifies a possible formula for the calculation of C_conv :

C_{conv} = (The + 2) . (nBlocs) . (6 . \log_{2} (2 fs / nBlocs))

Dans cette équation, L représente le nombre de FFT pour transformer en fréquence les signaux d'entrée (1 FFT par signal d'entrée), le 2 représente le nombre de FFT inverse pour obtenir le signal binaural temporel (2 FFT inverse pour les deux canaux du binaural), le 6 indique un coefficient de complexité par FFT, le second 2 indique un bourrage de zéros nécessaire pour éviter les problèmes dus à la convolution circulaire, Fs indique la taille de chacune des BRIR, et nBlocs représente le fait d'utiliser un traitement par bloc, plus réaliste dans une approche où la latence ne doit pas être excessivement élevée, et . représente la multiplication.In this equation, L represents the number of FFTs to frequency transform the input signals (1 FFT per input signal), the 2 represents the number of inverse FFTs to obtain the time binaural signal (2 inverse FFTs for both binaural channels), the 6 indicates a coefficient of complexity per FFT, the second 2 indicates a zero stuffing necessary to avoid the problems due to the circular convolution, Fs indicates the size of each of the BRIRs, and nBlocs represents the fact of use block processing, more realistic in an approach where latency should not be excessively high, and. represents multiplication.

Ainsi pour utilisation typique avec nBlocs=10, Fs=48000, L=22, la complexité par échantillon de signal multicanal pour une convolution directe basée sur une FFT est de C_conv = 19049 multiplications-additions.Thus for typical use with nBlocs = 10, Fs = 48000, L = 22, the multichannel signal sample complexity for direct convolution based on FFT is C_conv = 19049 multiplications-additions.

Cette complexité est trop élevée pour une implémentation réaliste à l'heure actuelle sur des processeurs courants (mobiles par exemple), il est donc nécessaire de diminuer cette complexité sans dégrader fortement le rendu de la binauralisation.This complexity is too high for a realistic implementation at present on current processors (mobile for example), it is therefore necessary to reduce this complexity without greatly degrading the rendering binauralization.

Pour que la spatialisation soit de bonne qualité, la totalité du signal temporel des BRIR doit être appliqué.For spatialization to be of good quality, the entire time signal of the BRIRs must be applied.

US 2006/045294 A1 divulgue un système de spatialisation sonore de signaux multicanaux pour délivrer au moins deux canaux de sortie, et incluant le partitionnement des réponses impulsionnelles en deux parties, la première contenant l'arrivée directe et les premières réflexions, et la seconde contenant la réverbération, les secondes parties d'au moins deux réponses impulsionnelles étant sommées et pondérées pour une obtenir une fonction de transfert unique représentant la réverbération à appliquer aux signaux d'entrée, afin de réduire la complexité des opérations de convolution.US 2006/045294 A1 divulgue également le filtrage par blocs de FFT. US 2006/045294 A1 discloses a sound spatialization system of multichannel signals for delivering at least two output channels, and including partitioning the two-part impulse responses, the first containing the direct arrival and the first reflections, and the second containing the reverb, the second portions of at least two impulse responses being summed and weighted to obtain a unique transfer function representing the reverb to be applied to the input signals, in order to reduce the complexity of the convolution operations. US 2006/045294 A1 also discloses block filtering of FFTs.

STEWART REBECCA ET AL: "Generating a Spatial Average Reverberation Tail Across Multiple Impulse Responses",CONFERENCE: 35TH INTERNATIONAL CONFERENCE: AUDIO FOR GAMES; FEBRUARY 2009, AES, NEW YORK, USA, 1 février 2009 (2009-02-01) divulgue un partitionnement de réponses impulsionnelles similaire, ainsi que le moyennage des parties de réponses impulsionnelles représentant la réverbération, pour réduire la redondance dans les bases de données de réponses impulsionnelles. STEWART REBECCA AND AL: "Generating a Spatial Average Reverberation Tail Across Multiple Impulse Responses", CONFERENCE: 35TH INTERNATIONAL CONFERENCE: AUDIO FOR GAMES; FEBRUARY 2009, AES, NEW YORK, USA, February 1, 2009 (2009-02-01 ) discloses a partitioning of similar impulse responses, as well as the averaging of the impulse response portions representing the reverberation, to reduce the redundancy in the impulse response databases.

GARDNER W G: "EFFICIENT CONVOLUTION WITHOUT INPUT-OUTPUT DELAY",JOURNAL OF THE AUDIO ENGINEERING SOCIETY, AUDIO ENGINEERING SOCIETY, NEW YORK, NY, US, vol. 43, no. 3, 1 mars 1995 (1995-03-01), pages 127-135 décrit une implémentation efficace pour la convolution de réponses impulsionnelles, notamment dans le contexte des simulations de réverbération, sous forme de blocs FFT. Cet article suggère de réutiliser les transformées des signaux d'entrée dès que possible, pour réduire les coûts de calcul, et montre qu'un partitionnement d'une réponse impulsionnelle en sous-blocs de longueur multiple les uns des autres permet de réduire la complexité de calcul. GARDNER WG: "EFFICIENT CONVOLUTION WITHOUT INPUT-OUTPUT DELAY", AUDIO ENGINEERING SOCIETY, NEW YORK, NY, US, vol. 43, no. 3, March 1, 1995 (1995-03-01), pages 127-135 describes an efficient implementation for the convolution of impulse responses, particularly in the context of reverberation simulations, in the form of FFT blocks. This article suggests reusing transformations of input signals as soon as possible, to reduce computational costs, and shows that partitioning an impulse response into sub-blocks of multiple lengths from one another reduces complexity. Calculation.

La présente invention vient améliorer la situation.The present invention improves the situation.

Elle vise à diminuer fortement la complexité de la binauralisation d'un signal multicanal avec effet de salle tout en conservant au mieux la qualité audio.It aims to greatly reduce the complexity of binauralizing a multichannel signal with room effect while maintaining the best audio quality.

La présente invention propose à cet effet un procédé de spatialisation sonore tel que défini dans les revendications 1 à 10.The present invention proposes for this purpose a sound spatialization method as defined in theClaims 1 to 10.

L'invention vise aussi un programme informatique comportant des instructions pour la mise en oeuvre du procédé.The invention also relates to a computer program comprising instructions for implementing the method.

L'invention peut être mise en oeuvre par un dispositif de spatialisation sonore tel que défini dans la revendication 12.The invention can be implemented by a sound spatialization device as defined in claim 12.

L'invention peut également être mise en oeuvre dans un module de décodage de signaux sonores, en tant que signaux d'entrée, comportant le dispositif de spatialisation ci-dessus.The invention can also be implemented in a sound signal decoding module, as input signals, comprising the spatialization device above.

D'autres avantages et caractéristiques de l'invention apparaîtront à la lecture de la description détaillée ci-après d'exemples de réalisation de l'invention et à l'examen des dessins sur lesquels :

lafigure 1 illustre un procédé de spatialisation de l'art antérieur,
lafigure 2 illustre schématiquement les étapes d'un procédé au sens de l'invention, dans un exemple de réalisation,
lafigure 3 représente une réponse impulsionnelle binaurale de salle BRIR,
lafigure 4 illustre schématiquement les étapes d'un procédé au sens de l'invention, dans un exemple de réalisation,
lafigure 5 illustre schématiquement les étapes d'un procédé au sens de l'invention, dans un exemple de réalisation,
lafigure 6 représente schématiquement un dispositif comportant des moyens de mise en oeuvre du procédé au sens de l'invention.

Other advantages and characteristics of the invention will appear on reading the following detailed description of embodiments of the invention and on examining the drawings in which:

the figure 1 illustrates a method of spatialization of the prior art,
the figure 2 schematically illustrates the steps of a method in the sense of the invention, in an exemplary embodiment,
the figure 3 represents a binaural impulse response of BRIR room,
the figure 4 schematically illustrates the steps of a method in the sense of the invention, in an exemplary embodiment,
the figure 5 schematically illustrates the steps of a method in the sense of the invention, in an exemplary embodiment,
the figure 6 schematically represents a device comprising means for implementing the method within the meaning of the invention.

On se réfère à lafigure 6 pour illustrer tout d'abord un contexte possible de mise en oeuvre de la présente invention dans un dispositif de type terminal connecté TER (par exemple un téléphone, smartphone ou autre, ou encore une tablette connectée, un ordinateur connecté, ou autres). Un tel dispositif TER comporte des moyens de réception (une antenne typiquement) de signaux audio Xc codés en compression, un dispositif de décodage DECOD délivrant des signaux décodés X prêts à être traités par un dispositif de spatialisation avant la restitution des signaux audio (par exemple en binaural sur un casque à oreillettes CAS). Bien entendu, dans certains cas, il peut être avantageux de garder les signaux partiellement décodés (par exemple dans le domaine des sous-bandes) si le traitement de spatialisation est effectué dans le même domaine (traitement fréquentiel dans le domaine des sous-bandes par exemple).We refer to the figure 6 to first illustrate a possible context of implementation of the present invention in a TER terminal connected device (for example a phone, smartphone or other, or a connected tablet, a connected computer, or others). Such a device TER comprises reception means (typically an antenna) of audio signals Xc encoded in compression, a decoding device DECOD delivering decoded signals X ready to be processed by a spatialization device before the audio signals are returned (for example in binaural on a CAS headset). Of course, in some cases, it may be advantageous to keep the partially decoded signals (for example in the subband domain) if the spatialization processing is performed in the same domain (frequency processing in the subband field by example).

En référence encore à lafigure 6, le dispositif de spatialisation se présente par une combinaison d'éléments :

hardware comportant typiquement un ou plusieurs circuits CIR coopérant avec une mémoire de travail MEM et un processeur PROC,
et software, dont lesfigures 2 et4 sont des exemples d'organigrammes en illustrant l'algorithme général.

Still referring to the figure 6 , the spatialization device is presented by a combination of elements:

hardware typically having one or more CIR circuits cooperating with a working memory MEM and a processor PROC,
and software, whose figures 2 and 4 are examples of flowcharts illustrating the general algorithm.

Ici, la coopération entre les éléments hardware et software produit un effet technique procurant notamment une économie de complexité de la spatialisation pour sensiblement un même rendu audio (même sensation pour un auditeur), comme on le verra plus loin.Here, the cooperation between the hardware and software elements produces a technical effect providing in particular an economy of complexity of the spatialization for substantially the same audio rendering (same sensation for a listener), as will be seen below.

On se réfère maintenant à lafigure 2 pour décrire un traitement au sens de l'invention, ainsi mis en oeuvre par des moyens informatiques.We now refer to the figure 2 to describe a treatment in the sense of the invention, and implemented by computer means.

Dans une première étape S21, une préparation des données est effectuée. Cette préparation est optionnelle, les signaux peuvent être traités selon les étapes S22 et suivantes sans ce pré-traitement.In a first step S21, a data preparation is performed. This preparation is optional, the signals can be processed according to steps S22 and following without this pre-treatment.

En particulier, cette préparation consiste à effectuer une troncature de chaque BRIR pour ignorer les échantillons inaudibles au début et à la fin de la réponse impulsionnelle.In particular, this preparation consists in truncating each BRIR to ignore the inaudible samples at the beginning and at the end of the impulse response.

Cette préparation, pour la troncature en début de réponse impulsionnelle TRONC S, à l'étape S211, consiste à déterminer un instant de début d'ondes sonores directes et peut-être mis en oeuvre par les étapes suivantes :

Une somme cumulée des énergies de chacun des filtres BRIR(1) est calculée. Typiquement, cette énergie est calculée par une somme au carré des amplitudes des échantillons 1 à j, avec j compris dans [1 ; J] avec J le nombre d'échantillon d'un filtre BRIR.
La valeur d'énergie du filtre d'énergie maximum valMax (parmi les filtres relatifs à l'oreille gauche et à l'oreille droite) est calculée.
Pour chacun des haut-parleurs 1, on calcule l'indice pour lequel l'énergie de chacun des filtres BRIR(1) dépasse un certain seuil en dB calculé par rapport à valMax (e.g. valMax-50dB).
L'indice de troncature iT retenu pour toutes les BRIR est l'indice minimum parmi tous les indices des BRIR et il est considéré comme l'instant de début d'ondes sonores directes.

This preparation, for the truncation at the beginning of the TRONC S impulse response, in step S211, consists in determining a start time of direct sound waves and can be implemented by the following steps:

A cumulative sum of the energies of each of the BRIR filters (1) is calculated. Typically, this energy is computed by a sum squared of the amplitudes ofsamples 1 to j, with j included in [1; J] with J the sample number of a BRIR filter.
The energy value of the maximum energy filter valMax (among the filters relating to the left ear and the right ear) is calculated.
For each of theloudspeakers 1, the index for which the energy of each of the BRIR filters (1) exceeds a certain threshold in dB calculated with respect to valMax (eg valMax-50dB) is calculated.
The truncation index iT retained for all the BRIRs is the minimum index among all the indices of the BRIRs and is considered as the moment of beginning of direct sound waves.

L'indice iT obtenu correspond donc au nombre d'échantillons à ignorer pour chacune des BRIR. Une troncature brutale en début de réponse impulsionnelle avec une fenêtre rectangulaire peut mener à des artefacts audibles si elle est appliquée dans une partie trop énergétique. Il peut donc être préférable d'appliquer une fenêtre de fondu d'entrée adaptée, cependant si des précautions ont été prises dans le seuil choisi, ce fenêtrage devient inutile, car inaudible (on coupe juste du signal inaudible).The index iT obtained therefore corresponds to the number of samples to be ignored for each of the BRIRs. Abrupt truncation at the beginning of an impulse response with a rectangular window can lead to audible artifacts if it is applied in too much energy. It may therefore be preferable to apply a suitable input fade window, however if precautions have been taken in the selected threshold, this windowing becomes useless, because inaudible (just cut the inaudible signal).

Le synchronisme entre BRIR rend possible l'application d'un délai constant pour toutes les BRIR dans un souci de simplicité d'implémentation, même si une optimisation de complexité est possible.The synchronism between BRIR makes it possible to apply a constant delay for all BRIRs for the sake of simplicity of implementation, even if an optimization of complexity is possible.

La troncature de chaque BRIR pour ignorer les échantillons inaudibles à la fin de la réponse impulsionnelle TRONC E, à l'étape S212, peut être réalisée à partir d'étape similaires à celles décrites ci-avant, adaptées pour convenir à la fin de la réponse impulsionnelle. Une troncature brutale en fin de réponse impulsionnelle avec une fenêtre rectangulaire peut mener à des artefacts audibles sur des signaux impulsionnels où la queue de réverbération peut se révéler audible. Ainsi, dans un mode de réalisation, on applique une fenêtre de fondu de sortie adaptée.The truncation of each BRIR to ignore the inaudible samples at the end of the impulse response TRONC E, in step S212, can be performed from steps similar to those described above, adapted to suit the end of the impulse response. Sudden truncation at the end of an impulse response with a rectangular window may lead to audible artifacts on pulse signals where the reverb tail may be audible. Thus, in one embodiment, a suitable output fade window is applied.

A l'étape 22, une isolation en synchronisme ISOL A/B est effectuée. Cette isolation en synchronisme consiste à séparer, pour chaque BRIR, la partie « son direct » et « premières réflexions » (ou Direct, noté A) et de la partie « son diffus » (ou Diffus, noté B). En effet, le traitement à effectuer sur la partie « son diffus » peut avantageusement être différent de celui à effectuer sur la partie « son direct » dans la mesure où il est préférable d'avoir une meilleure qualité de traitement sur la partie « son direct » que sur la partie « son diffus ». Ceci rend possible une optimisation du rapport qualité/complexité.In step 22, ISOL A / B synchronism isolation is performed. This isolation in synchronism consists of separating, for each BRIR, the part "direct sound" and "first reflections" (or Direct, noted A) and the part "diffuse sound" (or Diffus, noted B). Indeed, the treatment to be performed on the "diffuse sound" part may advantageously be different from that to be performed on the "direct sound" part, since it is preferable to have a better quality of treatment on the "direct sound" part. Only on the "diffuse sound" part. This makes it possible to optimize the quality / complexity ratio.

En particulier, pour réaliser l'isolation en synchronisme, on détermine un index « iDD » d'échantillon unique et commun à toutes les BRIR (d'où le terme « synchronisme ») à partir duquel on considère que le reste de la réponse impulsionnelle correspond à un champ diffus. On partitionne donc les réponses impulsionnelles BRIR(1) en deux : A(l) et B(l), où la concaténation des deux correspond à BRIR(1).In particular, to achieve isolation in synchronism, a single sample "iDD" index common to all BRIRs (hence the term "synchronism") from which the remainder of the impulse response is considered is determined. corresponds to a diffuse field. We therefore partition the BRIR (1) impulse responses into two: A (1) and B (1), where the concatenation of the two corresponds to BRIR (1).

Lafigure 3 montre l'index de partitionnement iDD à l'échantillon 2000. La partie à gauche de cet indice iDD correspond à la partie A. La partie à droite de cet indice iDD correspond à la partie B.The figure 3 shows the iDD partitioning index at the 2000 sample. The left part of this iDD index corresponds to part A. The right part of this iDD index corresponds to part B.

Dans un mode de réalisation, ces deux parties sont isolées, sans fenêtrage, afin de subir des traitements différents. Dans une variante, un fenêtrage entre les parties A(1) et B(1) est appliqué.In one embodiment, these two parts are isolated, without windowing, in order to undergo different treatments. In a variant, a windowing between the parts A (1) and B (1) is applied.

L'indice iDD peut-être spécifique à la salle pour laquelle les BRIR ont été déterminés. Le calcul de cet indice peut donc dépendre de l'enveloppe spectrale, de la corrélation des BRIR ou encore de l'échogramme de ces BRIR. Par exemple, iDD peut être déterminé par une formule du type $iDD = \sqrt{V_{salls}}$

avec V_salle le volume de la salle de mesure.The iDD index may be specific to the room for which the BRIRs were determined. The calculation of this index may therefore depend on the spectral envelope, the correlation of the BRIRs or the echogram of these BRIRs. For example, iDD can be determined by a formula of the type

iDD = \sqrt{V_{Salls}}

with V_room the volume of the measuring room.

Dans un mode de réalisation, iDD est une valeur fixe, typiquement 2000. Dans une variante, iDD varie, avantageusement de manière dynamique, en fonction de l'environnement à partir duquel les signaux d'entrée sont capturés.In one embodiment, iDD is a fixed value, typically 2000. In one variant, iDD varies, advantageously dynamically, depending on the environment from which the input signals are captured.

Le signal de sortie pour les oreilles gauche (g) et droite (d), représenté parO^g/d, s'écrit donc : $\begin{array}{l} O^{g / d} = \sum_{l = 1}^{L} I (l) * {BRIR}^{g / d} (l) = O_{A}^{g / d} + z^{- iDD}, O_{B}^{g / d} \\ = \sum_{l = 1}^{L} I (l) * A^{g / d} (l) + z^{- iDD} . \sum_{i = 1}^{L} I (l) * B^{g / d} (l) \end{array}$

oùz^-iDD correspond au délai de iDD échantillons.The output signal for the left (g) and right (d) ears, represented byO^{g /d} , is written as follows:

\begin{array}{l} O^{g / d} = Σ_{l = 1}^{The} I (l) * {BRIR}^{g / d} (l) = O_{AT}^{g / d} + z^{- iDD}, O_{B}^{g / d} \\ = Σ_{l = 1}^{The} I (l) * {AT}^{g / d} (l) + z^{- iDD} . Σ_{i = 1}^{The} I (l) * B^{g / d} (l) \end{array}

wherez^-iDD is the delay of iDD samples.

L'application de ce délai aux signaux est effectuée en stockant les valeurs calculées pour $\sum_{l = 1}^{L} I (l) * B^{g / d} (l)$

dans une mémoire temporaire (par exemple dans un buffer) et en les restituant au moment voulu.The application of this delay to the signals is carried out by storing the values calculated for

Σ_{l = 1}^{The} I (l) * B^{g / d} (l)

in a temporary memory (for example in a buffer) and restoring them at the desired moment.

Dans un mode de réalisation, les index d'échantillons choisis pour A et B peuvent également tenir compte des longueurs de trames dans le cas d'intégration dans un codeur audio. En effet, des tailles typiques de trames de 1024 échantillons peut mener à un choix tel que A fasse 1024 et B fasse 2048, en s'assurant que B est bien une zone de champ diffus pour toutes les BRIR.In one embodiment, the sample indices selected for A and B may also consider frame lengths in the case of integration into an audio encoder. Indeed, typical frame sizes of 1024 samples can lead to a choice such that A makes 1024 and B makes 2048, making sure that B is a diffuse field area for all BRIRs.

En particulier, il peut être intéressant que la taille de B soit un multiple de la taille de A car si le filtrage est implémenté par blocs de FFT, alors le calcul d'une FFT pour A peut être réutilisé pour B.In particular, it may be interesting that the size of B is a multiple of the size of A because if the filtering is implemented in blocks of FFT, then the calculation of an FFT for A can be reused for B.

Un champ diffus se caractérise par le fait qu'il est statistiquement identique en tous les points de la salle. Ainsi, sa réponse en fréquence varie peu en fonction du haut-parleur à simuler. La présente invention exploite cette caractéristique dans le but de remplacer tous les filtres Diffus D(l) de toutes les BRIR par un seul et unique filtre « moyen » B_mean afin de diminuer fortement la complexité due aux convolutions multiples. Pour cela, on peut modifier la partie champ diffus B à l'étape S23B, toujours en référence à lafigure 2.A diffuse field is characterized by the fact that it is statistically identical in all points of the room. Thus, its frequency response varies little depending on the speaker to simulate. The present invention exploits this feature in order to replace all Diffus D (l) filters of all BRIRs with a single and only one "mean" B_mean filter in order to greatly reduce the complexity due to multiple convolutions. For this purpose, it is possible to modify the diffuse field part B in step S23B, again with reference to the figure 2 .

A l'étape S23B1, on calcule la valeur du filtre moyen B_mean. D'abord, il est extrêmement rare que le système complet soit calibré idéalement, on peut donc appliquer un gain de pondération qui sera reporté dans le signal d'entrée afin d'effectuer une unique convolution par oreille pour la partie champs diffus. On décompose donc les BRIR en des filtres normalisés en énergie, et on reporte le gain de normalisation $\sqrt{E_{B^{g / d} (l)}}$

dans le signal d'entrée :

\begin{array}{l} O_{B}^{g / d} = \sum_{l = 1}^{L} [I (l) * B^{g / d} (l)] = \sum_{l = 1}^{L} [I (l) * (\sqrt{E_{B^{g / d} (l)}} . {B_{norm}}^{g / d} (l))] \\ = \sum_{l = 1}^{L} [(\sqrt{E_{B^{g / d} (l)}} . I (l)) * {B_{norm}}^{g / d} (l)] \end{array}

avec

{B_{norm}}^{g / d} (l) = \frac{B^{g / d} (l)}{\sqrt{E_{B^{g / d} (l)}}}

oùE_B^g/^d(l) représente l'énergie deB^g/d(l).In step S23B1, the value of the mean filter B_mean is calculated. First, it is extremely rare that the complete system is calibrated ideally, so we can apply a weight gain that will be reported in the input signal to perform a single convolution per ear for the diffuse field part. The BRIRs are therefore decomposed into standard energy filters, and the normalization gain is postponed.

\sqrt{E_{B^{g / d} (l)}}

in the input signal:

\begin{array}{l} O_{B}^{g / d} = Σ_{l = 1}^{The} [I (l) * B^{g / d} (l)] = Σ_{l = 1}^{The} [I (l) * (\sqrt{E_{B^{g / d} (l)}} . {B_{norm}}^{g / d} (l))] \\ = Σ_{l = 1}^{The} [(\sqrt{E_{B^{g / d} (l)}} . I (l)) * {B_{norm}}^{g / d} (l)] \end{array}

with

{B_{norm}}^{g / d} (l) = \frac{B^{g / d} (l)}{\sqrt{E_{B^{g / d} (l)}}}

whereE_B^g_/_{^d (l )} represents the energy ofB^{g /d} (1 ).

Ensuite, on approximeB_norm^g/d(l) par un seul et unique filtre moyenB_mean^g/d qui n'est plus fonction du haut-parleur 1, mais qu'il est possible de normaliser également en énergie : $O_{B}^{g / d} \approx {\hat{O}}_{B}^{g / d} = \sum_{l = 1}^{L} [(\sqrt{E_{B^{g / d} (l)}} . I (l)) * (\frac{{B_{mean}}^{g / d}}{\sqrt{E_{{B_{mean}}^{g / d}}}})]$

avec

{B_{mean}}^{g / d} = \frac{1}{L} \sum_{l = 1}^{L} [{B_{norm}}^{g / d} (l)]

Then,B_norm^{g /d} (l ) is approximated by a single mean filterB_mean^{g /d} which is no longer a function of theloudspeaker 1, but it is also possible to standardize in energy:

O_{B}^{g / d} \approx {\hat{O}}_{B}^{g / d} = Σ_{l = 1}^{The} [(\sqrt{E_{B^{g / d} (l)}} . I (l)) * (\frac{{B_{mean}}^{g / d}}{\sqrt{E_{{B_{mean}}^{g / d}}}})]

with

{B_{mean}}^{g / d} = \frac{1}{The} Σ_{l = 1}^{The} [{B_{norm}}^{g / d} (l)]

Dans un mode de réalisation, ce filtre moyen peut être obtenu par moyennage des échantillons temporels. Dans une variante, il peut être obtenu par tout autre type de moyennage comme par exemple un moyennage des densités spectrales de puissance.In one embodiment, this average filter can be obtained by averaging time samples. In a variant, it can be obtained by any other type of averaging such as averaging power spectral densities.

Dans un mode de réalisation, l'énergie du filtre moyenE_Emean^g/d peut être mesurée directement à partir du filtre construitB_mean^g/d. Dans une variante, il peut également être estimé en prenant en compte l'hypothèse que les filtresB_norm^g/d(l) sont décorrélés. En effet, dans ce cas, comme on somme des signaux d'énergie unitaire, on a : $E_{{B_{mean}}^{g / d}} = \sum {(\frac{1}{L} \sum_{l = 1}^{L} [{B_{norm}}^{g / d} (l)])}^{2} = \frac{1}{L^{2}} . (L . E_{{B_{norm}}^{g / d}}) = \frac{1}{L}$

In one embodiment, the energy of the average filterE_E_mean^{g /d} can be measured directly from the constructed filterB_mean^{g /d} . In a variant, it can also be estimated taking into account the hypothesis that the filtersB_norm^{g /d} (1 ) are decorrelated. Indeed, in this case, as we sum unit energy signals, we have:

E_{{B_{mean}}^{g / d}} = Σ {(\frac{1}{The} Σ_{l = 1}^{The} [{B_{norm}}^{g / d} (l) ()])}^{2} = \frac{1}{{The}^{2}} . (The . E_{{B_{norm}}^{g / d}}) = \frac{1}{The}

L'énergie peut être calculée sur l'ensemble des échantillons correspondants à la partie champ diffus.The energy can be calculated on all the samples corresponding to the diffuse field part.

A l'étape S23B2, on calcule la valeur du facteur de pondérationW^g/d(l). Un seul facteur de pondération à appliquer au signal d'entrée est calculé, prenant en compte les normalisations des filtres Diffus et du filtre moyen: ${\hat{O}}_{B}^{g / d} = \sum_{l = 1}^{L} [(\frac{\sqrt{E_{B^{g / d} (l)}}}{\sqrt{E_{{B_{mean}}^{g / d}}}} . I (l)) * {B_{mean}}^{g / d}] = \sum_{l = 1}^{L} [(\frac{1}{W^{g / d} (l)} . I (l)) . {B_{mean}}^{g / d}]$

Avec

W^{g / d} (l) = \frac{\sqrt{E_{{B_{mean}}^{g / d}}}}{\sqrt{E_{B^{g / d} (l)}}}

In step S23B2, the value of the weighting factorW^{g /d} (1 ) is calculated. A single weighting factor to be applied to the input signal is calculated, taking into account the standardizations of the Diffus filters and the average filter:

{\hat{O}}_{B}^{g / d} = Σ_{l = 1}^{The} [(\frac{\sqrt{E_{B^{g / d} (l)}}}{\sqrt{E_{{B_{mean}}^{g / d}}}} . I (l)) * {B_{mean}}^{g / d}] = Σ_{l = 1}^{The} [(\frac{1}{W^{g / d} (l)} . I (l)) . {B_{mean}}^{g / d}]

With

W^{g / d} (l) = \frac{\sqrt{E_{{B_{mean}}^{g / d}}}}{\sqrt{E_{B^{g / d} (l)}}}

Le filtre moyen étant constant, il peut sortir de la somme : ${\hat{O}}_{B}^{g / d} = \sum_{l = 1}^{L} [(\frac{1}{W^{g / d} (l)} . I (l))] * {B_{mean}}^{g / d}$

Since the average filter is constant, it can come out of the sum:

{\hat{O}}_{B}^{g / d} = Σ_{l = 1}^{The} [(\frac{1}{W^{g / d} (l)} . I (l))] * {B_{mean}}^{g / d}

Ainsi, les L convolutions avec la partie champ diffus sont remplacées par une seule convolution avec un filtre moyen, moyennant une somme pondérée du signal d'entrée.Thus, the L convolutions with the diffuse field portion are replaced by a single convolution with a mean filter, with a weighted sum of the input signal.

A l'étape S23B3, on peut optionnellement calculer un gain G corrigeant le gain du filtre moyenB_mean^g/d. En effet, dans le cas de la convolution entre les signaux d'entrée et les filtres non approximés, quelles que soient les valeurs de corrélation entre les signaux d'entrée, le filtrage par des filtres décorrélés que sont lesB^g/d(l) mène à des signaux à sommer qui sont alors eux aussi décorrélés. A l'inverse, dans le cas de la convolution entre les signaux d'entrée et le filtre moyen approximé, l'énergie du signal issu de la sommation des signaux filtrés va dépendre de la valeur de corrélation existant entre les signaux d'entrée.In step S23B3, it is optionally possible to calculate a gain G correcting the gain of the average filterB_mean^{g /d} . Indeed, in the case of the convolution between the input signals and the unmatched filters, whatever the correlation values between the input signals, the filtering by decorrelated filters that are theB^{g /d} (l ) leads to signals to be summed up which are then also decorrelated. Conversely, in the case of the convolution between the input signals and the approximated average filter, the energy of the signal resulting from the summation of the filtered signals will depend on the correlation value existing between the input signals.

Par exemple,
* si tous les signaux d'entrée I(1) sont identiques et d'énergie unitaire, et que les filtres B(l) sont tous décorrélés (puisque champs diffus) et d'énergie unitaire, on a : $E_{O_{B}^{g / d}} = energie (\sum_{l = 1}^{L} [I (l) * {B_{norm}}^{g / d} (l)]) = L$

* si tous les signaux d'entrée I(1) sont décorrélés et d'énergie unitaire, et que les filtres B(l) sont tous d'énergie unitaire, mais remplacés par des filtres identiques

\frac{{B_{mean}}^{g / d}}{\sqrt{E_{{B_{mean}}^{g / d}}}},

on a:

\begin{array}{l} E_{{\hat{O}}_{B}^{g / d}} = energie (\sum_{l = 1}^{L} [I (l) * (\frac{{B_{mean}}^{g / d}}{\sqrt{E_{{B_{mean}}^{g / d}}}})]) \\ = energie (\frac{1}{\sqrt{E_{{B_{mean}}^{g / d}}}} . \sum_{l = 1}^{L} [I (l) * {B_{mean}}^{g / d}]) = {(\frac{1}{\sqrt{E_{{B_{mean}}^{g / d}}}})}^{2} . (L . \frac{1}{L}) = L \end{array}

For example,
if all the input signals I (1) are identical and of unit energy, and the filters B (1) are all decorrelated (since diffuse fields) and of unit energy, we have:

E_{O_{B}^{g / d}} = energy (Σ_{l = 1}^{The} [I (l) * {B_{norm}}^{g / d} (l)]) = The

* if all the input signals I (1) are decorrelated and of unit energy, and the filters B (1) are all of unit energy, but replaced by identical filters

\frac{{B_{mean}}^{g / d}}{\sqrt{E_{{B_{mean}}^{g / d}}}},

we have:

\begin{array}{l} E_{{\hat{O}}_{B}^{g / d}} = energy (Σ_{l = 1}^{The} [I (l) * (\frac{{B_{mean}}^{g / d}}{\sqrt{E_{{B_{mean}}^{g / d}}}})]) \\ = energy (\frac{1}{\sqrt{E_{{B_{mean}}^{g / d}}}} . Σ_{l = 1}^{The} [I (l) * {B_{mean}}^{g / d}]) = {(\frac{1}{\sqrt{E_{{B_{mean}}^{g / d}}}})}^{2} . (The . \frac{1}{The}) = The \end{array}

Car les énergies des signaux décorrélés s'ajoutent.Because the energies of the decorrelated signals are added.

Ce cas est équivalent au précédent dans le sens où les signaux issus du filtrage sont tous décorrélés, grâce aux signaux d'entrée dans le premier cas, et grâce aux filtres dans le second cas.
* si tous les signaux d'entrée I(1) sont identiques et d'énergie unitaire, et que les filtres B(l) sont tous d'énergie unitaire, mais remplacés par des filtres identiques $\frac{{B_{mean}}^{g / d}}{\sqrt{E_{{B_{mean}}^{g / d}}}},$

on a:

\begin{array}{l} E_{{\hat{O}}_{B}^{g / d}} = energie (\sum_{l = 1}^{L} [I (l) * (\frac{{B_{mean}}^{g / d}}{\sqrt{E_{{B_{mean}}^{g / d}}}})]) \\ = energie (\frac{1}{\sqrt{E_{{B_{mean}}^{g / d}}}} . \sum_{l = 1}^{L} [I (l) * {B_{mean}}^{g / d}]) = {(\frac{1}{\sqrt{E_{{B_{mean}}^{g / d}}}})}^{2} . (L^{2} . \frac{1}{L}) \\ = L^{2} \end{array}

This case is equivalent to the previous one in the sense that the signals coming from the filtering are all decorrelated, thanks to the input signals in the first case, and thanks to the filters in the second case.
* if all the input signals I (1) are identical and unit energy, and the filters B (l) are all of unit energy, but replaced by identical filters

\frac{{B_{mean}}^{g / d}}{\sqrt{E_{{B_{mean}}^{g / d}}}},

we have:

\begin{array}{l} E_{{\hat{O}}_{B}^{g / d}} = energy (Σ_{l = 1}^{The} [I (l) * (\frac{{B_{mean}}^{g / d}}{\sqrt{E_{{B_{mean}}^{g / d}}}})]) \\ = energy (\frac{1}{\sqrt{E_{{B_{mean}}^{g / d}}}} . Σ_{l = 1}^{The} [I (l) * {B_{mean}}^{g / d}]) = {(\frac{1}{\sqrt{E_{{B_{mean}}^{g / d}}}})}^{2} . ({The}^{2} . \frac{1}{The}) \\ = {The}^{2} \end{array}

Car les énergies des signaux identiques s'ajoutent en quadrature (car leurs amplitudes s'ajoutent).Because the energies of the identical signals are added in quadrature (because their amplitudes are added).

Ainsi,

si deux haut-parleurs sont actifs simultanément, alimentés par des signaux décorrélés, alors aucun gain n'est apporté en appliquant les étapes S23B1 et S23B2 par rapport à la méthode classique.
si deux haut-parleurs sont actifs simultanément, alimentés par des signaux identiques, alors un gain de 10.log₁₀(L²/L) =10.log₁₀(2²/2) = 3.01dB est apporté en appliquant les étapes S23B1 et S23B2 par rapport à la méthode classique.
si trois haut-parleurs sont actifs simultanément, alimentés par des signaux identiques, alors un gain de 10.log₁₀(L²/L) = 10.log₁₀(3²/3) = 4.77dB est apporté en appliquant les étapes S23B1 et S23B2 par rapport à la méthode classique.

So,

if two loudspeakers are active simultaneously, powered by decorrelated signals, then no gain is made by applying the steps S23B1 and S23B2 compared to the conventional method.
if two loudspeakers are active simultaneously, powered by identical signals, then a gain of 10.lo g₁₀ (L² /L ) = 10.log₁₀ (2^2/2 ) = 3.01dB is provided by applying steps S23B1 and S23B2 compared to the conventional method.
if three speakers are active simultaneously, fed with identical signals, then a gain of 10_log10(L² / L) = 10log_{^{10 (3 2/3) =}} 4.77dB is made by applying the steps S23B1 and S23B2 compared to the conventional method.

Les cas évoqués ci-avant correspondent aux cas extrêmes de signaux identiques ou décorrélés. Ces cas sont toutefois réalistes : une source positionnée au milieu de deux haut-parleurs, virtuels ou réels, fournira un signal identique à ces deux haut-parleurs (par exemple avec une technique de type VBAP, pour « Vector base amplitude panning »). Dans le cas d'un positionnement dans un système 3D, les 3 haut-parleurs peuvent recevoir le même signal au même niveau.The cases mentioned above correspond to the extreme cases of identical or uncorrelated signals. These cases are however realistic: a source positioned in the middle of two speakers, virtual or real, provide a signal identical to these two speakers (for example with a technique of type VBAP, for "Vector base amplitude panning"). In the case of positioning in a 3D system, the 3 loudspeakers can receive the same signal at the same level.

Ainsi, on peut appliquer une compensation afin de respecter l'énergie des signaux binauralisés.Thus, compensation can be applied to respect the energy of binauralized signals.

Idéalement, ce gain de compensation G sera déterminé en fonction du signal d'entrée (soit G(I(1))) et sera appliqué à la somme des signaux d'entrée pondérés : ${\hat{O}}_{B}^{g / d} = G . \sum_{l = 1}^{L} [\frac{1}{W^{g / d} (l)} . I (l)] * {B_{mean}}^{g / d}$

Ideally, this compensation gain G will be determined according to the input signal (ie G (I (1))) and will be applied to the sum of the weighted input signals:

{\hat{O}}_{B}^{g / d} = G . Σ_{l = 1}^{The} [\frac{1}{W^{g / d} (l)} . I (l)] * {B_{mean}}^{g / d}

Le gainG(I(l)) peut être estimé par un calcul de corrélation entre chacun des signaux. Il peut également être estimé en comparant les énergies des signaux avant et après sommations. Dans ce cas, le gain G peut varier dynamiquement au cours du temps, en fonction par exemple des corrélations entre les signaux d'entrée, qui varient eux-mêmes au cours du temps.The gainG (I (1 )) can be estimated by a calculation of correlation between each of the signals. It can also be estimated by comparing the energies of the signals before and after summations. In this case, the gain G may vary dynamically over time, depending for example on correlations between the input signals, which vary themselves over time.

Dans un mode de réalisation simplifié, il est possible de fixer un gain constant, par exempleG = -3dB = 10^-3/20, ce qui évitera d'avoir à effectuer une estimation de corrélation qui peut être coûteuse. Le gain constant G peut alors être appliqué hors ligne aux facteurs de pondération (donnant ainsi $\frac{G}{W^{g / d} (l)}$

), ou au filtreB_mean^g/d, ce qui évitera l'application d'un gain supplémentaire au vol.In a simplified embodiment, it is possible to set a constant gain, for exampleG = -3dB = 10^-3/20 , which will avoid having to make a correlation estimate that can be expensive. The constant gain G can then be applied offline to the weighting factors (thus giving

\frac{G}{W^{g / d} (l)}

), or the filterB_mean^{g /d} , which will avoid the application of an additional gain on the fly.

Une fois les fonctions de transfert A et B isolées et les filtresB_mean^g/d (optionnellement les poidsW^g/d(l) et G) calculés, on applique ces fonctions de transfert et ces filtres aux signaux d'entrée.Once the transfer functions A and B isolated and the filtersB_mean^{g /d} (optionally weightsW^{g /d} (I ) and G) are calculated, these transfer functions and these filters are applied to the input signals.

Dans un premier mode de réalisation, décrit en référence à lafigure 4, le traitement du signal multicanal par application des filtres Direct (A) et Diffus (B) pour chacune des oreilles est effectué de la manière suivante :

On applique (étapes S4A1 à S4AL) au signal multicanal d'entrée un filtrage efficace (par exemple convolution directe basée-FFT) par les filtres Direct (A), comme décrit dans l'état de l'art. On obtient un signal ${\hat{O}}_{A}^{g / d}$
En fonction des relations entre les signaux d'entrée, notamment en fonction de leur corrélation, on peut optionnellement corriger à l'étape S4B11 le gain du filtre moyenB_mean^g/d par application du gain G aux signaux de sortie après sommation des signaux d'entrée préalablement pondérés (étapes M4B1 à M4BL).
On applique au signal multicanal B à l'étape S4B1 un filtrage efficace par le filtre Diffus moyen B_mean. Cette étape a bien lieu après sommation des signaux d'entrée préalablement pondérés (étapes M4B1 à M4BL). On obtient le signal ${\hat{O}}_{B}^{g / d} .$
On applique au signal ${\hat{O}}_{B}^{g / d}$
un délai iDD afin de compenser le retard introduit lors de l'étape d'isolation du signal B à l'étape S4B2.
Les signaux ${\hat{O}}_{A}^{g / d}$
et ${\hat{O}}_{B}^{g / d}$
sont sommés.
Si une troncature éliminant les échantillons inaudibles au début des réponses impulsionnelles a été réalisée, alors on applique à l'étape S41 au signal d'entrée un délai iT correspondant aux échantillons inaudibles supprimés.

In a first embodiment, described with reference to the figure 4 , the multichannel signal processing by applying the Direct (A) and Diffus (B) filters for each of the ears is performed as follows:

The input multichannel signal (S4A1 to S4AL) is efficiently filtered (eg FFT based direct convolution) by the Direct filters (A) as described in the state of the art. We get a signal ${\hat{O}}_{AT}^{g / d}$
As a function of the relations between the input signals, in particular as a function of their correlation, it is optionally possible to correct, in step S4B11, the gain of the mean filterB_mean^{g /d} by applying the gain G to the output signals after summation of the signals previously weighted input (steps M4B1 to M4BL).
The multichannel signal B in step S4B1 is effectively filtered by means of the_mean diffuse filter B_mean . This step takes place after summing the previously weighted input signals (steps M4B1 to M4BL). We get the signal ${\hat{O}}_{B}^{g / d} .$
We apply to the signal ${\hat{O}}_{B}^{g / d}$
a delay iDD to compensate for the delay introduced during the step of isolating the signal B in step S4B2.
Signals ${\hat{O}}_{AT}^{g / d}$
and ${\hat{O}}_{B}^{g / d}$
are summoned.
If a truncation eliminating inaudible samples at the beginning of the impulse responses has been performed, then in step S41 the input signal is applied with a delay iT corresponding to the inaudible samples deleted.

Dans une variante, en référence à lafigure 5, les signaux ne sont pas seulement calculés pour les oreilles gauches et droites (indices g et d ci-avant) mais pour k dispositifs de restitution (typiquement des haut-parleurs).In a variant, with reference to the figure 5 , the signals are not only calculated for the left and right ears (indices g and d above) but for k playback devices (typically loudspeakers).

Dans un second mode de réalisation, on applique le gain G préalablement à la sommation des signaux d'entrée, c'est-à-dire pendant les étapes de pondération (étapes M4B1 à M4BL).In a second embodiment, the gain G is applied prior to the summing of the input signals, that is to say during the weighting steps (steps M4B1 to M4BL).

Dans un troisième mode de réalisation, on applique une décorrélation aux signaux d'entrée. Ainsi, les signaux sont décorrélés après convolution par le filtre B_mean quelles que soient les corrélations d'origine entre signaux d'entrée. On peut utiliser une implémentation efficace de décorrélation (par exemple en utilisant un réseau de retards bouclés) afin d'éviter l'utilisation de filtres de décorrélation couteux.In a third embodiment, a decorrelation is applied to the input signals. Thus, the signals are decorrelated after convolution by the B_mean filter regardless of the original correlations between input signals. An efficient implementation of decorrelation (for example using a loopback network) can be used to avoid the use of expensive decorrelating filters.

Ainsi, en supposant de manière réaliste que des BRIR de longueur 48000 échantillons peuvent être :

tronquées entre l'échantillon 150 et l'échantillon 3222 par la technique décrite à l'étape S21,
décomposées en deux parties : champ direct A de 1024 échantillons, et champ diffus B de 2048 échantillons, par la technique décrite à l'étape S22,

alors la complexité de binauralisation peut être approximativement donnée par :

C_{inv} = C_{invA} + C_{invB} = (L + 2) . (6 . \log_{2} (2 . NA)) + (L + 2) . (6 . \log_{2} (2 . NB))

Avec NA et NB les tailles en échantillons de A et BThus, realistically assuming that BRIRs of 48,000 samples length can be:

truncated between sample 150 and sample 3222 by the technique described in step S21,
decomposed into two parts: direct field A of 1024 samples, and diffuse field B of 2048 samples, by the technique described in step S22,

then the binauralization complexity can be approximately given by:

C_{inv} = C_{invA} + C_{INVB} = (The + 2) . (6 . \log_{2} (2 . N / A)) + (The + 2) . (6 . \log_{2} (2 . NB))

With NA and NB the sizes in samples of A and B

Ainsi pour nBlocs=10, Fs=48000, L=22, NA=1024 et NB=2048, la complexité par échantillon de signal multicanal pour une convolution basée-FFT est de C_conv = 3312 multiplications-additions. Ce résultat est cependant logiquement à comparer à une solution simple implémentant seulement la troncature, soit pour nBlocs=10, Fs=3072, L=22 : $C_{tronc} = (L + 2) . (nBlocs) . (6 . \log_{2} (2 . Fs / nBlocs)) = 13339$

Thus for nBlocks = 10, Fs = 48000, L = 22, NA = 1024 and NB = 2048, the multichannel signal sample complexity for a FFT-based convolution is C_conv = 3312 multiplications-additions. This result is, however, logically compared to a simple solution implementing only truncation, ie for nBlocs = 10, Fs = 3072, L = 22:

C_{trunk} = (The + 2) . (nBlocs) . (6 . \log_{2} (2 . fs / nBlocs)) = 13339

Il existe donc un facteur 19049/3312=5.75 de complexité entre l'état de l'art et la présente invention, et encore un facteur 13339/3312=4 de complexité entre l'état de l'art bénéficiant de la troncature et la présente invention.There is therefore a factor 19049/3312 = 5.75 of complexity between the state of the art and the present invention, and another factor 13339/3312 = 4 of complexity between the state of the art benefiting from the truncation and the present invention.

Si la taille de B est un multiple de la taille de A, alors si le filtrage est implémenté par blocs de FFT, le calcul d'une FFT pour A peut être réutilisé pour B. On a donc besoin de L FFT sur NA points, qui serviront à la fois au filtrage par A et par B, deux FFT inverse sur NA points pour obtenir le signal binaural temporel, et la multiplication des spectres en fréquence.If the size of B is a multiple of the size of A, then if the filtering is implemented in blocks of FFT, the calculation of an FFT for A can be reused for B. We therefore need L FFT on NA points, which will serve at the same time for the filtering by A and by B, two inverse FFT on NA points to obtain the binaural time signal, and the multiplication of the spectrums in frequency.

Dans ce cas, la complexité peut être approximée (les additions sont négligées, (L+1) correspond à la multiplication des spectres, L pour A et 1 pour B) par : $C_{inv 2} = (L + 2) . (6 . \log_{2} (2 . NA)) + (L + 1) = 1607$

In this case, the complexity can be approximated (the additions are neglected, (L + 1) corresponds to the multiplication of the spectra, L for A and 1 for B) by:

C_{inv 2} = (The + 2) . (6 . \log_{2} (2 . N / A)) + (The + 1) = 1607

Avec cette approche, on gagne encore un facteur 2, et donc un facteur 12 et 8 par rapport à l'état de l'art non tronquée et tronquée.With this approach, we still gain a factor of 2, and therefore a factor of 12 and 8 compared to the state of the art untruncated and truncated.

L'invention peut trouver une application directe dans la norme MPEG-H 3D Audio.The invention can find a direct application in the MPEG-H 3D Audio standard.

Bien entendu, la présente invention ne se limite pas à la forme de réalisation décrite ci-avant ; elle s'étend à d'autres variantes tout en restant dans l'étendue de la protection définie dans les revendications annexées.Of course, the present invention is not limited to the embodiment described above; it extends to other variants while remaining within the scope of the protection defined in the appended claims.

Ainsi par exemple, on a décrit ci-avant un mode de réalisation dans lequel le signal Direct A n'est pas approximé par un filtre moyen. Bien entendu, on peut utiliser filtre moyen de A pour faire les convolutions (étapes S4A1 à S4AL) avec les signaux issus des haut-parleurs.For example, an embodiment has been described above in which the Direct A signal is not approximated by an average filter. Of course, it is possible to use an average filter of A to make the convolutions (steps S4A1 to S4AL) with the signals coming from the loudspeakers.

On a décrit ci-avant un mode de réalisation basé sur le traitement d'un contenu multicanal généré pour L haut-parleurs. Bien sûr, le contenu multicanal peut-être généré par tout type de source audio comme par exemple la voix, un instrument de musique, un bruit quelconque, etc.An embodiment has been described above based on the processing of multichannel content generated for L speakers. Of course, the multichannel content can be generated by any type of audio source such as voice, a musical instrument, any noise, etc.

On a décrit ci-avant un mode de réalisation basé sur des valeurs de BRIR déterminées dans une salle. Bien sûr, on peut mettre en oeuvre la présente invention pour tout type d'environnement extérieur (par exemple salle de concert, plein air, etc.).An embodiment has been described above based on determined BRIR values in a room. Of course, the present invention can be implemented for any type of external environment (eg concert hall, open air, etc.).

On a décrit ci-avant un mode de réalisation basé sur l'application de deux fonctions de transfert. Bien sûr, on peut mettre en oeuvre la présente invention avec plus de deux fonctions de transfert. Par exemple, on peut isoler en synchronisme une partie relative aux sons émis directement, une partie relative aux premières réflexions et une partie relative aux sons diffus.An embodiment has been described above based on the application of two transfer functions. Of course, the present invention can be implemented with more than two transfer functions. For example, one can isolate in synchronism a part relating to the sounds emitted directly, a part relating to the first reflections and a part relating to diffuse sounds.

Claims

Audio spatializing method wherein at least one filtering operation is applied to at least two input signals (I(1), I(2), ..., I(L)) in order to deliver at least two output signals (O(1), O(2), ..., O(K)), the filtering operation comprising:
- weighting (M4B1, M4B2, ..., M4BL) said at least two input signals with respective weighting weights (W^k(1), ..., W^k(L)), each weighting weight being specific to each of the input signals;
- for each impulse response incorporating a room effect among a plurality of impulse responses incorporating a room effect, said impulse response incorporating a room effect being respectively associated with one input signal among said at least two input signals (I(1), 1(2), ..., I(L)) and with one output signal among said at least two output signals (O(1), O(2), ..., O(K)):
∘ partitioning (S22), in a time domain, said impulse response into a first portion (A) and a second portion (B), said partitioning being carried out such that:
said first portion represents direct audio propagations and first audio reflections of said propagations and extends over a first number of samples; and
said second portion represents a diffuse audio field present after said first reflections and extends over a second number of samples, said second number of samples being a multiple of said first number of samples;
∘ determining a first transfer function (A^k(1), A^k(2), ..., A^k(L)) from said first portion;
∘ determining a second transfer function from said second portion;
- for each output signal (O(1), O(2), ..., O(K)) among said at least two output signals (O(1), O(2), ..., O(K)):
∘ determining (S23B1) a third transfer function (B_mean^k) from an average of said second transfer functions corresponding to the output signal (O(1), O(2), ..., O(K));
∘ applying (S4A1, S4A2, ..., S4AL) to each input signal (1(1), 1(2), ..., I(L)) the first transfer function (A^k(1), A^k(2), ..., A^k(L)) corresponding to the input signal (I(1), 1(2), ..., I(L)) and to the output signal (O(1), O(2), ..., O(K));
∘ applying (S4B1) to each input signal the third transfer function (B_mean^k) corresponding to the output signal (O(1), O(2), ..., O(K)); wherein the application of the first and third transfer functions is carried out using FFT blocks;
- summing the signals resulting from applying said first and third transfer functions in order to obtain said at least two output signals (O(1), O(2), ..., O(K)).
Method according to Claim 1,characterized in that an energy-compensating gain (G) is applied (S4B11) to the weighting weights (W^k(1), ..., W^k(L)).
Method according to one of the preceding claims,characterized in that said partitioning of said impulse response comprises operations for:
- determining (S211) a start time of presence of direct audio waves,
- determining a start time of presence of said diffuse audio field after the first reflections and
- selecting (S22), in said impulse response, a portion of the response that extends temporally between said start time of presence of direct audio waves to said start time of presence of the diffuse field, said selected response portion corresponding to said first transfer function.
Method according to Claim 3,characterized in that said filtering comprises applying at least one compensating delay (S4B2) corresponding to a temporal offset between said start time of direct audio waves and said start time of presence of the diffuse field.
Method according to Claim 4,characterized in that said first and third transfer functions are applied in parallel to said input signals andin that said at least one compensating delay is applied to the input signals filtered by said third transfer functions.
Method according to one of the preceding claims, wherein said third transfer function is given by: $B_{mean}^{k} = \frac{1}{L} \sum_{i = 1}^{L} [B_{norm}^{k} (l)]$
with:
k an index relative to an output signal,
l ∈ [1;L] an index relative to one input signal among said input signals,
L a number of input signals, and
$B_{norm}^{k} (l)$
a normalised transfer function obtained from one second transfer function among said second transfer functions.
Method according to Claim 6,characterized in that at least one output signal0^k of said method is given by: $O^{k} = \sum_{l = 1}^{L} (I (l) * A^{k} (l)) + z^{- iDD} \cdot \sum_{l = 1}^{L} (\frac{1}{W^{k} (l)} . I (l)) * B_{mean}^{k}$
with:
I(l) one input signal among said input signals,
A^k(l) one first transfer function among said first transfer functions,
W^k(l) one weighting weight among said weighting weights,
z^-iDD corresponds to the application of said compensating delay,
where · is a multiplication, and
where * is the convolution operator.
Method according to Claim 6,characterized in that it comprises a step of decorrelating the input signals, prior to applying the third transfer functions andin that at least one output signal0^k of said method is given by: $O^{k} = \sum_{l = 1}^{L} (I (l) * A^{k} (l)) + z^{- iDD} \cdot \sum_{l = 1}^{L} (\frac{1}{W^{k} (l)} \cdot I_{d} (l)) * B_{mean}^{k}$
with:
I(l) one input signal among said input signals,
I_d(l) one input signal among said input signals having been subjected to said decorrelating step,
A^k(l) one first transfer function among said first transfer functions,
W^k(l) one weighting weight among said weighting weights,
z^-iDD corresponds to the application of said compensating delay,
where · is a multiplication, and
where * is the convolution operator.
Method according to Claim 6,characterized in that it comprises a step of determining an energy-compensating gain depending on the input signals, andin that at least one output signal is given by: $O^{k} = \sum_{l = 1}^{L} (I (l) * A^{k} (l)) + z^{- iDD} \cdot \sum_{l = 1}^{L} (G (I (l)) \cdot \frac{1}{W^{k} (l)} \cdot I (l)) * B_{mean}^{k}$
with:
I(l) one input signal among said input signals,
G(I(l)) said determined energy-compensating gain,
A^k(l) one first transfer function among said first transfer functions,
W^k(l) one weighting weight among said weighting weights,
z^-iDD corresponds to the application of said compensating delay,
where · is a multiplication, and
where * is the convolution operator.
Method according to one of Claims 6 to 9,characterized in that said weight is given by: $W^{k} (l) = \frac{\sqrt{E_{B_{mean}^{k}}}}{\sqrt{E_{B^{k} (l)}}}$
withk the index relative to an output signal,
l ∈ [1;L] the index relative to one input signal among said input signals,
L the number of input signals,
with $E_{B_{mean}^{k}}$
an energy relative to $B_{mean}^{k}$
E_B^k(l) an energy relative to a one second transfer function among said second transfer functions.
Computer program containing instructions for implementing the method according to one of Claims 1 to 10, when these instructions are executed by a processor.
Audio spatializing device, comprising at least one filter applied to at least two input signals (I(1), I(2), ..., I(L)), the device being able to deliver at least two output signals (O(1), O(2), ..., O(K)),
the device comprising weighting modules (M4B1, M4B2, ..., M4BL) for weighting said at least two input signals with respective weighting weights (W^k(1), ..., W^k(L)), each weighting weight being specific to each of the input signals;
the device furthermore being configured to:
- for each impulse response incorporating a room effect among a plurality of impulse responses incorporating a room effect, said impulse response incorporating a room effect being respectively associated with one input signal among said at least two input signals (I(1), I(2), ..., I(L)) and with one output signal among said at least two output signals (O(1), O(2), ..., O(K)):
∘ partition (S22), in a time domain, said impulse response into a first portion (A) and a second portion (B), said partitioning being carried out such that:
said first portion represents direct audio propagations and first audio reflections of said propagations and extends over a first number of samples; and
said second portion represents a diffuse audio field present after said first reflections and extends over a second number of samples, said second number of samples being a multiple of said first number of samples;
∘ determine a first transfer function (A^k(1), A^k(2), ..., A^k(L)) from said first portion;
∘ determine a second transfer function from said second portion;
the filter comprising:
- for each output signal (O(1), O(2), ..., O(K)) among said at least two output signals (O(1), O(2), ..., O(K)):
∘ determining (S23B1) a third transfer function (B_mean^k) from an average of said second transfer functions corresponding to the output signal (O(1), O(2), ..., O(K));
∘ applying (S4A1, S4A2, ..., S4AL) to each input signal first transfer functions corresponding to the output signal (O(1), O(2), ..., O(K));
∘ applying (S4B1) to each input signal the third transfer function (B_mean^k) corresponding to the output signal (O(1), O(2), ..., O(K));
wherein the application of the first and third transfer functions is carried out using FFT blocks;
wherein the signals resulting from applying said first and third transfer functions are summed in order to obtain said at least two output signals (O(1), O(2), ..., O(K)).
Module for decoding audio signals, comprising a device according to Claim 12 for spatializing said audio signals when input as input signals.