EP1970894A1

Movatterモバイル変換

Info

Publication number: EP1970894A1
Application number: EP08151708A
Authority: EP
Inventors: Olivier Rosec; Didier Cadic
Original assignee: France Telecom SA
Current assignee: Orange SA
Priority date: 2007-03-12
Filing date: 2008-02-20
Publication date: 2008-09-17
Also published as: US8121834B2; US20080255830A1

Abstract

The method involves applying a modification operation to an initial audio signal to deliver an intermediate audio signal. Another modification operation implemented by a pitch synchronous overlap and add technique is applied to the intermediate signal to deliver a final audio signal based on a modification factor. The factor is determined in a manner to take into account effects of the former operation on a fundamental frequency of the initial signal so that fundamental frequency obtained by the final signal conforms to a modification setpoint relative to the frequency of the initial signal. An independent claim is also included for a computer program comprising instructions to perform a method for modifying acoustic characteristics of an initial audio signal with respect to modification setpoints.

Description

Translated fromFrench

La présente invention a trait de manière générale au domaine du traitement des signaux audio et plus précisément selon des techniques visant à modifier les paramètres caractéristiques d'un signal audio. L'invention concerne ainsi un procédé et un dispositif de modification des caractéristiques acoustiques d'un signal audio en fonction de consignes de modification relatives au moins à la fréquence fondamentale et l'enveloppe spectrale du signal. L'invention s'applique en particulier aux signaux de parole.The present invention relates generally to the field of audio signal processing and more specifically to techniques for modifying the characteristic parameters of an audio signal. The invention thus relates to a method and a device for modifying the acoustic characteristics of an audio signal as a function of modification instructions relating to at least the fundamental frequency and the spectral envelope of the signal. The invention applies in particular to speech signals.

Dans la suite de la description, la référence des documents cités qui est indiquée de manière abrégée entre crochets ([...]), est détaillée dans la liste de documents située en fin de description.In the remainder of the description, the reference of the cited documents which is indicated abbreviated in brackets ([...] ) is detailed in the list of documents at the end of the description.

Les techniques de modification de la parole numérisée s'avèrent très utiles dans de nombreuses applications de traitement de la parole. En synthèse de la parole, elles permettent de procéder à des modifications prosodiques (modification de la hauteur de voix et du rythme d'élocution) souvent nécessaires pour conférer une intonation acceptable au signal de parole synthétique. Dans le domaine de la conversion de voix, l'objectif est de modifier le signal de parole issu d'un locuteur source de manière à ce qu'il semble avoir été prononcé par un locuteur cible désiré. Dans ce but, des adaptations du timbre et de la hauteur de voix sont nécessaires. Citons également les applications de transformation de voix visant à modifier la parole perçue à partir uniquement d'un ensemble de descripteurs cibles (voix grave/aiguë, masculine/féminine/enfantine, voix robotisée, etc.).Digital speech modification techniques are very useful in many speech processing applications. In speech synthesis, they make it possible to make prosodic modifications (modification of the pitch of the voice and the rhythm of speech) often necessary to confer an acceptable intonation to the synthetic speech signal. In the field of voice conversion, the objective is to modify the speech signal from a source speaker so that it appears to have been spoken by a desired target speaker. For this purpose, adaptations of timbre and pitch are necessary. Also noteworthy are voice transformation applications aimed at modifying perceived speech from only a set of target descriptors (deep / low voice, male / female / child, robotic voice, etc.).

La plupart des techniques connues de modification de la parole visent essentiellement à modifier trois types de paramètres :

La hauteur de voix perçue (appeléepitch en anglais), mesurée par la fréquence fondamentale du signal de parole considéré, c'est-à-dire la fréquence de vibration des cordes vocales.
La vitesse d'élocution, directement reliée à la durée de prononciation des différents phonèmes compris dans le signal de parole considéré. La durée considérée peut être par exemple la durée totale d'une phrase courante.
Le timbre de la voix, qui peut être défini comme l'attribut perceptif qui caractérise la différence entre deux sons par ailleurs semblables en hauteur, intensité et durée. Le timbre contient à la fois une composante informative (liée aux phonèmes prononcés) et identitaire (liée au locuteur : par ex., voix rauque, claire, douce, ...). Le timbre est souvent décrit par l'enveloppe spectrale du signal de parole. On rappelle ici que l'enveloppe spectrale fait référence à une courbe englobant l'amplitude des pics spectraux observés sur le signal de parole.
Les trois types de paramètres précités ne sont pas indépendants les uns des autres, dans le sens où une modification appliquée à l'un de ces paramètres doit affecter les autres. Cela implique de modifier ces paramètres de manière cohérente. En particulier, la modification conjointe de la hauteur de voix et du timbre est nécessaire pour préserver le naturel de la parole résultante. Il a par exemple été montré dans le document[Syr85] (voir liste de documents référencés en fin de description) que le premier formant et la fréquence fondamentale sont étroitement liés, de sorte que chaque changement de l'un de ces paramètres doit être accompagné d'une modification appropriée de l'autre. On rappelle ici qu'un formant correspond à une résonance du conduit vocal, et est caractérisé par sa fréquence centrale et sa largeur de bande. Cette fréquence centrale se traduit par un pic de l'enveloppe spectrale.
On connaît des techniques de modification de signaux de parole, qui opèrent des modifications de la hauteur de voix perçue sans opérer conjointement de modification sur le timbre. De telles techniques sont, par exemple, des techniques de type TD-PSOLA ou de type HNM.
La technique connue sous l'acronyme TD-PSOLA (Time Domain Pitch Synchronous Overlap and Add, en anglais) et décrite par exemple dans le document de brevetEP0363233 ou bien dans le document[Mou95], repose sur une décomposition du signal de parole en signaux d'analyse court-terme et pitch-synchrones, qui sont ensuite repositionnés sur l'axe temporel et juxtaposés de manière progressive. La technique TD-PSOLA permet d'opérer des modifications prosodiques sur le signal de parole telles que l'allongement/rétrécissement de durée (time-stretching en anglais) ou le changement de la fréquence fondamentale (pitch) tout en conservant une bonne qualité sonore. On entend ici par "bonne qualité sonore" l'absence de coupures, bruit ou autres artefacts qui rendent le signal désagréable à l'écoute. On n'y inclut donc pas l'aspect naturel du timbre de la voix.

Most of the known techniques for modifying speech are essentially aimed at modifying three types of parameters:

The pitch of perceived voice (calledpitch in English), measured by the fundamental frequency of the speech signal considered, that is to say the frequency of vibration of the vocal cords.
The speed of speech, directly related to the duration of pronunciation of the different phonemes included in the speech signal considered. The duration considered can be for example the total duration of a current sentence.
The timbre of the voice, which can be defined as the perceptual attribute which characterizes the difference between two sounds otherwise similar in height, intensity and duration. The timbre contains both an informative component (related to pronounced phonemes) and an identity component (linked to the speaker: eg, hoarse, clear, soft voice, etc.). The timbre is often described by the spectral envelope of the speech signal. It is recalled here that the spectral envelope refers to a curve encompassing the amplitude of the spectral peaks observed on the speech signal.
The three types of parameters mentioned above are not independent of each other, in the sense that a modification applied to one of these parameters must affect the others. This involves changing these settings consistently. In particular, the joint modification of the pitch and the tone is necessary to preserve the naturalness of the resulting speech. It has for example been shown in document[Syr85] (see list of referenced documents at the end of the description) that the first formant and the fundamental frequency are closely related, so that each change of one of these parameters must be accompanied. an appropriate modification of the other. It is recalled here that a formant corresponds to a resonance of the vocal tract, and is characterized by its central frequency and its bandwidth. This central frequency results in a peak of the spectral envelope.
Techniques for modifying speech signals are known, which modify the pitch of the perceived voice without making a joint modification on the tone. Such techniques are, for example, TD-PSOLA or HNM type techniques.
The technique known by the acronym TD-PSOLA (Time Domain Pitch Synchronous Overlap and Add, in English) and described for example in the patent document EP0363233 or in the document[Mou95] , is based on a decomposition of the speech signal into short-term analysis signals and pitch-synchronous, which are then repositioned on the time axis and juxtaposed in a progressive manner. The TD-PSOLA technique makes it possible to make prosodic modifications on the speech signal such as the elongation / narrowing oftime (time-stretching in English) or the change of the fundamental frequency (pitch ) while maintaining a good sound quality . Here we mean by "good sound quality" the absence of cuts, noise or other artifacts that make the signal unpleasant to listen. It does not include the natural aspect of the timbre of the voice.

Cependant, avec la technique TD-PSOLA, si les facteurs de modification de durée utilisés peuvent atteindre la valeur 2 sans distorsion notable du signal, les possibilités de modification de la fréquence fondamentale restent relativement restreintes si l'on veut préserver le naturel du signal de parole résultant. En effet, dans TD-PSOLA, les modifications de la hauteur de voix ne sont accompagnées d'aucune modification de timbre. Or, comme mentionné précédemment, la modification conjointe de la hauteur de voix et du timbre est nécessaire pour préserver le naturel de la parole résultante.

La technique de modification de voix qui repose sur la mise en oeuvre du modèle HNM, est décrite par exemple dans le document[Sty96]. Le modèle harmonique plus bruit ou modèle HNM (Harmonic plus Noise Model, en anglais), a également été utilisé à des fins de modifications prosodiques voire spectrales. Il fait l'hypothèse qu'un segment (appelé aussi trame) voisé du signal de paroleS(n) peut être décomposé en une partie harmonique représentant la composante quasi-périodique du signal constituée d'une somme deL sinusoïdes harmoniques d'amplitudesA^l et de phases Φ^l, et une partie bruitée représentant le bruit de friction et la variation de l'excitation glottale d'une période a l'autre, modélisée par un bruit blanc gaussien excitant un filtre AR (auto-régressif) obtenu par analyse LPC (Linear Predictive Coding). Pour une trame non-voisée, la partie harmonique est absente et le signal est simplement modélisé par un bruit blanc mis en forme par filtrage AR. A la synthèse, en fonction des consignes de pitch désirées, les amplitudes et les phases de la partie harmonique sont ré-estimées de façon à préserver au mieux le timbre (c'est-à-dire l'enveloppe spectrale) du signal original. Cette ré-estimation est valide pour l'information d'amplitude dès lors qu'une enveloppe spectrale suffisamment lisse est disponible. En revanche, la ré-estimation des phases est beaucoup plus complexe et doit être réalisée en tenant compte des spectres de phase de la source glottique et du filtre caractérisant le conduit vocal, ces deux informations étant difficiles à extraire. Cette difficulté fait que le modèle HNM ne parvient pas à préserver la cohérence des signaux modifiés et donc conduit à une dégradation de la qualité de la parole résultante.

However, with the TD-PSOLA technique, if the duration modification factors used can reach thevalue 2 without significant distortion of the signal, the possibilities of modifying the fundamental frequency remain relatively small if one wants to preserve the naturalness of the signal of resulting word. Indeed, in TD-PSOLA, changes in voice pitch are not accompanied by any changes to the timbre. However, as mentioned above, the joint modification of the pitch and the tone is necessary to preserve the naturalness of the resulting speech.

The voice modification technique which is based on the implementation of the HNM model is described for example in the document[Sty96] . The harmonic model plus noise or HNM model (Harmonic plus Noise Model, in English), was also used for prosodic or even spectral modifications. It is assumed that a segment (also called frame) of the speech signalS (n) can be decomposed into a harmonic part representing the quasi-periodic component of the signal consisting of a sum ofL harmonic sinusoidal amplitudesAt^l and phases Φ^l , and a noisy part representing the friction noise and the variation of the glottal excitation from one period to another, modeled by a Gaussian white noise exciting a filter AR (auto-regressive) obtained by LPC analysis (Linear Predictive Coding ). For an unvoiced frame, the harmonic part is absent and the signal is simply modeled by a white noise shaped by AR filtering. At the synthesis, according to the desired pitch instructions, the amplitudes and the phases of the harmonic part are re-estimated in order to best preserve the timbre (that is to say the spectral envelope) of the original signal. This re-estimation is valid for amplitude information since a sufficiently smooth spectral envelope is available. On the other hand, the re-estimation of the phases is much more complex and must be carried out taking into account the phase spectra of the glottic source and the filter characterizing the vocal tract, these two informations being difficult to extract. This difficulty makes the HNM model fail to preserve the coherence of the modified signals and thus leads to a degradation of the resulting speech quality.

D'autres techniques connues de modification de voix, permettent, contrairement aux techniques précédentes, d'opérer conjointement sur la hauteur de voix perçue et sur le timbre.

Ainsi, la technique de "ré-échantillonnage" (resampling en anglais) est une technique permettant d'adapter un signal (pas nécessairement de parole) à une modification de sa fréquence d'échantillonnage. Appliqué à un signal de parole, cette technique permet de modifier conjointement la hauteur de voix, le timbre et la vitesse d'élocution, tout en conservant une excellente qualité sonore. La technique de ré-échantillonnage est décrite par exemple dans le document[Mou95]. Selon ce document, pour obtenir une accélération de facteur P (P nombre entier) du signal, on applique d'abord un filtre passe-bas, puis on décime le signal en supprimant P-1 échantillons sur P échantillons. Pour obtenir un ralentissement de facteur Q (Q nombre entier) d'un signal audio ou de parole, on ajoute Q-1 zéros entre deux échantillons de signal puis on applique un filtre passe-bas de fréquence de coupure appropriée.

Other known voice modification techniques, unlike the previous techniques, make it possible to operate jointly on the pitch of perceived voice and on the timbre.

Thus, the technique of "resampling" (resampling in English) is a technique for adapting a signal (not necessarily speech) to a change in its sampling frequency. Applied to a speech signal, this technique makes it possible to jointly modify the pitch of the voice, the timbre and the speed of speech, while maintaining an excellent sound quality. The resampling technique is described for example in the document[Mou95] . According to this document, to obtain an acceleration of factor P (P integer) of the signal, a low-pass filter is first applied, then the signal is decimated by removing P-1 samples from P samples. To obtain Q (Q integer) slowdown of an audio or speech signal, Q-1 zeros are added between two signal samples and then a low pass filter of appropriate cutoff frequency is applied.

En règle générale, le facteur de ré-échantillonnage, noté γ, n'est pas entier, mais peut être approché par un nombre rationnel P/Q. Lorsque γ=P/Q, il suffit de combiner les deux traitements : un sur-échantillonnage de facteur Q suivi d'un sous-échantillonnage de facteur P.In general, the resampling factor, noted γ, is not integer, but can be approximated by a rational number P / Q. When γ = P / Q, it is sufficient to combine the two treatments: Q-factor oversampling followed by P-factor downsampling.

De manière générale, lorsque le facteur de ré-échantillonnage γ appliqué est supérieur (respectivement inférieur) à 1, il se produit une dilatation (respectivement une contraction) du spectre d'amplitude du signal de parole, c'est-à-dire que la position des harmoniques et des formants du signal, représentée sur l'axe fréquentiel, se trouve multipliée (respectivement divisée) par γ. Une telle transformation spectrale affecte donc le timbre de la voix, et s'accompagnant également d'une multiplication (respectivement division) de la fréquence fondamentale par le même coefficient (γ), agit donc conjointement sur la hauteur de voix. Le ré-échantillonnage est par conséquent une technique efficace et relativement peu complexe pour modifier un signal de parole, puisqu'il permet de modifier conjointement le timbre et la hauteur de voix, et ce, sans apparition d'artefact audible, puisque le ré-échantillonnage préserve la cohérence temporelle du signal et de ce fait ne distord pas l'information véhiculée.In general, when the applied re-sampling factor γ is greater (respectively lower) than 1, there is an expansion (respectively a contraction) of the amplitude spectrum of the speech signal, that is to say that the position of the harmonics and formants of the signal, represented on the frequency axis, is multiplied (respectively divided) by γ. Such a spectral transformation therefore affects the timbre of the voice, and also accompanied by a multiplication (respectively division) of the fundamental frequency by the same coefficient (γ), thus acts jointly on the voice height. Resampling is therefore an effective and relatively uncomplicated technique for modifying a speech signal, since it makes it possible to jointly modify the timbre and the pitch of the voice, without the appearance of an audible artifact, since the re-sampling sampling preserves the temporal coherence of the signal and thus does not distort the information conveyed.

Cependant, le ré-échantillonnage ne permet pas à lui seul de réaliser des transformations pertinentes de la fréquence fondamentale et du timbre. En effet, le ré-échantillonnage du signal de parole provoque un déplacement homothétique des formants dans le même sens que celui de la fréquence fondamentale. Or, des observations sur des signaux de parole naturelle montrent que la plage de variation de la fréquence fondamentale est beaucoup plus importante que la plage de variation des fréquences formantiques. Ainsi, l'application d'un facteur de ré-échantillonnage égal au facteur de modification de la fréquence fondamentale souhaité se traduit par une dilatation/compression de l'enveloppe spectrale trop importante, et donc à une dégradation notable du naturel de la voix, provoquant par exemple des effets de "voix-tube" ou "voix-Donald".

Une autre technique connue permet d'opérer conjointement sur la hauteur de voix perçue et sur le timbre. Il s'agit de la technique exposée dans le document[Kai00] et qui repose sur une opération d'ajustement spectral basée sur l'utilisation d'un modèle de mélange de gaussiennes pour modéliser de manière conjointe l'enveloppe spectrale et la hauteur de voix. Ainsi, en fonction de la consigne de fréquence fondamentale souhaitée, une correction de l'enveloppe spectrale est opérée, ce qui permet de mieux préserver le naturel de la parole transformée, notamment lorsque d'importantes modifications de fréquence fondamentale sont effectuées. Ce type de technique permet d'effectuer des transformations du spectre d'amplitude relativement précises et bien maîtrisées. En revanche, l'information de phase des signaux transformés est mal contrôlée, ce qui conduit à une dégradation sensible de la qualité du signal résultant.

However, resampling alone does not allow for meaningful transformations of fundamental frequency and timbre. Indeed, the re-sampling of the speech signal causes a homothetic movement of the formants in the same direction as that of the fundamental frequency. However, observations on natural speech signals show that the range of variation of the fundamental frequency is much greater than the range of variation of the formant frequencies. Thus, the application of a resampling factor equal to the modification factor of the desired fundamental frequency results in a dilation / compression of the spectral envelope which is too great, and therefore to a significant degradation of the naturalness of the voice, causing for example effects of "voice-tube" or "voice-Donald".

Another known technique makes it possible to operate jointly on the pitch of the perceived voice and on the timbre. This is the technique described in[Kai00] and which is based on a spectral adjustment operation based on the use of a Gaussian mixing model to jointly model the spectral envelope and the height of the spectral envelope. voice. Thus, as a function of the desired fundamental frequency setpoint, a correction of the spectral envelope is made, which makes it possible to better preserve the naturalness of the transformed speech, especially when important fundamental frequency changes are made. This type of technique allows for relatively accurate and well-controlled amplitude spectrum transformations. In contrast, the phase information of the transformed signals is poorly controlled, which leads to a significant degradation of the quality of the resulting signal.

Il ressort de l'état de la technique brièvement exposé ci-dessus, qu'il existe un réel besoin de disposer d'une technique de modification d'un signal de parole, permettant de modifier conjointement au moins la hauteur de voix perçue et le timbre associés au signal de parole, et ce, afin de fournir un signal de parole de haute qualité en terme de naturel de la voix résultante perçue.It follows from the state of the art briefly described above that there is a real need to have a technique for modifying a speech signal, making it possible to jointly modify at least the pitch of the perceived voice and the timbre associated with the speech signal, in order to provide a high quality speech signal in terms of naturalness of the perceived resultant voice.

La présente invention concerne, selon un premier aspect, un procédé de modification des caractéristiques acoustiques d'un signal audio initial en fonction de consignes de modification relatives au moins à la fréquence fondamentale et l'enveloppe spectrale du signal initial. Selon l'invention, ce procédé est remarquable en ce que :

une première opération de modification est appliquée au signal initial afin de délivrer un signal audio intermédiaire, la première opération de modification étant destinée à déformer l'enveloppe spectrale du signal initial selon ladite consigne de modification de l'enveloppe spectrale, et
une seconde opération de modification est appliquée au signal intermédiaire afin de délivrer un signal audio final, ladite seconde opération étant destinée à modifier au moins la fréquence fondamentale du signal intermédiaire, selon un facteur de modification qui est déterminé de manière à prendre en compte les effets de la première opération de modification sur la fréquence fondamentale du signal audio initial, de sorte que la fréquence fondamentale obtenue pour le signal final soit conforme à ladite consigne relative à la fréquence fondamentale.

The present invention relates, in a first aspect, to a method of modifying the acoustic characteristics of an initial audio signal according to modification instructions relating to at least the fundamental frequency and the spectral envelope of the initial signal. According to the invention, this process is remarkable in that:

a first modification operation is applied to the initial signal in order to deliver an intermediate audio signal, the first modification operation being intended to deform the spectral envelope of the initial signal according to the said modification instruction of the spectral envelope, and
a second modification operation is applied to the intermediate signal to output a final audio signal, said second operation being for modifying at least the fundamental frequency of the intermediate signal, according to a modification factor which is determined to take into account the effects of the first modification operation on the fundamental frequency of the initial audio signal, so that the fundamental frequency obtained for the final signal is in accordance with said reference relative to the fundamental frequency.

Le principe à la base de l'invention consiste ainsi à modifier les caractéristiques d'un signal audio selon des consignes de modification prédéfinies concernant l'enveloppe spectrale et la fréquence fondamentale du signal, en combinant deux opérations de modification successives et distinctes dont les effets sont prédéterminés. L'une de ces opérations intervient principalement sur l'enveloppe spectrale du signal considéré (et donc sur le timbre perçu dans le cas d'un signal de parole), avec aussi un effet sur la fréquence fondamentale, mais qui ne permet pas d'appliquer la consigne prédéfinie relative à la fréquence fondamentale. L'autre opération de modification intervient essentiellement sur la fréquence fondamentale du signal considéré (et donc sur la hauteur de voix perçue dans le cas d'un signal de parole). Mais, avantageusement selon l'invention, cette seconde opération de modification est paramétrée de telle sorte à modifier la fréquence fondamentale du signal audio obtenu à l'issue de la première modification, afin que la fréquence fondamentale du signal modifié final soit conforme à la consigne initiale relative à la fréquence fondamentale.The principle underlying the invention thus consists in modifying the characteristics of an audio signal according to predefined modification instructions concerning the spectral envelope and the fundamental frequency of the signal, by combining two successive and distinct modification operations whose effects are predetermined. One of these operations intervenes mainly on the spectral envelope of the signal considered (and thus on the perceived timbre in the case of a signal of speech), with also an effect on the fundamental frequency, but which does not allow of apply the instruction predefined relative to the fundamental frequency. The other modification operation intervenes essentially on the fundamental frequency of the signal considered (and thus on the pitch of voice perceived in the case of a speech signal). But, advantageously according to the invention, this second modification operation is parameterized so as to modify the fundamental frequency of the audio signal obtained at the end of the first modification, so that the fundamental frequency of the final modified signal is in accordance with the instruction initial reference to the fundamental frequency.

Ainsi, grâce à la combinaison de ces deux étapes successives de modification de signal audio, on obtient un signal modifié final dont les caractéristiques relatives à l'enveloppe spectrale et à la fréquence fondamentale sont en totale conformité avec les consignes initiales. L'invention appliquée à un signal de parole permet par exemple de garantir le naturel d'une voix modifiée, puisque les consignes de modification du signal qui sont prédéfinies relativement au timbre et à la hauteur de voix, peuvent être réellement appliquées, sans qu'un changement de timbre (respectivement de hauteur de voix) ne dégrade la hauteur de voix (respectivement le timbre), et ne produise une voix modifiée manquant de naturel et/ou ne correspondant pas à la cible désirée.Thus, thanks to the combination of these two successive audio signal modification steps, a final modified signal is obtained whose characteristics relating to the spectral envelope and the fundamental frequency are in full compliance with the initial instructions. The invention applied to a speech signal makes it possible, for example, to guarantee the naturalness of a modified voice, since the signal modification instructions which are predefined with respect to the tone and the pitch of the voice can actually be applied without a change of timbre (respectively of pitch of voice) does not degrade the pitch of voice (respectively the timbre), and does not produce a modified voice lacking naturalness and / or does not correspond to the desired target.

Selon un mode de réalisation préféré de l'invention, les consignes de modification du signal audio initial comprennent un facteur γ d'étirement/contraction de l'enveloppe spectrale du signal initial selon l'axe fréquentiel, des facteurs β et α de modification respectivement de la fréquence fondamentale et de la durée du signal initial. Dans ce mode de réalisation, la première opération de modification produit sur le signal audio initial, outre la modification désirée de l'enveloppe spectrale, une modification de la fréquence fondamentale et une modification de la durée, selon respectivement des seconds facteurs β' et α'. La seconde opération de modification est alors choisie de manière à modifier la fréquence fondamentale et la durée du signal audio intermédiaire, selon respectivement des troisièmes facteurs β" et α", tels que : α'·α"=α et β'·β"=β.According to a preferred embodiment of the invention, the instructions for modifying the initial audio signal comprise a stretching / contraction factor γ of the spectral envelope of the initial signal according to the frequency axis, modification factors β and α, respectively. the fundamental frequency and the duration of the initial signal. In this embodiment, the first modification operation produces on the initial audio signal, in addition to the desired modification of the spectral envelope, a modification of the fundamental frequency and a modification of the duration, according to the second factors β 'and α, respectively. . The second modification operation is then chosen so as to modify the fundamental frequency and the duration of the intermediate audio signal, respectively according to the third factors β "and α", such that: α '· α "= α and β' · β" = β.

Ainsi, en choisissant, selon les formules ci-dessus, les paramètres α", β" de la seconde opération de modification, en fonction des facteurs de modification connus α' et β' résultant de l'application de la première opération de modification sur le signal audio initial, on obtient un signal audio modifié final dont les caractéristiques relatives à la durée, la fréquence fondamentale et l'enveloppe spectrale, sont conformes aux consignes de modifications initiales α, β, γ, et donc au signal cible désiré.Thus, by choosing, according to the formulas above, the parameters α ", β" of the second modification operation, as a function of the known modification factors α 'and β' resulting from the application of the first modification operation to the initial audio signal, we obtain a final modified audio signal whose characteristics relating to the duration, the fundamental frequency and the spectral envelope, are in accordance with the instructions of initial modifications α, β, γ, and therefore to the desired target signal.

Selon des caractéristiques particulières de réalisation de l'invention :

La première opération de modification est mise en oeuvre par une technique de type ré-échantillonnage de facteur γ, avec γ supérieur à 1 correspondant à un étirement de l'enveloppe spectrale du signal, et γ compris entre 0et 1 correspondant à une contraction de l'enveloppe spectrale du signal. Les seconds facteurs β' et α' sont respectivement définis en fonction du facteur γ de ré-échantillonnage selon les équations suivantes : β'=γ et $αʹ = \frac{1}{γ};$
et les troisièmes facteurs β" et α" sont obtenus selon les équations suivantes : $β " = \frac{β}{γ}$
et α" = α·γ.
La seconde opération de modification est mise en oeuvre par une technique de type PSOLA, par exemple TD-PSOLA.

According to particular features of embodiment of the invention:

The first modification operation is carried out by a γ factor resampling type technique, with γ greater than 1 corresponding to a stretching of the spectral envelope of the signal, and γ ranging from 0 to 1 corresponding to a contraction of the spectral envelope of the signal. The second factors β 'and α' are respectively defined as a function of the re-sampling factor γ according to the following equations: β '= γ and $α' = \frac{1}{γ};$
and the third factors β "and α" are obtained according to the following equations: $β " = \frac{β}{γ}$
and α "= α · γ.
The second modification operation is implemented by a PSOLA type technique, for example TD-PSOLA.

Selon une variante de mise en oeuvre du procédé selon l'invention, la seconde opération de modification est mise en oeuvre avant la première opération de modification, les seconds facteurs β' et α' étant déterminés au préalable en fonction du facteur γ.According to an alternative embodiment of the method according to the invention, the second modification operation is carried out before the first modification operation, the second factors β 'and α' being determined beforehand as a function of the factor γ.

Selon un second aspect, l'invention concerne un dispositif de traitement audio adapté à la modification des caractéristiques acoustiques d'un signal audio initial en fonction de consignes de modification relatives au moins à la fréquence fondamentale et l'enveloppe spectrale du signal initial. Conformément à l'invention, ce dispositif comporte :

des moyens de modification du signal audio initial selon une première opération de modification, afin de délivrer un signal audio intermédiaire, la première opération de modification étant destinée à déformer l'enveloppe spectrale du signal initial selon ladite consigne de modification de l'enveloppe spectrale du signal, et
des moyens de modification du signal intermédiaire selon une seconde opération de modification, afin de délivrer un signal audio final, ladite seconde opération étant destinée à modifier au moins la fréquence fondamentale du signal intermédiaire, de sorte que la fréquence fondamentale obtenue pour le signal final soit conforme à ladite consigne relative à la fréquence fondamentale, la fréquence fondamentale dudit signal intermédiaire étant modifiée selon un facteur de modification qui est déterminé de manière à prendre en compte les effets de la première opération de modification sur la fréquence fondamentale du signal audio initial.

According to a second aspect, the invention relates to an audio processing device adapted to the modification of the acoustic characteristics of an initial audio signal according to modification instructions relating to at least the fundamental frequency and the spectral envelope of the initial signal. According to the invention, this device comprises:

means for modifying the initial audio signal according to a first modification operation, in order to deliver an intermediate audio signal, the first modification operation being intended to deform the envelope spectral of the initial signal according to said instruction for modifying the spectral envelope of the signal, and
means for modifying the intermediate signal according to a second modification operation, in order to deliver a final audio signal, said second operation being intended to modify at least the fundamental frequency of the intermediate signal, so that the fundamental frequency obtained for the final signal is according to said reference relative to the fundamental frequency, the fundamental frequency of said intermediate signal being modified according to a modification factor which is determined so as to take into account the effects of the first modification operation on the fundamental frequency of the initial audio signal.

La présente invention concerne aussi un programme d'ordinateur de traitement audio, ce programme comportant des instructions adaptées à la mise en oeuvre d'un procédé selon l'invention, lorsque le programme est chargé et exécuté dans un système informatique.The present invention also relates to an audio processing computer program, this program including instructions adapted to the implementation of a method according to the invention, when the program is loaded and executed in a computer system.

Les avantages de ce dispositif de traitement audio ou de ce programme d'ordinateur sont identiques à ceux mentionnés plus haut en relation avec le procédé de l'invention.The advantages of this audio processing device or of this computer program are identical to those mentioned above in connection with the method of the invention.

L'invention sera mieux comprise à la lecture de la description détaillée qui va suivre, donnée uniquement à titre d'exemple et faite en se référant aux dessins sur lesquels :

laFigure 1 est un organigramme général illustrant un procédé de modification des caractéristiques acoustiques d'un signal audio, selon l'invention ;
laFigure 2 composée desfigures 2A à 2D représente différents stades de traitement d'un signal de parole selon l'algorithme connu sous l'acronyme TD-PSOLA.

The invention will be better understood on reading the detailed description which follows, given solely by way of example and with reference to the drawings in which:

the Figure 1 is a general flowchart illustrating a method of modifying the acoustic characteristics of an audio signal, according to the invention;
the Figure 2 composed of Figures 2A to 2D represents different stages of processing a speech signal according to the algorithm known by the acronym TD-PSOLA.

LaFigure 1 représente un organigramme général illustrant un procédé, selon l'invention, de modification des caractéristiques acoustiques d'un signal audio. La présente invention est applicable aux signaux audio en général (par exemple des signaux musicaux), cependant elle est particulièrement efficace en ce qui concerne les signaux de parole, par conséquent dans le cadre de la présente description de modes de réalisations de l'invention, le signal audio considéré à modifier est un signal de parole.TheFigure 1 represents a general flowchart illustrating a method, according to the invention, of modifying the acoustic characteristics of an audio signal. The present invention is applicable to audio signals in general (e.g., musical signals), however it is particularly effective with respect to speech signals, therefore within the scope of the present invention. description of embodiments of the invention, the audio signal to be modified is a speech signal.

En référence à laFigure 1, un procédé de modification des caractéristiques acoustiques d'un signal de parole, dit "signal initial", en fonction de consignes de modification relatives à des paramètres prédéfinis du signal de parole, commence par une étape initiale E10 de détermination des consignes de modification à appliquer en fonction du signal de parole désiré, c'est-à-dire en fonction d'un signal "cible".With reference to the Figure 1 , a method for modifying the acoustic characteristics of a speech signal, said "initial signal", according to modification instructions relating to predefined parameters of the speech signal, begins with an initial step E10 of determining the modification instructions to apply according to the desired speech signal, that is to say according to a "target" signal.

Selon le mode de réalisation exposé, les consignes de modification du signal de parole initial comprennent un facteur γ d'étirement/contraction de l'enveloppe spectrale du signal initial selon l'axe fréquentiel, et des facteurs α et β de modification respectivement de la durée et de la fréquence fondamentale du signal initial. Les facteurs α et β sont choisis de sorte que, s'ils sont respectivement supérieurs à 1, ils correspondent à une augmentation respectivement de la durée et de la fréquence fondamentale du signal, et s'ils sont respectivement compris entre 0 et 1, ils correspondent à une diminution respectivement de la durée et de la fréquence fondamentale du signal.According to the embodiment described, the instructions for modifying the initial speech signal comprise a stretching / contraction factor γ of the spectral envelope of the initial signal according to the frequency axis, and modifying factors α and β respectively of the duration and fundamental frequency of the initial signal. The factors α and β are chosen such that, if they are respectively greater than 1, they correspond to an increase respectively in the duration and the fundamental frequency of the signal, and if they are respectively between 0 and 1, they correspond to a decrease respectively in the duration and the fundamental frequency of the signal.

Ainsi lorsque le signal audio à modifier est un signal de parole, les facteurs de modification de consigne α, β et γ permettent de modifier respectivement les paramètres suivants relatifs aux caractéristiques de rendu sonore du signal de parole : la vitesse d'élocution, la hauteur de voix perçue, et le timbre de voix perçu.Thus, when the audio signal to be modified is a speech signal, the setpoint modification factors α, β and γ make it possible respectively to modify the following parameters relating to the sound reproduction characteristics of the speech signal: the speech speed, the pitch of perceived voice, and the perceived tone of voice.

Le choix des paramètres α, β et γ dépend de la transformation souhaitée. A titre d'illustration, lorsque d'importantes modifications sont opérées, par exemple pour transformer une voix d'adulte en une voix d'enfant, le facteur γ d'étirement/contraction de l'enveloppe spectrale du signal, et le facteur β de modification de la fréquence fondamentale, peuvent atteindre respectivement les valeurs '1,2' et '3'.The choice of the parameters α, β and γ depends on the desired transformation. By way of illustration, when important modifications are made, for example to transform an adult voice into a child's voice, the stretching / contraction factor γ of the spectral envelope of the signal, and the factor β of the fundamental frequency can reach the values '1,2' and '3' respectively.

Une étude statistique des variations de la fréquence fondamentale (pitch) et des fréquences formantiques est fournie dans le document[Hub99] (cf. en particulier, dans ce document, le tableau de l'Annexe A p. 1540). Cette étude peut être utilisée pour déterminer des valeurs "raisonnables" pour les paramètres γ et β. Ainsi, pour transformer une voix d'homme en une voix de femme, des facteurs d'étirement/contraction de l'enveloppe spectrale (γ) et de modification de fréquence fondamentale (β) de valeur respective '1,2' et '1,8', conviennent (il n'est pas nécessaire de modifier la durée dans ce cas particulier).A statistical study of the variations of the fundamental frequency (pitch ) and the formant frequencies is provided in the document[Hub99] (see in particular, in this document, the table of the Annex A page 1540). This study can be used to determine "reasonable" values for parameters γ and β. Thus, to transform a male voice into a female voice, stretching / contraction factors of the spectral envelope (γ) and fundamental frequency change (β) of respective value '1,2' and '1 , 8 ', agree (it is not necessary to change the duration in this particular case).

Le facteur α de modification de la durée du signal, dépend quant à lui, essentiellement du rythme d'élocution souhaité. Dans de nombreuses applications de transformation de voix, la modification du rythme d'élocution est considérée comme secondaire et donc ignorée, ce qui correspond à un facteur α égal à 1. En revanche, pour obtenir des effets très spécifiques, par exemple pour une transformation vers des voix de personnages de géants/nains, des facteurs de ralentissement ou d'accélération du rythme d'élocution peuvent être utilisés. Dans de tels cas, des valeurs typiques du facteur α peuvent être comprises entre les valeurs '0,5' et '2'.The factor α for modifying the duration of the signal depends, in turn, essentially on the desired speech rate. In many voice transformation applications, the change in the speech rate is considered as secondary and thus ignored, which corresponds to a factor α equal to 1. On the other hand, to obtain very specific effects, for example for a transformation to the voices of giants / dwarves characters, slowing or acceleration factors can be used. In such cases, typical values of the factor α may be between the values '0.5' and '2'.

De retour à laFigure 1, après l'étape E10 de détermination des consignes de modification en fonction de la transformation du signal désirée, l'étape suivante E11 consiste à déterminer en conséquence, d'une part, les deux opérations de modification successives à appliquer, en partant du signal de parole initial, et d'autre part leurs paramètres respectifs.Back to the Figure 1 after the step E10 of determining the modification instructions as a function of the transformation of the desired signal, the following step E11 consists in determining, on the one hand, the two successive modification operations to be applied, starting from the signal of initial speech, and secondly their respective parameters.

Ainsi, selon l'invention, une première opération de modification est appliquée au signal initial S(n) afin de délivrer un signal audio intermédiaire S1 (n), cette première opération de modification étant destinée à déformer l'enveloppe spectrale du signal initial S(n) selon la consigne γ de modification de l'enveloppe spectrale. On notera ici que les signaux audio ou vocaux considérés sont ici sous une forme échantillonnée numérique (n désignant un échantillon quelconque).Thus, according to the invention, a first modification operation is applied to the initial signal S (n) in order to deliver an intermediate audio signal S1 (n), this first modification operation being intended to deform the spectral envelope of the initial signal S (n) according to the instruction γ for modifying the spectral envelope. It will be noted here that the audio or voice signals considered here are in a digital sampled form (n denoting any sample).

Selon le mode de réalisation choisi, la première opération de modification (encore appelée "première transformation") choisie, désignée par 'MOD_OP1', est mise en oeuvre par une technique de type ré-échantillonnage (resampling en anglais) de facteur γ, avec γ supérieur à 1 correspondant à un étirement de l'enveloppe spectrale du signal, et γ compris entre 0 et 1 correspondant à une contraction de l'enveloppe spectrale du signal. Une telle méthode de ré-échantillonnage est connue et décrite par exemple dans le document[Mou95] précédemment cité. On pourra en particulier se reporter au paragraphe 3.2.1 de ce document, intitulé : "Time-domain and frequency-domain resampling". Cependant, contrairement à la technique deresampling exposée dans le document[Mou95] qui utilise leresampling pour modifier la hauteur de voix (pitch), la présente invention utilise la technique deresampling pour modifier essentiellement l'enveloppe spectrale du signal initial S(n) selon la consigne γ de modification de l'enveloppe spectrale.According to the embodiment chosen, the first modification operation (also called "primary") selected, designated 'MOD_OP1' is implemented by a re-sampling type of technique(resampling English) of γ factor, with γ greater than 1 corresponding to a stretching of the spectral envelope of the signal, and γ between 0 and 1 corresponding to a contraction of the spectral envelope of the signal. Such a resampling method is known and described for example in the document[Mou95] previously cited. In particular, see section 3.2.1 of this document, entitled "Time domain and frequency domain resampling ". However, unlike theresampling technique described in[Mou95] which usesresampling to modify thepitch , the present invention uses theresampling technique to essentially modify the spectral envelope of the initial signal S (n ) according to the set point γ for modifying the spectral envelope.

Cependant, il est connu qu'une telle technique de ré-échantillonnage, produit sur le signal de parole initial, outre la modification désirée de l'enveloppe spectrale conformément à l'invention, une modification de la fréquence fondamentale et une modification de la durée, selon respectivement des seconds facteurs, ici désignés par β' et α'. Ces seconds facteurs β' et α' sont respectivement définis en fonction du facteur γ de ré-échantillonnage selon les équations suivantes : $βʹ = γ et αʹ = \frac{1}{γ} .$

However, it is known that such a resampling technique, produced on the initial speech signal, besides the desired modification of the spectral envelope according to the invention, a modification of the fundamental frequency and a modification of the duration , respectively according to second factors, here denoted by β 'and α'. These second factors β 'and α' are respectively defined as a function of the re-sampling factor γ according to the following equations:

β' = γ and α' = \frac{1}{γ} .

Ainsi, conformément à l'invention la seconde opération de modification 'MOD_OP2', à appliquer au signal obtenu (S1(n)), dit "signal intermédiaire", suite à l'application de la première transformation MOD_OP1, doit être choisie de façon à prendre en compte les effets de MOD_OP1 sur la fréquence fondamentale, de sorte que la fréquence fondamentale obtenue pour le signal final (S2(n)) soit conforme à la consigne (β) relative à la fréquence fondamentale. Bien entendu, s'il existe également une consigne concernant la durée (α), comme dans le cas du présent mode de réalisation, la seconde transformation MOD_OP2 doit aussi tenir compte des effets de la première transformation MOD_OP1 sur la durée du signal initial.Thus, according to the invention, the second modification operation 'MOD_OP2', to be applied to the obtained signal (S1 (n)), called the "intermediate signal", following the application of the first transformation MOD_OP1, must be chosen so to take into account the effects of MOD_OP1 on the fundamental frequency, so that the fundamental frequency obtained for the final signal (S2 (n)) is in accordance with the reference (β) relative to the fundamental frequency. Of course, if there is also a setpoint concerning the duration (α), as in the case of the present embodiment, the second transformation MOD_OP2 must also take into account the effects of the first transformation MOD_OP1 on the duration of the initial signal.

Ainsi, dans le mode de réalisation exposé, la seconde opération de modification est destinée à modifier la fréquence fondamentale et la durée du signal intermédiaire (S1(n)), selon respectivement des troisièmes facteurs β" et α", tels que : $αʹ \cdot α " = α et βʹ \cdot β " = β .$

Thus, in the disclosed embodiment, the second modification operation is intended to modify the fundamental frequency and the duration of the intermediate signal (S1 (n)), respectively according to the third factors β "and α", such that:

α' \cdot α " = α and β ' \cdot β " = β .

De cette façon, la transformation globale effectuée entre le signal initial (S(n)) et le signal final (S2(n)), d'un point de vue de la fréquence fondamentale et de la durée, correspond à une transformation de facteurs respectifs β et α, selon les équations (2) ci-dessus.In this way, the overall transformation effected between the initial signal (S (n)) and the final signal (S2 (n)), from a point of view of fundamental frequency and duration, corresponds to a transformation of factors respective β and α, according to equations (2) above.

Dans le mode de réalisation, choisi dans lequel la première opération de modification MOD-OP1 est une technique de ré-échantillonnage de facteur γ produisant des effets sur la fréquence fondamentale et la durée selon les équations (1) plus haut, les troisièmes facteurs β" et α" relatifs à la seconde transformation MOD_OP2 sont obtenus selon les équations suivantes : $β " = \frac{β}{γ} et α " = α \cdot γ .$

In the embodiment, chosen in which the first modification operation MOD-OP1 is a γ-factor resampling technique producing effects on the fundamental frequency and the duration according to the equations (1) above, the third factors β "and α" relative to the second transformation MOD_OP2 are obtained according to the following equations:

β " = \frac{β}{γ} and α " = α \cdot γ .

En pratique, dans un mode de réalisation préféré, la seconde opération de modification MOD_OP2 est mise en oeuvre par une technique du type PSOLA (Pitch-Synchronous Overlap and Add), et en particulier, une technique PSOLA appliquée dans le domaine temporel, c'est-à-dire TD-PSOLA (time-domain PSOLA). La technique TD-PSOLA est décrite plus bas dans la description en liaison avec laFigure 2.In practice, in a preferred embodiment, the second modification operation MOD_OP2 is implemented by a PSOLA technique (Pitch-Synchronous Overlap and Add ), and in particular, a PSOLA technique applied in the time domain, that is TD-PSOLA (time-domain PSOLA). The TD-PSOLA technique is described below in the description in connection with the Figure 2 .

La seconde opération de modification MOD_OP2 peut être également réalisée à partir de techniques telles que LP-PSOLA (Linear Pediction PSOLA) ou FD-PSOLA (Frequency Domain PSOLA) ou en encore à partir d'une technique de type HNM (Harmonic plus Noise Model), ou de type vocoder de phase. On peut même envisager d'utiliser deux techniques indépendantes pour la modification de la fréquence fondamentale et de la durée.The second modification operation MOD_OP2 can also be carried out using techniques such as LP-PSOLA (Linear Pediction PSOLA) or FD-PSOLA (Frequency Domain PSOLA) or again using a technique of the HNM type (Harmonic plus Noise Model ), or phase vocoder type. One can even consider using two independent techniques for modifying the fundamental frequency and the duration.

En revanche, quelle que soit la technique utilisée pour la modification de la fréquence fondamentale, cette technique doit préserver globalement l'enveloppe spectrale du signal traité (en l'occurrence le signal intermédiaire S1(n)), puisque l'enveloppe spectrale du signal initial (S(n)) est modifiée essentiellement par la première opération de modification MOD_OP1.On the other hand, whatever the technique used for the modification of the fundamental frequency, this technique must globally preserve the spectral envelope of the processed signal (in this case the intermediate signal S1 (n)), since the spectral envelope of the signal initial (S (n)) is modified essentially by the first modification operation MOD_OP1.

De retour à laFigure 1, une fois l'étape E11 de choix des opérations de modification MOD_OP1 et MOD_OP2 et de leurs paramètres respectifs, effectuée, la modification proprement dite du signal de parole initial S(n) est réalisée avec les étapes suivantes E12 et E13.Back to the Figure 1 , once the step E11 of the modification operations MOD_OP1 and MOD_OP2 and their respective parameters, performed, the actual modification of the initial speech signal S (n) is performed with the following steps E12 and E13.

Ainsi, à l'étape E12, le signal initial S1(n) est modifié selon la transformation MOD_OP1, permettant d'obtenir un signal intermédiaire S1(n) dont l'enveloppe spectrale est modifiée (étirée ou contractée), par rapport au signal initial, selon la consigne γ de modification de l'enveloppe spectrale, et dont la fréquence fondamentale et la durée, sont respectivement modifiées selon les seconds facteurs β' et α'.Thus, in step E12, the initial signal S1 (n) is modified according to the transformation MOD_OP1, making it possible to obtain an intermediate signal S1 (n) whose spectral envelope is modified (stretched or contracted), with respect to the signal initial, according to the setpoint γ of modification of the spectral envelope, and whose fundamental frequency and duration, are respectively modified according to the second factors β 'and α'.

Enfin, à l'étape E13, le signal intermédiaire S1(n) est traité selon la transformation MOD_OP2, permettant de modifier la fréquence fondamentale et la durée du signal intermédiaire, afin d'obtenir le signal final S2(n) dont la durée, la fréquence fondamentale et l'enveloppe spectrale sont conformes aux consignes de modifications respectives α, β, γ.Finally, in step E13, the intermediate signal S1 (n) is processed according to the transformation MOD_OP2, making it possible to modify the fundamental frequency and the duration of the intermediate signal, in order to obtain the final signal S2 (n) whose duration, the fundamental frequency and the spectral envelope are in accordance with the respective modification instructions α, β, γ.

Dans le mode de réalisation choisi et présenté, l'étape de modification de l'enveloppe spectrale (MOS_OP1), c'est-à-dire du timbre du signal de parole, précède l'étape de modification des paramètres prosodiques (hauteur de voix et élocution) liés respectivement à la fréquence fondamentale et à la durée du signal. Cependant, l'ordre de ces opérations peut être inversé, à condition que les facteurs de modification de la première étape prennent en compte les effets de la seconde étape sur la fréquence fondamentale, et le cas échéant sur la durée, du signal traité, de manière à respecter, au global, les consignes de modification du signal initial. En particulier, dans la mise en oeuvre décrite plus haut, les seconds facteurs β' et α' de l'étape MOD_OP2, exécutée cette fois en premier, seraient alors déterminés au préalable en fonction du facteur γ de l'étape MOS_OP1 exécutée en second.In the embodiment chosen and presented, the step of modifying the spectral envelope (MOS_OP1), that is to say the timbre of the speech signal, precedes the step of modifying the prosodic parameters (voice height and elocution) related respectively to the fundamental frequency and the duration of the signal. However, the order of these operations can be reversed, provided that the modifying factors of the first step take into account the effects of the second step on the fundamental frequency, and if necessary on the duration, of the signal processed, of in order to respect, overall, the instructions for modifying the initial signal. In particular, in the implementation described above, the second factors β 'and α' of the step MOD_OP2, executed this time first, would then be determined beforehand as a function of the factor γ of the second step MOS_OP1 executed. .

LaFigure 2 représente les principaux stades de traitement d'un signal de parole selon l'algorithme TD-PSOLA. LaFig. 2A représente le signal de parole S(n) à modifier.TheFigure 2 represents the main stages of processing a speech signal according to the TD-PSOLA algorithm. TheFig. 2A represents the speech signal S (n) to be modified.

Au cours d'une première étape illustrée par laFig. 2B, le signal S(n) est segmenté en trames de manière ditepitch-synchrone, c'est-à-dire que chaque segment a une durée correspondant à l'inverse de la fréquence fondamentale du signal.During a first step illustrated by theFig. 2B , the signal S (n) is segmented into so-calledpitch-synchronous frames, that is to say that each segment has a duration corresponding to the inverse of the fundamental frequency of the signal.

En effet, les instants de fermeture de glotte, aussi appelés instants d'analyse, sont situés au voisinage des maxima d'énergie du signal de parole et le traitement TD-PSOLA permet une bonne préservation des caractéristiques du signal de parole au voisinage des extrémités des segments obtenus par analyse pitch-synchrone. Ainsi, lorsque ces instants sont repérés avec une précision satisfaisante, les performances de TD-PSOLA sont optimisées. Une telle segmentation pitch-synchrone est obtenue, par exemple, par des techniques à base de délais de groupe ou encore à partir de la méthode proposée parD. Vincent, O. Rosec, et T. Chonavel, dans la publication "Glottal closure instant estimation using an appropriateness measure of the source and continuity constraints", IEEE ICASSP'06, vol. 1, pp. 381-384, Toulouse, France, Mai 2006.Indeed, the glottal closure instants, also called analysis instants, are located in the vicinity of the energy maxima of the speech signal and the TD-PSOLA treatment allows a good preservation of the characteristics of the speech signal in the vicinity of the extremities. segments obtained by pitch-synchronous analysis. Thus, when these moments are spotted with a satisfactory accuracy, the performances of TD-PSOLA are optimized. Such pitch-synchronous segmentation is obtained, for example, by time delay techniques or from the method proposed by D. Vincent, O. Rosec, and T. Chonavel, in the publication "Glottal closure instant estimation using an appropriateness measure of the source and continuity constraints", IEEE ICASSP'06, vol. 1, pp. 381-384, Toulouse, France, May 2006 .

Cette étape de marquage pitch-synchrone est de préférence réalisée hors-ligne, c'est-à-dire non en temps réel, ce qui permet de réduire la charge de calcul pour une mise en oeuvre en temps réel.This pitch-synchronous marking step is preferably performed offline, that is to say not in real time, which reduces the calculation load for implementation in real time.

En fonction des facteurs de modification souhaités pour la fréquence fondamentale et la durée, les instants séparant les segments sont modifiés selon les règles suivantes :

pour un allongement de durée, certains segments sont dupliqués afin d'augmenter artificiellement le nombre d'impulsions glottiques ;
pour une réduction de la durée, certains segments sont supprimés ;
pour une augmentation de la fréquence fondamentale, c'est-à-dire un rendu plus aigu, les instants d'analyse sont rapprochés, ce qui nécessite éventuellement la duplication de segments pour conserver la durée totale ; et
pour une diminution de la fréquence fondamentale, c'est-à-dire un rendu plus grave, les instants d'analyse sont écartés, ce qui nécessite éventuellement la suppression de segments pour conserver la durée totale.

Depending on the desired modification factors for the fundamental frequency and the duration, the instants separating the segments are modified according to the following rules:

for an extension of duration, certain segments are duplicated in order to artificially increase the number of glottal pulses;
for a reduction of the duration, certain segments are deleted;
for an increase of the fundamental frequency, that is to say a more acute rendering, the analysis instants are brought closer, which possibly requires the duplication of segments to preserve the total duration; and
for a decrease in the fundamental frequency, that is to say a more serious rendering, the analysis instants are discarded, which possibly requires the removal of segments to maintain the total duration.

Une description détaillée de ces règles se trouve dans le document[Mou95], en particulier aux paragraphes 4.2.1 à 4.2.3 dudit document.A detailed description of these rules can be found in[Mou95] , particularly in paragraphs 4.2.1 to 4.2.3 of that document.

A l'issue de cette étape, le signal obtenu comprend un nombre entier de segments ou trames, chacun d'une durée correspondant à une période qui est l'inverse de la fréquence fondamentale modifiée, comme cela est représenté sur laFig. 2B.At the end of this step, the signal obtained comprises an integer number of segments or frames, each of a duration corresponding to a period which is the inverse of the modified fundamental frequency, as represented on FIG. Fig. 2B .

Le traitement de modification comprend ensuite un fenêtrage du signal autour des instants d'analyse, c'est-à-dire des instants séparant les segments. Cette étape de fenêtrage est illustrée par laFig. 2C.The modification processing then comprises a windowing of the signal around the analysis instants, that is to say the moments separating the segments. This step of windowing is illustrated by theFig. 2C.

Au cours de ce fenêtrage, on sélectionne, pour chaque instant d'analyse, une portion du signal fenêtrée autour de cet instant. Cette portion de signal est appelée "signal court-terme" et s'étend, dans l'exemple, sur une durée correspondant à deux fois la période fondamentale modifiée comme représentée à laFig. 2C.During this windowing, a portion of the windowed signal around this instant is selected for each analysis instant. This portion of the signal is called "short-term signal" and extends, in the example, over a period corresponding to twice the fundamental period modified as represented in FIG. Fig. 2C .

Le traitement de modification comprend enfin une sommation des signaux court-terme qui sont recentrés sur les instants de synthèse et ajoutés comme représenté à laFig. 2D.The modification processing finally comprises a summation of the short-term signals which are refocused on the synthesis instants and added as shown in FIG.Fig. 2D.

Dans les modes de réalisation de l'invention exposés ci-dessus à titre d'exemples, les coefficients de modification ont été choisis constants. Cependant, le procédé général selon l'invention décrit supra peut être mis en oeuvre pour opérer des modifications de signal audio selon des coefficients α, β et γ non constants. Dans un tel cas, on peut par exemple réaliser un découpage en trames (préférentiellement pitch-synchrones) et déterminer pour chacune d'entre elles des coefficients de modification constants. Les étapes E12 et E13 sont alors effectuées indépendamment sur chacune des trames. Puis les trames sont combinées par une technique classique d'addition-recouvrement de façon à reconstruire le signal transformé souhaité.In the embodiments of the invention set forth above as examples, the modification coefficients have been chosen constant. However, the general method according to the invention described above can be implemented to make changes to the audio signal according to non-constant coefficients α, β and γ. In such a case, one can for example perform a division into frames (preferably pitch-synchronous) and determine for each of them constant modification coefficients. Steps E12 and E13 are then performed independently on each of the frames. Then the frames are combined by a conventional addition-overlap technique to reconstruct the desired transformed signal.

Un procédé de modification d'un signal audio selon l'invention, tel que décrit supra, est en pratique mis en oeuvre par un dispositif de traitement de signaux audio, et plus particulièrement de signaux de parole. Un tel dispositif comporte donc des moyens matériels notamment électroniques et/ou logiciels adaptés à mettre en oeuvre un procédé selon l'invention.A method of modifying an audio signal according to the invention, as described above, is in practice implemented by a device for processing audio signals, and more particularly speech signals. Such a device therefore comprises material means including electronic and / or software adapted to implement a method according to the invention.

Selon une implémentation préférée, les étapes du procédé de modification d'un signal audio, selon l'invention, sont déterminées par les instructions d'un programme d'ordinateur utilisé dans un tel dispositif de traitement, constitué typiquement par un système informatique, par exemple un ordinateur personnel.According to a preferred implementation, the steps of the method for modifying an audio signal, according to the invention, are determined by the instructions of a computer program used in such a processing device, typically constituted by a computer system, by example a personal computer.

Le procédé selon l'invention est alors mis en oeuvre lorsque le programme précité est chargé dans des moyens informatiques incorporés dans le dispositif de traitement audio, et dont le fonctionnement est alors commandé par l'exécution du programme.The method according to the invention is then implemented when the aforementioned program is loaded into computer means incorporated in the audio processing device, and whose operation is then controlled by the execution of the program.

On entend ici par "programme d'ordinateur" un ou plusieurs programmes d'ordinateur formant un ensemble (logiciel) dont la finalité est la mise en oeuvre de l'invention lorsqu'il est exécuté par un système informatique approprié.The term "computer program" herein refers to one or more computer programs forming a set (software) whose purpose is the implementation of the invention when it is executed by an appropriate computer system.

En conséquence, l'invention a également pour objet un tel programme d'ordinateur, en particulier sous la forme d'un logiciel stocké sur un support d'informations. Un tel support d'informations peut être constitué par n'importe quelle entité ou dispositif capable de stocker un programme selon l'invention.Accordingly, the invention also relates to such a computer program, particularly in the form of software stored on an information carrier. Such an information carrier may be constituted by any entity or device capable of storing a program according to the invention.

Par exemple, le support en question peut comporter un moyen de stockage matériel, tel qu'une ROM, par exemple un CD ROM ou une ROM de circuit microélectronique, ou encore un moyen d'enregistrement magnétique, par exemple un disque dur. En variante, le support d'informations peut être un circuit intégré dans lequel le programme est incorporé, le circuit étant adapté pour exécuter ou pour être utilisé dans l'exécution du procédé en question.For example, the medium in question may comprise a hardware storage means, such as a ROM, for example a CD ROM or a microelectronic circuit ROM, or a magnetic recording means, for example a hard disk. As a variant, the information carrier may be an integrated circuit in which the program is incorporated, the circuit being adapted to execute or to be used in the execution of the method in question.

D'autre part, le support d'informations peut être aussi un support immatériel transmissible, tel qu'un signal électrique ou optique pouvant être acheminé via un câble électrique ou optique, par radio ou par d'autres moyens. Un programme selon l'invention peut être en particulier téléchargé sur un réseau de type Internet.On the other hand, the information medium can also be a transmissible immaterial medium, such as an electrical or optical signal that can be conveyed via an electrical or optical cable, by radio or by other means. A program according to the invention can in particular be downloaded to an Internet type network.

D'un point de vue conception, un programme d'ordinateur selon l'invention peut utiliser n'importe quel langage de programmation et être sous la forme de code source, code objet, ou de code intermédiaire entre code source et code objet (par ex., une forme partiellement compilée), ou dans n'importe quelle autre forme souhaitable pour implémenter un procédé selon l'invention.From a design point of view, a computer program according to the invention can use any programming language and be in the form of source code, object code, or intermediate code between source code and object code (for example eg, a partially compiled form), or in any other form desirable for implementing a method according to the invention.

Bien entendu, la présente invention n'est nullement limitée aux modes de réalisation décrits et représentés dans le cadre de cette description, mais englobe, bien au contraire, toute variante à la portée de l'homme du métier.Of course, the present invention is not limited to the embodiments described and shown in the context of this description, but encompasses, on the contrary, any variant within the scope of the skilled person.

Références citéesReferences cited

[Syr85][Syr85]: A.K. Syrdal and S.A. Steele, "Vowel F1 as a function of speaker fundamental frequency", 110th Meeting of JASA, vol. 78, Fall 1985. AK Syrdal and SA Steele, "Vowel F1 as a function of speaker fundamental frequency," 110th Meeting of JASA, vol. 78, Fall 1985 .
[Mou95][Mou95]: E. Moulines and J. Laroche, "Non-parametric techniques for pitch-scale and time-scale modification of speech", Speech Communication, vol. 16, pp. 175-205, 1995. E. Moulines and J. Laroche, "Non-parametric techniques for pitch-scale and time-scale modification of speech", Speech Communication, vol. 16, pp. 175-205, 1995 .
[Sty96][Sty96]: Y. Stylianou, "Harmonic plus Noise Model for speech, combined with statistical methods, for speech and speaker modification", PhD thesis, Ecole Nationale Supérieure des Télécommunications, France, 1996. Y. Stylianou, "Harmonic plus Noise Model for speech, combined with statistical methods, for speech and modification", PhD thesis, National School of Telecommunications, France, 1996 .
[Kai00][Kai00]: A. Kain and Y. Stylianou, "Stochastic modeling of spectral adjustment for high quality pitch modification", in Proceedings of ICASSP'00, vol. 2, pp. 949-952, June 2000. A. Kain and Y. Stylianou, "Stochastic Modeling of Spectral Adjustment for High Quality Change", in Proceedings of ICASSP'00, vol. 2, pp. 949-952, June 2000 .
[Hub99][Hub99]: J. E. Huber, E. T. Stathopoulos, G. M. Curione, T. A. Ash and K. Johnson, "Formants of children, women, and men: the effect of vocal intensity variation", Journal of the Acoustical Society of America, 106 (3), pp. 1532-1542, September 1999. JE Huber, ET Stathopoulos, Curione GM, TA Ash and K. Johnson, "Formants of children, women, and men: the effect of vocal variation", Journal of the Acoustical Society of America, 106 (3), pp. 1532-1542, September 1999 .

Claims

Translated fromFrench

Procédé de modification des caractéristiques acoustiques d'un signal audio initial en fonction de consignes de modification relatives au moins à la fréquence fondamentale et l'enveloppe spectrale du signal initial,caractérisé en ce que :- une première opération de modification (E12) est appliquée au signal initial (S(n)) afin de délivrer un signal audio intermédiaire (S1(n)), la première opération de modification étant destinée à déformer l'enveloppe spectrale du signal initial selon ladite consigne de modification de l'enveloppe spectrale, et

- une seconde opération de modification (E13) est appliquée au signal intermédiaire (S1(n)) afin de délivrer un signal audio final (S2(n)), ladite seconde opération étant destinée à modifier au moins la fréquence fondamentale du signal intermédiaire, selon un facteur de modification qui est déterminé de manière à prendre en compte les effets de la première opération de modification sur la fréquence fondamentale du signal audio initial, de sorte que la fréquence fondamentale obtenue pour le signal final soit conforme à ladite consigne relative à la fréquence fondamentale.

A method of modifying the acoustic characteristics of an initial audio signal according to modification instructions relating to at least the fundamental frequency and the spectral envelope of the initial signal,characterized in that : a first modification operation (E12) is applied to the initial signal (S (n)) in order to deliver an intermediate audio signal (S1 (n)), the first modification operation being intended to deform the spectral envelope of the initial signal; according to said instruction for modifying the spectral envelope, and

a second modification operation (E13) is applied to the intermediate signal (S1 (n)) in order to deliver a final audio signal (S2 (n)), said second operation being intended to modify at least the fundamental frequency of the intermediate signal, according to a modification factor which is determined so as to take into account the effects of the first modification operation on the fundamental frequency of the initial audio signal, so that the fundamental frequency obtained for the final signal is in accordance with said reference relative to the fundamental frequency.

Procédé selon la revendication 1, dans lequel :- les consignes de modification du signal audio initial comprennent un facteur γ d'étirement/contraction de l'enveloppe spectrale du signal initial selon l'axe fréquentiel, des facteurs β et α de modification respectivement de la fréquence fondamentale et de la durée du signal initial ;

- la première opération de modification produit sur le signal audio initial, outre la modification désirée de l'enveloppe spectrale, une modification de la fréquence fondamentale et une modification de la durée, selon respectivement des seconds facteurs β' et α' ; et

- la seconde opération de modification est destinée à modifier la fréquence fondamentale et la durée du signal audio intermédiaire, selon respectivement des troisièmes facteurs β" et α", tels que : α'·α"=α et β'·β"=β.

The method of claim 1, wherein: the instructions for modifying the initial audio signal comprise a stretching / contraction factor γ of the spectral envelope of the initial signal according to the frequency axis, modifying factors β and α respectively of the fundamental frequency and the duration of the signal initial;

the first modification operation produced on the initial audio signal, in addition to the desired modification of the spectral envelope, a modification of the fundamental frequency and a modification of the duration, respectively according to the second factors β 'and α'; and

the second modification operation is intended to modify the fundamental frequency and the duration of the intermediate audio signal, respectively according to the third factors β "and α", such that: α '· α "= α and β' · β" = β .

Procédé selon la revendication 2, dans lequel :- la première opération de modification est mise en oeuvre par une technique de type ré-échantillonnage de facteur γ, avec γ supérieur à 1 correspondant à un étirement de l'enveloppe spectrale du signal, et γ compris entre 0 et 1 correspondant à une contraction de l'enveloppe spectrale du signal ;

- les seconds facteurs β' et α' sont respectivement définis en fonction du facteur γ de ré-échantillonnage selon les équations suivantes : β'=γ et

α' = \frac{1}{γ};

et- les troisièmes facteurs β" et α" sont obtenus selon les équations suivantes :

β " = \frac{β}{γ}

et α"=α·γ .The method of claim 2 wherein: the first modification operation is implemented by a γ factor resampling type technique, with γ greater than 1 corresponding to a stretching of the spectral envelope of the signal, and γ between 0 and 1 corresponding to a contraction; the spectral envelope of the signal;

the second factors β 'and α' are respectively defined as a function of the resampling factor γ according to the following equations: β '= γ and

α' = \frac{1}{γ};

and

the third factors β "and α" are obtained according to the following equations:

β " = \frac{β}{γ}

and α "= α · γ.

Procédé selon l'une quelconque des revendications précédentes, dans lequel la seconde opération de modification est mise en oeuvre par une technique de type PSOLA.Method according to any one of the preceding claims, wherein the second modification operation is implemented by a PSOLA type technique.

Procédé selon l'une quelconque des revendications 2 à 4, dans lequel la seconde opération de modification est mise en oeuvre avant la première opération de modification, les seconds facteurs β' et α' étant déterminés au préalable en fonction du facteur γ.Method according to any one of claims 2 to 4, wherein the second modification operation is carried out before the first modification operation, the second factors β 'and α' being determined in advance according to the factor γ.

Procédé selon l'une quelconque des revendications 2 à 5, dans lequel le signal audio à modifier est un signal de parole, les facteurs de modification de consigne α, β, γ permettant de modifier respectivement les paramètres suivants relatifs aux caractéristiques de rendu sonore du signal de parole : la vitesse d'élocution, la hauteur de voix perçue, et le timbre de voix perçu.Method according to any one of claims 2 to 5, wherein the audio signal to be modified is a speech signal, the setpoint modification factors α, β, γ for respectively modifying the following parameters relating to the sound reproduction characteristics of the speech signal: speech velocity, perceived voice pitch, and perceived voice timbre.

Programme d'ordinateur de traitement audiocaractérisé en ce qu'il comporte des instructions de programme adaptées à la mise en oeuvre d'un procédé selon l'une quelconque des revendications 1 à 6, lorsque ledit programme est exécuté par un système informatique.Audio processing computer programcharacterized in that it comprises program instructions adapted to the implementation of a method according to any one of claims 1 to 6, when said program is executed by a computer system.

Dispositif de traitement audio adapté à la modification des caractéristiques acoustiques d'un signal audio initial en fonction de consignes de modification relatives au moins à la fréquence fondamentale et l'enveloppe spectrale du signal initial,caractérisé en ce qu'il comporte :- des moyens de modification du signal audio initial selon une première opération de modification, afin de délivrer un signal audio intermédiaire, la première opération de modification étant destinée à déformer l'enveloppe spectrale du signal initial selon ladite consigne de modification de l'enveloppe spectrale du signal, et

- des moyens de modification du signal intermédiaire selon une seconde opération de modification afin de délivrer un signal audio final, ladite seconde opération étant destinée à modifier au moins la fréquence fondamentale du signal intermédiaire, selon un facteur de modification qui est déterminé de manière à prendre en compte les effets de la première opération de modification sur la fréquence fondamentale du signal audio initial, de sorte que la fréquence fondamentale obtenue pour le signal final soit conforme à ladite consigne relative à la fréquence fondamentale.

Audio processing device adapted to the modification of the acoustic characteristics of an initial audio signal according to modification instructions relating to at least the fundamental frequency and the spectral envelope of the initial signal,characterized in that it comprises: means for modifying the initial audio signal according to a first modification operation, in order to deliver an intermediate audio signal, the first modification operation being intended to deform the spectral envelope of the initial signal according to said modification instruction of the spectral envelope; signal, and

means for modifying the intermediate signal according to a second modification operation in order to deliver a final audio signal, said second operation being intended to modify at least the fundamental frequency of the intermediate signal, according to a modification factor which is determined so as to take the effects of the first modification operation on the fundamental frequency of the initial audio signal are taken into account, so that the fundamental frequency obtained for the final signal is in accordance with said reference relative to the fundamental frequency.

Dispositif selon la revendication 8,caractérisé en ce qu'il comporte des moyens adaptés à la mise en oeuvre d'un procédé de modification selon l'une quelconque des revendications 2 à 6.Device according to Claim 8,characterized in that it comprises means suitable for implementing a modification method according to any one of Claims 2 to 6.