CN106601249B

Movatterモバイル変換

Info

Publication number: CN106601249B
Application number: CN201611026399.6A
Authority: CN
Inventors: 李冬梅; 杨有为; 贾瑞; 刘润生
Original assignee: Tsinghua University
Current assignee: Tsinghua University
Priority date: 2016-11-18
Filing date: 2016-11-18
Publication date: 2020-06-05
Anticipated expiration: 2036-11-18
Also published as: CN106601249A

Abstract

Translated fromChinese

本发明公开了一种基于听觉感知特性的数字语音实时分解/合成方法，涉及语音信号处理领域。本方法包括用N级级联的二阶带通滤波器构成一个N阶的伽马通滤波器再构建任意阶的伽马通数字滤波器模型及其参数，语音分解阶段用M路伽马通滤波器采用浮点算法或定点算法将输入语音分解为M路信号；语音合成阶段在伽马通滤波器组中引入延时，以更加符合人耳特性，人耳基底膜延时与频率成反比关，最后进行语音合成操作。本发明参考了人耳的等响度曲线特性，改进了语音分解合成方法，使得最终语音合成效果接近了理想带通滤波器的效果。本发明可应用在手机、人工耳蜗、助听器等语音设备中。The invention discloses a real-time decomposition/synthesis method of digital speech based on auditory perception characteristics, and relates to the field of speech signal processing. The method includes using N-stage cascaded second-order band-pass filters to form an N-order gamma-pass filter, and then constructing an arbitrary-order gamma-pass digital filter model and its parameters. M-channel gamma-pass filters are used in the speech decomposition stage. The filter uses floating-point algorithm or fixed-point algorithm to decompose the input speech into M-channel signals; in the speech synthesis stage, delay is introduced into the gamma pass filter bank to be more in line with the characteristics of the human ear. The delay of the human ear basilar membrane is inversely proportional to the frequency off, and finally perform the speech synthesis operation. The invention improves the speech decomposition and synthesis method by referring to the equal loudness curve characteristic of the human ear, so that the final speech synthesis effect is close to that of an ideal band-pass filter. The present invention can be applied to speech equipment such as mobile phones, cochlear implants, hearing aids and the like.

Description

Translated fromChinese

一种基于听觉感知特性的数字语音实时分解/合成方法A real-time decomposition/synthesis method of digital speech based on auditory perception characteristics

技术领域technical field

本发明属于数字语音信号处理领域，具体涉及一种基于听觉感知特性的数字语音实时分解/合成方法。The invention belongs to the field of digital speech signal processing, and in particular relates to a real-time decomposition/synthesis method of digital speech based on auditory perception characteristics.

背景技术Background technique

在日常生活中，存在各种各样的噪声。语音增强和语音识别等设备的性能在噪声环境下会明显恶化，限制了其应用场景。由于人耳在噪声环境下仍能正常工作，且对声音具有较强的灵敏度和抗干扰能力。因此在语音信号处理系统中迫切需要实现人耳尤其是基底膜的听觉感知特性。人耳基底膜的感知特性有：In daily life, there are various kinds of noise. The performance of devices such as speech enhancement and speech recognition deteriorates significantly in noisy environments, limiting their application scenarios. Because the human ear can still work normally in the noise environment, and has strong sensitivity to sound and anti-interference ability. Therefore, it is urgent to realize the auditory perception characteristics of the human ear, especially the basilar membrane, in the speech signal processing system. The perceptual properties of the basement membrane of the human ear are:

1.频率选择特性:不同的频率在基底膜上都有相应的共振点，频率较高的声音，在靠近基底膜底部位置会引起较大幅度的振动；对于频率较低的声音，响应最强烈的位置在基底膜的顶部。1. Frequency selection characteristics: Different frequencies have corresponding resonance points on the basilar membrane. A sound with a higher frequency will cause a larger vibration near the bottom of the basilar membrane; for a lower frequency sound, the response is the strongest. is located on top of the basement membrane.

2.频率分析特性:它能够将声音中的各种频率分解映射到基底膜的不同位置来感知，得到频率分布图；同时还能够将声音强度转化为对应基底膜位置上的振动幅度。最终，基底膜将声音中具有不同幅度不同频率的声音分离出来，并产生相应的神经信息，相当于对频率和强度等进行了编码，这样大脑就能够对这些信息进行分析归纳，形成不同的听觉感受。2. Frequency analysis characteristics: It can decompose and map various frequencies in the sound to different positions of the basilar membrane to perceive, and obtain a frequency distribution map; at the same time, it can also convert the sound intensity into the vibration amplitude at the corresponding basilar membrane position. Finally, the basilar membrane separates the sounds with different amplitudes and frequencies, and generates corresponding neural information, which is equivalent to encoding the frequency and intensity, so that the brain can analyze and summarize the information to form different hearing. feel.

3.带宽特性：人耳基底膜每个位置的滤波特性各不相同。人耳基底膜顶部对低频比较敏感，且在低频的分辨率较高、带宽小；基底膜底部对高频敏感，且在高频的分辨率高、带宽大。3. Bandwidth characteristics: The filtering characteristics of each position of the basilar membrane of the human ear are different. The top of the basilar membrane of the human ear is sensitive to low frequencies, and has high resolution and small bandwidth at low frequencies; the bottom of the basilar membrane is sensitive to high frequencies, and has high resolution and large bandwidth at high frequencies.

人耳基底膜每个位置的滤波特性都可以用一个听觉滤波器来描述，于是听觉系统处理语音的过程可以用一组听觉滤波器来模拟，听觉滤波器是通过拟合听觉系统的心理声学实验数据而被提出来的一类滤波器。使用上述听觉滤波器组可以将语音分解到不同的子带上面，进而实现对语音的分解和合成。The filtering characteristics of each position of the basilar membrane of the human ear can be described by an auditory filter, so the process of the auditory system processing speech can be simulated by a set of auditory filters. The auditory filter is a psychoacoustic experiment that fits the auditory system. A class of filters proposed for data. Using the above auditory filter bank, the speech can be decomposed into different subbands, so as to realize the decomposition and synthesis of the speech.

为了描述听觉滤波器的带宽，研究中经常使用等价矩形带宽(ERB)这一概念，ERB是指：对于相同的白噪声输入，当矩形滤波器和被测滤波器通过相同能量时，矩形滤波器的带宽即为等价矩形带宽。ERB与听觉滤波器的中心频率f_c大致呈线性关系，具体关系可用表达式如式(1-1)所示来描述：In order to describe the bandwidth of the auditory filter, the concept of equivalent rectangular bandwidth (ERB) is often used in research. ERB refers to: for the same white noise input, when the rectangular filter and the filter under test pass the same energy, the rectangular filter The bandwidth of the device is the equivalent rectangular bandwidth. The ERB has a roughly linear relationship with the center frequency f_c of the auditory filter, and the specific relationship can be described by the expression shown in Equation (1-1):

ERB(f_c)＝24.7(1+4.37f_c/1000) (1-1)ERB(f_c )=24.7(1+4.37f_c /1000) (1-1)

M组滤波器的中心频率f_c对应着人耳基底膜上的M个位置，它们在基底膜上是均匀分布的。为了更好的描述这种分布，ERB域(ERBs)的概念，首先通过表达式如式(1-2)所示得到ERBs域上的值，再将ERBs值均分，最后在回推出中心频率f_c的值。The center frequency f_c of the M groups of filters corresponds to M positions on the basilar membrane of the human ear, and they are uniformly distributed on the basilar membrane. In order to better describe this distribution, the concept of ERB field (ERBs) is firstly obtained through the expression as shown in Equation (1-2) to obtain the value on the ERBs field, then the ERBs value is equally divided, and finally the center frequency is derived the value of_fc .

一种典型的听觉滤波器组是由M个伽马通滤波器(Gammatone Filterbank)构成，每个伽马通滤波器的时域表达式为：A typical auditory filter bank is composed of M gamma pass filters (Gammatone Filterbank), and the time domain expression of each gamma pass filter is:

其中，u(t)是阶跃函数；参数A一般是固定值，主要用于归一化处理；N代表滤波器的阶数，控制着Gammatone函数包络的相对形状，一般设置N＝4；b代表函数的带宽，控制着函数时域的波动越大，函数波动的范围就越小,b＝ERB(f_c)。f_c代表滤波器的中心频率；

代表初始相位，由于

对滤波器性能影响较小，并且人耳对相位不敏感，因此

一般被设置为0。Among them, u(t) is a step function; the parameter A is generally a fixed value, which is mainly used for normalization processing; N represents the order of the filter, which controls the relative shape of the envelope of the Gammatone function, and is generally set to N=4; b represents the bandwidth of the function, which controls the greater the fluctuation in the time domain of the function, the smaller the range of the function fluctuation, b=ERB(f_c ). f_c represents the center frequency of the filter;

represents the initial phase, since

Less impact on filter performance, and the human ear is insensitive to phase, so

Usually set to 0.

将表达式(1-3)中进行Laplace变换，得到s域表达式为：Laplace transform is performed in expression (1-3), and the s-domain expression is obtained as:

其中，B_c＝2πb，w_c＝2πf_c；Wherein, B_c =2πb, w_c =2πf_c ;

g_n为归一化参数；g_n is the normalization parameter;

对表达式(1-4)使用冲激响应不变法，Using the impulse response invariant method for expression (1-4),

可得到数字滤波器的z域表达式:The z-domain expression of the digital filter can be obtained:

由表达式(1-9)可得到其时域迭代方程式(1-13),滤波器结构如图1所示，由四级级联结构组成，其中a₁～a₄，b₁，b₂分别为各级滤波器的抽头系数，g₁～g₄分别为各级的归一化系数，方框中表示在变换域Z域进行延时操作，每一级的输入信号经过各抽头系数的加权、延时、相加等操作后传入下一级。The time domain iterative equation (1-13) can be obtained from the expression (1-9). The filter structure is shown in Figure 1, which consists of a four-stage cascade structure, where a₁ ～a₄ , b₁ , b₂ are the tap coefficients of the filters of all levels, g₁ to g₄ are the normalization coefficients of the various levels, respectively. After weighting, delaying, adding and other operations, it is passed to the next level.

x₁(k)＝x(k) (1-12)x₁ (k)=x(k) (1-12)

y_n(k)＝x_n(k)+a_nx(k-1)-b₁y(k-1)-b₂y(k-2) (1-13)y_n (k)=x_n (k)+a_n x(k-1)-b₁ y(k-1)-b₂ y(k-2) (1-13)

x_n+1(k)＝g_ny_n(k) (1-14)x_n+1 (k)=g_n y_n (k) (1-14)

y(k)＝g₄y₄(k) (1-15)y(k)=g₄ y₄ (k) (1-15)

经过上述M个伽马通滤波器，输入语音被分解为M路语音信号，每路输出为y^m(k)，m代表各个伽马通滤波器的序号，上述(1-12)至(1-15)公式中省略了序号m。After the above-mentioned M gamma-pass filters, the input speech is decomposed into M-way speech signals, each output is y^m (k), m represents the serial number of each gamma-pass filter, the above (1-12) to (1 -15) The serial number m is omitted from the formula.

在实际应用中，系统有时还需要将分解后的语音(已经过降噪、识别等处理)恢复原始语音。由于每个通道都有群延时，因此可以获取群延时D_m，然后调整各通道延时，最后合成语音，计算表达式如下：In practical applications, the system sometimes needs to restore the decomposed speech (which has been processed by noise reduction, recognition, etc.) to the original speech. Since each channel has a group delay, the group delay D_m can be obtained, then adjust the delay of each channel, and finally synthesize speech. The calculation expression is as follows:

该方法的转移函数的幅度响应特性如图2所示，可以看到在低频阶段幅度较高，随着频率的升高幅度缓慢下降，其中合成语音中的各通道权重均相同，通道数目M＝64,通道中心频率分布为50Hz～7500Hz。The amplitude response characteristics of the transfer function of this method are shown in Figure 2. It can be seen that the amplitude is high in the low frequency stage, and the amplitude decreases slowly with the increase of the frequency. The weight of each channel in the synthesized speech is the same, and the number of channels M = 64, the channel center frequency distribution is 50Hz～7500Hz.

上述方法的缺点是：The disadvantages of the above method are:

1.该方法限制了伽马通函数的阶数N＝4，它仅仅是伽马通滤波器的一个特例，没有给出伽马通滤波器在N为其它值时的实现方法。1. This method limits the order of the gamma-pass function N=4, which is only a special case of the gamma-pass filter, and does not give the implementation method of the gamma-pass filter when N is other values.

2.该方法的一些关键参数是通过仿真获取的，缺乏理论计算依据，主要包括参数b、归一化参数g_n和通道群延时D_m，这降低了方法可操作性和可重复性。2. Some key parameters of the method are obtained through simulation and lack theoretical calculation basis, mainly including parameter b, normalized parameter g_n and channel group delay D_m , which reduces the operability and repeatability of the method.

3.该方法中的各个伽马通滤波器的幅度是相等，即合成语音时各个通道的权值均设成了1。然而人耳对于不同通道上的语音感知到的响度是不同，如图3中的人耳等响度曲线所示，横坐标为频率，单位是Hz，纵坐标为声压等级，单位为dB，要达到相同的响度，高频需要较高的幅值，低频需要较低的幅值。这样最终就导致合成语音有些频率上的语音被抑制了。3. The amplitude of each gamma pass filter in this method is equal, that is, the weight of each channel is set to 1 when synthesizing speech. However, the loudness perceived by the human ear for speech on different channels is different. As shown in the equal loudness curve of the human ear in Figure 3, the abscissa is the frequency, and the unit is Hz, and the ordinate is the sound pressure level, and the unit is dB. To achieve the same loudness, high frequencies require higher amplitudes and low frequencies require lower amplitudes. This eventually leads to the suppression of speech at some frequencies of the synthesized speech.

发明内容SUMMARY OF THE INVENTION

本发明的目的是针对现有技的不足，提出一种基于听觉感知特性的数字语音实时分解/合成方法。该方法给出了任意阶数的伽马通滤波器的实现方法，同时推导出了伽马通滤波器中的归一化参数g_n；并根据人耳基底膜延时特性，给出了各个的通道延时D_m。最后本发明参考了人耳的等响度曲线特性，改进了语音分解合成方法，使得最终语音合成效果接近了理想带通滤波器的效果。The purpose of the present invention is to propose a real-time decomposition/synthesis method of digital speech based on auditory perception characteristics in view of the deficiencies of the prior art. In this method, the realization method of gamma-pass filter of arbitrary order is given, and the normalization parameter g_n in the gamma-pass filter is deduced at the same time; The channel delay D_m . Finally, the present invention improves the speech decomposition and synthesis method by referring to the equal loudness curve characteristic of the human ear, so that the final speech synthesis effect is close to that of an ideal bandpass filter.

本发明提出的一种基于听觉感知特性的数字语音实时分解/合成方法，其特征在于，该方法具体步骤如下：A kind of real-time decomposition/synthesis method of digital speech based on auditory perception characteristics proposed by the present invention is characterized in that the specific steps of the method are as follows:

1)构建任意阶的伽马通数字滤波器模型：1) Build a gamma pass digital filter model of any order:

假设滤波器组数目为M，该M组滤波器对应着人耳基底膜上的M个位置，并在人耳基底膜上是均匀分布的，在频域上是对数分布的；具体包括：Assuming that the number of filter groups is M, the M groups of filters correspond to M positions on the basilar membrane of the human ear, and are uniformly distributed on the basilar membrane of the human ear and logarithmically distributed in the frequency domain; specifically include:

1.1)已知输入语音的采样率为f_s；1.1) The sampling rate of the known input speech is f_s ;

设通过滤波器的语音频率范围为[f_L，f_H]，0≤f_L＜f_H≤f_s/2；Assume that the range of speech frequencies passing through the filter is [f_L , f_H ], 0≤f_L <f_H ≤f_s /2;

1.2)根据表达式(1-2)：

得出中心频率f_c在ERBs域上的值分布为[ERBs(f_L)，ERBs(f_H)]，将其均分成M－1份得到等间距的M个ERBs值如式(1)所示：1.2) According to expression (1-2):

It is obtained that the value distribution of the center frequency f_c on the ERBs domain is [ERBs(f_L ), ERBs(f_H )], and it is divided into M-1 parts to obtain M ERBs values with equal spacing as shown in formula (1) Show:

其中，m∈[1,M]，代表通道号；Among them, m∈[1,M], represents the channel number;

1.3)根据式(1)的计算结果得到M组滤波器的中心频率f_c在ERBs域上的值如式(2)所示：1.3) According to the calculation result of formula (1), the value of the center frequency f_c of the M groups of filters on the ERBs domain is obtained as shown in formula (2):

1.4)针对b(f_c)与ERB(f_c)的关系：基于b＝ERB(f_c)，根据帕塞瓦尔(Paseval)定理，得出N阶伽马通滤波器中滤波器的中心频率f_c的带宽函数b(f_c)表达式如式(2a)所示：1.4) Regarding the relationship between b(f_c ) and ERB(f_c ): Based on b=ERB(f_c ), according to Paseval’s theorem, the center frequency of the filter in the N-order gamma-pass filter is obtained The expression of the bandwidth function b(f_c ) of f_c is shown in formula (2a):

其中，b代表函数的带宽，N为任意正整数；Among them, b represents the bandwidth of the function, and N is any positive integer;

1.5)用N级级联的二阶带通滤波器构成一个N阶的伽马通滤波器；对每个伽马通滤波器的时域表达式(1-3)：

进行Laplace变换得到s域表达式如式(2b)所示：1.5) Form an N-order gamma-pass filter with N-stage cascaded second-order band-pass filters; the time-domain expression (1-3) for each gamma-pass filter:

The Laplace transform is performed to obtain the s-domain expression as shown in formula (2b):

将式(2b)分解成零极点相乘得到如表达式(2c)所示：Decomposing equation (2b) into zeros and poles and multiplying them together, we get the following equation (2c):

使用冲激响应不变法得到N阶伽马通数字滤波器的z域表达式(2d)：The z-domain expression (2d) of the Nth order gamma-pass digital filter is obtained using the impulse response invariant method:

其中n＝1,2,…，N，s_n为表达式分子的零点，a_n、b₁、b₂的含义分别为各级滤波器的抽头系数；where_n =1,2,...,_N ,sn is the zero point of the numerator of the expression, and the meanings of an, b₁ and b₂ are the tap coefficients of filters at all levels;

a_n的表达式如(1-10)所示：

b₁、b₂的表达式如(1-11)所示：

The expression of a_n is shown in (1-10):

The expressions of b₁ and b₂ are shown in (1-11):

1.6)计算归一化参数g_n：伽马通滤波器各级的二阶滤波器的最大增益如式(2e)所示：1.6) Calculate the normalization parameter g_n : the maximum gain of the second-order filter of each stage of the gamma pass filter is shown in formula (2e):

归一化参数g_n如式(2f)所示：The normalization parameter g_n is shown in formula (2f):

1.7)根据步骤1.5)中得到的N级级联的二阶带通滤波器来构成任意阶的伽马通数字滤波器模型，并获取模型的各参数值:用m表示第m组伽马通滤波器组，则由表达式(1-10)、(1-11)、(2e)和(2f)分别得出各个滤波器组的参数

的值，其中

是各个通道的滤波器抽头系数，

为各个通道的归一化系数，如式(3)-式(6)所示：1.7) form the gamma pass digital filter model of any order according to the second-order band-pass filter of the N-level cascade obtained in step 1.5), and obtain each parameter value of the model: use m to represent the mth group of gamma pass filter bank, the parameters of each filter bank are obtained from the expressions (1-10), (1-11), (2e) and (2f) respectively

value, where

are the filter tap coefficients for each channel,

is the normalization coefficient of each channel, as shown in Equation (3)-Equation (6):

2)语音分解阶段；2) Speech decomposition stage;

利用步骤1)构建的伽马通数字滤波器模型，模仿人耳基底膜对语音进行分解：将输入语音实时地分解到M个子带上，使用M路伽马通滤波器采用浮点算法或定点算法将输入语音分解为M路信号；Use the gamma-pass digital filter model constructed in step 1) to decompose speech by imitating the basilar membrane of the human ear: decompose the input speech into M sub-bands in real time, and use M-way gamma-pass filters to use floating-point arithmetic or fixed-point The algorithm decomposes the input speech into M signals;

3)语音合成阶段；3) Speech synthesis stage;

在伽马通滤波器组中引入延时，以更加符合人耳特性，人耳基底膜延时与频率成反比关系，伽马通滤波器的群延时用表达式(16)来描述：A delay is introduced into the gamma-pass filter bank to be more in line with the characteristics of the human ear. The delay of the basilar membrane of the human ear is inversely proportional to the frequency. The group delay of the gamma-pass filter is described by expression (16):

式中，m通道群延时t_m的单位是秒，第m组滤波器的中心频率f_c的单位是Hz；In the formula, the unit of the m-channel group delay t_m is seconds, and the unit of the center frequency f_c of the mth group of filters is Hz;

具体步骤包括：Specific steps include:

3.1)计算各通道延时:语音的采样率为f_s，则采样后的各个通道的延时d_m用如表达式(17)、(18)来进行计算：3.1) Calculate the delay of each channel: the sampling rate of speech is f_s , then the delay d_m of each channel after sampling is calculated by expressions (17) and (18):

d_m＝D-[f_st_m] (17)d_m =D-[f_s t_m ] (17)

其中D为[f_st_m]中的最大值；where D is the maximum value in [f_s t_m ];

3.2)对各个通道在总滤波器中所占比重进行加权，则合成语音用表达式(8)来计算；设m通道的权重为w_m，该权重合并到g_N中，调增后的g_N用如下表达式计算：3.2) Weight the proportion of each channel in the total filter, then the synthesized speech is calculated by expression (8); let the weight of m channel be w_m , the weight is merged into g_N , and the increased g_N is calculated with the following expression:

此时，最终合成语音输出如式(20)所示：At this point, the final synthesized speech output is shown in equation (20):

其中，当k≤d_m时y^m(k-d_m)＝0；语音实时分解、合成任务完成。Wherein, when k≤d_m , y^m (kd_m )=0; the real-time speech decomposition and synthesis tasks are completed.

本发明的特点及有益效果在于：The characteristics and beneficial effects of the present invention are:

1)本发明有系统详细的理论推导过程，给出了各参数的理论计算方法，增强了算法实现的可操作性。1) The present invention has a systematic and detailed theoretical derivation process, provides a theoretical calculation method for each parameter, and enhances the operability of the algorithm implementation.

2)本发明不仅能完成语音分解操作，而且还提供了语音分解的逆变换过程，即支持后续对语音的合成操作。2) The present invention can not only complete the voice decomposition operation, but also provide the inverse transformation process of the voice decomposition, that is, support the subsequent voice synthesis operation.

3)本发明的所有操作均在时域上完成，避免了使用傅里叶变换以及逆变换等操作。3) All operations of the present invention are completed in the time domain, and operations such as Fourier transform and inverse transform are avoided.

4)本发明解决了实时性问题，能实时对语音进行分解、综合操作，扩大了其应用范围。4) The present invention solves the real-time problem, can decompose and synthesize speech in real time, and expand its application range.

5)针对计算复杂度过高、不利于算法硬件实现的问题，本发明提出了一套完整的定点化方案，为算法的硬件实现节约了大量资源。此外还使用了流水线技术，降低了关键路径延时，降低了方法的计算复杂度。5) In view of the problem that the computational complexity is too high and is not conducive to the hardware implementation of the algorithm, the present invention proposes a complete set of fixed-point solutions, which saves a lot of resources for the hardware implementation of the algorithm. In addition, pipeline technology is used, which reduces the critical path delay and reduces the computational complexity of the method.

附图说明Description of drawings

图1为现有方法中语音分解阶段使用的伽马通数字滤波方框图。FIG. 1 is a block diagram of gamma pass digital filtering used in the speech decomposition stage in the existing method.

图2为现有方法中合成语音阶段的总幅度响应曲线。Fig. 2 is the total amplitude response curve of the synthesized speech stage in the existing method.

图3为人耳的等响度曲线。Figure 3 shows the equal loudness curve of the human ear.

图4为本发明使用的定点化滤波算法的方框图Figure 4 is a block diagram of a fixed-point filtering algorithm used in the present invention

图5本发明中合成语音阶段的总幅度响应曲线。Figure 5. Overall amplitude response curve of the synthesized speech stage in the present invention.

具体实施方式Detailed ways

本发明提出的一种基于听觉感知特性的数字语音实时分解/合成方法，下面结合附图及具体实施例进一步说明如下：A real-time decomposition/synthesis method of digital speech based on auditory perception characteristics proposed by the present invention is further described below in conjunction with the accompanying drawings and specific embodiments:

本方法的与已有技术的主要区别是使用一组伽马通滤波器来模拟人耳的基底膜，基底膜上每个位置的滤波特性都可以用一个伽马通滤波器来描述，同时该方法参考了人耳基底膜延时特性和等响度曲线特性，进而实现对语音的分解和合成。The main difference between this method and the prior art is that a set of gamma-pass filters are used to simulate the basilar membrane of the human ear. The filtering characteristics of each position on the basilar membrane can be described by a gamma-pass filter. The method refers to the delay characteristics of the basilar membrane of the human ear and the equal loudness curve characteristics, and then realizes the decomposition and synthesis of speech.

该方法的具体步骤如下：The specific steps of this method are as follows:

1)构建任意阶的伽马通数字滤波器模型(包括每个滤波器的带宽、中心频率即位置参数信息)：1) Build a gamma pass digital filter model of any order (including the bandwidth and center frequency of each filter, that is, the position parameter information):

设通过滤波器的语音频率范围为[f_L，f_H]，0≤f_L≤f_H≤f_s/2；Let the range of speech frequencies passing through the filter be [f_L , f_H ], 0≤f_L ≤f_H ≤f_s /2;

1.2)由表达式(1-2)：

得出中心频率f_c在ERBs域上的值分布为[ERBs(f_L)，ERBs(f_H)]，将其均分成M－1份得到等间距的M个ERBs值如式(1)所示：1.2) By expression (1-2):

a_n的表达式如(1-10)所示：

b₁、b₂的表达式如(1-11)所示：

The expression of a_n is shown in (1-10):

The expressions of b₁ and b₂ are shown in (1-11):

由此将表达式(1-4)和(1-9)推广到了N为任意正整数的情况。以上结果将一个N阶的伽马通滤波器用N级级联的二阶带通滤波器来构成。Thus, expressions (1-4) and (1-9) are generalized to the case where N is any positive integer. The above results construct an N-order gamma-pass filter with N-stage cascaded second-order band-pass filters.

1.6)计算归一化参数g_n：(由于伽马通滤波器的幅度响应曲线是近似对称的，伽马通滤波器的幅度最大值在中心频率f_c处取得，)因此伽马通滤波器各级的二阶滤波器的最大增益如式(2e)所示：1.6) Calculate the normalization parameter g_n : (because the amplitude response curve of the gamma-pass filter is approximately symmetrical, and the amplitude maximum value of the gamma-pass filter is obtained at the center frequency_fc ,) Therefore, the gamma-pass filter The maximum gain of the second-order filter at each stage is shown in formula (2e):

的值，其中

是各个通道的滤波器抽头系数，

value, where

are the filter tap coefficients for each channel,

2)语音分解阶段；2) Speech decomposition stage;

利用步骤1)构建的伽马通数字滤波器模型，模仿人耳基底膜对语音进行分解：将输入语音实时地分解到M个子带上，最小处理单位是单个语音采样点，同时该处理过程均是在时域上进行的(不需要将语音变换到频域上)，得到M路的语音数据；Use the gamma pass digital filter model constructed in step 1) to decompose speech by imitating the basilar membrane of the human ear: decompose the input speech into M subbands in real time, and the minimum processing unit is a single speech sampling point. It is carried out in the time domain (it is not necessary to convert the speech to the frequency domain), and M channels of speech data are obtained;

首先假设输入语音为x(k)，采样率为f_s，使用M路伽马通滤波器采用浮点算法或定点算法将输入语音分解为M路信号,每一路的输出信号用y^m(k)表示，具体包括：First, assume that the input speech is x(k) and the sampling rate is f_s . Use M channels of gamma pass filters to decompose the input speech into M channels of signals using floating-point algorithm or fixed-point algorithm, and the output signal of each channel is represented by y^m (k ) means, specifically including:

用于软件仿真时采用浮点算法将输入语音依次通过M路伽马通滤波器得到M组语音输出信号，如式(7)-式(10)所示：When used for software simulation, the floating-point algorithm is used to sequentially pass the input speech through M channels of gamma-pass filters to obtain M groups of speech output signals, as shown in equations (7)-(10):

其中，m∈[1,M]代表通道号，n∈[1,4]指明表达式描述的是四级级联结构中的具体级数；y^m(k)代表每一路的语音输出。

代表每个通道的语音输入，

是各个通道的滤波器抽头系数，

为各个通道的归一化系数；Among them, m∈[1,M] represents the channel number, n∈[1,4] indicates that the expression describes the specific series in the four-level cascade structure; y^m (k) represents the speech output of each channel.

represents the voice input for each channel,

are the filter tap coefficients for each channel,

is the normalization coefficient of each channel;

用于硬件实现时采用定点算法将输入语音依次通过M路伽马通滤波器得到M租语音输出信号When used for hardware implementation, the fixed-point algorithm is used to sequentially pass the input speech through M-way gamma-pass filters to obtain M leased speech output signals

(针对计算复杂度过高、不利于算法硬件实现的问题，本发明提出了一套完整的定点化方案，为算法的硬件实现节约了大量资源；该算法同样将输入语音依次通过M路伽马通滤波器得到M租语音输出信号。图4为本发明使用的定点化滤波算法的方框图，流程与图1相似，但其中所有参数均为定点化处理后的结果，改进后，算法的计算时间周期缩短到原来的1/4，将算法的计算能力提升了4倍，从而达到减少运算资源消耗、降低功耗的目的。具体包括以下步骤：(Aiming at the problem that the computational complexity is too high and is not conducive to the hardware implementation of the algorithm, the present invention proposes a complete set of fixed-point solutions, which saves a lot of resources for the hardware implementation of the algorithm; the algorithm also sequentially passes the input speech through M channels of gamma Pass filter obtains M lease speech output signal.Fig. 4 is the block diagram of the fixed point filtering algorithm that the present invention uses, and flow process is similar to Fig. 1, but wherein all parameters are the result after fixed point processing, after improvement, the calculation time of algorithm The cycle is shortened to 1/4 of the original, and the computing power of the algorithm is increased by 4 times, thereby achieving the purpose of reducing computing resource consumption and power consumption. Specifically, the following steps are included:

2.1)对各个滤波器组的各参数进行定点化处理，即使参数扩大E＝2^p倍，然后分别取整数，如式(11)-(14)所示：2.1) Perform fixed-point processing on the parameters of each filter bank, even if the parameters are expanded by E=^2p times, and then take integers respectively, as shown in equations (11)-(14):

各式中[·]表示最接近·的整数；In each formula, [ ] represents the nearest integer;

2.2)对分别表示第m路伽马通滤波器中第n级的输入语音信号和输入语音信号的中间运算数据

进行定点化处理：即根据表达式(2e)得到最大增益Gain值随着中心频率f_c的变化关系，由此得出最大增益Gain_max，因此当输入语音为L比特时，中间运算结果的位宽设为Q比特，则Q的值为：2.2) The intermediate operation data representing the n-th stage of the m-th gamma pass filter and the input speech signal respectively

Perform fixed-point processing: that is, according to the expression (2e), the relationship between the maximum gain Gain value and the center frequency f_c is obtained, and the maximum gain Gain_max is obtained. Therefore, when the input speech is L bits, the bits of the intermediate operation result are If the width is set to Q bits, the value of Q is:

Q＝L+[log₂(Gain_max)] (15)Q=L+[log₂ (Gain_max )] (15)

其中[·]代表取不小于Q的最小整数；由此得到如图4所示的定点化滤波算法，以及每一路语音的输入输出；Wherein [ ] represents the smallest integer not less than Q; thus obtain the fixed-point filtering algorithm as shown in Figure 4, and the input and output of each channel of speech;

3)语音合成阶段；3) Speech synthesis stage;

在步骤2)中语音信号通过N阶Gammatone滤波器，被分解到N个子带上，可对分解后的语音信号进行语音增强、语音识别等处理(例如使用波束形成、计算听觉场景分析等常用语音增强算法)；处理后各路信号可通过直接叠加的操作重新合成，进而更好地还原语音。In step 2), the speech signal is decomposed into N subbands through the N-order Gammatone filter, and the decomposed speech signal can be processed by speech enhancement, speech recognition, etc. Enhancement algorithm); after processing, the signals of each channel can be resynthesized by the operation of direct superposition, so as to restore the voice better.

本发明在合成阶段参考了人耳基底膜神经延时特性，给出了伽马通滤波器的通道延时(时域延时)。人耳基底膜神经延时是指人耳基底膜接收语音信号，到将语音信号传递给大脑所需时间对于不同频率的声音是不一样的，因此在伽马通滤波器组中引入一定量的延时，更加符合人耳特性，人耳基底膜延时与频率成反比关系，基于以上分析，伽马通滤波器的群延时(相位变化随着频率变化的快慢程度)用表达式(16)来描述：In the synthesis stage, the present invention provides the channel delay (time domain delay) of the gamma pass filter with reference to the delay characteristics of the human ear basilar membrane nerve. The basilar membrane nerve delay of the human ear means that the time required for the basilar membrane of the human ear to receive the speech signal and transmit the speech signal to the brain is different for sounds of different frequencies, so a certain amount of gamma pass filter bank is introduced. The delay is more in line with the characteristics of the human ear. The delay of the basilar membrane of the human ear is inversely proportional to the frequency. Based on the above analysis, the group delay of the gamma-pass filter (the speed of the phase change with the frequency) is expressed as (16 ) to describe:

式中，m通道群延时t_m的单位是秒，第m组滤波器的中心频率f_c的单位是Hz。In the formula, the unit of the m-channel group delay t_m is seconds, and the unit of the center frequency f_c of the mth group of filters is Hz.

本发明的语音合成过程参考了人耳基底膜的延时特性，在语音合成前对各通道的输出分别引入适当的延时，然后再直接相加，这样可以极大地减弱各通道间的相互干扰，使得语音能够逐点计算各个数字语音的合成与分解，从而到达实时处理的目的。具体步骤包括：The speech synthesis process of the present invention refers to the delay characteristics of the basilar membrane of the human ear. Before speech synthesis, an appropriate delay is introduced to the output of each channel, and then directly added, which can greatly reduce the mutual interference between the channels. , so that the speech can calculate the synthesis and decomposition of each digital speech point by point, so as to achieve the purpose of real-time processing. Specific steps include:

d_m＝D-[f_st_m] (17)d_m =D-[f_s t_m ] (17)

其中D为[f_st_m]中的最大值。where D is the maximum value in [f_s t_m ].

3.2)(根据图3所示的人耳的等响度曲线，要达到相同的响度，高频需要较高的幅值，低频需要较低的幅值。)对各个通道在总滤波器中所占比重进行加权，则合成语音用表达式(8)来计算；设m通道的权重为w_m，在实际操作中，该权重可以合并到g_N中，调增后的g_N用如下表达式计算：3.2) (According to the equal loudness curve of the human ear shown in Figure 3, to achieve the same loudness, high frequencies need a higher amplitude, and low frequencies need a lower amplitude.) For each channel's share in the total filter If the weight is weighted, the synthesized speech is calculated by expression (8); let the weight of m channel be w_m , in practice, the weight can be merged into g_N , and the increased g_N is calculated by the following expression :

图5为采用本发明方法改进后语音合成阶段的幅度响应曲线。根据人耳等响度曲线调整通道权重后，合成语音方法的总幅度响应曲线接近理想带通滤波器效果，其中通道数目M＝64，通道中心频率分布为50Hz～7500Hz，在7500Hz以内的频率范围内幅度响应较大，超过频率上限之后幅度衰减较快。Fig. 5 is the amplitude response curve of the speech synthesis stage after adopting the method of the present invention to improve. After adjusting the channel weights according to the equal loudness curve of the human ear, the total amplitude response curve of the synthetic speech method is close to the ideal band-pass filter effect, where the number of channels is M=64, and the channel center frequency distribution is 50Hz~7500Hz, within the frequency range of 7500Hz The amplitude response is large, and the amplitude decays faster after exceeding the upper frequency limit.

Claims

Translated fromChinese

1.一种基于听觉感知特性的数字语音实时分解/合成方法，其特征在于，该方法具体步骤如下：1. a kind of digital speech real-time decomposition/synthesis method based on auditory perception characteristic, is characterized in that, the concrete steps of this method are as follows:

1.2)根据表达式(1-2)：

1.4)针对b(f_c)与ERB(f_c)的关系：基于b＝ERB(f_c)，根据帕塞瓦尔(Paseval)定理，得出N阶伽马通滤波器中滤波器的中心频率f_c的带宽函数b(f_c)表达式如式(2a)所示：1.4) For the relationship between b(f_c ) and ERB(f_c ): Based on b=ERB(f_c ), according to Paseval’s theorem, the center frequency of the filter in the N-order gamma-pass filter is obtained The expression of the bandwidth function b(f_c ) of f_c is shown in formula (2a):

a_n的表达式如(1-10)所示：

b₁、b₂的表达式如(1-11)所示：

The expression of a_n is shown in (1-10):

The expressions of b₁ and b₂ are shown in (1-11):

的值，其中

是各个通道的滤波器抽头系数，

value, where

are the filter tap coefficients for each channel,

2)语音分解阶段；2) Speech decomposition stage;

3)语音合成阶段；3) Speech synthesis stage;

式中，m通道群延时t_m的单位是秒，第m组滤波器的中心频率f_c的单位是Hz；In the formula, the unit of the m-channel group delay t_m is seconds, and the unit of the center frequency f_c of the m-th filter group is Hz;

具体步骤包括：Specific steps include:

d_m＝D-[f_st_m] (17)d_m =D-[f_s t_m ] (17)

其中D为[f_st_m]中的最大值；where D is the maximum value in [f_s t_m ];

3.2)对各个通道在总滤波器中所占比重进行加权，则合成语音用表达式(8)来计算；设m通道的权重为w_m，该权重合并到g_N中，调增后的g_N用如下表达式计算：3.2) Weight the proportion of each channel in the total filter, then the synthesized speech is calculated by expression (8); let the weight of m channel be w_m , this weight is merged into g_N , and the increased g_N is calculated with the following expression:

其中，当k≤d_m时y^m(k-d_m)＝0；语音实时分解、合成任务完成。Among them, when k≤d_m , y^m (kd_m )=0; the real-time speech decomposition and synthesis tasks are completed.

2.如权利要求1所述数字语音实时分解/合成方法，其特征在于，所述步骤2)用于软件仿真时采用浮点算法，具体包括：2. the real-time decomposition/synthesis method of digital speech as claimed in claim 1, is characterized in that, described step 2) adopts floating-point arithmetic when being used for software simulation, specifically comprises:

将输入语音依次通过M路伽马通滤波器得到M组语音输出信号，如式(7)-式(10)所示：The input speech is sequentially passed through M-way gamma-pass filters to obtain M groups of speech output signals, as shown in equations (7)-(10):

其中，m∈[1,M]代表通道号，n∈[1,4]指明表达式描述的是四级级联结构中的具体级数；y^m(k)代表每一路的语音输出；

代表每个通道的语音输入，

是各个通道的滤波器抽头系数，

为各个通道的归一化系数。Among them, m∈[1,M] represents the channel number, n∈[1,4] indicates that the expression describes the specific series in the four-level cascade structure; y^m (k) represents the speech output of each channel;

represents the voice input for each channel,

are the filter tap coefficients for each channel,

is the normalization coefficient for each channel.

3.如权利要求1所述数字语音实时分解/合成方法，其特征在于，所述步骤2)用于硬件实现时采用定点算法将输入语音依次通过M路伽马通滤波器得到M租语音输出信号，具体包括以下步骤：3. the real-time decomposition/synthesis method of digital speech as claimed in claim 1, is characterized in that, when described step 2) is used for hardware realization, adopt fixed-point algorithm to obtain M rented speech output successively by M road gamma pass filter by input speech signal, which includes the following steps:

进行定点化处理：即根据表达式(2e)得到最大增益Gain值随着中心频率f_c的变化关系，由此得出最大增益Gain_max，因此当输入语音为L比特时，中间运算结果的位宽设为Q比特，则Q的值为：2.2) The intermediate operation data representing the n-th input speech signal and the input speech signal in the m-th gamma pass filter respectively

Q＝L+[log₂(Gain_max)] (15)Q=L+[log₂ (Gain_max )] (15)

其中[·]代表取不小于Q的最小整数；由此得到每一路语音的输入输出。Where [·] represents the smallest integer not less than Q; thus, the input and output of each channel of speech are obtained.