CN102054480A

Movatterモバイル変換

Info

Publication number: CN102054480A
Application number: CN2009102359018A
Authority: CN
Inventors: 茹婷婷; 谢湘; 匡镜明
Original assignee: Beijing Institute of Technology BIT
Current assignee: Beijing Institute of Technology BIT
Priority date: 2009-10-29
Filing date: 2009-10-29
Publication date: 2011-05-11
Anticipated expiration: 2029-10-29
Also published as: CN102054480B

Abstract

本发明涉及一种基于分数阶傅立叶变换的单声道混叠语音分离方法，属于音频信号处理技术领域。首先对混叠语音信号进行预处理，去除其静音段信号，找出浊音帧。然后，基于分数阶傅立叶变换，浊音帧信号进行基音检测，分离出混叠语音的基频，最后各条基频结合语音信号的正弦模型来合成语音，从而得到分离后的各个语音信号。本发明可有效的分离并提取出多个混叠语音的基频，最终实现混叠语音的有效分离；采用基于FrFT代替传统的FFT来提取基音频率，减少了谐波频谱的延展，得到更为准确的原始信号的基频。本发明尤其适用于分离含有两个人语音的单声道混叠语音。

The invention relates to a monophonic aliasing speech separation method based on fractional Fourier transform, and belongs to the technical field of audio signal processing. Firstly, the aliased speech signal is preprocessed to remove the silent segment signal and find out the voiced sound frame. Then, based on the fractional Fourier transform, the voiced sound frame signal is subjected to pitch detection to separate the fundamental frequency of the aliased speech, and finally each fundamental frequency is combined with the sinusoidal model of the speech signal to synthesize the speech, thereby obtaining the separated speech signals. The present invention can effectively separate and extract the fundamental frequencies of a plurality of aliased voices, and finally realize the effective separation of aliased voices; the fundamental frequency is extracted by replacing the traditional FFT based on FrFT, which reduces the extension of the harmonic spectrum and obtains more Accurate fundamental frequency of the original signal. The invention is particularly suitable for separating monophonic aliased speech containing two human voices.

Description

Translated fromChinese

一种基于分数阶傅立叶变换的单声道混叠语音分离方法A Monophonic Aliasing Speech Separation Method Based on Fractional Fourier Transform

技术领域technical field

本发明涉及一种利用分数阶傅立叶变换进行单声道混叠语音分离的方法，属于音频信号处理技术领域。The invention relates to a method for separating monophonic aliasing speech by using fractional Fourier transform, and belongs to the technical field of audio signal processing.

背景技术Background technique

在语音和听觉信号处理领域中，有一个重要的问题是如何从混叠语音信号中分离出人们感兴趣的语音。混叠语音分离在语音通信、声学目标检测、声音信号增强等方面都有重要的理论意义和使用价值，但由于构成混叠语音的各个源语音信号在时域和频域上完全重叠，常用的语音增强方法难以将人们所感兴趣的语音(称为目标语音)从干扰语音中分离出来。In the field of speech and auditory signal processing, an important problem is how to separate the speech of interest from the aliased speech signal. Aliasing speech separation has important theoretical significance and practical value in speech communication, acoustic target detection, sound signal enhancement, etc. It is difficult for speech enhancement methods to separate the speech that people are interested in (called the target speech) from the interference speech.

分数阶傅立叶变换(Fractional Fourier Transform，FrFT)对于分析某些非平稳信号具有十分优良的特性，成为一种近年来引起信号处理界广泛关注的工具。作为非平稳信号的语音，FrFT或者类似的变换在语音信号处理中的应用目前主要集中在以下几个方面：语音分析，可以给出比传统的傅立叶变换方法更高的时频分辨率；基音估计，可以给出比传统方法更精确的基音估计；语音增强；语音识别；以及说话人识别等。Fractional Fourier Transform (FrFT) has excellent characteristics for analyzing some non-stationary signals, and has become a tool that has attracted widespread attention in the signal processing field in recent years. Speech as a non-stationary signal, the application of FrFT or similar transforms in speech signal processing is currently mainly focused on the following aspects: speech analysis, which can give higher time-frequency resolution than traditional Fourier transform methods; pitch estimation , can give more accurate pitch estimation than traditional methods; speech enhancement; speech recognition; and speaker recognition, etc.

在混叠语音分离方面的研究，主要分为听觉场景分析(Auditory Scene Analysis，ASA)和盲源分离(Blind Source Separation，BSS)两类。听觉场景分析的研究有两种方法：一种是从人的听觉生理及心理特性出发，研究人在声音识别过程中的规律，即听觉场景分析；另一种是利用对人听觉感知的研究成果建立模型，对模型进行数学分析并用计算机来实现它，这是计算听觉场景分析(Computational Auditory Scene Analysis，CASA)所要研究的内容。盲源分离是指在源信号、传输通道特性未知的情况下，仅由观测信号和源信号的一些先验知识(如概率密度)来估计出源信号各个分量的过程。盲源分离的独立分量分析方法首先是由P.Comon提出，它是基于神经网络和统计学的基础发展起来的一种技术，是一个十分活跃的前沿领域。Research on aliasing speech separation is mainly divided into two categories: Auditory Scene Analysis (ASA) and Blind Source Separation (BSS). There are two methods for the study of auditory scene analysis: one is to study the rules in the process of sound recognition from the perspective of human auditory physiological and psychological characteristics, that is, auditory scene analysis; the other is to use the research results of human auditory perception Building a model, mathematically analyzing the model and implementing it with a computer are what Computational Auditory Scene Analysis (CASA) is about. Blind source separation refers to the process of estimating the components of the source signal only by some prior knowledge (such as probability density) of the observed signal and the source signal when the characteristics of the source signal and the transmission channel are unknown. The independent component analysis method of blind source separation was first proposed by P.Comon. It is a technology developed on the basis of neural network and statistics, and it is a very active frontier field.

现有的混叠语音分离方法主要存在以下不足：The existing aliasing speech separation methods mainly have the following deficiencies:

(1)听觉场景分析和计算听觉场景分析的研究还处于起步阶段。特别是在计算听觉场景分析研究中，所建立的模型只能用于验证听觉场景分析研究中的一些不够明了的理论，即人脑处理听觉信号的机制。(1) The research on auditory scene analysis and computational auditory scene analysis is still in its infancy. Especially in computational auditory scene analysis research, the established models can only be used to verify some unclear theories in auditory scene analysis research, that is, the mechanism of human brain processing auditory signals.

针对盲源分离方法的研究非常活跃，但对这个问题还没有得到很好的解决，其涉及到多通道卷积混叠系统和盲反卷积系统的稳定性及相位不确定性问题，尤其是当源的数目未知时盲反卷积问题以及带噪声的情况。Research on blind source separation methods is very active, but this problem has not been well resolved, which involves the stability and phase uncertainty of multi-channel convolution aliasing systems and blind deconvolution systems, especially The blind deconvolution problem when the number of sources is unknown and the noisy case.

(2)混叠语音的基频分离提取是听觉场景分析中实现混叠语音分离的关键，但现有的混叠语音基频分离提取方法只考虑浊音与浊音的混叠，不考虑清音与浊音的混叠。这是因为在语音信号的清音帧中，激励信号是无周期性的，因此估计清音帧的基频并没有实际意义。不仅如此，清音帧估计出来的基频通常随机性强，不具有连续性，而从混叠语音中分离提取出的基频是以基频的连续性来判断其归属，所以，清音帧估计出的基频会影响基音归属判断，进而影响基频的平滑处理效果。(2) The fundamental frequency separation and extraction of aliased speech is the key to realize the separation of aliased speech in auditory scene analysis, but the existing method of fundamental frequency separation and extraction of aliased speech only considers the aliasing of voiced and voiced sounds, and does not consider unvoiced and voiced sounds aliasing. This is because in the unvoiced frame of the speech signal, the excitation signal is aperiodic, so estimating the fundamental frequency of the unvoiced frame has no practical significance. Not only that, the base frequency estimated by the unvoiced frame is usually highly random and not continuous, while the base frequency extracted from the aliased speech is judged by the continuity of the base frequency. Therefore, the estimated base frequency of the unvoiced frame is The fundamental frequency will affect the pitch attribution judgment, and then affect the smoothing effect of the fundamental frequency.

发明内容Contents of the invention

本发明的目的是为克服现有技术的缺陷，解决如何从单声道混叠语音信号中分离出目标语音的问题，提出一种新的基于分数阶傅立叶变换的单声道混叠语音分离方法。The purpose of the present invention is to overcome the defective of prior art, solve the problem how to separate target speech from monophonic aliasing speech signal, propose a kind of new monophonic aliasing speech separation method based on fractional order Fourier transform .

本发明所采用的技术方案如下：The technical scheme adopted in the present invention is as follows:

一种基于分数阶傅立叶变换的单声道混叠语音分离方法，包括以下步骤：A monophonic aliasing speech separation method based on fractional Fourier transform, comprising the following steps:

步骤一、对混叠语音信号进行预处理，去除其静音段信号，找出浊音帧。Step 1: Perform preprocessing on the aliased speech signal, remove the silent segment signal, and find out the voiced sound frame.

首先，对混叠语音信号进行端点检测，去除其静音段信号，把剩余的混叠段信号作为处理对象。First, endpoint detection is performed on the aliased speech signal, the silent segment signal is removed, and the remaining aliased segment signal is taken as the processing object.

然后，对剩余混叠段信号进行分帧处理，并进行清浊音判断，标出浊音帧。Then, the remaining aliasing segment signals are divided into frames, unvoiced and voiced are judged, and voiced frames are marked.

步骤二、基于分数阶傅立叶变换，对经步骤一处理后的浊音帧信号进行基音检测，分离出混叠语音的基音轨迹，也就是每个源信号的基频，过程如下：Step 2. Based on the fractional Fourier transform, perform pitch detection on the voiced sound frame signal processed in step 1, and separate the pitch track of the aliased speech, that is, the fundamental frequency of each source signal. The process is as follows:

首先，根据每帧信号的连续性计算出FrFT的阶数。然后，对浊音帧信号重新进行FrFT变换，求得谐波积谱，再用动态规划方法提取出其中一个人的基频，即一个源信号的基频。First, the order of FrFT is calculated according to the continuity of each frame signal. Then, perform FrFT transformation on the voiced sound frame signal again to obtain the harmonic product spectrum, and then use the dynamic programming method to extract the fundamental frequency of one of them, that is, the fundamental frequency of a source signal.

当搜出一个人的基频之后，在谐波积谱中减去此人的基频和谐波所对应的谱成分，然后再使用一次动态规划，即可得到另一个人的基频，，即另一个源信号的基频；When a person's fundamental frequency is found, subtract the spectral component corresponding to the fundamental frequency and harmonics of the person from the harmonic product spectrum, and then use dynamic programming again to obtain the fundamental frequency of another person, That is, the fundamental frequency of the other source signal;

重复上述过程，即可得到每个源信号的基频。By repeating the above process, the fundamental frequency of each source signal can be obtained.

步骤三、由于语音信号能够用一组正弦信号的叠加表示，因此，根据经步骤二得到的各条基频，结合语音信号的正弦模型来合成语音，从而得到分离后的各个语音信号。Step 3. Since the voice signal can be expressed by superposition of a group of sinusoidal signals, the voice is synthesized according to each fundamental frequency obtained in step 2 and combined with the sinusoidal model of the voice signal, so as to obtain the separated voice signals.

本发明的积极效果和优点在于：Positive effect and advantage of the present invention are:

1.使用本发明方法，可有效的分离并提取出多个混叠语音的基频，从而实现混叠语音的有效分离。1. Using the method of the present invention, the fundamental frequencies of multiple aliased voices can be effectively separated and extracted, thereby realizing effective separation of aliased voices.

2.采用基于FrFT代替传统的FFT(短时傅立叶变换)来提取基音频率，减少了谐波频谱的延展。2. Using FrFT instead of the traditional FFT (short-time Fourier transform) to extract the pitch frequency, reducing the extension of the harmonic spectrum.

3.由于每帧信号都有其固有的调制频率，使用FrFT可以选择合适的阶数使其符合信号固有的调频率，从而得到更为准确的原始信号的基频。3. Since each frame signal has its inherent modulation frequency, FrFT can be used to select an appropriate order to match the inherent modulation frequency of the signal, thereby obtaining a more accurate fundamental frequency of the original signal.

本发明尤其适用于分离含有两个人语音的单声道混叠语音。The invention is particularly suitable for separating monophonic aliased speech containing two human voices.

附图说明Description of drawings

图1为本发明方法的实现流程框图。Fig. 1 is a block diagram of the implementation flow of the method of the present invention.

图2为本发明方法中的基于分数阶傅立叶变换的混叠语音基音检测流程图。Fig. 2 is a flow chart of aliasing speech pitch detection based on fractional Fourier transform in the method of the present invention.

具体实施方式Detailed ways

下面结合附图对本发明的优选实施方式作进一步说明。Preferred embodiments of the present invention will be further described below in conjunction with the accompanying drawings.

一种基于分数阶傅立叶变换的单声道混叠语音分离方法，其实现流程如图1所示，包括以下步骤：A monophonic aliasing speech separation method based on fractional Fourier transform, its implementation process is as shown in Figure 1, including the following steps:

首先，对混叠语音信号进行端点检测，去除其静音段信号，把剩余的混叠段信号作为处理对象。端点检测可采用短时能量和过零率相结合的方法。First, endpoint detection is performed on the aliased speech signal, the silent segment signal is removed, and the remaining aliased segment signal is taken as the processing object. Endpoint detection can use the combination of short-term energy and zero-crossing rate.

然后，对剩余混叠段信号进行分帧处理，分帧时的帧长为20ms，帧移为10ms。此时，进行清浊音判断，并标出浊音帧。混叠语音信号的清浊音判断与单个语音的判断稍有不同，两个混叠语音的清浊情况有3种：双浊音、一清一浊、双清音。混叠语音的清浊音判断分为两步：先判断两个混叠信号是否为双清音，若是，判断结束，若不是，再判断两混叠信号是一清一浊还是双浊音。对于一清一浊，只对浊音帧进行后续处理，不处理清音帧。对于双清音信号，同样不对其进行处理。Then, the remaining aliasing section signals are divided into frames, the frame length is 20ms, and the frame shift is 10ms. At this time, unvoiced and voiced sound is judged, and the voiced sound frame is marked. The unvoiced and voiced judgment of aliased speech signals is slightly different from the judgment of a single voice. There are three types of unvoiced voices for two aliased voices: double voiced, one voiced and one voiced, and double unvoiced. The unvoiced and voiced judgment of aliased speech is divided into two steps: firstly judge whether the two aliased signals are double voiced, if so, the judgment ends, if not, then judge whether the two aliased signals are one voiced and one voiced or double voiced. For one unvoiced and one voiced, only the voiced frames are processed, and the unvoiced frames are not processed. As for the double voiceless signal, it is also not processed.

步骤二、采用基于分数阶傅立叶变换方式，对经步骤一处理后的浊音帧进行基音检测，分离出混叠语音的基音轨迹，也就是分离出每个源信号的基频。其实现流程如图2所示。Step 2: Perform pitch detection on the voiced sound frame processed in step 1 by adopting a method based on fractional Fourier transform, and separate the pitch track of the aliased speech, that is, separate the fundamental frequency of each source signal. Its implementation process is shown in Figure 2.

首先，根据每帧信号的连续性，计算出FrFT的阶数。考虑到目的是求解语音信号的基频，而且是用帧问连续的特性来搜索基频，FrFT的阶数α_i与前后两帧的基频密切相关，因此用下式表示：First, according to the continuity of each frame signal, the order of FrFT is calculated. Considering that the purpose is to find the fundamental frequency of the speech signal, and the fundamental frequency is searched by the continuous characteristics between frames, the order α_i of FrFT is closely related to the fundamental frequency of the two frames before and after, so it is expressed by the following formula:

${α α}_{i i} = = 11 - - | | \frac{{p p}_{i i} - - {p p}_{i i - - 11}}{{p p}_{i i} + + {p p}_{i i + + 11}} | | - - - - - - 1.1 1.1$

其中，p_i-1，p_i，p_i+1分别为前一帧、当前帧和下一帧的估计基频，p_i-1，p_i，p_i+1可通过短时傅立叶变换获得。Among them, p_i-1 , p_i , p_i+1 are the estimated fundamental frequencies of the previous frame, the current frame and the next frame respectively, and p_i-1 , p_i , p_i+1 can be obtained by short-time Fourier transform .

然后，对经步骤一处理后得到的浊音帧信号重新进行FrFT变换，求得谐波积谱，再用动态规划方法提取出其中一条基音轨迹，也就是其中一个人的基频。具体过程如下：Then, perform FrFT transformation again on the voiced sound frame signal processed in step 1 to obtain the harmonic product spectrum, and then use the dynamic programming method to extract one of the pitch tracks, that is, the fundamental frequency of one of the people. The specific process is as follows:

(1)对浊音帧信号x(n)，采用下式进行N点(例如1024点)的分数阶傅立叶变换，得到其幅度谱X(α，k)：(1) For the voiced sound frame signal x(n), use the following formula to perform fractional Fourier transform of N points (for example, 1024 points) to obtain its amplitude spectrum X(α, k):

X(α，k)＝FrFT_N{x(n)} 1.2X(α,k)=FrFT_N {x(n)} 1.2

再将幅度谱X(α，k)变换到对数域，得到对数幅度谱SLog(α，k)：Then transform the magnitude spectrum X(α, k) into the logarithmic domain to obtain the logarithmic magnitude spectrum SLog(α, k):

SLog(α，k)＝log₁₀(|X(α，k)|²) 1.3SLog(α, k) = log₁₀ (|X(α, k)|² ) 1.3

将一帧信号内的所有谐波对数谱SLog(α，k)进行求和，得到谐波积谱ρ(α，f)：Sum all the harmonic logarithmic spectra SLog(α, k) in a frame signal to obtain the harmonic product spectrum ρ(α, f):

$ρ ρ ((α α,, f f)) = = \frac{11}{H h} {Σ Σ}_{h h = = 11}^{H h} SLog SLog ((α α,, hf hf)) - - - - - - 1.4 1.4$

式1.4中，H为抽样带宽内的谐波个数，h为谐波索引的值，f为每帧的基频，α为每帧的阶数。In formula 1.4, H is the number of harmonics within the sampling bandwidth, h is the value of the harmonic index, f is the fundamental frequency of each frame, and α is the order of each frame.

(2)考虑到两个语音的混叠，从谐波积谱ρ(α，f)中提取出可能含有基频成分的M个候选峰值。考虑到计算量的问题，M的取值要大于等于3。当M大于等于3时，得到的结果基本没有变化。(2) Considering the aliasing of the two speeches, extract M candidate peaks that may contain fundamental frequency components from the harmonic product spectrum ρ(α,f). Considering the amount of calculation, the value of M should be greater than or equal to 3. When M is greater than or equal to 3, the obtained results basically do not change.

动态规划方法中需要设定一个指标函数，对每条路径均计算其指标函数的值，最大值所对应的路径即为所要求的其中一条基音轨迹。为了防止在基音周期的估计过程中出现半频错误或倍频错误，将指标函数c(α，f)设定为：In the dynamic programming method, an index function needs to be set, and the value of the index function is calculated for each path, and the path corresponding to the maximum value is one of the required pitch trajectories. In order to prevent half-frequency errors or octave errors in the estimation process of the pitch period, the index function c(α, f) is set as:

c(α，f)＝k(f)*(P(α，f)-P(α，f/2)) 1.5c(α,f)=k(f)*(P(α,f)-P(α,f/2)) 1.5

式1.5中，f为每帧信号的估计基频，k(f)为伴随f递减的函数。设定加权值k(f)能够避免倍频错误，引入P(α，f/2)能够避免半频错误。因此，将(α_i，f_i)记为μ_i，路径的评分函数S_i(μ_i)设定为：In Equation 1.5, f is the estimated fundamental frequency of each frame signal, and k(f) is a function that decreases with f. Setting the weighted value k(f) can avoid multiplication errors, and introducing P(α, f/2) can avoid half-frequency errors. Therefore, record (α_i , f_i ) as μ_i , and the scoring function S_i (μ_i ) of the path is set as:

${S S}_{i i} (({μ μ}_{i i})) = = {S S}_{i i - - 11} (({μ μ}_{i i - - 11}^{* *})) + + c c (({μ μ}_{i i})) - - - - - - 1.6 1.6$

${μ μ}_{i i - - 11}^{* *} = = {arg arg}_{{μ μ}_{i i - - 11}}^{max max} [[{s the s}_{i i - - 11} (({μ μ}_{i i - - 11})) + + c c (({μ μ}_{i i}))]] - - - - - - 1.7 1.7$

式1.6、1.7中，i表示帧号，

是在选择合适的阶数以及得到第i-1帧基频时的参数。由于正常人说话的基频范围为50Hz 400Hz，因此在此范围内搜索基频，在每帧信号的两个峰值点里均能够找到选择使评分函数S_i(μ_i)最大的f值，即认为是这一帧信号中其中一个人的基频。同理，当搜索所有的信号之后，可以连成一条基音轨迹，从而得到其中一个人的基音轨迹(即此人的基频)。In formulas 1.6 and 1.7, i represents the frame number,

It is a parameter when selecting an appropriate order and obtaining the fundamental frequency of the i-1th frame. Since the fundamental frequency range of normal people's speech is 50Hz to 400Hz, the fundamental frequency is searched within this range, and the f value that maximizes the scoring function S_i (μ_i ) can be found in the two peak points of each frame signal, namely It is considered to be the fundamental frequency of one of the persons in this frame signal. Similarly, after all the signals are searched, a pitch track can be connected to obtain the pitch track of one of them (that is, the fundamental frequency of the person).

当搜出一个人的基频之后，在谐波积谱ρ(α，p)中减去此人的基频和谐波所对应的谱成分，然后再使用一次动态规划方法，即可得到另一个人的基音轨迹(即此人的基频)，从而分离出混叠语音的基音轨迹。After searching out a person's fundamental frequency, subtract the spectral components corresponding to the fundamental frequency and harmonics of the person from the harmonic product spectrum ρ(α, p), and then use the dynamic programming method again to obtain another A person's pitch locus (that is, the person's fundamental frequency), thereby separating the pitch locus of the aliased speech.

求取谐波所对应的谱成分的方法如下：The method to obtain the spectral components corresponding to the harmonics is as follows:

在谐波积谱中减去谐波所对应的谱成分时，首先要知道谐波个数H_i，由此即能获知究竟需要减去几个谱成分。根据式1.8，可得到第i帧信号的谐波个数H_i，When subtracting the spectral components corresponding to the harmonics from the harmonic product spectrum, the number of harmonics H_i must first be known, so that it can be known how many spectral components need to be subtracted. According to formula 1.8, the number of harmonics H_i of the i-th frame signal can be obtained,

${H h}_{i i} = = \frac{{f f}_{s the s} / / 22}{{f f}_{i i}} - - - - - - ((1.8 1.8))$

式1.8中，f_i为第i帧的基频，f_s为采样率。则谐波频率f′和基频f的关系如下：In formula 1.8, f_i is the fundamental frequency of frame i, and f_s is the sampling rate. Then the relationship between the harmonic frequency f' and the fundamental frequency f is as follows:

f′＝h*f，h＝2，3，4，...，H 1.9f'=h*f, h=2, 3, 4,..., H 1.9

式1.9中，H为谐波个数。得到了谐波频率f′，即获知了谐波所对应的谱成分。In formula 1.9, H is the number of harmonics. The harmonic frequency f' is obtained, that is, the spectral component corresponding to the harmonic is obtained.

步骤三、由于语音信号能够用一组正弦信号的叠加表示，因此，根据经步骤二得到的各条基频f_i，结合语音信号的正弦模型来合成语音，从而得到分离后的各个语音信号。Step 3. Since the speech signal can be represented by a superposition of a group of sinusoidal signals, the speech is synthesized according to each fundamental frequency f_i obtained in step 2 and the sinusoidal model of the speech signal, so as to obtain separated speech signals.

Claims

1. monophony aliasing speech separating method based on fraction Fourier conversion is characterized in that may further comprise the steps:

Step 1, the aliasing voice signal is carried out pre-service, remove its quiet segment signal, find out unvoiced frame;

Step 2, based on fraction Fourier conversion, the unvoiced frame signal after step 1 is handled is carried out pitch Detection, isolate the pitch contour of aliasing voice, the fundamental frequency of each source signal just, process is as follows:

At first, calculate the exponent number of FrFT, then, the unvoiced frame signal is carried out the FrFT conversion again, try to achieve the long-pending spectrum of harmonic wave, extract one of them people's fundamental frequency again with dynamic programming method, i.e. the fundamental frequency of a source signal according to the continuity of every frame signal;

After the fundamental frequency of finding a people, in the long-pending spectrum of harmonic wave, deduct this person's fundamental frequency and the pairing spectrum composition of harmonic wave, and then use a dynamic programming, can obtain another person's fundamental frequency, i.e. the fundamental frequency of another source signal;

Repeat said process, can obtain the fundamental frequency of each source signal;

Step 3, according to each the bar fundamental frequency that obtains through step 2, come synthetic speech in conjunction with the sinusoidal model of voice signal, thus each voice signal after obtaining separating.

2. a kind of monophony aliasing speech separating method based on fraction Fourier conversion as claimed in claim 1 is characterized in that, in the described step 1, after removing quiet segment signal, the method for residue alias band signal being carried out the processing of branch frame is as follows:

Frame length when dividing frame is 20ms, and frame moves and is 10ms, at this moment, carries out pure and impure sound and judges, and mark unvoiced frame; The pure and impure sound judgement of aliasing voice was divided into for two steps: judge earlier whether two aliasing signals are two voicelesss sound, if, judge and finish, if not, judge that again two aliasing signals are one clear one turbid or pair voiced sounds; For one clear one turbid, only unvoiced frame is carried out subsequent treatment, do not handle unvoiced frames; For two voiceless sound signals, it is not handled equally.

3. a kind of monophony aliasing speech separating method based on fraction Fourier conversion as claimed in claim 1 or 2 is characterized in that, in step 2, and when calculating the exponent number of FrFT, the exponent number α of FrFT_iRepresent with following formula with the fundamental frequency of front and back two frames:

α_{i} = 1 - | \frac{p_{i} - p_{i - 1}}{p_{i} + p_{i + 1}} | - - - 1.1

Wherein, p_I-1, p_i, p_I+1Be respectively the estimation fundamental frequency of former frame, present frame and next frame.

4. a kind of monophony aliasing speech separating method as claimed in claim 1 or 2 based on fraction Fourier conversion, it is characterized in that, behind the exponent number that calculates FrFT, the unvoiced frame signal that obtains after handling through step 1 is carried out the FrFT conversion again, try to achieve the long-pending spectrum of harmonic wave, extract wherein pitch contour with dynamic programming method again, fundamental frequency just, its detailed process is as follows:

(1) to unvoiced frame signal x (n), adopt following formula to carry out the fraction Fourier conversion that N is ordered, obtain its amplitude spectrum X (α, k):

X(α，k)＝FrFT_N{x(n)} 1.2

With amplitude spectrum X (α k) transforms to log-domain, obtain logarithm amplitude spectrum SLog (α, k):

SLog(α，k)＝log₁₀(|X(α，k)|²) 1.3

With all the harmonic wave logarithmic spectrum SLog in the frame signal (α k) sues for peace, obtain the long-pending spectrum of harmonic wave ρ (α, f):

ρ (α, f) = \frac{1}{H} Σ_{h = 1}^{H} SLog (α, hf) - - - 1.4

In the formula 1.4, H is the harmonic wave number in the sampling bandwidth, and h is the value of harmonic wave index, and f is the fundamental frequency of every frame, and α is the exponent number of every frame;

(2) (α extracts M the candidate peak that may contain fundamental component in f), and the value of M is greater than and equals 3 from the long-pending spectrum of harmonic wave ρ;

Need to set a target function in the dynamic programming method, every paths is all calculated the value of its target function, the pairing path of maximal value is desired wherein fundamental frequency; With target function c (α f) is set at:

c(α，f)＝k(f)*(P(α，f)-P(α，f/2)) 1.5

In the formula 1.5, f is the estimation fundamental frequency of every frame signal, the function of k (f) for following f to successively decrease; With (α_i, f_i) be designated as μ_i, the score function S in path_i(μ_i) be set at:

S_{i} (μ_{i}) = S_{i - 1} (μ_{i - 1}^{*}) + c (μ_{i}) - - - 1.6

μ_{i - 1}^{*} = \arg_{μ_{i - 1}}^{\max} [s_{i - 1} (μ_{i - 1}) + c (μ_{i})] - - - 1.7

In the formula 1.6,1.7, i represents frame number,

It is the parameter when selecting suitable exponent number one-level to obtain i-1 frame fundamental frequency; Because the fundamental frequency scope that the normal person speaks is 50Hz-400Hz, therefore in this scope, search for fundamental frequency, in two peak points of every frame signal, all can find and select to make score function S_i(μ_i) maximum f value, promptly think the fundamental frequency of one of them people in this frame signal; In like manner, after all signals of search, can be linked to be a pitch contour, thereby obtain one of them people's fundamental frequency;

After the fundamental frequency of finding a people, (α deducts this person's fundamental frequency and the pairing spectrum composition of harmonic wave in p), and then uses dynamic programming method one time, can obtain another person's fundamental frequency, thereby isolate the pitch contour of aliasing voice at the long-pending spectrum of harmonic wave ρ;

After the fundamental frequency of finding a people, (α deducts this person's fundamental frequency and the pairing spectrum composition of harmonic wave in p), and then uses dynamic programming method one time at the long-pending spectrum of harmonic wave ρ, can obtain another person's pitch contour, thereby isolate the pitch contour of aliasing voice;

The method of asking for the pairing spectrum composition of harmonic wave is as follows:

When in the long-pending spectrum of harmonic wave, deducting the pairing spectrum composition of harmonic wave, at first to know harmonic wave number H_i, can know thus needs to deduct several spectrum compositions actually; According to formula 1.8, can obtain the harmonic wave number H of i frame signal_i,

H_{i} = \frac{f_{s} / 2}{f_{i}} - - - 1.8

In the formula 1.8, f_iBe the fundamental frequency of i frame, f_sBe sampling rate; Then the relation of harmonic frequency f ' and fundamental frequency f is as follows:

f′＝h*f，h＝2，3，4，...，H 1.9

In the formula 1.9, H is the harmonic wave number, has obtained harmonic frequency f ', has promptly known the pairing spectrum composition of harmonic wave.