Movatterモバイル変換


[0]ホーム

URL:


CN1099662C - Continuous voice identification technology for Chinese putonghua large vocabulary - Google Patents

Continuous voice identification technology for Chinese putonghua large vocabulary
Download PDF

Info

Publication number
CN1099662C
CN1099662CCN97116890ACN97116890ACN1099662CCN 1099662 CCN1099662 CCN 1099662CCN 97116890 ACN97116890 ACN 97116890ACN 97116890 ACN97116890 ACN 97116890ACN 1099662 CCN1099662 CCN 1099662C
Authority
CN
China
Prior art keywords
current phoneme
phoneme
vowel
judging whether
speech
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN97116890A
Other languages
Chinese (zh)
Other versions
CN1211026A (en
Inventor
杜利民
皮晓波
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Institute of Acoustics CAS
Original Assignee
Institute of Acoustics CAS
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute of Acoustics CASfiledCriticalInstitute of Acoustics CAS
Priority to CN97116890ApriorityCriticalpatent/CN1099662C/en
Publication of CN1211026ApublicationCriticalpatent/CN1211026A/en
Application grantedgrantedCritical
Publication of CN1099662CpublicationCriticalpatent/CN1099662C/en
Anticipated expirationlegal-statusCritical
Expired - Fee Relatedlegal-statusCriticalCurrent

Links

Images

Landscapes

Abstract

Translated fromChinese

本发明涉及语音信号处理中的一种语音识别方法。在语音声学模型的建模过程中,通过设计适合语音识别的汉字语音音素集和语境相关的声学模型,利用汉语普通话语音学知识和数据驱动的方法进行模型参数共享和语音训练数据的有效利用,建立跨字词语境相关的声学模型;在语音识别的过程中,利用能够提取的语音区别特征,如:清音/浊音等,减少模型匹配的盲目性,提高语音识别的搜索效率和准确性。

Figure 97116890

The invention relates to a speech recognition method in speech signal processing. In the modeling process of the speech acoustic model, by designing a Chinese character speech phoneme set suitable for speech recognition and a context-dependent acoustic model, using Mandarin Chinese phonetic knowledge and data-driven methods for model parameter sharing and effective use of speech training data , establish an acoustic model related to cross-word context; in the process of speech recognition, use the speech distinguishing features that can be extracted, such as unvoiced/voiced sounds, etc., to reduce the blindness of model matching and improve the search efficiency and accuracy of speech recognition .

Figure 97116890

Description

Translated fromChinese
汉语普通话大词汇连续语音识别方法A Method for Continuous Speech Recognition of Large Vocabulary in Mandarin Chinese

技术领域technical field

本发明汉语普通话大词汇连续语音识别技术属于语音信号处理和识别技术领域。The invention discloses a large-vocabulary continuous speech recognition technology for Mandarin Chinese, which belongs to the technical field of speech signal processing and recognition.

背景技术Background technique

汉语普通话大词汇连续语音识别目前还没有商用系统。其它语种的大词汇连续语音识别系统,如IBM公司96年的语音识别产品VoiceType,采用基于隐含马尔可夫模型的统计模型技术。采用隐含马尔可夫模型技术的系统可以取得较好的识别率,但是也存在固有缺点:(1)系统的识别过程完全依赖数量庞大的模型匹配计算,这使得系统技术变得格外复杂和脆弱。(2)模型的参数估计需要大量的训练数据,而且这些数据必须按统计的含义覆盖语音所有可能的变体,这使得大词汇连续语音识别系统的训练过程变得相当困难。由于这些缺点,使得完全基于隐含马尔可夫模型的语音识别系统的识别率在达到一定程度后,很难再进一步提高。There is currently no commercial system for large-vocabulary continuous speech recognition in Mandarin Chinese. Large-vocabulary continuous speech recognition systems of other languages, such as VoiceType, a speech recognition product of IBM Corporation in 1996, adopt statistical model technology based on Hidden Markov Model. The system using Hidden Markov Model technology can achieve a better recognition rate, but it also has inherent disadvantages: (1) The recognition process of the system is completely dependent on a large number of model matching calculations, which makes the system technology extremely complex and fragile . (2) The parameter estimation of the model requires a large amount of training data, and these data must cover all possible variants of the speech according to the statistical meaning, which makes the training process of the large-vocabulary continuous speech recognition system quite difficult. Due to these shortcomings, it is difficult to further improve the recognition rate of the speech recognition system based entirely on the hidden Markov model after reaching a certain level.

发明内容Contents of the invention

本发明的目的在于基于隐含马尔可夫模型的识别框架,利用汉语普通话语音学知识和信号处理提取的语音学区别特征,提出一种汉语普通话大词汇连续语音识别的声学模型建模方法和搜索改进方法,它可以提高语音识别系统的性能和计算复杂性。The purpose of the present invention is based on the recognition framework of the Hidden Markov Model, using the phonetic distinguishing features extracted from Mandarin Chinese phonetics knowledge and signal processing, to propose an acoustic model modeling method and search method for large vocabulary continuous speech recognition in Mandarin Chinese Improved methods, which can increase the performance and computational complexity of speech recognition systems.

本发明的第一部分是通过设计适合语音识别的汉字语音音素集和语境相关的声学模型,利用汉语普通话语音学知识和数据驱动的方法进行模型参数共享和语音训练数据的有效利用,建立跨字词语境相关的语音声学模型。The first part of the present invention is to design a Chinese character speech phoneme set suitable for speech recognition and a context-related acoustic model, and use Mandarin Chinese phonetics knowledge and data-driven methods to share model parameters and effectively utilize speech training data to establish cross-character Context-dependent acoustic models of speech.

在连续语流中,协同发音现象十分显著。对自动语音识别器来说,由于它是基于每个识别单元的声学特征一致性的原理,连续语音中的协同发音现象会使得识别单元的声学特征一致性大大降低,从而使识别系统的性能下降。建立语境相关的声学模型,可以改善识别单元的声学特征的一致性,但又会导致语音识别系统的识别单元模型的数量急剧增加。例如,汉语普通话声韵母的语境无关的识别单元只有60个左右,当考虑识别单元左右语境的影响时,语境相关的识别单元的数目达到数万个之多。由于实际上很难为这么多的声学模型准备在统计意义上足够多的训练数据,这使得每个声学模型的参数不能够得到有效的估计,最后导致语音识别的性能差。In continuous speech flow, the phenomenon of co-articulation is very significant. For the automatic speech recognizer, because it is based on the principle of the consistency of the acoustic features of each recognition unit, the phenomenon of co-pronunciation in continuous speech will greatly reduce the consistency of the acoustic features of the recognition unit, thereby reducing the performance of the recognition system . Establishing a context-dependent acoustic model can improve the consistency of the acoustic features of the recognition unit, but it will lead to a sharp increase in the number of recognition unit models of the speech recognition system. For example, there are only about 60 context-independent recognition units for consonants and finals in Mandarin Chinese. When considering the influence of the left and right contexts of the recognition units, the number of context-dependent recognition units reaches tens of thousands. Since it is actually difficult to prepare enough training data in a statistical sense for so many acoustic models, the parameters of each acoustic model cannot be effectively estimated, resulting in poor speech recognition performance.

本发明首先在汉字音节的音素层面设计适合语音识别的单元模型,在此基础上建立语境相关的声学模型,即把每个声学模型表示成为一个同时受到前后两端语境影响的识别单元。例如:汉语普通话的词汇——中国zhongguo在音素层面建立的识别单元模型为The present invention first designs a unit model suitable for speech recognition at the phoneme level of Chinese syllables, and establishes a context-related acoustic model on this basis, that is, expresses each acoustic model as a recognition unit affected by the context at both the front and rear ends. For example: the vocabulary of Mandarin Chinese - Chinese zhongguo, the recognition unit model established at the phoneme level is

zhongguo=sil ts’U ng k u o_o sil考虑到前后两端语境影响,产生得到语境相关的识别单元模型zhongguo=sil ts’U ng k u o_o sil Considering the influence of the context at both ends, a context-related recognition unit model is generated

zhongguo=sil sil-ts’+U ts’-U+ng U-ng+k ng-k+u k-u+o_o u-o_o+sil sil在这里-表示前语境,+表示后语境。zhongguo=sil sil-ts'+U ts'-U+ng U-ng+k ng-k+u k-u+o_o u-o_o+sil sil here - indicates the pre-context, + indicates the post-context.

在本发明的一种实现中,对汉语普通话语音设计的47个语境无关的识别单元(23个辅音、22个元音、1个寂静音、1个过渡音),经过语境相关的处理后生成的语境相关声学模型大约6000个,达到了可以有效处理的模型数目的范围。In a kind of realization of the present invention, to the 47 context-independent recognition units (23 consonants, 22 vowels, 1 silent sound, 1 transitional sound) of Mandarin Chinese speech design, through context-related processing The resulting context-dependent acoustic models are about 6000, reaching the range of the number of models that can be effectively processed.

本发明对所生成的语境相关的声学模型,采用汉语普通话语音学知识和训练数据驱动的语境判决树方式,进行声学模型状态层面的参数共享和数据的有效利用。图1是一个针对某一语境相关声学模型中状态1的语境判决树。在某一语境中的音素通过这个语境判决树时,可以选择一个概率分布来表示其状态1的输出概率分布。过程如下:首先回答根节点0上的问题?该音素的前面是否是中辅音(ts,ts_h,s,ts`,ts`_h,s`,z`)?如果这个音素是U,它的语境是ts`-U+sil,其判决的结果为肯定,决策转入下一个节点1;在节点1上需要继续回答的问题是:音素后面是否是寂静段?因为U后面正好是寂静音,所以判决的结果为肯定,决策转入下一个节点3。由于节点3已经到达语境判决树的叶节点,该项判决过程也就结束了,即音素U在ts`-U+sil语境中,应该选择概率分布1来表示它的状态1。For the generated acoustic model related to the context, the present invention adopts the context decision tree mode driven by the phonetic knowledge of Mandarin Chinese and the training data to share the parameters of the state level of the acoustic model and effectively utilize the data. Figure 1 is a contextual decision tree for state 1 in a context-dependent acoustic model. When a phoneme in a certain context passes through this context decision tree, a probability distribution can be selected to represent its output probability distribution of state 1. The process is as follows: first answer the question on the root node 0? Is the phoneme preceded by middle consonants (ts, ts_h, s, ts`, ts`_h, s`, z`)? If the phoneme is U, its context is ts`-U+sil, the result of the judgment is affirmative, and the decision is transferred to the next node 1; the question that needs to be answered on node 1 is: whether there is a silent segment after the phoneme ? Because U is followed by a silent tone, the result of the judgment is affirmative, and the decision is transferred to the next node 3. Since node 3 has reached the leaf node of the context decision tree, the decision process is over, that is, the phoneme U is in the ts`-U+sil context, and the probability distribution 1 should be selected to represent its state 1.

建立语境相关的声学模型的方框图如图2,步骤如下:The block diagram of building a context-dependent acoustic model is shown in Figure 2, and the steps are as follows:

(A)采集语音训练数据;(A) collecting speech training data;

(B)对每个语音训练数据提取特征矢量序列,语音特征可以选择LPC倒谱或摩尔倒谱MFCC;(B) extract feature vector sequence for each speech training data, speech feature can select LPC cepstrum or moore cepstrum MFCC;

(C)针对汉语普通话设计语境无关音素的声学模型(例如设计23个辅音、22个元音、1个寂静音、1个过渡音),用B-W算法,用采集的语音训练数据,估计获得汉语普通话语境无关音素的声学模型参数;(C) Designing an acoustic model of context-independent phonemes for Mandarin Chinese (such as designing 23 consonants, 22 vowels, 1 silent sound, and 1 transitional sound), using the B-W algorithm and using the collected speech training data to estimate the obtained Acoustic model parameters for context-independent phonemes in Mandarin Chinese;

(D)用训练好的语境无关音素的声学模型,对所有的语音训练数据进行状态分割,如图3所示;(D) Carry out state segmentation to all speech training data with the acoustic model of the trained context-independent phoneme, as shown in Figure 3;

(E)构造模型参数共享和数据有效利用所需要的语音学问题组。(E) Construct the set of phonetic questions needed for model parameter sharing and data efficient utilization.

根据汉语普通话语音学有关辅音的知识,我们设计了如下的辅音问题:According to the knowledge about consonants in the phonetics of Mandarin Chinese, we designed the following consonant questions:

辅音发音方式:1)判断当前音素前(后)面是否是浊辅音2)判断当前音素前(后)面是否是韵尾鼻音3)判断当前音素前(后)面是否是可以作声母的鼻音4)判断当前音素前(后)面是否是边音5)判断当前音素前(后)面是否是浊擦音6)判断当前音素前(后)面是否是送气音7)判断当前音素前(后)面是否是塞音或塞擦音8)判断当前音素前(后)面是否是擦音或塞擦音9)判断当前音素前(后)面是否是塞音10)判断当前音素前(后)面是否是擦音11)判断当前音素前(后)面是否是塞擦音Consonant pronunciation method: 1) Judging whether the front (back) side of the current phoneme is a voiced consonant 2) Judging whether the front (back) side of the current phoneme is a rhyme-final nasal sound 3) Judging whether the front (back) side of the current phoneme is a nasal sound that can be used as an initial consonant 4 ) Judging whether the front (back) face of the current phoneme is a side sound 5) Judging whether the front (back) face of the current phoneme is a voiced fricative 6) Judging whether the front (back) face of the current phoneme is an aspirated sound ) whether the face is a stop or a fricative 8) judging whether the front (back) face of the current phoneme is a fricative or a fricative 9) judging whether the front (back) face of the current phoneme is a stop 10) judging the front (back) face of the current phoneme Whether it is a fricative 11) Determine whether the front (back) face of the current phoneme is an affricate

辅音发音部位:1)判断当前音素前(后)面是否是唇音或舌尖音2)判断当前音素前(后)面是否是唇音3)判断当前音素前(后)面是否是舌尖音4)判断当前音素前(后)面是否是舌尖前音或舌尖后音5)判断当前音素前(后)面是否是舌尖前音6)判断当前音素前(后)面是否是舌尖后音7)判断当前音素前(后)面是否是舌面音或舌根音8)判断当前音素前(后)面是否是舌面音9)判断当前音素前(后)面是否是舌根音总共(11+9)*2=40个问题;Consonant articulation position: 1) Judging whether the front (back) side of the current phoneme is a labial or apical sound 2) Judging whether the front (back) side of the current phoneme is a labial sound 3) Judging whether the front (back) side of the current phoneme is an apical sound 4) Judging Whether the front (back) surface of the current phoneme is the front (back) side of the tongue tip or the tongue tip back sound 5) Judging whether the front (back) side of the current phoneme is the tongue tip front sound 6) Judging whether the current phoneme front (back) side is the tongue tip back sound 7) Judging the current Whether the front (back) face of the phoneme is a lingual sound or a root sound 8) Determine whether the front (back) face of the current phoneme is a lingual sound 9) Determine whether the front (back) face of the current phoneme is a root sound Total (11+9)* 2 = 40 questions;

我们根据汉语语音学有关元音的知识设计了下面的元音问题:We designed the following vowel questions based on the knowledge about vowels in Chinese phonetics:

元音舌位和唇形:1)判断当前音素前(后)是否是圆唇元音2)判断当前音素前(后)是否是开口呼元音3)判断当前音素前(后)是否是齐齿呼元音4)判断当前音素前(后)是否是合口呼元音5)判断当前音素前(后)是否是撮口呼元音6)判断当前音素前(后)是否是前元音7)判断当前音素前(后)是否是中元音(前后位置)8)判断当前音素前(后)是否是后元音9)判断当前音素前(后)是否是高元音10)判断当前音素前(后)是否是中元音(高低位置)11)判断当前音素前(后)是否是低元音12)判断当前音素前(后)是否是元音“iI”13)判断当前音素前(后)是否是元音“Ii”14)判断当前音素前(后)是否是元音“ae”或“a”或“@”15)判断当前音素前(后)是否是元音“H”或“j”16)判断当前音素前(后)是否是元音“A”或“o”17)判断当前音素前(后)是否是元音“@”或“ae”18)判断当前音素前(后)是否是元音“o_o”或“u”19)判断当前音素前(后)是否是元音“i”或“y”20)判断当前音素前(后)是否是元音“U”或“A_A”共20*2=40个问题(F)对每个模型的每个状态构造一个语境判决树,构造语境判决树的流程图如图4。步骤如下:Vowel tongue position and lip shape: 1) Judging whether the front (back) of the current phoneme is a rounded vowel 2) Judging whether the front (back) of the current phoneme is an open vowel 3) Judging whether the front (back) of the current phoneme is homogeneous Tooth-calling vowel 4) Judging whether the front (back) of the current phoneme is a colloquial vowel 5) Judging whether the front (back) of the current phoneme is a mouth-calling vowel 6) Judging whether the front (back) of the current phoneme is a front vowel 7 ) Judging whether the front (back) of the current phoneme is a middle vowel (front and back position) 8) Judging whether the front (back) of the current phoneme is a back vowel 9) Judging whether the front (back) of the current phoneme is a high vowel 10) Judging the current phoneme Whether the front (back) is a middle vowel (high and low position) 11) judge whether the front (back) of the current phoneme is a low vowel 12) judge whether the front (back) of the current phoneme is a vowel "iI" 13) judge the front of the current phoneme ( After) whether it is the vowel "Ii" 14) judge whether the front (back) of the current phoneme is the vowel "ae" or "a" or "@" 15) judge whether the front (back) of the current phoneme is the vowel "H" or "j" 16) Judging whether the front (back) of the current phoneme is a vowel "A" or "o" 17) Judging whether the front (back) of the current phoneme is a vowel "@" or "ae" 18) Judging before (after) the current phoneme ( After) whether it is a vowel "o_o" or "u" 19) judge whether the front (back) of the current phoneme is a vowel "i" or "y" 20) judge whether the front (back) of the current phoneme is a vowel "U" or "A_A" has a total of 20*2=40 questions (F) Construct a contextual decision tree for each state of each model, and the flow chart of constructing the contextual decision tree is shown in Figure 4. Proceed as follows:

1)根据前面状态分割的结果,调入每个模型的每个状态对应的语音学数据;1) According to the results of the previous state segmentation, call in the phonetic data corresponding to each state of each model;

2)建立一个语境判决树根节点,节点对应的概率分布由所有的语音训练数据得到;2) Establish a contextual decision tree root node, and the probability distribution corresponding to the node is obtained from all speech training data;

3)在设计好的汉语普通话语音学问题中,选择一个问题来分裂当前节点所包含的语音数据。分裂的依据是使得下面似然概率增加最多;L=Σe=1EΣt=1TeΣs∈SLn(Pr(ote;μs,Σs))γse(t)≈Ln(Pr(O;S))3) Among the designed Mandarin Chinese phonetic problems, select a problem to split the speech data contained in the current node. The basis for splitting is to make the following likelihood probability increase the most; L = Σ e = 1 E. Σ t = 1 T e Σ the s ∈ S ln ( PR ( o t e ; μ the s , Σ the s ) ) γ the s e ( t ) ≈ ln ( PR ( o ; S ) )

E为语音数据所包含的样本数量,Te为各个样本包含的语音帧数,S为产生这些样本的状态集,○为观测矢量,μ为均值,∑为协方差矩阵,γ为状态住留概率;E is the number of samples contained in the voice data, Teis the number of voice frames contained in each sample, S is the state set that generates these samples, ○ is the observation vector, μ is the mean, Σ is the covariance matrix, γ is the state residence probability;

上式可以简化为:L=Σs∈S-12(n(1+Ln(2π))+Ln(|Σs|)Σe=1EΣt=1Teγse(t))The above formula can be simplified as: L = Σ the s ∈ S - 1 2 ( no ( 1 + ln ( 2 π ) ) + ln ( | Σ the s | ) Σ e = 1 E. Σ t = 1 T e γ the s e ( t ) )

4)如果似然概率增加大于阈值Lmin,则返回3继续进行数据分裂,否则转到5);4) If the likelihood probability increase is greater than the threshold Lmin , then return to 3 to continue data splitting, otherwise go to 5);

5)选择两个节点,如果合并两个节点后,似然减少的值小于一个阈值,则合并该两个节点,否则转到6);5) Select two nodes, if after merging the two nodes, the value of likelihood reduction is less than a threshold, then merge the two nodes, otherwise go to 6);

6)将构造好的语境判决树写入文件;(G)用B-W算法,用所有的语音训练数据,按照语境判决树判决形成的共享模型和数据聚类进行训练,得到所说的跨字词语境相关声学模型的参数。6) write the contextual judgment tree that constructs into file; (G) use B-W algorithm, use all voice training data, carry out training according to the shared model that the contextual judgment tree judges and form and data clustering, obtain said cross Parameters for context-dependent acoustic models of words.

本发明的第二部分是在语音识别的搜索过程中融入能够提取的汉语普通话语音区别特征,减少模型匹配的盲目性,提高搜索的效率和准确性。The second part of the present invention is to incorporate the extractable Mandarin Chinese speech distinguishing features in the search process of speech recognition, reduce the blindness of model matching, and improve the efficiency and accuracy of search.

在完全基于隐含马尔可夫模型框架的系统中,语音识别的结果取决于从系统的模型库中搜索出与输入语音最佳匹配的声学模型序列。由于事先没有任何关于输入语音信号的语音学信息,语音识别算法只能假设在每一帧都有可能是系统所有识别单元之间的转移位置。这种假设造成语音识别过程存在大量的冗余计算。将能够提取到的语音信号的区别信息加到语音识别过程中,则可以大量降低语音识别的冗余计算,提高语音识别的搜索效率和准确性。In a system based entirely on the Hidden Markov Model framework, the result of speech recognition depends on searching the system's model library for the sequence of acoustic models that best matches the input speech. Since there is no prior phonetic information about the input speech signal, the speech recognition algorithm can only assume that there are possible transition positions between all recognition units of the system in each frame. This assumption causes a large number of redundant calculations in the speech recognition process. Adding the extracted distinguishing information of the speech signal to the speech recognition process can greatly reduce the redundant calculation of the speech recognition and improve the search efficiency and accuracy of the speech recognition.

汉语语音有很多区别特征,提取的方式和难易程度各异。对于能够可靠提取的语音区别特征,用本发明的方法可以显著减少语音识别的的搜索效率和准确性。例如,利用清音/浊音区别特征,语音识别的搜索过程如下:Chinese speech has many distinguishing features, and the extraction methods and degrees of difficulty vary. For the distinguishing features of speech that can be reliably extracted, the search efficiency and accuracy of speech recognition can be significantly reduced by using the method of the present invention. For example, using unvoiced/voiced distinguishing features, the search process for speech recognition is as follows:

1)读入一帧语音信号。1) Read in a frame of speech signal.

2)确定当前帧是清音或是浊音。如果是清音,则仅对清音的模型进行匹配,浊音模型中的累积概率置为零;如果是浊音,则仅对浊音的模型进行匹配,清音模型中的累积概率置为零。2) Determine whether the current frame is unvoiced or voiced. If it is unvoiced, only the unvoiced model is matched, and the cumulative probability in the voiced model is set to zero; if it is voiced, only the voiced model is matched, and the cumulative probability in the unvoiced model is set to zero.

3)判断语音信号是否处理完毕,是则结束,输出识别结果。否则转到1)。3) Judging whether the speech signal has been processed, if yes, it ends, and the recognition result is output. Otherwise go to 1).

本发明的优点:Advantages of the present invention:

1)在语境相关的精确的声学模型建立过程中,通过设计适合语音识别的汉字语音音素集和语境相关的声学模型,利用汉语普通话语音学知识和数据驱动的方法进行模型参数共享和数据的有效利用,解决了语音识别声学模型的精确性和可训练性之间的矛盾,使得即使在有限的训练数据的条件下,都可以建立尽可能精确的语音识别的声学模型,从而提高了系统性能。1) In the process of establishing a context-related accurate acoustic model, by designing a Chinese phoneme set suitable for speech recognition and a context-related acoustic model, using Mandarin Chinese phonetic knowledge and data-driven methods to share model parameters and data The effective use of it solves the contradiction between the accuracy and trainability of the speech recognition acoustic model, so that even under the condition of limited training data, the acoustic model of speech recognition can be established as accurate as possible, thereby improving the system performance.

2)在用上述语音识别的声学模型进行语音识别的过程中,利用可以提取的汉语普通话语音的区别特征信息,减少语音识别的搜索中模型匹配的数量,提高了语音识别的准确性和计算效率。2) In the process of speech recognition with the above-mentioned acoustic model of speech recognition, the distinguishing feature information of Mandarin Chinese speech that can be extracted is used to reduce the number of model matching in the search of speech recognition, and improve the accuracy and calculation efficiency of speech recognition .

附图说明Description of drawings

图1为本发明元音U状态1的语境判决树示意图;Fig. 1 is the context decision tree schematic diagram of vowel U state 1 of the present invention;

图2为本发明的建立声学模型的方块图;Fig. 2 is the block diagram of setting up the acoustic model of the present invention;

图3为本发明语音数据状态分割示意图;Fig. 3 is the segmentation schematic diagram of speech data state of the present invention;

图4为本发明语境判决树构造流程图;Fig. 4 is the flow chart of context decision tree construction of the present invention;

图5为本发明识别解码器方块图。Fig. 5 is a block diagram of the identification decoder of the present invention.

具体实施方式一个汉语普通话大词汇连续语音识别系统的构造过程:系统训练:The specific embodiment of the construction process of a Mandarin Chinese large vocabulary continuous speech recognition system: system training:

(A)采集语音训练数据。(A) Collect speech training data.

(B)对每个语音训练数据提取特征矢量序列,语音特征可以选择LPC倒谱或摩尔倒谱MFCC。(B) Extract feature vector sequence for each speech training data, speech feature can choose LPC cepstrum or Moore cepstrum MFCC.

(C)针对汉语普通话设计语境无关音素的声学模型(例如设计23个辅音、22个元音、1个寂静音、1个过渡音),用B-W算法,用采集的语音训练数据,估计获得汉语普通话语境无关音素的声学模型参数。(C) Designing an acoustic model of context-independent phonemes for Mandarin Chinese (such as designing 23 consonants, 22 vowels, 1 silent sound, and 1 transitional sound), using the B-W algorithm and using the collected speech training data to estimate the obtained Acoustic model parameters for context-independent phonemes in Mandarin Chinese.

(D)用训练好的语境无关模型,对所有的语音训练数据进行状态分割。如图3所示。(D) Using the trained context-free model, perform state segmentation on all speech training data. As shown in Figure 3.

(E)构造参数共享中需要的语音学问题。(E) Constructing the phonetic questions needed in parameter sharing.

(F)对每个模型的每个状态构造一个语境判决树。(F) Construct a contextual decision tree for each state of each model.

(G)用B-W算法,用所有的语音训练数据,按照语境判决树判决形成的共享模型和数据聚类进行训练,得到跨字词语境相关声学模型的参数。(G) Use the B-W algorithm, use all the speech training data, perform training according to the shared model and data clustering formed by the context decision tree decision, and obtain the parameters of the cross-word context-related acoustic model.

(H)按照图5方框图构造语音识别系统。(H) Construct a speech recognition system according to the block diagram in FIG. 5 .

(I)优化语音识别系统的参数。(1) Optimize the parameters of the speech recognition system.

Claims (2)

Translated fromChinese
1.一种汉语普通话大词汇连续语音声学模型的建模方法,其特征在于:通过设计适合语音识别的汉字语音音素集和语境相关的声学模型,利用汉语普通话语音学知识和数据驱动的方法进行模型参数共享和语音训练数据的有效利用,建立跨字词语境相关的语音声学模型;语音声学模型建立在汉字的音素层面上,在处理连续语音的协同发音影响时,不仅考虑汉字音节内部紧邻的音素之间的协同发音问题,还考虑跨字词边界的音素之间的协同发音问题;用汉语普通话语音学知识通过建立语音学问题组的方式建立模型参数共享和数据利用的语境判决树;语境判决树的使用采用似然概率最大的准则;1. A modeling method of Mandarin Chinese large vocabulary continuous speech acoustic model, characterized in that: by designing a Chinese character speech phoneme set suitable for speech recognition and a context-related acoustic model, utilizing Chinese Mandarin phonetics knowledge and data-driven methods Carry out model parameter sharing and effective use of speech training data, and establish a speech acoustic model related to cross-word context; the speech acoustic model is established on the phoneme level of Chinese characters, and not only considers the internal syllables of Chinese characters when dealing with the influence of co-pronunciation of continuous speech The problem of co-pronunciation between adjacent phonemes, and the problem of co-pronunciation between phonemes that cross word boundaries are also considered; use Mandarin Chinese phonetics knowledge to establish a phonetic problem group to establish model parameter sharing and contextual judgment of data utilization tree; the use of the context decision tree adopts the criterion with the largest likelihood probability;建立跨字词语境相关声学模型的步骤如下:The steps to build a cross-word context-dependent acoustic model are as follows:(A)采集语音训练数据;(A) collecting speech training data;(B)对每个语音训练数据提取特征矢量序列,语音特征可以选择LPC倒谱或摩尔倒谱MFCC;(B) extract feature vector sequence for each speech training data, speech feature can select LPC cepstrum or moore cepstrum MFCC;(C)针对汉语普通话设计语境无关的音素声学模型,例如设计23个辅音、22个元音、1个寂静音、1个过渡音,用B-W算法,用采集的语音训练数据,估计获得汉语普通话语境无关的音素声学模型参数;(C) Design a context-independent phoneme acoustic model for Mandarin Chinese, such as designing 23 consonants, 22 vowels, 1 silent sound, and 1 transitional sound, using the B-W algorithm and using the collected speech training data to estimate the obtained Chinese Mandarin context-independent phoneme acoustic model parameters;(D)用训练好的语境无关的音素声学模型,对所有的语音训练数据进行状态分割;(D) Use the trained context-independent phoneme acoustic model to perform state segmentation on all speech training data;(E)构造模型参数共享和数据有效利用所需要的语音学问题组;(E) Constructing the phonetic problem groups required for model parameter sharing and effective data utilization;根据汉语普通话语音学有关辅音的知识,我们设计了如下的辅音问题:According to the knowledge about consonants in the phonetics of Mandarin Chinese, we designed the following consonant questions:辅音发音方式:1)判断当前音素前(后)面是否是浊辅音2)判断当前音素前(后)面是否是韵尾鼻音3)判断当前音素前(后)面是否是可以作声母的鼻音4)判断当前音素前(后)面是否是边音5)判断当前音素前(后)面是否是浊擦音6)判断当前音素前(后)面是否是送气音7)判断当前音素前(后)面是否是塞音或塞擦音8)判断当前音素前(后)面是否是擦音或塞擦音9)判断当前音素前(后)面是否是塞音10)判断当前音素前(后)面是否是擦音11)断当前音素前(后)面是否是塞擦音Consonant pronunciation method: 1) Judging whether the front (back) side of the current phoneme is a voiced consonant 2) Judging whether the front (back) side of the current phoneme is a rhyme-final nasal sound 3) Judging whether the front (back) side of the current phoneme is a nasal sound that can be used as an initial consonant 4 ) Judging whether the front (back) face of the current phoneme is a side sound 5) Judging whether the front (back) face of the current phoneme is a voiced fricative 6) Judging whether the front (back) face of the current phoneme is an aspirated sound ) whether the face is a stop or a fricative 8) judging whether the front (back) face of the current phoneme is a fricative or a fricative 9) judging whether the front (back) face of the current phoneme is a stop 10) judging the front (back) face of the current phoneme Is it a fricative 11) Whether the front (back) side of the current phoneme is an affricate辅音发音部位:1)判断当前音素前(后)面是否是唇音或舌尖音2)判断当前音素前(后)面是否是唇音3)判断当前音素前(后)面是否是舌尖音4)判断当前音素前(后)面是否是舌尖前音或舌尖后音5)判断当前音素前(后)面是否是舌尖前音6)判断当前音素前(后)面是否是舌尖后音7)判断当前音素前(后)面是否是舌面音或舌根音8)判断当前音素前(后)面是否是舌面音9)判断当前音素前(后)面是否是舌根音总共(11+9)*2=40个问题;Consonant articulation position: 1) Judging whether the front (back) side of the current phoneme is a labial or apical sound 2) Judging whether the front (back) side of the current phoneme is a labial sound 3) Judging whether the front (back) side of the current phoneme is an apical sound 4) Judging Whether the front (back) surface of the current phoneme is the front (back) side of the tongue tip or the tongue tip back sound 5) Judging whether the front (back) side of the current phoneme is the tongue tip front sound 6) Judging whether the current phoneme front (back) side is the tongue tip back sound 7) Judging the current Whether the front (back) face of the phoneme is a lingual sound or a root sound 8) Determine whether the front (back) face of the current phoneme is a lingual sound 9) Determine whether the front (back) face of the current phoneme is a root sound Total (11+9)* 2 = 40 questions;我们根据汉语语音学有关元音的知识设计了下面的元音问题:We designed the following vowel questions based on the knowledge about vowels in Chinese phonetics:元音舌位和唇形:1)判断当前音素前(后)是否是圆唇元音2)判断当前音素前(后)是否是开口呼元音3)判断当前音素前(后)是否是齐齿呼元音4)判断当前音素前(后)是否是合口呼元音5)判断当前音素前(后)是否是撮口呼元音6)判断当前音素前(后)是否是前元音7)判断当前音素前(后)是否是中元音(前后位置)8)判断当前音素前(后)是否是后元音9)判断当前音素前(后)是否是高元音10)判断当前音素前(后)是否是中元音(高低位置)11)判断当前音素前(后)是否是低元音12)判断当前音素前(后)是否是元音“iI”13)判断当前音素前(后)是否是元音“Ii”14)判断当前音素前(后)是否是元音“ae”或“a”或“@”15)判断当前音素前(后)是否是元音“H”或“j”16)判断当前音素前(后)是否是元音“A”或“o”17)判断当前音素前(后)是否是元音“@”或“ae”18)判断当前音素前(后)是否是元音“o_o”或“u”19)判断当前音素前(后)是否是元音“i”或“y”20)判断当前音素前(后)是否是元音“U”或“A_A”共20*2=40个问题(F)对每个模型的每个状态构造一个语境判决树,构造语境判决树的步骤如下:Vowel tongue position and lip shape: 1) Judging whether the front (back) of the current phoneme is a rounded vowel 2) Judging whether the front (back) of the current phoneme is an open vowel 3) Judging whether the front (back) of the current phoneme is homogeneous Tooth-calling vowel 4) Judging whether the front (back) of the current phoneme is a colloquial vowel 5) Judging whether the front (back) of the current phoneme is a mouth-calling vowel 6) Judging whether the front (back) of the current phoneme is a front vowel 7 ) Judging whether the front (back) of the current phoneme is a middle vowel (front and back position) 8) Judging whether the front (back) of the current phoneme is a back vowel 9) Judging whether the front (back) of the current phoneme is a high vowel 10) Judging the current phoneme Whether the front (back) is a middle vowel (high and low position) 11) judge whether the front (back) of the current phoneme is a low vowel 12) judge whether the front (back) of the current phoneme is a vowel "iI" 13) judge the front of the current phoneme ( After) whether it is the vowel "Ii" 14) judge whether the front (back) of the current phoneme is the vowel "ae" or "a" or "@" 15) judge whether the front (back) of the current phoneme is the vowel "H" or "j" 16) Judging whether the front (back) of the current phoneme is a vowel "A" or "o" 17) Judging whether the front (back) of the current phoneme is a vowel "@" or "ae" 18) Judging before (after) the current phoneme ( After) whether it is a vowel "o_o" or "u" 19) judge whether the front (back) of the current phoneme is a vowel "i" or "y" 20) judge whether the front (back) of the current phoneme is a vowel "U" or "A_A" has a total of 20*2=40 questions (F) Construct a contextual decision tree for each state of each model. The steps for constructing a contextual decision tree are as follows:1)根据前面状态分割的结果,调入每个模型的每个状态对应的语音学数据;1) According to the results of the previous state segmentation, call in the phonetic data corresponding to each state of each model;2)建立一个语境判决树根节点,节点对应的概率分布由所有的语音训练数据得到;2) Establish a contextual decision tree root node, and the probability distribution corresponding to the node is obtained from all speech training data;3)在设计好的汉语普通话语音学问题中,选择一个问题来分裂当前节点所包含的语音数据。分裂的依据是使得下面似然概率增加最多;L=Σe=1EΣt=1TeΣs∈SLn(Pr(ote;μs,Σs))γse(t)≈Ln(Pr(O;S))3) Among the designed Mandarin Chinese phonetic problems, select a problem to split the speech data contained in the current node. The basis for splitting is to make the following likelihood probability increase the most; L = Σ e = 1 E. Σ t = 1 T e Σ the s ∈ S ln ( PR ( o t e ; μ the s , Σ the s ) ) γ the s e ( t ) ≈ ln ( PR ( o ; S ) )E为语音数据所包含的样本数量,Te为各个样本包含的语音帧数,S为产生这些样本的状态集,○为观测矢量,μ为均值,∑为协方差矩阵,γ为状态住留概率;E is the number of samples contained in the voice data, Teis the number of voice frames contained in each sample, S is the state set that generates these samples, ○ is the observation vector, μ is the mean, Σ is the covariance matrix, γ is the state residence probability;上式可以简化为:L=Σs∈S-12(n(1+Ln(2π))+Ln(|Σs|)Σe=1EΣt=1Teγse(t))The above formula can be simplified as: L = Σ the s ∈ S - 1 2 ( no ( 1 + ln ( 2 π ) ) + ln ( | Σ the s | ) Σ e = 1 E. Σ t = 1 T e γ the s e ( t ) )4)如果似然概率增加大于阈值Lmin,则返回3继续进行数据分裂,否则转到5);4) If the likelihood probability increase is greater than the threshold Lmin , then return to 3 to continue data splitting, otherwise go to 5);5)选择两个节点,如果合并两个节点后,似然减少的值小于一个阈值,则合并该两个节点,否则转到6);5) Select two nodes, if after merging the two nodes, the value of likelihood reduction is less than a threshold, then merge the two nodes, otherwise go to 6);6)将构造好的语境判决树写入文件;(G)用B-W算法,用所有的语音训练数据,按照语境判决树判决形成的共6) write the contextual judgment tree that constructs into file; (G) use B-W algorithm, use all speech training data, according to the common judgment that the contextual judgment tree judges to form享模型和数据聚类进行训练,得到所说的跨字词语境相关声学模型的参数。The parameters of the so-called cross-word context-dependent acoustic model are obtained by training the sharing model and data clustering.2.根据权利要求1所述的汉语普通话大词汇连续语音声学模型,进行汉语普通话大词汇连续语音识别,其特征在于:在语音识别的搜索过程中,融入能够提取的汉语普通话语音的区别特征,例如清音/浊音等,减少模型匹配的盲目性,提高语音识别的搜索效率和准确性。2. according to the Mandarin Chinese large vocabulary continuous speech acoustic model of claim 1, carry out Chinese Mandarin large vocabulary continuous speech recognition, it is characterized in that: in the search process of speech recognition, blend into the distinguishing feature of the Chinese Mandarin speech that can extract, For example, unvoiced/voiced sounds, etc., reduce the blindness of model matching, and improve the search efficiency and accuracy of speech recognition.
CN97116890A1997-09-051997-09-05Continuous voice identification technology for Chinese putonghua large vocabularyExpired - Fee RelatedCN1099662C (en)

Priority Applications (1)

Application NumberPriority DateFiling DateTitle
CN97116890ACN1099662C (en)1997-09-051997-09-05Continuous voice identification technology for Chinese putonghua large vocabulary

Applications Claiming Priority (1)

Application NumberPriority DateFiling DateTitle
CN97116890ACN1099662C (en)1997-09-051997-09-05Continuous voice identification technology for Chinese putonghua large vocabulary

Publications (2)

Publication NumberPublication Date
CN1211026A CN1211026A (en)1999-03-17
CN1099662Ctrue CN1099662C (en)2003-01-22

Family

ID=5174181

Family Applications (1)

Application NumberTitlePriority DateFiling Date
CN97116890AExpired - Fee RelatedCN1099662C (en)1997-09-051997-09-05Continuous voice identification technology for Chinese putonghua large vocabulary

Country Status (1)

CountryLink
CN (1)CN1099662C (en)

Families Citing this family (8)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN1674092B (en)*2004-03-262010-06-09松下电器产业株式会社 Modeling, decoding method and system for cross-word consonants and finals for continuous digit recognition
KR101780760B1 (en)*2011-06-302017-10-10구글 인코포레이티드Speech recognition using variable-length context
US9336771B2 (en)2012-11-012016-05-10Google Inc.Speech recognition using non-parametric models
US9858922B2 (en)2014-06-232018-01-02Google Inc.Caching speech recognition scores
US9299347B1 (en)2014-10-222016-03-29Google Inc.Speech recognition using associative mapping
CN105206271A (en)*2015-08-252015-12-30北京宇音天下科技有限公司Intelligent equipment voice wake-up method and system for realizing method
CN108922543B (en)*2018-06-112022-08-16平安科技(深圳)有限公司Model base establishing method, voice recognition method, device, equipment and medium
CN111599347B (en)*2020-05-272024-04-16广州科慧健远医疗科技有限公司Standardized sampling method for extracting pathological voice MFCC (functional peripheral component interconnect) characteristics for artificial intelligent analysis

Also Published As

Publication numberPublication date
CN1211026A (en)1999-03-17

Similar Documents

PublicationPublication DateTitle
CN109410914B (en) A Gan dialect phonetic and dialect point recognition method
CN110517663B (en)Language identification method and system
Hirsimaki et al.Importance of high-order n-gram models in morph-based speech recognition
US6912499B1 (en)Method and apparatus for training a multilingual speech model set
HalberstadtHeterogeneous acoustic measurements and multiple classifiers for speech recognition
Siniscalchi et al.An artificial neural network approach to automatic speech processing
CN1187693C (en)Method, apparatus, and system for bottom-up tone integration to Chinese continuous speech recognition system
Pellegrino et al.Automatic language identification: an alternative approach to phonetic modelling
Rosdi et al.Isolated malay speech recognition using Hidden Markov Models
CN1099662C (en)Continuous voice identification technology for Chinese putonghua large vocabulary
Fanty et al.City name recognition over the telephone
CN112766101B (en)Method for constructing Chinese lip language identification modeling unit set
Tombaloğlu et al.Deep learning based automatic speech recognition for Turkish
Azim et al.Large vocabulary Arabic continuous speech recognition using tied states acoustic models
Rasipuram et al.Grapheme and multilingual posterior features for under-resourced speech recognition: a study on scottish gaelic
WesterPronunciation variation modeling for Dutch automatic speech recognition
Fosler-LussierA tutorial on pronunciation modeling for large vocabulary speech recognition
JP2938865B1 (en) Voice recognition device
Sirigos et al.A hybrid syllable recognition system based on vowel spotting
KR100474253B1 (en)Speech recognition method using utterance of the first consonant of word and media storing thereof
Vazirnezhad et al.Hybrid statistical pronunciation models designed to be trained by a medium-size corpus
Puurula et al.Vocabulary decomposition for Estonian open vocabulary speech recognition
Saychum et al.A great reduction of wer by syllable toneme prediction for thai grapheme to phoneme conversion
MingMaximizing the continuity in segmentation-a new approach to model, segment and recognize speech
LingKeyword spotting in continuous speech utterances

Legal Events

DateCodeTitleDescription
C10Entry into substantive examination
SE01Entry into force of request for substantive examination
C06Publication
PB01Publication
C14Grant of patent or utility model
GR01Patent grant
C19Lapse of patent right due to non-payment of the annual fee
CF01Termination of patent right due to non-payment of annual fee

[8]ページ先頭

©2009-2025 Movatter.jp