CN1099662C

Movatterモバイル変換

Info

Publication number: CN1099662C
Application number: CN97116890A
Authority: CN
Inventors: 杜利民; 皮晓波
Original assignee: Institute of Acoustics CAS
Current assignee: Institute of Acoustics CAS
Priority date: 1997-09-05
Filing date: 1997-09-05
Publication date: 2003-01-22
Anticipated expiration: 2017-09-05
Also published as: CN1211026A

Abstract

本发明涉及语音信号处理中的一种语音识别方法。在语音声学模型的建模过程中，通过设计适合语音识别的汉字语音音素集和语境相关的声学模型，利用汉语普通话语音学知识和数据驱动的方法进行模型参数共享和语音训练数据的有效利用，建立跨字词语境相关的声学模型；在语音识别的过程中，利用能够提取的语音区别特征，如：清音/浊音等，减少模型匹配的盲目性，提高语音识别的搜索效率和准确性。

The invention relates to a speech recognition method in speech signal processing. In the modeling process of the speech acoustic model, by designing a Chinese character speech phoneme set suitable for speech recognition and a context-dependent acoustic model, using Mandarin Chinese phonetic knowledge and data-driven methods for model parameter sharing and effective use of speech training data , establish an acoustic model related to cross-word context; in the process of speech recognition, use the speech distinguishing features that can be extracted, such as unvoiced/voiced sounds, etc., to reduce the blindness of model matching and improve the search efficiency and accuracy of speech recognition .

Description

Translated fromChinese

汉语普通话大词汇连续语音识别方法A Method for Continuous Speech Recognition of Large Vocabulary in Mandarin Chinese

技术领域technical field

本发明汉语普通话大词汇连续语音识别技术属于语音信号处理和识别技术领域。The invention discloses a large-vocabulary continuous speech recognition technology for Mandarin Chinese, which belongs to the technical field of speech signal processing and recognition.

背景技术Background technique

汉语普通话大词汇连续语音识别目前还没有商用系统。其它语种的大词汇连续语音识别系统，如IBM公司96年的语音识别产品VoiceType，采用基于隐含马尔可夫模型的统计模型技术。采用隐含马尔可夫模型技术的系统可以取得较好的识别率，但是也存在固有缺点：(1)系统的识别过程完全依赖数量庞大的模型匹配计算，这使得系统技术变得格外复杂和脆弱。(2)模型的参数估计需要大量的训练数据，而且这些数据必须按统计的含义覆盖语音所有可能的变体，这使得大词汇连续语音识别系统的训练过程变得相当困难。由于这些缺点，使得完全基于隐含马尔可夫模型的语音识别系统的识别率在达到一定程度后，很难再进一步提高。There is currently no commercial system for large-vocabulary continuous speech recognition in Mandarin Chinese. Large-vocabulary continuous speech recognition systems of other languages, such as VoiceType, a speech recognition product of IBM Corporation in 1996, adopt statistical model technology based on Hidden Markov Model. The system using Hidden Markov Model technology can achieve a better recognition rate, but it also has inherent disadvantages: (1) The recognition process of the system is completely dependent on a large number of model matching calculations, which makes the system technology extremely complex and fragile . (2) The parameter estimation of the model requires a large amount of training data, and these data must cover all possible variants of the speech according to the statistical meaning, which makes the training process of the large-vocabulary continuous speech recognition system quite difficult. Due to these shortcomings, it is difficult to further improve the recognition rate of the speech recognition system based entirely on the hidden Markov model after reaching a certain level.

发明内容Contents of the invention

本发明的目的在于基于隐含马尔可夫模型的识别框架，利用汉语普通话语音学知识和信号处理提取的语音学区别特征，提出一种汉语普通话大词汇连续语音识别的声学模型建模方法和搜索改进方法，它可以提高语音识别系统的性能和计算复杂性。The purpose of the present invention is based on the recognition framework of the Hidden Markov Model, using the phonetic distinguishing features extracted from Mandarin Chinese phonetics knowledge and signal processing, to propose an acoustic model modeling method and search method for large vocabulary continuous speech recognition in Mandarin Chinese Improved methods, which can increase the performance and computational complexity of speech recognition systems.

本发明的第一部分是通过设计适合语音识别的汉字语音音素集和语境相关的声学模型，利用汉语普通话语音学知识和数据驱动的方法进行模型参数共享和语音训练数据的有效利用，建立跨字词语境相关的语音声学模型。The first part of the present invention is to design a Chinese character speech phoneme set suitable for speech recognition and a context-related acoustic model, and use Mandarin Chinese phonetics knowledge and data-driven methods to share model parameters and effectively utilize speech training data to establish cross-character Context-dependent acoustic models of speech.

在连续语流中，协同发音现象十分显著。对自动语音识别器来说，由于它是基于每个识别单元的声学特征一致性的原理，连续语音中的协同发音现象会使得识别单元的声学特征一致性大大降低，从而使识别系统的性能下降。建立语境相关的声学模型，可以改善识别单元的声学特征的一致性，但又会导致语音识别系统的识别单元模型的数量急剧增加。例如，汉语普通话声韵母的语境无关的识别单元只有60个左右，当考虑识别单元左右语境的影响时，语境相关的识别单元的数目达到数万个之多。由于实际上很难为这么多的声学模型准备在统计意义上足够多的训练数据，这使得每个声学模型的参数不能够得到有效的估计，最后导致语音识别的性能差。In continuous speech flow, the phenomenon of co-articulation is very significant. For the automatic speech recognizer, because it is based on the principle of the consistency of the acoustic features of each recognition unit, the phenomenon of co-pronunciation in continuous speech will greatly reduce the consistency of the acoustic features of the recognition unit, thereby reducing the performance of the recognition system . Establishing a context-dependent acoustic model can improve the consistency of the acoustic features of the recognition unit, but it will lead to a sharp increase in the number of recognition unit models of the speech recognition system. For example, there are only about 60 context-independent recognition units for consonants and finals in Mandarin Chinese. When considering the influence of the left and right contexts of the recognition units, the number of context-dependent recognition units reaches tens of thousands. Since it is actually difficult to prepare enough training data in a statistical sense for so many acoustic models, the parameters of each acoustic model cannot be effectively estimated, resulting in poor speech recognition performance.

本发明首先在汉字音节的音素层面设计适合语音识别的单元模型，在此基础上建立语境相关的声学模型，即把每个声学模型表示成为一个同时受到前后两端语境影响的识别单元。例如：汉语普通话的词汇——中国zhongguo在音素层面建立的识别单元模型为The present invention first designs a unit model suitable for speech recognition at the phoneme level of Chinese syllables, and establishes a context-related acoustic model on this basis, that is, expresses each acoustic model as a recognition unit affected by the context at both the front and rear ends. For example: the vocabulary of Mandarin Chinese - Chinese zhongguo, the recognition unit model established at the phoneme level is

zhongguo＝sil ts’U ng k u o_o sil考虑到前后两端语境影响，产生得到语境相关的识别单元模型zhongguo=sil ts’U ng k u o_o sil Considering the influence of the context at both ends, a context-related recognition unit model is generated

zhongguo＝sil sil-ts’+U ts’-U+ng U-ng+k ng-k+u k-u+o_o u-o_o+sil sil在这里-表示前语境，+表示后语境。zhongguo＝sil sil-ts'+U ts'-U+ng U-ng+k ng-k+u k-u+o_o u-o_o+sil sil here - indicates the pre-context, + indicates the post-context.

在本发明的一种实现中，对汉语普通话语音设计的47个语境无关的识别单元(23个辅音、22个元音、1个寂静音、1个过渡音)，经过语境相关的处理后生成的语境相关声学模型大约6000个，达到了可以有效处理的模型数目的范围。In a kind of realization of the present invention, to the 47 context-independent recognition units (23 consonants, 22 vowels, 1 silent sound, 1 transitional sound) of Mandarin Chinese speech design, through context-related processing The resulting context-dependent acoustic models are about 6000, reaching the range of the number of models that can be effectively processed.

本发明对所生成的语境相关的声学模型，采用汉语普通话语音学知识和训练数据驱动的语境判决树方式，进行声学模型状态层面的参数共享和数据的有效利用。图1是一个针对某一语境相关声学模型中状态1的语境判决树。在某一语境中的音素通过这个语境判决树时，可以选择一个概率分布来表示其状态1的输出概率分布。过程如下：首先回答根节点0上的问题？该音素的前面是否是中辅音(ts，ts_h，s，ts`，ts`_h，s`，z`)？如果这个音素是U，它的语境是ts`-U+sil，其判决的结果为肯定，决策转入下一个节点1；在节点1上需要继续回答的问题是：音素后面是否是寂静段？因为U后面正好是寂静音，所以判决的结果为肯定，决策转入下一个节点3。由于节点3已经到达语境判决树的叶节点，该项判决过程也就结束了，即音素U在ts`-U+sil语境中，应该选择概率分布1来表示它的状态1。For the generated acoustic model related to the context, the present invention adopts the context decision tree mode driven by the phonetic knowledge of Mandarin Chinese and the training data to share the parameters of the state level of the acoustic model and effectively utilize the data. Figure 1 is a contextual decision tree for state 1 in a context-dependent acoustic model. When a phoneme in a certain context passes through this context decision tree, a probability distribution can be selected to represent its output probability distribution of state 1. The process is as follows: first answer the question on the root node 0? Is the phoneme preceded by middle consonants (ts, ts_h, s, ts`, ts`_h, s`, z`)? If the phoneme is U, its context is ts`-U+sil, the result of the judgment is affirmative, and the decision is transferred to the next node 1; the question that needs to be answered on node 1 is: whether there is a silent segment after the phoneme ? Because U is followed by a silent tone, the result of the judgment is affirmative, and the decision is transferred to the next node 3. Since node 3 has reached the leaf node of the context decision tree, the decision process is over, that is, the phoneme U is in the ts`-U+sil context, and the probability distribution 1 should be selected to represent its state 1.

建立语境相关的声学模型的方框图如图2，步骤如下：The block diagram of building a context-dependent acoustic model is shown in Figure 2, and the steps are as follows:

(A)采集语音训练数据；(A) collecting speech training data;

(B)对每个语音训练数据提取特征矢量序列，语音特征可以选择LPC倒谱或摩尔倒谱MFCC；(B) extract feature vector sequence for each speech training data, speech feature can select LPC cepstrum or moore cepstrum MFCC;

(C)针对汉语普通话设计语境无关音素的声学模型(例如设计23个辅音、22个元音、1个寂静音、1个过渡音)，用B-W算法，用采集的语音训练数据，估计获得汉语普通话语境无关音素的声学模型参数；(C) Designing an acoustic model of context-independent phonemes for Mandarin Chinese (such as designing 23 consonants, 22 vowels, 1 silent sound, and 1 transitional sound), using the B-W algorithm and using the collected speech training data to estimate the obtained Acoustic model parameters for context-independent phonemes in Mandarin Chinese;

(D)用训练好的语境无关音素的声学模型，对所有的语音训练数据进行状态分割，如图3所示；(D) Carry out state segmentation to all speech training data with the acoustic model of the trained context-independent phoneme, as shown in Figure 3;

(E)构造模型参数共享和数据有效利用所需要的语音学问题组。(E) Construct the set of phonetic questions needed for model parameter sharing and data efficient utilization.

根据汉语普通话语音学有关辅音的知识，我们设计了如下的辅音问题：According to the knowledge about consonants in the phonetics of Mandarin Chinese, we designed the following consonant questions:

辅音发音方式：1)判断当前音素前(后)面是否是浊辅音2)判断当前音素前(后)面是否是韵尾鼻音3)判断当前音素前(后)面是否是可以作声母的鼻音4)判断当前音素前(后)面是否是边音5)判断当前音素前(后)面是否是浊擦音6)判断当前音素前(后)面是否是送气音7)判断当前音素前(后)面是否是塞音或塞擦音8)判断当前音素前(后)面是否是擦音或塞擦音9)判断当前音素前(后)面是否是塞音10)判断当前音素前(后)面是否是擦音11)判断当前音素前(后)面是否是塞擦音Consonant pronunciation method: 1) Judging whether the front (back) side of the current phoneme is a voiced consonant 2) Judging whether the front (back) side of the current phoneme is a rhyme-final nasal sound 3) Judging whether the front (back) side of the current phoneme is a nasal sound that can be used as an initial consonant 4 ) Judging whether the front (back) face of the current phoneme is a side sound 5) Judging whether the front (back) face of the current phoneme is a voiced fricative 6) Judging whether the front (back) face of the current phoneme is an aspirated sound ) whether the face is a stop or a fricative 8) judging whether the front (back) face of the current phoneme is a fricative or a fricative 9) judging whether the front (back) face of the current phoneme is a stop 10) judging the front (back) face of the current phoneme Whether it is a fricative 11) Determine whether the front (back) face of the current phoneme is an affricate

辅音发音部位：1)判断当前音素前(后)面是否是唇音或舌尖音2)判断当前音素前(后)面是否是唇音3)判断当前音素前(后)面是否是舌尖音4)判断当前音素前(后)面是否是舌尖前音或舌尖后音5)判断当前音素前(后)面是否是舌尖前音6)判断当前音素前(后)面是否是舌尖后音7)判断当前音素前(后)面是否是舌面音或舌根音8)判断当前音素前(后)面是否是舌面音9)判断当前音素前(后)面是否是舌根音总共(11+9)*2＝40个问题；Consonant articulation position: 1) Judging whether the front (back) side of the current phoneme is a labial or apical sound 2) Judging whether the front (back) side of the current phoneme is a labial sound 3) Judging whether the front (back) side of the current phoneme is an apical sound 4) Judging Whether the front (back) surface of the current phoneme is the front (back) side of the tongue tip or the tongue tip back sound 5) Judging whether the front (back) side of the current phoneme is the tongue tip front sound 6) Judging whether the current phoneme front (back) side is the tongue tip back sound 7) Judging the current Whether the front (back) face of the phoneme is a lingual sound or a root sound 8) Determine whether the front (back) face of the current phoneme is a lingual sound 9) Determine whether the front (back) face of the current phoneme is a root sound Total (11+9)* 2 = 40 questions;

我们根据汉语语音学有关元音的知识设计了下面的元音问题：We designed the following vowel questions based on the knowledge about vowels in Chinese phonetics:

元音舌位和唇形：1)判断当前音素前(后)是否是圆唇元音2)判断当前音素前(后)是否是开口呼元音3)判断当前音素前(后)是否是齐齿呼元音4)判断当前音素前(后)是否是合口呼元音5)判断当前音素前(后)是否是撮口呼元音6)判断当前音素前(后)是否是前元音7)判断当前音素前(后)是否是中元音(前后位置)8)判断当前音素前(后)是否是后元音9)判断当前音素前(后)是否是高元音10)判断当前音素前(后)是否是中元音(高低位置)11)判断当前音素前(后)是否是低元音12)判断当前音素前(后)是否是元音“iI”13)判断当前音素前(后)是否是元音“Ii”14)判断当前音素前(后)是否是元音“ae”或“a”或“@”15)判断当前音素前(后)是否是元音“H”或“j”16)判断当前音素前(后)是否是元音“A”或“o”17)判断当前音素前(后)是否是元音“@”或“ae”18)判断当前音素前(后)是否是元音“o_o”或“u”19)判断当前音素前(后)是否是元音“i”或“y”20)判断当前音素前(后)是否是元音“U”或“A_A”共20*2＝40个问题(F)对每个模型的每个状态构造一个语境判决树，构造语境判决树的流程图如图4。步骤如下：Vowel tongue position and lip shape: 1) Judging whether the front (back) of the current phoneme is a rounded vowel 2) Judging whether the front (back) of the current phoneme is an open vowel 3) Judging whether the front (back) of the current phoneme is homogeneous Tooth-calling vowel 4) Judging whether the front (back) of the current phoneme is a colloquial vowel 5) Judging whether the front (back) of the current phoneme is a mouth-calling vowel 6) Judging whether the front (back) of the current phoneme is a front vowel 7 ) Judging whether the front (back) of the current phoneme is a middle vowel (front and back position) 8) Judging whether the front (back) of the current phoneme is a back vowel 9) Judging whether the front (back) of the current phoneme is a high vowel 10) Judging the current phoneme Whether the front (back) is a middle vowel (high and low position) 11) judge whether the front (back) of the current phoneme is a low vowel 12) judge whether the front (back) of the current phoneme is a vowel "iI" 13) judge the front of the current phoneme ( After) whether it is the vowel "Ii" 14) judge whether the front (back) of the current phoneme is the vowel "ae" or "a" or "@" 15) judge whether the front (back) of the current phoneme is the vowel "H" or "j" 16) Judging whether the front (back) of the current phoneme is a vowel "A" or "o" 17) Judging whether the front (back) of the current phoneme is a vowel "@" or "ae" 18) Judging before (after) the current phoneme ( After) whether it is a vowel "o_o" or "u" 19) judge whether the front (back) of the current phoneme is a vowel "i" or "y" 20) judge whether the front (back) of the current phoneme is a vowel "U" or "A_A" has a total of 20*2=40 questions (F) Construct a contextual decision tree for each state of each model, and the flow chart of constructing the contextual decision tree is shown in Figure 4. Proceed as follows:

1)根据前面状态分割的结果，调入每个模型的每个状态对应的语音学数据；1) According to the results of the previous state segmentation, call in the phonetic data corresponding to each state of each model;

2)建立一个语境判决树根节点，节点对应的概率分布由所有的语音训练数据得到；2) Establish a contextual decision tree root node, and the probability distribution corresponding to the node is obtained from all speech training data;

3)在设计好的汉语普通话语音学问题中，选择一个问题来分裂当前节点所包含的语音数据。分裂的依据是使得下面似然概率增加最多； $L = Σ_{e = 1}^{E} Σ_{t = 1}^{T_{e}} \underset{s &Element; S}{Σ} Ln (\Pr (o_{t}^{e}; μ_{s}, Σ_{s})) γ_{s}^{e} (t) \approx Ln (\Pr (O; S))$ 3) Among the designed Mandarin Chinese phonetic problems, select a problem to split the speech data contained in the current node. The basis for splitting is to make the following likelihood probability increase the most; $L = Σ_{e = 1}^{E.} Σ_{t = 1}^{T_{e}} \underset{the s &Element; S}{Σ} \ln (PR (o_{t}^{e}; μ_{the s}, Σ_{the s})) γ_{the s}^{e} (t) \approx \ln (PR (o; S))$

E为语音数据所包含的样本数量，T_e为各个样本包含的语音帧数，S为产生这些样本的状态集，○为观测矢量，μ为均值，∑为协方差矩阵，γ为状态住留概率；E is the number of samples contained in the voice data, Te_is the number of voice frames contained in each sample, S is the state set that generates these samples, ○ is the observation vector, μ is the mean, Σ is the covariance matrix, γ is the state residence probability;

上式可以简化为： $L = \underset{s &Element; S}{Σ} - \frac{1}{2} (n (1 + Ln (2 π)) + Ln (| Σ_{s} |) Σ_{e = 1}^{E} Σ_{t = 1}^{T_{e}} γ_{s}^{e} (t))$ The above formula can be simplified as: $L = \underset{the s &Element; S}{Σ} - \frac{1}{2} (no (1 + \ln (2 π)) + \ln (| Σ_{the s} |) Σ_{e = 1}^{E.} Σ_{t = 1}^{T_{e}} γ_{the s}^{e} (t))$

4)如果似然概率增加大于阈值L_min，则返回3继续进行数据分裂，否则转到5)；4) If the likelihood probability increase is greater than the threshold L_min , then return to 3 to continue data splitting, otherwise go to 5);

5)选择两个节点，如果合并两个节点后，似然减少的值小于一个阈值，则合并该两个节点，否则转到6)；5) Select two nodes, if after merging the two nodes, the value of likelihood reduction is less than a threshold, then merge the two nodes, otherwise go to 6);

6)将构造好的语境判决树写入文件；(G)用B-W算法，用所有的语音训练数据，按照语境判决树判决形成的共享模型和数据聚类进行训练，得到所说的跨字词语境相关声学模型的参数。6) write the contextual judgment tree that constructs into file; (G) use B-W algorithm, use all voice training data, carry out training according to the shared model that the contextual judgment tree judges and form and data clustering, obtain said cross Parameters for context-dependent acoustic models of words.

本发明的第二部分是在语音识别的搜索过程中融入能够提取的汉语普通话语音区别特征，减少模型匹配的盲目性，提高搜索的效率和准确性。The second part of the present invention is to incorporate the extractable Mandarin Chinese speech distinguishing features in the search process of speech recognition, reduce the blindness of model matching, and improve the efficiency and accuracy of search.

在完全基于隐含马尔可夫模型框架的系统中，语音识别的结果取决于从系统的模型库中搜索出与输入语音最佳匹配的声学模型序列。由于事先没有任何关于输入语音信号的语音学信息，语音识别算法只能假设在每一帧都有可能是系统所有识别单元之间的转移位置。这种假设造成语音识别过程存在大量的冗余计算。将能够提取到的语音信号的区别信息加到语音识别过程中，则可以大量降低语音识别的冗余计算，提高语音识别的搜索效率和准确性。In a system based entirely on the Hidden Markov Model framework, the result of speech recognition depends on searching the system's model library for the sequence of acoustic models that best matches the input speech. Since there is no prior phonetic information about the input speech signal, the speech recognition algorithm can only assume that there are possible transition positions between all recognition units of the system in each frame. This assumption causes a large number of redundant calculations in the speech recognition process. Adding the extracted distinguishing information of the speech signal to the speech recognition process can greatly reduce the redundant calculation of the speech recognition and improve the search efficiency and accuracy of the speech recognition.

汉语语音有很多区别特征，提取的方式和难易程度各异。对于能够可靠提取的语音区别特征，用本发明的方法可以显著减少语音识别的的搜索效率和准确性。例如，利用清音/浊音区别特征，语音识别的搜索过程如下：Chinese speech has many distinguishing features, and the extraction methods and degrees of difficulty vary. For the distinguishing features of speech that can be reliably extracted, the search efficiency and accuracy of speech recognition can be significantly reduced by using the method of the present invention. For example, using unvoiced/voiced distinguishing features, the search process for speech recognition is as follows:

1)读入一帧语音信号。1) Read in a frame of speech signal.

2)确定当前帧是清音或是浊音。如果是清音，则仅对清音的模型进行匹配，浊音模型中的累积概率置为零；如果是浊音，则仅对浊音的模型进行匹配，清音模型中的累积概率置为零。2) Determine whether the current frame is unvoiced or voiced. If it is unvoiced, only the unvoiced model is matched, and the cumulative probability in the voiced model is set to zero; if it is voiced, only the voiced model is matched, and the cumulative probability in the unvoiced model is set to zero.

3)判断语音信号是否处理完毕，是则结束，输出识别结果。否则转到1)。3) Judging whether the speech signal has been processed, if yes, it ends, and the recognition result is output. Otherwise go to 1).

本发明的优点：Advantages of the present invention:

1)在语境相关的精确的声学模型建立过程中，通过设计适合语音识别的汉字语音音素集和语境相关的声学模型，利用汉语普通话语音学知识和数据驱动的方法进行模型参数共享和数据的有效利用，解决了语音识别声学模型的精确性和可训练性之间的矛盾，使得即使在有限的训练数据的条件下，都可以建立尽可能精确的语音识别的声学模型，从而提高了系统性能。1) In the process of establishing a context-related accurate acoustic model, by designing a Chinese phoneme set suitable for speech recognition and a context-related acoustic model, using Mandarin Chinese phonetic knowledge and data-driven methods to share model parameters and data The effective use of it solves the contradiction between the accuracy and trainability of the speech recognition acoustic model, so that even under the condition of limited training data, the acoustic model of speech recognition can be established as accurate as possible, thereby improving the system performance.

2)在用上述语音识别的声学模型进行语音识别的过程中，利用可以提取的汉语普通话语音的区别特征信息，减少语音识别的搜索中模型匹配的数量，提高了语音识别的准确性和计算效率。2) In the process of speech recognition with the above-mentioned acoustic model of speech recognition, the distinguishing feature information of Mandarin Chinese speech that can be extracted is used to reduce the number of model matching in the search of speech recognition, and improve the accuracy and calculation efficiency of speech recognition .

附图说明Description of drawings

图1为本发明元音U状态1的语境判决树示意图；Fig. 1 is the context decision tree schematic diagram of vowel U state 1 of the present invention;

图2为本发明的建立声学模型的方块图；Fig. 2 is the block diagram of setting up the acoustic model of the present invention;

图3为本发明语音数据状态分割示意图；Fig. 3 is the segmentation schematic diagram of speech data state of the present invention;

图4为本发明语境判决树构造流程图；Fig. 4 is the flow chart of context decision tree construction of the present invention;

图5为本发明识别解码器方块图。Fig. 5 is a block diagram of the identification decoder of the present invention.

具体实施方式一个汉语普通话大词汇连续语音识别系统的构造过程：系统训练：The specific embodiment of the construction process of a Mandarin Chinese large vocabulary continuous speech recognition system: system training:

(A)采集语音训练数据。(A) Collect speech training data.

(B)对每个语音训练数据提取特征矢量序列，语音特征可以选择LPC倒谱或摩尔倒谱MFCC。(B) Extract feature vector sequence for each speech training data, speech feature can choose LPC cepstrum or Moore cepstrum MFCC.

(C)针对汉语普通话设计语境无关音素的声学模型(例如设计23个辅音、22个元音、1个寂静音、1个过渡音)，用B-W算法，用采集的语音训练数据，估计获得汉语普通话语境无关音素的声学模型参数。(C) Designing an acoustic model of context-independent phonemes for Mandarin Chinese (such as designing 23 consonants, 22 vowels, 1 silent sound, and 1 transitional sound), using the B-W algorithm and using the collected speech training data to estimate the obtained Acoustic model parameters for context-independent phonemes in Mandarin Chinese.

(D)用训练好的语境无关模型，对所有的语音训练数据进行状态分割。如图3所示。(D) Using the trained context-free model, perform state segmentation on all speech training data. As shown in Figure 3.

(E)构造参数共享中需要的语音学问题。(E) Constructing the phonetic questions needed in parameter sharing.

(F)对每个模型的每个状态构造一个语境判决树。(F) Construct a contextual decision tree for each state of each model.

(G)用B-W算法，用所有的语音训练数据，按照语境判决树判决形成的共享模型和数据聚类进行训练，得到跨字词语境相关声学模型的参数。(G) Use the B-W algorithm, use all the speech training data, perform training according to the shared model and data clustering formed by the context decision tree decision, and obtain the parameters of the cross-word context-related acoustic model.

(H)按照图5方框图构造语音识别系统。(H) Construct a speech recognition system according to the block diagram in FIG. 5 .

(I)优化语音识别系统的参数。(1) Optimize the parameters of the speech recognition system.