CN1201286C

Movatterモバイル変換

Info

Publication number: CN1201286C
Application number: CN99817058.5A
Authority: CN
Inventors: 林志威(音译); 严永宏(音译); 赵青薇(音译); 袁宝生(音译)
Original assignee: Intel Corp
Current assignee: Intel Corp
Priority date: 1999-12-23
Filing date: 1999-12-23
Publication date: 2005-05-11
Anticipated expiration: 2019-12-23
Also published as: CN1406374A; WO2001048737A2; AU1767600A; WO2001048737A3

Abstract

Translated fromChinese

在一些实施例中，本发明包括一种创建词汇树和识别该词汇树中的开始音素的方法。这些实施例使用的方法进一步包括估计在词汇树中具有特定的开始音素的单词的概率并至少存储一些估计的概率，其特征在于补偿权重不与估计的概率一起存储。估计的概率可以存储在一个查询表中。在其他实施例中，本发明包括一种接收音素并在词汇树中识别它们的方法。这些实施例的方法还包括通过使用从存储区中检索到的估计的概率来估计包含这些音素的单词的概率，其特征在于检索概率不包括与估计的概率存储在一起的补偿权重。同样，估计的概率可以存储在一个查询表中。估计的概率可以在建立修剪阈值时使用。这些方法可以通过计算机可读的介质上的指令来实现。

In some embodiments, the invention includes a method of creating a vocabulary tree and identifying starting phonemes in the vocabulary tree. The method used by these embodiments further includes estimating probabilities of words in the vocabulary tree having a particular start phoneme and storing at least some of the estimated probabilities, characterized in that compensation weights are not stored with the estimated probabilities. The estimated probabilities can be stored in a lookup table. In other embodiments, the invention includes a method of receiving phonemes and identifying them in a vocabulary tree. The method of these embodiments also includes estimating probabilities of words containing the phonemes by using the estimated probabilities retrieved from the storage area, wherein the retrieved probabilities do not include compensation weights stored with the estimated probabilities. Likewise, the estimated probabilities can be stored in a lookup table. The estimated probabilities can be used when establishing pruning thresholds. These methods can be implemented by instructions on a computer readable medium.

Description

Translated fromChinese

使用基于词汇树的N格拉姆语言模式的执行语音识别的方法Method for performing speech recognition using vocabulary tree-based N-gram language models

技术领域technical field

本发明涉及语音识别系统，更具体来说，涉及一种基于词汇树的n格拉姆(n-gram)语言模式。The present invention relates to a speech recognition system, and more specifically, relates to an n-gram language model based on a vocabulary tree.

背景技术Background technique

语音识别器中的一个组成部分是语言模式。语言模式包括单词在一个词汇表中出现的概率以及一个单词跟随另一个单词或者多个单词的概率。的确，捕获一种给定语言的句法结构的流行方法是使用条件概率捕获嵌入在句子字串中的连续信息。例如，如果当前单词是w₁，那么就可以建立一种语言模式，说明某些其他单词w₂、w₃、...w_N将跟随在w₁后面的概率。条件概率通常通过检查一个训练主体(例如，报纸)中单词彼此近邻的频率计算出来。例如，条件概率P₂₁＝(w₂|w₁)是单词w₂跟随单词w₁的概率。概率P₂₁被称为双格拉姆。一个三格拉姆语言模式是一个单词按顺序跟随另外两个单词的条件概率。例如，P₂₁₀＝(w₂|w₁w₀)是单词w₂跟随单词w₁而w₁又跟随单词w₀的概率。一个单格拉姆或1-格拉姆概率只是一个单词将会出现的概率。例如，p₁＝p(w₁)是单词w₁在不考虑前面单词的情况下在一个特定时间将要出现的概率。One component in a speech recognizer is a language model. Language patterns include the probability of words appearing in a vocabulary and the probability of a word being followed by another word or words. Indeed, a popular approach to capturing the syntactic structure of a given language is to use conditional probabilities to capture the continuous information embedded in sentence strings. For example, if the current word is w₁ , then a language model can be established stating the probability that some other word w₂ , w₃ , . . . w_N will follow w₁ . Conditional probabilities are usually computed by examining how often words are close to each other in a training subject (e.g., newspapers). For example, the conditional probability P₂₁ =(w₂ |w₁ ) is the probability that word w₂ follows word w₁ . Probability P₂₁ is called a double gram. A three-gram language pattern is the conditional probability that a word follows two other words in sequence. For example, P₂₁₀ =(w₂ |w₁ w₀ ) is the probability that word w₂ follows word w₁ which_in turn follows word w₀ . A one-gram or 1-gram probability is just the probability that a word will occur. For example, p₁ =p(w₁ ) is the probability that word w₁ will occur at a certain time regardless of previous words.

单格拉姆、双格拉姆、三格拉姆等中所涉及的单词组合的可能数量呈几何级数地上升。在此处使用的术语“较低的格拉姆”和“较高的格拉姆”是指格拉姆的阶。例如，单格拉姆比双格拉姆低，而双格拉姆比三格拉姆低。三格拉姆比双格拉姆高，而双格拉姆比单格拉姆高。对于一个大词汇表，三格拉姆的组合的总数，甚至双格拉姆的组合的总数也大得难以管理。然而，结果是，如此大量的三格拉姆和双格拉姆导致条件概率非常小(几乎为零)，不值得将它们放到语言模式中。有人曾经使用过补偿权重来调整较低格拉姆的概率。例如，当三格拉姆概率不包括在语言模式中时，那么就可以使用双格拉姆概率再乘以一个补偿权重(bowt)。如果补偿权重不存在，那么就可以较低的格拉姆代替较高的格拉姆。相应地，一个基于单词的n-gram语言模式可以表示为等式(1)，如下所示：The possible number of word combinations involved in single-grams, double-grams, triple-grams, etc. goes up exponentially. The terms "lower Gram" and "higher Gram" as used herein refer to the Gram's order. For example, a single gram is lower than a double gram, and a double gram is lower than a triple gram. Triple gram is taller than double gram, and double gram is taller than single gram. For a large vocabulary, the total number of triple-gram combinations, and even double-gram combinations, is unmanageably large. However, it turns out that such a large number of triple-grams and double-grams leads to conditional probabilities so small (almost zero) that it's not worth putting them into the language model. Someone once used compensation weights to adjust the probability of lower gram. For example, when three-gram probabilities are not included in the language model, then two-gram probabilities multiplied by a compensating weight (bowt) can be used. If no compensation weight exists, then a lower gram can be substituted for a higher gram. Correspondingly, a word-based n-gram language model can be expressed as Equation (1), as follows:

如上所述，尽管等式(1)是一个比较通用的n-gram表示，但也很少考虑高于三格拉姆的情况。As mentioned above, although equation (1) is a relatively general n-gram representation, it rarely considers the case higher than three grams.

典型的n-gram语言模式文件存储格式如下所示：A typical n-gram language pattern file storage format is as follows:

对于1-格拉姆：p(w₁)w₁ bowt(w₁)For 1-gram: p(w₁ )w₁ bowt(w₁ )

对于i-格拉姆：(对于i＝1，...，n-1)p(w_i|w_i-1...w₁)w₁...w_i bowt(w₁...w_i-1)For i-gram: (for i=1,...,n-1)p(w_i |w_i-1 ...w₁ )w₁ ...w_i bowt(w₁ ...w_i-1 )

对于n-gram：p(w_n|w_n-1 w_n-2...w₁)w₁...w_nFor n-grams: p(w_n | w_n-1 w_n-2 ... w₁ ) w₁ ... w_n

词汇树用于组织可能的单词。例如，假设在一个词汇树中，单词w₂、w₃、...w_N中任何一个都可能跟在单词w₁后面。可以计算出条件概率以帮助决定单词w₂、w₃、...w_N中哪一个单词跟随单词w₁后面。对于大型词汇表，可能性的数量是巨大的。已经有人开发出各种技术，通过使用一个“修剪音速”“剪掉”其条件概率比相对于最大值的阈值低的低概率路径，从而减少所涉及的可能性的数量。Vocabulary trees are used to organize possible words. For example, assume that in a vocabulary tree, any of the words w₂ , w₃ , . . . w_N may follow the word w₁ . Conditional probabilities can be calculated to help decide which of the words w₂ , w₃ , . . . w_N follows the word w₁ . For large vocabularies, the number of possibilities is enormous. Various techniques have been developed to reduce the number of possibilities involved by "cutting out" low-probability paths whose conditional probability is lower than a threshold relative to the maximum, using a "pruning sound".

单词是作为一系列音素检测到的。此处音素是指表示声音的数字式电信号。但是，在单词的最后一个音素被检测出之前，说出的是哪一个单词通常是不知道的，结果造成对收到的单词的修剪延迟，因而对所接收单词解码的速度整体变慢。Words are detected as a series of phonemes. A phoneme here refers to a digital electrical signal representing sound. However, which word was spoken is generally not known until the last phoneme of the word is detected, resulting in pruning delays to the received word and thus an overall slower decoding of the received word.

在S.Ortmanns等人所写的文章“Language-Model Look-Aheadfor Large Vocabulary Speech Recognition，”ICSLP96(1996)，pp.2095-98中，提出了一种先行控制技术，在音束搜索策略的修剪过程中较早合并语言模式概率。但是，该文章的作者未能认识到如何最佳地将存储的词汇树的估计概率保持在易管理的水平。例如，Ortmanns等人的文章最后作出结论说，存储了计算(估计)概率的表的大小将大得出奇。见文中P2097。In the article "Language-Model Look-Ahead for Large Vocabulary Speech Recognition," written by S.Ortmanns et al., "ICSLP96 (1996), pp.2095-98, a kind of advance control technology is proposed, in the pruning of sound beam search strategy Language pattern probabilities are incorporated earlier in the process. However, the authors of that article failed to recognize how best to keep the estimated probabilities of the stored vocabulary trees at a manageable level. For example, the article by Ortmanns et al. concludes that the size of the table storing the calculated (estimated) probabilities would be prohibitively large. See P2097 in the text.

因此，大型词汇表连续语音识别器(LVCSR)需要一种更好的词汇树n-gram语言模式格式。Therefore, Large Vocabulary Continuous Speech Recognizers (LVCSR) need a better format for vocabulary tree n-gram language patterns.

发明内容Contents of the invention

根据本发明的一方面，提供了一种执行语音识别的方法，包括：创建词汇树；识别该词汇树中的第一音素；估计词汇树中具有第一音素的单词的概率；以及至少存储一些估计的概率，其中所存储的估计的概率只包括直接从n-gram概率推导的估计的概率，并且从补偿概率推导出的估计的概率被近似地补偿成(n-1)gram估计的概率。According to an aspect of the present invention, there is provided a method of performing speech recognition, comprising: creating a vocabulary tree; identifying a first phoneme in the vocabulary tree; estimating the probability of a word with the first phoneme in the vocabulary tree; and storing at least some estimated probabilities, where the stored estimated probabilities include only estimated probabilities derived directly from n-gram probabilities, and estimated probabilities derived from compensated probabilities are approximately compensated to (n-1) gram estimated probabilities.

根据本发明的又一方面，还提供了一种执行语音识别的方法，包括：接收音素并在词汇树中识别它们；以及通过使用从存储区中检索到的估计的概率来估计包含这些音素的单词的概率，其中检索到的估计的概率只包括直接从n-gram概率推导的估计的概率，并且从补偿概率推导出的估计的概率被近似地补偿成(n-1)gram估计的概率。According to yet another aspect of the present invention, there is also provided a method of performing speech recognition, comprising: receiving phonemes and recognizing them in a vocabulary tree; Word probabilities, where the retrieved estimated probabilities include only estimated probabilities derived directly from n-gram probabilities, and estimated probabilities derived from compensated probabilities are approximately compensated to (n-1) gram estimated probabilities.

在一些实施例中，本发明包括一种创建词汇树和识别该词汇树中的开始音素的方法。这些实施例的方法进一步包括估计具有特定的开始音素的单词在词汇树中的概率并至少存储一些估计的概率，其中补偿权重不与估计的概率一起存储。估计的概率可以存储在一个查询表中。In some embodiments, the invention includes a method of creating a vocabulary tree and identifying starting phonemes in the vocabulary tree. The method of these embodiments further includes estimating probabilities in the vocabulary tree of words having a particular starting phoneme and storing at least some of the estimated probabilities, wherein the compensation weights are not stored with the estimated probabilities. The estimated probabilities can be stored in a lookup table.

在其他实施例中，本发明包括一种接收音素并在词汇树中识别它们的方法。这些实施例的方法还包括通过使用从存储区中检索到的估计的概率来估计包含这些音素的单词的概率，其中检索概率不包括与估计的概率存储在一起的补偿权重。同样，估计的概率可以存储在一个查询表中。In other embodiments, the invention includes a method of receiving phonemes and identifying them in a vocabulary tree. The method of these embodiments also includes estimating probabilities of words containing the phonemes by using estimated probabilities retrieved from the storage area, wherein the retrieved probabilities do not include compensation weights stored with the estimated probabilities. Likewise, the estimated probabilities can be stored in a lookup table.

估计的概率可以在建立修剪阈值时使用。The estimated probabilities can be used when establishing pruning thresholds.

这些方法可以通过计算机可读的介质上的指令来实现。These methods can be implemented by instructions on a computer readable medium.

本文还介绍了更多实施例并在权利要求书中加以概括。Further embodiments are also described herein and summarized in the claims.

附图说明Description of drawings

通过阅读下面的详细说明并参照本发明的实施例的附图，您将会对本发明有一个全面的理解，但是，本发明不应该仅限于这里所介绍的实施例，这些实施例只用作说明和理解之用。By reading the following detailed description and referring to the accompanying drawings of the embodiments of the present invention, you will have a comprehensive understanding of the present invention, but the present invention should not be limited to the embodiments described here, these embodiments are only used for illustration and for understanding.

图1是表示根据本发明的一些实施例的词汇树的示意图。FIG. 1 is a schematic diagram illustrating a vocabulary tree according to some embodiments of the present invention.

图2是一种可以用于本发明的一些实施例中的计算机系统的高度概括的方框图。Figure 2 is a high-level block diagram of a computer system that may be used in some embodiments of the invention.

图3是一种可以用于本发明的一些实施例中的手提计算机系统的高度概括的示意图。Figure 3 is a high-level schematic diagram of a hand-held computer system that may be used in some embodiments of the invention.

具体实施方式Detailed ways

本发明涉及一种用于LVCSR的基于词汇树的n-gram语言模式格式。借助于本发明，一旦检测出一个开始音素，即可估计出一个单词的概率。修剪低于一个阈值的路径可以在识别出后继单词之前开始。本发明用于加速LVCSR中的搜索过程。在解码过程中，语言模式起着关键的作用，无论是在准确率方面，还是在性能方面。因此，语音识别系统的性能与语言模式有关。The invention relates to a vocabulary tree-based n-gram language model format for LVCSR. With the present invention, once a start phoneme is detected, the probability of a word can be estimated. Pruning paths below a threshold can start before successor words are identified. The present invention is used to speed up the search process in LVCSR. Language patterns play a key role in the decoding process, both in terms of accuracy and performance. Therefore, the performance of a speech recognition system is related to language patterns.

本发明涉及了组织词汇树的多种方法。作为一个示例，图1显示了一个词汇树的一部分示意图。图1中的词汇树根据音素将许多单词连接在一起，不同的单词可以共用一些相同的音素。前辈单词w₀用一个矩形来表示。w₀之前可能有单词，也可能没有单词。在词汇表中，有一些音素可以作为后继单词的开头。这些开始音素是Bph₁、Bph₂...Bph_x，可能少于音素的总数。The present invention involves various methods of organizing vocabulary trees. As an example, Figure 1 shows a schematic diagram of a part of a vocabulary tree. The vocabulary tree in Figure 1 connects many words together according to phonemes, and different words can share some of the same phonemes. The predecessor word w₀ is represented by a rectangle. There may or may not be words before w₀ . In the vocabulary, there are some phonemes that can be used as the beginning of subsequent words. These starting phonemes are Bph₁ , Bph₂ ... Bph_x , which may be less than the total number of phonemes.

多个单词可以以一个音素开始。为了便于讨论，共用相同的音素的单词具有类似的标签。例如，单词w₁₁、w₁₂和w₁₃每一个单词都以开始音素Bph₁开始。更具体来说，音素Bph₁、ph₂、ph₃和ph₄构成了单词w₁₁(例如，单词“fund”)；音素Bph₁、ph₂、ph₃、ph₄和Ph₅构成了单词w₁₂(例如，“funds”)，音素Bph₁和ph₂-ph₄和ph₆-ph₁₀构成了单词w₁₃(例如，“fundamental”)。(请注意，单词中的实际音素数量可能与这里显示的不同)。在现实中，典型的情况是，以相同的音素开头的单词要多得多，但为了便于讨论，只显示了三个与Bph₁关联的单词。在图1的示例中，假设单词w₁₂是最后检测到的单词。在这种情况下，单词w₀和w₁₂所在的将是实际路径，其他的路径将是潜在的路径。Multiple words can begin with a phoneme. For ease of discussion, words sharing the same phoneme have similar labels. For example, the words w₁₁ , w₁₂ and w₁₃ each begin with the beginning phoneme Bph₁ . More specifically, the phonemes Bph₁ , ph₂ , ph₃ and ph₄ form the word w₁₁ (eg, the word "fund"); the phonemes Bph₁ , ph₂ , ph₃ , ph₄ and Ph₅ form the word w₁₂ (eg, "funds"), the phonemes Bph₁ and ph₂ -ph₄ and ph₆ -ph₁₀ form the word w₁₃ (eg, "fundamental"). (Note that the actual number of phonemes in a word may differ from what is shown here). In reality, it is typical that many more words begin with the same phoneme, but for the sake of discussion only three Bph_1- associated words are shown. In the example of FIG. 1 , it is assumed that word w₁₂ was the last detected word. In this case, where the words w₀ and w₁₂ are located will be the actual paths, and the others will be potential paths.

在一些实施例中，一旦单词W₀的后继的第一个音素被识别出来，就可以进行后继单词的概率的估计，这样便可以在确切地知道后继单词之前开始进行修剪。In some embodiments, once the first phoneme of the successor of word_W0 is recognized, an estimate of the probability of the successor word can be made so that pruning can begin before the successor word is known for sure.

在本发明的一些实施例中，可以使用基于词汇树的n-gram语言模式格式，该格式可以有效地应用到与(例如)一种基于树的Viterbi解码算法一起使用语言模式先行控制机制。对于一个基于树的Viterbi音束搜索算法，通常对于树状态s和前辈字串w_n-1w_n-2…w₁的估计的语言模式概率π_v(s)可以通过如下所示的等式(2)进行估计：In some embodiments of the present invention, a vocabulary tree-based n-gram language mode format can be used that can be efficiently applied to use a language mode look-ahead control mechanism with, for example, a tree-based Viterbi decoding algorithm. For a tree-based Viterbi beam search algorithm, usually the estimated language pattern probability π_v (s) for tree state s and predecessor word string w_n-1 w_n-2 ...w₁ can be given by the following equation (2) Make an estimate:

${π π}_{v v} ((s the s)) = = \overset{max max}{w w &Element; &Element; W W ((s the s))} (({λ λ}_{w w} \cdot &Center Dot; p p ((w w | | {w w}_{n no - - 11} {w w}_{n no - - 22} . . . . . . {w w}_{11})))) - - - - - - ((22))$

其中W(s)是一组可从词汇树状态s得到的单词集，λ_w表示权重(以分数表示)，v是前辈单词，p(w|w_n-1w_n-2…w₁)表示n-gram单词条件概率。π_v(s)也可叫做在建立修剪阈值中使用的估计概率P_estimated。估计概率也可叫做先行控制概率。作为应用语言模式先行控制的结果，可以获得更紧密的修剪音束以加速解码过程。分数权重λ_w可以设置为1或可以介于0和1之间。在一些实施例中，λ_w可能大于1。分数权重可以采用经验法、通过反复试验进行确定或计算出。对于每个Bphl来说，分数权重可能相同，也可能不同。虽然本发明是以n-gram来表示的，但在实际中也可能使用三格拉姆、双格拉姆、单格拉姆和/或其他格拉姆。Where W(s) is a set of words that can be obtained from the vocabulary tree state s,_λw represents the weight (expressed in scores), v is the predecessor word, p(w|w_n-1 w_n-2 …w₁ ) Represents the n-gram word conditional probability. π_v (s) may also be called the estimated probability P_estimated used in establishing the pruning threshold. Estimated probabilities can also be called antecedent control probabilities. As a result of applying language mode look-ahead control, tighter pruned beams can be obtained to speed up the decoding process. Score weight_λw can be set to 1 or can be between 0 and 1. In some embodiments,_λw may be greater than one. Score weights can be determined empirically, through trial and error, or calculated. Score weights may or may not be the same for each Bphl. Although the present invention is expressed in terms of n-grams, it is also possible in practice to use triple-grams, double-grams, single-grams and/or other grams.

音素节点的透视图是一个树的状态。在说话的过程中，该树中有越来越多的音素被检测出来，估计的概率可能需要重新计算，以使修剪可以继续。The perspective of a phoneme node is the state of a tree. Over the course of speaking, more and more phonemes are detected in the tree, and the estimated probabilities may need to be recalculated so that pruning can continue.

通常上面提及的估计(计算)语言模式概率必须在运行时动态地加以计算和生成。该过程很费时间，尽管引入了高速缓存以节省总体计算开销。预先计算估计的概率并将它们存储在查询表中会显著地加快该过程。Usually the estimated (calculated) language pattern probabilities mentioned above must be calculated and generated dynamically at runtime. This process is time consuming, although caching is introduced to save overall computational overhead. Precomputing the estimated probabilities and storing them in a lookup table speeds up the process significantly.

在图1的示例中，假设Bphl是后继单词的第一个音素。在这种情况下，等式(3)中就会给出等式(2)的双格拉姆示例，如下所示：In the example in Figure 1, it is assumed that Bphl is the first phoneme of the successor word. In this case, the bigram example of equation (2) is given in equation (3), as follows:

P_estimated＝λ_wmax{P(w₁₁|w₀)，P(w₁₂|w₀)，P(w₁₃|w₀)} (3)。P_estimated =λ_w max{P(w₁₁ |w₀ ), P(w₁₂ |w₀ ), P(w₁₃ |w₀ )} (3).

根据情况，将修剪掉那些概率或条件概率低于阈值，或等于或低于阈值的单词。推导阈值的方法多种多样。例如，用一个数字乘以P_estimated或P_estimated减一个数字。Words with probabilities or conditional probabilities below the threshold, or at or below the threshold are pruned, depending on the situation. There are various ways to derive the threshold. For example, multiply P_estimated by a number or subtract a number from P_estimated .

为加速解码过程，我们通过部署将内存需求限制在一个可控制的范围内的补偿机制定义了一个基于词汇树的n-gram语言模式格式，用于存储预先计算的估计概率。一般情况下估计的概率P_estimated可以通过如下所示的等式(4)获得：To speed up the decoding process, we define a vocabulary tree-based n-gram language model format for storing precomputed estimated probabilities by deploying compensation mechanisms that limit memory requirements to a manageable range. In general, the estimated probability P_estimated can be obtained by equation (4) as shown below:

${P P}_{estimated estimated} = = \underset{max max}{P P (({S S}_{j j} | | {w w}_{n no - - 11} {w w}_{n no - - 22} . . . . . . . . {w w}_{11}))}$

其中S_j是潜在的后继单词的第j个状态。等式(4)包括括号中的三行。一般情况下，等式(4)的最顶行就是等式(2)。当然，等式(4)也可以用于不同的格拉姆，如单格拉姆、双格拉姆以及三格拉姆。等式(4)提供了等式(2)的近似值。只有当等式(4)的最顶行得到满足的情况下，P_estimated才会存储一个存储区中，例如存储在一个查询表中，这样查询表就可以控制在可管理的较小的水平。where_Sj is the jth state of the potential successor word. Equation (4) includes three lines in parentheses. In general, the top row of Equation (4) is Equation (2). Of course, equation (4) can also be used for different grams, such as single gram, double gram and triple gram. Equation (4) provides an approximation to Equation (2). Only when the topmost row of equation (4) is satisfied, P_estimated will be stored in a storage area, such as a lookup table, so that the lookup table can be kept at a manageably small level.

在等式(4)中，我们不必存储补偿权重，因为它们与基于标准单词的n-gram语言模式中存储的权重完全相同。在解码中，补偿权重可以通过一个常规的文件来获得。在解码中，如果等式(4)的第一行得不到满足，那么如果适合的话，就使用带补偿权重的较低阶的估计概率。In Equation (4), we do not have to store the compensation weights, since they are exactly the same as those stored in standard word-based n-gram language models. In decoding, compensation weights can be obtained through a regular file. In decoding, if the first row of equation (4) is not satisfied, then lower order estimated probabilities with compensation weights are used, if appropriate.

用于修剪的概率可以只是后继单词的估计概率，或者估计概率与前辈单词的概率相加(例如，在图1中，p(w₀)+P_estimated)。The probabilities used for pruning can be simply the estimated probabilities of the successor words, or the estimated probabilities are added to the probabilities of the predecessor words (eg, in Figure 1, p(w₀ )+P_estimated ).

在某些实施例中，查询表存储了基于树的n-gram的语言模式估计概率，如下所示。然而，也可以使用其他格式。In some embodiments, the lookup table stores tree-based n-gram language pattern estimation probabilities, as shown below. However, other formats may also be used.

1-格拉姆：1-gram:

p(s₁) s₁p(s₁ ) s₁

......

i-格拉姆：(i＝1，...，n-1)i-gram: (i=1,...,n-1)

p(si|w_i-1...w₁)w₁...w_i-1 s_ip(si|w_i-1 ...w₁ )w₁ ...w_i-1 s_i

......

n-gram：n-grams:

p(s_n|w_n-1 w₁)w₁...w_n-1 s_np(s_n |w_n-1 w₁ )w₁ ...w_n-1 s_n

由于压缩词汇树中的节点的总数相当于辞典中的单词的总数，基于词汇树的n-gram语言模式并以等式(4)为近似值的词汇树的总存储，与传统的对应的基于单词的n-gram的语言模式相比，其阶是相同的。用于普通n-gram语言模式的处理技术可以应用到本发明的新的基于词汇树的语言模式文件中。Since the total number of nodes in the compressed vocabulary tree is equivalent to the total number of words in the dictionary, the total storage of the vocabulary tree based on the n-gram language model of the vocabulary tree and approximated by equation (4) is different from that of the traditional corresponding word-based Compared with the n-gram language model, its order is the same. The processing techniques used for ordinary n-gram language models can be applied to the new vocabulary tree-based language model files of the present invention.

在某些实施例中，估计概率是在识别之前计算出的，并存储在一个查询表中。然而，为缩小表的大小，在某些实施例中，只存储那些直接从n-gram概率(不通过补偿)推导出的条目。从补偿概率推导出的条目(n-gram补偿到(n-1)-格拉姆)大致补偿到(n-1)-格拉姆估计概率。通过压缩，表的大小可以缩小到一个易控制的水平。In some embodiments, the estimated probabilities are calculated prior to recognition and stored in a look-up table. However, to reduce the size of the table, in some embodiments only those entries derived directly from the n-gram probabilities (without compensation) are stored. Entries derived from compensated probabilities (n-gram compensated to (n-1)-grams) roughly compensated to (n-1)-gram estimated probabilities. With compression, the size of the table can be reduced to a manageable level.

当到达一个单词的最后一个音素(或终节点)时，就可以识别出后继单词。例如，在图1中，一旦到达了音素ph₅，就知道是单词w₁₂。一旦知道了单词，就可以将估计概率替换为实际概率。这可以通过加上真实条件概率(例如，在图1中p(w₁₂|W₀))并减去估计概率来实现。在某些实施例中，在搜索期间的累积概率可以从第一个单词假设开始，例如，p(w₁ w₂ w₃...w_i)＝p(w₁)+p(w₂|w₁)+P(w₃|w₂)+...+p(w_i|w_i-1)。可以使用概率的对数，以便将来法转换为加法：log(P₁*P₂)＝log(p₁)+log(p₂)。When the last phoneme (or terminal node) of a word is reached, the successor word can be identified. For example, in Figure 1, once the phoneme ph₅ is reached, it is known to be the word w₁₂ . Once the words are known, the estimated probabilities can be replaced with actual probabilities. This can be done by adding the true conditional probability (eg, p(w₁₂ |W₀ ) in Figure 1 ) and subtracting the estimated probability. In some embodiments, the cumulative probability during the search can start from the first word hypothesis, for example, p(w₁ w₂ w₃ . . . w_i )=p(w₁ )+p(w₂ | w₁ )+P(w₃ |w₂ )+...+p(w_i |w_i-1 ). The logarithm of the probability can be used in order to convert the subtraction into addition: log(P₁ *P₂ )=log(p₁ )+log(p₂ ).

真实概率在识别出最后一个音素之后即可确定，它可以表示成P_true＝p(w_predecessor)+P_estimated+P(W_actual|W_predecessor)-P_estimated。在图1的示例中，假设单词w12是实际单词，真实概率P_true＝p(w₀)+P_estimated+P(W₁₂|W₀)-P_estimated，其中P_estimated可以通过上文所描述的方法获得。The true probability can be determined after the last phoneme is recognized, and it can be expressed as P_true =p(w_predecessor )+P_estimated +P(W_actual |W_predecessor )−P_estimated . In the example in Fig. 1, assuming that the word w12 is an actual word, the true probability P_true =p(w₀ )+P_estimated +P(W₁₂ |W₀ )-P_estimated , wherein P_estimated can be obtained by the above-described method to obtain.

词汇树的节点可以通过消除多余的节点来折叠或压缩。例如，在图1中，音素Bph₁、ph₂，ph₃，和ph₄可以折叠成一个状态(节点)。然而，在实际中，Bph₁通常会有其他分支单词，因此可能不能用ph₂-ph₄折叠。音素ph₆-ph₁₀可以折叠成一种状态。在某些实施例中，有两种词汇树：原来的一个用于语音识别器，压缩的词汇树用于语言模式。压缩词汇树可以用于在培训期间创建查询表。在培训中，可根据已知的技术从一个辞典创建词汇树。The nodes of the vocabulary tree can be collapsed or compressed by eliminating redundant nodes. For example, in Figure 1, the phonemes Bph₁ , ph₂ , ph₃ , and ph₄ can be collapsed into one state (node). However, in practice, Bph₁ will often have other branching words and thus may not be able to be folded with ph₂ -ph₄ . The phonemes ph₆ -ph₁₀ can be folded into one state. In some embodiments, there are two vocabulary trees: the original one for the speech recognizer and the compressed one for the language model. Compressed vocabulary trees can be used to create lookup tables during training. In training, vocabulary trees can be created from a thesaurus according to known techniques.

有各种计算机系统可以应用在培训和语音识别系统中。仅作为一个示例，图2显示了一个计算机系统10的概要图，该计算机系统有一个处理器14、存储器16，以及输入/输出和控制块18。处理器14中有大量的存储量，存储器16可以代表不位于处理器14的芯片的存储器或者一部分位于但一部分不位于处理器14的芯片的存储器。(或者存储器16可以完全地位于处理器14的芯片上)。至少有一些输入/输出和控制块18可以与处理器14位于相同的芯片上，或者位于单独的芯片上。一个麦格拉姆风26、监视器30、附加存储器34、以及输入设备(比如键盘和鼠标38)、网络连接42，以及扬声器44都可以与输入/输出和控制块18相连接。存储器34可以代表各种存储器，如硬盘、CD-ROM或者DVD光盘。查询表可以是任何形式，不用作一个限制性术语。存储的估计概率可能全部在一起或者分散到不同的位置。表的一部分或者全部可以复制并放到不同的存储器中。查询表可能位于存储器16、存储器34或者其它地方。查询表22和24代表查询表的全部或一部分。再强调一点，图1中的系统只作说明用，本发明不仅限于采用这样的计算机系统的情形。用于实现本发明的计算机系统10和其他计算机系统可以是各种形式的电脑，如台式机、大型机和便携式计算机。There are various computer systems that can be used in training and speech recognition systems. As just one example, FIG. 2 shows a schematic diagram of acomputer system 10 having a processor 14 , memory 16 , and input/output and control blocks 18 . There is a substantial amount of memory in processor 14 , and memory 16 may represent memory that is not located on a chip of processor 14 or that is partially located but partially not located on a chip of processor 14 . (Alternatively memory 16 may be entirely on-chip with processor 14). At least some of the input/output and control blocks 18 may be on the same chip as the processor 14, or on a separate chip. A Megram 26 , monitor 30 , additional storage 34 , and input devices such as a keyboard and mouse 38 , network connection 42 , and speakers 44 can all be connected to the I/O and control block 18 . The memory 34 may represent various kinds of memory, such as a hard disk, CD-ROM or DVD disk. A lookup table can be in any form and is not used as a limiting term. The stored estimated probabilities may be all together or scattered to different locations. Part or all of the tables can be copied and placed in different memories. The look-up table may be located in memory 16, memory 34, or elsewhere. Look-up tables 22 and 24 represent all or a portion of look-up tables. It should be emphasized again that the system in Fig. 1 is for illustration only, and the present invention is not limited to the case of using such a computer system.Computer system 10 and other computer systems used to implement the present invention can be various forms of computers, such as desktops, mainframes, and laptops.

例如，图3显示了一个手提设备60，并带有一个显示器62，可以用来实现图2的部分或全部功能。手提设备有时可以与另一个计算机系统(如图2中的系统)进行连接。图2和3中的物体的形状和相对大小也不暗示其实际形状和相对大小。For example, FIG. 3 shows a handheld device 60 with a display 62 that can be used to perform some or all of the functions of FIG. 2 . Handheld devices can sometimes be interfaced with another computer system (such as the system in Figure 2). The shapes and relative sizes of the objects in FIGS. 2 and 3 also do not imply their actual shapes and relative sizes.

各种存储器都可以算得上是计算机可读的介质，在上面可以存储指令，当执行这些指令时，便可以实施本发明的一些实施例。All kinds of memories can be regarded as computer-readable media, on which instructions can be stored, and when these instructions are executed, some embodiments of the present invention can be implemented.

其他信息和实施例Additional information and examples

已经实现了基于词汇树的采用上述格式的双格拉姆语言模式。通过使用预先计算的语言模式先行控制，我们不仅节省了估计概率的计算开销，节省量可达解码任务的总计算时间的15％，而且还节省了动态生成这些概率时必需的缓存所需要的大约50MB内存。(然而，这些数字只是示例，不是要求。)此外，我们的新语言模式格式还为我们提供了用合理的时间和内存处理更高阶的语言模式先行控制。A bigram language model in the above format based on a vocabulary tree has been implemented. By using precomputed language pattern look-ahead controls, we not only save the computational overhead of estimating probabilities by up to 15% of the total computational time of the decoding task, but also save about 50MB of memory. (However, these numbers are only examples, not requirements.) In addition, our new language mode format also provides us with a reasonable amount of time and memory to process higher order language mode lookahead control.

本说明中所提及的“实施例”、“一个实施例”、“一些实施例”或“其他实施例”是指至少在本发明的一些实施例中，不一定在所有实施例中包括的与实施例关联的一个特定功能、结构或特征。所说的“实施例”、“一个实施例”或“一些实施例”不一定都是指相同的实施例。References in this description to "an embodiment," "one embodiment," "some embodiments" or "other embodiments" mean at least some, but not necessarily all, embodiments of the invention. A specific function, structure or characteristic associated with an embodiment. References to "an embodiment," "one embodiment," or "some embodiments" are not necessarily all referring to the same embodiments.

如果说明中说“可能”、“可以”、或“也许”包括一个组件、功能、结构或特征，那么该特定组件、功能、结构或特征不一定非要被包括。如果说明书或 “权利要求书”中提及“一个”元素，那么并非意谓着只是一个元素。如果说明书或“权利要求书”中提及“其他”元素，那么并非排除有多个其他元素。If it is stated in the description that "may", "may", or "may" include a component, function, structure or feature, then the specific component, function, structure or feature does not necessarily have to be included. If the specification or "claims" refer to "an" element, that does not mean just one element. If "other" elements are mentioned in the description or in the "claims", that does not exclude that there are multiple other elements.

那些本领域的技术人员将会发现在本发明的范围内可以对前述的说明和附图作出许多变更。相应地，由下面的权利要求书以及对它的任何补正来定义本发明的范围。Those skilled in the art will recognize that many changes can be made in the foregoing description and drawings which come within the scope of the invention. Accordingly, the scope of the invention is defined by the following claims and any amendments thereto.

Claims

1. method of carrying out speech recognition comprises:

Create words tree;

Discern first phoneme in this words tree;

Estimate to have in the words tree probability of the word of first phoneme; And

At least store the probability of some estimations, wherein the probability of the estimation of being stored includes only directly the probability of the estimation of deriving from the n-gram probability, and the probability of the estimation of deriving from the compensation probability probability that compensated (n-1) gram to estimate approx.

2. method according to claim 1 is characterized in that only just storing the probability of estimating under the situation that the n-gram of correspondence exists.

3. method according to claim 1 is characterized in that the probability of estimating is stored in the question blank.

4. method according to claim 3 is characterized in that question blank comprises following message:

1-Ge Lamu: p (s₁) s₁

I-Ge Lamu: (for i=1 ..., n-1): p (s_i| w_I-1... w₁) w₁... w_I-1s_i

n-gram：p(s_n|w_n-1?w₁)w₁...w_n-1?s_n

5. method according to claim 1 is characterized in that the probability P of estimating_EstimatedTo obtain according to following equation:

S wherein_jBe j state of the word related with first phoneme, wherein W (s) is the set of words that can draw from words tree state s, λ_wRepresent a fractional weight, wherein only under the situation of first row that satisfies above-mentioned equation, just store the probability of estimating.

6. method according to claim 5 is characterized in that λ_wBe 1.

7. method according to claim 5 is characterized in that λ_wBetween 0 and 1, and select for each first phoneme.

8. method of carrying out speech recognition comprises:

Receive phoneme and in words tree, discern them; And

Estimate to comprise the probability of the word of these phonemes by the probability that uses the estimation from the memory block, retrieve, the probability of the estimation that wherein retrieves includes only directly the probability of the estimation of deriving from the n-gram probability, and the probability of the estimation of deriving from the compensation probability probability that compensated (n-1) gram to estimate approx.

9. method according to claim 8 is characterized in that the probability of estimating is stored in the question blank.

10. method according to claim 9 is characterized in that question blank comprises following message, and wherein s is the state of words tree, and p is a probability:

1-Ge Lamu: p (s₁) s₁

I-Ge Lamu: (for i=1 ..., n-1): p (s_i| w_I-1... w₁) w₁... w_I-1s_i

n-gram：p(s_n|w_n-1?w₁)w₁...w_n-1?s_n

11. method according to claim 8 is characterized in that can deriving backoff weight information from being stored in one based on the weight the n-gram language mode of word.

12. method according to claim 8 is characterized in that the probability of estimating uses when setting up a pruning threshold value.

13. method according to claim 8 is characterized in that the probability of estimating determines according to following equation:

S wherein_jBe j state of the word related with first phoneme, wherein W (s) is the set of words that can draw from words tree state s, λ_wRepresent a fractional weight, only store the result of first row.