JP4839291B2

Movatterモバイル変換

Info

Publication number: JP4839291B2
Application number: JP2007252817A
Authority: JP
Inventors: 俊樹遠藤; 正樹内藤; 恒河井
Original assignee: KDDI Corp
Current assignee: KDDI Corp
Priority date: 2007-09-28
Filing date: 2007-09-28
Publication date: 2011-12-21
Anticipated expiration: 2027-09-28
Also published as: JP2009086063A

Abstract

<P>PROBLEM TO BE SOLVED: To reduce user's operation frequency of editing candidate words, when the user edits the candidate words displayed on a screen, by displaying the candidate words from a speech recognition result of input speech on the screen. <P>SOLUTION: The speech recognition device comprises: a speech recognition section 13; a candidate word generation section 16 for grouping the candidate words by extracting the candidate words from the speech recognition result; a candidate words edit/display section 17 in which the grouped candidate words are displayed on the screen, and the recognition result is updated according to edit contents by the user; and an edit operation section 18 by which the user edits the candidate words displayed on the screen. The candidate word generation section 16 groups both candidate words which constitute a first candidate word string and a second candidate word string, so that an added value of time distance between the candidate word in the first candidate word string (a candidate word string for composing a sentence in which reliability is a maximum) in the same group, and the candidate words of the other candidate string (the second candidate word string) may become minimum. <P>COPYRIGHT: (C)2009,JPO&INPIT

Description

Translated fromJapanese

本発明は、音声認識装置およびコンピュータプログラムに関する。 The present invention relates to a speech recognition apparatus and a computer program.

従来、コンピュータを用いた音声認識では、話者の発声方法や音声入力時の背景雑音などの影響により１００％の認識率を達成することは困難である。そのために、例えば特許文献１に記載の音声認識装置は、入力音声に含まれる複数の単語を予め辞書に記憶されている複数の単語とそれぞれ比較し、競合候補の中から一番競合確率の高い単語を音声認識結果とし、音声認識結果を複数の単語の単語列として画面に表示し、競合候補の中から一番競合確率の高い単語の競合確率に近い競合確率を持つ１以上の競合単語を選び、対応する一番競合確率の高い単語に隣接して画面上に表示させ、ユーザによるマニュアル操作に応じて、画面上に表示された１以上の競合単語から適切な訂正単語を選択し、選択された競合単語を、音声認識結果の一番競合確率の高い単語と置き換えるようにしている。
特開２００６−１４６００８号公報Conventionally, in speech recognition using a computer, it is difficult to achieve a recognition rate of 100% due to the influence of a speaker's utterance method, background noise at the time of speech input, and the like. Therefore, for example, the speech recognition apparatus described inPatent Literature 1 compares a plurality of words included in the input speech with a plurality of words stored in the dictionary in advance, and has the highest competition probability among the competition candidates. A word is used as a speech recognition result, and the speech recognition result is displayed on the screen as a word string of a plurality of words. Select and display on the screen adjacent to the word with the highest probability of competition, and select and select an appropriate correction word from one or more competing words displayed on the screen according to the manual operation by the user The competing word is replaced with the word having the highest competition probability in the speech recognition result.
JP 2006-146008 A

しかし、上述した従来の音声認識装置では、以下に示すような問題がある。
特許文献１に記載の音声認識装置は、複数の系列の認識結果を簡略化する過程において、各系列に属する同一の表記ではない候補語をグループ化するときに、時間的な重なりがあれば、異なる系列に属する読みが近い候補語を同一のグループに分類している。しかしながら、例えば“庭には二羽鶏がいた”のように、読みの似た単語が時間的に近い位置にあると、図１２に示されるように、同一の単語が複数の区間で候補語となる場合が生じる。すると、音声認識結果から画面に表示される候補語として、同じ単語が複数の区間で表示されることとなり、表示される候補語の個数が多くなる。このため、ユーザは、画面に表示された候補語の中から正解を選んだり、候補語を削除したり、候補語を新規の候補語に変更したりなどすることにより画面上で文章を編集する際に、同じ単語を何度も編集しなければならず、手間や時間がかかる。また、連続する時間区間で同じ候補語が何度も出現するため、候補語を選択するときには前後にある候補語を確認する作業が必要となり、ユーザにとって負担である。However, the conventional speech recognition apparatus described above has the following problems.
In the process of simplifying the recognition results of a plurality of sequences, the speech recognition device described inPatent Document 1 has a temporal overlap when grouping candidate words that are not the same notation belonging to each sequence, Candidate words with similar readings belonging to different series are classified into the same group. However, for example, if a word similar in reading is in a position close in time, such as “There were two chickens in the garden”, the same word is a candidate word in multiple sections as shown in FIG. May occur. Then, as a candidate word displayed on the screen from the speech recognition result, the same word is displayed in a plurality of sections, and the number of displayed candidate words increases. Therefore, the user edits the text on the screen by selecting the correct answer from the candidate words displayed on the screen, deleting the candidate word, changing the candidate word to a new candidate word, etc. However, the same word must be edited many times, which takes time and effort. In addition, since the same candidate word appears many times in successive time intervals, it is necessary for the user to check the candidate words before and after selecting the candidate word, which is a burden on the user.

本発明は、このような事情を考慮してなされたもので、その目的は、入力音声の音声認識結果から候補語を画面に表示してユーザが画面に表示された候補語を編集するときに、ユーザが候補語を編集する操作の回数を減らすことのできる音声認識装置およびコンピュータプログラムを提供することにある。 The present invention has been made in consideration of such circumstances, and its purpose is to display candidate words on the screen from the speech recognition result of the input speech and when the user edits the candidate words displayed on the screen. Another object of the present invention is to provide a speech recognition apparatus and a computer program that can reduce the number of operations of a user editing a candidate word.

上記の課題を解決するために、本発明に係る音声認識装置は、入力された音声を認識する処理を行い、認識した単語の列から成る認識結果を生成する音声認識手段と、認識結果から候補語を抽出する候補語抽出手段と、候補語をグループ化する候補語グループ化手段と、グループ化された候補語を画面に表示する候補語表示手段と、ユーザが画面に表示された候補語を編集するための編集操作手段と、ユーザによる編集内容に従って認識結果を更新する更新手段と、を備え、前記候補語グループ化手段は、前記認識結果において信頼度が最大である文を構成する候補語の列（第１の候補語列）とそれ以外の候補語の列（第２の候補語列）を抽出する手段と、第１の候補語列中の候補語と第２の候補語列中の候補語の間の時間的距離を算出する手段と、同じグループ内の第１の候補語列中の候補語と第２の候補語列中の候補語の間の時間的距離を合計した値が最小になるように、第１の候補語列と第２の候補語列を構成する候補語同士をグループ化する手段と、を有することを特徴とする。 In order to solve the above problems, a speech recognition apparatus according to the present invention performs processing for recognizing input speech and generates a recognition result including a recognized word sequence, and a candidate based on the recognition result. Candidate word extracting means for extracting words, candidate word grouping means for grouping candidate words, candidate word display means for displaying the grouped candidate words on the screen, and candidate words displayed on the screen by the user An editing operation means for editing, and an updating means for updating the recognition result according to the content edited by the user, wherein the candidate word grouping means is a candidate word constituting a sentence having the maximum reliability in the recognition result Means for extracting a sequence (first candidate word sequence) and a column of other candidate words (second candidate word sequence), candidate words in the first candidate word sequence, and second candidate word sequences The temporal distance between candidate words And the first candidate word so that the sum of temporal distances between the candidate words in the first candidate word string and the candidate words in the second candidate word string in the same group is minimized. And means for grouping candidate words constituting the second candidate word string.

本発明に係る音声認識装置においては、前記時間的距離は、候補語間の時間的な重なりを表すことを特徴とする。 In the speech recognition apparatus according to the present invention, the temporal distance represents temporal overlap between candidate words.

本発明に係る音声認識装置は、入力された音声を認識する処理を行い、認識した単語の列から成る認識結果を生成する音声認識手段と、認識結果から候補語を抽出する候補語抽出手段と、候補語をグループ化する候補語グループ化手段と、グループ化された候補語を画面に表示する候補語表示手段と、ユーザが画面に表示された候補語を編集するための編集操作手段と、ユーザによる編集内容に従って認識結果を更新する更新手段と、を備え、前記候補語グループ化手段は、前記認識結果において信頼度が最大である文を構成する候補語の列（第１の候補語列）とそれ以外の候補語の列（第２の候補語列）を抽出する手段と、第１の候補語列中の候補語と第２の候補語列中の候補語の間の時間的な重なりを算出する手段と、第１の候補語列中の候補語と第２の候補語列中の候補語の間の音素の編集距離を算出する手段と、候補語間の時間的な重なりと候補語間の音素の編集距離を用いて、候補語間の距離を算出する手段と、第１の候補語列中の候補語と第２の候補語列中の候補語の間の距離が最小である候補語同士をグループ化する手段と、を有することを特徴とする。 The speech recognition apparatus according to the present invention performs a process of recognizing input speech, generates speech recognition means including a recognized word sequence, candidate word extraction means for extracting candidate words from the recognition results, , Candidate word grouping means for grouping the candidate words, candidate word display means for displaying the grouped candidate words on the screen, editing operation means for the user to edit the candidate words displayed on the screen, Updating means for updating a recognition result in accordance with contents edited by a user, wherein the candidate word grouping means includes a sequence of candidate words (first candidate word sequence) constituting a sentence having the maximum reliability in the recognition result ) And the other candidate word string (second candidate word string), and the time between the candidate word in the first candidate word string and the candidate word in the second candidate word string Means for calculating overlap and in the first candidate word string Using the means for calculating the phoneme editing distance between the candidate word and the candidate word in the second candidate word string, the temporal overlap between the candidate words and the phoneme editing distance between the candidate words, And a means for grouping candidate words having the shortest distance between the candidate words in the first candidate word string and the candidate words in the second candidate word string. It is characterized by.

本発明に係る音声認識装置においては、前記候補語間の音素の編集距離は、１音素当たりの編集回数であることを特徴とする。 In the speech recognition apparatus according to the present invention, the phoneme edit distance between the candidate words is the number of edits per phoneme.

本発明に係る音声認識装置においては、前記候補語間の音素の編集距離は、候補語間で音素を一致させるために必要な置換の対象の音素の組合せにおける、音素間の音響的な類似度を表すことを特徴とする。 In the speech recognition apparatus according to the present invention, the phoneme editing distance between the candidate words is an acoustic similarity between phonemes in a combination of phonemes to be replaced necessary for matching the phonemes between candidate words. It is characterized by expressing.

本発明に係る音声認識装置においては、前記音素間の音響的な類似度は、音素間の調音位置と調音様式についての一致度であることを特徴とする。 In the speech recognition apparatus according to the present invention, the acoustic similarity between the phonemes is a coincidence between the articulation position and the articulation style between the phonemes.

本発明に係る音声認識装置においては、前記候補語間の音素の編集距離は、音素間の誤認識しやすさを表すことを特徴とする。 In the speech recognition apparatus according to the present invention, the phoneme editing distance between the candidate words represents ease of erroneous recognition between phonemes.

本発明に係る音声認識装置においては、１つの候補語列中の複数の候補語が１つのグループに属すると判定された場合に、該複数の候補語を連結する手段を備えたことを特徴とする。 The speech recognition apparatus according to the present invention is characterized by comprising means for connecting a plurality of candidate words when it is determined that a plurality of candidate words in one candidate word string belongs to one group. To do.

本発明に係るコンピュータプログラムは、入力された音声を認識する処理を行い、認識した単語の列から成る認識結果を生成する音声認識機能と、認識結果から候補語を抽出する候補語抽出機能と、候補語をグループ化する候補語グループ化機能と、グループ化された候補語を画面に表示する候補語表示機能と、ユーザが画面に表示された候補語を編集するための編集操作機能と、ユーザによる編集内容に従って認識結果を更新する更新機能と、をコンピュータに実現させるコンピュータプログラムであり、前記候補語グループ化機能は、前記認識結果において信頼度が最大である文を構成する候補語の列（第１の候補語列）とそれ以外の候補語の列（第２の候補語列）を抽出し、第１の候補語列中の候補語と第２の候補語列中の候補語の間の時間的距離を算出し、同じグループ内の第１の候補語列中の候補語と第２の候補語列中の候補語の間の時間的距離を合計した値が最小になるように、第１の候補語列と第２の候補語列を構成する候補語同士をグループ化することを特徴とする。 The computer program according to the present invention performs a process for recognizing input speech, generates a recognition result including a recognized word sequence, a candidate word extraction function for extracting a candidate word from the recognition result, A candidate word grouping function for grouping candidate words, a candidate word display function for displaying the grouped candidate words on the screen, an editing operation function for the user to edit the candidate words displayed on the screen, and a user An update function that updates a recognition result in accordance with the edited content by the computer, and the candidate word grouping function includes a sequence of candidate words constituting a sentence having the maximum reliability in the recognition result ( First candidate word string) and other candidate word strings (second candidate word string) are extracted, and between the candidate words in the first candidate word string and the candidate words in the second candidate word string The temporal distance is calculated, and the value obtained by summing the temporal distances between the candidate words in the first candidate word string and the candidate words in the second candidate word string in the same group is minimized. The candidate words constituting one candidate word string and the second candidate word string are grouped together.

本発明に係るコンピュータプログラムは、入力された音声を認識する処理を行い、認識した単語の列から成る認識結果を生成する音声認識機能と、認識結果から候補語を抽出する候補語抽出機能と、候補語をグループ化する候補語グループ化機能と、グループ化された候補語を画面に表示する候補語表示機能と、ユーザが画面に表示された候補語を編集するための編集操作機能と、ユーザによる編集内容に従って認識結果を更新する更新機能と、をコンピュータに実現させるコンピュータプログラムであり、前記候補語グループ化機能は、前記認識結果において信頼度が最大である文を構成する候補語の列（第１の候補語列）とそれ以外の候補語の列（第２の候補語列）を抽出し、第１の候補語列中の候補語と第２の候補語列中の候補語の間の時間的な重なりを算出し、第１の候補語列中の候補語と第２の候補語列中の候補語の間の音素の編集距離を算出し、候補語間の時間的な重なりと候補語間の音素の編集距離を用いて、候補語間の距離を算出し、第１の候補語列中の候補語と第２の候補語列中の候補語の間の距離が最小である候補語同士をグループ化することを特徴とする。
これにより、前述の音声認識装置がコンピュータを利用して実現できるようになる。The computer program according to the present invention performs a process for recognizing input speech, generates a recognition result including a recognized word sequence, a candidate word extraction function for extracting a candidate word from the recognition result, A candidate word grouping function for grouping candidate words, a candidate word display function for displaying the grouped candidate words on the screen, an editing operation function for the user to edit the candidate words displayed on the screen, and a user An update function that updates a recognition result in accordance with the edited content by the computer, and the candidate word grouping function includes a sequence of candidate words constituting a sentence having the maximum reliability in the recognition result ( First candidate word string) and other candidate word strings (second candidate word string) are extracted, and between the candidate words in the first candidate word string and the candidate words in the second candidate word string The temporal overlap is calculated, the phoneme editing distance between the candidate word in the first candidate word string and the candidate word in the second candidate word string is calculated, and the temporal overlap between the candidate words and the candidate The distance between candidate words is calculated using the phoneme editing distance between words, and the distance between the candidate word in the first candidate word string and the candidate word in the second candidate word string is the smallest It is characterized by grouping words together.
As a result, the speech recognition apparatus described above can be realized using a computer.

本発明によれば、同一の単語が複数の区間で候補語となることを防ぐことができ、音声認識結果から画面に表示される候補語として、同じ単語が複数の区間で表示されることを防止することが可能になる。これにより、ユーザが候補語を編集する操作の回数を低減することができるという効果が得られる。 According to the present invention, the same word can be prevented from becoming a candidate word in a plurality of sections, and the same word can be displayed in a plurality of sections as a candidate word displayed on the screen from the speech recognition result. It becomes possible to prevent. Thereby, the effect that the frequency | count of operation which a user edits a candidate word can be reduced is acquired.

以下、図面を参照し、本発明の一実施形態について説明する。
図１は、本発明の一実施形態に係る音声認識装置１の全体構成を示すブロック図である。図１において、音声認識装置１は、音声入力部１１、音響特徴量抽出部１２、音声認識部１３、音響モデル記憶部１４、言語モデル記憶部１５、候補語生成部１６、候補語編集・表示部１７及び編集操作部１８を備える。Hereinafter, an embodiment of the present invention will be described with reference to the drawings.
FIG. 1 is a block diagram showing the overall configuration of aspeech recognition apparatus 1 according to an embodiment of the present invention. In FIG. 1, aspeech recognition apparatus 1 includes aspeech input unit 11, an acoustic featureamount extraction unit 12, aspeech recognition unit 13, an acousticmodel storage unit 14, a languagemodel storage unit 15, a candidateword generation unit 16, a candidate word edit / display. Aunit 17 and anediting operation unit 18 are provided.

音声入力部１１は、マイク、増幅器、アナログ−デジタル変換器（ＡＤ変換器）などから構成される。音声入力部１１は、ユーザが発声した音声をマイクにより入力し、入力したアナログの音声信号を適当なレベルまで増幅してからデジタルの音声データに変換する。この音声データは音響特徴量抽出部１２に送られる。 Thevoice input unit 11 includes a microphone, an amplifier, an analog-digital converter (AD converter), and the like. Thevoice input unit 11 inputs the voice uttered by the user through a microphone, amplifies the input analog voice signal to an appropriate level, and converts it into digital voice data. This audio data is sent to the acoustic featurequantity extraction unit 12.

なお、音声入力部１１は、電話回線、ＩＰ（Internet Protocol）網などの通信回線と接続する通信インターフェースを備え、通信回線を介して受信したデジタルの音声データを音響特徴量抽出部１２に送るものであってもよい。さらに、音声データが符号化されている場合には、復号化した音声データを音響特徴量抽出部１２に送るようにする。 Thevoice input unit 11 includes a communication interface connected to a communication line such as a telephone line or an IP (Internet Protocol) network, and sends digital voice data received via the communication line to the acoustic featurequantity extraction unit 12. It may be. Further, when the audio data is encoded, the decoded audio data is sent to the acoustic featurequantity extraction unit 12.

音響特徴量抽出部１２は、音声入力部１１から受け取った音声データから、後段の音声認識処理に用いる音響特徴量を抽出する。この音響特徴量のデータは音声認識部１３に送られる。 The acoustic featurequantity extraction unit 12 extracts the acoustic feature quantity used for the subsequent speech recognition processing from the voice data received from thevoice input unit 11. The acoustic feature amount data is sent to thespeech recognition unit 13.

音声認識部１３は、音響特徴量抽出部１２から受け取った音響特徴量データに対して音声認識処理を行う。この音声認識処理には、音響モデル記憶部１４に記憶されている音響モデルと、言語モデル記憶部１５に記憶されている言語モデルとを使用する。音響モデル及び言語モデルは、準備段階として事前に、学習データを用いた学習によって構築し、各記憶部１４，１５に格納しておく。 Thespeech recognition unit 13 performs speech recognition processing on the acoustic feature amount data received from the acoustic featureamount extraction unit 12. For this speech recognition processing, an acoustic model stored in the acousticmodel storage unit 14 and a language model stored in the languagemodel storage unit 15 are used. The acoustic model and the language model are constructed by learning using learning data in advance as a preparation stage, and stored in thestorage units 14 and 15.

音声認識部１３は、音響モデル及び言語モデルを用いた音声認識処理によって、音響特徴量データから単語を認識し、認識した単語の列から成る認識結果を作成する。このとき、最も確からしい単語の列から成る認識結果だけでなく、それ以外の他の認識された単語の列についても認識結果として作成する。音声認識部１３は、各認識結果に対して、音響的なスコア（音響尤度）と言語的な確率（言語確率）から認識結果の確からしさ（信頼度）を算出する。言語確率とは、一定数（例えば３個）の単語の並びが出現する確率である。音声認識部１３は、作成した認識結果の中から、所定の順位までの信頼度を有する認識結果を用いて、単語のネットワーク形式の認識結果を作成する。 Thespeech recognition unit 13 recognizes a word from the acoustic feature data by speech recognition processing using an acoustic model and a language model, and creates a recognition result including the recognized word sequence. At this time, not only the recognition result including the most probable word string but also other recognized word strings are generated as the recognition result. For each recognition result, thespeech recognition unit 13 calculates the certainty (reliability) of the recognition result from an acoustic score (acoustic likelihood) and a linguistic probability (language probability). The language probability is the probability that a certain number (for example, three) of word sequences will appear. Thespeech recognition unit 13 creates a recognition result in the form of a network of words using a recognition result having reliability up to a predetermined rank from the created recognition results.

図２に、単語のネットワーク形式の認識結果の構成例を示す。図２に示されるような、単語のネットワーク形式は、従来、ラティス形式と呼ばれている。図２の例は、ユーザが“今日の午後３時に会議です”という文章を読んだときの構成例である。図２に示されるように、複数の認識結果（所定の順位までの信頼度の認識結果）を使用し、各認識結果に含まれる時間的に対応する単語の区切りをネットワーク状に連結している。なお、図２の認識結果の内容は、説明の便宜上のものである。 FIG. 2 shows a configuration example of the recognition result of the word network format. The network format of words as shown in FIG. 2 is conventionally called a lattice format. The example of FIG. 2 is a configuration example when the user reads the sentence “It is a meeting at 3 pm today”. As shown in FIG. 2, a plurality of recognition results (recognition results up to a predetermined rank) are used, and temporally corresponding word breaks included in each recognition result are connected in a network form. . The contents of the recognition result in FIG. 2 are for convenience of explanation.

音声認識部１３は、単語のネットワーク形式の認識結果に、各単語の品詞の種類を示す品詞情報と、各単語の音響尤度、言語確率及び信頼度の情報とを含める。単語のネットワーク形式の認識結果は、候補語生成部１６に送られる。 Thespeech recognition unit 13 includes part-of-speech information indicating the type of part-of-speech of each word, and information on acoustic likelihood, language probability, and reliability of each word in the recognition result of the word network format. The recognition result of the word network format is sent to the candidateword generation unit 16.

候補語生成部１６は、音声認識部１３から受け取った単語のネットワーク形式の認識結果から候補語を抽出し、抽出した候補語をグループ化する。候補語生成部１６は、単語のネットワーク形式の認識結果から生成した候補語の列から成る候補語データを候補語編集・表示部１７に出力する。候補語データは、認識結果が候補語の列として表され、且つ、候補語がグループ化されたものである。 The candidateword generation unit 16 extracts candidate words from the recognition result in the network format of the words received from thespeech recognition unit 13, and groups the extracted candidate words. The candidateword generation unit 16 outputs candidate word data composed of a sequence of candidate words generated from the recognition result of the word network format to the candidate word editing /display unit 17. In the candidate word data, the recognition result is represented as a sequence of candidate words, and the candidate words are grouped.

候補語編集・表示部１７は、候補語生成部１６から受け取った候補語データを画面に表示する。編集操作部１８は、各種の編集用の操作キーを備える。例えば、画面に表示された候補語の中からユーザが正解の候補語を選択するための操作キー、ユーザが候補語を削除する操作キー、ユーザが新規の候補語を入力するための操作キー、ユーザが認識結果の編集の終了を指示する操作キーなどを備える。編集操作部１８は、ユーザが操作キーで行った編集内容を候補語編集・表示部１７に通知する。候補語編集・表示部１７は、編集操作部１８から通知された編集内容に従って、認識結果を更新する。そして、更新後の認識結果に対応する候補語データで画面の表示内容を更新する。これにより、ユーザが編集した内容を反映した認識結果が、画面に表示される。 The candidate word editing /display unit 17 displays the candidate word data received from the candidateword generation unit 16 on the screen. Theediting operation unit 18 includes various editing operation keys. For example, an operation key for the user to select a correct candidate word from candidate words displayed on the screen, an operation key for the user to delete the candidate word, an operation key for the user to input a new candidate word, An operation key for instructing the user to end editing of the recognition result is provided. Theediting operation unit 18 notifies the candidate word editing /display unit 17 of the editing content performed by the user using the operation keys. The candidate word editing /display unit 17 updates the recognition result according to the editing content notified from theediting operation unit 18. Then, the display content of the screen is updated with the candidate word data corresponding to the updated recognition result. Thereby, the recognition result reflecting the content edited by the user is displayed on the screen.

図３は、図１に示す候補語生成部１６の構成例である。図３において、候補語生成部１６は、候補語抽出部３０、候補語グループ化部３１、同一候補語の一元化部３２、候補語の追加部３３及び候補語グループ記憶部３４を有する。 FIG. 3 is a configuration example of the candidateword generation unit 16 shown in FIG. In FIG. 3, the candidateword generation unit 16 includes a candidateword extraction unit 30, a candidateword grouping unit 31, anunification unit 32 of the same candidate words, a candidateword addition unit 33, and a candidate wordgroup storage unit 34.

候補語抽出部３０は、単語のネットワーク形式の認識結果に含まれる単語の中から、候補語を抽出する。候補語抽出部３０は、個々の単語、又は、連続する複数の単語を、一つの候補語として抽出する。 The candidateword extraction unit 30 extracts candidate words from words included in the recognition result of the word network format. The candidateword extraction unit 30 extracts individual words or a plurality of consecutive words as one candidate word.

候補語グループ化部３１は、単語のネットワーク形式の認識結果から抽出された候補語について、グループ化を行う。候補語のグループ化は、読みの近さや時間情報などに基づいて行う。候補語グループ化部３１は、同一グループの候補語の開始時刻および終了時刻を、信頼度が最大の候補語の開始時刻および終了時刻に揃える。これにより、図２に示された単語のネットワーク形式の認識結果は、図４に示されるような、候補語のネットワーク形式になる。図４は、候補語グループ化処理後の認識結果の構成例である。図２では単語単位でネットワーク状に連結されていたが、図４では、候補語単位でネットワーク状に連結されていると共に、候補語がグループ化されている。これにより、認識結果が簡略化される。 The candidateword grouping unit 31 groups the candidate words extracted from the recognition result of the word network format. Candidate words are grouped based on reading proximity, time information, and the like. The candidateword grouping unit 31 aligns the start time and end time of the candidate words in the same group with the start time and end time of the candidate word having the maximum reliability. As a result, the recognition result of the word network format shown in FIG. 2 becomes a candidate word network format as shown in FIG. FIG. 4 is a configuration example of a recognition result after candidate word grouping processing. In FIG. 2, the word units are connected in a network form, but in FIG. 4, the candidate words are connected in a network form and the candidate words are grouped. Thereby, the recognition result is simplified.

同一候補語の一元化部３２は、候補語のグループ化処理後の認識結果に対して、同一グループに含まれる表記の同じ候補語を１つの候補語にまとめ、その候補語の信頼度を再計算する。一元化処理後の候補語の信頼度は、一元化処理前の候補語の信頼度の平均、加算、最大値などによって求める。同一候補語の一元化部３２は、さらに、各時間区間の候補語の数を、確率の高いものから所定個数までに制限する。これにより、図４に示された候補語グループ化処理後の認識結果は、図５に示されるように簡略化される。図５は、同一候補語の一元化処理後の認識結果の構成例である。なお、図５の例では、各時間区間で、信頼度の高い順に候補語を上から並べている。 Theunifying unit 32 of the same candidate words collects the same candidate words of the notation included in the same group into one candidate word and recalculates the reliability of the candidate words for the recognition result after the grouping process of candidate words To do. The reliability of the candidate words after the unification process is obtained by the average, addition, maximum value, etc. of the reliability of the candidate words before the unification process. Theunifying unit 32 for the same candidate word further restricts the number of candidate words in each time interval from a high probability to a predetermined number. Thereby, the recognition result after the candidate word grouping process shown in FIG. 4 is simplified as shown in FIG. FIG. 5 is a configuration example of a recognition result after unification processing of the same candidate word. In the example of FIG. 5, candidate words are arranged from the top in descending order of reliability in each time interval.

候補語の追加部３３は、同一候補語の一元化処理後の認識結果に対して、過去の候補語のグループの履歴に基づき、候補語を追加する。候補語グループ記憶部３４は、過去の候補語のグループの履歴を記憶している。候補語の追加部３３は、同一候補語の一元化処理後の認識結果中の最大の信頼度を有する候補語についてのグループの履歴を、候補語グループ記憶部３４から読み出す。候補語の追加部３３は、読み出したグループの履歴中に、同一候補語の一元化処理後の認識結果中のグループ内には存在しない候補語があった場合には、該候補語を同一候補語の一元化処理後の認識結果中のグループに追加する。逆に、同一候補語の一元化処理後の認識結果中のグループ内に存在する候補語が、候補語グループ記憶部３４から読み出したグループの履歴中に存在しない場合には、該候補語を候補語グループ記憶部３４内のグループの履歴に追加する。 The candidateword adding unit 33 adds the candidate word to the recognition result after the unification processing of the same candidate word based on the history of the group of past candidate words. The candidate wordgroup storage unit 34 stores a history of past candidate word groups. The candidateword adding unit 33 reads, from the candidate wordgroup storage unit 34, the group history of the candidate word having the maximum reliability in the recognition result after the unification processing of the same candidate word. When there is a candidate word that does not exist in the group in the recognition result after the unification processing of the same candidate word in the read history of the group, the candidateword adding unit 33 selects the candidate word as the same candidate word. Is added to the group in the recognition result after the unification processing. Conversely, if a candidate word that exists in the group in the recognition result after the unification processing of the same candidate word does not exist in the group history read from the candidate wordgroup storage unit 34, the candidate word is selected as a candidate word. It adds to the history of the group in the group memory |storage part 34. FIG.

候補語生成部１６は、候補語の追加処理後の認識結果に対応する候補語データを、候補語編集・表示部１７に出力する。 The candidateword generation unit 16 outputs candidate word data corresponding to the recognition result after the candidate word addition processing to the candidate word editing /display unit 17.

図６は、図１に示す候補語編集・表示部１７の構成例である。図６において、候補語編集・表示部１７は、候補語データ解析・更新部４１、候補語グループ・候補語選択履歴記憶部４２及び候補語表示部４３を有する。 FIG. 6 is a configuration example of the candidate word editing /display unit 17 shown in FIG. 6, the candidate word editing /display unit 17 includes a candidate word data analysis /update unit 41, a candidate word group / candidate word selectionhistory storage unit 42, and a candidateword display unit 43.

候補語データ解析・更新部４１は、候補語データを解析し、各時間区間で信頼度が最大の候補語を連結することにより、暫定的な認識結果を作成し、保持する。その暫定的な認識結果、及び、各候補語と同一グループの候補語のデータは、候補語表示部４３に送られる。このとき候補語表示部４３には、画面に表示可能な分量のみが送られる。 The candidate word data analysis /update unit 41 analyzes the candidate word data and creates and holds a provisional recognition result by connecting candidate words having the maximum reliability in each time interval. The provisional recognition result and the data of candidate words in the same group as each candidate word are sent to the candidateword display unit 43. At this time, only the amount that can be displayed on the screen is sent to the candidateword display unit 43.

候補語表示部４３は、候補語データ解析・更新部４１から受け取った認識結果を表示装置の画面に表示する。このとき、候補語の境界を空白などにより明示する。さらに、各候補語に対してグループ化された他の候補語がある場合は、その旨を下線などにより示す。さらに、同一グループの候補語を、認識結果を表示する画面とは別の画面に表示し、その画面内で候補語を信頼度の高い順に表示する。 The candidateword display unit 43 displays the recognition result received from the candidate word data analysis /update unit 41 on the screen of the display device. At this time, the boundary of the candidate word is clearly indicated by a blank or the like. Further, if there are other candidate words grouped for each candidate word, this is indicated by an underline or the like. Further, the candidate words of the same group are displayed on a screen different from the screen displaying the recognition result, and the candidate words are displayed in the order of high reliability in the screen.

候補語データ解析・更新部４１は、編集操作部１８からユーザの編集内容を受け取ると、その編集内容に従って認識結果を更新する。例えば、正解の候補語の選択、候補語の削除、候補語の並びの変更、新規の候補語の入力などの編集内容に従って、認識結果を変更する。正解の候補語の選択がなされた場合は、編集箇所を正解の候補語に置き換え、他の候補語を削除する。候補語の削除がなされた場合には、編集箇所の候補語を全て削除する。新規の候補語が入力された場合には、編集箇所に入力された候補語を挿入する。候補語データ解析・更新部４１は、編集後の認識結果、及び、各候補語と同一グループの候補語のデータを候補語表示部４３に送る。 When the candidate word data analysis /update unit 41 receives the editing content of the user from theediting operation unit 18, the candidate word data analysis /update unit 41 updates the recognition result according to the editing content. For example, the recognition result is changed according to editing contents such as selection of correct candidate words, deletion of candidate words, change of arrangement of candidate words, and input of new candidate words. When the correct candidate word is selected, the edited portion is replaced with the correct candidate word, and the other candidate words are deleted. If the candidate word is deleted, all the candidate words in the edited portion are deleted. When a new candidate word is input, the input candidate word is inserted at the edit location. The candidate word data analysis /update unit 41 sends the recognition result after editing and data of candidate words in the same group as each candidate word to the candidateword display unit 43.

候補語データ解析・更新部４１は、編集操作部１８から編集箇所を移動する指示を受け取ると、移動先に対応する認識結果、及び、各候補語と同一グループの候補語のデータを候補語表示部４３に送る。 When the candidate word data analyzing / updatingunit 41 receives an instruction to move the edited portion from theediting operation unit 18, the candidate word data is displayed for the recognition result corresponding to the destination and candidate word data in the same group as each candidate word. Send topart 43.

候補語グループ・候補語選択履歴記憶部４２は、候補語のグループと、ユーザが候補語を選択した確率（ユーザ選択確率）を保持する。候補語データ解析・更新部４１は、候補語グループ・候補語選択履歴記憶部４２を参照し、編集箇所にあたる候補語のグループの候補語の表示を、候補語グループ・候補語選択履歴記憶部４２内のユーザ選択確率の高い順に並び替える処理を行うことができる。なお、ユーザ選択確率による表示順序の変更処理については、実行の可否を選択することができるようにする。 The candidate word group / candidate word selectionhistory storage unit 42 holds a group of candidate words and a probability that the user has selected a candidate word (user selection probability). The candidate word data analyzing / updatingunit 41 refers to the candidate word group / candidate word selectionhistory storage unit 42 and displays the candidate word of the group of candidate words corresponding to the edited portion, as a candidate word group / candidate word selectionhistory storage unit 42. It is possible to perform processing for rearranging the items in descending order of user selection probability. It should be noted that whether or not to execute the display order changing process based on the user selection probability can be selected.

次に、本実施形態に係る候補語グループ化処理について、いくつかの実施例を挙げて詳細に説明する。 Next, the candidate word grouping processing according to the present embodiment will be described in detail with some examples.

候補語グループ化部３１は、候補語抽出部３０によって単語のネットワーク形式の認識結果から抽出された候補語から、単語のネットワーク形式の認識結果において、信頼度が最大である文を構成する候補語の列（以下、「第１の候補語列」と称する）と、それ以外の候補語の列（以下、「第２の候補語列」と称する）を抽出する。図２の例では、第１の候補語列は図２中の実線部分“今日の午後３時に会費です”であり、第２の候補語列は該“今日の午後３時に会費です”以外の候補語列である。 The candidateword grouping unit 31 configures a candidate word constituting a sentence having the maximum reliability in the recognition result of the word network format from the candidate words extracted from the recognition result of the word network format by the candidateword extraction unit 30. And the other candidate word strings (hereinafter referred to as “second candidate word strings”) are extracted. In the example of FIG. 2, the first candidate word string is a solid line portion “Today's 3:00 pm membership fee” in FIG. 2, and the second candidate word string is other than “Today's 3:00 pm membership fee” This is a candidate word string.

次いで、候補語グループ化部３１は、第１の候補語列中の候補語と第２の候補語列中の候補語の間の時間的距離を算出する。次いで、候補語グループ化部３１は、同じグループ内の第１の候補語列中の候補語と第２の候補語列中の候補語の間の時間的距離を合計した値が最小になるように、第１の候補語列と第２の候補語列を構成する候補語同士をグループ化する。なお、時間的距離が最小となる候補語の組合せは、DTW（Dynamic Time Warping）により効率的に求めることができる。 Next, the candidateword grouping unit 31 calculates a temporal distance between the candidate words in the first candidate word string and the candidate words in the second candidate word string. Next, the candidateword grouping unit 31 minimizes the sum of temporal distances between candidate words in the first candidate word string and candidate words in the second candidate word string in the same group. The candidate words constituting the first candidate word string and the second candidate word string are grouped together. Note that a combination of candidate words that minimizes the temporal distance can be efficiently obtained by DTW (Dynamic Time Warping).

第１の候補語列中の候補語と第２の候補語列中の候補語との間の時間的距離としては、例えば、候補語の時間的な重なりを１から減算した値を用いることができる。候補語w_iと候補語w_jの時間的な重なりO(w_i,w_j)は、例えば、（１）式によって計算することができる。（１）式においては、候補語w_i，w_jの時間長l(w_i)，l(w_j)の合計に対する時間的な重なりl(w_i∩w_j)の占める割合として、時間的な重なりO(w_i,w_j)を定義している。図７には、候補語w_iの時間長l(w_i)、候補語w_jの時間長l(w_j)及びその時間的な重なりl(w_i∩w_j)の関係が示されている。As a temporal distance between the candidate word in the first candidate word string and the candidate word in the second candidate word string, for example, a value obtained by subtracting the temporal overlap of candidate words from 1 is used. it can. The temporal overlap O (w_i , w_j ) between the candidate word w_i and the candidate word w_j can be calculated by, for example, equation (1). In equation (1), the temporal overlap l (w_i ∩w_j ) occupies the temporal length l (w_i ) and l (w_j ) of the candidate words w_i and w_j in terms of time. Defined overlap O (w_i , w_j ). 7, the time length l of candidate words w_{_i} (w_i), the time length l (w_j) of the candidate word w_j and in relation temporal overlapping l (w_i ∩w_j) is shown Yes.

図８は、実施例１に係る候補語グループ化処理の一例である。図８の例は、図２の認識結果を処理する一例である。図８では、第１の候補語列“今日の午後３時に会費です”と、第２の候補語列“京の午後３時に会議です”を処理対象にしている。そして、時間的な重なりの大きい、“今日の”と“京の”、“午後”と“午後”、“３時に”と“３時に”、“会費”と“会議”、“です”と“です”をそれぞれに同じグループとし、合計５つのグループ１〜５を形成している。 FIG. 8 is an example of candidate word grouping processing according to the first embodiment. The example of FIG. 8 is an example of processing the recognition result of FIG. In FIG. 8, the first candidate word string “Meeting fee at 3 pm today” and the second candidate word string “Meeting at 3 pm in Kyoto” are processed. And “Today” and “Kyoto”, “Afternoon” and “Afternoon”, “3 o'clock” and “3 o'clock”, “Meeting fee” and “Conference”, “Is” and “ "Is the same group for each", forming a total of five groups 1-5.

本実施例１によれば、第１の候補語列中の候補語と第２の候補語列中の候補語との間の時間的距離を合計した値が最小になるように、第１の候補語列と第２の候補語列を構成する候補語同士をグループ化する。これにより、候補語の並び順を維持しながら候補語のグループ化を行うと共に、同一の単語が複数の区間で候補語となることを防ぐことができる。これにより、音声認識結果から画面に表示される候補語として、同じ単語が複数の区間で表示されることを防止することが可能になる。 According to the first embodiment, the first value is set so that the sum of the temporal distances between the candidate words in the first candidate word string and the candidate words in the second candidate word string is minimized. The candidate words constituting the candidate word string and the second candidate word string are grouped. Thereby, it is possible to group candidate words while maintaining the order of arrangement of candidate words, and to prevent the same word from becoming a candidate word in a plurality of sections. This makes it possible to prevent the same word from being displayed in a plurality of sections as candidate words displayed on the screen from the speech recognition result.

実施例２では、実施例１と同様に、第１の候補語列中の候補語w_iと第２の候補語列中の候補語w_jとの時間的な重なりO(w_i,w_j)を用い、さらに、候補語間の読みの近さを表すものとして、候補語間の音素の編集距離（レーベンシュタイン距離）L(w_i,w_j)を用いる。In the second embodiment, as in the first embodiment, the temporal overlap O (w_i , w_j between the candidate word w_i in the first candidate word string and the candidate word w_{j in} the second candidate word string. ), And the phoneme editing distance (Levenstein distance) L (w_i , w_j ) between candidate words is used to represent the proximity of reading between candidate words.

候補語グループ化部３１は、実施例１と同様に、第１の候補語列中の候補語w_iと第２の候補語列中の候補語w_jとの時間的な重なりO(w_i,w_j)を算出する。次いで、候補語グループ化部３１は、第１の候補語列中の候補語w_iと第２の候補語列中の候補語w_jについて候補語間の音素の編集距離L(w_i,w_j)を算出する。候補語間の音素の編集距離L(w_i,w_j)は、１音素当たりの編集回数である。編集回数とは、候補語w_iと候補語w_jの間で音素を一致させるために必要となる、音素の削除、挿入、置換の回数の合計である。候補語w_iの音素の数をN_wi、候補語w_jの音素の数をN_wj、編集回数をN_eとすると、候補語間の音素の編集距離L(w_i,w_j)は、（２）式により算出できる。Similar to the first embodiment, the candidateword grouping unit 31 temporally overlaps the candidate word w_i in the first candidate word string and the candidate word w_{j in} the second candidate word string O (w_i , w_j ). Next, the candidateword grouping unit 31 edits the phoneme editing distance L (w_i , w between candidate words for the candidate word w_i in the first candidate word string and the candidate word w_{j in} the second candidate word string._j ) is calculated. The phoneme edit distance L (w_i , w_j ) between candidate words is the number of edits per phoneme. The number of edits is the total number of phoneme deletions, insertions, and replacements required for matching phonemes between the candidate word w_i and the candidate word w_j . If the number of phonemes of candidate word w_i is N_wi , the number of phonemes of candidate word w_j is N_wj , and the number of edits is N_e , the phoneme editing distance L (w_i , w_j ) between candidate words is It can be calculated by equation (2).

候補語グループ化部３１は、（２）式により候補語w_i，w_j間の音素の編集距離L(w_i,w_j)を算出する際に、候補語w_iの音素数N_wi及び候補語w_jの音素数N_wjを調べる。さらに、候補語グループ化部３１は、候補語w_iと候補語w_jの間で音素を一致させるために必要な編集回数N_eを調べる。次いで、候補語グループ化部３１は、音素数N_wiと音素数N_wjのうち大きい方の音素数を（２）式の分母に代入し、編集回数N_eを（２）式の分子に代入し、（２）式の値を計算する。When the candidateword grouping unit 31 calculates the phoneme editing distance L (w_i , w_j ) between the candidate words w_i and w_j by the expression (2), the number of phonemes N_wi of the candidate word w_i and The phoneme number N_wj of the candidate word w_j is examined. Further, the candidateword grouping unit 31 checks the number of edits N_e required to match the phonemes between the candidate word w_i and candidate word w_j. Then, the candidateword grouping unit 31, the number of phonemes larger one phoneme number N_wi phoneme number N_wj (2) is substituted into equation the denominator, substituting editing number N_e (2) to a molecule of formula Then, the value of equation (2) is calculated.

図９に、編集距離L(w_i,w_j)の計算例が示されている。図９の例は、“神田”と“蒲田”の間の編集距離L(w_i,w_j)を求めている。図９に示されるように“神田”と“蒲田”では、編集回数が４回であり、１音素当りの編集距離は、4/6となる。FIG. 9 shows a calculation example of the edit distance L (w_i , w_j ). In the example of FIG. 9, the edit distance L (w_i , w_j ) between “Kanda” and “Iwata” is obtained. As shown in FIG. 9, in “Kanda” and “Iwata”, the number of edits is 4, and the edit distance per phoneme is 4/6.

次いで、候補語グループ化部３１は、候補語間の時間的な重なりO(w_i,w_j)と候補語間の音素の編集距離L(w_i,w_j)を用いて、候補語間の距離D(w_i,w_j)を算出する。候補語間の距離D(w_i,w_j)は、（３）式により算出できる。但し、（３）式において、α，βは、０から１の範囲の値を持つ重み付け係数であり、D(w_i,w_j)が０から１の範囲の値になるように設定する。候補語間の距離D(w_i,w_j)は、候補語間の時間的な重なりO(w_i,w_j)と候補語間の音素の編集距離L(w_i,w_j)の重み付け加算値となっている。Next, the candidateword grouping unit 31 uses the temporal overlap O (w_i , w_j ) between the candidate words and the phoneme editing distance L (w_i , w_j ) between the candidate words, The distance D (w_i , w_j ) is calculated. The distance D (w_i , w_j ) between candidate words can be calculated by equation (3). In equation (3), α and β are weighting coefficients having values in the range of 0 to 1, and are set so that D (w_i , w_j ) has a value in the range of 0 to 1. The distance D (w_i , w_j ) between candidate words is the weight of the temporal overlap O (w_i , w_j ) between candidate words and the phoneme editing distance L (w_i , w_j ) between candidate words It is an added value.

次いで、候補語グループ化部３１は、第１の候補語列中の候補語w_iと第２の候補語列中の候補語w_jの間の距離D(w_i,w_j)が最小である候補語同士をグループ化する。Next, the candidateword grouping unit 31 has the smallest distance D (w_i , w_j ) between the candidate word w_i in the first candidate word string and the candidate word w_j in the second candidate word string. Group candidate words together.

図１０は、実施例２に係る候補語グループ化処理の一例である。図１０の例は、図２の認識結果を処理する一例である。図１０では、第１の候補語列“今日の午後３時に会費です”と、第２の候補語列“京都ご讃辞に会議です”を処理対象にしている。図１０に示されるように、第２の候補語列中の候補語“ご讃辞に”に対して、第１の候補語列の中で時間的に重なりのある候補語は、“午後”と“３時に”と“会費”の３つある。図１０の例では、本実施例２の候補語グループ化処理の結果、候補語“３時に”と候補語“ご讃辞に”が同一グループに属するように判定されている。 FIG. 10 is an example of candidate word grouping processing according to the second embodiment. The example of FIG. 10 is an example of processing the recognition result of FIG. In FIG. 10, the first candidate word string “Meeting fee is 3:00 pm today” and the second candidate word string “Meeting in Kyoto!” Are targeted for processing. As shown in FIG. 10, the candidate word that overlaps in time in the first candidate word string is “afternoon” with respect to the candidate word “congratulations” in the second candidate word string. And “3 o'clock” and “membership”. In the example of FIG. 10, as a result of the candidate word grouping process of the second embodiment, it is determined that the candidate word “3 o'clock” and the candidate word “comment” belong to the same group.

図１１は、実施例２に係る候補語グループ化処理の他の例である。図１１の例では、第１の候補語列が“京都ご讃辞に会議です”であり、第２の候補語列が“今日の午後３時に会費です”である。図１１に示されるように、第２の候補語列中の候補語“午後”と候補語“３時に”に対して、第１の候補語列の中で、時間的に重なりがあり且つ編集距離が短い候補語は“ご讃辞に”のみである。そのため、候補語“午後”も候補語“３時に”も、“ご讃辞に”とグループ化される。このように、１つの候補語列に含まれる複数の候補語が１つのグループに属すると判定された場合は、該複数の候補語を連結する。図１１の例では、候補語“午後”と候補語“３時に”を連結して、１つの候補語“午後３時に”を生成する。この候補語“午後３時に”は、候補語“ご讃辞に”と同じグループ２を形成する。 FIG. 11 is another example of candidate word grouping processing according to the second embodiment. In the example of FIG. 11, the first candidate word string is “Meeting in Kyoto” and the second candidate word string is “Meeting fee at 3 pm today”. As shown in FIG. 11, the candidate word “afternoon” and the candidate word “3 o'clock” in the second candidate word string are temporally overlapped and edited in the first candidate word string. The only candidate word that has a short distance is “comments”. For this reason, the candidate word “afternoon” and the candidate word “3 o'clock” are grouped together as “congratulations”. In this way, when it is determined that a plurality of candidate words included in one candidate word string belong to one group, the plurality of candidate words are connected. In the example of FIG. 11, the candidate word “afternoon” and the candidate word “3 o'clock” are connected to generate one candidate word “3 o'clock”. This candidate word “3 pm” forms the same group 2 as the candidate word “congratulations”.

本実施例２によれば、第１の候補語列中の候補語w_iと第２の候補語列中の候補語w_jとの間の時間的な重なりO(w_i,w_j)及び音素の編集距離L(w_i,w_j)の重み付け加算値である、距離D(w_i,w_j)が最小である候補語同士をグループ化する。これにより、候補語の並び順を維持しながら候補語のグループ化を行うと共に、同一の単語が複数の区間で候補語となることを防ぐことができる。これにより、音声認識結果から画面に表示される候補語として、同じ単語が複数の区間で表示されることを防止することが可能になる。According to the second embodiment, the temporal overlap O (w_i , w_j ) between the candidate word w_i in the first candidate word string and the candidate word w_{j in} the second candidate word string and Candidate words having the smallest distance D (w_i , w_j ), which is a weighted addition value of the phoneme editing distance L (w_i , w_j ), are grouped. Thereby, it is possible to group candidate words while maintaining the order of arrangement of candidate words, and to prevent the same word from becoming a candidate word in a plurality of sections. This makes it possible to prevent the same word from being displayed in a plurality of sections as candidate words displayed on the screen from the speech recognition result.

なお、実施例２の変形として、以下に示すいくつかの方法が挙げられる。 In addition, some methods shown below are mentioned as a modification of Example 2.

［実施例２−１］
実施例２−１では、候補語間の音素の編集距離L(w_i,w_j)を算出する際に、編集回数N_eの代わりに、候補語間の読みの近さを表すものとして「候補語w_iと候補語w_jの間で音素を一致させるために必要な置換の対象の音素の組合せにおける、音素間の音響的な類似度」を用いる。例えば、読みの似た３つの単語として“/n/a/k/a/d/a/（中田）”と“/n/a/k/a/t/a/（中田）”と“/n/a/k/a/m/a/（仲間）”では、音素“/d/”と音素“/t/”、音素“/d/”と音素“/m/”、音素“/t/”と音素“/m/”が、それぞれの単語間で音素を一致させるために必要な置換の対象の音素の組合せである。[Example 2-1]
In Example 2-1, the edit distance L (w_i, w_j) of the phonemes between the candidate word when calculating, instead of editing the number N_e, as representing the closeness of the readings between the candidate word " The “acoustic similarity between phonemes in the combination of phonemes to be replaced necessary for matching phonemes between candidate word w_i and candidate word w_j ” is used. For example, “/ n / a / k / a / d / a / (Nakada)”, “/ n / a / k / a / t / a / (Nakada)” and “/ n / a / k / a / m / a / (companies), phonemes “/ d /” and phonemes “/ t /”, phonemes “/ d /” and phonemes “/ m /”, phonemes “/ t” “/” And phoneme “/ m /” are combinations of phonemes to be replaced necessary to match phonemes between the respective words.

まず準備段階として事前に、すべての音素の組み合わせに対して、音響的な類似度に応じた編集コストC_pi,pjを決定し、メモリに保持しておく。編集コストC_pi,pjは、音素p_iと音素p_jの間の音響的な類似度を表す。First, as a preparation stage, the editing cost C_{pi, pj} corresponding to the acoustic similarity is determined in advance for all phoneme combinations and stored in the memory. The editing cost C_{pi, pj} represents the acoustic similarity between the phoneme p_i and the phoneme p_j .

候補語グループ化部３１は、（２）式により候補語w_i，w_j間の音素の編集距離L(w_i,w_j)を算出する際に、候補語w_iと候補語w_jの間で音素を一致させるために必要な置換の対象の音素の組合せを調べる。次いで、候補語グループ化部３１は、その各音素の組合せについて編集コストC_pi,pjをメモリから読み出す。次いで、候補語グループ化部３１は、その編集コストC_pi,pjの合計を計算する。次いで、候補語グループ化部３１は、その編集コストC_pi,pjの合計値を編集回数N_eの代わりに（２）式の分子に代入し、候補語w_iの音素数N_wiと候補語w_jの音素数N_wjのうち大きい方の音素数を（２）式の分母に代入し、（２）式の値を計算する。次いで、候補語グループ化部３１は、（３）式により、候補語間の距離D(w_i,w_j)を計算する。なお、候補語間の時間的な重なりO(w_i,w_j)は、実施例１と同様に算出する。Candidate word grouping unit 31 (2) by the candidate words w_i, edit phonemes between w_j distance L (w_i, w_j) in calculating, the candidate words w_i and candidate word w_j The combination of phonemes to be replaced necessary for matching phonemes between them is examined. Next, the candidateword grouping unit 31 reads the editing cost C_{pi, pj} from the memory for each combination of phonemes. Next, the candidateword grouping unit 31 calculates the sum of the editing costs C_{pi, pj} . Then, the candidateword grouping unit 31, the editing cost C_pi, substituted into equation (2) of the molecule instead of editing the number N_e of the total value of_pj, number of phonemes candidate words w_i N_wi and candidate word The larger phoneme number of w_j phoneme numbers N_wj is substituted into the denominator of equation (2), and the value of equation (2) is calculated. Next, the candidateword grouping unit 31 calculates the distance D (w_i , w_j ) between the candidate words according to equation (3). Note that the temporal overlap O (w_i , w_j ) between candidate words is calculated in the same manner as in the first embodiment.

本実施例２−１によれば、候補語間の編集回数が同じ場合であっても、編集距離が異なることがある。例えば、読みの似た３つの単語として“/n/a/k/a/d/a/（中田）”と“/n/a/k/a/t/a/（中田）”と“/n/a/k/a/m/a/（仲間）”では、どの単語の組合せでも編集回数は１である。このとき、もし、音素“/d/”と音素“/t/”の編集コストC_/d/,/t/が、音素“/d/”と音素“/m/”の編集コストC_/d/,/m/、及び、音素“/t/”と音素“/m/”の編集コストC_/t/,/m/より小さい場合には、“/n/a/k/a/d/a/（中田）”と“/n/a/k/a/t/a/（中田）”の組の編集距離は、他の単語の組に比べて、小さくなる。According to the present Example 2-1, even when the number of edits between candidate words is the same, the edit distance may be different. For example, “/ n / a / k / a / d / a / (Nakada)”, “/ n / a / k / a / t / a / (Nakada)” and “/ In “n / a / k / a / m / a / (friend)”, the number of editing is 1 for any combination of words. In this case, if, phoneme "/ d /" and the phoneme "/ t /" of editing cost C_{/ d /, / t /} is, phoneme "/ d /" and the phoneme "/ m /" Edit cost C_{/ d} of_{/, / m /} and the editing cost C_{/ t /, / m / of} the phonemes “/ t /” and “/ m /” is “/ n / a / k / a / d / The editing distance of the pair “a / (Nakada)” and “/ n / a / k / a / t / a / (Nakada)” is smaller than the other word pairs.

なお、音素間の音響的な類似度としては、例えば、子音の調音に必要な閉鎖が起きる声道の位置（調音位置）とその閉鎖の方法（調音様式）を音素毎に調べ、音素間の調音位置と調音様式についての一致度を用いることが挙げられる。 As the acoustic similarity between phonemes, for example, the position of the vocal tract (articulation position) where the closure required for articulation of consonants and the closing method (articulation style) are examined for each phoneme. For example, the degree of coincidence between the articulation position and the articulation style is used.

［実施例２−２］
実施例２−２では、候補語間の音素の編集距離L(w_i,w_j)として、音素間の誤認識しやすさ（Confusion Matrix）を用いる。[Example 2-2]
In Example 2-2, the ease of misrecognition between phonemes (Confusion Matrix) is used as the phoneme editing distance L (w_i , w_j ) between candidate words.

まず準備段階として事前に音声認識処理によって、すべての音素の組み合わせに対して、音素間の誤認識しやすさを求め、音素の組合せ毎に編集距離を決定しメモリに保持しておく。編集距離としては、誤認識しやすい音素間ほど編集距離が短い値、誤認識し難い音素間ほど編集距離が長い値を与える。 First, as a preparation stage, the ease of misrecognition between phonemes is obtained for all phoneme combinations in advance by speech recognition processing, and the edit distance is determined for each phoneme combination and stored in the memory. As the editing distance, a value that is shorter in the editing distance is given to a phoneme that is easily misrecognized, and a longer value is given to a phoneme that is hard to be erroneously recognized.

候補語グループ化部３１は、（２）式により候補語w_i，w_j間の音素の編集距離L(w_i,w_j)を算出する際に、候補語w_iと候補語w_jの間の音素の組合せを抽出する。次いで、候補語グループ化部３１は、その各音素の組合せについて編集距離をメモリから読み出す。次いで、候補語グループ化部３１は、その編集距離の合計を計算する。次いで、候補語グループ化部３１は、その編集距離の合計値を編集回数N_eの代わりに（２）式の分子に代入し、候補語w_iの音素数N_wiと候補語w_jの音素数N_wjのうち大きい方の音素数を（２）式の分母に代入し、（２）式の値を計算する。次いで、候補語グループ化部３１は、（３）式により、候補語間の距離D(w_i,w_j)を計算する。なお、候補語間の時間的な重なりO(w_i,w_j)は、実施例１と同様に算出する。Candidate word grouping unit 31 (2) by the candidate words w_i, edit phonemes between w_j distance L (w_i, w_j) in calculating, the candidate words w_i and candidate word w_j Extract phoneme combinations between. Next, the candidateword grouping unit 31 reads the edit distance from the memory for each combination of phonemes. Next, the candidateword grouping unit 31 calculates the sum of the edit distances. Then, the candidateword grouping unit 31 substitutes the equation (2) of the molecule instead of the sum of the number of edits N_e of the edit distance, the phoneme number N_wi and sound candidate word w_j of candidate words w_i The larger phoneme number among the prime numbers N_wj is substituted into the denominator of equation (2) to calculate the value of equation (2). Next, the candidateword grouping unit 31 calculates the distance D (w_i , w_j ) between the candidate words according to equation (3). Note that the temporal overlap O (w_i , w_j ) between candidate words is calculated in the same manner as in the first embodiment.

上述したように本実施形態によれば、候補語の並び順を維持しながら候補語のグループ化を行うと共に、同一の単語が複数の区間で候補語となることを防ぎ、音声認識結果から画面に表示される候補語として、同じ単語が複数の区間で表示されることを防止することができる。 As described above, according to the present embodiment, the candidate words are grouped while maintaining the arrangement order of the candidate words, and the same word is prevented from becoming a candidate word in a plurality of sections, and the screen from the voice recognition result is displayed. It is possible to prevent the same word from being displayed in a plurality of sections as the candidate words displayed on the screen.

なお、本実施形態に係る音声認識装置１は、専用のハードウェアにより実現されるものであってもよく、あるいはパーソナルコンピュータ等のコンピュータシステムにより構成され、図１に示される装置の各機能を実現するためのプログラムを実行することによりその機能を実現させるものであってもよい。 Note that thespeech recognition apparatus 1 according to the present embodiment may be realized by dedicated hardware, or may be configured by a computer system such as a personal computer to realize each function of the apparatus shown in FIG. The function may be realized by executing a program to do so.

また、その音声認識装置１には、周辺機器として入力装置、表示装置等（いずれも図示せず）が接続されるものとする。ここで、入力装置とはキーボード、マウス、携帯電話端末のキー等の入力デバイスのことをいう。表示装置とはＣＲＴ（Cathode Ray Tube）や液晶表示装置等のことをいう。
また、上記周辺機器については、音声認識装置１に直接接続するものであってもよく、あるいは通信回線を介して接続するようにしてもよい。In addition, an input device, a display device, and the like (none of which are shown) are connected to thevoice recognition device 1 as peripheral devices. Here, the input device refers to an input device such as a keyboard, a mouse, or a key of a mobile phone terminal. The display device refers to a CRT (Cathode Ray Tube), a liquid crystal display device or the like.
The peripheral device may be connected directly to thespeech recognition apparatus 1 or may be connected via a communication line.

また、図１に示す音声認識装置１の機能を実現するためのプログラムをコンピュータ読み取り可能な記録媒体に記録して、この記録媒体に記録されたプログラムをコンピュータシステムに読み込ませ、実行することにより、音声認識に係る処理を行ってもよい。なお、ここでいう「コンピュータシステム」とは、ＯＳや周辺機器等のハードウェアを含むものであってもよい。
また、「コンピュータシステム」は、ＷＷＷシステムを利用している場合であれば、ホームページ提供環境（あるいは表示環境）も含むものとする。
また、「コンピュータ読み取り可能な記録媒体」とは、フレキシブルディスク、光磁気ディスク、ＲＯＭ、フラッシュメモリ等の書き込み可能な不揮発性メモリ、ＤＶＤ（Digital Versatile Disk）等の可搬媒体、コンピュータシステムに内蔵されるハードディスク等の記憶装置のことをいう。Further, by recording a program for realizing the function of thevoice recognition device 1 shown in FIG. 1 on a computer-readable recording medium, and causing the computer system to read and execute the program recorded on the recording medium, Processing related to speech recognition may be performed. Here, the “computer system” may include an OS and hardware such as peripheral devices.
Further, the “computer system” includes a homepage providing environment (or display environment) if a WWW system is used.
“Computer-readable recording medium” refers to a flexible disk, a magneto-optical disk, a ROM, a writable nonvolatile memory such as a flash memory, a portable medium such as a DVD (Digital Versatile Disk), and a built-in computer system. A storage device such as a hard disk.

さらに「コンピュータ読み取り可能な記録媒体」とは、インターネット等のネットワークや電話回線等の通信回線を介してプログラムが送信された場合のサーバやクライアントとなるコンピュータシステム内部の揮発性メモリ（例えばＤＲＡＭ（Dynamic Random Access Memory））のように、一定時間プログラムを保持しているものも含むものとする。
また、上記プログラムは、このプログラムを記憶装置等に格納したコンピュータシステムから、伝送媒体を介して、あるいは、伝送媒体中の伝送波により他のコンピュータシステムに伝送されてもよい。ここで、プログラムを伝送する「伝送媒体」は、インターネット等のネットワーク（通信網）や電話回線等の通信回線（通信線）のように情報を伝送する機能を有する媒体のことをいう。
また、上記プログラムは、前述した機能の一部を実現するためのものであっても良い。さらに、前述した機能をコンピュータシステムにすでに記録されているプログラムとの組み合わせで実現できるもの、いわゆる差分ファイル（差分プログラム）であっても良い。Further, the “computer-readable recording medium” means a volatile memory (for example, DRAM (Dynamic DRAM) in a computer system that becomes a server or a client when a program is transmitted through a network such as the Internet or a communication line such as a telephone line. Random Access Memory)), etc., which hold programs for a certain period of time.
The program may be transmitted from a computer system storing the program in a storage device or the like to another computer system via a transmission medium or by a transmission wave in the transmission medium. Here, the “transmission medium” for transmitting the program refers to a medium having a function of transmitting information, such as a network (communication network) such as the Internet or a communication line (communication line) such as a telephone line.
The program may be for realizing a part of the functions described above. Furthermore, what can implement | achieve the function mentioned above in combination with the program already recorded on the computer system, and what is called a difference file (difference program) may be sufficient.

以上、本発明の実施形態について図面を参照して詳述してきたが、具体的な構成はこの実施形態に限られるものではなく、本発明の要旨を逸脱しない範囲の設計変更等も含まれる。
例えば、上述の音声認識装置１は、ワードプロセッサー装置、電子メール装置などの文書作成を行う各種の装置と組合せて構成するようにしてもよい。As mentioned above, although embodiment of this invention was explained in full detail with reference to drawings, the specific structure is not restricted to this embodiment, The design change etc. of the range which does not deviate from the summary of this invention are included.
For example, thevoice recognition device 1 described above may be configured in combination with various devices that create documents such as a word processor device and an electronic mail device.

本発明の一実施形態に係る音声認識装置１の全体構成を示すブロック図である。1 is a block diagram showing an overall configuration of aspeech recognition apparatus 1 according to an embodiment of the present invention.単語のネットワーク形式の認識結果の構成例である。It is an example of a structure of the recognition result of the network format of a word.図１に示す候補語生成部１６の構成例である。It is a structural example of the candidate word production |generation part 16 shown in FIG.本発明の一実施形態に係る候補語グループ化処理後の認識結果の構成例である。It is an example of composition of a recognition result after candidate word grouping processing concerning one embodiment of the present invention.同実施形態に係る候補語の同一候補語の一元化処理後の認識結果の構成例である。It is an example of composition of a recognition result after unification processing of the same candidate word of a candidate word concerning the embodiment.図１に示す候補語編集・表示部１７の構成例である。It is an example of a structure of the candidate word edit /display part 17 shown in FIG.本発明の一実施形態に係る候補語間の時間的な重なりl(w_i∩w_j)を説明するための説明図である。It is explanatory drawing for demonstrating the temporal overlap l (w_i ∩w_j ) between candidate words concerning one embodiment of the present invention.本発明の実施例１に係る候補語グループ化処理の一例である。It is an example of the candidate word grouping process which concerns on Example 1 of this invention.本発明の実施例２に係る編集距離L(w_i,w_j)の計算例である。It is an example of calculation of edit distance L (w_i , w_j ) according to the second embodiment of the present invention.本発明の実施例２に係る候補語グループ化処理の一例である。It is an example of the candidate word grouping process which concerns on Example 2 of this invention.本発明の実施例２に係る候補語グループ化処理の他の例である。It is another example of the candidate word grouping process which concerns on Example 2 of this invention.従来の候補語グループ化処理の例である。It is an example of the conventional candidate word grouping process.

符号の説明Explanation of symbols

１…音声認識装置、１１…音声入力部、１２…音響特徴量抽出部、１３…音声認識部、１４…音響モデル記憶部、１５…言語モデル記憶部、１６…候補語生成部、１７…候補語編集・表示部、１８…編集操作部、３０…候補語抽出部、３１…候補語グループ化部DESCRIPTION OFSYMBOLS 1 ... Voice recognition apparatus, 11 ... Voice input part, 12 ... Acoustic feature-value extraction part, 13 ... Voice recognition part, 14 ... Acoustic model memory | storage part, 15 ... Language model memory | storage part, 16 ... Candidate word generation part, 17 ... Candidate Word editing / display unit, 18 ... editing operation unit, 30 ... candidate word extraction unit, 31 ... candidate word grouping unit

Claims

Translated fromJapanese

入力された音声を認識する処理を行い、認識した単語の列から成る認識結果を生成し、音響的なスコア（音響尤度）と言語的な確率（言語確率）から認識結果の確からしさ（信頼度）を算出する音声認識手段と、
認識結果から候補語を抽出する候補語抽出手段と、
候補語をグループ化する候補語グループ化手段と、
グループ化された候補語を画面に表示する候補語表示手段と、
ユーザが画面に表示された候補語を編集するための編集操作手段と、
ユーザによる編集内容に従って認識結果を更新する更新手段と、を備え、
前記候補語グループ化手段は、
前記認識結果において信頼度が最大である文を構成する候補語の列（第１の候補語列）とそれ以外の候補語の列（第２の候補語列）を抽出する手段と、
第１の候補語列中の候補語と第２の候補語列中の候補語の間の時間的距離を算出する手段と、
同じグループ内の第１の候補語列中の候補語と第２の候補語列中の候補語の間の時間的距離を合計した値が最小になるように、第１の候補語列と第２の候補語列を構成する候補語同士をグループ化する手段と、を有し、
前記時間的距離は次式で定義され、
時間的距離＝１−O(w_i,w_j)、

但し、l(w_i)は候補語w_iの時間長であり、l(w_j)は候補語w_jの時間長であり、l(w_i∩w_j)は候補語w_iと候補語w_jが重なっている時間長である、
ことを特徴とする音声認識装置。Performs processing to recognize the input speech, generatesa recognition result consisting of a sequence of recognized words,and confirms the accuracy of the recognition result from the acoustic score (acoustic likelihood) and linguistic probability (language probability) (trust) Voice recognition meansfor calculating the degree)
Candidate word extracting means for extracting candidate words from the recognition result;
Candidate word grouping means for grouping candidate words;
Candidate word display means for displaying grouped candidate words on the screen;
Editing operation means for the user to edit the candidate words displayed on the screen;
Updating means for updating the recognition result in accordance with the contents edited by the user,
The candidate word grouping means includes:
Means for extracting a sequence of candidate words (first candidate word sequence) and a sequence of other candidate words (second candidate word sequence) constituting a sentence having the maximum reliability in the recognition result;
Means for calculating a temporal distance between a candidate word in the first candidate word string and a candidate word in the second candidate word string;
The first candidate word sequence and the first candidate word sequence are arranged such that the sum of temporal distances between the candidate words in the first candidate word sequence and the second candidate word sequence in the same group is minimized. It means for grouping the candidate words that constitute the second candidate word string, thepossess,
The temporal distance is defined by the following equation:
Temporal distance = 1−O (w_i, w_j),

Where l (w_i) isthe time length of thecandidate word w_i, l (w_j) isthe time length of thecandidate word w_j, and l (w_i∩w_j) is the candidate word w_iand the candidate word w_jis the length of time that overlaps,
A speech recognition apparatus characterized by that.

入力された音声を認識する処理を行い、認識した単語の列から成る認識結果を生成し、音響的なスコア（音響尤度）と言語的な確率（言語確率）から認識結果の確からしさ（信頼度）を算出する音声認識手段と、
認識結果から候補語を抽出する候補語抽出手段と、
候補語をグループ化する候補語グループ化手段と、
グループ化された候補語を画面に表示する候補語表示手段と、
ユーザが画面に表示された候補語を編集するための編集操作手段と、
ユーザによる編集内容に従って認識結果を更新する更新手段と、を備え、
前記候補語グループ化手段は、
前記認識結果において信頼度が最大である文を構成する候補語の列（第１の候補語列）とそれ以外の候補語の列（第２の候補語列）を抽出する手段と、
第１の候補語列中の候補語と第２の候補語列中の候補語の間の時間的な重なりを算出する手段と、
第１の候補語列中の候補語と第２の候補語列中の候補語の間の音素の編集距離を算出する手段と、
候補語間の時間的な重なりと候補語間の音素の編集距離を用いて、候補語間の距離を算出する手段と、
第１の候補語列中の候補語と第２の候補語列中の候補語の間の距離が最小である候補語同士をグループ化する手段と、を有し、
前記時間的な重なり「O(w_i,w_j)」は次式で定義され、

但し、l(w_i)は候補語w_iの時間長であり、l(w_j)は候補語w_jの時間長であり、l(w_i∩w_j)は候補語w_iと候補語w_jが重なっている時間長である、
前記音素の編集距離「L(w_i,w_j)」は次式で定義され、

但し、N_wiは候補語w_iの音素の数であり、N_wjは候補語w_jの音素の数であり、N_eは候補語w_iと候補語w_jの間で音素を一致させるために必要となる、音素の削除、挿入、置換の回数の合計であり、
前記候補語間の距離「D(w_i,w_j)」は次式で定義され、

但し、α及びβは０から１までの範囲の値を持つ重み付け係数であり、D(w_i,w_j)が０から１までの範囲の値になるように設定されている、
ことを特徴とする音声認識装置。Performs processing to recognize the input speech, generatesa recognition result consisting of a sequence of recognized words,and confirms the accuracy of the recognition result from the acoustic score (acoustic likelihood) and linguistic probability (language probability) (trust) Voice recognition meansfor calculating the degree)
Candidate word extracting means for extracting candidate words from the recognition result;
Candidate word grouping means for grouping candidate words;
Candidate word display means for displaying grouped candidate words on the screen;
Editing operation means for the user to edit the candidate words displayed on the screen;
Updating means for updating the recognition result in accordance with the contents edited by the user,
The candidate word grouping means includes:
Means for extracting a sequence of candidate words (first candidate word sequence) and a sequence of other candidate words (second candidate word sequence) constituting a sentence having the maximum reliability in the recognition result;
Means for calculating a temporal overlap between candidate words in the first candidate word string and candidate words in the second candidate word string;
Means for calculating a phoneme editing distance between a candidate word in the first candidate word string and a candidate word in the second candidate word string;
Means for calculating a distance between candidate words using temporal overlap between candidate words and a phoneme editing distance between candidate words;
Means the distance between the first candidate word candidate word in the string and the candidate word in the second candidate word in column group the candidate word with each other is minimal, andpossess,
The temporal overlap “O (w_i, w_j)” is defined by the following equation:

Where l (w_i) isthe time length of thecandidate word w_i, l (w_j) isthe time length of thecandidate word w_j, and l (w_i∩w_j) is the candidate word w_iand the candidate word w_jis the length of time that overlaps,
The phoneme editing distance `` L (w_i, w_j) '' is defined by the following equation:

Where N_wiisthe number of phonemes ofcandidate word w_i, N_wjisthe number of phonemes ofcandidate word w_j, and N_eisfor matching phonemes betweencandidate word w_iand candidate word w_jIs the total number of phoneme deletions, insertions, and replacements required for
The distance “D (w_i, w_j)”between the candidate wordsis defined by the following equation:

However, α and β are weighting coefficients having values in the range from 0 to 1, and D (w_i, w_j) is set to have values in the range from 0 to 1.
A speech recognition apparatus characterized by that.

但し、N_wiは候補語w_iの音素の数であり、N_wjは候補語w_jの音素の数であり、N_eは編集コストC_pi,pjの合計値であり、
前記編集コストC_pi,pjは、候補語w_iと候補語w_jの間で音素を一致させるために必要な置換の対象の音素p_iと音素p_jの間の音響的な類似度であり、
すべての音素の組み合わせに対して前記編集コストを予め記憶する記憶手段を設け、
前記候補語間の距離「D(w_i,w_j)」は次式で定義され、

Where N_wiisthe number of phonemes ofcandidate word w_i, N_wjisthe number of phonemes ofcandidate word w_j, N_eisthe total value ofediting costs C_{pi, pj},
The editing cost C_{pi, pj}isan acoustic similarity between thephoneme p_iand the phoneme p_{j to}be replaced necessary to match the phonemes between thecandidate word w_iand the candidate word w_j. ,
A storage means for storing the editing cost in advance for all phoneme combinations is provided,
The distance “D (w_i, w_j)”between the candidate wordsis defined by the following equation:

前記音素間の音響的な類似度は、音素間の調音位置と調音様式についての一致度であることを特徴とする請求項３に記載の音声認識装置。4. The speech recognition apparatus according to claim3 , wherein the acoustic similarity between the phonemes is a degree of coincidence between an articulation position and an articulation style between phonemes.

但し、N_wiは候補語w_iの音素の数であり、N_wjは候補語w_jの音素の数であり、
N_eは、候補語w_iと候補語w_jの間の音素の組合せ毎の音素間の誤認識しやすさの合計値であり、
すべての音素の組み合わせに対して、誤認識しやすい音素間ほど小さい値、誤認識し難い音素間ほど大きい値である前記音素間の誤認識しやすさを予め記憶する記憶手段を設け、
前記候補語間の距離「D(w_i,w_j)」は次式で定義され、

Where N_wiisthe number of phonemes ofcandidate word w_i, N_wjisthe number of phonemes ofcandidate word w_j,
N_eisa total value of the ease of misrecognition between phonemes for each phoneme combination betweencandidate word w_iand candidate word w_j,
For all phoneme combinations, there is provided storage means for storing in advance the ease of misrecognition between the phonemes, which is a smaller value between easy-to-recognize phonemes, and a larger value between phonemes that are difficult to misrecognize,
The distance “D (w_i, w_j)”between the candidate wordsis defined by the following equation:

１つの候補語列中の複数の候補語が１つのグループに属すると判定された場合に、該複数の候補語を連結する手段を備えたことを特徴とする請求項２から５のいずれか１項に記載の音声認識装置。6. The apparatus according to claim2, further comprising means for linking a plurality of candidate words when it is determined that a plurality of candidate words in one candidate word string belong to one group. The speech recognition device according toitem .

入力された音声を認識する処理を行い、認識した単語の列から成る認識結果を生成し、音響的なスコア（音響尤度）と言語的な確率（言語確率）から認識結果の確からしさ（信頼度）を算出する音声認識機能と、
認識結果から候補語を抽出する候補語抽出機能と、
候補語をグループ化する候補語グループ化機能と、
グループ化された候補語を画面に表示する候補語表示機能と、
ユーザが画面に表示された候補語を編集するための編集操作機能と、
ユーザによる編集内容に従って認識結果を更新する更新機能と、をコンピュータに実現させるコンピュータプログラムであり、
前記候補語グループ化機能は、
前記認識結果において信頼度が最大である文を構成する候補語の列（第１の候補語列）とそれ以外の候補語の列（第２の候補語列）を抽出し、
第１の候補語列中の候補語と第２の候補語列中の候補語の間の時間的距離を算出し、
同じグループ内の第１の候補語列中の候補語と第２の候補語列中の候補語の間の時間的距離を合計した値が最小になるように、第１の候補語列と第２の候補語列を構成する候補語同士をグループ化し、
前記時間的距離は次式で定義され、
時間的距離＝１−O(w_i,w_j)、

但し、l(w_i)は候補語w_iの時間長であり、l(w_j)は候補語w_jの時間長であり、l(w_i∩w_j)は候補語w_iと候補語w_jが重なっている時間長である、
ことを特徴とするコンピュータプログラム。Performs processing to recognize the input speech, generatesa recognition result consisting of a sequence of recognized words,and confirms the accuracy of the recognition result from the acoustic score (acoustic likelihood) and linguistic probability (language probability) (trust) Voice recognition functionto calculate the degree)
A candidate word extraction function for extracting candidate words from the recognition result;
A candidate word grouping function for grouping candidate words;
Candidate word display function to display the grouped candidate words on the screen,
An editing operation function for the user to edit the candidate words displayed on the screen;
A computer program for causing a computer to implement an update function for updating a recognition result in accordance with editing contents by a user;
The candidate word grouping function is:
Extracting candidate word strings (first candidate word strings) and other candidate word strings (second candidate word strings) constituting a sentence having the maximum reliability in the recognition result,
Calculating a temporal distance between a candidate word in the first candidate word string and a candidate word in the second candidate word string;
The first candidate word sequence and the first candidate word sequence are arranged such that the sum of temporal distances between the candidate words in the first candidate word sequence and the second candidate word sequence in the same group is minimized. the candidate words that constitute 2 of the candidate word stringgrouping,
The temporal distance is defined by the following equation:
Temporal distance = 1−O (w_i, w_j),

Where l (w_i) isthe time length of thecandidate word w_i, l (w_j) isthe time length of thecandidate word w_j, and l (w_i∩w_j) is the candidate word w_iand the candidate word w_jis the length of time that overlaps,
A computer program characterized by the above.

入力された音声を認識する処理を行い、認識した単語の列から成る認識結果を生成し、音響的なスコア（音響尤度）と言語的な確率（言語確率）から認識結果の確からしさ（信頼度）を算出する音声認識機能と、
認識結果から候補語を抽出する候補語抽出機能と、
候補語をグループ化する候補語グループ化機能と、
グループ化された候補語を画面に表示する候補語表示機能と、
ユーザが画面に表示された候補語を編集するための編集操作機能と、
ユーザによる編集内容に従って認識結果を更新する更新機能と、をコンピュータに実現させるコンピュータプログラムであり、
前記候補語グループ化機能は、
前記認識結果において信頼度が最大である文を構成する候補語の列（第１の候補語列）とそれ以外の候補語の列（第２の候補語列）を抽出し、
第１の候補語列中の候補語と第２の候補語列中の候補語の間の時間的な重なりを算出し、
第１の候補語列中の候補語と第２の候補語列中の候補語の間の音素の編集距離を算出し、
候補語間の時間的な重なりと候補語間の音素の編集距離を用いて、候補語間の距離を算出し、
第１の候補語列中の候補語と第２の候補語列中の候補語の間の距離が最小である候補語同士をグループ化し、
前記時間的な重なり「O(w_i,w_j)」は次式で定義され、

但し、N_wiは候補語w_iの音素の数であり、N_wjは候補語w_jの音素の数であり、N_eは候補語w_iと候補語w_jの間で音素を一致させるために必要となる、音素の削除、挿入、置換の回数の合計であり、
前記候補語間の距離「D(w_i,w_j)」は次式で定義され

但し、α及びβは０から１までの範囲の値を持つ重み付け係数であり、D(w_i,w_j)が０から１までの範囲の値になるように設定されている、
ことを特徴とするコンピュータプログラム。Performs processing to recognize the input speech, generatesa recognition result consisting of a sequence of recognized words,and confirms the accuracy of the recognition result from the acoustic score (acoustic likelihood) and linguistic probability (language probability) (trust) Voice recognition functionto calculate the degree)
A candidate word extraction function for extracting candidate words from the recognition result;
A candidate word grouping function for grouping candidate words;
Candidate word display function to display the grouped candidate words on the screen,
An editing operation function for the user to edit the candidate words displayed on the screen;
A computer program for causing a computer to implement an update function for updating a recognition result in accordance with editing contents by a user;
The candidate word grouping function is:
Extracting candidate word strings (first candidate word strings) and other candidate word strings (second candidate word strings) constituting a sentence having the maximum reliability in the recognition result,
Calculating a temporal overlap between the candidate words in the first candidate word string and the candidate words in the second candidate word string;
Calculating a phoneme editing distance between a candidate word in the first candidate word string and a candidate word in the second candidate word string;
Using the temporal overlap between candidate words and the phoneme editing distance between candidate words, calculate the distance between candidate words,
The distance between the first candidate word in the candidate words in columns and candidate word in the second candidate word in column groupsthe candidate word with each other is minimal,
The temporal overlap “O (w_i, w_j)” is defined by the following equation:

Where N_wiisthe number of phonemes ofcandidate word w_i, N_wjisthe number of phonemes ofcandidate word w_j, and N_eisfor matching phonemes betweencandidate word w_iand candidate word w_jIs the total number of phoneme deletions, insertions, and replacements required for
The distance “D (w_i, w_j)”between the candidate wordsis defined as

However, α and β are weighting coefficients having values in the range from 0 to 1, and D (w_i, w_j) is set to have values in the range from 0 to 1.
A computer program characterized by the above.

但し、N_wiは候補語w_iの音素の数であり、N_wjは候補語w_jの音素の数であり、N_eは編集コストC_pi,pjの合計値であり、
前記編集コストC_pi,pjは、候補語w_iと候補語w_jの間で音素を一致させるために必要な置換の対象の音素p_iと音素p_jの間の音響的な類似度であり、
すべての音素の組み合わせに対して前記編集コストを予め記憶する記憶手段を前記コンピュータに設け、
前記候補語間の距離「D(w_i,w_j)」は次式で定義され、

Where N_wiisthe number of phonemes ofcandidate word w_i, N_wjisthe number of phonemes ofcandidate word w_j, N_eisthe total value ofediting costs C_{pi, pj},
The editing cost C_{pi, pj}isan acoustic similarity between thephoneme p_iand the phoneme p_{j to}be replaced necessary to match the phonemes between thecandidate word w_iand the candidate word w_j. ,
The computer is provided with storage means for storing the editing cost in advance for all phoneme combinations,
The distance “D (w_i, w_j)”between the candidate wordsis defined by the following equation:

但し、N_wiは候補語w_iの音素の数であり、N_wjは候補語w_jの音素の数であり、
N_eは、候補語w_iと候補語w_jの間の音素の組合せ毎の音素間の誤認識しやすさの合計値であり、
すべての音素の組み合わせに対して、誤認識しやすい音素間ほど小さい値、誤認識し難い音素間ほど大きい値である前記音素間の誤認識しやすさを予め記憶する記憶手段を前記コンピュータに設け、
前記候補語間の距離「D(w_i,w_j)」は次式で定義され、

Where N_wiisthe number of phonemes ofcandidate word w_i, N_wjisthe number of phonemes ofcandidate word w_j,
N_eisa total value of the ease of misrecognition between phonemes for each phoneme combination betweencandidate word w_iand candidate word w_j,
The computer is provided with storage means for storing in advance the ease of misrecognition between the phonemes, which is a smaller value between phonemes that are easily misrecognized, and a larger value between phonemes that are difficult to misrecognize, for all combinations of phonemes. ,
The distance “D (w_i, w_j)”between the candidate wordsis defined by the following equation: