JPS6237797B2

Movatterモバイル変換

Info

Publication number: JPS6237797B2
Application number: JP18875680A
Authority: JP
Inventors: Kyoshi Iwata; Yasuhiro Nara; Akihiro Kimura
Original assignee: Fujitsu Ltd
Current assignee: Fujitsu Ltd
Priority date: 1980-12-26
Filing date: 1980-12-26
Publication date: 1987-08-14
Also published as: JPS57110000A

Description

【発明の詳細な説明】本発明は音声認識方式に関し、特に周波数スペ
クトル−時間の二次元スペクトルパターンをマス
ク演算を行なうことによりその特徴成分を抽出し
て得られた特徴パターンを使用して音声認識を行
うようにした音声認識方式に関する。DETAILED DESCRIPTION OF THE INVENTION The present invention relates to a speech recognition method, and in particular to speech recognition using a feature pattern obtained by extracting feature components from a two-dimensional frequency spectrum-time spectrum pattern by performing a mask operation. This invention relates to a speech recognition method that performs the following.

従来、音声認識においては、第１図に示す如き
音声特性曲線（第１図はイエアオウンと発声した
場合を示す）、すなわち、平面軸に周波数と時間
をとり、縦軸に音響パワーを示した周波数スペク
トル−時間曲線を、例えば８ｍｓのフレーム単位
毎にそのスペクトル分布から、フオルマント成分
を抽出したり、あるいはあるスペクトルパターン
を標準とした音種に変換し、音種系列として単語
標準パターンとのマツチングを行うこと等により
音声認識を行なつていた。 Conventionally, in speech recognition, a voice characteristic curve as shown in Figure 1 (Figure 1 shows the case where the user utters "Yeah own") is used, in other words, the frequency and time are plotted on the plane axis, and the acoustic power is plotted on the vertical axis. For example, formant components are extracted from the spectral distribution of the spectrum-time curve in units of 8 ms frames, or a certain spectral pattern is converted into a standard tone type, and then matched with a word standard pattern as a tone type series. Voice recognition was performed by doing things such as doing things.

しかしながらこのような従来の方式では次のよ
うな欠点が存在する。 However, such conventional methods have the following drawbacks.

(1) 大きな声や小さな声といつたパワーの大小に
より大きく影響を受ける。しかも第３フオルマ
ント成分や第４フオルマント成分は個人差が大
きい。したがつて正規化が必要になる。(1) It is greatly influenced by the power of a loud voice or a soft voice. Moreover, there are large individual differences in the third formant component and the fourth formant component. Normalization is therefore necessary.

(2) 孤立した単独フレーム単位での音声認識処理
のために、期待通りの音素に変換することが難
かしい。例えばイエアと連続的に発音した場合
には、「IIIEIEEAEAEAA」と混合して音韻系
列に変換されることが多く、この単独フレーム
単位でとらえる場合「Ｉ…Ｅ…Ａ…」という期
待通りの変換がむづかしい。(2) Because speech recognition is performed on an isolated, single frame basis, it is difficult to convert speech into expected phonemes. For example, when ``yeah'' is pronounced continuously, it is often converted into a phonetic sequence by mixing it with ``IIIEIEEAEAEAA'', and when viewed in units of individual frames, the expected conversion is ``I...E...A...''. It's difficult.

(3) 音素は上記(2)に示した如くその前後のフレー
ムの影響を受けるが、この点が全く考慮されて
いない。(3) As shown in (2) above, phonemes are affected by the frames before and after them, but this point is not taken into consideration at all.

(4) 第１図に示すように、周波数のピーク（フオ
ルマント）も周波数方向および時間方向のスペ
クトルの変化を有するのに、従来のフレーム単
位の認識方式ではこの点が考慮されていない。(4) As shown in FIG. 1, the frequency peak (formant) also has spectral changes in the frequency direction and the time direction, but this point is not taken into consideration in the conventional frame-by-frame recognition method.

したがつて本発明は上記の如き欠点を改善する
ことを目的とするものであつて、このために本発
明における音声認識方式では音声入力を周波数ス
ペクトルにもとづき分析しこの分析結果を辞書と
照合して音声認識を行う音声認識方式において、
音声入力を周波数成分およびそのパワーの大小を
表示するスペクトルを作成するスペクトル変換手
段と、このスペクトルをマスク演算するマスク演
算手段と、このマスク演算にもとづき少なくとも
上記スペクトルの縦方向のパターン変化を示す縦
方向成分と横方向のパターン変化を示す横方向成
分を抽出する成分抽出手段を有し、この縦方向成
分および横方向成分にもとづき得られたパターン
により辞書を照合するようにしたことを特徴とす
る。 Therefore, the present invention aims to improve the above-mentioned drawbacks, and for this purpose, the speech recognition method of the present invention analyzes the speech input based on the frequency spectrum and compares the analysis results with a dictionary. In the speech recognition method that performs speech recognition using
spectral conversion means for converting audio input into a spectrum that displays frequency components and the magnitude of their power; mask calculation means for performing mask calculation on this spectrum; The present invention is characterized in that it has a component extraction means for extracting a directional component and a horizontal component indicating a pattern change in the horizontal direction, and the pattern obtained based on the vertical component and the horizontal component is checked against a dictionary. .

本発明を具体的に説明するに先立ち本発明に使
用する周波数スペクトルパターンについて第１
図、第２図および第３図にもとづき説明する。 Before specifically explaining the present invention, a first explanation will be given regarding the frequency spectrum pattern used in the present invention.
The explanation will be given based on FIGS. 2 and 3.

上記の如く、第１図では、「イエアオウン」と
発声したときの周波数スペクトルの時間変化を立
体的に示している。この第１図より明らかなよう
に、フオルマント部分に急激なピークが存在し、
エネルギーが集中していることがわかる。このパ
ワースペクトルを二次元的な形で濃淡表示したも
のが第２図である。第２図ではパワーの大きい部
分程濃く表示している。ただし第２図は、第１図
とは異なり、「ダイ」と発声したものの一部分を
示している。この第２図からフオルマント部分が
次第に移動していることがわかり、この動きを捉
えることが音声認識における有用な情報となる。
そしてこの濃淡表示部分は閾値を設けてあるレベ
ル以上のところのみを表示すると、例えば第３図
の如きものが得られる。ただしこの第３図は「メ
キシコ」と発声したときのスペクトルデータの例
を示したものである。 As mentioned above, FIG. 1 three-dimensionally shows the temporal change in the frequency spectrum when the user utters "Yea own." As is clear from Figure 1, there is a sharp peak in the formant part,
You can see that the energy is concentrated. FIG. 2 shows this power spectrum displayed in two-dimensional shading. In FIG. 2, the higher the power, the darker the area is displayed. However, unlike FIG. 1, FIG. 2 only shows a portion of the utterance of "Dai." It can be seen from FIG. 2 that the formant portion is gradually moving, and capturing this movement provides useful information in speech recognition.
If a threshold value is set for this shaded display portion and only the portions above a certain level are displayed, an image as shown in FIG. 3, for example, can be obtained. However, this FIG. 3 shows an example of spectrum data when "Mexico" is uttered.

この第３図により次のことがわかる。 The following can be understood from this Figure 3.

(1) 同一音が続く場合には、ほぼ時間的に一様な
スペクトルパターンになる。(1) When the same sound continues, the spectrum pattern becomes almost uniform over time.

(2) 母音部は、はつきりとしたフオルマント特性
がある。(2) The vowel part has sharp formant characteristics.

(3) 異なつた音節が続く場合、その境界部分でス
ペクトルパターンが大きく変化する。（但し連
続母音の場合はあまりはつきりしない。）本発明はこのスペクトルパターンをマスク演算
を行ない、これにより得られた変化分情報にもと
づき音声認識を行なうものである。(3) When different syllables follow, the spectral pattern changes significantly at the boundary. (However, in the case of continuous vowels, this is not so obvious.) The present invention performs a mask operation on this spectral pattern, and performs speech recognition based on the change information obtained thereby.

本発明において使用されるマスクの一例および
変化分検出のための半別の一例を第４図にもとづ
き説明する。 An example of a mask used in the present invention and an example of a half-separated mask for detecting a change will be explained based on FIG. 4.

第４図イは本発明において使用するマスクパタ
ーンの一例であり、マスクを３×３のa₁〜a₉の９
つの領域に分けたものを使用する。そして第４図
ロに示すようにその左側の領域に＋１を乗算し、
右側の領域に−１を乗算してその和を求めたと
き、これが正ならば中央の領域ａにおいて、横方
向（周波数方向）に左側からパワーが大→小に変
化することを意味し、負ならば逆に小→大に変化
することを意味する。すなわち、マスク演算の絶
対値が大きい場合には時間方向にスペクトルが定
常状態にある。同様にハのマスクにより演算され
た結果が正ならば中央の領域ａが縦方向（時間方
向）においてパワーが大→小に変化することを示
し負ならば小→大に変化することを示す。すなわ
ち、結果の絶対値が大きいときは時間方向にスペ
クトルが変動状態にあることを示す。またニのマ
スクにより演算された結果が正ならば中央の領域
ａは右下斜方向にパワーが大→小に変化すること
を示し負ならば小→大に変化することを示す。す
なわち、スペクトルピークが時間とともに周波数
の低い方に移動する状態を示す。そしてホのマス
クにより演算された結果が正ならば中央の領域ａ
は左下斜方向にパワーが大→小に変化することを
示し負ならば小→大に変化することを示す。すな
わち、スペクトルピークが時間とともに周波数の
高い方に移動する状態を示す。すなわち第４図ロ
〜ホの各マスクはパワースペクトルの一次微分を
表示するものとみることができる。 FIG. 4A is an example of a mask pattern used in the present invention, in which the mask is arranged in a 3×3 pattern a₁ to a₉ .
Use one divided into two areas. Then, as shown in Figure 4B, multiply the area on the left side by +1,
When the right region is multiplied by -1 and the sum is found, if this is positive, it means that in the central region a, the power changes from large to small in the horizontal direction (frequency direction) from the left side, and it is negative. If so, it means that the value changes from small to large. That is, when the absolute value of the mask calculation is large, the spectrum is in a steady state in the time direction. Similarly, if the result calculated using the mask of Ha is positive, it indicates that the power in the central region a changes from large to small in the vertical direction (time direction), and if it is negative, it indicates that the power changes from small to large. That is, when the absolute value of the result is large, it indicates that the spectrum is in a state of fluctuation in the time direction. Further, if the result calculated using the second mask is positive, it indicates that the power in the central area a changes from large to small in the lower right diagonal direction, and if it is negative, it indicates that the power changes from small to large. That is, it shows a state in which the spectral peak moves toward lower frequencies over time. If the result calculated by the mask E is positive, the central area a
indicates that the power changes from large to small in the lower left diagonal direction, and if it is negative, it indicates that the power changes from small to large. That is, it shows a state in which the spectral peak moves toward higher frequencies over time. That is, each of the masks shown in FIG. 4 (Ro) to (H) can be viewed as displaying the first-order differential of the power spectrum.

また、第５図イに示すようにマスクを左側およ
び右側の領域に−１を乗し、中央の縦列に＋２を
乗じて加算するものを使用する場合には、同ホに
示すように中央のの列を２分してとおよび
ととの１次微分の変化分を求めたものを得る
ことができる。すなわちパワースペクトルの変化
の割合、つまり２次微分を求めることができる。
このようなことを第５図イのみでなくロ〜ニに示
すマスクを使用して演算することにより、横、
縦、斜方向のパワースペクトルの変化分を検出す
ることができる。 In addition, when using a mask that multiplies the left and right areas by -1 and multiplies +2 in the center column as shown in Figure 5A, By dividing the sequence into two, we can obtain the change in the first-order differential between and and . That is, the rate of change in the power spectrum, that is, the second derivative can be determined.
By calculating this using not only the masks shown in Figure 5A but also the masks shown in RO to D, the horizontal,
Changes in the power spectrum in the vertical and diagonal directions can be detected.

そして第４図ロ〜ホのマスクを使用して、第３
図に示す如きスペクトルパターンを分析してある
閾値以上の成分を残すような処理を行なうことに
より第６図に示すようなパターンを得ることがで
きる。この図において、縦、横、斜方向の線は、
縦、横、斜方向にある閾値以上の変化のあること
を示している。この第６図から次のようなことが
わかる。 Then, using the masks shown in Figure 4 Ro~Ho,
A pattern as shown in FIG. 6 can be obtained by analyzing the spectrum pattern shown in the figure and performing processing to leave components above a certain threshold value. In this figure, the vertical, horizontal, and diagonal lines are
This indicates that there is a change greater than a certain threshold in the vertical, horizontal, or diagonal directions. The following can be seen from Figure 6.

母音部分のようにフオルマント部分には縦成
分がはつきりとあらわれる。 Like the vowel part, the vertical component clearly appears in the formant part.

子音、母音の境界部分には横成分が検出され
る。（時間軸方向の変化がある。）斜め成分の存在によつてフオルマント周波数
の動きが検出できる。 Horizontal components are detected at the boundary between consonants and vowels. (There is a change in the time axis direction.) Movements in formant frequencies can be detected due to the presence of oblique components.

パワーの絶対値を使わないで音声部分を分析
できる。 Audio parts can be analyzed without using the absolute value of power.

したがつて、このような性質を利用して、母音
や子音の割出しを明確に行なうことができる。そ
してこのパターンにより辞書と比較すれば、これ
にもとづき正確に音韻識別することができ、特に
パワーの大きい母音部の識別を明確に行なうこと
ができる。 Therefore, by utilizing this property, vowels and consonants can be clearly identified. By comparing this pattern with a dictionary, it is possible to accurately identify phonemes based on this pattern, and in particular, to clearly identify vowel parts with high power.

次に第７図および第８図にもとづき本発明の一
実施例構成を説明する。 Next, the configuration of an embodiment of the present invention will be described based on FIGS. 7 and 8.

第７図は本発明の一実施例構成を示すブロツク
図であり、第８図はその動作を説明するフローチ
ヤートである。 FIG. 7 is a block diagram showing the configuration of an embodiment of the present invention, and FIG. 8 is a flowchart explaining its operation.

図中、１は音声入力部、２はスペクトル変換
部、３は第１フレームバツフア、４は第２フレー
ムバツフア、５は第３フレームバツフア、６はマ
スク演算回路、７は４方向成分抽出回路、８は有
音部始端・終端検出回路、９は母音・子音判定回
路、１０は辞書部、１１は照合部、１２は音韻識
別部である。 In the figure, 1 is an audio input section, 2 is a spectrum conversion section, 3 is a first frame buffer, 4 is a second frame buffer, 5 is a third frame buffer, 6 is a mask calculation circuit, and 7 is a four-directional component. 8 is a sound part start/end detecting circuit, 9 is a vowel/consonant determining circuit, 10 is a dictionary section, 11 is a collation section, and 12 is a phoneme identification section.

音声入力部は識別すべき音声が入力されたとき
これを例えば第１図に示す如き電気信号に変換す
るものである。スペクトル変換部２は、音声入力
部１から伝達された電気信号にもとづきこれを第
３図に示す如きスペクトルパターンを作成するも
のである。 The voice input section converts the voice to be identified into an electrical signal as shown in FIG. 1, for example. The spectrum conversion section 2 creates a spectrum pattern as shown in FIG. 3 based on the electrical signal transmitted from the audio input section 1.

第１フレームバツフア３、第２フレームバツフ
ア４および第３フレームバツフア５は、例えばシ
フトレジスタで構成されており、３×３のマスク
演算を行なうためのバツフアである。これによ
り、マスクの上段のデータは第１フレームバツフ
ア５に保持され、マスクの中段のデータは第２フ
レームバツフア４に保持され、マスクの下段のデ
ータは第３フレームバツフア３に保持されること
になる。 The first frame buffer 3, the second frame buffer 4, and the third frame buffer 5 are composed of shift registers, for example, and are buffers for performing a 3.times.3 mask operation. As a result, the data in the upper part of the mask is held in the first frame buffer 5, the data in the middle part of the mask is held in the second frame buffer 4, and the data in the lower part of the mask is held in the third frame buffer 3. That will happen.

マスク演算回路６は、マスクに応じた演算を行
なうものであつて、第４図ロのマスクに応じた演
算、同ハのマスクに応じた演算、同ニのマスクに
応じた演算および同ホのマスクに応じた演算をそ
れぞれ行なうものである。 The mask calculation circuit 6 performs calculations according to the masks, and performs calculations according to the masks shown in FIG. Each calculation is performed according to the mask.

４方向成分抽出回路７は、上記マスク演算回路
６における演算結果にもとづき、縦、横、斜の４
方向における変化状態を検出して、その変化状態
がそれぞれある閾値以上のものを抽出する回路で
ある。 The four-directional component extraction circuit 7 extracts vertical, horizontal, and diagonal components based on the calculation results in the mask calculation circuit 6.
This is a circuit that detects change states in the direction and extracts those whose change states exceed a certain threshold value.

有意部始端・終端検出回路８は、４方向成分抽
出回路７から伝達された信号にもとづき音声入力
信号の有音部分の範囲を検出するものであり、音
節の区切りや母音部の区切り等に使用するもので
ある。 The significant part start/end detection circuit 8 detects the range of the voiced part of the audio input signal based on the signal transmitted from the four-way component extraction circuit 7, and is used for dividing syllables, vowel parts, etc. It is something to do.

母音・子音判定回路９は、母音か子音かを判定
するものであり、無音部があれば子音とか、構成
分が多く存在する場合には子音とか、縦成分が多
く存在する場合には母音とかといつた判定を行な
い、辞書の検索を容易にするものである。 The vowel/consonant determination circuit 9 determines whether it is a vowel or a consonant, and if there is a silent part, it is determined to be a consonant, if there are many components, it is determined to be a consonant, and if there are many vertical components, it is determined to be a vowel. This makes it easier to search the dictionary.

辞書部１０は、音素の各種パターンが格納され
ている辞書であつて、上記の如き分析されたパタ
ーンがこの辞書部１０に格納されたパターンと比
較されることにより、音韻識別を行なうものであ
る。 The dictionary section 10 is a dictionary that stores various patterns of phonemes, and performs phoneme identification by comparing the analyzed patterns as described above with the patterns stored in the dictionary section 10. .

照合部１１は母音・子音判定回路９から伝達さ
れたパターンを辞書部１０に格納されているパタ
ーンと照合するものである。この際、母音・子音
判定回路９から伝達されるパターンには、子音部
（無音部）か母音部（有声子音部）かに区分され
て伝達されるので、辞書を検索する手数が節約さ
れるものである。 The matching section 11 matches the pattern transmitted from the vowel/consonant determination circuit 9 with the pattern stored in the dictionary section 10. At this time, the pattern transmitted from the vowel/consonant determination circuit 9 is divided into a consonant part (silent part) or a vowel part (voiced consonant part), so the effort of searching the dictionary is saved. It is something.

音韻識別部１２は照合部１１における照合状態
にもとづき、音声入力信号がどの音素であるかを
識別するものである。なお、最終的な音声の認識
は各フレーム毎の音韻識別部１２の出力に基づい
て時間的な非線形マツチングにより単語辞書と照
合することにより行なわれる。 The phoneme identification unit 12 identifies which phoneme the audio input signal is based on the verification state in the verification unit 11 . Note that the final speech recognition is performed by comparing the output of the phoneme identification unit 12 for each frame with a word dictionary by temporal nonlinear matching.

以下第７図の動作について簡単に説明する。 The operation shown in FIG. 7 will be briefly explained below.

音声入力が音薦入力部１に伝達されたとき、ス
ペクトル変換部２はこれにもとづき、第３図に示
すようなスペクトルパターンを作成し、これを第
１フレームバツフア３〜第３フレームバツフア５
に伝達する。そしてこのスペクトル変換部２から
伝達されたスペクトルパターンにもとづき３×３
のマスクを使用し、第４図ロ〜ホに示した状態の
マスク演算をマスク演算回路６で行なう。そして
４方向成分抽出回路７において、この演算結果に
対して閾値を設定し、一定値以上の変化分のみを
抽出して、縦、横、左右斜方向の変化分にもとづ
き、第６図に示す如きパターンを作成する。そし
てこの場合、母音・子音判定回路９で各フレーム
単位毎に先ず横成分が存在するか否かを検出し、
これがあれば音の最初の部分であるので次に縦方
向成分の有無を検出し、縦方向成分が存在する場
合にはこれを母音部あるいは有声子音部と識別
し、照合部１１にこれを伝達して辞書部１０を検
索する。縦方向成分が存在しなければ子音部ある
いは無音部と識別し、照合部１１はこれを参考に
して辞書部１０を検索する。そしてこれらの検索
結果を音韻識別部１２が判別して音声の識別が行
なわれることになる。そしてこれが終れば次のフ
レームに対して行なわれたマスク演算回路６から
得られた４方向成分にもとづき同様の識別を行な
う。 When the voice input is transmitted to the sound recommendation input unit 1, the spectrum conversion unit 2 creates a spectrum pattern as shown in FIG. 5
to communicate. Based on the spectrum pattern transmitted from this spectrum conversion unit 2, 3×3
Using the mask shown in FIG. Then, in the four-direction component extraction circuit 7, a threshold value is set for this calculation result, and only the changes above a certain value are extracted, and based on the changes in the vertical, horizontal, left and right diagonal directions, Create a pattern like this. In this case, the vowel/consonant determination circuit 9 first detects whether or not a horizontal component exists for each frame,
If this is the first part of the sound, the presence or absence of a vertical component is then detected. If a vertical component is present, it is identified as a vowel part or a voiced consonant part, and this is transmitted to the matching unit 11. to search the dictionary section 10. If there is no vertical component, it is identified as a consonant part or a silent part, and the collation part 11 searches the dictionary part 10 using this as a reference. Then, the phoneme identification section 12 discriminates these search results to perform speech identification. When this is completed, similar identification is performed based on the four-directional components obtained from the mask calculation circuit 6 for the next frame.

もしも横成分が検出されない場合には、次のフ
レームに対して同様の分析が行なわれる。このよ
うにして全フレームの分析が行なわれたときにこ
の音声認識処理が終了されることになる。 If no horizontal component is detected, a similar analysis is performed on the next frame. When all frames have been analyzed in this manner, this speech recognition process is completed.

以上説明の如く、本発明によればマスク演算に
よりその変化分を検出してパターン認識を行なう
ようにしたので、パワーの大小による影響を受け
ることなく、しかも前後のフレームの影響を考慮
した分析ができる。そしてしかも上記〜に示
すような、従来のものにはないすぐれた分析を行
うことができる。 As explained above, according to the present invention, pattern recognition is performed by detecting the change by mask calculation, so that analysis can be performed without being affected by the magnitude of power, and in addition, taking into account the influence of previous and subsequent frames. can. Furthermore, it is possible to perform superior analyzes that are not possible with conventional methods, as shown in ~ above.

なお上記説明では３×３のマスクを使用した場
合について説明したがマスクの大きさは勿論これ
に限定されるものではない。 In the above description, a case has been described in which a 3×3 mask is used, but the size of the mask is of course not limited to this.

【図面の簡単な説明】[Brief explanation of the drawing]

第１図は音声の周波数スペクトル特性図、第２
図はそのパワーの大小を濃淡で示した周波数スペ
クトル特性図、第３図はその周波数スペクトル特
性図の特徴を明確にするために閾値を設けて作成
した周波数スペクトル特性図、第４図は本発明に
おいて使用するマスクの説明図、第５図は本発明
において使用される他のマスクの説明図、第６図
は第４図のマスクにより分析された周波数スペク
トル特性図、第７図は本発明の一実施例構成図、
第８図はその動作状態を示すフローチヤートであ
る。図中、１は音声入力部、２はスペクトル変換
部、３は第１フレームバツフア、４は第２フレー
ムバツフア、５は第３フレームバツフア、６はマ
スク演算回路、７は４方向成分抽出回路、８は有
音部始端・終端検出回路、９は母音・子音判定回
路、１０は辞書部、１１は照合部、１２は音韻識
別部である。 Figure 1 is a frequency spectrum characteristic diagram of voice, Figure 2 is a diagram of the frequency spectrum characteristics of voice.
The figure is a frequency spectrum characteristic diagram showing the magnitude of the power in shading, Figure 3 is a frequency spectrum characteristic diagram created by setting a threshold value to clarify the characteristics of the frequency spectrum characteristic diagram, and Figure 4 is the invention of the present invention. FIG. 5 is an explanatory diagram of another mask used in the present invention, FIG. 6 is a frequency spectrum characteristic diagram analyzed by the mask of FIG. 4, and FIG. 7 is an explanatory diagram of another mask used in the present invention. An example configuration diagram,
FIG. 8 is a flowchart showing the operating state. In the figure, 1 is an audio input section, 2 is a spectrum conversion section, 3 is a first frame buffer, 4 is a second frame buffer, 5 is a third frame buffer, 6 is a mask calculation circuit, and 7 is a four-directional component. 8 is a sound part start/end detecting circuit, 9 is a vowel/consonant determining circuit, 10 is a dictionary section, 11 is a collation section, and 12 is a phoneme identification section.

Claims

Translated fromJapanese

【特許請求の範囲】[Claims]

１音声入力を周波数スペクトルにもとづき分析
しこの分析結果を辞書と照合して音声認識を行う
音声認識方式において、音声入力を周波数成分お
よびそのパワーの大小を表示するスペクトルを作
成するスペクトル変換手段と、このスペクトルを
マスク演算するマスク演算手段と、このマスク演
算にもとづき少なくとも上記スペクトルの縦方向
のパターン変化を示す縦方向成分と横方向のパタ
ーン変化を示す横方向成分を抽出する成分抽出手
段を有し、この縦方向成分および横方向成分にも
とづき得られたパターンにより辞書を照合するよ
うにしたことを特徴とする音声認識方式。1. In a speech recognition method that analyzes a speech input based on a frequency spectrum and compares the analysis result with a dictionary to perform speech recognition, a spectrum conversion means that creates a spectrum of the speech input to display the frequency components and the magnitude of their power; It has a mask calculation means for performing a mask calculation on this spectrum, and a component extraction means for extracting at least a vertical component representing a pattern change in the vertical direction and a horizontal component representing a pattern change in the horizontal direction of the spectrum based on the mask calculation. , a speech recognition method characterized in that a pattern obtained based on the vertical component and the horizontal component is checked against a dictionary.