JP2010256391A

Movatterモバイル変換

Info

Publication number: JP2010256391A
Application number: JP2009102722A
Authority: JP
Inventors: Takeshi Hanamura; 剛花村; Takashi Koike; 隆治小池
Original assignee: Individual
Current assignee: Individual
Priority date: 2009-04-21
Filing date: 2009-04-21
Publication date: 2010-11-11

Abstract

<P>PROBLEM TO BE SOLVED: To objectively and easily extract a main point part with high significance, from human utterance voice. <P>SOLUTION: An utterance feature analysis control section 50 inputs text information and section classified information etc. created from utterance voice by a voice recognition section 41, a pupil diameter which is obtained by analyzing an eyeball video by a pupil diameter analysis section 42, and face movement which is estimated from a normal face video by a face movement amount estimation section 43. Relative sound volume and relative pitch of the utterance voice, utterance speed, no meaning section, text-formation disabled part, a speaker emotion reaction value and an audience emotion reaction value are calculated as the result of analysis of an utterance feature. A display device 20 reflects strength and height of utterance voice, utterance speed, presence of no-meaning section, text-formation disabled section, and a speaker emotion reaction value (and/or an audience emotion reaction value) as the characteristics of the text information of the utterance voice, in a form of a text character, and expresses them on a part corresponding to the text character. <P>COPYRIGHT: (C)2011,JPO&INPIT

Description

Translated fromJapanese

本発明は、音声をテキストに変換する装置に関し、特に、発話音声の中から有意性の高い要点部分を抽出する技術に関する。 The present invention relates to an apparatus for converting speech into text, and more particularly to a technique for extracting a significant part from speech speech.

従来、人の発話音声の中から、有意性の高い要点部分（人が興味をもって集中して話している部分）を抽出するには、オペレータが、発話音声を記録して再生し、再生音声を聞き取って解析することにより行っていた。このため、膨大な時間と作業が必要であった。このような人的作業の効率化を図るため、音声認識技術が利用されている。 Conventionally, in order to extract the most significant points (parts where people are interested and talking) from human speech, the operator records and plays back the speech, It was done by listening and analyzing. For this reason, enormous time and work were required. In order to improve the efficiency of such human work, a speech recognition technique is used.

例えば、人の発話音声をテキスト化し、テキスト情報を解析する音声認識システムが知られている（非特許文献１，２、特許文献１を参照）。非特許文献１，２の音声認識システムは、音響モデルが、予め用意された音声認識対象の語彙（単語の集合）とその発音を規定した単語辞書を参照しながら、人の発話音声に対し音素、音節の周波数パターンのマッチングを行って入力音声を生成し、デコーダが、単語の連鎖を規定した言語モデルを参照しながら、音響モデルにより生成された入力音声を単語列であるテキストに変換するものである。また、特許文献１の音声認識システムは、発話音声のテキスト情報に編集を加えた場合であっても、テキスト情報に含まれる文字と、発話音声の認識結果に含まれる文字との間のマッチング結果を利用することにより、編集後のテキスト情報と元の発話音声との対応付けを可能にするものである。 For example, a speech recognition system that converts human speech into text and analyzes text information is known (see Non-PatentDocuments 1 and 2 and Patent Document 1). In the speech recognition systems of Non-PatentDocuments 1 and 2, the acoustic model refers to a phoneme for human speech while referring to a speech recognition target vocabulary (a set of words) and a word dictionary that defines its pronunciation. , By matching frequency patterns of syllables and generating input speech, and by referring to a language model that defines word chains, the decoder converts the input speech generated by the acoustic model into text that is a word string It is. Moreover, even if the speech recognition system ofpatent document 1 is a case where the text information of speech speech is edited, the matching result between the character contained in text information and the character contained in the speech speech recognition result By using, it is possible to associate the edited text information with the original speech voice.

このような音声認識システムを用いることにより、人の発話音声をテキスト化し、テキスト情報に編集を加えることができる。また、オペレータは、音声認識システムによって得られたテキスト情報を解析することによって、有意性の高い要点部分を抽出することが可能になる。 By using such a speech recognition system, it is possible to convert a person's speech into text and edit the text information. In addition, the operator can extract the main part having high significance by analyzing the text information obtained by the voice recognition system.

しかしながら、オペレータによるテキスト情報の解析作業も、前述した再生音声を聞き取る手法と同様に人的作業であるから、有意性の高い要点部分の抽出精度は、人の経験や勘等の主観的判断能力によって左右されてしまう。また、人の発話音声の品質は一定しておらず、周囲の環境や人の体調等によって異なるから、発話音声は、必ずしも分析が容易になるように整理され、かつ分かり易く表現されているとは限らない。このため、オペレータによる解析作業では、要点部分を精度高く抽出することができないという問題があった。そこで、人の経験や勘等の主観的判断能力に依存することなく、人の発話音声の中から、有意性の高い要点部分を精度高く抽出することが可能なシステムが所望されている。 However, since the text information analysis work by the operator is also a human work similar to the method of listening to the reproduced speech described above, the extraction accuracy of the highly significant points is the subjective judgment ability such as human experience and intuition. It will be influenced by. Also, since the quality of human speech is not constant and varies depending on the surrounding environment and the physical condition of the person, the speech is not necessarily organized so that it can be easily analyzed and expressed in an easy-to-understand manner. Is not limited. For this reason, in the analysis work by an operator, there was a problem that a main part could not be extracted with high accuracy. Therefore, there is a demand for a system that can accurately extract a significant part from a person's uttered speech without depending on the subjective judgment ability such as the person's experience and intuition.

ところで、眼球映像から瞳孔径を算出し、その変動を捉えることにより情動反応を評価する技術が知られている（例えば、特許文献２を参照）。この技術は、映像コンテンツを見ている人間の眼球映像をカメラにより撮影し、その眼球映像から瞳孔径の変動を算出し、この変動を映像コンテンツに対する人間の情動反応として扱い、映像コンテンツに対する興味・関心度合いを算出するものである。 By the way, a technique for evaluating the emotional reaction by calculating the pupil diameter from the eyeball image and capturing the fluctuation is known (for example, see Patent Document 2). This technology takes a human eyeball image of a person watching the video content with a camera, calculates the pupil diameter variation from the eyeball image, treats this variation as a human emotional response to the video content, The degree of interest is calculated.

この技術は、人の情動反応が瞳孔径の変動に表れるという知見に基づいている。しかしながら、情動反応には、人の本能的反応である一次情動と、人の意識が関係する高次の情動反応とがあり、瞳孔径には、これらの情動反応が重畳して表れることになる。したがって、瞳孔径の変動を捉えるのみでは、人が本当に興味・関心を持って反応しているか否かを精度高く判定することができない。 This technique is based on the finding that human emotional responses appear in pupil diameter variations. However, emotional responses include primary emotions, which are human instinct responses, and higher-order emotional responses related to human consciousness, and these emotional responses are superimposed on the pupil diameter. . Therefore, it is impossible to determine with high accuracy whether or not a person is really responding with interest and interest only by capturing changes in pupil diameter.

特開２００７−１３３０３３号公報JP 2007-133303 A特開２００４−２８２４７１号公報JP 2004-282471 A

“Ｊｕｌｉｕｓ”、［online］、Ｊｕｌｉｕｓｄｅｖｅｌｏｐｅｍｅｎｔｔｅａｍ、［平成２１年３月１０日検索］、インターネット＜ＵＲＬ： http://julius.sourceforge.jp/index.php?q=documents.html#beginner＞"Julius", [online], Julius development team, [March 10, 2009 search], Internet <URL: http://julius.sourceforge.jp/index.php?q=documents.html#beginner>河原達也、他１名、“連続音声認識ソフトウェアＪｕｌｉｕｓ”、［online］、［平成２１年３月１０日検索］、インターネット＜ＵＲＬ：http://julius.sourceforge.jp/paper/JSAI05.pdf＞Tatsuya Kawahara, 1 other, “Continuous Speech Recognition Software Julius”, [online], [Search on March 10, 2009], Internet <URL: http://julius.sourceforge.jp/paper/JSAI05.pdf>

そこで、本発明は上記課題を解決するためになされたものであり、その目的は、人の発話音声の中から、有意性の高い要点部分を客観的かつ容易に抽出することが可能な音声情報処理装置を提供することにある。 Therefore, the present invention has been made to solve the above-described problems, and the purpose of the present invention is to provide voice information that can objectively and easily extract a significant part from a person's speech. It is to provide a processing apparatus.

上記目的を達成するために、本発明による音声情報処理装置は、発話者の発話音声をテキスト化し、テキスト情報に変換する音声情報処理装置において、語彙、前記語彙の発音、及び前記語彙に対しテキスト情報の区間を設定するための区間種別が規定された辞書を用いて、前記発話音声をテキスト情報に変換し、前記発話音声の信号レベルが所定の値未満となる沈黙区間を設定し、前記辞書に規定された語彙に対する区間種別及び前記テキスト情報に含まれる語彙によって、前記発話者が発声した時間区間のうちの実際に意味のある発声をした発言区間を設定し、前記発話者が発声した時間区間のうちの前記発言区間を除いた時間区間をその他発声区間に設定する音声認識部と、前記発話音声に基づいて、前記区間毎に、発話音声特性データを算出する発話音声特性データ算出部と、前記発話者の生理状態によって変化する生理反応データを入力し、前記生理反応データに基づいて、前記区間毎に、前記発話者の情動の程度を示す発話者情動反応値を算出する発話者情動反応値算出部と、前記発言区間及びその他発声区間におけるテキスト情報をテキスト文字で表示する際に、前記発話音声特性データ算出部により算出された発話音声特性データの値、及び前記発話者情動反応値算出部により算出された発話者情動反応値に応じた形態で、前記音声認識部により区別された区間毎に前記テキスト文字を表示し、前記沈黙区間を、予め設定された形態で表示する表示部と、を備えたことを特徴とする。 In order to achieve the above object, a speech information processing apparatus according to the present invention is a speech information processing apparatus that converts a speech voice of a speaker into text information and converts it into text information. The vocabulary, pronunciation of the vocabulary, and text to the vocabulary Using a dictionary in which section types for setting sections of information are defined, converting the uttered voice into text information, setting a silent section where the signal level of the uttered voice is less than a predetermined value, and the dictionary The utterance time of the utterer is set by setting the utterance interval of the utterance actually spoken by the utterance uttered by the utterance by the interval type for the vocabulary defined in the vocabulary and the vocabulary included in the text information. Based on the speech recognition unit that sets a time segment excluding the speech segment in the segment as another speech segment, and the speech speech characteristic data for each segment based on the speech Utterance voice characteristic data calculation section for calculating the utterance, and physiological reaction data that changes depending on the physiological state of the speaker, and an utterance indicating the degree of emotion of the speaker for each section based on the physiological response data Utterance voice response data calculated by the utterance voice characteristic data calculator when the text information in the utterance section and other utterance sections is displayed as text characters. And the text character for each section distinguished by the speech recognition unit in a form corresponding to the value of the speaker and the emotional reaction value of the speaker calculated by the speaker emotional response value calculation unit, And a display unit for displaying in a preset form.

また、本発明による音声情報処理装置は、前記発話音声特性データ算出部が、前記発話音声に基づいて、前記区間毎に、発話音声の音量、音高及び速度を算出し、前記発話者情動反応値算出部は、発話者の眼球運動に伴うデータ、顔面の動き、脈拍値、発汗量のうちの少なくとも一つまたは複数のデータに基づいて、前記区間毎に、発話者情動反応値を算出し、前記表示部が、前記発話音声特性データ算出部により算出された発話音声の音量、音高及び速度の値、並びに、前記発話者情動反応値算出部により算出された発話者情動反応値に応じたそれぞれの形態で前記テキスト文字を表示し、前記沈黙区間を空白で表示し、前記音声認識部によりテキスト化できなかった前記発言区間またはその他発声区間を、予め設定された形態で表示することを特徴とする。 Further, in the speech information processing apparatus according to the present invention, the speech speech characteristic data calculation unit calculates the volume, pitch, and speed of speech speech for each section based on the speech speech, and the speaker emotion reaction The value calculation unit calculates a speaker emotional response value for each of the sections based on at least one or a plurality of data of the data associated with the eye movement of the speaker, the movement of the face, the pulse value, and the amount of sweat. The display unit responds to the volume, pitch, and speed values of the utterance voice calculated by the utterance voice characteristic data calculation unit, and the speaker emotion reaction value calculated by the speaker emotion reaction value calculation unit. The text characters are displayed in each form, the silence section is displayed as blank, and the speech section or other speech section that could not be converted into text by the speech recognition unit is displayed in a preset form. And wherein the door.

また、本発明による音声情報処理装置は、前記発話音声特性データ及び前記発話者情動反応値に基づいて、前記区間毎のテキスト情報の重要度を算出する区間重要度算出部と、前記テキスト情報の重要度と所定の値とに基づいて、重要度の高い区間を特定し、前記特定した区間のテキスト情報から単語を抽出する頻出重要単語抽出部と、前記抽出された単語を検索語としてデータベースを検索する検索部とを備え、前記表示部が、さらに、前記データベースの検索結果を表示することを特徴とする。 Further, the speech information processing apparatus according to the present invention includes a section importance degree calculating unit that calculates importance of text information for each section based on the utterance voice characteristic data and the speaker emotional reaction value, and the text information Based on the importance level and a predetermined value, a high-importance interval is identified, a frequent important word extraction unit that extracts words from the text information of the identified interval, and a database using the extracted words as search terms A search unit for searching, wherein the display unit further displays a search result of the database.

また、本発明による音声情報処理装置は、前記発話者による発話を聴取する聴取者の眼球運動に伴うデータ、顔面の動き、脈拍値、発汗量のうちの少なくとも一つまたは複数のデータに基づいて、前記区間毎に、聴取者情動反応値を算出する聴取者情動反応値算出部を備え、前記表示部が、さらに、前記聴取者情動反応値算出部により算出された聴取者情動反応値に応じた形態で、前記テキスト文字を表示することを特徴とする。 Further, the speech information processing apparatus according to the present invention is based on at least one or more data among data associated with eye movements of the listener who listens to the utterance by the speaker, facial movement, pulse value, and sweating amount. A listener emotion response value calculation unit that calculates a listener emotion response value for each section, and the display unit further corresponds to the listener emotion response value calculated by the listener emotion response value calculation unit. The text characters are displayed in a different form.

また、本発明による音声情報処理装置は、複数の発話者のそれぞれに対応して、前記処理をそれぞれ行う音声認識部、発話音声特性データ算出部、発話者情動反応値算出部及び表示部を備え、さらに、前記複数の発話者のうちの一人の発話者による発話の前記区間について、前記一人の発話者の発話音声特性データ及び発話者情動反応値、前記他の発話者の発話者情動反応値、並びに前記聴取者情動反応値に基づいて、前記一人の発話者による発話の重要度を算出する区間重要度算出部を備え、前記表示部が、さらに、前記重要度に応じた形態で、前記重要度が算出された前記一人の発話者における前記テキスト文字を表示することを特徴とする。 In addition, a speech information processing apparatus according to the present invention includes a speech recognition unit, a speech speech characteristic data calculation unit, a speaker emotional reaction value calculation unit, and a display unit that perform the above-described processing in correspondence with each of a plurality of speakers. Further, for the section of the utterance by one of the plurality of utterers, the utterance voice characteristic data and the utterance emotion reaction value of the one utterer, the utterance emotion reaction value of the other utterers And an interval importance calculation unit for calculating the importance of the utterance by the one speaker based on the listener emotion reaction value, and the display unit is further configured in accordance with the importance, The text character in the one speaker whose importance is calculated is displayed.

また、本発明による音声情報処理装置を含むシステムは、近赤外線を前記発話者の目の周辺に照射する照射器と、前記近赤外線を透過するフィルタ、及び前記フィルタからの出射光を受光する撮像素子を有し、前記発話者の映像を前記フィルタ及び撮像素子を介して目隠し顔映像として出力するカメラとを備え、前記音声情報処理装置の表示部が、前記カメラにより出力された前記発話者の目隠し顔映像を表示することを特徴とする。 The system including the voice information processing apparatus according to the present invention includes an irradiator that irradiates near-infrared rays around the eyes of the speaker, a filter that transmits the near-infrared rays, and an image that receives light emitted from the filters. And a camera that outputs the image of the speaker as a blindfolded face image through the filter and the image sensor, and the display unit of the voice information processing device is configured to display the speaker's video output by the camera. It is characterized by displaying a blindfolded face image.

以上のように、本発明によれば、発話音声の時間区間を発言区間、その他発声区間及び沈黙区間に区別し、これらの区間毎に発話音声特性データ及び情動反応データを算出し、発話音声を変換したテキスト情報を、発話音声の特性に応じた形態及び発話者の情動反応に応じた形態で表示するようにした。これにより、人の発話音声の中から、有意性の高い要点部分を客観的にかつ容易に抽出することが可能となる。 As described above, according to the present invention, the speech speech time interval is classified into the speech interval, the other utterance interval, and the silence interval, and the utterance speech characteristic data and the emotion reaction data are calculated for each of these intervals, The converted text information is displayed in a form corresponding to the characteristics of the speech and a form corresponding to the emotional reaction of the speaker. As a result, it is possible to objectively and easily extract the most significant points from human speech.

本発明の第１の実施形態（実施例１）による音声情報処理装置のハードウェア構成を示す概略図である。It is the schematic which shows the hardware constitutions of the audio | voice information processing apparatus by the 1st Embodiment (Example 1) of this invention.本発明の第１の実施形態（実施例１）による音声情報処理装置の機能構成を示すブロック図である。It is a block diagram which shows the function structure of the audio | voice information processing apparatus by the 1st Embodiment (Example 1) of this invention.発話者の顔を撮影するカメラの構成を示す図である。It is a figure which shows the structure of the camera which image | photographs a speaker's face.記憶部に生成されるＤＢの一覧を示す図である。It is a figure which shows the list of DB produced | generated by a memory | storage part.発話特徴解析制御部の構成を示すブロック図である。It is a block diagram which shows the structure of an utterance feature analysis control part.（１）は、発話音声の相対音量Ｖ（ｔ）を示すグラフである。（２）は、情動反応値Ｅｓ（ｔ）を示すグラフである。(1) is a graph showing the relative volume V (t) of the uttered voice. (2) is a graph showing the emotional reaction value Es (t).解析結果ＤＢの構成例を説明する図である。It is a figure explaining the structural example of analysis result DB.表示装置に表示される画面例である。It is an example of a screen displayed on a display device.提示情報の表示例を説明する図である。It is a figure explaining the example of a display of presentation information.本発明の第２の実施形態（実施例２）による音声情報処理装置の機能構成を示すブロック図である。It is a block diagram which shows the function structure of the audio | voice information processing apparatus by the 2nd Embodiment (Example 2) of this invention.本発明の第３の実施形態（実施例３）による音声情報処理装置の機能構成を示すブロック図である。It is a block diagram which shows the function structure of the audio | voice information processing apparatus by the 3rd Embodiment (Example 3) of this invention.

以下、本発明を実施するための形態について、図面を参照して説明する。 Hereinafter, embodiments for carrying out the present invention will be described with reference to the drawings.

〔実施例１〕
まず、本発明の第１の実施形態（実施例１）について説明する。図１は、実施例１の音声情報処理装置のハードウェア構成を示す概略図である。この音声情報処理装置１は、ＣＰＵ１０１と、各種プログラム、ＤＢ（データベース）、テーブル等を記憶するＲＯＭ及びＲＡＭからなる記憶部１０２と、各種アプリケーションのプログラム、ＤＢ、データ等を記憶する記憶装置（ハードディスク装置）１０３と、ネットワークを介してデータの送受信を行う通信部１０４と、マイク１４により集音された発話者の発話音声、カメラ１５により撮影された発話者の眼球映像、カメラ１７により撮影された発話者の顔の映像（通常顔映像及び目隠し顔映像）を入力する入力インタフェース部１０５と、発話者によるマウス、キーボード等の操作に伴い、所定のデータを入力制御する操作・入力部１０６、発話者に対し発話及びキー操作を促すための画面情報を表示器１２に出力する表示出力インタフェース部１０７と、発話者に対し発話及びキー操作を促すための音声情報をスピーカ１１に出力する音声出力インタフェース部１０８とを備えて構成され、これらの各構成要素はシステムバス１０９を介して相互に接続される。[Example 1]
First, the first embodiment (Example 1) of the present invention will be described. FIG. 1 is a schematic diagram illustrating a hardware configuration of the audio information processing apparatus according to the first embodiment. The voiceinformation processing apparatus 1 includes aCPU 101, astorage unit 102 including a ROM and a RAM for storing various programs, DB (database), tables, and the like, and a storage device (hard disk) for storing various application programs, DB, data, and the like. Device) 103, acommunication unit 104 that transmits and receives data via a network, and the voice of the speaker collected by themicrophone 14, the eyeball image of the speaker photographed by thecamera 15, and the camera 17 Aninput interface unit 105 for inputting an image of a speaker's face (normal face image and blindfolded face image), an operation /input unit 106 for controlling input of predetermined data in accordance with an operation of a mouse, a keyboard, and the like by the speaker, Display output screen that outputs screen information for prompting the user to speak and key operation to thedisplay 12 And a voiceoutput interface unit 108 that outputs voice information for prompting a speaker to speak and key operation to thespeaker 11. These components are mutually connected via asystem bus 109. Connected to.

記憶装置１０３には、音声情報処理装置１の基本的な機能を提供するためのＯＳ（オペレーティングシステム）プログラム、通信部１０４を介して外部の装置との間で通信を行う通信プログラム、発話者に対して発話を促すための案内情報を提示し、発話音声を解析することにより、発話音声の中から有意性の高い要点部分を抽出し、提示情報に変換して表示する一連の処理を行う音声情報処理プログラム等が記憶されている。尚、これらのプログラムは、音声情報処理装置１が処理を行うときに、ＣＰＵ１０１により記憶装置１０３から記憶部１０２のＲＡＭに読み出されて実行される。 Thestorage device 103 includes an OS (operating system) program for providing basic functions of the voiceinformation processing device 1, a communication program for communicating with an external device via thecommunication unit 104, and a speaker. Voice that performs a series of processes that extract guidance points for prompting utterances, analyze utterances, extract important points from utterances, convert them to presentations, and display them Information processing programs and the like are stored. Note that these programs are read from thestorage device 103 to the RAM of thestorage unit 102 and executed by theCPU 101 when the audioinformation processing apparatus 1 performs processing.

ここで、ＯＳプログラムは、ＣＰＵ１０１が記憶装置１０３から読み出して実行することにより、音声情報処理装置１の基本的な機能として、記憶部１０２、記憶装置１０３、通信部１０４、入力インタフェース部１０５、操作・入力部１０６、表示出力インタフェース部１０７及び音声出力インタフェース部１０８を管理する。そして、このＯＳプログラムがＣＰＵ１０１によって実行された状態で、前述の通信プログラム、音声情報処理プログラム等が実行される。 Here, the OS program is read out from thestorage device 103 and executed by theCPU 101, and as a basic function of the voiceinformation processing device 1, thestorage unit 102, thestorage device 103, thecommunication unit 104, theinput interface unit 105, the operation Theinput unit 106, the displayoutput interface unit 107, and the audiooutput interface unit 108 are managed. Then, with the OS program being executed by theCPU 101, the above-described communication program, voice information processing program, and the like are executed.

制御部１００は、ＣＰＵ１０１及び記憶部１０２により構成され、ＣＰＵ１０１が記憶部１０２及び記憶装置１０３に記憶された各種プログラムを読み出して実行することにより、音声情報処理装置１全体を統括制御する。このように、音声情報処理装置１は、図１に示したハードウェア構成により、制御部１００が音声情報処理プログラムに従って各種処理を行う。 The control unit 100 includes aCPU 101 and astorage unit 102, and theCPU 101 reads out and executes various programs stored in thestorage unit 102 and thestorage device 103, thereby performing overall control of the audioinformation processing apparatus 1. As described above, in the audioinformation processing apparatus 1, the control unit 100 performs various processes according to the audio information processing program with the hardware configuration illustrated in FIG.

図２は、実施例１による音声情報処理装置１の機能構成を示すブロック図であり、図１に示した制御部１００が音声情報処理プログラムにより処理を実行する際の機能構成を示している。この音声情報処理装置１は、案内情報提示部３１、入力部３２、音声認識部４１、瞳孔径解析部４２、顔面動き量推定部４３、発話特徴解析制御部５０、記憶部６０及び提示情報変換部７０を備えている。また、この音声情報処理を実現するシステムは、音声情報処理装置１、スピーカ１１、表示器１２、マイク１４とカメラ１５と照射器１６とを備えた目隠し用ゴーグル（ヘッドセット）１３、カメラ１７及び表示装置２０を備えて構成される。音声情報処理装置１と表示装置２０とは、インターネット等のネットワーク２１により接続される。 FIG. 2 is a block diagram illustrating a functional configuration of the audioinformation processing apparatus 1 according to the first embodiment, and illustrates a functional configuration when the control unit 100 illustrated in FIG. 1 executes processing using the audio information processing program. The voiceinformation processing apparatus 1 includes a guidanceinformation presentation unit 31, aninput unit 32, avoice recognition unit 41, a pupildiameter analysis unit 42, a facial motionamount estimation unit 43, an utterance featureanalysis control unit 50, astorage unit 60, and a presentation information conversion.Part 70 is provided. The system for realizing the voice information processing includes a voiceinformation processing apparatus 1, aspeaker 11, adisplay device 12, amicrophone 14, acamera 15 and anirradiator 16, a blindfold goggles (headset) 13, a camera 17, and Adisplay device 20 is provided. The audioinformation processing apparatus 1 and thedisplay apparatus 20 are connected by anetwork 21 such as the Internet.

スピーカ１１には、音声情報処理装置１から案内用の音声情報が出力される。これにより、発話者は、質問等の音声により意見を求められ、発話が促され、選択等のキー操作が促される。表示器１２には、音声情報処理装置１から案内用の画面情報が出力される。これにより、発話者は、同様に、発話が促され、選択等のキー操作が促される。 Audio information for guidance is output from the audioinformation processing apparatus 1 to thespeaker 11. Thus, the speaker is asked for an opinion by voice such as a question, utterance is prompted, and key operations such as selection are prompted. Screen information for guidance is output from the voiceinformation processing apparatus 1 to thedisplay 12. As a result, the speaker is similarly prompted to speak and to perform key operations such as selection.

（目隠し用ゴーグル）
目隠し用ゴーグル１３は、発話者の顔に装着される機器であり、マイク１４により音声が集音され、カメラ１５により眼球が撮影され、照射器１６により近赤外線が目の周辺に照射される。すなわち、目隠し用ゴーグル１３に備えたマイク１４は、発話者の音声を集音できるように発話者の口付近に設置され、カメラ１５は、発話者の眼球が撮影できるように発話者の目付近に設置され、照射器１６は、発話者の目の周辺に近赤外線が照射できるように発話者の目付近に設置されている。照射器１６が近赤外線を発話者の目の周辺に照射するのは、後述するカメラ１７によって、通常の顔映像（以下、通常顔映像という。）に加えて、目の周辺がマスクされて発話者を特定することができない顔映像（以下、目隠し顔映像という。）を取得するためである。また、後述する眼球映像の解析処理において、瞳孔径、視点位置等を算出するために、近赤外線を照射したときに撮影した眼球映像を用いるからである。ここで照射される近赤外線は、発話者の目の周辺でハレーションを起こす程度の高輝度の電磁波である。このように、照射器１６は、近赤外線を照射するようにしたから、発話者はまぶしさを感じることなく発話することができ、カメラ１７によって自然な表情の顔映像を撮影することができる。また、照射器１６は、例えば、目隠し用ゴーグル１３の一部である眼鏡のフレームに取り付けるようにする。これにより、例えば、発話者の目の横方向または縦方向から近赤外線が照射されるから、照射器１６が発話者の視野を妨げることがない。(Blindfold goggles)
Theblindfold goggles 13 are devices attached to the speaker's face. Sound is collected by themicrophone 14, an eyeball is photographed by thecamera 15, and near-infrared rays are irradiated around the eyes by theirradiator 16. That is, themicrophone 14 provided in theblindfold goggles 13 is installed in the vicinity of the speaker's mouth so that the voice of the speaker can be collected, and thecamera 15 is in the vicinity of the speaker's eye so that the eyeball of the speaker can be photographed. Theirradiator 16 is installed in the vicinity of the speaker's eyes so that near infrared light can be irradiated around the speaker's eyes. Theirradiator 16 irradiates near-infrared rays around the eyes of the speaker because the camera 17 (to be described later) masks the periphery of the eyes in addition to a normal face image (hereinafter referred to as a normal face image). This is because a face image (hereinafter referred to as a blindfolded face image) in which a person cannot be identified is acquired. In addition, in the eyeball image analysis process described later, an eyeball image captured when near-infrared rays are irradiated is used to calculate the pupil diameter, the viewpoint position, and the like. The near-infrared rays irradiated here are electromagnetic waves with high brightness that cause halation around the eyes of the speaker. Thus, since theirradiator 16 irradiates near infrared rays, the speaker can speak without feeling glare, and the camera 17 can shoot a facial image with a natural expression. Further, theirradiator 16 is attached to a frame of glasses that is a part of theblindfold goggles 13, for example. Thereby, for example, since near infrared rays are irradiated from the lateral direction or vertical direction of the speaker's eyes, theirradiator 16 does not disturb the visual field of the speaker.

（顔撮影用カメラ）
カメラ１７は、発話者の顔を撮影し、通常顔映像及び目隠し顔映像を出力する。図３は、カメラ１７の構成を示す図である。図３（１）は１台のカメラで構成した例を示しており、カメラ１７−１は、レンズ１７１、分光用プリズム１７２、近赤外線カットフィルタ１７３、撮像素子１７４、近赤外線透過フィルタ１７５及び撮像素子１７６を備えている。レンズ１７１を通過した発話者の顔映像信号は、分光用プリズム１７２によって２系統に分けられる。第１の系統の顔映像信号は近赤外線カットフィルタ１７３へ入射され、近赤外線が除去される。すなわち、照射器１６により発話者の目の周辺に照射された近赤外線が除去される。そして、近赤外線が除去された顔映像信号は撮像素子１７４へ入射され、通常顔映像として出力される。一方、分光用プリズム１７２により分けられた第２の系統の顔映像信号は、近赤外線透過フィルタ１７５へ入射され、近赤外線が透過する。すなわち、照射器１６により発話者の目の周辺に照射された近赤外線が透過する。そして、近赤外線が透過した顔映像信号は撮像素子１７６へ入射され、目隠し顔映像として出力される。(Camera for face photography)
The camera 17 captures the face of the speaker and outputs a normal face image and a blindfolded face image. FIG. 3 is a diagram illustrating the configuration of the camera 17. FIG. 3A shows an example configured with one camera. The camera 17-1 includes alens 171, aspectral prism 172, a nearinfrared cut filter 173, animage sensor 174, a nearinfrared transmission filter 175, and an image pickup. Anelement 176 is provided. The face image signal of the speaker who has passed through thelens 171 is divided into two systems by thespectral prism 172. The face image signal of the first system is incident on the near-infrared cut filter 173, and the near-infrared light is removed. That is, the near infrared rays irradiated around the eyes of the speaker by theirradiator 16 are removed. Then, the face image signal from which the near infrared rays have been removed is incident on theimage sensor 174 and is output as a normal face image. On the other hand, the face image signal of the second system divided by thespectroscopic prism 172 is incident on the near-infrared transmission filter 175 and transmits the near-infrared light. That is, the near infrared rays irradiated around the eyes of the speaker by theirradiator 16 are transmitted. Then, the face image signal through which the near-infrared ray is transmitted enters theimage sensor 176 and is output as a blindfolded face image.

図３（２）は２台のカメラで構成した例を示しており、カメラ１７−２は、第１のカメラ部及び第２のカメラ部により構成される。第１のカメラ部は、レンズ１７１−１、照射器１６により発話者の目の周辺に照射された近赤外線を除去するための近赤外線カットフィルタ１７３及び撮像素子１７４を備えている。また、第２のカメラ部は、レンズ１７１−２、照射器１６により発話者の目の周辺に照射された近赤外線を透過する近赤外線透過フィルタ１７５及び撮像素子１７６を備えている。第１のカメラ部により通常顔映像が出力され、第２のカメラ部により目隠し顔映像が出力される。 FIG. 3B shows an example constituted by two cameras, and the camera 17-2 is constituted by a first camera unit and a second camera unit. The first camera unit includes a lens 171-1 and a near-infrared cut filter 173 and animage sensor 174 for removing near-infrared rays irradiated around the eyes of the speaker by theirradiator 16. Further, the second camera unit includes a lens 171-2, a near-infrared transmission filter 175 and animage sensor 176 that transmit near-infrared rays that are irradiated around the eyes of the speaker by theirradiator 16. A normal face image is output by the first camera unit, and a blindfolded face image is output by the second camera unit.

このように、カメラ１７−１，１７−２により、発話者の目を含めて顔により個人を特定することが可能な通常顔映像、及び、高輝度な近赤外線の効果によって発話者の目の周辺がマスクされ個人を特定することが不可能な目隠し顔映像が出力される。音声情報処理装置１は、発話者を秘匿するための顔映像を得るために、カメラ１７−１，１７−２から目隠し顔映像を直接入力すればよいから、通常顔映像に基づいて秘匿用の映像を生成するための後処理を行う必要がなく、発話者を秘匿する顔映像を容易に取得することができる。したがって、カメラ１７−１，１７−２により、発話者の目及びその周辺が撮影されないから、個人情報の保護を容易に実現することができる。尚、カメラ１７−１，１７−２は、必ずしも近赤外線透過フィルタ１７５を備えていなくてもよい。 As described above, the cameras 17-1 and 17-2 can be used to identify the individual face by the face including the eyes of the speaker, and the eyes of the speaker by the effect of high-intensity near infrared rays. The surrounding face is masked and a blindfolded face image that cannot identify an individual is output. Since the voiceinformation processing apparatus 1 has only to directly input a blindfolded face image from the cameras 17-1 and 17-2 in order to obtain a face image for concealing the speaker, the speechinformation processing apparatus 1 is used for concealment based on the normal face image. It is not necessary to perform post-processing for generating a video, and a facial video concealing the speaker can be easily obtained. Accordingly, since the eyes of the speaker and the surroundings are not photographed by the cameras 17-1 and 17-2, personal information can be easily protected. Note that the cameras 17-1 and 17-2 do not necessarily include the nearinfrared transmission filter 175.

（音声情報処理装置）
次に、図２に示した音声情報処理装置１について説明する。音声情報処理装置１の案内情報提示部３１は、記憶部６０に記憶された案内情報ＤＢ６１から案内情報を読み出し、この案内情報に基づいて、発話者に発話及び選択等のキー操作を促すための音声情報をスピーカ１１に出力し、画面情報を表示器１２に出力する。これにより、発話者は、質問等により意見を求められ、発話者による発話音声が目隠し用ゴーグル１３に備えたマイク１４によって集音される。記憶部６０の案内情報ＤＢ６１には、発話を引き出すための音声及び画面シーケンスの情報が記憶されている。(Voice information processing device)
Next, the audioinformation processing apparatus 1 shown in FIG. 2 will be described. The guidanceinformation presentation unit 31 of the voiceinformation processing apparatus 1 reads guidance information from theguidance information DB 61 stored in thestorage unit 60, and prompts the speaker to perform key operations such as speech and selection based on the guidance information. Audio information is output to thespeaker 11, and screen information is output to thedisplay 12. Thus, the speaker is asked for an opinion by a question or the like, and the voice of the speaker is collected by themicrophone 14 provided in theblindfold goggles 13. Theguidance information DB 61 of thestorage unit 60 stores information for voice and screen sequence for drawing out utterances.

入力部３２は、目隠し用ゴーグル１３に備えたマイク１４から発話者の発話音声を、カメラ１５から眼球映像をそれぞれ入力すると共に、カメラ１７から発話者の通常顔映像及び目隠し顔映像を入力する。そして、これらの情報を同期した情報として、記憶部６０に記憶する。これにより、記憶部６０には、発話の特徴を解析制御するための情報として、発話音声ＤＢ６２、眼球映像ＤＢ６３、通常顔映像ＤＢ６４及び目隠し顔映像ＤＢ６５が生成される。 Theinput unit 32 inputs the utterance voice of the speaker from themicrophone 14 provided in theblindfold goggles 13 and the eyeball image from thecamera 15, and inputs the normal face image and the blindfolded face image of the speaker from the camera 17. Then, these pieces of information are stored in thestorage unit 60 as synchronized information. As a result, anutterance voice DB 62, aneyeball image DB 63, a normalface image DB 64, and a blindfoldedface image DB 65 are generated in thestorage unit 60 as information for analyzing and controlling the features of the utterance.

図４は、記憶部６０に記憶されるＤＢの一覧を示す図である。図４に示すように、記憶部６０には、案内情報ＤＢ６１、発話音声ＤＢ６２、眼球映像ＤＢ６３、通常顔映像ＤＢ６４、目隠し顔映像ＤＢ６５、テキスト情報等ＤＢ６６、瞳孔径・顔面動きＤＢ６７、解析結果ＤＢ６８等が記憶される。尚、図示していないが、発話者が選択したキー操作等の情報、及び音声情報処理装置１による処理のために必要なその他の情報も、記憶部６０に記憶される。 FIG. 4 is a diagram illustrating a list of DBs stored in thestorage unit 60. As shown in FIG. 4, thestorage unit 60 includes aguidance information DB 61, aspeech voice DB 62, aneyeball image DB 63, a normalface image DB 64, a blindfoldedface image DB 65, atext information DB 66, a pupil diameter / face motion DB 67, and ananalysis result DB 68. Etc. are memorized. Although not shown, information such as key operations selected by the speaker and other information necessary for processing by the voiceinformation processing apparatus 1 are also stored in thestorage unit 60.

案内情報ＤＢ６１には、発話者による発話等を促すための音声情報及び画面情報が案内情報として記憶される。発話音声ＤＢ６２には発話者による発話音声が記憶され、眼球映像ＤＢ６３には発話者の眼球映像が記憶され、通常顔映像ＤＢ６４には発話者の通常顔映像が記憶され、目隠し顔映像ＤＢ６５には発話者の目隠し顔映像が記憶される。発話音声ＤＢ６２、眼球映像ＤＢ６３、通常顔映像ＤＢ６４及び目隠し顔映像ＤＢ６５に記憶される各情報は互いに同期している。テキスト情報等ＤＢ６６には、音声認識部４１により認識されたテキスト情報、区間種別情報及び区間音量情報等が発話特徴解析制御部５０を介して記憶される。瞳孔径・顔面動きＤＢ６７には、瞳孔径解析部４２により解析された瞳孔径、及び顔面動き量推定部４３により推定された顔面動きが発話特徴解析制御部５０を介して記憶される。また、解析結果ＤＢ６８には、発話特徴解析制御部５０により解析された結果が記憶される。テキスト情報、区間種別情報、区間音量情報、瞳孔径、顔面動き及び解析結果の詳細については後述する。 Theguidance information DB 61 stores voice information and screen information for prompting utterances by a speaker as guidance information. Thespeech voice DB 62 stores the speech voice of the speaker, theeyeball image DB 63 stores the speaker's eyeball image, the normalface image DB 64 stores the speaker's normal face image, and the blindfoldedface image DB 65 stores the image. The blind face image of the speaker is stored. The information stored in thespeech audio DB 62, theeyeball video DB 63, the normalface video DB 64, and the blindfoldedface video DB 65 are synchronized with each other. Thetext information DB 66 stores text information recognized by thespeech recognition unit 41, section type information, section volume information, and the like via the utterance featureanalysis control unit 50. In the pupil diameter / face motion DB 67, the pupil diameter analyzed by the pupildiameter analysis unit 42 and the face motion estimated by the face motionamount estimation unit 43 are stored via the utterance featureanalysis control unit 50. Theanalysis result DB 68 stores the results analyzed by the utterance featureanalysis control unit 50. Details of the text information, section type information, section volume information, pupil diameter, facial movement, and analysis result will be described later.

次に、発話者に対して案内情報を提示し、発話音声等を入力して記憶する一連の動作について説明する。発話の特徴が解析制御される発話者による開始のキー操作が行われると、入力部３２は、その開始のキー操作を入力し（図２ではキー操作の入力については省略してある。）、案内情報提示部３１は、記憶部６０の案内情報ＤＢ６１から案内情報を読み出し、この案内情報の中から、例えば「今から半年後の景気はどうなっていると思いますか」の音声情報をスピーカ１１に出力する。発話者は、スピーカ１１からこの音声を聞くことにより、この質問に対する意見を述べる。入力部３２は、開始のキー操作に従って、マイク１４から発話音声を入力すると共に、カメラ１５から眼球映像を入力し、カメラ１７から通常顔映像及び目隠し顔映像を入力する。そして、同期した情報として記憶部６０に記憶する。次に、案内情報提示部３１は、案内情報の中から、例えば「そう思うのはなぜですか」の音声情報をスピーカ１１に出力する。同様にして、案内情報提示部３１は、入力部３２により発話音声が入力された後、「半年後の景気は以下のどれだと思いますか」の音声情報をスピーカ１１に出力すると共に、「１．上向き２．今のまま３．下向き」の画面情報を表示器１２に出力する。入力部３２は、発話者のキー操作によって選択された情報を入力し、記憶部６０に記憶する。そして、発話者による終了のキー操作が行われると、入力部３２は、その終了のキー操作を入力し、発話音声、眼球映像、通常顔映像及び目隠し顔映像の入力及び記憶処理を終了する。 Next, a series of operations for presenting guidance information to a speaker and inputting and storing a speech voice will be described. When a start key operation is performed by a speaker whose utterance characteristics are analyzed and controlled, theinput unit 32 inputs the start key operation (in FIG. 2, the input of the key operation is omitted). The guidanceinformation presentation unit 31 reads the guidance information from theguidance information DB 61 of thestorage unit 60, and from this guidance information, for example, voice information such as “What do you think is the economy six months from now?” 11 is output. The speaker gives an opinion on this question by listening to this sound from thespeaker 11. Theinput unit 32 inputs a speech sound from themicrophone 14 according to a start key operation, inputs an eyeball image from thecamera 15, and inputs a normal face image and a blindfolded face image from the camera 17. And it memorize | stores in the memory |storage part 60 as synchronized information. Next, the guidanceinformation presentation unit 31 outputs, for example, voice information “why do you think so” from the guidance information to thespeaker 11. Similarly, after the utterance voice is input by theinput unit 32, the guidanceinformation presentation unit 31 outputs voice information “What do you think is the economy after six months?” To thespeaker 11 and “ The screen information of “1. Upward 2. As it is 3. Downward” is output to thedisplay 12. Theinput unit 32 inputs information selected by a speaker's key operation and stores the information in thestorage unit 60. When the end key operation by the speaker is performed, theinput unit 32 inputs the end key operation, and ends the input and storage processing of the speech sound, the eyeball image, the normal face image, and the blindfolded face image.

このように、案内情報提示部３１が、記憶部６０の案内情報ＤＢ６１から案内情報を読み出し、発話者へ提示する。そして、入力部３２が、その質問等の意見として発話音声を入力すると共に、眼球映像、通常顔映像、目隠し顔映像等を入力し、同期した情報として記憶部６０に記憶する。 In this way, the guidanceinformation presentation unit 31 reads the guidance information from theguidance information DB 61 of thestorage unit 60 and presents it to the speaker. Then, theinput unit 32 inputs an utterance voice as an opinion such as a question, and inputs an eyeball image, a normal face image, a blindfolded face image, and the like, and stores them in thestorage unit 60 as synchronized information.

（音声認識部）
図２の音声情報処理装置１において、音声認識部４１は、記憶部６０の発話音声ＤＢ６２から発話者の発話音声を読み出し、発話音声の特性に基づいて、発話音声の音声認識結果をテキスト化し、発話の内容を示すテキスト情報に変換する。具体的には、音声認識部４１は、認識対象の語彙（＝単語及び文節の集合）及びその発音を規定した辞書を備えており、その辞書を用いて、ある区間の発話音声に対し、音素（ローマ字一字にほぼ相当）または音節（かな一文字に相当）の周波数パターンが最も適合する単語または文節を抜き出す。そして、辞書内においてその単語または文節に付与されている、漢字・かな・カタカナ表現の文字列をもって、前記区間の発話音声を単語または文節毎に切り分けたテキスト情報に変換する。尚、発話音声をテキスト情報に変換する手法については既知であり、例えば、前述した非特許文献１，２に記載されている手法が用いられる。(Voice recognition unit)
In the voiceinformation processing apparatus 1 of FIG. 2, thevoice recognition unit 41 reads the utterance voice of the speaker from theutterance voice DB 62 of thestorage unit 60, converts the voice recognition result of the utterance voice into text based on the characteristics of the utterance voice, It is converted into text information indicating the content of the utterance. Specifically, thespeech recognition unit 41 includes a dictionary that defines a vocabulary to be recognized (= a set of words and phrases) and a pronunciation thereof, and uses the dictionary to generate phoneme for speech speech in a certain section. Extract words or phrases that most closely match the frequency pattern (approximately equivalent to one Roman letter) or syllable (equivalent to one letter). Then, with the character string of kanji / kana / katakana expression assigned to the word or phrase in the dictionary, the speech speech in the section is converted into text information divided into words or phrases. In addition, the method of converting the speech voice into text information is known, and for example, the methods described inNon-Patent Documents 1 and 2 described above are used.

また、音声認識部４１は、発話音声の特性に基づいて、発話音声の区間を発言区間、その他発声区間及び沈黙区間の３つに区別し、これら３つの区間の開始時刻ｔ１、終了時刻ｔ２及び区間時間長（秒等にて表現した時間）をそれぞれ求め、これらの情報を区間種別情報として生成する。発言区間は、発話者が発声した区間のうちの実際に意味のある発声をした区間をいい、その他発声区間は、発話者が発声した区間から発言区間を除いた区間をいい、沈黙区間は、発話者が発声していない無発声の区間をいう。具体的には、音声認識部４１は、発話音声の入力信号レベルが所定の値以上に達しない区間を特定し、その発声区間を沈黙区間として区間種別情報を生成する。また、音声認識部４１に備えた辞書には、単語及び文節が発言またはその他発声のいずれかに属するかについての区別が定義されており、音声認識部４１は、沈黙区間以外の発声区間において、発話音声から認識した単語または文節を辞書にて検索し、辞書に定義されたその単語または文節の区別により、その区間が発言区間であるかその他発声区間であるかを判定し、区間種別情報を生成する。 Further, thespeech recognition unit 41 distinguishes the speech speech section into the speech section, the other speech section, and the silence section based on the characteristics of the speech sound, and the start time t1, end time t2, and Each section time length (time expressed in seconds or the like) is obtained, and these pieces of information are generated as section type information. The utterance section is the section where the utterer actually uttered, the other utterance section is the section excluding the utterance section from the utterance section, the silence section is An unvoiced section where the speaker is not speaking. Specifically, thevoice recognition unit 41 identifies a section where the input signal level of the uttered voice does not reach a predetermined value or more, and generates section type information using the utterance section as a silence section. Further, the dictionary provided in thespeech recognition unit 41 defines a distinction as to whether a word and a phrase belong to a utterance or other utterance, and thespeech recognition unit 41 is in a utterance interval other than the silence interval. The word or phrase recognized from the utterance is searched in the dictionary, and by distinguishing the word or phrase defined in the dictionary, it is determined whether the section is a speech section or another speech section, and the section type information is Generate.

尚、音声認識部４１は、発話音声をテキスト化すると共に、発話音声の区間を発言区間、その他発声区間及び沈黙区間の３つに区別するが、沈黙区間以外の発声区間において、発話音声の認識が不能な場合、つまり発話音声をテキスト化できない場合もあり得る。この場合、その発声区間を認識不可区間として区間種別情報を生成する。 Thespeech recognition unit 41 converts the speech voice into text and distinguishes the speech speech section into a speech section, another speech section, and a silence section. In the speech section other than the silence section, speech recognition is performed. There is a case where the speech cannot be converted into text. In this case, the section type information is generated with the utterance section as an unrecognizable section.

また、音声認識部４１は、発話音声に基づいて、発言区間、その他発声区間及び沈黙区間毎の平均音量を算出し、この平均音量を区間音量情報として生成する。具体的には、音声認識部４１は、各区間の音量を積分し、区間時間長で除算することにより、区間毎の平均音量を算出する。テキスト情報、区間種別情報及び区間音量情報は、発話音声と共に発話特徴解析制御部５０に出力される。 Further, thespeech recognition unit 41 calculates an average volume for each speech section, other speech section, and silence section based on the uttered voice, and generates this average volume as section volume information. Specifically, thevoice recognition unit 41 calculates the average sound volume for each section by integrating the volume of each section and dividing by the section time length. The text information, the section type information, and the section volume information are output to the utterance featureanalysis control unit 50 together with the utterance voice.

（瞳孔径解析部）
瞳孔径解析部４２は、記憶部６０の眼球映像ＤＢ６３から発話者の眼球映像を読み出し、眼球映像を解析することにより、瞳孔径を算出する。具体的には、瞳孔径解析部４２は、１フレームの眼球画像全体に２値化処理を施し、画像中の眼球の周りに計測用ウィンドウを設定し、その計測用ウィンドウ内の瞳孔部分の面積を２値化処理後のデータから求め、瞳孔径を算出する。このようにして算出された瞳孔径は、発話特徴解析制御部５０に出力される。尚、眼球画像の解析手法についは既知であり、例えば、前述した特許文献２に記載されている手法が用いられる。(Pupil diameter analyzer)
The pupildiameter analysis unit 42 calculates the pupil diameter by reading the eyeball image of the speaker from theeyeball image DB 63 in thestorage unit 60 and analyzing the eyeball image. Specifically, the pupildiameter analysis unit 42 performs binarization processing on the entire eyeball image of one frame, sets a measurement window around the eyeball in the image, and the area of the pupil portion in the measurement window Is obtained from the binarized data, and the pupil diameter is calculated. The pupil diameter thus calculated is output to the utterance featureanalysis control unit 50. The eyeball image analysis method is known, and for example, the method described inPatent Document 2 described above is used.

（顔面動き量推定部）
顔面動き量推定部４３は、記憶部６０の通常顔映像ＤＢ６４から発話者の通常顔映像を読み出し、通常顔映像を解析することにより、顔面動きデータ（以下、単に「顔面動き」という。）を推定する。尚、目隠し顔映像ＤＢ６５から目隠し顔映像を読み出して解析することにより、顔面動きを推定するようにしてもよい。ここで、顔面動きは、首振りまたは身体移動に伴う動作をいい、顔面の動きの大きさ及び向きのデータからなる。具体的には、顔面動き量推定部４３は、読み出した通常顔映像と、予め登録された発話者の顔映像とを比較し、類似する領域を抽出し（後述する（１）を参照）、その抽出した領域内の顔映像に基づいて、顔面の向きを通常顔映像のフレーム毎に検出する（後述する（２）を参照）。そして、顔面動き量推定部４３は、検出した２フレーム間における顔の向きの変化をその経過時間で除算することにより、単位時間あたりの顔面動きを算出する。この顔面動きは、所定の撮像面に顔面を射影した場合に、その撮像面における平行移動速度を示すパラメータｃ，ｄ（ｃは撮像面上における平行移動速度の水平成分、ｄは垂直成分にそれぞれ対応する）からなる顔面動きベクトルである。このようにして推定された顔面動きは、発話特徴解析制御部５０に出力される。尚、顔面動きを算出する手法については既知であり、例えば、以下の（１）〜（４）を参照されたい。
（１）“王様の箱庭：：ｂｌｏｇＪａｖａ（登録商標）Ｓｃｒｉｐｔから利用できる顔検出ＡＰＩ「ｆａｃｅｋｉｔ」”、［ｏｎｌｉｎｅ］、平成１９年３月２１日、インターネット＜ｈｔｔｐ：／／ｄ．ｈａｔｅｎａ．ｎｅ．ｊｐ／ｍａｓａｙｏｓｈｉ／２００７０３２１＞
（２）“Ｔｕｔｏｒｉａｌ：ＯｐｅｎＣＶｈａａｒｔｒａｉｎｉｎｇ（ＲａｐｉｄＯｂｊｅｃｔＤｅｔｅｃｔｉｏｎＷｉｔｈＡＣａｓｃａｄｅｏｆＢｏｏｓｔｅｄＣｌａｓｓｉｆｉｅｒｓＢａｓｅｄｏｎＨａａｒ−ｌｉｋｅＦｅａｔｕｒｅｓ）”、［ｏｎｌｉｎｅ］、インターネット＜ｈｔｔｐ：／／ｎｏｔｅ．ｓｏｎｏｔｓ．ｃｏｍ／ＳｃｉＳｏｆｔｗａｒｅ／ｈａａｒｔｒａｉｎｉｎｇ．ｈｔｍｌ＞
（３）秦泉寺久美、他２名、「スプライト生成のためのグローバルモーション算出法と符号化への適用」、電子情報通信学会論文誌Ｄ−２Ｖｏｌ．Ｊ８３−Ｄ−２Ｎｏ．２、ｐｐ．５３５−５４４、２０００年２月
（４）特許第３５５１９０８号公報
これ以外に、顔面動きを算出する手法としては、前記（１）で実現されている顔の向きを実時間で捉えて顔面動きデータを算出するものがある。また、その際のアルゴリズムとしては、前記（２）が用いられる。(Facial motion estimation unit)
The face motionamount estimation unit 43 reads the normal face image of the speaker from the normalface image DB 64 of thestorage unit 60 and analyzes the normal face image, thereby obtaining face motion data (hereinafter simply referred to as “face motion”). presume. Note that the facial motion may be estimated by reading out and analyzing the blindfolded face video from the blindfoldedface video DB 65. Here, the facial movement refers to an action accompanying swinging or body movement, and is composed of data on the magnitude and direction of the facial movement. Specifically, the facial motionamount estimation unit 43 compares the read normal face image with a pre-registered speaker's face image and extracts a similar region (see (1) described later). Based on the face image in the extracted area, the orientation of the face is detected for each frame of the normal face image (see (2) described later). Then, the facial motionamount estimation unit 43 calculates the facial motion per unit time by dividing the detected change in the orientation of the face between the two frames by the elapsed time. This facial movement is obtained by projecting the face onto a predetermined imaging surface, parameters c and d indicating the translation speed on the imaging surface (c is a horizontal component of the translation speed on the imaging surface, and d is a vertical component, respectively. Corresponding face motion vectors. The face movement estimated in this way is output to the utterance featureanalysis control unit 50. In addition, the method for calculating the facial motion is already known. For example, refer to the following (1) to (4).
(1) “King's miniature garden :: face detection API“ facekit ”available from blog Java (registered trademark) Script”, [online], March 21, 2007, Internet <http://d.hatena.ne .Jp / masayoshi / 20070321>
(2) "Tutorial: OpenCV haarlining (Rapid Object Detection With A Cascade of Boosted Classes Based on Haar-like Features)", [online], Internet </ p. soots. com / SciSoftware / haartraining. html>
(3) Kumi Oisenji and two others, "Global motion calculation method for sprite generation and application to encoding", IEICE Transactions D-2 Vol. J83-D-2 No. 2, pp. 535-544, February 2000 (4) Japanese Patent No. 3551908 In addition to this, as a method for calculating the facial motion, the facial motion data obtained by capturing the facial orientation realized in the above (1) in real time is used. There is something to calculate. In addition, the algorithm (2) is used as an algorithm at that time.

音声認識部４１からのテキスト情報等、瞳孔径解析部４２からの瞳孔径、及び顔面動き量推定部４３からの顔面動きは、互いに同期し時間的に対応付けられた情報として発話特徴解析制御部５０に出力される。 The text information from thespeech recognition unit 41, the pupil diameter from the pupildiameter analysis unit 42, and the facial motion from the facial motionamount estimation unit 43 are synchronized with each other and are temporally associated with each other as an utterance feature analysis control unit. 50 is output.

（発話特徴解析制御部）
発話特徴解析制御部５０は、音声認識部４１からテキスト情報等を、瞳孔径解析部４２から瞳孔径を、顔面動き量推定部４３から顔面動きを互いに同期し時間的に対応付けられた情報として入力する。そして、発話特徴解析制御部５０は、テキスト情報等、瞳孔径及び顔面動きを記憶部６０に記憶する。これにより、記憶部６０には、テキスト情報等ＤＢ６６及び瞳孔径・顔面動きＤＢ６７が生成される。(Speech feature analysis control unit)
The utterance featureanalysis control unit 50 synchronizes the text information and the like from thespeech recognition unit 41, the pupil diameter from the pupildiameter analysis unit 42, and the facial motion from the facialmotion estimation unit 43, as time-synchronized information. input. Then, the utterance featureanalysis control unit 50 stores the pupil diameter and facial movement, such as text information, in thestorage unit 60. Thus, thetext information DB 66 and the pupil diameter / face movement DB 67 are generated in thestorage unit 60.

また、発話特徴解析制御部５０は、入力した各種情報に基づいて、発話特徴を解析して解析結果を生成し、記憶部６０に記憶する。これにより、記憶部６０には、解析結果ＤＢ６８が生成される。 Further, the utterance featureanalysis control unit 50 analyzes the utterance feature based on the input various information, generates an analysis result, and stores it in thestorage unit 60. Thereby, theanalysis result DB 68 is generated in thestorage unit 60.

以下、発話特徴解析制御部５０による発話特徴の解析処理について説明する。図５は、発話特徴解析制御部５０の構成を示すブロック図である。この発話特徴解析制御部５０は、相対音量算出部５１、相対音高算出部５２、発声速度算出部５３、無意区間特定部５４、テキスト化不可部分特定部５５、発話者情動反応値算出部５６、区間重要度算出手段５７及び頻出重要単語抽出手段５８を備えている。 The utterance feature analysis processing by the utterance featureanalysis control unit 50 will be described below. FIG. 5 is a block diagram illustrating a configuration of the utterance featureanalysis control unit 50. The utterance featureanalysis control unit 50 includes a relativevolume calculation unit 51, a relativepitch calculation unit 52, an utterancespeed calculation unit 53, an unintentionalsection specification unit 54, a text-unablepart specification unit 55, and a speaker emotional reactionvalue calculation unit 56. , Section importance degree calculating means 57 and frequent importantword extracting means 58 are provided.

相対音量算出部５１は、入力したテキスト情報、区間種別情報、区間音量情報及び発話音声により、文節または単語毎に切り分けられたテキストを有する発言区間及びその他発声区間について、発話音声の音量ｖ（ｔ）の最大値Ｖｍａｘ及び最小値Ｖｍｉｎを求め、以下の式により相対音量Ｖ（ｔ）を算出する。
Ｖ（ｔ）＝（ｖ（ｔ）−Ｖｍｉｎ）／（Ｖｍａｘ−Ｖｍｉｎ）The relativevolume calculation unit 51 sets the volume v (t of the utterance voice for the utterance section and the other utterance sections having texts segmented into phrases or words based on the input text information, section type information, section volume information, and utterance voice. ) Is calculated, and the relative volume V (t) is calculated by the following equation.
V (t) = (v (t) -Vmin) / (Vmax-Vmin)

また、相対音量算出部５１は、発言区間及びその他発声区間について、単位時間あたりの相対音量Ｖ（ｔ）の積分値を算出し、その区間の区間時間長で除算することにより区間平均音量Ｖを算出する。 In addition, the relativevolume calculation unit 51 calculates an integrated value of the relative volume V (t) per unit time for the utterance section and other utterance sections, and divides the section average volume V by the section time length of the section. calculate.

図６（１）は、発話音声の相対音量Ｖ（ｔ）を示すグラフである。縦軸は発話音声の相対音量Ｖ（ｔ）を示し、横軸は時間ｔを示している。横軸の時間ｔに沿って区切られたａは発言区間であり、ｂはその他発声区間、ｃは沈黙区間である。図６（１）から、テキスト「ん〜」「えっと」が発話されたその他発声区間ｂの相対音量よりも、テキスト「多分、」「今と変わらない」が発話された発言区間ａの相対音量の方が大きいことがわかる。このように、相対音量算出部５１により、発言区間及びその他発声区間発話音声の相対音量Ｖ（ｔ）及び区間平均音量Ｖが算出される。尚、沈黙区間についても、相対音量Ｖ（ｔ）及び区間平均音量Ｖが算出される。 FIG. 6 (1) is a graph showing the relative volume V (t) of the uttered voice. The vertical axis represents the relative volume V (t) of the speech voice, and the horizontal axis represents time t. A divided along time t on the horizontal axis is a speech section, b is another speech section, and c is a silence section. From FIG. 6 (1), the relative of the speech section a in which the texts “Maybe” and “Same as now” are spoken, rather than the relative volume of the other speech sections b in which the text “n˜” and “um” are spoken. You can see that the volume is louder. In this manner, the relativevolume calculation unit 51 calculates the relative volume V (t) and the section average volume V of the speech section and other utterance sections. Note that the relative sound volume V (t) and the section average sound volume V are also calculated for the silent section.

図５に戻って、相対音高算出部５２は、入力したテキスト情報、区間種別情報及び発話音声により、文節または単語毎に切り分けられたテキストを有する発言区間及びその他発声区間について、発話音声の音高ａ（ｔ）の最大値Ａｍａｘ及び最小値Ａｍｉｎを求め、以下の式により相対音高Ａ（ｔ）を算出する。
Ａ（ｔ）＝（ａ（ｔ）−Ａｍｉｎ）／（Ａｍａｘ−Ａｍｉｎ）Referring back to FIG. 5, the relativepitch calculation unit 52 determines the sound of the utterance voice for the utterance section and other utterance sections having text segmented by phrase or word based on the input text information, section type information, and utterance voice. The maximum value Amax and the minimum value Amin of the height a (t) are obtained, and the relative pitch A (t) is calculated by the following formula.
A (t) = (a (t) -Amin) / (Amax-Amin)

また、相対音高算出部５２は、発言区間及びその他発声区間について、単位時間あたりの相対音高Ａ（ｔ）の積分値を算出し、その区間の区間時間長で除算することにより区間平均音高Ａを算出する。このように、相対音高算出部５２により、発言区間及びその他発声区間における発話音声の相対音高Ａ（ｔ）及び区間平均音高Ａが算出される。尚、沈黙区間についても、相対音高Ａ（ｔ）及び区間平均音高Ａが算出される。 In addition, the relativepitch calculation unit 52 calculates an integrated value of the relative pitch A (t) per unit time for the speech section and the other speech sections, and divides by the section time length of the section, thereby calculating the section average sound. High A is calculated. As described above, the relativepitch calculation unit 52 calculates the relative pitch A (t) and the average pitch A of the utterance voice in the utterance section and the other utterance sections. Note that the relative pitch A (t) and the average pitch A are also calculated for the silent interval.

発声速度算出部５３は、入力したテキスト情報及び区間種別情報により、文節または単語毎に切り分けられたテキストを有する発言区間及びその他発声区間について、テキスト化された文節及び単語に対応する音素数（または音節数）Ｃを取得する。具体的には、発生速度算出部５３は、音声認識部４１により発話音声が文節及び単語毎のテキストに切り分けられた際に用いた辞書によって、音素数（または音節数）Ｃを取得する。 The utterancespeed calculation unit 53 determines the number of phonemes corresponding to the phrased phrase and word (or the utterance section and the other utterance section having a text segmented for each phrase or word based on the input text information and section type information (or Acquire the number of syllables) C. Specifically, the generationspeed calculation unit 53 acquires the number of phonemes (or the number of syllables) C by the dictionary used when thespeech recognition unit 41 cuts the uttered speech into phrases and text for each word.

そして、発声速度算出部５３は、入力した区間種別情報における各区間の開始時刻ｔ１及び終了時刻ｔ２と、取得した音素数（または音節数）Ｃとを用いて、以下の式により発声速度ＶＶを算出する。
ＶＶ＝Ｃ／（ｔ２−ｔ１）
このように、発声速度算出部５３により、発言区間及びその他発声区間における発声速度が算出される。Then, the utterancespeed calculation unit 53 uses the start time t1 and end time t2 of each section in the input section type information and the acquired phoneme number (or syllable number) C to calculate the utterance speed VV by the following formula. calculate.
VV = C / (t2-t1)
In this manner, the utterancespeed calculation unit 53 calculates the utterance speed in the utterance section and other utterance sections.

無意区間特定部５４は、入力した区間種別情報により、その他発声区間及び沈黙区間を無意区間として特定する。無意区間特定部５４により無意区間に特定されたその他発声区間及び沈黙区間は、表示装置２０において、区間種別情報における区間時間長に比例した大きさの所定の形態で表示される。 The unintentionalsection specifying unit 54 specifies other utterance sections and silence sections as unintentional sections based on the input section type information. The other utterance sections and silence sections specified as the unintentional section by the unintentionalsection specifying unit 54 are displayed on thedisplay device 20 in a predetermined form having a size proportional to the section time length in the section type information.

テキスト化不可部分特定部５５は、入力したテキスト情報及び区間種別情報により、認識不可区間をテキスト化不可部分として特定する。テキスト化不可部分特定部５５により特定されたテキスト化不可部分は、表示装置２０において、テキスト化されなかったことを示す表示がなされる。 The untextablepart specifying unit 55 specifies a non-recognizable section as a non-textable part based on the input text information and section type information. Thedisplay device 20 displays that the non-text conversion part specified by the non-text conversionpart specifying unit 55 is not converted to text.

発話者情動反応値算出部５６は、入力した瞳孔径及び顔面動きにより、発話者の情動反応値Ｅｓ（ｔ）を算出する。具体的には、発話者情動反応値算出部５６は、入力した瞳孔径及び顔面動きを正規化し、相対的な瞳孔径ｐ（ｔ）及び顔面動きｆ（ｔ）を算出し、これらの時系列データに対する重み値をそれぞれＰ，Ｆ、定数をＳとして、以下の式により情動反応値Ｅｓ（ｔ）を算出する。
Ｅｓ（ｔ）＝Ｐ・ｐ（ｔ）＋Ｆ・ｆ（ｔ）＋Ｓ
一般に、このようにして算出される情動反応値Ｅｓ（ｔ）には、発話者による発話の内容の有意性が反映される。すなわち、有意性の高い内容を発話しているときには、情動反応値Ｅｓ（ｔ）が大きくなる傾向がある。これは、人の瞳孔径及び顔面動きが、興味関心を示しているときに大きくなるからである。一方、有意性の低い内容を発話しているときには、情動反応値Ｅｓ（ｔ）が小さくなる傾向がある。The speaker emotional responsevalue calculation unit 56 calculates the emotional response value Es (t) of the speaker based on the input pupil diameter and facial movement. Specifically, the speaker emotional reactionvalue calculation unit 56 normalizes the input pupil diameter and facial motion, calculates the relative pupil diameter p (t) and facial motion f (t), and these time series The emotional response value Es (t) is calculated by the following equation, where P and F are the weight values for the data, and S is the constant.
Es (t) = P · p (t) + F · f (t) + S
In general, the emotion response value Es (t) calculated in this way reflects the significance of the content of the utterance by the speaker. That is, the emotional reaction value Es (t) tends to be large when uttering highly significant content. This is because the pupil diameter and facial movement of a person increase when they are interested. On the other hand, the emotional reaction value Es (t) tends to be small when uttering content with low significance.

尚、発話者情動反応値算出部５６は、瞳孔径及び顔面動きに加えて、脈拍値、発汗量を入力するようにしてもよい。この場合、発話者情動反応値算出部５６は、以下の式により情動反応値Ｅｓ（ｔ）を算出する。
Ｅｓ（ｔ）＝Ｐ・ｐ（ｔ）＋Ｆ・ｆ（ｔ）＋Ｂ・ｂ（ｔ）＋Ｃ・ｃ（ｔ）＋Ｓ
ｂ（ｔ）は脈拍値の時系列データ、Ｂはその重み値、ｃ（ｔ）は発汗量の時系列データ、Ｃはその重み値とする。人の脈拍値及び発汗量は、興味関心を示しているときに大きくなるから、発話者が有意性の高い内容を発話しているときには、情動反応値Ｅｓ（ｔ）が大きくなる傾向がある。この場合、音声情報処理装置１は、脈拍センサから脈拍データを入力して脈拍値を求め、発汗量計測センサから発汗量データを入力して発汗量を求める。Note that the speaker emotional reactionvalue calculation unit 56 may input a pulse value and a sweating amount in addition to the pupil diameter and facial movement. In this case, the speaker emotional reactionvalue calculation unit 56 calculates the emotional reaction value Es (t) by the following equation.
Es (t) = P * p (t) + F * f (t) + B * b (t) + C * c (t) + S
b (t) is the time-series data of the pulse value, B is the weight value thereof, c (t) is the time-series data of the sweating amount, and C is the weight value thereof. Since a person's pulse value and the amount of sweating increase when showing interest, the emotional reaction value Es (t) tends to increase when the speaker speaks highly significant content. In this case, the voiceinformation processing apparatus 1 obtains a pulse value by inputting pulse data from the pulse sensor, and obtains a sweat amount by inputting sweat amount data from the sweat amount measurement sensor.

また、発話者情動反応値算出部５６は、入力した区間種別情報における発言区間、その他発声区間及び沈黙区間について、単位時間あたりの情動反応値Ｅｓ（ｔ）の積分値を算出し、入力した区間種別情報における区間時間長で除算することにより、各区間の代表的な値として区間平均情動反応値Ｅｓを算出する。 Further, the speaker emotional responsevalue calculation unit 56 calculates an integral value of the emotional response value Es (t) per unit time for the speech segment, the other speech segment, and the silence segment in the input segment type information. By dividing by the section time length in the type information, the section average emotion reaction value Es is calculated as a representative value of each section.

このように、発話者情動反応値算出部５６は、発話者の瞳孔径、顔面動き等の、生理状態によって変化する生理反応データを入力し、情動反応値Ｅｓ（ｔ）及び区間平均情動反応値Ｅｓを算出する。尚、生理反応データは、これらのデータに限定されるものではなく、発話の有意性に伴って変動するデータであればよい。例えば、発話者の瞳孔径の代わりに視点位置の変動データを用いるようにしてもよい。この場合は、眼球映像に基づいて視点位置が算出される。また、例えば、脳波の値を用いるようにしてもよい。 In this way, the speaker emotional responsevalue calculation unit 56 inputs physiological response data that varies depending on the physiological state, such as the pupil diameter of the speaker and facial movement, and the emotional response value Es (t) and the interval average emotional response value. Es is calculated. The physiological response data is not limited to these data, and may be data that varies with the significance of the utterance. For example, viewpoint position variation data may be used instead of the speaker's pupil diameter. In this case, the viewpoint position is calculated based on the eyeball image. Further, for example, a brain wave value may be used.

図６（２）は、発話者の情動反応値Ｅｓ（ｔ）を示すグラフである。縦軸は情動反応値Ｅｓ（ｔ）を示し、横軸は時間ｔを示している。このグラフは、図６（１）に示した発話音声の相対音量Ｖ（ｔ）と時間的に対応している。図６（１）（２）から、情動反応値Ｅｓ（ｔ）の大きい領域（括弧で示した領域）に対応する発言区間ａが、有意性の高い区間であるといえる。したがって、発話特徴解析制御部５０は、発話者の情動反応値Ｅｓ（ｔ）と予め設定された閾値とを比較し、情動反応値Ｅｓ（ｔ）が閾値よりも大きい時間領域を特定し、その時間領域に対応するテキストをテキスト情報から抽出することができる。このようにして抽出されたテキストが、有意性の高い要点部分になる。 FIG. 6B is a graph showing the emotional response value Es (t) of the speaker. The vertical axis represents the emotional reaction value Es (t), and the horizontal axis represents time t. This graph temporally corresponds to the relative volume V (t) of the uttered voice shown in FIG. From FIGS. 6 (1) and 6 (2), it can be said that the speech section a corresponding to the area where the emotional reaction value Es (t) is large (area shown in parentheses) is a highly significant section. Therefore, the utterance featureanalysis control unit 50 compares the emotional response value Es (t) of the speaker with a preset threshold, identifies a time region in which the emotional response value Es (t) is larger than the threshold, Text corresponding to the time domain can be extracted from the text information. The text extracted in this way becomes a highly significant point.

図５に戻って、区間重要度算出手段５７は、入力した区間種別情報、相対音量算出部５１により算出された区間平均音量Ｖ、相対音高算出部５２により算出された区間平均音高Ａ、発声速度算出部５３により算出された発声速度ＶＶ、及び、発話者情動反応値算出部５６により算出された、発話者の区間平均情動反応値Ｅｓを用いて、発話音声区間毎のテキスト情報の重要度（有意度の高さ）Ｗを算出する。具体的には、区間重要度算出手段５７は、以下の式により重要度Ｗを算出する。
Ｗ＝ｆｖ（Ｖ）＋ｆａ（Ａ）＋ｆｖｖ（ＶＶ）＋ｆｅｓ（Ｅｓ）
ここで、ｆｖ（Ｖ），ｆａ（Ａ），ｆｖｖ（ＶＶ），ｆｅｓ（Ｅｓ）は、それぞれ区間平均音量Ｖ、区間平均音高Ａ、発声速度ＶＶ、発話者の区間平均情動反応値Ｅｓに関する関数であり、それぞれの特性に応じて重み付け及び高次関数の表現とすることが可能である。単純化した例として、区間重要度算出手段５７は、発話音声特性データの線形結合により重要度Ｗを算出する場合、Ｗｖ，Ｗａ，Ｗｖｖ，Ｗｅｓをそれぞれ重み付け係数として以下の式を用いるようにすればよい
ｆｖ（Ｖ）＝Ｗｖ・Ｖ
ｆａ（Ａ）＝Ｗａ・Ａ
ｆｖｖ（ＶＶ）＝Ｗｖｖ・ＶＶ
ｆｅｓ（Ｅｓ）＝Ｗｅｓ・Ｅｓ
ここで、Ｗｖ，Ｗａ，Ｗｖｖ，Ｗｅｓは負の値としてもよい。このようにして算出された重要度Ｗは、テキスト情報における文節及び単語の重要度を示しているから、重要度Ｗに基づいて発話シーケンス中の重要な箇所を判定するために用いることができる。したがって、重要度Ｗを用いることにより、人の発話音声の中から、有意性の高い要点部分を客観的かつ容易に抽出することができる。この重要度Ｗは、表示装置２０においてグラフに表示される。また、対応する区間のテキスト情報を所定の形態で表示する際に用いられる。Returning to FIG. 5, the section importance degree calculation means 57 includes the input section type information, the section average volume V calculated by therelative volume calculator 51, the section average pitch A calculated by therelative pitch calculator 52, The importance of the text information for each utterance voice section using the utterance speed VV calculated by the utterancespeed calculation section 53 and the section average emotion reaction value Es of the speaker calculated by the speaker emotion reactionvalue calculation section 56. Degree (high significance level) W is calculated. Specifically, the section importance calculation means 57 calculates the importance W by the following formula.
W = fv (V) + fa (A) + fvv (VV) + fes (Es)
Here, fv (V), fa (A), fvv (VV), and fes (Es) relate to the section average volume V, the section average pitch A, the utterance speed VV, and the section average emotion response value Es of the speaker. It is a function, and can be expressed by weighting and higher-order function according to each characteristic. As a simplified example, when the importance W is calculated by linear combination of utterance voice characteristic data, the section importance calculation means 57 uses Wv, Wa, Wvv, and Wes as weighting coefficients, respectively, and uses the following expressions. Fv (V) = Wv · V
fa (A) = Wa / A
fvv (VV) = Wvv · VV
fes (Es) = Wes · Es
Here, Wv, Wa, Wvv, and Wes may be negative values. Since the importance W calculated in this way indicates the importance of the phrase and the word in the text information, it can be used to determine an important part in the utterance sequence based on the importance W. Therefore, by using the importance W, it is possible to objectively and easily extract a highly significant point from a person's speech. The importance W is displayed on thedisplay device 20 in a graph. Further, it is used when displaying text information of a corresponding section in a predetermined form.

頻出重要単語抽出手段５８は、区間重要度算出手段５７により算出された発話音声区間毎の重要度Ｗを発話音声区間の間で大小比較し、発話音声区間を重要度Ｗの高い順に並べ、全体の発話音声区間の中で重要度Ｗの高い所定数の発話音声区間を特定する。そして、頻出重要単語抽出手段５８は、特定した発話音声区間内のテキスト情報から単語を抽出し、その単語を頻出重要単語に設定する。または、全体の発話音声区間において、抽出した単語の発声回数を算出し、その発声回数が所定数を越える場合に、その単語を頻出重要単語に設定する。尚、頻出重要単語抽出手段５８は、発話音声区間毎の重要度Ｗの値を、最大１００及び最小０に正規化し、閾値（例えば７０）を予め設定しておき、それを越える発話音声区間を特定するようにしてもよい。この場合、頻出重要単語抽出手段５８は、前述のとおり、特定した発話音声区間内のテキスト情報から単語を抽出し、頻出重要単語に設定する。または、全体の発話音声区間において、抽出した単語の発声回数を算出し、その発声回数が所定数を越える場合に、その単語を頻出重要単語に設定する。このようにして抽出された頻出重要単語は、データベースを検索するために使用され、表示装置２０に表示される。 The frequent importantword extracting means 58 compares the importance W for each utterance voice section calculated by the section importance degree calculation means 57 between the utterance voice sections, arranges the utterance voice sections in descending order of importance W, A predetermined number of utterance voice sections having a high importance W are identified among the utterance voice sections. And the frequent important word extraction means 58 extracts a word from the text information in the specified speech audio | voice area, and sets the word as a frequent important word. Alternatively, the number of utterances of the extracted word is calculated in the entire utterance voice section, and when the number of utterances exceeds a predetermined number, the word is set as a frequent important word. The frequent important word extraction means 58 normalizes the value of the importance W for each utterance voice section to a maximum of 100 and a minimum of 0, sets a threshold value (for example, 70) in advance, and sets the utterance voice section exceeding this value. It may be specified. In this case, the frequent important word extraction means 58 extracts words from the text information in the specified speech speech section and sets them as frequent important words as described above. Alternatively, the number of utterances of the extracted word is calculated in the entire utterance voice section, and when the number of utterances exceeds a predetermined number, the word is set as a frequent important word. The frequent important words extracted in this way are used to search the database and are displayed on thedisplay device 20.

尚、発話特徴解析制御部５０は、図５に示した相対音量算出部５１等に加え、発話を聴取する聴取者の情動反応値を算出する聴取者情動反応値算出部を備えるようにしてもよい。この場合、聴取者情動反応値算出部は、聴取者の瞳孔径、顔面動き等を入力し、発話者の情動反応値Ｅｓ（ｔ）と同様の式により、聴取者の情動反応値Ｅｏ（ｔ）を算出する。また、情動反応値Ｅｏ（ｔ）を積分して区間時間長で除算し、区間平均情動反応値Ｅｏを算出する。聴取者が複数存在する場合は、聴取者毎に情動反応値Ｅｏ（ｔ）を算出する。また、区間重要度算出手段５７は、重要度Ｗの算出の際に、聴取者の区間平均情動反応値Ｅｏを含めて、以下の式により重要度Ｗを算出するようにしてもよい。
Ｗ＝ｆｖ（Ｖ）＋ｆａ（Ａ）＋ｆｖｖ（ＶＶ）＋ｆｅｓ（Ｅｓ）＋ｆｅｏ（Ｅｏ）
ｆｖ（Ｖ），ｆａ（Ａ），ｆｖｖ（ＶＶ），ｆｅｓ（Ｅｓ），ｆｅｏ（Ｅｏ）は、それぞれ区間平均音量Ｖ、区間平均音高Ａ、発声速度ＶＶ、発話者の区間平均情動反応値Ｅｓ、聴取者の区間平均情動反応値Ｅｏに関する関数である。単純化した例として、区間重要度算出手段５７は、発話音声特性データの線形結合により重要度Ｗを以下の式により算出するようにしてもよい。Ｗｖ，Ｗａ，Ｗｖｖ，Ｗｅｓ，Ｗｅｏはそれぞれ重み付け係数である。
ｆｖ（Ｖ）＝Ｗｖ・Ｖ
ｆａ（Ａ）＝Ｗａ・Ａ
ｆｖｖ（ＶＶ）＝Ｗｖｖ・ＶＶ
ｆｅｓ（Ｅｓ）＝Ｗｅｓ・Ｅｓ
ｆｅｏ（Ｅｏ）＝Ｗｅｏ・Ｅｏ
ここで、Ｗｖ，Ｗａ，Ｗｖｖ，Ｗｅｓ，Ｗｅｏは負の値としてもよい。このように、区間重要度算出手段５７は、聴取者の区間平均情動反応値Ｅｏを含めることにより、信頼性の高い重要度Ｗを算出することができる。したがって、重要度Ｗを用いることにより、人の発話音声の中から、有意性の高い要点部分を一層客観的かつ容易に抽出することができる。Note that the utterance featureanalysis control unit 50 includes a listener emotional response value calculation unit that calculates the emotional response value of the listener who listens to the utterance in addition to the relativevolume calculation unit 51 and the like shown in FIG. Good. In this case, the listener emotional response value calculation unit inputs the listener's pupil diameter, facial motion, and the like, and uses the listener's emotional response value Eo (t) in the same manner as the emotional response value Es (t) of the speaker. ) Is calculated. Further, the emotional response value Eo (t) is integrated and divided by the segment time length to calculate the segmental average emotional response value Eo. When there are a plurality of listeners, the emotional reaction value Eo (t) is calculated for each listener. In addition, when calculating the importance level W, the section importancelevel calculation unit 57 may calculate the importance level W by the following formula including the section average emotional response value Eo of the listener.
W = fv (V) + fa (A) + fvv (VV) + fes (Es) + feo (Eo)
fv (V), fa (A), fvv (VV), fes (Es), and feo (Eo) are the section average volume V, section average pitch A, utterance speed VV, and section average emotional response value of the speaker. Es is a function relating to the average emotional response value Eo of the listener. As a simplified example, the section importance degree calculation means 57 may calculate the importance degree W by the following expression by linear combination of speech voice characteristic data. Wv, Wa, Wvv, Wes and Weo are weighting coefficients, respectively.
fv (V) = Wv · V
fa (A) = Wa / A
fvv (VV) = Wvv · VV
fes (Es) = Wes · Es
feo (Eo) = Weo / Eo
Here, Wv, Wa, Wvv, Wes, Weo may be negative values. Thus, the section importance calculation means 57 can calculate the importance W with high reliability by including the section average emotional response value Eo of the listener. Therefore, by using the importance W, it is possible to more objectively and easily extract a highly significant point from a person's speech.

このように、発話特徴解析制御部５０により解析された発話音声の相対音量Ｖ（ｔ）等は、解析結果として記憶部６０の解析結果ＤＢ６８に記憶されると共に、提示情報変換部７０に出力される。 As described above, the relative volume V (t) of the utterance voice analyzed by the utterance featureanalysis control unit 50 is stored as an analysis result in theanalysis result DB 68 of thestorage unit 60 and also output to the presentationinformation conversion unit 70. The

（解析結果ＤＢ）
図７は、記憶部６０における解析結果ＤＢ６８の構成例を示す図である。解析結果ＤＢ６８は、区間の番号を示すＳｒ、開始時刻ｔ１、終了時刻ｔ２、区間の種別、テキスト、音量Ｖ、音素数（または音節数）Ｃ、音高Ａ、発話者の区間平均情動反応値Ｅｓ及び聴取者の区間平均情動反応値Ｅｏにより構成される。区間の番号を示すＳｒは、音声認識部４１により区別された３つの区間を時刻順に表した番号である。開始時刻ｔ１及び終了時刻ｔ２のｍｍ：ｓｓ．ｐｐは、ｍｍが分、ｓｓが秒、ｐｐが秒の小数点以下をそれぞれ示している。区間の種別は、Ｌが発言区間、Ｖがその他発声区間、Ｓが沈黙区間である。音量Ｖ、音高Ａは、各区間における区間平均音量Ｖ、区間平均音高Ａである。Ｎ／Ａは、データがないことを示している。(Analysis result DB)
FIG. 7 is a diagram illustrating a configuration example of theanalysis result DB 68 in thestorage unit 60. Theanalysis result DB 68 includes Sr indicating the section number, start time t1, end time t2, section type, text, volume V, number of phonemes (or number of syllables) C, pitch A, and speaker average emotional response value. It is comprised by Es and the listener's area average emotion reaction value Eo. Sr indicating the number of the section is a number representing the three sections distinguished by thevoice recognition unit 41 in time order. Mm at start time t1 and end time t2: ss. pp indicates mm for minutes, ss for seconds, and pp for fractional seconds. The types of sections are L for a speech section, V for other speech sections, and S for a silence section. The volume V and pitch A are the section average volume V and section average pitch A in each section. N / A indicates that there is no data.

図２に戻って、提示情報変換部７０は、発話特徴解析制御部５０から解析結果を入力するか、または記憶部６０の解析結果ＤＢ６８から解析結果を読み出すと共に、記憶部６０の各ＤＢからデータを読み出す。そして、提示情報変換部７０は、解析結果等の中から必要なデータを選択して提示情報を生成し、ネットワーク２１を介して表示装置２０へ送信する。提示情報として送信するデータは、表示装置２０が画面表示するために必要なデータであり、予め設定されている。 Returning to FIG. 2, the presentationinformation conversion unit 70 inputs the analysis result from the utterance featureanalysis control unit 50 or reads the analysis result from theanalysis result DB 68 of thestorage unit 60, and transmits data from each DB of thestorage unit 60. Is read. Then, the presentationinformation conversion unit 70 selects necessary data from the analysis results and the like, generates presentation information, and transmits it to thedisplay device 20 via thenetwork 21. The data to be transmitted as the presentation information is data necessary for thedisplay device 20 to display the screen, and is set in advance.

また、提示情報変換部７０は、解析結果の一つである、発話特徴解析制御部５０の頻出重要単語抽出手段５８により発話音声から抽出された頻出重要単語を検索ワードとして、ネットワーク２１を介して任意のデータベースを検索する。そして、提示情報変換部７０は、データベースの検索結果として得た情報を提示情報として表示装置２０へ送信する。データベースの検索結果は、例えば、頻出重要単語に直接関連するテキスト、ウェブページ、画像、地図、ニュース、動画、電子メール、これらの情報の所在を表すリンク情報（ＵＲＩ：ＵｎｉｆｏｒｍＲｅｓｏｕｒｃｅＩｄｅｎｔｉｆｉｅｒ）等である。これにより、表示装置２０は、発話音声の中で有意性の高い要点部分の頻出重要単語について、直接関連する情報を表示することができる。 Also, the presentationinformation conversion unit 70 uses the frequent important words extracted from the uttered speech by the frequent important word extraction means 58 of the utterance featureanalysis control unit 50 as one of the analysis results as a search word via thenetwork 21. Search any database. Then, the presentationinformation conversion unit 70 transmits information obtained as a database search result to thedisplay device 20 as presentation information. The database search results are, for example, texts directly related to frequent important words, web pages, images, maps, news, videos, e-mails, link information (URI: Uniform Resource Identifier) indicating the location of these information, and the like. . Thereby, thedisplay apparatus 20 can display the information directly related about the frequent important word of the important point part with high significance in utterance voice.

（表示装置）
次に、図２に示した表示装置２０について説明する。表示装置２０は、音声情報処理装置１からネットワーク２１を介して提示情報を受信し、画面表示用のデータに変換し、画面表示する。ここで、提示情報は、前述のとおり、解析結果、案内情報、発話音声、眼球映像、通常顔映像、目隠し顔映像、テキスト情報等、瞳孔径及び顔面動きのうちの、音声情報処理装置１の提示情報変換部７０において予め設定されたデータである。(Display device)
Next, thedisplay device 20 shown in FIG. 2 will be described. Thedisplay device 20 receives the presentation information from the voiceinformation processing device 1 via thenetwork 21, converts it into screen display data, and displays the screen. Here, as described above, the presentation information is the analysis result, guidance information, speech voice, eyeball image, normal face image, blindfolded face image, text information, etc. The data is preset in the presentationinformation conversion unit 70.

図８は、表示装置２０に表示される画面例である。表示装置２０は、提示情報を受信し、提示情報から画面表示用のデータに変換し、図８に示す画面を表示する。具体的には、表示装置２０は、提示情報として、テキスト情報、区間種別情報、発話音声の相対音量Ｖ（ｔ）、発話者の情動反応値Ｅｓ（ｔ）及び聴取者の情動反応値Ｅｏ（ｔ）を入力し、図８左上の「音声・情動の時間変動」の箇所に示すグラフの画面表示用データに変換し、発話音声の相対音量Ｖ（ｔ）と共にテキスト情報及び区間の種別、並びに情動反応値Ｅｓ（ｔ）を表示する。表示装置２０は、所定のキー操作の入力があると、発話者の情動反応値Ｅｓ（ｔ）に加えて、聴取者の情動反応値Ｅｏ（ｔ）のグラフの色を変える等、種別が分かるようにして表示する。 FIG. 8 is an example of a screen displayed on thedisplay device 20. Thedisplay device 20 receives the presentation information, converts the presentation information into data for screen display, and displays the screen shown in FIG. Specifically, thedisplay device 20 includes, as presentation information, text information, section type information, relative volume V (t) of the uttered voice, emotional response value Es (t) of the speaker, and emotional response value Eo (of the listener). t) is input, converted into screen display data of the graph shown in the “speech / emotion time variation” portion in the upper left of FIG. 8, the text information and the type of the section, together with the relative volume V (t) of the uttered speech, and The emotional response value Es (t) is displayed. When there is an input of a predetermined key operation, thedisplay device 20 can know the type such as changing the graph color of the listener's emotional response value Eo (t) in addition to the emotional response value Es (t) of the speaker. Display.

表示装置２０は、図８右上の「閾値パラメータ設定スライダバー」の箇所に示すスライダバーを表示し、キー操作によってスライダバーの位置の変更入力があると、その位置に応じた閾値（発話者の情動反応値Ｅｓ（ｔ）に対する閾値、聴取者の情動反応値Ｅｏ（ｔ）に対する閾値）及びパラメータ（音声再生速度）を設定する。 Thedisplay device 20 displays a slider bar indicated by a “threshold parameter setting slider bar” at the upper right of FIG. 8. When a change in the position of the slider bar is input by a key operation, a threshold value (speaker's A threshold for the emotional response value Es (t), a threshold for the emotional response value Eo (t) of the listener) and a parameter (sound reproduction speed) are set.

表示装置２０は、提示情報として発話音声を入力し、「閾値パラメータ設定スライダバー」の下の箇所に各ボタンを表示し、キー操作によるボタン指定の入力があると、そのボタンに応じた処理を行う。例えば、表示装置２０は、再生ボタンの入力があると、「閾値パラメータ設定スライダバー」に設定された音声再生速度により、発話音声を再生してスピーカ（図２には図示せず）へ出力する。また、休止ボタンの入力があると、再生を一旦休止する。 Thedisplay device 20 inputs the speech voice as the presentation information, displays each button below the “threshold parameter setting slider bar”, and when there is an input of button designation by key operation, performs processing corresponding to the button. Do. For example, when the playback button is input, thedisplay device 20 plays back the spoken voice at the voice playback speed set in the “threshold parameter setting slider bar” and outputs it to a speaker (not shown in FIG. 2). . When there is an input of a pause button, playback is paused.

表示装置２０は、提示情報として目隠し顔映像を入力し、図８左下の「映像」の箇所にその顔映像を表示する。このように、表示装置２０に画面表示される目隠し顔映像は、図３に示したカメラ１７から直接入力した映像であり、通常顔映像を加工して生成した映像ではないから、この目隠し顔映像を通常顔映像に戻すことができず、秘匿性の高い映像を提供することができる。 Thedisplay device 20 inputs the blindfolded face image as the presentation information, and displays the face image at the “video” position in the lower left of FIG. As described above, the blindfolded face image displayed on the screen of thedisplay device 20 is an image directly input from the camera 17 shown in FIG. 3 and is not an image generated by processing the normal face image. Cannot be returned to the normal face image, and a highly confidential image can be provided.

表示装置２０は、提示情報として、テキスト情報、区間種別情報、発話者の相対音量Ｖ（ｔ）及び区間平均音量Ｖ、発話者の相対音高Ａ（ｔ）及び区間平均音高Ａ、区間毎の音素数（または音節数）Ｃ及び発声速度ＶＶ、無意区間、テキスト化不可部分、発話者の情動反応値Ｅｓ（ｔ）及び区間平均情動反応値Ｅｓ、並びに聴取者の情動反応値Ｅｏ（ｔ）及び区間平均情動反応値Ｅｏを入力し、図８右下の「変換テキスト」の箇所に示す形態で画面表示する。また、「変換テキスト」の左側に設けられたカーソルをキー操作により上下に移動させることにより、テキストの位置が指定され、表示するテキストが多い場合は音声再生しているテキスト部分のみを表示する。「変換テキスト」の表示については後述する。 Thedisplay device 20 includes, as presentation information, text information, section type information, a speaker's relative volume V (t) and a section average volume V, a speaker's relative pitch A (t) and a section average pitch A, and each section. Phoneme number (or number of syllables) C and utterance speed VV, unintentional interval, non-textable portion, emotional response value Es (t) of the speaker, average emotional response value Es of the listener, and emotional response value Eo (t of the listener ) And the section average emotion response value Eo are input and displayed on the screen in the form shown in the “conversion text” section at the lower right of FIG. In addition, by moving the cursor provided on the left side of the “converted text” up and down by key operation, the position of the text is specified, and when there is a lot of text to be displayed, only the text portion that is being played back is displayed. The display of “conversion text” will be described later.

表示装置２０は、図８右下の「機能オプション（必要な機能をチェック）」の箇所に示すチェックボックスを表示し、キー操作によってチェックの指定を行う。キー操作によるチェックの指定の入力があると、そのチェックボックスに入力があったことを示す表示を行い、そのチェックボックスに対応した機能を「変換テキスト」の箇所の表示形態に反映する。すなわち、チェックボックスに対応した「発声強弱表示」「発声音高表示」「発声速度表示」「無意区間表示」「テキスト化不可部表示」「発話者情動閾値」「聴取者情動閾値」のそれぞれの機能が、「変換テキスト」の箇所の表示形態に反映される。一方、キー操作によるチェックの指定解除の入力があると、そのチェックボックスの表示を消去し、その機能を解除する。また、「発話者情動閾値」のチェックボックスにチェックがされていない場合は発話者の情動反応値の表示が行われないから、「閾値パラメータ設定スライダバー」における「発話者情動閾値」のスライダバーを左端に表示する。「聴取者情動閾値」についても同様である。詳細については後述する。 Thedisplay device 20 displays a check box shown in a “function option (check necessary functions)” at the lower right of FIG. 8 and designates a check by operating a key. When there is an input for specifying a check by a key operation, a display indicating that the check box has been input is displayed, and the function corresponding to the check box is reflected in the display form of the “conversion text” portion. In other words, each of the “speech strength display”, “speech pitch display”, “speech speed display”, “unintentional section display”, “text-incapable part display”, “speaker emotion threshold”, and “listener emotion threshold” corresponding to the check boxes The function is reflected in the display form of the “conversion text” part. On the other hand, when there is an input for canceling the designation of a check by key operation, the display of the check box is erased and the function is canceled. Also, if the “Speaker emotion threshold” checkbox is not checked, the emotion response value of the speaker is not displayed, so the “Speaker emotion threshold” slider bar in the “Threshold parameter setting slider bar” Is displayed at the left edge. The same applies to the “listener emotion threshold”. Details will be described later.

また、表示装置２０は、前述したように、キー操作により再生ボタンの入力があると、「閾値パラメータ設定スライダバー」に設定された音声再生速度により、発話音声を再生してスピーカから出力する。このとき、表示装置２０は、「音声・情動の時間変動」及び「変換テキスト」の箇所に、再生している発話音声のカーソルを表示する。図８は、「変わ」の発話音声が出力されているときの表示である。そして、表示装置２０は、「変換テキスト」の「変わ」の箇所に、その区間の情報を小ウィンドウに表示する。具体的には、表示装置２０は、小ウィンドウに、区間種別情報における区間時間長Δｔ：２．２ｓｅｃ、区間種別情報における種別：Ｌ（発言区間）、区間平均音量Ｖ：８６、音素数（または音節数）Ｃ：５、区間平均音高Ａ：４２０及び発話者の区間平均情動反応値Ｅｓ：９０を表示する。 Further, as described above, when the playback button is input by a key operation, thedisplay device 20 plays back the uttered voice at the voice playback speed set in the “threshold parameter setting slider bar” and outputs it from the speaker. At this time, thedisplay device 20 displays the cursor of the uttered voice being reproduced at the locations of “speech / emotion time variation” and “conversion text”. FIG. 8 is a display when the utterance voice of “unusual” is output. Then, thedisplay device 20 displays the information of the section in a “small” window at the “change” portion of the “conversion text”. Specifically, thedisplay device 20 displays, in a small window, a section time length Δt in the section type information: 2.2 sec, a type in the section type information: L (speaking section), a section average volume V: 86, a phoneme number (or Number of syllables) C: 5, section average pitch A: 420 and section average emotional response value Es: 90 of the speaker are displayed.

図９は、提示情報の表示例を説明する図であり、図８に示した「変換テキスト」の箇所の表示である。図９において、「ん〜多分今と変わらないと思います。つまり・・・・」は、発話音声から得られたテキスト情報を示している。また、テキスト情報の各文字に対応した折れ線は、音高の大きさを示している。 FIG. 9 is a diagram for explaining a display example of the presentation information, and is a display of the “conversion text” portion shown in FIG. In FIG. 9, “N ~ I think that it is probably the same as now. That is,...” Indicates text information obtained from the uttered voice. In addition, a broken line corresponding to each character of the text information indicates the pitch.

図９を参照して、表示装置２０は、テキスト情報を表示する際に、区間毎に算出された区間平均音量Ｖに応じて、区間毎のテキスト文字の大きさを変更する。具体的には、テキスト文字の大きさが区間平均音量Ｖに比例するように、区間平均音量Ｖが大きい場合はテキスト文字を大きく表示し、区間平均音量Ｖが小さい場合はテキスト文字を小さく表示する。これにより、発話者による発声音の強弱がテキスト文字の大きさに反映されるから、発話者が大きな声で話した箇所または小さな声で話した箇所を容易に認識することができ、有意性の高い要点部分を客観的に抽出することができる。一般に、大きな声または小さな声で話した箇所が、有意性の高い要点部分であるといえる。 Referring to FIG. 9, when displaying text information,display device 20 changes the size of text characters for each section according to section average volume V calculated for each section. Specifically, the text characters are displayed large when the section average volume V is high, and the text characters are displayed small when the section average volume V is low so that the size of the text characters is proportional to the section average volume V. . As a result, the strength of the utterance sound by the speaker is reflected in the size of the text character, so that the speaker can easily recognize the portion spoken with a loud voice or the portion spoken with a small voice, Highly important parts can be extracted objectively. In general, it can be said that a point spoken with a loud voice or a small voice is a highly significant point.

表示装置２０は、テキスト情報の表示と共に、区間毎に算出された区間平均音高Ａを折れ線グラフとして、テキスト情報の文字に対応して表示する。具体的には、区間平均音高Ａが大きい場合はテキスト文字の上側の位置に表示し、区間平均音高Ａが小さい場合はテキスト文字の下側の位置に表示する。これにより、発話者による発声音の高低が、テキスト文字に対応した位置に折れ線グラフとして表示されるから、発話者が高い声または低い声で話した箇所を容易に認識することができ、有意性の高い要点部分を客観的に抽出することができる。一般に、高い声または低い声で話した箇所が、有意性の高い要点部分であるといえる。 Thedisplay device 20 displays the section average pitch A calculated for each section as a line graph corresponding to the characters of the text information as well as displaying the text information. Specifically, when the section average pitch A is large, it is displayed at a position above the text character, and when the section average pitch A is small, it is displayed at a position below the text character. As a result, the level of the utterance sound by the speaker is displayed as a line graph at the position corresponding to the text character, so the location where the speaker spoke with a high or low voice can be easily recognized, and significance It is possible to objectively extract a high-priority part. In general, it can be said that a portion spoken with a high voice or a low voice is a highly significant point.

表示装置２０は、テキスト情報を表示する際に、区間毎に算出された発声速度ＶＶに応じて、区間毎のテキスト文字の幅を変更する。具体的には、テキスト文字の幅が発声速度ＶＶの大きさに比例するように、発声速度ＶＶが大きい場合はテキスト文字の幅を大きくして表示し、発声速度ＶＶが小さい場合はテキスト文字の幅を小さくして表示する。これにより、発話者による発声速度がテキスト文字の幅に反映されるから、発話者が速い速度で話した箇所または遅い速度で話した箇所を容易に認識することができ、有意性の高い要点部分を客観的に抽出することができる。一般に、速い速度または遅い速度で話した箇所が、有意性の高い要点部分であるといえる。 When displaying the text information, thedisplay device 20 changes the width of the text character for each section according to the utterance speed VV calculated for each section. Specifically, the width of the text character is displayed when the utterance speed VV is large so that the width of the text character is proportional to the size of the utterance speed VV, and the text character is displayed when the utterance speed VV is small. Display with a reduced width. As a result, the speaking speed of the speaker is reflected in the width of the text characters, so the point where the speaker spoke at a high speed or the part spoken at a slow speed can be easily recognized, and the main points with high significance Can be extracted objectively. In general, it can be said that a point spoken at a high speed or a low speed is a highly significant point.

表示装置２０は、テキスト情報を表示する際に、特定した無意区間（その他発声区間及び沈黙区間）における区間時間長に対応した表示を、空白文字または擬音表現の形態で表示する。図９では、沈黙区間をアンダーラインで表示し、その他発声区間をテキスト情報の文字で表示している。この場合、その他発声区間のテキスト文字の表示に代えて、アンダーラインで表示するようにしてもよい。これにより、発話者が実際に意味のある発言をした区間以外の無意区間がテキスト文字とは異なる形態で表示されるから、テキスト文字の中で無意区間を容易に認識することができ、その区間は有意性の高い部分でないことを客観的に判断することができる。 When displaying the text information, thedisplay device 20 displays the display corresponding to the section time length in the specified unintentional section (other speech section and silence section) in the form of blank characters or onomatopoeia. In FIG. 9, the silence interval is displayed as an underline, and the other utterance intervals are displayed as text information characters. In this case, instead of displaying the text characters in the other utterance sections, the text may be displayed with an underline. As a result, the unintentional section other than the section where the speaker actually made a meaningful utterance is displayed in a form different from the text character, so that the unintentional section can be easily recognized in the text character. Can objectively judge that it is not a highly significant part.

表示装置２０は、テキスト情報を表示する際に、テキスト情報に含まれるテキスト化不可部分を、特定の文字で表示する。例えば、テキスト化不可部分を＋＋＋で表示する。これにより、発話者が発話したにもかかわらず、テキスト化できなかった箇所を容易に認識することができる。 When thedisplay device 20 displays the text information, thedisplay device 20 displays a non-text-able portion included in the text information with specific characters. For example, the part that cannot be converted into text is displayed as +++. As a result, it is possible to easily recognize a portion that cannot be converted into text even though the speaker has spoken.

表示装置２０は、テキスト情報を表示する際に、区間毎に算出された、発話者の区間平均情動反応値Ｅｓに応じて、区間毎のテキスト文字を濃淡表示する。具体的には、区間平均情動反応値Ｅｓが閾値（図８の「閾値パラメータ設定スライダバー」において設定された発話者情動閾値）よりも大きい場合は、テキスト文字が浮かび上がるように濃く表示し、区間平均情動反応値Ｅｓが閾値以下の場合は、テキスト文字が半透明になるように薄く表示する。これにより、発話者の瞳孔径及び顔面動きにより表される情動反応がテキスト文字の濃淡に反映されるから、発話者が情動を示している箇所を容易に認識することができ、有意性の高い要点部分を客観的に抽出することができる。一般に、情動を示している箇所が、有意性の高い要点部分であるといえる。 When displaying the text information, thedisplay device 20 displays the text characters for each section in a shaded manner according to the section average emotion response value Es calculated for each section. Specifically, when the section average emotion reaction value Es is larger than a threshold value (speaker emotion threshold value set in the “threshold parameter setting slider bar” in FIG. 8), the text characters are displayed darkly so as to emerge, When the section average emotion reaction value Es is less than or equal to the threshold value, the text characters are displayed lightly so as to be translucent. As a result, the emotional reaction expressed by the pupil diameter and facial movement of the speaker is reflected in the shade of the text characters, so that the location where the speaker shows the emotion can be easily recognized, and the significance is high. The main points can be extracted objectively. In general, it can be said that the part showing the emotion is a highly significant point part.

表示装置２０は、テキスト情報を表示する際に、区間毎に算出された、聴取者の区間平均情動反応値Ｅｏに応じて、区間毎のテキスト文字の背景を色付けする。具体的には、区間平均情動反応値Ｅｏが閾値（図８の「閾値パラメータ設定スライダバー」において設定された聴取者情動閾値）よりも大きい場合は、テキスト文字の背景を所定の色で濃く表示し、区間平均情動反応値Ｅｏが閾値以下の場合は、テキスト文字の背景を所定の色で薄く表示する。これにより、発話者の発話を聞いている聴取者の瞳孔径及び顔面動きにより表される情動反応がテキスト文字の背景に色付けして反映されるから、聴取者が情動を示している箇所を容易に認識することができ、有意性の高い要点部分を客観的に抽出することができる。一般に、情動を示している箇所が、有意性の高い要点部分であるといえる。尚、図８及び図９では、図８の「機能オプション（必要な機能をチェック）」における「聴取者情動閾値」のチェックボックスがチェックされていないから、聴取者の区間平均情動反応値Ｅｏに応じた色付けは表示されない。 When displaying the text information, thedisplay device 20 colors the background of the text characters for each section according to the section average emotion reaction value Eo calculated for each section. Specifically, when the section average emotion reaction value Eo is larger than a threshold value (listener emotion threshold value set in the “threshold parameter setting slider bar” in FIG. 8), the background of the text character is displayed darkly in a predetermined color. When the section average emotion response value Eo is equal to or smaller than the threshold value, the background of the text character is displayed lightly in a predetermined color. As a result, the emotional reaction represented by the pupil diameter and facial movement of the listener who is listening to the speaker's speech is reflected in the background of the text characters, so the location where the listener shows the emotion can be easily The main points with high significance can be objectively extracted. In general, it can be said that the part showing the emotion is a highly significant point part. 8 and 9, since the “listener emotion threshold” check box in the “function option (check necessary functions)” in FIG. 8 is not checked, the average emotional response value Eo of the listener is displayed. Corresponding coloring is not displayed.

また、表示装置２０は、図８に示した「機能オプション（必要な機能をチェック）」におけるチェックボックスがチェックされている場合、その機能による表示を行い、チェックされていない場合、その機能による表示を行わない。 Thedisplay device 20 displays according to the function when the check box in the “function option (check necessary function)” shown in FIG. 8 is checked, and displays according to the function when the check box is not checked. Do not do.

このように、図８に示した表示画面例のように、図９に示した「変換テキスト」の表示により、発話者による発声音の強弱及び高低、発声速度、無意区間の有無、テキスト化不可部及び発話者情動反応値（または／及び聴取者情動反応値）が、テキスト文字の形態に反映されると共に、テキスト文字に対応して表現される。これにより、発話音声のテキスト情報の特性を、発声音の強弱等のデータ及び生理反応データから特徴付けることができる。したがって、発話音声のテキスト情報の特性に基づいて、発話音声の中から、有意性の高い要点部分を客観的かつ総合的に抽出することが可能となる。この場合、テキスト文字を見るオペレータは、発声音の強弱及び高低、発声速度等を、テキスト文字の形態によって直感的に認識することができる。 Thus, as in the example of the display screen shown in FIG. 8, the display of the “conversion text” shown in FIG. The part and speaker emotion response values (or / and listener emotion response values) are reflected in the form of text characters and are expressed in correspondence with the text characters. Thereby, the characteristic of the text information of the speech voice can be characterized from the data such as the strength of the voice and the physiological response data. Therefore, it is possible to objectively and comprehensively extract a significant part from the utterance voice based on the characteristics of the text information of the utterance voice. In this case, the operator who sees the text characters can intuitively recognize the intensity and level of the utterance sound, the utterance speed, and the like according to the form of the text characters.

尚、表示装置２０は、音声情報処理装置１からネットワーク２１を介して提示情報を受信するようにしたが、音声情報処理装置１が表示装置２０の機能を備え、図８に示した各データを図示しない表示器に表示するようにしてもよい。 Thedisplay device 20 receives the presentation information from the voiceinformation processing device 1 via thenetwork 21. However, the voiceinformation processing device 1 has the function of thedisplay device 20, and the data shown in FIG. You may make it display on the indicator which is not shown in figure.

以上のように、本発明の実施形態による実施例１の音声情報処理装置１によれば、入力部３２が、発話音声、眼球映像、通常顔映像及び目隠し顔映像を入力して記憶部６０に記憶し、発話特徴解析制御部５０が、音声認識部４１により発話音声から生成されたテキスト情報、区間種別情報等、瞳孔径解析部４２により眼球映像を解析して得られた瞳孔径、顔面動き量推定部４３により通常顔映像から推定された顔面動きを入力し、発話音声の相対音量及び相対音高、発話速度、無意区間、テキスト化不可部分、発話者情動反応値及び聴取者情動反応値を求め、発話特徴の解析結果として記憶部６０に記憶するようにした。また、表示装置２０は、記憶部６０に記憶されたデータのうちの所定の提示情報を受信し、発話者による発声音の強弱及び高低、発声速度、無意区間の有無、テキスト化不可部分及び発話者情動反応値（または／及び聴取者情動反応値）を発話音声のテキスト情報の特性として、テキスト文字の形態に反映し、テキスト文字に対応した箇所に表示するようにした。これにより、発話音声の中から有意性の高い要点部分を客観的かつ総合的に抽出することが可能となる。また、人的作業が不要になるから、発話音声の中から有意性の高い要点部分を容易に抽出することが可能となる。 As described above, according to the audioinformation processing apparatus 1 of Example 1 according to the embodiment of the present invention, theinput unit 32 inputs the speech sound, the eyeball image, the normal face image, and the blindfolded face image to thestorage unit 60. The pupildiameter analysis unit 50 stores the text information, section type information, etc. generated from the speech speech by thespeech recognition unit 41 and the pupil diameter and facial movement obtained by analyzing the eyeball image by the pupildiameter analysis unit 42. The face motion estimated from the normal face image is input by theamount estimation unit 43, and the relative volume and pitch of the utterance voice, the utterance speed, the unintentional section, the non-textable portion, the utterer emotion reaction value, and the listener emotion reaction value And stored in thestorage unit 60 as the analysis result of the utterance feature. In addition, thedisplay device 20 receives predetermined presentation information of the data stored in thestorage unit 60, and the strength and level of the utterance sound by the utterer, the utterance speed, the presence / absence of the involuntary section, the non-textable portion, and the utterance The human emotion reaction value (or / and the listener emotion reaction value) is reflected in the form of the text character as a characteristic of the text information of the uttered voice, and is displayed at a location corresponding to the text character. As a result, it is possible to objectively and comprehensively extract important points from the uttered speech. In addition, since no human work is required, it is possible to easily extract a highly significant point from the uttered speech.

〔実施例２〕
次に、本発明の第２の実施形態（実施例２）について説明する。図１０は、実施例２による音声情報処理装置の機能構成を示すブロック図である。この音声情報処理装置２は、図１に示した制御部１００が音声情報処理プログラムにより処理を実行する際の機能構成を示している。この音声情報処理装置２は、案内情報提示部３１、入力部３３、音声認識部４１、瞳孔径解析部４２、顔面動き量推定部４４、発話特徴解析制御部５０、記憶部８０及び提示情報変換部７０を備えている。図２に示した実施例１の音声情報処理装置１と、この実施例２の音声情報処理装置２とを比較すると、音声情報処理装置２は、音声情報処理装置１に備えた構成とは異なる入力部３３、顔面動き量推定部４４及び記憶部８０を備えている点で相違する。これら以外の構成は同じである。また、この音声情報処理を実現するシステムは、音声情報処理装置２、スピーカ１１、表示器１２、マイク１４とカメラ１５と照射器１６と加速度センサ１８とを備えた目隠し用ゴーグル（ヘッドセット）１９、及び表示装置２０を備えて構成される。音声情報処理装置２と表示装置２０とは、インターネット等のネットワーク２１により接続される。図２に示した実施例１のシステムと、この実施例２のシステムとを比較すると、実施例２のシステムは、加速度センサ１８を備えており、通常顔映像及び目隠し顔映像を撮影するカメラ１７を備えていない点で実施例１のシステムと相違する。[Example 2]
Next, a second embodiment (Example 2) of the present invention will be described. FIG. 10 is a block diagram illustrating a functional configuration of the audio information processing apparatus according to the second embodiment. The voiceinformation processing apparatus 2 shows a functional configuration when the control unit 100 shown in FIG. 1 executes processing by a voice information processing program. The voiceinformation processing apparatus 2 includes a guidanceinformation presentation unit 31, aninput unit 33, avoice recognition unit 41, a pupildiameter analysis unit 42, a facial motionamount estimation unit 44, an utterance featureanalysis control unit 50, astorage unit 80, and a presentation information conversion.Part 70 is provided. Comparing the voiceinformation processing apparatus 1 of the first embodiment shown in FIG. 2 with the voiceinformation processing apparatus 2 of the second embodiment, the voiceinformation processing apparatus 2 is different from the configuration provided in the voiceinformation processing apparatus 1. The difference is that aninput unit 33, a facialmotion estimation unit 44, and astorage unit 80 are provided. The configuration other than these is the same. Further, a system for realizing this voice information processing is a blindfold goggles (headset) 19 including a voiceinformation processing apparatus 2, aspeaker 11, adisplay 12, amicrophone 14, acamera 15, anirradiator 16, and anacceleration sensor 18. And adisplay device 20. The voiceinformation processing device 2 and thedisplay device 20 are connected by anetwork 21 such as the Internet. Comparing the system of the first embodiment shown in FIG. 2 with the system of the second embodiment, the system of the second embodiment includes anacceleration sensor 18, and a camera 17 that captures a normal face image and a blindfolded face image. This is different from the system of the first embodiment in that it is not provided.

加速度センサ１８は、目隠し用ゴーグル１９に設けられ、発話者の顔面動きを捉え、ｘ，ｙ，ｚ方向の加速度センサデータを音声情報処理装置２の入力部３３に出力する。加速度センサ１８は、縦、横及び高さの３方向の加速度を測定する３軸対応のデバイスであり、物体に加わる３次元空間のあらゆる方向の加速度を測定することができる。 Theacceleration sensor 18 is provided in the blindfold goggles 19, captures the facial movement of the speaker, and outputs acceleration sensor data in the x, y, and z directions to theinput unit 33 of the voiceinformation processing apparatus 2. Theacceleration sensor 18 is a three-axis compatible device that measures accelerations in three directions of length, width, and height, and can measure accelerations in all directions of a three-dimensional space applied to an object.

音声情報処理装置２の入力部３３は、目隠し用ゴーグル１９に備えたマイク１４から発話者の発話音声を、カメラ１５から眼球映像を、加速度センサ１８から加速度センサデータをそれぞれ入力する。そして、これらの情報を同期した情報として、記憶部８０に記憶する。これにより、記憶部８０には、発話音声ＤＢ６２、眼球映像ＤＢ６３及び加速度センサデータＤＢが生成される。 Theinput unit 33 of the voiceinformation processing apparatus 2 inputs the voice of the speaker from themicrophone 14 provided in the blindfold goggles 19, the eyeball image from thecamera 15, and the acceleration sensor data from theacceleration sensor 18. These pieces of information are stored in thestorage unit 80 as synchronized information. Thereby, thespeech unit DB 62, theeyeball image DB 63, and the acceleration sensor data DB are generated in thestorage unit 80.

顔面動き量推定部４４は、記憶部８０の加速度センサデータＤＢから加速度センサデータを読み出し、加速度センサデータを解析することにより発話者の顔面動き（顔面動きベクトル）を生成し、顔面動きを発話特徴解析制御部５０に出力する。具体的には、顔面動き量推定部４４は、ｘ，ｙ，ｚ方向の加速度センサデータを時間方向にそれぞれ積分し、ｘ，ｙ，ｚ方向の速度成分を得る。そして、これらの３つの速度成分からなるベクトルを所定の撮像面に射影することによって、その撮像面における平行移動速度を示すパラメータｃ，ｄ（ｃは撮像面上における平行移動速度の水平成分、ｄは垂直成分にそれぞれ対応する）からなる顔面動きベクトルを算出する。尚、加速度センサ１８からのｘ，ｙ，ｚ方向の加速度センサデータに基づいて、移動速度、移動距離、移動方向等の動き情報を生成する手法については既知である。詳細については、例えば特開２００６−３２０５６６号公報を参照されたい。 The facial motionamount estimation unit 44 reads the acceleration sensor data from the acceleration sensor data DB of thestorage unit 80, analyzes the acceleration sensor data, generates a facial motion (facial motion vector) of the speaker, and utters the facial motion. The data is output to theanalysis control unit 50. Specifically, the facial motionamount estimation unit 44 integrates acceleration sensor data in the x, y, and z directions in the time direction to obtain velocity components in the x, y, and z directions. Then, by projecting a vector composed of these three velocity components onto a predetermined imaging plane, parameters c and d (c are horizontal components of the translational speed on the imaging plane, d indicating the translation speed on the imaging plane, d Calculates a face motion vector consisting of (corresponding to each vertical component). A method for generating movement information such as a moving speed, a moving distance, and a moving direction based on x, y, and z direction acceleration sensor data from theacceleration sensor 18 is known. For details, refer to, for example, Japanese Patent Application Laid-Open No. 2006-320666.

音声認識部４１からのテキスト情報等、瞳孔径解析部４２からの瞳孔径、及び顔面動き量推定部４４からの顔面動きは、互いに同期し時間的に対応付けられた情報として発話特徴解析制御部５０に出力される。 The text information from thespeech recognition unit 41, the pupil diameter from the pupildiameter analysis unit 42, and the facial movement from the facial motionamount estimation unit 44 are synchronized with each other and are temporally associated with each other as an utterance feature analysis control unit. 50 is output.

表示装置２０は、音声情報処理装置２からネットワーク２１を介して提示情報を受信し、画面表示用のデータに変換し、画面表示する。ここで、提示情報は、解析結果、案内情報、発話音声、眼球映像、テキスト情報等、瞳孔径及び顔面動きのうちの、音声情報処理装置２の提示情報変換部７０において予め設定されたデータである。 Thedisplay device 20 receives the presentation information from the voiceinformation processing device 2 via thenetwork 21, converts it into screen display data, and displays the screen. Here, the presentation information is data set in advance in the presentationinformation conversion unit 70 of the speechinformation processing apparatus 2 among the pupil diameter and the facial motion, such as analysis results, guidance information, speech voice, eyeball image, text information, and the like. is there.

以上のように、本発明の実施形態による実施例２の音声情報処理装置２によれば、実施例１の音声情報処理装置１の効果と同様に、発話音声の中から有意性の高い要点部分を客観的かつ総合的に抽出することが可能となる。また、人的作業が不要になるから、発話音声の中から有意性の高い要点部分を容易に抽出することが可能となる。 As described above, according to the voiceinformation processing apparatus 2 of the second example according to the embodiment of the present invention, as in the effect of the voiceinformation processing apparatus 1 of the first example, a significant part having high significance from the uttered voice. Can be extracted objectively and comprehensively. In addition, since no human work is required, it is possible to easily extract a highly significant point from the uttered speech.

〔実施例３〕
次に、本発明の第３の実施形態（実施例３）について説明する。図１１は、実施例３による音声情報処理装置の機能構成を示すブロック図である。実施例３は、２人の発話者Ａ，Ｂが発話し、それを聴取者が聞いているインタビューの状況を例にして、発話者Ａ，Ｂの発話音声及び生理反応データ、並びに聴取者の生理反応データに基づいて、発話音声区間の重要度Ｗを算出し、発話者Ａ，Ｂの対話状況を解析するものである。この音声情報処理装置３は、図１に示した制御部１００が音声情報処理プログラムにより処理を実行する際の機能構成を示している。この音声情報処理装置３は、案内情報提示部３１、入力部３２−１，３２−２，８３、音声認識部４１−１，４１−２、瞳孔径解析部４２−１，４２−２、顔面動き量推定部４３−１，４３−２、瞳孔径解析・顔面動き量推定部８５、発話特徴解析制御部８６、記憶部８４及び提示情報変換部７０を備えている。図２に示した実施例１の音声情報処理装置１と、この実施例３の音声情報処理装置３とを比較すると、音声情報処理装置３は、案内情報提示部３１及び記憶部８４に加え、発話者Ａ，Ｂ用の２系統の入力部３２−１，３２−２、音声認識部４１−１，４１−２、瞳孔径解析部４２−１，４２−２、顔面動き量推定部４３−１，４３−２を備え、さらに、聴取者用の入力部８３及び瞳孔径解析・顔面動き量推定部８５を備えている点で相違する。また、この音声情報処理を実現するシステムは、音声情報処理装置３、スピーカ１１、表示器１２、マイク１４−１とカメラ１５−１と照射器１６−１とを備えた発話者Ａ用の目隠し用ゴーグル１３−１、発話者Ａの通常顔及び目隠し顔を撮影するカメラ１７−１、マイク１４−２とカメラ１５−２と照射器１６−２とを備えた発話者Ｂ用の目隠し用ゴーグル１３−２、発話者Ｂの通常顔及び目隠し顔を撮影するカメラ１７−２、並びに、聴取者の眼球を撮影して眼球映像を出力するカメラ８１及び聴取者の通常顔を撮影して通常顔映像を出力するカメラ８２を備えて構成される。音声情報処理装置３と表示装置２０とは、インターネット等のネットワーク２１により接続される。図２に示した実施例１のシステムと、この実施例３のシステムとを比較すると、実施例３のシステムは、発話者Ａ，Ｂ用の２系統の目隠し用ゴーグル１３−１，１３−２及びカメラ１７−１，１７−２、並びに聴取者用のカメラ８１，８２を備えている点で相違する。但し、図１１において、図２と共通する部分には図２と同一の符号を付し、その詳しい説明は省略する。Example 3
Next, a third embodiment (Example 3) of the present invention will be described. FIG. 11 is a block diagram illustrating a functional configuration of the audio information processing apparatus according to the third embodiment. Example 3 is an example of an interview situation in which two speakers A and B speak and the listener listens to them, and the voices and physiological reaction data of the speakers A and B, as well as the listener's Based on the physiological response data, the importance W of the utterance voice section is calculated, and the conversation situation of the speakers A and B is analyzed. The voiceinformation processing apparatus 3 shows a functional configuration when the control unit 100 shown in FIG. 1 executes processing by a voice information processing program. The voiceinformation processing device 3 includes a guidanceinformation presentation unit 31, input units 32-1, 32-2, and 83, voice recognition units 41-1 and 41-2, pupil diameter analysis units 42-1 and 42-2, a face. Motion amount estimation units 43-1 and 43-2, pupil diameter analysis / facial motionamount estimation unit 85, speech featureanalysis control unit 86,storage unit 84, and presentationinformation conversion unit 70 are provided. When comparing the voiceinformation processing apparatus 1 of the first embodiment shown in FIG. 2 with the voiceinformation processing apparatus 3 of the third embodiment, the voiceinformation processing apparatus 3 includes, in addition to the guidanceinformation presentation unit 31 and thestorage unit 84, Two input units 32-1 and 32-2 for the speakers A and B, speech recognition units 41-1 and 41-2, pupil diameter analysis units 42-1 and 42-2, and a facial motion estimation unit 43- 1 and 43-2, and further includes aninput unit 83 for a listener and a pupil diameter analysis / facialmotion estimation unit 85. The system for realizing the voice information processing is a blindfold for the speaker A provided with the voiceinformation processing device 3, thespeaker 11, thedisplay 12, the microphone 14-1, the camera 15-1, and the irradiator 16-1. Goggles 13-1, a camera 17-1 for photographing the normal face and the blindfolded face of the speaker A, a blindfold goggles for the speaker B provided with a microphone 14-2, a camera 15-2, and an irradiator 16-2 13-2, a camera 17-2 for photographing the normal face and blindfolded face of the speaker B, acamera 81 for photographing the eyeball of the listener and outputting an eyeball image, and a normal face for photographing the normal face of the listener Thecamera 82 is configured to output video. The voiceinformation processing device 3 and thedisplay device 20 are connected by anetwork 21 such as the Internet. Comparing the system of the first embodiment shown in FIG. 2 with the system of the third embodiment, the system of the third embodiment has two blindfold goggles 13-1 and 13-2 for the speakers A and B. And cameras 17-1, 17-2, andcameras 81, 82 for the listener. However, in FIG. 11, the same reference numerals as those in FIG.

記憶部８４には、入力部３２−１を介して、発話者Ａの発話音声、眼球映像、通常顔映像及び目隠し顔映像が記憶され、入力部３２−２を介して、発話者Ｂの発話音声、眼球映像、通常顔映像及び目隠し顔映像が記憶される。また、記憶部８４には、入力部８３を介して、聴取者の眼球映像及び通常顔映像が記憶される。 Thestorage unit 84 stores the speech voice, eyeball image, normal face image, and blindfolded face image of the speaker A via the input unit 32-1, and the speech of the speaker B via the input unit 32-2. Voice, eyeball image, normal face image, and blindfolded face image are stored. Thestorage unit 84 also stores the listener's eyeball image and normal face image via theinput unit 83.

音声認識部４１−１，４１−２、瞳孔径解析部４２−１，４２−２及び顔面動き量推定部４３−１，４３−２は、図２に示した音声認識部４１、瞳孔径解析部４２及び顔面動き量推定部４３と同様である。発話者Ａ，Ｂのテキスト情報等、瞳孔径及び顔面動きは、記憶部８４に記憶される。 The speech recognition units 41-1 and 41-2, the pupil diameter analysis units 42-1 and 42-2, and the facial motion estimation units 43-1 and 43-2 are thespeech recognition unit 41 and pupil diameter analysis shown in FIG. 2. This is the same as theunit 42 and the facial motionamount estimation unit 43. The pupil information and facial movement, such as text information of the speakers A and B, are stored in thestorage unit 84.

瞳孔径解析・顔面動き量推定部８５は、図２に示した瞳孔径解析部４２及び顔面動き量推定部４３と同様であり、記憶部８４から聴取者の眼球映像を読み出し、瞳孔径を算出する。また、記憶部８４から聴取者の通常顔映像を読み出し、顔面動きを推定する。そして、瞳孔径解析・顔面動き量推定部８５は、聴取者の瞳孔径及び顔面動きを発話特徴解析制御部８６に出力する。聴取者の瞳孔径及び顔面動きは、記憶部８４に記憶される。 The pupil diameter analysis / face motionamount estimation unit 85 is the same as the pupildiameter analysis unit 42 and the face motionamount estimation unit 43 shown in FIG. 2, and reads the eyeball image of the listener from thestorage unit 84 to calculate the pupil diameter. To do. Further, the normal face image of the listener is read from thestorage unit 84, and the facial motion is estimated. Then, the pupil diameter analysis / facialmotion estimation unit 85 outputs the pupil diameter and facial motion of the listener to the utterance featureanalysis control unit 86. The pupil diameter and facial movement of the listener are stored in thestorage unit 84.

発話特徴解析制御部８６は、図５に示したように、相対音量算出部５１、相対音高算出部５２、発声速度算出部５３、無意区間特定部５４、テキスト化不可部分特定部５５、発話者情動反応値算出部５６、区間重要度算出手段５７及び頻出重要単語抽出手段５８を備えており、これらに加え、前述した聴取者情動反応値算出部を備えている。相対音量算出部５１、相対音高算出部５２、発声速度算出部５３、無意区間特定部５４、テキスト化不可部分特定部５５及び発話者情動反応値算出部５６は、発話者Ａ，Ｂの相対音量Ｖ（ｔ）等をそれぞれ算出し、無意区間及びテキスト化不可部分をそれぞれ特定する。また、聴取者情動反応値算出部は、聴取者の情動反応値Ｅｏ（ｔ）及び区間平均情動反応値Ｅｏを算出する。 As shown in FIG. 5, the utterance featureanalysis control unit 86 includes a relativevolume calculation unit 51, a relativepitch calculation unit 52, an utterancespeed calculation unit 53, an unintentionalsection specifying unit 54, a text-untextablepart specifying unit 55, an utterance, A human emotional reactionvalue calculation unit 56, a section importancedegree calculation unit 57, and a frequent importantword extraction unit 58 are provided, and in addition to these, the listener emotional reaction value calculation unit is provided. The relativevolume calculation unit 51, the relativepitch calculation unit 52, the utterancespeed calculation unit 53, the unintentionalsection specifying unit 54, the non-textablepart specifying unit 55, and the speaker emotional reactionvalue calculating unit 56 are relative to the speakers A and B. The volume V (t) and the like are calculated, respectively, and the unintentional section and the non-textable portion are specified. Further, the listener emotional response value calculation unit calculates the listener's emotional response value Eo (t) and the interval average emotional response value Eo.

区間重要度算出手段５７は、発話音声区間において、区間平均音量Ｖ、区間平均音高Ａ、発声速度ＶＶ、発話者Ａ，Ｂの区間平均情動反応値Ｅｓ及び聴取者の区間平均情動反応値Ｅｏを用いて、テキスト情報の重要度Ｗを算出する。具体的には、区間重要度算出手段５７は、発話者Ａが発話し、発話者Ｂ及び聴取者がその発話を聞いている発話音声区間Ｔ１において、発話者Ａの発話音声特性データ、区間平均情動反応値Ｅｓ１（Ｔ１）、発話者Ｂの区間平均情動反応値Ｅｓ２（Ｔ１）及び聴取者の区間平均情動反応値Ｅｏ（Ｔ１）等を用いて、発話者Ａによる発話の重要度Ｗ１（Ｔ１）を算出する。また、区間重要度算出手段５７は、発話音声区間Ｔ１の直後に、発話者Ｂが発話し、発話者Ａ及び聴取者がその発話を聞いている発話音声区間Ｔ２において、発話者Ｂの発話音声特性データ、区間平均情動反応値Ｅｓ２（Ｔ２）、発話者Ａの区間平均情動反応値Ｅｓ１（Ｔ２）及び聴取者の区間平均情動反応値Ｅｏ（Ｔ２）等を用いて、発話者Ｂによる発話の重要度Ｗ２（Ｔ２）を算出する。同様に、区間重要度算出手段５７は、その直後に発話者Ａが発話する発話音声区間Ｔ３において、発話者Ａによる発話の重要度Ｗ１（Ｔ３）を算出し、その直後に発話者Ｂが発話する発話音声区間Ｔ４において、発話者Ｂによる発話の重要度Ｗ２（Ｔ４）を算出する。 In the utterance voice section, the section importance degree calculation means 57 calculates the section average volume V, the section average pitch A, the utterance speed VV, the section average emotion response value Es of the speakers A and B, and the section average emotion response value Eo of the listener. Is used to calculate the importance W of the text information. Specifically, the section importance degree calculation means 57 is the utterance voice characteristic data of the speaker A, the section average in the utterance voice section T1 in which the speaker A speaks and the speaker B and the listener listen to the speech. The importance W1 (T1) of the utterance by the speaker A by using the emotional response value Es1 (T1), the section average emotional response value Es2 (T1) of the speaker B, the section average emotional response value Eo (T1) of the listener, and the like. ) Is calculated. In addition, the section importance degree calculation means 57 is the speech voice of the speaker B in the speech voice section T2 in which the speaker B speaks and the speaker A and the listener listen to the speech immediately after the speech voice section T1. Using the characteristic data, the section average emotion response value Es2 (T2), the section average emotion reaction value Es1 (T2) of the speaker A, the section average emotion response value Eo (T2) of the listener, and the like, The importance W2 (T2) is calculated. Similarly, the section importance calculation means 57 calculates the importance W1 (T3) of the utterance by the speaker A in the utterance voice section T3 spoken by the speaker A immediately after that, and the speaker B speaks immediately after that. In the utterance voice section T4 to be calculated, the importance W2 (T4) of the utterance by the speaker B is calculated.

そして、区間重要度算出手段５７は、発話者Ａによる発話の重要度Ｗ１及び発話者Ｂによる発話の重要度Ｗ２を解析結果として提示情報変換部７０に出力すると共に、記憶部８４に記憶する。 The section importance calculation means 57 outputs the importance W1 of the utterance by the speaker A and the importance W2 of the utterance by the speaker B to the presentationinformation conversion unit 70 as analysis results and stores them in thestorage unit 84.

提示情報変換部７０は、前述した提示情報を生成することに加え、発話者Ａによる発話の重要度Ｗ１及び発話者Ｂによる発話の重要度Ｗ２を解析結果として入力する。そして、提示情報変換部７０は、発話者Ａによる発話の重要度Ｗ１が順次大きくなっているか否かを判定し、大きくなっていると判定した場合、すなわち、以下の式を満たす場合、発話者Ａによる発話の重要度Ｗ１が増加傾向にあると判定する。
Ｗ１（Ｔ１）≦Ｗ１（Ｔ３）
また、提示情報変換部７０は、発話者Ｂによる発話の重要度Ｗ２が順次大きくなっているか否かを判定し、大きくなっていると判定した場合、すなわち、以下の式を満たす場合、発話者Ｂによる発話の重要度Ｗ２が増加傾向にあると判定する。
Ｗ２（Ｔ２）≦Ｗ２（Ｔ４）
また、提示情報変換部７０は、これらの条件を満たすと判定した場合、すなわち、発話者Ａによる発話の重要度Ｗ１及び発話者Ｂによる発話の重要度Ｗ２が順次大きくなっていると判定した場合、発話者Ａ，Ｂが同時に有用性の高い要点部分の発話をしている、つまり、その対話のシーケンスが相乗効果を生みだす重要な局面になっているものと判定する。In addition to generating the above-described presentation information, the presentationinformation conversion unit 70 inputs the importance W1 of the utterance by the speaker A and the importance W2 of the utterance by the speaker B as analysis results. Then, the presentationinformation conversion unit 70 determines whether or not the importance level W1 of the utterance by the speaker A is sequentially increased, and when it is determined that it is increased, that is, when the following expression is satisfied, the speaker It determines with the importance W1 of the speech by A having an increasing tendency.
W1 (T1) ≦ W1 (T3)
In addition, the presentationinformation conversion unit 70 determines whether or not the importance level W2 of the utterance by the speaker B is sequentially increased, and when it is determined that it is increased, that is, when the following expression is satisfied, the speaker It is determined that the importance W2 of the utterance by B tends to increase.
W2 (T2) ≦ W2 (T4)
When the presentationinformation conversion unit 70 determines that these conditions are satisfied, that is, when it is determined that the importance W1 of the utterance by the speaker A and the importance W2 of the utterance by the speaker B are sequentially increased. Thus, it is determined that the speakers A and B are simultaneously uttering a highly useful point portion, that is, that the sequence of dialogue is an important aspect that produces a synergistic effect.

そして、提示情報変換部７０は、これらの判定結果を、ネットワーク２１を介して表示装置２０へ送信する。これらの判定結果は、発話者Ａ，Ｂを特定するための識別子と共に、対話シーケンスのテキスト上の発話音声特性データと同様に、グラフとして表示される。また、対応するテキスト上に、重要度に応じて所定の形態にて表示される。 Then, the presentationinformation conversion unit 70 transmits these determination results to thedisplay device 20 via thenetwork 21. These determination results are displayed as a graph together with identifiers for identifying the speakers A and B, as well as the speech voice characteristic data on the text of the dialogue sequence. In addition, it is displayed in a predetermined form on the corresponding text according to the importance.

以上のように、本発明の実施形態による実施例３の音声情報処理装置３によれば、実施例１の音声情報処理装置１の効果と同様に、発話音声の中から有意性の高い要点部分を客観的かつ総合的に抽出することが可能となる。また、人的作業が不要になるから、発話音声の中から有意性の高い要点部分を容易に抽出することが可能となる。さらに、表示装置２０において、複数の発話者による対話の重要度Ｗ１，Ｗ２を時間軸上に数値化することにより、複数の発話者による発話音声の中から、有意性の高い要点部分を客観的かつ総合的に抽出することが可能となる。 As described above, according to the voiceinformation processing apparatus 3 of the third example according to the embodiment of the present invention, the main part having high significance from the speech voice as in the effect of the voiceinformation processing apparatus 1 of the first example. Can be extracted objectively and comprehensively. In addition, since no human work is required, it is possible to easily extract a highly significant point from the uttered speech. Further, thedisplay device 20 quantifies the importance W1 and W2 of the conversations by a plurality of speakers on the time axis, thereby objectively identifying a highly significant point from the speeches by the plurality of speakers. And it becomes possible to extract comprehensively.

尚、実施例３では、発話者が２人の例で説明したが、本発明はその人数を限定するものではない。また、実施例３は、図１０の実施例２に示したように、加速度センサデータによって顔面動きを推定するようにしてもよい。また、発話者が３人以上の場合、発話特徴解析制御部８６の区間重要度算出手段５７は、発話者Ａによる発話の重要度Ｗ１、発話者Ｂによる発話の重要度Ｗ２、発話者Ｃによる発話の重要度Ｗ３等を算出し、提示情報変換部７０は、発話音声区間毎のこれらの重要度Ｗ１，Ｗ２，Ｗ３等を用いて、前述の条件式により、重要度が増加している発話音声区間の局面を特定し、その局面において発話している発話者のグループを特定してその人数を算出する。そして、提示情報変換部７０は、その人数が所定数よりも大きい場合は、その局面の発話内容（議論）が重要な部分、すなわち、有意性の高い要点部分であると判定し、その人数が所定数以下の場合は、その局面の議論が重要でない部分、すなわち有用性の高くない部分であると判定する。これにより、発話内容が有意性の高い要点部分であるか否か、すなわち、議論が重要であるか否かの軽重を判定することができる。 In the third embodiment, an example in which there are two speakers has been described. However, the present invention does not limit the number of speakers. Further, in the third embodiment, as shown in the second embodiment in FIG. 10, the facial motion may be estimated based on the acceleration sensor data. When there are three or more speakers, the section importance calculation means 57 of the utterance featureanalysis control unit 86 uses the importance W1 of the utterance by the speaker A, the importance W2 of the utterance by the speaker B, and the speaker C. The utterance importance W3 and the like are calculated, and the presentationinformation conversion unit 70 uses the importance W1, W2, W3, etc. for each utterance voice section, and the utterance whose importance is increased according to the conditional expression described above. The aspect of the voice section is specified, the group of speakers speaking in the aspect is specified, and the number of the speakers is calculated. When the number of persons is larger than the predetermined number, the presentationinformation conversion unit 70 determines that the utterance content (discussion) of the situation is an important part, that is, a highly significant point part, and the number of persons is If the number is less than or equal to the predetermined number, it is determined that the discussion of the situation is an unimportant part, that is, a part that is not highly useful. Thereby, it is possible to determine whether or not the utterance content is a highly significant point, that is, whether or not the discussion is important.

１，２，３音声情報処理装置
１１スピーカ
１２表示器
１３，１９目隠し用ゴーグル
１４マイク
１５，１７，８１，８２カメラ
１６照射器
１８加速度センサ
２０表示装置
２１ネットワーク
３１案内情報提示部
３２，３３，８３入力部
４１音声認識部
４２瞳孔径解析部
４３，４４顔面動き量推定部
５０，８６発話特徴解析制御部
５１相対音量算出部
５２相対音高算出部
５３発声速度算出部
５４無意区間特定部
５５テキスト化不可部分特定部
５６発話者情動反応値算出部
５７区間重要度算出手段
５８頻出重要単語抽出手段
６０，８０，８４記憶部
６１案内情報ＤＢ
６２発話音声ＤＢ
６３眼球映像ＤＢ
６４通常顔映像ＤＢ
６５目隠し顔映像ＤＢ
６６テキスト情報等ＤＢ
６７瞳孔径・顔面動きＤＢ
６８解析結果ＤＢ
７０提示情報変換部
８５瞳孔径解析・顔面動き量推定部
１００制御部
１０１ＣＰＵ
１０２記憶部
１０３記憶装置
１０４通信部
１０５入力インタフェース部
１０６操作・入力部
１０７表示出力インタフェース部
１０８音声出力インタフェース部
１０９システムバス
１７１レンズ
１７２分光用プリズム
１７３近赤外線カットフィルタ
１７４，１７６撮像素子
１７５近赤外線透過フィルタ1, 2, 3 Voiceinformation processing device 11Speaker 12Display 13, 19Blindfold goggles 14Microphone 15, 17, 81, 82Camera 16Irradiator 18Accelerometer 20Display device 21Network 31 Guidanceinformation presentation unit 32, 33, 83Input unit 41Speech recognition unit 42 Pupildiameter analysis unit 43, 44 Facialmotion estimation unit 50, 86 Utterance featureanalysis control unit 51 Relativevolume calculation unit 52 Relativepitch calculation unit 53 Speakingspeed calculation unit 54 Unintentionalsection identification unit 55 Untextablepart identification unit 56 Speaker emotion reactionvalue calculation unit 57 Sectionimportance calculation unit 58 Frequent importantword extraction unit 60, 80, 84Storage unit 61 Guidance information DB
62 Speech DB
63 Eyeball image DB
64 Normal face image DB
65 Blindfolded face image DB
66 Text information DB
67 Pupil diameter / Face movement DB
68 Analysis result DB
70 PresentationInformation Conversion Unit 85 Pupil Diameter Analysis / Facial Motion Estimation Unit 100Control Unit 101 CPU
102Storage Unit 103Storage Device 104Communication Unit 105Input Interface Unit 106 Operation /Input Unit 107 DisplayOutput Interface Unit 108 AudioOutput Interface Unit 109System Bus 171Lens 172Spectral Prism 173 NearInfrared Cut Filters 174 and 176Image Sensor 175 Near Infrared Transmission filter

Claims

Translated fromJapanese

発話者の発話音声をテキスト化し、テキスト情報に変換する音声情報処理装置において、
語彙、前記語彙の発音、及び前記語彙に対しテキスト情報の区間を設定するための区間種別が規定された辞書を用いて、前記発話音声をテキスト情報に変換し、前記発話音声の信号レベルが所定の値未満となる沈黙区間を設定し、前記辞書に規定された語彙に対する区間種別及び前記テキスト情報に含まれる語彙によって、前記発話者が発声した時間区間のうちの実際に意味のある発声をした発言区間を設定し、前記発話者が発声した時間区間のうちの前記発言区間を除いた時間区間をその他発声区間に設定する音声認識部と、
前記発話音声に基づいて、前記区間毎に、発話音声特性データを算出する発話音声特性データ算出部と、
前記発話者の生理状態によって変化する生理反応データを入力し、前記生理反応データに基づいて、前記区間毎に、前記発話者の情動の程度を示す発話者情動反応値を算出する発話者情動反応値算出部と、
前記発言区間及びその他発声区間におけるテキスト情報をテキスト文字で表示する際に、前記発話音声特性データ算出部により算出された発話音声特性データの値、及び前記発話者情動反応値算出部により算出された発話者情動反応値に応じた形態で、前記音声認識部により区別された区間毎に前記テキスト文字を表示し、前記沈黙区間を、予め設定された形態で表示する表示部と、
を備えたことを特徴とする音声情報処理装置。In a speech information processing apparatus that converts a speech voice of a speaker into text and converts it into text information,
Using the dictionary in which the vocabulary, the pronunciation of the vocabulary, and the section type for setting the section of text information for the vocabulary are defined, the utterance voice is converted into text information, and the signal level of the utterance voice is predetermined. Set a silence interval that is less than the value of the vocabulary, and actually made a meaningful utterance in the time interval uttered by the speaker according to the vocabulary specified in the dictionary and the vocabulary included in the text information A speech recognition unit that sets a speech section, and sets a time section excluding the speech section in a time section spoken by the speaker as another speech section;
An utterance voice characteristic data calculation unit that calculates utterance voice characteristic data for each of the sections based on the utterance voice;
Inputs physiological response data that changes depending on the physiological state of the speaker, and calculates a speaker emotional response value indicating the degree of emotion of the speaker for each section based on the physiological response data. A value calculator,
When the text information in the utterance section and other utterance sections is displayed as text characters, the value of the utterance voice characteristic data calculated by the utterance voice characteristic data calculation section, and the speaker emotional reaction value calculation section are calculated. A display unit that displays the text characters for each section distinguished by the voice recognition unit in a form according to a speaker emotion reaction value, and displays the silence section in a preset form;
A voice information processing apparatus comprising:

請求項１に記載の音声情報処理装置において、
前記発話音声特性データ算出部は、前記発話音声に基づいて、前記区間毎に、発話音声の音量、音高及び速度を算出し、
前記発話者情動反応値算出部は、発話者の眼球運動に伴うデータ、顔面の動き、脈拍値、発汗量のうちの少なくとも一つまたは複数のデータに基づいて、前記区間毎に、発話者情動反応値を算出し、
前記表示部は、前記発話音声特性データ算出部により算出された発話音声の音量、音高及び速度の値、並びに、前記発話者情動反応値算出部により算出された発話者情動反応値に応じたそれぞれの形態で前記テキスト文字を表示し、前記沈黙区間を空白で表示し、前記音声認識部によりテキスト化できなかった前記発言区間またはその他発声区間を、予め設定された形態で表示することを特徴とする音声情報処理装置。The audio information processing apparatus according to claim 1,
The utterance voice characteristic data calculation unit calculates the volume, pitch, and speed of the utterance voice for each section based on the utterance voice,
The speaker emotion response value calculation unit is configured to generate the speaker emotion for each section based on at least one or more of data associated with the eye movement of the speaker, facial movement, pulse value, and amount of sweat. Calculate the reaction value,
The display unit is responsive to the volume, pitch, and velocity values of the utterance voice calculated by the utterance voice characteristic data calculation unit, and the speaker emotion response value calculated by the speaker emotion reaction value calculation unit. The text character is displayed in each form, the silence section is displayed as blank, and the speech section or other speech section that could not be converted into text by the speech recognition unit is displayed in a preset form. A voice information processing apparatus.

請求項１または２に記載の音声情報処理装置において、
前記発話音声特性データ及び前記発話者情動反応値に基づいて、前記区間毎のテキスト情報の重要度を算出する区間重要度算出部と、
前記テキスト情報の重要度と所定の値とに基づいて、重要度の高い区間を特定し、前記特定した区間のテキスト情報から単語を抽出する頻出重要単語抽出部と、
前記抽出された単語を検索語としてデータベースを検索する検索部とを備え、
前記表示部は、さらに、前記データベースの検索結果を表示することを特徴とする音声情報処理装置。The speech information processing apparatus according to claim 1 or 2,
A section importance calculation unit that calculates importance of the text information for each section based on the utterance voice characteristic data and the speaker emotion reaction value;
Based on the importance level of the text information and a predetermined value, a high-importance interval is identified, and a frequent important word extraction unit that extracts words from the text information of the identified interval;
A search unit that searches a database using the extracted word as a search term,
The display unit further displays a search result of the database.

請求項１から３までのいずれか一項に記載の音声情報処理装置において、
前記発話者による発話を聴取する聴取者の眼球運動に伴うデータ、顔面の動き、脈拍値、発汗量のうちの少なくとも一つまたは複数のデータに基づいて、前記区間毎に、聴取者情動反応値を算出する聴取者情動反応値算出部を備え、
前記表示部は、さらに、前記聴取者情動反応値算出部により算出された聴取者情動反応値に応じた形態で、前記テキスト文字を表示することを特徴とする音声情報処理装置。The speech information processing apparatus according to any one of claims 1 to 3,
Based on at least one or more of data associated with eye movements of the listener who listens to the utterance by the speaker, facial movement, pulse value, and sweating amount, the listener's emotional response value for each section A listener emotional response value calculation unit for calculating
The voice information processing apparatus, wherein the display unit further displays the text characters in a form corresponding to a listener emotional response value calculated by the listener emotional response value calculation unit.

請求項４に記載の音声情報処理装置において、
複数の発話者のそれぞれに対応して、前記処理をそれぞれ行う音声認識部、発話音声特性データ算出部、発話者情動反応値算出部及び表示部を備え、
さらに、前記複数の発話者のうちの一人の発話者による発話の前記区間について、前記一人の発話者の発話音声特性データ及び発話者情動反応値、前記他の発話者の発話者情動反応値、並びに前記聴取者情動反応値に基づいて、前記一人の発話者による発話の重要度を算出する区間重要度算出部を備え、
前記表示部は、さらに、前記重要度に応じた形態で、前記重要度が算出された前記一人の発話者における前記テキスト文字を表示することを特徴とする音声情報処理装置。The voice information processing apparatus according to claim 4,
Corresponding to each of a plurality of speakers, a speech recognition unit that performs the above processing, a speech voice characteristic data calculation unit, a speaker emotion reaction value calculation unit and a display unit,
Further, for the section of the utterance by one of the plurality of utterers, the utterance voice characteristic data and the utterance emotion reaction value of the one utterer, the utterance emotion reaction value of the other utterers, And an interval importance calculation unit for calculating the importance of the utterance by the one speaker based on the listener emotion reaction value,
The display unit further displays the text characters of the one speaker whose importance is calculated in a form corresponding to the importance.

請求項１から５までのいずれか一項に記載の音声情報処理装置を含むシステムにおいて、
近赤外線を前記発話者の目の周辺に照射する照射器と、
前記近赤外線を透過するフィルタ、及び前記フィルタからの出射光を受光する撮像素子を有し、前記発話者の映像を前記フィルタ及び撮像素子を介して目隠し顔映像として出力するカメラとを備え、
前記音声情報処理装置の表示部は、前記カメラにより出力された前記発話者の目隠し顔映像を表示することを特徴とするシステム。In the system containing the audio | voice information processing apparatus as described in any one of Claim 1-5,
An irradiator that irradiates near-infrared rays around the eyes of the speaker;
A filter that transmits the near-infrared light, and an image sensor that receives light emitted from the filter, and includes a camera that outputs an image of the speaker as a blindfolded face image through the filter and the image sensor.
The display unit of the voice information processing apparatus displays a blindfolded face image of the speaker output by the camera.