JP3110215B2

Movatterモバイル変換

Info

Publication number: JP3110215B2
Application number: JP05198594A
Authority: JP
Inventors: 敏彦宮崎; 輝平山; 知明神; 雅代浅野
Original assignee: Oki Electric Industry Co Ltd; Osaka Gas Co Ltd
Current assignee: Oki Electric Industry Co Ltd; Osaka Gas Co Ltd
Priority date: 1993-08-10
Filing date: 1993-08-10
Publication date: 2000-11-20
Anticipated expiration: 2015-11-20
Also published as: JPH0756494A

Description

【発明の詳細な説明】DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【産業上の利用分野】本発明は、母国語や外国語の発音
訓練のための発音訓練装置に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a pronunciation training device for pronunciation training in a native language or a foreign language.

【０００２】[0002]

【従来の技術】母国語や外国語の発音を訓練するには、
訓練者が、模範者による発声音声を聴取し、この模範者
音声を真似て発声することが有効である。しかし、訓練
者が、自分が発声した音声がどの程度模範者音声に近似
しているかを正しく認識できないならば、訓練が十分な
効果を発揮しない。そこで、音声認識装置を利用するこ
とによって模範者音声と訓練者音声とをデータ化して比
較評価し、訓練者による発音の評価結果（「良い」ある
いは「悪い」）を提示する機能を有した発音訓練装置が
既に提案されている（特開昭６１−２５５３７９号公
報）。2. Description of the Related Art To train pronunciation of a native language or a foreign language,
It is effective for the trainee to listen to the voice uttered by the modeler and to simulate the modeler's voice. However, if the trainee cannot correctly recognize how close the voice uttered by the trainee is to the model voice, the training does not exert a sufficient effect. Therefore, using a voice recognition device, the modeler's voice and the trainee's voice are converted into data and compared and evaluated, and a pronunciation having a function of presenting the evaluation result (“good” or “bad”) of the pronunciation by the trainee. A training device has already been proposed (JP-A-61-255379).

【０００３】しかし、例えば、悪い評価結果を得た場合
に、訓練者は発声音声でしか判断材料がないため、どの
ようにすれば良い評価結果が得られるような発声を行な
うことができるか認識できないことも多い。[0003] However, for example, when a bad evaluation result is obtained, the trainee can only make judgments based on the uttered voice. Therefore, the trainee recognizes how to make a utterance that can obtain a good evaluation result. There are many things you can't do.

【０００４】そこで、発声の模範情報として、音声だけ
でなく発声時の発声器官の動きをとらえて動画像を用意
し、その模範動画像を再生することにより学習させる発
音訓練装置や、訓練者自身の発声時の発声器官の動きを
とらえて訓練者の発声時動画像と模範動画像とを（例え
ば同時に）再生提示する発音訓練装置も既に提案されて
いる。[0004] Therefore, as a model of utterance, not only a voice but also a movement of a vocal organ at the time of utterance is prepared to prepare a moving image, and a pronunciation training device for learning by reproducing the model moving image, a trainer himself / herself, A pronunciation training device that captures the movement of the vocal organ during utterance and reproduces and presents (for example, simultaneously) the trainee's utterance moving image and the model moving image at the same time has already been proposed.

【０００５】後者の発音訓練装置によれば、音声だけで
なく、発声器官の動きも模範と比較できるので、良好な
発音を行なう方法を訓練者が知得し易いものである。[0005] According to the latter pronunciation training device, not only the voice but also the movement of the vocal organs can be compared with the model, so that the trainee can easily learn a method of producing a good pronunciation.

【０００６】[0006]

【発明が解決しようとする課題】しかしながら、訓練者
の発声器官の動画像と模範者の動画像とを再生提示する
発音訓練装置も、以下のような課題を有するものであ
る。However, a pronunciation training apparatus for reproducing and presenting a moving image of a vocal organ of a trainee and a moving image of a model person also has the following problems.

【０００７】訓練者は模範者音声を真似て発音するとは
いえ、発音速度や時間は模範者音声の発音速度や時間と
異なったものとなる。従って、模範動画像と訓練者の発
声器官の動きをとらえた動画像の同一タイミングでの画
像内容も異なったものとなる。特に、このような相違
は、６、７単語以上からなる比較的長い文章を発音する
場合には時間が進むにつれて大きくなる。そのため、模
範動画像と訓練者の発声器官の動きをとらえた動画像と
を開始時刻を揃えて並行に再生して表示させたとして
も、後半になるに従い、同じ単語（フレーズ）を発音し
ている箇所が時間軸上ずれてしまう。その結果、訓練者
は同じ単語（フレーズ）を発音している際の模範者及び
自己の発声器官の動きを直接的には比較できない。すな
わち、従来の発音訓練装置は、発声器官の動画像を提示
することによる訓練効果が十分に発揮されない恐れがあ
るものであった。Although the trainee pronounces the model voice imitatively, the pronunciation speed and time are different from the pronunciation speed and time of the model voice. Therefore, the image content at the same timing of the model moving image and the moving image capturing the movement of the trainee's vocal organs are also different. In particular, such a difference becomes larger as time progresses when a relatively long sentence including six or seven words is pronounced. Therefore, even if the model video and the video capturing the movements of the trainee's vocal organs are played back and displayed in parallel at the same start time, the same word (phrase) is pronounced in the latter half. Are shifted on the time axis. As a result, the trainee cannot directly compare the movements of the modeler and his / her own vocal organs when pronouncing the same word (phrase). That is, in the conventional pronunciation training device, there is a possibility that the training effect by presenting the moving image of the vocal organ may not be sufficiently exerted.

【０００８】本発明は、以上の点を考慮してなされたも
のであり、発声器官の画像提示による訓練効果を従来よ
り高めることができる発音訓練装置を提供しようとした
ものである。The present invention has been made in view of the above points, and has been made to provide a pronunciation training apparatus that can enhance the training effect by presenting an image of a vocal organ as compared with the related art.

【０００９】[0009]

【課題を解決するための手段】かかる課題を解決するた
め、本発明においては、模範者発声時の発声器官の動き
を示す模範者発声時画像と、訓練者発声時の発声器官の
動きを示す訓練者発声時画像とを表示手段に同時に表示
する機能を備えた発音訓練装置において、模範者発声時
の時間情報及び訓練者発声時の時間情報に基づいて、模
範者発声時画像又は訓練者発声時画像の少なくとも一方
に対して補間又は間引きを実行させて模範者発声時画像
及び訓練者発声時画像を同期表示させる画像同期再生手
段を設けたことを特徴とする。In order to solve the above-mentioned problems, in the present invention, a model utterance image showing the movement of a vocal organ at the time of model utterance and a movement of the vocal organ at the time of trainee utterance are shown. In a pronunciation training device having a function of simultaneously displaying the trainee utterance image and the trainee utterance image on the display means, the model trainer utterance image or the trainee utterance is based on the time information at the time of the trainee utterance and the time information at the time of the trainee utterance. An image synchronous reproducing means for performing interpolation or thinning-out on at least one of the time images to synchronously display the image when the modeler utters and the image when the trainee utters is provided.

【００１０】[0010]

【作用】本発明は、模範者発声時画像及び訓練者発声時
画像を表示手段に同時に表示させても、模範者発声時の
時間や速度と訓練者発声時の時間や速度とが異なればあ
る時点で表示されている画像内容が異なることを考慮
し、画像同期再生手段を設けて、模範者発声時の時間情
報及び訓練者発声時の時間情報に基づいて、模範者発声
時画像又は訓練者発声時画像の少なくとも一方に対して
補間又は間引きを行ない、模範者発声時画像及び訓練者
発声時画像を「同期」して表示させるようにしたもので
ある。[Action] The present invention, even ifthe same time is displayed on the display means the image and training's utterances at the time of image during the model's speaking, different and time and speed at the time of trainee say time and speed at the time of the model's utterance In consideration of the fact that the image content displayed at a certain time is different, an image synchronous reproduction means is provided, and based on the time information at the time of the modeler's utterance and the time information at the time of the trainee's utterance, the image at the time of the modeler's utterance or the training Interpolation or thinning is performed on at least one of the trainee utterance images, and the model utterance image and the trainee utterance image are displayed in “synchronization”.

【００１１】[0011]

【実施例】以下、本発明による発音訓練装置の一実施例
を図面を参照しながら詳述する。この実施例の発音訓練
装置は、全ての構成要素をハードウェアで構成しても良
く、信号処理等を行なう一部の構成要素を情報処理装置
（パソコンやワークステーション）によるソフトウェア
で構成しても良く、発生器官の動画像表示機能にかかる
面を中心に機能的に示すと図１のブロック図のように表
すことができる。DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS An embodiment of a pronunciation training apparatus according to the present invention will be described below in detail with reference to the drawings. In the pronunciation training apparatus of this embodiment, all the components may be configured by hardware, and some of the components for performing signal processing and the like may be configured by software using an information processing device (a personal computer or a workstation). The functional organ can be represented as shown in the block diagram of FIG.

【００１２】図１において、この実施例の発音訓練装置
は、音声系の処理構成と、画像系の処理構成と、両系に
共通な処理構成とからなる。Referring to FIG. 1, the pronunciation training apparatus of this embodiment has a speech processing system, an image processing system, and a processing system common to both systems.

【００１３】音声系の処理構成は、模範者音声に係る構
成と訓練者音声に係る構成とに分けることができ、前者
として、模範者音声記憶手段１、模範者音声発生手段
２、模範者音声認識手段３及びスピーカ４があり、後者
として、マイクロホン５及び訓練者音声認識手段６があ
る。画像系の処理構成には、模範者発声時画像に係る構
成として模範者発声時画像記憶手段７があり、訓練者発
声時画像に係る構成としてビデオカメラ８及び訓練者発
声時画像記憶手段９があり、さらに画像同期再生手段１
０がある。音声系及び画像系に共通な処理構成として
は、キー入力手段１１及び表示手段１２がある。The processing configuration of the voice system can be divided into a configuration related to the model voice and a configuration related to the trainee voice. The former includes a model voice storage unit 1, a model voice generation unit 2, and a model voice. There is a recognition means 3 and a speaker 4, and as the latter, there is a microphone 5 and a trainee voice recognition means 6. In the processing configuration of the image system, there is an image storage unit 7 when the model is uttered as a configuration related to the image when the model is uttered, and a video camera 8 and the image storage unit 9 when the trainee is uttered as the configuration related to the trainer uttered image. Yes, and image synchronous playback means 1
There is 0. As a processing configuration common to the audio system and the image system, there are a key input unit 11 and a display unit 12.

【００１４】模範者音声記憶手段１は、模範者が発声し
た音声の情報を記憶しているものであり、例えば、１文
章ずつが再生基本単位として識別符号が付されて格納さ
れている。このような模範者音声情報の記憶方法は、音
声信号を単にアナログ／デジタル変換して記憶するもの
であっても良く、圧縮符号化等を施して記憶するもので
あっても良い。模範者音声記憶手段１は、例えば、キー
入力手段１１によって模範者音声の再生モードが選択さ
れている状態において、キー入力手段１１によって指示
された模範者音声情報を模範者音声発生手段２に出力す
る。この実施例の場合、記憶されている音声情報には、
模範者音声を発声出力させる際に併せて表示させるため
の表示音声情報も含まれている。例えば、発音記号情報
や文字情報が含まれている。The modeler voice storage means 1 stores information on voices uttered by the modeler. For example, each sentence is stored with an identification code as a basic reproduction unit. Such a method of storing modeler voice information may be a method of simply converting an audio signal from analog to digital and storing the signal, or may be a method of performing compression encoding and storing the signal. The model voice storing means 1 outputs the model voice information designated by the key input means 11 to the model voice generating means 2 in a state where the reproduction mode of the model voice is selected by the key input means 11, for example. I do. In the case of this embodiment, the stored audio information includes
Display audio information for displaying the modeler's voice together with the utterance output is also included. For example, it includes phonetic symbol information and character information.

【００１５】模範者音声発生手段２は、模範者音声記憶
手段１から与えられた模範者音声情報に基づいて、模範
者音声認識手段３、スピーカ４及び表示手段１２に与え
る各種信号を形成して出力する。模範者音声認識手段３
に与える信号は、模範者が発声したときにマイクロホン
が捕捉したと同様な電気信号（音声信号）であり、スピ
ーカ４に与える信号は当然にスピーカ４を駆動できる電
気信号（音声信号）であり（模範者音声認識手段３に与
える信号と同一でも良い）、表示手段１２に与える信号
は、発音記号情報や文字情報等の表示音声情報を表示で
きる形式の信号（例えばテレビジョン信号）に変換した
ものである。The model voice generating means 2 forms various signals to be provided to the model voice recognition means 3, the speaker 4 and the display means 12 based on the model voice information provided from the model voice storing means 1. Output. Model voice recognition means 3
Is an electric signal (audio signal) similar to that captured by the microphone when the model person utters the voice, and the signal applied to the speaker 4 is an electric signal (audio signal) that can naturally drive the speaker 4 ( The signal given to the model voice recognition means 3 may be the same as the signal given to the model voice recognition means 3), and the signal given to the display means 12 is converted into a signal (for example, a television signal) in a format capable of displaying display voice information such as phonetic symbol information and character information. It is.

【００１６】模範者音声認識手段３は、模範者音声発生
手段２から与えられた模範者音声信号に対して音声認識
処理を行ない、後述する図３に示すような模範者音声信
号に含まれている認識文字列（フレーズ；単語等）と、
模範者音声信号の開始時点を基準として計時した各認識
文字列の開始時刻及び終了時刻の組情報とを得て画像同
期再生手段１０に与えるものである。The model voice recognition unit 3 performs a voice recognition process on the model voice signal given from the model voice generation unit 2 and is included in the model voice signal as shown in FIG. Recognized character strings (phrases; words, etc.)
A set information of a start time and an end time of each recognition character string measured based on the start time of the model voice signal is obtained and given to the image synchronous reproduction means 10.

【００１７】スピーカ４は、模範者音声発生手段２から
与えられた模範者音声信号によって駆動されて模範者音
声を発音出力するものである。The speaker 4 is driven by the model voice signal given from the model voice generating means 2 to generate and output the model voice.

【００１８】マイクロホン５は、訓練者が発声した音声
を捕捉して電気信号（音声信号）に変換して訓練者音声
認識手段６及び訓練者発声画像記憶手段９に与えるもの
である。The microphone 5 captures the voice uttered by the trainee, converts the voice into an electric signal (voice signal), and provides the electrical signal to the trainee voice recognition means 6 and the trainee voice image storage means 9.

【００１９】訓練者音声認識手段６は、マイクロホン５
から与えられた訓練者音声信号に対して音声認識処理を
行ない、後述する図４に示すような訓練者音声信号に含
まれている認識文字列（フレーズ；単語）と、訓練者音
声信号の開始時点を基準として計時した各認識文字列の
開始時刻及び終了時刻の組情報とを得て画像同期再生手
段１０に与えるものである。The trainee voice recognition means 6 includes a microphone 5
The speech recognition processing is performed on the trainee speech signal given by the user and the recognition character string (phrase; word) included in the trainee speech signal as shown in FIG. A set information of a start time and an end time of each recognized character string measured based on the time is obtained and provided to the image synchronous reproduction means 10.

【００２０】なお、当該発音訓練装置が、模範者音声認
識手段３及び訓練者音声認識手段６が同時に動作するよ
うなモードがないものであれば、１個の音声認識手段を
模範者音声認識手段３又は訓練者音声認識手段６として
切り替えて用いることができる。If the pronunciation training device does not have a mode in which the model voice recognition means 3 and the trainee voice recognition means 6 operate simultaneously, one voice recognition means is used as the model voice recognition means. 3 or as the trainee voice recognition means 6.

【００２１】模範者発声時画像記憶手段７は、模範者が
発声した際の発声器官の動きをとられた動画像信号（例
えばテレビジョン信号）を記憶しているものである。こ
の動画像信号（以下、模範者発声時画像と呼ぶ）には、
例えば、発音記号情報や文字情報も含まれている。模範
者発声時画像記憶手段７に格納されている模範者発声時
画像は、模範者音声記憶手段１が格納している同一識別
符号を有する音声情報に対応しているものであり、その
音声情報の発音出力時間と模範者発声時画像の再生時間
は同じになされている。模範者発声時画像記憶手段７
は、画像同期再生手段１０から所定の模範者発声時画像
の再生が指示されたときに画像同期再生手段１０から与
えられたタイミング制御信号に従って記憶している模範
者発声時画像を再生して表示手段１２に与えるものであ
る。The model image utterance-time image storage means 7 stores a moving image signal (for example, a television signal) obtained by taking a motion of a vocal organ when the model person utters. This moving image signal (hereinafter, referred to as an image at the time of model utterance) includes:
For example, phonetic symbol information and character information are also included. The model utterance image stored in the model utterance image storage means 7 corresponds to the voice information having the same identification code stored in the model voice storage means 1 and the voice information thereof. And the reproduction time of the image when the modeler utters are made the same. Image storage means 7 when the modeler utters
Reproduces and displays the model-executed-voice image stored in accordance with the timing control signal given from the image-synchronized reproducing device 10 when the reproduction of the predetermined model-executed image is instructed from the image-synchronized reproducing device 10. This is given to the means 12.

【００２２】ビデオカメラ８は、訓練者が発声した際の
発声器官の動きをとられた動画像信号（例えばテレビジ
ョン信号；以下、訓練者発声時画像と呼ぶ）を訓練者発
声時画像記憶手段９に与えるものである。例えば、マイ
クロホン５及びビデオカメラ８は近接して設けられてお
り、マイクロホン５に向かって発音する訓練者の発声器
官の動きを、ビデオカメラ８が正面から撮像できるよう
になされている。The video camera 8 stores a moving image signal (for example, a television signal; hereinafter, referred to as a trainee utterance image) in which a movement of a vocal organ at the time of trainee utterance is stored. 9 is given. For example, the microphone 5 and the video camera 8 are provided close to each other, and the video camera 8 can capture the movement of the vocal organ of the trainee who pronounces toward the microphone 5 from the front.

【００２３】訓練者発声時画像記憶手段９は、マイクロ
ホン５からの音声信号における発音期間（フレーム間の
無音区間は短く、この区間も発音期間に含む）をとらえ
る、例えば有音／無音検出回路等を備え、この発音期間
内に、ビデオカメラ８から到来した訓練者発声時画像を
記憶する。また、訓練者発声時画像記憶手段９は、画像
同期再生手段１０から訓練者発声時画像の再生が指示さ
れたときに画像同期再生手段１０から与えられたタイミ
ング制御信号に従って記憶している訓練者発声時画像を
再生して表示手段１２に与えるものである。The trainee utterance-time image storage means 9 captures a sounding period (a silence period between frames is short, and this period is also included in the sounding period) in the sound signal from the microphone 5, for example, a sound / silence detection circuit or the like. And stores the trainee utterance utterance image coming from the video camera 8 during this sounding period. The trainee utterance image storage means 9 stores the trainee in accordance with the timing control signal given from the image synchronization reproduction means 10 when the image synchronization reproduction means 10 instructs the reproduction of the trainee utterance image. The utterance image is reproduced and given to the display means 12.

【００２４】なお、模範者発声時画像記憶手段７及び訓
練者発声時画像記憶手段９が、画像信号に対して圧縮符
号化等を施して記憶するものであっても良い。The image storage means 7 when the model person utters and the image storage means 9 when the trainee utters may store the image signals by performing compression coding or the like.

【００２５】画像同期再生手段１０は、例えば、模範者
音声認識手段３及び訓練者音声認識手段６からそれぞれ
与えられた認識文字列（フレーズ；単語等）と、各認識
文字列の開始時刻及び終了時刻の組情報とに基づいて、
キー入力手段１１から指示された再生倍速における模範
者発声時画像及び訓練者発声時画像の各文字列について
の画像フレーム数を求めるものである。また、画像同期
再生手段１０は、このように、求めた模範者発声時画像
及び訓練者発声時画像の各文字列についての画像フレー
ム数に基づいて、模範者発声時画像及び訓練者発声時画
像の各文字列についての再生を同期させるための画像フ
レームの出力の仕方（画像フレームの補完や間引き）を
決定し、各文字列について同じタイミングで同じ枚数の
画像フレームを模範者発声時画像記憶手段７及び訓練者
発声時画像記憶手段９が表示手段１２に出力させるよう
に制御するものである。The image synchronous reproduction means 10 includes, for example, recognition character strings (phrases; words, etc.) given from the model voice recognition means 3 and the trainee voice recognition means 6, respectively, and the start time and end of each recognition character string. Based on the time set information,
The number of image frames for each character string of the model utterance image and the trainee utterance image at the reproduction speed indicated by the key input means 11 is obtained. In addition, the image synchronous reproduction means 10 performs the image at the time of the modeler's utterance and the image at the time of the trainee's utterance based on the number of image frames for each character string of the image at the time of the modeler's utterance and the image at the trainer's utterance thus obtained. The method of outputting image frames for synchronizing the reproduction of each character string (complementing or thinning out the image frames) is determined, and the same number of image frames are stored at the same timing for each character string by the model utterance image storage means. 7 and the trainee's utterance image storage means 9 are controlled so as to output them to the display means 12.

【００２６】なお、この実施例の画像同期再生手段１０
は、訓練者発声時間側を基準として同期再生制御を行な
うものとする。The image synchronous reproducing means 10 of this embodiment
Performs synchronous playback control based on the trainee utterance time side.

【００２７】キー入力手段１１は、模範者音声の再生を
指示したり、再生する模範者音声の種類を指示したり、
模範者発声時画像及び訓練者発声時画像の再生倍速等を
指示したりするものである。The key input means 11 instructs the reproduction of the model voice, the type of the model voice to be reproduced,
For example, it instructs the reproduction speed of the image when the model person utters and the image when the trainee utters.

【００２８】表示手段１２は、模範者音声発声手段２か
ら発音記号情報や文字情報等の表示音声情報を表示でき
る形式の信号（例えばテレビジョン信号）が与えられる
とそれを表示するものである。また、表示手段１２は、
模範者発声時画像記憶手段７及び訓練者発声時画像記憶
手段９から模範者発声時画像及び訓練者発声時画像が同
時に与えられた場合には、模範者発声時画像及び訓練者
発声時画像を同時表示するものである。例えば、表示画
面の上半分に模範者発声時画像を下半分に訓練者発声時
画像を表示したり、又は、表示画面の左半分に模範者発
声時画像を右半分に訓練者発声時画像を表示したりす
る。このような複数画像の合成は、表示手段１２が全て
の処理を行なっても良く、また、模範者発声時画像記憶
手段７及び訓練者発声時画像記憶手段９に格納する際、
又は、これら記憶手段から再生する際に表示画面の半分
の大きさだけを有効な画像（残り半分を例えばペデスタ
ルレベルにする）とさせてこれらを表示手段１２が合成
するようにしても良い。The display means 12 displays a signal (for example, a television signal) in a format capable of displaying display audio information such as phonetic symbol information and character information from the modeler audio utterance means 2. Also, the display means 12
When the image at the time of the model utterance and the image at the time of the trainee utterance are simultaneously given from the image memory at the time of the model utterance 7 and the image storage at the time of the trainee utterance, the image at the time of the model utterance and the image at the time of the trainee utterance are obtained. They are displayed simultaneously. For example, an image at the time of the model utterance is displayed in the upper half of the display screen, and an image at the time of the trainer utterance is displayed at the lower half of the display screen. Display. Such a combination of a plurality of images may be performed by the display unit 12 when the display unit 12 stores the images in the model utterance image storage unit 7 and the trainee utterance image storage unit 9.
Alternatively, when reproducing from these storage units, only the size of a half of the display screen may be set as an effective image (the other half is, for example, a pedestal level), and the display unit 12 may combine them.

【００２９】なお、本発明の特徴とは無関係であるが、
表示手段１２は訓練者に対するガイダンスメッセージ等
も適宜表示する。以下、このことについては言及しな
い。Although not related to the features of the present invention,
The display means 12 also displays a guidance message or the like for the trainee as appropriate. Hereinafter, this will not be described.

【００３０】図２は、以上のような構成を有する実施例
の発音訓練装置の処理の流れ（訓練者の動作も含む）を
示すものであり、以下、この図２を中心とし、図３〜図
７の説明図をも参照しながら、実施例の発音訓練装置の
処理を説明する。FIG. 2 shows the flow of processing (including the operation of the trainee) of the pronunciation training apparatus of the embodiment having the above-described configuration. Hereinafter, FIG. The processing of the pronunciation training apparatus of the embodiment will be described with reference to the explanatory diagram of FIG.

【００３１】訓練者がキー入力手段１１を用いて所定の
模範者音声の出力を指示すると（ステップ１００）、指
示された模範者音声情報が模範者音声記憶手段１から再
生されて模範者音声発生手段２に与えられ（ステップ１
０１）、その情報が模範者音声発生手段２によって各種
の所定信号に変換されて模範者音声認識手段３、スピー
カ４及び表示手段１２に与えられる（ステップ１０
２）。When the trainee gives an instruction to output a predetermined model voice using the key input means 11 (step 100), the specified model voice information is reproduced from the model voice storage means 1 to generate a model voice. (Step 1)
01), the information is converted into various predetermined signals by the model voice generating means 2 and given to the model voice recognition means 3, the speaker 4 and the display means 12 (step 10).
2).

【００３２】これにより、スピーカ４からは模範者音声
が発音出力され（ステップ１０３）、表示手段１２によ
って発音記号又は文字列（表示模範者音声情報）が表示
され（ステップ１０４）、また、模範者音声認識手段３
による認識処理が実行されて図３に示すような認識結果
が得られる（ステップ１０５）。As a result, the modeler's voice is output from the speaker 4 (step 103), and the phonetic symbol or character string (display modeler's voice information) is displayed on the display means 12 (step 104). Voice recognition means 3
Is performed to obtain a recognition result as shown in FIG. 3 (step 105).

【００３３】図３は、訓練者によって指示された模範者
音声が“I have a pen. ”の場合であり、模範者音声信
号から認識された文字列（フレーズ）が“Ｉ”、“ｈａ
ｖｅ”、“ａ”、“ｐｅｎ”であって、文字列“Ｉ”は
０ｍｓから９８ｍｓの間で発音され、文字列“ｈａｖ
ｅ”は１０７ｍｓから４６２ｍｓの間で発音され、文字
列“ａ”は４７１ｍｓから５５３ｍｓの間で発音され、
文字列“ｐｅｎ”は５５９ｍｓから８２０ｍｓの間で発
音された場合を示している。音声認識処理は、所定のサ
ンプリング周期でサンプリングされた音声データに対し
て行ない、音声データのならびがどのような文字列に対
応するかを処理するものであるので、文字列の開始時刻
や終了時刻を容易に得ることができる。FIG. 3 shows a case where the model voice specified by the trainee is “I have a pen.”, And the character strings (phrases) recognized from the model voice signal are “I” and “ha”.
ve "," a ", and" pen ", the character string" I "is pronounced between 0 ms and 98 ms, and the character string" hav
e "is pronounced between 107 ms and 462 ms, the character string" a "is pronounced between 471 ms and 553 ms,
The character string “pen” indicates a case where the sound is generated between 559 ms and 820 ms. The voice recognition process is performed on voice data sampled at a predetermined sampling period, and processes the sequence of voice data to correspond to a character string. Can be easily obtained.

【００３４】訓練者は、スピーカ４から発音された模範
者音声を聴取した後、必要ならば表示手段１２に表示さ
れた発音記号や文字情報を確認して、模範者音声を真似
てマイクロホン５に向かって発音し（ステップ１０
６）、マイクロホン５は、訓練者による発音音声を捕捉
して電気信号（音声信号）に変換して訓練者音声認識手
段６及び訓練者発声時画像記憶手段９に与え（ステップ
１０７）、一方、ビデオカメラ８は訓練者の発声時画像
を撮像して訓練者発声時画像記憶手段９に与える（ステ
ップ１０８）。The trainee listens to the model voice spoken from the speaker 4 and, if necessary, checks the phonetic symbols and character information displayed on the display means 12 and imitates the model voice to the microphone 5. Pronounce it (step 10
6) The microphone 5 captures the pronunciation voice of the trainee, converts it into an electric signal (voice signal), and provides the electrical signal to the trainee voice recognition means 6 and the trainee utterance image storage means 9 (step 107). The video camera 8 captures the trainee utterance image and gives it to the trainee utterance image storage means 9 (step 108).

【００３５】これにより、訓練者音声認識手段６から図
４に示すような認識結果が出力され（ステップ１０
９）、発音期間の訓練者発声時画像が訓練者発声時画像
記憶手段９に記憶される（ステップ１１０）。Thus, a recognition result as shown in FIG. 4 is output from the trainee voice recognition means 6 (step 10).
9) The trainer utterance image during the sounding period is stored in the trainer utterance image storage means 9 (step 110).

【００３６】図４は、訓練者の発音速度が模範者の発音
速度より遅い場合の認識結果を示している。すなわち、
文字列“Ｉ”は０ｍｓから１１２ｍｓの間で発音され、
文字列“ｈａｖｅ”は１２８ｍｓから５０２ｍｓの間で
発音され、文字列“ａ”は５１６ｍｓから６０９ｍｓの
間で発音され、文字列“ｐｅｎ”は６１９ｍｓから９８
５ｍｓの間で発音された場合を示している。FIG. 4 shows the recognition result when the sounding speed of the trainee is lower than the sounding speed of the model person. That is,
The character string "I" is pronounced between 0 ms and 112 ms,
The character string "have" is pronounced between 128 ms and 502 ms, the character string "a" is pronounced between 516 ms and 609 ms, and the character string "pen" is pronounced between 619 ms and 98 ms.
This shows a case where a sound is generated within 5 ms.

【００３７】その後、キー入力手段１１によって訓練者
が再生倍速を規定した模範者及び自己の発声時画像の同
時表示を求めると（ステップ１１１）、画像同期再生手
段１０は、模範者発声時画像及び訓練者発声時画像を指
示された再生倍速でしかも同じ速度で再生させるための
各記憶手段７、９に与えるタイミング制御信号（再生す
る画像フレームを指示する信号を含む）を模範者音声認
識手段３及び訓練者音声認識手段６の認識結果から演算
して求め、そのタイミング制御信号を各記憶手段７、９
に与え（ステップ１１２）、これにより表示手段１２が
模範者発声時画像及び訓練者発声時画像を同期して同時
に表示する（ステップ１１３）。Thereafter, when the trainee requests the simultaneous display of the modeler and his / her own utterance image specified by the key input means 11 (step 111), the image synchronous reproduction means 10 outputs the modeler utterance image and A model control voice recognition unit 3 includes a timing control signal (including a signal designating an image frame to be reproduced) to be given to each of the storage units 7 and 9 for reproducing the image at the time of the trainee's utterance at the specified reproduction speed and at the same speed. And the timing control signal is calculated from the recognition result of the trainee voice recognition means 6, and the timing control signal is stored in each storage means 7, 9
(Step 112), whereby the display means 12 simultaneously displays the image when the modeler utters and the image when the trainee utters synchronously (step 113).

【００３８】次に、画像同期再生手段１０が実行する上
述した処理（ステップ１１２）の具体例を詳述する。Next, a specific example of the above-described processing (step 112) executed by the image synchronous reproduction means 10 will be described in detail.

【００３９】画像同期再生手段１０は、まず、図３に示
す模範者音声認識手段３の認識結果から指示された再生
倍速における各文字列期間及び文字列間無音期間に必要
な模範者発声時画像における画像フレーム数を算出し、
また、図４に示す訓練者音声認識手段６の認識結果から
指示された再生倍速における各文字列期間及び文字列間
無音期間に必要な訓練者発声時画像における画像フレー
ム数を算出する。なお、模範者発声時画像及び訓練者発
声時画像の同時表示は、発声器官の動きの妥当性の確認
に用いられるので、１倍再生が指示されることは少な
く、１０倍程度のスロー再生が指示されることが多い。
以下では、１０倍再生が指示されたとして説明を行な
う。The image-synchronous reproduction means 10 first generates an image of a model utterance necessary for each character string period and a silent period between character strings at the reproduction speed indicated by the recognition result of the model voice recognition means 3 shown in FIG. Calculate the number of image frames in,
In addition, the number of image frames in the trainee utterance image required for each character string period and the silence period between character strings at the reproduction speed indicated by the trainee voice recognition means 6 shown in FIG. 4 is calculated. Note that the simultaneous display of the model utterance image and the trainee utterance image is used to confirm the validity of the movements of the vocal organs. Often instructed.
In the following, description will be made assuming that 10-fold reproduction is instructed.

【００４０】図５及び図６はそれぞれ、このようにして
算出された模範者発声時画像における画像フレーム数
と、訓練者発声時画像における画像フレーム数とを示す
ものである。図３に示すように、例えば、文字列“Ｉ”
については発声に時間９８ｍｓかかっており、１０倍再
生では時間９８０ｍｓで文字列“Ｉ”に係る模範者発声
時画像を出力することになる。１画像フレーム当りの時
間は１／３０ｓ（テレビジョン信号が例えばＮＴＳＣ方
式に従う場合）であるので、時間９８０ｍｓは、フレー
ム数では２９（２９．４＝０．９８÷１／３０を整数化
した値）となる。各文字列に対してこのような処理を経
て得られた結果を示したものが図５及び図６である。FIGS. 5 and 6 respectively show the number of image frames in the image when the modeler utters and the number of image frames in the image when the trainee utters calculated in this way. As shown in FIG. 3, for example, the character string "I"
, It takes 98 ms to utter, and in the case of 10-times reproduction, an image at the time of model person utterance related to the character string “I” is output in 980 ms. Since the time per one image frame is 1/30 s (when the television signal complies with, for example, the NTSC system), the time 980 ms is a value obtained by converting the number of frames into 29 (29.4 = 0.98 ÷ 1/30). ). FIGS. 5 and 6 show the results obtained through such processing for each character string.

【００４１】次に、画像同期再生手段１０は、図５に示
す文字列期間及び文字列間無音期間に必要な模範者発声
時画像における画像フレーム数に基づいて、記憶されて
いる模範者発声時画像の各フレームを何回ずつ繰返して
再生するかを決定し、同様に、図６に示す文字列期間及
び文字列間無音期間に必要な訓練者発声時画像における
画像フレーム数に基づいて、記憶されている訓練者発声
時画像の各フレームを何回ずつ繰返して再生するかを決
定する。Next, based on the number of image frames in the model utterance image required for the character string period and the inter-character string silence period shown in FIG. It is determined how many times each frame of the image is to be reproduced, and stored in the same manner based on the number of image frames in the trainee utterance image required for the character string period and the silent period between character strings shown in FIG. It is determined how many times each frame of the trainee utterance image is repeated and reproduced.

【００４２】撮像タイミングは１／３０ｓ毎であるの
で、図７に示すように、例えば、訓練者が文字列“Ｉ”
を発声している期間では４フレームの画像しか撮像して
おらず、文字列“Ｉ”から文字列“ｈａｖｅ”へ移行す
る無音期間では撮像がなされていない。また、文字列
“Ｉ”についての必要再生フレーム数は３４フレームで
ある。そのため、各フレームを単純に１０回ずつ繰返し
て再生して１０倍スロー再生を表示するよりは、各フレ
ームの繰返し再生数を調整した方が良好になる。例え
ば、文字列“Ｉ”にかかる４個のフレームの内、第１〜
第３フレームを１０回ずつ再生し、第４フレームは文字
列“Ｉ”から文字列“ｈａｖｅ”へ移行する無音期間を
考慮して６回再生し、次に、文字列“ｈａｖｅ”に係る
先頭のフレームを文字列“Ｉ”から文字列“ｈａｖｅ”
へ移行する無音期間を考慮して１３回再生するように決
定することは好ましい態様である。また、例えば、文字
列の再生総フレーム数に、その文字列に係るフレームが
均等に出現するように各フレームの繰返し数を決定する
ことも好ましい態様である。Since the imaging timing is every 1/30 s, for example, as shown in FIG.
Is captured, only four frames of the image are captured, and no image is captured during the silent period in which the character string “I” shifts to the character string “have”. The required number of reproduction frames for the character string “I” is 34 frames. Therefore, it is better to adjust the number of repetitive playbacks of each frame than to simply play back each frame ten times and display 10-times slow playback. For example, out of four frames related to the character string “I”,
The third frame is reproduced ten times, and the fourth frame is reproduced six times in consideration of the silent period in which the character string “I” shifts to the character string “have”. From the character string “I” to the character string “have”
It is a preferable mode to determine the reproduction to be performed 13 times in consideration of the silence period in which the transition to is made. Further, for example, it is also a preferable embodiment to determine the number of repetitions of each frame so that the frames related to the character string appear evenly in the total number of reproduced frames of the character string.

【００４３】このようにして、記憶されている模範者発
声時画像及び訓練者発声時画像の各フレームの１０倍再
生を実現するに必要な繰返し数を決定すると、画像同期
再生手段１０は、訓練者発声時画像のフレーム数を基準
として、訓練者発声時画像の各文字列及び文字列間の各
フレームの繰返し数を修正する。例えば、文字列“Ｉ”
については、図５及び図６の比較から明らかなように、
模範者発声時画像のフレーム数を５フレームだけ増やす
ことが必要になり、文字列“Ｉ”についての模範者発声
時画像の３個のフレームについて繰返し数を１０、１
０、９と決定していたものを例えば１２、１２、１０に
修正する。このような処理を、他の文字列期間や文字列
間の無音期間に対しても行なう。In this way, when the number of repetitions necessary to realize 10-fold reproduction of each of the stored images of the modeler and the trainer when uttering is determined, the image synchronous reproducing means 10 On the basis of the number of frames of the trainee utterance image, each character string of the trainee utterance image and the number of repetitions of each frame between the character strings are corrected. For example, the character string "I"
As is clear from the comparison between FIGS. 5 and 6,
It is necessary to increase the number of frames of the model utterance image by 5 frames, and set the repetition number to 10, 1 for three frames of the model utterance image for the character string "I".
What has been determined as 0, 9 is corrected to, for example, 12, 12, or 10. Such a process is performed for another character string period or a silent period between character strings.

【００４４】そして、画像同期再生手段１０は、決定し
た繰返し数だけ各フレームを再生させるように、訓練者
発声時画像記憶手段９及び模範者発声時画像記憶手段７
に対する再生制御を行ない、訓練者発声時画像及び模範
者発声時画像を同期した状態で表示手段１２に同時表示
させる。Then, the image synchronous reproducing means 10 reproduces each frame by the determined number of repetitions so that the trainee utterance image storing means 9 and the model person uttering image storing means 7.
And the display unit 12 simultaneously displays the image when the trainee utters and the image when the model person utters in synchronization.

【００４５】従って、上記実施例によれば、発声器官の
動きを示す訓練者発声時画像及び模範者発声時画像を同
期して表示することができる。その結果、訓練者は、発
声時の発声器官の時間変化を模範者のそれと同期させて
目視確認することができ、訓練者は先生なしで正しい発
声を行うための細かな訓練を行なうことができ、発音の
仕方の妥当性を判断でき、訓練効果を従来の発音訓練装
置より高めることができる。Therefore, according to the above embodiment, it is possible to synchronously display the trainee utterance image and the model person utterance image indicating the movement of the vocal organ. As a result, the trainee can visually check the temporal change of the vocal organs during vocalization in synchronization with that of the modeler, and the trainee can perform detailed training for correct vocalization without a teacher. Therefore, the validity of the pronunciation method can be determined, and the training effect can be enhanced as compared with the conventional pronunciation training device.

【００４６】本発明は、上記実施例に限定されるもので
はなく、以下に例示したような各種の変形実施例を許容
するものである。The present invention is not limited to the above embodiment, but allows various modified embodiments as exemplified below.

【００４７】(1) 模範者音声記憶手段１に記憶しておく
音声情報に、図３に示すような情報を含めることとし、
模範者音声認識手段３を省略させるようにしても良い。(1) The voice information stored in the model voice storing means 1 includes information as shown in FIG.
The model voice recognition unit 3 may be omitted.

【００４８】(2) 上記実施例においては、訓練者発声時
画像を基準とし、模範者発声時画像の再生方法を修正し
て同期化させるものを示したが、逆に、模範者発声時画
像を基準とし、訓練者発声時画像の再生方法を修正して
同期化させても良く、また、訓練者発声時画像及び模範
者発声時画像のフレーム数が多い方を基準として他方の
再生方法を修正して同期化させても良い。また、同期化
させる方法も、再生フレームの追加（補間）だけでなく
再生フレームの間引きでも良い。(2) In the above embodiment, the method of synchronizing by correcting the reproduction method of the model utterance image based on the trainer utterance image has been described. With reference to the above, the reproduction method of the trainer utterance image may be corrected and synchronized, and the other reproduction method may be changed based on the larger number of frames of the trainer utterance image and the model utterance image. It may be corrected and synchronized. In addition, the method of synchronization may be not only the addition (interpolation) of the reproduction frame but also the thinning of the reproduction frame.

【００４９】(3) 発声時画像の再生倍速を固定したもの
であっても良い。再生倍速がスロー倍速で固定されてい
る装置の場合には、ビデオカメラ８として高速撮像のも
のを適用し、その再生を通常速度で実行してスロー再生
を実現するようにしても良く、この場合には、模範者発
声時画像記憶手段７に格納する模範者発声時画像も高速
のビデオカメラによって撮像したものとなる。(3) The reproduction speed of the image at the time of utterance may be fixed. In the case of a device in which the playback speed is fixed at the slow speed, a high-speed imaging device may be applied as the video camera 8 and the playback may be executed at the normal speed to realize slow playback. Then, the model utterance image stored in the model utterance image storage means 7 is also captured by a high-speed video camera.

【００５０】(4) 上記実施例では、所定の再生倍速に対
応するための処理を行なった後、訓練者発声時画像及び
模範者発声時画像を同期再生させるための修正処理を行
なうものを示したが、逆の順序で処理するようにしても
良い。また、再生倍速が固定化されている装置の中に
は、所定の再生倍速に対応するための処理が不要なもの
もある。(4) In the above-described embodiment, after performing the processing corresponding to the predetermined reproduction speed, the correction processing for synchronously reproducing the trainer utterance image and the model person utterance image is performed. However, the processing may be performed in the reverse order. In addition, some devices having a fixed reproduction speed do not require a process for supporting a predetermined reproduction speed.

【００５１】(5) 図２のフローチャートは、１個の再生
単位（例えば１文章）について模範者音声を発音出力さ
せた後、訓練者発声時画像及び模範者発声時画像を同期
表示させるまでの処理を通して示したが、任意数の再生
単位の模範者音声を発音出力させ、その後、その数分だ
け訓練者に発音させ、最後に、その数分だけ同期表示を
順に行なうようにしても良い。(5) The flowchart of FIG. 2 shows the process from the time when the model voice is output for one reproduction unit (for example, one sentence) to the time when the trainer utterance image and the model utterance image are displayed synchronously. Although shown through the processing, the modeler's voice in an arbitrary number of playback units may be output as sound, and then the trainee may sound the number of times, and finally, the synchronous display may be sequentially performed for the number of times.

【００５２】(6) 上記実施例は、１個の再生単位に含ま
れている文字列を単位に同期化処理を行なうものを示し
たが、再生単位全体で同期化処理を行なうものであって
も良い。再生単位全体の開始時刻と終了時刻とが両発声
時画像で一致させるような処理だけ（文字列単位の同期
化を考慮しない）を行なうものであっても良い。この場
合、上記実施例より同期化の程度は多少落ちるが、処理
構成を上記実施例より簡単にすることができる。(6) In the above embodiment, the synchronization processing is performed for each character string included in one reproduction unit. However, the synchronization processing is performed for the entire reproduction unit. Is also good. Only a process that does not match the start time and the end time of the entire reproduction unit in the images at both utterances (without considering the synchronization in character string units) may be performed. In this case, the degree of synchronization is slightly lower than in the above embodiment, but the processing configuration can be simplified as compared with the above embodiment.

【００５３】[0053]

【発明の効果】以上のように、本発明によれば、訓練者
の発声時の発声器官の時間変化を模範者のそれと同期さ
せて表示するようにしたので、発声器官の画像提示によ
る訓練効果を従来より高めることができる。As described above, according to the present invention, the time change of the vocal organ during the utterance of the trainee is displayed in synchronization with that of the model person. Can be higher than before.

【図面の簡単な説明】[Brief description of the drawings]

【図１】実施例の構成を示す機能ブロック図である。FIG. 1 is a functional block diagram illustrating a configuration of an embodiment.

【図２】実施例の処理の流れの一例を示すフローチャー
トである。FIG. 2 is a flowchart illustrating an example of the flow of a process according to the embodiment;

【図３】模範者発声時の時間情報を示す説明図である。FIG. 3 is an explanatory diagram showing time information when a model person utters.

【図４】訓練者発声時の時間情報を示す説明図である。FIG. 4 is an explanatory diagram showing time information at the time of trainee utterance.

【図５】模範者発声時画像の所定再生倍速での必要フレ
ーム数を示す説明図である。FIG. 5 is an explanatory diagram showing the required number of frames at a predetermined reproduction speed of an image when a model person utters.

【図６】訓練者発声時画像の所定再生倍速での必要フレ
ーム数を示す説明図である。FIG. 6 is an explanatory diagram showing the required number of frames of a trainee utterance image at a predetermined reproduction speed.

【図７】訓練者発声の時間変化と撮像点との関係を示す
説明図である。FIG. 7 is an explanatory diagram showing a relationship between a temporal change of trainee utterance and an imaging point.

【符号の説明】[Explanation of symbols]

３…模範者音声認識手段、６…訓練者音声認識手段、７
…模範者発声時画像記憶手段、８…ビデオカメラ、９…
訓練者発声時画像記憶手段、１０…画像同期再生手段、
１２…表示手段。3: model voice recognition means, 6: trainee voice recognition means, 7
… Image storage means when the model is uttered, 8… Video camera, 9…
Image storage means at the time of trainee utterance, 10 ... image synchronous reproduction means,
12 Display means.

───────────────────────────────────────────────────── フロントページの続き (51)Int.Cl.⁷ 識別記号ＦＩＨ０４Ｎ 7/18 (72)発明者明神知大阪府大阪市西区千代崎３丁目２番95号株式会社オージス総研内 (72)発明者浅野雅代愛知県名古屋市千種区内山三丁目８番10 号株式会社沖テクノシステムズラボラトリ内 (56)参考文献特開昭60−195584（ＪＰ，Ａ) 特開昭60−245000（ＪＰ，Ａ) 特開昭63−146088（ＪＰ，Ａ) (58)調査した分野(Int.Cl.⁷，ＤＢ名) G09B 5/02 G09B 5/06 G09B 19/04,19/16 G10L 11/04 G10L 13/00 G10L 21/06 H04N 7/18──────────────────────────────────────────────────の Continuation of the front page (51) Int.Cl.⁷ Identification symbol FI H04N 7/18 (72) Inventor Satoshi Myojin 3-2-95 Chiyozaki, Nishi-ku, Osaka-shi, Osaka Ojis Research Institute Inc. (72) Invention Person Masayo Asano 3-8-10 Uchiyama, Chikusa-ku, Nagoya City, Aichi Prefecture Inside Oki Techno Systems Laboratory (56) References JP-A-60-195584 (JP, A) JP-A-60-245000 (JP, A JP-A-63-146088 (JP, A) (58) Fields investigated (Int. Cl.⁷ , DB name) G09B 5/02 G09B 5/06 G09B 19 / 04,19 / 16 G10L 11/04 G10L 13 / 00 G10L 21/06 H04N 7/18

Claims

Translated fromJapanese

(57)【特許請求の範囲】(57) [Claims]

【請求項１】模範者発声時の発声器官の動きを示す模
範者発声時画像と、訓練者発声時の発声器官の動きを示
す訓練者発声時画像とを表示手段に同時に表示する機能
を備えた発音訓練装置において、模範者発声時の時間情報及び訓練者発声時の時間情報に
基づいて、模範者発声時画像又は訓練者発声時画像の少
なくとも一方に対して補間又は間引きを実行させて模範
者発声時画像及び訓練者発声時画像を同期表示させる画
像同期再生手段を設けたことを特徴とした発音訓練装
置。1. A function for simultaneously displaying, on a display unit, an image at the time of a model utterance showing the movement of a vocal organ at the time of model utterance, and an image at the time of a trainee utterance showing the movement of the vocal organ at the time of trainee utterance. In the pronunciation training device, based on the time information at the time of the modeler's utterance and the time information at the time of the trainee's utterance, at least one of the image at the time of the modeler's utterance and the image at the time of the trainee's utterance is executed by performing interpolation or thinning. A pronunciation training device comprising an image synchronous reproducing means for synchronously displaying an image when a speaker utters and an image when a trainee utters.