JP2007329794A

Movatterモバイル変換

Info

Publication number: JP2007329794A
Application number: JP2006160530A
Authority: JP
Inventors: Shinji Hizuka; 真二肥塚
Original assignee: Yamaha Corp
Current assignee: Yamaha Corp
Priority date: 2006-06-09
Filing date: 2006-06-09
Publication date: 2007-12-20

Abstract

<P>PROBLEM TO BE SOLVED: To provide a voice recording device capable of easily retrieving a desired speech by dividing the speech into sections by using sure speech breaks in spite of simple constitution. <P>SOLUTION: A control part 14 analyzes operation data input from an operation part 18 and extracts page change-over event information and data change-over event information of presentation data. Further, the control part 14 analyzes voice data to be input and extracts the voice section of a specific speaker. Then, the control part 14 generates voice condition data including the page information, presenter identification information and recording time of the presentation data. The control part 14 generates the voice condition data by changing over the page information of the presentation data when the page change-over event information is extracted and changing over the presenter identification information when the speaker is changed and the data changing event information is extracted. <P>COPYRIGHT: (C)2008,JPO&INPIT

Description

Translated fromJapanese

この発明は、会議音声等の音声を記録して利用する装置に関するものである。 The present invention relates to an apparatus for recording and using audio such as conference audio.

従来から、会議やプレゼンテーションの記録を、録音した音声で残すことが行われている。この記録は、数十分〜数時間分の音声信号を録音した１つのファイルである。 Conventionally, recordings of meetings and presentations are made with recorded voices. This recording is one file in which audio signals for several tens of minutes to several hours are recorded.

ユーザは、このような記録を後から確認するとき、時間情報（録音開始からの経過時間）を参照して所望の発言を検索する。しかし、時間情報だけでは、録音されている内容（どの話者が発言しているか等）が判別できず、後から再生するときに所望の発言を探すことが困難であった。 When confirming such a recording later, the user searches for a desired statement with reference to time information (elapsed time from the start of recording). However, the recorded information (which speaker is speaking, etc.) cannot be determined by only the time information, and it is difficult to find a desired statement when playing back later.

そこで、音声の特徴から話者を特定して、話者毎に録音データを区間分割し、分割した区間ごとに再生できるようにしたものがある（例えば特許文献１参照）。 In view of this, there is a speaker in which the speaker is identified from the voice characteristics, and the recording data is divided into sections for each speaker and can be reproduced for each divided section (for example, see Patent Document 1).

特許文献１に記載の装置は、話者を識別するための各種パラメータを音声データから抽出し、これらのパラメータに識別番号を付してグループ化する。グループ化したパラメータ（話者識別情報）は、音声データとともに記憶される。記憶された音声データは、話者毎に時系列にグラフィック表示され、視覚的に所望の発言を探すことができる。
特開平８−１５３１１８号公報The apparatus described inPatent Literature 1 extracts various parameters for identifying a speaker from voice data, and groups these parameters with identification numbers. The grouped parameters (speaker identification information) are stored together with the voice data. The stored voice data is graphically displayed in time series for each speaker, and a desired utterance can be visually searched.
JP-A-8-153118

しかし、音声の特徴により話者を特定することは容易ではなかった。話者の特定を誤ると、誤った話者識別情報が付されてデータベースが構築されてしまうという問題点が有った。また、話者を特定できなかった箇所は連続した区間となるため、やはり所望の発言を探すことが困難であった。 However, it is not easy to identify the speaker by the characteristics of speech. If the speaker is specified incorrectly, there is a problem that the database is constructed with the wrong speaker identification information. Moreover, since the part where the speaker could not be specified is a continuous section, it was difficult to find the desired speech.

本発明は、簡易な構成でありながら、確実な発言の切れ目を用いて区間分割し、所望の発言を検索し易くする音声録音装置を提供することを目的とする。 SUMMARY OF THE INVENTION An object of the present invention is to provide a voice recording apparatus that has a simple configuration and that makes it easy to search for a desired utterance by dividing a section using a certain utterance.

この発明の音声録音装置は、音声信号を外部から入力する音声入力部と、複数の映像データを含むプレゼンテーションデータを記憶するとともに、前記音声信号が録音される記憶部と、映像データを外部へ出力する映像出力部と、利用者による映像切換操作を受け付ける操作部と、前記映像切換操作に従って、前記複数の映像データを順次切換えて前記映像出力部に出力する映像再生部と、前記音声入力部から入力された音声信号を録音データとして時系列に前記記憶部に録音する音声録音部と、前記記憶部に録音された録音データを再生する音声再生部と、を備え、前記音声録音部は、前記音声信号の録音中における前記映像切換操作がされたタイミングをさらに記録し、前記音声再生部は、前記映像切換操作がされたタイミングで分割される区間を単位として前記録音データを再生することを特徴とする。 The audio recording apparatus of the present invention stores an audio input unit for inputting an audio signal from the outside, a presentation data including a plurality of video data, and a storage unit for recording the audio signal, and outputs the video data to the outside A video output unit that performs a video switching operation by a user, a video playback unit that sequentially switches the plurality of video data to output to the video output unit according to the video switching operation, and an audio input unit. An audio recording unit that records the input audio signal as recording data in time series in the storage unit; and an audio reproduction unit that reproduces the recording data recorded in the storage unit, and the audio recording unit includes: The timing at which the video switching operation is performed during recording of the audio signal is further recorded, and the audio playback unit is divided at the timing at which the video switching operation is performed. Characterized in that for reproducing the record data that section as a unit.

この発明では、映像データを含むプレゼンテーションデータ（資料データ）を記録している。ユーザは、映像切換操作を行い、プレゼンテーションを行う。音声録音部は、このプレゼンテーション中の音声を録音する。音声録音部は、映像切換操作を検出し、このタイミングを記録する。ここで、映像切換操作タイミングの記録は、録音している音声データを分割することで記録してもよいし、切換えタイミングを示す情報を付加することで記録するようにしてもよい。音声データを再生するときには、記録されたタイミングで分割される各区間を１単位として再生する。 In the present invention, presentation data (material data) including video data is recorded. The user performs a video switching operation and gives a presentation. The voice recording unit records the voice during the presentation. The audio recording unit detects the video switching operation and records this timing. Here, the video switching operation timing may be recorded by dividing the recorded audio data, or may be recorded by adding information indicating the switching timing. When audio data is reproduced, each interval divided at the recorded timing is reproduced as one unit.

また、この発明は、さらに、前記記憶部は、１または複数の話者の声の特徴データをリファレンスとして記憶しており、前記録音データの前記区間毎の音声信号から話者の声の特徴データを抽出する特徴抽出部と、前記抽出した特徴データを前記リファレンスと比較することにより、前記各区間毎の話者を特定して、特定した話者の情報を前記録音データに付加する話者特定部と、を備えたことを特徴とする。 Further, according to the present invention, the storage unit stores feature data of one or a plurality of speaker's voices as a reference, and the feature data of the speaker's voice is obtained from an audio signal for each section of the recording data. A speaker extracting unit that identifies a speaker for each section by comparing the extracted feature data with the reference, and adds speaker information to the recorded data And a section.

この発明では、特定の話者の音声特徴量（フォルマント等）をリファレンスとして記録しておく。また、特徴抽出部は、録音中の音声データから各話者の音声特徴量を抽出する。抽出した音声特徴量と、リファレンスの音声特徴量と、を比較し、各区間毎の話者を特定して話者情報を音声データに付与する。 In the present invention, a voice characteristic amount (formant, etc.) of a specific speaker is recorded as a reference. The feature extraction unit extracts the voice feature amount of each speaker from the voice data being recorded. The extracted voice feature quantity is compared with the reference voice feature quantity, the speaker for each section is specified, and the speaker information is added to the voice data.

この発明によれば、映像切換操作を検出し、このタイミングを記録して再生時に各タイミングで分割される区間毎に音声を再生することで、簡易な構成でありながら、確実な発言の切れ目を用いて区間分割し、所望の発言を効率よく検索することができる。 According to the present invention, the video switching operation is detected, the timing is recorded, and the audio is reproduced for each section divided at each timing at the time of reproduction. It is possible to divide into sections and efficiently search for a desired utterance.

図面を参照して、本発明の実施形態である音声記録再生装置について説明する。この音声記録再生装置は、典型的にはパーソナルコンピュータによって実現されるものであり、主にプレゼンテーション（以下、単にプレゼンと言う。）に用いられ、発表者等の発話音声を記録し、再生するものである。
図１は、本実施形態の音声記録再生装置を用いたプレゼンの概要を示す図である。パーソナルコンピュータである音声記録再生装置１にはプレゼンデータ（プレゼン資料）が記録されており、プレゼン発表者３は、音声記録再生装置１を操作し、プロジェクタ５に映し出される映像を切換える操作をしながらその内容を説明する。プレゼン発表者３が発した音声は、音声記録再生装置１により、音声データとして取得される。また、プレゼン発表者３が行った各種操作は、音声記録再生装置１により、操作データとして取得される。An audio recording / reproducing apparatus according to an embodiment of the present invention will be described with reference to the drawings. This audio recording / reproducing apparatus is typically realized by a personal computer, and is mainly used for presentations (hereinafter simply referred to as presentations), and records and reproduces speech sounds of presenters and the like. It is.
FIG. 1 is a diagram showing an outline of a presentation using the audio recording / reproducing apparatus of the present embodiment. Presentation data (presentation material) is recorded in the audio recording / reproducingapparatus 1, which is a personal computer, and the presentation presenter 3 operates the audio recording / reproducingapparatus 1 and performs an operation of switching the image displayed on theprojector 5. The contents will be described. The voice uttered by the presenter 3 is acquired as voice data by the voice recording / reproducingapparatus 1. Various operations performed by the presenter 3 are acquired as operation data by the audio recording / reproducingapparatus 1.

音声データは、データベース（後述記憶部１５）に時系列で記録される。また、操作データは、音声記録再生装置１により分析され、映像切換操作（プレゼン資料の切換操作）が抽出される。この映像切換操作が抽出されたタイミングによって、音声データがセグメンテーション（区間抽出）される。 The audio data is recorded in time series in a database (storage unit 15 described later). Further, the operation data is analyzed by the audio recording / reproducingapparatus 1, and a video switching operation (presentation material switching operation) is extracted. The audio data is segmented (section extraction) at the timing when the video switching operation is extracted.

また、音声データは、音声記録再生装置１により分析され、そのプレゼン発表者３の音声特徴量が抽出される。この音声特徴量は、予めデータベースに記録されているリファレンスの音声特徴量と比較され、話者が特定される。特定された話者の情報は、音声データの各区間の属性を識別する付加情報として記録される。 The voice data is analyzed by the voice recording / reproducingapparatus 1 and the voice feature amount of the presenter 3 is extracted. This voice feature quantity is compared with a reference voice feature quantity recorded in advance in a database, and a speaker is specified. The specified speaker information is recorded as additional information for identifying the attribute of each section of the voice data.

また、各区間でプロジェクタ５に表示されていた映像は、切り出し（スナップショット）され、上記音声データと対応付けてプレビュー映像データとして、データベースに記録される。 Also, the video displayed on theprojector 5 in each section is cut out (snapshot) and recorded in the database as preview video data in association with the audio data.

図２は、音声記録再生装置の具体的な構成を示すブロック図である。音声記録再生装置１は、マイク１１、収音アンプ１２、Ａ／Ｄコンバータ１３、制御部１４、記憶部１５、ＲＡＭ１６、映像出力Ｉ／Ｆ１７、操作部１８、Ｄ／Ａコンバータ１９、放音アンプ２０、およびスピーカ２１を備えている。音声記録再生装置１は、映像出力Ｉ／Ｆ１７を介してプロジェクタ５に接続される。 FIG. 2 is a block diagram showing a specific configuration of the audio recording / reproducing apparatus. The audio recording / reproducingapparatus 1 includes a microphone 11, asound collecting amplifier 12, an A /D converter 13, a control unit 14, astorage unit 15, aRAM 16, a video output I /F 17, anoperation unit 18, a D /A converter 19, and a sound emitting amplifier. 20 and aspeaker 21. The audio recording / reproducingapparatus 1 is connected to theprojector 5 via the video output I / F 17.

マイク１１は、自装置の発話者の発話音を含む周囲の音を収音して電気信号に変換し、収音音声信号を生成する。収音アンプ１２は収音音声信号を増幅し、Ａ／Ｄコンバータ１３はアナログ形式の収音音声信号をディジタル形式に変換する。 The microphone 11 picks up surrounding sounds including the utterance sound of the utterer of the device and converts it into an electric signal to generate a collected sound signal. Thesound collecting amplifier 12 amplifies the collected sound signal, and the A /D converter 13 converts the collected sound signal in an analog format into a digital format.

制御部１４は、音声記録再生装置を統括的に制御する。制御部１４は、記憶部１５から動作用プログラムを読み出し、ＲＡＭ１６に展開することで、種々の処理を行う。また、制御部１４は、上記ディジタル形式の収音音声信号を、音声データとして記憶部１５に記録する。 The control unit 14 comprehensively controls the audio recording / reproducing apparatus. The control unit 14 reads out the operation program from thestorage unit 15 and develops it in theRAM 16 to perform various processes. Further, the control unit 14 records the collected sound signal in the digital format in thestorage unit 15 as sound data.

記憶部１５は、大容量の磁気ディスク等からなり、機能的に音声データ記録部１５１、音声状況データ記録部１５２、プレゼンデータ記録部１５３、および音声特徴データ記録部１５４を備えている。また、再生アプリケーション１５５、および編集アプリケーション１５６を記憶している。音声データ記録部１５１には、上記収音された音声データが記録される。 Thestorage unit 15 is composed of a large-capacity magnetic disk or the like, and functionally includes an audiodata recording unit 151, an audio statusdata recording unit 152, a presentationdata recording unit 153, and an audio featuredata recording unit 154. In addition, areproduction application 155 and anediting application 156 are stored. The voicedata recording unit 151 records the collected voice data.

制御部１４は、記憶部１５の再生アプリケーション１５５をＲＡＭ１６に展開して、音声データ記録部１５１に記録されている音声データを読み出し、再生処理を行う。また、記憶部１５の編集アプリケーション１５６をＲＡＭ１６に展開して、編集処理を行う。制御部１４は、再生処理においては、音声データ記録部１５１から読み出した音声データをＤ／Ａコンバータ１９に出力する。Ｄ／Ａコンバータ１９は、制御部１４から入力された音声データをアナログ形式の放音音声信号に変換し、放音アンプ２０は放音音声信号を増幅してスピーカ２１に与え、スピーカ２１は、増幅された放音音声信号を放音する。これにより、自装置の周囲（主にプレゼン発表者）の音声を記録し、この記録済みの音声を再生、放音する。 The control unit 14 expands thereproduction application 155 in thestorage unit 15 in theRAM 16, reads out the audio data recorded in the audiodata recording unit 151, and performs a reproduction process. Further, theediting application 156 in thestorage unit 15 is expanded in theRAM 16 to perform editing processing. In the reproduction process, the control unit 14 outputs the audio data read from the audiodata recording unit 151 to the D /A converter 19. The D /A converter 19 converts the audio data input from the control unit 14 into an analog sound output sound signal, and thesound output amplifier 20 amplifies the sound output sound signal and applies the sound output signal to thespeaker 21. The amplified sound signal is emitted. As a result, the sound around the device (mainly the presenter) is recorded, and the recorded sound is reproduced and emitted.

制御部１４は、編集処理においては、図５に示すような画像を表示する映像信号を映像出力Ｉ／Ｆに出力する。編集処理の詳細については後述する。 In the editing process, the control unit 14 outputs a video signal for displaying an image as shown in FIG. 5 to the video output I / F. Details of the editing process will be described later.

操作部１８は、キーボードやマウスからなり、ユーザ（発表者）の操作態様に応じた操作データを生成し、制御部１４に出力する。例えば、マウスで（プロジェクタ５で表示される）表示画面上のカーソルを移動させ、該当位置でマウスをクリックすることにより、クリック情報が制御部１４に与えられ、制御部１４はクリック位置とクリック状況から操作入力内容を判断して所定の処理を行う。 Theoperation unit 18 includes a keyboard and a mouse, generates operation data corresponding to the operation mode of the user (presenter), and outputs the operation data to the control unit 14. For example, by moving the cursor on the display screen (displayed by the projector 5) with the mouse and clicking the mouse at the corresponding position, the click information is given to the control unit 14, and the control unit 14 determines the click position and the click status. The operation input content is determined from the above, and a predetermined process is performed.

ユーザが、操作部１８を用いて記憶部１５のプレゼンデータ記録部１５３に記録されているプレゼンデータ（資料映像表示用のアプリケーションを含む）を読み出すように指示すると、制御部１４は、プレゼンデータ記録部１５３からプレゼンデータのうち、指定されたファイル（資料ファイル）を読み出し、映像信号を生成する。制御部１４は、この映像信号を映像出力Ｉ／Ｆ１７を介してプロジェクタ５に出力する。プロジェクタ５は、入力された映像信号に応じて（スクリーン等に）映像を表示する。なお、プロジェクタ５に代えて、汎用のディスプレイ等を用いてもよい。これにより、ユーザは資料ファイルを映像表示して、プレゼンを行うことができる。 When the user instructs to read the presentation data (including the application for displaying the document video) recorded in the presentationdata recording unit 153 of thestorage unit 15 using theoperation unit 18, the control unit 14 records the presentation data. A designated file (material file) is read out from the presentation data from theunit 153 to generate a video signal. The control unit 14 outputs this video signal to theprojector 5 via the video output I /F 17. Theprojector 5 displays an image (on a screen or the like) according to the input image signal. Note that a general-purpose display or the like may be used instead of theprojector 5. As a result, the user can display the material file as a video and make a presentation.

制御部１４は、Ａ／Ｄコンバータ１９３から入力した音声データから音声特徴量を抽出する。音声特徴量は、典型的には話者のフォルマント、ピッチ等を表し、音声データをフーリエ変換した周波数スペクトル（パワースペクトル）、およびこのパワースペクトルを対数変換後に逆フーリエ変換したケプストラムから抽出する。 The control unit 14 extracts a voice feature amount from the voice data input from the A / D converter 193. The speech feature amount typically represents a speaker formant, pitch, and the like, and is extracted from a frequency spectrum (power spectrum) obtained by Fourier transforming speech data, and a cepstrum obtained by logarithmically transforming the power spectrum and then performing inverse Fourier transform.

制御部１４は、プレゼンに先立ち、各話者の音声特徴量を抽出し、記憶部１５の音声特徴データ記録部１５４に記録しておく。各話者の識別情報（すなわち各音声特徴データがどの話者のものであるか）は、プレゼン参加者（進行役等）が予め登録する。例えば、ある話者Ａの音声特徴量を記憶部１５に登録するとき、プレゼン進行役は、話者Ａに発言してもらい、操作部１８を用いて話者Ａの情報（個人名等）を記憶部１５に記録する。なお、本実施形態の音声記録再生装置を社内で用いる場合、プレゼン参加者が変化しない場合、等であれば、各社員の音声特徴量を、予め記憶部１５に記録しておくようにしてもよい。 Prior to the presentation, the control unit 14 extracts the voice feature amount of each speaker and records it in the voice featuredata recording unit 154 of thestorage unit 15. Each speaker's identification information (that is, which speaker each voice feature data belongs to) is registered in advance by a presentation participant (facilitator, etc.). For example, when a voice feature amount of a certain speaker A is registered in thestorage unit 15, the presentation facilitator asks the speaker A to speak and uses theoperation unit 18 to obtain information (personal name, etc.) of the speaker A. Record in thestorage unit 15. If the voice recording / playback apparatus of this embodiment is used in-house, if the presentation participant does not change, etc., the voice feature amount of each employee may be recorded in thestorage unit 15 in advance. Good.

制御部１４は、プレゼン中において、入力される音声データから音声特徴量を抽出し、記憶部１５に記録されている各話者の音声特徴量と比較する。その結果、特定の話者（発表者）の発言を抽出し、記録音声データの各部分の属性を識別する音声状況データを生成する。ここで、属性には、音声データの記録時刻、プレゼン資料のページ情報、発表者識別情報等が含まれている。音声状況データは、記憶部１５の音声状況データ記録部１５２に記録される。 During the presentation, the control unit 14 extracts a voice feature amount from the input voice data and compares it with the voice feature amount of each speaker recorded in thestorage unit 15. As a result, the speech of a specific speaker (presenter) is extracted, and voice status data for identifying the attributes of each part of the recorded voice data is generated. Here, the attributes include recording time of audio data, page information of presentation material, presenter identification information, and the like. The voice status data is recorded in the voice statusdata recording unit 152 of thestorage unit 15.

制御部１４は、操作部１８から入力される操作データを分析し、特定の操作（資料のページ切り換え操作）のイベント情報を検出する。制御部１４は、このページ切り換えのイベント情報、および上記音声特徴量に基づいてページ切り換え情報を生成し、音声状況データを生成する。さらに、制御部１４は、操作部１８から入力される操作データを分析し、資料ファイル切り換えのイベント情報も検出する。制御部１４は、この資料ファイル切り換えのイベント情報、および上記音声特徴量に基づいて発表者識別情報を生成し、音声状況データを生成する。この音声状況データと、音声データが時系列に記録されるので、映像切換操作が抽出されたタイミングによって、資料ページ切り換え毎、発表者毎に音声データがセグメンテーションされる。 The control unit 14 analyzes the operation data input from theoperation unit 18 and detects event information of a specific operation (material page switching operation). The control unit 14 generates page switching information based on the page switching event information and the audio feature amount, and generates audio status data. Furthermore, the control unit 14 analyzes the operation data input from theoperation unit 18 and detects event information for switching the material file. The control unit 14 generates presenter identification information based on the event information for switching the material file and the audio feature amount, and generates audio status data. Since the audio status data and the audio data are recorded in time series, the audio data is segmented for each presentation page switching and for each presenter at the timing when the video switching operation is extracted.

次に、音声記録再生装置の録音フローについて図３を参照して説明する。
図３は、制御部１４の録音処理フローを示すフローチャートである。なお、この録音処理フローが行われる前に、各会議参加者の音声特徴量は、記憶部１５に登録しておくものである。
制御部１４は、音声信号の入力を監視している。制御部１４は、プレゼン開始トリガを検出すると録音を開始する（Ｓ１→Ｓ２）。この際、プレゼン開始トリガとしては、音声信号が入力されたことを検知することで得られたり、ユーザが操作部１８を用いてプレゼン開始の指示をしたことにより得ることができる。Next, a recording flow of the audio recording / reproducing apparatus will be described with reference to FIG.
FIG. 3 is a flowchart showing a recording process flow of the control unit 14. Note that the audio feature amount of each conference participant is registered in thestorage unit 15 before the recording process flow is performed.
The control unit 14 monitors the input of the audio signal. When detecting the presentation start trigger, the control unit 14 starts recording (S1 → S2). At this time, the presentation start trigger can be obtained by detecting that an audio signal has been input, or can be obtained by the user instructing to start the presentation using theoperation unit 18.

録音が開始されると、制御部１４は（内蔵タイマ等から）録音開始時刻を取得し、この録音開始時刻を１つの音声データファイルのタイトルとして保存する（Ｓ３）。 When recording is started, the control unit 14 acquires a recording start time (from an internal timer or the like), and stores this recording start time as a title of one audio data file (S3).

制御部１４は、入力された音声信号、現在時刻を取得し（Ｓ４）、音声信号を音声データとして、現在時刻を時間データとして記憶部１５に与え、記憶部１５は順次音声データを記憶する（Ｓ５）。 The control unit 14 acquires the input voice signal and the current time (S4), gives the voice signal as voice data and the current time as time data to thestorage unit 15, and thestorage unit 15 sequentially stores the voice data ( S5).

制御部１４は、音声データの分析処理を行う（Ｓ６）。音声データの分析処理は、以下のようにして行う。すなわち、制御部１４は、入力音声データから音声特徴量を抽出し、記憶部１５から登録済みの話者の音声特徴量を読み出し、パターンマッチング等の手法により、抽出した音声特徴量が、読み出した音声特徴量と合致するか否かを判断する。音声特徴量が合致する場合に、この入力音声データを登録済みの話者（例えば発話者Ａとする）の発話区間として判断し、話者を特定する。合致しない場合は、登録されていない他者の発話、または無音（雑音）区間として判断する。これにより、前回の音声データ取得タイミングから話者が変化したかを判断する。 The control unit 14 performs an audio data analysis process (S6). The voice data analysis process is performed as follows. That is, the control unit 14 extracts a voice feature amount from the input voice data, reads the voice feature amount of the registered speaker from thestorage unit 15, and reads the extracted voice feature amount by a method such as pattern matching. It is determined whether or not the voice feature value matches. If the voice feature amounts match, the input voice data is determined as the utterance section of a registered speaker (for example, speaker A), and the speaker is specified. If they do not match, it is determined as an utterance of another person who is not registered or a silent (noise) section. Thereby, it is determined whether the speaker has changed from the previous voice data acquisition timing.

次に、制御部１４は、操作データの分析を行う（Ｓ７）。制御部１４は、入力されている操作データから特定の操作を検出し、イベント情報を抽出する処理を行う。すなわち、現在映像出力Ｉ／Ｆに出力している資料ファイルについて次のページに進める操作（ページ切り換え操作のイベント情報）、資料ファイルの切り換え操作（資料ファイル切り換え操作のイベント情報）を検出する。 Next, the control unit 14 analyzes the operation data (S7). The control unit 14 performs a process of detecting a specific operation from the input operation data and extracting event information. That is, an operation to advance to the next page (event information of page switching operation) and a switching operation of material file (event information of material file switching operation) are detected for the material file currently output to the video output I / F.

制御部１４は、上記音声データの分析結果、および操作データの分析結果から、セグメンテーションのタイミングであるか、話者変更のタイミングであるか、いずれにも該当しないか、を判断する（Ｓ８）。制御部１４は、上記ページ切り換えのイベント情報、および資料ファイル切り換えのイベント情報を検出していなければ、いずれにも該当しないとして、録音終了されるまで音声データの取得から処理を繰り返す（Ｓ８→Ｓ１４→Ｓ４）。なお、音声特徴量から話者の変更を検出した場合であっても、上記イベント情報を抽出していなければ音声データ取得から処理を繰り返す。 Based on the analysis result of the voice data and the analysis result of the operation data, the control unit 14 determines whether it is the segmentation timing, the speaker change timing, or none (S8). If the page switching event information and the material file switching event information are not detected, the control unit 14 assumes that none of them corresponds, and repeats the process from the acquisition of the audio data until the recording ends (S8 → S14). → S4). Even when the change of the speaker is detected from the voice feature amount, the process is repeated from the voice data acquisition unless the event information is extracted.

制御部１４は、ページ切り換えのイベント情報を検出しているが、資料ファイルの切り換えイベント情報を検出していなければ、セグメンテーションのタイミングと判断し、Ｓ９のチャプタ終了処理を行う。また、ページ切り換えのイベント情報、および資料ファイル切り換えのイベント情報を検出していた場合であっても、話者の変化を検出していなければ、話者の変更が無いとして、セグメンテーションのタイミングして判断し、Ｓ９のチャプタ終了処理を行う。なお、資料ファイルの切り換えイベント情報を検出した場合は、自動的にページの切り換えもされているため、ページ切り換えのイベント情報を検出していないが、資料ファイルの切り換えイベント情報を検出している場合は無いものとする。 The control unit 14 detects the page switching event information. However, if the page file switching event information is not detected, the control unit 14 determines that it is the segmentation timing, and performs the chapter end process in S9. Even if page switching event information and document file switching event information have been detected, if no change in the speaker is detected, there is no change in the speaker and the segmentation timing is set. Judgment is performed, and the chapter end process of S9 is performed. Note that when page change event information is detected, page switching event information is not detected because page switching event information is detected, but data file switching event information is detected. There shall be no.

制御部１４は、ページ切り換えのイベント情報、および資料ファイル切り換えのイベント情報を検出し、さらに話者の変化を検出していれば、話者変更が有ったとしてＳ１０の発表終了処理を行う。 The control unit 14 detects the event information for switching pages and the event information for switching material files. If the change of the speaker is further detected, the control unit 14 performs the announcement ending process in S10 based on the change of the speaker.

制御部１４は、Ｓ９のチャプタ終了処理において、そのチャプタのプレゼン資料のページ情報を生成し、Ｓ１０の発表終了処理において、発表者識別情報を生成する。その後、Ｓ１１の処理において、音声状況データを生成する。制御部１４は、同じチャプタ、発表者からなる音声データ群を関連付けするため、該当する音声データ群のプレゼン資料のページ情報、発表者識別情報、および記録時刻、を備えた音声状況データを生成して記憶部１５に与える。記憶部１５は、制御部１４からの音声状況データを音声状況データ記録部１５２に記録する。 The control unit 14 generates page information of the presentation material of the chapter in the chapter end process of S9, and generates presenter identification information in the presentation end process of S10. Thereafter, in the process of S11, voice status data is generated. The control unit 14 generates audio status data including page information, presenter identification information, and recording time of the presentation material of the corresponding audio data group in order to associate the audio data group including the same chapter and the presenter. To thestorage unit 15. Thestorage unit 15 records the voice situation data from the control unit 14 in the voice situationdata recording unit 152.

制御部１４は、Ｓ１２の処理において、プレゼンデータの切り出しを行う。制御部１４は、この処理において、現在画面表示している資料ページの映像を切り出（スナップショットを取得）し、上記音声状況データと対応付けてプレビュー映像データとして、プレゼンデータ記録部１５３に追加記録する。 The control unit 14 cuts out the presentation data in the process of S12. In this process, the control unit 14 cuts out the video of the document page currently displayed on the screen (takes a snapshot) and adds it to the presentationdata recording unit 153 as preview video data in association with the audio status data. Record.

このような音声状況データの生成、記録処理、音声データの記録処理、およびプレゼンデータ切り出し処理は、録音終了トリガが検出されるまで繰り返し行われ、チャプタ切り換え、発表者の変化が有る度に音声状況データが生成、記録される。そして、録音終了トリガが検出されれば（Ｓ１４）、最終の音声状況データを生成、記録するとともに、音声状況データ記録部１５２に予め記録された各音声状況データを録音開始時に取得したタイトルでグループ化するグループ化指示データを生成して音声状況データ記録部１５２に記録する（Ｓ１５）。なお、録音終了トリガは、操作部１８によるユーザのプレゼン終了指示を検出することにより得られる。 Such voice status data generation, recording processing, voice data recording processing, and presentation data cut-out processing are repeated until a recording end trigger is detected. Data is generated and recorded. If the recording end trigger is detected (S14), the final voice situation data is generated and recorded, and each voice situation data recorded in advance in the voice situationdata recording unit 152 is grouped by the title acquired at the start of recording. Grouping instruction data to be generated is generated and recorded in the voice situation data recording unit 152 (S15). Note that the recording end trigger is obtained by detecting a user's presentation end instruction from theoperation unit 18.

このような構成および処理を行うことで、音声データ記録部１５１には、図４に示すように、経時的に連続する音声データが記録され、音声データファイルとして記録される。この際、音声データファイルは、音声状況データ記録部１５２に記録された音声状況データにより、発表者別、チャプタ別（ページ別）に区分されている。 By performing such configuration and processing, the audiodata recording unit 151 records audio data that is continuous over time as shown in FIG. 4 and is recorded as an audio data file. At this time, the audio data file is classified by presenter and by chapter (by page) according to the audio status data recorded in the audio statusdata recording unit 152.

例えば、発表者Ａの音声データファイルであれば、資料１の音声データ、資料２の音声データ、資料３の音声データ、資料４の音声データ、資料５の音声データ、資料６の音声データ、および無音、雑音区間の音声データで区分化される。そして、各区分化音声データには区分の開始時刻（開始時間データ）が関連付けされている。同様に、発表者Ｂの音声データファイルであれば、資料７の音声データ、資料８の音声データ、資料９の音声データ、資料１０の音声データ、および無音、雑音区間の音声データで区分化され、区分毎の開始時間データが関連付けされている。 For example, in the case of the audio data file of the presenter A, the audio data ofmaterial 1, the audio data of material 2, the audio data of material 3, the audio data of material 4, the audio data ofmaterial 5, the audio data of material 6, and It is segmented by sound data in silence and noise sections. Each segmented audio data is associated with a segment start time (start time data). Similarly, the audio data file of the presenter B is divided into audio data of material 7, audio data of material 8, audio data ofmaterial 9, audio data of material 10, and audio data of silence and noise sections. The start time data for each category is associated.

次に、音声データファイル作成時の構成および処理について説明する。
図５は編集アプリ実行時に表示される表示画像を示す図であり、（Ａ）が初期状態、（Ｂ）が編集後状態を示す。Next, the configuration and processing when creating an audio data file will be described.
FIG. 5 is a diagram showing a display image displayed when the editing application is executed. (A) shows an initial state, and (B) shows a state after editing.

ユーザが会議後に操作部１８を操作して編集アプリを実行すると、制御部１４は、記憶部１５の音声状況データ記録部１５２から音声状況データを取得し、図５（Ａ）に示すような画面を表示する。 When the user operates theoperation unit 18 to execute the editing application after the meeting, the control unit 14 acquires the audio status data from the audio statusdata recording unit 152 of thestorage unit 15 and displays a screen as shown in FIG. Is displayed.

図５（Ａ）に示すように編集画面は、表題表示部２０１、タイムチャート表示部２０２、を備える。タイムチャート表示部２０２は、各音声データを示すバーグラフ２０３、資料ページ表示部２０４、発表者表示部２０５、内容表示部２０６を備える。 As shown in FIG. 5A, the editing screen includes atitle display unit 201 and a timechart display unit 202. The timechart display unit 202 includes abar graph 203 indicating each audio data, a materialpage display unit 204, a presenter display unit 205, and acontent display unit 206.

（１）表題表示部２０１
初期状態で、図５（Ａ）に示すように表題表示部２０１には音声状況データのファイル名に相当する、音声データファイル記録年月日が表示される。ユーザがマウスを用いて表題表示部２０１を選択すると、表題表示部２０１は編集可能となる。そして、ユーザが、プレゼン議題名である「商品販売検討会」をキーボード等で入力すると、図５（Ｂ）に示すように表題表示部２０１には、「商品販売検討会」が表示される。制御部１４は、編集アプリが終了する際、この変更を有効にするかを確認し、有効にする選択がされれば、音声状況データに表題名が「商品販売検討会」であることを関連付けする。この場合、音声状況データファイル名を直接「商品販売検討会」に変更して、記憶部１５に記憶させてもよい。これにより、表題が単なる年月日表示から具体的な議題名の表示になるので、後からでも容易に音声データファイルの内容を認識することができる。(1)Title display section 201
In the initial state, as shown in FIG. 5A, thetitle display unit 201 displays the voice data file recording date corresponding to the file name of the voice status data. When the user selects thetitle display unit 201 using the mouse, thetitle display unit 201 can be edited. Then, when the user inputs “Product Sales Review Meeting”, which is the title of the presentation, using a keyboard or the like, “Product Sales Review Meeting” is displayed on thetitle display unit 201 as shown in FIG. When the editing application ends, the control unit 14 confirms whether or not to enable the change, and if the selection is made to enable, associates that the title name is “product sales review meeting” with the audio status data. To do. In this case, the voice status data file name may be directly changed to “product sales review meeting” and stored in thestorage unit 15. As a result, the title changes from a simple date display to a specific agenda name display, so that the contents of the audio data file can be easily recognized later.

（２）タイムチャート表示部２０２
タイムチャート表示部２０２は、音声状況データファイル名から得られる区分化の情報に従い、各区分化音声データを資料ページ別で時系列に並べてバーグラフ２０３として表示する。この際、バーグラフ２０３の長さは区分化音声データの時間長を表す。この際、資料ページを表す情報として、資料ページ表示部２０４が表示される。(2) Timechart display unit 202
The timechart display unit 202 displays each segmented voice data as abar graph 203 arranged in time series for each material page according to the segmentation information obtained from the voice status data file name. At this time, the length of thebar graph 203 represents the time length of the segmented audio data. At this time, the materialpage display unit 204 is displayed as information representing the material page.

（３）資料ページ表示部２０４
各資料ページ表示部２０４には、図５（Ａ）に示すように、プレゼンデータ記録部１５３から得られた各資料ページ名が初期状態で表示される。ユーザがマウスを用いて資料ページ表示部２０４を選択すると、資料ページ表示部２０４は編集可能となる。そして、ユーザが、資料ページ毎に対応する題名をキーボード等で入力すると、資料ページ表示部２０４には、図５（Ｂ）に示すように資料ページ名（概要、コンセプト、等）が表示される。制御部１４は、編集アプリが終了する際、この変更を有効にするかを確認し、有効にする選択がされれば、各資料ページに対応する資料ページ名を関連付け、記憶部１５に記憶する。(3) Documentpage display unit 204
As shown in FIG. 5A, each material page name obtained from the presentationdata recording unit 153 is displayed on each materialpage display unit 204 in an initial state. When the user selects the documentpage display unit 204 using the mouse, the documentpage display unit 204 can be edited. Then, when the user inputs a title corresponding to each material page with a keyboard or the like, the materialpage display unit 204 displays the material page name (outline, concept, etc.) as shown in FIG. . When the editing application is terminated, the control unit 14 confirms whether or not to enable this change. If the selection is made to enable, the control unit 14 associates the document page name corresponding to each document page and stores it in thestorage unit 15. .

この際、各区分化音声状況データをマウスでダブルクリックする等の操作を行えば、制御部１４はこれを認識して、該当する区分化音声データを記憶部１５から読み出して再生する。再生音はスピーカ２１から放音される。ユーザはこの音声を聞くことにより、各区分化音声データに対応する話者の発言内容を確認することができる。 At this time, if an operation such as double-clicking each segmented voice status data with a mouse is performed, the control unit 14 recognizes this and reads out the corresponding segmented voice data from thestorage unit 15 and reproduces it. The reproduced sound is emitted from thespeaker 21. By listening to this voice, the user can confirm the content of the speaker's speech corresponding to each segmented voice data.

（４）発表者表示部２０５
発表者表示部２０５には、図５（Ａ）に示すように、音声状況データから得られた話者名（発話者Ａ、発話者Ｂ）が初期状態で表示される。ユーザがマウスを用いて発表者表示部２０５を選択すると、発表者表示部２０５は編集可能となる。そして、ユーザが、各発話者の個人名をキーボード等で入力すると、発表者表示部２０５には、図５（Ｂ）に示すように個人名（「担当Ａ氏」、「担当Ｂ氏」）が表示される。(4) Presenter display unit 205
As shown in FIG. 5A, the presenter display unit 205 displays the speaker names (speaker A and speaker B) obtained from the voice status data in the initial state. When the user selects the presenter display unit 205 using the mouse, the presenter display unit 205 can be edited. When the user inputs the personal name of each speaker using a keyboard or the like, the presenter display unit 205 displays the personal name ("Mr. A", "Mr. B") as shown in FIG. Is displayed.

（５）内容表示部２０６
内容表示部２０６には、図５（Ａ）に示すように初期状態では枠しか表示されない。ユーザがマウスを用いて内容表示部２０６を選択すると、内容表示部２０６は編集可能となる。そして、ユーザがプレゼン内容をキーボード等で入力すると、内容表示部２０６には、図５（Ｂ）に示すようにプレゼン内容（「商品Ａの説明」、「マーケティング」）が表示される。この際、各内容表示部２０６は、それぞれ異なる色やパターンで表示される。そして、いずれかの内容表示部２０６を選択した状態で、各区分化音声データのバーグラフ２０３を選択すると、これらが関連付けされて、内容表示部２０６と同じ色、パターンで表示される。(5)Content display unit 206
In thecontent display portion 206, only a frame is displayed in the initial state as shown in FIG. When the user selects thecontent display unit 206 using a mouse, thecontent display unit 206 can be edited. Then, when the user inputs the presentation content with a keyboard or the like, thecontent display unit 206 displays the presentation content (“Description of product A”, “Marketing”) as shown in FIG. At this time, eachcontent display unit 206 is displayed in a different color or pattern. When one of thecontent display units 206 is selected and thebar graph 203 of each segmented audio data is selected, they are associated and displayed in the same color and pattern as thecontent display unit 206.

（５）資料ページプレビュー表示部２０７
ユーザが各区分化音声状況データをマウスでクリックする操作を行えば、制御部１４はこれを認識して、該当する区分化音声データに対応付けられているプレビュー映像データを読み出して、資料ページプレビュー表示部２０７に画面表示する。(5) Document pagepreview display section 207
When the user performs an operation of clicking each segmented audio status data with a mouse, the control unit 14 recognizes this, reads out preview video data associated with the corresponding segmented audio data, and displays a document page preview. The screen is displayed on theunit 207.

これにより、ユーザは、各区分化音声状況データをいちいち聴きなおすことなく、発表内容を容易に把握することができる。 Thereby, the user can grasp | ascertain the presentation content easily, without re-listening each division | segmentation audio | voice state data one by one.

以上のような構成および処理を用いることで、より分かりやすいプレゼン音声データファイルを簡単に形成することができる。また、プレゼンの必要部のみを容易に聞き直すことができる。 By using the configuration and processing as described above, it is possible to easily form a presentation voice data file that is easier to understand. In addition, it is possible to easily listen to only the necessary parts of the presentation.

なお、本実施形態においては、制御部１４が音声特徴量の抽出、操作イベント情報の抽出を行っているが、音声特徴量の抽出専用の構成部（ＤＳＰ）を備えていてもよい。 In the present embodiment, the control unit 14 extracts voice feature amounts and operation event information. However, the control unit 14 may include a configuration unit (DSP) dedicated to voice feature amount extraction.

本実施形態の音声記録再生装置を用いたプレゼンの概要を示す図The figure which shows the outline | summary of the presentation using the audio | voice recording / reproducing apparatus of this embodiment.音声記録再生装置の構成図Configuration diagram of audio recording and playback device録音処理フローを示すフローチャートFlow chart showing recording process flow記憶部１５に記録された音声データの概念図Conceptual diagram of audio data recorded in thestorage unit 15編集アプリケーション実行時に表示される表示画像を示す図The figure which shows the display picture which is displayed when executing the editing application

符号の説明Explanation of symbols

１−音声記録再生装置
３−プレゼン発表者
５−プロジェクタ1-Voice recording / playback device 3-Presentation presenter 5-Projector

Claims

Translated fromJapanese

音声信号を外部から入力する音声入力部と、
複数の映像データを含むプレゼンテーションデータを記憶するとともに、前記音声信号が録音される記憶部と、
映像データを外部へ出力する映像出力部と、
利用者による映像切換操作を受け付ける操作部と、
前記映像切換操作に従って、前記複数の映像データを順次切換えて前記映像出力部に出力する映像再生部と、
前記音声入力部から入力された音声信号を録音データとして時系列に前記記憶部に録音する音声録音部と、
前記記憶部に録音された録音データを再生する音声再生部と、
を備え、
前記音声録音部は、前記音声信号の録音中における前記映像切換操作がされたタイミングをさらに記録し、
前記音声再生部は、前記映像切換操作がされたタイミングで分割される区間を単位として前記録音データを再生する
音声録音装置。An audio input unit for inputting an audio signal from the outside;
Storing presentation data including a plurality of video data, and a storage unit for recording the audio signal;
A video output unit for outputting video data to the outside;
An operation unit for accepting a video switching operation by a user;
In accordance with the video switching operation, a video playback unit that sequentially switches the video data and outputs the video data to the video output unit;
A voice recording unit that records the voice signal input from the voice input unit in the storage unit in a time series as recording data;
An audio reproduction unit for reproducing the recording data recorded in the storage unit;
With
The audio recording unit further records the timing of the video switching operation during recording of the audio signal,
The audio recording unit reproduces the recorded data in units of sections divided at the timing when the video switching operation is performed.

前記記憶部は、１または複数の話者の声の特徴データをリファレンスとして記憶しており、
前記録音データの前記区間毎の音声信号から話者の声の特徴データを抽出する特徴抽出部と、
前記抽出した特徴データを前記リファレンスと比較することにより、前記各区間毎の話者を特定して、特定した話者の情報を前記録音データに付加する話者特定部と、
を備えた請求項１に記載の音声録音装置。The storage unit stores feature data of one or more speaker voices as a reference,
A feature extraction unit for extracting feature data of a speaker's voice from an audio signal for each section of the recording data;
By comparing the extracted feature data with the reference, a speaker for each section is specified, and a speaker specifying unit that adds information of the specified speaker to the recording data;
The voice recording device according to claim 1, comprising: