JP2023016504A

Movatterモバイル変換

Info

Publication number: JP2023016504A
Application number: JP2021120856A
Authority: JP
Inventors: トアンドゥクグェン; Tuan Duc Nguen
Original assignee: Aimesoft Jsc
Current assignee: Aimesoft Jsc
Priority date: 2021-07-21
Filing date: 2021-07-21
Publication date: 2023-02-02
Also published as: WO2023002300A1; JP2023162179A

Abstract

To provide a slide playback program, etc., with which it is possible to enhance the presence in a presentation in which voice synthesis is used.SOLUTION: A slide playback program causes a computer to: acquire presentation data including a plurality of items of slide data that include a speech text and a display element; output, in a prescribed sequence, the display elements respectively included in the plurality of items of slide data; and affix a person video to, and output, a reading-aloud voice for the speech text included in the slide data being outputted.SELECTED DRAWING: Figure 10

Description

Translated fromJapanese

本発明は、プレゼンテーションデータに含まれる複数のスライドを順次表示出力するスライド再生プログラム等に関する。 The present invention relates to a slide reproduction program and the like for sequentially displaying and outputting a plurality of slides included in presentation data.

近年、商談などにおいて、表示装置に画像を表示し、この画像を順次切り替えながら商材の説明することが行われている。表示される各画像をスライドと呼び、複数のスライドをまとめたものはプレゼンテーションデータと呼ばれている。 2. Description of the Related Art In recent years, in business negotiations and the like, images are displayed on a display device, and products are explained while sequentially switching the images. Each displayed image is called a slide, and a collection of multiple slides is called presentation data.

また、音声合成技術を利用したプレゼンテーション装置が提案されている（特許文献１）。特許文献１に記載のプレゼンテーション装置は、スライドの切り替えと同期して、音声合成でテキストデータの読み上げを自動的に行う。 Also, a presentation device using speech synthesis technology has been proposed (Patent Document 1). The presentation device described inPatent Document 1 automatically reads out text data by speech synthesis in synchronization with the switching of slides.

特開２００１－５４７６号公報JP-A-2001-5476

しかしながら、音声のみでは臨場感に欠け、聴取者は内容を理解しにくくなる場合がある。本発明はこのような状況に鑑みてなされたものである。その目的は、音声合成を用いたプレゼンテーションにおいて、より臨場感を出すことが可能なスライド再生プログラム等を提供することである。 However, the sound alone lacks a sense of realism, and the listener may find it difficult to understand the content. The present invention has been made in view of such circumstances. The object is to provide a slide reproduction program or the like that can give a more realistic feeling in a presentation using speech synthesis.

本願の一態様に係るスライド再生プログラムは、発話テキストと表示要素とを含むスライドデータを複数含むプレゼンテーションデータを取得し、複数の前記スライドデータそれぞれに含む前記表示要素を所定の順番で出力するとともに、出力している前記スライドデータに含む前記発話テキストの読み上げ音声を、人物動画を付して出力する処理をコンピュータに行わせることを特徴とする。 A slide playback program according to an aspect of the present application acquires presentation data including a plurality of slide data including spoken text and display elements, outputs the display elements included in each of the plurality of slide data in a predetermined order, A computer is caused to perform a process of outputting the reading voice of the spoken text included in the output slide data with a moving image of the person.

本願の一観点によれば、人物が話しをしている人物動画を表示することにより、臨場感のあるプレゼンテーションが可能となる。 According to one aspect of the present application, it is possible to give a realistic presentation by displaying a moving image of a person talking.

プレゼンテーションシステムの構成例を示す説明図である。1 is an explanatory diagram showing a configuration example of a presentation system; FIG.再生装置のハードウェア構成例を示すブロック図である。3 is a block diagram showing a hardware configuration example of a playback device; FIG.基本設定ＤＢの例を示す説明図である。FIG. 4 is an explanatory diagram showing an example of a basic setting DB;モデルＤＢの例を示す説明図である。FIG. 4 is an explanatory diagram showing an example of a model DB;発話設定ＤＢの例を示す説明図である。FIG. 4 is an explanatory diagram showing an example of a speech setting DB;画面設定ＤＢの例を示す説明図である。FIG. 4 is an explanatory diagram showing an example of a screen setting DB;遷移設定ＤＢの例を示す説明図である。FIG. 10 is an explanatory diagram showing an example of a transition setting DB;メイン処理の手順例を示すフローチャートである。7 is a flowchart illustrating an example of the procedure of main processing;コマンド実行処理の手順例を示すフローチャートである。7 is a flow chart showing an example of the procedure of command execution processing;再生処理の手順例を示すフローチャートである。7 is a flow chart showing an example of a procedure of reproduction processing;ＶＲモデル作成処理の手順例を示すフローチャートである。6 is a flow chart showing an example of a procedure of VR model creation processing;発表設定画面の例を示す説明図である。FIG. 10 is an explanatory diagram showing an example of a presentation setting screen;モデル作成画面の例を示す説明図である。FIG. 5 is an explanatory diagram showing an example of a model creation screen;発話設定画面の例を示す説明図である。FIG. 11 is an explanatory diagram showing an example of a speech setting screen;発表者設定画面の例を示す説明図である。FIG. 10 is an explanatory diagram showing an example of a presenter setting screen;スライドショー設定画面の例を示す説明図である。FIG. 5 is an explanatory diagram showing an example of a slide show setting screen;スライド再生画面及び発表者画面の例を示す説明図である。FIG. 10 is an explanatory diagram showing an example of a slide playback screen and a presenter screen;再生処理の他の手順例を示すフローチャートである。FIG. 11 is a flowchart showing another procedure example of reproduction processing; FIG.再生処理の他の手順例を示すフローチャートである。FIG. 11 is a flowchart showing another procedure example of reproduction processing; FIG.スクリプト実行処理の手順例を示すフローチャートである。7 is a flow chart showing an example of a procedure of script execution processing;スライド再生画面及び発表者画面の他の例を示す説明図である。FIG. 11 is an explanatory diagram showing another example of the slide playback screen and the presenter screen;再生処理の他の手順例を示すフローチャートである。FIG. 11 is a flowchart showing another procedure example of reproduction processing; FIG.再生処理の他の手順例を示すフローチャートである。FIG. 11 is a flowchart showing another procedure example of reproduction processing; FIG.再生処理の他の手順例を示すフローチャートである。FIG. 11 is a flowchart showing another procedure example of reproduction processing; FIG.

（実施の形態１）
以下実施の形態を、図面を参照して説明する。以下の説明におけるプレゼンテーションデータについて述べる。プレゼンテーションデータは複数のスライドを含む。スライドは、コンピュータのディスプレイに表示したり、プロジェクターで投影したりするためのプレゼンテーションソフト用の表示データをいう。スライドはオブジェクト（表示要素）を含む。オブジェクトはテキスト、図形、動画、表、グラフ等である。オブジェクトは属性として、大きさ、位置、傾きを有する。スライドには、プロジェクターで投影する際には表示されない、テキスト（発話テキスト）を含めることが可能である。当該テキストは、スピーカーノート、発表者ノート、単にノートともいう。スピーカーノートはプロジェクターで投影する画像には含まれないが、プレゼンテーションソフトを実行するコンピュータのディスプレイには表示可能である。(Embodiment 1)
Embodiments will be described below with reference to the drawings. Presentation data in the following description will be described. Presentation data includes multiple slides. A slide is display data for presentation software to be displayed on a computer display or projected by a projector. A slide contains objects (display elements). Objects are text, figures, animations, tables, graphs, and the like. An object has size, position, and tilt as attributes. Slides can contain text (spoken text) that is not visible when projected by a projector. The text is also called speaker notes, presenter notes, or simply notes. The speaker notes are not included in the image projected by the projector, but can be displayed on the display of the computer running the presentation software.

図１はプレゼンテーションシステムの構成例を示す説明図である。プレゼンテーションシステム１００は再生装置１及び音声合成サーバ２を含む。再生装置１及び音声合成サーバ２はネットワークＮにより、互いに通信可能に接続されている。図１において、再生装置１は１台のみ記載しているが、２台以上でもよい。図１では、再生装置Ｋも再生装置１と同様であり、その中身の処理概念図を示す。再生装置１、再生装置Ｋは共に再生装置と呼ぶ。また、再生装置をプロジェクターに接続しても良い（例えば、USBケーブル、若しくは、VGAケーブル等による有線接続、又は、Wifi若しくはBluetooth（登録商標）などによる無線接続を行なう）。その場合、後述する再生装置の表示部のデータをプロジェクターに送信する。プロジェクターからの出力をスクリーン等に投影し、画像を表示させることになる。 FIG. 1 is an explanatory diagram showing a configuration example of a presentation system. Apresentation system 100 includes aplayback device 1 and aspeech synthesis server 2 . Theplayback device 1 and thespeech synthesis server 2 are connected by a network N so as to be able to communicate with each other. Although only oneplayback device 1 is shown in FIG. 1, two or more playback devices may be provided. In FIG. 1, the reproducing apparatus K is similar to the reproducingapparatus 1, and a conceptual diagram of the contents of the processing is shown. Both theplayback device 1 and the playback device K are called playback devices. Also, the playback device may be connected to the projector (for example, a wired connection such as a USB cable or a VGA cable, or a wireless connection such as Wifi or Bluetooth (registered trademark)). In that case, the data of the display section of the playback device, which will be described later, is transmitted to the projector. An image is displayed by projecting the output from the projector onto a screen or the like.

再生装置はユーザがプレゼンテーションに用いる装置である。再生装置はノートパソコン、パネルコンピュータ、タブレットコンピュータ、スマートフォン等で構成する。再生装置の論理的な処理は再生装置Ｋで示す。再生装置は後述のハードウェア構成で、プレゼンテーションデータＫ１、ＶＲ（Virtual Reality：バーチャルリアリティー）モデルＤＢＫ２、設定データＫ３を保持している。本願における一つの実施形態のスライド再生プログラムＫ４はこれらのデータを読み込み、発表者ノートのテキストを音声合成サーバ２に送信し、音声合成結果を得る。更に、スライドデータからスライド表示プログラムＫ５（例えば、ＭｉｃｒｏｓｏｆｔＰｏｗｅｒＰｏｉｎｔ，Ｇｏｏｇｌｅプレゼンテーションなど）でスライドを表示し、ＶＲエンジンでＶＲアバターＫ６を表示させる。スライド再生プログラムＫ４はスライド表示Ｋ５、ＶＲアバターＫ６及び音声合成結果Ｋ７を表示、再生する。また、スライド再生プログラムＫ４はスライド表示、音声合成結果の再生、アバター表示と同時に、スライドのページ遷移の制御も自動的に行い、これらの要素の表示、再生を同期化する。音声合成サーバ２は音声合成エンジンを備える。音声合成サーバ２は再生装置１からテキストデータを受け付け、音声合成モデルを用いて受け付けたテキストを読み上げる音声を合成し、音声データを再生装置１へ返信する。音声合成サーバ２はサーバコンピュータ、ワークステーション等で構成する。また、音声合成サーバ２を複数のコンピュータからなるマルチコンピュータ、ソフトウェアによって仮想的に構築された仮想マシン又は量子コンピュータで構成してもよい。さらに、音声合成サーバ２の機能をクラウドサービスで実現してもよい。 A playback device is a device that a user uses for a presentation. The playback device consists of a notebook computer, a panel computer, a tablet computer, a smart phone, and the like. The logical processing of the playback device is indicated by playback device K. FIG. The playback device has a hardware configuration to be described later, and holds presentation data K1, a VR (Virtual Reality) model DBK2, and setting data K3. The slide playback program K4 of one embodiment of the present application reads these data, transmits the text of the presenter's notes to thespeech synthesis server 2, and obtains speech synthesis results. Further, the slide data is displayed using a slide display program K5 (eg, Microsoft PowerPoint, Google Presentation, etc.), and the VR engine displays a VR avatar K6. The slide reproduction program K4 displays and reproduces the slide display K5, the VR avatar K6, and the speech synthesis result K7. In addition, the slide reproduction program K4 automatically controls page transition of slides simultaneously with slide display, speech synthesis result reproduction, and avatar display, and synchronizes the display and reproduction of these elements. Thespeech synthesis server 2 has a speech synthesis engine. Thespeech synthesis server 2 receives text data from thereproduction device 1 , synthesizes speech for reading out the received text using a speech synthesis model, and returns the speech data to thereproduction device 1 . Thespeech synthesizing server 2 is composed of a server computer, a work station, and the like. Further, thespeech synthesis server 2 may be configured by a multicomputer consisting of a plurality of computers, a virtual machine virtually constructed by software, or a quantum computer. Furthermore, the function of thespeech synthesis server 2 may be realized by a cloud service.

図２は再生装置のハードウェア構成例を示すブロック図である。再生装置１は制御部１１、主記憶部１２、補助記憶部１３、通信部１４、入力部１５、表示部１６、音声出力部１７及び読み取り部１８を含む。制御部１１、主記憶部１２、補助記憶部１３、通信部１４、入力部１５、表示部１６、音声出力部１７及び読み取り部１８はバスＢにより接続されている。 FIG. 2 is a block diagram showing a hardware configuration example of the playback device. Theplayback device 1 includes acontrol section 11 , amain storage section 12 , anauxiliary storage section 13 , acommunication section 14 , an input section 15 , adisplay section 16 , anaudio output section 17 and areading section 18 . Thecontrol unit 11 ,main memory unit 12 ,auxiliary memory unit 13 ,communication unit 14 , input unit 15 ,display unit 16 ,audio output unit 17 andreading unit 18 are connected by bus B.

制御部１１は、一又は複数のＣＰＵ（Central Processing Unit）、ＭＰＵ（Micro-Processing Unit）、ＧＰＵ（Graphics Processing Unit）等の演算処理装置を有する。制御部１１は、補助記憶部１３に記憶された制御プログラム１Ｐ（スライド再生プログラム、プログラム製品）を読み出して実行することにより、再生装置１に係る種々の情報処理、制御処理等を行い、取得部及び出力部等の機能部を実現する。 Thecontrol unit 11 has one or more arithmetic processing units such as a CPU (Central Processing Unit), an MPU (Micro-Processing Unit), and a GPU (Graphics Processing Unit). Thecontrol unit 11 reads and executes thecontrol program 1P (slide playback program, program product) stored in theauxiliary storage unit 13 to perform various information processing, control processing, etc. related to theplayback device 1, and a functional unit such as an output unit.

主記憶部１２は、ＳＲＡＭ（Static Random Access Memory）、ＤＲＡＭ（Dynamic Random Access Memory）、フラッシュメモリ等である。主記憶部１２は主として制御部１１が演算処理を実行するために必要なデータを一時的に記憶する。 Themain storage unit 12 is SRAM (Static Random Access Memory), DRAM (Dynamic Random Access Memory), flash memory, or the like. Themain storage unit 12 mainly temporarily stores data necessary for thecontrol unit 11 to execute arithmetic processing.

補助記憶部１３はハードディスク又はＳＳＤ（Solid State Drive）等であり、制御部１１が処理を実行するために必要な制御プログラム１Ｐや各種ＤＢ（Database）を記憶する。補助記憶部１３は、基本設定ＤＢ１３１、モデルＤＢ１３２、発話設定ＤＢ１３３、画面設定ＤＢ１３４及び遷移設定ＤＢ１３５、ＶＲモデルデータ１３６、並びに、プレゼンテーションデータ１３７を記憶する。補助記憶部１３は再生装置１に接続された外部記憶装置であってもよい。補助記憶部１３に記憶する各種ＤＢ等を、再生装置１とは異なるデータベースサーバやクラウドストレージに記憶してもよい。一方、基本設定ＤＢ１３１、モデルＤＢ１３２、発話設定ＤＢ１３３、画面設定ＤＢ１３４及び遷移設定ＤＢ１３５が記憶する内容を、まとめて一つのファイルとして、補助記憶部１３に記憶してもよい。 Theauxiliary storage unit 13 is a hard disk, SSD (Solid State Drive), or the like, and stores thecontrol program 1P and various DBs (Databases) necessary for thecontrol unit 11 to execute processing.Auxiliary storage unit 13 storesbasic setting DB 131 , model DB 132 ,speech setting DB 133 ,screen setting DB 134 ,transition setting DB 135 ,VR model data 136 andpresentation data 137 . Theauxiliary storage unit 13 may be an external storage device connected to theplayback device 1 . Various DBs and the like stored in theauxiliary storage unit 13 may be stored in a database server or cloud storage different from the reproducingdevice 1 . On the other hand, the contents stored in thebasic setting DB 131, the model DB 132, thespeech setting DB 133, thescreen setting DB 134, and the transition setting DB 135 may be collectively stored in theauxiliary storage unit 13 as one file.

通信部１４はネットワークＮを介して、音声合成サーバ２と通信を行う。また、制御部１１が通信部１４を用い、ネットワークＮ等を介して他のコンピュータから制御プログラム１Ｐをダウンロードし、補助記憶部１３に記憶してもよい。 Thecommunication unit 14 communicates with thespeech synthesis server 2 via the network N. FIG. Alternatively, thecontrol unit 11 may use thecommunication unit 14 to download thecontrol program 1P from another computer via the network N or the like and store it in theauxiliary storage unit 13 .

入力部１５はキーボードやマウス等を含む。表示部１６は液晶表示パネル等を含む。表示部１６はプレゼンテーションデータ１３７を構成するスライドなどを表示する。また、入力部１５と表示部１６とを一体化し、タッチパネルディスプレイを構成してもよい。さらに、再生装置１は外部の表示装置に表示を行ってもよい。 The input unit 15 includes a keyboard, mouse, and the like. Thedisplay unit 16 includes a liquid crystal display panel and the like. Thedisplay unit 16 displays slides and the like forming thepresentation data 137 . Also, the input unit 15 and thedisplay unit 16 may be integrated to form a touch panel display. Furthermore, theplayback device 1 may perform display on an external display device.

音声出力部１７は音声スピーカを含む。音声出力部１７はデジタル音声データをアナログ音声信号に変換し、スピーカから出力する。Audio output unit 17 includes an audio speaker. Theaudio output unit 17 converts digital audio data into an analog audio signal and outputs it from a speaker.

読み取り部１８はＣＤ（Compact Disc）－ＲＯＭ及びＤＶＤ（Digital Versatile Disc）－ＲＯＭを含む可搬型記憶媒体１ａを読み取る。制御部１１が読み取り部１８を介して、制御プログラム１Ｐを可搬型記憶媒体１ａより読み取り、補助記憶部１３に記憶してもよい。また、半導体メモリ１ｂから、制御部１１が制御プログラム１Ｐを読み込んでもよい。 Thereader 18 reads portable storage media 1a including CD (Compact Disc)-ROM and DVD (Digital Versatile Disc)-ROM. Thecontrol unit 11 may read thecontrol program 1P from the portable storage medium 1a via thereading unit 18 and store it in theauxiliary storage unit 13 . Alternatively, thecontrol unit 11 may read thecontrol program 1P from thesemiconductor memory 1b.

次にデータベースについて説明する。図３は基本設定ＤＢの例を示す説明図である。基本設定ＤＢ１３１はスライド再生に関する基本設定を記憶する。基本設定ＤＢ１３１はモデルＩＤ列及びＵＲＩ列を含む。モデルＩＤ列は発表者として表示されるＶＲモデルのＩＤを記憶する。ＵＲＩ列はプレゼンテーションデータのＵＲＩ（Uniform Resource Identifier）を記憶する。 Next, the database will be explained. FIG. 3 is an explanatory diagram showing an example of the basic setting DB. Thebasic setting DB 131 stores basic settings regarding slide reproduction. Thebasic setting DB 131 includes a model ID column and a URI column. The model ID column stores the ID of the VR model displayed as the presenter. The URI column stores the URI (Uniform Resource Identifier) of the presentation data.

図４はモデルＤＢの例を示す説明図である。モデルＤＢ１３２は発表者として表示されるＶＲモデルの情報を記憶する。モデルＤＢ１３２はモデルＩＤ列、名称列、写真列及びモデル列を含む。モデルＩＤ列はＶＲモデルを一意に特定するモデルＩＤを記憶する。モデルＩＤはモデルＤＢ１３２の主キーであり、上述の基本設定ＤＢ１３１のモデルＩＤ列は、外部キーとしてモデルＩＤを記憶する。名称列はＶＲモデルの名称を記憶する。写真列はＶＲモデルを作成する際に用いた静止画像を記憶する。予め用意されているＶＲモデルなどの場合、写真列は静止画像を記憶しなくともよい。モデル列はＶＲモデルの実体についての情報を記憶する。図４に示す例ではＶＲモデルデータ１３６に相当するファイルの名称を、モデル列は記憶している。なお、ＶＲモデルは動画より生成してもよい。この場合、写真列に替えて又は加えて、動画列を設ける。動画列はＶＲモデルを作成する際に用いた動画像を記憶する。 FIG. 4 is an explanatory diagram showing an example of the model DB. Themodel DB 132 stores information on VR models displayed as presenters. Themodel DB 132 includes a model ID column, a name column, a photo column and a model column. The model ID column stores model IDs that uniquely identify VR models. The model ID is the primary key of themodel DB 132, and the model ID column of thebasic setting DB 131 stores the model ID as a foreign key. The name column stores the name of the VR model. The photo column stores the still images used in creating the VR model. In the case of pre-prepared VR models, etc., the photo sequence does not need to store still images. The model column stores information about the entity of the VR model. In the example shown in FIG. 4, the model column stores the name of the file corresponding to theVR model data 136 . Note that the VR model may be generated from a moving image. In this case, instead of or in addition to the photo column, a video column is provided. The moving image column stores the moving images used when creating the VR model.

図５は発話設定ＤＢの例を示す説明図である。発話設定ＤＢ１３３は発話音声の設定を記憶する。発話設定ＤＢ１３３はエンジン列、ピッチ列、速さ列、言語列、性別列及び声モデル列を含む。エンジン列は音声合成に用いる音声合成エンジンの識別情報を記憶する。ピッチ列は合成音声の音程を記憶する。速さ列は発話の速度を記憶する。言語列は発話する言語を記憶する。性別列は発話音声の性別を記憶する。声モデル列は音声合成エンジンが複数の音声モデルを備えている場合、音声合成に用いる音声モデルの識別情報（特定情報）を記憶する。 FIG. 5 is an explanatory diagram showing an example of the speech setting DB. Theutterance setting DB 133 stores utterance voice settings. Theutterance setting DB 133 includes an engine column, a pitch column, a speed column, a language column, a gender column and a voice model column. The engine column stores identification information of speech synthesis engines used for speech synthesis. The pitch string stores pitches of synthesized speech. The speed column stores the speed of speech. The language column stores the spoken language. The gender column stores the gender of the uttered voice. The voice model string stores identification information (specific information) of a voice model used for voice synthesis when the voice synthesis engine has a plurality of voice models.

図６は画面設定ＤＢの例を示す説明図である。画面設定ＤＢ１３４はアバター画像を表示する発表者画面の設定を記憶する。画面設定ＤＢ１３４は背景画像列、幅列、高さ列、位置列を含む。背景画像列はアバターの背景に表示する画像の情報を記憶する。背景の画像は静止画像でも動画像でもよい。図６に示す例では静止画像または動画画像ファイルの名称を、背景画像列は記憶する。幅列は発表者画面の幅を記憶する。高さ列は発表者画面の高さを記憶する。位置列は画面全体の中で、発表者画面を表示する位置を記憶する。 FIG. 6 is an explanatory diagram showing an example of the screen setting DB. Thescreen setting DB 134 stores the setting of the presenter screen for displaying the avatar image. Thescreen setting DB 134 includes a background image column, a width column, a height column, and a position column. The background image column stores information about images to be displayed in the background of the avatar. The background image may be a still image or a moving image. In the example shown in FIG. 6, the name of the still image or moving image file is stored in the background image string. The width column stores the width of the presenter screen. The height column stores the height of the presenter screen. The position column stores the position where the presenter's screen is displayed in the entire screen.

図７は遷移設定ＤＢの例を示す説明図である。遷移設定ＤＢ１３５はスライドが次のスライドに遷移する際の設定を記憶する。遷移設定ＤＢ１３５はディレイ列及び切り替え列を含む。ディレイ列は表示しているスライドの発話テキストの音声読み上げが完了してから、次のスライドに遷移するまでの間隔時間（以下、「遷移間隔時間」という。）を記憶する。切り替え列は現在のスライドから次のスライドに切り替える際の効果、モーションを記憶する。 FIG. 7 is an explanatory diagram showing an example of a transition setting DB. Thetransition setting DB 135 stores settings when a slide transitions to the next slide. Thetransition setting DB 135 includes delay columns and switching columns. The delay column stores the interval time (hereinafter referred to as "transition interval time") from the completion of reading aloud of the uttered text of the displayed slide to the transition to the next slide. The transition row memorizes the effect, the motion, when switching from the current slide to the next slide.

次に、プレゼンテーションシステム１００で行われる処理について説明する。図８はメイン処理の手順例を示すフローチャートである。再生装置１の制御部１１は設定を読み込む（ステップＳ１）。設定は、基本設定ＤＢ１３１、発話設定ＤＢ１３３、画面設定ＤＢ１３４及び遷移設定ＤＢ１３５に記憶されている。制御部１１は読み込んだ設定に基づく設定画面を生成し、表示部１６に表示する（ステップＳ２）。設定項目は種々あるため、複数グループに分けられており、設定画面では、設定内容をグループ毎にタブ表示している。制御部１１は入力部１５を介して、ユーザの操作入力を受け付ける（ステップＳ３）。制御部１１は操作入力が設定画面のタブ切り替えであるか否かを判定する（ステップＳ４）。制御部１１は操作入力が設定画面のタブ切り替えであると判定した場合（ステップＳ４でＹＥＳ）、表示するタブを指定されたタブに切り替える（ステップＳ５）。制御部１１は操作入力が設定画面のタブ切り替えでないと判定した場合（ステップＳ４でＮＯ）、操作入力が設定の入力であるか否かを判定する（ステップＳ６）。制御部１１は操作入力が設定の入力であると判定した場合（ステップＳ６でＹＥＳ）、入力を受け付ける（ステップＳ７）。制御部１１は処理をステップＳ３へ戻す。この際、受け付けた入力内容が設定画面に反映される。制御部１１は操作入力が設定の入力でないと判定した場合（ステップＳ６でＮＯ）、操作入力が終了指示であるか否かを判定する（ステップＳ８）。制御部１１は操作入力が終了指示でないと判定した場合（ステップＳ８でＮＯ）、入力に応じたコマンドを実行し（ステップＳ９）、処理をステップＳ３へ戻す。制御部１１は操作入力が終了指示であると判定した場合（ステップＳ８でＹＥＳ）、処理を終了する。 Next, processing performed by thepresentation system 100 will be described. FIG. 8 is a flowchart showing an example of the procedure of main processing. Thecontrol unit 11 of theplayback device 1 reads the settings (step S1). Settings are stored in thebasic setting DB 131 , thespeech setting DB 133 , thescreen setting DB 134 and thetransition setting DB 135 . Thecontrol unit 11 generates a setting screen based on the read settings and displays it on the display unit 16 (step S2). Since there are various setting items, they are divided into a plurality of groups, and on the setting screen, the setting contents are tabbed for each group. Thecontrol unit 11 receives a user's operation input via the input unit 15 (step S3). Thecontrol unit 11 determines whether or not the operation input is tab switching of the setting screen (step S4). When thecontrol unit 11 determines that the operation input is to switch tabs of the setting screen (YES in step S4), it switches the tab to be displayed to the specified tab (step S5). When thecontrol unit 11 determines that the operation input is not tab switching of the setting screen (NO in step S4), it determines whether the operation input is a setting input (step S6). When thecontrol unit 11 determines that the operation input is the setting input (YES in step S6), it accepts the input (step S7). Thecontrol unit 11 returns the process to step S3. At this time, the received input contents are reflected on the setting screen. When thecontrol unit 11 determines that the operation input is not the setting input (NO in step S6), it determines whether the operation input is an end instruction (step S8). When thecontrol unit 11 determines that the operation input is not the end instruction (NO in step S8), it executes the command corresponding to the input (step S9), and returns the process to step S3. When thecontrol unit 11 determines that the operation input is an end instruction (YES in step S8), it ends the process.

図９はコマンド実行処理の手順例を示すフローチャートである。制御部１１は実行するコマンドがスライドの再生であるか否かを判定する（ステップＳ２１）。制御部１１は実行するコマンドがスライドの再生であると判定した場合（ステップＳ２１でＹＥＳ）、スライドの再生を行う（ステップＳ２２）。再生完了後、制御部１１は処理を呼び出し元へ戻す。制御部１１は実行するコマンドがスライドの再生でないと判定した（ステップＳ２１でＮＯ）、実行するコマンドがＶＲモデル作成であるか否かを判定する（ステップＳ２３）。制御部１１は実行するコマンドがＶＲモデル作成であると判定した場合（ステップＳ２３でＹＥＳ）、ＶＲモデル作成を行う（ステップＳ２４）。モデル作成後、制御部１１は処理を呼び出し元へ戻す。制御部１１は実行するコマンドがＶＲモデル作成でないと判定した場合（ステップＳ２３でＮＯ）、処理を呼び出し元へ戻す。 FIG. 9 is a flow chart showing an example of the procedure of command execution processing. Thecontrol unit 11 determines whether or not the command to be executed is to reproduce a slide (step S21). When thecontrol unit 11 determines that the command to be executed is to reproduce a slide (YES in step S21), it reproduces the slide (step S22). After completing the reproduction, thecontrol unit 11 returns the processing to the calling source. When thecontrol unit 11 determines that the command to be executed is not slide reproduction (NO in step S21), it determines whether or not the command to be executed is VR model creation (step S23). When thecontrol unit 11 determines that the command to be executed is VR model creation (YES in step S23), VR model creation is performed (step S24). After creating the model, thecontrol unit 11 returns the process to the calling source. If thecontrol unit 11 determines that the command to be executed is not VR model creation (NO in step S23), the process returns to the calling source.

図１０は再生処理の手順例を示すフローチャートである。制御部１１は再生に必要な設定が済みであるか否かを判定する（ステップＳ３１）。制御部１１は再生に必要な設定が済みでないと判定した場合（ステップＳ３１でＮＯ）、エラー表示を行い（ステップＳ４１）、処理を呼び出し元へ戻す。必要な設定が済みでないと判定する場合には、再生するプレゼンテーションデータが指定されているが、当該データの存在を確認できない場合も含む。制御部１１は再生に必要な設定が済みであると判定した場合（ステップＳ３１でＹＥＳ）、ＶＲモデルデータを取得する（ステップＳ３２）。制御部１１はスライドデータを取得する（ステップＳ３３）。制御部１１はスライドを表示部１６に表示する（ステップＳ３４）。制御部１１はスライドデータに含まれる発話テキストを、音声合成サーバ２へ送信する（ステップＳ３５）。音声合成サーバ２は発話テキストの読み上げ音声のデータを作成し、作成した音声データを再生装置１へ送信する。制御部１１は音声データを音声合成サーバ２から受信する（ステップＳ３６）。制御部１１は動画を出力する（ステップＳ３７）。制御部１１はＶＲモデルデータより作成したアバターの動画（人物動画）を作成し、表示部１６に表示するとともに、発話テキストの読み上げ音声を音声出力部１７から出力する。制御部１１は読み上げ音声の出力が終了したか否かを判定する（ステップＳ３８）。制御部１１は読み上げ音声の出力が終了していないと判定した場合（ステップＳ３８でＮＯ）、ステップＳ３８を再度、実行する。制御部１１は読み上げ音声の出力が終了したと判定した場合（ステップＳ３８でＹＥＳ）、次のスライドデータがあるか否かを判定する（ステップＳ３９）。制御部１１は次のスライドデータがあると判定した場合（ステップＳ３９でＹＥＳ）、遷移間隔時間（所定時間）が経過した否かを判定する（ステップＳ４０）。制御部１１は遷移間隔時間が経過していないと判定した場合（ステップＳ４０でＮＯ）、ステップＳ４０を再度、実行する。制御部１１は遷移間隔時間が経過したと判定した場合（ステップＳ４０でＹＥＳ）、処理をステップＳ３３へ戻す。制御部１１は次のスライドデータがないと判定した場合（ステップＳ３９でＮＯ）、処理を呼び出し元へ戻す。 FIG. 10 is a flow chart showing an example of the procedure of reproduction processing. Thecontrol unit 11 determines whether or not settings required for reproduction have been completed (step S31). If thecontrol unit 11 determines that the settings required for reproduction have not been completed (NO in step S31), it displays an error (step S41) and returns the process to the calling source. If it is determined that the necessary settings have not been completed, the presentation data to be reproduced is specified, but the presence of the data cannot be confirmed. When thecontroller 11 determines that the settings necessary for reproduction have been completed (YES in step S31), it acquires the VR model data (step S32). Thecontrol unit 11 acquires slide data (step S33). Thecontrol unit 11 displays the slide on the display unit 16 (step S34). Thecontrol unit 11 transmits the spoken text included in the slide data to the speech synthesis server 2 (step S35). Thespeech synthesis server 2 creates read-out voice data of the spoken text, and transmits the created voice data to thereproduction device 1 . Thecontrol unit 11 receives the voice data from the voice synthesis server 2 (step S36). Thecontrol unit 11 outputs the moving image (step S37). Thecontrol unit 11 creates an avatar moving image (person moving image) created from the VR model data, displays it on thedisplay unit 16 , and outputs read-out voice of the uttered text from theaudio output unit 17 . Thecontrol unit 11 determines whether or not the reading voice has been output (step S38). When thecontrol unit 11 determines that the output of the reading voice has not ended (NO in step S38), it executes step S38 again. When thecontrol unit 11 determines that the output of the reading voice is finished (YES in step S38), it determines whether or not there is next slide data (step S39). When thecontrol unit 11 determines that there is the next slide data (YES in step S39), it determines whether or not the transition interval time (predetermined time) has elapsed (step S40). When thecontrol unit 11 determines that the transition interval time has not elapsed (NO in step S40), it executes step S40 again. When thecontrol unit 11 determines that the transition interval time has passed (YES in step S40), the process returns to step S33. When thecontrol unit 11 determines that there is no next slide data (NO in step S39), the process is returned to the calling source.

図１１はＶＲモデル作成処理の手順例を示すフローチャートである。再生装置１の制御部１１はＶＲモデル作成に用いる画像を取得する（ステップＳ５１）。画像は人物のポートレートの写真画像である。制御部１１は取得した画像からＶＲモデルの作成を行う（ステップＳ５２）。制御部１１は、写真画像内の顔を認識し、２次元または３次元のＶＲモデルを生成する。制御部１１は目、口を認識、まばたきや話しをしているかのようなアニメーションを作成する。ＶＲモデルの作成は、公知技術を用いることが可能であるので、詳細な説明は省略する。ＶＲモデルの作成は再生装置１ではなく、外部サーバやクラウドサービスを利用して行ってもよい。制御部１１は作成したＶＲモデルの実体を補助記憶部１３に、ＶＲモデルの名称等の属性データをモデルＤＢ１３２に記憶し（ステップＳ５３）、処理を呼び出し元へ戻す。 FIG. 11 is a flow chart showing an example of the procedure of VR model creation processing. Thecontrol unit 11 of theplayback device 1 acquires an image used for creating a VR model (step S51). The image is a photographic image of a person's portrait. Thecontrol unit 11 creates a VR model from the acquired image (step S52). Thecontrol unit 11 recognizes the face in the photographic image and generates a two-dimensional or three-dimensional VR model. Thecontrol unit 11 recognizes the eyes and mouth, and creates an animation as if the person is blinking or talking. A well-known technique can be used to create the VR model, so a detailed description thereof will be omitted. The VR model may be created using an external server or cloud service instead of theplayback device 1 . Thecontrol unit 11 stores the substance of the created VR model in theauxiliary storage unit 13 and the attribute data such as the name of the VR model in the model DB 132 (step S53), and returns the process to the calling source.

続いて、再生装置１が表示部１６に表示する画面の例について説明する。図１２は発表設定画面の例を示す説明図である。発表設定画面ｄ０１はスライドの再生を行うに当たり、最低限必要な設定を行う画面である。発表設定画面ｄ０１はモデル選択メニューｄ０１１、プレゼンデータ指定欄ｄ０１２、参照ボタンｄ０１３及び再生ボタンｄ０１４を含む。モデル選択メニューｄ０１１は、動画表示する発表者のモデルを選択するプルダウンメニューである。プレゼンデータ指定欄ｄ０１２は再生するプレゼンテーションデータのＵＲＩを入力する。参照ボタンｄ０１３を選択すると、ファイル選択のダイアログボックスが表示され、再生するプレゼンテーションデータとして、補助記憶部１３に記憶しているファイルを選択可能である。再生ボタンｄ０１４を選択するとスライドの再生（スライドショー）を開始する。 Next, examples of screens displayed on thedisplay unit 16 by theplayback device 1 will be described. FIG. 12 is an explanatory diagram showing an example of the presentation setting screen. The presentation setting screen d01 is a screen for making the minimum necessary settings for reproducing slides. The presentation setting screen d01 includes a model selection menu d011, a presentation data specification field d012, a reference button d013, and a play button d014. The model selection menu d011 is a pull-down menu for selecting the model of the presenter whose moving image is to be displayed. The presentation data designation field d012 is for inputting the URI of the presentation data to be reproduced. When the reference button d013 is selected, a file selection dialog box is displayed, and a file stored in theauxiliary storage unit 13 can be selected as presentation data to be reproduced. When the play button d014 is selected, slide play (slide show) is started.

図１３はモデル作成画面の例を示す説明図である。モデル作成画面ｄ０２はＶＲモデルを作成する際に使用する画面である。モデル作成画面ｄ０２は名称入力欄ｄ０２１、ファイル選択ボタンｄ０２２及び作成ボタンｄ０２３を含む。名称入力欄ｄ０２１には、新たに作成するＶＲモデルの名称を入力する。ファイル選択ボタンｄ０２２を選択すると、ファイル選択のダイアログボックスが表示され、ＶＲモデルの基となる人物の写真ファイルを選択することが可能となる。作成ボタンｄ０２３を選択すると、写真ファイルを基にＶＲデータが作成される。この際、再生装置１は写真内で人が写っている領域を認識して、人の領域以外は、背景画像として設定し、記憶する。 FIG. 13 is an explanatory diagram showing an example of the model creation screen. The model creation screen d02 is a screen used when creating a VR model. The model creation screen d02 includes a name input field d021, a file selection button d022 and a create button d023. The name of the newly created VR model is entered in the name entry field d021. When the file selection button d022 is selected, a file selection dialog box is displayed, and it becomes possible to select a person's photo file as the basis of the VR model. When the create button d023 is selected, VR data is created based on the photo file. At this time, the reproducingapparatus 1 recognizes the area in which the person appears in the photograph, and sets and stores the area other than the person's area as a background image.

図１４は発話設定画面の例を示す説明図である。発話設定画面ｄ０３は発話テキストの読み上げ音声についての設定を行う画面である。発話設定画面ｄ０３はエンジン選択メニューｄ０３１、ピッチ入力欄ｄ０３２、速度入力欄ｄ０３３、言語選択メニューｄ０３４、性別設定欄ｄ０３５及びモデル選択メニューｄ０３６を含む。エンジン選択メニューｄ０３１は発話テキストから読み上げ音声を作成する際に、利用する音声合成エンジンを選択するプルダウンメニューである。ピッチ入力欄ｄ０３２には音声のピッチ（高さ）の設定を入力する。０を入力すると既定の声の高さで音声が作成される。正の値を入力すると既定よりも高い声の高さで音声が作成される。負の値を入力すると既定よりも低い声の高さで音声が作成される。速度入力欄ｄ０３３は発話の速度設定を行う。０を入力すると既定の速度で、音声が再生される。正の値を入力すると既定よりも速い速度で音声が再生される。負の値を入力すると既定よりも遅い速度で音声が再生される。言語選択メニューｄ０３４は作成する音声の言語を選択メニューである。選択する言語は発話テキストが記述されている言語と一致する必要がある。性別設定欄ｄ０３５は音声の性別を設定する。モデル選択メニューｄ０３６は音声のモデルを選択するプルダウンメニューである。モデル選択メニューｄ０３６により選択可能な音声のモデルは、エンジン選択メニューｄ０３１、言語選択メニューｄ０３４及び性別設定欄ｄ０３５の設定によって、変動する。 FIG. 14 is an explanatory diagram showing an example of the speech setting screen. The utterance setting screen d03 is a screen for setting the reading voice of the utterance text. The speech setting screen d03 includes an engine selection menu d031, a pitch input field d032, a speed input field d033, a language selection menu d034, a gender setting field d035, and a model selection menu d036. The engine selection menu d031 is a pull-down menu for selecting a speech synthesis engine to be used when creating read-out speech from the spoken text. The setting of the voice pitch (height) is entered in the pitch input field d032. Entering 0 creates a voice with the default pitch. Entering a positive value creates a voice with a higher pitch than the default. Entering a negative value creates a voice with a lower pitch than the default. The speed input field d033 sets the speech speed. Entering 0 will play the audio at the default speed. Entering a positive value plays the audio at a faster speed than the default. Entering a negative value will play the audio at a slower speed than the default. The language selection menu d034 is a menu for selecting the language of the voice to be created. The language you select must match the language in which the spoken text is written. The gender setting field d035 sets the gender of the voice. The model selection menu d036 is a pull-down menu for selecting an audio model. The voice models that can be selected from the model selection menu d036 vary depending on the settings in the engine selection menu d031, language selection menu d034, and sex setting field d035.

なお、音声のモデルとして、発表する人間の声のモデルを音声合成エンジンに登録しておけば、発表者自身の声が利用可能となる。この場合、発話設定ＤＢ１３３の声モデル列に氏名等の識別情報（話者識別情報）を記憶する。声のモデルの作成は、例えば、ＷａｖｅＮｅｔを利用する。ＷａｖｅＮｅｔはＤＮＮ（Deep Neural Network）により構成され、話者の声の特徴を学習し、音声を合成することが可能である。 As a speech model, if a model of the voice of a person presenting is registered in the speech synthesis engine, the presenter's own voice can be used. In this case, identification information (speaker identification information) such as name is stored in the voice model column of theutterance setting DB 133 . For example, WaveNet is used to create the voice model. WaveNet is composed of a DNN (Deep Neural Network), and is capable of learning the features of a speaker's voice and synthesizing speech.

図１５は発表者設定画面の例を示す説明図である。発表者設定画面ｄ０４は発表者画面の設定を行う画面である。発表者設定画面ｄ０４は背景選択メニューｄ０４１、幅設定欄ｄ０４２、高さ設定欄ｄ０４３及び位置選択メニューｄ０４４を含む。背景選択メニューｄ０４１は発表者画面において、発表者の背景として表示する画像を選択するプルダウンメニューである。幅設定欄ｄ０４２には発表画面の幅を入力する。高さ設定欄ｄ０４３には発表者画面の高さを入力する。幅、高さの単位は例えばピクセルである。位置選択メニューｄ０４４は発表者画面の表示位置を選択するプルダウンメニューである。表示位置は、スライドを表示する画面を基準とした相対的な位置である。表示位置と例えば、右上、右下、左上又は左下である。 FIG. 15 is an explanatory diagram showing an example of the presenter setting screen. The presenter setting screen d04 is a screen for setting the presenter screen. The presenter setting screen d04 includes a background selection menu d041, a width setting field d042, a height setting field d043, and a position selection menu d044. The background selection menu d041 is a pull-down menu for selecting an image to be displayed as the background of the presenter on the presenter screen. The width of the presentation screen is entered in the width setting field d042. The height of the presenter screen is entered in the height setting field d043. The units of width and height are pixels, for example. The position selection menu d044 is a pull-down menu for selecting the display position of the presenter's screen. The display position is a position relative to the screen on which the slide is displayed. Display position and, for example, upper right, lower right, upper left, or lower left.

図１６はスライドショー設定画面の例を示す説明図である。スライドショー設定画面ｄ０５はスライドの再生設定を行う画面である。スライドショー設定画面ｄ０５は時間設定欄ｄ０５１を含む。再生装置１は表示しているスライドに対応する発話テキストの読み上げ音声の再生が終わると、次のスライドを表示するが、音声の再生終了後から次のスライドを表示するまでに時間を置くことが可能である。時間設定欄ｄ０５１には、再生終了後から次のスライドを表示するまでの時間を秒単位で入力する。 FIG. 16 is an explanatory diagram showing an example of the slide show setting screen. The slide show setting screen d05 is a screen for setting reproduction of slides. The slide show setting screen d05 includes a time setting field d051. When the reproducingapparatus 1 finishes reproducing the reading voice of the spoken text corresponding to the displayed slide, it displays the next slide. It is possible. In the time setting field d051, the time from the end of playback to the display of the next slide is entered in seconds.

図１７はスライド再生画面及び発表者画面の例を示す説明図である。図１７では、スライド再生画面ｄ０６の右上に発表者画面ｄ０７を表示している。発表者画面ｄ０７は閉じるボタンｄ０７１、音量アイコンｄ０７２、進行バーｄ０７３、再生／一時停止アイコンｄ０７４及び表示頁アイコンｄ０７５を含む。これらはマウスポインタを発表者画面ｄ０７上に移動させた場合に表示される。閉じるボタンｄ０７１を選択すると、スライドの再生は停止され、発表者画面ｄ０７は閉じられる。音量アイコンｄ０７２を選択すると、トラックバーが表示され、トラックバーのつまみをドラッグすることより、音量を調整可能である。進行バーｄ０７３はスライドの再生位置をトラックバーにより表示する。つまみｄ０７３１をドラッグすることより、表示するスライドを戻したり、先へ進めたりすることが可能である。また、キーボードの左矢印キー、右矢印キーを押しても、表示するスライドを切り替える同様の操作が可能である。再生／一時停止アイコンｄ０７４は再生時に選択すると一時停止し、一時停止時に選択する再生を再開する。表示頁アイコンｄ０７５はスライド再生画面ｄ０６に表示しているスライドの順番号を表示する。 FIG. 17 is an explanatory diagram showing examples of a slide playback screen and a presenter screen. In FIG. 17, the presenter screen d07 is displayed on the upper right of the slide reproduction screen d06. The presenter screen d07 includes a close button d071, a volume icon d072, a progress bar d073, a play/pause icon d074, and a display page icon d075. These are displayed when the mouse pointer is moved onto the presenter's screen d07. When the close button d071 is selected, the slide playback is stopped and the presenter screen d07 is closed. When the volume icon d072 is selected, a track bar is displayed, and the volume can be adjusted by dragging the knob on the track bar. The progress bar d073 displays the playback position of the slide using a track bar. By dragging the knob d0731, it is possible to return or advance the displayed slide. Also, pressing the left arrow key and the right arrow key on the keyboard can perform the same operation of switching the slide to be displayed. The play/pause icon d074 pauses when selected during playback, and resumes the selected playback when paused. The display page icon d075 displays the order number of the slide displayed on the slide reproduction screen d06.

本実施の形態は以下の効果を奏する。本実施の形態においては、プレゼンテーションデータを構成する各スライドデータに発話テキストを設定しておくことにより、発話テキストの読み上げ音声を出力しながら各スライドを順に再生するので、発表（プレゼンテーション）を自動化することが可能となる。また、ＶＲモデルの動画を表示する発表者画面をスライドと共に表示するので、単に動画を視聴する場合に比べて、臨場感を与えることが可能となる。また、スライドの再生を一時停止することが可能であるので、スライドや発話テキストには含まれていない事柄について、補足説明が可能である。また、プレゼンテーション中に質問を受け付けて、回答することも可能である。さらに、ＶＲモデルは写真から作成可能であるので、ＶＲモデルを実際の発表者の写真から生成し、発表者の声を学習したＷａｖｅＮｅｔを用いて、音声合成を行うことにより、発表者自身の動画（人物動画）と発表者自身の声による発話テキストの読み上げが可能となる。それにより、発表者自身がその場で発表しているかのような印象を視聴者に与えることが可能となる。そして、発話テキストはスピーカーノートに記述するので、内容の修正が容易であり、修正をしたらすぐに発表に反映することが可能である。そのため、即座の対応や微修正の繰り返しが容易に可能となる。 This embodiment has the following effects. In the present embodiment, by setting spoken text in each slide data constituting the presentation data, each slide is reproduced in order while outputting the reading voice of the spoken text, thereby automating the presentation. becomes possible. In addition, since the presenter's screen displaying the moving image of the VR model is displayed together with the slides, it is possible to give a sense of realism compared to simply viewing the moving image. In addition, since it is possible to pause the playback of the slides, it is possible to provide supplementary explanations about matters not included in the slides or the spoken text. It is also possible to receive questions during the presentation and answer them. Furthermore, since a VR model can be created from a photograph, a VR model is generated from an actual presenter's photograph, and by using WaveNet, which has learned the presenter's voice, to synthesize the presenter's own video. It is possible to read aloud the spoken text (personal video) and the presenter's own voice. As a result, it is possible to give the viewer the impression that the presenter himself/herself is presenting on the spot. Since the spoken text is written in the speaker notes, it is easy to modify the content, and the modification can be immediately reflected in the presentation. Therefore, immediate response and repetition of minor corrections are easily possible.

（実施の形態２）
本実施の形態は発話テキストの翻訳を行う形態に関する。以下の説明において、上述の実施の形態と異なる点を主に説明する。本実施の形態では、発話テキストの記述言語と読み上げ音声の言語（出力言語）とが異なる場合について述べる。本実施の形態は、例えば、発話テキストが日本語で記述し、発話設定画面ｄ０３において、言語選択メニューｄ０３４で英語を選択して、発表を行う。(Embodiment 2)
This embodiment relates to a mode of translating a spoken text. In the following description, differences from the above-described embodiment will be mainly described. In the present embodiment, a description will be given of a case where the description language of the spoken text and the language of the reading voice (output language) are different. In this embodiment, for example, the speech text is written in Japanese, and English is selected in the language selection menu d034 on the speech setting screen d03 to make the presentation.

図１８は再生処理の他の手順例を示すフローチャートである。図１８に示すフローチャートの一部は、図１０と同様である。制御部１１は再生に必要な設定が済みであるか否かを判定する（ステップＳ６１）。制御部１１は再生に必要な設定が済みでないと判定した場合（ステップＳ６１でＮＯ）、エラー表示を行い（ステップＳ７４）、処理を呼び出し元へ戻す。制御部１１は再生に必要な設定が済みであると判定した場合（ステップＳ６１でＹＥＳ）、ＶＲモデルデータを取得する（ステップＳ６２）。制御部１１はスライドデータを取得する（ステップＳ６３）。制御部１１はスライドを表示部１６に表示する（ステップＳ６４）。制御部１１はスライドデータに含まれる発話テキストの記述言語を判定する（ステップＳ６５）。言語の判定は周知技術により可能である。例えば、言語の判定はその言語における文字数を数えて、割合を計算するなどの手法があるが、公知の技術であるので説明を省略する。制御部１１は判定した記述言語が読み上げ音声の言語と一致するか否かを判定する（ステップＳ６６）。制御部１１は記述言語が読み上げ音声の言語と一致しないと判定した場合（ステップＳ６６でＮＯ）、発話テキストの翻訳を行う（ステップＳ６７）。翻訳は再生装置１が行ってもよいが、公知のクラウドサービスを用いて行ってもよい。制御部１１は発話テキストを翻訳サービスサイトに送信し、翻訳された発話テキストを受信する。制御部１１は記述言語が読み上げ音声の言語と一致すると判定した場合（ステップＳ６６でＹＥＳ）、ステップＳ６８へ処理を進める。制御部１１は発話テキスト又は翻訳された発話テキストを、音声合成サーバ２へ送信する（ステップＳ６８）。ステップＳ６９からＳ７３の処理内容は、図１０に示したステップＳ３６からＳ４０の処理内容と同様であるから、説明を省略する。なお、上述した、言語の判定については、グェントアンドゥク，“Latent Relational Web Search Engine Based on the Relational Similarity between Entity Pairs.”，2012年，東京大学，博士論文甲28480等に開示されている。 FIG. 18 is a flow chart showing another procedure example of the reproduction process. A part of the flowchart shown in FIG. 18 is the same as in FIG. Thecontrol unit 11 determines whether or not the settings necessary for reproduction have been completed (step S61). If thecontrol unit 11 determines that the settings necessary for reproduction have not been completed (NO in step S61), it displays an error (step S74) and returns the process to the calling source. When thecontroller 11 determines that the settings required for reproduction have been completed (YES in step S61), it acquires VR model data (step S62). Thecontrol unit 11 acquires slide data (step S63). Thecontrol unit 11 displays the slide on the display unit 16 (step S64). Thecontrol unit 11 determines the description language of the spoken text included in the slide data (step S65). Language determination is possible by well-known techniques. For example, there is a method of determining the language by counting the number of characters in the language and calculating the ratio. Thecontrol unit 11 determines whether or not the determined description language matches the language of the reading voice (step S66). If thecontrol unit 11 determines that the description language does not match the language of the reading voice (NO in step S66), it translates the spoken text (step S67). The translation may be performed by theplayback device 1, or may be performed using a known cloud service. Thecontrol unit 11 transmits the spoken text to the translation service site and receives the translated spoken text. If thecontrol unit 11 determines that the description language matches the reading voice language (YES in step S66), the process proceeds to step S68. Thecontrol unit 11 transmits the spoken text or the translated spoken text to the speech synthesis server 2 (step S68). Since the processing contents of steps S69 to S73 are the same as the processing contents of steps S36 to S40 shown in FIG. 10, description thereof will be omitted. The above-described language determination is disclosed in Nguyen Thuanduk, “Latent Relational Web Search Engine Based on the Relational Similarity between Entity Pairs.”, 2012, University of Tokyo, Ph.D.

本実施の形態は上述の実施の形態が奏する効果に加えて、以下の効果を奏する。本実施の形態では、発話テキストの記述言語と読み上げ音声の言語とが異なる場合であっても、発表が可能となる。なお、スライドに含まれるテキストデータを抽出し、当該テキストデータを読み上げ音声の言語へ翻訳して表示してもよい。 This embodiment has the following effects in addition to the effects of the above-described embodiments. In this embodiment, even if the description language of the uttered text and the language of the reading voice are different, the presentation can be made. It is also possible to extract text data contained in the slide, translate the text data into the language of the reading voice, and display it.

（実施の形態３）
本実施の形態はスライド再生中にポインティングデバイスのポインタの制御を行う形態に関する。以下の説明において、上述の実施の形態と異なる点を主に説明する。本実施の形態において、発話テキスト内にポインタの制御を行うための命令（制御命令）を記述可能とする。例えば、発話テキストを以下のように記述する。(Embodiment 3)
This embodiment relates to a form of controlling a pointer of a pointing device during slide reproduction. In the following description, differences from the above-described embodiment will be mainly described. In this embodiment, an instruction (control instruction) for controlling the pointer can be described in the spoken text. For example, the spoken text is described as follows.

「ＡＭトークはマルチモーダルＡＩと、アアル・ピイ・エイの技術を利用するバーチャルプレゼンターのアプリケーションです。ＡＭトークはスライドを自動的に再生できます。合成音声でスライドの内容を読み上げ、スライドのページ送りを自動的に制御します。<script>mouse_move(PRESENWIN, CENTER)</script>発表者の顔アニメーションを生成できます。」 “AM Talk is a virtual presenter application that uses multimodal AI and AAL PI A technology. AM Talk can automatically play slides. Synthetic voice reads out the content of the slides and turns the pages of the slides. <script>mouse_move(PRESENWIN, CENTER)</script> can generate presenter face animation."

図１９は再生処理の他の手順例を示すフローチャートである。図１９に示すフローチャートの一部は、図１０と同様である。制御部１１は再生に必要な設定が済みであるか否かを判定する（ステップＳ９１）。制御部１１は再生に必要な設定が済みでないと判定した場合（ステップＳ９１でＮＯ）、エラー表示を行い（ステップＳ１０４）、処理を呼び出し元へ戻す。制御部１１は再生に必要な設定が済みであると判定した場合（ステップＳ９１でＹＥＳ）、ＶＲモデルデータを取得する（ステップＳ９２）。制御部１１はスライドデータを取得する（ステップＳ９３）。制御部１１はスライドを表示部１６に表示する（ステップＳ９４）。制御部１１はスライドデータに含まれる発話テキストにスクリプトが記述されていないか探索する（ステップＳ９５）。制御部１１は探索結果からスクリプトが発話テキストにスクリプトが記述されているか否かを判定する（ステップＳ９６）。制御部１１は発話テキストにスクリプトが記述されていないと判定した場合（ステップＳ９６でＮＯ）、処理をステップＳ９７へ進める。ステップＳ９７からＳ１０２の処理内容は、図１０に示したステップＳ３５からＳ４０の処理内容と同様であるから、説明を省略する。制御部１１は発話テキストにスクリプトが記述されていると判定した場合（ステップＳ９６でＹＥＳ）、サブルーチン・スクリプト実行を行う（ステップＳ１０３）。制御部１１はステップＳ１０１以降を実行する。 FIG. 19 is a flow chart showing another procedure example of the reproduction process. A part of the flowchart shown in FIG. 19 is the same as that in FIG. Thecontrol unit 11 determines whether or not settings required for reproduction have been completed (step S91). If thecontrol unit 11 determines that the settings required for reproduction have not been completed (NO in step S91), it displays an error (step S104) and returns the process to the calling source. When thecontrol unit 11 determines that the settings required for reproduction have been completed (YES in step S91), it acquires VR model data (step S92). Thecontrol unit 11 acquires slide data (step S93). Thecontrol unit 11 displays the slide on the display unit 16 (step S94). Thecontrol unit 11 searches whether or not a script is described in the spoken text included in the slide data (step S95). Thecontrol unit 11 determines whether or not the script is described in the utterance text from the search result (step S96). When thecontrol unit 11 determines that the script is not described in the utterance text (NO in step S96), the process proceeds to step S97. Since the processing contents of steps S97 to S102 are the same as the processing contents of steps S35 to S40 shown in FIG. 10, description thereof is omitted. When thecontrol unit 11 determines that the script is described in the uttered text (YES in step S96), the subroutine script is executed (step S103). Thecontrol unit 11 executes steps after step S101.

図２０はスクリプト実行処理の手順例を示すフローチャートである。制御部１１はスライドデータに含まれる発話テキストをスクリプトの前後で分割する（ステップＳ１１１）。制御部１１は分割した発話テキストを個別に音声合成サーバ２へ送信する（ステップＳ１１２）。制御部１１は音声合成サーバ２から音声データを受信する（ステップＳ１１３）。この際、制御部１１はスクリプト前の発話テキストに対応する音声データと、スクリプト後の発話テキストに対応する音声データとを判別可能なように、主記憶部１２又は補助記憶部１３に設けた一時記憶領域に記載順に記憶する。また、スクリプトの実行タイミングが判定可能なデータも一時記憶領域に記憶しておくことが望ましい。例えば、「TEXT1, SCRIPT1,TEXT2」という配列を記憶しておく。TEXT1はスクリプト前の発話テキストを、TEXT2はスクリプト後の発話テキストを、SCRIPT1はスクリプトを示す。当該配列を参照することにより、制御部１１は音声出力の途中に、スクリプトの実行を行うことが可能である。制御部１１は動画出力を開始する（ステップＳ１１４）。制御部１１は実行データを選択する（ステップＳ１１５）。制御部１１は実行データが音声データであり、音声出力を行うか否かを判定する（ステップＳ１１６）。制御部１１は音声出力を行うと判定した場合（ステップＳ１１６でＹＥＳ）、音声出力を行う（ステップＳ１１７）。制御部１１は音声出力が終了したか否かを判定する（ステップＳ１１８）。制御部１１は音声出力が終了してないと判定した場合（ステップＳ１１８でＮＯ）、ステップＳ１１８を再度行う。制御部１１は音声出力が終了したと判定した場合（ステップＳ１１８でＹＥＳ）、次に実行すべき処理があるか否かを判定する（ステップＳ１１９）。実行すべき処理は、音声出力又はスクリプト実行である。制御部１１は次に実行すべき処理があると判定した場合（ステップＳ１１９でＹＥＳ）、処理をステップＳ１１５へ戻す。制御部１１は次に実行すべき処理がないと判定した場合（ステップＳ１１９でＮＯ）、処理を呼び出し元へ戻す。制御部１１は音声出力を行なわないと判定した場合（ステップＳ１１６でＮＯ）、スクリプトを実行し（ステップＳ１２０）、処理をステップＳ１１９へ移す。ステップＳ１１６やステップＳ１１９の判定は、例えば、上述した配列を参照することにより可能である。 FIG. 20 is a flow chart showing an example of the procedure of script execution processing. Thecontrol unit 11 divides the spoken text included in the slide data before and after the script (step S111). Thecontrol unit 11 individually transmits the divided speech texts to the speech synthesis server 2 (step S112). Thecontrol unit 11 receives voice data from the voice synthesis server 2 (step S113). At this time, thecontrol unit 11 stores a temporary data stored in themain storage unit 12 or theauxiliary storage unit 13 so that the voice data corresponding to the uttered text before the script can be distinguished from the voice data corresponding to the uttered text after the script. Stored in the order of description in the storage area. In addition, it is desirable to store data with which script execution timing can be determined in the temporary storage area. For example, store the array "TEXT1, SCRIPT1, TEXT2". TEXT1 indicates the spoken text before the script, TEXT2 indicates the spoken text after the script, and SCRIPT1 indicates the script. By referring to the array, thecontrol unit 11 can execute the script during voice output. Thecontrol unit 11 starts outputting the moving image (step S114). Thecontrol unit 11 selects execution data (step S115). Thecontrol unit 11 determines whether or not the execution data is audio data and audio output is to be performed (step S116). When thecontrol unit 11 determines that the voice output is to be performed (YES in step S116), the voice output is performed (step S117). Thecontrol unit 11 determines whether or not the voice output has ended (step S118). When thecontrol unit 11 determines that the voice output has not ended (NO in step S118), step S118 is performed again. When thecontrol unit 11 determines that the voice output has ended (YES in step S118), it determines whether or not there is processing to be executed next (step S119). The processing to be executed is voice output or script execution. When thecontrol unit 11 determines that there is processing to be executed next (YES in step S119), the processing returns to step S115. If thecontrol unit 11 determines that there is no process to be executed next (NO in step S119), it returns the process to the caller. Ifcontrol unit 11 determines not to output audio (NO in step S116), it executes the script (step S120), and moves the process to step S119. The determinations in step S116 and step S119 can be made, for example, by referring to the array described above.

再生処理により、上述の発話テキストでは、まず、「ＡＭトークはマルチモーダルＡＩと、…スライドのページ送りを自動的に制御します。」の読み上げ音声が出力される。次にスクリプトが実行され、ポインティングデバイスのポインタが、発表者画面の中心に移動する。そして「発表者の顔アニメーションを生成できます。」の読み上げ音声が出力される。 By the playback processing, first, reading voice of "AM Talk automatically controls multimodal AI and slide page turning." is output. A script is then executed to move the pointer of the pointing device to the center of the presenter's screen. Then, a reading voice saying "You can generate the presenter's face animation" is output.

本実施の形態は上述の実施の形態が奏する効果に加えて、以下の効果を奏する。本実施の形態においては、スクリプトにより、ポインティングデバイスのポインタ移動等の制御が可能となる。スライド内で注目すべき箇所を視聴者に示すので、発表の効果を高めることが可能となる。なお、ポインティングデバイスのポインタ移動制御は、スクリプトの一例であり、他の制御も可能である。例えば、スライドの効果として、スライド内の複数テキストを一気に表示するのではなく、マウスクリックする毎に、表示するテキストを追加する効果がある。このような効果を実行する場合、発話テキストの中に、マウスクリックするスクリプトを記述し、当該スクリプトを実行することにより、人手を介すことなく実行可能である。なお、本実施の形態において、実施の形態２で示した翻訳機能を設けてもよい。また、スクリプトにより、キーボードの操作をエミュレートしてもよい。 This embodiment has the following effects in addition to the effects of the above-described embodiments. In this embodiment, it is possible to control the movement of the pointer of the pointing device by using a script. It is possible to enhance the effectiveness of the presentation by showing the audience the points of interest within the slides. Note that the pointer movement control of the pointing device is an example of a script, and other controls are also possible. For example, the effect of the slide is to add text to be displayed each time the mouse is clicked, instead of displaying multiple texts in the slide at once. When executing such an effect, it is possible to execute without human intervention by writing a mouse-clicking script in the spoken text and executing the script. Note that the translation function shown in the second embodiment may be provided in this embodiment. A script may also emulate keyboard operations.

（実施の形態４）
本実施の形態はＶＲモデルを用いた発表者の制御を行う形態に関する。以下の説明において、上述の実施の形態と異なる点を主に説明する。本実施の形態は実施の形態３に関連する形態である。(Embodiment 4)
This embodiment relates to a mode of controlling a presenter using a VR model. In the following description, differences from the above-described embodiment will be mainly described. This embodiment is a form related to the third embodiment.

上述の実施の形態では、発表者画面に表示するＶＲモデルを用いた発表者は、目と口を動作させている。本実施の形態においては、ジェスチャーも可能とする。ジェスチャーを行わせるためには、発話テキストにスクリプトを記載する。 In the embodiment described above, the presenter using the VR model displayed on the presenter screen moves his eyes and mouth. In this embodiment, gestures are also possible. In order to make gestures, a script is written in the spoken text.

発表者に行わせるジェスチャーは、例えば、所定方向を指し示すコマンドと通常の姿勢に戻るコマンドとが想定される。所定方向は、右上、真上、左上、左下、真下及び右下等である。例えば、スクリプトに記述する関数として、prstr_pose(引数)を設ける。引数はUR（右上）、DA（真上）、UL（左上）、LL（左下）、DB（真下）、LR(右下)及びNR（通常）。引数LLを指定すると、発表者は発表者画面の左下方向を指し示す。引数NRを指定すると、発表者は指し示す姿勢を通常の姿勢に戻す。 Gestures made by the presenter are assumed to be, for example, a command to point in a predetermined direction and a command to return to a normal posture. The predetermined direction is upper right, right up, upper left, lower left, right down, right down, and the like. For example, set prstr_pose (argument) as a function to be written in the script. The arguments are UR (top right), DA (top right), UL (top left), LL (bottom left), DB (bottom), LR (bottom right) and NR (normal). If the argument LL is specified, the presenter points to the lower left direction of the presenter's screen. The argument NR causes the presenter to return the pointing pose to normal.

本実施の形態において、再生装置１が行なう再生処理は上述の実施の形態と同様であるから、説明を省略する。また、ＶＲモデルにより発表者にジェスチャーを行わせる制御は公知の技術で可能であるので、説明を省略する。 In the present embodiment, the reproducing process performed by the reproducingapparatus 1 is the same as that in the above-described embodiment, so the description thereof will be omitted. Also, the VR model can be controlled to make the presenter perform gestures using a known technique, so the explanation is omitted.

図２１はスライド再生画面及び発表者画面の他の例を示す説明図である。図２１では、スライド再生画面ｄ０６の中央下に発表者画面ｄ０７を表示している。そして、発表者画面ｄ０７に表示している発表者は、真上方向を指し示している。なお、図２１の発表者画面ｄ０７において、発表者は指し棒（指示棒）を持っているが、指し棒は必ずしも表示する必要はない。 FIG. 21 is an explanatory diagram showing another example of the slide playback screen and the presenter screen. In FIG. 21, the presenter screen d07 is displayed at the lower center of the slide reproduction screen d06. The presenter displayed on the presenter screen d07 points directly upward. In the presenter screen d07 of FIG. 21, the presenter has a pointing stick (pointing stick), but the pointing stick does not necessarily have to be displayed.

本実施の形態は上述の実施の形態が奏する効果に加えて、以下の効果を奏する。発表者にジェスチャーを行わせることより、視聴者がスライドの内容に注目することを期待できる。 This embodiment has the following effects in addition to the effects of the above-described embodiments. By having the presenter make gestures, the audience can be expected to pay attention to the contents of the slide.

（実施の形態５）
本実施の形態は他のアプリケーションソフトとの連携動作を行なう形態に関する。以下の説明において、上述の実施の形態と異なる点を主に説明する。(Embodiment 5)
The present embodiment relates to a form of cooperative operation with other application software. In the following description, differences from the above-described embodiment will be mainly described.

まず、スライドの表示要素に動画が含まれている場合の処理について説明する。図２２及び図２３は再生処理の他の手順例を示すフローチャートである。図２２及び図２３に示すフローチャートの一部は、図１９と同様である。制御部１１は再生に必要な設定が済みであるか否かを判定する（ステップＳ１３１）。制御部１１は再生に必要な設定が済みでないと判定した場合（ステップＳ１３１でＮＯ）、エラー表示を行い（ステップＳ１５１）、処理を呼び出し元へ戻す。制御部１１は再生に必要な設定が済みであると判定した場合（ステップＳ１３１でＹＥＳ）、ＶＲモデルデータを取得する（ステップＳ１３２）。制御部１１はスライドデータを取得する（ステップＳ１３３）。制御部１１はスライドデータに含まれる表示要素に動画があるか探索する（ステップＳ１３４）。制御部１１は探索結果から表示要素に動画があるか否かを判定する（ステップＳ１３５）。制御部１１は探索結果から表示要素に動画があると判定した場合（ステップＳ１３５でＹＥＳ）、発話テキストにスクリプトが記述されていないか探索する（ステップ１３６）。制御部１１は探索結果からスクリプトが発話テキストにスクリプトが記述されているか否かを判定する（ステップＳ１３７）。制御部１１は発話テキストにスクリプトが記述されていないと判定した場合（ステップＳ１３７でＮＯ）、スライドを表示する共に表示要素である動画の再生を開始する（ステップＳ１３８）。制御部１１は必要に応じて、動画再生のアプリケーションを起動し、動画を再生する。この際、スライド再生画面及び発表者画面を最小化し、動画再生画面を全画面表示とすることが望ましい。制御部１１は動画再生が終了したか否かを判定する（ステップＳ１３９）。制御部１１は動画再生が終了していないと判定した場合（ステップＳ１３９でＮＯ）、再度、ステップＳ１１９を実行する。制御部１１は動画再生が終了したと判定した場合（ステップＳ１３９でＹＥＳ）、画面の表示状態を動画再生前の状態へ戻し、処理をステップＳ１４９（図２３）へ進める。なお、ステップＳ１３７でＮＯと判定された場合、発話テキストは書かれていないことが前提である。もし、発話テキストに何か書かれていても、その内容は無視されて、読み上げ音声は出力されない。発話テキストにスクリプトが含まれていない場合、動画を再生するタイミングを制御部１１は判定できないからである。 First, the processing when the display element of the slide includes a moving image will be described. 22 and 23 are flowcharts showing another procedure example of the reproduction process. A part of the flow charts shown in FIGS. 22 and 23 are the same as in FIG. Thecontrol unit 11 determines whether or not the settings necessary for reproduction have been completed (step S131). If thecontrol unit 11 determines that the settings necessary for reproduction have not been completed (NO in step S131), it displays an error (step S151) and returns the process to the calling source. When thecontrol unit 11 determines that the settings required for reproduction have been completed (YES in step S131), it acquires VR model data (step S132). Thecontrol unit 11 acquires slide data (step S133). Thecontrol unit 11 searches for a moving image among the display elements included in the slide data (step S134). Thecontrol unit 11 determines whether or not there is a moving image in the display element based on the search result (step S135). When thecontrol unit 11 determines from the search result that there is a moving image in the display element (YES in step S135), it searches whether a script is described in the spoken text (step 136). Thecontrol unit 11 determines whether or not the script is described in the utterance text from the search result (step S137). When thecontrol unit 11 determines that the script is not described in the spoken text (NO in step S137), it displays the slide and starts playing the moving image which is the display element (step S138). Thecontrol unit 11 activates a moving image reproduction application and reproduces moving images as necessary. At this time, it is desirable to minimize the slide playback screen and the presenter screen and display the video playback screen in full screen. Thecontrol unit 11 determines whether or not the moving image reproduction has ended (step S139). When thecontrol unit 11 determines that the moving image reproduction has not ended (NO in step S139), it executes step S119 again. When thecontrol unit 11 determines that the moving image reproduction has ended (YES in step S139), the screen display state is returned to the state before the moving image reproduction, and the process proceeds to step S149 (FIG. 23). It should be noted that, if NO is determined in step S137, it is assumed that the speech text is not written. Even if something is written in the spoken text, the content is ignored and the reading voice is not output. This is because thecontrol unit 11 cannot determine the timing of reproducing the moving image if the spoken text does not include a script.

制御部１１は発話テキストにスクリプトが記述されていると判定した場合（ステップＳ１３７でＹＥＳ）、制御部１１はスライドを表示部１６に表示する（ステップＳ１４０）。制御部１１はサブルーチン・スクリプト実行を行う（ステップＳ１４１）。なお、ここでは発話テキストに記載されたスクリプトの中には、動画再生の命令が書かれていることが前提である。発話テキストに動画再生の命令が書かれていない場合は、動画は再生されない。また、スクリプト実行処理において、動画再生の命令が実行される場合、当該命令は動画再生が完了するまで、処理を完了しない。動画再生が完了すると、処理を完了する。制御部１１はサブルーチン・スクリプト実行後、処理をステップＳ１４９へ移す。 When thecontrol unit 11 determines that the script is described in the spoken text (YES in step S137), thecontrol unit 11 displays the slide on the display unit 16 (step S140). Thecontrol unit 11 executes a subroutine/script (step S141). Here, it is assumed that the script described in the spoken text contains a command for reproducing a moving image. If the spoken text does not include an instruction to play the video, the video is not played. Also, in the script execution process, when an instruction to reproduce a moving image is executed, the instruction does not complete the processing until the reproduction of the moving image is completed. When the video playback is completed, the process is completed. After executing the subroutine/script, thecontrol unit 11 shifts the process to step S149.

制御部１１は探索結果から表示要素に動画がないと判定した場合（ステップＳ１３５でＮＯ）、制御部１１はスライドデータに含まれる発話テキストにスクリプトが記述されていないか探索する（ステップＳ１４２）。制御部１１は探索結果からスクリプトが発話テキストにスクリプトが記述されているか否かを判定する（ステップＳ１４３）。制御部１１は発話テキストにスクリプトが記述されていないと判定した場合（ステップＳ１４３でＮＯ）、発話テキストを音声合成サーバ２へ送信する（ステップＳ１４４）。制御部１１は音声合成サーバ２から音声データを受信する（ステップＳ１４５）。制御部１１はスライドを表示部１６に表示する（ステップＳ１４６）。制御部１１は処理をステップＳ１４７（図２３）へ移す。制御部１１は発表者動画を出力する（ステップＳ１４７）。ステップＳ１４８からステップＳ１５０は、図１０のステップＳ３８からＳ４０と同様であるから説明を省略する。制御部１１は発話テキストにスクリプトが記述されていると判定した場合（ステップＳ１４３でＹＥＳ）、処理をステップＳ１４０へ移す。 When thecontrol unit 11 determines from the search result that there is no moving image in the display element (NO in step S135), thecontrol unit 11 searches whether a script is described in the spoken text included in the slide data (step S142). Thecontrol unit 11 determines whether or not the script is described in the utterance text from the search result (step S143). When thecontrol unit 11 determines that the script is not described in the spoken text (NO in step S143), it transmits the spoken text to the speech synthesis server 2 (step S144). Thecontrol unit 11 receives the voice data from the voice synthesis server 2 (step S145). Thecontrol unit 11 displays the slide on the display unit 16 (step S146). Thecontrol unit 11 shifts the process to step S147 (FIG. 23). Thecontrol unit 11 outputs the presenter moving image (step S147). Steps S148 to S150 are the same as steps S38 to S40 in FIG. 10, so description thereof will be omitted. When thecontrol unit 11 determines that a script is described in the uttered text (YES in step S143), the process proceeds to step S140.

以上、動画の再生について説明したが、ＵＲＬ（リンク情報）が表示要素に含まれている場合も同様である。ただし、動画の場合と異なり、発話テキストにスクリプトが含まれていない場合でも、ＵＲＬで指定されたデータを出力するために、直ちにインターネットブラウザを起動はしない。発話テキストにＵＲＬが含まれている否かを探索する。探索の結果、発話テキストにＵＲＬが含まれているときは、記載されたＵＲＬを出力するために、インターネットブラウザを起動するスクリプトが記載されていると解釈する。なお、動画再生時と同様に、インターネットブラウザを表示する際、スライド再生画面及び発表者画面を最小化し、インターネットブラウザを全画面表示とすることが望ましい。また、インターネットブラウザでの表示を終了し、スライド再生に戻るスクリプトが発話テキストに書かれていない場合、制御部１１は所定時間が経過したら、インターネットブラウザでの表示を終了し、スライド再生に戻す。 Although the reproduction of moving images has been described above, the same applies to the case where the URL (link information) is included in the display element. However, unlike the case of moving images, even if the spoken text does not contain a script, the Internet browser is not immediately activated to output the data specified by the URL. Search if the spoken text contains a URL. As a result of the search, when URL is included in the spoken text, it is interpreted that a script for activating an Internet browser is described in order to output the described URL. As with video playback, when displaying the Internet browser, it is desirable to minimize the slide playback screen and the presenter screen and display the Internet browser in full screen. In addition, when a script for ending display on the Internet browser and returning to slide reproduction is not written in the utterance text, thecontrol part 11 ends display on the Internet browser and returns to slide reproduction after a predetermined time elapses.

本実施の形態は上述の実施の形態が奏する効果に加えて、以下の効果を奏する。スライドの再生途中で、他のアプリケーションの実行が可能となるので、発表内容をより充実させることが可能となる。なお、他のアプリケーションにおいても、スクリプトの実行が可能である場合、他のアプリケーションでもスクリプトを実行させれば、スライド再生で行える動作が多彩となり、発表内容をさらに充実させることが可能となる。 This embodiment has the following effects in addition to the effects of the above-described embodiments. Since it is possible to execute other applications while the slide is being reproduced, it is possible to enrich the content of the presentation. If other applications can also execute the script, if the script is executed in the other application as well, the operations that can be performed in the slide playback will be diversified, making it possible to further enhance the content of the presentation.

なお、表示要素に動画を含めていない場合でも、ＵＲＬで動画ファイル等を指定すれば、インターネットブラウザを、利用して又は介して、動画の再生が可能である。また、動画再生中にスクリプトの実行を可能とし、スクリプトでマウスポインタの位置制御とクリック操作を行えば、動画を一時停止して、発話テキストを読み上げ音声を出力し、音声が終了したら、動画の再生を再開するなどの動作も可能である。 Even if the display element does not include a moving image, if a moving image file or the like is specified by a URL, the moving image can be reproduced using or via an Internet browser. In addition, it is possible to execute a script during video playback, and if the script controls the position of the mouse pointer and clicks, the video will be paused, the spoken text will be read aloud, and the audio will be output. Operations such as resuming playback are also possible.

（実施の形態６）
本実施の形態はスライドデータに発表者ノートが含まれていない場合の動作に関する形態である。以下の説明において、上述の実施の形態と異なる点を主に説明する。本実施の形態においては、スライドデータに発表者ノートが含まれていない場合、スライドデータに含まれるオブジェクトを利用して、発話テキストを作成する。(Embodiment 6)
This embodiment relates to the operation when the slide data does not contain the presenter's notes. In the following description, differences from the above-described embodiment will be mainly described. In this embodiment, if the slide data does not contain the presenter's notes, the objects included in the slide data are used to create the spoken text.

図２４は再生処理の他の手順例を示すフローチャートである。図２４は、図１０に示した再生処理に新たな処理を追加することを示している。再生処理において、制御部１１はスライド表示（ステップＳ３４）を行った後、スライドデータに発表者ノートが含まれているか否かを判定する（ステップＳ１６１）。制御部１１はスライドデータに発表者ノートが含まれていると判定した場合（ステップＳ１６１でＹＥＳ）、処理を図１０のステップＳ３５へ移す。制御部１１はスライドデータに発表者ノートが含まれていないと判定した場合（ステップＳ１６１でＮＯ）、スライドを構成するオブジェクトを対象に、テキストオブジェクトを探索する（ステップＳ１６２）。制御部１１はテキストオブジェクトがあるか否かを判定する（ステップＳ１６３）。制御部１１はテキストオブジェクトがあると判定した場合（ステップＳ１６３でＹＥＳ）、テキストオブジェクトのテキストから発話テキストを作成する（ステップＳ１６４）。例えば、箇条書きのテキストが得られた場合、助詞や助動詞等を補い、文章作成し、発話テキストとする。制御部１１は処理を図１０のステップＳ３５へ移す。制御部１１はテキストオブジェクトがないと判定した場合（ステップＳ１６３でＮＯ）、画像オブジェクトに対して文字認識を行なう（ステップＳ１６５）。例えば、ＯＣＲ（Optical character recognition）技術を用いる。制御部１１は認識処理の結果、文字が得られたか否かを判定する（ステップＳ１６６）。制御部１１は文字が得られたと判定した場合（ステップＳ１６６でＹＥＳ）、処理をステップＳ１６４へ移す。制御部１１は文字が得られなかったと判定した場合（ステップＳ１６６でＮＯ）、スライドデータに含まれる画像オブジェクトを選択し、画像を説明するキャプションを生成し（ステップＳ１６７）、処理を図１０のステップＳ３５へ移す。キャプションの生成には、画像キャプション自動生成ＡＩを用いる。例えば、画像キャプション自動生成ＡＩはＣＮＮ（Convolutional Neural Network）とＬＳＴＭ（Long Short Term Memory）とを組み合わせた深層学習モデルを用いる。当該学習モデルでは次の手順で学習を行なう。学習済みＣＮＮで画像の特徴量を抽出する。ＬＳＴＭで文章の特徴量を抽出する。ＣＮＮとＬＳＴＭの特徴量を結合する。Softmax関数で次に来る単語を予測する。これらのステップを繰り返すことで、画像のキャプションを学習モデルは生成する。学習モデルが生成したキャプションが正解のキャプションに近づくように、学習モデルを訓練する。訓練済みの学習モデルにおいて、ＣＮＮに画像を入力し、ＬＳＴＭに文開始記号を入力すると、キャプションを生成することができる。 FIG. 24 is a flow chart showing another procedure example of the reproduction process. FIG. 24 shows addition of new processing to the reproduction processing shown in FIG. In the reproduction process, after displaying the slide (step S34), thecontrol unit 11 determines whether or not the slide data includes the presenter's notes (step S161). When thecontrol unit 11 determines that the slide data includes the presenter's notes (YES in step S161), the process proceeds to step S35 in FIG. When thecontrol unit 11 determines that the presenter's notes are not included in the slide data (NO in step S161), it searches for a text object among the objects forming the slide (step S162). Thecontrol unit 11 determines whether or not there is a text object (step S163). If thecontrol unit 11 determines that there is a text object (YES in step S163), it creates an utterance text from the text of the text object (step S164). For example, when an itemized text is obtained, add particles, auxiliary verbs, etc., create sentences, and use it as an utterance text. Thecontrol unit 11 shifts the process to step S35 in FIG. When thecontrol unit 11 determines that there is no text object (NO in step S163), character recognition is performed on the image object (step S165). For example, OCR (Optical character recognition) technology is used. Thecontrol unit 11 determines whether or not characters are obtained as a result of the recognition processing (step S166). When thecontrol unit 11 determines that a character has been obtained (YES in step S166), the process proceeds to step S164. When thecontrol unit 11 determines that no characters have been obtained (NO in step S166), it selects an image object included in the slide data, generates a caption describing the image (step S167), and proceeds to the step of FIG. Move to S35. An image caption automatic generation AI is used to generate captions. For example, image caption automatic generation AI uses a deep learning model that combines CNN (Convolutional Neural Network) and LSTM (Long Short Term Memory). The learning model performs learning in the following procedure. The feature amount of the image is extracted by the trained CNN. The LSTM is used to extract the feature quantity of the text. Combine CNN and LSTM features. Predict the next word with the Softmax function. By repeating these steps, the learning model generates captions for the images. Train the learning model so that the captions generated by the learning model are closer to the correct captions. In a trained learning model, inputting an image into the CNN and inputting a sentence start symbol into the LSTM can generate captions.

上述の説明において、実施の形態１の再生処理を変形する例を述べたが、それに限らない。他の実施形態の再生処理を変形することも可能である。 In the above description, an example of modifying the reproduction processing of the first embodiment was described, but the present invention is not limited to this. It is also possible to modify the playback process of other embodiments.

本実施の形態は上述の実施の形態が奏する効果に加えて、以下の効果を奏する。発表者ノートを用意しなくとも、ＶＲモデルを用いた発表者による発表の自動化が可能となる。 This embodiment has the following effects in addition to the effects of the above-described embodiments. It is possible to automate the presentation by the presenter using the VR model without preparing the presenter's notes.

各実施の形態で記載されている技術的特徴（構成要件）はお互いに組み合わせ可能であり、組み合わせすることにより、新しい技術的特徴を形成することができる。
今回開示された実施の形態はすべての点で例示であって、制限的なものではないと考えられるべきである。本発明の範囲は、上記した意味ではなく、特許請求の範囲によって示され、特許請求の範囲と均等の意味及び範囲内でのすべての変更が含まれることが意図される。The technical features (components) described in each embodiment can be combined with each other, and new technical features can be formed by combining them.
The embodiments disclosed this time are illustrative in all respects and should not be considered restrictive. The scope of the present invention is indicated by the scope of the claims rather than the above-described meaning, and is intended to include all modifications within the scope and meaning equivalent to the scope of the claims.

１００プレゼンテーションシステム
１再生装置
１Ｐ制御プログラム
１１制御部
１２主記憶部
１３補助記憶部
１３１基本設定ＤＢ
１３２モデルＤＢ
１３３発話設定ＤＢ
１３４画面設定ＤＢ
１３５遷移設定ＤＢ
１３６ＶＲモデルデータ
１３７プレゼンテーションデータ
１４通信部
１５入力部
１６表示部
１７音声出力部
１８読み取り部
１ａ可搬型記憶媒体
１ｂ半導体メモリ
２音声合成サーバ
Ｂバス
ＮネットワークREFERENCE SIGNSLIST 100presentation system 1playback device1P control program 11control unit 12main storage unit 13auxiliary storage unit 131 basic setting DB
132 Model DB
133 Utterance setting DB
134 Screen setting DB
135 Transition setting DB
136VR model data 137presentation data 14 communication unit 15input unit 16display unit 17audio output unit 18 reading unit 1aportable storage medium1b semiconductor memory 2 speech synthesis server B bus N network

本願の一態様に係るスライド再生プログラムは、発話テキストと表示要素とを含むスライドデータを複数含むプレゼンテーションデータを取得し、被写体に人物を含む１枚の静止画像を取得し、取得した前記静止画像において人物の領域を認識し、前記領域以外を背景として設定し、前記静止画像に基づいて、人物動画を作成し、複数の前記スライドデータそれぞれに含む前記表示要素を所定の順番で出力するとともに、出力している前記スライドデータに含む前記発話テキストの読み上げ音声を、前記人物動画を付して出力する処理をコンピュータに行わせることを特徴とする。A slide reproduction program according to an aspect of the present application obtains presentation data including a plurality of slide data including spoken text and display elements, obtainsone still image including a person as a subject, and in the obtained still image recognizing an area of a person, setting an area other than the area as a background, creating a moving image of the person based on the still image, outputting the display elements included in each of the plurality of slide data in a predetermined order, and outputting A computer is caused to perform a process of outputting the reading voice of the spoken text included in the slide data, with the moving image ofthe person added.

再生装置はユーザがプレゼンテーションに用いる装置である。再生装置はノートパソコン、パネルコンピュータ、タブレットコンピュータ、スマートフォン等で構成する。再生装置の論理的な処理は再生装置Ｋで示す。再生装置は後述のハードウェア構成で、プレゼンテーションデータＫ１、ＶＲ（Virtual Reality：バーチャルリアリティー）モデルＤＢＫ２、設定データＫ３を保持している。本願における一つの実施形態のスライド再生プログラムＫ４はこれらのデータを読み込み、発表者ノートのテキストを音声合成サーバ２に送信し、音声合成結果を得る。更に、スライドデータからスライド表示プログラム（例えば、ＭｉｃｒｏｓｏｆｔＰｏｗｅｒＰｏｉｎｔ，Ｇｏｏｇｌｅプレゼンテーションなど）でスライドを表示し、ＶＲエンジンでＶＲアバターＫ６を表示させる。スライド再生プログラムＫ４はスライド表示Ｋ５、ＶＲアバターＫ６及び音声合成結果Ｋ７を表示、再生する。また、スライド再生プログラムＫ４はスライド表示、音声合成結果の再生、アバター表示と同時に、スライドのページ遷移の制御も自動的に行い、これらの要素の表示、再生を同期化する。音声合成サーバ２は音声合成エンジンを備える。音声合成サーバ２は再生装置１からテキストデータを受け付け、音声合成モデルを用いて受け付けたテキストを読み上げる音声を合成し、音声データを再生装置１へ返信する。音声合成サーバ２はサーバコンピュータ、ワークステーション等で構成する。また、音声合成サーバ２を複数のコンピュータからなるマルチコンピュータ、ソフトウェアによって仮想的に構築された仮想マシン又は量子コンピュータで構成してもよい。さらに、音声合成サーバ２の機能をクラウドサービスで実現してもよい。A playback device is a device that a user uses for a presentation. The playback device consists of a notebook computer, a panel computer, a tablet computer, a smart phone, and the like. The logical processing of the playback device is indicated by playback device K. FIG. The playback device has a hardware configuration to be described later, and holds presentation data K1, a VR (Virtual Reality) model DBK2, and setting data K3. The slide playback program K4 of one embodiment of the present application reads these data, transmits the text of the presenter's notes to thespeech synthesis server 2, and obtains speech synthesis results. Further, the slide data is displayed by a slide displayprogram ( for example, Microsoft PowerPoint, Google Presentation, etc.), and the VR avatar K6 is displayed by the VR engine. The slide reproduction program K4 displays and reproduces the slide display K5, the VR avatar K6, and the speech synthesis result K7. In addition, the slide reproduction program K4 automatically controls page transition of slides simultaneously with slide display, speech synthesis result reproduction, and avatar display, and synchronizes the display and reproduction of these elements. Thespeech synthesis server 2 has a speech synthesis engine. Thespeech synthesis server 2 receives text data from thereproduction device 1 , synthesizes speech for reading out the received text using a speech synthesis model, and returns the speech data to thereproduction device 1 . Thespeech synthesizing server 2 is composed of a server computer, a work station, and the like. Further, thespeech synthesis server 2 may be configured by a multicomputer consisting of a plurality of computers, a virtual machine virtually constructed by software, or a quantum computer. Furthermore, the function of thespeech synthesis server 2 may be realized by a cloud service.

Claims

Translated fromJapanese

発話テキストと表示要素とを含むスライドデータを複数含むプレゼンテーションデータを取得し、
複数の前記スライドデータそれぞれに含む前記表示要素を所定の順番で出力するとともに、出力している前記スライドデータに含む前記発話テキストの読み上げ音声を、人物動画を付して出力する
処理をコンピュータに行わせることを特徴とするスライド再生プログラム。obtaining presentation data including a plurality of slide data including spoken text and display elements;
performing a process of outputting the display elements included in each of the plurality of slide data in a predetermined order, and outputting the reading voice of the spoken text included in the output slide data with a moving image of the person. A slide playback program characterized by

１枚の静止画像を取得し、
取得した前記静止画像に基づいて、前記人物動画を作成する
ことを特徴とする請求項１に記載のスライド再生プログラム。Acquire a single still image,
2. The slide reproduction program according to claim 1, wherein the moving image of the person is created based on the obtained still image.

出力言語を取得し、
前記発話テキストを前記出力言語に翻訳し、翻訳した発話テキストの読み上げ音声を出力する
ことを特徴とする請求項１又は請求項２に記載のスライド再生プログラム。get the output language,
3. The slide reproduction program according to claim 1, wherein the spoken text is translated into the output language, and a reading voice of the translated spoken text is output.

出力している前記表示要素に対応する前記発話テキストの読み上げ音声の出力完了後に、前記スライドデータの次のスライドデータの前記表示要素を出力する
ことを特徴とする請求項１から請求項３のいずれか一項に記載のスライド再生プログラム。4. The display element of the slide data following the slide data is output after completion of the output of the reading voice of the spoken text corresponding to the display element being output. or the slide playback program according to item 1.

前記読み上げ音声の出力完了後、さらに所定時間の経過後に、前記スライドデータの次のスライドデータの前記表示要素を出力する
ことを特徴とする請求項４に記載のスライド再生プログラム。5. The slide reproduction program according to claim 4, further comprising outputting the display element of the slide data next to the slide data after a predetermined time has elapsed after the output of the reading voice is completed.

性別を含む音声合成モデルを特定する特定情報、並びに、声の高さ及び発話の速さを受け付け、
前記特定情報に対応した前記音声合成モデルに基づき、受け付けた声の高さ、及び、発話の速さで、前記発話テキストの読み上げ音声を出力する
ことを特徴とする請求項１から請求項５のいずれか一項に記載のスライド再生プログラム。receiving specific information identifying the speech synthesis model, including gender, and pitch and rate of speech;
Based on the speech synthesis model corresponding to the specific information, reading voice of the uttered text is output at the received voice pitch and utterance speed. The slide playback program according to any one of the items.

前記音声合成モデルは、特定の話者の発話音声を学習して生成したモデルを含み、
前記特定情報は話者を特定する話者識別情報を含み、該話者識別情報に対応する前記音声合成モデルに基づき、前記読み上げ音声を出力する
ことを特徴とする請求項６に記載のスライド再生プログラム。The speech synthesis model includes a model generated by learning the utterance speech of a specific speaker,
7. The slide reproduction according to claim 6, wherein the specific information includes speaker identification information that identifies a speaker, and the reading voice is output based on the speech synthesis model corresponding to the speaker identification information. program.

前記表示要素が動画である場合、該動画の再生を行う
ことを特徴とする請求項１から請求項７のいずれか一項に記載のスライド再生プログラム。8. The slide reproduction program according to any one of claims 1 to 7, wherein when the display element is a moving image, the moving image is reproduced.

前記スライドデータは、制御命令を含めることが可能であり、
出力対象となっている前記スライドデータにポインティングデバイスにより制御されるポインタの前記制御命令が含まれている場合、当該制御命令に従い、前記ポインタを制御する
ことを特徴とする請求項１から請求項８のいずれか一項に記載のスライド再生プログラム。The slide data can include control instructions,
9. When the slide data to be output includes the control instruction for a pointer controlled by a pointing device, the pointer is controlled according to the control instruction. The slide playback program according to any one of 1.

前記表示要素は全画面表示で出力し、
出力対象となっている前記スライドデータに、他のアプリケーションソフトへ遷移するリンク情報が含まれている場合、前記表示要素を表示している画面を最小化し、前記アプリケーションソフトへ制御を渡し、
前記アプリケーションソフトから制御が戻った場合、前記表示要素を全画面表示で再出力する
ことを特徴とする請求項１から請求項９のいずれか一項に記載のスライド再生プログラム。outputting the display element in full screen display;
if the slide data to be output contains link information for transitioning to other application software, minimizing the screen displaying the display element and passing control to the application software;
10. The slide playback program according to any one of claims 1 to 9, wherein when control is returned from the application software, the display elements are re-output in full-screen display.

出力対象となっている前記スライドデータに、人物に所定のジェスチャーを行わせる制御命令が含まれている場合、当該制御命令に従ったジェスチャーを行う前記人物動画を出力する
ことを特徴とする請求項１から請求項１０のいずれか一項に記載のスライド再生プログラム。3. When the slide data to be output includes a control command for causing a person to perform a predetermined gesture, the video of the person performing the gesture according to the control command is output. 11. The slide reproducing program according to any one of claims 1 to 10.

前記プレゼンテーションデータは、前記発話テキストを含まないスライドデータを含み、当該スライドデータに含む前記表示要素から発話テキストを生成する
ことを特徴とする請求項１から請求項１１のいずれか一項に記載のスライド再生プログラム。12. The method according to any one of claims 1 to 11, wherein the presentation data includes slide data that does not include the spoken text, and the spoken text is generated from the display elements included in the slide data. slide playback program.

発話テキストと表示要素とを含むスライドデータを複数含むプレゼンテーションデータを取得する取得部と、
複数の前記スライドデータそれぞれに含む前記表示要素を所定の順番で出力するとともに、出力している前記スライドデータに含む前記発話テキストの読み上げ音声を、人物動画を付して出力する出力部と
を備えることを特徴とするスライド再生装置。an acquisition unit for acquiring presentation data including a plurality of slide data including spoken text and display elements;
an output unit for outputting the display elements included in each of the plurality of slide data in a predetermined order, and for outputting the reading voice of the spoken text included in the output slide data with a person moving image attached. A slide playback device characterized by:

コンピュータが、
発話テキストと表示要素とを含むスライドデータを複数含むプレゼンテーションデータを取得し、
複数の前記スライドデータそれぞれに含む前記表示要素を所定の順番で出力するとともに、出力している前記スライドデータに含む前記発話テキストの読み上げ音声を、人物動画を付して出力する
処理を行うことを特徴とするスライド再生方法。the computer
obtaining presentation data including a plurality of slide data including spoken text and display elements;
outputting the display elements included in each of the plurality of slide data in a predetermined order, and outputting the reading voice of the spoken text included in the output slide data with a moving image of a person. A slide playback method characterized by: