JP2024078382A

Movatterモバイル変換

Info

Publication number: JP2024078382A
Application number: JP2023102537A
Authority: JP
Inventors: 耕司桑田; Koji Kuwata
Original assignee: Ricoh Co Ltd
Current assignee: Ricoh Co Ltd
Priority date: 2022-11-29
Filing date: 2023-06-22
Publication date: 2024-06-10

Abstract

【課題】直近の複数の発話者をクローズアップ表示するビデオ会議システムにおいて、クローズアップ表示される発話者の切り替え頻度を抑制した録画映像を提供する。【解決手段】ビデオ会議システムは、ビデオ会議の映像を録画するビデオ会議システムであって、マイクアレイで取得した前記ビデオ会議の音声に基づいて音の方向を検知する方向検知部と、１つ以上のカメラで撮影した前記ビデオ会議の第１の映像から人物の画像を検知する画像検知部と、前記音の方向と前記人物の画像とに基づいて、前記ビデオ会議システムを利用して前記ビデオ会議に参加する利用者の発話順序を特定する特定部と、前記第１の映像を所定の時間遅延させた第２の映像から、前記発話順序に基づいて、新たに発話した第１の利用者を含む所定の数の利用者の画像を所定のエリアに表示する前記ビデオ会議の録画映像を作成する録画映像作成部と、を有する。【選択図】図８[Problem] To provide a recorded video in which the frequency of switching of speakers displayed in close-up is suppressed in a video conference system that displays multiple speakers most recently in close-up. [Solution] The video conference system records video of a video conference, and includes: a direction detection unit that detects a sound direction based on the audio of the video conference acquired by a microphone array; an image detection unit that detects an image of a person from a first video of the video conference captured by one or more cameras; an identification unit that identifies the speech order of users participating in the video conference using the video conference system based on the sound direction and the image of the person; and a recorded video creation unit that creates a recorded video of the video conference that displays images of a predetermined number of users, including a first user who has recently spoken, in a predetermined area based on the speech order from a second video obtained by delaying the first video by a predetermined time. [Selected Figure] Figure 8

Description

Translated fromJapanese

本発明は、ビデオ会議システム、及び録画映像作成方法に関する。The present invention relates to a video conferencing system and a method for creating recorded video.

マイクで取得した音声とカメラで撮影した画像とを、通信ネットワークを介して送受信して、遠隔会議を実現するビデオ会議システムが普及している。Videoconferencing systems that enable remote conferences by sending and receiving audio picked up by a microphone and images taken by a camera over a communications network are becoming widespread.

特許文献１には、入力画像からビデオ会議の会議画像を生成する際に、話者部分の領域を拡大、又は縮小することにより、話者が適切な大きさとなるように表示できる会議画像再生システムが開示されている。Patent document 1 discloses a conference image playback system that can display the speaker at an appropriate size by enlarging or reducing the area of the speaker when generating a conference image of a video conference from an input image.

また、特許文献２には、ビデオ会議システムにおいて、パノラマカメラと、マイクアレイを組み合わせて、パノラマカメラで会議室全体を撮影し、発言者がいる場合に、自動的に発言者の映像をクローズアップして表示すること等が開示されている。Patent document 2 also discloses a video conferencing system that combines a panoramic camera with a microphone array to capture an image of the entire conference room with the panoramic camera, and automatically displays a close-up of the speaker's image when the speaker is present.

自拠点の発話者の画像をクローズアップ表示するビデオ会議システムでは、発話者が頻繁に切り替わると、クローズアップ表示される発話者の画像の切り替わり頻度がめまぐるしくなるという問題がある。また、今までの、ビデオ会議システムでは、会議映像を録画する際に会議映像をそのまま録画しているため、録画映像においても、クローズアップ表示される発話者の画像の切り替わり頻度がめまぐるしくなるという問題がある。In a video conferencing system that displays a close-up image of the speaker at the same location, if the speaker changes frequently, the close-up image of the speaker changes too frequently. In addition, in conventional video conferencing systems, the conference video is recorded as is, so there is also a problem that the close-up image of the speaker changes too frequently in the recorded video.

このような問題は、例えば、直近の複数の発話者をクローズアップ表示するビデオ会議システムにおいても存在する。This problem also exists, for example, in video conferencing systems that provide close-up views of the nearest speakers.

本発明の一実施の形態は、上記の課題に鑑みてなされたものであって、直近の複数の発話者をクローズアップ表示するビデオ会議システムにおいて、クローズアップ表示される発話者の切り替え頻度を抑制した録画映像を提供する。One embodiment of the present invention has been made in consideration of the above problems, and provides recorded video in a video conferencing system that displays close-ups of multiple recent speakers, with reduced frequency of switching between speakers displayed in close-up.

上記の課題を解決するため、本発明の一実施形態に係るビデオ会議システムは、ビデオ会議の映像を録画するビデオ会議システムであって、マイクアレイで取得した前記ビデオ会議の音声に基づいて音の方向を検知する方向検知部と、１つ以上のカメラで撮影した前記ビデオ会議の第１の映像から人物の画像を検知する画像検知部と、前記音の方向と前記人物の画像とに基づいて、前記ビデオ会議システムを利用して前記ビデオ会議に参加する利用者の発話順序を特定する特定部と、前記第１の映像を所定の時間遅延させた第２の映像から、前記発話順序に基づいて、新たに発話した第１の利用者を含む所定の数の利用者の画像を所定のエリアに表示する前記ビデオ会議の録画映像を作成する録画映像作成部と、を有する。In order to solve the above problem, a video conference system according to one embodiment of the present invention is a video conference system that records video of a video conference, and includes a direction detection unit that detects the direction of sound based on the audio of the video conference acquired by a microphone array, an image detection unit that detects an image of a person from a first video of the video conference captured by one or more cameras, an identification unit that identifies the speaking order of users participating in the video conference using the video conference system based on the direction of sound and the image of the person, and a recorded video creation unit that creates a recorded video of the video conference that displays images of a predetermined number of users, including a first user who has recently spoken, in a predetermined area based on the speaking order from a second video that is a predetermined delay from the first video.

本発明の一実施形態によれば、直近の複数の発話者をクローズアップ表示するビデオ会議システムにおいて、クローズアップ表示される発話者の切り替え頻度を抑制した録画映像を提供することができる。According to one embodiment of the present invention, in a video conferencing system that displays close-ups of multiple recent speakers, it is possible to provide recorded video that reduces the frequency with which the speakers displayed in close-up are switched.

一実施形態に係る通信システムのシステム構成の例を示す図である。FIG. 1 is a diagram illustrating an example of a system configuration of a communication system according to an embodiment.一実施形態に係るビデオ会議システムの別の構成例を示す図である。FIG. 13 is a diagram illustrating another example of the configuration of a video conference system according to an embodiment.一実施形態に係るビデオ会議システムの会議映像のイメージを示す図である。FIG. 2 is a diagram showing an image of a conference video of a video conference system according to an embodiment.一実施形態に係る会議映像の遷移の例を示す図である。1A and 1B are diagrams illustrating an example of a transition of a conference video according to an embodiment.一実施形態に係る録画映像の遷移の例を示す図である。1A to 1C are diagrams illustrating an example of a transition of recorded video according to an embodiment.一実施形態に係るビデオ会議端末のハードウェア構成の例を示す図である。FIG. 2 is a diagram illustrating an example of a hardware configuration of a video conference terminal according to an embodiment.一実施形態に係るコンピュータのハードウェア構成の例を示す図である。FIG. 2 is a diagram illustrating an example of a hardware configuration of a computer according to an embodiment.一実施形態に係るビデオ会議システムの機能構成の一例を示す図である。FIG. 1 is a diagram illustrating an example of a functional configuration of a video conference system according to an embodiment.一実施形態に係るビデオ会議システムの機能構成の別の一例を示す図である。FIG. 2 is a diagram illustrating another example of a functional configuration of a video conference system according to an embodiment.一実施形態に係る会議映像、及び録画映像の作成処理の例を示すフローチャートである。11 is a flowchart illustrating an example of a process for creating a conference video and a recorded video according to an embodiment.第１の実施形態に係る録画映像の作成処理の例を示すフローチャートである。10 is a flowchart illustrating an example of a process for creating a recorded video according to the first embodiment.第２の実施形態に係る録画映像の作成処理の例を示すフローチャートである。13 is a flowchart illustrating an example of a process for creating a recorded video according to the second embodiment.第３の実施形態に係る録画映像の作成処理の例を示すフローチャートである。13 is a flowchart illustrating an example of a process for creating a recorded video according to the third embodiment.第３の実施形態に係る優先度の決定処理の例を示すフローチャートである。13 is a flowchart illustrating an example of a priority determination process according to the third embodiment.一実施形態に係る会議映像の作成処理のイメージを示す図である。FIG. 11 is a diagram showing an image of a process for creating a conference video according to an embodiment;一実施形態に係る録画映像の作成処理のイメージを示す図である。FIG. 11 is a diagram showing an image of a process for creating a recorded video according to an embodiment;

以下、本発明の実施形態について、図面を参照しながら詳細に説明する。The following describes in detail an embodiment of the present invention with reference to the drawings.

＜システム構成＞
図１は、一実施形態に係る通信システムのシステム構成の例を示す図である。通信システム１は、例えば、自拠点に設置されたビデオ会議システム１００を利用して、１人以上の利用者Ａ、利用者Ｂ、利用者Ｃ、利用者Ｄ、・・・が、他の拠点で他のビデオ会議システム１１０を利用する他の利用者とビデオ会議を行うシステムである。なお、ビデオ会議は、ウェブ会議と呼ばれる場合もある。また、本実施形態に係るビデオ会議システム１００は、ビデオ会議の映像を録画する機能を有している。 <System Configuration>
1 is a diagram showing an example of a system configuration of a communication system according to an embodiment. Thecommunication system 1 is a system in which, for example, one or more users A, B, C, D, ... use avideo conference system 100 installed at a base to hold a video conference with other users who use anothervideo conference system 110 at another base. Note that a video conference may also be called a web conference. Thevideo conference system 100 according to this embodiment also has a function of recording video of the video conference.

なお、以下の説明において、利用者Ａ、利用者Ｂ、利用者Ｃ、利用者Ｄ、・・・のうち、任意の利用者を示す場合、「自拠点の利用者」を用いる。また、図１に示した自拠点の利用者の数、及び他の拠点の他の利用者の数は一例である。In the following explanation, when referring to any of users A, B, C, D, etc., the term "user at the own location" is used. Also, the number of users at the own location and the number of other users at other locations shown in Figure 1 are examples.

図１の例では、通信システム１は、自拠点に設けられたビデオ会議システム１００と、他の拠点に設けられた他のビデオ会議システム１１０と、会議サーバ１０とを含む。また、ビデオ会議システム１００、他のビデオ会議システム１１０、及び会議サーバ１０は、例えば、インターネット、及びＬＡＮ（Local Area Network）等の通信ネットワーク２に接続されている。In the example of FIG. 1, thecommunication system 1 includes avideo conference system 100 provided at the home base, anothervideo conference system 110 provided at another base, and aconference server 10. Thevideo conference system 100, the othervideo conference system 110, and theconference server 10 are connected to acommunication network 2, such as the Internet and a LAN (Local Area Network).

例えば、自拠点の利用者は、ビデオ会議システム１００を用いて、会議サーバ１０が提供するビデオ会議に参加する。また、他利用者は、他のビデオ会議システム１１０を用いて、会議サーバ１０が提供する同じビデオ会議に参加する。これにより、ビデオ会議システム１００と、他のビデオ会議システム１１０は、会議サーバ１０を介して、会議映像を互いに送受信して、ビデオ会議を行うことができる。For example, a user at a home base uses avideo conference system 100 to participate in a video conference provided by theconference server 10. Another user uses anothervideo conference system 110 to participate in the same video conference provided by theconference server 10. This allows thevideo conference system 100 and the othervideo conference system 110 to send and receive conference video to each other via theconference server 10 and conduct a video conference.

なお、会議サーバ１０が提供するビデオ会議は、会議映像を互いに送受信する任意のビデオ会議（又はウェブ会議）であってよい。また、ビデオ会議システム１００は、会議サーバ１０を介さずに、通信ネットワーク２を介して、他のビデオ会議システム１１０と直接ビデオ会議を行うものであってもよい。The video conference provided by theconference server 10 may be any video conference (or web conference) in which conference video is transmitted and received between the parties. Thevideo conference system 100 may also conduct a video conference directly with anothervideo conference system 110 via thecommunication network 2, without going through theconference server 10.

図１の例では、ビデオ会議システム１００は、ビデオ会議端末１０１と、ビデオ会議端末１０１に接続される表示装置１０２とを含む。In the example of FIG. 1, thevideo conferencing system 100 includes avideo conferencing terminal 101 and adisplay device 102 connected to thevideo conferencing terminal 101.

ビデオ会議端末１０１は、例えば、複数のマイクを配列して構成されたマイクアレイ（マイクロフォンアレイ）を備え、マイクアレイを用いて話者の方向を検知する機能を有している。また、ビデオ会議端末１０１は、ビデオ会議端末１０１の周辺にいる利用者Ａ、利用者Ｂ、利用者Ｃ、利用者Ｄ、・・・を撮影する１つ以上のカメラを有している。例えば、ビデオ会議システム１００は、周囲３６０のパノラマ画像を撮影可能なパノラマカメラを有し、会議室全体を撮影するものであってもよい。或いは、ビデオ会議システム１００は、複数のカメラを有し、複数のカメラで撮影した画像を組み合わせて、ビデオ会議端末１０１の周辺にいる利用者Ａ、利用者Ｂ、利用者Ｃ、利用者Ｄ、・・・を撮影するものであってもよい。Thevideo conference terminal 101, for example, includes a microphone array (microphone array) configured by arranging multiple microphones, and has a function of detecting the direction of a speaker using the microphone array. Thevideo conference terminal 101 also includes one or more cameras that capture images of user A, user B, user C, user D, etc., in the vicinity of thevideo conference terminal 101. For example, thevideo conference system 100 may include a panoramic camera that can capture a 360-degree panoramic image of the surroundings, and capture the entire conference room. Alternatively, thevideo conference system 100 may include multiple cameras, and combine images captured by the multiple cameras to capture images of user A, user B, user C, user D, etc., in the vicinity of thevideo conference terminal 101.

また、ビデオ会議端末１０１は、通信ネットワーク２に接続されており、会議サーバ１０が提供するビデオ会議に参加するビデオ会議機能を有している。例えば、ビデオ会議端末１０１は、マイクアレイで取得した音声と、カメラで撮影した映像とに基づく会議映像を、会議サーバ１０を介して、他のビデオ会議システム１１０に送信する。また、ビデオ会議端末１０１は、会議サーバ１０を介して、他のビデオ会議システム１１０から受信した会議映像を表示装置１０２に表示する。また、ビデオ会議端末１０１は、スピーカを有しており、会議映像に含まれる会議音声を出力することができる。別の一例として、表示装置１０２がスピーカを備えており、ビデオ会議端末１０１は、表示装置１０２が備えるスピーカを用いて、会議の音声を出力してもよい。Thevideo conference terminal 101 is also connected to thecommunication network 2, and has a video conference function for participating in a video conference provided by theconference server 10. For example, thevideo conference terminal 101 transmits a conference video based on the audio acquired by the microphone array and the video captured by the camera to anothervideo conference system 110 via theconference server 10. Thevideo conference terminal 101 also displays the conference video received from the othervideo conference system 110 via theconference server 10 on thedisplay device 102. Thevideo conference terminal 101 also has a speaker and can output the conference audio included in the conference video. As another example, thedisplay device 102 may be equipped with a speaker, and thevideo conference terminal 101 may output the conference audio using the speaker equipped in thedisplay device 102.

表示装置１０２は、ビデオ会議端末１０１が出力する表示画面を表示する装置である。表示装置１０２は、例えば、ディスプレイ、ＩＷＢ（Interactive White Board）、又はプロジェクタ等、ビデオ会議端末１０１が出力する表示画面を表示可能な様々な装置であってよい。Thedisplay device 102 is a device that displays the display screen output by thevideo conference terminal 101. Thedisplay device 102 may be any of a variety of devices capable of displaying the display screen output by thevideo conference terminal 101, such as a display, an IWB (Interactive White Board), or a projector.

ここで、ＩＷＢは、タッチセンサ搭載型のディスプレイであり、電子黒板とも呼ばれる。ＩＷＢは、ディスプレイに表示した画面に、例えば、ペン、又は指等で直接書き込みができるほか、ディスプレイに表示した内容をデータとして保存することができる。また、ＩＷＢは、プロジェクタのように、大型のディスプレイとしても用いることもできる。Here, an IWB is a display equipped with a touch sensor, and is also called an electronic whiteboard. An IWB allows users to write directly on the screen displayed on the display with, for example, a pen or a finger, and can also save the contents displayed on the display as data. An IWB can also be used as a large display, like a projector.

他のビデオ会議システム１１０は、会議サーバ１０が提供するビデオ会議に参加し、ビデオ会議システム１００と会議映像を送受信することができるものであれば、任意の構成であってよい。例えば、他のビデオ会議システム１１０は、ＰＣ（Personal Computer）、タブレット端末、又はスマートフォン等の情報処理装置であってもよいし、ビデオ会議システム１００と同様の構成であってもよい。The othervideo conference system 110 may have any configuration as long as it can participate in a video conference provided by theconference server 10 and send and receive conference video with thevideo conference system 100. For example, the othervideo conference system 110 may be an information processing device such as a PC (Personal Computer), a tablet terminal, or a smartphone, or may have a configuration similar to that of thevideo conference system 100.

図２は、一実施形態に係るビデオ会議システムの別の構成例を示す図である。ビデオ会議システム１００は、例えば、図２（Ａ）に示すように、ビデオ会議機能を有するＰＣ（Personal Computer）２０１と、マイクアレイ、１つ以上のカメラ、及びスピーカ等を備えたウェブ会議デバイス２０１とによって構成されるものであってもよい。Figure 2 is a diagram showing another example of the configuration of a video conferencing system according to an embodiment. For example, as shown in Figure 2 (A), thevideo conferencing system 100 may be configured with a PC (Personal Computer) 201 having a video conferencing function and aweb conferencing device 201 equipped with a microphone array, one or more cameras, a speaker, etc.

ウェブ会議デバイス２０１は、図１で説明したビデオ会議端末１０１と同様に、複数のマイクを配列して構成されたマイクアレイ（マイクロフォンアレイ）を備え、マイクアレイを用いて話者の方向を検知する機能を有している。また、ウェブ会議デバイス２０１は、ウェブ会議デバイス２０１の周辺にいる利用者Ａ、利用者Ｂ、利用者Ｃ、利用者Ｄ、・・・を撮影する１つ以上のカメラを有している。さらに、ウェブ会議デバイス２０１は、スピーカを用いて、ＰＣ２０２から出力される会議音声の音声データに基づいて、会議音声を出力することができる。Theweb conference device 201, like thevideo conference terminal 101 described in FIG. 1, has a microphone array (microphone array) configured by arranging multiple microphones, and has a function of detecting the direction of the speaker using the microphone array. Theweb conference device 201 also has one or more cameras that capture images of user A, user B, user C, user D, ... in the vicinity of theweb conference device 201. Furthermore, theweb conference device 201 can output the conference audio using a speaker based on the audio data of the conference audio output from the PC 202.

ウェブ会議デバイス２０１は、例えば、ＵＳＢ（Universal Serial Bus）ケーブル２０３等でＰＣ２０２に接続されており、ビデオ会議端末１０１が作成する会議映像と同様の会議映像を作成し、作成した会議映像をＰＣ２０２に送信する。また、ＰＣ２０２は、通信ネットワーク２に接続されており、ウェブ会議デバイス２０１から出力される会議映像を用いて、ビデオ会議を行う。Theweb conference device 201 is connected to the PC 202, for example, via a USB (Universal Serial Bus)cable 203, creates a conference video similar to the conference video created by thevideo conference terminal 101, and transmits the created conference video to the PC 202. The PC 202 is also connected to thecommunication network 2, and holds a video conference using the conference video output from theweb conference device 201.

また、ビデオ会議システム１００は、例えば、図２（Ｂ）に示すように、ビデオ会議機能を有するＩＷＢ２１１と、上述したウェブ会議デバイス２０１とによって構成されるものであってもよい。ウェブ会議デバイス２０１は、例えば、ＵＳＢケーブル２０３等でＩＷＢ２１１に接続されており、ビデオ会議端末１０１が作成する会議映像と同様の会議映像を作成し、作成した会議映像をＩＷＢ２１１に送信する。また、ＩＷＢ２１１は、通信ネットワーク２に接続されており、ウェブ会議デバイス２０１から出力される会議映像を用いて、他のビデオ会議システム１１０とビデオ会議を行う。Thevideo conference system 100 may also be configured, for example, as shown in FIG. 2B, with anIWB 211 having a video conference function and the above-mentionedweb conference device 201. Theweb conference device 201 is connected to theIWB 211, for example, via aUSB cable 203 or the like, and creates a conference video similar to the conference video created by thevideo conference terminal 101, and transmits the created conference video to theIWB 211. TheIWB 211 is also connected to thecommunication network 2, and uses the conference video output from theweb conference device 201 to hold a video conference with anothervideo conference system 110.

（会議映像の例）
図３は、一実施形態に係るビデオ会議システムの会議映像の例を示す図である。この図は、例えば、図１で説明したビデオ会議端末１０１、又は図２（Ａ）、（Ｂ）で説明したウェブ会議デバイス２０１が作成する会議映像のイメージを示す図である。 (Example of meeting video)
3 is a diagram showing an example of a conference video of the video conference system according to an embodiment of the present invention, which is a diagram showing an image of the conference video created by, for example, thevideo conference terminal 101 described in FIG. 1 or theweb conference device 201 described in FIG. 2A and FIG. 2B.

図２に示すように、会議映像３００は、例えば、ビデオ会議に参加する利用者の全体を表示する全体表示エリア３０１と、所定の数の利用者の画像をクローズアップ表示するクローズアップ表示エリア３０２とを含む。As shown in FIG. 2, theconference video 300 includes, for example, anoverall display area 301 that displays all the users participating in the video conference, and a close-updisplay area 302 that displays close-up images of a predetermined number of users.

全体表示エリア３０１には、例えば、ビデオ会議端末１０１（又はウェブ会議デバイス２０１）が備えるカメラで、ビデオ会議が行われている会議室全体を撮影したパノラマ映像が表示される。また、クローズアップ表示エリア３０２には、直近の発話を行った所定の数の利用者が、クローズアップ表示される。In theoverall display area 301, for example, a panoramic image of the entire conference room where the video conference is taking place is displayed, captured by a camera provided in the video conference terminal 101 (or the web conference device 201). In addition, in the close-updisplay area 302, a predetermined number of users who have recently spoken are displayed in close-up.

図３の例では、クローズアップ表示エリア３０２には、３つの表示枠３０２－１、３０２－２、３０２－３が表示されている。この場合、例えば、図１に示すような自拠点において、利用者Ａ、利用者Ｂ、利用者Ｃの順に発話を行うと、クローズアップ表示エリア３０２には、利用者Ａの画像、利用者Ｂの画像、及び利用者Ｃの画像がクローズアップ表示（拡大表示）される。なお、クローズアップ表示エリア３０２に表示する表示枠の数（所定の数）は、２つであってもよいし、４つ以上であってもよい。ここでは、クローズアップ表示エリア３０２に表示する表示枠の数が３つであるものとして、以下の説明を行う。In the example of FIG. 3, three display frames 302-1, 302-2, and 302-3 are displayed in the close-updisplay area 302. In this case, for example, when user A, user B, and user C speak in this order at the home base as shown in FIG. 1, the close-updisplay area 302 displays a close-up (enlarged) image of user A, an image of user B, and an image of user C. Note that the number of display frames (predetermined number) displayed in the close-updisplay area 302 may be two, or may be four or more. Here, the following explanation will be given assuming that the number of display frames displayed in the close-updisplay area 302 is three.

このように、自拠点の発話者の画像をクローズアップ表示するビデオ会議システム１００では、発話者が頻繁に切り替わると、クローズアップ表示される発話者の画像の切り替わり頻度がめまぐるしくなるという問題がある。In this way, in thevideo conferencing system 100 that displays a close-up image of the speaker at the local location, if the speaker changes frequently, the close-up image of the speaker changes too frequently.

図４は、一実施形態に係る会議映像の遷移の例を示す図である。なお、ビデオ会議システム１００は、自拠点で新たな利用者が発話した場合、３つの表示枠３０２－１、３０２－２、３０２－３のうち、最も過去に発話した利用者の画像が表示されている表示枠に、新たに発話した利用者の画像を表示するものとする。Figure 4 is a diagram showing an example of the transition of a conference video according to one embodiment. When a new user speaks at thevideo conference system 100's own location, thevideo conference system 100 displays the image of the newly speaking user in one of the three display frames 302-1, 302-2, and 302-3 in which the image of the user who spoke most recently is displayed.

例えば、ビデオ会議において、利用者Ａ、利用者Ｂ、利用者Ｃの順に発話が行われ、ビデオ会議システム１００は、図３に示すような会議映像３００を出力しているものとする。この状態で、新たに利用者Ｄが発話すると、ビデオ会議システム１００は、図４（Ａ）に示すように、利用者Ａ、利用者Ｂ、利用者Ｃのうち、最も過去に発話した利用者Ａが表示されていた表示枠３０２－１に、利用者Ｄの画像を表示する会議映像４１０を出力する。For example, in a video conference, users A, B, and C speak in that order, and thevideo conference system 100 outputs theconference video 300 shown in FIG. 3. In this state, when user D speaks again, thevideo conference system 100 outputs theconference video 410, as shown in FIG. 4 (A), displaying an image of user D in the display frame 302-1 in which user A, B, and C, who spoke most recently, was displayed.

また、この状態で、新たに利用者Ａが発話すると、ビデオ会議システム１００は、図４（Ｂ）に示すように、利用者Ｄ、利用者Ｂ、利用者Ｃのうち、最も過去に発話した利用者Ｂが表示されていた表示枠３０２－２に、利用者Ａの画像を表示する会議映像４２０を出力する。さらに、この状態で、新たに利用者Ｂが発話すると、ビデオ会議システム１００は、図４（Ｃ）に示すように、利用者Ｄ、利用者Ａ、利用者Ｃのうち、最も過去に発話した利用者Ｃが表示されていた表示枠３０２－３に、利用者Ｂの画像を表示する会議映像４３０を出力する。同様に、この状態で、新たに利用者Ｃが発話すると、ビデオ会議システム１００は、図４（Ｄ）に示すように、利用者Ｄ、利用者Ａ、利用者Ｂのうち、最も過去に発話した利用者Ｄが表示されていた表示枠３０２－１に、利用者Ｃの画像を表示する会議映像４４０を出力する。When user A newly speaks in this state, thevideo conference system 100 outputs aconference video 420 displaying an image of user A in the display frame 302-2 in which user B, who was the earliest speaker among users D, B, and C, had been displayed, as shown in FIG. 4(B). When user B newly speaks in this state, thevideo conference system 100 outputs aconference video 430 displaying an image of user B in the display frame 302-3 in which user C, who was the earliest speaker among users D, A, and C, had been displayed, as shown in FIG. 4(C). Similarly, when user C newly speaks in this state, thevideo conference system 100 outputs aconference video 440 displaying an image of user C in the display frame 302-1 in which user D, A, and B, who was the earliest speaker, had been displayed, as shown in FIG. 4(D).

このように、自拠点の発話者の画像をクローズアップ表示するビデオ会議システム１００では、発話者が頻繁に切り替わると、クローズアップ表示される発話者の画像の切り替わり頻度がめまぐるしくなる。また、今までの、ビデオ会議システムでは、会議映像を録画する際に会議映像をそのまま録画しているため、録画映像においても、クローズアップ表示される発話者の画像の切り替わり頻度がめまぐるしくなるという問題がある。In this way, in avideo conferencing system 100 that displays a close-up image of a speaker at the local location, if the speaker changes frequently, the close-up image of the speaker changes frequently. Also, in conventional video conferencing systems, when recording a conference video, the conference video is recorded as is, so there is a problem that the close-up image of the speaker changes frequently in the recorded video as well.

そこで、ビデオ会議システム１００は、カメラで撮影したビデオ会議の第１の映像を所定の時間遅延させた第２の映像と、第１の映像に基づいて特定した利用者の発話順序とに基づいて、発話者の切り替え頻度を抑制した録画映像を作成する機能を有している。ここで、所定の時間は、例えば、１分～５分程度、好ましくは、２分～３分程度の時間であるが、これに限られない。Thevideoconferencing system 100 has a function to create a recorded video that suppresses the frequency of speaker switching based on a second image that is a predetermined delay from a first image of the videoconference captured by a camera and the user's speech order determined based on the first image. Here, the predetermined time is, for example, about 1 to 5 minutes, preferably about 2 to 3 minutes, but is not limited to this.

図５は、一実施形態に係る録画映像の遷移の例を示す図である。ここで、ビデオ会議システム１００は、第２の映像において、新たな利用者が発話した場合、３つの表示枠３０２－１、３０２－２、３０２－３のうち、最も過去に発話した利用者の画像が表示されている表示枠を、新たに発話した利用者の画像に置き換えるものとする。ただし、ビデオ会議システム１００は、第２の映像において新たに第１の利用者が発話した場合、次に発話する第２の利用者の画像が表示されている表示枠、及び次の次に発話する第３の利用者の画像が表示されている表示枠を、置き換えの対象から外す。また、ビデオ会議システム１００は、第２の映像において新たに発話した第１の利用者が、クローズアップ表示エリア３０２に既に表示されている場合、クローズアップ表示エリア３０２のレイアウトを変更しない。FIG. 5 is a diagram showing an example of the transition of recorded video according to an embodiment. Here, when a new user speaks in the second video, thevideo conference system 100 replaces the display frame in which the image of the user who spoke the earliest among the three display frames 302-1, 302-2, and 302-3 is displayed with the image of the newly speaking user. However, when a first user newly speaks in the second video, thevideo conference system 100 excludes from the replacement target the display frame in which the image of the second user who will speak next and the display frame in which the image of the third user who will speak after that is displayed. Also, when the first user who newly speaks in the second video is already displayed in the close-updisplay area 302, thevideo conference system 100 does not change the layout of the close-updisplay area 302.

例えば、所定の時間遅延させた第２の映像において、利用者Ａ、利用者Ｂ、利用者Ｃの順に発話が行われ、ビデオ会議システム１００は、図３に示すような会議映像３００を作成したものとする。また、ビデオ会議システム１００は、遅延させていない第１の映像に基づいて、この後の発話順序が、利用者Ｄ、利用者Ａ、利用者Ｂ、利用者Ｃの順序であることを特定したものとする。For example, in the second video delayed by a predetermined time, user A, user B, and user C speak in that order, and thevideoconferencing system 100 creates theconference video 300 shown in FIG. 3. Also, thevideoconferencing system 100 determines, based on the first video that is not delayed, that the subsequent speaking order is user D, user A, user B, and user C.

この状態で、第２の映像において、新たに利用者Ｄが発話すると、ビデオ会議システム１００は、例えば、図５（Ａ）に示すような録画映像５１０を作成する。ここでは、利用者Ｄの次に発話する利用者Ａの画像が表示枠３０２－１に表示されているので、ビデオ会議システム１００は、表示枠３０２－１を置き換えの対象から外す。また、利用者Ａの次に発話する利用者Ｂの画像が表示枠３０２－２に表示されているので、ビデオ会議システム１００は、表示枠３０２－２も置き換えの対象から外す。これにより、ビデオ会議システム１００は、残りの表示枠３０２－３に、新たに発話した利用者Ｄの画像を表示する。In this state, when user D newly speaks in the second video, thevideoconferencing system 100 creates, for example, recordedvideo 510 as shown in FIG. 5(A). Here, since the image of user A, who will speak after user D, is displayed in display frame 302-1, thevideoconferencing system 100 excludes display frame 302-1 from the replacement target. Also, since the image of user B, who will speak after user A, is displayed in display frame 302-2, thevideoconferencing system 100 also excludes display frame 302-2 from the replacement target. As a result, thevideoconferencing system 100 displays the image of user D, who has newly spoken, in the remaining display frame 302-3.

この状態で、第２の映像において、新たに利用者Ａが発話すると、ビデオ会議システム１００は、例えば、図５（Ｂ）に示すような録画映像５２０を作成する。ここでは、録画映像５１０において、既に利用者Ａの画像が表示されているので、ビデオ会議システム１００は、クローズアップ表示エリア３０２のレイアウトを変更しない。In this state, when user A speaks again in the second video, thevideoconferencing system 100 creates, for example, a recordedvideo 520 as shown in FIG. 5(B). In this case, since an image of user A is already displayed in the recordedvideo 510, thevideoconferencing system 100 does not change the layout of the close-updisplay area 302.

また、この状態で、第２の映像において、新たに利用者Ｂが発話すると、ビデオ会議システム１００は、例えば、図５（Ｃ）に示すような録画映像５３０を作成する。ここでも、録画映像５１０において、既に利用者Ｂの画像が表示されているので、ビデオ会議システム１００は、クローズアップ表示エリア３０２のレイアウトを変更しない。In this state, when user B speaks again in the second video, thevideoconferencing system 100 creates, for example, a recordedvideo 530 as shown in FIG. 5(C). Again, since an image of user B is already displayed in the recordedvideo 510, thevideoconferencing system 100 does not change the layout of the close-updisplay area 302.

さらに、この状態で、第２の映像において、新たに利用者Ｃが発話すると、ビデオ会議システム１００は、例えば、図５（Ｄ）に示すような録画映像５４０を作成する。例えば、ビデオ会議システム１００は、録画映像５３０のクローズアップ表示エリア３０２に表示している利用者Ａ、利用者Ｂ、利用者Ｄのうち、最も過去に発話した利用者Ｄの画像が表示されていた表示枠３０２－３に、新たに発話した利用者Ｃの画像を表示する。Furthermore, in this state, when user C newly speaks in the second video, thevideoconferencing system 100 creates, for example, a recordedvideo 540 as shown in FIG. 5(D). For example, thevideoconferencing system 100 displays an image of user C who has newly spoken in the display frame 302-3 in which an image of user D who has spoken most recently among users A, B, and D displayed in the close-updisplay area 302 of the recordedvideo 530 was previously displayed.

このように、ビデオ会議システム１００は、図４で説明した会議映像４１０、４２０、４３０、４４０より、クローズアップ表示される発話者の切り替え頻度を抑制した録画映像５１０、５２０、５３０、５４０を作成し、録画することができる。In this way, thevideo conferencing system 100 can create and record recordedvideos 510, 520, 530, and 540 that reduce the frequency of switching between speakers displayed in close-up, compared to theconference videos 410, 420, 430, and 440 described in FIG. 4.

＜ハードウェア構成＞
続いて、本実施形態に係る各装置のハードウェア構成の例について説明する。 <Hardware Configuration>
Next, an example of the hardware configuration of each device according to this embodiment will be described.

（ビデオ会議端末ハードウェア構成）
図６は、一実施形態に係るビデオ会議端末のハードウェア構成の例を示す図である。ビデオ会議端末１０１は、例えば、ＣＰＵ（Central Processing Unit）６０１、ＲＯＭ（Read Only Memory）６０２、ＲＡＭ（Random Access Memory）６０３、ＳＳＤ（Solid State Drive）６０４、ネットワークＩ／Ｆ（Interface）６０５、外部機器接続Ｉ／Ｆ６０６、ディスプレイＩ／Ｆ６０７、操作部６０８、映像コーデック６０９、音処理ユニット６１０、マイクアレイ６１１、スピーカ６１２、映像処理ユニット６１３、カメラ６１４ａ、６１４ｂ、・・・、映像遅延バッファ６１５、及びバス６１６等を有する。 (Videoconferencing terminal hardware configuration)
6 is a diagram showing an example of a hardware configuration of a video conference terminal according to an embodiment. Thevideo conference terminal 101 includes, for example, a CPU (Central Processing Unit) 601, a ROM (Read Only Memory) 602, a RAM (Random Access Memory) 603, an SSD (Solid State Drive) 604, a network I/F (Interface) 605, an external device connection I/F 606, a display I/F 607, anoperation unit 608, avideo codec 609, asound processing unit 610, amicrophone array 611, aspeaker 612, avideo processing unit 613, cameras 614a, 614b, ..., avideo delay buffer 615, and abus 616.

ＣＰＵ６０１は、所定のプログラムを実行することにより、ビデオ会議端末１０１が備える様々な機能を制御する演算装置（プロセッサ）である。ＲＯＭ６０２は、例えば、ＣＰＵ６０１の起動に用いられるプログラム等を記憶する不揮発性のメモリである。ＲＡＭ６０３は、例えば、ＣＰＵ６０１のワークエリア等として用いられる揮発性のメモリである。ＳＳＤ６０４は、例えば、ビデオ会議端末１０１用のプログラム、データ、又は設定情報等を記憶するストレージデバイスの一例である。TheCPU 601 is a computing device (processor) that controls various functions of thevideoconferencing terminal 101 by executing a specific program. TheROM 602 is a non-volatile memory that stores, for example, programs used to start up theCPU 601. TheRAM 603 is a volatile memory that is used, for example, as a work area for theCPU 601. TheSSD 604 is an example of a storage device that stores, for example, programs, data, or setting information for thevideoconferencing terminal 101.

ネットワークＩ／Ｆ６０５は、ビデオ会議端末１０１を、例えば、通信ネットワーク２等に接続するための通信インタフェースである。外部機器接続Ｉ／Ｆ６０６は、ビデオ会議端末１０１に、様々な外部機器を接続するためのインタフェースである。ここで、外部機器には、例えば、ビデオ会議端末１０１が作成した録画映像を録画するための外部記憶装置等が含まれる。ディスプレイＩ／Ｆ６０７は、ビデオ会議端末１０１に表示装置１０２等を接続するためのインタフェースである。操作部６０８は、例えば、操作ボタン、スイッチ、又はタッチパネル等の、利用者の操作を受け付ける入力デバイスである。The network I/F 605 is a communication interface for connecting thevideo conference terminal 101 to, for example, thecommunication network 2. The external device connection I/F 606 is an interface for connecting various external devices to thevideo conference terminal 101. Here, the external devices include, for example, an external storage device for recording video created by thevideo conference terminal 101. The display I/F 607 is an interface for connecting thedisplay device 102 to thevideo conference terminal 101. Theoperation unit 608 is, for example, an input device that accepts user operations, such as an operation button, a switch, or a touch panel.

映像コーデック６０９は、例えば、ビデオ会議で送受信する会議映像を符号化するＣｏｄｅｒ、及び符号化された会議映像を復号するＤｅｃｏｄｅｒ等を含む。なお、会議映像の符号化、及び復号はソフトウェアで行われるものであってもよい。Thevideo codec 609 includes, for example, a coder that encodes the conference video transmitted and received in the video conference, and a decoder that decodes the encoded conference video. Note that the encoding and decoding of the conference video may be performed by software.

音処理ユニット６１０は、例えば、マイクアレイ６１１を用いて、指向性を制御するビームフォーミング等の様々は音処理を実行するデバイスである。また、音処理ユニット６１０は、スピーカ６１２を用いて、会議音声等の様々な音を出力する音処理も実行する。Thesound processing unit 610 is a device that performs various sound processing such as beamforming to control directivity using, for example, amicrophone array 611. Thesound processing unit 610 also performs sound processing to output various sounds such as conference audio using aspeaker 612.

映像処理ユニット６１３は、１つ以上のカメラ６１４ａ、６１４ｂ、・・・から、ビデオ会議端末１０１の周辺を撮影した画像を取得し、取得した画像に対して、例えば、画像合成、画質補正、又は歪み補正等の画像処理を行うデバイスである。カメラ６１４ａ、６１４ｂ、・・・は、ビデオ会議端末１０１の周辺の画像を撮影する撮影装置である。映像遅延バッファ６１５は、カメラで撮影したビデオ会議の第１の映像を所定の時間遅延させた第１映像を生成するバッファである。バス６１６は、上記の各構成要素に共通に接続され、例えば、アドレス信号、データ信号、及び各種の制御信号等を伝送する。Thevideo processing unit 613 is a device that acquires images of the periphery of thevideo conference terminal 101 from one or more cameras 614a, 614b, ..., and performs image processing such as image synthesis, image quality correction, and distortion correction on the acquired images. The cameras 614a, 614b, ... are imaging devices that capture images of the periphery of thevideo conference terminal 101. Thevideo delay buffer 615 is a buffer that generates a first video by delaying the first video of the video conference captured by the camera by a predetermined time. Thebus 616 is commonly connected to each of the above components, and transmits, for example, address signals, data signals, and various control signals.

（ウェブ会議デバイスのハードウェア構成）
ウェブ会議デバイス２０１は、例えば、図６に示したビデオ会議端末１０１のハードウェア構成から、ディスプレイＩ／Ｆ６０７、映像コーデック６０９を省略したハードウェア構成を有している。ウェブ会議デバイス２０１は、例えば、外部機器接続Ｉ／Ｆ６０６を介して、ＰＣ２０２、又はＩＷＢ２１１に接続される。 (Hardware configuration of web conferencing device)
Theweb conference device 201 has a hardware configuration in which the display I/F 607 and thevideo codec 609 are omitted from the hardware configuration of thevideo conference terminal 101 shown in Fig. 6. Theweb conference device 201 is connected to thePC 202 or theIWB 211 via the external device connection I/F 606, for example.

（コンピュータのハードウェア構成）
ＰＣ２０２は、例えば、図７に示すような、コンピュータ７００のハードウェア構成を有している。また、会議サーバ１０は、例えば、１つ以上のコンピュータ７００によって構成される。 (Computer hardware configuration)
ThePC 202 has, for example, the hardware configuration of acomputer 700 as shown in Fig. 7. Theconference server 10 is composed of, for example, one ormore computers 700.

図７は、一実施形態に係るコンピュータのハードウェア構成を示す図である。コンピュータ７００は、例えば、ＣＰＵ７０１、ＲＯＭ７０２、ＲＡＭ７０３、ＨＤ（Hard Disk)７０４、ＨＤＤ（Hard Disk Drive)コントローラ７０５、ディスプレイ７０６、外部機器接続Ｉ／Ｆ７０７、ネットワークＩ／Ｆ７０８、キーボード７０９、ポインティングデバイス７１０、ＤＶＤ－ＲＷ（Digital Versatile Disk ReWritable)ドライブ７１２、メディアＩ／Ｆ７１４、及び、バスライン７１５等を備えている。Figure 7 is a diagram showing the hardware configuration of a computer according to one embodiment. Thecomputer 700 includes, for example, aCPU 701, aROM 702, aRAM 703, a HD (Hard Disk) 704, a HDD (Hard Disk Drive)controller 705, adisplay 706, an external device connection I/F 707, a network I/F 708, akeyboard 709, apointing device 710, a DVD-RW (Digital Versatile Disk ReWritable) drive 712, a media I/F 714, and abus line 715.

これらのうち、ＣＰＵ７０１は、コンピュータ７００の全体の動作を制御する演算装置である。ＲＯＭ７０２は、ＩＰＬ等のＣＰＵ７０１の駆動に用いられるプログラムを記憶する不揮発性のメモリである。ＲＡＭ７０３は、ＣＰＵ７０１のワークエリア等として使用される揮発性のメモリである。ＨＤ７０４は、ＯＳ（Operating System）やアプリケーション等のプログラムや、各種のデータ等を記憶する大容量の記憶装置である。ＨＤＤコントローラ７０５は、ＣＰＵ７０１の制御にしたがってＨＤ７０４に対する各種データの読み出し又は書き込みを制御する。Of these, theCPU 701 is an arithmetic unit that controls the overall operation of thecomputer 700. TheROM 702 is a non-volatile memory that stores programs used to drive theCPU 701, such as IPL. TheRAM 703 is a volatile memory that is used as a work area for theCPU 701, etc. TheHD 704 is a large-capacity storage device that stores programs such as the OS (Operating System) and applications, as well as various types of data. TheHDD controller 705 controls the reading and writing of various types of data from and to theHD 704 under the control of theCPU 701.

ディスプレイ７０６は、カーソル、メニュー、ウィンドウ、文字、又は画像などの各種情報を表示する。外部機器接続Ｉ／Ｆ７０７は、各種の外部機器を接続するためのインタフェースである。ネットワークＩ／Ｆ７０８は、通信ネットワークを利用してデータ通信をするための通信インタフェースである。キーボード７０９は、文字、数値、各種指示などの入力のための複数のキーを備えた入力手段の一種である。ポインティングデバイス７１０は、各種指示の選択や実行、処理対象の選択、カーソルの移動などを行う入力手段の一種である。Thedisplay 706 displays various information such as a cursor, a menu, a window, characters, or an image. The external device connection I/F 707 is an interface for connecting various external devices. The network I/F 708 is a communication interface for data communication using a communication network. Thekeyboard 709 is a type of input means equipped with multiple keys for inputting characters, numbers, various instructions, etc. Thepointing device 710 is a type of input means for selecting and executing various instructions, selecting a processing target, moving the cursor, etc.

ＤＶＤ－ＲＷドライブ７１２は、着脱可能な記録媒体の一例としてのＤＶＤ－ＲＷ７１１に対する各種データの読み出し又は書き込みを制御する。なお、ＤＶＤ－ＲＷ７１１は、ＤＶＤ－ＲＷに限らず、他の着脱可能な記録媒体であっても良い。メディアＩ／Ｆ７１４は、フラッシュメモリ等のメディア７１３に対するデータの読み出し又は書き込み（記憶）を制御する。バスライン７１５は、図７に示されているＣＰＵ７０１等の各構成要素を電気的に接続するためのアドレスバス、データバス、及び各種の制御信号等を含む。The DVD-RW drive 712 controls the reading and writing of various data from the DVD-RW 711, which is an example of a removable recording medium. Note that the DVD-RW 711 is not limited to a DVD-RW, and may be other removable recording media. The media I/F 714 controls the reading and writing (storing) of data from themedia 713, such as a flash memory. Thebus line 715 includes an address bus, a data bus, and various control signals for electrically connecting the various components, such as theCPU 701 shown in FIG. 7.

＜機能構成＞
続いて、本実施形態に係るビデオ会議システム１００の機能構成の例について説明する。 <Functional configuration>
Next, an example of the functional configuration of thevideo conference system 100 according to the present embodiment will be described.

図８は、一実施形態に係るウェブ会議システムの機能構成の一例を示す図である。図８の例では、ビデオ会議システム１００は、ビデオ会議端末１０１と、ビデオ会議端末１０１に接続される表示装置１０２とを含む。FIG. 8 is a diagram illustrating an example of the functional configuration of a web conference system according to an embodiment. In the example of FIG. 8, avideo conference system 100 includes avideo conference terminal 101 and adisplay device 102 connected to thevideo conference terminal 101.

（ビデオ会議端末の機能構成）
ビデオ会議端末１０１は、例えば、通信部８０１、音声取得部８０２、方向検知部８０３、映像取得部８０４、画像検知部８０５、特定部８０６、映像遅延部８０７、録画映像作成部８０８、録画映像管理部８０９、会議映像作成部８１０、ＵＩ（User Interface）部８１１、会議制御部８１２、表示制御部８１３、及び音声出力部８１４等を有する。 (Functional configuration of video conferencing terminal)
Thevideo conference terminal 101 has, for example, acommunication unit 801, anaudio acquisition unit 802, adirection detection unit 803, avideo acquisition unit 804, animage detection unit 805, anidentification unit 806, avideo delay unit 807, a recordedvideo creation unit 808, a recordedvideo management unit 809, a conferencevideo creation unit 810, a UI (User Interface)unit 811, aconference control unit 812, adisplay control unit 813, and anaudio output unit 814.

通信部８０１は、例えば、ＣＰＵ６０１が実行するプログラム、及びネットワークＩ／Ｆ６０５等によって実現され、ビデオ会議端末１０１を通信ネットワーク２に接続し、会議サーバ１０等の他の装置と通信する通信処理を実行する。Thecommunication unit 801 is realized, for example, by a program executed by theCPU 601 and the network I/F 605, and performs communication processing to connect thevideo conference terminal 101 to thecommunication network 2 and communicate with other devices such as theconference server 10.

音声取得部８０２は、例えば、ＣＰＵ６０１が実行するプログラム、マイクアレイ６１１、及び音処理ユニット６１０等によって実現され、ビデオ会議端末１０１の周辺の音声を取得する音声取得処理を実行する。また、音声取得部８０２は、例えば、マイクアレイ６１１によるビームフォーミング、取得した音声の音質調整、又は取得した音声の音量調整等も行う。Theaudio acquisition unit 802 is realized, for example, by a program executed by theCPU 601, themicrophone array 611, thesound processing unit 610, etc., and executes an audio acquisition process to acquire audio around thevideo conference terminal 101. Theaudio acquisition unit 802 also performs, for example, beamforming using themicrophone array 611, sound quality adjustment of the acquired audio, or volume adjustment of the acquired audio.

方向検知部８０３は、例えば、ＣＰＵ６０１が実行するプログラム、及び音処理ユニット６１０等によって実現され、音声取得部８０２がマイクアレイ６１１で取得したビデオ会議の音声に基づいて音の方向を検知する方向検知処理を実行する。例えば、方向検知部８０３は、マイクアレイ６１１の複数のマイクで取得した音声データを解析して、音源がどの方向にあるかを推定する。Thedirection detection unit 803 is realized, for example, by a program executed by theCPU 601 and thesound processing unit 610, and executes a direction detection process to detect the direction of sound based on the sound of the video conference acquired by thesound acquisition unit 802 using themicrophone array 611. For example, thedirection detection unit 803 analyzes the sound data acquired by the multiple microphones of themicrophone array 611 to estimate the direction of the sound source.

映像取得部８０４は、例えば、ＣＰＵ６０１が実行するプログラム、１つ以上のカメラ６１４ａ、６１４ｂ、・・・、及び映像処理ユニット６１３等によって実現される。映像取得部８０４は、例えば、ビデオ会議端末１０１の周辺を撮影した第１の映像を取得する映像取得処理を実行する。また、映像取得部８０４は、取得した第１の映像の画質補正、又は歪み補正等も行う。Thevideo acquisition unit 804 is realized, for example, by a program executed by theCPU 601, one or more cameras 614a, 614b, ..., and avideo processing unit 613. Thevideo acquisition unit 804 executes a video acquisition process to acquire a first video image captured around thevideo conference terminal 101. Thevideo acquisition unit 804 also performs image quality correction, distortion correction, etc., on the acquired first video image.

画像検知部８０５は、例えば、ＣＰＵ６０１が実行するプログラム、及び映像処理ユニット等によって実現され、映像取得部８０４が、１つ以上のカメラ６１４ａ、６１４ｂ、・・・で撮影した第１の映像から人物の画像を検知する画像検知処理を実行する。例えば、画像検知部８０５は、入力した映像から、人物が映っている領域を推定するように、予め機械学習した学習済の機械学習モデル等を用いて、人物が映っている領域を推定することにより、人物の画像を検知してもよい。Theimage detection unit 805 is realized, for example, by a program executed by theCPU 601 and a video processing unit, etc., and executes an image detection process in which thevideo acquisition unit 804 detects an image of a person from a first video captured by one or more cameras 614a, 614b, .... For example, theimage detection unit 805 may detect an image of a person by estimating an area in which a person is shown from the input video using a machine learning model that has been trained in advance to estimate an area in which a person is shown.

ここで、機械学習とは、コンピュータに人のような学習能力を獲得させるための技術であり、コンピュータが、データ識別等の判断に必要なアルゴリズムを、事前に取り込まれる学習データから自律的に生成し、新たなデータについてこれを適用して予測を行う技術のことをいう。機械学習のための学習方法は、教師あり学習、教師なし学習、半教師学習、強化学習、深層学習のいずれかの方法でもよく、さらに、これらの学習方法を組み合わせた学習方法でもよく、機械学習のための学習方法は問わない。Here, machine learning refers to a technology that allows a computer to acquire human-like learning capabilities, in which the computer autonomously generates algorithms necessary for judgments such as data identification from training data that is previously loaded, and applies these to new data to make predictions. The learning method for machine learning may be any of supervised learning, unsupervised learning, semi-supervised learning, reinforcement learning, and deep learning, or may be a combination of these learning methods; any learning method for machine learning is acceptable.

なお、画像検知部８０５は、例えば、公知のパターン認識技術等を用いて、映像取得部８０４が取得した第１の映像から、人物の画像を検知してもよい。In addition, theimage detection unit 805 may detect an image of a person from the first image acquired by theimage acquisition unit 804, for example, by using a known pattern recognition technique.

特定部８０６は、例えば、ＣＰＵ６０１が実行するプログラム等によって実現される。特定部８０６は、方向検知部８０３が検知した音の方向と、画像検知部８０５が検知した人物の画像とに基づいて、自拠点で発話した利用者、及び自拠点で発話した人物の発話順序を特定する特定処理を実行する。Theidentification unit 806 is realized, for example, by a program executed by theCPU 601. Theidentification unit 806 executes an identification process to identify the user who spoke at the local base and the order in which the people who spoke at the local base spoke, based on the direction of the sound detected by thedirection detection unit 803 and the image of the person detected by theimage detection unit 805.

映像遅延部８０７は、例えば、ＣＰＵ６０１が実行するプログラム、及び映像遅延バッファ６１５等によって実現され、映像取得部８０４が取得した第１の映像を所定の時間遅延させて、第２の映像を出力する映像遅延処理を実行する。例えば、映像遅延部８０７は、第１の映像を所定の時間保持した後に、第２の映像を出力する映像遅延バッファ６１５に、第１の映像を入力する。Thevideo delay unit 807 is realized, for example, by a program executed by theCPU 601, thevideo delay buffer 615, etc., and executes a video delay process that delays the first video acquired by thevideo acquisition unit 804 by a predetermined time and outputs the second video. For example, thevideo delay unit 807 holds the first video for a predetermined time, and then inputs the first video to thevideo delay buffer 615, which outputs the second video.

録画映像作成部８０８は、例えば、ＣＰＵ６０１が実行するプログラム、及び映像処理ユニット等によって実現される。録画映像作成部８０８は、第１の映像を所定の時間遅延させた第２の映像から、特定部８０６が特定した発話順序に基づいて、例えば、図５（Ａ）～（Ｄ）で説明した録画映像５１０、５２０、５３０、５４０等を作成する録画映像作成処理を実行する。なお、録画映像作成部８０８が実行する録画映像作成処理については、複数の実施形態を例示して後述する。The recordedvideo creation unit 808 is realized, for example, by a program executed by theCPU 601, a video processing unit, etc. The recordedvideo creation unit 808 executes a recorded video creation process to create, for example, recordedvideos 510, 520, 530, 540, etc., described in Figs. 5(A) to (D) from a second video obtained by delaying the first video by a predetermined time, based on the speech order identified by theidentification unit 806. Note that the recorded video creation process executed by the recordedvideo creation unit 808 will be described later with multiple exemplary embodiments.

録画映像管理部８０９は、例えば、ＣＰＵ６０１が実行するプログラム等によって実現され、録画映像作成部８０８が作成した録画映像を、例えば、ＳＳＤ６０４、又は外部機器接続Ｉ／Ｆ６０６に接続した外部記憶装置等に録画（記憶）する。The recordedvideo management unit 809 is realized, for example, by a program executed by theCPU 601, and records (stores) the recorded video created by the recordedvideo creation unit 808, for example, in theSSD 604 or an external storage device connected to the external device connection I/F 606.

会議映像作成部８１０は、例えば、ＣＰＵ６０１が実行するプログラム、及び映像処理ユニット等によって実現される。会議映像作成部８１０は、第１の映像から、例えば、図４（Ａ）～（Ｄ）で説明した会議映像４１０、４２０、４３０、４４０等を作成する会議映像作成処理を実行する。The conferencevideo creation unit 810 is realized, for example, by a program executed by theCPU 601, a video processing unit, etc. The conferencevideo creation unit 810 executes a conference video creation process to create, for example,conference videos 410, 420, 430, 440, etc., described in Figures 4 (A) to (D) from the first video.

ＵＩ部８１１は、例えば、ＣＰＵ６０１が実行するプログラム、及び操作部６０８等によって実現され、利用者によるビデオ会議端末１０１に対する様々な操作を受け付ける。TheUI unit 811 is realized, for example, by a program executed by theCPU 601 and theoperation unit 608, and accepts various operations on thevideo conference terminal 101 by the user.

会議制御部８１２は、例えば、ＣＰＵ６０１が実行するプログラム、及び映像コーデック６０９等によって実現され、通信部８０１を介して会議サーバ１０に接続し、他のビデオ会議システム１１０とビデオ会議を行う会議制御処理を実行する。例えば、会議制御部８１２は、ビデオ会議中に、会議映像作成部８１０が作成した会議映像を、会議サーバ１０を介して、他のビデオ会議システム１１０に送信する。また、会議制御部８１２は、会議サーバ１０を介して、他のビデオ会議システム１１０から会議映像を受信する。なお、会議制御部８１２は、既存の様々なビデオ会議、又はウェブ会議の仕組みを利用するものであってよい。Theconference control unit 812 is realized, for example, by a program executed by theCPU 601, thevideo codec 609, etc., and connects to theconference server 10 via thecommunication unit 801, and executes conference control processing to hold a video conference with anothervideo conference system 110. For example, during a video conference, theconference control unit 812 transmits conference video created by the conferencevideo creation unit 810 to the othervideo conference system 110 via theconference server 10. Theconference control unit 812 also receives conference video from the othervideo conference system 110 via theconference server 10. Theconference control unit 812 may utilize various existing video conference or web conference mechanisms.

表示制御部８１３は、例えば、ＣＰＵ６０１が実行するプログラム、及びディスプレイＩ／Ｆ６０７等によって実現され、会議制御部８１２が他のビデオ会議システム１１０から受信した会議映像を、表示装置１０２等に表示させる。Thedisplay control unit 813 is realized, for example, by a program executed by theCPU 601 and the display I/F 607, etc., and causes the conference video received by theconference control unit 812 from anothervideo conference system 110 to be displayed on thedisplay device 102, etc.

音声出力部８１４は、例えば、ＣＰＵ６０１が実行するプログラム、音処理ユニット６１０、及びスピーカ６１２等によって実現される。音声出力部８１４は、例えば、会議制御部８１２が他のビデオ会議システム１１０から受信した会議映像に含まれる会議音声を出力する音声出力処理を実行する。なお、音声出力部８１４は、表示装置１０２が備えるスピーカ等によって実現されるものであってもよい。Theaudio output unit 814 is realized, for example, by a program executed by theCPU 601, thesound processing unit 610, thespeaker 612, and the like. Theaudio output unit 814 executes audio output processing to output the conference audio included in the conference video received by theconference control unit 812 from anothervideo conference system 110. Note that theaudio output unit 814 may also be realized by a speaker or the like provided in thedisplay device 102.

図９は、一実施形態に係るウェブ会議システムの機能構成の別の一例を示す図である。図８で説明したビデオ会議端末１０１の各機能構成は、例えば、図９に示すように、ウェブ会議デバイス２０１と、ＰＣ２０２に、分散して設けられていてもよい。また、ＰＣ２０２は、ＩＷＢ２１１等のウェブ会議機能を有する電子機器であってもよい。FIG. 9 is a diagram showing another example of the functional configuration of a web conference system according to an embodiment. The functional configuration of thevideo conference terminal 101 described in FIG. 8 may be distributed between aweb conference device 201 and aPC 202, for example, as shown in FIG. 9. ThePC 202 may be an electronic device having a web conference function, such as anIWB 211.

（ウェブ会議デバイスの機能構成）
ウェブ会議デバイス２０１は、例えば、通信部９１１、音声取得部８０２、方向検知部８０３、映像取得部８０４、画像検知部８０５、特定部８０６、映像遅延部８０７、録画映像作成部８０８、録画映像管理部８０９、会議映像作成部８１０、ＵＩ部８１１、及び音声出力部８１４等を有する。これらの各機能構成のうち、通信部９１１以外の機能構成は、図８で説明した各機能構成と同様なので、ここでは説明を省略する。 (Web conferencing device functional configuration)
Theweb conference device 201 includes, for example, acommunication unit 911, anaudio acquisition unit 802, adirection detection unit 803, avideo acquisition unit 804, animage detection unit 805, anidentification unit 806, avideo delay unit 807, a recordedvideo creation unit 808, a recordedvideo management unit 809, a conferencevideo creation unit 810, aUI unit 811, and anaudio output unit 814. Of these functional components, the functional components other than thecommunication unit 911 are the same as the functional components described in FIG. 8, and therefore will not be described here.

通信部９１１は、例えば、ウェブ会議デバイス２０１が備えるＣＰＵが実行するプログラム、及び外部機器接続Ｉ／Ｆ等によって実現され、ＰＣ２０２（又はＩＷＢ２１１等）と通信する通信処理を実行する。例えば、通信部９１１は、会議映像作成部８１０が作成した会議映像を、ＰＣ２０２等に送信する。また、通信部９１１は、ＰＣ２０２等から他のビデオ会議システム１１０からの会議映像に含まれる会議音声を受信し、音声出力部８１４に出力する。Thecommunication unit 911 is realized, for example, by a program executed by the CPU of theweb conference device 201 and an external device connection I/F, and executes communication processing to communicate with the PC 202 (or theIWB 211, etc.). For example, thecommunication unit 911 transmits the conference video created by the conferencevideo creation unit 810 to thePC 202, etc. Thecommunication unit 911 also receives conference audio included in the conference video from anothervideo conference system 110 from thePC 202, etc., and outputs it to theaudio output unit 814.

（ＰＣの機能構成）
ＰＣ２０２は、例えば、第１の通信部９０１、第２の通信部９０２、ＵＩ部９０３、会議制御部８１２、及び表示制御部８１３等を有する。 (Functional configuration of PC)
ThePC 202 includes, for example, afirst communication unit 901, asecond communication unit 902, aUI unit 903, aconference control unit 812, and adisplay control unit 813.

第１の通信部９０１は、例えば、ＣＰＵ７０１が実行するプログラム、及びネットワークＩ／Ｆ７０８等によって実現され、ＰＣ２０２を通信ネットワーク２に接続し、会議サーバ１０等の他の装置と通信する第１の通信処理を実行する。Thefirst communication unit 901 is realized, for example, by a program executed by theCPU 701 and the network I/F 708, and executes a first communication process that connects thePC 202 to thecommunication network 2 and communicates with other devices such as theconference server 10.

第２の通信部９０２は、例えば、ＣＰＵ７０１が実行するプログラム、及び外部機器接続Ｉ／Ｆ７０７等によって実現され、外部機器接続Ｉ／Ｆ７０７に接続されたウェブ会議デバイス２０１等と通信する第２の通信処理を実行する。Thesecond communication unit 902 is realized, for example, by a program executed by theCPU 701 and the external device connection I/F 707, etc., and executes a second communication process to communicate with theweb conference device 201, etc. connected to the external device connection I/F 707.

ＵＩ部９０３は、例えば、ＣＰＵ７０１が実行するプログラム等によって実現され、ＰＣ２０２に対する利用者の操作を受け付ける。TheUI unit 903 is realized, for example, by a program executed by theCPU 701, and accepts user operations on thePC 202.

会議制御部８１２は、例えば、ＣＰＵ７０１が実行するプログラム等によって実現され、第１の通信部９０１を介して会議サーバ１０に接続し、他のビデオ会議システム１１０とビデオ会議を行う会議制御処理を実行する。例えば、会議制御部８１２は、第２の通信部９０２が、ウェブ会議デバイス２０１から受信した会議映像を、会議サーバ１０を介して、他のビデオ会議システム１１０に送信する。また、会議制御部８１２は、会議サーバ１０を介して、他のビデオ会議システム１１０から会議映像を受信し、受信した会議映像を表示制御部８１３に表示させる。さらに、会議制御部８１２は、受信した会議映像に含まれる会議音声をウェブ会議デバイス２０１に送信して、会議音声を出力させる。Theconference control unit 812 is realized, for example, by a program executed by theCPU 701, and executes a conference control process to connect to theconference server 10 via thefirst communication unit 901 and hold a video conference with anothervideo conference system 110. For example, theconference control unit 812 transmits the conference video received by thesecond communication unit 902 from theweb conference device 201 to the othervideo conference system 110 via theconference server 10. Theconference control unit 812 also receives conference video from the othervideo conference system 110 via theconference server 10, and causes thedisplay control unit 813 to display the received conference video. Furthermore, theconference control unit 812 transmits the conference audio included in the received conference video to theweb conference device 201, and causes theweb conference device 201 to output the conference audio.

なお、図８、９に示したビデオ会議システム１００の機能構成は一例である。例えば、図８、９に示した各装置が備える各機能構成は、ビデオ会議システム１００に含まれるいずれの装置が備えていてもよい。Note that the functional configuration of thevideoconferencing system 100 shown in FIGS. 8 and 9 is an example. For example, the functional configurations of the devices shown in FIGS. 8 and 9 may be provided in any of the devices included in thevideoconferencing system 100.

＜処理の流れ＞
続いて、本実施形態に係る録画映像作成方法の処理の流れについて説明する。 <Processing flow>
Next, the process flow of the recorded video creating method according to this embodiment will be described.

（会議映像、及び録画映像の作成処理）
図１０は、一実施形態に係る会議映像、及び録画映像の作成処理の例を示すフローチャートである。この処理は、他のビデオ会議システム１１０とビデオ会議中に、ビデオ会議システム１００が実行する会議映像の作成処理、及び録画映像の作成処理の概要を示している。 (Creation of meeting video and recorded video)
10 is a flowchart showing an example of a process for creating a conference video and a recorded video according to an embodiment. This process shows an overview of a process for creating a conference video and a recorded video executed by thevideo conference system 100 during a video conference with anothervideo conference system 110.

ステップＳ１００１において、方向検知部８０３は、音声取得部８０２がマイクアレイ６１１で取得した音声に基づいて音の方向を検知する。In step S1001, thedirection detection unit 803 detects the direction of the sound based on the sound acquired by thesound acquisition unit 802 using themicrophone array 611.

ステップＳ１１０２において、画像検知部８０５は、１つ以上のカメラ６１４ａ、６１４ｂ、・・・で撮影した第１の映像から人物の画像を検知する。例えば、画像検知部８０５は、自拠点でビデオ会議に参加している人物の画像を検知する。In step S1102, theimage detection unit 805 detects an image of a person from the first video captured by one or more cameras 614a, 614b, .... For example, theimage detection unit 805 detects an image of a person participating in a video conference at the own location.

ステップＳ１１０３において、特定部８０６は、方向検知部８０３が検知した音の方向と、画像検知部８０５が検知した人物の画像とに基づいて、自拠点で発話した利用者、及び自拠点で発話した人物の発話順序を特定する。In step S1103, theidentification unit 806 identifies the user who spoke at the local location and the order in which the people who spoke at the local location spoke, based on the direction of the sound detected by thedirection detection unit 803 and the image of the person detected by theimage detection unit 805.

ステップＳ１１０４において、会議映像作成部８１０は、第１の映像から、他の利用者より後に発話した所定の数の利用者の画像をクローズアップ表示エリア３０２に表示する会議映像を作成する。なお、ここでは、所定の数が「３」であるものとして以下の説明を行う。In step S1104, the conferencevideo creation unit 810 creates a conference video from the first video in which images of a predetermined number of users who spoke after other users are displayed in the close-updisplay area 302. Note that the following explanation assumes that the predetermined number is "3."

例えば、図１の自拠点において、利用者Ａ、利用者Ｂ、利用者Ｃの順に発話したものとする。この場合、会議映像作成部８１０は、図３に示すように、他の利用者（利用者Ｄ）より後に発話した３人の利用者（利用者Ａ、利用者Ｂ、利用者Ｃ）の画像をクローズアップ表示エリア３０２に表示する会議映像３００を作成する。For example, assume that at the home location in FIG. 1, users A, B, and C speak in that order. In this case, the conferencevideo creation unit 810 creates aconference video 300 that displays images of the three users (user A, user B, and user C) who spoke after another user (user D) in the close-updisplay area 302, as shown in FIG. 3.

ステップＳ１００５において、会議映像作成部８１０は、作成した会議映像を、例えば、会議制御部８１２に出力する。これにより、会議制御部８１２は、会議映像作成部８１０が出力した会議映像を、自拠点の会議映像として、会議サーバ１０を介して他のビデオ会議システム１１０に送信する。In step S1005, the conferencevideo creation unit 810 outputs the created conference video to, for example, theconference control unit 812. As a result, theconference control unit 812 transmits the conference video output by the conferencevideo creation unit 810 to the othervideo conference system 110 via theconference server 10 as a conference video of its own base.

また、録画映像作成部８０８は、ステップＳ１００４、Ｓ１００５の処理とは別に、ステップＳ１００６の処理を実行する。ステップＳ１００６において、録画映像作成部８０８は、第１の映像を遅延させた第２の映像から、発話順序に基づいて、新たに発話した第１の利用者を含む所定の数の利用者の画像をクローズアップ表示エリア３０２に表示する録画映像を作成する。例えば、録画映像作成部８０８は、図５（Ａ）～（Ｄ）で説明した録画映像５１０、５２０、５３０、５４０等を作成する。The recordedvideo creation unit 808 also executes the process of step S1006, separately from the processes of steps S1004 and S1005. In step S1006, the recordedvideo creation unit 808 creates a recorded video in which images of a predetermined number of users, including the first user who has recently spoken, are displayed in the close-updisplay area 302 based on the order of speech, from the second video, which is a delayed version of the first video. For example, the recordedvideo creation unit 808 creates recordedvideos 510, 520, 530, 540, etc., as described in Figures 5 (A) to (D).

ステップＳ１００７において、録画映像管理部８０９は、録画映像作成部８０８が作成した録画映像を、例えば、ＳＳＤ６０４、又は外部機器接続Ｉ／Ｆ６０６に接続された外部記憶装置等に録画（記憶）する。In step S1007, the recordedvideo management unit 809 records (stores) the recorded video created by the recordedvideo creation unit 808, for example, in theSSD 604 or an external storage device connected to the external device connection I/F 606.

図１０の処理により、ビデオ会議システム１００は、第１の映像に基づいて、例えば、図４（Ａ）～（Ｄ）で説明した会議映像を作成するとともに、第２の映像に基づいて、例えば、図５（Ａ）～（Ｄ）で説明した録画映像を作成する。By the process of FIG. 10, thevideo conferencing system 100 creates, for example, the conference video described in FIG. 4 (A) to (D) based on the first video, and creates, for example, the recorded video described in FIG. 5 (A) to (D) based on the second video.

（録画映像の作成処理）
続いて、例えば、図１０のステップＳ１００６において、ビデオ会議システム１００が実行する録画映像の作成処理の例について、複数の実施形態を例示して説明する。 (Recorded video creation process)
Next, an example of the process of creating a recorded video executed by thevideo conference system 100 in step S1006 of FIG. 10 will be described with reference to a number of exemplary embodiments.

［第１の実施形態］
図１１は、第１の実施形態に係る録画映像の作成処理の例を示すフローチャートである。この処理は、例えば、図１０のステップＳ１００６において、録画映像作成部８０８が実行する録画映像の作成処理の一例を示している。 [First embodiment]
Fig. 11 is a flowchart showing an example of a process for creating a recorded video according to the first embodiment. This process shows an example of a process for creating a recorded video executed by the recordedvideo creating unit 808 in step S1006 in Fig. 10, for example.

ステップＳ１１０１において、第１の映像を所定の時間遅延させた第２の映像において、新たに第１の利用者が発話すると、録画映像作成部８０８は、ステップＳ１１０２以降の処理を実行する。In step S1101, when the first user newly speaks in the second video, which is the first video delayed by a predetermined time, the recordedvideo creation unit 808 executes the process from step S1102 onwards.

ステップＳ１１０２において、録画映像作成部８０８は、所定のエリアの表示枠に空きがあるか否かを判断する。例えば、録画映像作成部８０８は、図５（Ａ）に示すような録画映像５１０のクローズアップ表示エリア３０２に空きがあるか否かを判断する。なお、クローズアップ表示エリア３０２は、所定のエリアの一例である。図５（Ａ）の例では、全ての表示枠３０２－１、３０２－２、３０２－３に利用者が表示されているので、録画映像作成部８０８は空きがないと判断する。In step S1102, the recordedvideo creation unit 808 determines whether there is any free space in the display frame of the specified area. For example, the recordedvideo creation unit 808 determines whether there is any free space in the close-updisplay area 302 of the recordedvideo 510 as shown in FIG. 5(A). Note that the close-updisplay area 302 is an example of a specified area. In the example of FIG. 5(A), users are displayed in all of the display frames 302-1, 302-2, and 302-3, so the recordedvideo creation unit 808 determines that there is no free space.

所定のエリアの表示枠に空きがある場合、録画映像作成部８０８は、処理をステップＳ１１０３に移行させる。一方、所定のエリアの表示枠に空きがない場合、録画映像作成部８０８は、処理をステップＳ１１０４に移行させる。If there is free space in the display frame of the specified area, the recordedvideo creation unit 808 transitions the process to step S1103. On the other hand, if there is no free space in the display frame of the specified area, the recordedvideo creation unit 808 transitions the process to step S1104.

ステップＳ１１０３に移行すると、録画映像作成部８０８は、空いている表示枠に第１の利用者の画像をクローズアップ表示した録画映像を作成する。When the process proceeds to step S1103, the recordedvideo creation unit 808 creates a recorded video in which a close-up image of the first user is displayed in the available display frame.

一方、ステップＳ１１０４に移行すると、録画映像作成部８０８は、第１の利用者の画像が所定のエリアに表示されているか否かを判断する。例えば、図５（Ａ）に示すような録画映像５１０の状態から、第２の映像において新たに利用者Ａが発話したものとする。この場合、クローズアップ表示エリア３０２には、既に利用者Ａの画像が表示されているので、録画映像作成部８０８は、第１の利用者の画像が所定のエリアに表示されていると判断する。On the other hand, when the process proceeds to step S1104, the recordedvideo creation unit 808 determines whether or not the image of the first user is displayed in the specified area. For example, assume that user A newly speaks in the second video from the state of the recordedvideo 510 shown in FIG. 5(A). In this case, since the image of user A is already displayed in the close-updisplay area 302, the recordedvideo creation unit 808 determines that the image of the first user is displayed in the specified area.

第１の利用者の画像が所定のエリアに表示されている場合、録画映像作成部８０８は、処理をステップＳ１１０５に移行させる。一方、第１の利用者の画像が所定のエリアに表示されていない場合、録画映像作成部８０８は、処理をステップＳ１１０６に移行させる。If the image of the first user is displayed in the specified area, the recordedvideo creation unit 808 transitions the process to step S1105. On the other hand, if the image of the first user is not displayed in the specified area, the recordedvideo creation unit 808 transitions the process to step S1106.

ステップＳ１１０５に移行すると、録画映像作成部８０８は、現在の所定のエリアのレイアウトを維持して、録画映像を作成する。例えば、図５（Ｂ）に示すような録画映像５２０の状態から、第２の映像において新たに利用者Ａが発話したものとする。この場合、録画映像作成部８０８は、クローズアップ表示エリア３０２のレイアウトを変更せずに、例えば、図５（Ｃ）に示すような録画映像５３０を作成する。When the process proceeds to step S1105, the recordedvideo creation unit 808 creates a recorded video while maintaining the current layout of the specified area. For example, assume that user A newly speaks in the second video from the state of recordedvideo 520 shown in FIG. 5(B). In this case, the recordedvideo creation unit 808 creates, for example, recordedvideo 530 shown in FIG. 5(C) without changing the layout of the close-updisplay area 302.

一方、ステップＳ１１０６に移行すると、録画映像作成部８０８は、特定部８０６が特定した発話順序に基づいて、第１の利用者の次に発話する第２の利用者の画像が、所定のエリアに表示されているか否かを判断する。第２の利用者の画像が、所定のエリア（クローズアップ表示エリア３０２）に表示されている場合、録画映像作成部８０８は、処理をステップＳ１１０７に移行させる。一方、第２の利用者の画像が、所定のエリアに表示されていない場合、録画映像作成部８０８は、処理をステップＳ１１０８に移行させる。On the other hand, when the process proceeds to step S1106, the recordedvideo creation unit 808 determines whether or not an image of the second user who will speak after the first user is displayed in a specified area based on the speaking order identified by theidentification unit 806. If an image of the second user is displayed in the specified area (close-up display area 302), the recordedvideo creation unit 808 proceeds to step S1107. On the other hand, if an image of the second user is not displayed in the specified area, the recordedvideo creation unit 808 proceeds to step S1108.

ステップＳ１１０７に移行すると、録画映像作成部８０８は、第２の利用者の画像の表示枠を維持する。When proceeding to step S1107, the recordedvideo creation unit 808 maintains the display frame of the image of the second user.

ステップＳ１１０８に移行すると、録画映像作成部８０８は、特定部８０６が特定した発話順序に基づいて、第２の利用者の次に発話する第３の利用者の画像が、所定のエリアに表示されているか否かを判断する。第３の利用者の画像が、所定のエリア（クローズアップ表示エリア３０２）に表示されている場合、録画映像作成部８０８は、処理をステップＳ１１０９に移行させる。一方、第３の利用者の画像が、所定のエリアに表示されていない場合、録画映像作成部８０８は、処理をステップＳ１１１０に移行させる。When the process proceeds to step S1108, the recordedvideo creation unit 808 determines whether or not an image of a third user who will speak after the second user is displayed in a specified area based on the speaking order identified by theidentification unit 806. If an image of the third user is displayed in the specified area (close-up display area 302), the recordedvideo creation unit 808 proceeds to step S1109. On the other hand, if an image of the third user is not displayed in the specified area, the recordedvideo creation unit 808 proceeds to step S1110.

ステップＳ１１０９に移行すると、録画映像作成部８０８は、第３の利用者の画像の表示枠を維持する。When proceeding to step S1109, the recordedvideo creation unit 808 maintains the display frame of the image of the third user.

ステップＳ１１１０において、録画映像作成部８０８は、残りの表示枠のうち、タイムスタンプが最も古い表示枠に、第１の利用者の画像をクローズアップ表示する録画映像を作成する。例えば、各表示枠には、最後に画像を更新した時刻を示すタイムスタンプ等が付加されているものとする。In step S1110, the recordedvideo creation unit 808 creates a recorded video that displays a close-up image of the first user in the remaining display frame with the oldest timestamp. For example, each display frame is assumed to have a timestamp or the like added to indicate the time the image was last updated.

図１２の処理により、録画映像作成部８０８は、第１の利用者が発話したときに、第２の利用者の画像が所定のエリアに表示されている場合、少なくとも第１の利用者の画像と第２の利用者画像とを所定のエリアに表示する録画映像を作成する。By the process of FIG. 12, if an image of a second user is displayed in a specified area when a first user speaks, the recordedvideo creation unit 808 creates a recorded video that displays at least an image of the first user and an image of the second user in a specified area.

また、録画映像作成部８０８は、第１の利用者が発話したときに、第３の利用者の画像が所定のエリアに表示されている場合、少なくとも第１の利用者の画像と第３の利用者画像とを所定のエリアに表示する録画映像を作成する。In addition, if an image of a third user is displayed in a specified area when the first user speaks, the recordedvideo creation unit 808 creates a recorded video that displays at least an image of the first user and an image of the third user in the specified area.

さらに、録画映像作成部８０８は、第１の利用者が発話したときに、第２の利用者の画像と第３の利用者の画像が所定のエリアに表示されている場合、第１の利用者の画像と第２の利用者画像と第３の利用者の画像とを所定のエリアに表示する録画映像を作成する。Furthermore, if an image of a second user and an image of a third user are displayed in a specified area when the first user speaks, the recordedvideo creation unit 808 creates a recorded video that displays an image of the first user, an image of the second user, and an image of the third user in a specified area.

また、録画映像作成部８０８は、第１の利用者が発話したときに、第１の利用者の画像が所定のエリアに表示されている場合、所定のエリアの表示を変更せずに、録画映像を作成する。In addition, if an image of the first user is displayed in a specified area when the first user speaks, the recordedvideo creation unit 808 creates a recorded video without changing the display of the specified area.

［第２の実施形態］
図１２は、第２の実施形態に係る録画映像の作成処理の例を示すフローチャートである。この処理は、例えば、図１０のステップＳ１００６において、録画映像作成部８０８が実行する録画映像の作成処理の別の一例を示している。この処理は、図１１で説明した第1実施形態に係る録画映像の作成処理のステップＳ１１０１の次に、ステップＳ１２０１の処理が追加されている。なお、ステップＳ１１０２以降の処理は、第１の実施形態に係る録画映像の作成処理と同様なので、ここでは説明を省略する。 Second Embodiment
Fig. 12 is a flowchart showing an example of a process for creating a recorded video according to the second embodiment. This process shows another example of the process for creating a recorded video executed by the recordedvideo creation unit 808 in step S1006 in Fig. 10, for example. In this process, step S1201 is added after step S1101 of the process for creating a recorded video according to the first embodiment described in Fig. 11. Note that the processes from step S1102 onwards are similar to the process for creating a recorded video according to the first embodiment, and therefore will not be described here.

ステップＳ１１０１において、第１の映像を所定の時間遅延させた第２の映像において、新たに第１の利用者が発話すると、録画映像作成部８０８は、ステップＳ１２０１の処理を実行する。In step S1101, when the first user newly speaks in the second video, which is the first video delayed by a predetermined time, the recordedvideo creation unit 808 executes the process of step S1201.

ステップＳ１２０１において、録画映像作成部８０８は、第２の映像における第１の利用者の発話時間が所定の時間（例えば、１秒～３秒程度）未満であるか否かを判断する。ここで、第２の映像は、第１の映像を遅延させた映像なので、ビデオ会議システム１００は、第１の映像に基づいて、第１利用者の発話時間を予め取得しておくことができる。また、録画映像の作成には即時性は求められないため、録画映像作成部８０８は、所定の時間待機することにより、第２の映像から第１の利用者の発話時間が、所定の時間未満であるか否かを判断してもよい。In step S1201, the recordedvideo creation unit 808 determines whether the speaking time of the first user in the second video is less than a predetermined time (e.g., about 1 to 3 seconds). Here, since the second video is a delayed version of the first video, thevideoconferencing system 100 can obtain the speaking time of the first user in advance based on the first video. In addition, since immediacy is not required for creating recorded video, the recordedvideo creation unit 808 may wait a predetermined time to determine whether the speaking time of the first user from the second video is less than the predetermined time.

発話時間が所定の時間未満でない場合、録画映像作成部８０８は、ステップＳ１１０２以降の処理を実行する。一方、発話時間が所定の時間未満である場合、録画映像作成部８０８は、処理をステップＳ１１０５に移行させる。If the speaking time is not less than the predetermined time, the recordedvideo creation unit 808 executes the process from step S1102 onwards. On the other hand, if the speaking time is less than the predetermined time, the recordedvideo creation unit 808 transitions the process to step S1105.

図１２の処理により、ビデオ会議システム１００は、第１の利用者の発話時間が所定の時間未満である場合、クローズアップ表示エリア３０２のレイアウトを維持して録画映像を作成する。従って、所定の時間に適切な時間を設定することにより、ビデオ会議システム１００は、例えば、「はい」、「いいえ」等の短い発話により、クローズアップ表示される発話者が頻繁に切り替わることを抑制することができる。By the process of FIG. 12, if the speaking time of the first user is less than the predetermined time, thevideo conferencing system 100 creates a recorded video while maintaining the layout of the close-updisplay area 302. Therefore, by setting an appropriate time for the predetermined time, thevideo conferencing system 100 can prevent the speaker displayed in close-up from frequently switching due to short utterances such as "yes" and "no."

［第３の実施形態］
図１３は、第３の実施形態に係る録画映像の作成処理の例を示すフローチャートである。この処理は、例えば、図１０のステップＳ１００６において、ビデオ会議システム１００が実行する録画映像の作成処理のより具体的な処理の一例を示している。なお、ここでは、ビデオ会議システム１００が、図８に示すように、ビデオ会議端末１０１と表示装置によって構成されているものとして、以下の説明を行う。 [Third embodiment]
Fig. 13 is a flowchart showing an example of a process for creating a recorded video according to the third embodiment. This process shows a more specific example of the process for creating a recorded video executed by thevideo conference system 100 in step S1006 in Fig. 10. Note that the following description will be given assuming that thevideo conference system 100 is composed of avideo conference terminal 101 and a display device as shown in Fig. 8.

ステップＳ１３０１において、ビデオ会議システム１００は、システムを初期設定する。例えば、ビデオ会議システム１００は、ビデオ会議端末１０１を初期化する。In step S1301, thevideo conference system 100 initializes the system. For example, thevideo conference system 100 initializes thevideo conference terminal 101.

ステップＳ１３０２において、ビデオ会議端末１０１は、カメラ６１４－１、６１４－２、・・・、マイクアレイ６１１、及びスピーカ６１２等を初期化する。In step S1302, thevideo conference terminal 101 initializes the cameras 614-1, 614-2, ..., themicrophone array 611, thespeaker 612, etc.

ステップＳ１３０３において、ビデオ会議端末１０１は、他のビデオ会議システム１００との接続を確認して、ビデオ会議を開始する。また、ビデオ会議端末１０１は、録画条件を設定して録画を開始する。好ましくは、録画は任意のタイミングで中断、又は中止することができる。In step S1303, thevideo conference terminal 101 confirms a connection with anothervideo conference system 100 and starts a video conference. Thevideo conference terminal 101 also sets recording conditions and starts recording. Preferably, recording can be interrupted or stopped at any time.

ステップＳ１３０４、ステップＳ１３０５において、第２の映像で利用者Ａが発話すると、ビデオ会議端末１０１は、録画映像において、利用者Ａをクローズアップ表示エリア３０２にクローズアップ表示する。なお、ここでは、クローズアップ表示エリア３０２に、３つ表示枠があるものとする。In steps S1304 and S1305, when user A speaks in the second video, thevideoconferencing terminal 101 displays a close-up of user A in the close-updisplay area 302 in the recorded video. Note that in this example, it is assumed that there are three display frames in the close-updisplay area 302.

ステップＳ１３０６、Ｓ１３０７において、第２の映像で利用者Ｂが発話すると、ビデオ会議端末１０１は、録画映像において、利用者Ｂをクローズアップ表示エリア３０２にクローズアップ表示する。In steps S1306 and S1307, when user B speaks in the second video, thevideoconferencing terminal 101 displays a close-up of user B in the close-updisplay area 302 in the recorded video.

ステップＳ１３０８、Ｓ１３０９において、第２の映像で利用者Ｃが発話すると、ビデオ会議端末１０１は、録画映像において、利用者Ｃをクローズアップ表示エリア３０２にクローズアップ表示する。ここで、録画映像のクローズアップ表示エリア３０２には、利用者Ａの画像、利用者Ｂの画像、及び利用者Ｃの画像が表示される。In steps S1308 and S1309, when user C speaks in the second video, thevideoconferencing terminal 101 displays a close-up of user C in the close-updisplay area 302 of the recorded video. Here, an image of user A, an image of user B, and an image of user C are displayed in the close-updisplay area 302 of the recorded video.

ステップＳ１３１０において、ビデオ会議端末１０１は、クローズアップ表示エリア３０２の全ての表示枠に利用者が表示されると、特定部８０６が特定した発話順序を取得し、クローズアップ表示エリア３０２の優先度を決定（更新）するものとする。なお、図１３において、発話順序（-->Ｄ-->Ａ）は、次に発話する利用者が利用者Ｄであり、利用者Ｄの次に発話する利用者が利用者Ａであることを示している。ここでは、今後の話者が、Ｄ-->Ａ-->Ｂ-->Ｃ-->Ｅ-->Ａの順に推移するものとして以下の説明を行う。また、優先度（Ｂ＞Ａ＞Ｃ）は、利用者Ｂの画像が表示されている表示枠の優先度が最も高く、利用者Ｃの画像が表示されている表示枠の優先度が最も低いことを表している。In step S1310, when users are displayed in all display frames of the close-updisplay area 302, thevideoconferencing terminal 101 acquires the speaking order identified by theidentification unit 806 and determines (updates) the priority of the close-updisplay area 302. In FIG. 13, the speaking order (-->D-->A) indicates that the next user to speak is user D, and the user to speak after user D is user A. Here, the following explanation is given assuming that the future speakers will progress in the order D-->A-->B-->C-->E-->A. In addition, the priority (B>A>C) indicates that the display frame in which the image of user B is displayed has the highest priority, and the display frame in which the image of user C is displayed has the lowest priority.

ステップＳ１３１１、Ｓ１３１２において、第２の映像で利用者Ｄが発話すると、ビデオ会議端末１０１は、発話順序（-->Ａ-->Ｂ）を取得し、利用者Ａ、Ｂ、Ｃの優先度を決定する。例えば、発話順序（-->Ａ-->Ｂ）から、利用者Ａ、Ｂの優先度は利用者Ｃより高くすべきであり、ＡがＢより先に発話することから、ビデオ会議端末１０１は、優先度を（Ａ＞Ｂ＞Ｃ）に決定する。In steps S1311 and S1312, when user D speaks in the second video, thevideoconferencing terminal 101 obtains the speaking order (-->A-->B) and determines the priorities of users A, B, and C. For example, based on the speaking order (-->A-->B), the priorities of users A and B should be higher than user C, and since A speaks before B, thevideoconferencing terminal 101 determines the priorities to be (A>B>C).

ステップＳ１３１３において、ビデオ会議端末１０１は、優先度がもっとも低い、利用者Ｃの画像が表示されている表示枠３０２－３に、利用者Ｄの画像をクローズアップ表示する。これにより、例えば、図５（Ａ）に示すような録画映像５１０が作成される。また、ステップＳ１３１４において、ビデオ会議端末１０１は、優先度を（Ａ＞Ｂ＞Ｄ）に更新する。In step S1313, thevideoconferencing terminal 101 displays a close-up of the image of user D in the display frame 302-3 in which the image of user C, which has the lowest priority, is displayed. This creates a recordedvideo 510, for example, as shown in FIG. 5(A). In addition, in step S1314, thevideoconferencing terminal 101 updates the priority to (A>B>D).

ステップＳ１３１５、Ｓ１３１６において、第２の映像で利用者Ａが発話すると、録画映像５１０のクローズアップ表示エリア３０２に、利用者Ａの画像が既に表示されているので、ビデオ会議端末１０１は、クローズアップ表示エリア３０２のレイアウトを維持する。これにより、例えば、図５（Ｂ）に示すような録画映像５２０が作成される。In steps S1315 and S1316, when user A speaks in the second video, since an image of user A is already displayed in the close-updisplay area 302 of the recordedvideo 510, thevideoconferencing terminal 101 maintains the layout of the close-updisplay area 302. As a result, for example, a recordedvideo 520 as shown in FIG. 5(B) is created.

ステップＳ１３１７において、ビデオ会議端末１０１は、発話順序（-->Ｂ-->Ｃ）を取得し、利用者Ａ、Ｂ、Ｄの優先度を決定する。例えば、発話順序（-->Ｂ-->Ｃ）から、利用者Ｂの優先度は利用者Ａ、Ｄより高くすべきであり、利用者Ｄは、利用者Ａより過去の発話者になるので、ビデオ会議端末１０１は、優先度を（Ｂ＞Ａ＞Ｄ）に更新する。In step S1317, thevideoconferencing terminal 101 obtains the speaking order (-->B-->C) and determines the priorities of users A, B, and D. For example, based on the speaking order (-->B-->C), the priority of user B should be higher than users A and D, and user D is an earlier speaker than user A, so thevideoconferencing terminal 101 updates the priority to (B>A>D).

ステップＳ１３１８、Ｓ１３１９において、第２の映像で利用者Ｂが発話すると、録画映像５１０のクローズアップ表示エリア３０２に、利用者Ｂの画像が既に表示されているので、ビデオ会議端末１０１は、クローズアップ表示エリア３０２のレイアウトを維持する。これにより、例えば、図５（Ｃ）に示すような録画映像５３０が作成される。In steps S1318 and S1319, when user B speaks in the second video, since an image of user B is already displayed in the close-updisplay area 302 of the recordedvideo 510, thevideoconferencing terminal 101 maintains the layout of the close-updisplay area 302. As a result, for example, a recordedvideo 530 such as that shown in FIG. 5(C) is created.

ステップＳ１３２０において、ビデオ会議端末１０１は、発話順序（-->Ｃ-->Ｅ）を取得し、利用者Ａ、Ｂ、Ｄの優先度を決定する。例えば、発話順序（-->Ｃ-->Ｅ）から、優先度を高くすべき利用者はいないので、ビデオ会議端末１０１は、過去の発話順序が遅い順に、優先度を（Ｂ＞Ａ＞Ｄ）に更新する。In step S1320, thevideoconferencing terminal 101 obtains the speaking order (-->C-->E) and determines the priorities of users A, B, and D. For example, based on the speaking order (-->C-->E), there is no user who should have a high priority, so thevideoconferencing terminal 101 updates the priorities to (B>A>D) in descending order of past speaking order.

ステップＳ１３２１、Ｓ１３２２において、第２の映像で利用者Ｃが発話すると、ビデオ会議端末１０１は、発話順序（-->Ｅ-->Ａ）を取得し、利用者Ａ、Ｂ、Ｄの優先度を決定する。例えば、発話順序（-->Ｅ-->Ａ）から、利用者Ａの優先度は利用者Ｂ、Ｄより高くすべきであり、利用者Ｄは、利用者Ｂより過去の発話者になるので、ビデオ会議端末１０１は、優先度を（Ａ＞Ｂ＞Ｄ）に決定する。In steps S1321 and S1322, when user C speaks in the second video, thevideoconferencing terminal 101 obtains the speaking order (-->E-->A) and determines the priorities of users A, B, and D. For example, based on the speaking order (-->E-->A), the priority of user A should be higher than users B and D, and user D is an earlier speaker than user B, so thevideoconferencing terminal 101 determines the priority to be (A>B>D).

ステップＳ１３２３において、ビデオ会議端末１０１は、優先度がもっとも低い、利用者Ｄの画像が表示されている表示枠３０２－３に、利用者Ｃの画像をクローズアップ表示する。これにより、例えば、図５（Ｄ）に示すような録画映像５４０が作成される。また、ステップＳ１３２４において、ビデオ会議端末１０１は、利用者Ａ、Ｂ、Ｃの優先度を更新する。好ましくは、利用者Ｂは、利用者Ｃより過去の発話者になるので、ビデオ会議端末１０１は、優先度を（Ａ＞Ｃ＞Ｂ）に更新する。
ビデオ会議端末１０１は、録画完了まで、同様の処理を繰り返し実行する。 In step S1323, thevideoconference terminal 101 displays a close-up of the image of user C in the display frame 302-3 in which the image of user D, who has the lowest priority, is displayed. As a result, for example, a recordedvideo 540 such as that shown in Fig. 5(D) is created. In addition, in step S1324, thevideoconference terminal 101 updates the priorities of users A, B, and C. Since user B is an earlier speaker than user C, thevideoconference terminal 101 preferably updates the priorities to (A>C>B).
Thevideoconference terminal 101 repeats the same process until the recording is completed.

（優先度の決定処理）
図１４は、第３の実施形態にかかる優先度の決定処理の例を示すフローチャートである。この処理は、例えば、図１３のステップＳ１３１０、Ｓ１３１３、Ｓ１３１６、Ｓ１３１９等において、ビデオ会議端末１０１が実行する優先度の決定処理の一例を示している。 (Priority Determination Process)
Fig. 14 is a flowchart showing an example of a priority determination process according to the third embodiment. This process shows an example of a priority determination process executed by thevideo conference terminal 101 in steps S1310, S1313, S1316, S1319, etc. in Fig. 13.

なお、図１４に示す処理の開始時点において、利用者Ｚが新たにクローズアップ表示エリア３０２に表示され、クローズアップ表示エリア３０２に利用者Ｘの画像、利用者Ｙの画像、利用者Ｚの画像が表示されている状態であるものとする。At the start of the process shown in FIG. 14, user Z is newly displayed in close-updisplay area 302, and images of user X, user Y, and user Z are displayed in close-updisplay area 302.

ステップＳ１４０１において、ビデオ会議端末１０１は、利用者Ｘ、又は利用者Ｙが次の発話者であるか否かを判断する。利用者Ｘ、又は利用者Ｙが次の発話者である場合、ビデオ会議端末１０１は、処理をステップＳ１４０２に移行させる。一方、利用者Ｘも利用者Ｙも次の発話者でない場合、ビデオ会議端末１０１は、処理をステップＳ１４０５に移行させる。In step S1401, thevideoconference terminal 101 determines whether user X or user Y is the next speaker. If user X or user Y is the next speaker, thevideoconference terminal 101 transitions the process to step S1402. On the other hand, if neither user X nor user Y is the next speaker, thevideoconference terminal 101 transitions the process to step S1405.

ステップＳ１４０２に移行すると、ビデオ会議端末１０１は、利用者Ｘ、又は利用者Ｙが次の次の発話者であるか否かを判断する。利用者Ｘ、又は利用者Ｙが次の次の発話者である場合、ビデオ会議端末１０１は、処理をステップＳ１４０３に移行させる。一方、利用者Ｘも利用者Ｙも次の次の発話者でない場合、ビデオ会議端末１０１は、処理をステップＳ１４０４に移行させる。When the process proceeds to step S1402, thevideoconference terminal 101 determines whether user X or user Y is the next speaker. If user X or user Y is the next speaker, thevideoconference terminal 101 proceeds to step S1403. On the other hand, if neither user X nor user Y is the next speaker, thevideoconference terminal 101 proceeds to step S1404.

ステップＳ１４０３に移行すると、ビデオ会議端末１０１は、利用者Ｘが次の発話者であるか否かを判断する。利用者Ｘが次の発話者である場合、ビデオ会議端末１０１は、優先度を「Ｘ＞Ｙ＞Ｚ」に決定する。一方、利用者Ｘが次の発話者でない場合、ビデオ会議端末１０１は、優先度を「Ｙ＞Ｘ＞Ｚ」に決定する。When the process proceeds to step S1403, thevideoconference terminal 101 determines whether or not user X is the next speaker. If user X is the next speaker, thevideoconference terminal 101 determines the priority as "X>Y>Z." On the other hand, if user X is not the next speaker, thevideoconference terminal 101 determines the priority as "Y>X>Z."

ステップＳ１４０４に移行すると、ビデオ会議端末１０１は、利用者Ｘが次の発話者であるか否かを判断する。利用者Ｘが次の発話者である場合、ビデオ会議端末１０１は、優先度を「Ｘ＞Ｚ＞Ｙ」に決定する。一方、利用者Ｘが次の発話者でない場合、ビデオ会議端末１０１は、優先度を「Ｙ＞Ｚ＞Ｘ」に決定する。When the process proceeds to step S1404, thevideoconference terminal 101 determines whether or not user X is the next speaker. If user X is the next speaker, thevideoconference terminal 101 determines the priority as "X>Z>Y." On the other hand, if user X is not the next speaker, thevideoconference terminal 101 determines the priority as "Y>Z>X."

ステップＳ１４０５に移行すると、ビデオ会議端末１０１は、利用者Ｘ、又は利用者Ｙが次の次の発話者であるか否かを判断する。利用者Ｘ、又は利用者Ｙが次の次の発話者である場合、ビデオ会議端末１０１は、処理をステップＳ１４０６に移行させる。一方、利用者Ｘも利用者Ｙも次の次の発話者でない場合、ビデオ会議端末１０１は、処理をステップＳ１４０７に移行させる。When the process proceeds to step S1405, thevideoconference terminal 101 determines whether user X or user Y is the next speaker. If user X or user Y is the next speaker, thevideoconference terminal 101 proceeds to step S1406. On the other hand, if neither user X nor user Y is the next speaker, thevideoconference terminal 101 proceeds to step S1407.

ステップＳ１４０６に移行すると、ビデオ会議端末１０１は、利用者Ｘが次の次の発話者であるか否かを判断する。利用者Ｘが次の次の発話者である場合、ビデオ会議端末１０１は、優先度を「Ｘ＞Ｚ＞Ｙ」に決定する。一方、利用者Ｘが次の次の発話者でない場合、ビデオ会議端末１０１は、優先度を「Ｙ＞Ｚ＞Ｘ」に決定する。When the process proceeds to step S1406, thevideoconference terminal 101 determines whether or not user X is the next speaker. If user X is the next speaker, thevideoconference terminal 101 determines the priority as "X>Z>Y." On the other hand, if user X is not the next speaker, thevideoconference terminal 101 determines the priority as "Y>Z>X."

ステップＳ１４０７に移行すると、ビデオ会議端末１０１は、利用者Ｙが利用者Ｘより過去の発話者であるか否かを判断する。利用者Ｙが利用者Ｘより過去の発話者である場合、ビデオ会議端末１０１は、優先度を「Ｚ＞Ｘ＞Ｙ」に決定する。一方、利用者Ｙが利用者Ｘより過去の発話者でない場合、ビデオ会議端末１０１は、優先度を「Ｚ＞Ｙ＞Ｘ」に決定する。When the process proceeds to step S1407, thevideoconference terminal 101 determines whether or not user Y is an earlier speaker than user X. If user Y is an earlier speaker than user X, thevideoconference terminal 101 determines the priority as "Z>X>Y". On the other hand, if user Y is not an earlier speaker than user X, thevideoconference terminal 101 determines the priority as "Z>Y>X".

図１３の処理により、ビデオ会議端末１０１は、発話順序に基づいて、次の発話者と、次の次の発話者が優先順位の上位にくるように、利用者Ｘ、利用者Ｙ、利用者Ｚの優先順位を決定（更新）することができる。ただし、図１３に示した処理は一例である。ビデオ会議端末１０１は、発話順序に基づいて、次の発話者と次の次の発話者が優先順位の上位にくるように、他の方法で利用者の優先順位を決定してもよい。By the process of FIG. 13, thevideo conference terminal 101 can determine (update) the priority order of users X, Y, and Z, based on the speaking order, so that the next speaker and the speaker after that have the highest priority. However, the process shown in FIG. 13 is just an example. Thevideo conference terminal 101 may determine the priority order of users in other ways, based on the speaking order, so that the next speaker and the speaker after that have the highest priority.

（会議映像、及び録画映像のイメージ）
図１５は、一実施形態に係る会議映像の作成処理のイメージを示す図である。会議映像作成部８１０は、例えば、図１５に示すように、音声データ、及びカメラ映像ストリームと同じタイミングで会議映像を作成する。今までは、この会議映像をそのまま録画していたため、例えば、クローズアップ表示エリア３０２に表示される利用者の画像が頻繁に切り替わるという問題がある。また、この方法では、話者検知に要する時間ｔ等により、新たに発話した話者が、クローズアップ表示されるまでに遅延が発生するという問題もある。 (Images of the meeting video and recorded video)
Fig. 15 is a diagram showing an image of a conference video creation process according to an embodiment. The conferencevideo creation unit 810 creates the conference video at the same timing as the audio data and the camera video stream, for example, as shown in Fig. 15. Until now, this conference video has been recorded as is, which causes a problem that, for example, the image of the user displayed in the close-updisplay area 302 frequently changes. In addition, this method also has a problem that a delay occurs before a new speaker who speaks is displayed in close-up due to the time t required for speaker detection, etc.

図１６は、一実施形態に係る録画映像の作成処理のイメージを示す図である。録画映像作成部８０８は、例えば、図１６に示すように、カメラ映像ストリームを所定の時間（録画映像遅延時間）遅延させたカメラ映像ストリームを用いて、録画映像のレイアウトを行う。これにより、録画映像作成部８０８は、予め特定した発話順序に基づいて、クローズアップ表示エリア３０２に表示される利用者の画像の切り替わり頻度が少なくなるように、録画映像のレイアウトを行うことができる。また、この方法では、話者検知に要する時間ｔの影響を受けないので、新たに発話した話者が、クローズアップ表示されるまでに遅延も解消することができる。Figure 16 is a diagram showing an image of the process of creating recorded video according to one embodiment. For example, as shown in Figure 16, the recordedvideo creation unit 808 lays out the recorded video using a camera video stream that is delayed by a predetermined time (recorded video delay time). This allows the recordedvideo creation unit 808 to lay out the recorded video based on a pre-specified speech order so as to reduce the frequency of switching between images of users displayed in the close-updisplay area 302. In addition, this method is not affected by the time t required for speaker detection, so it is possible to eliminate the delay before a new speaker is displayed in close-up.

以上、本発明の各実施形態によれば、直近の複数の発話者をクローズアップ表示するビデオ会議システム１００において、クローズアップ表示される発話者の切り替え頻度を抑制した録画映像を提供することができる。As described above, according to each embodiment of the present invention, in avideo conferencing system 100 that displays close-ups of multiple most recent speakers, it is possible to provide recorded video that reduces the frequency with which the speakers displayed in close-up are switched.

＜補足＞
上記で説明した各実施形態の各機能は、一又は複数の処理回路によって実現することが可能である。ここで、本明細書における「処理回路」とは、電子回路により実装されるプロセッサのようにソフトウェアによって各機能を実行するようプログラミングされたプロセッサや、上記で説明した各機能を実行するよう設計されたＡＳＩＣ（Application Specific Integrated Circuit）、ＤＳＰ（digital signal processor）、ＦＰＧＡ（field programmable gate array）や従来の回路モジュール等のデバイスを含むものとする。 <Additional Information>
Each function of each embodiment described above can be realized by one or more processing circuits. Here, the term "processing circuit" in this specification includes a processor programmed to execute each function by software, such as a processor implemented by an electronic circuit, and devices such as an ASIC (Application Specific Integrated Circuit), a DSP (digital signal processor), an FPGA (field programmable gate array), and a conventional circuit module designed to execute each function described above.

＜付記＞
本明細書には、下記の各項のビデオ会議システム、及び録画映像作成方法が開示されている。
（第１項）
ビデオ会議の映像を録画するビデオ会議システムであって、
マイクアレイで取得した前記ビデオ会議の音声に基づいて音の方向を検知する方向検知部と、
１つ以上のカメラで撮影した前記ビデオ会議の第１の映像から人物の画像を検知する画像検知部と、
前記音の方向と前記人物の画像とに基づいて、前記ビデオ会議システムを利用して前記ビデオ会議に参加する利用者の発話順序を特定する特定部と、
前記第１の映像を所定の時間遅延させた第２の映像から、前記発話順序に基づいて、新たに発話した第１の利用者を含む所定の数の利用者の画像を所定のエリアに表示する前記ビデオ会議の録画映像を作成する録画映像作成部と、
を有する、ビデオ会議システム。
（第２項）
前記録画映像作成部は、前記第２の映像において、前記第１の利用者が発話したときに、前記第１の利用者の次に発話する第２の利用者の画像が前記所定のエリアに表示されている場合、少なくとも前記第１の利用者の画像と前記第２の利用者画像とを前記所定のエリアに表示する前記録画映像を作成する、第１項に記載のビデオ会議システム。
（第３項）
前記録画映像作成部は、前記第２の映像において、前記第１の利用者が発話したときに、前記第１の利用者の次に発話する第２の利用者の次に発話する第３の利用者の画像が前記所定のエリアに表示されている場合、少なくとも前記第１の利用者の画像と前記第３の利用者の画像とを前記所定のエリアに表示する前記録画映像を作成する、第１項又は第２項に記載のビデオ会議システム。
（第４項）
前記録画映像作成部は、前記第２の映像において、前記第１の利用者が発話したときに、前記第１の利用者の画像が前記所定のエリアに表示されている場合、前記所定のエリアの表示を変更せずに、前記録画映像を作成する、第１項～第３項のいずれかに記載のビデオ会議システム。
（第５項）
前記録画映像作成部は、前記第２の映像において、利用者の発話時間が所定の時間未満である場合、前記所定のエリアの表示を変更せずに、前記録画映像を作成する、第１項～第４項のいずれかに記載のビデオ会議システム。
（第６項）
前記第１の映像から、他の利用者よりも後に発話した前記所定の数の利用者の画像を前記所定のエリアに表示する前記ビデオ会議の会議映像を作成する会議映像作成部を有する、第１項～第５項のいずれかに記載のビデオ会議システム。
（第７項）
前記録画映像作成部は、前記所定のエリアに表示される利用者の画像の変化が、前記会議映像より少ない前記録画映像を作成する、第６項に記載のビデオ会議システム。
（第８項）
前記第１の映像を前記所定の時間保持した後に、前記第２の映像を出力する遅延バッファを有する、第１項～第７項のいずれかに記載のビデオ会議システム。
（第９項）
ビデオ会議を制御する第１の装置と、マイクアレイと１つ以上のカメラとを備え、第１の装置に接続される第２の装置と、を含み前記ビデオ会議の映像を録画するビデオ会議システムであって、
前記第２の装置は、
前記マイクアレイで取得した音声に基づいて音の方向を検知する方向検知部と、
前記カメラで撮影した第１の映像から人物の画像を検知する画像検知部と、
前記音の方向と前記人物の画像とに基づいて、前記ビデオ会議システムを利用して前記ビデオ会議に参加する利用者の発話順序を特定する特定部と、
前記第１の映像を所定の時間遅延させた第２の映像から、前記発話順序に基づいて、新たに発話した第１の利用者を含む所定の数の利用者の画像を所定のエリアに表示する前記ビデオ会議の録画映像を作成する録画映像作成部と、
を有する、ビデオ会議システム。
（第１０項）
ビデオ会議の映像を録画するビデオ会議システムが、
マイクアレイで取得した前記ビデオ会議の音声に基づいて音の方向を検知する方向検知処理と、
１つ以上のカメラで撮影した前記ビデオ会議の第１の映像から人物の画像を検知する画像検知処理と、
前記音の方向と前記人物の画像とに基づいて、前記ビデオ会議システムを利用して前記ビデオ会議に参加する利用者の発話順序を特定する特定処理と、
前記第１の映像を所定の時間遅延させた第２の映像から、前記発話順序に基づいて、新たに発話した第１の利用者を含む所定の数の利用者の画像を所定のエリアに表示する前記ビデオ会議の録画映像を作成する録画映像作成処理と、
を実行する、録画映像作成方法。 <Additional Notes>
This specification discloses the following video conference system and video recording method.
(Section 1)
A video conferencing system for recording video of a video conference, comprising:
a direction detection unit that detects a direction of a sound based on the sound of the video conference acquired by a microphone array;
an image detection unit for detecting an image of a person from a first video of the video conference captured by one or more cameras;
an identification unit that identifies a speech order of users participating in the video conference using the video conference system based on the direction of the sound and the image of the person;
a recorded video creation unit that creates a recorded video of the video conference from a second video obtained by delaying the first video by a predetermined time, the recorded video displaying images of a predetermined number of users including a first user who has recently spoken in a predetermined area based on the order of speech;
A video conferencing system comprising:
(Section 2)
The video conferencing system described inclaim 1, wherein, when the image of a second user who speaks after the first user is displayed in the specified area in the second video when the first user speaks, the recorded video creation unit creates the recorded video that displays at least an image of the first user and an image of the second user in the specified area.
(Section 3)
The video conferencing system described inparagraph 1 or 2, wherein, when an image of a third user who speaks after a second user who speaks after the first user is displayed in the specified area in the second video when the first user speaks, the recorded video creation unit creates the recorded video that displays at least an image of the first user and an image of the third user in the specified area.
(Section 4)
A video conferencing system as described in any ofclaims 1 to 3, wherein the recorded video creation unit creates the recorded video without changing the display of the specified area in the second video when the image of the first user is displayed in the specified area when the first user speaks.
(Section 5)
A video conferencing system as described in any one ofclaims 1 to 4, wherein the recorded video creation unit creates the recorded video without changing the display of the specified area if the user's speaking time in the second video is less than a specified time.
(Section 6)
6. The video conference system according to any one ofclaims 1 to 5, further comprising a conference video creation unit that creates a conference video of the video conference from the first video, the conference video displaying images of the predetermined number of users who spoke later than other users in the predetermined area.
(Section 7)
7. The video conference system according toclaim 6, wherein the recorded video creation unit creates the recorded video in which changes in the image of the user displayed in the predetermined area are less than those in the conference video.
(Section 8)
8. The video conference system according to any one ofclaims 1 to 7, further comprising a delay buffer that holds the first video for the predetermined period of time and then outputs the second video.
(Section 9)
1. A video conferencing system including a first device for controlling a video conference, and a second device having a microphone array and one or more cameras and connected to the first device, for recording a video of the video conference,
The second device comprises:
a direction detection unit that detects a direction of a sound based on the sound acquired by the microphone array;
an image detection unit that detects an image of a person from a first image captured by the camera;
an identification unit that identifies a speech order of users participating in the video conference using the video conference system based on the direction of the sound and the image of the person;
a recorded video creation unit that creates a recorded video of the video conference from a second video obtained by delaying the first video by a predetermined time based on the speech order, the recorded video displaying images of a predetermined number of users including a first user who has recently spoken in a predetermined area;
A video conferencing system comprising:
(Article 10)
A video conferencing system that records video of video conferences
a direction detection process for detecting a direction of a sound based on the audio of the video conference acquired by a microphone array;
an image detection process for detecting an image of a person in a first video of the videoconference captured by one or more cameras;
a process of identifying a speech order of users participating in the video conference using the video conference system based on the direction of the sound and the image of the person;
a recorded video creation process for creating a recorded video of the video conference, which displays images of a predetermined number of users including a first user who has recently spoken in a predetermined area based on the speech order, from a second video obtained by delaying the first video by a predetermined time;
A method for creating recorded footage.

以上、本発明の実施形態について説明したが、本発明はかかる特定の実施形態に限定されるものではなく、特許請求の範囲に記載された本発明の要旨の範囲内において、様々な変形、及び応用が可能である。Although the embodiment of the present invention has been described above, the present invention is not limited to such a specific embodiment, and various modifications and applications are possible within the scope of the gist of the present invention described in the claims.

１通信システム
１０会議サーバ
１００ビデオ会議システム
１０１ビデオ会議端末
２０１ウェブ会議デバイス（第２の装置）
２０２ＰＣ（第１の装置）
２１１ＩＷＢ（第１の装置）
３０２クローズアップ表示エリア（所定のエリア）
６１１マイクアレイ
６１４－１、６１４－２カメラ
６１５映像遅延バッファ
８０３方向検知部
８０５画像検知部
８０６特定部
８０７映像遅延部
８０８録画映像作成部
８１０会議映像作成部 1Communication system 10Conference server 100Video conference system 101Video conference terminal 201 Web conference device (second device)
202 PC (first device)
211 IWB (first device)
302 Close-up display area (predetermined area)
611 Microphone array 614-1, 614-2Camera 615Video delay buffer 803Direction detection unit 805Image detection unit 806Identification unit 807Video delay unit 808 Recordedvideo creation unit 810 Conference video creation unit

特開２００９－１８２９８０号公報JP 2009-182980 A特開２０１７－３４５０２号公報JP 2017-34502 A

Claims

Translated fromJapanese

ビデオ会議の映像を録画するビデオ会議システムであって、
マイクアレイで取得した前記ビデオ会議の音声に基づいて音の方向を検知する方向検知部と、
１つ以上のカメラで撮影した前記ビデオ会議の第１の映像から人物の画像を検知する画像検知部と、
前記音の方向と前記人物の画像とに基づいて、前記ビデオ会議システムを利用して前記ビデオ会議に参加する利用者の発話順序を特定する特定部と、
前記第１の映像を所定の時間遅延させた第２の映像から、前記発話順序に基づいて、新たに発話した第１の利用者を含む所定の数の利用者の画像を所定のエリアに表示する前記ビデオ会議の録画映像を作成する録画映像作成部と、
を有する、ビデオ会議システム。 A video conferencing system for recording video of a video conference,
a direction detection unit that detects a direction of a sound based on the sound of the video conference acquired by a microphone array;
an image detection unit for detecting an image of a person from a first video of the video conference captured by one or more cameras;
an identification unit that identifies a speech order of users participating in the video conference using the video conference system based on the direction of the sound and the image of the person;
a recorded video creation unit that creates a recorded video of the video conference from a second video obtained by delaying the first video by a predetermined time based on the speech order, the recorded video displaying images of a predetermined number of users including a first user who has recently spoken in a predetermined area;
A video conferencing system comprising:

前記録画映像作成部は、前記第２の映像において、前記第１の利用者が発話したときに、前記第１の利用者の次に発話する第２の利用者の画像が前記所定のエリアに表示されている場合、少なくとも前記第１の利用者の画像と前記第２の利用者の画像とを前記所定のエリアに表示する前記録画映像を作成する、請求項１に記載のビデオ会議システム。The video conference system according to claim 1, wherein, in the second video, when the first user speaks and an image of a second user who will speak after the first user is displayed in the specified area, the recorded video creation unit creates the recorded video in which at least an image of the first user and an image of the second user are displayed in the specified area.

前記録画映像作成部は、前記第２の映像において、前記第１の利用者が発話したときに、前記第１の利用者の次に発話する第２の利用者の次に発話する第３の利用者の画像が前記所定のエリアに表示されている場合、少なくとも前記第１の利用者の画像と前記第３の利用者の画像とを前記所定のエリアに表示する前記録画映像を作成する、請求項１又は２に記載のビデオ会議システム。The video conference system according to claim 1 or 2, wherein, when an image of a third user who will speak after a second user who will speak after the first user is displayed in the predetermined area in the second video when the first user speaks, the recorded video creation unit creates the recorded video that displays at least an image of the first user and an image of the third user in the predetermined area.

前記録画映像作成部は、前記第２の映像において、前記第１の利用者が発話したときに、前記第１の利用者の画像が前記所定のエリアに表示されている場合、前記所定のエリアの表示を変更せずに、前記録画映像を作成する、請求項１に記載のビデオ会議システム。The video conference system according to claim 1, wherein, if an image of the first user is displayed in the specified area in the second video when the first user speaks, the recorded video creation unit creates the recorded video without changing the display of the specified area.

前記録画映像作成部は、前記第２の映像において、利用者の発話時間が所定の時間未満である場合、前記所定のエリアの表示を変更せずに、前記録画映像を作成する、請求項１に記載のビデオ会議システム。The video conference system according to claim 1, wherein the recorded video creation unit creates the recorded video without changing the display of the specified area if the user's speaking time in the second video is less than a specified time.

前記第１の映像から、他の利用者よりも後に発話した前記所定の数の利用者の画像を前記所定のエリアに表示する前記ビデオ会議の会議映像を作成する会議映像作成部を有する、請求項１に記載のビデオ会議システム。The video conference system according to claim 1, further comprising a conference video creation unit that creates a conference video of the video conference from the first video, the conference video displaying images of the predetermined number of users who spoke later than other users in the predetermined area.

前記録画映像作成部は、前記所定のエリアに表示される利用者の画像の変化が、前記会議映像より少ない前記録画映像を作成する、請求項６に記載のビデオ会議システム。The video conferencing system according to claim 6, wherein the recorded video creation unit creates the recorded video in which the change in the image of the user displayed in the specified area is less than that in the conference video.

前記第１の映像を前記所定の時間保持した後に、前記第２の映像を出力する映像遅延バッファを有する、請求項１に記載のビデオ会議システム。The video conference system according to claim 1, further comprising a video delay buffer that outputs the second video after holding the first video for the predetermined time.

ビデオ会議を制御する第１の装置と、マイクアレイと１つ以上のカメラとを備え、第１の装置に接続される第２の装置と、を含み前記ビデオ会議の映像を録画するビデオ会議システムであって、
前記第２の装置は、
前記マイクアレイで取得した音声に基づいて音の方向を検知する方向検知部と、
前記カメラで撮影した第１の映像から人物の画像を検知する画像検知部と、
前記音の方向と前記人物の画像とに基づいて、前記ビデオ会議システムを利用して前記ビデオ会議に参加する利用者の発話順序を特定する特定部と、
前記第１の映像を所定の時間遅延させた第２の映像から、前記発話順序に基づいて、新たに発話した第１の利用者を含む所定の数の利用者の画像を所定のエリアに表示する前記ビデオ会議の録画映像を作成する録画映像作成部と、
を有する、ビデオ会議システム。 1. A video conferencing system including a first device for controlling a video conference, and a second device having a microphone array and one or more cameras and connected to the first device, for recording a video of the video conference,
The second device comprises:
a direction detection unit that detects a direction of a sound based on the sound acquired by the microphone array;
an image detection unit that detects an image of a person from a first image captured by the camera;
an identification unit that identifies a speech order of users participating in the video conference using the video conference system based on the direction of the sound and the image of the person;
a recorded video creation unit that creates a recorded video of the video conference from a second video obtained by delaying the first video by a predetermined time, the recorded video displaying images of a predetermined number of users including a first user who has recently spoken in a predetermined area based on the order of speech;
A video conferencing system comprising:

ビデオ会議の映像を録画するビデオ会議システムが、
マイクアレイで取得した前記ビデオ会議の音声に基づいて音の方向を検知する方向検知処理と、
１つ以上のカメラで撮影した前記ビデオ会議の第１の映像から人物の画像を検知する画像検知処理と、
前記音の方向と前記人物の画像とに基づいて、前記ビデオ会議システムを利用して前記ビデオ会議に参加する利用者の発話順序を特定する特定処理と、
前記第１の映像を所定の時間遅延させた第２の映像から、前記発話順序に基づいて、新たに発話した第１の利用者を含む所定の数の利用者の画像を所定のエリアに表示する前記ビデオ会議の録画映像を作成する録画映像作成処理と、
を実行する、録画映像作成方法。 A video conferencing system that records video of video conferences
a direction detection process for detecting a direction of a sound based on the audio of the video conference acquired by a microphone array;
an image detection process for detecting an image of a person in a first video of the videoconference captured by one or more cameras;
a process of identifying a speech order of users participating in the video conference using the video conference system based on the direction of the sound and the image of the person;
a recorded video creation process for creating a recorded video of the video conference, which displays images of a predetermined number of users including a first user who has recently spoken in a predetermined area based on the speech order, from a second video obtained by delaying the first video by a predetermined time;
A method for creating recorded footage.