JP7694217B2

Movatterモバイル変換

Info

Publication number: JP7694217B2
Application number: JP2021116986A
Authority: JP
Inventors: 卓也益子; 淳悦伊藤; 彰太貫; 太郎三浦
Original assignee: Oki Electric Industry Co Ltd
Current assignee: Oki Electric Industry Co Ltd
Priority date: 2021-07-15
Filing date: 2021-07-15
Publication date: 2025-06-18
Anticipated expiration: 2041-07-15
Also published as: JP2023013073A

Description

本発明は、通信端末、判定方法及びプログラムに関する。The present invention relates to a communication terminal, a determination method, and a program.

通信端末を用いて会議を行う会議システムが知られている。例えば、特許文献１には、複数のマイクロホンと音声レベル検出手段及び音声データ保存手段を備えるマイクロコンピュータとスピーカとから構成される会議用音声システムが開示されている。Conference systems that use communication terminals to hold conferences are known. For example,Patent Document 1 discloses a conference audio system that is composed of a microcomputer equipped with multiple microphones, audio level detection means, and audio data storage means, and a speaker.

国際公開第２００７／０１３１８０号International Publication No. 2007/013180

特許文献１に記載の会議用音声システムには、マイクロホンに音声が捉えられるとミュートが解除されるオートミュート解除装置が備えられている。しかしながら、特許文献１に記載の会議用音声システムでは、単に音声が捉えられたことをしてミュートが解除されるため、ユーザによる発話の意図を含まない音声、例えば咳または物音がマイクに捉えられることで、ユーザが発話を意図しない状態でのミュートの解除が発生し得る。The conference audio system described inPatent Document 1 is equipped with an automatic mute release device that releases the mute when sound is picked up by the microphone. However, in the conference audio system described inPatent Document 1, the mute is released simply because sound is picked up, so the mute may be released without the user intending to speak if sound that does not include the user's intention to speak, such as a cough or sound, is picked up by the microphone.

本発明は、上記した点に鑑みてなされたものであり、ユーザの意図に沿った態様でミュート解除に関する動作を行うことが可能な通信端末を提供することを目的とする。The present invention has been made in consideration of the above points, and aims to provide a communication terminal that can perform operations related to unmuting in a manner that meets the user's intentions.

本発明による通信端末は、他の通信端末と共に互いに音声データの送受信をする音声送受信システムを構築可能な通信端末であって、前記通信端末に入力された入力音声から第１の音声データを生成する音声データ生成部と、前記音声送受信システムにおいて前記通信端末が前記第１の音声データを送信しない状態であるミュート状態であるか否かを判定するミュート状態判定部と、前記ミュート状態である場合に、前記入力音声の強度を示す第１の音声レベル及び前記他の通信端末から送信された第２の音声データによって示される音声の強度を示す第２の音声レベルに基づいて前記ミュート状態を解除すべきか否かを判定するミュート解除判定部と、を有し、前記ミュート解除判定部は、前記第１の音声データによって示される音声に所定のキーワードが含まれている場合に前記ミュート状態を解除すべきであると判定することを特徴とする。 The communication terminal of the present invention is a communication terminal capable of constructing an audio transmission/reception system for transmitting and receiving audio data together with other communication terminals, and has an audio data generation unit that generates first audio data from input audio input to the communication terminal, a mute state determination unit that determines whether the communication terminal is in a mute state in which it is not transmitting the first audio data in the audio transmission/reception system, and a mute release determination unit that, when in the mute state, determines whether the mute state should be released based on a first audio level indicating the intensity of the input audio and a second audio level indicatingthe intensity of the audio indicated by second audio data transmitted from the other communication terminal, and is characterized in that the mute release determination unit determines that the mute state should be released when the audio indicated by the first audio data contains a predetermined keyword .

また、本発明による判定方法は、他の通信端末と共に互いに音声データの送受信をする音声送受信システムを構築可能な通信端末による判定方法であって、音声データ生成部が、前記通信端末に入力された入力音声から第１の音声データを生成する音声データ生成ステップと、ミュート状態判定部が、前記音声送受信システムにおいて前記通信端末が前記第１の音声データを送信しない状態であるミュート状態であるか否かを判定するミュート状態判定ステップと、前記ミュート状態判定部が前記ミュート状態であると判定した場合に、ミュート解除判定部が、前記入力音声の強度を示す第１の音声レベル及び前記他の通信端末から送信された第２の音声データによって示される音声の強度を示す第２の音声レベルに基づいて前記ミュート状態を解除すべきか否かを判定するミュート解除判定ステップと、を有し、前記ミュート解除判定部は、前記第１の音声データによって示される音声に所定のキーワードが含まれている場合に前記ミュート状態を解除すべきであると判定することを特徴とする。 In addition, a judgment method according to the present invention is a judgment method by a communication terminal capable of constructing an audio transmission/reception system for transmitting and receiving audio data together with other communication terminals, and includes an audio data generation step in which an audio data generation unit generates first audio data from input audio input to the communication terminal, a mute state judgment step in which a mute state judgment unit judges whether or not the communication terminal is in a mute state in which the communication terminal is not transmitting the first audio data in the audio transmission/reception system, and a mute unmute judgment step in which, when the mute state judgment unit judges that the communication terminal is in the mute state, a mute unmute judgment unit judges whether or not the mute state should be unmuted based on a first audio level indicating the intensity of the input audio anda second audio level indicating the intensity of the audio indicated by second audio data transmitted from the other communication terminal, and is characterized in that the mute unmute judgment unit judges that the mute state should be unmute if the audio indicated by the first audio data contains a predetermined keyword .

また、本発明によるプログラムは、他の通信端末と共に互いに音声データの送受信をする音声送受信システムを構築可能な通信端末に実行させるプログラムであって、音声データ生成部が、前記通信端末に入力された入力音声から第１の音声データを生成する音声データ生成ステップと、ミュート状態判定部が、前記音声送受信システムにおいて前記通信端末が前記第１の音声データを送信しない状態であるミュート状態であるか否かを判定するミュート状態判定ステップと、前記ミュート状態判定部が前記ミュート状態であると判定した場合に、ミュート解除判定部が、前記入力音声の強度を示す第１の音声レベル及び前記他の通信端末から送信された第２の音声データによって示される音声の強度を示す第２の音声レベルに基づいて前記ミュート状態を解除すべきか否かを判定するミュート解除判定ステップと、を有し、前記ミュート解除判定部は、前記第１の音声データによって示される音声に所定のキーワードが含まれている場合に前記ミュート状態を解除すべきであると判定するプログラムである。 In addition, the program according to the present inventionis a program to be executed by a communication terminal capable of constructing an audio transmission/reception system for transmitting and receiving audio data together with other communication terminals, the program having an audio data generation step in which an audio data generation unit generates first audio data from input audio input to the communication terminal, a mute state determination step in which a mute state determination unit determines whether the communication terminal is in a mute state in which the communication terminal is not transmitting the first audio data in the audio transmission/reception system, and a mute unmute determination step in which, when the mute state determination unitdetermines that the communication terminal is in the mute state, a mute unmute determination unit determines whether the mute state should be unmuted based on a first audio level indicating the intensity of the input audio and a second audio level indicating the intensity of the audio indicated by second audio data transmitted from the other communication terminal, and the mute unmute determination unit determines that the mute state should be unmute when the audio indicated by the first audio data contains a predetermined keyword .

実施例１に係る会議システムの構成を示す図である。1 is a diagram illustrating a configuration of a conference system according to a first embodiment.実施例１に係る通信端末の構成を示すブロック図である。1 is a block diagram showing a configuration of a communication terminal according to a first embodiment;実施例１に係る会議サーバの構成を示すブロック図である。FIG. 2 is a block diagram showing a configuration of a conference server according to the first embodiment.実施例１に係る通信端末の制御ルーチンを示すフローチャートである。4 is a flowchart showing a control routine of the communication terminal according to the first embodiment.実施例２に係る会議システムの構成を示す図であるFIG. 1 is a diagram illustrating a configuration of a conference system according to a second embodiment.実施例２に係る通信端末の構成を示すブロック図である。FIG. 11 is a block diagram showing a configuration of a communication terminal according to a second embodiment.実施例２に係る通信端末に記憶されているキーワードの一例を示すテーブルである。13 is a table illustrating an example of keywords stored in a communication terminal according to a second embodiment.実施例２に係る音声認識サーバの構成を示すブロック図である。FIG. 11 is a block diagram showing a configuration of a voice recognition server according to a second embodiment.実施例２に係る通信端末の制御ルーチンを示すフローチャートである。10 is a flowchart showing a control routine of a communication terminal according to a second embodiment;実施例２に係る音声認識サーバの制御ルーチンを示すフローチャートである。10 is a flowchart showing a control routine of a voice recognition server according to a second embodiment.

以下、本発明の実施例について図面を参照しつつ具体的に説明する。なお、図面において、同一の構成要素については同一の符号を付け、重複する構成要素の説明は省略する。The following describes in detail an embodiment of the present invention with reference to the drawings. Note that in the drawings, the same components are given the same reference numerals, and descriptions of duplicated components are omitted.

図１は、実施例１に係る音声送受信システムとしての会議システム１００を示す図である。以下の説明においては、会議システム１００が、３台の通信端末１０、１１及び１２と会議サーバ１４とがネットワークＮＷを介して通信可能に接続されて構築されているシステムである場合を説明する。もちろん、会議システム１００を構成する通信端末の台数は、図１に示す３台に限られるものではなく、システムの能力が許す限り何台であっても良い。Figure 1 is a diagram showing aconference system 100 as an audio transmission/reception system according to the first embodiment. In the following explanation, theconference system 100 is a system constructed by connecting threecommunication terminals 10, 11, and 12 and aconference server 14 so that they can communicate with each other via a network NW. Of course, the number of communication terminals constituting theconference system 100 is not limited to the three shown in Figure 1, and may be any number as long as the system capacity allows.

ネットワークＮＷは、例えば、ＷＡＮ（Wide Area Network）、ＬＡＮ（Local Area Network）、公衆通信回線（公衆回線）等の双方向にデータ通信が可能な有線又は無線の通信ネットワークである。The network NW is, for example, a wired or wireless communication network capable of two-way data communication, such as a WAN (Wide Area Network), a LAN (Local Area Network), or a public communication line (public line).

通信端末１０、１１及び１２の各々は、ネットワークＮＷを介して会議サーバ１４に接続可能な通信端末である。通信端末１０、１１及び１２の各々は、会議サーバ１４によって互いに通信可能に接続されることで、当該会議サーバ１４を介して互いに音声データの送受信を行うことができる。本実施例において、通信端末１０、１１及び１２の各々は、音声データの送受信が可能なＰＣ（Personal Computer）である。Each of thecommunication terminals 10, 11, and 12 is a communication terminal that can be connected to theconference server 14 via the network NW. Each of thecommunication terminals 10, 11, and 12 is communicatively connected to each other by theconference server 14, and can transmit and receive voice data to and from each other via theconference server 14. In this embodiment, each of thecommunication terminals 10, 11, and 12 is a PC (Personal Computer) that can transmit and receive voice data.

会議サーバ１４は、通信端末１０、１１及び１２の各々とネットワークＮＷを介して個々に接続を確立し、通信端末１０、１１及び１２の各々を互いに音声データの送受信が可能な状態とする通信装置である。Theconference server 14 is a communication device that establishes a connection with each of thecommunication terminals 10, 11, and 12 individually via the network NW, and enables each of thecommunication terminals 10, 11, and 12 to transmit and receive voice data to and from each other.

本実施例において、通信端末１０、１１及び１２の各々には、会議システム１００を構築するための会議アプリケーションがインストールされている。会議サーバ１４は、当該アプリケーションを介した通信端末１０、１１及び１２の各々からの接続要求に応答することにより、通信端末１０、１１及び１２の各々を互いに音声データの送受信が可能な状態にすることができる。In this embodiment, a conference application for constructing theconference system 100 is installed in each of thecommunication terminals 10, 11, and 12. Theconference server 14 can place each of thecommunication terminals 10, 11, and 12 in a state in which they can transmit and receive audio data to and from each other by responding to a connection request from each of thecommunication terminals 10, 11, and 12 via the application.

なお、当該会議アプリケーションは、例えば、ネットワークＮＷを介した通信によって通信端末１０、１１及び１２の各々に取得されても良く、ＤＶＤ等の光ディスクまたはＵＳＢ等の記憶媒体を介して取得されてもよい。The conference application may be acquired by each of thecommunication terminals 10, 11, and 12, for example, by communication via the network NW, or may be acquired via a storage medium such as an optical disk, such as a DVD, or a USB.

図２は、通信端末１０の構成を示すブロック図である。以下、通信端末１１及び１２についても通信端末１０と同様の構成を有する。Figure 2 is a block diagram showing the configuration ofcommunication terminal 10. Below,communication terminals 11 and 12 have the same configuration ascommunication terminal 10.

制御部１５は、ＣＰＵ（Central Processing Unit）やＲＯＭ（Read Only Memory）、ＲＡＭ（Random Access Memory）を含む処理装置である。ＣＰＵは、ＲＯＭに記憶された各種プログラムを読み出し実行することにより各種機能を実現する。制御部１５は、通信端末１０の利用者（以下、ユーザとも称する）による操作に応じて各部への指示及び制御を行う部分である。本実施例では、上記した会議アプリケーションの処理を制御部１５が実行する。Thecontrol unit 15 is a processing device including a CPU (Central Processing Unit), ROM (Read Only Memory), and RAM (Random Access Memory). The CPU realizes various functions by reading and executing various programs stored in the ROM. Thecontrol unit 15 is a part that issues instructions and controls to each part in response to operations by a user of the communication terminal 10 (hereinafter also referred to as a user). In this embodiment, thecontrol unit 15 executes the processing of the above-mentioned conference application.

入力装置１６は、通信端末１０のユーザからの入力操作を受け付ける入力装置である。入力装置１６は、例えば、キーボードやマウスなどの、文字や数字等の情報を入力する入力機器である。Theinput device 16 is an input device that accepts input operations from a user of thecommunication terminal 10. Theinput device 16 is, for example, an input device such as a keyboard or a mouse that inputs information such as letters and numbers.

マイク１７は、通信端末１０のユーザの音、例えば当該ユーザが発した音声を収音して電気信号に変換する音声入力装置である。言い換えれば、マイク１７は、通信端末１０に入力された入力音声から第１の音声データとしての音声データを生成する音声データ生成部である。Themicrophone 17 is a voice input device that picks up the sound of the user of thecommunication terminal 10, for example, the voice uttered by the user, and converts it into an electrical signal. In other words, themicrophone 17 is a voice data generating unit that generates voice data as first voice data from the input voice input to thecommunication terminal 10.

スピーカ１８は、制御部１５の制御に基づいて通信端末１１及び１２から送信される第２の音声データとしての音声データによって示される音声を出力する音声出力装置である。本実施例において、通信端末１０のユーザは、マイク１７及びスピーカ１８を通して通信端末１１及び１２の各々のユーザと音声通話可能である。Thespeaker 18 is an audio output device that outputs audio represented by audio data as second audio data transmitted from thecommunication terminals 11 and 12 under the control of thecontrol unit 15. In this embodiment, the user of thecommunication terminal 10 can make audio calls with each of the users of thecommunication terminals 11 and 12 through themicrophone 17 and thespeaker 18.

カメラ１９は、制御部１５の制御に基づいて撮影を行う撮像装置である。カメラ１９は、例えば、通信端末１０のユーザを撮影するカメラである。Thecamera 19 is an imaging device that takes pictures based on the control of thecontrol unit 15. Thecamera 19 is, for example, a camera that takes pictures of the user of thecommunication terminal 10.

ディスプレイ２１は、制御部１５の制御に基づいて画面表示を行う表示装置である。ディスプレイ２１には、例えば、通信端末１１及び１２と通信可能に接続されている際に、カメラ１９の映像、通信端末１０における音声のミュートのＯＮ／ＯＦＦの状況や会議に参加している通信端末１１及び１２のユーザ名が表示されるウインドウ等の会議ユーザインタフェースが表示される。Thedisplay 21 is a display device that displays a screen based on the control of thecontrol unit 15. For example, when thedisplay 21 is communicatively connected to thecommunication terminals 11 and 12, it displays a conference user interface such as a window that displays the image from thecamera 19, the ON/OFF status of the audio mute in thecommunication terminal 10, and the user names of thecommunication terminals 11 and 12 participating in the conference.

なお、ディスプレイ２１は、入力装置１６としての通信端末１０のユーザからの入力操作を受け付けるタッチパネルと制御部１５の制御に基づいて画面表示を行うディスプレイとが組み合わされたタッチパネルディスプレイであってもよい。ディスプレイ２１がタッチパネルである場合、ディスプレイ２１は、上記入力装置１６に加えて、または上記入力装置１６に替えて入力機器として機能する。Thedisplay 21 may be a touch panel display that combines a touch panel that accepts input operations from the user of thecommunication terminal 10 as theinput device 16 and a display that displays a screen based on the control of thecontrol unit 15. When thedisplay 21 is a touch panel, thedisplay 21 functions as an input device in addition to or instead of theinput device 16.

以下に、制御部１５の機能ブロックについて説明する。The functional blocks of thecontrol unit 15 are described below.

通信部２３は、制御部１５の指示に従って通信端末１１及び１２とデータの送受信を行う機能部である。通信部２３は、例えば、ＮＩＣ（Network Interface Card）等の通信インターフェース機器とともにネットワークＮＷを介してデータをやり取りするための通信インターフェースを形成し、ネットワークＮＷを介したデータの送受信を行う部分である。Thecommunication unit 23 is a functional unit that transmits and receives data to and from thecommunication terminals 11 and 12 in accordance with instructions from thecontrol unit 15. Thecommunication unit 23 forms a communication interface for exchanging data via the network NW together with a communication interface device such as a NIC (Network Interface Card), and transmits and receives data via the network NW.

通信部２３は、マイク１７によって音声入力された後に制御部１５において変換された音声データを会議サーバ１４に送信する送信部であり得る。また、通信部２３は、会議サーバ１４を介して他の通信端末から送信された音声データを受信する受信部であり得る。Thecommunication unit 23 may be a transmission unit that transmits voice data that is input by themicrophone 17 and then converted by thecontrol unit 15 to theconference server 14. Thecommunication unit 23 may also be a reception unit that receives voice data transmitted from other communication terminals via theconference server 14.

ミュート状態判定部２４は、会議システム１００において通信端末１０が音声データを送信しない状態であるミュート状態であるか否かを判定する判定部である。ミュート状態判定部２４は、例えば、ユーザによる入力装置１６の操作によって音声のミュートが選択されている場合に、通信端末１０がミュート状態であると判定する。The mutestate determination unit 24 is a determination unit that determines whether or not thecommunication terminal 10 in theconference system 100 is in a mute state, which is a state in which audio data is not being transmitted. The mutestate determination unit 24 determines that thecommunication terminal 10 is in a mute state, for example, when the user operates theinput device 16 to select audio muting.

ミュート解除判定部２５は、自端末、すなわち通信端末１０のマイク１７に入力された音声に基づいてミュート状態を解除するか否かを判定する判定部である。具体的には、ミュート解除判定部２５は、通信端末１０のユーザが発話を意図した発声をしたか否かを判定することでミュート状態を解除すべきかを判定する。The muterelease determination unit 25 is a determination unit that determines whether or not to release the mute state based on the voice input to themicrophone 17 of the terminal itself, i.e., thecommunication terminal 10. Specifically, the muterelease determination unit 25 determines whether or not to release the mute state by determining whether or not the user of thecommunication terminal 10 has made an utterance with the intention of speaking.

例えば、ミュート解除判定部２５は、自端末のマイク１７に入力された音声の強度を示す音声レベル（以下、第１の音声レベルとも称する）が所定の閾値（以下、第１の閾値とも称する）以上になったか否かで通信端末１０のユーザが発話を意図する発声をしているかを判定する。For example, the muterelease determination unit 25 determines whether the user of thecommunication terminal 10 is making an utterance with the intention of speaking based on whether the sound level (hereinafter also referred to as the first sound level) indicating the intensity of the sound input to themicrophone 17 of the terminal itself has reached or exceeded a predetermined threshold (hereinafter also referred to as the first threshold).

第１の閾値は、例えば、自端末のユーザが発話した際の音声レベルの履歴から設定され得る。また、第１の閾値は、ユーザの咳払いの音又はマウスのクリック音などの小さな雑音や環境音の音声レベルよりも大きくなるように設定されている。The first threshold value may be set, for example, from the history of the voice level when the user of the terminal speaks. The first threshold value is also set to be higher than the voice level of small noises and environmental sounds, such as the sound of the user clearing their throat or the sound of clicking a mouse.

ミュート解除判定部２５は、他端末、すなわち通信端末１１及び１２から送信された音声データによって示される音声に基づいて通信端末１１及び１２のユーザが発話を意図した発声をしているか否かを判定する判定部でもある。The muterelease determination unit 25 is also a determination unit that determines whether the users of thecommunication terminals 11 and 12 are making an intended speech based on the voice represented by the voice data transmitted from the other terminals, i.e., thecommunication terminals 11 and 12.

例えば、ミュート解除判定部２５は、他端末、すなわち通信端末１１及び１２から送信された音声データによって示される音声の強度を示す音声レベル（以下、第２の音声レベルとも称する）が所定の閾値（以下、第２の閾値とも称する）以下になったか否かで通信端末１１及び１２のユーザが発話を意図する発声をしているかを判定する。For example, the muterelease determination unit 25 determines whether the users of thecommunication terminals 11 and 12 are making an intended speech based on whether the voice level (hereinafter also referred to as the second voice level) indicating the strength of the voice indicated by the voice data transmitted from the other terminals, i.e., thecommunication terminals 11 and 12, has become equal to or lower than a predetermined threshold (hereinafter also referred to as the second threshold).

第２の閾値は、通信端末１１及び１２のユーザが発話した際の音声レベルの履歴に基づき、当該音声レベルよりも小さくなるように設定されている。The second threshold is set to be smaller than the voice level based on the history of the voice levels when the users of thecommunication terminals 11 and 12 speak.

本実施例において、ミュート解除判定部２５は、通信端末１０がミュート状態である場合に、第１の音声レベルに基づいて通信端末１０のユーザが発話を意図する発声をしたと判定し、且つ第２の音声レベルに基づいて通信端末１１及び１２のユーザが発話を意図する発声をしていないと判定すると、ミュート状態を解除すべきと判定する。In this embodiment, when thecommunication terminal 10 is in a muted state, the muterelease determination unit 25 determines that the user of thecommunication terminal 10 has made an utterance intending to speak based on the first voice level, and determines that the users of thecommunication terminals 11 and 12 have not made an utterance intending to speak based on the second voice level, and then determines that the mute state should be released.

制御部１５は、ミュート解除判定部２５がミュート状態を解除すべきと判定すると、スピーカ１８から通信端末１０がミュート状態であることを通信端末１０のユーザに通知するための通知音を出力させる制御を実行する。When the muterelease determination unit 25 determines that the mute state should be released, thecontrol unit 15 executes control to output a notification sound from thespeaker 18 to notify the user of thecommunication terminal 10 that thecommunication terminal 10 is in a muted state.

当該通知音は、例えば、「ピッピッ」という単純なアラーム音でもよく、「ミュート中です」といった音声でもよい。また、制御部１５は、スピーカ１８から上記した通知音を出力させると共に、ディスプレイ２１上に「ミュート中です」という表示をしてもよい。The notification sound may be, for example, a simple alarm sound such as "beep beep" or a voice such as "Muted." Thecontrol unit 15 may also output the notification sound from thespeaker 18 and display "Muted" on thedisplay 21.

通信端末１０のユーザは、例えば、発話をした際にスピーカ１８から出力された通知音によって通信端末１０がミュート状態であることに気が付くことで、入力装置１６を操作し、当該通信端末１０のミュート状態を解除して再び発話を行い得る。When the user of thecommunication terminal 10 notices that thecommunication terminal 10 is in a muted state due to a notification sound output from thespeaker 18 when speaking, the user can operate theinput device 16 to unmute thecommunication terminal 10 and speak again.

なお、制御部１５は、スピーカ１８から通知音を出力させると共に通信端末１０のミュート状態を解除してもよい。これにより、当該通知音によって通信端末１０がミュート状態であることに気が付いたユーザが、通信端末１０のミュート状態を解除する操作を行う手間を省くことができる。Thecontrol unit 15 may output a notification sound from thespeaker 18 and also unmute thecommunication terminal 10. This can save a user who notices that thecommunication terminal 10 is in a muted state due to the notification sound from having to unmute thecommunication terminal 10.

言い換えれば、通信端末１０のユーザは、当該通信端末１０がミュート状態であったこと及び当該ミュート状態が解除されたことを認識しつつ、そのまま発話を行うことができる。なお、制御部１５は、必ずしも通知音を出力させると共に通信端末１０のミュート状態を解除しなくてもよく、通知音無しで単に通信端末１０のミュート状態を解除してもよい。In other words, the user of thecommunication terminal 10 can continue speaking while recognizing that thecommunication terminal 10 was in a muted state and that the muted state has been released. Note that thecontrol unit 15 does not necessarily have to release the muted state of thecommunication terminal 10 when outputting the notification sound, and may simply release the muted state of thecommunication terminal 10 without releasing the notification sound.

また、制御部１５は、通信端末１０のユーザによる手動で又は自動でミュート状態が解除された際に、ディスプレイ２１上に「ミュート状態を解除しました」等の表示をすることで、通信端末１０のユーザにミュート状態が解除されたことを知らせてもよい。In addition, when the mute state is released either manually or automatically by the user of thecommunication terminal 10, thecontrol unit 15 may notify the user of thecommunication terminal 10 that the mute state has been released by displaying a message such as "Mute state has been released" on thedisplay 21.

図３は、会議サーバ１４の構成を示すブロック図である。制御部２７は、ＣＰＵやＲＯＭ、ＲＡＭを含み、会議サーバ１４の各部への指示及び制御を行う処理装置である。Figure 3 is a block diagram showing the configuration of theconference server 14. Thecontrol unit 27 is a processing device that includes a CPU, ROM, and RAM, and issues instructions and controls to each part of theconference server 14.

制御部２７は、上記したように、会議アプリケーションを介した通信端末１０、１１及び１２の各々から送信される接続要求に応答することにより、通信端末１０、１１及び１２の各々を互いに音声データの送受信が可能な状態にする。As described above, thecontrol unit 27 responds to a connection request sent from each of thecommunication terminals 10, 11, and 12 via the conference application, thereby putting each of thecommunication terminals 10, 11, and 12 into a state in which they can transmit and receive audio data to each other.

制御部２７のうちのミキシング部２８は、通信端末１０、１１及び１２の各々が互いに音声データの送受信が可能な状態になると、通信端末１０、１１及び１２の各々から送信される音声データに対して合成処理を行い、１つの音声データを生成するミキサー機能を有する部分である。ミキシング部２８によって生成された音声データは通信端末１０、１１及び１２の各々に送信される。The mixingunit 28 of thecontrol unit 27 is a part that has a mixer function that performs synthesis processing on the voice data transmitted from each of thecommunication terminals 10, 11, and 12 to generate one voice data when each of thecommunication terminals 10, 11, and 12 is in a state in which thecommunication terminals 10, 11, and 12 are capable of transmitting and receiving voice data to each other. The voice data generated by the mixingunit 28 is transmitted to each of thecommunication terminals 10, 11, and 12.

通信部２９は、制御部２７の指示に従って外部機器とのデータの送受信を行う通信インターフェースである。通信部２９は、例えば、ネットワークＮＷに接続するためのＮＩＣである。通信部２９は、通信端末１０、１１及び１２の各々から送信される音声データを受信する受信部であり得る。また、通信部２９は、ミキシング部２８にて合成処理を行った音声データを通信端末１０、１１及び１２の各々に送信する送信部であり得る。Thecommunication unit 29 is a communication interface that transmits and receives data to and from external devices according to instructions from thecontrol unit 27. Thecommunication unit 29 is, for example, a NIC for connecting to the network NW. Thecommunication unit 29 can be a receiving unit that receives audio data transmitted from each of thecommunication terminals 10, 11, and 12. Thecommunication unit 29 can also be a transmitting unit that transmits the audio data that has been synthesized by the mixingunit 28 to each of thecommunication terminals 10, 11, and 12.

以下に、本実施例における通信端末１０の具体的な動作の一例をフローチャートを用いて説明する。Below, an example of the specific operation of thecommunication terminal 10 in this embodiment is explained using a flowchart.

図４は、通信端末１０の制御部１５において実行される通知音出力ルーチンＲＴ１を示すフローチャートである。制御部１５は、例えば、会議サーバ１４を介して自端末、すなわち通信端末１０と通信端末１１及び１２との間で接続が確立されたことを開始トリガーとして、通知音出力ルーチンＲＴ１を開始する。Figure 4 is a flowchart showing the notification sound output routine RT1 executed by thecontrol unit 15 of thecommunication terminal 10. Thecontrol unit 15 starts the notification sound output routine RT1, for example, when a connection is established between thecontrol unit 15's own terminal, i.e., thecommunication terminal 10, and thecommunication terminals 11 and 12 via theconference server 14.

制御部１５は、まず、ミュート状態判定部２４を介して通信端末１０がミュート状態であるか否かを判定する（ステップＳ１０１）。制御部１５は、ミュート状態判定部２４が通信端末１０がミュート状態ではないと判定すると（ステップＳ１０１：ＮＯ）、通知音出力ルーチンＲＴ１を終了する。Thecontrol unit 15 first determines whether thecommunication terminal 10 is in a muted state via the mute state determination unit 24 (step S101). If the mutestate determination unit 24 determines that thecommunication terminal 10 is not in a muted state (step S101: NO), thecontrol unit 15 ends the notification sound output routine RT1.

制御部１５は、ミュート状態判定部２４が通信端末１０がミュート状態であると判定すると（ステップＳ１０１：ＹＥＳ）、ミュート解除判定部２５を介してマイク１７に入力された音声の強度を示す第１の音声レベルが第１の閾値以上になったか否かを判定する（ステップＳ１０２）。When the mutestate determination unit 24 determines that thecommunication terminal 10 is in a muted state (step S101: YES), thecontrol unit 15 determines whether the first audio level indicating the intensity of the audio input to themicrophone 17 via the muterelease determination unit 25 has reached or exceeded the first threshold (step S102).

制御部１５は、ミュート解除判定部２５が第１の音声レベルが第１の閾値以上になっていないと判定すると（ステップＳ１０２：ＮＯ）、すなわち通信端末１０のユーザが発話を意図した発声をしていないと判定すると、通知音出力ルーチンＲＴ１を終了する。When the muterelease determination unit 25 determines that the first audio level is not equal to or higher than the first threshold (step S102: NO), that is, when it determines that the user of thecommunication terminal 10 is not making an intended speech, thecontrol unit 15 ends the notification sound output routine RT1.

制御部１５は、ミュート解除判定部２５が第１の音声レベルが第１の閾値以上になったと判定すると（ステップＳ１０２：ＹＥＳ）、すなわち通信端末１０のユーザが発話を意図した発声をしていると判定すると、ミュート解除判定部２５を介して第２の音声レベルが第２の閾値以下になったか否かを判定する（ステップＳ１０３）。When the muterelease determination unit 25 determines that the first audio level has become equal to or higher than the first threshold (step S102: YES), i.e., when thecontrol unit 15 determines that the user of thecommunication terminal 10 is making an utterance with the intention of speaking, thecontrol unit 15 determines via the muterelease determination unit 25 whether the second audio level has become equal to or lower than the second threshold (step S103).

制御部１５は、ミュート解除判定部２５が第２の音声レベルが第２の閾値以下になっていないと判定すると（ステップＳ１０３：ＮＯ）、すなわち通信端末１１及び１２のユーザが発話を意図した発声をしていると判定すると、通知音出力ルーチンＲＴ１を終了する。When the muterelease determination unit 25 determines that the second audio level is not equal to or lower than the second threshold (step S103: NO), that is, when it determines that the users of thecommunication terminals 11 and 12 are making sounds with the intention of speaking, thecontrol unit 15 ends the notification sound output routine RT1.

制御部１５は、ミュート解除判定部２５が第２の音声レベルが第２の閾値以下になったと判定すると（ステップＳ１０３：ＹＥＳ）、すなわち通信端末１１及び１２のユーザが発話を意図した発声をしていないと判定すると、スピーカ１８から通信端末１０がミュート状態であることを通知する通知音を出力させる（ステップＳ１０４）。When the muterelease determination unit 25 determines that the second audio level is equal to or lower than the second threshold (step S103: YES), i.e., when thecontrol unit 15 determines that the users of thecommunication terminals 11 and 12 are not making any intended speech, thecontrol unit 15 causes thespeaker 18 to output a notification sound notifying the user that thecommunication terminal 10 is in a muted state (step S104).

制御部１５は、ステップＳ１０４により、上記したように、通信端末１０がミュート状態であることを通信端末１０のユーザに通知するためのアラームや音声等の通知音をスピーカ１８から出力させる。制御部１５は、ステップＳ１０４の後に通知音出力ルーチンＲＴ１を終了する。In step S104, thecontrol unit 15 causes thespeaker 18 to output a notification sound such as an alarm or voice to notify the user of thecommunication terminal 10 that thecommunication terminal 10 is in a muted state, as described above. After step S104, thecontrol unit 15 ends the notification sound output routine RT1.

上記したように、本実施例によれば、通信端末１０がミュート状態である場合に、ミュート解除判定部２５が第１の音声レベルに基づいて通信端末１０のユーザが発話を意図する発声をしたと判定し、且つ第２の音声レベルに基づいて通信端末１１及び１２のユーザが発話を意図する発声をしていないと判定すると、制御部１５は、スピーカ１８から通信端末１０がミュート状態である旨を通知する通知音を出力させる。As described above, according to this embodiment, when thecommunication terminal 10 is in a muted state, if the muterelease determination unit 25 determines, based on the first voice level, that the user of thecommunication terminal 10 has made an utterance with the intention of speaking, and determines, based on the second voice level, that the users of thecommunication terminals 11 and 12 have not made an utterance with the intention of speaking, thecontrol unit 15 causes thespeaker 18 to output a notification sound notifying that thecommunication terminal 10 is in a muted state.

これにより、通信端末１０のユーザは、自身が発話を意図する発声をした際に、通信端末１１及び１２のユーザが発話を意図する発声をしていないような状況において通信端末１０がミュート状態であることを知ることができる。As a result, the user ofcommunication terminal 10 can know thatcommunication terminal 10 is in a muted state when he or she makes an intended speech, even in a situation where the users ofcommunication terminals 11 and 12 are not intending to speak.

また、制御部１５の制御によってスピーカ１８から通知音を出力させると共にミュート状態を解除する態様とした場合には、通信端末１０のユーザは、ミュート状態の解除に係る操作等をすることなくスムーズに発話をすることができる。In addition, when thecontrol unit 15 controls thespeaker 18 to output a notification sound and cancel the mute state, the user of thecommunication terminal 10 can speak smoothly without having to perform any operation to cancel the mute state.

従って、本実施例によれば、単に自身の音声が捉えられたことをしてミュート状態が解除されることや、会議中に他の会議参加者が発言をしている際にミュート状態が解除されることが発生しないため、ユーザの意図に沿った態様でミュート解除に関する動作を行うことができる。Therefore, according to this embodiment, the mute state is not released simply because the user's own voice is picked up, nor is the mute state released when another conference participant is speaking during the conference, so the operation related to unmuting can be performed in a manner that is in line with the user's intention.

なお、本実施例において、通信端末１０、１１及び１２の各々は、ＰＣである場合について説明したが、会議サーバ１４を介して互いに音声データの送受信が可能な端末であればよく、これに限られない。例えば、通信端末１０、１１及び１２の各々は、タブレット端末やスマートフォンであってもよい。また、通信端末１０、１１及び１２の各々は、例えば、ミュート状態のＯＮ／ＯＦＦの切り替えが可能なＩＰ（Internet Protocol）電話や、固定電話（アナログ電話）であってもよい。In the present embodiment, thecommunication terminals 10, 11, and 12 are each described as being a PC, but the present invention is not limited to this and may be any terminal capable of transmitting and receiving audio data to and from each other via theconference server 14. For example, each of thecommunication terminals 10, 11, and 12 may be a tablet terminal or a smartphone. Furthermore, each of thecommunication terminals 10, 11, and 12 may be, for example, an IP (Internet Protocol) telephone capable of switching the mute state ON/OFF, or a landline telephone (analog telephone).

なお、通信端末１０、１１及び１２の各々は、会議サーバ１４を介して互いに音声データの送受信が可能であればよく、互いに異なる端末であってもよい。例えば、会議システム１００において、通信端末１０をＰＣとし、通信端末１１をスマートフォンとし、通信端末１２をＩＰ電話としてもよい。Note that each of thecommunication terminals 10, 11, and 12 may be different terminals as long as they are capable of transmitting and receiving voice data to each other via theconference server 14. For example, in theconference system 100, thecommunication terminal 10 may be a PC, thecommunication terminal 11 may be a smartphone, and the communication terminal 12 may be an IP phone.

本実施例において、通信端末１０、１１及び１２の各々には、上記した会議アプリケーションがインストールされ、当該各々の制御部においてミュート状態の判定やユーザの発話の判定が行われるとしたがこれに限られない。例えば、上記した通信端末１０、１１及び１２の各々のミュート状態の判定やユーザの発話の判定は、Ｗｅｂブラウザ上のＷｅｂアプリケーション上において会議サーバ１４によって行われてもよい。In this embodiment, the above-mentioned conference application is installed in each of thecommunication terminals 10, 11, and 12, and the mute state and user speech are determined in each of the control units, but this is not limited to the above. For example, the mute state and user speech of each of thecommunication terminals 10, 11, and 12 may be determined by theconference server 14 on a web application on a web browser.

以下に、実施例２に係る音声送受信システムとしての会議システム２００について図５～１０を用いて説明する。会議システム２００は、音声認識サーバ３３を有する点で実施例１と異なっており、また、通信端末３０、３１及び３２の構成が実施例１と異なっている。会議システムは、これらの点以外において実施例１と同様の構成を有する。Below, aconference system 200 as a voice transmission/reception system according to the second embodiment will be described with reference to Figures 5 to 10. Theconference system 200 differs from the first embodiment in that it includes avoice recognition server 33, and the configurations of thecommunication terminals 30, 31, and 32 differ from those of the first embodiment. Other than these points, the conference system has the same configuration as the first embodiment.

図５は、会議システム２００の構成を示す図である。以下の説明においては、会議システム２００が、３台の通信端末３０、３１及び３２と会議サーバ１４と音声認識サーバ３３とがネットワークＮＷを介して通信可能に接続されて構築されているシステムである場合を説明する。もちろん、会議システム２００を構成する通信端末の台数は、図５に示す３台に限られるものではなく、システムの能力が許す限り何台であっても良い。Figure 5 is a diagram showing the configuration of theconference system 200. In the following explanation, theconference system 200 is a system constructed by connecting threecommunication terminals 30, 31, and 32, aconference server 14, and avoice recognition server 33 so that they can communicate with each other via a network NW. Of course, the number of communication terminals constituting theconference system 200 is not limited to the three shown in Figure 5, and may be any number as long as the system capacity allows.

音声認識サーバ３３は、通信端末３０から送信される音声データをテキストデータに変換し、当該テキストデータを通信端末３０に送信する音声認識サーバである。本実施例において、音声認識サーバ３３は、会議サーバ１４とは別個に設けられている。Thevoice recognition server 33 is a voice recognition server that converts voice data transmitted from thecommunication terminal 30 into text data and transmits the text data to thecommunication terminal 30. In this embodiment, thevoice recognition server 33 is provided separately from theconference server 14.

図６は、通信端末３０の構成を示すブロック図である。制御部３４は、ミュート解除判定部３５の構成が実施例１と異なっており、それ以外の点で実施例１と同様の構成を有する。以下、通信端末３１及び３２についても通信端末３０と同様の構成を有する。Figure 6 is a block diagram showing the configuration ofcommunication terminal 30. Thecontrol unit 34 has a configuration similar to that of the first embodiment, except for the configuration of the muterelease determination unit 35. Below,communication terminals 31 and 32 also have the same configuration ascommunication terminal 30.

本実施例において、ミュート解除判定部３５は、音声レベル判定部３５Ａ及びキーワード判定部３５Ｂとから構成される。In this embodiment, the muterelease determination unit 35 is composed of an audiolevel determination unit 35A and akeyword determination unit 35B.

音声レベル判定部３５Ａは、通信端末３０がミュート状態である場合に、上記した第１の音声レベルが第１の閾値以上になったかを判定し、また、上記した第２の音声レベルが第２の閾値以下になったかを判定する。When thecommunication terminal 30 is in a muted state, the audiolevel determination unit 35A determines whether the first audio level described above is equal to or higher than a first threshold, and determines whether the second audio level described above is equal to or lower than a second threshold.

ミュート解除判定部３５は、音声レベル判定部３５Ａが第１の音声レベルが第１の閾値以上になったと判定した場合に、通信端末３０のユーザが発話を意図する発声をしたと判定する。また、ミュート解除判定部３５は、音声レベル判定部３５Ａが第２の音声レベルが第２の閾値以下になったと判定した場合に、通信端末３１及び３２のユーザが発話を意図する発声をしていないと判定する。When the audiolevel determination unit 35A determines that the first audio level is equal to or higher than the first threshold, the muterelease determination unit 35 determines that the user of thecommunication terminal 30 has made an utterance with the intention of speaking. When the audiolevel determination unit 35A determines that the second audio level is equal to or lower than the second threshold, the muterelease determination unit 35 determines that the users of thecommunication terminals 31 and 32 have not made an utterance with the intention of speaking.

キーワード判定部３５Ｂは、音声認識サーバ３３から送信されるテキストデータが示す文字列とキーワードＤＢ３６に保存されているキーワードとを比較し、当該文字列に所定のキーワードが含まれているか否かを判定する判定部である。具体的には、キーワード判定部３５Ｂは、上記したテキストデータが示す文字列に発話の意図を有するワードが含まれているか否かを判定する。Thekeyword determination unit 35B is a determination unit that compares a character string indicated by text data transmitted from thevoice recognition server 33 with keywords stored in thekeyword DB 36, and determines whether the character string includes a predetermined keyword. Specifically, thekeyword determination unit 35B determines whether the character string indicated by the above-mentioned text data includes a word having a speech intention.

ミュート解除判定部３５は、キーワード判定部３５Ｂが上記したテキストデータが示す文字列に発話の意図を有するワードが含まれていると判定した場合に通信端末３０のユーザが発話を意図する発声をしていると判定する。The muterelease determination unit 35 determines that the user of thecommunication terminal 30 is making an utterance with the intention of speaking when thekeyword determination unit 35B determines that the character string indicated by the above-mentioned text data contains a word with the intention of speaking.

キーワードＤＢ３６は、上記した発話の意図を有するワードを複数保持しているデータベースである。なお、キーワードＤＢ３６は、外部ハードディスク等の外部記憶装置に記憶されていてもよく、制御部３４は、当該外部記憶装置を介して上記したキーワードを取得してもよい。Thekeyword DB 36 is a database that holds multiple words that have the above-mentioned speech intention. Thekeyword DB 36 may be stored in an external storage device such as an external hard disk, and thecontrol unit 34 may acquire the above-mentioned keywords via the external storage device.

ここで、図７を用いて上記したキーワードＤＢ３６が保持しているキーワードについて説明する。Here, we will use Figure 7 to explain the keywords stored in the above-mentioned keyword DB36.

図７は、キーワードＤＢ３６が保持しているキーワードの一例を示すキーワードＴＢ１を示す図である。キーワードＴＢ１において、「キーワードの種類」は、キーワードＴＢ１に保存されているワードがどのようなシチュエーションで用いられる言葉であるかを示すものである。また、キーワードＴＢ１において、「キーワードの例」は、上記したキーワードの種類の各々に対応する言葉の一例を示したものである。Figure 7 is a diagram showing keyword TB1 showing examples of keywords stored in keyword DB36. In keyword TB1, "keyword type" indicates the situation in which the words stored in keyword TB1 are used. Also, in keyword TB1, "keyword example" shows an example of a word corresponding to each of the above-mentioned keyword types.

キーワードＴＢ１において、「挨拶を示す言葉」とは、例えば、「おはようございます」や「よろしくおねがいします」などの、主に会議の開始時において多く用いられる言葉である。In keyword TB1, "greeting words" are words that are often used at the start of a meeting, such as "Good morning" or "Thank you for your help."

また、キーワードＴＢ１において、「自身から話しかける際に用いる言葉」とは、例えば、「ちょっとすみません」や「よろしいでしょうか」などの、主に自身が話に割って入る際や自身から話を切り出す際に多く用いられる言葉である。In addition, in keyword TB1, "words used when speaking to oneself" are words that are often used when oneself wants to interrupt a conversation or initiate a conversation, such as "Excuse me for a moment" or "Is that okay?".

また、キーワードＴＢ１において、「他者から話を振られた際に用いる言葉」とは、例えば、「それは」や「わかりました」などの、他者から説明を求められた際や他者の意見に同意する際に多く用いられる言葉である。In addition, in keyword TB1, "words used when started on a conversation by another person" are words such as "that" or "I understand" that are often used when someone asks for an explanation or when agreeing with another person's opinion.

再び図６を参照する。制御部３４は、ミュート解除判定部３５が第２の音声レベルに基づいて通信端末３１及び３２のユーザが発話を意図する発声をしていないと判定すると、音声レベルが第１の閾値以上になっている音声を一定時間（例えば最初の２～３秒程度）抽出し、当該音声を音声データに変換して音声認識サーバ３３に送信する。Referring again to FIG. 6 , when the muterelease determination unit 35 determines based on the second voice level that the users of thecommunication terminals 31 and 32 are not making any intended speech, thecontrol unit 34 extracts voice whose voice level is equal to or higher than the first threshold for a certain period of time (e.g., the first 2 to 3 seconds), converts the voice into voice data, and transmits it to thevoice recognition server 33.

本実施例において、ミュート解除判定部３５は、上記したように、キーワード判定部３５Ｂが上記したテキストデータが示す文字列に発話の意図を有するワードが含まれていると判定した場合に、通信端末３０のユーザが発話を意図する発声をしていると判定する。In this embodiment, as described above, when thekeyword determination unit 35B determines that the character string indicated by the above-mentioned text data contains a word that indicates an intention to speak, the muterelease determination unit 35 determines that the user of thecommunication terminal 30 is making an utterance that indicates an intention to speak.

制御部３４は、ミュート解除判定部３５が通信端末３０のユーザが発話を意図する発声をしていると判定した場合に、実施例１と同様に、スピーカ１８から通信端末３０がミュート状態であることを通知する通知音を出力させる。なお、制御部３４は、実施例１と同様に、スピーカ１８から通知音を出力させると共に通信端末３０のミュート状態を解除してもよい。When the muterelease determination unit 35 determines that the user of thecommunication terminal 30 is making an utterance with the intention of speaking, thecontrol unit 34 causes thespeaker 18 to output a notification sound notifying that thecommunication terminal 30 is in a muted state, as in Example 1. Note that thecontrol unit 34 may also cause thespeaker 18 to output a notification sound and release the mute state of thecommunication terminal 30, as in Example 1.

図８は、音声認識サーバ３３の構成を示すブロック図である。制御部３７は、ＣＰＵやＲＯＭ、ＲＡＭを含む処理装置である。制御部３７は、音声認識サーバ３３の各部への指示及び制御を行う部分である。Figure 8 is a block diagram showing the configuration of thevoice recognition server 33. Thecontrol unit 37 is a processing device including a CPU, ROM, and RAM. Thecontrol unit 37 issues instructions to and controls each part of thevoice recognition server 33.

制御部３７のうちの音声認識部３８は、通信端末３０から送信される音声データを音声認識する部分である。具体的には、音声認識部３８は、上記したように、通信端末３０から送信される音声データを音声変換によって文字列からなるテキストデータに変換する。Thevoice recognition unit 38 of thecontrol unit 37 is a part that performs voice recognition on the voice data transmitted from thecommunication terminal 30. Specifically, as described above, thevoice recognition unit 38 converts the voice data transmitted from thecommunication terminal 30 into text data consisting of character strings by voice conversion.

音声認識部３８は、例えば、通信端末３０から送信される音声データから音の周波数や強弱等の特徴量を抽出し（音響分析）、音響分析によって抽出した特徴量を予め学習している音や単語の情報に照らし合わせて声の最小単位である音素を抽出し（音響モデル）、情報データベースの中から音の組み合わせを抽出して単語として認識し（発音辞書）、音響モデルで抽出した音素及び発音辞書で認識した単語を組み合わせ、意味のある文章として認識する（言語モデル）ことによって、音声を文字として認識することができる。Thevoice recognition unit 38 can recognize voice as characters, for example by extracting features such as sound frequency and volume from the voice data transmitted from the communication terminal 30 (acoustic analysis), comparing the features extracted by the acoustic analysis with previously learned sound and word information to extract phonemes, which are the smallest units of voice (acoustic model), extracting combinations of sounds from an information database and recognizing them as words (pronunciation dictionary), and combining the phonemes extracted by the acoustic model and the words recognized by the pronunciation dictionary to recognize them as meaningful sentences (language model).

通信部３９は、制御部３７の指示に従って通信端末３１及び３２とデータの送受信を行う通信インターフェースである。通信部３９は、例えば、ネットワークＮＷに接続するためのＮＩＣである。通信部３９は、通信端末３０から送信される音声データを受信する受信部であり得る。また、通信部３９は、音声認識によって生成されたテキストデータを通信端末３０に送信する送信部であり得る。Thecommunication unit 39 is a communication interface that transmits and receives data to and from thecommunication terminals 31 and 32 in accordance with instructions from thecontrol unit 37. Thecommunication unit 39 is, for example, a NIC for connecting to the network NW. Thecommunication unit 39 can be a receiving unit that receives voice data transmitted from thecommunication terminal 30. Thecommunication unit 39 can also be a transmitting unit that transmits text data generated by voice recognition to thecommunication terminal 30.

大容量記憶装置４１は、例えば、ハードディスク装置、ＳＳＤ（solid state drive）、フラッシュメモリ等により構成されており、オペレーティングシステムや、ソフトウェア等の各種プログラムを記憶している。本実施例において、大容量記憶装置４１は、上記した音声認識のための音響モデルや発音辞書における音や単語の情報などを保持している。The large-capacity storage device 41 is composed of, for example, a hard disk drive, a solid state drive (SSD), a flash memory, etc., and stores various programs such as an operating system and software. In this embodiment, the large-capacity storage device 41 holds the acoustic model for the speech recognition described above, information on sounds and words in the pronunciation dictionary, etc.

以下に、本実施例における通信端末３０及び音声認識サーバ３３の各々の具体的な動作の一例をフローチャートを用いて説明する。Below, an example of the specific operation of thecommunication terminal 30 and thevoice recognition server 33 in this embodiment is explained using a flowchart.

図９は、通信端末３０の制御部３４において実行される通知音出力ルーチンＲＴ２を示すフローチャートである。図９において、実施例１に係る通信端末１０の制御部１５において実行される通知音出力ルーチンＲＴ１と異なる点のみ説明する。Figure 9 is a flowchart showing the notification sound output routine RT2 executed by thecontrol unit 34 of thecommunication terminal 30. In Figure 9, only the differences from the notification sound output routine RT1 executed by thecontrol unit 15 of thecommunication terminal 10 according to the first embodiment will be explained.

制御部３４は、ステップＳ１０３において、ミュート解除判定部２５が第２の音声レベルが第２の閾値以下になったと判定すると（ステップＳ１０３：ＹＥＳ）、第１の音声レベルが第１の閾値以上となった音声の最初の２～３秒程度を抽出し、音声データに変換して音声認識サーバ３３に送信する（ステップＳ２０１）。When the muterelease determination unit 25 determines in step S103 that the second audio level has become equal to or lower than the second threshold (step S103: YES), thecontrol unit 34 extracts the first 2 to 3 seconds of the audio where the first audio level becomes equal to or higher than the first threshold, converts it into audio data, and transmits it to the audio recognition server 33 (step S201).

制御部３４は、ステップＳ２０１の後に、音声認識サーバ３３からテキストデータを受信したか否かを判定する（ステップＳ２０２）。制御部３４は、音声認識サーバ３３からテキストデータを受信していないと判定すると（ステップＳ２０２：ＮＯ）、ステップＳ２０２を繰り返し実行する。After step S201, thecontrol unit 34 determines whether or not text data has been received from the voice recognition server 33 (step S202). If thecontrol unit 34 determines that text data has not been received from the voice recognition server 33 (step S202: NO), it repeats step S202.

制御部３４は、音声認識サーバ３３からテキストデータを受信したと判定すると（ステップＳ２０２：ＹＥＳ）、キーワード判定部３５Ｂを介して当該テキストデータにキーワードＤＢ３６に保存されているキーワードが含まれているか否かを判定する（ステップＳ２０３）。すなわち、キーワード判定部３５Ｂは、自端末のマイク１７に入力された音声が発話の意図を有するワードであるか否かを判定する。When thecontrol unit 34 determines that text data has been received from the voice recognition server 33 (step S202: YES), it determines via thekeyword determination unit 35B whether or not the text data contains a keyword stored in the keyword DB 36 (step S203). In other words, thekeyword determination unit 35B determines whether or not the voice input to themicrophone 17 of the terminal is a word intended to be spoken.

制御部３４は、キーワード判定部３５Ｂがテキストデータに発話の意図を有するワードが含まれていると判定すると（ステップＳ２０３：ＹＥＳ）、すなわちミュート解除判定部３５が通信端末３０のユーザが発話を意図する発声をしていると判定した場合に、スピーカ１８から通信端末１０がミュート状態であることを通知する通知音を出力させる（ステップＳ２０４）。When thekeyword determination unit 35B determines that the text data contains a word that indicates an intention to speak (step S203: YES), i.e., when the muterelease determination unit 35 determines that the user of thecommunication terminal 30 is making an utterance that indicates an intention to speak, thecontrol unit 34 causes thespeaker 18 to output a notification sound notifying that thecommunication terminal 10 is in a muted state (step S204).

制御部３４は、キーワード判定部３５Ｂがテキストデータにキーワードが含まれていないと判定すると（ステップＳ２０３：ＮＯ）、通知音出力ルーチンＲＴ２を終了する。制御部３４は、ステップＳ２０４の後に通知音出力ルーチンＲＴ２を終了する。When thekeyword determination unit 35B determines that the text data does not contain a keyword (step S203: NO), thecontrol unit 34 ends the notification sound output routine RT2. Thecontrol unit 34 ends the notification sound output routine RT2 after step S204.

図１０は、音声認識サーバ３３の制御部３７において実行される音声認識ルーチンＲＴ３を示すフローチャートである。制御部３７は、例えば、ネットワークＮＷを介して音声認識サーバ３３と通信端末３０との間で接続が確立されたことを開始トリガーとして、音声認識ルーチンＲＴ３を開始する。Figure 10 is a flowchart showing the voice recognition routine RT3 executed by thecontrol unit 37 of thevoice recognition server 33. Thecontrol unit 37 starts the voice recognition routine RT3, for example, when a connection is established between thevoice recognition server 33 and thecommunication terminal 30 via the network NW.

制御部３７は、通信端末３０から音声データを受信したか否かを判定する（ステップＳ３０１）。制御部３７は、通信端末３０から音声データを受信したと判定すると（ステップＳ３０１：ＹＥＳ）、当該音声データが示す音声を音声認識部３８を介してテキストデータに変換する（ステップＳ３０２）。Thecontrol unit 37 determines whether or not voice data has been received from the communication terminal 30 (step S301). When thecontrol unit 37 determines that voice data has been received from the communication terminal 30 (step S301: YES), thecontrol unit 37 converts the voice represented by the voice data into text data via the voice recognition unit 38 (step S302).

制御部３７は、通信端末３０から音声データを受信していないと判定すると（ステップＳ３０１：ＮＯ）、音声認識ルーチンＲＴ３を終了する。When thecontrol unit 37 determines that no voice data has been received from the communication terminal 30 (step S301: NO), it terminates the voice recognition routine RT3.

制御部３７は、ステップＳ３０２の後に、音声認識部３８を介して変換されたテキストデータを通信端末３０に送信する（ステップＳ３０３）。ステップＳ３０３の後に、音声認識ルーチンＲＴ３を終了する。After step S302, thecontrol unit 37 transmits the converted text data to thecommunication terminal 30 via the voice recognition unit 38 (step S303). After step S303, the voice recognition routine RT3 is terminated.

上記したように、本実施例によれば、通信端末３０がミュート状態である場合に、ミュート解除判定部３５が第１の音声レベルが第１の閾値以上であると判定し、且つ第２の音声レベルが第２の閾値以下であると判定すると、制御部３４は、第１の閾値以上の音声が示す音声データを音声認識サーバ３３に送信する。As described above, according to this embodiment, when thecommunication terminal 30 is in a muted state, if the muterelease determination unit 35 determines that the first voice level is equal to or higher than the first threshold and that the second voice level is equal to or lower than the second threshold, thecontrol unit 34 transmits voice data representing a voice level equal to or higher than the first threshold to thevoice recognition server 33.

そして、制御部３４は、音声認識サーバ３３から送信されるテキストデータを参照し、キーワード判定部３５Ｂがテキストデータが示す文字列に発話の意図を有するワードが含まれていると判定すると、スピーカ１８から通信端末３０がミュート状態である旨を通知する通知音を出力させる。Then, thecontrol unit 34 refers to the text data transmitted from thevoice recognition server 33, and when thekeyword determination unit 35B determines that the character string indicated by the text data contains a word that indicates an intended speech, thecontrol unit 34 causes thespeaker 18 to output a notification sound indicating that thecommunication terminal 30 is in a muted state.

これにより、通信端末３０のユーザは、自身が一定の音声レベルを有する音声を発した際に、通信端末１１及び１２のユーザが発言をしていないような状況において、通信端末３０に入力された音声が発話の意図を有するワードである場合に通信端末３０がミュート状態であることを知ることができる。As a result, when the user ofcommunication terminal 30 makes a sound having a certain sound level, in a situation where the users ofcommunication terminals 11 and 12 are not speaking, if the sound input tocommunication terminal 30 is a word intended to be spoken, the user ofcommunication terminal 30 can know thatcommunication terminal 30 is in a muted state.

また、制御部３４によってスピーカ１８から通知音を出力させると共に通信端末３０のミュート状態を解除する態様とした場合には、通信端末３０のユーザは、通信端末３０のミュート状態を解除する操作をすることなく発話をすることができる。In addition, when thecontrol unit 34 is configured to output a notification sound from thespeaker 18 and unmute thecommunication terminal 30, the user of thecommunication terminal 30 can speak without performing an operation to unmute thecommunication terminal 30.

従って、本実施例によれば、実施例１と同様に、単に自身の音声が捉えられたことをしてミュート状態が解除されることや、会議中に他の会議参加者が発言している際にミュート状態が解除されることが発生しないため、ユーザの意図に沿った態様でミュート解除に関する動作を行うことができる。Therefore, according to this embodiment, as in the first embodiment, the mute state is not released simply because the user's own voice is picked up, nor is the mute state released when another conference participant is speaking during the conference, so that the operation of unmuting can be performed in a manner that is in line with the user's intention.

本実施例において、通信端末３０のミュート状態を解除するための機能（ミュート解除機能）の一翼を担う音声認識サーバ３３は、会議サーバ１４と別個に存在している。言い換えれば、会議サーバ１４が変わった場合であっても、その度に音声認識サーバ３３を変更する必要がない。In this embodiment, thevoice recognition server 33, which is responsible for part of the function for canceling the mute state of the communication terminal 30 (mute cancel function), exists separately from theconference server 14. In other words, even if theconference server 14 is changed, there is no need to change thevoice recognition server 33 each time.

そのため、例えば、会議毎に異なるプロトコルで構築される会議システムを用いる場合であっても、上記したミュート解除機能を発揮させるために異なる処理、例えば会議毎に異なるプロトコルに準じた音声データを生成する等の処理を行う必要が無い。従って、音声認識サーバ３３を会議サーバ１４と別個に設けることで、上記ミュート解除機能、及び当該機能を搭載したアプリケーションの汎用性を高めることが可能となる。Therefore, for example, even if a conference system is used that is built with a different protocol for each conference, there is no need to perform different processing to realize the above-mentioned unmute function, such as processing to generate voice data conforming to a different protocol for each conference. Therefore, by providing thevoice recognition server 33 separately from theconference server 14, it is possible to increase the versatility of the above-mentioned unmute function and the application equipped with this function.

例えば、上記したミュート解除機能は、ＺＯＯＭ（登録商標）、Ｓｋｙｐｅ（登録商標）、Ｔｅａｍｓ（登録商標）、ＢｌｕｅＪｅａｎｓ（登録商標）、Ｗｅｂｅｘ（登録商標）等の様々な会議アプリケーションにアドオンとして追加され、各会議アプリケーションで行われる会議の音声データを音声認識サーバ３３に送信することで、上記ミュート解除機能を実現可能である。For example, the above-mentioned unmute function can be added as an add-on to various conferencing applications such as ZOOM (registered trademark), Skype (registered trademark), Teams (registered trademark), BlueJeans (registered trademark), and Webex (registered trademark), and the above-mentioned unmute function can be realized by transmitting the voice data of the conference held in each conferencing application to thevoice recognition server 33.

なお、制御部３４は、通知音出力ルーチンＲＴ２において、音声レベル判定部３５Ａが第２の音声レベルが第２の閾値以下となっていると判定した場合に（ステップＳ１０３：ＹＥＳ）、第１の閾値以上の音声レベルを有する音声を音声データとして音声認識サーバ３３に送信する（ステップＳ２０１）としたが、当該ステップＳ１０３は実行されなくてもよい。Note that in the notification sound output routine RT2, if the voicelevel determination unit 35A determines that the second voice level is equal to or lower than the second threshold (step S103: YES), thecontrol unit 34 transmits the voice having a voice level equal to or higher than the first threshold as voice data to the voice recognition server 33 (step S201), but step S103 does not have to be executed.

すなわち、制御部３４は、第１の音声レベルが第１の閾値以上となり、且つ第１の閾値以上の音声レベルを有する音声に発話の意図を有するワードが含まれている場合に、スピーカ１８から通知音を出力させてもよい。これにより、制御部３４は、自端末に入力された音声の態様のみに基づいて、通信端末３０のミュート状態の通知や解除を行うことができる。That is, thecontrol unit 34 may output a notification sound from thespeaker 18 when the first voice level is equal to or higher than the first threshold and the voice having a voice level equal to or higher than the first threshold contains a word that indicates an intended speech. This allows thecontrol unit 34 to notify or cancel the mute state of thecommunication terminal 30 based only on the state of the voice input to the terminal itself.

本実施例において、音声認識サーバ３３は、通信端末３０に代わってキーワード判定部３５Ｂを有していてもよく、大容量記憶装置４１がキーワードＤＢ３６を有していてもよい。例えば、音声認識サーバ３３の制御部３７は、通信端末３０から送信された音声データを音声認識部３８にてテキストデータに変換し、キーワード判定部３５Ｂによって当該テキストデータが示す文字列に発話の意図を有するキーワードが含まれているか否かを判定してもよく、当該判定の結果を通信端末３０に送信してもよい。In this embodiment, thevoice recognition server 33 may have akeyword determination unit 35B instead of thecommunication terminal 30, and themass storage device 41 may have akeyword DB 36. For example, thecontrol unit 37 of thevoice recognition server 33 may convert the voice data transmitted from thecommunication terminal 30 into text data in thevoice recognition unit 38, and may determine whether or not a keyword having the intention of the speech is included in a character string indicated by the text data by thekeyword determination unit 35B, and may transmit the result of the determination to thecommunication terminal 30.

これにより、通信端末３０の制御部３４は、音声認識サーバ３３から送信されるキーワード判定の結果に基づいて、上記した文字列に発話の意図を有するワードが含まれているという判定結果である場合に、スピーカ１８からミュート状態の通知音を出力してもよい。As a result, thecontrol unit 34 of thecommunication terminal 30 may output a notification sound of a muted state from thespeaker 18 when the determination result based on the keyword determination result sent from thevoice recognition server 33 indicates that the above-mentioned character string contains a word that has an intended speech.

本実施例において、音声認識サーバ３３は、通信端末３０、３１及び３２の各々にそれぞれ組み込まれていてもよい。例えば、通信端末３０がＩＰ電話である場合には、音声認識サーバ３３は、複数の電話機を接続する構内交換機（ＰＢＸ）に組み込まれていてもよい。また、音声認識サーバ３３は、会議サーバ１４に組み込まれていてもよい。In this embodiment, thevoice recognition server 33 may be incorporated in each of thecommunication terminals 30, 31, and 32. For example, if thecommunication terminal 30 is an IP telephone, thevoice recognition server 33 may be incorporated in a private branch exchange (PBX) that connects multiple telephones. Thevoice recognition server 33 may also be incorporated in theconference server 14.

実施例１及び実施例２おいて説明した通信端末、会議サーバ１４及び音声認識サーバ３３の各々の制御部における一連の処理は、コンピュータにより実行させるプログラムとしてもよい。また、当該プログラムは、コンピュータに読み取り可能な記録媒体に記録されていてもよい。The series of processes in the control units of the communication terminal, theconference server 14, and thevoice recognition server 33 described in the first and second embodiments may be programs executed by a computer. The programs may also be recorded on a computer-readable recording medium.

上記した記録媒体のタイプは、特に限定されず、例えば、光ディスク、ハードディスク、またはフラッシュメモリもしくはＳＳＤ等の半導体メモリであってもよい。また、上記プログラムは、通信を介して通信端末にダウンロードされインストールされてもよい。The type of the recording medium is not particularly limited, and may be, for example, an optical disk, a hard disk, or a semiconductor memory such as a flash memory or SSD. The program may also be downloaded and installed in a communication terminal via communication.

上記した実施例１及び実施例２において示した制御ルーチンは例示に過ぎず、用途または使用条件等に応じて適宜選択及び変更可能である。The control routines shown in the above-mentioned Examples 1 and 2 are merely examples, and can be appropriately selected and modified depending on the application or conditions of use, etc.

１０、１１、１２、３０、３１、３２通信端末
１４会議サーバ
１５、２７、３４、３７制御部
１６入力装置
１７マイク
１８スピーカ
１９カメラ
２１ディスプレイ
２３、２９、３９通信部
２４ミュート状態判定部
２５、３５ミュート解除判定部
２６ミキシング部
３３音声認識サーバ
３５Ａ音声レベル判定部
３５Ｂキーワード判定部
３６キーワードＤＢ
３８音声変換部
４１大容量記憶装置10, 11, 12, 30, 31, 32Communication terminal 14Conference server 15, 27, 34, 37Control unit 16Input device 17Microphone 18Speaker 19Camera 21Display 23, 29, 39Communication unit 24 Mutestate determination unit 25, 35 Mute release determination unit 26 Mixingunit 33Voice recognition server 35A Voicelevel determination unit 35BKeyword determination unit 36 Keyword DB
38Voice conversion unit 41 Large-capacity storage device

Claims

Translated fromJapanese

他の通信端末と共に互いに音声データの送受信をする音声送受信システムを構築可能な通信端末であって、
前記通信端末に入力された入力音声から第１の音声データを生成する音声データ生成部と、
前記音声送受信システムにおいて前記通信端末が前記第１の音声データを送信しない状態であるミュート状態であるか否かを判定するミュート状態判定部と、
前記ミュート状態である場合に、前記入力音声の強度を示す第１の音声レベル及び前記他の通信端末から送信された第２の音声データによって示される音声の強度を示す第２の音声レベルに基づいて前記ミュート状態を解除すべきか否かを判定するミュート解除判定部と、を有し、
前記ミュート解除判定部は、前記第１の音声データによって示される音声に所定のキーワードが含まれている場合に前記ミュート状態を解除すべきであると判定することを特徴とする通信端末。 A communication terminal capable of constructing a voice transmission/reception system for transmitting and receiving voice data together with other communication terminals,
a voice data generating unit that generates first voice data from an input voice input to the communication terminal;
a mute state determination unit that determines whether or not the communication terminal in the voice transmission/reception system is in a mute state in which the communication terminal does not transmit the first voice data;
a mute release determination unit that, when in the mute state, determines whether or not to release the mute state based on a first sound level indicating the intensity of the input sound and a second sound level indicating the intensity of the sound indicated by the second sound data transmitted from the other communication terminal,
The communication terminal according toclaim 1, wherein the mute release determination unit determines that the mute state should be released when a predetermined keyword is included in the voice represented by the first voice data .

前記ミュート解除判定部は、前記第１の音声レベルが第１の閾値以上であり、且つ前記第２の音声レベルが第２の閾値以下である場合に前記ミュート状態を解除すべきであると判定することを特徴とする、請求項１に記載の通信端末。The communication terminal according to claim 1, characterized in that the mute release determination unit determines that the mute state should be released when the first audio level is equal to or higher than a first threshold and the second audio level is equal to or lower than a second threshold.

前記所定のキーワードは、発話の意図を有するワードであることを特徴とする、請求項１又は２に記載の通信端末。3. The communication terminal according to claim 1 , wherein the predetermined keyword is a word having a speech intention.

前記ミュート解除判定部によって前記ミュート状態を解除すべきであるとの判定がなされると、通知音が出力されることを特徴とする、請求項１乃至３のいずれか１つに記載の通信端末。 4. The communication terminal according to claim1 , wherein when the mute release determination unit determines that the mute state should be released, a notification sound is output.

前記ミュート解除判定部によって前記ミュート状態を解除すべきであるとの判定がなされると、前記音声送受信システムにおける前記ミュート状態を解除することを特徴とする、請求項１乃至４のいずれか１つに記載の通信端末。 5. The communication terminal according to claim1 , wherein when the mute release determination unit determines that the mute state should be released, the mute state in the voice transmitting/receiving system is released.

他の通信端末と共に互いに音声データの送受信をする音声送受信システムを構築可能な通信端末による判定方法であって、
音声データ生成部が、前記通信端末に入力された入力音声から第１の音声データを生成する音声データ生成ステップと、
ミュート状態判定部が、前記音声送受信システムにおいて前記通信端末が前記第１の音声データを送信しない状態であるミュート状態であるか否かを判定するミュート状態判定ステップと、
前記ミュート状態判定部が前記ミュート状態であると判定した場合に、ミュート解除判定部が、前記入力音声の強度を示す第１の音声レベル及び前記他の通信端末から送信された第２の音声データによって示される音声の強度を示す第２の音声レベルに基づいて前記ミュート状態を解除すべきか否かを判定するミュート解除判定ステップと、を有し、
前記ミュート解除判定部は、前記第１の音声データによって示される音声に所定のキーワードが含まれている場合に前記ミュート状態を解除すべきであると判定することを特徴とする判定方法。 A method for determining whether a communication terminal can construct a voice transmission/reception system for transmitting and receiving voice data together with another communication terminal, comprising the steps of:
a voice data generating step of generating first voice data from an input voice input to the communication terminal by a voice data generating unit;
a mute state determination step in which a mute state determination unit determines whether or not the communication terminal in the voice transmitting/receiving system is in a mute state in which the communication terminal does not transmit the first voice data;
a mute release determination step in which, when the mute state determination unit determines that the communication terminal is in the mute state, a mute release determination unit determines whether or not to release the mute state based on a first sound level indicating the intensity of the input sound and a second sound level indicating the intensity of the sound indicated by the second sound data transmitted from the other communication terminal,
The method for determining whether the mute state should be cancelled, when the voice represented by the first voice data includes a predetermined keyword, characterized in that the mute release determination unit determines that the mute state should be released when the voice represented by the first voice data includes a predetermined keyword.

他の通信端末と共に互いに音声データの送受信をする音声送受信システムを構築可能な通信端末に実行させるプログラムであって、
音声データ生成部が、前記通信端末に入力された入力音声から第１の音声データを生成する音声データ生成ステップと、
ミュート状態判定部が、前記音声送受信システムにおいて前記通信端末が前記第１の音声データを送信しない状態であるミュート状態であるか否かを判定するミュート状態判定ステップと、
前記ミュート状態判定部が前記ミュート状態であると判定した場合に、ミュート解除判定部が、前記入力音声の強度を示す第１の音声レベル及び前記他の通信端末から送信された第２の音声データによって示される音声の強度を示す第２の音声レベルに基づいて前記ミュート状態を解除すべきか否かを判定するミュート解除判定ステップと、を有し、
前記ミュート解除判定部は、前記第１の音声データによって示される音声に所定のキーワードが含まれている場合に前記ミュート状態を解除すべきであると判定するプログラム。A program executed by a communication terminal capable of constructing a voice transmission/reception system for transmitting and receiving voice data to and from other communication terminals,
a voice data generating step of generating first voice data from an input voice input to the communication terminal by a voice data generating unit;
a mute state determination step in which a mute state determination unit determines whether or not the communication terminalin the voice transmitting/receiving system is in a mute state in which the communication terminal does not transmit the first voice data;
a mute release determination step in which, when the mute state determination unit determines that the communication terminal is in the mute state, a mute release determination unit determines whether or not to release the mute state based on a first sound level indicating the intensity of the input sound and a second sound level indicating the intensity of the sound indicated by the second sound data transmitted from the other communication terminal,
The mute release determination unit is a program that determines that the mute state should be released when a predetermined keyword is included in the voice represented by the first voice data .