JP2018055287A

Movatterモバイル変換

Info

Publication number: JP2018055287A
Application number: JP2016188846A
Authority: JP
Inventors: 建鋒徐; Kenho Jo; 茂之酒澤; Shigeyuki Sakasawa
Original assignee: KDDI Corp
Current assignee: KDDI Corp
Priority date: 2016-09-27
Filing date: 2016-09-27
Publication date: 2018-04-05
Anticipated expiration: 2036-09-27
Also published as: JP6600288B2

Abstract

【課題】映像信号における動作認識に関して複数の識別器を適用した結果を統合するに際して、適切な重みを設定することのできる統合装置を提供する。【解決手段】映像信号に基づく複数の時系列信号のそれぞれに識別器を適用して得られる、時系列信号ごとの映像信号における各動作種別の尤度時系列を重み付け平均することで、各動作種別の尤度ベクトルを算出する算出部と、当該尤度ベクトルを対応する時系列信号ごとに重み付け平均することで、映像信号における統合された各動作種別の尤度ベクトルを求める統合部と、を備える統合装置であって、算出部では、尤度時系列の各時刻における各動作種別の要素の分布の乱雑さに基づく重み付け平均を用い、統合部では、時系列信号ごとの尤度ベクトルがそれぞれ取る範囲に基づく正規化係数に基づく重み付け平均を用いる。【選択図】図２An object of the present invention is to provide an integration device capable of setting an appropriate weight when integrating results obtained by applying a plurality of discriminators for motion recognition in a video signal. Each operation is performed by weighted averaging the likelihood time series of each operation type in a video signal for each time series signal obtained by applying a discriminator to each of a plurality of time series signals based on the video signal. A calculation unit that calculates the likelihood vector of the type, and an integration unit that calculates the likelihood vector of each operation type integrated in the video signal by performing weighted averaging of the likelihood vector for each corresponding time-series signal. The calculation unit uses a weighted average based on the randomness of the distribution of elements of each operation type at each time in the likelihood time series, and the integration unit uses a likelihood vector for each time series signal, respectively. Use a weighted average based on a normalization factor based on the range to take. [Selection] Figure 2

Description

Translated fromJapanese

本発明は、映像信号における動作認識に関して深層畳み込みニューラルネットワーク等の識別器を複数適用した結果を統合するに際して、適切な重みを設定することのできる統合装置及びプログラムに関する。 The present invention relates to an integration apparatus and program capable of setting an appropriate weight when integrating results obtained by applying a plurality of discriminators such as a deep convolution neural network for motion recognition in a video signal.

全結合していない順伝播型ニューラルネットワークとして、畳み込みニューラルネットワーク(Convolutional Neural Networks: ConvNet)が知られている。その各層は畳み込み層とプーリング層とで構成され、階層的にパターン学習をできるといった特徴がある。 A convolutional neural network (Conval Neural Networks: ConvNet) is known as a forward propagation type neural network that is not fully coupled. Each layer is composed of a convolution layer and a pooling layer, and has a feature that pattern learning can be performed hierarchically.

非特許文献１に開示されているように、当該各層を所定の多数すなわち深層とした深層畳み込みニューラルネットワークは画像認識で活用され、認識精度を大幅に向上させた。具体的に非特許文献１では、深層畳み込みニューラルネットワークを用いて、画像ピクセル（一般的に、RGBという3チャネルの形）を入力して特徴量を出力し、更に分類器を使うことで、画像認識タスクへの適用を実現している。 As disclosed in Non-Patent Document 1, a deep convolution neural network in which each layer is a predetermined large number, that is, a deep layer, is used for image recognition, and the recognition accuracy is greatly improved. Specifically, in Non-Patent Document 1, a deep convolution neural network is used to input image pixels (generally, three-channel form of RGB), output feature values, and further use a classifier to Applicable to recognition tasks.

Alex Krizhevsky; Ilya Sutskever; Geoffrey E. Hinton (2012). "ImageNet Classification with Deep Convolutional Neural Networks". Advances in Neural Information Processing Systems 25: 1097-1105Alex Krizhevsky; Ilya Sutskever; Geoffrey E. Hinton (2012). "ImageNet Classification with Deep Convolutional Neural Networks". Advances in Neural Information Processing Systems 25: 1097-1105Feng Ning; Delhomme, D.; LeCun, Y.; Piano, F.; Bottou, L.; Barbano, P.E., "Toward automatic phenotyping of developing embryos from videos," in Image Processing, IEEE Transactions on , vol.14, no.9, pp.1360-1371, Sept. 2005Feng Ning; Delhomme, D .; LeCun, Y .; Piano, F .; Bottou, L .; Barbano, PE, "Toward automatic phenotyping of developing embryos from videos," in Image Processing, IEEE Transactions on, vol.14, no.9, pp.1360-1371, Sept. 2005Simonyan, Karen and Zisserman, Andrew, "Two-Stream Convolutional Networks for Action Recognition in Videos", Advances in Neural Information Processing Systems 27, pp. 568-576, 2014.Simonyan, Karen and Zisserman, Andrew, "Two-Stream Convolutional Networks for Action Recognition in Videos", Advances in Neural Information Processing Systems 27, pp. 568-576, 2014.Donahue, Jeffrey, Lisa Anne Hendricks, Sergio Guadarrama, Marcus Rohrbach, Subhashini Venugopalan, Kate Saenko, and Trevor Darrell. "Long-term recurrent convolutional networks for visual recognition and description." In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2625-2634. 2015.Donahue, Jeffrey, Lisa Anne Hendricks, Sergio Guadarrama, Marcus Rohrbach, Subhashini Venugopalan, Kate Saenko, and Trevor Darrell. "Long-term recurrent convolutional networks for visual recognition and description." In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pp. 2625-2634. 2015.Gunnar Farneback, "Two-Frame Motion Estimation Based on Polynomial Expansion, Image Analysis," Volume 2749 of the series Lecture Notes in Computer Science, pp 363-370, June 2003Gunnar Farneback, "Two-Frame Motion Estimation Based on Polynomial Expansion, Image Analysis," Volume 2749 of the series Lecture Notes in Computer Science, pp 363-370, June 2003

上記のように、画像認識（静止画の認識）などで成功した深層畳み込みニューラルネットワークを映像信号（動画）の行動認識へ適用させることが検討されている。 As described above, it has been studied to apply a deep convolution neural network that has been successfully used in image recognition (recognition of a still image) to action recognition of a video signal (moving image).

最も簡単な適用方法として、例えば非特許文献２では動画の各フレームを独立の静止画として入力させている。しかしながら、非特許文献２では、時間軸の相関性や動き情報を利用していないので、動き情報が重要ではないタスク（例えば、静止物体の認識）にしか適用できない。よって、動き情報が重要である行動認識には不適切であるものと考えられる。 As the simplest application method, for example, in Non-Patent Document 2, each frame of a moving image is input as an independent still image. However, since Non-Patent Document 2 does not use time-axis correlation or motion information, it can be applied only to tasks where motion information is not important (for example, recognition of a stationary object). Therefore, it is considered inappropriate for action recognition where motion information is important.

一方、非特許文献３では、二つの独立な深層畳み込みニューラルネットワークを学習させて、二つの判定（中間）結果を統合し、最終結果を判定する、いわゆるLate Fusionの手法を試みている。図１は、非特許文献３の手法を模式的に説明するための図である。 On the other hand, Non-Patent Document 3 attempts a so-called Late Fusion method in which two independent deep convolutional neural networks are learned, two determination (intermediate) results are integrated, and a final result is determined. FIG. 1 is a diagram for schematically explaining the technique of Non-Patent Document 3. In FIG.

非特許文献３において、一つ目の深層畳み込みニューラルネットワークは非特許文献２と同様に、動画の各フレームを独立の静止画（RGB画像とする）として入力させている。図１ではV1としてRBG画像からなる映像信号V1が示され、当該映像信号V1を入力として出力処理11-1Pすなわち深層畳み込みニューラルネットワークの処理を適用することで各行動パターン1,2,…,Nについてのスコア時系列FS1が出力されることが示されている。さらに、当該スコア時系列FS1に対して算出処理12-1Pを適用することで、映像信号V1における各行動パターン1,2,…,Nの尤度ベクトルTS1を出力することが示されている。 In Non-Patent Document 3, as in Non-Patent Document 2, the first deep convolutional neural network inputs each frame of a moving image as an independent still image (RGB image). In FIG. 1, a video signal V1 composed of an RBG image is shown as V1, and each of the behavior patterns 1, 2,..., N is applied by applying the output processing 11-1P, that is, processing of a deep convolution neural network, with the video signal V1 as an input. It is shown that the score time series FS1 for is output. Furthermore, it is shown that the likelihood vector TS1 of each action pattern 1, 2,..., N in the video signal V1 is output by applying the calculation process 12-1P to the score time series FS1.

また、非特許文献３では、二つ目の深層畳み込みニューラルネットワークの適用対象として、動画の隣接フレーム間からdense optical flowを算出し、当該dense optical flow（フロー画像とする）を入力させている。図１ではV2としてフロー画像からなる時系列信号V2が示され、当該時系列信号V2を入力として出力処理11-2Pすなわち深層畳み込みニューラルネットワークの処理を適用することで各行動パターン1,2,…,Nについてのスコア時系列FS2が出力されることが示されている。さらに、当該スコア時系列FS1に対して算出処理12-2Pを適用することで、フロー画像からなる時系列信号V2における各行動パターン1,2,…,Nの尤度ベクトルTS2を出力することが示されている。 In Non-Patent Document 3, as an application target of the second deep convolution neural network, a dense optical flow is calculated from adjacent frames of a moving image, and the dense optical flow (referred to as a flow image) is input. In FIG. 1, a time-series signal V2 composed of a flow image is shown as V2, and each of the action patterns 1, 2,... Is applied by applying the output process 11-2P, that is, a deep convolution neural network process, with the time-series signal V2 as an input. , N, the score time series FS2 is output. Further, by applying the calculation process 12-2P to the score time series FS1, it is possible to output the likelihood vector TS2 of each of the action patterns 1, 2,..., N in the time series signal V2 made of a flow image. It is shown.

さらに、非特許文献３では、前記二つの深層畳み込みニューラルネットワークで出力したスコア（各行動パターンの尤度）を統合し、当該統合スコアで動画全体の認識結果を判定している。フロー画像を用いた深層畳み込みニューラルネットワークは動き情報を利用し、RGB画像の深層畳み込みニューラルネットワークと補完性が強いため、最終的に認識結果の精度を向上させることが非特許文献３では報告されている。 Further, in Non-Patent Document 3, the scores (likelihood of each action pattern) output by the two deep convolution neural networks are integrated, and the recognition result of the entire moving image is determined by the integrated score. Non-Patent Document 3 reports that the deep convolutional neural network using flow images uses motion information and is highly complementary to the deep convolutional neural network of RGB images. Yes.

図１では、二つの深層畳み込みニューラルネットワークから出力されたスコアとしての尤度ベクトルTS1,TS2が統合処理13Pによって統合されて統合スコアINT_SC（統合された尤度ベクトル）が得られ、当該統合スコアINT_SCに対して評価処理14Pにより最終的な評価結果OUT（当初の映像データが行動パターン1,2,…,Nのいずれに該当するかの評価結果）が得られることが示されている。 In FIG. 1, likelihood vectors TS1 and TS2 as scores output from two deep convolutional neural networks are integrated by an integration process 13P to obtain an integrated score INT_SC (an integrated likelihood vector), and the integrated score INT_SC On the other hand, it is shown that the final evaluation result OUT (an evaluation result indicating whether the initial video data corresponds to the action pattern 1, 2,..., N) is obtained by the evaluation process 14P.

しかしながら、以上の非特許文献３の手法には、次のような課題があった。すなわち、非特許文献３では以下のように２ステップでの統合を行っているが、当該統合する際の重み付けが必ずしも適切なものではなかった。 However, the method of Non-Patent Document 3 described above has the following problems. That is, in Non-Patent Document 3, the integration is performed in two steps as follows, but the weighting at the time of the integration is not always appropriate.

（第１ステップ）…図１の算出処理12-1P,12-2Pにおける統合
ここで、RGB画像の時系列V1及びフロー画像の時系列V2をそれぞれの深層畳み込みニューラルネットワークに入力し、各フレームで行動毎に時系列スコアとしての尤度FS1,FS2を出力してから、統合（算出処理12-1P,12-2P）を行う。(First step): Integration in the calculation processes 12-1P and 12-2P of FIG. 1 Here, the time series V1 of the RGB image and the time series V2 of the flow image are input to the respective deep convolution neural networks, After the likelihoods FS1 and FS2 are output as time series scores for each action, integration (calculation processes 12-1P and 12-2P) is performed.

当該統合（算出処理12-1P,12-2P）は具体的には、各フレームの行動毎の時系列スコアFS1,FS2を行動毎に時間平均し、当該行動毎の平均スコアをRGB画像時系列またはフロー画像時系列の時間軸上で平均されたスコアTS1,TS2（尤度ベクトル）として出力している。ここで、本発明者は、当該平均する際に、均一な重みが採用されてしまっていることを課題として見出した。 Specifically, the integration (calculation processes 12-1P, 12-2P) time-averages the time series scores FS1, FS2 for each action of each frame for each action, and the average score for each action is an RGB image time series. Alternatively, the scores TS1 and TS2 (likelihood vectors) averaged on the time axis of the flow image time series are output. Here, the present inventor has found as a problem that a uniform weight has been adopted in the averaging.

（第２ステップ）…図１の統合処理13Pにおける統合
上記第１ステップで出力したRGB画像時系列の平均スコアTS1とフロー画像時系列の平均スコアTS2とを行動種別1,2,…,N毎に重み付け平均し、前記行動毎の平均スコアを総合スコアINT_SC（統合された尤度ベクトル）として出力している。ここで、本発明者は、重みづけ平均するための重みに事前知識による固定的なものが採用されてしまっていることを課題として見出した。例えば非特許文献３に関連する非特許文献４においては、RGB画像時系列とフロー画像時系列の重み設定に関して、（均一な1/2及び1/2の重みよりも、）1/3及び2/3の重みの方が、精度が高いと報告している。しかしながら、これらは固定的な重みである。(Second step): Integration in the integration process 13P of FIG. 1 The average score TS1 of the RGB image time series and the average score TS2 of the flow image time series output in the first step are each for action types 1, 2,. The average score for each action is output as an overall score INT_SC (integrated likelihood vector). Here, the present inventor has found as a problem that a fixed weight based on prior knowledge has been adopted as a weight for weighted averaging. For example, in Non-Patent Document 4 related to Non-Patent Document 3, regarding the weight setting of the RGB image time series and the flow image time series, 1/3 and 2 (rather than uniform 1/2 and 1/2 weights) The / 3 weight is reported to be more accurate. However, these are fixed weights.

以上のように、本発明者が課題として見出した、第１、第２ステップで固定的な重みを用いることが必ずしも適切ではない事情として、次を挙げることができる。 As described above, the followings can be cited as circumstances where it is not always appropriate to use fixed weights in the first and second steps, which the present inventors have found as problems.

すなわち、人の行動において、重要な時刻とそうではない時刻がある。例えば、ジェスチャーにおいては順番に、準備段階と本番段階、終了段階があると言われる。基本的に、準備段階と終了段階は行動の本質に反映しにくいので、このような重要性の低い段階まで含めて均一重みを採用してしまうと、認識精度を下げてしまうと考えられる。 In other words, there are important times and other times in human behavior. For example, in a gesture, it is said that there are a preparation stage, a production stage, and an end stage in order. Basically, since the preparation stage and the end stage are difficult to reflect on the essence of the action, it is considered that the recognition accuracy will be lowered if uniform weight is adopted including such a less important stage.

さらに、映像から抽出した時系列信号（例えば、図１に例示したRGB画像時系列信号V1やフロー画像時系列信号V2）は隣接フレーム間の相関がとても強いので、一旦間違っって判定してしまうと、連続するフレームにおいても同じ間違い結果を判定してしまう（いわゆる共倒れ現象が発生してしまう）可能性が高い。ここで、一定数の共倒れ現象が発生しているにも関わらず、均一な平均を用いて時系列平均してしまうと、その結果も共倒れ現象の影響を受けて間違い結果となる可能性が高い。 Furthermore, the time-series signals extracted from the video (for example, the RGB image time-series signal V1 and the flow image time-series signal V2 illustrated in FIG. 1) have a very strong correlation between adjacent frames. Therefore, there is a high possibility that the same erroneous result will be determined even in consecutive frames (so-called collapsing phenomenon will occur). Here, even if a certain number of collapsing phenomena have occurred, if a time series average is performed using a uniform average, the result is also likely to be an erroneous result due to the influence of the collapsing phenomenon. .

以上のような事情から、第１ステップにおいて、均一重みで平均することは必ずしも最適とは言えない。動画のコンテンツを解析し、適応的な重みを採用することが望ましい。同様に、第２ステップにおいても、固定的な重みを用いるのではなく、動画の解析により重みを適応的に設定することが望ましい。 For the above reasons, it is not always optimal to average with a uniform weight in the first step. It is desirable to analyze video content and adopt adaptive weights. Similarly, in the second step, it is desirable not to use fixed weights but to set weights adaptively by analyzing moving images.

以上のような従来技術の課題に鑑み、本発明は、映像信号における動作認識に関して深層畳み込みニューラルネットワーク等の識別器を複数適用した結果を統合するに際して、適切な重みを設定することのできる統合装置及びプログラムを提供することを目的とする。 In view of the above-described problems of the prior art, the present invention provides an integration device capable of setting an appropriate weight when integrating results obtained by applying a plurality of classifiers such as a deep convolution neural network with respect to motion recognition in a video signal. And to provide a program.

上記目的を達成するため、本発明は、映像信号(V)より抽出される又は当該映像信号(V)に関連する複数の時系列信号(Vi;i=1,2,…,M)のそれぞれに識別器を適用して得られる、当該時系列信号(Vi)ごとの前記映像信号(V)における各動作種別(act=1,2,…,N)の尤度時系列(FSi)を当該尤度時系列上で重み付け平均することで、当該時系列信号(Vi)ごとの各動作種別(act)の尤度ベクトル(TSi)を算出する算出部と、前記時系列信号(Vi)ごとの各動作種別(act)の尤度ベクトル(TSi)を当該対応する時系列信号(Vi)ごとに重み付け平均することで、前記映像信号(V)における統合された各動作種別(act)の尤度ベクトル(INT_SC)を求める統合部と、を備える統合装置であって、前記算出部では、前記尤度時系列(FSi)の各時刻(t)における前記各動作種別(act)の要素の分布の乱雑さ(v(t),H(t))に基づく重み付け平均によって、前記尤度ベクトル(TSi)を算出し、前記統合部では、前記時系列信号(Vi)ごとの尤度ベクトル(TSi)がそれぞれ取る範囲に基づく正規化係数(w^Vi)に基づく重み付け平均によって、前記統合された各動作種別の尤度ベクトル（INT_SC）を求めることを特徴とする。In order to achieve the above object, the present invention provides each of a plurality of time series signals (Vi; i = 1, 2,..., M) extracted from a video signal (V) or related to the video signal (V). The likelihood time series (FSi) of each operation type (act = 1, 2, ..., N) in the video signal (V) for each time series signal (Vi) obtained by applying a discriminator to the By calculating a weighted average on the likelihood time series, a calculation unit that calculates a likelihood vector (TSi) of each operation type (act) for each time series signal (Vi), and for each time series signal (Vi) The likelihood of each action type (act) integrated in the video signal (V) by weighted averaging the likelihood vector (TSi) of each action type (act) for each corresponding time series signal (Vi) An integration unit for obtaining a vector (INT_SC), wherein the calculation unit calculates a distribution of elements of each action type (act) at each time (t) of the likelihood time series (FSi). Based on randomness (v (t), H (t)) The likelihood vector (TSi) is calculated by a weighted average, and the integration unit calculates a normalization coefficient (w^Vi ) based on a range taken by the likelihood vector (TSi) for each time series signal (Vi). A likelihood vector (INT_SC) of each integrated action type is obtained by a weighted average based on the weighted average.

また、本発明は、コンピュータを前記統合装置として機能させるプログラムであることを特徴とする。 In addition, the present invention is a program that causes a computer to function as the integrated device.

本発明の統合装置によれば、その算出部及び統合部によって、映像信号における動作認識に関して深層畳み込みニューラルネットワーク等の識別器を複数適用した結果を統合するに際して、適切な重みを設定することができる。 According to the integration device of the present invention, an appropriate weight can be set by the calculation unit and the integration unit when integrating results obtained by applying a plurality of classifiers such as a deep convolution neural network with respect to motion recognition in a video signal. .

従来技術の手法を説明するための図である。It is a figure for demonstrating the method of a prior art.一実施形態に係る統合装置の機能ブロック図である。It is a functional block diagram of the integrated apparatus which concerns on one Embodiment.図２の各部の処理におけるデータの流れの模式的な例を示す図である。It is a figure which shows the typical example of the flow of the data in the process of each part of FIG.算出されるオプティカルフロー及びそのx成分(dx）とｙ成分(dy)の例を示す図である。It is a figure which shows the example of the calculated optical flow and its x component (dx) and y component (dy).RGB画像及びこれから合成されるフロー画像の例を示す図である。It is a figure which shows the example of the RGB image and the flow image synthesize | combined from now.正解フレームでは映像種別を問わず分散が大きく、不正解フレームでは映像種別を問わず分散が小さい傾向があること示す例を、RGB画像及びフロー画像のそれぞれの時系列信号において示す図である。It is a figure which shows in the time series signal of each of an RGB image and a flow image that an dispersion | distribution tends to be large regardless of a video type in a correct frame, and a dispersion | distribution tends to be small regardless of a video type in an incorrect answer frame.

図２は、一実施形態に係る統合装置の機能ブロック図である。統合装置10は、出力部11、算出部12、統合部13及び評価部14を備える。図３は、図２の各部の処理におけるデータの流れの模式的な例を示す図であるが、その構成は図１と共通である。すなわち、本発明の統合装置10においては重みの適応的な算出が実現されるが、処理全体の枠組みとしては前掲の非特許文献３等と共通の枠組みを採用することができる。 FIG. 2 is a functional block diagram of an integrated device according to an embodiment. The integration device 10 includes an output unit 11, a calculation unit 12, an integration unit 13, and an evaluation unit 14. FIG. 3 is a diagram showing a schematic example of the flow of data in the processing of each unit in FIG. 2, but the configuration is the same as that in FIG. That is, in the integrated device 10 of the present invention, adaptive calculation of weights is realized, but a framework common to the above-mentioned Non-Patent Document 3 and the like can be adopted as the framework of the entire processing.

具体的に、図１の出力処理11-1P,11-2Pが図３の出力部11-1,11-2に置き換えられ、図１の算出処理12-1P,12-2Pが図３の算出部12-1,12-2に置き換えられ、図１の統合処理13Pが図３の統合部13に置き換えられ、図１の評価処理14Pが図３の評価部14に置き換えられている。ここで、本発明においては算出部12及び統合部13で独自の重み算出処理を行っているが、全体的な枠組みは図１と共通のものとすることができる。 Specifically, the output processes 11-1P and 11-2P in FIG. 1 are replaced with the output units 11-1 and 11-2 in FIG. 3, and the calculation processes 12-1P and 12-2P in FIG. 1 is replaced by the integration unit 13 of FIG. 3, and the evaluation process 14P of FIG. 1 is replaced by the evaluation unit 14 of FIG. Here, in the present invention, the calculation unit 12 and the integration unit 13 perform unique weight calculation processing, but the overall framework can be the same as that in FIG.

以下、図３の例を適宜参照しながら、図２の各部の処理内容を説明する。なお、図３では時系列信号V1,V2の2種類のみを扱う場合を例として挙げているが、図２に示すように本発明は2種類に限られない一般のM種類の時系列信号V1,V2,…,VMを対象として適用可能である。 Hereinafter, the processing content of each unit in FIG. 2 will be described with reference to the example in FIG. 3 as appropriate. Note that FIG. 3 shows an example in which only two types of time series signals V1 and V2 are handled. However, as shown in FIG. 2, the present invention is not limited to two types and is a general M type time series signal V1. , V2, ..., VM is applicable.

[出力部11]
出力部11では、統合装置10における解析対象となる映像信号Vから抽出された（又は映像信号Vに関連した）複数の（互いに異種類の）時系列信号V1,V2,…,VM（インデクスi（i=1,2,…,M）でViとして識別する）に対してそれぞれ深層畳み込みニューラルネットワーク等の識別器を適用し、インデクスact=1,2,…,Nで識別される各行動の尤度の時系列信号を求めて算出部12へと出力する。[Output unit 11]
In the output unit 11, a plurality of (different types of) time series signals V1, V2,..., VM (index i) extracted from the video signal V to be analyzed in the integrated device 10 (or related to the video signal V). Apply a discriminator such as a deep convolutional neural network to (identify as Vi at i = 1,2, ..., M), and identify each action identified by index act = 1,2, ..., N A likelihood time-series signal is obtained and output to the calculation unit 12.

すなわち、出力部11では時系列信号Viから尤度の時系列信号FSiを得るが、当該時系列信号FSiは各行動act=1,2, …, Nのそれぞれの尤度の時系列信号fs(i,act)で構成されている。この関係を式として以下に示す。
FSi=(fs(i,1), fs(i,2), …, fs(i,N))
すなわち、各行動actの尤度の時系列信号fs(i,act)の長さ（時間軸上のデータ個数）がLであり、信号fs(i,act)をサイズL×1の縦ベクトルで表現したとすると、時系列信号FSiはサイズL×Nの行列として表現することができる。That is, the output unit 11 obtains a likelihood time series signal FSi from the time series signal Vi, and the time series signal FSi is a time series signal fs () of each action act = 1, 2,. i, act). This relationship is shown below as an equation.
FSi = (fs (i, 1), fs (i, 2),…, fs (i, N))
That is, the length of the time series signal fs (i, act) of the likelihood of each action act (the number of data on the time axis) is L, and the signal fs (i, act) is a vertical vector of size L × 1 If expressed, the time-series signal FSi can be expressed as a matrix of size L × N.

ここで、映像信号Vから抽出する（又は映像信号Vに関連する）各時系列信号Vi(i=1,2,…,M)は所定数Mでそれぞれ所定種類のものとすることができ、当初の映像信号V自身が含まれていてもよい。例えば、図３の例は図１と共通であり、M=2であって、V1が当初の映像信号V自身（例えばRGB信号の時系列）であり、V2が映像信号Vから求まるフロー画像の時系列V2である場合の例が示されている。 Here, each time series signal Vi (i = 1, 2,..., M) extracted from the video signal V (or related to the video signal V) can be of a predetermined type with a predetermined number M, The original video signal V itself may be included. For example, the example of FIG. 3 is the same as that of FIG. 1, M = 2, V1 is the original video signal V itself (for example, a time series of RGB signals), and V2 is a flow image obtained from the video signal V. An example in the case of time series V2 is shown.

その他、当初の映像信号Vから抽出される、又は、映像信号Vに関連した時系列信号Viとしては例えば、デプス画像（デプスマップ）の時系列などを採用することもできる。なお、異種類の時系列信号（例えばRGB画像時系列V1とフロー画像時系列V2）同士の間では、空間解像度及び／または時間解像度（フレームレート）は必ずしも共通でなくともよい。さらに、同じ映像信号Vに基づくもの同士であれば、互いに撮影角度や同期のずれがあってもよい。 In addition, as the time series signal Vi extracted from the initial video signal V or related to the video signal V, for example, a time series of a depth image (depth map) may be employed. Note that the spatial resolution and / or temporal resolution (frame rate) are not necessarily common between different types of time series signals (for example, the RGB image time series V1 and the flow image time series V2). Furthermore, as long as they are based on the same video signal V, there may be a difference in shooting angle and synchronization.

なお、図３の例では、信号Vi(i=1,2)に対して適用される図２の出力部11を出力部11-iとして分けて描いている。図３では同様に、信号Vi(i=1,2)に対して適用される図２の算出部12を出力部12-iとして分けて描いている。 In the example of FIG. 3, the output unit 11 of FIG. 2 applied to the signal Vi (i = 1, 2) is illustrated separately as the output unit 11-i. Similarly, in FIG. 3, the calculation unit 12 of FIG. 2 applied to the signal Vi (i = 1, 2) is separately illustrated as the output unit 12-i.

以上のように、出力部11では各時系列信号Viに深層畳み込みニューラルネットワーク等の識別器を適用して各行動の尤度の時系列信号FSiを得る。ここで、各信号Viに対する深層畳み込みニューラルネットワーク等の識別器の適用に関しては、前掲の非特許文献２，３等のように既存手法と同様にすればよいので、その詳細の説明は省略する。 As described above, the output unit 11 applies a discriminator such as a deep convolution neural network to each time series signal Vi to obtain a time series signal FSi of the likelihood of each action. Here, the application of a discriminator such as a deep convolutional neural network to each signal Vi may be performed in the same manner as the existing method as described in Non-Patent Documents 2 and 3 described above, and the detailed description thereof will be omitted.

当該識別器は事前に多数の学習用データを用いて事前に学習しておき、当該学習によって構築された識別器を出力部11において利用すればよい。こうして、例えばact=1の場合は信号Viは「踊っている」ものである、act=2の場合は信号Viは「泳いでいる」ものである、といったような各行動種別act=1,2,…,Nについての尤度ベクトル（の時系列）を出力部11で得ることができるようになる。事前学習の詳細も前掲の非特許文献２，３等のように既存手法と同様であるため、その詳細の説明は省略する。 The classifier may learn in advance using a large number of learning data in advance, and the classifier constructed by the learning may be used in the output unit 11. Thus, for example, when act = 1, the signal Vi is “dancing”, and when act = 2, the signal Vi is “swimming”. ,..., N can be obtained at the output unit 11 (time series thereof). Details of the pre-learning are the same as those of the existing methods as described in Non-Patent Documents 2 and 3 described above, and thus detailed description thereof is omitted.

なお、出力部11では、映像信号Vから時系列信号Vi（のいずれか一部分又は全部）を抽出する処理も行うようにしてもよい。例えば、フロー画像の時系列として時系列信号Viを得る場合、前掲の非特許文献５の手法を採用してよい。以下に当該手法の概略を示す。 Note that the output unit 11 may also perform processing for extracting the time-series signal Vi (any part or all) from the video signal V. For example, when obtaining the time-series signal Vi as the time series of the flow image, the method of Non-Patent Document 5 described above may be employed. The outline of the method is shown below.

映像信号の隣フレームRGB画像を二枚入力すると、非特許文献５の手法で同解像度のフロー画像を一枚算出することができる。 When two adjacent frame RGB images of a video signal are input, one flow image with the same resolution can be calculated by the method of Non-Patent Document 5.

まず、（式１）でRGB画像を輝度画像Yに変換する。
Y = 0.299 × R + 0.587 × G + 0.114 × B （式１）
ここで、R、G、Bは入力フレームのあるピクセルのR、G、B値であり、Yは前記ピクセルの輝度値である。
非特許文献５の基本前提は、画像の小さい領域の中に任意のピクセルのY成分が（式２）のようにquadratic polynomial basis（2次形式）で表現できるというものである。
f₁(x)=x^TA₁x+b₁^Tx+c₁ （式２）First, an RGB image is converted into a luminance image Y by (Equation 1).
Y = 0.299 × R + 0.587 × G + 0.114 × B (Formula 1)
Here, R, G, and B are R, G, and B values of a pixel in the input frame, and Y is a luminance value of the pixel.
The basic premise of Non-Patent Document 5 is that the Y component of an arbitrary pixel can be expressed by a quadratic polynomial basis (quadratic form) as in (Equation 2) in a small region of the image.
f₁ (x) = x^T A₁ x + b₁^T x + c₁ (Formula 2)

ここで、ｘは第１フレームのY成分の対象ピクセルの位置座標であり、A1、ｂ１、ｃ１はその領域で算出する係数であり、f1(x)は対象ピクセルのY成分である。
同様に、第２フレームの対応領域は（式３）になる。
f₂(x)= f₁(x-d)=(x-d)^TA₁(x-d)+b₁^T(x-d)+c₁
= x^TA₁x+(b₁-2A₁d)^Tx+d^TA₁d-b₁^Td+c₁
= x^TA₂x+b₂^Tx+c₂ （式３）Here, x is the position coordinate of the target pixel of the Y component of the first frame, A1, b1, and c1 are coefficients calculated in that region, and f1 (x) is the Y component of the target pixel.
Similarly, the corresponding area of the second frame is (Equation 3).
f₂ (x) = f₁ (xd) = (xd)^T A₁ (xd) + b₁^T (xd) + c₁
= x^T A₁ x + (b₁ -2A₁ d)^T x + d^T A₁ db₁^T d + c₁
= x^T A₂ x + b₂^T x + c₂ (Formula 3)

但し、ｄは対象ピクセルｘのオプティカルフロー（位置座標の差分）であり、A２、ｂ２、ｃ２はその領域で算出する係数である。
よって、オプティカルフローｄは（式４）で算出することができる。
d=(-1/2)A₁^-1(b₂-b₁) （式４）However, d is the optical flow (positional coordinate difference) of the target pixel x, and A2, b2, and c2 are coefficients calculated in that region.
Therefore, the optical flow d can be calculated by (Equation 4).
d = (-1/2) A₁^-1 (b₂ -b₁ ) (Formula 4)

図４に、当該算出されるオプティカルフローのx成分(dx）とｙ成分(dy)の例を示す。図４にて（１）では当該両成分を当初画像上のベクトル場として描いており、（２）及び（３）は各成分をグレースケール画像として示している。本発明においてはさらに、例えば（式５)で第三成分dzを算出することができる。（式５）において||は絶対値演算である。
dz=|dx²+dy²| （式５）FIG. 4 shows an example of the x component (dx) and y component (dy) of the calculated optical flow. In FIG. 4, (1) shows both components as a vector field on the initial image, and (2) and (3) show each component as a grayscale image. Furthermore, in the present invention, for example, the third component dz can be calculated by (Equation 5). In (Formula 5), || is an absolute value calculation.
dz = | dx² + dy² | (Formula 5)

以上のdx,dy,dzにより、画像の３チャネルとして一枚のフロー画像を合成することができる。図５に、（１）RGB画像及び（２）これから合成されるフロー画像の例を示す。 With the above dx, dy, and dz, a single flow image can be synthesized as three image channels. FIG. 5 shows an example of (1) an RGB image and (2) a flow image to be synthesized.

[算出部12]
算出部12では、以上の出力部11で得られた各時系列信号Viの行動act毎の尤度スコア時系列FSiに対し、信号ViのフレームVi(t)(ここで、t(=1,2,…)は当該信号Viにおける時刻t、すなわち、フレーム番号tを表すものとする)毎の適応的な重みw(t)を算出し、当該フレーム毎の適応的な重みw(t)に基づいて時系列FSiの平均を求めることで、時系列平均された高度act毎の尤度スコアTSiを求め、統合部13へと出力する。（なお、重みw(t)は各信号Viについてそれぞれ求まるので、正確にはw(t)[i]等と表記すべきものであってi依存性を有するが、表記が煩雑となるため、Viは固定的に与えられており文脈上明らかなものとして、算出部12の説明の際などはw(t)と略記する。）[Calculation unit 12]
In the calculation unit 12, the frame Vi (t) of the signal Vi (where t (= 1, 1) is applied to the likelihood score time series FSi for each action act of each time series signal Vi obtained in the output unit 11 described above. 2, ...) represents the time t in the signal Vi, i.e., the frame number t)) and calculates the adaptive weight w (t) for each frame. Based on the average of the time series FSi, the likelihood score TSi for each altitude act averaged in time series is obtained and output to the integration unit 13. (Note that since the weight w (t) is obtained for each signal Vi, it should be expressed as w (t) [i], etc., and has i dependency, but the notation becomes complicated. Is fixedly given and is apparent in context, and will be abbreviated as w (t) when describing the calculation unit 12.)

ここで、本発明の算出部12において算出される適応的な重みw(t)の「考え方」をまず説明する。 Here, the “concept” of the adaptive weight w (t) calculated by the calculation unit 12 of the present invention will be described first.

本発明においては、サンプルデータに対する事前分析として、各フレームVi(t)のスコアFSi(t)（尤度スコア時系列FSiの時刻tにおける各行動actの尤度ベクトル）で独立に判断する際（すなわち、ベクトルFSi(t)において最大尤度を与えている行動種別actに該当すると判断する場合）に、間違い結果と正しい結果を持つフレームの統計データにより、間違いフレームと正しいフレームの統計特性を分析した。 In the present invention, as a prior analysis on the sample data, when independently determining by the score FSi (t) of each frame Vi (t) (the likelihood vector of each action act at time t of the likelihood score time series FSi) ( In other words, the statistical characteristics of the error frame and the correct frame are analyzed based on the error data and the statistical data of the frame having the correct result (when it is determined that the action type act is giving the maximum likelihood in the vector FSi (t)). did.

当該分析の結果、間違いフレームではスコアの分布が比較的に均一であり、分散が小さいが、正しいフレームでスコアの分散が比較的に大きいことがわかった。 As a result of the analysis, it was found that the score distribution is relatively uniform in the error frame and the variance is small, but the score variance is relatively large in the correct frame.

例えば、図６に101種の行動における、時系列データViの一連のフレームVi(t)において当該フレームの尤度ベクトルFSi(t)の値で独立に「正しい／間違い」（正解／不正解）を確認した際の、行動毎（グラフ横軸）に正しいフレームのスコアの分散の平均値と間違いフレームの分散の平均値（グラフ縦軸）を示す。[1]のRGB画像（図３のV1）と[2]のフロー画像（図３のV2）の両者において、正しいフレームの方が分散が大きいことが読み取れる。 For example, in 101 types of actions in FIG. 6, “correct / incorrect” (correct / incorrect) independently for the value of the likelihood vector FSi (t) of the frame in a series of frames Vi (t) of the time-series data Vi. The average value of the variance of the correct frame score and the average value of the variance of the erroneous frame (graph vertical axis) are shown for each action (the horizontal axis of the graph). In both the RGB image [1] (V1 in FIG. 3) and the flow image [2] (V2 in FIG. 3), it can be seen that the correct frame has a larger variance.

図６では[1],[2]の両者に関して、正しいフレームの分散平均値m_var_correct(act)が上段側に、誤ったフレームの分散平均値m_var_incorrect(act)が下段側に、それぞれ示されている。 In FIG. 6, with respect to both [1] and [2], the dispersion average value m_var_correct (act) of the correct frame is shown on the upper side, and the dispersion average value m_var_incorrect (act) of the incorrect frame is shown on the lower side. .

図６では、各行動actの分散とは、次のようにして求めている。まず、時刻tのフレームVi(t)の尤度ベクトルFSi(t)が行動種別act=1,2,…,NごとのN次元ベクトル（各要素がfsi_[act](t)）として以下の式（５A）のようなN成分で構成されているものとする。
FSi(t)=(fsi_[1](t), fsi_[2](t), …, fsi_[N](t)) （５A）In FIG. 6, the variance of each action act is obtained as follows. First, the likelihood vector FSi (t) of the frame Vi (t) at time t is an N-dimensional vector (each element is fsi_[act] (t)) for each action type act = 1, 2,. Suppose that it is comprised by N component like Formula (5A).
FSi (t) = (fsi_[1] (t), fsi_[2] (t),…, fsi_[N] (t)) (5A)

そして、図６において行動actの正しいフレームの分散平均値m_var_correct(act)とは以下の式(5B)を満たす一連の分散値var(FSi(t))において求めた平均値である。すなわち、行動種別actが「正しい」と判定される一連のフレームVi(t)における尤度ベクトルFSi(t)の分散の平均値である。ここで、var(X)とはサンプル集合Xの分散を表す。
{var(FSi(t))| max{fsi_[a](t)|a=1,2, …, N}= fsi_[act](t)である} （5B）
同様に、図６において行動actの誤ったフレームの分散平均値m_var_incorrect(act)とは以下の式(5C)を満たす一連の分散値var(FSi(t))において求めた平均値である。すなわち、行動種別actが「正しい」と判定されない（「誤った」と判定される）一連のフレームVi(t)における尤度ベクトルFSi(t)の分散の平均値である。
{var(FSi(t))| max{fsi_[a](t)|a=1,2, …, N}= fsi_[act](t)ではない} （5C）
ここで、上記（5B）,(5C)の尤度ベクトルFSi(t)を求めるフレームVi(t)は、所定のテスト映像における一連のフレームである。なお、当該所定のテスト映像に関しては、特にどの行動種別actに該当するかの正解ラベルを付与しておく必要はない。In FIG. 6, the variance average value m_var_correct (act) of the frame with the correct action act is an average value obtained from a series of variance values var (FSi (t)) satisfying the following equation (5B). That is, the average value of the variance of the likelihood vector FSi (t) in a series of frames Vi (t) in which the action type act is determined to be “correct”. Here, var (X) represents the variance of the sample set X.
{var (FSi (t)) | max {fsi_[a] (t) | a = 1,2,…, N} = fsi_[act] (t)} (5B)
Similarly, in FIG. 6, the variance average value m_var_incorrect (act) of an erroneous frame of the action act is an average value obtained from a series of variance values var (FSi (t)) satisfying the following equation (5C). That is, the average value of the variances of the likelihood vectors FSi (t) in a series of frames Vi (t) that are not determined to be “correct” for the action type act (determined to be “incorrect”).
{var (FSi (t)) | max {fsi_[a] (t) | a = 1,2,…, N} = not fsi_[act] (t)} (5C)
Here, the frame Vi (t) for obtaining the likelihood vector FSi (t) of the above (5B) and (5C) is a series of frames in a predetermined test video. In addition, regarding the predetermined test video, it is not necessary to give a correct answer label indicating which action type act particularly.

以上のような「考え方」に基づき、算出部12では適応的な重みw(t)を具体的に以下のステップ1〜ステップ3を順次実行することで求めることができる。なお、以下では、前述の通り重みw(t)におけるi依存性（信号Viへの依存性）の表記は省略して、重みw(t)を求めるものとして説明する。 Based on the “concept” as described above, the calculation unit 12 can specifically obtain the adaptive weight w (t) by sequentially executing the following Step 1 to Step 3. In the following description, it is assumed that the weight w (t) is obtained by omitting the notation of i dependency (dependency on the signal Vi) in the weight w (t) as described above.

[ステップ１]（「乱雑さ」の算出）：あるフレームのスコアS(act,t)を入力すると、（式６）で分散v(t)を「乱雑さ」として算出する。 [Step 1] (Calculation of “Randomness”): When a score S (act, t) of a certain frame is input, the variance v (t) is calculated as “Randomness” in (Expression 6).

ここで、前述の通り、actは行動種別のIDであり、tはフレームのIDであり、Nは行動の数である。さらに、スコアS(act,t)とは、（i依存性を明記して書けば）前述の(5A)で与えた尤度ベクトルのact成分である。
S(act,t)=「尤度ベクトルFSi(t)のact成分」
=fsi_[act](t)
また、分散の代わりに、（式７）でエントロピーH(t)を「乱雑さ」として算出してもよい。Here, as described above, act is an action type ID, t is a frame ID, and N is the number of actions. Furthermore, the score S (act, t) is an act component of the likelihood vector given in the above (5A) (if it is written specifying i dependency).
S (act, t) = “act component of likelihood vector FSi (t)”
= fsi_[act] (t)
Further, instead of dispersion, the entropy H (t) may be calculated as “randomness” in (Expression 7).

[ステップ２]（重みの設定）：ステップ１で算出した各フレームのスコアの分散v(t)を入力として用いて、（式８）で各フレームの重みw(t)を算出する。 [Step 2] (weight setting): Using the variance v (t) of the score of each frame calculated in step 1 as an input, the weight w (t) of each frame is calculated by (Equation 8).

ここで前述の通り、tはフレームのIDであり、和はt=1,2, ...,Tまで取る。Tはフレームの数（対象としている時系列信号Viのフレーム総数）である。（式８）では分散v(t)を用いているが、代わりにエントロビーH(t)を用いてもよい。 Here, as described above, t is the frame ID, and the sum is taken up to t = 1, 2,. T is the number of frames (the total number of frames of the target time series signal Vi). In (Expression 8), the variance v (t) is used, but the entropy H (t) may be used instead.

さらに、オプションとしての一実施形態においては、事前知識により、各フレームの重みw(t)を、（式８）で求めたものからさらに補正して設定してもよい。例えば、前半は（ほぼ）ゼロから徐徐に上がって（式8）の本来の値に到達するようにし、後半は（式８）の本来の値から徐徐に下がって（ほぼ）ゼロとなるように補正してもよい。すなわち、前半を1≦t≦T1の区間とし、後半をT2≦t≦Tの区間とする。ここでT1<T2である。具体的に、求めた重みw(t)を例えば以下のように、前半に関して（式8A）、後半に関して（式８B）ように補正してもよい。なお、（式8A）、（式8B）において「←」の記号は、コンピュータプログラム表記における慣用表記と同様に、左辺の値（変数w(t)）を右辺の値で更新する（すなわち、左辺の補正後の値w(t)を、補正前の値w(t)に基づいて算出される右辺の値とする）ことを意味している。
w(t)←w(t)×(t/T1) （式８A）
w(t)←w(t)×{1-(t-T2)/(T-T2)} （式８B）Furthermore, in one embodiment as an option, the weight w (t) of each frame may be further corrected and set from that obtained in (Equation 8) based on prior knowledge. For example, the first half gradually increases from (nearly) zero to reach the original value of (Equation 8), and the second half gradually decreases from the original value of (Equation 8) to (nearly) zero. It may be corrected. That is, the first half is set as a section of 1 ≦ t ≦ T1, and the second half is set as a section of T2 ≦ t ≦ T. Here, T1 <T2. Specifically, the obtained weight w (t) may be corrected as follows (Formula 8A) for the first half and (Formula 8B) for the second half, for example. In (Equation 8A) and (Equation 8B), the symbol “←” updates the value on the left side (variable w (t)) with the value on the right side (ie, the left side), as in conventional notation in computer program notation. The value w (t) after the correction of (1) is the right side value calculated based on the value w (t) before the correction).
w (t) ← w (t) × (t / T1) (Formula 8A)
w (t) ← w (t) × {1- (t-T2) / (T-T2)} (Formula 8B)

[ステップ３]（時系列スコアの算出）：全てフレームのスコアS(act, t)と重みw(t)を入力として用いて、（式９）で、適応的に時系列平均されたスコアSA(act)を算出する。和は（式８）と同様、t=1,2, ..., Tで取る。 [Step 3] (Calculation of time-series score): Using the score S (act, t) and weight w (t) of all frames as inputs, the score SA adaptively time-series averaged in (Equation 9) Calculate (act). The sum is taken at t = 1, 2,..., T, as in (Equation 8).

[統合部13]
統合部13では、以上の算出部12が適応的な重みw(t)[i]（ここではi依存性を明記した）によって各時系列信号Viにおいて算出した時系列上での平均スコア（尤度ベクトル）TSiを統合したものとして、統合スコアINT_SCを求め、評価部14へと出力する。[Integration Unit 13]
In the integration unit 13, the average score (likelihood) on the time series calculated in each time series signal Vi by the above calculation unit 12 using the adaptive weight w (t) [i] (here, i dependency is specified). Assuming that the degree vector) TSi is integrated, an integrated score INT_SC is obtained and output to the evaluation unit 14.

統合部13の具体的な処理内容を説明するに先立ってまず、統合部13における統合処理の「考え方」を説明する。 Prior to explaining the specific processing contents of the integration unit 13, the “concept” of the integration processing in the integration unit 13 will be described first.

すなわち、図３のRGB画像時系列V1及びフロー画像時系列V2といったように、異種類の信号間において正しいフレームと間違いフレームを比較すると、（異種信号間での分散の絶対値の大小の相違があるために、）「正しいフレームでスコアの分散が相対的に大きい」という結論は必ずしも成立しない。 That is, when the correct frame and the wrong frame are compared between different types of signals, such as the RGB image time series V1 and the flow image time series V2 in FIG. 3, the difference in absolute value of the variance between different types of signals is as follows. For this reason, the conclusion that “the score dispersion is relatively large in the correct frame” does not necessarily hold.

逆に、図７のように正しいフレームと間違いフレームとの区別によらず、[1]のRGB画像のスコアの分散が[2]のフロー画像の分散より明らかに大きい。つまり、フロー画像の正しいフレームであっても、スコアの分布はRGB画像の間違いフレームより比較的に均一な分散となっている。 Conversely, as shown in FIG. 7, the variance of the score of the RGB image [1] is clearly larger than the variance of the flow image [2], regardless of the distinction between the correct frame and the wrong frame. That is, even in the correct frame of the flow image, the score distribution is relatively evenly distributed as compared to the erroneous frame of the RGB image.

一方、RGB画像による推定とフロー画像による推定とは補完性が強いので、いずれかの結果を切り捨てるのではなく、どちらも活用できるように公平的に組み合わせることが望ましい。従って、RGB画像とフロー画像のそれぞれの時系列スコアを正規化させるように重みを設定する。 On the other hand, the estimation based on the RGB image and the estimation based on the flow image have strong complementarity, and it is desirable to combine them fairly so that either result can be used instead of discarding any result. Accordingly, the weight is set so that the time series scores of the RGB image and the flow image are normalized.

すなわち、以上を換言すれば、例えばRGB画像（時系列信号V1）による推定結果はバラツキが大であり、いずれの行動種別actに該当するかをいわば強い自信（確信）を持って推定するものであるが、必ずしもその推定精度は高くなく、これとは逆に、フロー画像（時系列信号V2）による推定結果はバラツキが小であり、いずれの行動種別actに該当するかをいわば弱い自信（確信）しか持たずに推定するものであるが、その推定精度は高い、というような状況において、両信号V1,V2の推定結果を互いに補完すべく、各スコアを正規化する。 In other words, in other words, for example, estimation results based on RGB images (time-series signal V1) vary widely, and it is estimated with a strong confidence (confidence) which action type act corresponds. However, the estimation accuracy is not necessarily high. On the contrary, the estimation result by the flow image (time-series signal V2) has little variation, and it is weak confidence (confidence) that corresponds to which action type act. In the situation where the estimation accuracy is high, each score is normalized to complement the estimation results of both signals V1 and V2.

統合部13における具体的な正規化処理及び統合処理は、V1及びV2の2種類の時系列信号を統合する場合、例えば以下１〜４のようにすることができる。 Specific normalization processing and integration processing in the integration unit 13 can be performed as follows, for example, when two types of time series signals V1 and V2 are integrated.

１．最大スコアで正規化する場合
最大スコアによる正規化の重みとして、以下の一連の（式１０）のようにして信号V1の重みw^V1及び信号V2の重みw^V2を求めることができる。ここで、TSi(act)(i=1,2)は、算出部12において適応的に時系列平均されたスコアTSi（尤度ベクトル）における行動種別act成分の値である。1. When Normalizing with Maximum Score As a normalization weight based on the maximum score, the weight w^V1 of the signal^V1 and the weight w^{V2 of the} signal V2 can be obtained as shown in the following series (Equation 10). Here, TSi (act) (i = 1, 2) is the value of the action type act component in the score TSi (likelihood vector) adaptively time-series averaged by the calculation unit 12.

２．平均スコアで正規化する場合
一方、上記「１」とは別手法として、平均スコアによる正規化の重みとして、以下の一連の（式１１）のようにして信号V1の重みw^V1及び信号V2の重みw^V2を求めることもできる。Nは前述のように行動種別actの総数である。2. On the other hand, normalization by the average score On the other hand, as a method different from the above “1”, as a weight for normalization by the average score, the weights^{V V1} and V2 of the signal V1 as shown in the following series (Equation 11) The weight w^V2 can also be obtained. N is the total number of action types act as described above.

３．合理的な範囲への制限に関して
さらに、上記「１．最大値」及び「２．平均値」のいずれによって信号V1の重みw^V1及び信号V2の重みw^V2を求めた場合であっても、以下の（式１２）の条件に該当する場合、当該（式１２）にて与えられているように重みw^V1（及びｗ^V2）を事前に設定した合理的範囲[THL,THH]内に制限することが好ましい。3. Regarding the limitation to a reasonable range Further, even when the weight w^V1 of the signal^V1 and the weight w^{V2 of the} signal^V2 are obtained by any of the above “1. maximum value” and “2. average value”, When the condition of (Equation 12) is satisfied, the weight w^V1 (and w^V2 ) is limited to a preset reasonable range [THL, THH] as given in (Equation 12). It is preferable.

４．最終的な統合スコアの算出に関して
以上「１又は２」及び「３」の処理にて求まった信号V1の重みw^V1及び信号V2の重みw^V2により、統合スコアINT_SC（当該尤度ベクトルにおける行動種別actの要素値をINT_SC(act)とする）を以下の（式１３）のように算出することできる。
INT_SC(act)=w^V1TS1(act)+w^V2TS2(act) （式１３）4). The weights w^V2 weights w^V1 and the signal V2 above with respect to the calculation of the final total score "1 or 2" and signal Motoma' in the process of "3" V1, action type integrated score INT_SC (in the likelihood vector (act element value is INT_SC (act)) can be calculated as shown in the following (formula 13).
INT_SC (act) = w^V1 TS1 (act) + w^V2 TS2 (act) (Formula 13)

なお、以上の説明では信号V1及び信号V2の2種類を統合する場合を例としたが、3種類以上の信号を統合する場合も全く同様にして重み算出が可能である。 In the above description, the case where two types of signals V1 and V2 are integrated has been described as an example. However, when three or more types of signals are integrated, weight calculation can be performed in exactly the same manner.

例えば、3種類のV1,V2,V3（例えば、RGB画像時系列信号、フロー画像時系列信号及びデプス画像時系列信号）を統合する重みw^V1,w^V2,w^V3を上記「１．最大値で正規化」する場合、以下の一連の（式１４）によって可能である。For example, the weights w^V1 , w^V2 , and w^V3 for integrating three types of V1, V2, and V3 (for example, RGB image time series signal, flow image time series signal, and depth image time series signal) are described in “1. Maximum value”. Can be obtained by the following series (Equation 14).

上記（式１４）より、最大値の正規化ではなく平均値の正規化の場合も各重みの算出の仕方は明らかである。さらに、一般にM種類の信号V1,V2,…,VMを用いる場合、上記（式１４）をさらに一般化したものとして、各信号Viの重み係数w^Viは以下のように求めればよい。From the above (Equation 14), it is clear how to calculate each weight in the case of normalization of the average value instead of normalization of the maximum value. Furthermore, when M types of signals V1, V2,..., VM are generally used, the weighting coefficient w^Vi of each signal Vi may be obtained as follows, assuming that (Equation 14) is further generalized.

[評価部14]
評価部14では、以上のように統合部13にて得られた統合スコアINT_SC（各成分がINT_SC(act)である尤度ベクトル）を用いて、時系列信号V1,V2,…VMを抽出した（又は関連するものとして与えられた）当初の映像信号Vにおける行動種別がいずれ（act=1,2,…,Nのいずれ）に該当するかを評価する。具体的には、以下の（式１５）のように統合スコアINT_SCにおける尤度値としての成分の値INT_SC(act)が最も大きなものを評価結果act=act_{[評価結果]}として与えることができる。[Evaluation Department 14]
The evaluation unit 14 extracts the time series signals V1, V2,... VM using the integrated score INT_SC (likelihood vector in which each component is INT_SC (act)) obtained by the integration unit 13 as described above. It is evaluated whether the action type in the initial video signal V (or given as related) corresponds to any of (act = 1, 2,..., N). Specifically, as shown in the following (Equation 15), the one having the largest component value INT_SC (act) as the likelihood value in the integrated score INT_SC can be given as the evaluation result act = act_{[evaluation result]} .

以上、本発明によれば、映像信号Vから抽出された又は映像信号Vに関連する複数の時系列信号V1,V2,…,VMのそれぞれで深層畳み込みニューラルネットワーク等の識別器を適用した結果として得られる時系列信号（行動種別act毎の尤度スコア時系列信号）FS1,FS2,…,FSMによって当初の映像信号Vの行動種別の評価結果を得るために、適応的な重みを算出したうえで、当該適応的な重みによって最終的な評価結果を得るようにすることができる。 As described above, according to the present invention, as a result of applying a discriminator such as a deep convolution neural network in each of a plurality of time-series signals V1, V2,..., VM extracted from the video signal V or related to the video signal V In order to obtain the evaluation result of the action type of the original video signal V by the obtained time series signal (likelihood score time series signal for each action type act) FS1, FS2, ..., FSM, the adaptive weight is calculated. Thus, a final evaluation result can be obtained by the adaptive weight.

従って、本発明においては重みを適応的に算出する処理以外の部分では従来手法と共通の枠組みを採用可能であることにより、複数の時系列信号V1,V2,…,VMにおける複数の深層畳み込みニューラルネットワークを映像信号Vの行動認識に適用させる際に、モデルの学習を再度やり直す必要が生じる等の影響はゼロであり、計算量やメモリ消費にもほぼ影響がないにも関わらず、認識精度を向上することが可能である。 Therefore, in the present invention, a framework common to the conventional method can be adopted except for the process of calculating the weights adaptively, and thus a plurality of deep convolutional neural networks in a plurality of time series signals V1, V2,. When the network is applied to the action recognition of the video signal V, there is no impact such as the need to re-learn the model, and the recognition accuracy is improved despite the fact that there is almost no impact on the amount of computation and memory consumption. It is possible to improve.

以下、本発明における補足的事項を説明する。 Hereinafter, supplementary matters in the present invention will be described.

（１）図２の統合装置10においては、算出部12及び統合部13の両者で適応的な重みに相当するものを求めるものとして説明したが、本発明においてはいずれか片方のみにおいて当該適応的な重みに相当するものを算出して、もう片方においては前掲の非特許文献３等と同様の従来手法における固定的な重みを算出するようにしても、従来手法（「課題」において説明したように、算出部12及び統合部13に相当する両者の処理において従来の固定的な重みを利用する図１の手法）と比べた認識精度向上を達成することが可能である。 (1) In the integration apparatus 10 in FIG. 2, the calculation unit 12 and the integration unit 13 have been described as obtaining the weight corresponding to the adaptive weight, but in the present invention, only one of the adaptive units 10 Even if the equivalent weight is calculated and the fixed weight in the conventional method similar to Non-Patent Document 3 described above is calculated on the other side, the conventional method (as described in “Problem”) In addition, it is possible to achieve an improvement in recognition accuracy compared with the conventional method of FIG. 1 that uses a fixed weight in the processing corresponding to both the calculation unit 12 and the integration unit 13.

すなわち、算出部12にて従来手法と同様の時間軸上の均一な平均を用いるようにして、統合部13では以上説明したような本発明における適応的な重みw^Vi(i=1,2,…,M)を利用するようにしてもよい。逆に、算出部12では以上説明したような本発明における分散v(t)に基づく適応的な重みw(t)を用いるようにして、統合部13では従来手法と同様の固定的な重みを用いるようにしてもよい。That is, the calculation unit 12 uses a uniform average on the time axis similar to the conventional method, and the integration unit 13 uses the adaptive weight w^Vi (i = 1, 2, ..., M) may be used. Conversely, the calculation unit 12 uses the adaptive weight w (t) based on the variance v (t) in the present invention as described above, and the integration unit 13 uses the same fixed weight as in the conventional method. You may make it use.

なお、認識精度の観点からは、上記のように算出部12又は統合部13のいずれか片方のみにおいて本発明の手法を採用するのではなく、算出部12及び統合部13の両方において以上説明したような本発明による適応的な重みを採用するようにすることが好ましい。 Note that, from the viewpoint of recognition accuracy, the method of the present invention is not adopted in only one of the calculation unit 12 and the integration unit 13 as described above, but has been described above in both the calculation unit 12 and the integration unit 13. It is preferable to adopt the adaptive weight according to the present invention.

（２）本発明は、映像信号Vにおける行動種別act(=1,2,…,N)の認識を、複数の識別器を統合することによって認識する際の重みの適応的な設定に関するものとして説明したが、より一般には、映像信号Vにおける行動（例えば、人間その他生物等の何らかの意図等を持った行動）種別に限らず、動作種別act(=1,2, …, N)の識別に関しても本発明は全く同様に適用可能である。すなわち、映像信号Vにおける一般の動作（非生物その他一般の対象によって発生し、意図の有無等も問わない動作、例えば、「車が走っている」等の動作）種別の識別の際の、複数の識別器を統合する重みの適応的な算出に、本発明は全く同様に適用可能である。 (2) The present invention relates to adaptive setting of weights when recognizing the action type act (= 1, 2,..., N) in the video signal V by integrating a plurality of discriminators. As described above, more generally, not only the type of action in the video signal V (for example, action having some intention such as human beings or other living things) but the identification of the action type act (= 1, 2, ..., N) The present invention can be applied in exactly the same manner. That is, a plurality of types of general movements in the video signal V (operations caused by non-living objects or other general objects with or without intention, for example, movements such as “the car is running”) are identified. The present invention is equally applicable to the adaptive calculation of weights that integrate the classifiers.

（３）本発明は、コンピュータを統合装置10の各部の全て又はその任意の一部分として機能させるプログラムとしても提供可能である。当該コンピュータには、CPU(中央演算装置)、メモリ及び各種I/Fといった周知のハードウェア構成のものを採用することができ、所定プログラムを読み込んで実行するCPUが当該各部の機能に対応する命令を実行することにより、当該各部を実現することとなる。 (3) The present invention can also be provided as a program that causes a computer to function as all of the components of the integrated device 10 or any part thereof. The computer can adopt a known hardware configuration such as a CPU (Central Processing Unit), a memory, and various I / Fs, and a CPU that reads and executes a predetermined program corresponds to the function of each unit. By executing the above, each part is realized.

10…統合装置、11…出力部、12…算出部、13…統合部、14…評価部 10 ... Integration device, 11 ... Output unit, 12 ... Calculation unit, 13 ... Integration unit, 14 ... Evaluation unit

Claims

Translated fromJapanese

映像信号(V)より抽出される又は当該映像信号(V)に関連する複数の時系列信号(Vi;i=1,2,…,M)のそれぞれに識別器を適用して得られる、当該時系列信号(Vi)ごとの前記映像信号(V)における各動作種別(act=1,2,…,N)の尤度時系列(FSi)を当該尤度時系列上で重み付け平均することで、当該時系列信号(Vi)ごとの各動作種別(act)の尤度ベクトル(TSi)を算出する算出部と、
前記時系列信号(Vi)ごとの各動作種別(act)の尤度ベクトル(TSi)を当該対応する時系列信号(Vi)ごとに重み付け平均することで、前記映像信号(V)における統合された各動作種別(act)の尤度ベクトル(INT_SC)を求める統合部と、を備える統合装置であって、
前記算出部では、前記尤度時系列(FSi)の各時刻(t)における前記各動作種別(act)の要素の分布の乱雑さ(v(t),H(t))に基づく重み付け平均によって、前記尤度ベクトル(TSi)を算出し、
前記統合部では、前記時系列信号(Vi)ごとの尤度ベクトル(TSi)がそれぞれ取る範囲に基づく正規化係数(w^Vi)に基づく重み付け平均によって、前記統合された各動作種別の尤度ベクトル（INT_SC）を求めることを特徴とする統合装置。Obtained by applying a discriminator to each of a plurality of time-series signals (Vi; i = 1, 2,..., M) extracted from the video signal (V) or related to the video signal (V). The likelihood time series (FSi) of each operation type (act = 1, 2,..., N) in the video signal (V) for each time series signal (Vi) is weighted and averaged on the likelihood time series. A calculation unit for calculating the likelihood vector (TSi) of each action type (act) for each time series signal (Vi),
The likelihood vector (TSi) of each action type (act) for each time series signal (Vi) is weighted averaged for each corresponding time series signal (Vi), thereby integrating the video signal (V). An integration unit including an integration unit for obtaining a likelihood vector (INT_SC) of each action type (act),
In the calculation unit, by the weighted average based on the randomness (v (t), H (t)) of the distribution of elements of each action type (act) at each time (t) of the likelihood time series (FSi) Calculating the likelihood vector (TSi),
In the integration unit, the integrated likelihood vector of each action type is obtained by weighted averaging based on a normalization coefficient (w^Vi ) based on a range taken by the likelihood vector (TSi) for each time series signal (Vi). An integrated device characterized by obtaining (INT_SC).

映像信号(V)より抽出される又は当該映像信号に関連する複数の時系列信号(Vi;i=1,2,…,M)のそれぞれに識別器を適用して得られる、当該時系列信号(Vi)ごとの前記映像信号(V)における各動作種別(act=1,2,…,N)の尤度時系列(FSi)を当該尤度時系列上で重み付け平均することで、当該時系列信号(Vi)ごとの各動作種別(act)の尤度ベクトル(TSi)を算出する算出部と、
前記時系列信号(Vi)ごとの各動作種別(act)の尤度ベクトル(TSi)を当該対応する時系列信号(Vi)ごとに重み付け平均することで、前記映像信号(V)における統合された各動作種別(act)の尤度ベクトル(INT_SC)を求める統合部と、を備える統合装置であって、
前記算出部では、前記尤度時系列(FSi)の各時刻(t)における前記各動作種別(act)の要素の分布の乱雑さ(v(t),H(t))に基づく重み付け平均によって、前記尤度ベクトル(TSi)を算出することを特徴とする統合装置。The time-series signal obtained by applying a discriminator to each of a plurality of time-series signals (Vi; i = 1, 2,..., M) extracted from the video signal (V) or related to the video signal. The likelihood time series (FSi) of each operation type (act = 1, 2, ..., N) in the video signal (V) for each (Vi) is weighted and averaged on the likelihood time series. A calculation unit for calculating a likelihood vector (TSi) of each operation type (act) for each series signal (Vi);
The likelihood vector (TSi) of each action type (act) for each time series signal (Vi) is weighted averaged for each corresponding time series signal (Vi), thereby integrating the video signal (V). An integration unit including an integration unit for obtaining a likelihood vector (INT_SC) of each action type (act),
In the calculation unit, by the weighted average based on the randomness (v (t), H (t)) of the distribution of elements of each action type (act) at each time (t) of the likelihood time series (FSi) An integration apparatus for calculating the likelihood vector (TSi).

前記算出部では、前記乱雑さ(v(t),H(t))を分散(v(t))として評価することを特徴とする請求項１または２に記載の統合装置。 The integration apparatus according to claim 1, wherein the calculation unit evaluates the randomness (v (t), H (t)) as a variance (v (t)).

前記算出部では、前記乱雑さ(v(t),H(t))をエントロピー(H(t))として評価することを特徴とする請求項１または２に記載の統合装置。 The integration device according to claim 1, wherein the calculation unit evaluates the randomness (v (t), H (t)) as entropy (H (t)).

映像信号(V)より抽出される又は当該映像信号(V)に関連する複数の時系列信号(Vi;i=1,2,…,M)のそれぞれに識別器を適用して得られる、当該時系列信号(Vi)ごとの前記映像信号(V)における各動作種別(act=1,2,…,N)の尤度時系列(FSi)を当該尤度時系列上で重み付け平均することで、当該時系列信号(Vi)ごとの各動作種別(act)の尤度ベクトル(TSi)を算出する算出部と、
前記時系列信号(Vi)ごとの各動作種別(act)の尤度ベクトル(TSi)を当該対応する時系列信号(Vi)ごとに重み付け平均することで、前記映像信号(V)における統合された各動作種別(act)の尤度ベクトル(INT_SC)を求める統合部と、を備える統合装置であって、
前記統合部では、前記時系列信号(Vi)ごとの尤度ベクトル(TSi)がそれぞれ取る範囲に基づく正規化係数(w^Vi)に基づく重み付け平均によって、前記統合された各動作種別の尤度ベクトル（INT_SC）を求めることを特徴とする統合装置。Obtained by applying a discriminator to each of a plurality of time-series signals (Vi; i = 1, 2,..., M) extracted from the video signal (V) or related to the video signal (V). The likelihood time series (FSi) of each operation type (act = 1, 2,..., N) in the video signal (V) for each time series signal (Vi) is weighted and averaged on the likelihood time series. A calculation unit for calculating the likelihood vector (TSi) of each action type (act) for each time series signal (Vi),
The likelihood vector (TSi) of each action type (act) for each time series signal (Vi) is weighted averaged for each corresponding time series signal (Vi), thereby integrating the video signal (V). An integration unit including an integration unit for obtaining a likelihood vector (INT_SC) of each action type (act),
In the integration unit, the integrated likelihood vector of each action type is obtained by weighted averaging based on a normalization coefficient (w^Vi ) based on a range taken by the likelihood vector (TSi) for each time series signal (Vi). An integrated device characterized by obtaining (INT_SC).

前記統合部では、前記正規化係数(w^Vi)を、前記時系列信号(Vi)ごとの尤度ベクトル(TSi)の要素としての動作種別(act)ごとの要素値（TSi(act)）の中の最大値に基づいて定めることを特徴とする請求項１または５に記載の統合装置。In the integration unit, the normalization coefficient (w^Vi ) is converted into an element value (TSi (act)) for each action type (act) as an element of a likelihood vector (TSi) for each time series signal (Vi). 6. The integration device according to claim 1, wherein the integration device is determined based on a maximum value.

前記統合部では、前記正規化係数(w^Vi)を、前記時系列信号(Vi)ごとの尤度ベクトル(TSi)の要素としての動作種別(act)ごとの要素値（TSi(act)）の平均値に基づいて定めることを特徴とする請求項１または５に記載の統合装置。In the integration unit, the normalization coefficient (w^Vi ) is converted into an element value (TSi (act)) for each action type (act) as an element of a likelihood vector (TSi) for each time series signal (Vi). 6. The integration device according to claim 1, wherein the integration device is determined based on an average value.

前記統合部では、前記正規化係数(w^Vi)を定めるに際して、所定の上下限範囲[THL,THH]内に収まらない係数がある場合は、当該範囲内に収まるように係数を修正することを特徴とする請求項１，５，６，７のいずれかに記載の統合装置。In the integration unit, when determining the normalization coefficient (w^Vi ), if there is a coefficient that does not fall within the predetermined upper and lower limit range [THL, THH], the coefficient is corrected to fall within the range. The integrated apparatus according to claim 1, wherein the integrated apparatus is characterized in that:

前記統合部にて求まった各動作種別(act)の尤度ベクトル(INT_SC)の要素のうち最大値に対応する動作種別を、前記映像信号(V)の該当結果として出力する評価部をさらに備えることを特徴とする請求項１ないし８のいずれかに記載の統合装置。 It further includes an evaluation unit that outputs the operation type corresponding to the maximum value among the elements of the likelihood vector (INT_SC) of each operation type (act) obtained by the integration unit as a corresponding result of the video signal (V). 9. The integrated apparatus according to claim 1, wherein

前記適用する識別器が深層畳み込みニューラルネットワークであることを特徴とする請求項１ないし９のいずれかに記載の統合装置。 The integrated device according to claim 1, wherein the classifier to be applied is a deep convolutional neural network.

コンピュータを請求項１ないし１０のいずれかに記載の統合装置として機能させることを特徴とするプログラム。 A program for causing a computer to function as the integrated device according to any one of claims 1 to 10.