CN111914731A

Movatterモバイル変換

Info

Publication number: CN111914731A
Application number: CN202010738071.7A
Authority: CN
Inventors: 邵洁; 莫晨
Original assignee: Shanghai University of Electric Power
Current assignee: Shanghai University of Electric Power
Priority date: 2020-07-28
Filing date: 2020-07-28
Publication date: 2020-11-10
Anticipated expiration: 2040-07-28
Also published as: CN111914731B

Abstract

Translated fromChinese

本发明涉及一种基于自注意力机制的多模态LSTM的视频动作预测方法，该方法包括以下步骤：步骤1：准备训练数据集并针对原视频进行预处理得到RGB图片和光流图片；步骤2：基于RGB图片和光流图片通过TSN网络提取RGB特征和光流特征，基于训练数据集通过Faster‑RCNN目标检测器得到与目标检测相关的特征；步骤3：建立基于自注意力机制的多模态LSTM网络模型，并将步骤2中得到的RGB特征和光流特征以及与目标检测相关的特征输入至该网络模型中训练，输出各自对应的动作种类分布张量；步骤4：建立融合网络为动作种类分布张量分配权重并与动作种类分布张量相结合得到最终视频动作预测结果。与现有技术相比，本发明具有准确率高，解决了较长的动作预测时间效果不佳的缺陷。

The invention relates to a video action prediction method based on a self-attention mechanism of multi-modal LSTM. The method includes the following steps: Step 1: prepare a training data set and preprocess the original video to obtain RGB pictures and optical flow pictures; Step 2 : Extract RGB features and optical flow features through TSN network based on RGB pictures and optical flow pictures, and obtain features related to target detection through Faster-RCNN target detector based on the training data set; Step 3: Establish a multi-modal LSTM based on self-attention mechanism network model, and input the RGB features and optical flow features obtained in step 2 and the features related to target detection into the network model for training, and output the corresponding action type distribution tensors; Step 4: Establish a fusion network for the action type distribution The tensor assigns weights and combines with the action category distribution tensor to obtain the final video action prediction result. Compared with the prior art, the present invention has high accuracy, and solves the defect that the effect of long action prediction time is not good.

Description

Translated fromChinese

一种基于自注意力机制的多模态LSTM的视频动作预测方法A Self-Attention Mechanism-Based Multimodal LSTM for Video Action Prediction

技术领域technical field

本发明涉及视频动作预测技术领域，尤其是涉及一种基于自注意力机制的多模态LSTM的视频动作预测方法。The invention relates to the technical field of video action prediction, in particular to a video action prediction method based on a self-attention mechanism of multimodal LSTM.

背景技术Background technique

基于视觉的动作识别一直是计算机视觉领域研究的难点与热点之一，涉及图像处理、深度学习、人工智能等多个学科领域。不仅具有很高的学术研究价值，在5G时代下互联网行业的蓬勃发展的趋势下，对于视频的分析和理解还具有广泛的应用背景。目前动作识别领域的关注点在于如何正确地识别视频中包含的完整动作。然而在实际应用中，人们更希望监控系统能对监控场所中潜在的风险进行预警，使得危险行为造成严重后果之前对这些行为加以阻止，而并非已经完成的动作加以识别或者对造成的后果加以检测。若要实现这一目的，则需要赋予监控系统视觉，使其具备动作预测的能力。Vision-based action recognition has always been one of the difficulties and hotspots in the field of computer vision, involving image processing, deep learning, artificial intelligence and other disciplines. Not only has high academic research value, but also has a wide range of application backgrounds for video analysis and understanding under the booming trend of the Internet industry in the 5G era. The current focus in the field of action recognition is how to correctly identify the complete action contained in the video. However, in practical applications, people prefer that the monitoring system can provide early warning of potential risks in the monitoring site, so that dangerous behaviors can be prevented before they cause serious consequences, rather than the completed actions to be identified or the consequences to be detected. . To achieve this, the surveillance system needs to be given vision and the ability to predict actions.

动作预测是指通过提取并且处理连续输入的视频流的特征，从而在视频中的动作完成前尽可能早地对其动作类别进行预测。动作预测与动作识别的主要区别便在于识别对象的完整性。前者的识别对象是动作发生前的视频片段，这些片段不包含将要发生的动作。而后者的识别对象是包含动作的完整视频。动作预测是一项更加具有挑战性的任务。第一，一些动作在运动的初期具有相似的特征表现，比如“握手”和“挥手”这两个不同的动作在运动初期都存在将手向上抬起的过程，相似的举动使得获得的视频流的特征不好区分这两个不同的动作。第二，由于动作预测任务的设定，无法得知完成整个动作需要的时间，无法通过动作持续时间来区分不同的动作。因此从已经观测到的视频部分，既无法取得具有关键语义的特征来区分动作初期相似的动作，也不能获得完整的动作时序结构。第三，由于选取的视频片段取在需要预测的动作片段之前，而这样的输入数据往往与需要预测的动作没有很强的联系。Action prediction refers to predicting the action category as early as possible before the action in the video is completed by extracting and processing the features of the continuously input video stream. The main difference between action prediction and action recognition is the completeness of the recognized object. The former recognizes the video clips before the action occurs, and these clips do not contain the action that will take place. In the latter, the recognition object is the complete video that contains the action. Action prediction is a more challenging task. First, some actions have similar characteristics in the early stage of exercise. For example, the two different actions of "shaking hands" and "waving hands" both have the process of raising their hands in the early stage of exercise. Similar actions make the obtained video stream. The features are not good at distinguishing these two different actions. Second, due to the setting of the action prediction task, the time required to complete the entire action cannot be known, and different actions cannot be distinguished by the action duration. Therefore, from the observed video part, it is not possible to obtain features with key semantics to distinguish similar actions at the beginning of the action, nor to obtain the complete temporal structure of actions. Third, since the selected video segment is taken before the action segment that needs to be predicted, such input data often does not have a strong connection with the action that needs to be predicted.

动作预测方法通常从视频里提取出特征，建模特征与动作类别之间的映射关系来预测将来发生的动作。因此预测效果的好坏很大程度上取决于特征对于非完整动作的描述能力，以及能否从目标动作中学习到其所特有的时序运动模式。在深度学习的方法出现之前，词袋模型以及支持向量机等传统机器学习方法被用于解决动作预测任务。近年来，深度学习类的方法成为计算机视觉领域的主流，卷积网络可以提取具有丰富语义的高层次特征，这些高层次特征可以用来识别以及检测。然后再进一步融合或者编码这些特征，改善模型的效果。Action prediction methods usually extract features from videos and model the mapping relationship between features and action categories to predict future actions. Therefore, the quality of the prediction depends largely on the ability of the feature to describe the incomplete action, and whether it can learn its unique temporal motion pattern from the target action. Before the advent of deep learning methods, traditional machine learning methods such as bag-of-words models and support vector machines were used to solve the task of action prediction. In recent years, deep learning-like methods have become the mainstream in the field of computer vision. Convolutional networks can extract high-level features with rich semantics, which can be used for recognition and detection. These features are then further fused or encoded to improve the performance of the model.

发明内容SUMMARY OF THE INVENTION

本发明的目的就是为了克服上述现有技术存在的缺陷而提供一种基于自注意力机制的多模态LSTM的视频动作预测方法。The purpose of the present invention is to provide a video action prediction method based on multi-modal LSTM based on self-attention mechanism in order to overcome the above-mentioned defects of the prior art.

本发明的目的可以通过以下技术方案来实现：The object of the present invention can be realized through the following technical solutions:

一种基于自注意力机制的多模态LSTM的视频动作预测方法，该方法包括以下步骤：A self-attention mechanism-based multimodal LSTM video action prediction method, which includes the following steps:

步骤1：准备训练数据集并针对原视频进行预处理得到RGB图片和光流图片；Step 1: Prepare the training data set and preprocess the original video to obtain RGB pictures and optical flow pictures;

步骤2：基于RGB图片和光流图片通过TSN网络提取RGB特征和光流特征，基于训练数据集通过Faster-RCNN目标检测器得到与目标检测相关的特征；Step 2: Extract RGB features and optical flow features through TSN network based on RGB pictures and optical flow pictures, and obtain features related to target detection through Faster-RCNN target detector based on the training data set;

步骤3：建立基于自注意力机制的多模态LSTM网络模型，并将步骤2中得到的RGB特征和光流特征以及与目标检测相关的特征输入至该网络模型中训练，输出各自对应的动作种类分布张量；Step 3: Establish a multimodal LSTM network model based on the self-attention mechanism, and input the RGB features and optical flow features obtained in step 2, as well as the features related to target detection, into the network model for training, and output the corresponding action types. distribution tensor;

步骤4：建立融合网络为动作种类分布张量分配权重并与动作种类分布张量相结合得到最终视频动作预测结果。Step 4: Establish a fusion network to assign weights to the action category distribution tensor and combine it with the action category distribution tensor to obtain the final video action prediction result.

进一步地，所述的步骤1包括以下分步骤：Further, described step 1 comprises the following sub-steps:

步骤101：选取用于训练得到与目标检测相关的特征的数据集；Step 101: select a dataset for training to obtain features related to target detection;

步骤102：按照设定帧率分解原视频提取得到RGB图片；Step 102: Decompose the original video according to the set frame rate to extract the RGB picture;

步骤103：采用TVL1算法针对原视频提取得到光流图片。Step 103: Using the TVL1 algorithm to extract an optical flow picture from the original video.

进一步地，所述步骤101中的数据集采用EPIC-KITCHENS数据集和EGTEA Gaze⁺数据集。Further, the data set in the step 101 adopts the EPIC-KITCHENS data set and the EGTEA Gaze⁺ data set.

进一步地，所述步骤102中的设定帧率为30fps。Further, the set frame rate in step 102 is 30fps.

进一步地，所述的步骤2包括以下分步骤：Further, described step 2 comprises the following sub-steps:

步骤201：预先训练完毕原TSN网络，得到预训练的TSN网络模型；Step 201: The original TSN network is pre-trained to obtain a pre-trained TSN network model;

步骤202；去除原TSN网络中的分类层，加载预训练的TSN网络模型，得到基于双流法原理的TSN网络；Step 202: Remove the classification layer in the original TSN network, load the pre-trained TSN network model, and obtain a TSN network based on the principle of the two-stream method;

步骤203：将RGB图片和光流图片输入至基于双流法原理的TSN网络中，从该网络中的global pooling层输出提取对应的RGB特征和光流特征；Step 203: Input the RGB picture and the optical flow picture into the TSN network based on the principle of the two-stream method, and extract the corresponding RGB features and optical flow features from the output of the global pooling layer in the network;

步骤204：利用数据集的目标标注训练Faster-RCNN目标检测器得到与目标检测相关的特征。Step 204: Train the Faster-RCNN target detector by using the target annotation of the dataset to obtain features related to target detection.

进一步地，所述步骤202中的基于双流法原理的TSN网络所对应的训练过程的初始学习率设为0.001，采用随机梯度下降的标准交叉熵损失函数训练160个epoch，当在第80个epoch后，学习率减少10倍。Further, the initial learning rate of the training process corresponding to the TSN network based on the principle of the two-stream method in the step 202 is set to 0.001, and the standard cross-entropy loss function of stochastic gradient descent is used to train 160 epochs. , the learning rate is reduced by a factor of 10.

进一步地，所述步骤204中的数据集采用EGTEA Gaze⁺数据集。Further, the data set in the step 204 adopts the EGTEA Gaze⁺ data set.

进一步地，所述步骤3包括以下分步骤：Further, the step 3 includes the following sub-steps:

步骤301：建立基于自注意力机制的多模态LSTM网络模型，其包括由位置编码模块和自注意力机制模块组成的编码器和多层独立的LSTM网络，其中：Step 301: Establish a multi-modal LSTM network model based on a self-attention mechanism, which includes an encoder composed of a position encoding module and a self-attention mechanism module and a multi-layer independent LSTM network, wherein:

所述位置编码模块，用于编码视频中的帧的绝对位置和相对位置以得到对应的位置的特征序列；The position encoding module is used to encode the absolute position and relative position of the frame in the video to obtain the feature sequence of the corresponding position;

所述自注意力机制模块，用于进一步挖掘位置的特征序列中的语义以得到对于视频的全局描述；The self-attention mechanism module is used to further mine the semantics in the feature sequence of the position to obtain a global description for the video;

步骤302：将步骤2中得到的RGB特征和光流特征以及与目标检测相关的特征输入至基于自注意力机制的多模态LSTM网络模型中训练，输出各自对应的动作种类分布张量。Step 302: Input the RGB features, optical flow features and features related to target detection obtained in step 2 into the multimodal LSTM network model based on the self-attention mechanism for training, and output the corresponding action type distribution tensors.

进一步地，所述步骤302中的将步骤2中得到的RGB特征和光流特征以及与目标检测相关的特征输入至基于自注意力机制的多模态LSTM网络模型中训练所对应的训练过程的学习率设为0.005，采用随机梯度下降的标准交叉熵损失函数训练100个epoch，动量设为0.9。Further, in the step 302, the RGB features and optical flow features obtained in step 2 and the features related to target detection are input into the learning of the training process corresponding to the training in the multimodal LSTM network model based on the self-attention mechanism. The rate is set to 0.005, and the standard cross-entropy loss function of stochastic gradient descent is used to train for 100 epochs, and the momentum is set to 0.9.

进一步地，所述LSTM网络的层数为2层。Further, the number of layers of the LSTM network is 2 layers.

与现有技术相比，本发明具有以下优点：Compared with the prior art, the present invention has the following advantages:

(1)本发明方法综合考虑三种视频特征，RGB特征用于建模空间信息，光流特征用于建模时序运动信息，与目标检测相关的特征用于建模视频中的人与何种目标进行互动；由于特征序列对于位置信息十分敏感，采用独立的基于三角函数的位置编码模块编码视频中的帧的绝对位置以及相对位置；对于编码好位置的特征序列，再送入自注意力模块处理，进一步挖掘特征序列中的语义以获得对于视频的全局描述；将自注意模块的输出作为LSTM网络的输入，LSTM能够有效地加载历史信息，并且可以完成不同预测时间的预测，LSTM网络的输出即为动作种类的分布；为了避免过拟合，分开训练特征提取网络和预测网络；提取好的三种特征作为预测网络的输入，预测网络采取交叉熵损失函数进行训练；将训练好的模型在数据集的测试集上进行检测，以评估模型的效果；与近年来做动作预测的方法对比，在准确率指标上超过了那些方法，并且解决了较长的动作预测时间效果不佳的缺陷。(1) The method of the present invention comprehensively considers three kinds of video features, RGB features are used to model spatial information, optical flow features are used to model time series motion information, and features related to target detection are used to model people and what kind of people in the video. The target interacts with each other; since the feature sequence is very sensitive to position information, an independent trigonometric function-based position encoding module is used to encode the absolute position and relative position of the frame in the video; for the feature sequence with the encoded position, it is sent to the self-attention module for processing. , further excavate the semantics in the feature sequence to obtain a global description of the video; the output of the self-attention module is used as the input of the LSTM network, LSTM can effectively load historical information, and can complete the prediction of different prediction times, the output of the LSTM network is is the distribution of action types; in order to avoid over-fitting, the feature extraction network and the prediction network are separately trained; the three extracted features are used as the input of the prediction network, and the prediction network adopts the cross entropy loss function for training; Compared with the methods of action prediction in recent years, it surpasses those methods in the accuracy index, and solves the defect that the long action prediction time is not effective.

(2)本发明方法中的自注意力机制是自然语言处理领域的研究提出，被证实在文本、语音等数据上取得了不错的效果。而计算机视觉领域的数据类型以图片视频为主，在动作预测任务的算法运用自注意力机制有助于缩短两大社区之间的距离。(2) The self-attention mechanism in the method of the present invention is proposed in the field of natural language processing, and it has been proved that it has achieved good results on data such as text and speech. While the data types in the field of computer vision are mainly pictures and videos, the use of self-attention mechanism in the algorithm of action prediction task helps to shorten the distance between the two communities.

(3)本发明方法证明了文本序列与视频序列都具有时序性，相似的特性也是能够使用位置编码和自注意力机制编码的基础。(3) The method of the present invention proves that both text sequences and video sequences have temporality, and the similar characteristics are also the basis for encoding using positional coding and self-attention mechanism.

附图说明Description of drawings

图1为本发明中的整体网络模型结构图；Fig. 1 is the overall network model structure diagram in the present invention;

图2为本发明中的多模态LSTM网络模型结构图。FIG. 2 is a structural diagram of a multimodal LSTM network model in the present invention.

具体实施方式Detailed ways

下面将结合本发明实施例中的附图，对本发明实施例中的技术方案进行清楚、完整地描述，显然，所描述的实施例是本发明的一部分实施例，而不是全部实施例。基于本发明中的实施例，本领域普通技术人员在没有做出创造性劳动的前提下所获得的所有其他实施例，都应属于本发明保护的范围。The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention. Obviously, the described embodiments are part of the embodiments of the present invention, but not all of the embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those of ordinary skill in the art without creative work shall fall within the protection scope of the present invention.

具体实施例specific embodiment

1.视频预处理以及训练数据准备1. Video preprocessing and training data preparation

本发明方法在EPIC-KITCHENS和EGTEA Gaze⁺两个数据集进行实验，以30fps的帧率分解原视频提取RGB图片，然后使用TVL1算法提取对应的光流图片。The method of the present invention conducts experiments on two data sets of EPIC-KITCHENS and EGTEA Gaze⁺ , decomposes the original video at a frame rate of 30fps to extract RGB pictures, and then uses the TVL1 algorithm to extract the corresponding optical flow pictures.

2.特征提取2. Feature extraction

本发明方法采用基于双流法原理的TSN网络提取RGB特征和光流特征。首先在动作识别任务上训练好TSN网络，得到预训练的模型。然后去除原TSN网络的分类层，加载预训练的模型，从global pooling层的输出提取相应的RGB特征和光流特征。与目标检测相关的特征利用数据集的目标标注训练Faster-RCNN目标检测器，目标检测器的输出略去边界框的坐标信息，保留目标的种类信息。因为算法只关注视频中的人与何物在互动，即算法仅仅建模有利于预测动作种类的信息。The method of the invention adopts the TSN network based on the principle of the two-stream method to extract RGB features and optical flow features. First, the TSN network is trained on the action recognition task to obtain a pre-trained model. Then remove the classification layer of the original TSN network, load the pre-trained model, and extract the corresponding RGB features and optical flow features from the output of the global pooling layer. The features related to target detection use the target annotation of the dataset to train the Faster-RCNN target detector. The output of the target detector omits the coordinate information of the bounding box and retains the type information of the target. Because the algorithm only pays attention to what people and objects in the video are interacting with, that is, the algorithm only models information that is useful for predicting the kind of action.

3.动作预测3. Action prediction

本发明方法设计的动作预测算法的网络模型结构如图1所示，以位置编码模块和自注意力机制模块的编码器和两层独立的LSTM网络为基础框架，编码器负责进一步编码提取完的特征序列，提取特征序列的上下文信息以获得更加丰富的语义。LSTM网络具体实施动作预测，加载过去观察到的视频帧，并且产生不同的预测时间的动作种类分布。LSTM的输入输出关系如图2所示，对于一个视频片段，在动作开始前，往前取样14帧图片，时间间隔为0.25s。三种特征分别进入由编码器和LSTM网络构成的三个子网络，分别训练。最后再使用三个全连接层构成的注意力机制的融合网络，为三个子网络分配权重，权重与对应的动作种类分布张量对应相乘得到整个模型最后输出。The network model structure of the action prediction algorithm designed by the method of the present invention is shown in Figure 1. The encoder of the position encoding module and the self-attention mechanism module and the two-layer independent LSTM network are used as the basic framework. The encoder is responsible for further encoding and extracting the Feature sequence, extract the context information of feature sequence to obtain richer semantics. The LSTM network implements action prediction, loads past observed video frames, and produces a distribution of action classes at different prediction times. The input-output relationship of LSTM is shown in Figure 2. For a video segment, before the action starts, 14 frames of pictures are sampled forward, and the time interval is 0.25s. The three kinds of features are entered into three sub-networks composed of encoder and LSTM network and trained separately. Finally, the fusion network of the attention mechanism composed of three fully connected layers is used to assign weights to the three sub-networks, and the weights are multiplied by the corresponding action type distribution tensors to obtain the final output of the entire model.

4.训练策略以及相关参数4. Training strategy and related parameters

TSN的双流网络训练设定是，segment的数量为3，采用随机梯度下降的标准交叉熵损失函数训练160个epoch。初始学习率定为0.001，在第80个epoch后，学习率减少10倍。实验环境为单卡GEFORCE 1060。Faster-RCNN目标检测器在EPIC-KITCHENS数据集训练，由于EGTEA Gaze⁺数据集缺乏边界框标注，所以在该数据集上并未加入与目标检测相关的特征，该数据集上的模型仅考虑RGB特征和光流特征。预测网络有3个子网络，每个子网络也是采用随机梯度下降的标准交叉熵损失函数进行训练，学习率固定不变定为0.005，动量设为0.9，训练100个epoch。The two-stream network training setting of TSN is that the number of segments is 3, and the standard cross-entropy loss function of stochastic gradient descent is used to train 160 epochs. The initial learning rate is set at 0.001, and after the 80th epoch, the learning rate is reduced by a factor of 10. The experimental environment is a single-card GEFORCE 1060. The Faster-RCNN object detector is trained on the EPIC-KITCHENS dataset. Since the EGTEA Gaze⁺ dataset lacks bounding box annotations, no features related to object detection are added to this dataset. The model on this dataset only considers RGB features and optical flow features. The prediction network has 3 sub-networks, and each sub-network is also trained using the standard cross-entropy loss function of stochastic gradient descent. The learning rate is fixed at 0.005, the momentum is set at 0.9, and the training is 100 epochs.

5.实验结果与分析5. Experimental results and analysis

表1和表2展示了本发明方法与其他预测算法在EPIC-KITCHENS和EGTEA Gaze⁺数据集上的结果。评价指标是Top-5准确率。在EGTEA Gaze⁺数据集上所有的预测时间，本发明方法超过了对比的方法。在EPIC-KITCHENS数据集上，除了预测时间为0.5s和0.25s略低于RU算法，其他预测时间均超过了其他的对比算法。为了进一步验证自注意力机制的有效性，本发明方法在模型中对比全模型(B)以及去除掉编码器后的模型(A)在三个分特征上的预测结果如表3所示。从结果看来，本发明方法提出的基于自注意力机制的编码器有效地改善了模型的性能，不仅解决了其他算法在长预测时间性能较差的缺陷，增加了模型的鲁棒性，而且在准确率上也带来了提升。Tables 1 and 2 show the results of the method of the present invention and other prediction algorithms on the EPIC-KITCHENS and EGTEA Gaze⁺ datasets. The evaluation metric is Top-5 accuracy. At all prediction times on the EGTEA Gaze⁺ dataset, the inventive method outperforms the comparative method. On the EPIC-KITCHENS dataset, except that the prediction time is 0.5s and 0.25s, which is slightly lower than the RU algorithm, other prediction times exceed other comparison algorithms. In order to further verify the effectiveness of the self-attention mechanism, the method of the present invention compares the full model (B) and the model (A) after removing the encoder in the model. The prediction results on the three sub-features are shown in Table 3. From the results, the encoder based on the self-attention mechanism proposed by the method of the present invention effectively improves the performance of the model, not only solves the defect of poor performance of other algorithms in long prediction time, but also increases the robustness of the model. There is also an improvement in accuracy.

表1：本发明方法与其他预测算法在EPIC-KITCHENS数据集上的预测结果Table 1: The prediction results of the method of the present invention and other prediction algorithms on the EPIC-KITCHENS dataset

TABLE ITABLE I

Action anticipation results on the EPIC-KITCHENS datasetAction anticipation results on the EPIC-KITCHENS dataset

表2：本发明方法与其他预测算法在EGTEA Gaze⁺数据集上的预测结果Table 2: The prediction results of the method of the present invention and other prediction algorithms on the EGTEA Gaze⁺ data set

TABLE IITABLE II

Action anticipation results on the EGTEA Gaze+datasetAction anticipation results on the EGTEA Gaze+dataset

表3：本发明方法在模型中对比全模型(B)以及去除掉编码器后的模型(A)在三个分特征上的预测结果Table 3: The method of the present invention compares the prediction results of the full model (B) and the model (A) after removing the encoder on three sub-features in the model

TABLE IIITABLE III

Comparison of experimental results with and without Encoder on asingle modalityComparison of experimental results with and without Encoder on asingle modality

本实施例的图1中，Linear layer表示线性层，Flow feature表示光流特征，RGBfeature表示RGB特征，Obj feature表示与目标检测相关的特征，Multiplication表示相乘，Anticipation output distribution表示预测输出结果，BN-inception表示BN-inception网络结构，Faster-RCNN表示Faster-RCNN目标检测器，Position encoding表示位置编码模块，Sum表示求和，The concatenation of hidden and cell states表示隐藏状态和单元状态的连接，Self-attention表示自注意力模块，Rolling LSTM unit表示运行中的LSTM网络单元，Unrolling LSTM unit表示未运行的LSTM网络单元，Multi-model LSTM表示多模态LSTM网络模型，Modality ATTention fusion network表示模态注意融合网络；In Figure 1 of this embodiment, Linear layer represents a linear layer, Flow feature represents optical flow feature, RGBfeature represents RGB feature, Obj feature represents feature related to target detection, Multiplication represents multiplication, Anticipation output distribution represents prediction output result, BN -inception represents the BN-inception network structure, Faster-RCNN represents the Faster-RCNN target detector, Position encoding represents the position encoding module, Sum represents the summation, The concatenation of hidden and cell states represents the connection of the hidden state and the cell state, Self- Attention means self-attention module, Rolling LSTM unit means running LSTM network unit, Unrolling LSTM unit means not running LSTM network unit, Multi-model LSTM means multi-modal LSTM network model, Modality ATTention fusion network means modal attention fusion network;

本实施例的图2中，Observation time表示观测时间，Anticipation time表示预测时间，Time interval表示时间区间，Anticipate output表示预测输出，ObservedSegment表示观察部分，Action occurring表示行动发生，Action starting time表示行动开始时间。In FIG. 2 of this embodiment, Observation time represents observation time, Anticipation time represents prediction time, Time interval represents time interval, Anticipate output represents prediction output, ObservedSegment represents observation part, Action occurring represents action occurrence, Action starting time represents action start time .

以上所述，仅为本发明的具体实施方式，但本发明的保护范围并不局限于此，任何熟悉本技术领域的技术人员在本发明揭露的技术范围内，可轻易想到各种等效的修改或替换，这些修改或替换都应涵盖在本发明的保护范围之内。因此，本发明的保护范围应以权利要求的保护范围为准。The above are only specific embodiments of the present invention, but the protection scope of the present invention is not limited to this. Any person skilled in the art can easily think of various equivalents within the technical scope disclosed by the present invention. Modifications or substitutions should be included within the protection scope of the present invention. Therefore, the protection scope of the present invention should be subject to the protection scope of the claims.

Claims

Translated fromChinese

1.一种基于自注意力机制的多模态LSTM的视频动作预测方法，其特征在于，该方法包括以下步骤：1. a video action prediction method based on the multimodal LSTM of self-attention mechanism, is characterized in that, this method comprises the following steps:

2.根据权利要求1所述的一种基于自注意力机制的多模态LSTM的视频动作预测方法，其特征在于，所述的步骤1包括以下分步骤：2. the video action prediction method of a kind of self-attention mechanism-based multimodal LSTM according to claim 1, is characterized in that, described step 1 comprises the following sub-steps:

3.根据权利要求2所述的一种基于自注意力机制的多模态LSTM的视频动作预测方法，其特征在于，所述步骤101中的数据集采用EPIC-KITCHENS数据集和EGTEA Gaze⁺数据集。3. a kind of video action prediction method based on the multimodal LSTM of self-attention mechanism according to claim 2, is characterized in that, the data set in described step 101 adopts EPIC-KITCHENS data set and EGTEA Gaze⁺ data set.

4.根据权利要求2所述的一种基于自注意力机制的多模态LSTM的视频动作预测方法，其特征在于，所述步骤102中的设定帧率为30fps。4 . The video action prediction method based on self-attention mechanism of multi-modal LSTM according to claim 2 , wherein the set frame rate in the step 102 is 30fps. 5 .

5.根据权利要求1所述的一种基于自注意力机制的多模态LSTM的视频动作预测方法，其特征在于，所述的步骤2包括以下分步骤：5. a kind of video action prediction method based on self-attention mechanism multimodal LSTM according to claim 1, is characterized in that, described step 2 comprises the following sub-steps:

6.根据权利要求5所述的一种基于自注意力机制的多模态LSTM的视频动作预测方法，其特征在于，所述步骤202中的基于双流法原理的TSN网络所对应的训练过程的初始学习率设为0.001，采用随机梯度下降的标准交叉熵损失函数训练160个epoch，当在第80个epoch后，学习率减少10倍。6. a kind of video action prediction method based on self-attention mechanism multi-modal LSTM according to claim 5, is characterized in that, in described step 202, the training process corresponding to the TSN network based on two-stream method principle The initial learning rate was set to 0.001, and the standard cross-entropy loss function of stochastic gradient descent was used to train for 160 epochs. After the 80th epoch, the learning rate was reduced by a factor of 10.

7.根据权利要求5所述的一种基于自注意力机制的多模态LSTM的视频动作预测方法，其特征在于，所述步骤204中的数据集采用EGTEA Gaze⁺数据集。7 . The video action prediction method based on a self-attention mechanism of multi-modal LSTM according to claim 5 , wherein the data set in the step 204 adopts the EGTEA Gaze⁺ data set. 8 .

8.根据权利要求1所述的一种基于自注意力机制的多模态LSTM的视频动作预测方法，其特征在于，所述步骤3包括以下分步骤：8. the video action prediction method of a kind of self-attention mechanism-based multimodal LSTM according to claim 1, is characterized in that, described step 3 comprises the following sub-steps:

9.根据权利要求8所述的一种基于自注意力机制的多模态LSTM的视频动作预测方法，其特征在于，所述步骤302中的将步骤2中得到的RGB特征和光流特征以及与目标检测相关的特征输入至基于自注意力机制的多模态LSTM网络模型中训练所对应的训练过程的学习率设为0.005，采用随机梯度下降的标准交叉熵损失函数训练100个epoch，动量设为0.9。9. The video action prediction method of a self-attention mechanism-based multi-modal LSTM according to claim 8, wherein in the step 302, the RGB features and optical flow features obtained in step 2 and the The learning rate of the training process corresponding to the input of target detection-related features to the multimodal LSTM network model based on the self-attention mechanism is set to 0.005, and the standard cross-entropy loss function of stochastic gradient descent is used to train 100 epochs, and the momentum is set to 0.005. is 0.9.

10.根据权利要求8所述的一种基于自注意力机制的多模态LSTM的视频动作预测方法，其特征在于，所述LSTM网络的层数为2层。10 . The video action prediction method based on self-attention mechanism of multimodal LSTM according to claim 8 , wherein the number of layers of the LSTM network is 2 layers. 11 .