CN110188239A

Movatterモバイル変換

Info

Publication number: CN110188239A
Application number: CN201910294018.XA
Authority: CN
Inventors: 迟禄; 严慧; 田贵宇; 穆亚东; 陈刚; 王成成; 黄波; 韩峻; 糜俊青
Original assignee: Mid Star Technology Ltd By Share Ltd; Peking University; Nanjing University of Science and Technology
Current assignee: Zhongxing Micro Technology Co ltd; Peking University; Nanjing University of Science and Technology
Priority date: 2018-12-26
Filing date: 2019-04-12
Publication date: 2019-08-30
Anticipated expiration: 2039-04-12
Also published as: CN110188239B

Abstract

Translated fromChinese

本发明涉及一种基于跨模态注意力机制的双流视频分类方法和装置。与传统的双流方法不同，本发明在预测结果之前就将两个模态(甚至更多模态)的信息进行了融合，因此能够更加高效充分，同时，由于在较早阶段就进行了信息交互，单个分支在后阶段已经具有了另一分支的重要信息，单分支的精度已经与传统双流方法持平甚至超过，单分支的参数量要比传统双流方法少很多；相比非局部神经网络，本发明设计的注意力模块能够跨模态，而不仅仅只在单模态内部使用注意力机制，本发明提出的方法在两个模态相同的情况下其效果等价于非局部神经网络。

The invention relates to a dual-stream video classification method and device based on a cross-modal attention mechanism. Different from the traditional two-stream method, the present invention fuses the information of two modalities (or even more modalities) before predicting the result, so it can be more efficient and sufficient. At the same time, due to the information interaction at an earlier stage , a single branch already has the important information of another branch in the later stage, the accuracy of the single branch has been equal to or even surpassed the traditional two-stream method, and the parameters of the single branch are much less than the traditional two-stream method; compared with the non-local neural network, this The attention module designed by the invention can cross modalities, instead of only using the attention mechanism within a single modality. The method proposed by the invention is equivalent to a non-local neural network when the two modalities are the same.

Description

Translated fromChinese

一种基于跨模态注意力机制的双流视频分类方法和装置A two-stream video classification method and device based on cross-modal attention mechanism

技术领域technical field

本发明涉及一种视频分类的方法，尤其涉及一种使用注意力机制的双流视频分类方法和装置，属于计算机视觉领域。The invention relates to a video classification method, in particular to a dual-stream video classification method and device using an attention mechanism, belonging to the field of computer vision.

技术背景technical background

随着深度学习在图像领域的飞速发展，视频领域也逐渐引入深度学习方法并取得了一定成就。但现在的技术水平还远未达到理想效果，面临的问题主要有以下两方面：With the rapid development of deep learning in the image field, deep learning methods have gradually been introduced in the video field and have achieved certain achievements. However, the current technical level is far from reaching the desired effect, and the problems faced mainly include the following two aspects:

第一，当前的技术还未能充分利用动态信息。视频与图像所不同的地方在于，帧与帧之间的动态信息对视频来说是独特而且十分重要的。比如，即使对于人类来说，只看一帧图像是难以判断各种细分类的舞蹈(比如探戈和萨尔萨舞)，而如果加入了动作轨迹信息，那么这一任务将会变得容易很多。同样地，在一些体育运动的分类也是依赖于动作轨迹。First, current technology has yet to take full advantage of dynamic information. The difference between video and image is that the dynamic information between frames is unique and very important to video. For example, even for humans, it is difficult to judge various sub-categories of dances (such as tango and salsa) only by looking at one frame of images, and if the motion trajectory information is added, this task will become much easier. . Likewise, classification in some sports also relies on motion trajectories.

第二，当前的技术还很难迅速准确定位到关键物体上。注意力机制在自然语言处理中已经有广泛应用，但在视频分类中的研究还比较缺乏。通过注意力机制，神经网络能够过滤掉无关物体而更加关注关键物体。比如“舞剑”这一类别，如果检测到关键物体“剑”，那么分类就变得简单了。通常情况下，移动物体更能够吸引人类的目光，而该区域也往往蕴含着视频分类的关键信息，比如“做蛋糕”与“做披萨”两类，关键物体“蛋糕”或“披萨”正是位于移动的双手附近。Second, the current technology is difficult to quickly and accurately locate key objects. Attention mechanism has been widely used in natural language processing, but the research in video classification is still relatively lacking. Through the attention mechanism, the neural network can filter out irrelevant objects and pay more attention to key objects. For example, in the category of "dancing sword", if the key object "sword" is detected, then the classification becomes simple. Usually, moving objects are more attractive to human eyes, and this area often contains key information for video classification, such as "making cake" and "making pizza". The key object "cake" or "pizza" is exactly Located near the moving hands.

有很多现有技术在不断尝试解决上述两种问题。关于如何利用动态信息，当前的技术主要有两种：一种是设计与时间维度相关的神经网络结构，比如循环神经网络(RNN)、三维卷积神经网络(3D-Conv)等，通过数据驱动的方式来训练一个能够捕捉到帧与帧之间信息的网络结构；另一种是显式地利用动态信息，即先抽取光流，之后利用这些光流单独训练一个神经网络分支，与RGB分支的结果进行加权求和，这也就是目前使用比较广泛的双流视频分类技术。而关于如何捕捉关键线索，即将注意力机制引入到视频分类中去，其研究相对较少，比较有代表性的便是非局部神经网络(Non-local Neural Networks)，但该网络只能关注到单模态内部的重要信息，对于“移动物体”没有特殊的建模方式。There are many existing technologies that are constantly trying to solve the above two problems. Regarding how to use dynamic information, there are two main current technologies: one is to design a neural network structure related to the time dimension, such as recurrent neural network (RNN), three-dimensional convolutional neural network (3D-Conv), etc., through data-driven The way to train a network structure that can capture the information between frames; the other is to explicitly use dynamic information, that is, to extract the optical flow first, and then use these optical flows to train a neural network branch separately, and the RGB branch The results are weighted and summed, which is the currently widely used dual-stream video classification technology. As for how to capture key clues, that is, to introduce the attention mechanism into video classification, there are relatively few studies. The more representative one is the non-local neural network (Non-local Neural Networks), but the network can only focus on a single Important information inside the modal, there is no special way to model "moving objects".

发明内容Contents of the invention

本发明主要提出一种新颖的基于跨模态注意力机制的双流视频分类方法，能够高效地利用多模态信息进行视频分类，并且能关注到移动的物体上，使视频分类变得更加简单高效。本发明提出的技术具有通用性，可以广泛用于现有的视频分类问题甚至是其他多模态模型。The present invention mainly proposes a novel dual-stream video classification method based on a cross-modal attention mechanism, which can efficiently use multi-modal information for video classification, and can focus on moving objects, making video classification simpler and more efficient . The technique proposed in the present invention is general and can be widely used in existing video classification problems and even other multimodal models.

本发明具体要解决的技术问题包括：1.充分利用多模态信息进行视频分类；2.更加关注关键物体，使视频分类更加准确；3.使用更少的参数达到更高的精度。The specific technical problems to be solved in the present invention include: 1. Make full use of multi-modal information for video classification; 2. Pay more attention to key objects to make video classification more accurate; 3. Use fewer parameters to achieve higher precision.

与传统的双流方法不同，本发明在预测结果之前就将两个模态(甚至更多模态，比如提取的声音以及使用物体检测模型提取的中间特征图等)的信息进行了融合，因此能够更加高效充分，同时，由于在较早阶段就进行了信息交互，单个分支在后阶段已经具有了另一分支的重要信息，单分支的精度已经与传统双流方法持平甚至超过，单分支的参数量要比传统双流方法少很多；相比非局部神经网络，本发明设计的注意力模块能够跨模态，而不仅仅只在单模态内部使用注意力机制，本发明提出的方法在两个模态相同的情况下其效果等价于非局部神经网络。Different from the traditional two-stream method, the present invention fuses the information of two modalities (or even more modalities, such as the extracted sound and the intermediate feature map extracted using the object detection model, etc.) before predicting the result, so it can It is more efficient and sufficient. At the same time, due to the information interaction in the earlier stage, a single branch already has important information of the other branch in the later stage. The accuracy of the single branch has been equal to or even exceeded that of the traditional two-stream method. The parameter amount of the single branch It is much less than the traditional two-stream method; compared with the non-local neural network, the attention module designed by the present invention can be cross-modal, not just using the attention mechanism within the single-mode, the method proposed by the present invention can be used in two modes In the case of the same state, its effect is equivalent to that of a non-local neural network.

本发明的一种基于跨模态注意力机制的双流视频分类方法，包括以下步骤：A kind of double-stream video classification method based on cross-modal attention mechanism of the present invention, comprises the following steps:

1)建立RGB分支与光流分支的神经网络结构，其中包含跨模态注意力模块；1) Establish the neural network structure of the RGB branch and the optical flow branch, which includes a cross-modal attention module;

2)根据待分类视频得到RGB与光流，将其分别输入RGB分支与光流分支的神经网络结构中；2) Obtain RGB and optical flow according to the video to be classified, and input them into the neural network structure of RGB branch and optical flow branch respectively;

3)对于输入的RGB与光流，RGB分支与光流分支的神经网络结构通过跨模态注意力模块进行信息交互，实现跨模态的信息融合；3) For the input RGB and optical flow, the neural network structure of the RGB branch and the optical flow branch performs information interaction through the cross-modal attention module to achieve cross-modal information fusion;

4)根据RGB分支与光流分支的神经网络结构得到的信息融合后的结果进行视频分类。4) Carry out video classification according to the result of information fusion obtained by the neural network structure of the RGB branch and the optical flow branch.

本方法设计的跨模态注意力机制(跨模态注意力模块)主要由三部分组成：关键值(Key)、查询(Query)和数值(Value)。其中，关键值是指所有信息的索引，查询是指查询信息的索引，数值是指所有信息。跨模态的注意力机制可以描述为：从当前模态中生成查询，从另一个模态中生成关键值-数值匹配对，根据查询与关键值的相似程度来从另一个模态中获取重要信息。因此，跨模态的注意力机制实际上是一种根据当前模态从另一模态中选择性获取信息的过程，获取的信息往往是在当前模态中比较微弱甚至是缺失的但对最终结果是非常重要的。The cross-modal attention mechanism (cross-modal attention module) designed by this method is mainly composed of three parts: key value (Key), query (Query) and value (Value). Among them, the key value refers to the index of all information, the query refers to the index of query information, and the value refers to all information. The cross-modal attention mechanism can be described as: generate a query from the current modality, generate key-value matching pairs from another modality, and obtain important values from another modality according to how similar the query is to the key value information. Therefore, the cross-modal attention mechanism is actually a process of selectively obtaining information from another modality according to the current modality. Results are very important.

图1是一个跨模态注意力机制的实现例子。X和Y表示输入，分别来自RGB分支和光流分支。Q(查询)、K(字典)和V(数值)是X或Y通过1x1卷积生成，其形状大小在图中已经标注，在矩阵相乘之前进行矩阵转置、变形等操作使其可以进行矩阵乘法运算。Q与K相乘，得到M，表示每一个像素在整张特征图上的注意力权重分布，得到该分布M后，将M与V相乘，即从V中有选择性地获取信息Z，得到Z后进行非线性变换(比如采用ReLU等激活函数)，将变换后的结果与原始输入进行残差连接，得到最终输出结果(Output)。Figure 1 is an implementation example of a cross-modal attention mechanism. X and Y denote the input, from the RGB branch and the optical flow branch, respectively. Q (query), K (dictionary) and V (value) are generated by X or Y through 1x1 convolution. Matrix multiplication operation. Q is multiplied by K to obtain M, which represents the attention weight distribution of each pixel on the entire feature map. After obtaining the distribution M, multiply M by V, that is, selectively obtain information Z from V, After obtaining Z, perform nonlinear transformation (such as using activation functions such as ReLU), and perform residual connection between the transformed result and the original input to obtain the final output result (Output).

通过这种操作方式，在当前的网络阶段，RGB分支能够遍历光流分支的所有位置。因此，RGB分支能够综合光流分支的所有信息进而能够有选择性地选取重要信息与当前信息进行融合，而不仅仅是在最后阶段进行加权求和(普通双流方法)。同时，由于该操作的输入与输出的形状是完全一致的而且能够处理任意形状的输入，该操作具有很好的兼容性，能够插入到几乎所有网络的任何阶段，这样也能充分利用多尺度信息。为了进一步提升其兼容性，为其加入了残差连接，即在上述操作的结果上与操作前的输入直接相加，这样在理论上就能保证加入跨模态注意力模块的模型不会比原始模型精度低。In this way of operation, in the current network stage, the RGB branch can traverse all the positions of the optical flow branch. Therefore, the RGB branch can synthesize all the information of the optical flow branch and can selectively select important information to fuse with the current information, instead of just performing weighted summation in the final stage (ordinary two-stream method). At the same time, since the shape of the input and output of the operation is exactly the same and can handle input of any shape, the operation has good compatibility and can be inserted into any stage of almost all networks, which can also make full use of multi-scale information. . In order to further improve its compatibility, a residual connection is added to it, that is, the result of the above operation is directly added to the input before the operation, so that in theory it can be guaranteed that the model added to the cross-modal attention module will not compare to The original model has low accuracy.

图2是本发明设计的网络结构，该网络结构基于双流模型，加入了跨模态注意力模块。该模型的两个分支分别是RGB分支与Flow分支，分别负责处理图像的表面特征与动态信息。其具体流程如下：Fig. 2 is a network structure designed by the present invention, which is based on a two-stream model and a cross-modal attention module is added. The two branches of the model are the RGB branch and the Flow branch, which are respectively responsible for processing the surface features and dynamic information of the image. The specific process is as follows:

Step1：初始网络参数。该网络的参数是以在ImageNet数据集上预训练的模型为初始值，之后再在kinetics数据集上进行训练，直到收敛。Step1: Initial network parameters. The parameters of the network are based on the pre-trained model on the ImageNet dataset as the initial value, and then trained on the kinetics dataset until convergence.

Step2：数据处理。该网络的输入需要RGB与光流两种输入，对于RGB，是直接从原始视频中截取帧，之后缩放到指定分辨率(224x224)；对于光流，是采用相邻两帧的RGB图像通过OpenCV中GPU版本的TVL1光流算法抽取的，连续若干帧(如连续五帧)光流堆叠在一起作为Flow分支的输入，其分辨率与RGB的分辨率一致；Step2: Data processing. The input of the network requires both RGB and optical flow inputs. For RGB, the frame is directly intercepted from the original video, and then scaled to the specified resolution (224x224); for optical flow, the RGB image of two adjacent frames is passed through OpenCV Extracted by the TVL1 optical flow algorithm of the GPU version in China, several consecutive frames (such as five consecutive frames) of optical flow are stacked together as the input of the Flow branch, and its resolution is consistent with that of RGB;

Step3：得到数据后，将RGB与光流分别输入到两个分支中，在运行过程中，两个分支通过跨模态注意力模块(即图1中所示的结构，在图2中用CMA₁～CMA_n来表示)进行信息交互，从而达到多层次充分利用多模态信息的作用。Step3: After getting the data, input RGB and optical flow into two branches respectively. During the operation, the two branches pass through the cross-modal attention module (that is, the structure shown in Figure 1, and CMA is used in Figure 2₁ ~ CMA_n ) for information interaction, so as to achieve the function of making full use of multi-modal information at multiple levels.

Step4：两个分支最终会得到两个结果，两个结果也可以像普通双流方法一样进行加权求和。由于该模型能够在较早的阶段进行信息融合，因此，只采用RGB分支的结果进行视频分类也能达到甚至超过普通双流模型中两个分支同时预测的精度，而此时光流分支并不需要进行后面的运算(在图2中用虚线表示)，也节省了大量参数，因此该模型是十分高效的。另外，如果同时采用该模型两个分支结果进行预测精度会进一步提升。Step4: The two branches will eventually get two results, and the two results can also be weighted and summed like the ordinary two-stream method. Since the model can perform information fusion at an earlier stage, only using the results of the RGB branch for video classification can reach or even exceed the accuracy of the simultaneous prediction of the two branches in the ordinary dual-stream model, while the optical flow branch does not need to be performed at this time. The following operations (indicated by dashed lines in Figure 2) also save a lot of parameters, so the model is very efficient. In addition, if the results of the two branches of the model are used at the same time, the prediction accuracy will be further improved.

基于同一发明构思，本发明还提供一种与上面方法对应的基于跨模态注意力机制的双流视频分类装置，其包括：Based on the same inventive concept, the present invention also provides a dual-stream video classification device based on a cross-modal attention mechanism corresponding to the above method, which includes:

网络构建模块，负责建立RGB分支与光流分支的神经网络结构，其中包含跨模态注意力模块；The network building module is responsible for establishing the neural network structure of the RGB branch and the optical flow branch, which includes a cross-modal attention module;

数据处理模块，负责根据待分类视频得到RGB与光流，将其分别输入RGB分支与光流分支的神经网络结构中；The data processing module is responsible for obtaining RGB and optical flow according to the video to be classified, and inputting them into the neural network structure of the RGB branch and the optical flow branch respectively;

信息融合模块，负责对于输入的RGB与光流，RGB分支与光流分支的神经网络结构通过跨模态注意力模块进行信息交互，实现跨模态的信息融合；The information fusion module is responsible for the input RGB and optical flow, the neural network structure of the RGB branch and the optical flow branch conducts information interaction through the cross-modal attention module, and realizes cross-modal information fusion;

视频分类模块，负责根据RGB分支与光流分支的神经网络结构得到的信息融合后的结果进行视频分类。The video classification module is responsible for video classification according to the result of information fusion obtained by the neural network structure of the RGB branch and the optical flow branch.

与现有技术相比，本发明的优点：Compared with prior art, the advantages of the present invention:

(1)增加了多模态的信息交互，能够多层次充分利用多模态信息；(1) Multi-modal information interaction is added to make full use of multi-modal information at multiple levels;

(2)通过跨模态注意力机制使两个模态能够有选择性地选择对方的信息，从而达到高效利用互补信息，更为准确地捕捉到关键物体；(2) Through the cross-modal attention mechanism, the two modalities can selectively select each other's information, so as to achieve efficient use of complementary information and capture key objects more accurately;

(3)能够用更少的参数量达到甚至超过传统双流方法的精度，同时也可以综合两个分支的结果进行预测，其分类精度会进一步提升；(3) It can achieve or even exceed the accuracy of the traditional two-stream method with fewer parameters, and can also synthesize the results of the two branches for prediction, and its classification accuracy will be further improved;

(4)本发明设计的跨模态注意力模块具有很好的兼容性，与大多数现有技术均不冲突，几乎可以插入到任何现有的网络架构中去，能够稳定提升视频分类精度。(4) The cross-modal attention module designed by the present invention has good compatibility, does not conflict with most existing technologies, can be inserted into almost any existing network architecture, and can stably improve video classification accuracy.

附图说明Description of drawings

图1为跨模态注意力模块示例图；Figure 1 is an example diagram of a cross-modal attention module;

图2为本发明提出的视频分类网络结构图。FIG. 2 is a structural diagram of a video classification network proposed by the present invention.

具体实施方式Detailed ways

为使本发明的上述目的、特征和优点能够更加明显易懂，下面通过具体实施例和附图，对本发明做进一步详细说明。In order to make the above objects, features and advantages of the present invention more comprehensible, the present invention will be further described in detail below through specific embodiments and accompanying drawings.

1、跨模态注意力模块的配置1. Configuration of cross-modal attention module

跨模态注意力模块能够处理任意维度的输入，且能够保证输入与输出的形状一致，因此具有极好的兼容性。以2维的配置为例，Q、K、V分别通过1x1的2维卷积操作得到(对于3维模型，这里的卷积是1x1x1的3维卷积操作)，为了降低计算复杂度以及节省GPU空间，上述卷积操作在获得Q、K、V的同时在通道维度进行了降维操作。为了进一步简化运算，可以在卷积操作之前先经过一次最大化池化(max-pooling)操作，在空间维度降低到原来的1/4。得到Z后，需要再经过一次卷积操作将维度升到与输入维度一致，之后再经过一次BatchNormalization(BN，批标准化)，将BN的参数全都初始化为零，这样就使该模块的初始状态对之前网络的运算结果无任何影响。The cross-modal attention module can handle input of any dimension, and can ensure that the shape of the input and output is consistent, so it has excellent compatibility. Taking the 2-dimensional configuration as an example, Q, K, and V are respectively obtained by 1x1 2-dimensional convolution operation (for 3-dimensional models, the convolution here is 1x1x1 3-dimensional convolution operation), in order to reduce computational complexity and save In GPU space, the above convolution operation performs dimensionality reduction in the channel dimension while obtaining Q, K, and V. In order to further simplify the operation, a max-pooling operation can be performed before the convolution operation, and the spatial dimension is reduced to 1/4 of the original. After getting Z, it needs to go through another convolution operation to increase the dimension to be consistent with the input dimension, and then go through BatchNormalization (BN, batch normalization) to initialize all the parameters of BN to zero, so that the initial state of the module is right The results of previous network operations are not affected in any way.

2、网络的配置2. Network configuration

两个分支均以ResNet-50为基础网络。在节省GPU显存的情况下为了尽可能利用多尺度、更精确的空间信息，在res3和res4两个阶段均匀插入了5个(也可以是其它数目)跨模态注意力模块，这一配置与非局部神经网络也是一致的。RGB分支只输入一帧图像，而光流分支输入的是连续五帧光流图像。RGB分支的权重是直接以ImageNet pretrained的参数为初始化，而光流分支，由于其输入的形状与在ImageNet上训练模型的输入形状不同，需要进行适当改动，这里将第一层卷积网络的卷积核参数在通道维度上取平均，之后将该平均值复制五次得到五通道的卷积核，其它层的参数可以直接复制过来，这样就能很好地将在ImageNet上训练的参数迁移过来。Both branches use ResNet-50 as the base network. In order to use multi-scale and more accurate spatial information as much as possible while saving GPU memory, 5 (or other numbers) cross-modal attention modules are evenly inserted in the res3 and res4 stages. This configuration is the same as Non-local neural networks are also consistent. The RGB branch only inputs one frame of image, while the optical flow branch inputs five consecutive frames of optical flow images. The weight of the RGB branch is directly initialized with the parameters of ImageNet pretrained, and the optical flow branch, because its input shape is different from the input shape of the training model on ImageNet, needs to be modified appropriately. Here, the volume of the first layer of convolutional network The kernel parameters are averaged in the channel dimension, and then the average value is copied five times to obtain a five-channel convolution kernel. The parameters of other layers can be copied directly, so that the parameters trained on ImageNet can be well transferred. .

网络是基于时序分段网络框架(Temporal Segment Networks，TSN)，因为该框架能够简单高效地建模长序列关系。将整个视频平均分成m段，每一段随机选取一帧作为网络的输入，这样便会得到m个结果，而最终的视频预测结果是基于这m个结果的平均值。The network is based on the Temporal Segment Networks (TSN) framework because it can model long sequence relationships simply and efficiently. Divide the entire video into m segments on average, and each segment randomly selects a frame as the input of the network, so that m results will be obtained, and the final video prediction result is based on the average of these m results.

3、数据处理3. Data processing

原始视频数据分辨率并不完全一致。将其统一缩放至256x256分辨率大小。使用OpenCV中的GPU版本TVL1算法抽取光流，将其结果在[-20,20]处截断，然后缩放到[-1,1]之间。还进行了数据增强，比如随机裁剪、缩放、镜像处理等，需要注意的是，两个分支对于同一输入的数据增强是一致的，比如，如果对RGB图像的左上角进行了裁剪，那么对光流的左上角同一位置也进行裁剪。在时间维度上，RGB图像对应连续五帧光流的第一帧光流。Raw video data resolutions are not exactly the same. Scale it uniformly to a 256x256 resolution size. Use the GPU version TVL1 algorithm in OpenCV to decimate the optical flow, truncate the result at [-20,20], and then scale it to [-1,1]. Data enhancement is also performed, such as random cropping, scaling, mirroring, etc. It should be noted that the two branches are consistent with the data enhancement of the same input. For example, if the upper left corner of the RGB image is cropped, then the light The upper left corner of the stream is also clipped at the same position. In the time dimension, the RGB image corresponds to the first frame of optical flow in five consecutive frames.

4、网络的训练4. Network training

由于通常情况下光流分支的收敛速度比RGB分支的收敛速度慢，因此，首先在kinetics数据集上训练了光流分支，这样也有有助于光流分支为RGB分支提供更准确的信息。之后，开始迭代训练，即RGB分支与光流分支交替优化。在训练RGB分支的过程中，冻结光流分支的所有参数，包括光流分支中的跨模态注意力模块，只更新RGB分支的参数，训练光流分支时反之。每次迭代的训练次数不超过30个epoch。实践发现，对于刚训练完的分支，其精度往往要高于另一个分支，因此对于两个分支结果的加权权重，给更高精度分支赋予更高权重(5:1)，这样会得到较高的精度。通常情况下，迭代一次便能达到很高的精度。Since the convergence speed of the optical flow branch is usually slower than that of the RGB branch, the optical flow branch is first trained on the kinetics dataset, which also helps the optical flow branch to provide more accurate information for the RGB branch. After that, iterative training starts, that is, the RGB branch and the optical flow branch are alternately optimized. In the process of training the RGB branch, freeze all the parameters of the optical flow branch, including the cross-modal attention module in the optical flow branch, and only update the parameters of the RGB branch, and vice versa when training the optical flow branch. The number of training for each iteration does not exceed 30 epochs. Practice has found that for the branch that has just been trained, its accuracy is often higher than that of the other branch, so for the weighted weight of the results of the two branches, assign a higher weight (5:1) to the higher-precision branch, which will get a higher accuracy. Typically, very high accuracy can be achieved with a single iteration.

训练过程中，采用了标准的交叉熵损失函数以及随机梯度下降优化方法。batch大小为128，训练过程中也同时更新BN参数，为了获得更准确的BN统计量，采用了同步BN。学习率初始化为0.01，当训练准确率到达一个平稳阶段时，学习率降低为当前学习率的十分之一。为了防止过拟合，加入了dropout＝0.7与weight decay＝0.0005.训练过程中的K设为3.During training, the standard cross-entropy loss function and stochastic gradient descent optimization method are used. The batch size is 128, and the BN parameters are also updated during the training process. In order to obtain more accurate BN statistics, synchronous BN is used. The learning rate is initialized to 0.01. When the training accuracy reaches a stable stage, the learning rate is reduced to one-tenth of the current learning rate. In order to prevent overfitting, dropout=0.7 and weight decay=0.0005 are added. K during training is set to 3.

测试过程中，裁剪出图像的四个角以及中心位置并进行反转，这样便得到了10个样本，将这10个样本输入到网络中便会得到10个结果，对这10个结果进行平均便是最终的视频分类结果。TSN中的K设为25。During the test, the four corners and the center of the image are cut out and reversed, so that 10 samples are obtained, and these 10 samples are input into the network to obtain 10 results, and the 10 results are averaged is the final video classification result. K in TSN is set to 25.

5、迁移学习5. Transfer Learning

本发明提出的网络结构是基于kinetics进行训练的，kinetics具有400类，其他数据集，比如ucf101，只有101类，且这101类与kinetics 400类中有重叠部分也有非重叠部分。为了将该模型迁移到新的视频分类上，只需要在新的数据集上对最后一层全连接层进行微调即可，这样便能达到很好的效果。对于其他数据集也可以采用类似方法，该模型具有较好的迁移能力。The network structure proposed by the present invention is trained based on kinetics. Kinetics has 400 categories, and other data sets, such as ucf101, have only 101 categories, and these 101 categories have overlapping parts and non-overlapping parts with the 400 categories of kinetics. In order to transfer the model to a new video classification, it is only necessary to fine-tune the last fully connected layer on the new data set, so that good results can be achieved. Similar methods can be adopted for other datasets, and the model has good transferability.

ResNet50是不加跨模态注意力模块的模型，CMA-ResNet50是在此基础上加了该模块的模型。-R表示RGB分支，-S表示RGB分支与光流分支的融合。下面表1是以ResNet50为主干网络的实验：ResNet50 is a model without a cross-modal attention module, and CMA-ResNet50 is a model with this module added on this basis. -R indicates the RGB branch, and -S indicates the fusion of the RGB branch and the optical flow branch. Table 1 below is an experiment with ResNet50 as the backbone network:

表1.以ResNet50为主干网络的实验结果Table 1. Experimental results with ResNet50 as the backbone network

模型Model准确率(％)Accuracy(%)ResNet50-RResNet50-R67.7367.73ResNet50-SResNet50-S71.2171.21CMA-ResNet50-RCMA-ResNet50-R72.1772.17CMA-ResNet50-SCMA-ResNet50-S72.6272.62

P3D是一个3维卷积神经网络模型，CMA-P3D是在此基础上加了3维跨模态注意力模块的模型。下面表2是以P3D为主干网络的实验：P3D is a 3D convolutional neural network model, and CMA-P3D is a model with a 3D cross-modal attention module added to it. Table 2 below is an experiment with P3D as the backbone network:

表2.以P3D为主干网络的实验结果Table 2. Experimental results with P3D as the backbone network

模型Model准确率(％)Accuracy(%)P3D-RP3D-R71.5071.50P3D-SP3D-S74.6274.62CMA-P3D-RCMA-P3D-R74.8674.86CMA-P3D-SCMA-P3D-S75.9875.98

从上面两张表中可以看出，无论是二维还是三维的跨模态注意力模块，均能稳定提升精度，而且加了该模块的RGB分支比对比实验中的双流模型(ResNet50-S/P3D-S)精度都要高，加了该模块的双流模型精度会进一步提升。It can be seen from the above two tables that whether it is a two-dimensional or three-dimensional cross-modal attention module, it can stably improve the accuracy, and the dual-stream model (ResNet50-S/P3D- S) The accuracy is high, and the accuracy of the dual-stream model with this module will be further improved.

本发明另一实施例提供一种基于跨模态注意力机制的双流视频分类装置，其包括：Another embodiment of the present invention provides a dual-stream video classification device based on a cross-modal attention mechanism, which includes:

本发明不局限于ResNet-50神经网络，可以应用于各种神经网络(比如VGG、DenseNet、SENet等)，也可以应用于3D神经网络(如I3D、P3D等)。同时，跨模态注意力模块内部的操作也不局限于上文中所描述的实现方式，比如生成关键值/查询/数值的过程中，可以用更复杂的操作来代替1x1卷积操作(比如叠加多层卷积操作)，在得到Z后也可以进行更加复杂的操作(同样可以采用更多层的卷积操作)。对于与主分支的合并方式，上文采用的是残差连接，也可以采用其他方式，比如与主分支的特征进行拼接等。The present invention is not limited to the ResNet-50 neural network, and can be applied to various neural networks (such as VGG, DenseNet, SENet, etc.), and can also be applied to 3D neural networks (such as I3D, P3D, etc.). At the same time, the internal operations of the cross-modal attention module are not limited to the implementation methods described above. For example, in the process of generating key values/queries/values, more complex operations can be used instead of 1x1 convolution operations (such as superposition Multi-layer convolution operation), and more complex operations can be performed after obtaining Z (more convolution operations can also be used). For the method of merging with the main branch, the residual connection is used above, and other methods can also be used, such as splicing with the features of the main branch.

以上实施例仅用以说明本发明的技术方案而非对其进行限制，本领域的普通技术人员可以对本发明的技术方案进行修改或者等同替换，而不脱离本发明的原理和范围，本发明的保护范围应以权利要求书所述为准。The above embodiments are only used to illustrate the technical solution of the present invention and not to limit it. Those skilled in the art can modify or equivalently replace the technical solution of the present invention without departing from the principle and scope of the present invention. The scope of protection should be determined by the claims.

Claims

Translated fromChinese

1.一种基于跨模态注意力机制的双流视频分类方法，其特征在于，包括以下步骤：1. A double-stream video classification method based on a cross-modal attention mechanism, characterized in that, comprising the following steps:

2.根据权利要求1所述的方法，其特征在于，所述跨模态注意力模块包含关键值、查询和数值，从当前模态中生成查询，从另一个模态中生成关键值-数值匹配对，根据查询与关键值的相似程度来从另一个模态中获取重要信息。2. The method according to claim 1, wherein the cross-modal attention module includes key values, queries and values, generates queries from the current modality, and generates key value-values from another modality Matching pairs to fetch important information from another modal based on how similar the query is to a key-value.

3.根据权利要求2所述的方法，其特征在于，所述跨模态注意力模块为：X和Y表示输入，分别来自RGB分支和光流分支，查询Q、字典K和数值V是X或Y通过1x1卷积生成；Q与K相乘得到M，表示每一个像素在整张特征图上的注意力权重分布；将M与V相乘，即从V中有选择性地获取信息Z，得到Z后进行非线性变换，将变换后的结果与原始输入进行残差连接，得到最终结果。3. The method according to claim 2, wherein the cross-modal attention module is: X and Y represent input, respectively from the RGB branch and the optical flow branch, and the query Q, the dictionary K and the value V are X or Y is generated by 1x1 convolution; Q and K are multiplied to obtain M, which represents the attention weight distribution of each pixel on the entire feature map; M and V are multiplied to selectively obtain information Z from V, After obtaining Z, a nonlinear transformation is performed, and the transformed result is residually connected with the original input to obtain the final result.

4.根据权利要求3所述的方法，其特征在于，所述跨模态注意力模块在通过卷积操作获得Q、K、V的同时在通道维度进行降维操作，以降低计算复杂度以及节省GPU空间；在卷积操作之前先经过一次最大化池化操作以简化运算，得到Z后再经过一次卷积操作将维度升到与输入维度一致，之后再经过一次BN，将BN的参数全都初始化为零。4. The method according to claim 3, wherein the cross-modal attention module performs a dimensionality reduction operation in the channel dimension while obtaining Q, K, and V through a convolution operation, to reduce computational complexity and Save GPU space; before the convolution operation, a maximum pooling operation is performed to simplify the operation. After obtaining Z, a convolution operation is performed to increase the dimension to be consistent with the input dimension, and then a BN is performed to convert all BN parameters. Initialized to zero.

5.根据权利要求1所述的方法，其特征在于，所述根据待分类视频得到RGB与光流，包括：5. The method according to claim 1, wherein said obtaining RGB and optical flow according to the video to be classified comprises:

a)对于RGB，直接从原始的待分类视频中截取帧，之后缩放到指定分辨率，作为RGB分支的神经网络结构的输入；a) For RGB, the frame is directly intercepted from the original video to be classified, and then scaled to the specified resolution as the input of the neural network structure of the RGB branch;

b)对于光流，采用相邻两帧的RGB图像通过光流算法抽取，连续若干帧光流堆叠在一起作为光流分支的神经网络结构的输入，其分辨率与RGB的分辨率一致。b) For optical flow, the RGB images of two adjacent frames are extracted through the optical flow algorithm, and several consecutive frames of optical flow are stacked together as the input of the neural network structure of the optical flow branch, and its resolution is consistent with that of RGB.

6.根据权利要求1所述的方法，其特征在于，步骤4)采用以下方式之一进行视频分类：6. The method according to claim 1, characterized in that, step 4) adopts one of the following methods for video classification:

a)只采用RGB分支的结果进行视频分类；a) Only use the results of the RGB branch for video classification;

b)通过将两个分支得到的两个结果进行加权求和来进行视频分类。b) Video classification is performed by weighted summation of the two results obtained by the two branches.

7.根据权利要求1所述的方法，其特征在于，所述RGB分支与光流分支的神经网络结构均以ResNet-50为基础网络，在res3和res4两个阶段均匀插入若干个跨模态注意力模块；采用时序分段网络框架，将整个视频平均分成m段，每一段随机选取一帧作为网络的输入，这样便会得到m个结果，而最终的视频预测结果是基于这m个结果的平均值。7. The method according to claim 1, wherein the neural network structures of the RGB branch and the optical flow branch are all based on ResNet-50, and several cross-modal Attention module: using the time series segmentation network framework, the entire video is divided into m segments on average, and each segment randomly selects a frame as the input of the network, so that m results will be obtained, and the final video prediction result is based on these m results average of.

8.根据权利要求1所述的方法，其特征在于，所述RGB分支与光流分支的神经网络结构的训练过程包括：首先训练光流分支，之后开始迭代训练即RGB分支与光流分支交替优化；在训练RGB分支的过程中冻结光流分支的所有参数，包括光流分支中的跨模态注意力模块，只更新RGB分支的参数，训练光流分支时反之；对于两个分支结果的加权权重，给更高精度的分支赋予更高权重；训练过程中采用标准的交叉熵损失函数以及随机梯度下降优化方法。8. The method according to claim 1, wherein the training process of the neural network structure of the RGB branch and the optical flow branch comprises: first training the optical flow branch, and then starting iterative training, that is, the RGB branch and the optical flow branch alternately Optimization; freeze all parameters of the optical flow branch in the process of training the RGB branch, including the cross-modal attention module in the optical flow branch, only update the parameters of the RGB branch, and vice versa when training the optical flow branch; for the results of the two branches Weighted weights, giving higher weights to higher-precision branches; standard cross-entropy loss functions and stochastic gradient descent optimization methods are used in the training process.

9.根据权利要求1所述的方法，其特征在于，所述RGB分支与光流分支的神经网络结构通过对最后一层全连接层进行微调而迁移到新的数据集上，实现迁移学习。9. The method according to claim 1, wherein the neural network structure of the RGB branch and the optical flow branch is migrated to a new data set by fine-tuning the last fully connected layer to realize transfer learning.

10.一种基于跨模态注意力机制的双流视频分类装置，其特征在于，包括：10. A dual-stream video classification device based on a cross-modal attention mechanism, comprising: