CN112149504A

Movatterモバイル変換

Info

Publication number: CN112149504A
Application number: CN202010849991.6A
Authority: CN
Inventors: 杨慧敏; 田秋红
Original assignee: Zhejiang University of Technology ZJUT
Current assignee: Shenzhen Chaowei Imaging Technology Co ltd
Priority date: 2020-08-21
Filing date: 2020-08-21
Publication date: 2020-12-29
Anticipated expiration: 2040-08-21
Also published as: CN112149504B

Abstract

Translated fromChinese

本发明公开了一种混合卷积的残差网络与注意力结合的动作视频识别方法。包括：1)读取动作视频中人的动作，然后将动作视频转换为原始视频帧图像；2)分别使用时间抽样、随机裁剪和亮度调整的方法对动作视频的视频帧进行数据增强，组成获得视频帧图像；3)构建注意力模块，利用注意力模块构建混合卷积块，级联混合卷积块构建基于混合卷积的残差网络与注意力结合的混合卷积残差网络模型，用混合卷积残差网络模型对视频帧图像进行时空特征学习，获取关键特征图；4)使用Softmax分类层对关键特征图进行分类。本发明在扩展网络深度的同时，保留视频帧的特征信息，充分融合时空特征，提高重要通道特征的相关度，有效地提高动作识别的预测性能。

The invention discloses an action video recognition method combining a mixed convolution residual network and attention. Including: 1) reading the actions of people in the action video, and then converting the action video into an original video frame image; 2) using the methods of time sampling, random cropping and brightness adjustment to perform data enhancement on the video frames of the action video, and the composition obtains Video frame image; 3) Build an attention module, use the attention module to build a mixed convolution block, and cascade mixed convolution blocks to build a mixed convolution residual network model based on the combination of mixed convolution residual network and attention, using The hybrid convolutional residual network model performs spatiotemporal feature learning on the video frame images to obtain key feature maps; 4) uses the Softmax classification layer to classify the key feature maps. While expanding the network depth, the invention retains the feature information of the video frame, fully integrates the spatiotemporal features, improves the correlation of the important channel features, and effectively improves the prediction performance of the action recognition.

Description

Translated fromChinese

混合卷积的残差网络与注意力结合的动作视频识别方法A hybrid convolutional residual network combined with attention for action video recognition

技术领域technical field

本发明属于智能视频分析技术领域的一种动作视频识别方法，具体是涉及了一种基于混合卷积的残差网络与注意力机制结合的动作视频识别方法。The invention belongs to an action video recognition method in the technical field of intelligent video analysis, and particularly relates to an action video recognition method based on a hybrid convolution residual network combined with an attention mechanism.

背景技术Background technique

动作识别具有视频处理、模式识别、虚拟现实等应用价值，是计算机视觉领域的重要研究课题之一。视频中的动作识别是视频理解任务中的关键问题。它不仅需要捕获空间维度上的特征，还需要对多个连续帧之间的时间关系进行编码。因此，从动作视频中有效地提取高分辨率的时空特征对于提高动作识别的准确性具有重要意义。然而，视频是一个具有时间关系的连续帧序列，每个像素点与其邻近像素点具有很高的相似性，时空相关性非常强。传统的卷积神经网络对单幅图像数据具有优异的特征提取性能，但不能从视频中提取时空特征。Action recognition has application value such as video processing, pattern recognition, and virtual reality, and is one of the important research topics in the field of computer vision. Action recognition in videos is a key problem in video understanding tasks. It not only needs to capture the features in the spatial dimension, but also needs to encode the temporal relationship between multiple consecutive frames. Therefore, effectively extracting high-resolution spatiotemporal features from action videos is of great significance for improving the accuracy of action recognition. However, video is a continuous frame sequence with temporal relationship, each pixel has a high similarity with its neighboring pixels, and the spatial and temporal correlation is very strong. Traditional convolutional neural networks have excellent feature extraction performance for single image data, but cannot extract spatiotemporal features from videos.

当视频输入为连续图像时，目前主要有三种方法：(1)2DCNNs结合RNN/LSTM，(2)双流CNNs，(3)3DCNNs。双流CNNs使用两个独立的网络来捕获空间特征和时间运动信息。虽然该方法效果较好，但由于两个网络的训练是分离的，不能有效地混合外观和运动信息。RNN/LSTM能更好地处理序列信息，因此常与CNN相结合来处理动作识别。然而，这类方法只保留了顶层的高级特性，忽略了底层特性中的相关性。利用3DCNN获取时空信息是一种有效的方法。然而，3DCNN模型参数量巨大，包含大量冗余的空间数据，训练3DCNNs是一个非常具有挑战性的任务。近年来，许多研究试图从不同的角度引入注意机制来增强行为识别的鲁棒性。然而，深度网络中的注意力叠加机制会导致重复的点积，从而降低特征的价值。When the video input is continuous images, there are currently three main approaches: (1) 2DCNNs combined with RNN/LSTM, (2) dual-stream CNNs, (3) 3DCNNs. Two-stream CNNs use two independent networks to capture spatial features and temporal motion information. Although this method works well, it cannot effectively mix appearance and motion information since the training of the two networks is separated. RNN/LSTM can better handle sequence information, so it is often combined with CNN to handle action recognition. However, such methods only preserve high-level features at the top level, ignoring dependencies in low-level features. Using 3DCNN to obtain spatiotemporal information is an effective method. However, 3DCNN models have a huge amount of parameters and contain a lot of redundant spatial data, making training 3DCNNs a very challenging task. In recent years, many studies have attempted to introduce attention mechanisms from different perspectives to enhance the robustness of action recognition. However, the attention stacking mechanism in deep networks leads to repeated dot products, which reduces the value of features.

发明内容SUMMARY OF THE INVENTION

为了解决背景技术中存在的问题，本发明的目的在于提供一种基于混合卷积的残差网络与注意力机制结合的动作视频中的动作识别方法，设计MC-RAN模块，其以混合卷积的残差网络为基础，将3D卷积解耦的2D卷积和1D卷积分别与适应的空间注意力模块M_SS与通道注意力模块M_CS融合，提高重要通道特征的相关度，增加特征图的全局相关性，以提高动作识别的性能。In order to solve the problems existing in the background technology, the purpose of the present invention is to provide an action recognition method in an action video based on the combination of a hybrid convolution residual network and an attention mechanism, and design an MC-RAN module, which uses a hybrid convolution Based on the residual network, the 2D convolution and 1D convolution of the 3D convolution decoupling are respectively fused with the adaptive spatial attention module M_SS and the channel attention module M_CS to improve the correlation of important channel features and increase the characteristics of Global correlation of graphs to improve the performance of action recognition.

本发明采用的技术方案如下：The technical scheme adopted in the present invention is as follows:

本发明包括以下步骤：The present invention includes the following steps:

1)读取动作视频中人的动作，然后将动作视频转换为原始视频帧图像；1) Read the action of the person in the action video, and then convert the action video into the original video frame image;

2)分别使用时间抽样、随机裁剪和亮度调整的方法对动作视频的视频帧进行数据增强，组成获得视频帧图像；2) respectively using the method of time sampling, random cropping and brightness adjustment to perform data enhancement on the video frame of the action video, and form a video frame image;

所述步骤2)具体为：Described step 2) is specifically:

时间抽样：对于每个动作视频，随机采样16帧动作视频的连续帧进行训练；如果连续帧的帧数达不到16帧，就循环播放该动作视频，直至连续帧的帧数达到16帧；Time sampling: For each action video, 16 consecutive frames of action video are randomly sampled for training; if the number of consecutive frames does not reach 16 frames, the action video is played in a loop until the number of consecutive frames reaches 16 frames;

随机裁剪：将原始视频帧图像的大小调整为128×171像素，然后将原始视频帧图像的大小随机裁剪为112×112像素；Random cropping: resize the original video frame image to 128×171 pixels, and then randomly crop the size of the original video frame image to 112×112 pixels;

亮度调整：随机调整原始视频帧图像的亮度。Brightness Adjustment: Randomly adjust the brightness of the original video frame image.

3)构建注意力模块，利用注意力模块构建混合卷积块，级联混合卷积块构建基于混合卷积的残差网络与注意力结合的混合卷积残差网络模型，用混合卷积残差网络模型对视频帧图像进行时空特征学习，获取关键特征图；3) Build an attention module, use the attention module to build a mixed convolution block, cascade mixed convolution blocks to build a mixed convolution residual network model based on a mixed convolution residual network combined with attention, and use a mixed convolution residual network model. The difference network model performs spatiotemporal feature learning on video frame images to obtain key feature maps;

混合卷积块表达为：The mixed convolution block is expressed as:

X_t+1＝X_t+W(X_t)X_t+1 =X_t +W(X_t )

其中，X_t和X_t+1表示第t个MC-RAN模块的输入和输出；X_t和X_t+1具有相同的特征维度，W代表加入注意力机制的混合卷积残差函数；Among them, X_t and X_t+1 represent the input and output of the t-th MC-RAN module; X_t and X_t+1 have the same feature dimension, and W represents the mixed convolution residual function with the attention mechanism;

所述步骤3)具体为：选取3DResNet网络结构作为基本网络结构，3DResNet网络结构中原有的3D卷积模块由第一卷积层和四个混合卷积块代替，混合卷积块包括MC-RAN模块和加合层；MC-RAN模块包括(2+1)D卷积层、第一批量归一化层、第一ReLU激活层、3D卷积层和第二批量归一化层，所述(2+1)D卷积层是由2D卷积层中加入注意力模块组成；混合卷积块的输入X_t输入MC-RAN模块，MC-RAN模块输出后的特征图与输入X_t通过加合层进行特征图相加，相加后的特征图经第二ReLU激活层处理后的输出作为混合卷积块的输出X_t+1，每个混合卷积块之后级联3D最大池化层进行下采样；The step 3) is specifically: selecting the 3DResNet network structure as the basic network structure, the original 3D convolution module in the 3DResNet network structure is replaced by the first convolution layer and four mixed convolution blocks, and the mixed convolution blocks include MC-RAN. modules and additive layers; the MC-RAN module includes a (2+1)D convolutional layer, a first batch normalization layer, a first ReLU activation layer, a 3D convolutional layer and a second batch normalization layer, the The (2+1)D convolutional layer is composed of an attention module added to the 2D convolutional layer; the input_Xt of the hybrid convolutional block is input to the MC-RAN module, and the feature map output by the MC-RAN module and the input_Xt pass through The addition layer performs feature map addition, and the output of the added feature map processed by the second ReLU activation layer is used as the output X_t+1 of the mixed convolution block. After each mixed convolution block, cascade 3D max pooling layer for downsampling;

第i个尺寸为N_i-1×t×d×d的3D卷积层由M_i个尺寸为N_i-1×1×d×d的第二2D卷积层和N_i个尺寸为M_i×t×1×1的时序卷积层组成，M_i由以下公式计算：The i-th 3D convolutional layer of size N_i-1 × t × d × d consists of M_i second 2D convolutional layers of size N_i-1 × 1 × d × d and N_i of size M_i ×t×1×1 time-series convolutional layers, M_i is calculated by the following formula:

其中，d表示3D卷积层输出特征图的宽高尺寸参数，t表示时刻时序，[]表示向下取整。Among them, d represents the width and height size parameters of the output feature map of the 3D convolution layer, t represents the time sequence, and [] represents the rounding down.

所述(2+1)D卷积层主要由第一2D卷积层、空间注意力模块M_SS、时间卷积层和通道注意力模块M_CS级联构成，由空间注意力模块M_SS和通道注意力模块M_CS构成了注意力模块；The (2+1)D convolutional layer is mainly composed of the first 2D convolutional layer, the spatial attention module M_SS , the temporal convolution layer and the channel attention module M_CS cascaded, and the spatial attention module M_SS and The channel attention module_MCS constitutes the attention module;

空间注意力模块M_SS通过第三2D卷积层来获取输入特征图在空间维度上的空间权重图W_SS；通道注意力模块M_CS通过添加多层感知器来获取输入特征图在通道维度上的通道权重图W_CS；The spatial attention module M_SS obtains the spatial weight map W_SS of the input feature map in the spatial dimension through the third 2D convolution layer; the channel attention module M_CS obtains the input feature map in the channel dimension by adding a multi-layer perceptron The channel weight map W_CS ;

所述空间注意力模块M_SS的构建具体为：当输入特征图F的大小为C×H×W时，C代表输入特征图中每一帧图像的通道数，H和W代表输入特征图中每一帧图像的宽高尺寸参数；首先，利用全局平均池化对输入特征图中每一帧图像的通道进行压缩，生成一个大小为1×H×W的2D空间描述符Z；之后使用第三2D卷积层对2D空间描述符Z进行卷积获取到输入特征图中的感兴趣目标区域；最后在第三2D卷积层添加第三批量归一化层对感兴趣目标区域进行维度变换，获得空间注意力权重图W_SS；The construction of the spatial attention module M_SS is as follows: when the size of the input feature map F is C×H×W, C represents the number of channels of each frame image in the input feature map, and H and W represent the input feature map. The width and height size parameters of each frame of image; first, use global average pooling to compress the channel of each frame of image in the input feature map to generate a 2D spatial descriptor Z with a size of 1×H×W; The three 2D convolution layers convolve the 2D spatial descriptor Z to obtain the target area of interest in the input feature map; finally, a third batch normalization layer is added to the third 2D convolution layer to perform dimension transformation on the target area of interest , obtain the spatial attention weight map W_SS ;

空间注意力权重图W_SS可表示为：The spatial attention weight map W_SS can be expressed as:

W_SS(F)＝BN(σ(f^7×7(Avgpool(F)))W_SS (F)=BN(σ(f^7×7 (Avgpool(F)))

其中，BN()表示批量归一化，σ()表示是sigmoid激活函数，f^7×7()表示卷积核大小为7×7的卷积操作，Avgpool()表示全局平均池化，F表示输入的特征图；Among them, BN() represents batch normalization, σ() represents a sigmoid activation function, f^7×7 () represents a convolution operation with a convolution kernel size of 7×7, Avgpool() represents global average pooling, and F feature map representing the input;

所述通道注意力模块M_CS的构建具体为：当输入特征图Q的大小为C×H×W，C代表输入特征图中每一帧图像的通道数，首先，对输入特征图Q进行全局平均池化操作，产生一个大小为1×1×C的通道向量Q'；随后，使用多层感知器对通道向量Q'进行处理，以学习通道向量Q'的权重；The construction of the channel attention module_MCS is as follows: when the size of the input feature map Q is C×H×W, and C represents the number of channels of each frame of the input feature map, first, the global input feature map Q is performed. The average pooling operation produces a channel vector Q' ofsize 1 × 1 × C; then, the channel vector Q' is processed using a multilayer perceptron to learn the weight of the channel vector Q';

通道向量Q'可由如下公式计算：The channel vector Q' can be calculated by the following formula:

其中F(i,j)表示在坐标(i,j)的特征图，i表示在H维度的像素点，j表示在W维度的像素点；where F(i,j) represents the feature map at coordinates (i,j), i represents the pixel in the H dimension, and j represents the pixel in the W dimension;

最后在多层感知器后增加第四批量归一化层来进行维度转换，获得通道注意力权重图W_CS；Finally, a fourth batch normalization layer is added after the multi-layer perceptron to perform dimension transformation, and the channel attention weight map W_CS is obtained;

通道注意力权重图W_CS可表示为：The channel attention weight map_WCS can be expressed as:

W_CS(F)＝BN(MLP(Avgpool(F)))＝BN(σ(W₁(δ(W₀Avgpool(F)+b₀)+b₁)))W_CS (F)=BN(MLP(Avgpool(F)))=BN(σ(W₁ (δ(W₀ Avgpool(F)+b₀ )+b₁ )))

其中，MLP()表示带有隐藏层的多层感知器，W₀和W₁是MLP()的权重，大小分别为C/r×C和C×C/r，r是压缩比，δ()是线性修正单元，b₀和b₁表示MLP()的偏置项，大小分别为C/r和C。where MLP() represents a multilayer perceptron with hidden layers, W0 and W1 are the weights_of MLP(₎ , the sizes are C/r×C and C×C/r, respectively, r is the compression ratio, δ( ) is a linear correction unit, b₀ and b₁ represent the bias terms of MLP(), and the sizes are C/r and C, respectively.

4)使用Softmax分类层对关键特征图进行分类。4) Use the Softmax classification layer to classify the key feature maps.

所述的步骤4)具体为：视频帧图像经过四个MC-RAN模块后视频帧图像中的时空特征已经融合，混合卷积残差网络模型获取了关键特征，将关键特征图输入到Softmax层中进行分类。The step 4) is specifically: after the video frame image has passed through four MC-RAN modules, the spatiotemporal features in the video frame image have been fused, the hybrid convolutional residual network model obtains key features, and the key feature map is input to the Softmax layer. classified in.

所述的输入特征图在第一个MC-RAN模块中的输入特征图是步骤2)中的视频帧图像经过第一卷积层后的输出特征图，在后续的MC-RAN模块中的输入特征图是前一个MC-RAN模块的输出经过3D最大池化层后的输出特征图。The input feature map of the input feature map in the first MC-RAN module is the output feature map of the video frame image in step 2) after passing through the first convolutional layer, and the input feature map in the subsequent MC-RAN module. The feature map is the output feature map of the output of the previous MC-RAN module after going through the 3D max pooling layer.

本发明的有益效果：Beneficial effects of the present invention:

1)本发明设计了MC-RAN模块，以混合卷积的残差网络为基础，将3D卷积解耦的2D卷积和1D卷积分别与适应的空间注意力模块与通道注意力模块融合，充分融合时空特征，提高重要通道特征的相关度，增加特征图的全局相关性，以提高行为识别的性能。1) The present invention designs the MC-RAN module, which is based on the residual network of mixed convolution, and fuses the 2D convolution and 1D convolution of 3D convolution decoupling with the adaptive spatial attention module and channel attention module respectively. , fully fuse spatiotemporal features, improve the correlation of important channel features, and increase the global correlation of feature maps to improve the performance of behavior recognition.

2)本发明提出的混合卷积残差网络模型可以在扩展网络深度的同时，保留特征信息。本发明在公共数据集UCF101和HMDB51上开展对比试验，经数据集Kinetics预训练后，在UCF101和HMDB51测试集上的Top1准确率分别达到96.8％和74.8％。2) The hybrid convolutional residual network model proposed by the present invention can retain feature information while expanding the network depth. The present invention conducts comparative experiments on the public data sets UCF101 and HMDB51, and after pre-training with the data set Kinetics, the Top1 accuracy rates on the UCF101 and HMDB51 test sets reach 96.8% and 74.8% respectively.

附图说明Description of drawings

图1为本发明实施例的部分数据集示例；FIG. 1 is an example of a partial data set according to an embodiment of the present invention;

图2为本发明实施例的模块设计图；2 is a module design diagram of an embodiment of the present invention;

图3为本发明实施例空间注意力模块结构；3 is a structure of a spatial attention module according to an embodiment of the present invention;

图4为本发明实施例通道注意力模块结构；FIG. 4 is a structure of a channel attention module according to an embodiment of the present invention;

图5为本发明实施例混合卷积块级联图；5 is a cascade diagram of a hybrid convolution block according to an embodiment of the present invention;

图6为本发明实施例的特征图；(a),(b),(c),(d)为原始视频帧；(e),(f),(g),(h)为对应的特征图。Figure 6 is a feature map of an embodiment of the present invention; (a), (b), (c), (d) are original video frames; (e), (f), (g), (h) are corresponding features picture.

具体实施方式Detailed ways

下面结合附图和实施例对本发明作进一步说明。The present invention will be further described below with reference to the accompanying drawings and embodiments.

本发明提供混合卷积的残差网络与注意力结合的动作视频识别方法，利用开源数据集UCF101作为实验数据集，具体数据集示例如图1所示。该图表示其中一个动作视频转换成的部分动作视频的视频帧图像，将图像保存成.jpg格式，最后的图片大小为320×240。The present invention provides an action video recognition method combining a hybrid convolutional residual network and attention, using the open source data set UCF101 as an experimental data set, and an example of a specific data set is shown in FIG. 1 . This figure shows the video frame image of part of the action video converted from one of the action videos, save the image in .jpg format, and the final picture size is 320×240.

本发明实施例如下：Examples of the present invention are as follows:

步骤1：采用Opencv中的VideoCapture函数读入动作视频，并将读入的动作视频转换成动作视频的视频帧图像，部分动作视频的视频帧图像如图1所示。Step 1: Use the VideoCapture function in Opencv to read in the action video, and convert the read action video into the video frame image of the action video. The video frame images of some action videos are shown in Figure 1.

步骤2：本发明首先对动作识别模型进行数据预处理，然后在Kinetics数据集上进行预训练，而不是从头开始训练我们的模型，以提高我们模型的准确率。Step 2: The present invention first performs data preprocessing on the action recognition model, and then performs pre-training on the Kinetics data set, instead of training our model from scratch, so as to improve the accuracy of our model.

2.1)动作识别的数据预处理如下：2.1) The data preprocessing for action recognition is as follows:

分别使用时间抽样、随机裁剪和亮度调整的方法对动作视频的视频帧进行数据增强，组成获得视频帧图像；Use the methods of time sampling, random cropping and brightness adjustment to enhance the data of the video frame of the action video, and form a video frame image;

2.2)动作识别的模型预训练过程如下：2.2) The model pre-training process for action recognition is as follows:

将预处理后的视频帧图像输入混合卷积残差网络模型进行空间和通道维度上的特征提取，混合卷积残差网络模型的输入图像的形状批处理大小batch_size为16×112×112×3，混合卷积残差网络模型的输出形状批处理大小batch_size为类别标签。使用随机梯度下降SGD进行损失值的优化，初始学习率设置为0.01，当验证损失达到饱和时，初始学习率除以10。动量momentum系数为0.9，dropout系数为0.5，权值衰减率为10e^-3，并且使用batch norm加速模型训练，在服务器上使用8块Tesla V100 GPU进行训练，每块GPU上的batch_size为8，总的batch_size为64。Input the preprocessed video frame images into the hybrid convolutional residual network model for feature extraction in space and channel dimensions. The shape of the input image of the hybrid convolutional residual network model batch_size is 16×112×112×3 , the output shape of the hybrid convolutional residual network model batch_size is the class label. The optimization of the loss value is performed using stochastic gradient descent SGD, the initial learning rate is set to 0.01, and when the validation loss reaches saturation, the initial learning rate is divided by 10. The momentum momentum coefficient is 0.9, the dropout coefficient is 0.5, the weight decay rate is 10e^-3 , and batch norm is used to accelerate model training, using 8 Tesla V100 GPUs for training on the server, the batch_size on each GPU is 8, the total The batch_size is 64.

步骤3：构建注意力模块，注意力模块中使用注意力机制关注先验知识所提到的位置，去除背景和噪声对动作识别的干扰，依据先验知识自动给输入特征图的不同位置分配不同的注意力；Step 3: Build an attention module. The attention module uses the attention mechanism to focus on the positions mentioned by the prior knowledge, remove the interference of background and noise on action recognition, and automatically assign different positions to the input feature map according to the prior knowledge. attention;

利用注意力模块构建混合卷积块，级联混合卷积块构建基于混合卷积的残差网络与注意力结合的混合卷积残差网络模型，用混合卷积残差网络模型对视频帧图像进行时空特征学习，获取关键特征图；Use the attention module to build a mixed convolution block, cascade mixed convolution blocks to build a mixed convolution residual network model based on the combination of mixed convolution residual network and attention, and use the mixed convolution residual network model for video frame images. Perform spatiotemporal feature learning to obtain key feature maps;

混合卷积块表达为：The mixed convolution block is expressed as:

X_t+1＝X_t+W(X_t)X_t+1 =X_t +W(X_t )

式中，X_t和X_t+1表示第t个MC-RAN模块的输入和输出；X_t和X_t+1具有相同的特征维度，W代表加入注意力机制的混合卷积残差函数。In the formula, X_t and X_t+1 represent the input and output of the t-th MC-RAN module; X_t and X_t+1 have the same feature dimension, and W represents the hybrid convolution residual function with the attention mechanism added.

步骤3)具体为：选取3DResNet网络结构作为基本网络结构，将3DResNet网络结构中原有的3D卷积模块由第一卷积层和四个混合卷积块结构代替，混合卷积块包括MC-RAN模块和加合层；MC-RAN模块包括依次连接的(2+1)D卷积层、第一批量归一化层、第一ReLU激活层、3D卷积层和第二批量归一化层；混合卷积块的输入X_t输入MC-RAN模块，MC-RAN模块输出后的特征图与输入X_t通过加合层进行特征图相加，相加后的特征图经第二ReLU激活层处理后的输出作为混合卷积块的输出X_t+1，每个混合卷积块之后级联3D最大池化层进行下采样。Step 3) is specifically: selecting the 3DResNet network structure as the basic network structure, and replacing the original 3D convolution module in the 3DResNet network structure by the first convolution layer and four hybrid convolution block structures, and the hybrid convolution block includes MC-RAN. modules and additive layers; the MC-RAN module consists of a (2+1)D convolutional layer, a first batch normalization layer, a first ReLU activation layer, a 3D convolutional layer, and a second batch normalization layer connected in sequence ; The input X_t of the hybrid convolution block is input to the MC-RAN module, and the feature map output by the MC-RAN module and the input X_t are added by the addition layer, and the added feature map is activated by the second ReLU layer. The processed output is taken as the output X_t+1 of the hybrid convolution block, and each hybrid convolution block is followed by a cascaded 3D max-pooling layer for downsampling.

a、第i个尺寸为N_i-1×t×d×d的3D卷积层由M_i个尺寸为N_i-1×1×d×d的第二2D卷积层和N_i个尺寸为M_i×t×1×1的时序卷积层组成，M_i由以下公式计算：a. The i-th 3D convolutional layer of size N_i_-1 ×t×d×d consists of Mi second 2D convolutional layers of size N_i-1 ×1×d×d and N_i dimensions is composed of sequential convolutional layers of M_i ×t×1×1, and M_i is calculated by the following formula:

其中，d表示3D卷积层输出特征图的宽高尺寸参数，t表示时刻时序，[]表示向下取整；Among them, d represents the width and height size parameters of the output feature map of the 3D convolution layer, t represents the time sequence, and [] represents the rounding down;

b、在第一卷积层conv1处进行空间下采样，步长为1×2×2。对于第三混合卷积块conv3_1，第四混合卷积块conv4_1和第五混合卷积块conv5_1，对其中的(2+1)D卷积的第一2D卷积层和时间卷积层分别进行了时空下采样，步长分别为1×2×2和2×1×1。表1为第一卷积层和混合卷积块的网络结构图。b. Perform spatial downsampling at the first convolutional layer conv1 with a stride of 1×2×2. For the third hybrid convolution block conv3_1, the fourth hybrid convolution block conv4_1 and the fifth hybrid convolution block conv5_1, the first 2D convolution layer and the temporal convolution layer of the (2+1)D convolution are performed respectively. For spatiotemporal downsampling, the steps are 1 × 2 × 2 and 2 × 1 × 1, respectively. Table 1 shows the network structure diagram of the first convolutional layer and the hybrid convolutional block.

表1为第一卷积层和混合卷积块的网络层结构。Table 1 shows the network layer structure of the first convolutional layer and the hybrid convolutional block.

c、混合卷积块级联图如图5所示，(2+1)D卷积层是由2D卷积层中加入注意力模块组成；(2+1)D卷积层主要由第一2D卷积层、空间注意力模块M_SS、时间卷积层和通道注意力模块M_CS级联构成。所述注意力模块分别在输入特征图的空间上和通道上施加注意力，由空间注意力模块M_SS和通道注意力模块M_CS构成了注意力模块。c. The cascade diagram of mixed convolution blocks is shown in Figure 5. The (2+1)D convolutional layer is composed of the attention module added to the 2D convolutional layer; the (2+1)D convolutional layer is mainly composed of the first A 2D convolution layer, a spatial attention module M_SS , a temporal convolution layer and a channel attention module M_CS are cascaded. The attention module applies attention on the space and channel of the input feature map respectively, and the attention module is composed of the spatial attention module M_SS and the channel attention module M_CS .

空间注意力模块M_SS通过第三2D卷积核来获取输入特征图在空间维度上的空间权重图W_SS；通道注意力模块M_CS通过添加多层感知器来获取输入特征图在通道维度上的通道权重图W_CS；The spatial attention module M_SS obtains the spatial weight map W_SS of the input feature map in the spatial dimension through the third 2D convolution kernel; the channel attention module M_CS obtains the input feature map in the channel dimension by adding a multi-layer perceptron The channel weight map W_CS ;

所述空间注意力模块M_SS的构建具体为：当输入特征图F的大小为C×H×W时，C代表输入特征图中每一帧图像的通道数，H和W代表输入特征图中每一帧图像的宽高尺寸参数；首先，利用全局平均池化对输入特征图中每一帧图像的通道进行压缩，生成一个大小为1×H×W的2D空间描述符Z，Z在坐标(i,j)处的元素计算如下：The construction of the spatial attention module M_SS is as follows: when the size of the input feature map F is C×H×W, C represents the number of channels of each frame image in the input feature map, and H and W represent the input feature map. The width and height size parameters of each frame of image; first, use global average pooling to compress the channel of each frame of image in the input feature map, and generate a 2D spatial descriptor Z with a size of 1×H×W, Z is at the coordinate The element at (i,j) is calculated as follows:

其中F_i,j(k)表示在第K个通道在坐标(i,j)的特征图，i表示在H维度的像素点，j表示在W维度的像素点；之后使用大小为7×7的第三2D卷积层对2D空间描述符进行卷积获取到输入特征图中的感兴趣目标区域；最后在第三2D卷积层添加第三批量归一化层对感兴趣目标区域进行维度变换，获得空间注意力权重图W_SS。where F_i,j (k) represents the feature map of the Kth channel at coordinates (i, j), i represents the pixel in the H dimension, and j represents the pixel in the W dimension; the size is 7×7 later. The third 2D convolutional layer convolves the 2D spatial descriptor to obtain the target area of interest in the input feature map; finally, a third batch normalization layer is added to the third 2D convolutional layer to dimension the target area of interest. Transform to obtain the spatial attention weight map W_SS .

W_SS(F)＝BN(σ(f^7×7(Avgpool(F)))W_SS (F)=BN(σ(f^7×7 (Avgpool(F)))

其中，BN()表示批量归一化，σ()表示是sigmoid激活函数，f^7×7()表示卷积核大小为7×7的卷积操作，Avgpool()表示全局平均池化，F表示输入特征图。Among them, BN() represents batch normalization, σ() represents a sigmoid activation function, f^7×7 () represents a convolution operation with a convolution kernel size of 7×7, Avgpool() represents global average pooling, and F represents the input feature map.

通道注意力模块M_CS的构建具体为：当输入大小为H×W×C的特征图Q时，C代表输入特征图中每一帧图像的通道数。首先，对输入特征图Q进行全局平均池化操作，产生一个大小为1×1×C的特征图Q'；随后，使用带有隐藏层的多层感知器FC对通道向量Q'进行处理，以学习通道向量Q'的权重；以权重作为相关性，为了限制通道注意力模块的复杂性和节省参数代价，将隐藏激活层的大小设置为1×1×C/r，其中r是压缩比，设置为16。The construction of the channel attention module_MCS is as follows: when the feature map Q of size H×W×C is input, C represents the number of channels in each frame of the input feature map. First, a global average pooling operation is performed on the input feature map Q to generate a feature map Q' ofsize 1 × 1 × C; then, the channel vector Q' is processed using a multilayer perceptron FC with hidden layers, Taking the weight of the learning channel vector Q'; taking the weight as the correlation, in order to limit the complexity of the channel attention module and save the parameter cost, the size of the hidden activation layer is set to 1 × 1 × C/r, where r is the compression ratio , set to 16.

最后在多层感知器后增加第四批量归一化层来进行维度转换，获得通道注意力权重图W_CS。Finally, a fourth batch normalization layer is added after the multi-layer perceptron to perform dimension transformation to obtain the channel attention weight map W_CS .

其中，MLP表示带有隐藏层的多层感知器，W₀和W₁是MLP的权重，大小分别为C/r×C和C×C/r。σ()是sigmoid激活函数，δ()是线性修正单元，b₀和b₁表示MLP()的偏置项，大小分别为C/r和C。where MLP represents a multi-layer perceptron with hidden layers, and W₀ and W₁ are the weights of the MLP, with sizes C/r×C and C×C/r, respectively. σ() is the sigmoid activation function, δ() is the linear correction unit, b₀ and b₁ represent the bias terms of MLP(), and the sizes are C/r and C, respectively.

步骤4：视频帧图像经过第一卷积层和四个混合卷积块后视频帧图像中的时空特征已经融合，混合卷积残差网络模型获取了关键特征，加入注意力模块后特征图可视化如图6所示。将关键特征图输入到Softmax层中进行分类。使用经过训练的网络来评估验证集中的每个视频，并获得相应的类别标签。经过训练后，将提出的混合卷积残差网络模型和不同的网络模型进行对比，实验结果如表2所示，结果表明，混合卷积残差网络模型在不增加参数量的情况下，在Top1和Top5的识别准确率都有所增加。Step 4: After the video frame image has passed through the first convolution layer and four mixed convolution blocks, the spatiotemporal features in the video frame image have been fused. The mixed convolution residual network model has acquired key features, and the feature map is visualized after adding the attention module. As shown in Figure 6. The key feature maps are input into the Softmax layer for classification. Use the trained network to evaluate each video in the validation set and obtain the corresponding class label. After training, the proposed hybrid convolutional residual network model is compared with different network models. The experimental results are shown in Table 2. The results show that the hybrid convolutional residual network model does not increase the amount of parameters. The recognition accuracy of both Top1 and Top5 has increased.

表2为混合卷积残差网络模型与其他模型的识别结果比较。Table 2 compares the recognition results of the hybrid convolutional residual network model with other models.

网络模型network model参数量parameter quantityTop-1识别率(％)Top-1 recognition rate (%)Top-5识别率(％)Top-5 recognition rate (%)平均识别率(％)Average recognition rate (%)ResNet[39]ResNet[39]63.72M63.72M60.160.181.981.971.071.0(2+1)D-ResNet[12](2+1)D-ResNet[12]63.88M63.88M66.866.888.188.177.4577.45MC-ResNet[28]MC-ResNet [28]63.88M63.88M67.367.389.289.278.2578.25RAN[26]RAN[26]63.97M63.97M61.761.783.283.272.4572.45(2+1)D-RAN(2+1)D-RAN63.98M63.98M67.867.889.389.378.5578.55MC-RANMC-RAN63.98M63.98M68.868.889.989.979.3579.35

上述具体实施方式用来解释说明本发明，而不是对本发明进行限制，在本发明的精神和权利要求的保护范围内，对本发明做出的任何修改和改变，都落入本发明的保护范围。The above-mentioned specific embodiments are used to explain the present invention, rather than limit the present invention. Any modification and change made to the present invention within the spirit of the present invention and the protection scope of the claims all fall into the protection scope of the present invention.

Claims

1. A motion video identification method combining a residual error network of hybrid convolution and attention is characterized in that: the method comprises the following steps:

1) reading the motion of a person in the motion video, and converting the motion video into an original video frame image;

2) respectively using methods of time sampling, random cutting and brightness adjustment to perform data enhancement on video frames of the action video to form and obtain video frame images;

3) constructing an attention module, constructing a mixed convolution block by using the attention module, constructing a mixed convolution residual network model based on the combination of a mixed convolution residual network and attention by cascading the mixed convolution block, and performing space-time feature learning on a video frame image by using the mixed convolution residual network model to obtain a key feature map;

the mixed volume block is expressed as:

X_t+1＝X_t+W(X_t)

wherein, X_tAnd X_t+1Represents the input and output of the tth MC-RAN module; x_tAnd X_t+1The characteristic dimensions are the same, and W represents a mixed convolution residual function added with an attention mechanism;

4) the key feature maps are classified using a Softmax classification layer.

2. The hybrid convolved residual network and attention-combined motion video recognition method according to claim 1, characterized in that: the step 2) is specifically as follows:

time sampling: for each motion video, randomly sampling continuous frames of 16 motion videos for training; if the number of the continuous frames does not reach 16 frames, the action video is played circularly until the number of the continuous frames reaches 16 frames;

random cutting: resizing the original video frame image to 128 × 171 pixels, and then randomly cropping the original video frame image to 112 × 112 pixels;

and (3) brightness adjustment: and randomly adjusting the brightness of the original video frame image.

3. The hybrid convolved residual network and attention-combined motion video recognition method according to claim 1, characterized in that:

the step 3) is specifically as follows: selecting 3DResNet networkThe structure is used as a basic network structure, an original 3D convolution module in the 3D ReesNet network structure is replaced by a first convolution layer and four mixed convolution blocks, and each mixed convolution block comprises an MC-RAN module and an addition layer; the MC-RAN module comprises a (2+1) D convolution layer, a first batch normalization layer, a first ReLU activation layer, a 3D convolution layer and a second batch normalization layer, wherein the (2+1) D convolution layer is formed by adding an attention module into the 2D convolution layer; input X of mixed volume block_tInput MC-RAN module, and output characteristic diagram and input X of MC-RAN module_tAdding the feature maps through the adding layer, wherein the output of the added feature maps processed by the second ReLU activation layer is used as the output X of the mixed volume block_t+1After each mixed rolling block, cascading a 3D maximum pooling layer for down-sampling;

ith size of N_i-1The 3D convolution layer of x t x D consists of M_iEach size is N_i-1A second 2D convolutional layer of x 1 xdxdxd and N_iEach size is M_iComposition of time-series convolution layer of x t x 1, M_iCalculated by the following formula:

wherein D represents a width and height dimension parameter of the 3D convolution layer output characteristic diagram, t represents a time sequence, and [ ] represents a downward rounding.

4. The hybrid convolved residual network and attention-combined motion video recognition method according to claim 3, characterized in that:

the (2+1) D convolutional layer mainly comprises a first 2D convolutional layer and a space attention module M_SSTime convolution layer and channel attention module M_CSFormed in cascade and consisting of spatial attention modules M_SSAnd channel attention module M_CSAn attention module is formed;

space attention module M_SSObtaining a spatial weight map W of the input feature map in a spatial dimension by a third 2D convolutional layer_SS(ii) a Channel attention module M_CSBy addingObtaining a channel weight map W of an input feature map in a channel dimension by a multi-layer perceptron_CS；

The spatial attention module M_SSThe construction method specifically comprises the following steps: when the size of the input feature map F is C multiplied by H multiplied by W, C represents the channel number of each frame of image in the input feature map, and H and W represent the width and height dimension parameters of each frame of image in the input feature map; firstly, compressing a channel of each frame of image in an input feature map by using global average pooling to generate a 2D space descriptor Z with the size of 1 multiplied by H multiplied by W; then, a third 2D convolution layer is used for performing convolution on the 2D space descriptor Z to obtain an interested target area in the input feature map; and finally, adding a third batch normalization layer on a third 2D convolution layer to carry out dimension transformation on the interested target region to obtain a space attention weight graph W_SS；

Spatial attention weight graph W_SSCan be expressed as:

W_SS(F)＝BN(σ(f^7×7(Avgpool(F)))

wherein BN () represents batch normalization, σ () represents sigmoid activation function, f^7×7() Represents the convolution operation with a convolution kernel size of 7 × 7, Avgpool () represents the global average pooling, and F represents the input signature;

the channel attention module M_CSThe construction method specifically comprises the following steps: when the size of the input feature map Q is C multiplied by H multiplied by W, and C represents the number of channels of each frame image in the input feature map, firstly, carrying out global average pooling operation on the input feature map Q to generate a channel vector Q' with the size of 1 multiplied by C; subsequently, the channel vector Q 'is processed using a multi-layer perceptron to learn weights of the channel vector Q';

the channel vector Q' can be calculated by the following formula:

wherein F (i, j) represents a feature map at coordinates (i, j), i represents a pixel point at dimension H, and j represents a pixel point at dimension W;

finally in multiple layersAdding a fourth batch normalization layer behind the sensor to perform dimension conversion to obtain a channel attention weight graph W_CS；

Channel attention weight graph W_CSCan be expressed as:

W_CS(F)＝BN(MLP(Avgpool(F)))＝BN(σ(W₁((W₀Avgpool(F)+b₀)+b₁)))

wherein MLP () represents a multilayer perceptron with hidden layers, W₀And W₁Is the weight of MLP () with the size of C/r × C and C × C/r, r is the compression ratio, () is the linear correction unit, b₀And b₁And bias terms representing MLP () with the sizes of C/r and C, respectively.

5. The hybrid convolved residual network and attention-combined motion video recognition method according to claim 1, characterized in that: the step 4) is specifically as follows: after the video frame images pass through the four MC-RAN modules, the spatio-temporal characteristics in the video frame images are fused, the mixed convolution residual error network model obtains key characteristics, and the key characteristic images are input into a Softmax layer for classification.

6. The hybrid convolved residual network and attention-combined motion video recognition method according to claim 1, characterized in that: the input feature map of the input feature map in the first MC-RAN module is the output feature map of the video frame image in step 2) after passing through the first convolutional layer, and the input feature map of the subsequent MC-RAN module is the output feature map of the previous MC-RAN module after passing through the 3D max-pooling layer.