Movatterモバイル変換


[0]ホーム

URL:


CN112149504A - A hybrid convolutional residual network combined with attention for action video recognition - Google Patents

A hybrid convolutional residual network combined with attention for action video recognition
Download PDF

Info

Publication number
CN112149504A
CN112149504ACN202010849991.6ACN202010849991ACN112149504ACN 112149504 ACN112149504 ACN 112149504ACN 202010849991 ACN202010849991 ACN 202010849991ACN 112149504 ACN112149504 ACN 112149504A
Authority
CN
China
Prior art keywords
layer
attention
convolution
feature map
module
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010849991.6A
Other languages
Chinese (zh)
Other versions
CN112149504B (en
Inventor
杨慧敏
田秋红
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Chaowei Imaging Technology Co ltd
Original Assignee
Zhejiang University of Technology ZJUT
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang University of Technology ZJUTfiledCriticalZhejiang University of Technology ZJUT
Priority to CN202010849991.6ApriorityCriticalpatent/CN112149504B/en
Publication of CN112149504ApublicationCriticalpatent/CN112149504A/en
Application grantedgrantedCritical
Publication of CN112149504BpublicationCriticalpatent/CN112149504B/en
Activelegal-statusCriticalCurrent
Anticipated expirationlegal-statusCritical

Links

Images

Classifications

Landscapes

Abstract

Translated fromChinese

本发明公开了一种混合卷积的残差网络与注意力结合的动作视频识别方法。包括:1)读取动作视频中人的动作,然后将动作视频转换为原始视频帧图像;2)分别使用时间抽样、随机裁剪和亮度调整的方法对动作视频的视频帧进行数据增强,组成获得视频帧图像;3)构建注意力模块,利用注意力模块构建混合卷积块,级联混合卷积块构建基于混合卷积的残差网络与注意力结合的混合卷积残差网络模型,用混合卷积残差网络模型对视频帧图像进行时空特征学习,获取关键特征图;4)使用Softmax分类层对关键特征图进行分类。本发明在扩展网络深度的同时,保留视频帧的特征信息,充分融合时空特征,提高重要通道特征的相关度,有效地提高动作识别的预测性能。

Figure 202010849991

The invention discloses an action video recognition method combining a mixed convolution residual network and attention. Including: 1) reading the actions of people in the action video, and then converting the action video into an original video frame image; 2) using the methods of time sampling, random cropping and brightness adjustment to perform data enhancement on the video frames of the action video, and the composition obtains Video frame image; 3) Build an attention module, use the attention module to build a mixed convolution block, and cascade mixed convolution blocks to build a mixed convolution residual network model based on the combination of mixed convolution residual network and attention, using The hybrid convolutional residual network model performs spatiotemporal feature learning on the video frame images to obtain key feature maps; 4) uses the Softmax classification layer to classify the key feature maps. While expanding the network depth, the invention retains the feature information of the video frame, fully integrates the spatiotemporal features, improves the correlation of the important channel features, and effectively improves the prediction performance of the action recognition.

Figure 202010849991

Description

Translated fromChinese
混合卷积的残差网络与注意力结合的动作视频识别方法A hybrid convolutional residual network combined with attention for action video recognition

技术领域technical field

本发明属于智能视频分析技术领域的一种动作视频识别方法,具体是涉及了一种基于混合卷积的残差网络与注意力机制结合的动作视频识别方法。The invention belongs to an action video recognition method in the technical field of intelligent video analysis, and particularly relates to an action video recognition method based on a hybrid convolution residual network combined with an attention mechanism.

背景技术Background technique

动作识别具有视频处理、模式识别、虚拟现实等应用价值,是计算机视觉领域的重要研究课题之一。视频中的动作识别是视频理解任务中的关键问题。它不仅需要捕获空间维度上的特征,还需要对多个连续帧之间的时间关系进行编码。因此,从动作视频中有效地提取高分辨率的时空特征对于提高动作识别的准确性具有重要意义。然而,视频是一个具有时间关系的连续帧序列,每个像素点与其邻近像素点具有很高的相似性,时空相关性非常强。传统的卷积神经网络对单幅图像数据具有优异的特征提取性能,但不能从视频中提取时空特征。Action recognition has application value such as video processing, pattern recognition, and virtual reality, and is one of the important research topics in the field of computer vision. Action recognition in videos is a key problem in video understanding tasks. It not only needs to capture the features in the spatial dimension, but also needs to encode the temporal relationship between multiple consecutive frames. Therefore, effectively extracting high-resolution spatiotemporal features from action videos is of great significance for improving the accuracy of action recognition. However, video is a continuous frame sequence with temporal relationship, each pixel has a high similarity with its neighboring pixels, and the spatial and temporal correlation is very strong. Traditional convolutional neural networks have excellent feature extraction performance for single image data, but cannot extract spatiotemporal features from videos.

当视频输入为连续图像时,目前主要有三种方法:(1)2DCNNs结合RNN/LSTM,(2)双流CNNs,(3)3DCNNs。双流CNNs使用两个独立的网络来捕获空间特征和时间运动信息。虽然该方法效果较好,但由于两个网络的训练是分离的,不能有效地混合外观和运动信息。RNN/LSTM能更好地处理序列信息,因此常与CNN相结合来处理动作识别。然而,这类方法只保留了顶层的高级特性,忽略了底层特性中的相关性。利用3DCNN获取时空信息是一种有效的方法。然而,3DCNN模型参数量巨大,包含大量冗余的空间数据,训练3DCNNs是一个非常具有挑战性的任务。近年来,许多研究试图从不同的角度引入注意机制来增强行为识别的鲁棒性。然而,深度网络中的注意力叠加机制会导致重复的点积,从而降低特征的价值。When the video input is continuous images, there are currently three main approaches: (1) 2DCNNs combined with RNN/LSTM, (2) dual-stream CNNs, (3) 3DCNNs. Two-stream CNNs use two independent networks to capture spatial features and temporal motion information. Although this method works well, it cannot effectively mix appearance and motion information since the training of the two networks is separated. RNN/LSTM can better handle sequence information, so it is often combined with CNN to handle action recognition. However, such methods only preserve high-level features at the top level, ignoring dependencies in low-level features. Using 3DCNN to obtain spatiotemporal information is an effective method. However, 3DCNN models have a huge amount of parameters and contain a lot of redundant spatial data, making training 3DCNNs a very challenging task. In recent years, many studies have attempted to introduce attention mechanisms from different perspectives to enhance the robustness of action recognition. However, the attention stacking mechanism in deep networks leads to repeated dot products, which reduces the value of features.

发明内容SUMMARY OF THE INVENTION

为了解决背景技术中存在的问题,本发明的目的在于提供一种基于混合卷积的残差网络与注意力机制结合的动作视频中的动作识别方法,设计MC-RAN模块,其以混合卷积的残差网络为基础,将3D卷积解耦的2D卷积和1D卷积分别与适应的空间注意力模块MSS与通道注意力模块MCS融合,提高重要通道特征的相关度,增加特征图的全局相关性,以提高动作识别的性能。In order to solve the problems existing in the background technology, the purpose of the present invention is to provide an action recognition method in an action video based on the combination of a hybrid convolution residual network and an attention mechanism, and design an MC-RAN module, which uses a hybrid convolution Based on the residual network, the 2D convolution and 1D convolution of the 3D convolution decoupling are respectively fused with the adaptive spatial attention module MSS and the channel attention module MCS to improve the correlation of important channel features and increase the characteristics of Global correlation of graphs to improve the performance of action recognition.

本发明采用的技术方案如下:The technical scheme adopted in the present invention is as follows:

本发明包括以下步骤:The present invention includes the following steps:

1)读取动作视频中人的动作,然后将动作视频转换为原始视频帧图像;1) Read the action of the person in the action video, and then convert the action video into the original video frame image;

2)分别使用时间抽样、随机裁剪和亮度调整的方法对动作视频的视频帧进行数据增强,组成获得视频帧图像;2) respectively using the method of time sampling, random cropping and brightness adjustment to perform data enhancement on the video frame of the action video, and form a video frame image;

所述步骤2)具体为:Described step 2) is specifically:

时间抽样:对于每个动作视频,随机采样16帧动作视频的连续帧进行训练;如果连续帧的帧数达不到16帧,就循环播放该动作视频,直至连续帧的帧数达到16帧;Time sampling: For each action video, 16 consecutive frames of action video are randomly sampled for training; if the number of consecutive frames does not reach 16 frames, the action video is played in a loop until the number of consecutive frames reaches 16 frames;

随机裁剪:将原始视频帧图像的大小调整为128×171像素,然后将原始视频帧图像的大小随机裁剪为112×112像素;Random cropping: resize the original video frame image to 128×171 pixels, and then randomly crop the size of the original video frame image to 112×112 pixels;

亮度调整:随机调整原始视频帧图像的亮度。Brightness Adjustment: Randomly adjust the brightness of the original video frame image.

3)构建注意力模块,利用注意力模块构建混合卷积块,级联混合卷积块构建基于混合卷积的残差网络与注意力结合的混合卷积残差网络模型,用混合卷积残差网络模型对视频帧图像进行时空特征学习,获取关键特征图;3) Build an attention module, use the attention module to build a mixed convolution block, cascade mixed convolution blocks to build a mixed convolution residual network model based on a mixed convolution residual network combined with attention, and use a mixed convolution residual network model. The difference network model performs spatiotemporal feature learning on video frame images to obtain key feature maps;

混合卷积块表达为:The mixed convolution block is expressed as:

Xt+1=Xt+W(Xt)Xt+1 =Xt +W(Xt )

其中,Xt和Xt+1表示第t个MC-RAN模块的输入和输出;Xt和Xt+1具有相同的特征维度,W代表加入注意力机制的混合卷积残差函数;Among them, Xt and Xt+1 represent the input and output of the t-th MC-RAN module; Xt and Xt+1 have the same feature dimension, and W represents the mixed convolution residual function with the attention mechanism;

所述步骤3)具体为:选取3DResNet网络结构作为基本网络结构,3DResNet网络结构中原有的3D卷积模块由第一卷积层和四个混合卷积块代替,混合卷积块包括MC-RAN模块和加合层;MC-RAN模块包括(2+1)D卷积层、第一批量归一化层、第一ReLU激活层、3D卷积层和第二批量归一化层,所述(2+1)D卷积层是由2D卷积层中加入注意力模块组成;混合卷积块的输入Xt输入MC-RAN模块,MC-RAN模块输出后的特征图与输入Xt通过加合层进行特征图相加,相加后的特征图经第二ReLU激活层处理后的输出作为混合卷积块的输出Xt+1,每个混合卷积块之后级联3D最大池化层进行下采样;The step 3) is specifically: selecting the 3DResNet network structure as the basic network structure, the original 3D convolution module in the 3DResNet network structure is replaced by the first convolution layer and four mixed convolution blocks, and the mixed convolution blocks include MC-RAN. modules and additive layers; the MC-RAN module includes a (2+1)D convolutional layer, a first batch normalization layer, a first ReLU activation layer, a 3D convolutional layer and a second batch normalization layer, the The (2+1)D convolutional layer is composed of an attention module added to the 2D convolutional layer; the inputXt of the hybrid convolutional block is input to the MC-RAN module, and the feature map output by the MC-RAN module and the inputXt pass through The addition layer performs feature map addition, and the output of the added feature map processed by the second ReLU activation layer is used as the output Xt+1 of the mixed convolution block. After each mixed convolution block, cascade 3D max pooling layer for downsampling;

第i个尺寸为Ni-1×t×d×d的3D卷积层由Mi个尺寸为Ni-1×1×d×d的第二2D卷积层和Ni个尺寸为Mi×t×1×1的时序卷积层组成,Mi由以下公式计算:The i-th 3D convolutional layer of size Ni-1 × t × d × d consists of Mi second 2D convolutional layers of size Ni-1 × 1 × d × d and Ni of size Mi ×t×1×1 time-series convolutional layers, Mi is calculated by the following formula:

Figure BDA0002644396770000021
Figure BDA0002644396770000021

其中,d表示3D卷积层输出特征图的宽高尺寸参数,t表示时刻时序,[]表示向下取整。Among them, d represents the width and height size parameters of the output feature map of the 3D convolution layer, t represents the time sequence, and [] represents the rounding down.

所述(2+1)D卷积层主要由第一2D卷积层、空间注意力模块MSS、时间卷积层和通道注意力模块MCS级联构成,由空间注意力模块MSS和通道注意力模块MCS构成了注意力模块;The (2+1)D convolutional layer is mainly composed of the first 2D convolutional layer, the spatial attention module MSS , the temporal convolution layer and the channel attention module MCS cascaded, and the spatial attention module MSS and The channel attention moduleMCS constitutes the attention module;

空间注意力模块MSS通过第三2D卷积层来获取输入特征图在空间维度上的空间权重图WSS;通道注意力模块MCS通过添加多层感知器来获取输入特征图在通道维度上的通道权重图WCSThe spatial attention module MSS obtains the spatial weight map WSS of the input feature map in the spatial dimension through the third 2D convolution layer; the channel attention module MCS obtains the input feature map in the channel dimension by adding a multi-layer perceptron The channel weight map WCS ;

所述空间注意力模块MSS的构建具体为:当输入特征图F的大小为C×H×W时,C代表输入特征图中每一帧图像的通道数,H和W代表输入特征图中每一帧图像的宽高尺寸参数;首先,利用全局平均池化对输入特征图中每一帧图像的通道进行压缩,生成一个大小为1×H×W的2D空间描述符Z;之后使用第三2D卷积层对2D空间描述符Z进行卷积获取到输入特征图中的感兴趣目标区域;最后在第三2D卷积层添加第三批量归一化层对感兴趣目标区域进行维度变换,获得空间注意力权重图WSSThe construction of the spatial attention module MSS is as follows: when the size of the input feature map F is C×H×W, C represents the number of channels of each frame image in the input feature map, and H and W represent the input feature map. The width and height size parameters of each frame of image; first, use global average pooling to compress the channel of each frame of image in the input feature map to generate a 2D spatial descriptor Z with a size of 1×H×W; The three 2D convolution layers convolve the 2D spatial descriptor Z to obtain the target area of interest in the input feature map; finally, a third batch normalization layer is added to the third 2D convolution layer to perform dimension transformation on the target area of interest , obtain the spatial attention weight map WSS ;

空间注意力权重图WSS可表示为:The spatial attention weight map WSS can be expressed as:

WSS(F)=BN(σ(f7×7(Avgpool(F)))WSS (F)=BN(σ(f7×7 (Avgpool(F)))

其中,BN()表示批量归一化,σ()表示是sigmoid激活函数,f7×7()表示卷积核大小为7×7的卷积操作,Avgpool()表示全局平均池化,F表示输入的特征图;Among them, BN() represents batch normalization, σ() represents a sigmoid activation function, f7×7 () represents a convolution operation with a convolution kernel size of 7×7, Avgpool() represents global average pooling, and F feature map representing the input;

所述通道注意力模块MCS的构建具体为:当输入特征图Q的大小为C×H×W,C代表输入特征图中每一帧图像的通道数,首先,对输入特征图Q进行全局平均池化操作,产生一个大小为1×1×C的通道向量Q';随后,使用多层感知器对通道向量Q'进行处理,以学习通道向量Q'的权重;The construction of the channel attention moduleMCS is as follows: when the size of the input feature map Q is C×H×W, and C represents the number of channels of each frame of the input feature map, first, the global input feature map Q is performed. The average pooling operation produces a channel vector Q' ofsize 1 × 1 × C; then, the channel vector Q' is processed using a multilayer perceptron to learn the weight of the channel vector Q';

通道向量Q'可由如下公式计算:The channel vector Q' can be calculated by the following formula:

Figure BDA0002644396770000031
Figure BDA0002644396770000031

其中F(i,j)表示在坐标(i,j)的特征图,i表示在H维度的像素点,j表示在W维度的像素点;where F(i,j) represents the feature map at coordinates (i,j), i represents the pixel in the H dimension, and j represents the pixel in the W dimension;

最后在多层感知器后增加第四批量归一化层来进行维度转换,获得通道注意力权重图WCSFinally, a fourth batch normalization layer is added after the multi-layer perceptron to perform dimension transformation, and the channel attention weight map WCS is obtained;

通道注意力权重图WCS可表示为:The channel attention weight mapWCS can be expressed as:

WCS(F)=BN(MLP(Avgpool(F)))=BN(σ(W1(δ(W0Avgpool(F)+b0)+b1)))WCS (F)=BN(MLP(Avgpool(F)))=BN(σ(W1 (δ(W0 Avgpool(F)+b0 )+b1 )))

其中,MLP()表示带有隐藏层的多层感知器,W0和W1是MLP()的权重,大小分别为C/r×C和C×C/r,r是压缩比,δ()是线性修正单元,b0和b1表示MLP()的偏置项,大小分别为C/r和C。where MLP() represents a multilayer perceptron with hidden layers, W0 and W1 are the weightsof MLP() , the sizes are C/r×C and C×C/r, respectively, r is the compression ratio, δ( ) is a linear correction unit, b0 and b1 represent the bias terms of MLP(), and the sizes are C/r and C, respectively.

4)使用Softmax分类层对关键特征图进行分类。4) Use the Softmax classification layer to classify the key feature maps.

所述的步骤4)具体为:视频帧图像经过四个MC-RAN模块后视频帧图像中的时空特征已经融合,混合卷积残差网络模型获取了关键特征,将关键特征图输入到Softmax层中进行分类。The step 4) is specifically: after the video frame image has passed through four MC-RAN modules, the spatiotemporal features in the video frame image have been fused, the hybrid convolutional residual network model obtains key features, and the key feature map is input to the Softmax layer. classified in.

所述的输入特征图在第一个MC-RAN模块中的输入特征图是步骤2)中的视频帧图像经过第一卷积层后的输出特征图,在后续的MC-RAN模块中的输入特征图是前一个MC-RAN模块的输出经过3D最大池化层后的输出特征图。The input feature map of the input feature map in the first MC-RAN module is the output feature map of the video frame image in step 2) after passing through the first convolutional layer, and the input feature map in the subsequent MC-RAN module. The feature map is the output feature map of the output of the previous MC-RAN module after going through the 3D max pooling layer.

本发明的有益效果:Beneficial effects of the present invention:

1)本发明设计了MC-RAN模块,以混合卷积的残差网络为基础,将3D卷积解耦的2D卷积和1D卷积分别与适应的空间注意力模块与通道注意力模块融合,充分融合时空特征,提高重要通道特征的相关度,增加特征图的全局相关性,以提高行为识别的性能。1) The present invention designs the MC-RAN module, which is based on the residual network of mixed convolution, and fuses the 2D convolution and 1D convolution of 3D convolution decoupling with the adaptive spatial attention module and channel attention module respectively. , fully fuse spatiotemporal features, improve the correlation of important channel features, and increase the global correlation of feature maps to improve the performance of behavior recognition.

2)本发明提出的混合卷积残差网络模型可以在扩展网络深度的同时,保留特征信息。本发明在公共数据集UCF101和HMDB51上开展对比试验,经数据集Kinetics预训练后,在UCF101和HMDB51测试集上的Top1准确率分别达到96.8%和74.8%。2) The hybrid convolutional residual network model proposed by the present invention can retain feature information while expanding the network depth. The present invention conducts comparative experiments on the public data sets UCF101 and HMDB51, and after pre-training with the data set Kinetics, the Top1 accuracy rates on the UCF101 and HMDB51 test sets reach 96.8% and 74.8% respectively.

附图说明Description of drawings

图1为本发明实施例的部分数据集示例;FIG. 1 is an example of a partial data set according to an embodiment of the present invention;

图2为本发明实施例的模块设计图;2 is a module design diagram of an embodiment of the present invention;

图3为本发明实施例空间注意力模块结构;3 is a structure of a spatial attention module according to an embodiment of the present invention;

图4为本发明实施例通道注意力模块结构;FIG. 4 is a structure of a channel attention module according to an embodiment of the present invention;

图5为本发明实施例混合卷积块级联图;5 is a cascade diagram of a hybrid convolution block according to an embodiment of the present invention;

图6为本发明实施例的特征图;(a),(b),(c),(d)为原始视频帧;(e),(f),(g),(h)为对应的特征图。Figure 6 is a feature map of an embodiment of the present invention; (a), (b), (c), (d) are original video frames; (e), (f), (g), (h) are corresponding features picture.

具体实施方式Detailed ways

下面结合附图和实施例对本发明作进一步说明。The present invention will be further described below with reference to the accompanying drawings and embodiments.

本发明提供混合卷积的残差网络与注意力结合的动作视频识别方法,利用开源数据集UCF101作为实验数据集,具体数据集示例如图1所示。该图表示其中一个动作视频转换成的部分动作视频的视频帧图像,将图像保存成.jpg格式,最后的图片大小为320×240。The present invention provides an action video recognition method combining a hybrid convolutional residual network and attention, using the open source data set UCF101 as an experimental data set, and an example of a specific data set is shown in FIG. 1 . This figure shows the video frame image of part of the action video converted from one of the action videos, save the image in .jpg format, and the final picture size is 320×240.

本发明实施例如下:Examples of the present invention are as follows:

步骤1:采用Opencv中的VideoCapture函数读入动作视频,并将读入的动作视频转换成动作视频的视频帧图像,部分动作视频的视频帧图像如图1所示。Step 1: Use the VideoCapture function in Opencv to read in the action video, and convert the read action video into the video frame image of the action video. The video frame images of some action videos are shown in Figure 1.

步骤2:本发明首先对动作识别模型进行数据预处理,然后在Kinetics数据集上进行预训练,而不是从头开始训练我们的模型,以提高我们模型的准确率。Step 2: The present invention first performs data preprocessing on the action recognition model, and then performs pre-training on the Kinetics data set, instead of training our model from scratch, so as to improve the accuracy of our model.

2.1)动作识别的数据预处理如下:2.1) The data preprocessing for action recognition is as follows:

分别使用时间抽样、随机裁剪和亮度调整的方法对动作视频的视频帧进行数据增强,组成获得视频帧图像;Use the methods of time sampling, random cropping and brightness adjustment to enhance the data of the video frame of the action video, and form a video frame image;

时间抽样:对于每个动作视频,随机采样16帧动作视频的连续帧进行训练;如果连续帧的帧数达不到16帧,就循环播放该动作视频,直至连续帧的帧数达到16帧;Time sampling: For each action video, 16 consecutive frames of action video are randomly sampled for training; if the number of consecutive frames does not reach 16 frames, the action video is played in a loop until the number of consecutive frames reaches 16 frames;

随机裁剪:将原始视频帧图像的大小调整为128×171像素,然后将原始视频帧图像的大小随机裁剪为112×112像素;Random cropping: resize the original video frame image to 128×171 pixels, and then randomly crop the size of the original video frame image to 112×112 pixels;

亮度调整:随机调整原始视频帧图像的亮度。Brightness Adjustment: Randomly adjust the brightness of the original video frame image.

2.2)动作识别的模型预训练过程如下:2.2) The model pre-training process for action recognition is as follows:

将预处理后的视频帧图像输入混合卷积残差网络模型进行空间和通道维度上的特征提取,混合卷积残差网络模型的输入图像的形状批处理大小batch_size为16×112×112×3,混合卷积残差网络模型的输出形状批处理大小batch_size为类别标签。使用随机梯度下降SGD进行损失值的优化,初始学习率设置为0.01,当验证损失达到饱和时,初始学习率除以10。动量momentum系数为0.9,dropout系数为0.5,权值衰减率为10e-3,并且使用batch norm加速模型训练,在服务器上使用8块Tesla V100 GPU进行训练,每块GPU上的batch_size为8,总的batch_size为64。Input the preprocessed video frame images into the hybrid convolutional residual network model for feature extraction in space and channel dimensions. The shape of the input image of the hybrid convolutional residual network model batch_size is 16×112×112×3 , the output shape of the hybrid convolutional residual network model batch_size is the class label. The optimization of the loss value is performed using stochastic gradient descent SGD, the initial learning rate is set to 0.01, and when the validation loss reaches saturation, the initial learning rate is divided by 10. The momentum momentum coefficient is 0.9, the dropout coefficient is 0.5, the weight decay rate is 10e-3 , and batch norm is used to accelerate model training, using 8 Tesla V100 GPUs for training on the server, the batch_size on each GPU is 8, the total The batch_size is 64.

步骤3:构建注意力模块,注意力模块中使用注意力机制关注先验知识所提到的位置,去除背景和噪声对动作识别的干扰,依据先验知识自动给输入特征图的不同位置分配不同的注意力;Step 3: Build an attention module. The attention module uses the attention mechanism to focus on the positions mentioned by the prior knowledge, remove the interference of background and noise on action recognition, and automatically assign different positions to the input feature map according to the prior knowledge. attention;

利用注意力模块构建混合卷积块,级联混合卷积块构建基于混合卷积的残差网络与注意力结合的混合卷积残差网络模型,用混合卷积残差网络模型对视频帧图像进行时空特征学习,获取关键特征图;Use the attention module to build a mixed convolution block, cascade mixed convolution blocks to build a mixed convolution residual network model based on the combination of mixed convolution residual network and attention, and use the mixed convolution residual network model for video frame images. Perform spatiotemporal feature learning to obtain key feature maps;

混合卷积块表达为:The mixed convolution block is expressed as:

Xt+1=Xt+W(Xt)Xt+1 =Xt +W(Xt )

式中,Xt和Xt+1表示第t个MC-RAN模块的输入和输出;Xt和Xt+1具有相同的特征维度,W代表加入注意力机制的混合卷积残差函数。In the formula, Xt and Xt+1 represent the input and output of the t-th MC-RAN module; Xt and Xt+1 have the same feature dimension, and W represents the hybrid convolution residual function with the attention mechanism added.

步骤3)具体为:选取3DResNet网络结构作为基本网络结构,将3DResNet网络结构中原有的3D卷积模块由第一卷积层和四个混合卷积块结构代替,混合卷积块包括MC-RAN模块和加合层;MC-RAN模块包括依次连接的(2+1)D卷积层、第一批量归一化层、第一ReLU激活层、3D卷积层和第二批量归一化层;混合卷积块的输入Xt输入MC-RAN模块,MC-RAN模块输出后的特征图与输入Xt通过加合层进行特征图相加,相加后的特征图经第二ReLU激活层处理后的输出作为混合卷积块的输出Xt+1,每个混合卷积块之后级联3D最大池化层进行下采样。Step 3) is specifically: selecting the 3DResNet network structure as the basic network structure, and replacing the original 3D convolution module in the 3DResNet network structure by the first convolution layer and four hybrid convolution block structures, and the hybrid convolution block includes MC-RAN. modules and additive layers; the MC-RAN module consists of a (2+1)D convolutional layer, a first batch normalization layer, a first ReLU activation layer, a 3D convolutional layer, and a second batch normalization layer connected in sequence ; The input Xt of the hybrid convolution block is input to the MC-RAN module, and the feature map output by the MC-RAN module and the input Xt are added by the addition layer, and the added feature map is activated by the second ReLU layer. The processed output is taken as the output Xt+1 of the hybrid convolution block, and each hybrid convolution block is followed by a cascaded 3D max-pooling layer for downsampling.

a、第i个尺寸为Ni-1×t×d×d的3D卷积层由Mi个尺寸为Ni-1×1×d×d的第二2D卷积层和Ni个尺寸为Mi×t×1×1的时序卷积层组成,Mi由以下公式计算:a. The i-th 3D convolutional layer of size Ni-1 ×t×d×d consists of Mi second 2D convolutional layers of size Ni-1 ×1×d×d and Ni dimensions is composed of sequential convolutional layers of Mi ×t×1×1, and Mi is calculated by the following formula:

Figure BDA0002644396770000061
Figure BDA0002644396770000061

其中,d表示3D卷积层输出特征图的宽高尺寸参数,t表示时刻时序,[]表示向下取整;Among them, d represents the width and height size parameters of the output feature map of the 3D convolution layer, t represents the time sequence, and [] represents the rounding down;

b、在第一卷积层conv1处进行空间下采样,步长为1×2×2。对于第三混合卷积块conv3_1,第四混合卷积块conv4_1和第五混合卷积块conv5_1,对其中的(2+1)D卷积的第一2D卷积层和时间卷积层分别进行了时空下采样,步长分别为1×2×2和2×1×1。表1为第一卷积层和混合卷积块的网络结构图。b. Perform spatial downsampling at the first convolutional layer conv1 with a stride of 1×2×2. For the third hybrid convolution block conv3_1, the fourth hybrid convolution block conv4_1 and the fifth hybrid convolution block conv5_1, the first 2D convolution layer and the temporal convolution layer of the (2+1)D convolution are performed respectively. For spatiotemporal downsampling, the steps are 1 × 2 × 2 and 2 × 1 × 1, respectively. Table 1 shows the network structure diagram of the first convolutional layer and the hybrid convolutional block.

表1为第一卷积层和混合卷积块的网络层结构。Table 1 shows the network layer structure of the first convolutional layer and the hybrid convolutional block.

Figure BDA0002644396770000062
Figure BDA0002644396770000062

c、混合卷积块级联图如图5所示,(2+1)D卷积层是由2D卷积层中加入注意力模块组成;(2+1)D卷积层主要由第一2D卷积层、空间注意力模块MSS、时间卷积层和通道注意力模块MCS级联构成。所述注意力模块分别在输入特征图的空间上和通道上施加注意力,由空间注意力模块MSS和通道注意力模块MCS构成了注意力模块。c. The cascade diagram of mixed convolution blocks is shown in Figure 5. The (2+1)D convolutional layer is composed of the attention module added to the 2D convolutional layer; the (2+1)D convolutional layer is mainly composed of the first A 2D convolution layer, a spatial attention module MSS , a temporal convolution layer and a channel attention module MCS are cascaded. The attention module applies attention on the space and channel of the input feature map respectively, and the attention module is composed of the spatial attention module MSS and the channel attention module MCS .

空间注意力模块MSS通过第三2D卷积核来获取输入特征图在空间维度上的空间权重图WSS;通道注意力模块MCS通过添加多层感知器来获取输入特征图在通道维度上的通道权重图WCSThe spatial attention module MSS obtains the spatial weight map WSS of the input feature map in the spatial dimension through the third 2D convolution kernel; the channel attention module MCS obtains the input feature map in the channel dimension by adding a multi-layer perceptron The channel weight map WCS ;

所述空间注意力模块MSS的构建具体为:当输入特征图F的大小为C×H×W时,C代表输入特征图中每一帧图像的通道数,H和W代表输入特征图中每一帧图像的宽高尺寸参数;首先,利用全局平均池化对输入特征图中每一帧图像的通道进行压缩,生成一个大小为1×H×W的2D空间描述符Z,Z在坐标(i,j)处的元素计算如下:The construction of the spatial attention module MSS is as follows: when the size of the input feature map F is C×H×W, C represents the number of channels of each frame image in the input feature map, and H and W represent the input feature map. The width and height size parameters of each frame of image; first, use global average pooling to compress the channel of each frame of image in the input feature map, and generate a 2D spatial descriptor Z with a size of 1×H×W, Z is at the coordinate The element at (i,j) is calculated as follows:

Figure BDA0002644396770000071
Figure BDA0002644396770000071

其中Fi,j(k)表示在第K个通道在坐标(i,j)的特征图,i表示在H维度的像素点,j表示在W维度的像素点;之后使用大小为7×7的第三2D卷积层对2D空间描述符进行卷积获取到输入特征图中的感兴趣目标区域;最后在第三2D卷积层添加第三批量归一化层对感兴趣目标区域进行维度变换,获得空间注意力权重图WSSwhere Fi,j (k) represents the feature map of the Kth channel at coordinates (i, j), i represents the pixel in the H dimension, and j represents the pixel in the W dimension; the size is 7×7 later. The third 2D convolutional layer convolves the 2D spatial descriptor to obtain the target area of interest in the input feature map; finally, a third batch normalization layer is added to the third 2D convolutional layer to dimension the target area of interest. Transform to obtain the spatial attention weight map WSS .

空间注意力权重图WSS可表示为:The spatial attention weight map WSS can be expressed as:

WSS(F)=BN(σ(f7×7(Avgpool(F)))WSS (F)=BN(σ(f7×7 (Avgpool(F)))

其中,BN()表示批量归一化,σ()表示是sigmoid激活函数,f7×7()表示卷积核大小为7×7的卷积操作,Avgpool()表示全局平均池化,F表示输入特征图。Among them, BN() represents batch normalization, σ() represents a sigmoid activation function, f7×7 () represents a convolution operation with a convolution kernel size of 7×7, Avgpool() represents global average pooling, and F represents the input feature map.

通道注意力模块MCS的构建具体为:当输入大小为H×W×C的特征图Q时,C代表输入特征图中每一帧图像的通道数。首先,对输入特征图Q进行全局平均池化操作,产生一个大小为1×1×C的特征图Q';随后,使用带有隐藏层的多层感知器FC对通道向量Q'进行处理,以学习通道向量Q'的权重;以权重作为相关性,为了限制通道注意力模块的复杂性和节省参数代价,将隐藏激活层的大小设置为1×1×C/r,其中r是压缩比,设置为16。The construction of the channel attention moduleMCS is as follows: when the feature map Q of size H×W×C is input, C represents the number of channels in each frame of the input feature map. First, a global average pooling operation is performed on the input feature map Q to generate a feature map Q' ofsize 1 × 1 × C; then, the channel vector Q' is processed using a multilayer perceptron FC with hidden layers, Taking the weight of the learning channel vector Q'; taking the weight as the correlation, in order to limit the complexity of the channel attention module and save the parameter cost, the size of the hidden activation layer is set to 1 × 1 × C/r, where r is the compression ratio , set to 16.

通道向量Q'可由如下公式计算:The channel vector Q' can be calculated by the following formula:

Figure BDA0002644396770000072
Figure BDA0002644396770000072

其中F(i,j)表示在坐标(i,j)的特征图,i表示在H维度的像素点,j表示在W维度的像素点;where F(i,j) represents the feature map at coordinates (i,j), i represents the pixel in the H dimension, and j represents the pixel in the W dimension;

最后在多层感知器后增加第四批量归一化层来进行维度转换,获得通道注意力权重图WCSFinally, a fourth batch normalization layer is added after the multi-layer perceptron to perform dimension transformation to obtain the channel attention weight map WCS .

通道注意力权重图WCS可表示为:The channel attention weight mapWCS can be expressed as:

WCS(F)=BN(MLP(Avgpool(F)))=BN(σ(W1(δ(W0Avgpool(F)+b0)+b1)))WCS (F)=BN(MLP(Avgpool(F)))=BN(σ(W1 (δ(W0 Avgpool(F)+b0 )+b1 )))

其中,MLP表示带有隐藏层的多层感知器,W0和W1是MLP的权重,大小分别为C/r×C和C×C/r。σ()是sigmoid激活函数,δ()是线性修正单元,b0和b1表示MLP()的偏置项,大小分别为C/r和C。where MLP represents a multi-layer perceptron with hidden layers, and W0 and W1 are the weights of the MLP, with sizes C/r×C and C×C/r, respectively. σ() is the sigmoid activation function, δ() is the linear correction unit, b0 and b1 represent the bias terms of MLP(), and the sizes are C/r and C, respectively.

步骤4:视频帧图像经过第一卷积层和四个混合卷积块后视频帧图像中的时空特征已经融合,混合卷积残差网络模型获取了关键特征,加入注意力模块后特征图可视化如图6所示。将关键特征图输入到Softmax层中进行分类。使用经过训练的网络来评估验证集中的每个视频,并获得相应的类别标签。经过训练后,将提出的混合卷积残差网络模型和不同的网络模型进行对比,实验结果如表2所示,结果表明,混合卷积残差网络模型在不增加参数量的情况下,在Top1和Top5的识别准确率都有所增加。Step 4: After the video frame image has passed through the first convolution layer and four mixed convolution blocks, the spatiotemporal features in the video frame image have been fused. The mixed convolution residual network model has acquired key features, and the feature map is visualized after adding the attention module. As shown in Figure 6. The key feature maps are input into the Softmax layer for classification. Use the trained network to evaluate each video in the validation set and obtain the corresponding class label. After training, the proposed hybrid convolutional residual network model is compared with different network models. The experimental results are shown in Table 2. The results show that the hybrid convolutional residual network model does not increase the amount of parameters. The recognition accuracy of both Top1 and Top5 has increased.

表2为混合卷积残差网络模型与其他模型的识别结果比较。Table 2 compares the recognition results of the hybrid convolutional residual network model with other models.

网络模型network model参数量parameter quantityTop-1识别率(%)Top-1 recognition rate (%)Top-5识别率(%)Top-5 recognition rate (%)平均识别率(%)Average recognition rate (%)ResNet[39]ResNet[39]63.72M63.72M60.160.181.981.971.071.0(2+1)D-ResNet[12](2+1)D-ResNet[12]63.88M63.88M66.866.888.188.177.4577.45MC-ResNet[28]MC-ResNet [28]63.88M63.88M67.367.389.289.278.2578.25RAN[26]RAN[26]63.97M63.97M61.761.783.283.272.4572.45(2+1)D-RAN(2+1)D-RAN63.98M63.98M67.867.889.389.378.5578.55MC-RANMC-RAN63.98M63.98M68.868.889.989.979.3579.35

上述具体实施方式用来解释说明本发明,而不是对本发明进行限制,在本发明的精神和权利要求的保护范围内,对本发明做出的任何修改和改变,都落入本发明的保护范围。The above-mentioned specific embodiments are used to explain the present invention, rather than limit the present invention. Any modification and change made to the present invention within the spirit of the present invention and the protection scope of the claims all fall into the protection scope of the present invention.

Claims (6)

1. A motion video identification method combining a residual error network of hybrid convolution and attention is characterized in that: the method comprises the following steps:
1) reading the motion of a person in the motion video, and converting the motion video into an original video frame image;
2) respectively using methods of time sampling, random cutting and brightness adjustment to perform data enhancement on video frames of the action video to form and obtain video frame images;
3) constructing an attention module, constructing a mixed convolution block by using the attention module, constructing a mixed convolution residual network model based on the combination of a mixed convolution residual network and attention by cascading the mixed convolution block, and performing space-time feature learning on a video frame image by using the mixed convolution residual network model to obtain a key feature map;
the mixed volume block is expressed as:
Xt+1=Xt+W(Xt)
wherein, XtAnd Xt+1Represents the input and output of the tth MC-RAN module; xtAnd Xt+1The characteristic dimensions are the same, and W represents a mixed convolution residual function added with an attention mechanism;
4) the key feature maps are classified using a Softmax classification layer.
2. The hybrid convolved residual network and attention-combined motion video recognition method according to claim 1, characterized in that: the step 2) is specifically as follows:
time sampling: for each motion video, randomly sampling continuous frames of 16 motion videos for training; if the number of the continuous frames does not reach 16 frames, the action video is played circularly until the number of the continuous frames reaches 16 frames;
random cutting: resizing the original video frame image to 128 × 171 pixels, and then randomly cropping the original video frame image to 112 × 112 pixels;
and (3) brightness adjustment: and randomly adjusting the brightness of the original video frame image.
3. The hybrid convolved residual network and attention-combined motion video recognition method according to claim 1, characterized in that:
the step 3) is specifically as follows: selecting 3DResNet networkThe structure is used as a basic network structure, an original 3D convolution module in the 3D ReesNet network structure is replaced by a first convolution layer and four mixed convolution blocks, and each mixed convolution block comprises an MC-RAN module and an addition layer; the MC-RAN module comprises a (2+1) D convolution layer, a first batch normalization layer, a first ReLU activation layer, a 3D convolution layer and a second batch normalization layer, wherein the (2+1) D convolution layer is formed by adding an attention module into the 2D convolution layer; input X of mixed volume blocktInput MC-RAN module, and output characteristic diagram and input X of MC-RAN moduletAdding the feature maps through the adding layer, wherein the output of the added feature maps processed by the second ReLU activation layer is used as the output X of the mixed volume blockt+1After each mixed rolling block, cascading a 3D maximum pooling layer for down-sampling;
ith size of Ni-1The 3D convolution layer of x t x D consists of MiEach size is Ni-1A second 2D convolutional layer of x 1 xdxdxd and NiEach size is MiComposition of time-series convolution layer of x t x 1, MiCalculated by the following formula:
Figure FDA0002644396760000021
wherein D represents a width and height dimension parameter of the 3D convolution layer output characteristic diagram, t represents a time sequence, and [ ] represents a downward rounding.
4. The hybrid convolved residual network and attention-combined motion video recognition method according to claim 3, characterized in that:
the (2+1) D convolutional layer mainly comprises a first 2D convolutional layer and a space attention module MSSTime convolution layer and channel attention module MCSFormed in cascade and consisting of spatial attention modules MSSAnd channel attention module MCSAn attention module is formed;
space attention module MSSObtaining a spatial weight map W of the input feature map in a spatial dimension by a third 2D convolutional layerSS(ii) a Channel attention module MCSBy addingObtaining a channel weight map W of an input feature map in a channel dimension by a multi-layer perceptronCS
The spatial attention module MSSThe construction method specifically comprises the following steps: when the size of the input feature map F is C multiplied by H multiplied by W, C represents the channel number of each frame of image in the input feature map, and H and W represent the width and height dimension parameters of each frame of image in the input feature map; firstly, compressing a channel of each frame of image in an input feature map by using global average pooling to generate a 2D space descriptor Z with the size of 1 multiplied by H multiplied by W; then, a third 2D convolution layer is used for performing convolution on the 2D space descriptor Z to obtain an interested target area in the input feature map; and finally, adding a third batch normalization layer on a third 2D convolution layer to carry out dimension transformation on the interested target region to obtain a space attention weight graph WSS
Spatial attention weight graph WSSCan be expressed as:
WSS(F)=BN(σ(f7×7(Avgpool(F)))
wherein BN () represents batch normalization, σ () represents sigmoid activation function, f7×7() Represents the convolution operation with a convolution kernel size of 7 × 7, Avgpool () represents the global average pooling, and F represents the input signature;
the channel attention module MCSThe construction method specifically comprises the following steps: when the size of the input feature map Q is C multiplied by H multiplied by W, and C represents the number of channels of each frame image in the input feature map, firstly, carrying out global average pooling operation on the input feature map Q to generate a channel vector Q' with the size of 1 multiplied by C; subsequently, the channel vector Q 'is processed using a multi-layer perceptron to learn weights of the channel vector Q';
the channel vector Q' can be calculated by the following formula:
Figure FDA0002644396760000022
wherein F (i, j) represents a feature map at coordinates (i, j), i represents a pixel point at dimension H, and j represents a pixel point at dimension W;
finally in multiple layersAdding a fourth batch normalization layer behind the sensor to perform dimension conversion to obtain a channel attention weight graph WCS
Channel attention weight graph WCSCan be expressed as:
WCS(F)=BN(MLP(Avgpool(F)))=BN(σ(W1((W0Avgpool(F)+b0)+b1)))
wherein MLP () represents a multilayer perceptron with hidden layers, W0And W1Is the weight of MLP () with the size of C/r × C and C × C/r, r is the compression ratio, () is the linear correction unit, b0And b1And bias terms representing MLP () with the sizes of C/r and C, respectively.
5. The hybrid convolved residual network and attention-combined motion video recognition method according to claim 1, characterized in that: the step 4) is specifically as follows: after the video frame images pass through the four MC-RAN modules, the spatio-temporal characteristics in the video frame images are fused, the mixed convolution residual error network model obtains key characteristics, and the key characteristic images are input into a Softmax layer for classification.
6. The hybrid convolved residual network and attention-combined motion video recognition method according to claim 1, characterized in that: the input feature map of the input feature map in the first MC-RAN module is the output feature map of the video frame image in step 2) after passing through the first convolutional layer, and the input feature map of the subsequent MC-RAN module is the output feature map of the previous MC-RAN module after passing through the 3D max-pooling layer.
CN202010849991.6A2020-08-212020-08-21Motion video identification method combining mixed convolution residual network and attentionActiveCN112149504B (en)

Priority Applications (1)

Application NumberPriority DateFiling DateTitle
CN202010849991.6ACN112149504B (en)2020-08-212020-08-21Motion video identification method combining mixed convolution residual network and attention

Applications Claiming Priority (1)

Application NumberPriority DateFiling DateTitle
CN202010849991.6ACN112149504B (en)2020-08-212020-08-21Motion video identification method combining mixed convolution residual network and attention

Publications (2)

Publication NumberPublication Date
CN112149504Atrue CN112149504A (en)2020-12-29
CN112149504B CN112149504B (en)2024-03-26

Family

ID=73889023

Family Applications (1)

Application NumberTitlePriority DateFiling Date
CN202010849991.6AActiveCN112149504B (en)2020-08-212020-08-21Motion video identification method combining mixed convolution residual network and attention

Country Status (1)

CountryLink
CN (1)CN112149504B (en)

Cited By (36)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN112766172A (en)*2021-01-212021-05-07北京师范大学Face continuous expression recognition method based on time sequence attention mechanism
CN112800957A (en)*2021-01-282021-05-14内蒙古科技大学Video pedestrian re-identification method and device, electronic equipment and storage medium
CN112818843A (en)*2021-01-292021-05-18山东大学Video behavior identification method and system based on channel attention guide time modeling
CN112883264A (en)*2021-02-092021-06-01联想(北京)有限公司Recommendation method and device
CN113128395A (en)*2021-04-162021-07-16重庆邮电大学Video motion recognition method and system based on hybrid convolution and multi-level feature fusion model
CN113139530A (en)*2021-06-212021-07-20城云科技(中国)有限公司Method and device for detecting sleep post behavior and electronic equipment thereof
CN113160117A (en)*2021-02-042021-07-23成都信息工程大学Three-dimensional point cloud target detection method under automatic driving scene
CN113283338A (en)*2021-05-252021-08-20湖南大学Method, device and equipment for identifying driving behavior of driver and readable storage medium
CN113288162A (en)*2021-06-032021-08-24北京航空航天大学Short-term electrocardiosignal atrial fibrillation automatic detection system based on self-adaptive attention mechanism
CN113343760A (en)*2021-04-292021-09-03暖屋信息科技(苏州)有限公司Human behavior recognition method based on multi-scale characteristic neural network
CN113468531A (en)*2021-07-152021-10-01杭州电子科技大学Malicious code classification method based on deep residual error network and mixed attention mechanism
CN113673559A (en)*2021-07-142021-11-19南京邮电大学 A spatiotemporal feature extraction method of video characters based on residual network
CN113837263A (en)*2021-09-182021-12-24浙江理工大学 Gesture image classification method based on feature fusion attention module and feature selection
CN113850135A (en)*2021-08-242021-12-28中国船舶重工集团公司第七0九研究所 A method and system for dynamic gesture recognition based on time shift framework
CN113850182A (en)*2021-09-232021-12-28浙江理工大学Action identification method based on DAMR-3 DNet
CN114037930A (en)*2021-10-182022-02-11苏州大学 Video action recognition method based on spatiotemporal enhancement network
CN114140654A (en)*2022-01-272022-03-04苏州浪潮智能科技有限公司 Image action recognition method, device and electronic device
CN114155469A (en)*2021-12-072022-03-08湖南科技大学Deep video frame rate up-conversion detection device based on double-current convolutional neural network
CN114373476A (en)*2022-01-112022-04-19江西师范大学 A sound scene classification method based on multi-scale residual attention network
CN114724252A (en)*2022-04-242022-07-08中国计量大学 A Video Action Recognition Method Based on Improved MobileNet
CN114758265A (en)*2022-03-082022-07-15深圳集智数字科技有限公司Escalator operation state identification method and device, electronic equipment and storage medium
CN114783053A (en)*2022-03-242022-07-22武汉工程大学Behavior identification method and system based on space attention and grouping convolution
CN114842542A (en)*2022-05-312022-08-02中国矿业大学Facial action unit identification method and device based on self-adaptive attention and space-time correlation
CN114882889A (en)*2022-04-132022-08-09厦门快商通科技股份有限公司Speaker recognition model training method, device, equipment and readable medium
CN114937225A (en)*2022-05-302022-08-23杭州电子科技大学Behavior identification method based on channel-time characteristics
CN115035605A (en)*2022-08-102022-09-09广东履安实业有限公司 Action recognition method, device, device and storage medium based on deep learning
CN115049969A (en)*2022-08-152022-09-13山东百盟信息技术有限公司Poor video detection method for improving YOLOv3 and BiConvLSTM
CN115331140A (en)*2022-07-292022-11-11南京邮电大学Channel grouping-based space-time feature separation and extraction method in action recognition
CN115527275A (en)*2022-10-312022-12-27浙江理工大学 Behavior Recognition Method Based on P2CS_3DNet
CN116304984A (en)*2023-03-142023-06-23烟台大学 Multimodal intent recognition method and system based on contrastive learning
CN116385964A (en)*2023-03-242023-07-04杭州电子科技大学Video crowd counting method based on combination of attention and spatial transformation network
CN116416479A (en)*2023-06-062023-07-11江西理工大学南昌校区Mineral classification method based on deep convolution fusion of multi-scale image features
CN116502044A (en)*2023-04-272023-07-28电子科技大学Radio frequency fingerprint identification method based on residual error network model
CN119919879A (en)*2024-12-302025-05-02深圳市检验检疫科学研究院 A convolutional-long short-term memory neural network object detection and recognition method
CN120451964A (en)*2025-04-272025-08-08中国农业科学院果树研究所Wild pear variety identification system and method based on leaf image deep learning
CN120451964B (en)*2025-04-272025-10-10中国农业科学院果树研究所 Wild pear variety recognition system and method based on leaf image deep learning

Citations (7)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN109886225A (en)*2019-02-272019-06-14浙江理工大学 An online detection and recognition method of image gesture action based on deep learning
CN109886090A (en)*2019-01-072019-06-14北京大学 A Video Pedestrian Re-identification Method Based on Multi-temporal Convolutional Neural Networks
CN110110646A (en)*2019-04-302019-08-09浙江理工大学A kind of images of gestures extraction method of key frame based on deep learning
CN110245593A (en)*2019-06-032019-09-17浙江理工大学 A Key Frame Extraction Method of Gesture Image Based on Image Similarity
CN110457524A (en)*2019-07-122019-11-15北京奇艺世纪科技有限公司Model generating method, video classification methods and device
CN110807808A (en)*2019-10-142020-02-18浙江理工大学 A Item Recognition Method Based on Physics Engine and Deep Fully Convolutional Network
CN111091045A (en)*2019-10-252020-05-01重庆邮电大学 A Sign Language Recognition Method Based on Spatio-temporal Attention Mechanism

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN109886090A (en)*2019-01-072019-06-14北京大学 A Video Pedestrian Re-identification Method Based on Multi-temporal Convolutional Neural Networks
CN109886225A (en)*2019-02-272019-06-14浙江理工大学 An online detection and recognition method of image gesture action based on deep learning
CN110110646A (en)*2019-04-302019-08-09浙江理工大学A kind of images of gestures extraction method of key frame based on deep learning
CN110245593A (en)*2019-06-032019-09-17浙江理工大学 A Key Frame Extraction Method of Gesture Image Based on Image Similarity
CN110457524A (en)*2019-07-122019-11-15北京奇艺世纪科技有限公司Model generating method, video classification methods and device
CN110807808A (en)*2019-10-142020-02-18浙江理工大学 A Item Recognition Method Based on Physics Engine and Deep Fully Convolutional Network
CN111091045A (en)*2019-10-252020-05-01重庆邮电大学 A Sign Language Recognition Method Based on Spatio-temporal Attention Mechanism

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
包嘉欣;田秋红;杨慧敏;陈影柔;: "基于肤色分割与改进VGG网络的手语识别", 计算机系统应用, no. 06*
王晨浩: "多粒度唇语识别技术研究", CNKI*
解怀奇;乐红兵;: "基于通道注意力机制的视频人体行为识别", 电子技术与软件工程, no. 04*

Cited By (51)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN112766172B (en)*2021-01-212024-02-02北京师范大学Facial continuous expression recognition method based on time sequence attention mechanism
CN112766172A (en)*2021-01-212021-05-07北京师范大学Face continuous expression recognition method based on time sequence attention mechanism
CN112800957A (en)*2021-01-282021-05-14内蒙古科技大学Video pedestrian re-identification method and device, electronic equipment and storage medium
CN112818843A (en)*2021-01-292021-05-18山东大学Video behavior identification method and system based on channel attention guide time modeling
CN113160117A (en)*2021-02-042021-07-23成都信息工程大学Three-dimensional point cloud target detection method under automatic driving scene
CN112883264A (en)*2021-02-092021-06-01联想(北京)有限公司Recommendation method and device
CN112883264B (en)*2021-02-092025-04-25联想(北京)有限公司 A recommendation method and device
CN113128395B (en)*2021-04-162022-05-20重庆邮电大学Video action recognition method and system based on hybrid convolution multistage feature fusion model
CN113128395A (en)*2021-04-162021-07-16重庆邮电大学Video motion recognition method and system based on hybrid convolution and multi-level feature fusion model
CN113343760A (en)*2021-04-292021-09-03暖屋信息科技(苏州)有限公司Human behavior recognition method based on multi-scale characteristic neural network
CN113283338A (en)*2021-05-252021-08-20湖南大学Method, device and equipment for identifying driving behavior of driver and readable storage medium
CN113288162B (en)*2021-06-032022-06-28北京航空航天大学Short-term electrocardiosignal atrial fibrillation automatic detection system based on self-adaptive attention mechanism
CN113288162A (en)*2021-06-032021-08-24北京航空航天大学Short-term electrocardiosignal atrial fibrillation automatic detection system based on self-adaptive attention mechanism
CN113139530A (en)*2021-06-212021-07-20城云科技(中国)有限公司Method and device for detecting sleep post behavior and electronic equipment thereof
CN113139530B (en)*2021-06-212021-09-03城云科技(中国)有限公司Method and device for detecting sleep post behavior and electronic equipment thereof
CN113673559A (en)*2021-07-142021-11-19南京邮电大学 A spatiotemporal feature extraction method of video characters based on residual network
CN113673559B (en)*2021-07-142023-08-25南京邮电大学 A Spatiotemporal Feature Extraction Method of Video Characters Based on Residual Network
CN113468531A (en)*2021-07-152021-10-01杭州电子科技大学Malicious code classification method based on deep residual error network and mixed attention mechanism
CN113850135A (en)*2021-08-242021-12-28中国船舶重工集团公司第七0九研究所 A method and system for dynamic gesture recognition based on time shift framework
CN113837263A (en)*2021-09-182021-12-24浙江理工大学 Gesture image classification method based on feature fusion attention module and feature selection
CN113837263B (en)*2021-09-182023-09-26浙江理工大学Gesture image classification method based on feature fusion attention module and feature selection
CN113850182A (en)*2021-09-232021-12-28浙江理工大学Action identification method based on DAMR-3 DNet
CN113850182B (en)*2021-09-232024-08-09浙江理工大学 Action recognition method based on DAMR_3DNet
CN114037930A (en)*2021-10-182022-02-11苏州大学 Video action recognition method based on spatiotemporal enhancement network
CN114037930B (en)*2021-10-182022-07-12苏州大学 Video action recognition method based on spatiotemporal enhancement network
CN114155469A (en)*2021-12-072022-03-08湖南科技大学Deep video frame rate up-conversion detection device based on double-current convolutional neural network
CN114155469B (en)*2021-12-072024-09-06湖南科技大学Depth video frame rate up-conversion detection device based on double-flow convolutional neural network
CN114373476A (en)*2022-01-112022-04-19江西师范大学 A sound scene classification method based on multi-scale residual attention network
CN114140654A (en)*2022-01-272022-03-04苏州浪潮智能科技有限公司 Image action recognition method, device and electronic device
CN114758265A (en)*2022-03-082022-07-15深圳集智数字科技有限公司Escalator operation state identification method and device, electronic equipment and storage medium
CN114758265B (en)*2022-03-082024-11-19深圳须弥云图空间科技有限公司 Escalator operation status identification method, device, electronic equipment and storage medium
CN114783053A (en)*2022-03-242022-07-22武汉工程大学Behavior identification method and system based on space attention and grouping convolution
CN114882889A (en)*2022-04-132022-08-09厦门快商通科技股份有限公司Speaker recognition model training method, device, equipment and readable medium
CN114724252A (en)*2022-04-242022-07-08中国计量大学 A Video Action Recognition Method Based on Improved MobileNet
CN114937225A (en)*2022-05-302022-08-23杭州电子科技大学Behavior identification method based on channel-time characteristics
CN114842542B (en)*2022-05-312023-06-13中国矿业大学 Facial Action Unit Recognition Method and Device Based on Adaptive Attention and Spatiotemporal Correlation
CN114842542A (en)*2022-05-312022-08-02中国矿业大学Facial action unit identification method and device based on self-adaptive attention and space-time correlation
CN115331140A (en)*2022-07-292022-11-11南京邮电大学Channel grouping-based space-time feature separation and extraction method in action recognition
CN115035605B (en)*2022-08-102023-04-07广东履安实业有限公司 Action recognition method, device, equipment and storage medium based on deep learning
CN115035605A (en)*2022-08-102022-09-09广东履安实业有限公司 Action recognition method, device, device and storage medium based on deep learning
CN115049969A (en)*2022-08-152022-09-13山东百盟信息技术有限公司Poor video detection method for improving YOLOv3 and BiConvLSTM
CN115527275A (en)*2022-10-312022-12-27浙江理工大学 Behavior Recognition Method Based on P2CS_3DNet
CN116304984A (en)*2023-03-142023-06-23烟台大学 Multimodal intent recognition method and system based on contrastive learning
CN116385964A (en)*2023-03-242023-07-04杭州电子科技大学Video crowd counting method based on combination of attention and spatial transformation network
CN116385964B (en)*2023-03-242025-07-08杭州电子科技大学 A video crowd counting method based on the combination of attention and spatial transformer network
CN116502044A (en)*2023-04-272023-07-28电子科技大学Radio frequency fingerprint identification method based on residual error network model
CN116416479B (en)*2023-06-062023-08-29江西理工大学南昌校区Mineral classification method based on deep convolution fusion of multi-scale image features
CN116416479A (en)*2023-06-062023-07-11江西理工大学南昌校区Mineral classification method based on deep convolution fusion of multi-scale image features
CN119919879A (en)*2024-12-302025-05-02深圳市检验检疫科学研究院 A convolutional-long short-term memory neural network object detection and recognition method
CN120451964A (en)*2025-04-272025-08-08中国农业科学院果树研究所Wild pear variety identification system and method based on leaf image deep learning
CN120451964B (en)*2025-04-272025-10-10中国农业科学院果树研究所 Wild pear variety recognition system and method based on leaf image deep learning

Also Published As

Publication numberPublication date
CN112149504B (en)2024-03-26

Similar Documents

PublicationPublication DateTitle
CN112149504B (en)Motion video identification method combining mixed convolution residual network and attention
CN112446476B (en) Neural network model compression method, device, storage medium and chip
CN108509978B (en)Multi-class target detection method and model based on CNN (CNN) multi-level feature fusion
CN114596520B (en) A method and device for first-person video action recognition
Kim et al.Fully deep blind image quality predictor
CN112329658A (en)Method for improving detection algorithm of YOLOV3 network
CN113011329A (en)Pyramid network based on multi-scale features and dense crowd counting method
CN113255616B (en)Video behavior identification method based on deep learning
CN114973049B (en)Lightweight video classification method with unified convolution and self-attention
CN113269787A (en)Remote sensing image semantic segmentation method based on gating fusion
CN110532959B (en) Real-time violent behavior detection system based on two-channel three-dimensional convolutional neural network
CN117456330A (en)MSFAF-Net-based low-illumination target detection method
CN112131959A (en)2D human body posture estimation method based on multi-scale feature reinforcement
CN113850182B (en) Action recognition method based on DAMR_3DNet
Hongmeng et al.A detection method for deepfake hard compressed videos based on super-resolution reconstruction using CNN
CN115205966A (en) A spatiotemporal Transformer action recognition method for sign language recognition
CN116630387A (en) Monocular Image Depth Estimation Method Based on Attention Mechanism
CN114519383A (en)Image target detection method and system
CN114972851A (en)Remote sensing image-based ship target intelligent detection method
Ma et al.Convolutional transformer network for fine-grained action recognition
TWI836972B (en)Underwater image enhancement method and image processing system using the same
CN116403133A (en)Improved vehicle detection algorithm based on YOLO v7
CN117392578A (en)Action detection method and system based on two-stage space-time attention
CN118609163A (en) A lightweight real-time human posture recognition method based on MobileViT
CN116883896A (en)Monitoring video anomaly detection method and system based on dynamic self-supervision network

Legal Events

DateCodeTitleDescription
PB01Publication
PB01Publication
SE01Entry into force of request for substantive examination
SE01Entry into force of request for substantive examination
GR01Patent grant
GR01Patent grant
TR01Transfer of patent right
TR01Transfer of patent right

Effective date of registration:20240826

Address after:230000 floor 1, building 2, phase I, e-commerce Park, Jinggang Road, Shushan Economic Development Zone, Hefei City, Anhui Province

Patentee after:Dragon totem Technology (Hefei) Co.,Ltd.

Country or region after:China

Address before:No.928, No.2 street, Jianggan Economic Development Zone, Hangzhou City, Zhejiang Province, 310018

Patentee before:ZHEJIANG SCI-TECH University

Country or region before:China

TR01Transfer of patent right
TR01Transfer of patent right

Effective date of registration:20250523

Address after:518000 Guangdong Province Shenzhen City Nanshan District Yuehai Street Binhai Community Haitian Second Road No.25 Shenzhen Bay Venture Capital Building 801

Patentee after:Shenzhen Chaowei Imaging Technology Co.,Ltd.

Country or region after:China

Address before:230000 floor 1, building 2, phase I, e-commerce Park, Jinggang Road, Shushan Economic Development Zone, Hefei City, Anhui Province

Patentee before:Dragon totem Technology (Hefei) Co.,Ltd.

Country or region before:China


[8]ページ先頭

©2009-2025 Movatter.jp