CN109446923A

Movatterモバイル変換

Info

Publication number: CN109446923A
Application number: CN201811176393.6A
Authority: CN
Inventors: 李侃; 李杨; 王欣欣
Original assignee: Beijing Institute of Technology BIT
Current assignee: Beijing Institute of Technology BIT
Priority date: 2018-10-10
Filing date: 2018-10-10
Publication date: 2019-03-08
Anticipated expiration: 2038-10-10
Also published as: CN109446923B

Abstract

Translated fromChinese

本发明提出了一种基于训练特征融合的深度监督卷积神经网络行为识别方法，属于人工智能计算机视觉领域。本方法提取目标视频的多层卷积特征，设计局部演化池化层，利用局部演化池化层将视频卷积特征映射到一个包含时间信息的向量上，从而提取到目标视频的局部演化描述符；通过使用VLAD编码方法，将多个局部演化描述符编码成基于元动作的视频级表示；利用卷积网络多层级之间信息的互补性，将多层级分类结果集成得到最终分类结果。本发明充分利用时间信息构建视频级表示，有效提高了视频行为识别的准确率。同时，通过集成多层级的预测结果提高了网络中间层的判别性，从而提高了网络整体的性能。

The invention proposes a deep supervision convolutional neural network behavior recognition method based on training feature fusion, which belongs to the field of artificial intelligence computer vision. This method extracts the multi-layer convolution features of the target video, designs a local evolution pooling layer, and uses the local evolution pooling layer to map the video convolution features to a vector containing time information, so as to extract the local evolution descriptor of the target video. ; By using the VLAD coding method, multiple local evolution descriptors are encoded into a meta-action-based video-level representation; the multi-level classification results are integrated to obtain the final classification result by utilizing the complementarity of information between the multiple layers of the convolutional network. The present invention makes full use of time information to construct a video-level representation, and effectively improves the accuracy of video behavior recognition. At the same time, the discriminativeness of the middle layer of the network is improved by integrating the multi-level prediction results, thereby improving the overall performance of the network.

Description

Translated fromChinese

基于训练特征融合的深度监督卷积神经网络行为识别方法Deeply supervised convolutional neural network behavior recognition method based on training feature fusion

技术领域technical field

本发明涉及一种基于视频的行为识别方法，特别涉及一种基于训练特征融合的深度卷积神经网络行为识别方法，属于人工智能计算机视觉领域。The invention relates to a video-based behavior recognition method, in particular to a deep convolution neural network behavior recognition method based on training feature fusion, and belongs to the field of artificial intelligence computer vision.

背景技术Background technique

目前，人体行为识别是智能视频分析领域的研究热点，也是视频理解任务的重要研究方向。近年来，在视频监控、异常事件监测、基于内容的视频检索等方面取得了广泛关注。然而，由于人类行为的复杂性、多变性、视频背景信息的干扰等因素，使得如何为视频建立适当的时空级表示成为关键。At present, human behavior recognition is a research hotspot in the field of intelligent video analysis and an important research direction for video understanding tasks. In recent years, it has gained extensive attention in video surveillance, abnormal event monitoring, and content-based video retrieval. However, due to the complexity and variability of human behavior, the interference of video background information, etc., how to build an appropriate spatiotemporal level representation for videos becomes the key.

早期研究主要致力于识别理想场景下的简单动作，采用基于人工设计特征的行为识别方法，例如，基于三维直方图(HOG3D)的方法、基于光流直方图(HOF)的方法、基于运动边界直方图的方法等等。这些方法通过以描述时空兴趣点(STIP)为中心的区域特征来构建视频的表示，并用来识别视频中的动作。Early research mainly focused on recognizing simple actions in ideal scenarios, using action recognition methods based on artificially designed features, e.g., histogram-based three-dimensional (HOG3D)-based methods, histogram-based optical flow (HOF)-based methods, and motion-boundary-based methods. method of graphs, etc. These methods construct representations of videos by describing regional features centered on spatiotemporal interest points (STIPs) and are used to identify actions in videos.

随着多媒体技术的快速发展，网络以及监控视频中的数据迅速增长，基于真实场景的人体行为识别技术越来越受到关注。由于人体外形、视角、光照和背景变化及摄像头的移动等问题，传统的基于人工设计特征的行为识别方法已经难以在这些真实场景下取得理想效果。With the rapid development of multimedia technology and the rapid growth of data in the network and surveillance video, human behavior recognition technology based on real scenes has attracted more and more attention. Due to the problems of human shape, perspective, illumination and background changes, and camera movement, traditional behavior recognition methods based on artificially designed features have been difficult to achieve ideal results in these real scenes.

近年来，随着深度学习在计算机视觉领域的快速发展和应用，一系列基于深度模型的人体视频行为识别方法被提出。例如，从单帧的层次识别视频中的行为、通过使用RGB帧和光流的双流网络捕捉视频中的运动信息、通过探索视频流上的三维卷积网络来学习视频片段的时空特征等，以及后来提出的双流膨胀三维卷积网络(I3D)，它将卷积神经网络结构中二维的卷积和池化核集扩展为三维，这使得网络无缝地学习视频的时空特征成为可能。In recent years, with the rapid development and application of deep learning in the field of computer vision, a series of human video behavior recognition methods based on deep models have been proposed. For example, identifying behaviors in videos from a single-frame level, capturing motion information in videos by using two-stream networks with RGB frames and optical flow, learning spatiotemporal features of video segments by exploring 3D convolutional networks on video streams, etc., and later The proposed two-stream dilated 3D convolutional network (I3D), which expands the 2D convolutional and pooling kernel sets in the convolutional neural network structure to 3D, makes it possible for the network to seamlessly learn the spatiotemporal features of videos.

然而，现有的卷积神经网络结构只能够对单帧或视频短片段进行建模，缺少直接对视频的长时序结构信息进行建模的能力。因此，现有的基于深度模型的行为识别方法采用了不同的策略来获取视频长时序的时空特征。这些策略主要分为两类：(1)深度卷积特征编码及池化方法，即，利用深度卷积网络来提取帧或视频片段的卷积特征，然后采用时空编码或者池化的方法构建全局的视频级表示。但是，这种方法构造的视频表示是无序的，没有考虑到视频帧与帧之间的时序和演变关系。(2)通过考虑视频的时序结构来构建视频级表示，即，将多个帧或视频片段的深度特征输入到时序模型如LSTM、GRU或排序函数中，将其融合成视频级表示。但是，这种方法会在一定程度上缺失视频的空间局部信息。However, the existing convolutional neural network structures can only model single frames or short video clips, and lack the ability to directly model long-term structural information of videos. Therefore, existing deep model-based action recognition methods employ different strategies to obtain spatiotemporal features of long-term video. These strategies are mainly divided into two categories: (1) Deep convolutional feature coding and pooling methods, that is, deep convolutional networks are used to extract convolutional features of frames or video clips, and then spatiotemporal coding or pooling methods are used to construct a global video-level representation of . However, the video representation constructed by this method is unordered and does not take into account the temporal and evolutionary relationships between video frames. (2) A video-level representation is constructed by considering the temporal structure of the video, i.e., the depth features of multiple frames or video segments are input into a temporal model such as LSTM, GRU, or a ranking function, which is fused into a video-level representation. However, this method will lack the spatial local information of the video to a certain extent.

发明内容SUMMARY OF THE INVENTION

本发明的目的在于为了克服现有技术存在的缺陷，针对目前基于深度特征的长时序视频表示方法中存在的问题，从如何为视频建立适当的时空级表示的层面出发来识别人物行为，提出了一种基于训练特征融合的深度监督卷积神经网络行为识别方法。The purpose of the present invention is to overcome the defects of the prior art, aiming at the problems existing in the current deep feature-based long-sequence video representation methods, from the perspective of how to establish an appropriate spatiotemporal level representation for the video to identify character behavior, and proposes a A deeply supervised convolutional neural network behavior recognition method based on training feature fusion.

本发明通过以下技术方案实现。The present invention is realized by the following technical solutions.

一种基于训练特征融合的深度监督卷积神经网络行为识别方法，包括以下步骤：A deep-supervised convolutional neural network behavior recognition method based on training feature fusion, comprising the following steps:

步骤1：采集用于训练的视频数据，形成训练数据集。Step 1: Collect video data for training to form a training data set.

对训练视频数据集中的视频进行预处理，提取全部视频帧，并将其裁剪成相同尺寸。Preprocess the videos in the training video dataset, extract all video frames, and crop them to the same size.

步骤2：对训练数据集中的视频进行帧采样。Step 2: Frame sampling of the videos in the training dataset.

对训练数据集中的每个视频进行均匀的帧采集。在整个视频跨度上，以为时间间隔，均匀采集T个RGB帧[I₁,I₂,...,I_T]，其中，T_z为某视频总时长，令I_t表示第t个采集到的视频帧，第t帧对应到第t时刻。Uniform frame acquisition for each video in the training dataset. over the entire video span, with is the time interval, and evenly collects T RGB frames [I₁ , I₂ ,..., I_T ], where T_z is the total duration of a video, let I_t represent the t-th collected video frame, and the t-th The frame corresponds to time t.

步骤3：扩充训练数据集。Step 3: Expand the training dataset.

将从每个视频采集到的视频帧全部进行反转操作，使之成为新的视频，从而扩充训练数据集，使视频数据集中的视频数目为之前的2倍。All the video frames collected from each video are reversed to make them into new videos, thereby expanding the training data set, so that the number of videos in the video data set is twice as many as before.

步骤4：提取训练视频帧的多层卷积特征。Step 4: Extract the multi-layer convolutional features of the training video frames.

首先，从标准的CNN(卷积神经网络)架构中选取M个卷积层，用于提取视频帧的多层卷积特征。由于识别行为通常需要诸如物体部分或身体部分在内的高水平语义信息，因此本发明从卷积网络的顶层卷积层中选择用于产生特征图的M个卷积层。First, M convolutional layers are selected from the standard CNN (Convolutional Neural Network) architecture for extracting multi-layer convolutional features of video frames. Since recognition actions usually require high-level semantic information such as object parts or body parts, the present invention selects M convolutional layers for generating feature maps from the top convolutional layers of a convolutional network.

之后，将采集到的视频V的T个RGB帧[I₁,I₂,...,I_T]输入到该卷积网络中，并提取每个RGB帧在这M个卷积层中产生的特征图。对于每个RGB帧，在每个选定卷积层都会获得空间大小为N×N，包含C个通道的特征图。对于整个视频V，将会获得M×T个空间大小为N×N，包含C个通道的特征图。After that, input T RGB frames [I₁ , I₂ ,..., I_T ] of the captured video V into the convolutional network, and extract each RGB frame to generate in the M convolutional layers feature map. For each RGB frame, a feature map of spatial size N×N containing C channels is obtained at each selected convolutional layer. For the entire video V, M×T feature maps of size N×N with C channels will be obtained.

步骤5：对视频帧的多层特征图进行特征聚合，得到视频级表示。具体方法如下：Step 5: Perform feature aggregation on the multi-layer feature maps of video frames to obtain video-level representations. The specific method is as follows:

步骤5.1：使用局部演化排序池化方法，提取视频V的局部演化描述符。Step 5.1: Extract the local evolution descriptors of video V using the local evolution ranking pooling method.

将视频V的多帧在同一卷积层下得到的T个特征图作为输入，然后将每帧的特征图分解为一组局部空间特征，最后对每个空间位置的局部空间特征的演化信息进行建模生成局部演化描述符。具体方法如下：The T feature maps obtained from multiple frames of video V under the same convolutional layer are used as input, and then the feature map of each frame is decomposed into a set of local spatial features, and finally the evolution information of the local spatial features of each spatial position is analyzed. Modeling generates local evolution descriptors. The specific method is as follows:

步骤5.1.1：经步骤4，视频V的T帧[I₁,I₂,...,I_T]中的每一帧在某选定卷积层均获取空间大小N×N并且包含C个通道的特征图，这些特征图表示为[fm₁,fm₂,...,fm_T]。分别连接每个特征图上每个空间位置上所有通道的值，t∈{1,...,T}，从而将每个特征图分解为多个局部空间特征。对于每一帧，将获得N×N个C维的局部空间特征。Step 5.1.1: After step 4, each frame in_T frames [I₁ , I₂ , . feature maps of channels, these feature maps are denoted as [fm₁ ,fm₂ ,...,fm_T ]. The values of all channels at each spatial location on each feature map are concatenated separately, t ∈ {1,...,T}, thereby decomposing each feature map into multiple local spatial features. For each frame, N×N C-dimensional local spatial features will be obtained.

步骤5.1.2：对T帧[I₁,I₂,...,I_T]的每个空间位置的演化信息进行建模，生成视频V局部演化描述符。具体方法如下：Step 5.1.2: Model the evolution information of each spatial position of T frames [I₁ , I₂ , ..., I_T ] to generate video V local evolution descriptors. The specific method is as follows:

步骤5.1.2.1：对于某一个特定的空间位置，将T帧的局部空间特征按照时间顺序排列表示为[r_i1,r_i2,…,r_it,...,r_iT]，其中i＝{1,...,N×N}，为第t时刻上第i个空间位置的局部空间特征，为C维的实数向量空间，即r_it为C维实数向量空间上的向量。Step 5.1.2.1: For a specific spatial position, the local spatial features of T frames are arranged in time order as [r_i1 ,r_i2 ,...,r_it ,...,r_iT ], where i={ 1,...,N×N}, is the local spatial feature of the i-th spatial position at the t-th time, is a C-dimensional real number vector space, that is, r_it is a vector on the C-dimensional real number vector space.

步骤5.1.2.2：建模第i个空间位置的演化信息。定义一个排序(Rank)函数，为每一个时刻计算一个分数值：Step 5.1.2.2: Model the evolution information of the ith spatial location. Define a Rank function to calculate a score value for each moment:

S(t,i∣e)＝e^Td_it (1)S(t,i∣e)=e^T d_it (1)

其中，为第t时刻上第i个空间位置的平均局部空间特征，本发明设定一个约束关系：后面时刻对应的分数值大于前面的时刻对应的分数值，即参数e可以反映这些局部空间特征的时间顺序。对参数e进行学习可以认为是一个凸优化问题：in, is the average local spatial feature of the i-th spatial position at the t-th time, The present invention sets a constraint relationship: the score value corresponding to the later moment is greater than the score value corresponding to the previous moment, that is, The parameter e can reflect the temporal order of these local spatial features. Learning the parameter e can be thought of as a convex optimization problem:

目标函数E(e)的第一项是通用的二次正则化项，第二项是软计数损失函数hinge-loss。The first term of the objective function E(e) is the general quadratic regularization term, and the second term is the soft count loss function hinge-loss.

步骤5.1.2.3：优化目标函数E(e)，将一系列局部空间特征映射到向量e^★上。e^★包含对这些局部空间特征的排序信息，即为局部演化描述符。本方法使用近似技术解决方程式的优化问题，从而将该操作嵌入CNN网络当中。最终，上述目标函数的解简化为：Step 5.1.2.3: Optimize the objective function E(e), and map a series of local spatial features to the vector e^★ . e^★ contains the ranking information of these local spatial features, which is the local evolution descriptor. This method uses approximation techniques to solve the optimization problem of the equations, thereby embedding this operation in the CNN network. Ultimately, the solution to the above objective function simplifies to:

其中，α_t＝2(T-t+1)-(T+1)(H_T-H_t-1)，为参数，该权重通过排序池化(RankPooling)得到。上述解看作第i个空间位置在T个采集到的时刻上的局部空间特征的加权相加。where α_t =2(T-t+1)-(T+1)(H_T -H_t-1 ), is a parameter, and the weight is obtained by ranking pooling (RankPooling). The above solution can be regarded as the weighted addition of the local spatial features of the ith spatial position at the T collected moments.

步骤5.1.2.4：基于上述排序函数的近似解，设计局部演化排序池化层。该层输入T帧N×N×C大小的卷积特征图，输出N×N个C维的局部演化描述符向量[e₁,e₂,...,e_N×N]。Step 5.1.2.4: Based on the approximate solution of the above ranking function, design a local evolution ranking pooling layer. This layer inputs a convolutional feature map of size N×N×C of T frames, and outputs N×N C-dimensional local evolution descriptor vectors [e₁ ,e₂ ,...,e_N×N ].

步骤5.2：使用基于局部演化描述符的VLAD(局部聚合向量)编码方法，将视频的局部演化描述符编码为基于元动作的视频级的表示。Step 5.2: The local evolution descriptor of the video is encoded into a meta-action based video-level representation using the local evolution descriptor-based VLAD (local aggregation vector) encoding method.

本方法基于“一个动作是由一组元动作组成”的想法，提出了基于局部演化描述符的VLAD编码方法，将多个局部演化描述符编码成基于元动作的表示，从而构建紧凑的语义级别的表示。具体步骤如下：Based on the idea that an action is composed of a set of meta-actions, this method proposes a local evolution descriptor-based VLAD encoding method, which encodes multiple local evolution descriptors into meta-action-based representations, thereby constructing a compact semantic level representation. Specific steps are as follows:

步骤5.2.1：使用K个元动词单词，将特征空间划分为K单元，设每个单元的锚定点为a_k。Step 5.2.1: Using the K metaverb words, the feature space It is divided into K units, and the anchor point of each unit is set as a_k .

步骤5.2.2：将步骤5.1中得到的视频V的一系列局部演化描述符[e₁,e₂,...,e_N×N]中的每个局部演化描述符，分配给上述K个单元中的其中一个单元，并记录局部演化描述符e_i与锚定点a_k之间的残差向量。Step 5.2.2: Assign each local evolution descriptor in a series of local evolution descriptors [e₁ , e₂ , ..., e_N×N ] of the video V obtained in step 5.1 to the above K one of the cells, and record the residual vector between the local evolution descriptor e_i and the anchor point a_k .

步骤5.2.3：将残差向量进行求和。Step 5.2.3: Sum the residual vectors.

式(4)中，表示描述符e_i的软分配，锚定点a_k在该公式中是一个可通过训练调节的超参数；e_i-a_k表示局部演化描述符与第k个锚定点之间的残差。通过公式得到的h_k表示第k个单元中的聚合描述符。In formula (4), represents the soft assignment of the descriptor e_i , and the anchor point a_k is a hyperparameter that can be adjusted by training in this formulation; e_i_-ak represents the residual between the locally evolved descriptor and the kth anchor point. The h_k obtained by formula represents the aggregated descriptor in the kth unit.

步骤5.2.4：得到该视频的局部演化描述符与每个锚定点间的残差之和，视频V可表示为v＝[h₁,h₂,...,h_K]，C为实数空间的维度，K为元动作单元的个数，所以，ν为实数空间上C×K大小的矩阵。Step 5.2.4: Obtain the sum of the residuals between the local evolution descriptor of the video and each anchor point, the video V can be expressed as v=[h₁ , h₂ ,...,h_K ], C is the dimension of the real number space, and K is the number of meta-action units. Therefore, ν is a matrix of size C×K in the real number space.

基于上式可微分，且允许将误差梯度反向传播到网络的较低层，因此本发明设计了基于局部演化描述符的VLAD编码层。Based on the differentiability of the above formula and allowing back-propagation of the error gradient to the lower layers of the network, the present invention designs a VLAD encoding layer based on local evolution descriptors.

步骤6：对于选取的M个卷积层，并行在每一层进行上述步骤5、步骤6操作，得到该视频在每个选定卷积层的视频级特征表示。Step 6: For the selected M convolutional layers, perform the above steps 5 and 6 operations on each layer in parallel to obtain the video-level feature representation of the video in each selected convolutional layer.

对多个卷积层得到的视频级表示进行动作识别，是本发明提出的基于深度监督的动作识别方法。Action recognition on video-level representations obtained from multiple convolutional layers is a deep-supervised action recognition method proposed in the present invention.

步骤7：将步骤6中得到的在每一层的视频级表示输入到相应的分类器中，得到视频V在M个选定卷积层上的分类结果。具体方法如下：Step 7: Input the video-level representation at each layer obtained in Step 6 into the corresponding classifier to obtain the classification results of video V on the M selected convolutional layers. The specific method is as follows:

步骤7.1：为了整合网络的卷积和聚合操作中的所有参数，定义：Step 7.1: To integrate all parameters in the convolution and aggregation operations of the network, define:

其中，B表示卷积层的总数。设b＝{1,...,B}，表示第b个卷积层的参数。M表示本发明选取的卷积层的个数，由于在每个选取的卷积层上均得到一个分类结果，故每个选取的卷积层与一个特征聚合操作和一个分类器相连，所以特征聚合操作的个数为M，分类器的个数也为M。设m＝{1,...,M}，故表示第m个选取的卷积层上的特征聚合操作的权重，表示第m个选取的卷积层上所连分类器的权重。where B represents the total number of convolutional layers. Let b={1,...,B}, Represents the parameters of the bth convolutional layer. M represents the number of convolutional layers selected in the present invention. Since a classification result is obtained on each selected convolutional layer, each selected convolutional layer is connected with a feature aggregation operation and a classifier, so the features The number of aggregation operations is M, and the number of classifiers is also M. Let m={1,...,M}, so represents the weight of the feature aggregation operation on the m-th selected convolutional layer, Indicates the weight of the classifier connected to the m-th selected convolutional layer.

步骤7.2：定义合并所有输出层分类错误的损失函数：Step 7.2: Define a loss function that incorporates all output layer classification errors:

其中，L表示动作分类的视频级交叉熵损失函数，定义为：where L represents the video-level cross-entropy loss function for action classification, defined as:

其中，g为视频V的真实标签，g∈A，A＝{A₁,...,A_z}定义了所有动作类别，类别数量为Z，A_i表示动作集A中的第i个动作类别，s^m表示第m个卷积层预测得到的动作类别。Among them, g is the real label of the video V, g∈A, A={A₁ ,...,A_z } defines all action categories, the number of categories is Z, and A_i represents the ith action in the action set A category, s^m denotes the action category predicted by the mth convolutional layer.

步骤8：将M个选定卷积层的分类结果进行集成。Step 8: Integrate the classification results of the M selected convolutional layers.

本发明提出了一种分类集成方法来融合多层级的预测结果，该方法对在各个卷积层得到分值使用对应的权值求和，以充分利用多层级信息的互补性。对应的权值通过基于注意力的方法分配。具体方法如下：The present invention proposes a classification and integration method to fuse multi-level prediction results. The method sums up the scores obtained in each convolution layer using corresponding weights, so as to make full use of the complementarity of multi-level information. The corresponding weights are assigned by an attention-based method. The specific method is as follows:

步骤8.1：令融合后的预测结果F表示为：Step 8.1: Let the fused prediction result F be expressed as:

其中，表示集成权重，其中是一个Z维的向量，通过注意力(Attention)机制分配权重得到，s^m表示第m个卷积层预测的动作类别。in, represents the ensemble weight, where is a Z-dimensional vector, obtained by assigning weights through the Attention mechanism, and s^m represents the action category predicted by the mth convolutional layer.

集成层的损失函数定义为：The loss function of the ensemble layer is defined as:

其中，y＝argmax(F)表示最终预测得到的动作类别，为最终预测动作类别为A_i的概率。Among them, y=argmax(F) represents the final predicted action category, is the probability that the final predicted action category is A_i .

步骤8.2：在训练集上最小化以下目标函数，学习得到所有的参数W,w_c，w_f：Step 8.2: Minimize the following objective function on the training set and learn all parameters W, w_c , w_f :

步骤9：使用梯度下降算法优化上述损失函数，通过反向传播调整模型参数，直至损失函数收敛。此时，该基于可训练特征融合的深度卷积神经网络行为识别模型已训练完成。Step 9: Use the gradient descent algorithm to optimize the above loss function, and adjust the model parameters through backpropagation until the loss function converges. At this point, the deep convolutional neural network behavior recognition model based on trainable feature fusion has been trained.

步骤10：使用步骤9中训练好的模型，对未知视频V′中的人物行为进行识别。具体步骤如下：Step 10: Use the model trained in Step 9 to identify the behavior of the characters in the unknown video V'. Specific steps are as follows:

步骤10.1：将未知视频V′按照步骤1和步骤2中的方法进行预处理以及帧采样，得到对V′均匀采集的T个RGB帧[I′₁,I′₂,...,I′_T]。Step 10.1: Perform preprocessing and frame sampling on the unknown video V' according to the methods in Step 1 and Step 2 to obtain T RGB frames [I'₁ , I'₂ ,...,I' uniformly collected for V'_T ].

步骤10.2：按照步骤4所述方法，提取未知视频的多层卷积特征。对于V′的每个RGB帧，在每个选定卷积层都将会获取空间大小为N×N，包含C个通道的特征图。对于整个未知视频V′，将会获得M×T个空间大小为N×N，包含C个通道的特征图。Step 10.2: According to the method described in step 4, extract the multi-layer convolution features of the unknown video. For each RGB frame of V′, a feature map of spatial size N×N containing C channels will be acquired at each selected convolutional layer. For the entire unknown video V′, M×T feature maps with a spatial size of N×N and C channels will be obtained.

步骤10.3，按照步骤5、步骤6所述方法，得到V′在M个选定卷积层每一层上的视频级特征表示。具体步骤如下：Step 10.3, according to the methods described in steps 5 and 6, to obtain the video-level feature representation of V′ on each of the M selected convolutional layers. Specific steps are as follows:

首先，按照步骤5.1所述方法，使用局部演化排序池化方法得到V′在每一选定卷积层上的N×N个C维的局部演化描述符向量[e′₁,e′₂,...,e′_N×N]，First, according to the method described in step 5.1, use the local evolution sorting pooling method to obtain N×N C-dimensional local evolution descriptor vectors [e′₁ ,e′₂ , ...,e′_N×N ],

然后，按照步骤5.2所述方法，使用基于局部演化描述符的VLAD编码方式将[e′₁,e′₂,...,e′_N×N]编码为基于元动作的视频级表示v′＝[h′₁,h′₂,...,h′_K]，Then, according to the method described in step 5.2, [e′₁ ,e′₂ ,...,e′_N×N ] is encoded into the meta-action based video-level representation v′ using the local evolution descriptor-based VLAD encoding method =[h′₁ ,h′₂ ,...,h′_K ],

最后，按照步骤6所述方法，在M个选定卷积层上并行进行上述操作，在每一层上得到V′的视频级表示。Finally, following the method described in step 6, the above operations are performed in parallel on the M selected convolutional layers, and a video-level representation of V' is obtained on each layer.

步骤10.4：按照步骤7所述方法，获得V′在M个选定卷积层上的分类结果，s^′m表示V′在第m个卷积层上预测得到的动作类别结果。按照步骤8所述方法，使用分类集成方法对多层的分类结果进行集成，得到最终对未知视频的分类结果。F′表示融合后的预测结果：Step 10.4: According to the method described in Step 7, obtain the classification results of V' on the M selected convolutional layers, where^s'm represents the action category result predicted by V' on the mth convolutional layer. According to the method described in step 8, the classification results of the multi-layers are integrated by using the classification integration method to obtain the final classification result of the unknown video. F' represents the prediction result after fusion:

其中，是一个Z维的向量，s^′m表示第m个卷积层预测的动作类别。in, is a Z-dimensional vector, and^s′m represents the action category predicted by the mth convolutional layer.

上述过程执行完毕后，即可得到对未知视频中人物行为的预测结果。After the above process is completed, the prediction result of the behavior of the characters in the unknown video can be obtained.

有益效果beneficial effect

本发明对比现有技术，具有以下有益效果：Compared with the prior art, the present invention has the following beneficial effects:

(1)所提出特征聚合操作将局部演化排序池化操作和基于局部演化描述符的VLAD编码操作合二为一，并提出局部演化排序池化层和基于局部演化描述符的VLAD编码层，简化了方法的实施；(1) The proposed feature aggregation operation combines the local evolution sorting pooling operation and the local evolution descriptor-based VLAD coding operation into one, and proposes a local evolution sorting pooling layer and a local evolution descriptor-based VLAD coding layer to simplify implementation of the method;

(2)所提出的局部演化排序池化方法，通过建模每个空间位置的时间演化信息来捕获更多关于动作的细节；(2) The proposed local evolution ranking pooling method captures more details about actions by modeling the temporal evolution information of each spatial location;

(3)所提出的基于局部演化描述符的VLAD编码方式通过将局部演化描述符投影到一个语义空间，生成了更具有判别力的视频表示；(3) The proposed local evolution descriptor-based VLAD encoding method generates more discriminative video representations by projecting the local evolution descriptors into a semantic space;

(4)所提出深度监督动作识别方法在单个网络中构建多层级的视频表示，并产生多个预测结果；(4) The proposed deep-supervised action recognition method builds multi-level video representations in a single network and produces multiple prediction results;

(5)所提出多层级分类结果集成方法通过集成多层级的预测结果提高了网络中间层的判别性，从而提高了网络整体的性能。(5) The proposed multi-level classification result integration method improves the discriminability of the middle layer of the network by integrating the multi-level prediction results, thereby improving the overall performance of the network.

附图说明Description of drawings

图1为本发明整体逻辑结构图。FIG. 1 is an overall logical structure diagram of the present invention.

图2为本发明方法的步骤详述及参数传播。包含模型训练步骤以及本发明所提出的特征聚合方法，以及深度监督动作识别方法。FIG. 2 is a detailed description of the steps and parameter propagation of the method of the present invention. It includes model training steps, the feature aggregation method proposed by the present invention, and the deep supervised action recognition method.

图3为本发明方法的流程图。Figure 3 is a flow chart of the method of the present invention.

具体实施方式Detailed ways

下面将结合附图对本发明的具体实施方法做进一步的详细说明。The specific implementation method of the present invention will be further described in detail below with reference to the accompanying drawings.

本发明的执行环境是有计算机实现以下三个主要功能构成：一、多层卷积特征提取功，该功能是提取视频每一帧的在多层特征图。二、特征聚合功能，包含局部演化描述池化层，该层的功能是将每一层得到的多帧特征图编码为局部演化描述符；以及基于局部演化描述符的VLAD编码层，该层的功能是将局部演化描述符编码成基于元动作的视频级表示。三、深度监督动作识别方法，该方法的功能是用上面得到的多层的视频级表示来识别视频中的人物动作，并将多层的分类结果进行集成得到最终的预测结果。本发明的整体逻辑结构图见图1。The execution environment of the present invention is composed of the following three main functions realized by a computer: 1. The multi-layer convolution feature extraction function, which is to extract the multi-layer feature map of each frame of the video. Second, the feature aggregation function, including the local evolution description pooling layer, the function of this layer is to encode the multi-frame feature maps obtained by each layer into local evolution descriptors; and the VLAD encoding layer based on local evolution descriptors, the layer's The function is to encode local evolution descriptors into meta-action-based video-level representations. 3. Deeply supervised action recognition method. The function of this method is to use the multi-layer video-level representation obtained above to recognize the human action in the video, and integrate the multi-layer classification results to obtain the final prediction result. The overall logical structure diagram of the present invention is shown in FIG. 1 .

如图3所示，为本发明一种基于可训练特征融合的深度监督卷积神经网络行为识别方法的流程图。As shown in FIG. 3 , it is a flow chart of a deep-supervised convolutional neural network behavior recognition method based on the fusion of trainable features of the present invention.

下面对本发明提出一种基于可训练特征融合的深度监督卷积神经网络行为识别方法的具体实施例作更详细的描述。The specific embodiment of the deep supervised convolutional neural network behavior recognition method based on the fusion of trainable features provided by the present invention will be described in more detail below.

根据附图3中(b)所示模型训练阶段流程图，模型训练阶段的具体实施方法为：According to the flow chart of the model training stage shown in (b) in the accompanying drawing, the specific implementation method of the model training stage is:

步骤1：对训练视频数据集中的视频进行预处理，提取全部视频帧，并裁剪成尺寸为224px×224px。Step 1: Preprocess the videos in the training video dataset, extract all video frames, and crop them to a size of 224px × 224px.

步骤2：对训练视频中的每个视频，以时间间隔为均匀采集10个RGB帧[I₁,I₂,...,I₁₀]，T_z为某视频总时长，I_t表示某视频第t个采集到视频帧，为了方便起见，某训练视频的第t帧对应为其的第t时刻。Step 2: For each video in the training video, the time interval is Evenly collect 10 RGB frames [I₁ , I₂ ,..., I₁₀ ], T_z is the total duration of a video, and It represents the_t -th video frame collected from a video. For convenience, the length of a training video is The t-th frame corresponds to its t-th time instant.

步骤3：将数据集中每个视频采集到的视频帧均进行反转操作，使之成为新的视频以扩充训练数据集，使得视频数据集中视频数目为之前的2倍。Step 3: Invert the video frames collected from each video in the data set to make them into new videos to expand the training data set, so that the number of videos in the video data set is twice as many as before.

步骤4：提取训练视频帧的多层卷积特征，本发明在预训练好的CNN架构中选取了3个卷积层：Mixed5_a层、Mixed5_b层和Mixed5_c层用于产生视频帧的特征图。将采集到的视频V的10个RGB帧[I₁,I₂,...,I₁₀]输入到该卷积网络中，对于每个RGB帧，在每个选定卷积层都会获得空间大小为64×64，包含3个通道的特征图。对于整个视频V，将会获得3×10个空间大小为64×64，包含3个通道的特征图。Step 4: Extract the multi-layer convolutional features of the training video frame. The present invention selects three convolutional layers in the pre-trained CNN architecture: the Mixed5_a layer, the Mixed5_b layer and the Mixed5_c layer for generating the feature map of the video frame. 10 RGB frames [I₁ ,I₂ ,...,I₁₀ ] of the captured video V are input into this convolutional network, for each RGB frame, a space is obtained at each selected convolutional layer The size is 64×64 and contains feature maps of 3 channels. For the entire video V, 3 × 10 feature maps with spatial size of 64 × 64 and 3 channels will be obtained.

步骤5：对视频帧的多层特征图进行特征聚合，得到视频级的表示，具体方法如下：Step 5: Perform feature aggregation on the multi-layer feature map of the video frame to obtain a video-level representation. The specific method is as follows:

步骤5.1，将每个训练视频采集的RGB帧输入到局部演化排序池化层得到每个训练视频的局部演化描述符。Step 5.1, input the RGB frames collected from each training video to the local evolution sorting pooling layer to obtain the local evolution descriptor of each training video.

步骤5.1.1，经过步骤4中，训练视频V的10帧[I₁,I₂,...,I₁₀]中的每一帧在Mixed5_a层均获取空间大小64×64并且包含3个通道的特征图，这些特征图可表示为[fm₁,fm₂,...,fm₁₀]。连接m_t上每个空间位置上所有通道的值，t∈{1,...,10}，从而将fm_t特征图分解为64×64个3维的局部空间特征。Step 5.1.1, after step 4, each frame in the 10 frames [I₁ , I₂ ,..., I₁₀ ] of the training video V obtains a spatial size of 64 × 64 in the Mixed5_a layer and contains 3 channels The feature maps of , which can be represented as [fm₁ , fm₂ ,...,fm₁₀ ]. Concatenate the values of all channels at each spatial location on m_t , t ∈ {1,...,10}, thereby decomposing the fm_t feature map into 64 × 64 3-dimensional local spatial features.

步骤5.1.2，对T帧[I₁,I₂,...,I₁₀]的每个空间位置的演化信息进行建模，生成视频V局部演化描述符，具体方法如下：Step 5.1.2, model the evolution information of each spatial position of the T frame [I₁ , I₂ ,..., I₁₀ ], and generate a local evolution descriptor of the video V. The specific method is as follows:

步骤5.1.2.1，将某一特定空间位置i的局部空间特征按照时间顺序进行排序，得到表示[r_i1,r_i2,…r_it,…,r_i10]，其中i＝{1,...,64}，为第t时刻上第i个空间位置的局部空间特征，为3维的实数向量空间，即r_it为3维实数向量空间上的向量。Step 5.1.2.1, sort the local spatial features of a specific spatial position i in chronological order to obtain the representation [r_i1 ,r_i2 ,…r_it ,…,r_i10 ], where i={1,… ,64}, is the local spatial feature of the i-th spatial position at the t-th time, is a 3-dimensional real vector space, that is, r_it is a vector on the 3-dimensional real vector space.

步骤5.1.2.2，使用排序函数S(t,i∣e)＝e¹⁰d_it为每一个时刻t计算一个分数值，其中为第t时刻上第i个空间位置的平均局部空间特征，1～10对应为时刻，设q∈{1,...,10}为t∈{1,...,10}之后的时刻，则有S(q,i∣e)>S(t,i∣e)。找出所有满足条件的q>t，计算E(e)：Step 5.1.2.2, use the sorting function S(t,i∣e)=e¹⁰ d_it to calculate a score value for each time t, where is the average local spatial feature of the i-th spatial position at the t-th time, 1 to 10 correspond to moments. Let q∈{1,...,10} be the moment after t∈{1,...,10}, then S(q,i∣e)>S(t, i∣e). Find all q>t satisfying the condition, and calculate E(e):

步骤5.1.2.3，优化E(e)，将一系列局部空间特征映射到一个向量e^★。e^★即为该训练视频的局部演化描述符：Step 5.1.2.3, optimize E(e), map a series of local spatial features to a vector e^★ . e^★ is the local evolution descriptor of the training video:

e^＊＝argmin_eE(e)e^* = argmin_e E(e)

使用近似技术简化E(e)的解为：Simplifying the solution of E(e) using approximation techniques is:

其中，α_t＝2(10-t+1)-(10+1)(H₁₀-H_t-1)，该权重通过排序池化(RankPooling)得到。上述解可以看作第i个空间位置在所有采集到的10个时刻的局部空间特征的加权相加。where α_t =2(10-t+1)-(10+1)(H₁₀ -H_t-1 ), The weights are obtained through RankPooling. The above solution can be regarded as the weighted addition of the local spatial features of the ith spatial position at all 10 collected moments.

步骤5.1.2.4，学得的e向量即为该训练视频第i个空间位置的局部演化描述符，输入整个训练视频，在Mixed5_a层将得到64×64个3维的局部演化描述符向量[e₁,e₂,...,e_64×64]。Step 5.1.2.4, the learned e vector is the local evolution descriptor of the ith spatial position of the training video, input the entire training video, and get 64 × 64 3-dimensional local evolution descriptor vectors in the Mixed5_a layer [e₁ ,e₂ ,...,e_64×64 ].

步骤5.2：将每个训练视频的局部演化描述符向量输入到基于局部演化描述符的VLAD编码层得到每个训练视频的视频级表示。Step 5.2: Input the local evolution descriptor vector of each training video into the VLAD coding layer based on the local evolution descriptor to obtain the video-level representation of each training video.

步骤5.2.1，使用32个元动词单词将特征空间划分为32个单元，然后将局部演化描述符e₁,e₂,...,e_64×64分配给这32个单元中的其中一个单元。记录局部演化描述符e_i与每个元动作锚定点a_k之间的残差向量(e_i-a_k)。Step 5.2.1, using 32 metaverb words to convert the feature space Divide into 32 units, then assign local evolution descriptors e₁ ,e₂ ,...,e_64×64 to one of these 32 units. The residual_vector (_ei_−ak ) between the local evolution descriptor ei and each meta-action anchor_ak is recorded.

步骤5.2.2，将这些残差向量求和，得到第k个单元中的聚合描述符h_k。Step 5.2.2, sum these residual vectors to obtain the aggregated descriptor h_k in the kth unit.

步骤5.2.3，该训练视频可以表示为v＝[h₁,h₂,...,h₃₂]，v为实数空间上3×32大小的矩阵。Step 5.2.3, the training video can be expressed as v=[h₁ , h₂ ,...,h₃₂ ], v is a matrix of size 3 × 32 in the real space.

步骤6：并行在Mixed5_a层Mixed5_b层和Mixed5_c层执行上述步骤5中操作，得到每个训练视频在这3个卷积层上的视频级表示。Step 6: Perform the operations in Step 5 above on the Mixed5_a layer, Mixed5_b layer and Mixed5_c layer in parallel to obtain the video-level representation of each training video on these 3 convolutional layers.

步骤7：获得训练视频在多个卷积层的分类结果。Step 7: Obtain the classification results of training videos at multiple convolutional layers.

将步骤6中得到的在每一层的视频级表示输入到相应的分类器中得到在该卷积层的分类结果。具体方法如下：The video-level representation at each layer obtained in step 6 is input into the corresponding classifier to obtain the classification result at the convolutional layer. The specific method is as follows:

步骤7.1，定义参数，整个网络卷积层的总数为B，第b个卷积层的参数表示为本发明选取的卷积层为Mixed5_a层Mixed5_b层和Mixed5_c层3层，由于在每个选取的卷积层上均得到一个分类结果，故每个选取的卷积层与一个特征聚合操作和一个分类器相连，所以特征聚合操作的个数为3，分类器的个数也为3。则第m个选取的卷积层上的特征聚合操作的权重为第m个选取的卷积层上所连分类器的权重为Step 7.1, define the parameters, the total number of convolutional layers in the entire network is B, and the parameters of the bth convolutional layer are expressed as The convolutional layers selected in the present invention are the Mixed5_a layer, the Mixed5_b layer and the Mixed5_c layer. Since a classification result is obtained on each selected convolutional layer, each selected convolutional layer is associated with a feature aggregation operation and a classification The classifiers are connected, so the number of feature aggregation operations is 3, and the number of classifiers is also 3. Then the weight of the feature aggregation operation on the m-th selected convolutional layer is The weight of the classifier connected to the m-th selected convolutional layer is

步骤7.2，合并所有输出层分类错误的损失函数定义为：Step 7.2, the loss function for merging all output layer classification errors is defined as:

其中L表示动作分类的视频级交叉熵损失函数。where L represents the video-level cross-entropy loss function for action classification.

设A＝{A₁,...,A₅₁}定义了训练数据集中所有的动作类别，类别数量为51类。该训练视频的真实标签为g∈A，s^m表示第m个卷积层预测的动作类别。则交叉熵损失函数为：Let A={A₁ ,...,A₅₁ } define all action categories in the training data set, and the number of categories is 51 categories. The ground-truth label of this training video is g∈A, and sm denotes the action category predicted by the^mth convolutional layer. Then the cross entropy loss function is:

步骤8：将多层的分类结果进行集成。Step 8: Integrate the classification results of multiple layers.

步骤8.1，集成后的预测结果为：其中表示集成权重，其中是一个Z维的向量，通过注意力分配权重得到。集成层的损失函数定义为：Step 8.1, the integrated prediction result is: in represents the ensemble weight, where is a Z-dimensional vector obtained by assigning weights through attention. The loss function of the ensemble layer is defined as:

其中，表示最终预测得到的动作类别，P(y＝A_i∣V,W,wcm,wf为最终预测动作类别为Ai的概率。in, Indicates the final predicted action category, P(y=A_i ∣V,W,wcm,wf is the probability that the final predicted action category is Ai.

步骤8.2，最小化目标函数Step 8.2, Minimize the objective function

学习得到所有的参数W,w_c，w_f。 Learn to get all parameters W, w_c , w_f .

步骤9：使用梯度下降算法优化损失函数通过反向传播调整模型参数，直至损失函数收敛，此时该基于可训练特征融合的深度卷积神经网络行为识别模型已训练完成。Step 9: Optimizing Loss Function Using Gradient Descent Algorithm Adjust the model parameters through backpropagation until the loss function converges, at which point the training of the deep convolutional neural network behavior recognition model based on trainable feature fusion has been completed.

步骤10：使用步骤9中训练好的模型对未知视频V′中的人物行为进行识别，具体步骤如下：Step 10: Use the model trained in Step 9 to identify the behavior of characters in the unknown video V', and the specific steps are as follows:

步骤10.1，对输入的未知视频按照步骤1和步骤2进行预处理以及帧采样，提取未知视频全部视频帧并裁剪成尺寸为224px×224px。以时间间隔为均匀采集10个RGB帧[I′₁,I′₁,...,I′₁₀]，0.4s为未知视频总时长，I′_t表示某视频第t个采集到视频帧。Step 10.1, perform preprocessing and frame sampling on the input unknown video according to steps 1 and 2, extract all video frames of the unknown video and crop them into a size of 224px×224px. with time interval of Evenly collect 10 RGB frames [I′₁ , I′₁ ,...,I′₁₀ ], 0.4s is the total duration of the unknown video, and I′_t represents the t-th video frame collected from a certain video.

步骤10.2，按照步骤4中的方法，提取未知视频的多层卷积特征，对于V′的每个RGB帧，在每个选定卷积层都会获得空间大小为64×64，包含3个通道的特征图。对于整个未知视频V′，将会获得3×10个空间大小为64×64，包含3个通道的特征图。Step 10.2, according to the method in step 4, extract the multi-layer convolutional features of the unknown video, for each RGB frame of V′, a spatial size of 64×64 is obtained in each selected convolutional layer, including 3 channels feature map. For the entire unknown video V', 3 × 10 feature maps with a spatial size of 64 × 64 and 3 channels will be obtained.

步骤10.3：按照步骤5、步骤6中的方法，得到V′在3个选定卷积层每一层上的视频级特征表示，具体步骤如下：Step 10.3: According to the methods in Step 5 and Step 6, obtain the video-level feature representation of V′ on each of the three selected convolutional layers. The specific steps are as follows:

首先按照步骤5.1中的方法，使用局部演化排序池化方法得到V′在每一选定卷积层上的64×64个3维的局部演化描述符向量[e′₁,e′₁,...,e′_64×64]，First, according to the method in step 5.1, use the local evolution sorting pooling method to obtain 64×64 3-dimensional local evolution descriptor vectors [e′₁ ,e′₁ ,. ..,e′_64×64 ],

然后按照步骤5.2中的方法，使用基于局部演化描述符的VLAD编码方式将[e′₁,e′₁,...,e′_64×64]编码为基于元动作的视频级表示v′＝[h′₁,h′₂,...,h′₃₂]，v′为实数空间上3×32大小的矩阵。Then, according to the method in step 5.2, [e′₁ ,e′₁ ,...,e′_64×64 ] is encoded into a meta-action based video-level representation v′= [h′₁ ,h′₂ ,...,h′₃₂ ], v' is a 3×32 matrix in the real space.

最后按照步骤6中的方法，在3个选定卷积层Mixed5_a层、Mixed5_b层和Mixed5_c层上并行进行上述操作，在每一层上得到V′的视频级表示。Finally, according to the method in step 6, the above operations are performed in parallel on the 3 selected convolutional layers Mixed5_a, Mixed5_b, and Mixed5_c, and a video-level representation of V' is obtained on each layer.

步骤10.4，按照步骤7中的方法获得V′在3个选定卷积层上的分类结果，s^′m表示未知视频V′在第m个卷积层上预测得到的动作类别结果。按照步骤8中的方法，使用分类集成方法对多层的分类结果进行集成，得到最终对未知视频的分类结果：其中表示集成权重。Step 10.4, according to the method in step 7, obtain the classification results of V' on the three selected convolutional layers,^s'm represents the action category result predicted by the unknown video V' on the mth convolutional layer. According to the method in step 8, use the classification integration method to integrate the multi-layer classification results to obtain the final classification result of the unknown video: in represents the ensemble weight.

上述过程执行完毕后，即可得到对未知视频中人物行为的预测结果为“跑步”。After the above process is completed, the prediction result of the behavior of the characters in the unknown video can be obtained as "running".

以上所述的具体描述，对发明的目的、技术方案和有益效果进行了进一步详细说明，所应理解的是，以上所述仅为本发明的具体实施例而已，并不用于限定本发明的保护范围，凡在本发明的精神和原则之内，所做的任何修改、等同替换、改进等，均应包含在本发明的保护范围之内。The above-mentioned specific descriptions further describe the purpose, technical solutions and beneficial effects of the present invention in detail. It should be understood that the above-mentioned descriptions are only specific embodiments of the present invention, and are not intended to limit the protection of the present invention. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention shall be included within the protection scope of the present invention.