CN108399435B

Movatterモバイル変換

Info

Publication number: CN108399435B
Application number: CN201810237226.1A
Authority: CN
Inventors: 陈志�; 周传; 岳文静; 陈璐; 刘玲; 掌静; 李争彦
Original assignee: Nanjing University of Posts and Telecommunications
Current assignee: Nanjing University of Posts and Telecommunications
Priority date: 2018-03-21
Filing date: 2018-03-21
Publication date: 2020-09-25
Anticipated expiration: 2038-03-21
Also published as: CN108399435A

Abstract

Translated fromChinese

本发明公开一种基于动静特征的视频分类方法，解决视频分类准确度不够高的问题。本发明首先对视频中的动态特征和静态特征进行处理，利用Cholesky变换对这些信息融合后，使用GRU神经网络完成视频的分类；接着通过DT算法捕获每个视频帧的动态特征，再通过DBSCAN聚类算法将每个视频帧隔离，在每个视频片段的每个帧里构建运动框并连接每个视频片段相邻帧之间的运动框，完成动态特征的捕获和跟踪；然后通过HoG和BoW方法将动态特征生成动态信息直方图与通过CNN神经网络生成的静态信息直方图利用Cholesky变换相融合；最后利用GRU神经网络实现视频的分类。本发明通过对动态和静态信息的分开处理，能够提升视频分类的准确性，具有良好的实施性和鲁棒性。

The invention discloses a video classification method based on dynamic and static features, which solves the problem of insufficient video classification accuracy. The present invention firstly processes the dynamic features and static features in the video, uses the Cholesky transform to fuse the information, and then uses the GRU neural network to complete the video classification; then captures the dynamic features of each video frame through the DT algorithm, and then uses the DBSCAN to aggregate the information. The class algorithm isolates each video frame, constructs a motion frame in each frame of each video clip and connects the motion frames between adjacent frames of each video clip to complete the capture and tracking of dynamic features; then through HoG and BoW Methods The dynamic information histogram generated by dynamic features and the static information histogram generated by CNN neural network were fused by Cholesky transform. Finally, the video classification was realized by using GRU neural network. The present invention can improve the accuracy of video classification by separately processing dynamic and static information, and has good implementability and robustness.

Description

Translated fromChinese

一种基于动静特征的视频分类方法A video classification method based on dynamic and static features

技术领域technical field

本发明涉及一种基于动静特征的视频分类方法，属于行为识别、机器学习等交叉技术领域。The invention relates to a video classification method based on dynamic and static features, belonging to the cross technical fields of behavior recognition, machine learning and the like.

背景技术Background technique

近年来，视频中的行为识别和分类计算机视觉领域中的一个重要研究课题，具有重要的理论意义与实际应用价值。In recent years, behavior recognition and classification in video is an important research topic in the field of computer vision, which has important theoretical significance and practical application value.

随着我国经济社会的发展和科技的进步，对视频中任务的识别分析和理解已经成为社会科学和自然科学领域的重要内容，在安防监控、智慧城市建设、体育项目和生命健康等诸多领域都具有广泛的应用。与静态图片中的行为识别相比较而言，视频中背景的变化、动态对象的跟踪和高维度的数据处理等更加复杂，因而具有更大的挑战性。With the development of my country's economy and society and the advancement of science and technology, the identification, analysis and understanding of tasks in videos has become an important part of the social and natural sciences. Has a wide range of applications. Compared with action recognition in still pictures, background changes, dynamic object tracking, and high-dimensional data processing in videos are more complex and therefore more challenging.

对于视频中的人物行为识别主要分为两个部分，一是对于类似背景之类的静态信息的处理，二是对于动态对象的跟踪和识别。就视频分类来说，如何使视频中的静态信息和动态信息互不影响特征提取的同时又能保持这两者可以相互结合，以及结合的过程中动态特征向量和静态特征向量的贡献律是多少是需要确定的。The human behavior recognition in the video is mainly divided into two parts, one is the processing of static information such as background, and the other is the tracking and recognition of dynamic objects. As far as video classification is concerned, how to make the static information and dynamic information in the video not affect the feature extraction while keeping the two can be combined with each other, and what is the contribution law of the dynamic feature vector and the static feature vector in the process of combining is to be determined.

目前常用的跟踪方法主要是光流法，而常用的神经网络包括RNN神经网络，LSTM神经网络等。光流法的优点是在不需要知道场景任何信息的情况下，能够检测出运动目标，但是计算复杂度高，实时性差，对硬件有较高的要求。而训练标准的RNN来解决需要学习长期时间依赖性的问题是不理想的。At present, the commonly used tracking method is mainly the optical flow method, and the commonly used neural networks include RNN neural network, LSTM neural network and so on. The advantage of the optical flow method is that it can detect moving targets without knowing any information about the scene, but it has high computational complexity, poor real-time performance, and high hardware requirements. Training a standard RNN to solve problems that require learning long-term temporal dependencies is not ideal.

目前为止，对于视频中的行为和识别的分类的方法，还需要进行大量的研究工作。To date, there is still a lot of research work to be done on methods for the classification of behavior and recognition in videos.

发明内容SUMMARY OF THE INVENTION

技术问题：发明所要解决的技术问题是视频中动态特征和静态特征的提取并完成二者信息的融合，以有效的提高对视频中行为分类的准确度。Technical problem: The technical problem to be solved by the invention is to extract dynamic features and static features in the video and complete the fusion of the two information, so as to effectively improve the accuracy of the behavior classification in the video.

技术方案：本发明的一种基于动静特征的视频分类方法包括以下步骤：Technical solution: a video classification method based on dynamic and static features of the present invention includes the following steps:

步骤1)输入1个视频，所述视频是用户输入的视频，将该视频分解成具有l帧的视频片段，其中每个视频片段的间隔为5帧；Step 1) input 1 video, the video is the video input by the user, and this video is decomposed into a video clip with 1 frame, and the interval of each video clip is 5 frames;

步骤2)通过密集轨迹跟踪算法即DT算法对步骤1)输入视频中运动的对象进行跟踪，并使用基于密度的噪声空间聚类算法(DBSCAN聚类算法)对来隔离每帧视频，实现对上述视频中动态信息的捕获和跟踪；所述的DT算法是通过网格划分的方式在图片的多个尺度上分别密集采样特征点；DBSCAN聚类算法是从某个选定的核心点出发，不断向密度可达的区域扩张，得到一个包含核心点和边界点的最大化区域；Step 2) Track the moving objects in the input video in step 1) through the dense trajectory tracking algorithm, namely the DT algorithm, and use the density-based noise space clustering algorithm (DBSCAN clustering algorithm) to isolate each frame of video to achieve the above The capture and tracking of dynamic information in the video; the DT algorithm is to densely sample feature points on multiple scales of the picture through grid division; the DBSCAN clustering algorithm starts from a selected core point and continuously Expand to the density-reachable area to get a maximized area containing core points and boundary points;

步骤3)在每个视频片段的每一帧图像中构建运动框，通过增加和删除运动管中运动框的数量使每帧图像中包含的运动框的数量一致，通过步骤2)中跟踪的运动轨迹，将每帧中的运动框连接，生成运动管；Step 3) Constructing a motion frame in each frame of each video clip, by adding and deleting the number of motion frames in the motion tube, the number of motion frames contained in each frame of image is consistent, and by the motion tracked in step 2) Track, connect the motion frames in each frame to generate motion tubes;

步骤4)通过计算运动管中的光流矢量，利用方向梯度直方图HoG特征的方法为每个运动管统计运动管运动的方向，再通过k均值聚类法即k-means聚类法选取100000个描述方向的向量，从而生成对动态信息的描述；HoG特征是是一种在计算机视觉和图像处理中用来进行物体检测的特征描述子，通过计算和统计图像局部区域的梯度方向直方图来构成特征；k-means聚类法是数据点到原型的某种距离作为优化的目标函数，利用函数求极值的方法得到迭代运算的调整规则；Step 4) By calculating the optical flow vector in the motion tube, use the method of the HoG feature of the directional gradient histogram to count the direction of motion of the motion tube for each motion tube, and then select 100,000 by the k-means clustering method, that is, the k-means clustering method. A vector describing the direction to generate a description of dynamic information; HoG feature is a feature descriptor used for object detection in computer vision and image processing. Constitute features; k-means clustering method uses a certain distance from the data point to the prototype as the objective function of optimization, and uses the method to find the extreme value of the function to obtain the adjustment rules of the iterative operation;

步骤5)处理静态特征的步骤如下：在数据集ImageNet上训练一个卷积神经网络即CNN神经网络，所述CNN神经网络包括5层卷积层，2层完全链接层和一个softmax模型的输出层，线性整流函数即ReLU函数作为激活函数；将此CNN神经网络应用到最初的分解的视频片断的每个帧，从中检索到深度特征后从CNN中的softmax层输出静态特征向量；输出的静态特征向量为每个视频片段建立一个静态描述，产生的静态特征的时间序列为：C＝[c_t0,c_t1,...,c_tn-1]；其中n代表视频的片段；Step 5) The steps of processing static features are as follows: train a convolutional neural network, namely a CNN neural network, on the dataset ImageNet, the CNN neural network includes 5 layers of convolution layers, 2 layers of fully linked layers and an output layer of a softmax model , the linear rectification function is the ReLU function as the activation function; this CNN neural network is applied to each frame of the initial decomposed video segment, and the deep features are retrieved from it, and the static feature vector is output from the softmax layer in the CNN; the output static feature The vector establishes a static description for each video segment, and the time sequence of the generated static features is: C=[c_t0 ,c_t1 ,...,c_tn-1 ]; where n represents the segment of the video;

步骤6)通过乔里斯基变换即Cholesky变换将静态描述和动态描述进行融合，然后将融合的向量通过门控循环单元GRU神经网络，完成视频的分类；所述Cholesky变换是指通过代数的变换找到两个未知关系的变量之间的数学关系，通过矩阵的变换找到另外一个向量使得这个向量与动态描述向量和静态描述向量都用联系，从而就用这个向量来表示静态描述向量和动态描述向量；Step 6) fuse static description and dynamic description through Cholesky transformation, and then pass the fused vector through the gated recurrent unit GRU neural network to complete the classification of the video; the Cholesky transformation refers to finding through algebraic transformation. The mathematical relationship between two variables with unknown relationship, find another vector through the transformation of the matrix, so that this vector is related to the dynamic description vector and the static description vector, so that this vector is used to represent the static description vector and dynamic description vector;

其中，in,

所述步骤2)具体如下：Described step 2) is as follows:

步骤21)采用步长为5*5的采样框对每帧中的关键点进行采样，设置第t帧关键点的坐标为P_t(x_t,y_t)，则t+1帧的坐标为

所述P为关键点，M为中值滤波的内核，ω为光流磁场中的滤波中值，

为(x_t,y_t)的四舍五入的值；Step 21) Use a sampling frame with a step size of 5*5 to sample the key points in each frame, and set the coordinates of the key points in the t-th frame as P_t (x_t , y_t ), then the coordinates of the t+1 frame are

The P is the key point, M is the kernel of the median filter, ω is the filter median in the optical flow magnetic field,

is the rounded value of (x_t , y_t );

步骤22)在步骤21)中5*5的采样框没有包括特征点，则手动增加这个特征点到跟踪的轨迹中；Step 22) In step 21), the sampling frame of 5*5 does not include the feature point, then manually add this feature point to the tracked track;

步骤23)记录每个视频片段中每帧的关键点的坐标，得到序列S＝(ΔP_t,ΔP_t+1,...ΔP_t+l-1)；产生的矢量通过位移矢量的大小之和来归一化得到

所述l为步骤1中的帧片段数，ΔP_t＝(P_t+1-P_t)＝(x_t+1-x_t,y_t+1-y_t)；Step 23) Record the coordinates of the key points of each frame in each video segment, and obtain the sequence S=(ΔP_t , ΔP_t+1 ,...ΔP_t+1-1 ); the generated vector is calculated by dividing the size of the displacement vector and to normalize to get

The l is the number of frame segments instep 1, ΔP_t =(P_t+1 -P_t )=(x_t+1 -x_t ,y_t+1 -y_t );

步骤24)分离帧内的每一个区域，选取领域半径ε和核心点MinPoints；去除离集群中最远的20％的点保证DT算法作用在整个区域。Step 24) Separate each area in the frame, select the field radius ε and the core point MinPoints; remove the 20% points farthest from the cluster to ensure that the DT algorithm acts on the entire area.

所述步骤3)具体如下：Described step 3) is as follows:

步骤31)以关键点P为中心，建立运动框，用向量b＝(x,y,r,f)表示关键点P的运动框；所述x、y是这个运动框的左上角的横坐标和纵坐标，r是这个运动框的边长，f表示这个帧；Step 31) With the key point P as the center, a motion frame is established, and the motion frame of the key point P is represented by a vector b=(x, y, r, f); the x, y are the abscissas of the upper left corner of the motion frame and the ordinate, r is the side length of the motion frame, and f represents the frame;

步骤32)计算每个视频片段中帧内的平均动作框数量n，假设从第一帧到第w帧的动作框的数达到了n，舍弃w帧内的其余动作框，从w+1帧开始重新找到包含n个动作框的某帧，重复这个步骤，直到每帧包含的动作框数一样；通过一个序列表示：Step 32) Calculate the average number of action boxes in each video clip, n, assuming that the number of action boxes from the first frame to the wth frame reaches n, discard the rest of the action boxes in the w frame, and start from w+1 frame. Start to re-find a frame containing n action boxes, and repeat this step until each frame contains the same number of action boxes; represented by a sequence:

g(v_i,t)＝{[b_t,1,1,b_t,1,2,...,b_t,1,k],[b_t,2,1,b_t,2,2,...,b_t,2,k],...,[b_t,n,1,b_t,n,2,...,b_t,n,k]}；g(v_i,t )={[b_t,1,1 ,b_t,1,2 ,...,b_t,1,k ],[b_t,2,1 ,b_t,2,2 ,...,b_t,2,k ],...,[b_t,n,1 ,b_t,n,2 ,...,b_t,n,k ]};

其中b_t,j,k是第t个视频片段中第j帧中的第k个动作框；通过步骤32)使每个帧包含的动作框数都为k；Wherein b_{t, j, k} is the k-th action frame in the j-th frame in the t-th video clip; through step 32), the number of action frames contained in each frame is k;

步骤33)在保证每帧包含的运动框数一致后，开始建立运动管；设置每个视频片段的距离矩阵：Step 33) After ensuring that the number of motion frames contained in each frame is consistent, start to build a motion tube; set the distance matrix of each video segment:

所述D_i,j是第k帧的第i个动作框与第k+1帧的第j个动作框之间的欧几里得距离；此距离矩阵选出在相邻帧中两个距离最短的运动框，每帧之间最短的动作框通过运动管连接这些帧；为每个运动管构造一个包含帧数、运动框数、运动框坐标和运动框大小的5列的矩阵M_i：The D_i,j is the Euclidean distance between the i-th action frame of the k-th frame and the j-th action frame of the k+1-th frame; this distance matrix selects two distances in adjacent frames. The shortest motion frame, the shortest action frame between each frame, connects these frames through motion tubes; construct a 5-column matrix_Mi for each motion tube containing the number of frames, the number of motion frames, the coordinates of the motion frame, and the size of the motion frame:

所述距离矩阵表示为第M_i个视频片段的第k帧视频的动作框信息，所述n代表动作框的个数，x、y代表动作框的左上角坐标信息，r代表动作框的边长，z表示与第k帧相连的下一帧。The distance matrix is represented as the action frame information of the kth frame of the video of the M_i th video segment, the n represents the number of action frames, x and y represent the upper left corner coordinate information of the action frame, and r represents the edge of the action frame. long, z denotes the next frame connected to the kth frame.

所述步骤4)具体如下：Described step 4) is as follows:

步骤41)识别每个视频片段的每个运动管并计算运动管的光流矢量，创建HoG特征后取一个合适的bin值，统计运动管运动的方向在每个角度区域的数量，为每一个运动管建立直方图；上述HoG方法是一种在计算机视觉和图像处理中用来进行物体检测的特征描述子，通过计算和统计图像局部区域的梯度方向直方图来构成特征；Step 41) Identify each motion tube of each video segment and calculate the optical flow vector of the motion tube, take an appropriate bin value after creating the HoG feature, and count the number of motion tube motion directions in each angle area, for each The motion tube establishes a histogram; the above HoG method is a feature descriptor used for object detection in computer vision and image processing, and features are formed by calculating and counting the gradient direction histogram of the local area of the image;

步骤42)在所有视频中选取100000个HoG向量，使用k-means聚类法对这100000个向量进行聚类，对每个视频片段使用以下公式：Step 42) Select 100,000 HoG vectors in all videos, use the k-means clustering method to cluster these 100,000 vectors, and use the following formula for each video segment:

p＝argmin(T_j-h_n,k),j＝{1,2,...1000}；p=argmin(Tj-hn_,k ),_j ={1,2,...1000};

所述h_n,k为第n个视频片段的第k个HoG向量，T_j是第j个簇头整个动态信息在时间上的序列；得直方图H＝[H_t0,H_t1,...,H_tn-1]，其中n代表视频片段。The h_n,k is the k-th HoG vector of the n-th video segment, and T_j is the temporal sequence of the entire dynamic information of the j-th cluster head; the histogram H=[H_t0 , H_t1 , .. .,H_tn-1 ], where n represents the video segment.

所述步骤6)具体如下：Described step 6) is as follows:

步骤61)设置H代表静态向量,M代表动态向量；使用Cholesky变换将动态和静态矢量融合，得到动态和静态特征描述的融合时间序列C＝[c_t0,c_t1,...c_tn-1]；Step 61) Set H to represent a static vector and M to represent a dynamic vector; use Cholesky transform to fuse the dynamic and static vectors to obtain a fusion time series C=[c_t0 ,c_t1 ,...c_tn-1 described by dynamic and static features ];

步骤62)设置参数C_t，表示每个视频片段的融合矢量，使用GRU神经网络中的更新门和重置门处理输入的数据信息；将生成的时间序列C＝[c_t0,c_t1,...c_tn-1]输入到GRU神经网络中完成最后的视频分类。Step 62) Set the parameter C_t to represent the fusion vector of each video segment, and use the update gate and reset gate in the GRU neural network to process the input data information; the generated time series C=[c_t0 ,c_t1 ,. ..c_tn-1 ] input into the GRU neural network to complete the final video classification.

所述步骤1)中，l按照经验取15。In described step 1), l is 15 according to experience.

所述步骤24)中，ε和MinPoints按照经验取8和10。In the step 24), ε and MinPoints are 8 and 10 according to experience.

所述步骤41)中，bin按照经验取100。In the step 41), bin is taken as 100 according to experience.

所述步骤42)中，k按照经验取1000。In the step 42), k is 1000 according to experience.

有益效果：本发明采用以上技术方案与现有技术相比，具有以下技术效果：Beneficial effect: the present invention adopts the above technical scheme compared with the prior art, has the following technical effects:

本发明使用DT跟踪算法和DBSCAN算法对视频帧的关键点进行跟踪和聚类，通过光流法构造运动管对视频帧进行连接，并利用Cholesky变换对动态描述信息和静态描述信息进行融合，采用GRU神经网络完成最后的视频分类。通过这些方法的应用能够对视频中的运动对象完成分类，具有良好的准确性和有效性，具体来说：The invention uses the DT tracking algorithm and the DBSCAN algorithm to track and cluster the key points of the video frame, constructs motion tubes through the optical flow method to connect the video frames, and uses the Cholesky transformation to fuse the dynamic description information and the static description information. The GRU neural network completes the final video classification. The application of these methods can complete the classification of moving objects in videos with good accuracy and effectiveness, specifically:

(1)本发明通过使用DT跟踪算法和DBSCAN算法，可以有效的排除视频背景的干扰，对所需要跟踪的关键点进行跟踪，增加了对于关键点捕获的准确性。(1) By using the DT tracking algorithm and the DBSCAN algorithm, the present invention can effectively eliminate the interference of the video background, track the key points that need to be tracked, and increase the accuracy of capturing the key points.

(2)本发明通过找出相邻帧中关键点的最短欧几里得距离，连接距离最短的两个帧，从而将视频片段里面的所有帧连接起来，更准确的完成了对关键运动物体运动轨迹的跟踪(2) The present invention connects all the frames in the video clip by finding the shortest Euclidean distance of the key points in the adjacent frames, and connecting the two frames with the shortest distance, so as to more accurately complete the detection of key moving objects. Tracking of motion trajectories

(3)本发明通过使用随机Cholesky变换，将静态和动态特征描述向量融合，找出最佳的融合精度后，提高了视频分类的准确性。(3) The present invention fuses static and dynamic feature description vectors by using random Cholesky transform, and improves the accuracy of video classification after finding the best fusion accuracy.

附图说明Description of drawings

图1是基于动静特征的视频分类方法流程。Figure 1 is a flow of a video classification method based on dynamic and static features.

图2是HoG生成动态信息直方图。Figure 2 is a histogram of dynamic information generated by HoG.

图3是GRU神经网路示意图。Figure 3 is a schematic diagram of the GRU neural network.

具体实施方式Detailed ways

下面结合附图对本发明的技术方案做进一步的详细说明：Below in conjunction with accompanying drawing, the technical scheme of the present invention is described in further detail:

本发明的一种基于动静特征的视频分类方法包括以下步骤：A video classification method based on dynamic and static features of the present invention comprises the following steps:

步骤1)输入1个视频，所述视频是用户输入的视频，将该视频分解成具有l帧的视频片段，其中每个视频片段的间隔为5帧；Step 1)input 1 video, the video is the video input by the user, and this video is decomposed into a video clip with 1 frame, and the interval of each video clip is 5 frames;

其中，in,

所述步骤2)具体如下：Described step 2) is as follows:

is the rounded value of (x_t , y_t );

步骤24)通过DBSCAN聚类算法分离帧内的每一个区域，选取领域半径ε和核心点MinPoints；通过边界噪音移除算法去除离集群中最远的20％的点保证DT算法作用在整个区域。Step 24) Separate each area in the frame by DBSCAN clustering algorithm, select the field radius ε and core point MinPoints; remove the 20% points farthest from the cluster by the boundary noise removal algorithm to ensure that the DT algorithm acts on the entire area.

所述步骤3)具体如下：Described step 3) is as follows:

g(v_i,t)＝{[b_t,1,1,b_t,1,2,...,b_t,1,k],[b_t,2,1,b_t,2,2,...,b_t,2,k],...,[b_t,n,1,b_t,n,2,...,b_t,n,k]}；其中b_t,j,k是第t个视频片段中第j帧中的第k个动作框；通过步骤32)使每个帧包含的动作框数都为k；g(v_i,t )={[b_t,1,1 ,b_t,1,2 ,...,b_t,1,k ],[b_t,2,1 ,b_t,2,2 ,...,b_t,2,k ],...,[b_t,n,1 ,b_t,n,2 ,...,b_t,n,k ]}; where b_{t,j , k} is the k-th action frame in the j-th frame in the t-th video clip; through step 32), the number of action frames contained in each frame is k;

所述步骤4)具体如下：Described step 4) is as follows:

所述步骤6)具体如下：Described step 6) is as follows:

步骤62)设置参数C_t，表示每个视频片段的融合矢量，使用GRU神经网络中的更新门和重置门处理输入的数据信息。将生成的时间序列C＝[c_t0,c_t1,...c_tn-1]输入到GRU神经网络中完成最后的视频分类。Step 62) Set the parameter C_t to represent the fusion vector of each video segment, and use the update gate and reset gate in the GRU neural network to process the input data information. The generated time series C=[c_t0 ,c_t1 ,...c_tn-1 ] are input into the GRU neural network to complete the final video classification.

在具体实施中，图1是基于动静特征的视频分类的方法流程。首先用户输入1个视频，然后将该视频分成帧数为15帧的片段。In a specific implementation, FIG. 1 is a flowchart of a method for video classification based on dynamic and static features. First theuser inputs 1 video, and then the video is divided into 15-frame segments.

通过DT跟踪算法和DBSCAN聚类算法对每帧视频的特征点进行捕获和跟踪，所述的DT算法是通过网格划分的方式在图片的多个尺度上分别密集采样特征点。DBSCAN算法是从某个选定的核心点出发，不断向密度可达的区域扩张，从而得到一个包含核心点和边界点的最大化区域，区域中任意两点相连。The feature points of each frame of video are captured and tracked by the DT tracking algorithm and the DBSCAN clustering algorithm. The DT algorithm densely samples the feature points on multiple scales of the picture by grid division. The DBSCAN algorithm starts from a selected core point and continuously expands to the density-reachable area, so as to obtain a maximized area including the core point and the boundary point, and any two points in the area are connected.

为了对运动对象构造运动管，这里需要用到欧几里得距离矩阵，找出相邻帧间欧几里得距离最短的两个运动框，从而在每个视频片段中通过构造运动管来连接每一帧。这样就可以做到跟踪每个视频片段的特征点的运动轨迹。In order to construct a motion tube for a moving object, we need to use the Euclidean distance matrix to find the two motion frames with the shortest Euclidean distance between adjacent frames, so as to connect each video segment by constructing a motion tube. every frame. In this way, the motion trajectory of the feature points of each video clip can be tracked.

接下来需要对这些视频片段里面运动管中的运动框的坐标进行记录，为每个运动管构造了一个包含帧数，运动框数，运动框坐标和运动框大小的5列的矩阵M_i：Next, it is necessary to record the coordinates of the motion frames in the motion tubes in these video clips, and construct a 5-column matrix M_i for each motion tube that contains the number of frames, the number of motion frames, the coordinates of the motion frame and the size of the motion frame:

其中矩阵内各参数的意义为：第M_i个视频片段的第k帧视频的动作框信息如上所示，其中n代表动作框的个数，x，y代表动作框的左上角坐标信息，r代表动作框的边长。The meanings of the parameters in the matrix are: the action frame information of the kth frame video of the M_i video clip is as shown above, where n represents the number of action frames, x and y represent the coordinate information of the upper left corner of the action frame, r Represents the side length of the action box.

识别了每个视频片段的每个运动管后，计算运动管的光流矢量。如图2所示创建HoG，取bin＝100，每个区域的角度为3.6度，统计运动管运动的方向在每个角度区域的数量，所以对于每一个运动管都可以建立直方图。上述HoG方法是一种在计算机视觉和图像处理中用来进行物体检测的特征描述子，通过计算和统计图像局部区域的梯度方向直方图来构成特征。然后选取100000个HoG向量，使用k-means聚类法对这100000个向量进行聚类，取k＝1000(这是为了和融合静态信息所取得)。上述k-means算法是硬聚类算法，是典型的基于原型的目标函数聚类方法的代表，它是数据点到原型的某种距离作为优化的目标函数，利用函数求极值的方法得到迭代运算的调整规则。通过对每个视频片段使用以下公式After identifying each motion tube of each video clip, calculate the optical flow vector of the motion tube. Create HoG as shown in Figure 2, take bin=100, the angle of each area is 3.6 degrees, and count the number of motion tube motion directions in each angle area, so a histogram can be established for each motion tube. The above HoG method is a feature descriptor used for object detection in computer vision and image processing, and features are formed by calculating and counting the gradient direction histograms of local regions of the image. Then select 100,000 HoG vectors, use k-means clustering method to cluster these 100,000 vectors, and take k = 1000 (this is obtained for integrating static information). The above k-means algorithm is a hard clustering algorithm, which is a typical representative of the prototype-based objective function clustering method. It uses a certain distance from the data point to the prototype as the optimized objective function, and uses the function to find the extreme value method to obtain iterations. Adjustment rules for operations. By using the following formula for each video clip

p＝argmin(T_j-h_n,k),j＝{1,2,...1000}可以得到直方图。其中h_n,k为第n个视频片段的第k个HoG向量，T_j是第j个簇头整个动态信息在时间上的序列就可以得到了：H＝[H_t0,H_t1,...,H_tn-1]；p=argmin(T_j -h_n,k ), j={1,2,...1000} can get the histogram. Where h_n,k is the k-th HoG vector of the n-th video clip, and T_j is the temporal sequence of the entire dynamic information of the j-th cluster head: H=[H_t0 , H_t1 , .. ., H_tn-1 ];

在ImageNet上训练一个深度CNN神经网络，将静态特征也通过时间序列表示出来，I＝[i_t0,i_t1,...,i_tn-1]；其中n代表视频的片段。使用Cholesky变换将动态和静态矢量融合，得到动态描述和静态描述融合的时间序列C＝[c_t0,c_t1,...c_tn-1]，将生成的时间序列C＝[c_t0,c_t1,...c_tn-1]输入到GRU神经网络中完成最后的视频分类。A deep CNN neural network is trained on ImageNet, and the static features are also represented by time series, I=[i_t0 , i_t1 ,..., i_tn-1 ]; where n represents the segment of the video. Use the Cholesky transform to fuse the dynamic and static vectors to obtain a time series C=[c_t0 ,c_t1 ,...c_tn-1 ] that combines the dynamic description and the static description, and the generated time series C=[c_t0 ,c_t1 ,...c_tn-1 ] are input to the GRU neural network to complete the final video classification.

图3是GRU神经网络每个cell单元的具体构造，序列中前一个向量分别通过重置门r_t和更新门z_t后，得到r_t＝σ(W_r·[h_t-1,c_t])和z_t＝σ(W_t·[h_t-1,c_t])；重置门r和前一次结果h_t-1再连接这次输入的序列c_t进行卷积通过权值

再经过tanh可以得到

而最终的输出

其中W_r,W_z,

是权重都是拼接的，在学习时需要分割出来。即：Figure 3 shows the specific structure of each cell unit of the GRU neural network. After the previous vector in the sequence passes through the reset gate_rt and the update gate z_t , respectively, r_t =σ(W_r ·[h_t-1 ,c_t ]) and z_t =σ(W_t ·[h_t-1 ,c_t ]); reset the gate r and the previous result h_t-1 and then connect the input sequence c_t this time for convolution through weights

After passing through tanh, you can get

And the final output

where W_r , W_z ,

The weights are all spliced and need to be segmented during learning. which is:

通过最后的GRU网络，就完成了视频的分类。Through the final GRU network, the video classification is completed.

Claims

Translated fromChinese

1.一种基于动静特征的视频分类方法，其特征在于，包括以下步骤：1. a video classification method based on dynamic and static features, is characterized in that, comprises the following steps:

其中，in,

所述步骤2)具体如下：Described step 2) is as follows:

is the rounded value of (x_t , y_t );步骤22)在步骤21)中5*5的采样框没有包括特征点，则手动增加这个特征点到跟踪的轨迹中；Step 22) In step 21), the sampling frame of 5*5 does not include the feature point, then manually add this feature point to the tracked track;

The l is the number of frame segments in step 1, ΔP_t =(P_t+1 -P_t )=(x_t+1 -x_t ,y_t+1 -y_t );

2.根据权利要求1所述的一种基于动静特征的视频分类方法，其特征在于，所述步骤3)具体如下：2. a kind of video classification method based on dynamic and static feature according to claim 1, is characterized in that, described step 3) is specifically as follows:

步骤32)计算每个视频片段中帧内的平均动作框数量n，假设从第一帧到第w帧的动作框的数达到了n，舍弃w帧内的其余动作框，从w+1帧开始重新找到包含n个动作框的某帧，重复这个步骤，直到每帧包含的动作框数一样；通过一个序列表示：g(v_i,t)＝{[b_t,1,1,b_t,1,2,...,b_t,1,k],[b_t,2,1,b_t,2,2,...,b_t,2,k],...,[b_t,n,1,b_t,n,2,...,b_t,n,k]}；其中b_t,j,k是第t个视频片段中第j帧中的第k个动作框；通过步骤32)使每个帧包含的动作框数都为k；Step 32) Calculate the average number of action boxes in each video clip, n, assuming that the number of action boxes from the first frame to the wth frame reaches n, discard the rest of the action boxes in the w frame, and start from w+1 frame. Start to re-find a frame containing n action boxes, and repeat this step until the number of action boxes contained in each frame is the same; represented by a sequence: g(v_i,t )={[b_t,1,1 ,b_t,1,2 ,...,b_t,1,k ],[b_t,2,1 ,b_t,2,2 ,...,b_t,2,k ],...,[ b_t,n,1 ,b_t,n,2 ,...,b_t,n,k ]}; where b_t,j,k is the k-th action in the j-th frame in the t-th video clip frame; by step 32), the number of action frames included in each frame is all k;

3.根据权利要求1所述的一种基于动静特征的视频分类方法，其特征在于，所述步骤4)具体如下：3. a kind of video classification method based on dynamic and static feature according to claim 1, is characterized in that, described step 4) is specifically as follows:

4.根据权利要求1所述的一种基于动静特征的视频分类方法，其特征在于，所述步骤6)具体如下：4. a kind of video classification method based on dynamic and static feature according to claim 1, is characterized in that, described step 6) is specifically as follows:

5.根据权利要求1所述的一种基于动静特征的视频分类方法，其特征在于，所述步骤1)中，l按照经验取15。5. a kind of video classification method based on dynamic and static features according to claim 1, is characterized in that, in described step 1), 1 takes 15 according to experience.

6.根据权利要求2所述的一种基于动静特征的视频分类方法，其特征在于，所述步骤24)中，ε和MinPoints按照经验取8和10。6. A kind of video classification method based on dynamic and static features according to claim 2, is characterized in that, in described step 24), ε and MinPoints take 8 and 10 according to experience.

7.根据权利要求3所述的一种基于动静特征的视频分类方法，其特征在于，所述步骤41)中，bin按照经验取100。7. A kind of video classification method based on dynamic and static features according to claim 3, is characterized in that, in described step 41), bin takes 100 according to experience.

8.根据权利要求3所述的一种基于动静特征的视频分类方法，其特征在于，所述步骤42)中，k按照经验取1000。8. A kind of video classification method based on dynamic and static features according to claim 3, is characterized in that, in described step 42), k is 1000 according to experience.