CN113313030A

Movatterモバイル変換

Info

Publication number: CN113313030A
Application number: CN202110597647.7A
Authority: CN
Inventors: 董敏; 曹瑞东; 毕盛; 方政霖
Original assignee: South China University of Technology SCUT
Current assignee: South China University of Technology SCUT
Priority date: 2021-05-31
Filing date: 2021-05-31
Publication date: 2021-08-27
Anticipated expiration: 2041-05-31
Also published as: CN113313030B

Abstract

Translated fromChinese

本发明公开了一种基于运动趋势特征的人体行为识别方法，包括步骤：1)获取人体行为识别的视频数据集；2)对视频数据集中的视频进行视频帧提取，制作数据集；3)构造运动趋势特征提取模型，利用运动趋势特征提取模型对步骤2)的数据集进行特征提取与识别，实现模型训练；4)根据实际场景对来自于步骤3)训练完成的模型进行迁移学习，将迁移后的模型应用于实际场景下的人体行为识别任务中。本发明有利于视频场景下对复杂、长时需推理行为的识别，具有实际应用价值。

The invention discloses a method for recognizing human behavior based on motion trend features, comprising the steps of: 1) acquiring a video data set for human behavior recognition; 2) extracting video frames from the video in the video data set to create a data set; 3) constructing Motion trend feature extraction model, using the motion trend feature extraction model to perform feature extraction and identification on the data set in step 2) to realize model training; 4) According to the actual scene, transfer learning is performed on the model from step 3) after training, and the migration The latter model is applied to the human action recognition task in the actual scene. The invention is beneficial to the recognition of complex and long-term inference behaviors in video scenes, and has practical application value.

Description

Human behavior identification method based on motion trend characteristics

Technical Field

The invention relates to the technical field of human behavior recognition analysis based on a video scene, in particular to a human behavior recognition method based on motion trend characteristics.

Background

Visual information, one of the most readily available information in real life, is video as a carrier, on which a great deal of research hotspots and applications emerge. In these concerns, human behavior recognition in videos is focused on, such as smart nursing, sports event judgment, sign language recognition, and the like, and these tasks are more focused on the actions of the human body than on the interaction between the human body and the object. The video is composed of a plurality of frames of images with time sequence relation, so how to simultaneously capture the space semantic features of the images in the video and the time sequence motion features among the plurality of frames of images becomes the key for human behavior recognition.

The behavior recognition task of human behavior recognition in the video is different from the image classification task in that the classification in the video needs time sequence modeling, so that various ways are tried by combining time sequence and space modeling by various behavior recognition networks. Two Stream CNN proposed in 2014 tries to use a double-flow network, one network carries out spatial feature expression, the other network carries out time sequence feature expression, and then spatial features and time sequence features extracted by the Two networks are fused and classified. Then, many time sequence modeling models appear, a TSN model proposed in 2016 divides a video into a plurality of segments, then performs sparse sampling, performs prediction on each segment, and then fuses the results of the segments to obtain video-level prediction, and the mechanism enables the models to have the capability of capturing long-range time sequence information, but the method still does not link the time dimension and the space dimension of behavior characteristics and lacks the space-time fusion capability. And starting from C3D proposed in 2015, a 3D convolutional neural network (3D CNN) was used to perform spatio-temporal feature extraction, including R3D, I3D, NL-I3D, SlowFast, NL-SlowFast, X3D models, etc. The 3D convolution extends the spatial semantic features to spatio-temporal features by adding a temporal dimension to the convolution kernel. Although the 3D CNN models can well fuse spatio-temporal features, these 3D CNN models only learn features on a small sliding window instead of the entire video, so they are difficult to obtain video-level predictions, and the 3D CNN models are very expensive in computation overhead, high in requirements on a computation platform, difficult in training, and very time-consuming in the actual inference prediction process.

The human body behaviors can be divided into two types, one type of behavior can be judged by only one frame of static image in the video, the behavior is called as inference-free behavior, the other type of behavior can be judged by identifying the characteristics of multiple frames of images in the video and the motion relationship between the images, and the behavior is called as inference-required behavior. The identification of behaviors needing reasoning has higher requirements on the time sequence relation modeling of a human behavior identification model, so that the human behavior identification method applied to a video scene not only needs to consider the time-space characteristic fusion capability of the human behavior identification model, but also needs to design a strategy for capturing long-range time sequence motion characteristics so as to obtain video-level prediction of the behaviors, and simultaneously achieves the balance between human behavior identification precision and expenditure as far as possible.

Disclosure of Invention

The invention aims to overcome the defects of the human behavior recognition method in the current video scene on the capabilities of fusing space-time characteristics and capturing long-range time sequence motion information, provides a human behavior recognition method based on motion trend characteristics, enhances the recognition of behaviors needing inference and behaviors needing no inference, improves the accuracy of the model on the recognition of complex and long-term behaviors in the video scene, and enables the model to be better applied to an actual system.

In order to achieve the purpose, the technical scheme provided by the invention is as follows: the human behavior identification method based on the motion trend characteristics comprises the following steps:

1) acquiring a video data set for human behavior recognition;

2) extracting video frames of the videos in the video data set in the step 1) to manufacture a data set;

3) constructing a motion trend feature extraction model, and performing feature extraction and identification on the data set manufactured in the step 2) by using the motion trend feature extraction model to realize model training; the motion trend feature extraction model is improved on the structure of an ECO (efficient connected Network for Online Video exploration) model, and is improved by adding calculation of motion features, modifying extraction of spatio-temporal features of the ECO model into extraction of motion trend features and adding a feature fusion module, wherein the improvement aims at enhancing identification of behaviors needing inference and behaviors needing no inference at the same time;

4) and (3) performing transfer learning on the model trained in the step 3) according to the actual scene, and applying the transferred model to a human behavior recognition task in the actual scene to complete behavior recognition.

Further, in step 1), a public open-source video data set is obtained by downloading an HMDB51, UCF101 and a Jester video data set, and the obtained public open-source video data set is organized according to a customized file structure standard, wherein the directory names of primary folders of the HMDB51 video data set and the UCF101 video data set are categories to which human behaviors belong, each folder is a video belonging to the category with the format of avi, and the Jester video data set is a video set with the format of webm named according to a video sequence number.

Further, the step 2) comprises the following steps:

2.1) traversing all video files under the folder of each video data set, extracting the video frames of each video by using OpenCV to obtain a video frame data set, and counting the frame number of each video;

2.2) dividing each video frame data set in the step 2.1) into a training set and a verification set according to a division file provided by the official website of each video data set, and storing the information of the training set and the verification set in a file, wherein each behavior of the file is a tuple, and the tuple comprises a video frame folder address, a video frame number and a video corresponding category.

Further, in step 3), the constructed motion trend feature extraction model comprises a video frame preprocessing module, a spatial semantic feature extraction module, a motion feature calculation module, a motion trend feature extraction module, a spatio-temporal feature extraction module, a feature fusion module, a global pooling layer and a full link layer, and the specific conditions are as follows:

the video frame pre-processing module performs the following operations: sparse sampling is carried out on all video frames from the same video, the video frames are evenly divided into 16 segments according to the number of the frames, 1 frame is randomly selected from each segment, 16 frames of images are sampled in total, center cutting is carried out on the 16 frames of images, and 16 frames of images with the size of 224 multiplied by 224 pixels are obtained;

the space semantic feature extraction module is a 2D convolutional network, a weight sharing mechanism is used for extracting the space semantic features of the 16 frames of images, a backbone network of the module adopts a structure the same as that of an ECO (echo-back error) model, the backbone network firstly comprises two convolutional layers, one pooling layer is arranged behind each convolutional layer, the two convolutional layers are 7 multiplied by 7 and 3 multiplied by 3 respectively, the two pooling layers are max pool layers with the size of 3 multiplied by 3, and then the backbone network comprises three initiation layers, and the number of output channels of the three initiation layers is respectively 256, 320 and 96; 16 images are passed through the backbone network to obtain 16 features with the size of [96,28,28], and the 16 features are stacked to form a spatial semantic feature;

the motion feature calculation module performs the following operations: for the obtained space semantic features, according to a formula D_(n-1)k(x,y)＝F_nk(x,y)-F_(n-1)k(x, y) calculating the motion characteristics, wherein n is more than 1 and less than or equal to 16, k is more than or equal to 1 and less than or equal to 96, x is more than or equal to 1 and less than or equal to 28, y is more than or equal to 1 and less than or equal to 28, F_nk(x, y) feature map F representing the k-th channel of the spatial semantic features in the time dimension n_nkObtaining a plurality of characteristic differences D according to the characteristic values at the x-th row and the y-th column_(n-1)kThese feature difference components have a size of [96,15,28 ]]The motion characteristics of the stack of (a);

the motion trend feature extraction module is a 3D convolution network and is used for extracting motion trend features of the motion features, the module backbone network is a part of a 3D-ResNet-18 network and comprises 6 3D convolutions, the sizes of the convolutions are all 3 multiplied by 3, 1 BN3D layer and 1 ReLU layer are connected behind each convolution, and the motion trend features with the sizes of [512,4,7 and 7] are extracted through the module;

the space-time feature extraction module is a 3D convolutional network and is used for extracting space-time features of the space semantic features, the module consists of two layers of 3D convolutions, the number of output channels of the two convolutions is 512, the size of the output channels is 3 multiplied by 3, padding is equal to 1, the step length is 2, 1 BN3D layer and 1 ReLU layer are connected behind each convolution, and the space-time features with the sizes of [512,4,7 and 7] are extracted through the module;

the characteristic fusion module is used for connecting the motion trend characteristics and the space-time characteristics to obtain video-level characteristics with the final size of [1024,4,7,7 ];

passing the video-level features with the size of [1024,4,7,7] through a global pooling layer with a convolution kernel size of 1 × 7 × 7, wherein the global pooling layer is avg pooling and has a step size of 1, so as to obtain features with the size of [1024,1,1,1 ];

performing a Flatten operation on the features with the sizes of [1024,1,1 and 1], then passing through a Dropout (0.3) layer, and finally performing classification and identification through a full connection layer.

Further, the step 4) comprises the following steps:

4.1) collecting actual scene video data, extracting video frames of the obtained video data, and making a data set;

4.2) fine-tuning the trained motion trend feature extraction model, freezing parameters of all feature extraction related layers, training on the data set in the step 4.1), and applying the trained motion feature extraction model to human behavior recognition in an actual scene to obtain an accurate recognition result.

Compared with the prior art, the invention has the following advantages and beneficial effects:

1. the invention carries out sparse sampling on the video to be identified, can reduce data redundancy, reduce the input size of the model and reduce the calculated amount of the model in the reasoning process.

2. The model of the invention performs feature difference on the multi-frame space semantic features, enhances the motion trend feature expression of behaviors, and can further reduce the number of feature maps so as to reduce the calculated amount, and meanwhile, the motion trend features can amplify the difference of the behaviors among different categories, and have better recognition effect under the condition of smaller data set. The motion trend characteristics and the space-time characteristics are fused in the model, so that the model not only retains complete static image semantic information, but also can capture dynamic motion characteristics, and can simultaneously strengthen the identification of behaviors needing inference and behaviors needing no inference.

3. The model of the invention is mainly characterized by structure, modularization of each part, flexible replacement of the backbone network of the space semantic feature extraction module, the space-time feature extraction module and the motion trend feature extraction module into other networks, and selection of a light or high-precision network according to computing resources.

Drawings

FIG. 1 is a schematic diagram of the method of the present invention.

Detailed Description

The present invention will be further described with reference to the following specific examples.

The method for recognizing the human body behavior based on the motion trend characteristics, provided by the embodiment, comprises the following steps of:

1) acquiring a video data set for human behavior recognition, specifically as follows:

1.1) obtaining a public open source video data set by downloading an HMDB51, a UCF101 and a Jester video data set;

1.2) organizing the acquired public open source video data set according to a custom file structure standard, wherein the directory name of a primary folder of the HMDB51 video data set and the UCF101 video data set is a category to which human behaviors belong, and a video with a format of.avi of the category to which the folders belong is arranged under each folder; the Jester video data set is a video set named according to a video sequence number and in a format of webm.

2) Extracting video frames of the videos in the video data set in the step 1) to manufacture a data set, wherein the method specifically comprises the following steps:

2.2) dividing each video frame data set in 2.1) into a training set and a verification set according to a division file provided by the official website of each video data set, and storing the information of the training set and the verification set in a file, wherein each behavior of the file is a tuple, and the tuple comprises a video frame folder address, a video frame number and a video corresponding category.

3) Constructing a motion trend feature extraction model, as shown in fig. 1, performing feature extraction and identification on the data set in the step 2) by using the motion trend feature extraction model, and realizing model training; the motion trend feature extraction model is improved on the basis of an ECO (efficient connected Network for Online Video exploration) model structure, and is improved by adding calculation of motion features, modifying extraction of spatio-temporal features of the ECO model into extraction of motion trend features and adding a feature fusion module, wherein the improvement aims at enhancing identification of behaviors needing inference and behaviors needing no inference at the same time.

The constructed motion trend feature extraction model comprises a video frame preprocessing module, a space semantic feature extraction module, a motion feature calculation module, a motion trend feature extraction module, a space-time feature extraction module, a feature fusion module, a global pooling layer and a full-link layer, and the specific conditions are as follows:

the video frame pre-processing module performs the following operations: performing sparse sampling on all video frames from the same video, averagely dividing the video frames into 16 segments according to the number of the frames, namely N in the figure 1 is equal to 16, randomly selecting 1 frame from each segment, and sampling 16 frames of images in total; the 16 frame image is center-cropped to obtain 16 frames of 224 × 224 pixels.

The space semantic feature extraction module is a 2D convolutional network, a weight sharing mechanism is used for extracting the space semantic features of the 16 frames of images, a backbone network of the module adopts a structure the same as that of an ECO (echo-back error) model, the backbone network firstly comprises two convolutional layers, one pooling layer is arranged behind each convolutional layer, the two convolutional layers are 7 multiplied by 7 and 3 multiplied by 3 respectively, the two pooling layers are max pool layers with the size of 3 multiplied by 3, and then the backbone network comprises three initiation layers, and the number of output channels of the three initiation layers is respectively 256, 320 and 96; 16 images get 16 features of size [96,28,28] through this backbone network, and these 16 features are stacked into a spatial semantic feature.

The motion feature calculation module performs the following operations: for the above space semantic features, according to formula D_(n-1)k(x,y)＝F_nk(x,y)-F_(n-1)k(x, y) calculating the motion characteristics, wherein n is more than 1 and less than or equal to 16, k is more than or equal to 1 and less than or equal to 96, x is more than or equal to 1 and less than or equal to 28, y is more than or equal to 1 and less than or equal to 28, F_nk(x, y) stands for spatial semantic featuresFeature map F characterizing the k-th channel in time dimension n_nkObtaining a plurality of characteristic differences D according to the characteristic values at the x-th row and the y-th column_(n-1)kThese feature difference components have a size of [96,15,28 ]]The motion characteristics of the stack.

The motion trend feature extraction module is a 3D convolution network and is used for extracting motion trend features of the motion features, the module backbone network is a part of a 3D-ResNet-18 network and comprises 6 3D convolutions, the sizes of the convolutions are all 3 multiplied by 3, 1 BN3D layer and 1 ReLU layer are connected behind each convolution, and the motion trend features with the sizes of [512,4,7 and 7] are extracted through the module.

The space-time feature extraction module is a 3D convolutional network and is used for extracting space-time features of the space semantic features, the module is composed of two layers of 3D convolutions, the number of output channels of the two convolutions is 512, the size of the output channels is 3 multiplied by 3, padding is equal to 1, the step length is 2, 1 BN3D layer and 1 ReLU layer are connected behind each convolution, and the space-time features with the sizes of [512,4,7 and 7] are extracted through the module.

And the characteristic fusion module is used for connecting the motion trend characteristics and the space-time characteristics to obtain the video-level characteristics with the final size of [1024,4,7,7 ].

And (3) passing the video-level features with the size of [1024,4,7,7] through a global pooling layer with a convolution kernel size of 1 × 7 × 7, wherein the global pooling layer is avg pooling and has a step size of 1, and obtaining the features with the size of [1024,1,1,1 ].

Performing a Flatten operation on the features with the size of [1024,1,1,1], then passing through a Dropout (0.3) layer, and finally performing classification and identification through a full connection layer.

4) Performing migration learning on the model trained in the step 3) according to an actual scene, and applying the migrated model to a human behavior recognition task in the actual scene, wherein the method specifically comprises the following steps:

In conclusion, the invention provides the human behavior identification method based on the motion trend characteristics, which can effectively reduce data redundancy, capture long-range time sequence information, enhance the motion trend characteristic expression of behaviors, retain complete static image semantic information of a sampled video frame, capture dynamic motion characteristics of the behaviors, facilitate the identification of the human behaviors, and flexibly replace part of modules according to scenes, has wide research and practical application values, and is worthy of popularization.

The above-mentioned embodiments are merely preferred embodiments of the present invention, and the scope of the present invention is not limited thereto, so that the changes in the shape and principle of the present invention should be covered within the protection scope of the present invention.