Movatterモバイル変換


[0]ホーム

URL:


CN113313030A - Human behavior identification method based on motion trend characteristics - Google Patents

Human behavior identification method based on motion trend characteristics
Download PDF

Info

Publication number
CN113313030A
CN113313030ACN202110597647.7ACN202110597647ACN113313030ACN 113313030 ACN113313030 ACN 113313030ACN 202110597647 ACN202110597647 ACN 202110597647ACN 113313030 ACN113313030 ACN 113313030A
Authority
CN
China
Prior art keywords
video
features
data set
feature extraction
motion
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110597647.7A
Other languages
Chinese (zh)
Other versions
CN113313030B (en
Inventor
董敏
曹瑞东
毕盛
方政霖
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
South China University of Technology SCUT
Original Assignee
South China University of Technology SCUT
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by South China University of Technology SCUTfiledCriticalSouth China University of Technology SCUT
Priority to CN202110597647.7ApriorityCriticalpatent/CN113313030B/en
Publication of CN113313030ApublicationCriticalpatent/CN113313030A/en
Application grantedgrantedCritical
Publication of CN113313030BpublicationCriticalpatent/CN113313030B/en
Expired - Fee Relatedlegal-statusCriticalCurrent
Anticipated expirationlegal-statusCritical

Links

Images

Classifications

Landscapes

Abstract

Translated fromChinese

本发明公开了一种基于运动趋势特征的人体行为识别方法,包括步骤:1)获取人体行为识别的视频数据集;2)对视频数据集中的视频进行视频帧提取,制作数据集;3)构造运动趋势特征提取模型,利用运动趋势特征提取模型对步骤2)的数据集进行特征提取与识别,实现模型训练;4)根据实际场景对来自于步骤3)训练完成的模型进行迁移学习,将迁移后的模型应用于实际场景下的人体行为识别任务中。本发明有利于视频场景下对复杂、长时需推理行为的识别,具有实际应用价值。

Figure 202110597647

The invention discloses a method for recognizing human behavior based on motion trend features, comprising the steps of: 1) acquiring a video data set for human behavior recognition; 2) extracting video frames from the video in the video data set to create a data set; 3) constructing Motion trend feature extraction model, using the motion trend feature extraction model to perform feature extraction and identification on the data set in step 2) to realize model training; 4) According to the actual scene, transfer learning is performed on the model from step 3) after training, and the migration The latter model is applied to the human action recognition task in the actual scene. The invention is beneficial to the recognition of complex and long-term inference behaviors in video scenes, and has practical application value.

Figure 202110597647

Description

Human behavior identification method based on motion trend characteristics
Technical Field
The invention relates to the technical field of human behavior recognition analysis based on a video scene, in particular to a human behavior recognition method based on motion trend characteristics.
Background
Visual information, one of the most readily available information in real life, is video as a carrier, on which a great deal of research hotspots and applications emerge. In these concerns, human behavior recognition in videos is focused on, such as smart nursing, sports event judgment, sign language recognition, and the like, and these tasks are more focused on the actions of the human body than on the interaction between the human body and the object. The video is composed of a plurality of frames of images with time sequence relation, so how to simultaneously capture the space semantic features of the images in the video and the time sequence motion features among the plurality of frames of images becomes the key for human behavior recognition.
The behavior recognition task of human behavior recognition in the video is different from the image classification task in that the classification in the video needs time sequence modeling, so that various ways are tried by combining time sequence and space modeling by various behavior recognition networks. Two Stream CNN proposed in 2014 tries to use a double-flow network, one network carries out spatial feature expression, the other network carries out time sequence feature expression, and then spatial features and time sequence features extracted by the Two networks are fused and classified. Then, many time sequence modeling models appear, a TSN model proposed in 2016 divides a video into a plurality of segments, then performs sparse sampling, performs prediction on each segment, and then fuses the results of the segments to obtain video-level prediction, and the mechanism enables the models to have the capability of capturing long-range time sequence information, but the method still does not link the time dimension and the space dimension of behavior characteristics and lacks the space-time fusion capability. And starting from C3D proposed in 2015, a 3D convolutional neural network (3D CNN) was used to perform spatio-temporal feature extraction, including R3D, I3D, NL-I3D, SlowFast, NL-SlowFast, X3D models, etc. The 3D convolution extends the spatial semantic features to spatio-temporal features by adding a temporal dimension to the convolution kernel. Although the 3D CNN models can well fuse spatio-temporal features, these 3D CNN models only learn features on a small sliding window instead of the entire video, so they are difficult to obtain video-level predictions, and the 3D CNN models are very expensive in computation overhead, high in requirements on a computation platform, difficult in training, and very time-consuming in the actual inference prediction process.
The human body behaviors can be divided into two types, one type of behavior can be judged by only one frame of static image in the video, the behavior is called as inference-free behavior, the other type of behavior can be judged by identifying the characteristics of multiple frames of images in the video and the motion relationship between the images, and the behavior is called as inference-required behavior. The identification of behaviors needing reasoning has higher requirements on the time sequence relation modeling of a human behavior identification model, so that the human behavior identification method applied to a video scene not only needs to consider the time-space characteristic fusion capability of the human behavior identification model, but also needs to design a strategy for capturing long-range time sequence motion characteristics so as to obtain video-level prediction of the behaviors, and simultaneously achieves the balance between human behavior identification precision and expenditure as far as possible.
Disclosure of Invention
The invention aims to overcome the defects of the human behavior recognition method in the current video scene on the capabilities of fusing space-time characteristics and capturing long-range time sequence motion information, provides a human behavior recognition method based on motion trend characteristics, enhances the recognition of behaviors needing inference and behaviors needing no inference, improves the accuracy of the model on the recognition of complex and long-term behaviors in the video scene, and enables the model to be better applied to an actual system.
In order to achieve the purpose, the technical scheme provided by the invention is as follows: the human behavior identification method based on the motion trend characteristics comprises the following steps:
1) acquiring a video data set for human behavior recognition;
2) extracting video frames of the videos in the video data set in the step 1) to manufacture a data set;
3) constructing a motion trend feature extraction model, and performing feature extraction and identification on the data set manufactured in the step 2) by using the motion trend feature extraction model to realize model training; the motion trend feature extraction model is improved on the structure of an ECO (efficient connected Network for Online Video exploration) model, and is improved by adding calculation of motion features, modifying extraction of spatio-temporal features of the ECO model into extraction of motion trend features and adding a feature fusion module, wherein the improvement aims at enhancing identification of behaviors needing inference and behaviors needing no inference at the same time;
4) and (3) performing transfer learning on the model trained in the step 3) according to the actual scene, and applying the transferred model to a human behavior recognition task in the actual scene to complete behavior recognition.
Further, in step 1), a public open-source video data set is obtained by downloading an HMDB51, UCF101 and a Jester video data set, and the obtained public open-source video data set is organized according to a customized file structure standard, wherein the directory names of primary folders of the HMDB51 video data set and the UCF101 video data set are categories to which human behaviors belong, each folder is a video belonging to the category with the format of avi, and the Jester video data set is a video set with the format of webm named according to a video sequence number.
Further, the step 2) comprises the following steps:
2.1) traversing all video files under the folder of each video data set, extracting the video frames of each video by using OpenCV to obtain a video frame data set, and counting the frame number of each video;
2.2) dividing each video frame data set in the step 2.1) into a training set and a verification set according to a division file provided by the official website of each video data set, and storing the information of the training set and the verification set in a file, wherein each behavior of the file is a tuple, and the tuple comprises a video frame folder address, a video frame number and a video corresponding category.
Further, in step 3), the constructed motion trend feature extraction model comprises a video frame preprocessing module, a spatial semantic feature extraction module, a motion feature calculation module, a motion trend feature extraction module, a spatio-temporal feature extraction module, a feature fusion module, a global pooling layer and a full link layer, and the specific conditions are as follows:
the video frame pre-processing module performs the following operations: sparse sampling is carried out on all video frames from the same video, the video frames are evenly divided into 16 segments according to the number of the frames, 1 frame is randomly selected from each segment, 16 frames of images are sampled in total, center cutting is carried out on the 16 frames of images, and 16 frames of images with the size of 224 multiplied by 224 pixels are obtained;
the space semantic feature extraction module is a 2D convolutional network, a weight sharing mechanism is used for extracting the space semantic features of the 16 frames of images, a backbone network of the module adopts a structure the same as that of an ECO (echo-back error) model, the backbone network firstly comprises two convolutional layers, one pooling layer is arranged behind each convolutional layer, the two convolutional layers are 7 multiplied by 7 and 3 multiplied by 3 respectively, the two pooling layers are max pool layers with the size of 3 multiplied by 3, and then the backbone network comprises three initiation layers, and the number of output channels of the three initiation layers is respectively 256, 320 and 96; 16 images are passed through the backbone network to obtain 16 features with the size of [96,28,28], and the 16 features are stacked to form a spatial semantic feature;
the motion feature calculation module performs the following operations: for the obtained space semantic features, according to a formula D(n-1)k(x,y)=Fnk(x,y)-F(n-1)k(x, y) calculating the motion characteristics, wherein n is more than 1 and less than or equal to 16, k is more than or equal to 1 and less than or equal to 96, x is more than or equal to 1 and less than or equal to 28, y is more than or equal to 1 and less than or equal to 28, Fnk(x, y) feature map F representing the k-th channel of the spatial semantic features in the time dimension nnkObtaining a plurality of characteristic differences D according to the characteristic values at the x-th row and the y-th column(n-1)kThese feature difference components have a size of [96,15,28 ]]The motion characteristics of the stack of (a);
the motion trend feature extraction module is a 3D convolution network and is used for extracting motion trend features of the motion features, the module backbone network is a part of a 3D-ResNet-18 network and comprises 6 3D convolutions, the sizes of the convolutions are all 3 multiplied by 3, 1 BN3D layer and 1 ReLU layer are connected behind each convolution, and the motion trend features with the sizes of [512,4,7 and 7] are extracted through the module;
the space-time feature extraction module is a 3D convolutional network and is used for extracting space-time features of the space semantic features, the module consists of two layers of 3D convolutions, the number of output channels of the two convolutions is 512, the size of the output channels is 3 multiplied by 3, padding is equal to 1, the step length is 2, 1 BN3D layer and 1 ReLU layer are connected behind each convolution, and the space-time features with the sizes of [512,4,7 and 7] are extracted through the module;
the characteristic fusion module is used for connecting the motion trend characteristics and the space-time characteristics to obtain video-level characteristics with the final size of [1024,4,7,7 ];
passing the video-level features with the size of [1024,4,7,7] through a global pooling layer with a convolution kernel size of 1 × 7 × 7, wherein the global pooling layer is avg pooling and has a step size of 1, so as to obtain features with the size of [1024,1,1,1 ];
performing a Flatten operation on the features with the sizes of [1024,1,1 and 1], then passing through a Dropout (0.3) layer, and finally performing classification and identification through a full connection layer.
Further, the step 4) comprises the following steps:
4.1) collecting actual scene video data, extracting video frames of the obtained video data, and making a data set;
4.2) fine-tuning the trained motion trend feature extraction model, freezing parameters of all feature extraction related layers, training on the data set in the step 4.1), and applying the trained motion feature extraction model to human behavior recognition in an actual scene to obtain an accurate recognition result.
Compared with the prior art, the invention has the following advantages and beneficial effects:
1. the invention carries out sparse sampling on the video to be identified, can reduce data redundancy, reduce the input size of the model and reduce the calculated amount of the model in the reasoning process.
2. The model of the invention performs feature difference on the multi-frame space semantic features, enhances the motion trend feature expression of behaviors, and can further reduce the number of feature maps so as to reduce the calculated amount, and meanwhile, the motion trend features can amplify the difference of the behaviors among different categories, and have better recognition effect under the condition of smaller data set. The motion trend characteristics and the space-time characteristics are fused in the model, so that the model not only retains complete static image semantic information, but also can capture dynamic motion characteristics, and can simultaneously strengthen the identification of behaviors needing inference and behaviors needing no inference.
3. The model of the invention is mainly characterized by structure, modularization of each part, flexible replacement of the backbone network of the space semantic feature extraction module, the space-time feature extraction module and the motion trend feature extraction module into other networks, and selection of a light or high-precision network according to computing resources.
Drawings
FIG. 1 is a schematic diagram of the method of the present invention.
Detailed Description
The present invention will be further described with reference to the following specific examples.
The method for recognizing the human body behavior based on the motion trend characteristics, provided by the embodiment, comprises the following steps of:
1) acquiring a video data set for human behavior recognition, specifically as follows:
1.1) obtaining a public open source video data set by downloading an HMDB51, a UCF101 and a Jester video data set;
1.2) organizing the acquired public open source video data set according to a custom file structure standard, wherein the directory name of a primary folder of the HMDB51 video data set and the UCF101 video data set is a category to which human behaviors belong, and a video with a format of.avi of the category to which the folders belong is arranged under each folder; the Jester video data set is a video set named according to a video sequence number and in a format of webm.
2) Extracting video frames of the videos in the video data set in the step 1) to manufacture a data set, wherein the method specifically comprises the following steps:
2.1) traversing all video files under the folder of each video data set, extracting the video frames of each video by using OpenCV to obtain a video frame data set, and counting the frame number of each video;
2.2) dividing each video frame data set in 2.1) into a training set and a verification set according to a division file provided by the official website of each video data set, and storing the information of the training set and the verification set in a file, wherein each behavior of the file is a tuple, and the tuple comprises a video frame folder address, a video frame number and a video corresponding category.
3) Constructing a motion trend feature extraction model, as shown in fig. 1, performing feature extraction and identification on the data set in the step 2) by using the motion trend feature extraction model, and realizing model training; the motion trend feature extraction model is improved on the basis of an ECO (efficient connected Network for Online Video exploration) model structure, and is improved by adding calculation of motion features, modifying extraction of spatio-temporal features of the ECO model into extraction of motion trend features and adding a feature fusion module, wherein the improvement aims at enhancing identification of behaviors needing inference and behaviors needing no inference at the same time.
The constructed motion trend feature extraction model comprises a video frame preprocessing module, a space semantic feature extraction module, a motion feature calculation module, a motion trend feature extraction module, a space-time feature extraction module, a feature fusion module, a global pooling layer and a full-link layer, and the specific conditions are as follows:
the video frame pre-processing module performs the following operations: performing sparse sampling on all video frames from the same video, averagely dividing the video frames into 16 segments according to the number of the frames, namely N in the figure 1 is equal to 16, randomly selecting 1 frame from each segment, and sampling 16 frames of images in total; the 16 frame image is center-cropped to obtain 16 frames of 224 × 224 pixels.
The space semantic feature extraction module is a 2D convolutional network, a weight sharing mechanism is used for extracting the space semantic features of the 16 frames of images, a backbone network of the module adopts a structure the same as that of an ECO (echo-back error) model, the backbone network firstly comprises two convolutional layers, one pooling layer is arranged behind each convolutional layer, the two convolutional layers are 7 multiplied by 7 and 3 multiplied by 3 respectively, the two pooling layers are max pool layers with the size of 3 multiplied by 3, and then the backbone network comprises three initiation layers, and the number of output channels of the three initiation layers is respectively 256, 320 and 96; 16 images get 16 features of size [96,28,28] through this backbone network, and these 16 features are stacked into a spatial semantic feature.
The motion feature calculation module performs the following operations: for the above space semantic features, according to formula D(n-1)k(x,y)=Fnk(x,y)-F(n-1)k(x, y) calculating the motion characteristics, wherein n is more than 1 and less than or equal to 16, k is more than or equal to 1 and less than or equal to 96, x is more than or equal to 1 and less than or equal to 28, y is more than or equal to 1 and less than or equal to 28, Fnk(x, y) stands for spatial semantic featuresFeature map F characterizing the k-th channel in time dimension nnkObtaining a plurality of characteristic differences D according to the characteristic values at the x-th row and the y-th column(n-1)kThese feature difference components have a size of [96,15,28 ]]The motion characteristics of the stack.
The motion trend feature extraction module is a 3D convolution network and is used for extracting motion trend features of the motion features, the module backbone network is a part of a 3D-ResNet-18 network and comprises 6 3D convolutions, the sizes of the convolutions are all 3 multiplied by 3, 1 BN3D layer and 1 ReLU layer are connected behind each convolution, and the motion trend features with the sizes of [512,4,7 and 7] are extracted through the module.
The space-time feature extraction module is a 3D convolutional network and is used for extracting space-time features of the space semantic features, the module is composed of two layers of 3D convolutions, the number of output channels of the two convolutions is 512, the size of the output channels is 3 multiplied by 3, padding is equal to 1, the step length is 2, 1 BN3D layer and 1 ReLU layer are connected behind each convolution, and the space-time features with the sizes of [512,4,7 and 7] are extracted through the module.
And the characteristic fusion module is used for connecting the motion trend characteristics and the space-time characteristics to obtain the video-level characteristics with the final size of [1024,4,7,7 ].
And (3) passing the video-level features with the size of [1024,4,7,7] through a global pooling layer with a convolution kernel size of 1 × 7 × 7, wherein the global pooling layer is avg pooling and has a step size of 1, and obtaining the features with the size of [1024,1,1,1 ].
Performing a Flatten operation on the features with the size of [1024,1,1,1], then passing through a Dropout (0.3) layer, and finally performing classification and identification through a full connection layer.
4) Performing migration learning on the model trained in the step 3) according to an actual scene, and applying the migrated model to a human behavior recognition task in the actual scene, wherein the method specifically comprises the following steps:
4.1) collecting actual scene video data, extracting video frames of the obtained video data, and making a data set;
4.2) fine-tuning the trained motion trend feature extraction model, freezing parameters of all feature extraction related layers, training on the data set in the step 4.1), and applying the trained motion feature extraction model to human behavior recognition in an actual scene to obtain an accurate recognition result.
In conclusion, the invention provides the human behavior identification method based on the motion trend characteristics, which can effectively reduce data redundancy, capture long-range time sequence information, enhance the motion trend characteristic expression of behaviors, retain complete static image semantic information of a sampled video frame, capture dynamic motion characteristics of the behaviors, facilitate the identification of the human behaviors, and flexibly replace part of modules according to scenes, has wide research and practical application values, and is worthy of popularization.
The above-mentioned embodiments are merely preferred embodiments of the present invention, and the scope of the present invention is not limited thereto, so that the changes in the shape and principle of the present invention should be covered within the protection scope of the present invention.

Claims (5)

Translated fromChinese
1.基于运动趋势特征的人体行为识别方法,其特征在于,包括以下步骤:1. the human body behavior recognition method based on motion trend feature, is characterized in that, comprises the following steps:1)获取人体行为识别的视频数据集;1) Obtain a video dataset for human action recognition;2)对步骤1)的视频数据集中的视频进行视频帧提取,制作数据集;2) video frame extraction is carried out to the video in the video data set of step 1) to make a data set;3)构造运动趋势特征提取模型,利用运动趋势特征提取模型对步骤2)制作的数据集进行特征提取与识别,实现模型训练;其中,所述运动趋势特征提取模型是在ECO模型结构上进行改进,其改进之处是增加了对运动特征的计算,将ECO模型对时空特征的提取修改为对运动趋势特征的提取,并增加了特征融合模块,改进的目的在于同时增强对需推理行为和无需推理行为的识别;3) Construct a motion trend feature extraction model, and use the motion trend feature extraction model to perform feature extraction and identification on the data set made in step 2) to realize model training; wherein, the motion trend feature extraction model is improved on the ECO model structure , the improvement is that the calculation of motion features is added, the extraction of spatiotemporal features of the ECO model is modified to the extraction of motion trend features, and a feature fusion module is added. Recognition of inference behavior;4)根据实际场景对来自于步骤3)训练完成的模型进行迁移学习,将迁移后的模型应用于实际场景下的人体行为识别任务中,完成行为识别。4) Perform migration learning on the model trained in step 3) according to the actual scene, and apply the migrated model to the human behavior recognition task in the actual scene to complete the behavior recognition.2.根据权利要求1所述的基于运动趋势特征的人体行为识别方法,其特征在于,在步骤1)中,通过下载HMDB51、UCF101和Jester视频数据集获取公共开源视频数据集,将获取到的公共开源视频数据集按照自定义的文件结构标准组织,其中HMDB51视频数据集和UCF101视频数据集的一级文件夹目录名称为人体行为所属类别,每个文件夹下为所属该类别格式为.avi的视频,Jester视频数据集为按照视频序号命名的格式为.webm的视频集合。2. the human body behavior identification method based on motion trend feature according to claim 1, is characterized in that, in step 1), obtain public open source video data set by downloading HMDB51, UCF101 and Jester video data set, will obtain. The public open source video datasets are organized according to the custom file structure standard. The name of the first-level folder directory of the HMDB51 video dataset and the UCF101 video dataset is the category of human behavior, and each folder is the category to which it belongs. The format is .avi The video of the Jester video data set is a video collection in the format of .webm named according to the video serial number.3.根据权利要求1所述的基于运动趋势特征的人体行为识别方法,其特征在于,所述步骤2)包括以下步骤:3. the human body behavior identification method based on motion trend feature according to claim 1, is characterized in that, described step 2) comprises the following steps:2.1)遍历每个视频数据集的文件夹下所有视频文件,使用OpenCV提取每个视频的视频帧,得到视频帧数据集,统计每个视频的帧数;2.1) Traverse all the video files under the folder of each video data set, use OpenCV to extract the video frame of each video, obtain the video frame data set, and count the number of frames of each video;2.2)根据每个视频数据集的官网提供的分割文件将步骤2.1)中每个视频帧数据集分割为训练集和验证集,将训练集和验证集信息保存在文件中,文件的每一行为一个元组,元组包含视频帧文件夹地址、视频帧数和视频对应类别。2.2) According to the segmentation file provided by the official website of each video data set, each video frame data set in step 2.1) is divided into a training set and a validation set, and the training set and validation set information are saved in the file. Each behavior of the file A tuple, the tuple contains the video frame folder address, video frame number and video corresponding category.4.根据权利要求1所述的基于运动趋势特征的人体行为识别方法,其特征在于,在步骤3)中,构造的运动趋势特征提取模型包括视频帧预处理模块、空间语义特征提取模块、运动特征计算模块、运动趋势特征提取模块、时空特征提取模块、特征融合模块以及全局池化层和全连接层,具体情况如下:4. the human behavior recognition method based on motion trend feature according to claim 1, is characterized in that, in step 3), the motion trend feature extraction model of structure comprises video frame preprocessing module, spatial semantic feature extraction module, motion Feature calculation module, motion trend feature extraction module, spatiotemporal feature extraction module, feature fusion module, global pooling layer and fully connected layer, the details are as follows:所述视频帧预处理模块执行以下操作:对来自同一个视频的所有视频帧进行稀疏采样,将它们按帧数平均分为16个片段,从每个片段中随机选取1帧,共采样16帧图像,对这16帧图像进行中心裁剪,得到16帧大小为224×224像素的图像;The video frame preprocessing module performs the following operations: sparsely sample all video frames from the same video, divide them into 16 segments on average according to the number of frames, randomly select 1 frame from each segment, and sample 16 frames in total image, the 16 frames of images are centrally cropped to obtain 16 frames of images with a size of 224 × 224 pixels;所述空间语义特征提取模块为2D卷积网络,使用权重共享机制,对上述16帧图像进行空间语义特征提取,此模块的骨干网采用与ECO模型相同的结构,骨干网首先包含两个卷积层,每个卷积层之后为一个池化层,两个卷积层的大小分别为7×7和3×3,两个池化层为大小为3×3的max pool,然后骨干网包含三个inception层,它们的输出通道数分别为256、320和96;16帧图像通过此骨干网得到16个大小为[96,28,28]的特征,这16个特征堆叠成空间语义特征;The spatial semantic feature extraction module is a 2D convolutional network, which uses a weight sharing mechanism to extract spatial semantic features from the above 16 frames of images. The backbone network of this module adopts the same structure as the ECO model. The backbone network firstly includes two convolutions. layer, each convolutional layer is followed by a pooling layer, the sizes of the two convolutional layers are 7×7 and 3×3 respectively, the two pooling layers are max pools of size 3×3, and then the backbone network contains Three inception layers, their output channel numbers are 256, 320 and 96 respectively; 16 frames of images get 16 features of size [96, 28, 28] through this backbone network, and these 16 features are stacked into spatial semantic features;所述运动特征计算模块执行以下操作:对上述得到的空间语义特征,根据公式D(n-1)k(x,y)=Fnk(x,y)-F(n-1)k(x,y)计算运动特征,其中1<n≤16,1≤k≤96,1≤x≤28,1≤y≤28,Fnk(x,y)代表空间语义特征在时间维度为n的第k通道的特征图Fnk第x行y列处的特征值,得到多个特征差分D(n-1)k,这些特征差分组成大小为[96,15,28,28]的堆叠的运动特征;The motion feature calculation module performs the following operations: for the spatial semantic features obtained above, according to the formula D(n-1)k (x,y)=Fnk (x,y)-F(n-1)k (x ,y) Calculate the motion feature, where 1<n≤16, 1≤k≤96, 1≤x≤28, 1≤y≤28, Fnk (x, y) represents the spatial semantic feature in the nth time dimension The eigenvalues at the xth row and the yth column of the feature map Fnk of the k channel are obtained, and multiple feature differences D(n-1)k are obtained. These feature differences form stacked motion features of size [96, 15, 28, 28] ;所述运动趋势特征提取模块为3D卷积网络,对上述运动特征进行运动趋势特征提取,此模块骨干网络为3D-ResNet-18网络的一部分,包含6个3D卷积,卷积大小均为3×3×3,每个卷积后面连接1个BN3d层和1个ReLU层,通过此模块提取得到大小为[512,4,7,7]的运动趋势特征;The motion trend feature extraction module is a 3D convolution network, and the motion trend feature extraction is performed on the above motion features. The backbone network of this module is a part of the 3D-ResNet-18 network, including 6 3D convolutions, and the convolution size is 3 ×3×3, each convolution is followed by 1 BN3d layer and 1 ReLU layer, and the motion trend feature of size [512, 4, 7, 7] is extracted through this module;所述时空特征提取模块为3D卷积网络,对上述空间语义特征进行时空特征提取,此模块由两层3D卷积组成,两个卷积输出通道数均为512,大小为3×3×3,padding等于1,步长为2,每个卷积后面连接1个BN3d层和1个ReLU层,通过此模块提取得到大小为[512,4,7,7]的时空特征;The spatiotemporal feature extraction module is a 3D convolutional network, which performs spatiotemporal feature extraction on the above-mentioned spatial semantic features. This module consists of two layers of 3D convolutions. The number of output channels of the two convolutions is 512 and the size is 3×3×3 , padding is equal to 1, stride is 2, each convolution is followed by a BN3d layer and a ReLU layer, and the spatiotemporal features of size [512, 4, 7, 7] are extracted through this module;所述特征融合模块对上述运动趋势特征和时空特征进行连接,得到最终大小为[1024,4,7,7]的视频级特征;The feature fusion module connects the above motion trend features and spatiotemporal features to obtain video-level features with a final size of [1024, 4, 7, 7];将上述大小为[1024,4,7,7]的视频级特征通过一个卷积核大小为1×7×7的全局池化层,该全局池化层为avg pooling,步长为1,得到大小为[1024,1,1,1]的特征;Passing the above video-level features of size [1024, 4, 7, 7] through a global pooling layer with a convolution kernel size of 1 × 7 × 7, the global pooling layer is avg pooling, and the stride is 1, we get features of size [1024,1,1,1];对上述大小为[1024,1,1,1]的特征进行Flatten操作,然后再通过一个Dropout(0.3)层,最后通过全连接层进行分类识别。The Flatten operation is performed on the above features with a size of [1024, 1, 1, 1], then a Dropout (0.3) layer is passed, and finally a fully connected layer is used for classification and recognition.5.根据权利要求1所述的基于运动趋势特征的人体行为识别方法,其特征在于,所述步骤4)包括以下步骤:5. the human body behavior identification method based on motion trend feature according to claim 1, is characterized in that, described step 4) comprises the following steps:4.1)收集实际场景视频数据,对得到的视频数据进行视频帧提取,制作数据集;4.1) Collect actual scene video data, perform video frame extraction on the obtained video data, and make a data set;4.2)对训练好的运动趋势特征提取模型进行微调,冻结所有特征提取相关层的参数,在步骤4.1)的数据集上进行训练,将训练后的运动特征提取模型应用于实际场景下的人体行为识别,得到准确的识别结果。4.2) Fine-tune the trained motion trend feature extraction model, freeze the parameters of all feature extraction related layers, train on the data set in step 4.1), and apply the trained motion feature extraction model to human behavior in actual scenarios Recognition, get accurate recognition results.
CN202110597647.7A2021-05-312021-05-31 Human Behavior Recognition Method Based on Movement Trend FeaturesExpired - Fee RelatedCN113313030B (en)

Priority Applications (1)

Application NumberPriority DateFiling DateTitle
CN202110597647.7ACN113313030B (en)2021-05-312021-05-31 Human Behavior Recognition Method Based on Movement Trend Features

Applications Claiming Priority (1)

Application NumberPriority DateFiling DateTitle
CN202110597647.7ACN113313030B (en)2021-05-312021-05-31 Human Behavior Recognition Method Based on Movement Trend Features

Publications (2)

Publication NumberPublication Date
CN113313030Atrue CN113313030A (en)2021-08-27
CN113313030B CN113313030B (en)2023-02-14

Family

ID=77376213

Family Applications (1)

Application NumberTitlePriority DateFiling Date
CN202110597647.7AExpired - Fee RelatedCN113313030B (en)2021-05-312021-05-31 Human Behavior Recognition Method Based on Movement Trend Features

Country Status (1)

CountryLink
CN (1)CN113313030B (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN113963315A (en)*2021-11-162022-01-21重庆邮电大学 A method and system for real-time video multi-person behavior recognition in complex scenes
CN114171032A (en)*2021-11-242022-03-11厦门快商通科技股份有限公司 Cross-channel voiceprint model training method, identification method, device and readable medium
CN114973107A (en)*2022-06-242022-08-30山东省人工智能研究院Unsupervised cross-domain video action identification method based on multi-discriminator cooperation and strong and weak sharing mechanism

Citations (7)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN104143074A (en)*2013-05-072014-11-12李东舸Method and equipment for generating motion feature codes on the basis of motion feature information
US20170220854A1 (en)*2016-01-292017-08-03Conduent Business Services, LlcTemporal fusion of multimodal data from multiple data acquisition systems to automatically recognize and classify an action
CN107609460A (en)*2017-05-242018-01-19南京邮电大学A kind of Human bodys' response method for merging space-time dual-network stream and attention mechanism
EP3284013A1 (en)*2015-04-162018-02-21University of Essex Enterprises LimitedEvent detection and summarisation
CN108108699A (en)*2017-12-252018-06-01重庆邮电大学Merge deep neural network model and the human motion recognition method of binary system Hash
CN111382677A (en)*2020-02-252020-07-07华南理工大学 Human action recognition method and system based on 3D attention residual model
CN111680618A (en)*2020-06-042020-09-18西安邮电大学 Dynamic gesture recognition method, storage medium and device based on video data characteristics

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN104143074A (en)*2013-05-072014-11-12李东舸Method and equipment for generating motion feature codes on the basis of motion feature information
EP3284013A1 (en)*2015-04-162018-02-21University of Essex Enterprises LimitedEvent detection and summarisation
US20170220854A1 (en)*2016-01-292017-08-03Conduent Business Services, LlcTemporal fusion of multimodal data from multiple data acquisition systems to automatically recognize and classify an action
CN107609460A (en)*2017-05-242018-01-19南京邮电大学A kind of Human bodys' response method for merging space-time dual-network stream and attention mechanism
CN108108699A (en)*2017-12-252018-06-01重庆邮电大学Merge deep neural network model and the human motion recognition method of binary system Hash
CN111382677A (en)*2020-02-252020-07-07华南理工大学 Human action recognition method and system based on 3D attention residual model
CN111680618A (en)*2020-06-042020-09-18西安邮电大学 Dynamic gesture recognition method, storage medium and device based on video data characteristics

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
BANGLI LIU ET AL: "Human-Human Interaction recognition based on spatial and motion trend feature", 《IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING》*
张爱辉等: "PCRM的改进及其在人体行为识别中的应用", 《计算机工程与设计》*

Cited By (3)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN113963315A (en)*2021-11-162022-01-21重庆邮电大学 A method and system for real-time video multi-person behavior recognition in complex scenes
CN114171032A (en)*2021-11-242022-03-11厦门快商通科技股份有限公司 Cross-channel voiceprint model training method, identification method, device and readable medium
CN114973107A (en)*2022-06-242022-08-30山东省人工智能研究院Unsupervised cross-domain video action identification method based on multi-discriminator cooperation and strong and weak sharing mechanism

Also Published As

Publication numberPublication date
CN113313030B (en)2023-02-14

Similar Documents

PublicationPublication DateTitle
CN112541503B (en)Real-time semantic segmentation method based on context attention mechanism and information fusion
CN112507898B (en) A Multimodal Dynamic Gesture Recognition Method Based on Lightweight 3D Residual Network and TCN
CN111563508B (en)Semantic segmentation method based on spatial information fusion
CN111046821B (en)Video behavior recognition method and system and electronic equipment
JP2023003026A (en)Method for identifying rural village area classified garbage based on deep learning
CN115116054B (en) A method for identifying pests and diseases based on multi-scale lightweight networks
CN110046550B (en) Pedestrian attribute recognition system and method based on multi-layer feature learning
CN110569814B (en)Video category identification method, device, computer equipment and computer storage medium
CN111382677B (en)Human behavior recognition method and system based on 3D attention residual error model
CN111563507A (en)Indoor scene semantic segmentation method based on convolutional neural network
CN109300121A (en) Method and system for constructing a diagnostic model of cardiovascular disease and the diagnostic model
CN113313030A (en)Human behavior identification method based on motion trend characteristics
CN109993269B (en)Single image crowd counting method based on attention mechanism
CN113283400B (en)Skeleton action identification method based on selective hypergraph convolutional network
CN110532911B (en)Covariance measurement driven small sample GIF short video emotion recognition method and system
CN113705293B (en) Image scene recognition method, device, equipment and readable storage medium
CN116524596B (en)Sports video action recognition method based on action granularity grouping structure
CN112528058B (en)Fine-grained image classification method based on image attribute active learning
CN116229323A (en)Human body behavior recognition method based on improved depth residual error network
CN117689731B (en)Lightweight new energy heavy-duty battery pack identification method based on improved YOLOv model
CN114694174A (en) A human interaction behavior recognition method based on spatiotemporal graph convolution
CN116189281B (en) End-to-end human behavior classification method and system based on spatiotemporal adaptive fusion
CN117011943A (en)Multi-scale self-attention mechanism-based decoupled 3D network action recognition method
CN114513653B (en) Video processing method, device, equipment, computer program product and storage medium
CN119964244A (en) PSC-TNet video action recognition method based on fusion of spatial features and frame difference information

Legal Events

DateCodeTitleDescription
PB01Publication
PB01Publication
SE01Entry into force of request for substantive examination
SE01Entry into force of request for substantive examination
GR01Patent grant
GR01Patent grant
CF01Termination of patent right due to non-payment of annual fee

Granted publication date:20230214

CF01Termination of patent right due to non-payment of annual fee

[8]ページ先頭

©2009-2025 Movatter.jp