CN116645726A

Movatterモバイル変換

Info

Publication number: CN116645726A
Application number: CN202310624663.XA
Authority: CN
Inventors: 姜那; 权威; 许�鹏; 施智平
Original assignee: Capital Normal University
Current assignee: Capital Normal University
Priority date: 2023-05-30
Filing date: 2023-05-30
Publication date: 2023-08-25
Anticipated expiration: 2043-05-30
Also published as: CN116645726B

Abstract

The invention relates to a behavior recognition method and a system for space-time double-branch fusion by utilizing three-dimensional human body recovery, wherein the method comprises the following steps: s1: constructing a three-dimensional reconstruction network, normalizing human depth map gestures by utilizing human key point coordinates, extracting key gesture features, fusing with the depth map to obtain a human gesture model, and modeling the human gesture modelSequentially superposing to obtain a three-dimensional human body superposition graph; s2: inputting the three-dimensional human body superposition graph into a space extraction network based on depth residual error fusion and posture information attention to obtain a three-dimensional posture feature vector F_S The method comprises the steps of carrying out a first treatment on the surface of the S3: constructing a training sample group (F, S) to input a time sequence extraction network focused on the basis of cross-domain depth residual error and track information to obtain a cross-domain track feature vector F_T The method comprises the steps of carrying out a first treatment on the surface of the S4: will F_S And F_T Inputting a time-space domain fusion understanding module based on self-adaption to obtain a behavior fusion characteristic F_Fusion And then human body action recognition is carried out. The method provided by the invention can improve the detection precision of most human body actions.

Description

Translated fromChinese

利用三维人体恢复进行时空双分支融合的行为识别方法及系统Behavior recognition method and application of spatio-temporal dual-branch fusion using 3D human body restorationsystem

技术领域technical field

本发明涉及计算机视觉领域和深度学习领域，具体涉及一种利用三维人体恢复进行时空双分支融合的行为识别方法及系统。The invention relates to the fields of computer vision and deep learning, in particular to a behavior recognition method and system for performing spatio-temporal dual-branch fusion using three-dimensional human body restoration.

背景技术Background technique

行为识别是指对图像或视频中人体行为进行自动分析并加以判断及分类的视觉研究，近年VR游戏的普及、体育运动分析和视觉监控等应用，行为识别成为最热门的领域之一。现阶段常用的基于深度学习的智能行为识别算法更是具有反应速度快、处理效率高等优点，可极大节省人力和时间，并被广泛的应用于智能安防、医疗救护、教学分析等领域。Behavior recognition refers to visual research that automatically analyzes, judges, and classifies human behavior in images or videos. In recent years, with the popularization of VR games, sports analysis, and visual monitoring applications, behavior recognition has become one of the most popular fields. The commonly used intelligent behavior recognition algorithm based on deep learning at this stage has the advantages of fast response and high processing efficiency, which can greatly save manpower and time, and is widely used in intelligent security, medical rescue, teaching analysis and other fields.

RGB模态由于其易收集在动作识别使用广泛，但RGB数据对背景、尺度、光照等条件要求较高。然而随着神经网络的加深，从视频提取的2D信息不足以支撑全方位的3D视觉，对于3D信息的缺少导致信息片面、错误分类等影响其快速落地应用。因此，扩充3D信息的技术应运而生，以原始数据为依托，利用提取空间信息等手段丰富网络特征细节。传统基于人工特征的方法主要从RGB图像出发，人工提取常见的有纹理、梯度等有用特征以及人体自身较为明显的特征，再使用BOOST、SVM和概率模型等对人体行为进行识别。RGB mode is widely used in action recognition due to its easy collection, but RGB data has higher requirements on background, scale, lighting and other conditions. However, with the deepening of the neural network, the 2D information extracted from the video is not enough to support a full range of 3D vision, and the lack of 3D information leads to one-sided information and misclassification, which affect its rapid application. Therefore, the technology of expanding 3D information emerges at the historic moment, relying on the original data, and enriching the network feature details by means of extracting spatial information. Traditional methods based on artificial features mainly start from RGB images, artificially extract common useful features such as texture and gradient, as well as obvious features of the human body, and then use BOOST, SVM and probability models to recognize human behavior.

近年来，出现了很多经典的基于补充3D信息的方法。具有代表性的有：基于2D双流网络的方法，即主要使用视觉框架和标记光流分别用在两个2DCNN上提取运动信息特征；基于递归神经网络(RNN)，更关注序列帧的特征变化部分，主要将RNN和CNN结合，学习更全局的3D信息；基于3D的卷积神经网络(3DCNN)的方法将整个视频序列整体卷积，使用3D卷积核获得视频的时空特征。这些方法为行为识别的3D信息提取，提供了新的方式。相比传统的特征提取，扩展的方法更加丰富，有利于提高网络的识别精度。In recent years, many classical methods based on supplementary 3D information have emerged. Representative ones are: the method based on 2D dual-stream network, which mainly uses the visual frame and the marked optical flow to extract motion information features on two 2DCNNs respectively; based on the recurrent neural network (RNN), more attention is paid to the feature change part of the sequence frame , which mainly combines RNN and CNN to learn more global 3D information; the method based on 3D convolutional neural network (3DCNN) convolves the entire video sequence as a whole, and uses the 3D convolution kernel to obtain the spatiotemporal features of the video. These methods provide a new way for 3D information extraction for behavior recognition. Compared with the traditional feature extraction, the extension method is more abundant, which is beneficial to improve the recognition accuracy of the network.

这些方法都为提供更多3D信息来优化算法，但对于长期运动的时间结构和帧间的关系理解程度不足，这就导致容易丢失上下级特征中所包含的行为信息。空间特征的提取仍需更精准，进一步降低背景对人体的影响，否则可能会出现节点错位、局限视角的伪3D成像等情况。如何利用数据集补充3D信息并更好的融合行为信息成为一个亟待解决的问题。These methods provide more 3D information to optimize the algorithm, but they have insufficient understanding of the temporal structure of long-term motion and the relationship between frames, which makes it easy to lose the behavior information contained in the upper and lower features. The extraction of spatial features still needs to be more accurate to further reduce the influence of the background on the human body, otherwise there may be node misalignment, pseudo 3D imaging with limited viewing angles, etc. How to use data sets to supplement 3D information and better integrate behavioral information has become an urgent problem to be solved.

发明内容Contents of the invention

为了解决上述技术问题，本发明提供一种利用三维人体恢复进行时空双分支融合的行为识别方法及系统。In order to solve the above-mentioned technical problems, the present invention provides a behavior recognition method and system for performing spatio-temporal dual-branch fusion using 3D human body restoration.

本发明技术解决方案为：一种利用三维人体恢复进行时空双分支融合的行为识别方法，包括：The technical solution of the present invention is: a behavior recognition method using three-dimensional human body restoration to perform spatio-temporal dual-branch fusion, including:

步骤S1：对原始行为识别数据集中的视频按固定帧长将一组连续帧图像集I输入三维重建网络，首先利用姿态规范模块生成深度图和姿态关键点，再利用所述姿态关键点调整所述深度图的人体姿态，得到姿态规范后的深度图；将所述姿态规范后的深度图输入关键姿态提取模块，通过注意力局部感知模块提取关键姿态特征，并与所述姿态规范后的深度图进行融合，得到人体姿态模型；调整所述人体姿态模型中人体位置，使人体位于图像中心，得到调整后的人体姿态模型；按帧长将所述调整后的人体姿态模型进行有序叠加，得到三维人体叠加图；Step S1: For the video in the original behavior recognition data set, input a set of continuous frame image sets I into the 3D reconstruction network according to a fixed frame length, first use the pose specification module to generate a depth map and pose key points, and then use the pose key points to adjust all The human body pose of the depth map is obtained to obtain the depth map after the pose standardization; the depth map after the pose standardization is input into the key pose extraction module, and the key pose features are extracted through the attention local perception module, and compared with the depth map after the pose standardization The images are fused to obtain a human body posture model; the position of the human body in the human body posture model is adjusted so that the human body is positioned at the center of the image to obtain the adjusted human body posture model; the adjusted human body posture model is stacked in an orderly manner according to the frame length, Obtain a three-dimensional human body overlay;

步骤S2：将所述三维人体叠加图输入到基于深度残差融合和姿态信息关注的空间提取网络，首先利用二维姿态处理模块获取姿态表观特征，再将所述姿态表征特征经过深度残差融合模块和空间域姿态注意力模块，对整体空间姿态的重要区域特征进行捕捉，得到高理解度的空间姿态特征；最后将所述空间姿态特征通过二维预测映射模块得到三维姿态特征向量F_S；Step S2: Input the 3D human overlay image into the space extraction network based on depth residual fusion and attitude information attention, firstly use the 2D attitude processing module to obtain the attitude appearance features, and then pass the attitude representation features through the depth residual The fusion module and the spatial domain attitude attention module capture the important regional features of the overall spatial attitude to obtain highly understandable spatial attitude features; finally, the spatial attitude features are obtained through the two-dimensional predictive mapping module to obtain the three-dimensional attitude feature vector F_S ;

步骤S3：在所述连续帧图像集I中选取每个视频的关键帧(F,S)作为训练样本组，F表示输入的帧长，S表示在I中选取的关键帧图像集；将(F,S)输入基于跨域深度残差和轨迹信息关注的时序提取网络，首先利用三维轨迹处理模块提取训练样本组的轨迹表观特征，在将所述轨迹表观特征经过跨域深度残差模块和时域轨迹注意力模块，捕捉时域下轨迹变化信息，得到时域轨迹特征；最后将所述时域轨迹特征输入到三维预测映射模块得到跨域轨迹特征向量F_T；Step S3: select the key frame (F, S) of each video in described continuous frame image set I as training sample group, F represents the frame length of input, and S represents the key frame image set selected in I; Will ( F, S) Input the time series extraction network based on the cross-domain depth residual and trajectory information attention, first use the three-dimensional trajectory processing module to extract the trajectory appearance characteristics of the training sample group, and pass the trajectory appearance characteristics through the cross-domain depth residual The module and the time-domain trajectory attention module capture the trajectory change information in the time domain to obtain the time-domain trajectory features; finally, the time-domain trajectory features are input to the three-dimensional predictive mapping module to obtain the cross-domain trajectory feature vector F_T ;

步骤S4：将S输入所述基于深度残差融合和姿态信息关注的空间提取网络和所述基于跨域深度残差和轨迹信息关注的时序提取网络，分别得到姿态特征向量F_S和轨迹特征向量F_T，将其输入基于自适应的时空域融合理解模块，得到行为融合特征F_Fusion，根据F_Fusion可得到S在与S对应的标注图像集G_T的行为识别结果。Step S4: Input S into the spatial extraction network based on deep residual fusion and attitude information attention and the time series extraction network based on cross-domain depth residual and trajectory information attention, and obtain the attitude feature vector F_S and trajectory feature vector respectively F_T , which is input into the self-adaptive spatio-temporal domain fusion understanding module to obtain the behavior fusion feature F_Fusion , according to F_Fusion , the behavior recognition result of S in the tagged image set_GT corresponding to S can be obtained.

本发明与现有技术相比，具有以下优点：Compared with the prior art, the present invention has the following advantages:

1、本发明公开了一种利用三维人体恢复进行时空双分支融合的行为识别方法，利用姿态关键点和注意力局部感知的姿态叠加方法，设计了基于关键点调整深度图来规范人体姿态，并提出使用注意力局部感知的特征融合方法，补充人体局部姿态细节特征，并将姿态按帧长有序叠加，提高空间域多视角姿态信息，为空间姿态提取和行为识别提供丰富的三维姿态线索。1. The present invention discloses a behavior recognition method using three-dimensional human body restoration to perform spatio-temporal dual-branch fusion. Using the gesture superposition method of gesture key points and attention local perception, a depth map based on key point adjustment is designed to standardize human body posture, and A feature fusion method using local awareness of attention is proposed to supplement the detailed features of human body poses, and the poses are stacked in an orderly manner according to the frame length to improve the multi-view pose information in the spatial domain and provide rich 3D pose clues for spatial pose extraction and behavior recognition.

2、本发明设计的三维人体叠加图增强空间域姿态信息，将人体叠加图提取的姿态表观特征作为输入，基于深度残差融合和姿态信息关注的空间提取网络将输出促使三维叠加图在相同行为下接近，不同行为间存在差异的特征。利用具有空间域姿态注意力模块，丰富高维特征信息，缓解局部视角的伪3D成像，实现多视角下的深度姿态表征的识别。2. The 3D human body overlay image designed by the present invention enhances the attitude information in the spatial domain, and uses the posture appearance features extracted from the human body overlay image as input, and the spatial extraction network based on deep residual fusion and attitude information attention will output the three-dimensional overlay image in the same Behaviors are close to each other, and there are characteristics that differ between different behaviors. Utilizes a spatial domain pose attention module to enrich high-dimensional feature information, alleviate pseudo-3D imaging of local perspectives, and realize recognition of deep pose representations under multiple perspectives.

3、本发明利用序列样本组构建轨迹表观特征，通过基于跨域深度残差和轨迹信息关注的时序提取网络学习序列的动态关联特征，捕捉时域下轨迹变化信息。利用时域轨迹注意力模块，过滤连续帧间轨迹变化的冗余信息，提高姿态变化的动作识别精度。3. The present invention uses sequence sample groups to construct trajectory appearance features, extracts dynamic correlation features of network learning sequences based on cross-domain depth residuals and time series of trajectory information attention, and captures trajectory change information in the time domain. Utilizing the temporal trajectory attention module, the redundant information of trajectory changes between consecutive frames is filtered, and the action recognition accuracy of pose changes is improved.

4、本发明在获得姿态特征和轨迹特征后，将其输入到基于自适应的时空域融合理解模块，输出具有高维和低维特征高效融合、去除冗余信息的融合特征，可理解姿态在轨迹变化中的行为。设计了在网络训练时，通过自适应方式调整两分支特征参与度，动态变化分支权重，降低不必要的干扰信息，提高了动作识别性能。4. After obtaining attitude features and trajectory features, the present invention inputs them into the self-adaptive spatio-temporal domain fusion understanding module, and outputs fusion features with efficient fusion of high-dimensional and low-dimensional features and removal of redundant information. changing behavior. It is designed to adjust the feature participation of the two branches in an adaptive manner during network training, dynamically change the branch weights, reduce unnecessary interference information, and improve the performance of action recognition.

附图说明Description of drawings

图1为本发明实施例中一种利用三维人体恢复进行时空双分支融合的行为识别方法的流程图；Fig. 1 is a flow chart of a behavior recognition method using three-dimensional human body restoration to perform spatio-temporal dual-branch fusion in an embodiment of the present invention;

图2为本发明实施例中三维重建网络结构示意图；2 is a schematic diagram of a three-dimensional reconstruction network structure in an embodiment of the present invention;

图3为本发明实施例中基于三维恢复的人体姿态叠加的过程的示意图；Fig. 3 is a schematic diagram of the process of human body pose superposition based on three-dimensional restoration in an embodiment of the present invention;

图4为本发明实施例中基于深度残差融合和姿态信息关注的空间提取网络结构示意图；4 is a schematic diagram of a spatial extraction network structure based on deep residual fusion and attitude information attention in an embodiment of the present invention;

图5为本发明实施例中基于跨域深度残差和轨迹信息关注的时序提取网络结构示意图；FIG. 5 is a schematic diagram of a timing extraction network structure based on cross-domain depth residuals and trajectory information in an embodiment of the present invention;

图6为本发明实施例中基于自适应的时空域融合理解模块的结构示意图；6 is a schematic structural diagram of an adaptive temporal-space domain fusion understanding module in an embodiment of the present invention;

图7为本发明实施例中一种利用三维人体恢复进行时空双分支融合的行为识别系统的结构框图。FIG. 7 is a structural block diagram of a behavior recognition system using three-dimensional human body restoration for spatio-temporal dual-branch fusion in an embodiment of the present invention.

具体实施方式Detailed ways

本发明提供了一种利用三维人体恢复进行时空双分支融合的行为识别方法可提高人体动作的检测精度。The invention provides a behavior recognition method using three-dimensional human body recovery to perform time-space double branch fusion, which can improve the detection accuracy of human body motion.

为了使本发明的目的、技术方案及优点更加清楚，以下通过具体实施，并结合附图，对本发明进一步详细说明。In order to make the purpose, technical solution and advantages of the present invention clearer, the present invention will be further described in detail below through specific implementation and in conjunction with the accompanying drawings.

实施例一Embodiment one

如图1所示，本发明实施例提供的一种利用三维人体恢复进行时空双分支融合的行为识别方法，包括下述步骤：As shown in FIG. 1, an embodiment of the present invention provides a behavior recognition method using three-dimensional human body restoration to perform spatio-temporal dual-branch fusion, including the following steps:

步骤S1：对原始行为识别数据集中的视频按固定帧长将一组连续帧图像集I输入三维重建网络，首先利用姿态规范模块生成深度图和姿态关键点，再利用姿态关键点调整深度图的人体姿态，得到姿态规范后的深度图；将姿态规范后的深度图输入关键姿态提取模块，通过注意力局部感知模块提取关键姿态特征，并与姿态规范后的深度图进行融合，得到人体姿态模型；调整人体姿态模型中人体位置，使人体位于图像中心，得到调整后的人体姿态模型；按帧长将调整后的人体姿态模型进行有序叠加，得到三维人体叠加图；Step S1: For the video in the original behavior recognition data set, input a set of continuous frame image sets I into the 3D reconstruction network according to the fixed frame length. First, use the pose specification module to generate the depth map and pose key points, and then use the pose key points to adjust the depth map. Human body posture, get the depth map after posture standardization; input the depth map after posture standardization into the key pose extraction module, extract key pose features through the attention local perception module, and fuse them with the depth map after posture standardization to obtain the human body pose model ; adjust the position of the human body in the human body posture model, so that the human body is located in the center of the image, and obtain the adjusted human body posture model; orderly superimpose the adjusted human body posture models according to the frame length, and obtain a three-dimensional human body overlay;

步骤S2：将三维人体叠加图输入到基于深度残差融合和姿态信息关注的空间提取网络，首先利用二维姿态处理模块获取姿态表观特征，再将姿态表征特征经过深度残差融合模块和空间域姿态注意力模块，对整体空间姿态的重要区域特征进行捕捉，得到高理解度的空间姿态特征；最后将空间姿态特征通过二维预测映射模块得到三维姿态特征向量F_S；Step S2: Input the 3D human overlay image into the spatial extraction network based on deep residual fusion and attitude information attention. Firstly, use the 2D attitude processing module to obtain the apparent features of the attitude, and then pass the attitude representation features through the deep residual fusion module and the space The domain attitude attention module captures the important regional features of the overall spatial attitude to obtain highly understandable spatial attitude features; finally, the spatial attitude features are obtained through the two-dimensional predictive mapping module to obtain the three-dimensional attitude feature vector F_S ;

步骤S3：在连续帧图像集I中选取每个视频的关键帧(F,S)作为训练样本组，F表示输入的帧长，S表示在I中选取的关键帧图像集；将(F,S)输入基于跨域深度残差和轨迹信息关注的时序提取网络，首先利用三维轨迹处理模块提取训练样本组的轨迹表观特征，在将轨迹表观特征经过跨域深度残差模块和时域轨迹注意力模块，捕捉时域下轨迹变化信息，得到时域轨迹特征；最后将时域轨迹特征输入到三维预测映射模块得到跨域轨迹特征向量F_T；Step S3: Select the key frame (F, S) of each video in the continuous frame image set I as the training sample group, F represents the frame length of the input, and S represents the key frame image set selected in I; the (F, S) S) Input the timing extraction network based on the cross-domain depth residual and trajectory information attention, first use the three-dimensional trajectory processing module to extract the trajectory appearance features of the training sample group, and then pass the trajectory appearance features through the cross-domain depth residual module and time domain The trajectory attention module captures the trajectory change information in the time domain and obtains the characteristics of the trajectory in the time domain; finally, the trajectory characteristics in the time domain are input into the three-dimensional predictive mapping module to obtain the cross-domain trajectory feature vector F_T ;

步骤S4：将S输入基于深度残差融合和姿态信息关注的空间提取网络和基于跨域深度残差和轨迹信息关注的时序提取网络，分别得到姿态特征向量F_S和轨迹特征向量F_T，将其输入基于自适应的时空域融合理解模块，得到行为融合特征F_Fusion，根据F_Fusion可得到S在与S对应的标注图像集G_T的行为识别结果。Step S4: Input S into the spatial extraction network based on deep residual fusion and attitude information attention and the time sequence extraction network based on cross-domain depth residual and trajectory information attention, and obtain the attitude feature vector F_S and trajectory feature vector F_T respectively, and set Its input is based on the adaptive spatio-temporal domain fusion understanding module, and the behavior fusion feature F_Fusion is obtained. According to F_Fusion , the behavior recognition result of S in the tagged image set_GT corresponding to S can be obtained.

在一个实施例中，上述步骤S1：对原始行为识别数据集中的视频按固定帧长将一组连续帧图像集I送入三维重建网络，首先利用姿态规范模块生成深度图和姿态关键点，再利用姿态关键点调整深度图的人体姿态，得到姿态规范后的深度图；将姿态规范后的深度图输入关键姿态提取模块，通过注意力局部感知模块提取关键姿态特征，并与姿态规范后的深度图进行融合，得到人体姿态模型；调整人体姿态模型中人体位置，使人体位于图像中心，得到调整后的人体姿态模型；按帧长将调整后的人体姿态模型进行有序叠加，得到三维人体叠加图，具体包括：In one embodiment, the above step S1: Send a group of continuous frame image sets I into the three-dimensional reconstruction network according to a fixed frame length for the video in the original behavior recognition data set, first use the pose specification module to generate a depth map and pose key points, and then Use the key points of the posture to adjust the human body posture of the depth map to obtain the depth map after the pose standardization; input the depth map after the pose standardization into the key pose extraction module, extract the key pose features through the attention local perception module, and compare them with the depth map after the pose standardization The images are fused to obtain the human body pose model; the position of the human body in the human body pose model is adjusted so that the human body is located in the center of the image, and the adjusted human body pose model is obtained; the adjusted human body pose models are stacked in an orderly manner according to the frame length to obtain a three-dimensional human body superposition Figures, including:

步骤S11：构建三维重建网络，包括：姿态规范模块和关键姿态提取模块；将连续帧图像集I输入三维重建网络，首先利用姿态规范模块生成深度图和姿态关键点，再利用姿态关键点调整深度图的人体姿态，得到姿态规范后的深度图D；Step S11: Build a 3D reconstruction network, including: a pose specification module and a key pose extraction module; input the continuous frame image set I into the 3D reconstruction network, first use the pose specification module to generate a depth map and pose key points, and then use the pose key points to adjust the depth The human body posture of the figure, and the depth map D after the posture specification is obtained;

在本步骤中，通过关键点深度图进行姿态规范，可加强后期姿态叠加的规范性；In this step, the attitude standardization is carried out through the depth map of the key points, which can strengthen the standardization of the attitude superposition in the later stage;

步骤S12：将D输入关键姿态提取模块，利用注意力局部感知模块提取关键姿态特征D_xy，再将D_xy与D进行融合，得到人体姿态模型F_att：Step S12: Input D into the key pose extraction module, use the attention local perception module to extract the key pose feature D_xy , and then fuse D_xy and D to obtain the human body pose model F_att :

D_xy＝SM((f_xD)^T·f_yD) (1)D_xy ＝SM((f_x D)^T f_y D) (1)

F_att＝β∑(D_xy·f_zD)+D (2)F_att ＝β∑(D_xy f_z D)+D (2)

其中，公式(1)中f_xD和f_yD表示按不同注意度的特征提取器，SM表示softmax；Among them, f_x D and f_y D in formula (1) represent feature extractors according to different attention levels, and SM represents softmax;

公式(2)中∑表示按特征维度聚合，β表示训练中特征调节率，初始值设为0，“+”表示特征叠加；D表示姿态规范后的深度图；In formula (2), ∑ represents aggregation by feature dimension, β represents the feature adjustment rate during training, the initial value is set to 0, "+" represents feature superposition; D represents the depth map after pose specification;

图2为三维重建网络结构示意图，该网络使用双分支分别对深度图进行处理，一分支利用人体姿态关键点来规范深度图姿态位置，另一分支使用注意力局部感知提取姿态规范后的深度图的姿态特征，两分支通过特征融合，补充人体局部姿态信息，为姿态叠加做有效的处理；Figure 2 is a schematic diagram of the structure of the 3D reconstruction network. The network uses two branches to process the depth map respectively. One branch uses the key points of the human body posture to standardize the pose position of the depth map, and the other branch uses attention local perception to extract the depth map after the pose specification. The posture features of the two branches supplement the local posture information of the human body through feature fusion, and effectively process the posture superposition;

步骤S13：将人物置于人体姿态模型的中心，并统一图像尺寸，得到调整后的人体姿态模型；按输入帧长对调整后的人体姿态模型进行有序叠加，得到三维人体叠加图P。Step S13: Place the person at the center of the human body pose model and unify the image size to obtain an adjusted human body pose model; orderly superimpose the adjusted human body pose model according to the input frame length to obtain a 3D human body overlay map P.

如图3展示了基于三维恢复的人体姿态叠加的过程示意图。Figure 3 shows a schematic diagram of the process of human pose superposition based on 3D restoration.

本发明实施例设计了一种用在行为识别的新的数据格式：三维人体叠加图P，与光流数据类似，能够使得网络学习到空间姿态特征。但光流数据难以标注，耗时很长，本发明将人体姿态图通过中心人像按帧长有序叠加，得到的叠加图在相同行为下特征接近，不同行为间存在差异，获取多视角的3D姿态信息，以保证网络可学习到更多的信息，从而提高网络模型的精度。The embodiment of the present invention designs a new data format for behavior recognition: a three-dimensional human body overlay map P, which is similar to optical flow data and enables the network to learn spatial posture features. However, the optical flow data is difficult to label and takes a long time. The present invention superimposes the human body posture diagram in an orderly manner through the central portrait according to the frame length. Attitude information to ensure that the network can learn more information, thereby improving the accuracy of the network model.

在一个实施例中，上述步骤S2：将三维人体叠加图输入到基于深度残差融合和姿态信息关注的空间提取网络，首先利用二维姿态处理模块获取姿态表观特征，再将姿态表征特征经过深度残差融合模块和空间域姿态注意力模块，对整体空间姿态的重要区域特征进行捕捉，得到高理解度的空间姿态特征；最后将空间姿态特征通过二维预测映射模块得到三维姿态特征向量F_S，具体包括：In one embodiment, the above step S2: input the 3D human body overlay image into the spatial extraction network based on deep residual fusion and pose information attention, first use the 2D pose processing module to obtain the pose appearance features, and then pass the pose representation features through The deep residual fusion module and the spatial domain attitude attention module capture the important regional features of the overall spatial attitude to obtain highly understandable spatial attitude features; finally, the spatial attitude features are obtained through the two-dimensional predictive mapping module to obtain the three-dimensional attitude feature vector F_S , including:

步骤S21：构建输入到基于深度残差融合和姿态信息关注的空间提取网络，包括：二维姿态处理模块、深度残差融合模块、空间域姿态注意力模块和二维预测映射模块；首先将三维人体叠加图P利用二维姿态处理模块生成粗粒度姿态特征，再利用粗粒度姿态特征边缘提取尖锐的姿态表观特征；Step S21: Construct a spatial extraction network based on deep residual fusion and attitude information attention, including: a two-dimensional attitude processing module, a deep residual fusion module, a spatial domain attitude attention module and a two-dimensional predictive mapping module; first, the three-dimensional The superimposed image of the human body P uses the two-dimensional attitude processing module to generate coarse-grained attitude features, and then uses the edges of the coarse-grained attitude features to extract sharp attitude appearance features;

图4展示了基于深度残差融合和姿态信息关注的空间提取网络结构示意图；Figure 4 shows a schematic diagram of the spatial extraction network structure based on deep residual fusion and attitude information attention;

步骤S22：将姿态表观特征输入到深度残差融合模块，经过姿态编码器生成具有空间姿态信息的特征F_Resp，如公式(3)所示，其中，在姿态编码器包括多层残差模块，使用特征跳连融合，将每层的残差特征与前一层输出的残差特征进行特征跳连融合后得到的输出作为下一层残差模块的输入；Step S22: Input the attitude appearance features into the deep residual fusion module, and generate the feature F_Resp with spatial attitude information through the attitude encoder, as shown in formula (3), wherein the attitude encoder includes a multi-layer residual module , using feature skip connection fusion, the output obtained after the feature skip connection fusion of the residual feature of each layer and the residual feature output of the previous layer is used as the input of the residual module of the next layer;

其中，Res()表示残差模块，表示姿态编码器中第i个下采样块的特征；F_Resp表示在残差模块中进行层级特征跳连融合的最终特征输出；Among them, Res() represents the residual module, Represents the feature of the i-th downsampling block in the attitude encoder; F_Resp represents the final feature output of the hierarchical feature skip fusion in the residual module;

步骤S23：将F_Resp输入空间域姿态注意力模块，得到空间姿态特征F_Space，如公式(4)～(6)所示：Step S23: Input F_Resp into the attitude attention module in the space domain to obtain the space attitude feature F_Space , as shown in formulas (4)-(6):

Att(F_Resp)＝σ(f_c(F_merge)) (5)Att(F_Resp )=σ(f_c (F_merge )) (5)

其中，公式(4)中Att()表示空间域姿态注意力计算，F_Space表示对F_Resp进行注意力残差融合；Among them, Att() in the formula (4) represents the attitude calculation in the space domain, and F_Space represents the attention residual fusion of F_Resp ;

公式(5)中，f_c()表示对拼接维度进行特征还原，Att(F_Resp)能捕捉到三维人体叠加图在局部变化的姿态信息；In formula (5), f_c () represents the feature restoration of the stitching dimension, and Att(F_Resp ) can capture the attitude information of the local change of the three-dimensional human overlay image;

其中公式(6)中，A_i表示在固定域中捕捉关注点i，Max()表示捕捉固定域中最大关注点，Σ表示平滑固定域特征，表示特征拼接；Among them, in formula (6), A_i means to capture the point of interest i in the fixed domain, Max() means to capture the largest point of interest in the fixed domain, Σ means to smooth the feature of the fixed domain, Represents feature splicing;

本发明实施例为三维人体姿态叠加图构建了基于深度残差融合和姿态信息关注的空间提取网络，设计了深度残差融合模块，将相邻残差模块的残差特征进行特征融合，提取姿态间的纹理特征，并利用空间域姿态注意力模块对最终的残差融合特征进行提取，提取的特征与最终的残差融合特征进行特征汇聚，在整体空间的关键姿态进行提取，得到高理解度的空间姿态特征，使提取的姿态信息与轨迹信息融合时更能捕捉到运动中局部变化的部分。The embodiment of the present invention constructs a spatial extraction network based on depth residual fusion and attitude information attention for a three-dimensional human body posture overlay image, and designs a deep residual fusion module, which fuses the residual features of adjacent residual modules to extract posture The texture features in the space, and use the spatial domain attitude attention module to extract the final residual fusion features, the extracted features and the final residual fusion features are aggregated, and the key attitudes in the overall space are extracted to obtain a high degree of understanding Spatial attitude features, so that when the extracted attitude information and trajectory information are fused, it can better capture the local changes in the movement.

步骤S24：将F_Space通过二维预测映射模块转换为三维姿态特征向量F_S，其中F_S具有与预测类别数一致的维度。Step S24: Convert F_Space into a three-dimensional pose feature vector F_S through the two-dimensional predictive mapping module, where F_S has a dimension consistent with the number of predicted categories.

本发明通过对三维人体叠加图进行训练，设计了特殊提取方式，并按照叠加图的特性，在相同行为下接近，不同行为间存在差异进行处理，使得网络在提取高维特征的同时，对叠加的姿态视角有效利用，缓解在2D下的相似行为姿态难以识别的问题，提高网络对变化中人体姿态的特征提取能力。The present invention designs a special extraction method by training the three-dimensional human body overlay image, and according to the characteristics of the overlay image, approaches under the same behavior, and processes differences between different behaviors, so that the network extracts high-dimensional features and at the same time the overlay The pose angle of view is effectively used to alleviate the problem of difficult recognition of similar behavior poses in 2D, and improve the feature extraction ability of the network for changing human poses.

在一个实施例中，上述步骤S3：在连续帧图像集I中选取每个视频的关键帧(F,S)作为训练样本组，F表示输入的帧长，S表示在I中选取的关键帧图像集；将(F,S)输入基于跨域深度残差和轨迹信息关注的时序提取网络，首先利用三维轨迹处理模块提取训练样本组的轨迹表观特征，在将轨迹表观特征经过跨域深度残差模块和时域轨迹注意力模块，捕捉时域下轨迹变化信息，得到时域轨迹特征；最后将时域轨迹特征输入到三维预测映射模块得到跨域轨迹特征向量F_T，具体包括：In one embodiment, the above-mentioned step S3: select the key frame (F, S) of each video in the continuous frame image set I as the training sample group, F represents the frame length of the input, and S represents the key frame selected in I Image set; input (F, S) into the time series extraction network based on cross-domain depth residual and trajectory information attention, first use the three-dimensional trajectory processing module to extract the trajectory appearance features of the training sample group, and then pass the trajectory appearance features through the cross-domain The deep residual module and the time-domain trajectory attention module capture the trajectory change information in the time domain to obtain the time-domain trajectory features; finally, the time-domain trajectory features are input into the 3D predictive mapping module to obtain the cross-domain trajectory feature vector F_T , which specifically includes:

步骤S31：构建基于跨域深度残差和轨迹信息关注的时序提取网络，包括：三维轨迹处理模块、跨域深度残差模块、时域轨迹注意力模块和三维预测映射模块；Step S31: Construct a time series extraction network based on cross-domain depth residual and trajectory information attention, including: 3D trajectory processing module, cross-domain depth residual module, time domain trajectory attention module and 3D prediction mapping module;

在连续帧图像集I中选取每个视频的关键帧(F,S)作为训练样本组，F表示输入的帧长，S表示在I中选取的关键帧图像集；将(F,S)使用三维轨迹处理模块生成序列关联特征，并调整跨域边缘特征，生成轨迹表观特征；Select the key frame (F, S) of each video in the continuous frame image set I as the training sample group, F represents the frame length of the input, and S represents the key frame image set selected in I; use (F, S) The 3D trajectory processing module generates sequence correlation features, adjusts cross-domain edge features, and generates trajectory apparent features;

如图5所示，展示了基于跨域深度残差和轨迹信息关注的时序提取网络结构示意图；As shown in Figure 5, it shows a schematic diagram of the timing extraction network structure based on cross-domain depth residual and trajectory information attention;

步骤S32：将轨迹表观特征输入跨域深度残差模块：经过轨迹编码器生成跨时域轨迹信息特征F_t；Step S32: Input the trajectory apparent features into the cross-domain depth residual module: generate the cross-time domain trajectory information feature F_t through the trajectory encoder;

步骤S33：将F_t输入时域轨迹注意力模块，得到时域轨迹特征F_tmerge，如公式(7)～公式(8)所示；Step S33: Input F_t into the time-domain trajectory attention module to obtain the time-domain trajectory feature F_tmerge , as shown in formula (7) to formula (8);

其中，公式(7)中，表示对当前第c帧的跨时域轨迹信息特征F_t做全局特征融合；H和W分别表示当前帧图像的高和宽；Among them, in formula (7), Represents the global feature fusion of the cross-temporal trajectory information feature F_t of the current c-th frame; H and W represent the height and width of the current frame image, respectively;

公式(8)中，使用W₁和W₂两次学习通道配置间的依赖关系，其中W₁表示对的每个特征分配权重值，W₂表示对分配后的特征再进行权重值分配；σ()表示特征权重化，并分配到公式(7)中的融合特征上；In formula (8), use W₁ and W₂ to learn the dependency between channel configurations twice, where W₁ represents the pair Assign a weight value to each feature of , W₂ means to assign a weight value to the assigned feature; σ() means to weight the feature and assign it to the fusion feature in formula (7);

本发明实施例中对基于跨域深度残差和轨迹信息关注的时序提取网络加入时序维度对局部的姿态变化轨迹进行特征提取，将跨时域轨迹信息特征输入到时域轨迹注意力模块，对特征进行全局融合并学习通道配置间的依赖关系，过滤连续帧间轨迹变化的冗余信息，提高姿态变化的动作识别精度。In the embodiment of the present invention, the timing dimension is added to the time-series extraction network based on cross-domain depth residual and trajectory information attention to extract the features of the local attitude change trajectory, and the cross-time domain trajectory information features are input to the time domain trajectory attention module. Features are globally fused and the dependencies between channel configurations are learned, redundant information of trajectory changes between consecutive frames is filtered, and action recognition accuracy for pose changes is improved.

步骤S34：将F_tmerge输入三维预测映射模块，生成跨域轨迹特征向量F_T，F_T与F_S维数一致。Step S34: Input F_tmerge into the three-dimensional predictive mapping module to generate a cross-domain trajectory feature vector F_T , where the dimensions of F_T and F_S are consistent.

本发明设计了利用序列样本组构建轨迹表观特征作，通过基于跨域深度残差和轨迹信息关注的时序提取网络学习序列的动态关联特征，扩大提取可控范围到时序域，丰富学习到的运动信息，捕捉时域下轨迹变化信息，提高姿态变化的动作识别精度。The present invention designs the trajectory appearance feature construction using sequence sample groups, and extracts the dynamic correlation features of the network learning sequence based on the time series of cross-domain depth residual and trajectory information attention, expands the controllable range of extraction to the time series domain, and enriches the learned data. Motion information captures trajectory change information in the time domain and improves the accuracy of motion recognition for posture changes.

在一个实施例中，上述步骤S4：将S输入基于深度残差融合和姿态信息关注的空间提取网络和基于跨域深度残差和轨迹信息关注的时序提取网络，分别得到姿态特征向量F_S和轨迹特征向量F_T，将其输入基于自适应的时空域融合理解模块，得到行为融合特征F_Fusion，根据F_Fusion可得到S在与S对应的标注图像集GT的行为识别结果，具体包括：In one embodiment, the above step S4: Input S into the spatial extraction network based on deep residual fusion and attitude information attention and the time series extraction network based on cross-domain depth residual and trajectory information attention, and obtain the attitude feature vector F_S and The trajectory feature vector F_T is input into the adaptive spatio-temporal domain fusion understanding module to obtain the behavior fusion feature F_{Fusion. According to F Fusion}_, the behavior recognition results of S in the tagged image set GT corresponding to S can be obtained, including:

步骤S41：将S分别输入基于深度残差和姿态信息关注的空间提取网络得到三维姿态特征向量FS，和基于跨域深度残差和轨迹信息关注的时序提取网络得到跨域轨迹特征向量F_T；Step S41: Input S into the spatial extraction network based on depth residual and attitude information attention to obtain the three-dimensional attitude feature vector FS, and the time sequence extraction network based on cross-domain depth residual and trajectory information attention to obtain the cross-domain trajectory feature vector F_T ;

将F_S和F_T输入基于自适应的时空域融合理解模块，如公式(9)～公式(10)所示，理解姿态在轨迹变化中的行为，得到行为融合特征F_Fusion；Input F_S and F_T into the self-adaptive spatio-temporal domain fusion understanding module, as shown in formula (9) ~ formula (10), understand the behavior of posture in trajectory changes, and obtain the behavior fusion feature F_Fusion ;

F_Fusion＝f(F_T，ω₁)+f(F_S，ω₂) (9)F_Fusion = f(F_T , ω₁ )+f(F_S , ω₂ ) (9)

ω₂＝1-ω₁ (10)ω₂ =1-ω₁ (10)

其中，公式(9)中，f()表示权重分配器，ω₁和ω₂表示在F_s和F_T的特征参与度，在首轮迭代中设置初始值；Among them, in formula (9), f() represents the weight allocator, ω₁ and ω₂ represent the feature participation in F_s and F_T , and the initial value is set in the first round of iteration;

本发明实施例设计了基于自适应的时空域融合理解模块用于理解姿态在轨迹变化中的行为，使用在批量数据样本中得到的动态损失调整网络双分支的特征参与度，隐式地实现不同空间中特征的互补性。In the embodiment of the present invention, an adaptive spatio-temporal domain fusion understanding module is designed to understand the behavior of the attitude in the trajectory change, and the dynamic loss obtained in the batch data samples is used to adjust the feature participation of the dual branches of the network, implicitly realizing different Complementarity of features in space.

步骤S42：将ω₁和ω₂分别在网络训练时进行迭代更新，用来调整两个网络分支特征参与度，选择迭代中最优结果作为在S上的类别结果，将类别结果与S对应的标注图像集G_T上的标签做比较，得到最终的预测结果。Step S42: Iteratively update ω₁ and ω₂ during network training to adjust the feature participation of the two network branches, select the optimal result in the iteration as the category result on S, and compare the category result with the corresponding Annotate the labels on the image set_GT for comparison to get the final prediction result.

图6展示了基于自适应的时空域融合理解模块的结构示意图。Figure 6 shows a schematic diagram of the structure of the self-adaptive spatio-temporal domain fusion understanding module.

本发明实施例在获得姿态特征和轨迹特征后，将其输入到基于自适应的时空域融合理解模块，输出具有高维和低维特征高效融合、去除冗余信息的融合特征，可理解姿态在轨迹变化中的行为。设计了在网络训练时，通过自适应方式调整两分支特征参与度，动态变化分支权重，降低不必要的干扰信息，提高了动作识别性能。In the embodiments of the present invention, after obtaining attitude features and trajectory features, they are input to the self-adaptive spatio-temporal domain fusion understanding module, and the fusion features with efficient fusion of high-dimensional and low-dimensional features and removal of redundant information are output. changing behavior. It is designed to adjust the feature participation of the two branches in an adaptive manner during network training, dynamically change the branch weights, reduce unnecessary interference information, and improve the performance of action recognition.

实施例二Embodiment two

如图7所示，本发明实施例提供了一种利用三维人体恢复进行时空双分支融合的行为识别系统，包括下述模块：As shown in Fig. 7, the embodiment of the present invention provides a behavior recognition system using three-dimensional human body restoration to perform spatio-temporal dual-branch fusion, including the following modules:

获取三维人体叠加图模块51，对原始行为识别数据集中的视频按固定帧长将一组连续帧图像集I输入三维重建网络，首先利用姿态规范模块生成深度图和姿态关键点，再利用姿态关键点调整深度图的人体姿态，得到姿态规范后的深度图；将姿态规范后的深度图输入关键姿态提取模块，通过注意力局部感知模块提取关键姿态特征，并与姿态规范后的深度图进行融合，得到人体姿态模型；调整人体姿态模型中人体位置，使人体位于图像中心，得到调整后的人体姿态模型；按帧长将调整后的人体姿态模型进行有序叠加，得到三维人体叠加图；Obtain the 3D human overlay map module 51, input a set of continuous frame image sets I into the 3D reconstruction network according to the fixed frame length of the video in the original behavior recognition data set, first use the pose specification module to generate a depth map and pose key points, and then use the pose key points Adjust the human body posture of the depth map to obtain the depth map after posture standardization; input the depth map after posture standardization into the key pose extraction module, extract key pose features through the attention local perception module, and fuse them with the depth map after pose standardization , to obtain the human body posture model; adjust the position of the human body in the human body posture model, so that the human body is located in the center of the image, and obtain the adjusted human body posture model; orderly stack the adjusted human body posture models according to the frame length, and obtain a three-dimensional human body superimposition map;

获取三维姿态特征向量模块52，用于将三维人体叠加图输入到基于深度残差融合和姿态信息关注的空间提取网络，首先利用二维姿态处理模块获取姿态表观特征，再将姿态表征特征经过深度残差融合模块和空间域姿态注意力模块，对整体空间姿态的重要区域特征进行捕捉，得到高理解度的空间姿态特征；最后将空间姿态特征通过二维预测映射模块得到三维姿态特征向量F_S；Obtaining the 3D pose feature vector module 52, which is used to input the 3D human overlay image into the spatial extraction network based on deep residual fusion and pose information attention, firstly use the 2D pose processing module to obtain the pose appearance features, and then pass the pose representation features through The deep residual fusion module and the spatial domain attitude attention module capture the important regional features of the overall spatial attitude to obtain highly understandable spatial attitude features; finally, the spatial attitude features are obtained through the two-dimensional predictive mapping module to obtain the three-dimensional attitude feature vector F_S ;

获取跨域轨迹特征向量模块53，用于在连续帧图像集I中选取每个视频的关键帧(F,S)作为训练样本组，F表示输入的帧长，S表示在I中选取的关键帧图像集；将(F,S)输入基于跨域深度残差和轨迹信息关注的时序提取网络，首先利用三维轨迹处理模块提取训练样本组的轨迹表观特征，在将轨迹表观特征经过跨域深度残差模块和时域轨迹注意力模块，捕捉时域下轨迹变化信息，得到时域轨迹特征；最后将时域轨迹特征输入到三维预测映射模块得到跨域轨迹特征向量F_T；Obtain the cross-domain trajectory feature vector module 53, which is used to select the key frame (F, S) of each video in the continuous frame image set I as the training sample group, F represents the frame length of the input, and S represents the key selected in I Frame image set; input (F, S) into the timing extraction network based on cross-domain depth residual and trajectory information attention, firstly use the three-dimensional trajectory processing module to extract the trajectory appearance characteristics of the training sample group, and then pass the trajectory appearance characteristics through the cross-domain The domain depth residual module and the time domain trajectory attention module capture the trajectory change information in the time domain to obtain the time domain trajectory features; finally, input the time domain trajectory features into the 3D predictive mapping module to obtain the cross domain trajectory feature vector F_T ;

行为识别模块54，用于将S输入基于深度残差融合和姿态信息关注的空间提取网络和基于跨域深度残差和轨迹信息关注的时序提取网络，分别得到姿态特征向量F_S和轨迹特征向量F_T，将其输入基于自适应的时空域融合理解模块，得到行为融合特征F_Fusion，根据F_Fusion可得到S在与S对应的标注图像集G_T的行为识别结果。The behavior recognition module 54 is used to input S into the spatial extraction network based on deep residual fusion and attitude information attention and the time series extraction network based on cross-domain depth residual and trajectory information attention, and obtain the attitude feature vector F_S and trajectory feature vector respectively F_T , which is input into the self-adaptive spatio-temporal domain fusion understanding module to obtain the behavior fusion feature F_Fusion , according to F_Fusion , the behavior recognition result of S in the tagged image set_GT corresponding to S can be obtained.

提供以上实施例仅仅是为了描述本发明的目的，而并非要限制本发明的范围。本发明的范围由所附权利要求限定。不脱离本发明的精神和原理而做出的各种等同替换和修改，均应涵盖在本发明的范围之内。The above embodiments are provided only for the purpose of describing the present invention, not to limit the scope of the present invention. The scope of the invention is defined by the appended claims. Various equivalent replacements and modifications made without departing from the spirit and principle of the present invention shall fall within the scope of the present invention.