CN114882524A

Movatterモバイル変換

Info

Publication number: CN114882524A
Application number: CN202210397216.0A
Authority: CN
Inventors: 刘星言; 康文雄; 林亿鸿; 邓飞其
Original assignee: South China University of Technology SCUT
Current assignee: South China University of Technology SCUT
Priority date: 2022-04-15
Filing date: 2022-04-15
Publication date: 2022-08-09
Anticipated expiration: 2042-04-15
Also published as: CN114882524B

Abstract

Translated fromChinese

本发明公开了一种基于全卷积神经网络的单目三维手势估计方法，包括以下步骤：获取手部图像，对图像进行预处理；构建全卷积三维手势估计网络，并对全卷积三维手势估计网络进行训练；将预处理后的图像输入全卷积三维手势估计网络预测最终的关键点二维坐标及各关键点的相对深度；将预测的关键点二维坐标与相对深度进行后处理，计算手部关键点的三维坐标。本发明通过将手部尺度信息与神经网络对手部关键点的深度预测过程解耦，能够有效应对单目三维手势估计中的尺度不确定性问题，实际应用中通过获取准确的先验尺度信息，本发明准确还原出场景中手部关键点相对成像设备的实际深度，有效提升三维手势估计方法的精度上限以及对场景的泛化能力。

The invention discloses a monocular three-dimensional gesture estimation method based on a full convolution neural network, comprising the following steps: acquiring a hand image, preprocessing the image; constructing a full convolution three-dimensional gesture estimation network, and evaluating the full convolution three-dimensional gesture The gesture estimation network is trained; the preprocessed images are input into the fully convolutional 3D gesture estimation network to predict the final two-dimensional coordinates of key points and the relative depth of each key point; post-processing the predicted two-dimensional coordinates and relative depths of key points , calculate the three-dimensional coordinates of the key points of the hand. The present invention can effectively deal with the scale uncertainty problem in monocular three-dimensional gesture estimation by decoupling the hand scale information and the neural network's depth prediction process for hand key points. The invention accurately restores the actual depth of the key points of the hand in the scene relative to the imaging device, and effectively improves the upper limit of accuracy of the three-dimensional gesture estimation method and the generalization ability of the scene.

Description

Translated fromChinese

一种基于全卷积神经网络的单目三维手势估计方法A Monocular 3D Gesture Estimation Method Based on Fully Convolutional Neural Network

技术领域technical field

本发明涉及计算机视觉领域，具体涉及一种基于全卷积神经网络的单目三维手势估计方法。The invention relates to the field of computer vision, in particular to a monocular three-dimensional gesture estimation method based on a fully convolutional neural network.

背景技术Background technique

手部所具有的高自由度、自然友好等特点使其成为人机智能交互的重要研究对象，基于视觉的手势交互系统能够使得用户摆脱对人机交互中间件的依赖，直接使用手势完成与现实或虚拟场景的互动以及对智能终端的操作，大大提升了用户的使用连贯性及便捷性。谷歌、微软和Facebook等科技巨头近年来投入了大量资源研究基于增强现实(AR)、虚拟现实(VR)及混合现实(MR)的智能交互可穿戴终端，视觉手势交互正是其中一项核心技术。The high degree of freedom and natural friendliness of the hand make it an important research object for human-computer intelligent interaction. The vision-based gesture interaction system can make users get rid of the dependence on human-computer interaction middleware, and directly use gestures to complete the interaction with reality. Or the interaction of virtual scenes and the operation of intelligent terminals greatly improve the consistency and convenience of users. Tech giants such as Google, Microsoft and Facebook have invested a lot of resources in the research of smart interactive wearable terminals based on Augmented Reality (AR), Virtual Reality (VR) and Mixed Reality (MR) in recent years, and visual gesture interaction is one of the core technologies .

手势估计是指从手部图像中预测手部关键点位置的过程，其中，单目三维手势估计要求从单张手部彩色图像或深度图中预测手部关键点在成像空间中的三维位置，是一项极具挑战性的任务。手势估计作为视觉手势交互中的重要一环，能够帮助计算机捕捉手部与场景中其他真实物体或虚拟物体的相对位置关系，从而对真实场景的变化作出分析及预测，或对虚拟世界的场景作出相应的反馈。单目三维手势估计因其对输入图像的要求较低，整体方法的灵活性较高，具有广泛的应用前景及实用价值。Gesture estimation refers to the process of predicting the position of hand key points from hand images. Monocular 3D gesture estimation requires predicting the 3D position of hand key points in imaging space from a single hand color image or depth map. is a very challenging task. As an important part of visual gesture interaction, gesture estimation can help the computer to capture the relative positional relationship between the hand and other real objects or virtual objects in the scene, so as to analyze and predict the changes in the real scene, or make predictions about the scene in the virtual world. corresponding feedback. Monocular 3D gesture estimation has broad application prospects and practical value because of its low requirements on input images and high flexibility of the overall method.

与本发明最相近的现有技术：The prior art closest to the present invention:

单目三维手势估计：单目三维手势估计任务要求从输入的单张手部图像中预测手部关键点的三维空间坐标，是一项高级视觉理解任务。手部姿态的高自由度、自遮挡特性以及单张图像的三维尺度不确定性问题使得该任务具有较高的难度。现有的基于深度学习的单目三维手势估计方法从输入图像类型划分，可分为基于彩色图像的方法及基于深度图的方法。其中，基于深度图的方法能够利用手部表面的深度信息，一定程度上缓解了尺度不确定性的问题，取得了较好的效果。基于手部彩色图像的方法按照实现的过程可分为基于学习的方法以及基于模型的方法，前者通过设计复杂的神经网络并结合大量带精确三维关键点标注的图像数据进行训练，从而让神经网络从图像中直接回归关键点的坐标；而基于模型的方法则引入参数化的手部模型，利用神经网络回归该手部模型的参数，并通过手部图像进行半监督训练，能够取得较高的精度上限。Monocular 3D Gesture Estimation: The monocular 3D gesture estimation task requires predicting the 3D spatial coordinates of hand key points from an input single hand image, which is an advanced visual understanding task. The high degree of freedom of hand pose, self-occlusion properties, and the 3D scale uncertainty of a single image make this task difficult. Existing deep learning-based monocular 3D gesture estimation methods are divided into input image types and can be divided into color image-based methods and depth map-based methods. Among them, the depth map-based method can use the depth information of the hand surface, alleviate the problem of scale uncertainty to a certain extent, and achieve good results. The methods based on color images of hands can be divided into learning-based methods and model-based methods according to the implementation process. The coordinates of key points are directly regressed from the image; while the model-based method introduces a parameterized hand model, uses a neural network to regress the parameters of the hand model, and conducts semi-supervised training through hand images, which can achieve higher accuracy. The upper limit of precision.

现有技术的缺点：Disadvantages of the prior art:

1、现有的基于深度图的单目三维手势估计方法虽然整体精度较高，但依赖于高质量的深度图，而深度图的成像条件苛刻，易受到场景干扰，极大地限制了单目手势估计的应用场景。1. Although the existing monocular 3D gesture estimation methods based on depth maps have high overall accuracy, they rely on high-quality depth maps, and the imaging conditions of depth maps are harsh and susceptible to scene interference, which greatly limits monocular gestures. Estimated application scenarios.

2、现有的基于学习的单目彩色图像三维手势估计方法使用的神经网络结构较为复杂，部分算子不兼容主流的部署框架，且十分依赖大量精确标注数据的训练，结果容易受到尺度不确定性问题的影响，泛化能力较差。(Yang L,Li S,Lee D,et al.Aligninglatent spaces for 3d hand pose estimation[C]//Proceedings of the IEEE/CVFInternational Conference on Computer Vision.2019:2335-2343.)。2. The neural network structure used by the existing learning-based monocular color image 3D gesture estimation method is relatively complex, some operators are not compatible with mainstream deployment frameworks, and rely heavily on the training of a large number of accurately labeled data, the results are easily affected by scale uncertainty. The influence of sexual problems, the generalization ability is poor. (Yang L, Li S, Lee D, et al. Aligninglatent spaces for 3d hand pose estimation[C]//Proceedings of the IEEE/CVFInternational Conference on Computer Vision.2019:2335-2343.).

3、现有的基于模型的单目彩色图像三维手势估计方法依靠神经网络回归手部模型的参数，并通过手部彩色图像进行间接的监督，导致整体方法的训练难度较大，训练效果容易受到模型参数初始化、图像间接监督方式的影响，加大了实际应用的开发难度，不利于应用扩展(Boukhayma A,Bem R,Torr P H S.3d hand shape and pose from images inthe wild[C]//Proceedings of the IEEE/CVF Conference on Computer Vision andPattern Recognition.2019:10843-10852.)。3. The existing model-based monocular color image 3D gesture estimation method relies on the neural network to regress the parameters of the hand model, and conducts indirect supervision through the color image of the hand, which makes the training of the overall method difficult and the training effect is easily affected. The influence of model parameter initialization and indirect image supervision increases the difficulty of practical application development and is not conducive to application expansion (Boukhayma A, Bem R, Torr P H S. 3d hand shape and pose from images in the wild[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019:10843-10852.).

4、现有的单目彩色图像三维手势估计方法在网络结构设计上仅考虑精度的提升，没有考虑不同资源条件下的运行需求，难以实现低资源场景下的实时运行。4. The existing monocular color image 3D gesture estimation method only considers the improvement of accuracy in network structure design, and does not consider the operation requirements under different resource conditions, so it is difficult to realize real-time operation in low-resource scenarios.

发明内容SUMMARY OF THE INVENTION

为解决了基于手部彩色图像的单目三维手势估计方法容易受到尺度不确定性因素影响、三维手势估计神经网络结构复杂且扩展性差以及现有的三维手势估计神经网络实时运行效率低的问题，本发明提出一种基于全卷积神经网络的单目三维手势估计方法，本发明的网络完成从输入手部图像预测手部关键点图像坐标及相对深度值的任务，输入图像首先经过连续的卷积及下采样获取语义丰富但空间分辨率较低的特征图，再通过连续的卷积及上采样继续细化语义特征并恢复特征图的高空间分辨率，最终输出单元从高分辨率特征图中预测最终的关键点二维坐标及其相对深度。In order to solve the problems that the monocular 3D gesture estimation method based on hand color image is easily affected by scale uncertainty factors, the 3D gesture estimation neural network has complex structure and poor scalability, and the existing 3D gesture estimation neural network has low real-time operation efficiency. The present invention proposes a monocular three-dimensional gesture estimation method based on a fully convolutional neural network. The network of the present invention completes the task of predicting the coordinates and relative depth values of key point images of the hand from the input hand image. Product and downsampling to obtain feature maps with rich semantics but low spatial resolution, and then continue to refine the semantic features and restore the high spatial resolution of the feature maps through continuous convolution and upsampling, and finally output the unit from the high-resolution feature maps. Predict the final keypoint 2D coordinates and their relative depths in .

本发明至少通过如下技术方案之一实现。The present invention is realized by at least one of the following technical solutions.

一种基于全卷积神经网络的单目三维手势估计方法，包括以下步骤：A monocular 3D gesture estimation method based on a fully convolutional neural network, comprising the following steps:

S1、获取手部图像，对图像进行预处理；S1. Obtain a hand image, and preprocess the image;

S2、构建全卷积三维手势估计网络，并对全卷积三维手势估计网络进行训练；S2. Construct a fully convolutional 3D gesture estimation network, and train the fully convolutional 3D gesture estimation network;

S3、将预处理后的图像输入全卷积三维手势估计网络预测最终的关键点二维坐标及各关键点的相对深度；S3. Input the preprocessed image into a fully convolutional three-dimensional gesture estimation network to predict the final two-dimensional coordinates of key points and the relative depth of each key point;

S4、将预测的关键点二维坐标与相对深度进行后处理，计算手部关键点的三维坐标。S4, post-processing the predicted two-dimensional coordinates of the key points and the relative depth, and calculate the three-dimensional coordinates of the hand key points.

进一步地，所述预处理包括图像缩放及图像填充。Further, the preprocessing includes image scaling and image filling.

进一步地，所述全卷积三维手势估计网络包括以下单元：Further, the fully convolutional 3D gesture estimation network includes the following units:

输入压缩单元，用于提取输入图像中的局部粗略特征的同时降低输入图像分辨率；an input compression unit, which is used to reduce the resolution of the input image while extracting local rough features in the input image;

基本卷积单元，用于提取基本特征；Basic convolution unit for extracting basic features;

下采样单元，用于对特征图进行空间下采样，提升特征图感受野；The downsampling unit is used to perform spatial downsampling on the feature map to improve the receptive field of the feature map;

上采样单元，用于对特征图进行空间上采样，配合来自下采样单元输出的横向连接恢复特征图空间信息，并进一步丰富语义特征；The upsampling unit is used to spatially upsample the feature map, and cooperate with the horizontal connection output from the downsampling unit to restore the spatial information of the feature map, and further enrich the semantic features;

输出单元，用于从最终的高分辨率特征图中提取信息，并预测最终的手部关键点二维坐标及归一化相对深度。The output unit is used to extract information from the final high-resolution feature map, and predict the final 2D coordinates of hand key points and normalized relative depth.

进一步地，所述基本卷积单元包括：Further, the basic convolution unit includes:

1)7×7逐通道卷积：大卷积核的逐通道卷积，保持单倍输入特征图通道数量；1) 7×7 channel-by-channel convolution: channel-by-channel convolution with a large convolution kernel, maintaining a single input feature map channel number;

2)层归一化操作：将输入特征图所有像素值均值及方差转化为参与训练的参数；2) Layer normalization operation: convert the mean and variance of all pixel values of the input feature map into parameters participating in training;

3)1×1卷积：用于将特征图通道数量放大或缩小4倍；3) 1×1 convolution: used to enlarge or reduce the number of feature map channels by 4 times;

4)激活函数：采用GELU函数计算卷积层的输出激活值；4) Activation function: use the GELU function to calculate the output activation value of the convolutional layer;

5)残差连接：将单元输入与激活后的输出相加，作为最终的单元输出。5) Residual connection: The unit input is added to the activated output as the final unit output.

进一步地，所述输出单元包括：Further, the output unit includes:

1)二维坐标预测分支：利用3×3大小的卷积核从特征图中提取特征得到手部关键点热力图，并经过可微分的最大索引操作(soft-argmax)求得其中的最高响应位置，即为预测的关键点二维坐标；1) Two-dimensional coordinate prediction branch: use a 3×3 convolution kernel to extract features from the feature map to obtain a heat map of hand key points, and obtain the highest response through a differentiable maximum index operation (soft-argmax) The position is the two-dimensional coordinate of the predicted key point;

2)相对深度预测分支：利用3×3大小的卷积核从特征图中提取特征得到手部关键点的隐深度图，通过全局空间平均池化(mean)得到各关键点的归一化相对深度值。2) Relative depth prediction branch: extract features from the feature map with a 3×3 convolution kernel to obtain the hidden depth map of the hand key points, and obtain the normalized relative depth of each key point through the global space average pooling (mean). depth value.

进一步地，全卷积三维手势估计网络的训练包括以下步骤：Further, the training of the fully convolutional 3D gesture estimation network includes the following steps:

21)图像预处理：对参与训练的输入图像进行缩放与填充，并进行数据增强；21) Image preprocessing: scaling and filling the input images involved in training, and performing data enhancement;

22)标注预处理：训练过程中，对手部关键点标注的处理主要包括：配合图像旋转数据增强方式修改对应的关键点二维标注，以及将绝对深度标注转化为归一化的相对深度标注，具体操作为将所有关键点的深度标注与根部关键点的深度标注作差，并将结果除以手部参考长度；22) Labeling preprocessing: During the training process, the processing of the labeling of the key points on the hand mainly includes: modifying the corresponding two-dimensional labeling of the key points with the image rotation data enhancement method, and converting the absolute depth labeling into the normalized relative depth labeling, The specific operation is to make the difference between the depth annotation of all key points and the depth annotation of the root key point, and divide the result by the hand reference length;

23)网络前向推理：将经过处理的图像输入全卷积三维手势估计网络，得到预测的关键点二维坐标及相对深度；23) Network forward reasoning: Input the processed image into a fully convolutional 3D gesture estimation network to obtain the predicted 2D coordinates and relative depth of key points;

24)损失计算：将网络前向推理的输出与经过预处理的标注输入损失函数，得到损失函数值；24) Loss calculation: the output of the network forward inference and the preprocessed annotation input loss function are used to obtain the loss function value;

25)梯度反向传播：计算损失函数值相对于全卷积三维手势估计网络的网络参数的梯度值，使用反向传播算法更新网络参数；25) Gradient backpropagation: Calculate the gradient value of the loss function value relative to the network parameters of the fully convolutional three-dimensional gesture estimation network, and use the backpropagation algorithm to update the network parameters;

26)迭代训练：对输入图像分批次重复进行上述步骤，直到损失函数值不再降低，则完成全卷积三维手势估计网络训练。26) Iterative training: Repeat the above steps for the input images in batches until the loss function value no longer decreases, then complete the training of the fully convolutional 3D gesture estimation network.

进一步地，全卷积三维手势估计网络训练过程采用Smooth-L1损失函数计算网络预测的关键点二维坐标及相对深度与标注之间的误差，并计算误差相对各网络参数的梯度，通过反向传播算法更新网络参数。Further, in the training process of the fully convolutional 3D gesture estimation network, the Smooth-L1 loss function is used to calculate the error between the two-dimensional coordinates of the key points predicted by the network, the relative depth and the annotation, and the gradient of the error relative to each network parameter is calculated. The propagation algorithm updates the network parameters.

进一步地，全卷积三维手势估计网络训练过程图像预处理中采用的数据增强包括：Further, the data enhancement used in the image preprocessing during the training process of the fully convolutional 3D gesture estimation network includes:

1)随机图像旋转：将输入图像绕中心点随机旋转-180°至180°；1) Random image rotation: randomly rotate the input image around the center point by -180° to 180°;

2)随机图像翻转：将输入图像进行随机水平翻转及竖直翻转；2) Random image flip: randomly flip the input image horizontally and vertically;

3)随机颜色变换：将输入图像的HSV通道分别进行75％～125％、50％～150％以及50％～150％范围内的随机值缩放；3) Random color transformation: The HSV channel of the input image is scaled with random values in the range of 75% to 125%, 50% to 150%, and 50% to 150%;

4)随机噪声增强：将输入图像中每个位置以50％的概率施加一个均值为0方差为0～0.1的高斯噪声。4) Random noise enhancement: A Gaussian noise with a mean of 0 and a variance of 0-0.1 is applied to each position in the input image with a probability of 50%.

进一步地，所述后处理包括：Further, the post-processing includes:

1)将全卷积三维手势估计网络输出中的关键点归一化相对深度转化为关键点绝对深度；1) Convert the normalized relative depth of key points in the output of the fully convolutional 3D gesture estimation network into the absolute depth of key points;

2)利用全卷积三维手势估计网络输出中的关键点在图像中的二维坐标及转化后的关键点绝对深度，计算关键点三维坐标。2) Using the fully convolutional 3D gesture to estimate the 2D coordinates in the image of the key points in the network output and the absolute depth of the converted key points, and calculate the 3D coordinates of the key points.

进一步地，所述计算关键点三维坐标具体过程为：Further, the specific process of calculating the three-dimensional coordinates of the key points is as follows:

其中，(X_i,Y_i,Z_i)为所求第i个关键点的三维坐标，(u_i,v_i)为全卷积三维手势估计网络输出的第i个关键点的二维坐标，z_i为第i个关键点的绝对深度值，K为成像设备内参矩阵。Among them, (X_i ,Y_i ,Z_i ) are the three-dimensional coordinates of the i-th key point, and (u_i ,vi ) are the two-dimensional coordinates of the_i -th key point output by the fully convolutional three-dimensional gesture estimation network , zi is the absolute depth value of the_i -th key point, and K is the internal parameter matrix of the imaging device.

与现有的技术相比，本发明的有益效果为：Compared with the prior art, the beneficial effects of the present invention are:

1、本发明通过将手部尺度信息与神经网络对手部关键点的深度预测过程解耦，能够有效应对单目三维手势估计中的尺度不确定性问题，实际应用中通过获取准确的先验尺度信息，本发明则可以准确还原出场景中手部关键点相对成像设备的实际深度，有效提升了三维手势估计方法的精度上限以及对场景的泛化能力。1. The present invention can effectively deal with the scale uncertainty problem in monocular 3D gesture estimation by decoupling the hand scale information from the neural network's depth prediction process for key points of the hand. In practical applications, by obtaining accurate prior scales information, the present invention can accurately restore the actual depth of the hand key points in the scene relative to the imaging device, effectively improving the upper limit of accuracy of the three-dimensional gesture estimation method and the generalization ability of the scene.

2、本发明通过基本的卷积操作构建了基于全卷积结构的三维手势估计网络，简化了基于深度学习的三维手势估计应用的网络设计过程，支持现有大部分部署框架对网络的优化加速。同时，该网络可以按比例调整自身模块的数量及卷积层的通道，适应不同资源条件的应用场景，降低了算法后续维护及优化的难度。2. The present invention constructs a three-dimensional gesture estimation network based on a full convolution structure through basic convolution operations, simplifies the network design process of a three-dimensional gesture estimation application based on deep learning, and supports the optimization and acceleration of most existing deployment frameworks for the network. . At the same time, the network can proportionally adjust the number of its own modules and the channels of the convolution layer, adapt to the application scenarios of different resource conditions, and reduce the difficulty of subsequent maintenance and optimization of the algorithm.

3、本发明将常用的轻量化设计引入三维手势估计网络结构中，极大提升了三维手势估计网络的运行效率，使得本发明能够在低资源场景下实现实时三维手势估计。3. The present invention introduces the commonly used lightweight design into the 3D gesture estimation network structure, which greatly improves the operation efficiency of the 3D gesture estimation network, so that the present invention can realize real-time 3D gesture estimation in low resource scenarios.

附图说明Description of drawings

图1为本实施例全卷积三维手势估计网络结构图；1 is a structural diagram of a fully convolutional three-dimensional gesture estimation network according to the present embodiment;

图2为本实施例全卷积三维手势估计网络中的输入压缩单元结构图；2 is a structural diagram of an input compression unit in a fully convolutional three-dimensional gesture estimation network of the present embodiment;

图3为本实施例全卷积三维手势估计网络中的基本卷积单元结构图；3 is a structural diagram of a basic convolution unit in the fully convolutional three-dimensional gesture estimation network of the present embodiment;

图4为本实施例全卷积三维手势估计网络中的上采样单元结构图；4 is a structural diagram of an upsampling unit in a fully convolutional three-dimensional gesture estimation network of the present embodiment;

图5为本实施例全卷积三维手势估计网络中的下采样单元结构图；5 is a structural diagram of a downsampling unit in the fully convolutional three-dimensional gesture estimation network of the present embodiment;

图6为本实施例全卷积三维手势估计网络中的输出单元结构图；6 is a structural diagram of an output unit in the fully convolutional three-dimensional gesture estimation network of the present embodiment;

图7为本实施例全卷积三维手势估计网络的训练及推理流程图。FIG. 7 is a flowchart of training and inference of the fully convolutional 3D gesture estimation network according to the present embodiment.

具体实施方式Detailed ways

下面结合实施例及附图，对本发明作进一步地详细说明，但本发明的实施方式不限于此。The present invention will be further described in detail below with reference to the embodiments and the accompanying drawings, but the embodiments of the present invention are not limited thereto.

实施例1Example 1

本发明提出的全卷积三维手势估计网络，可完成单目三维手势估计应用的实现过程，其主要分为神经网络的训练过程及推理过程，其中，神经网络在训练完成后即可用作测试及推理。如图7所示，一种基于全卷积神经网络的单目三维手势估计方法，包括以下步骤：The fully convolutional 3D gesture estimation network proposed by the present invention can complete the realization process of monocular 3D gesture estimation application, which is mainly divided into the training process and the reasoning process of the neural network. The neural network can be used for testing after the training is completed. and reasoning. As shown in Figure 7, a monocular 3D gesture estimation method based on a fully convolutional neural network includes the following steps:

S1、构建全卷积三维手势估计网络，并对全卷积三维手势估计网络进行训练；所述全卷积三维手势估计网络包括以下单元：S1, construct a fully convolutional 3D gesture estimation network, and train the fully convolutional 3D gesture estimation network; the fully convolutional 3D gesture estimation network includes the following units:

输入压缩单元：其组成如图2所示，其首先将输入图像经过一个卷积核大小为4×4、步长为4的卷积层处理，得到压缩特征图，然后经过层内归一化操作，将压缩特征图的均值与方差调整为可学习的参数。Input compression unit: Its composition is shown in Figure 2. It first processes the input image through a convolutional layer with a convolution kernel size of 4 × 4 and a stride of 4 to obtain a compressed feature map, and then undergoes intra-layer normalization. operation to adjust the mean and variance of the compressed feature map as learnable parameters.

基本卷积单元：提取特征的基本单元，其组成如图3所示，其首先利用7×7的逐通道卷积处理输入的C通道特征图，然后经过层内归一化处理，再利用1×1卷积将特征图通道数量增至4C，并经过GELU激活函数处理，最后经过1×1卷积将特征图通道数量还原为C，与单元输入求和后输出。Basic convolution unit: The basic unit for extracting features, its composition is shown in Figure 3, it first uses 7 × 7 channel-by-channel convolution to process the input C channel feature map, and then undergoes intra-layer normalization processing, and then uses 1 The ×1 convolution increases the number of feature map channels to 4C, and is processed by the GELU activation function. Finally, the number of feature map channels is restored to C through 1 × 1 convolution, and the output is summed with the unit input.

下采样单元：负责对特征图进行空间下采样，其结构如图5所示，其首先对输入特征图进行层内归一化，然后经过卷积核大小为2×2、步长为2的卷积层处理，得到空间分辨率减半的特征图。Downsampling unit: responsible for spatial downsampling of the feature map. Its structure is shown in Figure 5. It first performs intra-layer normalization on the input feature map, and then passes the convolution kernel size of 2 × 2 and step size of 2. Convolutional layer processing, resulting in a feature map with halved spatial resolution.

上采样单元：负责对特征图进行空间上采样，其结构如图4所示，其首先对输入特征图进行层内归一化，然后经过卷积核大小为2×2、步长为2的反卷积处理，将空间分辨率增至原来的2倍，实现特征图的空间上采样。Upsampling unit: responsible for spatial upsampling of the feature map. Its structure is shown in Figure 4. It first performs intra-layer normalization on the input feature map, and then passes the convolution kernel size of 2 × 2 and step size of 2. The deconvolution process doubles the spatial resolution and realizes the spatial upsampling of the feature map.

输出单元：负责从最终的高分辨率特征图中提取信息预测最终的手部关键点二维坐标及相对深度，其结构如图6所示，其中，一条支路通过卷积核大小为3×3、步长为1的相同卷积操作将特征图转化为手部各关键点的热力图，并通过可微分的最大索引操作(soft-argmax)求出热力图中的最高响应值，即为预测的关键点二维坐标；另一支路通过卷积核大小为3×3、步长为1的相同卷积操作将特征图转化为各关键点的隐深度图，再通过全局空间平均池化(mean)得到各关键点的归一化相对深度值。Output unit: responsible for extracting information from the final high-resolution feature map to predict the final two-dimensional coordinates and relative depth of key points of the hand. Its structure is shown in Figure 6, where a branch passes through the convolution kernel with a size of 3× 3. The same convolution operation with a step size of 1 converts the feature map into a heat map of each key point of the hand, and obtains the highest response value in the heat map through a differentiable maximum index operation (soft-argmax), which is The two-dimensional coordinates of the predicted key points; the other branch converts the feature map into the hidden depth map of each key point through the same convolution operation with a convolution kernel size of 3 × 3 and a stride of 1, and then passes the global spatial average pooling. The normalized relative depth value of each key point is obtained by using mean.

如图1所示，在整体结构设计上，全卷积三维手势估计网络主要分为下采样阶段及上采样阶段。在下采样阶段中，其首先利用输入压缩单元压缩输入图像，然后经过三个串联的下采样特征提取模块处理，其中，下采样特征提取模块由N_i个特征通道数为C_i的基本卷积单元及一个下采样单元串联构成，i＝1,2,3特征图空间分辨率由大到小的排序。处理后的特征图经过N₄个通道数为C₄的基本卷积单元进一步处理，然后进入上采样阶段。在上采样阶段中，其首先经过N₄个基本卷积单元处理，然后经过三个串联的上采样特征提取模块处理，其中，上采样特征提取模块由N_i个特征通道数为C_i的基本卷积单元及一个上采样单元串联构成，其同时还接收来自相同空间分辨率的下采样阶段中的“下采样特征提取模块”的输出并进行求和，最后经过输入单元处理得到预测的手部关键点二维坐标及归一化相对深度值。本实施例中N₁、N₂、N₃、N₄、C₁、C₂、C₃、C₄分别取1、1、1、1、48、96、192,、384，得到的全卷积三维手势估计网络适合对运行效率要求较高的应用场景。As shown in Figure 1, in the overall structure design, the fully convolutional 3D gesture estimation network is mainly divided into a down-sampling stage and an up-sampling stage. In the downsampling stage, the input image is first compressed by the input compression unit, and then processed by three concatenated downsampling feature extraction modules, wherein the downsampling feature extraction module is composed of N_i basic convolution units with C_i feature channels And a downsampling unit is formed in series, i=1, 2, 3 feature map spatial resolution is sorted from large to small. The processed feature maps are further processed by N₄ basic convolution units with C₄ channels, and then enter the upsampling stage. In the up-sampling stage, it is first processed by N₄ basic convolution units, and then processed by three series-connected up-sampling feature extraction modules, wherein the up-sampling feature extraction module consists of N_i basic feature channels whose number is C_i The convolution unit and an upsampling unit are formed in series, which also receive the output from the "downsampling feature extraction module" in the downsampling stage of the same spatial resolution and sum it up, and finally get the predicted hand after processing by the input unit Two-dimensional coordinates of key points and normalized relative depth values. In this example, N₁ , N₂ , N₃ , N₄ , C₁ , C₂ , C₃ , and C₄ are taken as 1, 1, 1, 1, 48, 96, 192, 384, respectively, and the whole volume obtained The cumulative 3D gesture estimation network is suitable for application scenarios that require high operating efficiency.

全卷积三维手势估计网络的训练包括以下步骤：The training of the fully convolutional 3D gesture estimation network consists of the following steps:

1)图像预处理：将参与训练的输入图像填充至1:1并缩放为256×256，并进行包括随机图像旋转、随机图像翻转、随机颜色变换及随机噪声增强在内的数据增强操作。1) Image preprocessing: Fill the input image participating in training to 1:1 and scale it to 256×256, and perform data enhancement operations including random image rotation, random image flip, random color transformation and random noise enhancement.

2)标注预处理：训练过程中，对手部关键点标注的处理主要包括：配合图像旋转数据增强方式修改对应的关键点二维标注，以及将绝对深度标注转化为归一化的相对深度标注，具体操作为将所有关键点的深度标注与根部关键点的深度标注作差，并将结果除以手部参考长度；2) Labeling preprocessing: During the training process, the processing of the labeling of key points on the face mainly includes: modifying the corresponding two-dimensional labeling of key points with the image rotation data enhancement method, and converting the absolute depth labeling into the normalized relative depth labeling, The specific operation is to make the difference between the depth annotation of all key points and the depth annotation of the root key point, and divide the result by the hand reference length;

3)网络前向推理：将经过处理的图像输入全卷积三维手势估计网络，得到预测的关键点二维坐标及相对深度。3) Network forward reasoning: Input the processed image into a fully convolutional 3D gesture estimation network to obtain the predicted 2D coordinates and relative depths of key points.

4)损失计算：将网络前向推理的输出与经过预处理的标注输入损失函数SmoothL1函数，计算得到损失函数值。4) Loss calculation: The output of the network forward inference and the preprocessed annotation input loss function SmoothL1 function are calculated to obtain the loss function value.

5)梯度反向传播：计算损失函数值相对于全卷积三维手势估计网络的网络参数的梯度值，使用反向传播算法更新网络参数。5) Gradient backpropagation: Calculate the gradient value of the loss function value relative to the network parameters of the fully convolutional 3D gesture estimation network, and use the backpropagation algorithm to update the network parameters.

6)迭代训练：对输入图像分批次重复进行上述步骤，直到损失函数值不再降低，则完成全卷积三维手势估计网络训练。6) Iterative training: Repeat the above steps for the input image in batches until the loss function value no longer decreases, then complete the training of the fully convolutional 3D gesture estimation network.

S2、获取手部图像，对图像进行预处理，具体为将输入图像填充至1:1并缩放至256×256分辨率。S2. Acquire a hand image, and preprocess the image, specifically, filling the input image to 1:1 and scaling to a resolution of 256×256.

S3、将预处理后的图像输入训练好的全卷积三维手势估计网络，得到手部关键点二维坐标及归一化相对深度值。S3. Input the preprocessed image into the trained fully convolutional three-dimensional gesture estimation network to obtain two-dimensional coordinates of hand key points and a normalized relative depth value.

S4、对全卷积三维手势估计网络预测的手部关键点二维坐标及各关键点的相对深度值进行后处理，计算关键点三维坐标，完成手势估计任务。后处理过程包括：S4 , post-processing the two-dimensional coordinates of the hand key points predicted by the fully convolutional three-dimensional gesture estimation network and the relative depth value of each key point, and calculate the three-dimensional coordinates of the key points to complete the gesture estimation task. Post-processing includes:

1)将归一化相对深度转化为绝对深度，首先利用提供的手部参考长度(实践中可由场景内其他方式获得，与场景因素相关)代入下列方程：1) Convert the normalized relative depth into absolute depth, first use the provided hand reference length (which can be obtained in other ways in the scene in practice, and is related to scene factors) into the following equation:

其中，l_ref表示手部参考长度，K为成像设备内参矩阵，z_r为根部关键点绝对深度，(u_r,v_r)为网络预测的根部关键点二维坐标，(u_m,v_m)为网络预测的中指掌节点二维坐标(即与手部参考长度相关的另一关节点)，d_m为网络预测的中指掌节点的归一化相对深度值。求解以上方程可得根部关键点的绝对深度z_r，配合网络预测的各关键点的归一化相对深度值即可还原所有关键点的绝对深度：Among them, l_ref is the reference length of the hand, K is the internal parameter matrix of the imaging device, z_r is the absolute depth of the root key point, (u_r , v_r ) is the two-dimensional coordinate of the root key point predicted by the network, (u_m ,v_m ) is the two-dimensional coordinate of the palm of the middle finger node predicted by the network (that is, another joint point related to the reference length of the hand), and d_m is the normalized relative depth value of the palm of the middle finger node predicted by the network. The absolute depth z_r of the root key point can be obtained by solving the above equation, and the absolute depth of all key points can be restored with the normalized relative depth value of each key point predicted by the network:

其中，

表示网络预测的第i个关键点的归一化相对深度值，z_i为对应的绝对深度值。in,

Indicates the normalized relative depth value of the ith keypoint predicted by the network, and_zi is the corresponding absolute depth value.

2)计算关键点三维坐标：2) Calculate the three-dimensional coordinates of the key points:

其中，(Z_i,Y_i,Z_i)为所求第i个关键点的三维坐标，(u_i,v_i)为网络输出的第i个关键点的二维坐标，z_i为第i个关键点的绝对深度值，K为成像设备内参矩阵。Among them, (Z_i ,Y_i ,Z_i ) are the three-dimensional coordinates of the i-th key point, (u_i ,vi ) are the two-dimensional coordinates of the_i -th key point output by the network, and zi is the_i -th key point. The absolute depth value of each key point, K is the internal parameter matrix of the imaging device.

所述全卷积三维手势估计网络具有如下特性：根据应用场景需求，可调整基本卷积单元的串联数量N及基本卷积单元的层内通道C，从而改变全卷积三维手势估计网络的深度与宽度，影响整体方法的性能，从而部署在不同资源条件的应用场景中。The fully convolutional 3D gesture estimation network has the following characteristics: according to the requirements of the application scenario, the number of series N of the basic convolution units and the intra-layer channel C of the basic convolution units can be adjusted, thereby changing the depth of the fully convolutional 3D gesture estimation network. and width, affect the performance of the overall method, thus deploying it in application scenarios with different resource conditions.

实施例2Example 2

更改网络结构设计参数中的N₁、N₂、N₃、N₄、C₁、C₂、C₃、C₄分别为2、2、2、2、96、144、216、324，得到的全卷积三维手势估计网络适合对运行效率要求较高的应用场景，其余过程与实施例1相同，适合运行效率与精度要求较为均衡的场景。Change N₁ , N₂ , N₃ , N₄ , C₁ , C₂ , C₃ , and C₄ in the design parameters of the network structure to 2, 2, 2, 2, 96, 144, 216, and 324, respectively, to obtain The fully convolutional 3D gesture estimation network is suitable for application scenarios with high requirements on operating efficiency, and the rest of the process is the same as that in Embodiment 1, which is suitable for scenarios with relatively balanced requirements on operating efficiency and accuracy.

实施例3Example 3

更改网络结构设计参数中的N₁、N₂、N₃、N₄、C₁、C₂、C₃、C₄分别为3、2、3、2、96、192、384、768，得到的全卷积三维手势估计网络适合对运行效率要求较高的应用场景。其余过程与实施例1相同，适合运行精度要求较高而效率要求较低的应用场景。Change N₁ , N₂ , N₃ , N₄ , C₁ , C₂ , C₃ , and C₄ in the design parameters of the network structure to 3, 2, 3, 2, 96, 192, 384, and 768, respectively, to obtain The fully convolutional 3D gesture estimation network is suitable for application scenarios that require high operating efficiency. The rest of the process is the same as that of Embodiment 1, which is suitable for application scenarios with high requirements on operation accuracy and low requirements on efficiency.

上述实施例仅用于详细阐述以帮助理解本发明的技术方案，对本领域技术人员而言，在不脱离本发明原理的前提下做出的任何改进与替换，均属于本发明的保护范围。The above-mentioned embodiments are only used to describe in detail to help understand the technical solutions of the present invention. For those skilled in the art, any improvements and replacements made without departing from the principles of the present invention belong to the protection scope of the present invention.

Claims

Translated fromChinese

1.一种基于全卷积神经网络的单目三维手势估计方法，其特征在于，包括以下步骤：1. a monocular three-dimensional gesture estimation method based on full convolutional neural network, is characterized in that, comprises the following steps:

S3、将预处理后的图像输入全卷积三维手势估计网络预测最终的关键点二维坐标及各关键点的相对深度；S3. Input the preprocessed image into the fully convolutional three-dimensional gesture estimation network to predict the final two-dimensional coordinates of key points and the relative depth of each key point;

2.根据权利要求1所述的一种基于全卷积神经网络的单目三维手势估计方法，其特征在于，所述预处理包括图像缩放及图像填充。2 . The method for estimating monocular 3D gestures based on a fully convolutional neural network according to claim 1 , wherein the preprocessing includes image scaling and image filling. 3 .

3.根据权利要求1所述的一种基于全卷积神经网络的单目三维手势估计方法，其特征在于，所述全卷积三维手势估计网络包括以下单元：3. a kind of monocular 3D gesture estimation method based on fully convolutional neural network according to claim 1, is characterized in that, described fully convolutional 3D gesture estimation network comprises following unit:

输入压缩单元，用于提取输入图像中的局部特征的同时降低输入图像分辨率；an input compression unit, which is used to reduce the resolution of the input image while extracting local features in the input image;

基本卷积单元，用于提取图像的基本特征；Basic convolution unit, used to extract the basic features of the image;

4.根据权利要求3所述的一种基于全卷积神经网络的单目三维手势估计方法，其特征在于，所述基本卷积单元包括：4. a kind of monocular 3D gesture estimation method based on fully convolutional neural network according to claim 3, is characterized in that, described basic convolution unit comprises:

5.根据权利要求3所述的一种基于全卷积神经网络的单目三维手势估计方法，其特征在于，所述输出单元包括：5. a kind of monocular 3D gesture estimation method based on fully convolutional neural network according to claim 3, is characterized in that, described output unit comprises:

2)相对深度预测分支：利用3×3大小的卷积核从特征图中提取特征得到手部关键点的隐深度图，通过全局空间平均池化(mean)得到各关键点的归一化相对深度值。2) Relative depth prediction branch: Extract features from the feature map with a 3×3 convolution kernel to obtain the hidden depth map of the hand key points, and obtain the normalized relative depth of each key point through global spatial average pooling (mean). depth value.

6.根据权利要求1所述的一种基于全卷积神经网络的单目三维手势估计方法，其特征在于，全卷积三维手势估计网络的训练包括以下步骤：6. a kind of monocular 3D gesture estimation method based on fully convolutional neural network according to claim 1, is characterized in that, the training of fully convolutional 3D gesture estimation network comprises the following steps:

24)损失计算：将网络前向推理的输出与经过预处理的标注输入损失函数，得到损失函数值；24) Loss calculation: The output of the forward inference of the network and the preprocessed annotation input loss function are used to obtain the loss function value;

7.根据权利要求6所述的一种基于全卷积神经网络的单目三维手势估计方法，其特征在于，全卷积三维手势估计网络训练过程采用Smooth-L1损失函数计算网络预测的关键点二维坐标及相对深度与标注之间的误差，并计算误差相对各网络参数的梯度，通过反向传播算法更新网络参数。7. A kind of monocular 3D gesture estimation method based on fully convolutional neural network according to claim 6, is characterized in that, the key point of network prediction is calculated using Smooth-L1 loss function in the training process of fully convolutional 3D gesture estimation network The error between the two-dimensional coordinates and the relative depth and the annotation is calculated, and the gradient of the error relative to each network parameter is calculated, and the network parameters are updated through the back-propagation algorithm.

8.根据权利要求6所述的一种基于全卷积神经网络的单目三维手势估计方法，其特征在于，全卷积三维手势估计网络训练过程图像预处理中采用的数据增强包括：8. a kind of monocular 3D gesture estimation method based on full convolutional neural network according to claim 6, is characterized in that, the data enhancement that adopts in the image preprocessing of full convolutional 3D gesture estimation network training process comprises:

9.根据权利要求1～8任一项所述的一种基于全卷积神经网络的单目三维手势估计方法，其特征在于，所述后处理包括：9 . The monocular 3D gesture estimation method based on a fully convolutional neural network according to any one of claims 1 to 8 , wherein the post-processing comprises:

10.根据权利要求9所述的一种基于全卷积神经网络的单目三维手势估计方法，其特征在于，所述计算关键点三维坐标具体过程为：10. a kind of monocular 3D gesture estimation method based on fully convolutional neural network according to claim 9, it is characterised in that the specific process of calculating the 3D coordinates of key points is: