CN110533721A

Movatterモバイル変換

Info

Publication number: CN110533721A
Application number: CN201910795984.XA
Authority: CN
Inventors: 刘复昌; 孟凡胜
Original assignee: Hangzhou Normal University
Current assignee: Hangzhou Blackbox 3d Technology Co ltd
Priority date: 2019-08-27
Filing date: 2019-08-27
Publication date: 2019-12-03
Anticipated expiration: 2039-08-27
Also published as: CN110533721B

Abstract

Translated fromChinese

本发明公开了一种基于增强自编码器的室内目标物体6D姿态估计方法。本发明分成三阶段：多目标物体检测阶段：首先输入单幅彩色图像到改进版的Faster R‑CNN，然后RPN网络提取出候选框，再通过全卷积网络输出目标类别概率和二维边界框；增强自编码器预测物体关键点阶段：利用概率期望连接多目标物体检测阶段与增强自编码器预测物体关键点阶段，通过训练改进版的堆叠式降噪自动编码器对感兴趣区域编解码出相同尺寸的无噪声感兴趣区域，再通过全连接层预测出目标物体在二维图像上的关键点；计算目标物体的6D姿态估计阶段：根据关键点计算出目标物体的6D姿态。本发明对于背景杂乱与物体存在遮挡的情况下具有很强的鲁棒性，对光照、颜色不敏感且不要求物体具有丰富的纹理特征。

The invention discloses a method for estimating the 6D pose of an indoor target object based on an enhanced self-encoder. The present invention is divided into three stages: multi-target object detection stage: first input a single color image to the improved version of Faster R-CNN, then the RPN network extracts the candidate frame, and then outputs the target category probability and two-dimensional bounding box through the full convolutional network ;Enhanced self-encoder to predict the key point stage of the object: use the probability expectation to connect the multi-target object detection stage and the enhanced self-encoder to predict the key point stage of the object, and encode and decode the region of interest by training the improved version of the stacked noise reduction autoencoder The noise-free region of interest of the same size, and then predict the key points of the target object on the two-dimensional image through the fully connected layer; calculate the 6D pose estimation stage of the target object: calculate the 6D pose of the target object according to the key points. The invention has strong robustness to the background clutter and occlusion of objects, is insensitive to light and color, and does not require objects to have rich texture features.

Description

Translated fromChinese

一种基于增强自编码器的室内目标物体6D姿态估计方法A 6D Pose Estimation Method for Indoor Target Objects Based on Enhanced Autoencoder

技术领域technical field

本发明涉及姿态估计领域，具体公开一种基于增强自编码器的室内目标物体6D姿态估计方法。The invention relates to the field of attitude estimation, and specifically discloses a method for estimating the 6D attitude of an indoor target object based on an enhanced self-encoder.

背景技术Background technique

单幅彩色图像的目标检测与物体6D姿态在工业与移动机器人操作、虚拟现实、增强现实的人机交互中都起着非常重要的作用，遮挡问题在6D姿态估计问题中是最具有挑战性的问题之一。The target detection and object 6D pose of a single color image play a very important role in the human-computer interaction of industrial and mobile robot operations, virtual reality, and augmented reality. The occlusion problem is the most challenging in the 6D pose estimation problem. one of the problems.

目前姿态估计的主流方法中，主要分为基于模板匹配的方法、基于点的方法、基于描述子的方法、基于特征学习方法和基于卷积神经网络端到端的方法。这些方法在处理复杂环境下的遮挡问题，鲁棒性不是很理想。At present, the mainstream methods of pose estimation are mainly divided into template matching-based methods, point-based methods, descriptor-based methods, feature learning-based methods and convolutional neural network-based end-to-end methods. These methods are not very robust in dealing with occlusion problems in complex environments.

基于模板匹配的方法需要对检测的目标物体做大量采样工作，提取足够锅并且鲁棒匹配末班，在对模板进行匹配才能得到大致的物体姿态，最后再使用ICP精化结果，虽然模版匹配方法对于低纹理的物体可以比较高效的进行姿态估计，但是其对姿态大量变化的物体就非常麻烦，因为其需要大量的模版去匹配，而且其无法解决物体遮挡问题。The method based on template matching needs to do a lot of sampling work on the detected target object, extract enough pots and robustly match the last class, and only after matching the template can the approximate object pose be obtained, and finally use ICP to refine the result, although the template matching method For objects with low texture, it can perform pose estimation more efficiently, but it is very troublesome for objects with large changes in pose, because it requires a large number of templates to match, and it cannot solve the problem of object occlusion.

基于点的方法基本是通过点云上面少量点对构成描述子来做的，通过任意两个点都计算PPF描述子，构建模型哈希表，以描述子为键，以这两个点为点对，通过两个点云的匹配来计算其刚体变换矩阵，求得物体姿态，但是这种方法非常耗时耗力；The point-based method basically uses a small number of point pairs on the point cloud to form a descriptor, calculates the PPF descriptor through any two points, builds a model hash table, uses the descriptor as a key, and uses these two points as points Yes, the rigid body transformation matrix is calculated by matching the two point clouds to obtain the object pose, but this method is very time-consuming and labor-intensive;

基于描述子的方法是提高匹配点的精度，从而提升物体姿态的准确度，不过点的方法和描述子的方法耗时耗力巨大，都比较依赖点的质量，且需要丰富的纹理特征；The descriptor-based method is to improve the accuracy of the matching points, thereby improving the accuracy of the object pose, but the point method and the descriptor method are time-consuming and labor-intensive, both rely on the quality of the points, and require rich texture features;

基于特征学习的方法是通过学习物体的特征来进行物体姿态估计，通过传统的机器学习方法(如随机森林)学习物体特征来回归预测物体的姿态，如Latent-Class HoughForests系列工作，但是这类方法很难处理对称性物体和遮挡物体；The method based on feature learning is to estimate the attitude of the object by learning the characteristics of the object, and learn the characteristics of the object through the traditional machine learning method (such as random forest) to predict the attitude of the object, such as the Latent-Class HoughForests series of work, but this type of method It is difficult to deal with symmetrical objects and occluded objects;

基于卷积神经网络端到端的方法是最近比较流行的方法，但是该方法需要大量的训练数据，尤其是三维的标注数据非常难以获得，这类方法先用卷积神经网络提取特征点，然后用PnP方法计算出姿态 (包括三维旋转矩阵R与三维平移矩阵T)，但是这些方法大多是针对单个目标，没有考虑多个目标之间的遮挡情况，虽然也有学者提出多个目标的方法如Singleshot6D和SSD-6D，但是对于遮挡效果并不太好，浙江大学提出的PVNet对于遮挡效果不错，但是其方法是基于像素投票的，比较耗费资源且对结果做了很多处理，算法比较复杂。The end-to-end method based on convolutional neural network is a popular method recently, but this method requires a large amount of training data, especially the three-dimensional labeled data is very difficult to obtain. This type of method first uses convolutional neural network to extract feature points, and then uses The PnP method calculates the attitude (including the three-dimensional rotation matrix R and the three-dimensional translation matrix T), but most of these methods are aimed at a single target, without considering the occlusion between multiple targets, although some scholars have proposed methods for multiple targets such as Singleshot6D and SSD-6D, but the occlusion effect is not very good. The PVNet proposed by Zhejiang University has a good occlusion effect, but its method is based on pixel voting, which consumes resources and does a lot of processing on the results. The algorithm is more complicated.

综上所述，现有技术存在的问题是：基于模板匹配的方法对于遮挡物体表现不理想，且需要后续复杂处理；基于点的方法和基于描述子的方法对点质量和纹理特征要求较高；基于特征学习的方法很难处理对称性物体和遮挡物体；基于卷积神经网络端到端的方法对于多目标在杂乱场景及物体之间的遮挡解决不好，后续处理较多，无法满足实际应用需求。To sum up, the problems existing in the existing technology are: template matching-based methods are not ideal for occluded objects, and require subsequent complex processing; point-based methods and descriptor-based methods have higher requirements for point quality and texture features ; The method based on feature learning is difficult to deal with symmetrical objects and occluded objects; the end-to-end method based on convolutional neural network is not good at solving the occlusion of multiple targets in messy scenes and objects, and there are many follow-up processing, which cannot meet the practical application need.

发明内容Contents of the invention

针对现有技术存在的问题，本发明提出了一种基于增强自编码器的室内目标物体6D姿态估计方法与系统。Aiming at the problems existing in the prior art, the present invention proposes a method and system for estimating the 6D pose of an indoor target object based on an enhanced autoencoder.

为实现上述目的，本发明的技术方案为一种基于增强自编码器的室内目标物体6D姿态估计方法与系统，具体技术方案包括以下步骤：In order to achieve the above purpose, the technical solution of the present invention is a method and system for estimating the 6D pose of an indoor target object based on an enhanced autoencoder. The specific technical solution includes the following steps:

本发明方法分成三个阶段：The inventive method is divided into three stages:

多目标物体检测阶段：Multi-target object detection stage:

首先输入单幅彩色图像到改进版的Faster R-CNN，然后RPN网络提取出候选框，再通过全卷积网络输出目标类别概率和二维边界框；First input a single color image to the improved version of Faster R-CNN, then the RPN network extracts the candidate frame, and then outputs the target category probability and two-dimensional bounding box through the full convolutional network;

增强自编码器(AAE)预测物体关键点阶段：Enhanced autoencoder (AAE) predicts the key point stage of the object:

利用概率期望连接多目标物体检测阶段与增强自编码器预测物体关键点阶段，通过训练改进版的堆叠式降噪自动编码器(SDAE)对感兴趣区域编解码出相同尺寸的无噪声感兴趣区域，再通过全连接层 (fc)预测出目标物体在二维图像上的关键点；Using the probability expectation to connect the multi-target object detection stage and the enhanced self-encoder to predict the key point stage of the object, the improved version of the stacked noise reduction autoencoder (SDAE) is trained to encode and decode the region of interest into a noise-free region of interest of the same size , and then predict the key points of the target object on the two-dimensional image through the fully connected layer (fc);

计算目标物体的6D姿态估计阶段：Calculate the 6D pose estimation stage of the target object:

根据关键点计算出目标物体的6D姿态。Calculate the 6D pose of the target object based on the key points.

所述的多目标物体检测阶段的具体步骤如下：The specific steps of the multi-target object detection stage are as follows:

1-1.输入单幅彩色图像到Faster R-CNN的特征提取器ResNet101 网络中进行特征提取，得到特征图，该特征图会用在后面的区域提名网络(RPN网络)和全卷积层(FCN)；1-1. Input a single color image to the feature extractor ResNet101 network of Faster R-CNN for feature extraction, and obtain a feature map, which will be used in the subsequent region nomination network (RPN network) and full convolutional layer ( FCN);

1-2.将得到的特征图输入给RPN网络，RPN网络使用9个锚点，因为使用到的数据集LINEMOD中的目标类别多为小目标，所以锚点尺度大小分别设置为128*128、192*192和256*256像素，长宽比分别为1∶1、1∶2和2∶1，得到候选框；1-2. Input the obtained feature map to the RPN network. The RPN network uses 9 anchor points. Because the target categories in the used dataset LINEMOD are mostly small targets, the anchor point scales are set to 128*128, 192*192 and 256*256 pixels, and the aspect ratios are 1:1, 1:2, and 2:1, respectively, to obtain the candidate frame;

1-3将步骤1-1得到的特征图和步骤1-2得到的候选框输入给下采样池化模块，将感兴趣区域映射产生固定大小为7*7像素的特征图；1-3 Input the feature map obtained in step 1-1 and the candidate frame obtained in step 1-2 to the downsampling pooling module, and map the region of interest to generate a feature map with a fixed size of 7*7 pixels;

1-4将步骤1-3的特征图输入给代替全连接层的全卷积层，得到目标物体的类别和二维边界框，目标物体的类别用概率表示，边界框指目标物体在图像中的左上角和右下角坐标点组成的矩形框区域。1-4 Input the feature map of steps 1-3 to the full convolution layer instead of the fully connected layer to obtain the category of the target object and the two-dimensional bounding box. The category of the target object is represented by probability, and the bounding box refers to the target object in the image. The rectangular box area composed of the coordinate points of the upper left corner and the lower right corner of .

所述的增强自编码器预测物体关键点的具体步骤如下：The specific steps of the enhanced self-encoder to predict the key points of the object are as follows:

2-1.采用改进版的堆叠式去噪自编码器(SADE)，SDAE是逐层训练的去噪自动编码器(DAE)，为了让网络训练收敛，采用ReLU的方法，并且通过修改SDAE的网络结构与隐变量参数，得到增强自编码器，从而提高去噪能力；2-1. Using an improved version of the stacked denoising autoencoder (SADE), SDAE is a layer-by-layer training denoising autoencoder (DAE). In order to make the network training converge, the ReLU method is used, and by modifying the SDAE The network structure and hidden variable parameters are obtained to enhance the self-encoder, thereby improving the denoising ability;

2-2.将多目标物体检测阶段得到的感兴趣区域(Rois)输入给增强自编码器进行训练，感兴趣区域的尺寸大小调整为128*128像素；2-2. Input the region of interest (Rois) obtained in the multi-target object detection stage to the enhanced autoencoder for training, and adjust the size of the region of interest to 128*128 pixels;

2-3.将尺寸为128*128的感兴趣区域输入给增强自编码器的编码器(Encoder)，编码器是对输入编码映射为隐变量的过程，这个隐变量中包含了输入的所有特征，由6个卷积层、6个ReLU激活层、1个 Flatten压平层和1个全连接层，其中隐变量单元设为128；2-3. Input the region of interest with a size of 128*128 to the encoder (Encoder) of the enhanced self-encoder. The encoder is the process of mapping the input code to a hidden variable. This hidden variable contains all the features of the input. , consisting of 6 convolutional layers, 6 ReLU activation layers, 1 Flatten flattening layer and 1 fully connected layer, where the hidden variable unit is set to 128;

2-4.将编码器编码后的隐变量输入给解码器(Decoder)，解码器对隐变量进行解码，由6个卷积层、6个ReLU激活层、1个Flatten压平层和1个全连接层组成，隐变量单元设为128，得到新的感兴趣区域I，依然是用隐变量进行表示；2-4. Input the hidden variables encoded by the encoder to the decoder (Decoder), and the decoder decodes the hidden variables, consisting of 6 convolutional layers, 6 ReLU activation layers, 1 Flatten layer and 1 It consists of a fully connected layer, and the hidden variable unit is set to 128 to obtain a new region of interest I, which is still represented by hidden variables;

2-5.步骤2-2和2-3获得了新的感兴趣区域的隐变量表示，再此基础上加上一个全连接层，用于预测物体的三维包围盒的8个关键点在感兴趣区域的投影。2-5. Steps 2-2 and 2-3 obtain the hidden variable representation of the new region of interest, and then add a fully connected layer on this basis to predict the 8 key points of the three-dimensional bounding box of the object. Projection of the region of interest.

所述的计算物体的6D姿态估计阶段的具体步骤如下：The specific steps of the 6D pose estimation stage of the calculation object are as follows:

3-1.将增强自编码器预测到的物体的三维包围盒的8个关键点在感兴趣区域的投影输入给EPnP算法；3-1. Input the projection of the 8 key points of the three-dimensional bounding box of the object predicted by the enhanced self-encoder into the EPnP algorithm in the region of interest;

3-2.提取LINEMOD数据集自带的点云模型(.ply)中的世界坐标系下的特征点，此特征点是三维坐标点，表示为(x，y，z)；3-2. Extract the feature points in the world coordinate system in the point cloud model (.ply) that comes with the LINEMOD dataset. This feature point is a three-dimensional coordinate point, expressed as (x, y, z);

3-3.提取LINEMOD数据集自带的相机内参矩阵，相机参数是固定的；3-3. Extract the camera internal reference matrix that comes with the LINEMOD dataset, and the camera parameters are fixed;

3-4.相机的畸变参数矩阵设为1个8维全0的矩阵；3-4. The distortion parameter matrix of the camera is set to an 8-dimensional matrix of all 0s;

3-5.将三维坐标点、8个关键点、相机内参矩阵和相机的畸变参数矩阵输入给OpenCV的SolvePnP求解出三维旋转矩阵R和三维平移矩阵T，从而求得目标物体的6D姿态。3-5. Input the 3D coordinate points, 8 key points, camera internal reference matrix and camera distortion parameter matrix to SolvePnP of OpenCV to solve the 3D rotation matrix R and 3D translation matrix T, so as to obtain the 6D pose of the target object.

本发明方法实现过程中涉及到的网络损失函数设置的具体步骤如下：The specific steps of the network loss function setting involved in the implementation process of the method of the present invention are as follows:

网络损失函数由四部分组成：多目标物体检测阶段的损失函数、增强自编码器重建感兴趣区域的损失函数、增强自编码器预测目标物体关键点的损失函数和计算目标物体姿态的损失函数，每部分的损失函数组成如下所述：The network loss function consists of four parts: the loss function of the multi-target object detection stage, the loss function of the enhanced autoencoder to reconstruct the region of interest, the loss function of the enhanced autoencoder to predict the key points of the target object, and the loss function of calculating the pose of the target object. The loss function composition of each part is as follows:

(1)多目标物体检测阶段损失函数记为Loss₁，包括类别损失函数 L_cls、目标物体二维边界框损失函数L_box，如公式(1)所示：(1) The loss function in the multi-target object detection stage is denoted as Loss₁ , including the category loss function L_cls and the target object two-dimensional bounding box loss function L_box , as shown in formula (1):

Loss₁＝Loss(p，u，b^u，b^v，θ，x，y，z)＝L_cls(p，u)+[u≥1]L_box(b^u，b^v) (1)Loss₁ = Loss(p, u, b^u , b^v , θ, x, y, z) = L_cls (p, u)+[u≥1]L_box (b^u , b^v ) (1)

其中，类别损失函数L_cls使用交叉熵损失函数，如公式(1)所示：Among them, the category loss function L_cls uses the cross-entropy loss function, as shown in formula (1):

L_cls(p，u)＝-log(p_u) (2)L_cls (p,u)=-log(p_u ) (2)

对于每个感兴趣区域使用L_cls来输出每一类别的概率大小p＝ (p₀，...，p_c)，目标类别共有C+1类，u表示类别；For each region of interest, use L_cls to output the probability p=(p₀ ,...,p_c ) of each category, the target category has C+1 categories, and u represents the category;

二维边界框的损失函数L_box采用SmoothL₁LOSS回归损失函数，如公式(3)和(4)所示：The loss function L_box of the two-dimensional bounding box adopts the SmoothL₁ LOSS regression loss function, as shown in formulas (3) and (4):

其中，in,

公式(3)中代表类2D边界框的真实值，代表2D边界框的预测值，x代表真实值与预测值的差；In formula (3) represents the ground truth of a 2D-like bounding box, Represents the predicted value of the 2D bounding box, and x represents the difference between the real value and the predicted value;

(2)增强自编码器重建感兴趣区域的损失函数为L_Rois，记为Loss₂， Loss₂采用MSELoss回归损失函数，定义如公式(5)和(6)所示：(2) The loss function of the enhanced self-encoder to reconstruct the region of interest is L_Rois , denoted as Loss₂ , and Loss₂ uses the MSELoss regression loss function, defined as shown in formulas (5) and (6):

Loss₂＝L_Rois＝∑_i∈[1，n]MSELoss(I_Rois-I_{Rois_Restore}) (5)Loss₂ ＝L_Rois ＝∑ i∈_[1,n] MSELoss(I_Rois -I_{Rois_Restore} ) (5)

其中，in,

公式中I_Rois代表感兴趣区域的真实值，I_{Rois_Restore}代表重建出的感兴趣区域的重建值，即经过增强自编码器编解码出的感兴趣区域；In the formula, I_Rois represents the true value of the region of interest, and I_{Rois_Restore} represents the reconstruction value of the reconstructed region of interest, that is, the region of interest encoded and decoded by the enhanced self-encoder;

(3)增强自编码器预测目标物体关键点的损失函数L_Keypoints，记为Loss₃，Loss₃也采用SmoothL₁LOSS回归损失函数，定义如公式(7) 所示：(3) Enhance the loss function L Keypoints of the self-encoder to predict the key points of the target object, denoted as Loss₃ , and Loss₃ also uses the_SmoothL₁ LOSS regression loss function, defined as shown in formula (7):

其中，代表预测关键点，代表实际关键点，SmoothL₁LOSS如公式(4)所示；in, Represents the key point of prediction, Represents the actual key point, SmoothL₁ LOSS is shown in formula (4);

(4)计算目标物体姿态的损失函数L_pose，记为Loss₄，Loss₄也采用 SmoothL₁LOSS回归损失函数，定义如公式(8)所示；(4) Calculate the loss function L_pose of the pose of the target object, denoted as Loss₄ , and Loss₄ also uses the SmoothL₁ LOSS regression loss function, defined as shown in formula (8);

其中，R表示三维旋转矩阵预测值，表示三维旋转矩阵真实值，T表示三维平移矩阵预测值，表示三维平移矩阵真实值，α₁、α₂代表权值，用于平衡姿态估计旋转和平移损失值；Among them, R represents the predicted value of the three-dimensional rotation matrix, Indicates the real value of the three-dimensional rotation matrix, T indicates the predicted value of the three-dimensional translation matrix, Represents the true value of the three-dimensional translation matrix, α₁ and α₂ represent weight values, which are used to balance the rotation and translation loss values of attitude estimation;

所以模型总的损失函数为公式(9)所示：Therefore, the total loss function of the model is shown in formula (9):

本发明方法中利用概率期望连接多目标物体检测阶段与增强自编码器预测物体关键点阶段的具体步骤如下：In the method of the present invention, the specific steps of using the probability expectation to connect the multi-target object detection stage and the enhanced self-encoder to predict the key point stage of the object are as follows:

首先假设多目标物体检测截断需要学习的权重参数为w，增强编码器阶段需要学习的权重参数为v，由于在计算物体的6D姿态结果与多目标物体检测部分的权重参数w没有直接的导数关系，即多目标物体检测部分与增强自编码器部分无法直接进行前向与后向传播，因此利用增强学习的方法——“动作与奖惩相互作用”实现，即：First assume that the weight parameter that needs to be learned for multi-target object detection truncation is w, and the weight parameter that needs to be learned for the enhanced encoder stage is v, because there is no direct derivative relationship between the 6D pose result of the calculation object and the weight parameter w of the multi-target object detection part , that is, the multi-target object detection part and the enhanced self-encoder part cannot directly perform forward and backward propagation, so the method of enhanced learning-"action and reward-punishment interaction" is used to achieve, namely:

(1)首先假设计算物体的6D姿态的评价策略二维重投影或模型顶点平均三维距离作为奖励函数与惩罚函数；(1) First, assume that the evaluation strategy for calculating the 6D pose of the object is two-dimensional reprojection or the average three-dimensional distance of the model vertices as a reward function and a penalty function;

(2)奖-惩函数求得的结果作为奖励或惩罚；(2) The result obtained by the reward-punishment function is used as reward or punishment;

(3)将计算出的6D姿态作为，当奖励或惩罚不符合时，就反向传播 Loss3更新权重参数v，直到奖励或惩罚符合；姿态包括三维旋转矩阵R与三维平移矩阵T；(3) The calculated 6D posture is used as, when the reward or punishment does not meet, the weight parameter v is updated by backpropagating Loss3 until the reward or punishment is met; the posture includes a three-dimensional rotation matrix R and a three-dimensional translation matrix T;

其中奖-惩项与多目标物体检测输出的类别有概率关系，如多目标物体检测输出的类别结果x％概率为物体A，此时就有了物体A对应的增强自编码器，即影响最终输出的奖-惩项结果与多目标物体检测有概率关系，所以想实现Loss3的前向与反向传播，不需要直接对奖- 惩项直接求导，只要对概率求导即可，继而求出所有概率求导后的期望，公式(10)是计算可学习权重参数w和v的导数：Among them, the reward-punishment item has a probability relationship with the category of the multi-target object detection output. For example, the x% probability of the category result output by the multi-target object detection is object A. At this time, there is an enhanced autoencoder corresponding to object A, which affects the final The output reward-punishment results have a probabilistic relationship with multi-target object detection, so if you want to realize the forward and back propagation of Loss3, you don’t need to directly derive the reward-punishment items, you only need to derive the probability, and then find The expectation after deriving all the probabilities, the formula (10) is to calculate the derivatives of the learnable weight parameters w and v:

其中，in,

P(J|w)＝exp(-Loss₁) (11)P(J|w)＝exp(-Loss₁ ) (11)

Reward＝l_pose(·)＝2D Projection(K，R，T) (12)Reward＝l_pose (·)＝2D Projection(K, R, T) (12)

或者or

Reward＝l_pose(·)＝ADD(R，T) (13)Reward＝l_pose (·)＝ADD(R,T) (13)

上面公式(10)和(11)中，J代表训练样本，P(J|w)表示选择类别的概率，通过Loss₁归一化方法exp(-Loss₁)求得，表示样本J 服从于概率P(J|w)，由于参数w会影响概率P(J|w)，继而一定概率影响选择增强自编码器的计算，进而影响最后的结果l_pose(·)，即 Reward，所以公式(11)是概率的期望对参数w的求导，即更新参数w达到最小化参数目的，参数w与参数v一样计算方法，l_pose(·)是奖励函数二维重投影2D Projection(K，R，)或者模型顶点平均三维距离ADD(R，T)的结果表示；因为是对Faster R-CNN输出的概率进行求期望，所以最终Loss₁的定义如公式(14)所示：In the above formulas (10) and (11), J represents the training sample, P(J|w) represents the probability of selecting a category, which is obtained by the Loss₁ normalization method exp(-Loss₁ ), Indicates that the sample J is subject to the probability P(J|w), because the parameter w will affect the probability P(J|w), and then a certain probability will affect the calculation of the selected enhanced autoencoder, and then affect the final result l_pose ( ), that is Reward, so formula (11) is the expectation of probability The derivation of the parameter w, that is, updating the parameter w to minimize the parameter, the parameter w is calculated in the same way as the parameter v, l_pose ( ) is the reward function two-dimensional reprojection 2D Projection (K, R,) or the model vertex The result representation of the average three-dimensional distance ADD(R, T); because the probability of the Faster R-CNN output is expected, the final definition of Loss₁ is shown in formula (14):

所述的v与w求解方式相同，Loss₂与Loss₁求解方式相同。The v and w described above are solved in the same way, and Loss₂ and Loss₁ are solved in the same way.

本发明中多目标物体检测阶段采用的评价策略是交并比IoU方法，IoU是指目标物体的预测二维包围盒与真实二维包围盒的重叠程度α，当IoU＞α表示正样本，IoU定义如公式(15)所示：The evaluation strategy adopted in the multi-target object detection stage of the present invention is the intersection-over-union IoU method. IoU refers to the overlap degree α between the predicted two-dimensional bounding box of the target object and the real two-dimensional bounding box. When IoU>α indicates a positive sample, IoU The definition is shown in formula (15):

其中pr_bbox表示预测的二维边界框，gt_bbox表示真实的二维边界框，交集与并集是二维边界框所占区域的重叠与合并区域。Among them, pr_bbox represents the predicted two-dimensional bounding box, gt_bbox represents the real two-dimensional bounding box, and the intersection and union are the overlapping and merging areas of the area occupied by the two-dimensional bounding box.

本发明中6D姿态估计阶段的评价策略采用二维重投影和模型顶点平均三维距离评价方法，分别参考公式(16)和(17)所示：The evaluation strategy of the 6D attitude estimation stage in the present invention adopts the two-dimensional reprojection and the average three-dimensional distance evaluation method of the model vertices, shown in reference formulas (16) and (17) respectively:

其中m表示物体3D模型顶点数目，M为物体3D模型顶点集合， K表示相机的内参，x为模型(.ply点云模型)网格顶点；Among them, m represents the number of vertices of the 3D model of the object, M is the set of vertices of the 3D model of the object, K represents the internal reference of the camera, and x is the vertices of the mesh of the model (.ply point cloud model);

其中m表示物体3D模型顶点数目，M为物体3D模型顶点集合，R_pred表示预测的旋转矩阵，T_pred表示预测的平移矩阵，R_gt表示实际的旋转矩阵，T_gt为实际的平移矩阵；Where m represents the number of vertices of the 3D model of the object, M is the set of vertices of the 3D model of the object, R_pred represents the predicted rotation matrix, T_pred represents the predicted translation matrix, R_gt represents the actual rotation matrix, and T_gt is the actual translation matrix;

所述的Linemod数据集，其中的鸡蛋盒EggBox和钻头Driller是对称物体，其评价方法采用ADDs，如公式(18)所示：In the Linemod data set, the egg box EggBox and the drill bit Driller are symmetrical objects, and the evaluation method uses ADDs, as shown in formula (18):

本发明中用于训练与测试的数据集制作具体步骤如下：In the present invention, the specific steps for making data sets for training and testing are as follows:

训练使用的数据集是Linemod原始数据集，测试使用的数据集采用Linemod遮挡数据集，训练与测试的数据集为多目标Linemod数据集，其中训练数据集是根据原始数据集合成而来，具体过程为：The data set used for training is the original Linemod data set, the data set used for testing is the Linemod occlusion data set, the data set for training and testing is the multi-target Linemod data set, and the training data set is synthesized from the original data set, the specific process for:

①根据原始Linemod数据集提供的mask掩模图像，计算出目标在图像中的二维包围盒区域；① Calculate the two-dimensional bounding box area of the target in the image according to the mask image provided by the original Linemod dataset;

②根据二维包围盒在图像中的坐标位置，计算出对应位置的 JPEG彩色图像位置；② According to the coordinate position of the two-dimensional bounding box in the image, calculate the position of the JPEG color image corresponding to the position;

③Linemod图像中共目标的二维包围盒为前景，其余为背景，将背景替换成VOC2012中的图像；③The two-dimensional bounding box of the common target in the Linemod image is the foreground, and the rest is the background, and the background is replaced with the image in VOC2012;

④重复步骤①②③，将Linemod中13类目标按照上述步骤②随机贴在VOC2012图像中，保证每张图像中有13中类别的目标；④Repeat steps ①②③, paste the 13 types of targets in Linemod randomly in the VOC2012 image according to the above step ②, and ensure that there are 13 types of targets in each image;

⑤对生成的多目标Linemod图像进行数据增强。⑤ Perform data augmentation on the generated multi-target Linemod image.

综上所述，本发明的增益效果如下：In summary, the gain effect of the present invention is as follows:

本发明属于目标姿态估计领域，公开了一种基于增强自编码器的室内目标物体6D姿态估计方法与系统。所述的方法分成三个阶段：多目标物体检测阶段首先输入单幅彩色图像到改进版的Faster R-CNN，然后RPN网络提取出候选框，再通过全卷积(FCN)网络输出目标类别概率和二维边界框；概率期望连接增强自编码器(AAE)预测物体关键点阶段，通过训练改进版的堆叠式降噪自动编码器(SDAE) 对感兴趣区域编解码出相同尺寸的无噪声感兴趣区域，再通过全连接层(fc)预测出目标物体在二维图像上的关键点；计算物体姿态阶段 PnP根据关键点计算出目标的6D姿态。使用LINEMOD数据集训练后，本发明对于背景杂乱与物体存在遮挡的情况下具有很强的鲁棒性，对光照、颜色不敏感且不要求物体具有丰富的纹理特征。The invention belongs to the field of target pose estimation, and discloses a method and system for 6D pose estimation of an indoor target object based on an enhanced self-encoder. The method described is divided into three stages: the multi-target object detection stage first inputs a single color image to the improved version of Faster R-CNN, then the RPN network extracts the candidate frame, and then outputs the target category probability through the full convolution (FCN) network and two-dimensional bounding box; Probabilistic Expectation Connection Augmented Autoencoder (AAE) predicts the key point stage of the object, and the noise-free sense of the same size is encoded and decoded by training the improved version of the stacked denoising autoencoder (SDAE) In the area of interest, the key points of the target object on the two-dimensional image are predicted through the fully connected layer (fc); the PnP calculation phase of the object pose calculates the 6D pose of the target according to the key points. After using the LINEMOD dataset for training, the present invention has strong robustness to background clutter and occlusion of objects, is insensitive to light and color, and does not require objects to have rich texture features.

附图说明Description of drawings

为了更清楚地说明本申请实施例或现有技术中的技术方案，下面将实施例或现有技术描述中所需要使用的附图作简单地介绍，显而易见地，下面描述中的附图仅仅是本申请的一些实施例，对于本领域普通发票技术人员来讲，在不付出创造性劳动的前提下，还可以根据这些附图获得其他的附图。In order to more clearly illustrate the technical solutions in the embodiments of the present application or in the prior art, the accompanying drawings that need to be used in the description of the embodiments or the prior art are briefly introduced below. Obviously, the accompanying drawings in the following description are only For some embodiments of the present application, for ordinary invoice technicians in the field, other drawings can also be obtained based on these drawings without creative work.

图1是本发明提出方法的整体流程效果图及效果局部放大图；Fig. 1 is the overall flow chart of the proposed method of the present invention and the effect local enlargement figure;

图2是本发明提出的增强自编码器(AAE)的网络结构图；Fig. 2 is the network structure diagram of the enhanced self-encoder (AAE) that the present invention proposes;

具体实施方式Detailed ways

为了使本发明的技术方案更加清楚明白，以下结合实施例，对发明内容做更加详细地说明，但发明的保护范围不限于下述的实例，本说明书中公开的所有特征，或公开的所有方法或过程中的步骤，除了互相排斥的特征和/或步骤以外，均可以以任何方式组合。In order to make the technical solution of the present invention clearer, the content of the invention will be described in more detail below in conjunction with the examples, but the scope of protection of the invention is not limited to the following examples, all the features disclosed in this specification, or all methods disclosed or steps in a process, may be combined in any way, except for mutually exclusive features and/or steps.

下面结合附图对原理做进一步说明。The principle will be further described below in conjunction with the accompanying drawings.

如图1所示是本发明提出方法的整体流程图，用实现效果形式显示，以Linemod中的Ape猿猴类别为例，具体的操作步骤为：As shown in Figure 1, it is the overall flowchart of the proposed method of the present invention, which is displayed in the form of realization effect. Taking the Ape ape category in Linemod as an example, the specific operation steps are:

一种基于增强自编码器的室内目标物体6D姿态估计方法与系统，所述的方法分成三个阶段：多目标物体检测阶段首先输入单幅彩色图像到改进版的Faster R-CNN，然后RPN网络提取出候选框，再通过全卷积(FCN)网络输出目标类别概率和二维边界框；概率期望连接增强自编码器(AAE)预测物体关键点阶段，通过训练改进版的堆叠式降噪自动编码器(SDAE)对感兴趣区域编解码出相同尺寸的无噪声感兴趣区域，再通过全连接层(fc)预测出目标物体在二维图像上的关键点；计算物体姿态阶段PnP根据关键点计算出目标的6D姿态。A method and system for estimating the 6D pose of an indoor target object based on an enhanced self-encoder. The method is divided into three stages: the multi-target object detection stage first inputs a single color image to the improved version of Faster R-CNN, and then the RPN network Extract the candidate frame, and then output the target category probability and two-dimensional bounding box through the full convolution (FCN) network; the probability expectation connection enhanced autoencoder (AAE) predicts the key point stage of the object, and automatically trains the improved version of the stacked noise reduction The encoder (SDAE) encodes and decodes the region of interest to produce a noise-free region of interest of the same size, and then predicts the key points of the target object on the two-dimensional image through the fully connected layer (fc); Calculate the 6D pose of the target.

多目标物体检测阶段的主要实现步骤有三步：ResNet101残差网络提取特征；RPN区域提名网络提取一定数量的Rois感兴趣区域；最后经过RoIHead模块对Rois进行类别预测与二维边界框回归， RoIHead主要包括RoIAlign和FCN两个模块，下面对着三步进行更进一步的解释：The main implementation steps in the multi-target object detection stage are three steps: ResNet101 residual network extracts features; RPN region nomination network extracts a certain number of Rois regions of interest; finally, the RoIHead module performs category prediction and two-dimensional bounding box regression on Rois, and RoIHead mainly Including the two modules of RoIAlign and FCN, the following three steps are further explained:

第一步ResNet101残差网络提取特征：The first step is to extract features from the ResNet101 residual network:

(1)输入多目标Linemod数据集的彩色图像作为训练样本，此多目标数据集与原始的Linemd数据集不同，原始数据集为单目标图像和单目标标注，而多目标数据集是本发明方法合成的，其中地彩色图像宽高尺寸为640*480像素，与在将图像输入之前还需要做两件事： Faster R-CNN要求的图像尺寸为1000*600像素，所以需要将Linemod 彩色图像更改尺寸为1000*600，标注信息会随之自动更改；同时，为了增加模型的泛化性，对图像进行数据增强，通过改变图像的亮度、对比度、随机增加mask掩膜(图像中显示为黑色小块)、随机增加高斯噪声等；将处理过的多目标彩色图像输入给ResNet101残差网络来提取特征；(1) The color image of the input multi-target Linemod data set is used as a training sample. This multi-target data set is different from the original Linemd data set. The original data set is a single target image and a single target label, and the multi-target data set is the method of the present invention Synthesized, where the ground color image has a width and height of 640*480 pixels, and two things need to be done before inputting the image: Faster R-CNN requires an image size of 1000*600 pixels, so the Linemod color image needs to be changed The size is 1000*600, and the label information will be automatically changed accordingly; at the same time, in order to increase the generalization of the model, the data of the image is enhanced, by changing the brightness and contrast of the image, and randomly increasing the mask mask (displayed in the image as a black small block), randomly adding Gaussian noise, etc.; input the processed multi-target color image to the ResNet101 residual network to extract features;

(2)不同于Faster R-CNN提出的VGG16作为特征提取器，为了提取更多、更好地特征来表征图像，本发明采用网络更深、表征能力的 ResNet101残差网络作为特征提取器，ResNet101可以提取出一定数量的特征图，ReSNet101的网络结构为：(2) Unlike the VGG16 proposed by Faster R-CNN as a feature extractor, in order to extract more and better features to represent images, the present invention uses the ResNet101 residual network with deeper network and characterization ability as a feature extractor. ResNet101 can A certain number of feature maps are extracted, and the network structure of ReSNet101 is:

卷积第1层采用7*7卷积核，输入3，输出64，步长为2；再经过一层下采样层，下采样步长为2；The first layer of convolution uses a 7*7 convolution kernel, input 3, output 64, and the step size is 2; and then passes through a downsampling layer, and the downsampling step size is 2;

卷积第2层采用1*1卷积核，输入64，输出64，步长为1；The second layer of convolution uses a 1*1 convolution kernel, input 64, output 64, and the step size is 1;

卷积第3层采用3*3卷积核，输入64，输出64，步长为1；The third layer of convolution uses a 3*3 convolution kernel, input 64, output 64, and the step size is 1;

卷积第2层采用1*1卷积核，输入64，输出256，步长为1；The second layer of convolution uses a 1*1 convolution kernel, input 64, output 256, and the step size is 1;

卷积第5层为第1层后的下采样结果，采用1*1卷积核，输入 64，输出256，步长为1；The fifth layer of convolution is the downsampling result after the first layer, using a 1*1 convolution kernel, input 64, output 256, and the step size is 1;

卷积第6层采用1*1卷积核，输入256，输出64，步长为1；The sixth layer of convolution uses a 1*1 convolution kernel, with an input of 256, an output of 64, and a step size of 1;

卷积第7层采用3*3卷积核，输入64，输出64，步长为1；The seventh layer of convolution uses a 3*3 convolution kernel, input 64, output 64, and the step size is 1;

卷积第8层采用1*1卷积核，输入64，输出256，步长为1；The 8th layer of convolution uses a 1*1 convolution kernel, input 64, output 256, and the step size is 1;

卷积第9～11层为第6～8层1次；Convolve the 9th to 11th layers once for the 6th to 8th layers;

卷积第12层采用1*1卷积核，输入256，输出128，步长为2；The 12th layer of convolution uses a 1*1 convolution kernel, with an input of 256, an output of 128, and a step size of 2;

卷积第13层采用3*3卷积核，输入128，输出128，步长为1；The 13th layer of convolution uses a 3*3 convolution kernel, with an input of 128 and an output of 128, with a step size of 1;

卷积第14层采用1*1卷积核，输入128，输出256，步长为1；The 14th layer of convolution uses a 1*1 convolution kernel, with an input of 128 and an output of 256, with a step size of 1;

卷积第15层为第11层后的结果，采用1*1卷积核，输入256，输出512，步长为2；The 15th layer of convolution is the result of the 11th layer, using a 1*1 convolution kernel, input 256, output 512, and the step size is 2;

卷积第16～21层为第12～14层2次；Convolve the 16th to 21st layers twice as the 12th to 14th layers;

卷积第23层采用1*1卷积核，输入512，输出256，步长为2；The 23rd layer of convolution uses a 1*1 convolution kernel, input 512, output 256, and the step size is 2;

卷积第24层采用3*3卷积核，输入256，输出256，步长为1；The 24th layer of convolution uses a 3*3 convolution kernel, with an input of 256, an output of 256, and a step size of 1;

卷积第25层采用1*1卷积核，输入256，输出1024，步长为1；The 25th layer of convolution uses a 1*1 convolution kernel, with an input of 256, an output of 1024, and a step size of 1;

卷积第26层为第21层后的结果，采用1*1卷积核，输入512，输出1024，步长为2；The 26th layer of convolution is the result of the 21st layer, using a 1*1 convolution kernel, input 512, output 1024, and the step size is 2;

卷积第27层采用1*1卷积核，输入1024，输出512，步长为1；The 27th layer of convolution uses a 1*1 convolution kernel, input 1024, output 512, and the step size is 1;

卷积第28层采用3*3卷积核，输入256，输出256，步长为1；The 28th layer of convolution uses a 3*3 convolution kernel, input 256, output 256, and the step size is 1;

卷积第29层采用1*1卷积核，输入256，输出1024，步长为1；The 29th layer of convolution uses a 1*1 convolution kernel, with an input of 256, an output of 1024, and a step size of 1;

卷积第30～93为第27～29重复21次；The 30th to 93rd convolutions are repeated 21 times for the 27th to 29th convolutions;

卷积第94用1*1卷积核，输入1024，输出512，步长为1；The 94th convolution uses a 1*1 convolution kernel, input 1024, output 512, and the step size is 1;

卷积第95用3*3卷积核，输入256，输出256，步长为1；The 95th convolution uses a 3*3 convolution kernel, the input is 256, the output is 256, and the step size is 1;

卷积第96用1*1卷积核，输入256，输出2048，步长为1；The 96th convolution uses a 1*1 convolution kernel, input 256, output 2048, and the step size is 1;

卷积第97层为第93层的结果，采用1*1卷积核，输入1024，输出2048，步长为1；The 97th layer of convolution is the result of the 93rd layer, using a 1*1 convolution kernel, input 1024, output 2048, and the step size is 1;

卷积第98采用1*1卷积核，输入2048，输出512，步长为1；The 98th convolution uses a 1*1 convolution kernel, input 2048, output 512, and the step size is 1;

卷积第99用3*3卷积核，输入512，输出512，步长为1；The 99th convolution uses a 3*3 convolution kernel, input 512, output 512, and the step size is 1;

卷积第100用1*1卷积核，输入512，输出2048，步长为1；The 100th convolution uses a 1*1 convolution kernel, the input is 512, the output is 2048, and the step size is 1;

卷积第101～103为第98～100层重复1次的结果，然后经过一个平均池化层；Convolution 101-103 is the result of repeating the 98th-100th layer once, and then passes through an average pooling layer;

最后连接一个全连接层，输入为2048，输出位类别的数量，本发明采用的Linemod数据集类别数量为13类。Finally, a fully connected layer is connected, the input is 2048, and the number of output bit categories, the number of categories of the Linemod data set used in the present invention is 13 categories.

第二步RPN区域提名网络对ResNet101提取出的特征图(Features Maps)进行提取感兴趣区域(Rois)，RPN的操作核心是锚点(Anchors)， RPN神经网络使用9个锚点，大小为128*128，256*256，单位是像素，3个长宽比为1：1的锚点，3个长宽比为1：2的锚点，3个长宽比为2：1的锚点，主要有4个过程：In the second step, the RPN region nomination network extracts the region of interest (Rois) from the feature map (Features Maps) extracted by ResNet101. The core of the RPN operation is the anchor point (Anchors). The RPN neural network uses 9 anchor points with a size of 128 *128, 256*256, unit is pixel, 3 anchor points with aspect ratio 1:1, 3 anchor points with aspect ratio 1:2, 3 anchor points with aspect ratio 2:1, There are 4 main processes:

(1)对于每张图片，利用它的Feature map，计算(H/16)×(W/16) ×9(大概20000)个Anchor属于前景的概率，以及对应的位置参数；(1) For each picture, use its Feature map to calculate (H/16)×(W/16)×9 (about 20000) the probability that Anchor belongs to the foreground, and the corresponding position parameters;

(2)选取概率较大的12000个anchor；(2) Select 12,000 anchors with high probability;

(3)利用回归的位置参数，修正这12000个anchor的位置，得到 RoIs；(3) Correct the positions of the 12,000 anchors by using the regression position parameters to obtain RoIs;

(4)利用非极大值((Non-maximum suppression，NMS)抑制，选出概率最大的2000个RoIs；(4) Use non-maximum suppression (NMS) suppression to select the 2000 RoIs with the highest probability;

第三步RoIHead在RPN给出的2000候选框和ResNet101提取的特征图之上继续进行分类和位置参数的回归，主要包括RoIAlign 固定Rois的尺寸为7*7像素和FCN输出类别概率与二维包围盒两个部分，具体过程如下：In the third step, RoIHead continues to perform classification and regression of position parameters on the 2000 candidate frames given by RPN and the feature map extracted by ResNet101, mainly including RoIAlign, fixing the size of Rois to be 7*7 pixels and FCN output category probability and two-dimensional enveloping There are two parts of the box, the specific process is as follows:

(1)Faster R-CNN使用的RoIPooling方法，但是该方法在经过两次量化后会造成一定偏差，此偏差势必会对后层的回归定位产生影响，所以本发明借鉴Mask R-CNN的RoIAlign，RoIAlign对Rois区域内的像素采用双线性插值法进行计算，缩减了两次量化带来的偏差，一般对于大目标来说RoIPooling与RoIAlign的差别不大，但是对于小目标RoIAlign更精准一些，因为本发明使用的Linemod数据集中类别基本都是小目标，所以采用RoIAlign的方法将Rois量化为固定尺寸7*7的感兴趣区域；(1) The RoIPooling method used by Faster R-CNN, but this method will cause a certain deviation after two quantizations, which will inevitably affect the regression positioning of the back layer, so the present invention uses RoIAlign of Mask R-CNN for reference, RoIAlign uses the bilinear interpolation method to calculate the pixels in the Rois area, which reduces the deviation caused by the two quantizations. Generally, there is not much difference between RoIPooling and RoIAlign for large targets, but RoIAlign is more accurate for small targets, because The categories in the Linemod data set used by the present invention are basically small objects, so the RoIAlign method is used to quantify the Rois into a fixed-size 7*7 region of interest;

(2)本发明使用FCN全卷积网络代替全连接层将RoIAlign输出的固定大小的Rois映射成一个固定长度的特征向量，FCN相比全连接层可大幅度减少网络参数量，再经过平均池化层后，输出两个分支：一个是输出目标物体的类别概率，一个是回归物体二维边界框。(2) The present invention uses the FCN fully convolutional network instead of the fully connected layer to map the fixed-size Rois output by RoIAlign into a fixed-length feature vector. Compared with the fully connected layer, the FCN can greatly reduce the amount of network parameters, and then through the average pool After the transformation layer, two branches are output: one is to output the category probability of the target object, and the other is to return the two-dimensional bounding box of the object.

如图2所示，概率期望连接增强自编码器(AAE)预测物体关键点阶段主要分成两个阶段增强自编码器重建感兴趣区域和预测物体关键点，具体过程如下：As shown in Figure 2, the probabilistic expectation connection augmented autoencoder (AAE) stage of predicting object key points is mainly divided into two stages: enhanced autoencoder reconstruction of the region of interest and prediction of object key points, the specific process is as follows:

增强自编码器(AAE)重建感兴趣区域是对Faster R-CNN得到的 Rois感兴趣区域进行编-解码(Encoder-Decode㈡，本发明采用的增强自编码器(AAE)是改进版的堆叠式去噪自编码器(SADE)，SDAE是逐层训练的去噪自动编码器(DAE)，也可以采用Dropout、ReLU的方法，为了让网络训练收敛，本发明采用的是ReLU的方法；首先利用编码器Encoder对感兴趣区域进行编码，编码是一个降维下采样的过程，将感兴趣区域编码成一个128维的隐变量(Lantent Code)；然后通过解码器Decoder对这个128维的隐变量进行上采样操作，类似于翻卷积，重建出一个相同尺寸的感兴趣区域，从图2中的输入可以看到输入的感兴趣区域颜色和遮挡是非常明显的，经过AAE重建出的是目标很清晰的感兴趣区域，如果颜色、遮挡等相对于目标物体如Ape)是噪声，那么正式经过改进版的堆叠式去噪声自动编码器(SDAE)将噪声 (颜色、遮挡等)去掉，恢复出无任何噪声的感兴趣区域，然后在此无噪声的重建感兴趣区域基础上做后续操作，明显比在存在噪声的感兴趣区域上更直接，重建感兴趣区域的具体网络结构为：The enhanced autoencoder (AAE) reconstructing the region of interest is to code-decode the Rois region of interest obtained by Faster R-CNN (Encoder-Decode (2), the enhanced autoencoder (AAE) used in the present invention is an improved version of the stacked de- Noisy self-encoder (SADE), SDAE is the denoising automatic encoder (DAE) of layer-by-layer training, also can adopt the method of Dropout, ReLU, in order to allow network training convergence, what the present invention adopted is the method of ReLU; The encoder Encoder encodes the region of interest. Encoding is a process of dimensionality reduction and downsampling. The region of interest is encoded into a 128-dimensional latent variable (Lantent Code); and then the 128-dimensional latent variable is up- Sampling operation, similar to convolution, reconstructs a region of interest of the same size. From the input in Figure 2, it can be seen that the color and occlusion of the input region of interest are very obvious. After AAE reconstruction, the target is very clear. In the area of interest, if the color, occlusion, etc. are noise relative to the target object (such as Ape), then the formally improved version of the stacked denoising autoencoder (SDAE) will remove the noise (color, occlusion, etc.) and restore it without any noise The region of interest, and then do follow-up operations on the basis of this noise-free reconstructed region of interest, which is obviously more direct than the region of interest with noise. The specific network structure of the reconstructed region of interest is:

卷积第1层使用5*5尺寸的卷积核，对输入的感兴趣区域进行特征提取，输入为3，输出为64，步长为2，填充为2；The first layer of convolution uses a convolution kernel of size 5*5 to perform feature extraction on the input region of interest. The input is 3, the output is 64, the step size is 2, and the padding is 2;

卷积第2层使用5*5尺寸的卷积核，对输入的感兴趣区域进行特征提取，输入为64，输出为128，步长为2，填充为2；The second layer of convolution uses a convolution kernel of size 5*5 to perform feature extraction on the input region of interest. The input is 64, the output is 128, the step size is 2, and the padding is 2;

卷积第3层使用5*5尺寸的卷积核，对输入的感兴趣区域进行特征提取，输入为128，输出为256，步长为2，填充为2；The third layer of convolution uses a convolution kernel of size 5*5 to perform feature extraction on the input region of interest. The input is 128, the output is 256, the step size is 2, and the padding is 2;

卷积第4层使用5*5尺寸的卷积核，对输入的感兴趣区域进行特征提取，输入为256，输出为512，步长为2，填充为2；The fourth layer of convolution uses a convolution kernel of size 5*5 to perform feature extraction on the input region of interest. The input is 256, the output is 512, the step size is 2, and the padding is 2;

卷积第5层使用5*5尺寸的卷积核，对输入的感兴趣区域进行特征提取，输入为512，输出为512，步长为2，填充为2；The fifth layer of convolution uses a convolution kernel of size 5*5 to perform feature extraction on the input region of interest. The input is 512, the output is 512, the step size is 2, and the padding is 2;

卷积第6层使用5*5尺寸的卷积核，对输入的感兴趣区域进行特征提取，输入为512，输出为512，步长为2，填充为2；The sixth layer of convolution uses a convolution kernel of size 5*5 to perform feature extraction on the input region of interest. The input is 512, the output is 512, the step size is 2, and the padding is 2;

再通过Flatten压片层将第6层的输出映射成一维的特征向量方便计算；Then map the output of the sixth layer into a one-dimensional feature vector through the Flatten compression layer to facilitate calculation;

然后再通过一个全连接层fc输出128维的隐变量，输入为2048Then output a 128-dimensional hidden variable through a fully connected layer fc, and the input is 2048

上一步Encoder已经将感兴趣区域的特征提取出来，保存在输出的128维隐变量中，将Encoder输出的128维隐变量输入给Decoder 用来上采样操作，恢复与原来相同大小的感兴趣区域，且恢复的感兴趣区域只有目标物体没有任何其他噪声，具体的网络结构如下：In the previous step, the Encoder has extracted the features of the region of interest and stored them in the output 128-dimensional hidden variable, and input the 128-dimensional hidden variable output by the Encoder to the Decoder for upsampling operation, and restored the same size of the original region of interest. And the restored region of interest only has the target object without any other noise. The specific network structure is as follows:

卷积第1层使用5*5尺寸的卷积核，对输入的感兴趣区域进行特征提取，输入为512，输出为512，步长为1，填充为2；The first layer of convolution uses a convolution kernel of size 5*5 to perform feature extraction on the input region of interest. The input is 512, the output is 512, the step size is 1, and the padding is 2;

卷积第2层使用5*5尺寸的卷积核，对输入的感兴趣区域进行特征提取，输入为512，输出为512，步长为1，填充为2；The second layer of convolution uses a convolution kernel of size 5*5 to perform feature extraction on the input region of interest. The input is 512, the output is 512, the step size is 1, and the padding is 2;

卷积第3层使用5*5尺寸的卷积核，对输入的感兴趣区域进行特征提取，输入为512，输出为256，步长为1，填充为2；The third layer of convolution uses a convolution kernel of size 5*5 to perform feature extraction on the input region of interest. The input is 512, the output is 256, the step size is 1, and the padding is 2;

卷积第4层使用5*5尺寸的卷积核，对输入的感兴趣区域进行特征提取，输入为256，输出为128，步长为1，填充为2；The fourth layer of convolution uses a convolution kernel of size 5*5 to perform feature extraction on the input region of interest. The input is 256, the output is 128, the step size is 1, and the padding is 2;

卷积第5层使用5*5尺寸的卷积核，对输入的感兴趣区域进行特征提取，输入为128，输出为64，步长为1，填充为2；The fifth layer of convolution uses a convolution kernel of size 5*5 to perform feature extraction on the input region of interest. The input is 128, the output is 64, the step size is 1, and the padding is 2;

卷积第6层使用5*5尺寸的卷积核，对输入的感兴趣区域进行特征提取，输入为64，输出为512，步长为1，填充为2；The sixth layer of convolution uses a convolution kernel of size 5*5 to perform feature extraction on the input region of interest. The input is 64, the output is 512, the step size is 1, and the padding is 2;

卷积第7层使用5*5尺寸的卷积核，对输入的感兴趣区域进行特征提取，输入为512，输出为3，步长为1，填充为2；The 7th layer of convolution uses a convolution kernel of size 5*5 to perform feature extraction on the input region of interest. The input is 512, the output is 3, the step size is 1, and the padding is 2;

在卷积第1～7层之间都有一个ReLU激活函数层和上采样 Unsample(sacle_factor＝2.0，mode＝nearest)层，sacle_factor代表比例因子，用来控制图像宽和高的比例，mode＝nearest代表使用功能最近邻方法；There is a ReLU activation function layer and an upsampling Unsample (sacle_factor=2.0, mode=nearest) layer between the first to seventh layers of convolution. sacle_factor represents the scale factor, which is used to control the ratio of image width and height, mode=nearest Represents the use of the functional nearest neighbor method;

再通过Sigmoid()函数层将变量映射到(0，1)区间，方便计算及网络收敛。Then through the Sigmoid() function layer, the variables are mapped to the (0, 1) interval, which is convenient for calculation and network convergence.

增强自编码器(AAE)预测物体关键点指在已经重建新的感兴趣区域的基础上，进行预测感兴趣区域内目标物体的关键点，重建感兴趣区域后，感兴趣区域已经是无噪声的感兴趣区域；然后通过编码器 Encoder对无噪声的感兴趣区域编码成128维隐变量，编码器将感兴趣区域的所有特征都提取出来了，再通过一个全连接层FC输出需要的维度，类似于ResNet101最后一层的全连接层输出13种类别一样，这里是输出目标物体的8个关键点，效果如图2所示，The enhanced autoencoder (AAE) predicts the key points of the object on the basis of reconstructing the new region of interest, and predicts the key points of the target object in the region of interest. After the region of interest is reconstructed, the region of interest is already noise-free Region of interest; then encode the noise-free region of interest into a 128-dimensional hidden variable through the encoder Encoder, the encoder extracts all the features of the region of interest, and then outputs the required dimension through a fully connected layer FC, similar to The fully connected layer of the last layer of ResNet101 outputs 13 categories, here are the 8 key points of the output target object, the effect is shown in Figure 2,

具体实现过程如下：The specific implementation process is as follows:

(1)首先使用恢复感兴趣区域的预训练模型，连接到解码器 Encoder，网络结构参数不变，这一步骤会将无噪声的感兴趣区域编码成128维隐变量；(1) First use the pre-trained model to restore the region of interest, connect it to the decoder Encoder, and keep the network structure parameters unchanged. This step will encode the noise-free region of interest into a 128-dimensional hidden variable;

(2)再通过一个全连接层输出16维向量，输入为128维隐变量，这个16维向量即代表目标物体的关键点的坐标(x，y)，共8个点。(2) Then a 16-dimensional vector is output through a fully connected layer, and the input is a 128-dimensional hidden variable. This 16-dimensional vector represents the coordinates (x, y) of the key points of the target object, a total of 8 points.

计算物体的6D姿态估计阶段的具体步骤如下：The specific steps of calculating the 6D pose estimation stage of the object are as follows:

(1)将增强自编码器预测到的物体的三维包围盒的8个关键点在感兴趣区域的投影输入给EPnP算法；(1) The projection of 8 key points of the three-dimensional bounding box of the object predicted by the enhanced self-encoder in the region of interest is input to the EPnP algorithm;

(2)提取LINEMOD数据集自带点云模型中的世界坐标系下的特征点，此特征点是三维坐标，表示为(x，y，z)；(2) Extract the feature points in the world coordinate system in the point cloud model of the LINEMOD dataset. This feature point is a three-dimensional coordinate, expressed as (x, y, z);

(3)提取LINEMOD数据集自带的相机内参矩阵，相机参数是固定的；(3) Extract the internal camera reference matrix that comes with the LINEMOD dataset, and the camera parameters are fixed;

(3)相机的畸变参数矩阵设为1个8维全0的矩阵；(3) The distortion parameter matrix of the camera is set to an 8-dimensional matrix of all 0s;

(4)将三维坐标点、8个关键点、相机内参矩阵和相机的畸变参数矩阵输入给OpenCV的SolvePnP求解出三维旋转矩阵R和三维平移矩阵T，从而求得目标物体的6D姿态。(4) Input the 3D coordinate points, 8 key points, camera internal reference matrix and camera distortion parameter matrix to SolvePnP of OpenCV to solve the 3D rotation matrix R and 3D translation matrix T, so as to obtain the 6D pose of the target object.

对于概率期望、模型算法的网络损失函数设置、衡量算法优劣用到的评价策略和训练与测试数据集制作方法在权利要求书中已详细阐述，不再赘述。The probability expectation, the network loss function setting of the model algorithm, the evaluation strategy used to measure the pros and cons of the algorithm, and the method of making the training and test data sets have been described in detail in the claims, and will not be repeated.

使用LINEMOD数据集训练后，本发明对于背景杂乱与物体存在遮挡的情况下具有很强的鲁棒性，对光照、颜色不敏感且不要求物体具有丰富的纹理特征。After using the LINEMOD dataset for training, the present invention has strong robustness to background clutter and occlusion of objects, is insensitive to light and color, and does not require objects to have rich texture features.