CN112561979A

Movatterモバイル変換

Info

Publication number: CN112561979A
Application number: CN202011562061.9A
Authority: CN
Inventors: 雷建军; 孙琳; 彭勃; 张哲�; 刘秉正
Original assignee: Tianjin University
Current assignee: Tianjin University
Priority date: 2020-12-25
Filing date: 2020-12-25
Publication date: 2021-03-26
Anticipated expiration: 2040-12-25
Also published as: CN112561979B

Abstract

Translated fromChinese

本发明公开了一种基于深度学习的自监督单目深度估计方法，所述方法包括：分别提取原始的右视图I^r和合成的左视图

的金字塔特征，将金字塔特征进行水平相关操作以获得多尺度的相关特征F^c，并获取完善后的多尺度相关特征F^m；将F^m送入双目线索预测模块中的视觉线索预测网络，生成辅助的视觉线索D^r，并从合成的左视图

再重建出右视图

利用重建的右视图

和真实的右视图I^r之间的图像重建损失

来优化双目线索预测模块；将双目线索预测模块生成的视觉线索D^r用于约束单目深度估计网络预测的视差图D^l，使用一致性损失增强二者之间的一致性；构建遮挡引导的约束来为遮挡区域像素和非遮挡区域像素的重建误差分配不同的权重。

The invention discloses a self-supervised monocular depth estimation method based on deep learning, the method comprises: extracting the original right view I^r and the synthesized left view respectively

Pyramid feature, perform horizontal correlation operation on the pyramid feature to obtain multi-scale correlation feature F^c , and obtain the perfect multi-scale correlation feature F^m ; send F^m to the visual cue prediction network in the binocular cue prediction module, Generate auxiliary visual cues^Dr , and from the synthesized left view

Rebuild the right view

Utilize reconstructed right view

and the real right view I^r between image reconstruction loss

to optimize the binocular cue prediction module; use the visual cues D^r generated by the binocular cue prediction module to constrain the disparity map D^l predicted by the monocular depth estimation network, and use consistency loss to enhance the consistency between the two; construct occlusion Guided constraints to assign different weights to the reconstruction errors of pixels in occluded and non-occluded regions.

Description

Translated fromChinese

一种基于深度学习的自监督单目深度估计方法A self-supervised monocular depth estimation method based on deep learning

技术领域technical field

本发明涉及计算机视觉、深度估计领域，尤其涉及一种基于深度学习的自监督单目深度估计方法。The invention relates to the fields of computer vision and depth estimation, in particular to a self-supervised monocular depth estimation method based on deep learning.

背景技术Background technique

深度感知作为计算机视觉的基本任务之一，可以广泛应用于自动驾驶、增强现实、机器人导航和三维重建等领域。尽管有源传感器(例如：激光雷达、结构光和飞行时间)已被广泛利用以直接获取场景深度，但是有源传感器设备通常体积大、价格昂贵且具有较高的能耗。相比之下，基于RGB(彩色)图像预测深度的方法具有价格低廉、易于实现等优点。在现有基于图像的深度估计方法中，单目深度估计不依赖于感知环境的多次采集，受到了研究人员的广泛关注。As one of the basic tasks of computer vision, depth perception can be widely used in the fields of autonomous driving, augmented reality, robot navigation, and 3D reconstruction. Although active sensors (eg, lidar, structured light, and time-of-flight) have been widely utilized to directly acquire scene depth, active sensor devices are usually bulky, expensive, and have high energy consumption. In contrast, the depth prediction method based on RGB (color) images has the advantages of low price and easy implementation. Among existing image-based depth estimation methods, monocular depth estimation does not rely on multiple acquisitions of the perceptual environment, and has received extensive attention from researchers.

近年来，基于深度学习的单目深度估计方法已取得显著进展。其中，基于监督学习的方法通常需要具有真实深度标注的大型数据集来训练深度估计模型。在实际应用中，对大量图像进行高质量的像素级标注是一项具有挑战性的任务，这极大地限制了基于监督学习的单目深度估计方法的应用。与直接使用深度标签作为监督的方法不同，自监督方法旨在利用单目视频或双目图像为网络训练提供间接的监督信号。因此，研究无需深度标注信息的自监督单目深度估计方法具有重要意义和应用价值。In recent years, deep learning-based monocular depth estimation methods have made significant progress. Among them, supervised learning-based methods usually require large datasets with real depth annotations to train depth estimation models. In practical applications, high-quality pixel-level annotation of a large number of images is a challenging task, which greatly limits the application of supervised learning-based monocular depth estimation methods. Unlike methods that directly use deep labels as supervision, self-supervised methods aim to utilize monocular videos or stereo images to provide indirect supervision signals for network training. Therefore, it is of great significance and application value to study the self-supervised monocular depth estimation method without depth annotation information.

在自监督的单目深度估计方法中，一种基本的技术手段是利用单目深度估计模型从源视图中预测视差图，基于预测的视差图和源视图合成目标视图，并采用合成的目标视图和真实的目标视图间的重建误差约束深度估计模型的训练。最后，可以利用相机参数，基于预测的视差图计算出深度图。然而，现有方法通常仅关注于利用合成的目标视图来构造监督信号，没有充分地探索并利用源视图与合成的目标视图间的几何相关性。此外，由于源视图和目标视图间存在遮挡，现有的方法在视差学习中，直接最小化合成的目标视图与真实的目标视图之间的外观差异，将导致在遮挡区域附近预测的视差不准确。因此，在自监督的单目深度估计方法中，研究如何充分地探索并利用源视图与合成的目标视图之间的几何相关性以及如何解决源视图和目标视图间的遮挡问题是至关重要的。In the self-supervised monocular depth estimation method, a basic technical means is to use the monocular depth estimation model to predict the disparity map from the source view, synthesize the target view based on the predicted disparity map and the source view, and use the synthesized target view The reconstruction error between the real object view and the real object view constrains the training of the depth estimation model. Finally, a depth map can be computed based on the predicted disparity map using the camera parameters. However, existing methods usually only focus on exploiting synthesized target views to construct supervisory signals, and do not sufficiently explore and exploit the geometric correlation between source views and synthesized target views. In addition, due to the occlusion between the source view and the target view, existing methods directly minimize the appearance difference between the synthesized target view and the real target view in disparity learning, which will lead to inaccurate disparity prediction near the occluded area . Therefore, in self-supervised monocular depth estimation methods, it is crucial to study how to fully explore and exploit the geometric correlation between the source view and the synthesized target view and how to solve the occlusion problem between the source view and the target view. .

发明内容SUMMARY OF THE INVENTION

当前的自监督单目深度估计方法通常仅关注于利用合成的目标视图来构造监督信号，没有充分地利用源视图与合成的目标视图间的几何相关性，且没有分析及处理源视图和目标视图间存在的遮挡问题。本发明针对这些问题，提出一种基于深度学习的自监督单目深度估计方法，通过探索源视图与合成的目标视图间的相关性来生成辅助的视觉线索，并利用生成的视觉线索推理遮挡区域来构建遮挡引导的约束，提高自监督单目深度估计的性能，详见下文描述：Current self-supervised monocular depth estimation methods usually only focus on using synthesized target views to construct supervised signals, do not fully exploit the geometric correlation between source views and synthesized target views, and do not analyze and process source and target views. occlusion problems. Aiming at these problems, the present invention proposes a self-supervised monocular depth estimation method based on deep learning, which generates auxiliary visual cues by exploring the correlation between the source view and the synthesized target view, and uses the generated visual cues to infer the occlusion area. to build occlusion-guided constraints and improve the performance of self-supervised monocular depth estimation, as described below:

一种基于深度学习的自监督单目深度估计方法，所述方法包括：A self-supervised monocular depth estimation method based on deep learning, the method comprises:

1)分别提取原始的右视图和合成的左视图的金字塔特征，将金字塔特征进行水平相关操作以获得多尺度的相关特征F^c，并获取完善后的多尺度相关特征F^m；1) Extract the pyramid features of the original right view and the synthesized left view respectively, perform a horizontal correlation operation on the pyramid features to obtain a multi-scale correlation feature F^c , and obtain a perfected multi-scale correlation feature F^m ;

2)将F^m送入双目线索预测模块中的视觉线索预测网络，生成辅助的视觉线索D^r，并从合成的左视图

再重建出右视图

利用重建的右视图

和真实的右视图I^r之间的图像重建损失

来优化双目线索预测模块；2) Feed F^m into the visual cue prediction network in the binocular cue prediction module to generate auxiliary visual cues D^r , and from the synthesized left view

Rebuild the right view

Utilize reconstructed right view

and the real right view I^r between image reconstruction loss

to optimize the binocular cue prediction module;

3)将双目线索预测模块生成的视觉线索D^r用于约束单目深度估计网络预测的视差图D^l，使用一致性损失增强二者之间的一致性；3) The visual cue D^r generated by the binocular cue prediction module is used to constrain the disparity map D^l predicted by the monocular depth estimation network, and the consistency between the two is enhanced by the consistency loss;

4)构建遮挡引导的约束来为遮挡区域像素和非遮挡区域像素的重建误差分配不同的权重。4) Construct occlusion-guided constraints to assign different weights to the reconstruction errors of occluded area pixels and non-occluded area pixels.

其中，所述获得多尺度的相关特征F^c具体为：Wherein, the obtained multi-scale related feature F^c is specifically:

F^c＝F^r(x,y)e F^l(x+d,y)F^c =F^r (x,y)e F^l (x+d,y)

其中，F^r(x,y)和F^l(x,y)分别表示特征图F^r和F^l中位置(x,y)处的值，e表示点积，d表示可能的视差值。Among them, F^r (x, y) and^Fl (x, y) represent the value at position (x, y) in the feature maps F^r and^Fl , respectively, e represents the dot product, and d represents the possible disparity value.

其中，所述完善后的多尺度相关特征F^m具体为：Wherein, the perfected multi-scale correlation feature F^m is specifically:

F^m＝Concat[F^c,Conv(F^r)]F^m =Concat[F^c ,Conv(F^r )]

其中，Conv(·)表示卷积运算，Concat[·,·]表示在相同尺度上的级联操作。Among them, Conv(·) represents the convolution operation, and Concat[·,·] represents the cascade operation on the same scale.

进一步地，所述使用一致性损失增强二者之间的一致性具体为：Further, the use of consistency loss to enhance the consistency between the two is specifically:

其中，w(·)表示变形操作，用来逐像素地对齐D^r和D^l。where w(·) represents the warping operation used to align D^r and D^l pixel by pixel.

其中，所述遮挡引导的约束具体为：Wherein, the constraints of the occlusion guidance are specifically:

其中，e表示点积，p表示像素索引，N表示像素总数，γ表示偏置，SSIM(I^l(p),

)为真实的左视图与合成的左视图间像素p处的结构相似性，I^l(p)为真实的左视图中像素p的像素值，

为合成的左视图中像素p的像素值，M^l(p)为左遮挡掩模中像素p的像素值，M^r(p)为右遮挡掩模中像素p的像素值，SSIM(I^r(p),

)为真实的右视图与合成的右视图间像素p处的结构相似性，I^r(p)为真实的右视图中像素p的像素值，

为合成的右视图中像素p的像素值；where e represents the dot product, p represents the pixel index, N represents the total number of pixels, γ represents the bias, SSIM(I^l (p),

) is the structural similarity at the pixel p between the real left view and the synthesized left view, I^l (p) is the pixel value of the pixel p in the real left view,

is the pixel value of pixel p in the synthesized left view, M^l (p) is the pixel value of pixel p in the left occlusion mask, M^r (p) is the pixel value of pixel p in the right occlusion mask, SSIM(I^r (p),

) is the structural similarity at pixel p between the real right view and the synthesized right view, I^r (p) is the pixel value of pixel p in the real right view,

is the pixel value of pixel p in the synthesized right view;

最终训练整个网络使用的损失函数公式表达如下：The loss function formula used in the final training of the entire network is expressed as follows:

其中，λ_M，λ_con和λ_es表示不同损失函数的权重。where λ_M , λ_con and λ_es represent the weights of different loss functions.

本发明提供的技术方案的有益效果是：The beneficial effects of the technical scheme provided by the present invention are:

1、本发明提出了一种双目线索预测模块，通过探索源视图与合成视图间的相关性，生成辅助的视觉线索，从而实现自监督单目深度估计；1. The present invention proposes a binocular cue prediction module, which generates auxiliary visual cues by exploring the correlation between the source view and the synthesized view, thereby realizing self-supervised monocular depth estimation;

2、本发明提出了一种遮挡引导的约束，为深度估计网络的监督提供正确的指导，提升了遮挡区域附近的深度估计准确性。2. The present invention proposes a constraint of occlusion guidance, which provides correct guidance for the supervision of the depth estimation network, and improves the accuracy of depth estimation near the occlusion area.

附图说明Description of drawings

图1为一种基于深度学习的自监督单目深度估计方法的流程图；1 is a flowchart of a deep learning-based self-supervised monocular depth estimation method;

图2为本发明方法与其他方法的对比结果示意图。Figure 2 is a schematic diagram of the comparison results between the method of the present invention and other methods.

具体实施方式Detailed ways

为使本发明的目的、技术方案和优点更加清楚，下面对本发明实施方式作进一步地详细描述。In order to make the objectives, technical solutions and advantages of the present invention clearer, the embodiments of the present invention are further described in detail below.

本发明实施例提供了一种基于深度学习的自监督单目深度估计方法，参见图1，该方法包括以下步骤：An embodiment of the present invention provides a deep learning-based self-supervised monocular depth estimation method. Referring to FIG. 1 , the method includes the following steps:

一、构建单目深度估计网络1. Building a monocular depth estimation network

对原始的右视图I^r，使用单目深度估计网络从右视图I^r中学习从右到左的视差图D^l。单目深度估计网络采用具有跳跃连接的编码器-解码器的网络结构，编码器网络使用ResNet50来对右视图进行特征提取，解码器网络由连续的反卷积和跳跃式连接构成来逐步地将特征图的分辨率恢复到输入图像的分辨率。从单目深度估计网络得到视差图D^l之后，利用右视图I^r和视差图D^l来合成左视图

将合成的左视图

和真实的左视图I^l之间的图像重建损失

作为目标函数来优化单目深度估计网络。For the original right view I^r , a right-to-left disparity map D^l is learned from the right view I^r using a monocular depth estimation network. The monocular depth estimation network adopts an encoder-decoder network structure with skip connections. The encoder network uses ResNet50 to extract features from the right view, and the decoder network consists of continuous deconvolution and skip connections to gradually The resolution of the feature map is restored to that of the input image. After the disparity map^Dl is obtained from the monocular depth estimation network, the left view is synthesized using the right view^Ir and the disparity map^Dl

Left view to be composited

and the real left view^Il between image reconstruction loss

as the objective function to optimize the monocular depth estimation network.

二、构建双目线索预测模块2. Build a binocular cue prediction module

对原始的右视图I^r和合成的左视图

使用双目线索预测模块学习它们的几何对应关系，并生成辅助的视觉线索。首先分别提取原始的右视图和合成的左视图的金字塔特征F^r和F^l。For the original right view I^r and the synthesized left view

Use a binocular cue prediction module to learn their geometric correspondence and generate auxiliary visual cues. First, the original right-view and synthesized left-view pyramid features^Fr and^Fl are extracted, respectively.

为了学习这两个视图之间的几何对应关系，将具有相同尺度的金字塔特征F^r和F^l进行水平相关操作以获得多尺度的相关特征F^c，公式如下：F^c＝F^r(x,y)e F^l(x+d,y) (1)In order to learn the geometric correspondence between these two views, the pyramid features Fr and Fl with the same scale are^subjected to horizontal correlation operation to obtain the multi-scale^correlation feature F^c , the formula is as follows: F^c =^Fr (x, y)e F^l (x+d,y) (1)

其中，F^r(x,y)和F^l(x,y)分别表示特征图F^r和F^l中位置(x,y)处的值，e表示点积，d表示可能的视差值。F^c的通道数为可能的视差值的数量。Among them, F^r (x, y) and^Fl (x, y) represent the value at position (x, y) in the feature maps F^r and^Fl , respectively, e represents the dot product, and d represents the possible disparity value. The number of channels of F^c is the number of possible disparity values.

上述水平相关操作将原始的右视图和合成的左视图之间的几何对应关系在不同的尺度上进行了编码。The above horizontal correlation operation encodes the geometric correspondence between the original right view and the synthesized left view at different scales.

为了在双目线索预测模块中预测更准确的视差图，保留右特征F^r的详细信息，以进一步完善多尺度相关特征F^c，公式如下：F^m＝Concat[F^c,Conv(F^r)] (2)In order to predict a more accurate disparity map in the binocular cue prediction module, the detailed information of the right feature Fr is retained to further improve the multi-scale correlation feature F^c , the formula is as follows: F^m =Concat[F^c ,Conv(F^r⁾ ] (2)

其中，F^m表示完善后的多尺度相关特征，Conv(·)表示卷积运算，Concat[·,·]表示在相同尺度上的级联操作。Among them, F^m represents the multi-scale correlation feature after improvement, Conv( ) represents the convolution operation, and Concat[ , ] represents the cascade operation on the same scale.

之后，将F^m送入双目线索预测模块中的视觉线索预测网络，生成辅助的视觉线索D^r，该视觉线索表示原始的右视图和合成的左视图之间相应像素的水平偏移。根据视觉线索D^r，从合成的左视图

再重建出右视图

利用重建的右视图

和真实的右视图I^r之间的图像重建损失

来优化双目线索预测模块。Afterwards,^Fm is fed into the visual cue prediction network in the binocular cue prediction module to generate auxiliary visual cues^Dr , which represent the horizontal offset of corresponding pixels between the original right view and the synthesized left view. From the synthesized left view, according to visual^cues Dr

Rebuild the right view

Utilize reconstructed right view

and the real right view I^r between image reconstruction loss

to optimize the binocular cue prediction module.

其中，视觉线索预测网络的结构为：包含了13个残差块的编码器以及包含了6个反卷积块的解码器组成的编解码网络，残差块采用的是ResNet50中的残差块，反卷积块由一个卷积层和一个上采样层组成。Among them, the structure of the visual cue prediction network is: an encoder including 13 residual blocks and an encoder and decoder network including 6 deconvolution blocks. The residual block is the residual block in ResNet50. , the deconvolution block consists of a convolutional layer and an upsampling layer.

为了利用双目线索预测模块来帮助单目深度估计网络，将双目线索预测模块生成的D^r用于约束单目深度估计网络预测的视差图D^l。考虑到D^r和D^l分别表征了从左视图到右视图以及从右视图到左视图的几何对应关系，因此本发明实施例使用一致性损失来增强D^r和D^l之间的一致性。一致性损失函数的公式表达如下：In order to utilize the stereo cue prediction module to help the monocular depth estimation network, the D^r generated by the stereo cue prediction module is used to constrain the disparity map D^l predicted by the monocular depth estimation network. Considering that^Dr and^Dl respectively represent the geometric correspondence from left view to right view and from right view to left view, the embodiment of the present invention uses consistency loss to enhance the consistency between^Dr and^D1 . The formula of the consistency loss function is expressed as follows:

其中，w(·)表示变形操作，用来逐像素地对齐D^r和D^l，以使得D^r和D^l之间的一致性可以直接用L1损失来测量。where w(·) represents the warping operation used to align^Dr and^Dl pixel by pixel so that the consistency between^Dr and^Dl can be directly measured with the L1 loss.

此外，为了提高视差图的局部平滑度，使用边缘感知平滑损失

来对D^r和D^l进行正则化。边缘感知平滑损失的公式表达如下：Furthermore, to improve the local smoothness of the disparity map, an edge-aware smoothing loss is used

to regularize D^r and D^l . The formula for edge-aware smoothing loss is as follows:

其中，

表示水平方向上的一阶微分算子，

表示竖直方向上的一阶微分算子。in,

represents the first-order differential operator in the horizontal direction,

represents a first-order differential operator in the vertical direction.

三、构建遮挡引导的约束3. Constraints for building occlusion guidance

为了解决遮挡问题，本发明实施例构建遮挡引导的约束来为遮挡区域像素和非遮挡区域像素的重建误差分配不同的权重，从而为视差估计的监督提供正确的指导。In order to solve the occlusion problem, the embodiment of the present invention constructs an occlusion guidance constraint to assign different weights to the reconstruction errors of the pixels in the occlusion area and the pixels in the non-occlusion area, so as to provide correct guidance for the supervision of disparity estimation.

首先，利用D^r和D^l识别左视图和右视图中属于遮挡区域的像素。考虑到除了遮挡区域外，D^r和D^l中的视差值应是一致的，因此计算差异图Diff^l和Diff^r，具体公式如下：First, use^Dr and^Dl to identify pixels belonging to occlusion areas in the left and right views. Considering that the disparity values in D^r and D^l should be consistent except for the occlusion area, the difference maps Diff^l and Diff^r are calculated, and the specific formula is as follows:

其中，w(·)表示变形操作，|·|表示取绝对值操作。Among them, w(·) represents the deformation operation, and |·| represents the absolute value operation.

差异图中的值在遮挡区域要远大于非遮挡区域，因此通过检测差异图Diff^l和Diff^r中的离群点来获得二进制的遮挡掩模M^l和M^r，具体公式如下：The value in the difference map is much larger in the occluded area than in the non-occluded area, so the binary occlusion masks M^l and M^r are obtained by detecting the outliers in the difference maps Diff^l and Diff^r . The specific formula is as follows:

其中，W和H表示不一致图的宽度和高度，[·]表示艾弗森括号，当括号内的条件满足时将值设置为1，否则设置为0，λ表示平衡常数。在遮挡掩模M^l和M^r中，值为0的位置表示遮挡区域像素，而值为1的位置表示非遮挡区域中的像素。Among them, W and H represent the width and height of the inconsistent graph, [ ] represents the Iverson bracket, the value is set to 1 when the conditions in the bracket are satisfied, and 0 otherwise, and λ represents the equilibrium constant. In occlusion masks^Ml and^Mr , locations with a value of 0 represent pixels in the occluded area, while locations with a value of 1 represent pixels in the non-occluded area.

具体地，左遮挡掩模M^l表示在左视图中可见但在右视图中不可见的像素，右遮挡掩模M^r表示在右视图中可见但在左视图中不可见的像素。Specifically, the left^occlusion mask^M1 represents pixels that are visible in the left view but not in the right view, and the right occlusion mask Mr represents pixels that are visible in the right view but not visible in the left view.

利用遮挡掩模M^l和M^r，本发明实施例使用遮挡引导的约束来为视差估计的监督提供更准确的指导，具体公式如下：Using the occlusion masks M^l and M^r , the embodiment of the present invention uses the constraints of occlusion guidance to provide more accurate guidance for the supervision of disparity estimation, and the specific formula is as follows:

为合成的右视图中像素p的像素值，α＝0.85。where e represents the dot product, p represents the pixel index, N represents the total number of pixels, γ represents the bias, SSIM(I^l (p),

is the pixel value of pixel p in the synthesized right view, α=0.85.

四、训练基于深度学习的自监督单目深度估计网络4. Training a deep learning-based self-supervised monocular depth estimation network

该训练过程中，基于深度学习的自监督单目深度估计网络包括单目深度估计网络、双目线索预测模块和遮挡引导约束(公式(9)和(10))，训练分为四个阶段。In this training process, the deep learning-based self-supervised monocular depth estimation network includes a monocular depth estimation network, a stereo cue prediction module, and occlusion guidance constraints (formulas (9) and (10)). The training is divided into four stages.

在第一阶段使用左视图的图像重建损失

和边缘感知平滑损失

训练单目深度估计网络。在第二阶段，不更新单目深度估计网络的权重，使用右视图的图像重建损失

和边缘感知平滑损失

训练双目线索预测模块。在第三阶段，使用

和

共同优化单目深度估计网络和双目线索预测模块。最后，在第四阶段，将遮挡引导约束嵌入整个网络，使用

联合训练整个网络，各个损失的权重{λ_M,λ_con,λ_es}分别为{1.0,1.0,0.1}。Image reconstruction loss using the left view in the first stage

and edge-aware smoothing loss

Train a monocular depth estimation network. In the second stage, the weights of the monocular depth estimation network are not updated, and the image reconstruction loss from the right view is used

and edge-aware smoothing loss

Train the binocular cue prediction module. In the third stage, use

and

Co-optimize the monocular depth estimation network and the binocular cue prediction module. Finally, in the fourth stage, occlusion-guided constraints are embedded throughout the network, using

The entire network is jointly trained, and the weights of each loss {λ_M , λ_con , λ_es } are {1.0, 1.0, 0.1}, respectively.

图2给出了预测深度图的均方根误差对比结果，对比算法包括：3Net方法和Monodepth2方法，这两种方法均是自监督单目深度估计算法。均方根误差越小，所预测的深度图越准确。如图所示，3Net方法和Monodepth2方法均得到了较大的均方根误差，原因在于这两种方法均是仅利用合成的目标视图来构造监督信号，没有进一步利用源视图与合成目标视图间的几何相关性，且没有处理源视图和目标视图间存在的遮挡问题。从图2中可以看出，通过探索源视图与合成目标视图间的相关性来生成辅助的视觉线索以及构建遮挡引导的约束，本发明方法可以获得更准确的深度图。Figure 2 shows the comparison results of the root mean square error of the predicted depth map. The comparison algorithms include: the 3Net method and the Monodepth2 method, both of which are self-supervised monocular depth estimation algorithms. The smaller the root mean square error, the more accurate the predicted depth map. As shown in the figure, both the 3Net method and the Monodepth2 method obtain large RMSE, because both methods only use the synthesized target view to construct the supervision signal, and do not further utilize the difference between the source view and the synthesized target view. The geometric dependencies of the source view and the target view are not dealt with. As can be seen from Fig. 2, the method of the present invention can obtain a more accurate depth map by exploring the correlation between the source view and the synthetic target view to generate auxiliary visual cues and construct constraints for occlusion guidance.

本发明实施例对各器件的型号除做特殊说明的以外，其他器件的型号不做限制，只要能完成上述功能的器件均可。In the embodiment of the present invention, the models of each device are not limited unless otherwise specified, as long as the device can perform the above functions.

本领域技术人员可以理解附图只是一个优选实施例的示意图，上述本发明实施例序号仅仅为了描述，不代表实施例的优劣。Those skilled in the art can understand that the accompanying drawing is only a schematic diagram of a preferred embodiment, and the above-mentioned serial numbers of the embodiments of the present invention are only for description, and do not represent the advantages or disadvantages of the embodiments.

以上所述仅为本发明的较佳实施例，并不用以限制本发明，凡在本发明的精神和原则之内，所作的任何修改、等同替换、改进等，均应包含在本发明的保护范围之内。The above are only preferred embodiments of the present invention and are not intended to limit the present invention. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention shall be included in the protection of the present invention. within the range.

Claims

Translated fromChinese

1.一种基于深度学习的自监督单目深度估计方法，其特征在于，所述方法包括：1. A self-supervised monocular depth estimation method based on deep learning, wherein the method comprises:

再重建出右视图

利用重建的右视图

和真实的右视图I^r之间的图像重建损失

Rebuild the right view

Utilize reconstructed right view

and the real right view I^r between image reconstruction loss

to optimize the binocular cue prediction module;3)将双目线索预测模块生成的视觉线索D^r用于约束单目深度估计网络预测的视差图D^l，使用一致性损失增强二者之间的一致性；3) The visual cue D^r generated by the binocular cue prediction module is used to constrain the disparity map D^l predicted by the monocular depth estimation network, and the consistency between the two is enhanced by the consistency loss;

2.根据权利要求1所述的一种基于深度学习的自监督单目深度估计方法，其特征在于，所述获得多尺度的相关特征F^c具体为：2. a kind of self-supervised monocular depth estimation method based on deep learning according to claim 1, is characterized in that, described obtaining multi-scale correlation feature F^c is specifically:

F^c＝F^r(x,y)e F^l(x+d,y)F^c =F^r (x,y)e F^l (x+d,y)

其中，F^r(x,y)和F^l(x,y)分别表示特征图F^r和F^l中位置(x,y)处的值，e表示点积，d表示视差值。Among them, F^r (x, y) and F^l (x, y) represent the value at position (x, y) in the feature maps F^r and F^l , respectively, e represents the dot product, and d represents the disparity value.

3.根据权利要求2所述的一种基于深度学习的自监督单目深度估计方法，其特征在于，所述完善后的多尺度相关特征F^m具体为：3. a kind of self-supervised monocular depth estimation method based on deep learning according to claim 2, is characterized in that, described perfected multi-scale correlation feature F^m is specifically:

F^m＝Concat[F^c,Conv(F^r)]F^m =Concat[F^c ,Conv(F^r )]

4.根据权利要求1所述的一种基于深度学习的自监督单目深度估计方法，其特征在于，所述使用一致性损失增强二者之间的一致性具体为：4. a kind of self-supervised monocular depth estimation method based on deep learning according to claim 1, is characterized in that, described using consistency loss to enhance the consistency between the two is specifically:

5.根据权利要求4所述的一种基于深度学习的自监督单目深度估计方法，其特征在于，所述遮挡引导的约束具体为：5. a kind of self-supervised monocular depth estimation method based on deep learning according to claim 4, is characterized in that, the constraint of described occlusion guidance is specifically:

其中，e表示点积，p表示像素索引，N表示像素总数，γ表示偏置，

为真实的左视图与合成的左视图间像素p处的结构相似性，I^l(p)为真实的左视图中像素p的像素值，

为合成的左视图中像素p的像素值，M^l(p)为左遮挡掩模中像素p的像素值，M^r(p)为右遮挡掩模中像素p的像素值，

为真实的右视图与合成的右视图间像素p处的结构相似性，I^r(p)为真实的右视图中像素p的像素值，

为合成的右视图中像素p的像素值；where e is the dot product, p is the pixel index, N is the total number of pixels, γ is the bias,

is the structural similarity at pixel p between the real left view and the synthesized left view, I^l (p) is the pixel value of pixel p in the real left view,

is the pixel value of pixel p in the synthesized left view, M^l (p) is the pixel value of pixel p in the left occlusion mask, M^r (p) is the pixel value of pixel p in the right occlusion mask,

is the structural similarity at pixel p between the real right view and the synthesized right view, I^r (p) is the pixel value of pixel p in the real right view,

is the pixel value of pixel p in the synthesized right view;