CN109685842B

Movatterモバイル変換

Info

Publication number: CN109685842B
Application number: CN201811531022.5A
Authority: CN
Inventors: 刘光辉; 朱志鹏; 孙铁成; 李茹; 徐增荣
Original assignee: University of Electronic Science and Technology of China
Current assignee: University of Electronic Science and Technology of China
Priority date: 2018-12-14
Filing date: 2018-12-14
Publication date: 2023-03-21
Anticipated expiration: 2038-12-14
Also published as: CN109685842A

Abstract

The invention discloses a sparse depth densification method based on a multi-scale network. Belongs to the technical field of depth estimation of computer vision. The invention uses the multi-scale convolution neural network to effectively fuse the RGB image data and the sparse point cloud data, and finally obtains the dense depth image. And mapping the sparse point cloud to a two-dimensional plane to generate a sparse depth map, aligning the sparse depth map with the RGB image, connecting the sparse depth map and the RGB image together to generate an RGBD image, inputting the RGBD image to a multi-scale convolution neural network for training and testing, and finally estimating a dense depth map. The depth is estimated in a mode of combining the RGB image and the sparse point cloud, and the distance information contained in the point cloud can guide the RGB image to be converted into a depth map; the multi-scale network utilizes the information of different resolutions of the original data, on one hand, the visual field is expanded, on the other hand, the input depth map on the small resolution is denser, and higher accuracy can be obtained.

Description

Translated fromChinese

一种基于多尺度网络的稀疏深度稠密化方法A sparse depth-densification method based on multi-scale network

技术领域technical field

本发明属于计算机视觉的深度估计领域，具体涉及一种基于多尺度卷积神经网络的稀疏深度稠密化方法。The invention belongs to the field of depth estimation of computer vision, and in particular relates to a sparse depth densification method based on a multi-scale convolutional neural network.

背景技术Background technique

无人驾驶中，基于计算机视觉技术的感知系统是最基础的部分。目前，无人驾驶感知系统中最常使用的是基于可见光的摄像头，摄像头具有成本低，相关技术成熟等优点。但基于可见光的摄像头也存在明显缺点：其一，由摄像头拍摄的RGB图像只有颜色信息，如果目标纹理复杂，感知系统容易判断失误。其二，在某些环境，基于可见光的摄像头会失效。例如光照不足的夜晚，摄像头很难正常进行工作。激光雷达也是无人驾驶感知系统经常使用的传感器。激光雷达不易受光照条件的影响，其采集的点云数据具有三维特性，由点云数据可以直接得到深度图像，深度图像是将点云映射到二维平面形成的图像，每一个像素点的值表示该点到传感器的距离。相比于RGB图像，深度图像包含的距离信息对物体识别，分割等任务更有帮助。但激光雷达价格昂贵，并且采集的点云过于稀疏，生成的深度图也过于稀疏，一定程度影响了其使用效果。In unmanned driving, the perception system based on computer vision technology is the most basic part. At present, cameras based on visible light are most commonly used in unmanned driving perception systems. Cameras have the advantages of low cost and mature related technologies. However, cameras based on visible light also have obvious disadvantages: first, the RGB images captured by the camera only have color information, and if the target texture is complex, the perception system is prone to misjudgment. Second, in some environments, cameras based on visible light will fail. For example, at night when there is insufficient light, it is difficult for the camera to work normally. Lidar is also a sensor often used in driverless perception systems. Lidar is not easily affected by lighting conditions. The point cloud data it collects has three-dimensional characteristics. The depth image can be directly obtained from the point cloud data. The depth image is an image formed by mapping the point cloud to a two-dimensional plane. The value of each pixel Indicates the distance from the point to the sensor. Compared with RGB images, the distance information contained in depth images is more helpful for tasks such as object recognition and segmentation. However, lidar is expensive, and the collected point cloud is too sparse, and the generated depth map is also too sparse, which affects its use effect to a certain extent.

发明内容Contents of the invention

本发明的发明目的在于：针对上述存在的问题，提供一种利用多尺度网络对稀疏深度进行稠密化的方法。The object of the present invention is to provide a method for densifying sparse depth by using a multi-scale network in view of the above existing problems.

本发明的基于多尺度网络的稀疏深度稠密化方法，包括下列步骤：The sparse depth densification method based on the multi-scale network of the present invention comprises the following steps:

构建多尺度网络模型：Build a multiscale network model:

所述多尺度网络模型包括L(L≥2)路输入分支支路，将L路分支支路的输出对应点相加后输入信息融合层，对信息融合层后接一个上采样处理层，作为多尺度网络模型的输出层；The multi-scale network model includes L (L≥2) road input branches, the output corresponding points of the L road branches are added and input to the information fusion layer, and an upsampling processing layer is connected to the information fusion layer, as The output layer of the multi-scale network model;

其中，L路输入分支支路中，其中一路支路作为原始图像的输入；剩余L-1路作为原始图像进行不同下采样后得到的下采样图像的输入；且多尺度网络模型的输出层的输出图像与原始图像的尺寸相同；Among them, among the L-way input branches, one of the branches is used as the input of the original image; the remaining L-1 way is used as the input of the down-sampled image obtained after different down-sampling of the original image; and the output layer of the multi-scale network model The output image has the same dimensions as the original image;

且L路输入分支支路的输入数据包括：RGB图像和稀疏深度图；其中对于对原始图像的稀疏深度图的下采样方式为：对于稀疏深度图，基于预设的下采样倍数K，将稀疏深度图按照像素划分为网格，每个网格包含K×K个原始输入像素；并基于原始输入像素的深度值设置各原始输入像素的标记值s_i，若当前原始输入像素的深度值为0，则s_i＝0；否则s_i＝1；其中i为每个网格包括的K×K个原始输入像素的区分符；并根据公式

得到每个网格的深度值p_new，其中p_i表示原始输入像素i的深度值；And the input data of the L-way input branch branch includes: RGB image and sparse depth map; wherein the downsampling method for the sparse depth map of the original image is: for the sparse depth map, based on the preset downsampling multiple K, the sparse The depth map is divided into grids according to pixels, and each grid contains K×K original input pixels; and the tag value s_i of each original input pixel is set based on the depth value of the original input pixel, if the depth value of the current original input pixel is 0, then s_i =0; otherwise s_i =1; where i is the distinguisher of K×K original input pixels included in each grid; and according to the formula

Get the depth value p_new of each grid, where p_i represents the depth value of the original input pixel i;

输入为原始图像的支路的网络结构为第一网络结构；The network structure of the branch that is input as the original image is the first network structure;

输入为原始图像的下采样图像的支路的网络结构为：在第一网络结构后增设K/2个16通道的上采样卷积块D，其中K表示对原始图像的下采样倍数；The network structure of the branch of the downsampled image that is input as the original image is: after the first network structure, K/2 upsampling convolution blocks D of 16 channels are added, where K represents the downsampling multiple of the original image;

所述第一网络结构包括十四层，分别为：The first network structure includes fourteen layers, which are:

第一层为输入层和池化层，输入层的卷积核大小为7*7，通道数为64，卷积步长为2；池化层采用最大值池化，其卷积核大小为3*3，池化常数为2；The first layer is the input layer and the pooling layer. The convolution kernel size of the input layer is 7*7, the number of channels is 64, and the convolution step is 2; the pooling layer adopts the maximum pooling, and the convolution kernel size is 3*3, the pooling constant is 2;

第二层和第三层结构相同，均为一个64通道的R¹残差卷积块；The second layer and the third layer have the same structure, both are a 64-channel R¹ residual convolution block;

第四层为一个128通道的R²残差卷积块；The fourth layer is a 128-channel R² residual convolution block;

第五层为一个128通道的R¹残差卷积块；The fifth layer is a 128-channel R¹ residual convolution block;

第六层为一个256通道的R²残差卷积块；The sixth layer is a 256-channel R² residual convolution block;

第七层为一个256通道的R¹残差卷积块；The seventh layer is a 256-channel R¹ residual convolution block;

第八层为一个512通道的R²残差卷积块；The eighth layer is a 512-channel R² residual convolution block;

第九层为一个512通道的R¹残差卷积块；The ninth layer is a 512-channel R¹ residual convolution block;

第十层为一个卷积层，其卷积核大小为3*3，通道数为256，卷积步长为1；The tenth layer is a convolution layer with a convolution kernel size of 3*3, a channel number of 256, and a convolution step size of 1;

第十一层为128通道的上采样卷积块D，并将第十一层的输出与第七层的输出按照通道叠加后再输入第十二层；The eleventh layer is an upsampling convolution block D of 128 channels, and the output of the eleventh layer and the output of the seventh layer are superimposed according to the channel and then input to the twelfth layer;

第十二层为64通道的上采样卷积块D，并将第十二层的输出与第五层的输出按照通道叠加后再输入第十三层；The twelfth layer is a 64-channel upsampling convolution block D, and the output of the twelfth layer and the output of the fifth layer are superimposed according to the channel and then input to the thirteenth layer;

第十三层是32通道的上采样卷积块D，并将第十三层的输出与第三层的输出按照通道叠加后再输入第十四层；The thirteenth layer is a 32-channel upsampling convolution block D, and the output of the thirteenth layer and the output of the third layer are superimposed according to the channel and then input to the fourteenth layer;

第十四层为16通道的上采样卷积块D；The fourteenth layer is a 16-channel upsampling convolution block D;

所述R¹残差卷积块包括两层相同结构的卷积层，其卷积核大小为3*3，卷积步长为1，通道数可调节；并将输入R¹残差卷积块的输入数据与第二层的输出对应点相加接入一个ReLU激活函数，作为R¹残差卷积块的输出层；The R¹ residual convolution block includes two layers of convolution layers with the same structure, the convolution kernel size is 3*3, the convolution step size is 1, and the number of channels is adjustable; and the input R¹ residual convolution The input data of the block is added to the corresponding point of the output of the second layer and connected to a ReLU activation function as the output layer of the R¹ residual convolution block;

所述R²残差卷积块包括第一、第二和第三卷积层，输入R²残差卷积块的输入数据分别进入两条支路，再将两条支路的输出对应点相加接入一个ReLU激活函数，作为R²残差卷积块的输出层；其中一条支路为顺次连接的第一和第二卷积层，另一条支路为第三卷积层；The R² residual convolution block includes the first, second and third convolution layers, the input data of the input R² residual convolution block enters two branches respectively, and then the output corresponding points of the two branches Adding access to a ReLU activation function as the output layer of the R² residual convolution block; one of the branches is the first and second convolutional layers connected in sequence, and the other branch is the third convolutional layer;

所述第一卷积层和第二卷积层的结构相同，均为卷积核大小为3*3，卷积步长为2，通道数可调节；第三卷积层为卷积核大小为3*3，卷积步长为1，通道数可调节；The structure of the first convolution layer and the second convolution layer is the same, the size of the convolution kernel is 3*3, the convolution step size is 2, and the number of channels is adjustable; the third convolution layer is the size of the convolution kernel It is 3*3, the convolution step is 1, and the number of channels can be adjusted;

所述上采样卷积块D包括两个放大模块和一个卷积层，其中输入上采样卷积块D的输入数据分别进入两条支路，再将两条支路的输出对应点相加接入一个ReLU激活函数，作为上采样卷积块D的输出层；其中一条支路为顺次连接的第一放大模块和卷积层，另一条支路为第二放大模块；The up-sampling convolution block D includes two amplification modules and a convolution layer, wherein the input data of the up-sampling convolution block D enters two branches respectively, and then the output corresponding points of the two branches are added and connected Enter a ReLU activation function as the output layer of the upsampling convolution block D; one of the branches is the first amplification module and the convolution layer connected in sequence, and the other branch is the second amplification module;

其中，上采样卷积块D的卷积层为：卷积核大小是3*3，卷积步长为1，通道数可调节；Among them, the convolution layer of the upsampling convolution block D is: the convolution kernel size is 3*3, the convolution step is 1, and the number of channels is adjustable;

上采样卷积块D的放大模块包括四个并列的卷积层，该四个卷积层的通道数设置为相同，卷积核大小分别为：3*3，3*2，2*3和2*2，且卷积步长均为1，输入放大模块的输入数据通过其四个卷积层后再拼接在一起，作为放大模块的输出；The amplification module of the upsampling convolution block D includes four parallel convolutional layers, the number of channels of the four convolutional layers is set to be the same, and the convolution kernel sizes are: 3*3, 3*2, 2*3 and 2*2, and the convolution step size is 1, the input data of the input amplification module passes through its four convolution layers and then spliced together as the output of the amplification module;

所述信息融合模块为卷积核大小为3*3，通道数为1，卷积步长为1的卷积层；The information fusion module is a convolution layer with a convolution kernel size of 3*3, a channel number of 1, and a convolution step size of 1;

对所构建的多尺度网络模型进行深度学习训练，并通过训练好的多尺度网络模型得到待处理图像的稠密化的处理结果。Perform deep learning training on the constructed multi-scale network model, and obtain the densified processing result of the image to be processed through the trained multi-scale network model.

综上所述，由于采用了上述技术方案，本发明的有益效果是：本发明利用稀疏点云和图像相结合的方式估计深度，稀疏深度对RGB图像进行指导，RGB图像对稀疏深度进行补充，结合两种数据形式的优点，结合本发明所设置的多个尺度网络模型进行深度估计，提高了深度估计的准确率。In summary, due to the adoption of the above technical solution, the beneficial effects of the present invention are: the present invention uses a combination of sparse point cloud and image to estimate the depth, the sparse depth guides the RGB image, and the RGB image supplements the sparse depth, Combining the advantages of the two data forms and combining the multiple scale network models set by the present invention for depth estimation improves the accuracy of depth estimation.

附图说明Description of drawings

图1是具体实施方式中，本发明的下采样示意图；Fig. 1 is a schematic diagram of downsampling of the present invention in a specific embodiment;

图2是具体实施方式中，残差卷积块示意图。其中图2-a是类型一残差卷积块，图2-b是类型二残差卷积块；Fig. 2 is a schematic diagram of a residual convolution block in a specific embodiment. Among them, Figure 2-a is a type one residual convolution block, and Figure 2-b is a type two residual convolution block;

图3是具体实施方式中，上采样卷积块示意图。其中图3-a是放大模块示意图，图3-b是整个上采样卷积块示意图；Fig. 3 is a schematic diagram of an upsampling convolution block in a specific embodiment. Among them, Figure 3-a is a schematic diagram of the amplification module, and Figure 3-b is a schematic diagram of the entire upsampling convolution block;

图4是具体实施方式中，所采用的多尺度网络结构示意图；Fig. 4 is a schematic diagram of a multi-scale network structure adopted in a specific embodiment;

图5是具体实施方式中，本发明与现有处理方法的结果与对比结果图。其中图5-a为输入的RGB图像，图5-b为稀疏深度图；图5-c为现有方法对图5-b的深度估计；图5-d为本发明对图5-b的深度估计结果。Fig. 5 is a graph showing results and comparison results of the present invention and existing processing methods in the specific embodiment. Wherein Fig. 5-a is the input RGB image, and Fig. 5-b is a sparse depth map; Fig. 5-c is the depth estimation of Fig. 5-b by the existing method; Fig. 5-d is the present invention to Fig. 5-b Depth estimation results.

具体实施方式Detailed ways

为使本发明的目的、技术方案和优点更加清楚，下面结合实施方式和附图，对本发明作进一步地详细描述。In order to make the purpose, technical solution and advantages of the present invention clearer, the present invention will be further described in detail below in conjunction with the implementation methods and accompanying drawings.

为了满足特定场景(例如无人驾驶)对对深度图像质量要求较高的需求，本发明提出了一种利用多尺度网络对稀疏深度进行稠密化的方法。而现有的深度估计方法主要利用RGB图像直接得到稠密深度，但是由于二维图像直接估计深度图像存在内在模糊性，为了解决该问题，本发明利用稀疏点云和图像相结合的方式估计深度，稀疏深度对RGB图像进行指导，RGB图像对稀疏深度进行补充，结合两种数据形式的优点，同时在多个尺度下进行深度估计，提高了深度估计的准确率。In order to meet the requirements of specific scenarios (such as unmanned driving) for higher quality depth images, the present invention proposes a method for densifying sparse depth using a multi-scale network. The existing depth estimation methods mainly use RGB images to directly obtain the dense depth, but because the two-dimensional image directly estimates the depth image, there is inherent ambiguity. In order to solve this problem, the present invention uses a combination of sparse point cloud and image to estimate the depth. The sparse depth guides the RGB image, and the RGB image complements the sparse depth. Combining the advantages of the two data forms, the depth estimation is performed at multiple scales at the same time, which improves the accuracy of the depth estimation.

本发明使用多尺度卷积神经网络，将RGB图像数据和稀疏点云数据进行有效的融合，最终得出稠密的深度图像。将稀疏点云映射到二维平面生成稀疏深度图，并与RGB图像对齐，然后将稀疏深度图和RGB图像连接在一起生成RGBD(RGB+Depth Map)图像，将RGBD图像输入到多尺度卷积神经网络进行训练和测试，最终估计出一个稠密的深度图。RGB图像和稀疏点云相结合的方式估计深度，可以让点云包含的距离信息去指导RGB图像转化为深度图；多尺度网络利用了原始数据不同分辨率的信息，一方面扩大了视野域，另一方面小分辨率上的输入深度图更稠密，可以获得更高的准确率。The present invention uses a multi-scale convolutional neural network to effectively fuse RGB image data and sparse point cloud data, and finally obtain a dense depth image. Map the sparse point cloud to a two-dimensional plane to generate a sparse depth map, and align it with the RGB image, then connect the sparse depth map and the RGB image to generate an RGBD (RGB+Depth Map) image, and input the RGBD image to the multi-scale convolution The neural network is trained and tested, and finally estimates a dense depth map. Combining RGB images and sparse point clouds to estimate depth can allow the distance information contained in point clouds to guide the transformation of RGB images into depth maps; multi-scale networks use information of different resolutions of original data, on the one hand to expand the field of view, On the other hand, the input depth map at a small resolution is denser and can achieve higher accuracy.

本发明提出的基于多尺度的稀疏深度稠密化方法的具体实现过程如下：The specific implementation process of the multi-scale-based sparse depth densification method proposed by the present invention is as follows:

(1)输入数据下采样：(1) Input data downsampling:

可行的下采样倍数与输入数据的大小有很大的关系。对于一张大小为M*N的输入图像而言，可行的下采样倍数范围为[2，min(M，N)*2^-5]。The feasible downsampling factor has a great relationship with the size of the input data. For an input image with a size of M*N, the feasible range of downsampling multiples is [2, min(M, N)*2^-5 ].

采样的方式如下所述：用K表示所选择的下采样倍数，将输入稀疏深度图按照像素划分为网格，每个网格包含K*K个原始输入像素，则输入图像将被划分为

个网格。图1为下采样倍数为2时的示意图。将网格中的K*K个像素表示为像素集合P＝{p₁，p₂，...，p_K*K}。The sampling method is as follows: use K to represent the selected downsampling multiple, divide the input sparse depth map into grids according to pixels, and each grid contains K*K original input pixels, then the input image will be divided into

grid. Figure 1 is a schematic diagram when the downsampling factor is 2. The K*K pixels in the grid are expressed as a pixel set P={p₁ , p₂ , . . . , p_K*K }.

由于稀疏深度图中存在深度为零的值，这些值被称为无效值。构建一个标记值s用来标记无效值，如果该像素点深度值不等于0则认为有效，令s等于1；否则为无效值，令s等于0。从而可以得到与像素集合P对应的标记值集合为S＝{s₁，s₂，...，s_K*K}。Due to the presence of values at depth zero in the sparse depth map, these values are referred to as invalid values. Construct a flag value s to mark invalid values. If the pixel depth value is not equal to 0, it is considered valid, and s is equal to 1; otherwise, it is an invalid value, and s is equal to 0. Therefore, it can be obtained that the label value set corresponding to the pixel set P is S={s₁ , s₂ , . . . , s_K*K }.

经过上述下采样后的新的深度值为：

其中p_n表示原始像素点的深度值，s_n表示原始像素点的标记值。The new depth value after the above downsampling is:

Among them, p_n represents the depth value of the original pixel, and s_n represents the label value of the original pixel.

对划分好的每个网格都进行上述操作，从而得到一个新的分辨率更小，更加稠密的深度图(简称小分辨率深度图)。相比于传统的降采样方法，该方式得到的小分辨率深度图更加稠密，由于剔除了无效值的影响，深度值也更加准确。RGB图像降采样则采用传统的双线性内插降采样方法。最终得到小分辨率的图像和稀疏深度图。The above operations are performed on each divided grid to obtain a new depth map with a smaller resolution and denser (referred to as a small resolution depth map). Compared with the traditional down-sampling method, the small-resolution depth map obtained by this method is denser, and the depth value is more accurate because the influence of invalid values is eliminated. RGB image down-sampling adopts traditional bilinear interpolation down-sampling method. The end result is a small-resolution image and a sparse depth map.

(2)构建残差卷积块：(2) Construct residual convolution block:

残差卷积块是本发明的多尺度网络的重要组成部分，用于提取输入数据的特征，其分为两种类型。The residual convolution block is an important part of the multi-scale network of the present invention, which is used to extract the features of the input data, and it is divided into two types.

类型一：残差卷积块R¹构建过程如下：如图2-a所示，残差卷积块的第一层是一个卷积层，其卷积核大小为3*3，通道数为n，卷积步长(stide)为1。第二层与第一层结构相同。然后将输入数据与第二层的输出对应点相加。最后接入一个ReLU激活函数。残差卷积块结构固定，但是卷积层的通道数可变，调整通道数可以得到不同的残差卷积块，据此将类型一残差卷积块命名为n通道R¹。R¹的输入输出大小一致，其中没有下采样的操作。Type 1: The construction process of the residual convolution block^R1 is as follows: As shown in Figure 2-a, the first layer of the residual convolution block is a convolution layer with a convolution kernel size of 3*3 and a number of channels of n, the convolution step size (stide) is 1. The second layer has the same structure as the first layer. The input data is then summed with the output corresponding points of the second layer. Finally, a ReLU activation function is connected. The structure of the residual convolutional block is fixed, but the number of channels of the convolutional layer is variable, and different residual convolutional blocks can be obtained by adjusting the number of channels. Therefore, the type-one residual convolutional block is named n-channel R¹ . The input and output of R¹ have the same size, and there is no downsampling operation.

类型二：残差卷积块R²构建过程如下：如图2-b所示，残差卷积块的第一层是一个卷积层，其卷积核大小为3*3，通道数为n，卷积步长为2。第二层也是是一个卷积层，其卷积核大小为3*3，通道数为n，卷积步长为1。然后将输入数据通过一个卷积层，其卷积核大小为1*1，通道数为n，卷积步长为2，将该输出与第二层的输出对应点相加。最后接入一个ReLU激活函数。与R¹命名方式类似，将类型二残差卷积块命名为n通道R²。R²的输入大小是输出的两倍，该操作的目的是扩大卷积核的感受野，更好的提取全局特征。Type 2: The construction process of the residual convolution block^R2 is as follows: As shown in Figure 2-b, the first layer of the residual convolution block is a convolution layer with a convolution kernel size of 3*3 and the number of channels as n, the convolution step size is 2. The second layer is also a convolution layer with a convolution kernel size of 3*3, the number of channels is n, and the convolution step is 1. Then the input data is passed through a convolution layer with a convolution kernel size of 1*1, the number of channels is n, and the convolution step is 2, and the output is added to the output corresponding point of the second layer. Finally, a ReLU activation function is connected. Similar to the naming method of R¹ , thetype 2 residual convolution block is named n-channel R² . The input size of R² is twice the output. The purpose of this operation is to expand the receptive field of the convolution kernel and better extract global features.

(3)构建上采样卷积块：(3) Build an upsampling convolution block:

上采样卷积块也是多尺度网络的重要部分，其作用是将输入放大，每一个上采样卷积块可以将输入放大一倍。其构建过程如下：上采样卷积块的基本模块是放大模块，如图3-a所示，放大模块由四个并列的卷积层构成，这四个卷积层的通道数都是n，卷积核大小分别是3*3，3*2，2*3和2*2，输入通过这四个卷积层后拼接在一起，输出相比于输入扩大了一倍。如图3-b所示，上采样卷积块由两个分支构成。分支一的第一层是一个通道数为n的放大模块，其后接一个ReLU激活函数，第二层是一个卷积层，其卷积核大小是3*3，通道数为n。分支二只有一层，该层是一个通道数为n的放大模块。分支一的输出与分支二的输出对应点相加，最后接入一个ReLU激活函数。与R¹，R²命名方式类似，将上采样卷积块命名为n通道D。The upsampling convolution block is also an important part of the multi-scale network. Its function is to amplify the input, and each upsampling convolution block can double the input. The construction process is as follows: the basic module of the upsampling convolutional block is the amplification module, as shown in Figure 3-a, the amplification module is composed of four parallel convolutional layers, and the number of channels of these four convolutional layers is n, The convolution kernel sizes are 3*3, 3*2, 2*3, and 2*2. The input is spliced together after passing through these four convolutional layers, and the output is doubled compared to the input. As shown in Fig. 3-b, the upsampled convolutional block consists of two branches. The first layer ofbranch 1 is an amplification module with n channels, followed by a ReLU activation function, and the second layer is a convolutional layer with a convolution kernel size of 3*3 and n channels.Branch 2 has only one layer, which is an amplification module with n channels. The output of branch one is added to the corresponding point of the output of branch two, and finally a ReLU activation function is connected. Similar to the naming method of R¹ and R² , the upsampling convolution block is named n-channel D.

(4)构建多尺度卷积网络：(4) Construct a multi-scale convolutional network:

多尺度网络可以构建多个尺度，即可以构建多条支路，支路构建的数量与下采样倍数一样受到输入图像大小的影响，对于大小为M*N的图像而言，支路数量上限为log₂(min(M，N)*2^-5)+1。构建方法以两条支路为例，其要建立两条支路，一条支路的输入为原分辨率，另一条支路的输入为1/K原始分辨率，K为输入图像的下采样倍数。最后将两条支路进行信息融合。The multi-scale network can construct multiple scales, that is, multiple branches can be constructed. The number of branch constructions is affected by the size of the input image just like the downsampling multiple. For an image with a size of M*N, the upper limit of the number of branches is log₂ (min(M,N)*2^-5 )+1. The construction method takes two branches as an example. It needs to establish two branches. The input of one branch is the original resolution, and the input of the other branch is 1/K original resolution, and K is the downsampling multiple of the input image. . Finally, the information fusion of the two branches is carried out.

第一条支路，即输入为原始分辨率的支路构建如下：The first branch, that is, the branch whose input is native resolution, is constructed as follows:

第一层为输入层和池化层，输入层的卷积核大小为7*7，通道数为64，卷积步长为2。池化层采用最大值池化，其卷积核大小为3*3，池化常数为2。原始输入的尺寸为M*N*4，通过第一层后尺寸变为

即大小变为原来的1/4，通道数变为64个。The first layer is the input layer and the pooling layer. The convolution kernel size of the input layer is 7*7, the number of channels is 64, and the convolution step is 2. The pooling layer adopts maximum pooling, its convolution kernel size is 3*3, and the pooling constant is 2. The size of the original input is M*N*4, after passing through the first layer, the size becomes

That is, the size becomes 1/4 of the original, and the number of channels becomes 64.

第二层是一个64通道的R¹残差卷积块，记为R¹₁。The second layer is a 64-channel R¹ residual convolution block, denoted as R¹₁ .

第三层结构与第二层相同，记为R¹₂。The structure of the third layer is the same as that of the second layer, denoted as R¹₂ .

第四层是一个128通道的R²残差卷积块记为R²₁。The fourth layer is a 128-channel R² residual convolution block denoted as R²₁ .

第五层是一个128通道的R¹残差卷积块，记为R¹₃。The fifth layer is a 128-channel R¹ residual convolution block, denoted as R¹₃ .

第六层是一个256通道的R²残差卷积块，记为R²₂。The sixth layer is a 256-channel R² residual convolution block, denoted as R²₂ .

第七层是一个256通道的R¹残差卷积块，记为R¹₄。The seventh layer is a 256-channel R¹ residual convolution block, denoted as R¹₄ .

第八层是一个512通道的R²残差卷积块，记为R²₃。The eighth layer is a 512-channel R² residual convolution block, denoted as R²₃ .

第九层是一个512通道的R¹残差卷积块，记为R¹₅。The ninth layer is a 512-channel R¹ residual convolution block, denoted as R¹₅ .

第十层是一个卷积层，其卷积核大小为3*3，通道数为256，卷积步长为1。The tenth layer is a convolution layer with a convolution kernel size of 3*3, a channel number of 256, and a convolution step size of 1.

第十一层为128通道的上采样卷积块D，记为D₁。The eleventh layer is a 128-channel upsampling convolutional block D, denoted as D₁ .

然后将D₁的输出与第七层R¹₄的输出按照通道叠加在一起，其中R¹₄的输出尺寸为

D₁的输出尺寸为

叠加后的尺寸变为

叠加的意义在于可以获取在卷积过程中丢失的一些原始信息，使得结果更准确。Then the output of D₁ and the output of the seventh layer R¹₄ are superimposed together according to the channel, where the output size of R¹₄ is

The output size of_D1 is

The superimposed size becomes

The significance of superposition is that some original information lost in the convolution process can be obtained, making the result more accurate.

第十二层为64通道的上采样卷积块D，记为D₂，然后将D₂的输出与R¹₃的输出按照通道叠加。The twelfth layer is a 64-channel upsampling convolution block D, denoted as D₂ , and then the output of D₂ and the output of R¹₃ are superimposed according to the channel.

第十三层是32通道的上采样卷积块D，记为D₃，然后将D₃的输出与R¹₂的输出按照通道叠加。The thirteenth layer is a 32-channel upsampling convolution block D, denoted as D₃ , and then the output of D₃ and the output of R¹₂ are superimposed according to the channel.

第十四层为16通道的上采样卷积块D，记为D₄。The fourteenth layer is a 16-channel upsampling convolution block D, denoted as D₄ .

至此，输入为原分辨率的支路的网络结构构建完毕。So far, the network structure of the branches whose input is the original resolution is constructed.

第二条支路，即输入为1/K原始分辨率的支路构建如下：The second branch, the branch whose input is 1/K native resolution, is constructed as follows:

前十四层结构与输入为原始分辨率的支路完全相同，其后要根据支路的输入大小添加对应个数的16通道的上采样卷积块D。对于输入为1/K原始分辨率(下采样倍数为K)的支路而言，则要添加K/2个上采样卷积块。如图4为一个两条支路的情形，其中第二条支路输入为1/2原始分辨率(下采样倍数为2)的例子，其第二条支路要添加的上采样卷积块D的个数就是1个。多分辨率的情况与之类似，如果输入是1/4原始分辨率，则添加两个16通道的上采样卷积块，以此类推。The structure of the first fourteen layers is exactly the same as that of the branch whose input is the original resolution, and then a corresponding number of 16-channel upsampling convolutional blocks D are added according to the input size of the branch. For a branch whose input is 1/K original resolution (downsampling multiple is K), K/2 upsampling convolution blocks are added. Figure 4 is a case of two branches, where the second branch input is an example of 1/2 original resolution (downsampling multiple is 2), the upsampling convolution block to be added to the second branch The number of D is 1. The case of multi-resolution is similar, if the input is 1/4 original resolution, add two 16-channel upsampling convolutional blocks, and so on.

支路构建完成后，需要将这两条支路的信息进行融合，信息融合的结构如下：将第一条支路的输出与第二条支路的输出对应点相加，作为信息融合模块的输入。信息融合模块的网络结构为一个卷积层，其卷积核大小为3*3，通道数为1，最后将该层输出经过线性上采样得到大小和原始输入一样大小的最终结果。After the branch is constructed, the information of the two branches needs to be fused. The structure of the information fusion is as follows: the output of the first branch is added to the corresponding point of the output of the second branch, as the information fusion module enter. The network structure of the information fusion module is a convolution layer with a convolution kernel size of 3*3 and a channel number of 1. Finally, the output of this layer is linearly up-sampled to obtain the final result with the same size as the original input.

多余多支路(两条以上)情况下的信息融合，则是：Information fusion in the case of redundant multi-branch (more than two) is:

(5)损失函数的设置：(5) Setting of loss function:

本具体实施方式中，损失函数采用Smooth L1损失函数，即

其中d表示卷积神经网络估计出来的深度值，d^g表示标准的深度值，N表示一张深度图中像素个数的总和。In this specific embodiment, the loss function adopts the Smooth L1 loss function, that is

Where d represents the depth value estimated by the convolutional neural network, d^g represents the standard depth value, and N represents the sum of the number of pixels in a depth map.

(6)模型的训练和测试：(6) Model training and testing:

本具体实施方式中，采用的训练数据来源于公开数据集NYU-Depth-v2 dataset。该数据集包含了RGB图像和稠密的深度图，其大小为640*480。训练过程选用了48000张RGB图像和其对应的稠密深度图作为训练数据；测试过程选用了654张RGB图像和其对应的稠密深度作为测试数据。网络的输入是一张RGB图像以及一张稀疏深度图，该数据集不存在稀疏深度图，可以通过对稠密深度图随机采样1000个点得到稀疏深度图，与RGB图像组合成RGBD图像作为输入。In this specific embodiment, the training data used comes from the public dataset NYU-Depth-v2 dataset. The dataset contains RGB images and dense depth maps with a size of 640*480. In the training process, 48,000 RGB images and their corresponding dense depth maps were selected as training data; in the testing process, 654 RGB images and their corresponding dense depth maps were selected as test data. The input of the network is an RGB image and a sparse depth map. There is no sparse depth map in this data set. The sparse depth map can be obtained by randomly sampling 1000 points from the dense depth map, and combined with the RGB image to form an RGBD image as input.

训练时，将RGBD图像下采样成320*240大小，再进行中心切割得到304*228大小的RGBD图像(即输入多尺度网络模型的原始图像)，将该图像作为第一条支路的输入，然后将该图像按照步骤(1)所述的方法下采样两倍得到152*114大小的RGBD图像作为第二条支路的输入。一次训练8张图像，则训练完整个数据集需要6000次，将整个数据集训练15遍，则一共要训练90000次。训练时的学习率采用变化的学习率，初始学习率设置为0.01，数据集每训练完5遍，学习率下降10倍，最后学习率为0.0001。训练完毕后将模型的参数保存。During training, the RGBD image is down-sampled to a size of 320*240, and then the center cut is performed to obtain an RGBD image of a size of 304*228 (that is, the original image input to the multi-scale network model), and this image is used as the input of the first branch, Then the image is down-sampled twice according to the method described in step (1) to obtain an RGBD image with a size of 152*114 as the input of the second branch. To train 8 images at a time, it takes 6,000 times to train the entire data set, and to train the entire data set 15 times, it takes a total of 90,000 times to train. The learning rate during training adopts a variable learning rate. The initial learning rate is set to 0.01. After the data set is trained 5 times, the learning rate decreases by 10 times, and the final learning rate is 0.0001. After training, save the parameters of the model.

测试时，读取模型的参数，数据处理方式于训练过程相同，将处理后的数据输入到模型中，输出最终的结果。如图5所示，是本发明的输出结果和现有的深度学习方法的一些比较。整体来看，本发明的结果更清晰，从黑框中的结果比较可以看出，本发明的结果细节体现的更好。During the test, the parameters of the model are read, and the data processing method is the same as the training process. The processed data is input into the model, and the final result is output. As shown in Figure 5, it is some comparisons between the output results of the present invention and existing deep learning methods. On the whole, the results of the present invention are clearer, and it can be seen from the comparison of the results in the black box that the details of the results of the present invention are better reflected.

以上所述，仅为本发明的具体实施方式，本说明书中所公开的任一特征，除非特别叙述，均可被其他等效或具有类似目的的替代特征加以替换；所公开的所有特征、或所有方法或过程中的步骤，除了互相排斥的特征和/或步骤以外，均可以任何方式组合。The above is only a specific embodiment of the present invention. Any feature disclosed in this specification, unless specifically stated, can be replaced by other equivalent or alternative features with similar purposes; all the disclosed features, or All method or process steps may be combined in any way, except for mutually exclusive features and/or steps.

Claims

Translated fromChinese

1.一种基于多尺度网络的稀疏深度稠密化方法，其特征在于，包括下列步骤：1. A kind of sparse depth densification method based on multi-scale network, it is characterized in that, comprises the following steps:

构建多尺度网络模型：Build a multiscale network model:

所述多尺度网络模型包括L路输入分支支路，将L路分支支路的输出对应点相加后输入信息融合层，对信息融合层后接一个上采样处理层，作为多尺度网络模型的输出层；The multi-scale network model includes L-way input branches, and the output corresponding points of the L-way branches are added to the input information fusion layer, and an upsampling processing layer is connected to the information fusion layer, as the multi-scale network model. output layer;

Get the depth value p_new of each grid, where p_i represents the depth value of the original input pixel i;输入为原始图像的支路的网络结构为第一网络结构；The network structure of the branch that is input as the original image is the first network structure;

信息融合模块为卷积核大小为3*3，通道数为1，卷积步长为1的卷积层；The information fusion module is a convolution layer with a convolution kernel size of 3*3, a channel number of 1, and a convolution step of 1;

2.如权利要求1所述的方法，其特征在于，对原始图像的RGB图像的下采样方式为：采用双线性内插降采样方法。2. The method according to claim 1, wherein the downsampling method of the RGB image of the original image is: adopting a bilinear interpolation downsampling method.

3.如权利要求1所述的方法，其特征在于，对多尺度网络模型进行深度学习训练时，采用的损失函数为

其中d_j表示多尺度网络模型输出的各像素点的深度值，即估计值深度值，j为像素点区分符，

表示像素点的标准深度值，即训练样本的对应标签值，N表示一幅稀疏深度图的像素个数的总和。3. The method according to claim 1, wherein, when the multi-scale network model is trained for deep learning, the loss function used is

where d_j represents the depth value of each pixel output by the multi-scale network model, that is, the estimated depth value, and j is the pixel point identifier,

Indicates the standard depth value of the pixel, that is, the corresponding label value of the training sample, and N indicates the sum of the number of pixels in a sparse depth map.