CN115731278A

Movatterモバイル変換

Info

Publication number: CN115731278A
Application number: CN202211600042.XA
Authority: CN
Inventors: 巩书凯; 蓝玲玲; 梁先黎; 江虹锋; 陈磊
Original assignee: Chongqing Humi Network Technology Co Ltd
Current assignee: Chongqing Humi Network Technology Co Ltd
Priority date: 2022-12-12
Filing date: 2022-12-12
Publication date: 2023-03-03

Abstract

The invention belongs to the technical field of image depth estimation, and particularly relates to a monocular depth estimation method based on deep learning, which comprises the following steps: s1, acquiring a training data set; s2, constructing a monocular depth recognition network, wherein the monocular depth recognition network comprises a coding unit, a fusion unit and a decoding unit; the encoding unit comprises a local encoder and a global encoder; the fusion unit is used for splicing the local feature map and the global feature map to obtain a fusion feature map fusing the local information and the global information; the decoding unit is used for carrying out up-sampling on the fusion characteristic graph to obtain an identified depth graph; s3, training the monocular depth recognition network through a training data set, and updating parameters of the monocular depth recognition network through a preset loss function; and S4, acquiring the depth information of the image by using the trained monocular depth recognition network. The invention can ensure both convenience in use and lower cost on the basis of ensuring accuracy.

Description

Monocular depth estimation method based on deep learning

Technical Field

The invention belongs to the technical field of image depth estimation, and particularly relates to a monocular depth estimation method based on deep learning.

Background

With the rapid development of artificial intelligence, computer vision is rapidly applied in people's daily life. Three-dimensional information of a scene is important for scene understanding, and how to acquire depth information from the scene is a hot spot direction in recent years. Computer vision obtains two-dimensional plane images, and depth information of the images is lacked. Therefore, one of the important tasks of computer vision is to reconstruct a three-dimensional model of a scene by acquiring depth information of the scene in a manner that simulates the world as perceived by the human eye.

At present, two technical schemes, namely monocular depth estimation and binocular/eye-capturing depth estimation, are mainly used for acquiring depth information of an image. The binocular/multi-view depth estimation has good accuracy, but has high requirements on the number of cameras and parameters, is inconvenient to use and has high cost, so that the practical popularization is not facilitated; the requirements of monocular depth estimation on the number of cameras and parameters can be obviously reduced, the use is convenient, the cost is lower, and the popularization is facilitated, but the monocular depth estimation is difficult to ensure on the accuracy rate.

Therefore, how to combine the convenience of use and the lower cost on the basis of ensuring the accuracy becomes a problem to be solved urgently at present.

Disclosure of Invention

Aiming at the defects of the prior art, the invention provides a monocular depth estimation method based on deep learning, which can give consideration to the convenience of use and lower cost on the basis of ensuring the accuracy.

In order to solve the technical problems, the invention adopts the following technical scheme:

a monocular depth estimation method based on deep learning comprises the following steps:

s1, acquiring a training data set, wherein the training data set comprises training images and corresponding depth maps;

s2, constructing a monocular depth recognition network, wherein the monocular depth recognition network comprises a coding unit, a fusion unit and a decoding unit; the encoding unit comprises a local encoder and a global encoder, the local encoder is used for extracting local information of the image and generating a local feature map, and the global encoder is used for extracting global information of the image and generating a global feature map; the fusion unit is used for splicing the local feature map and the global feature map to obtain a fusion feature map fused with the local information and the global information; the decoding unit is used for carrying out up-sampling on the fusion characteristic graph to obtain an identified depth graph;

s3, training the monocular depth recognition network through a training data set, and updating parameters of the monocular depth recognition network through a preset loss function;

and S4, acquiring the depth information of the image to be processed by using the trained monocular depth recognition network.

Preferably, in S3, when training the monocular depth recognition network, after calculating a difference between the depth map recognized by the monocular depth recognition network and the actual depth map, derivation is performed on the loss function, and then the network parameter weight is updated through back propagation.

Preferably, in S3, the loss function is:

L(y,y′)L_MS-SSIM (y,y′)+λL_pixel (y,y′

wherein L is_pixel Represents the loss of scale invariance, L_{MS_SSIM} Representing a loss of multi-scale structural similarity; y represents the true depth map, y' represents the predicted depth map, and λ is a hyper-parameter with a preset specific value.

Preferably, loss of scale invariance

Wherein, g_i ＝logy′_i -logy_i And T represents the number of pixel points with effective depth, and in the experiment in this chapter, beta and alpha are training parameters with preset specific numerical values.

Preferably, β is 0.85 and α is 10.

Preferably, the multi-scale structure similarity is lost

Wherein,

in the formula, mu_y Represents the mean of y;

and a variance representing y; sigma_yy′ Represents y and y' covariance; l (y, y') represents the luminance estimates of the real image and the predicted depth map; c (y, y') represents the contrast estimate of the real image and the predicted depth map; s (y, y') represents the trend of the change of the real image and the predicted depth map; m represents the largest scale in the multi-scale structural similarity loss.

Preferably, in S2, the local encoder is pre-trained EfficientNetB5, and the global encoder is pre-trained Vision Transformer.

Preferably, in S2, the decoding unit performs upsampling by a bilinear difference method.

Compared with the prior art, the invention has the following beneficial effects:

1. compared with the prior art, the monocular depth recognition network of the encoder-decoder framework is constructed, the coding unit in the monocular depth recognition network comprises the local encoder and the global encoder, the local information of the image can be extracted, the global information of the image can also be obtained, and the fusion feature map is up-sampled after the local information and the global information are fused, so that the comprehensiveness of the image depth information is ensured, the detail information of the depth map is ensured, and the depth information of the image is fully and completely obtained. And then, training the monocular depth recognition network through a training data set, and updating parameters of the monocular depth recognition network through a preset loss function. The monocular depth recognition network obtained in the way not only keeps the advantages of low requirements of monocular recognition on the number and parameters of cameras, convenience in use, lower cost and benefit for popularization, but also has higher accuracy and comprehensiveness of depth information recognition.

In conclusion, the method can give consideration to the convenience of use and lower cost on the basis of ensuring the accuracy.

2. The perceptibility of image details depends on the sampling density of the image signal, the distance of the image plane to the camera and the perceptual capability of the camera system. In the invention, the loss function of the monocular depth recognition network takes scale invariance loss and multiscale structure similarity loss into account, and the effectiveness and the accuracy of the trained monocular depth recognition network in recognizing the image depth information can be ensured.

3. The invention provides the specific parameter values of each parameter in the loss function, and can ensure the effectiveness of the trained monocular depth recognition network.

4. The coding unit provided by the invention can capture image characteristics by using transfer learning, can quickly converge a monocular depth recognition network, and saves the training time of the network.

5. The decoding unit in the monocular depth recognition network performs upsampling by a bilinear difference method, so that the quality of the image after upsampling is ensured, and excessive calculation cost is not increased.

6. The network model used by the invention is easy to migrate and popularize to other image processing prediction tasks, and has wide application range.

Drawings

For a better understanding of the objects, solutions and advantages of the present invention, reference will now be made in detail to the present invention, which is illustrated in the accompanying drawings, in which:

FIG. 1 is a flow chart in the embodiment;

FIG. 2 is a schematic diagram of an embodiment of a monocular deep recognition network;

fig. 3 is a schematic diagram of a training process of the monocular deep recognition network in the embodiment.

Detailed Description

The following is further detailed by the specific embodiments:

example (b):

as shown in fig. 1, the present embodiment discloses a monocular depth estimation method based on deep learning, which includes the following steps:

s1, a training data set is obtained, and the training data set comprises training images and corresponding depth maps.

In specific implementation, the training data set can be obtained from NYU Depth v 2. NYU Depth v2 is a data set providing images and Depth maps of different indoor scenes with a resolution of 640 x 480. The data set contained 120000 training samples and 654 test samples. In this embodiment, a 50000 sample subset is obtained as a training data set.

S2, constructing a monocular depth recognition network, wherein the monocular depth recognition network comprises a coding unit, a fusion unit and a decoding unit. The architecture of the monocular depth recognition network is shown in fig. 2.

The encoding unit comprises a local encoder and a global encoder, the local encoder is used for extracting local information of the image and generating a local feature map, and the global encoder is used for extracting global information of the image and generating a global feature map; in specific implementation, a convolutional neural network EfficientNet B5 is used as a specific encoder, and a Vision Transformer (ViT) is used as a global encoder. The convolutional neural networks EfficientNet B5 and ViT are models pre-trained on ImageNet, the input images are extracted with features of different levels by using the convolutional neural networks, and then global attention operation is performed on the input images by using the ViT to obtain richer semantic information and contextual information. In this embodiment, viT extracts global information of an image to generate a global feature map with a resolution of 30 × 40, and EfficientNetB5 extracts local information of the image to generate a local feature map with a resolution of 30 × 40.

The fusion unit is used for splicing the local feature map and the global feature map to obtain a fusion feature map fusing the local information and the global information; during specific implementation, concat splicing is carried out on the global feature map and the local feature map according to the channel dimension, and the feature map fusing the local information and the global information is obtained.

The decoding unit is used for carrying out up-sampling on the fusion characteristic image to obtain an identified depth image. In specific implementation, the decoding unit performs upsampling by a bilinear difference method. Therefore, the quality of the image after up-sampling is ensured, and excessive calculation cost is not increased.

And S3, as shown in FIG. 3, training the monocular depth recognition network through the training data set, and updating parameters of the monocular depth recognition network through a preset loss function. When the monocular depth recognition network is trained, after the difference between the depth map recognized by the monocular depth recognition network and the actual depth map is calculated, the derivative of the loss function is obtained, and then the network parameter weight is updated through back propagation.

In specific implementation, the loss function is:

L(y,y′)＝L_MS-SSIM (y,y′)+λL_pixel (y,y′)

wherein L is_pixel Represents the loss of scale invariance, L_{MS_SSIM} Representing a multi-scale structural similarity loss; y denotes the true depth map, y' denotes the predicted depth map, and λ is the hyper-parameter. In this example, λ is 0.5.

In practice, the scale invariance is lost

Wherein, g_i ＝logy′_i -logy_i And T represents the number of pixel points with effective depth, and in the experiment in this chapter, beta and alpha are training parameters with preset specific numerical values. In this example, β is 0.85 and α is 10.

Loss of multi-scale structural similarity

Wherein,

in the formula, mu_y Means for y;

and a variance representing y; sigma_yy′ Represents y and y' covariance; l (y, y') represents the luminance estimates of the real image and the predicted depth map; c (y, y') represents the contrast estimate of the real image and the predicted depth map; s (y, y') represents the trend of the change of the real image and the predicted depth map; m represents the largest scale in the multi-scale structural similarity loss. That is, the brightness index is compared only in scale M, and the contrast and texture index are compared in all scales.

After the training is completed, the original image is used as an input, and the generated depth map is half the resolution of the input image, that is, the resolution of the depth map is 320 × 240.

To facilitate the implementation of those skilled in the art, the hardware platform used in the present invention is i7-10700CPU, NVIDIA GeForce RTX 3090, and the software platform is PyTorch deep learning framework. The encoder is an EfficientNet B5 pre-trained on ImageNet, the EfficientNet network improves the network performance by scaling three dimensions (depth, width and image resolution) of a model and obtains the most advanced result in an image classification task, the initial learning rate in the training process is set to be 0.00005, the optimizer adopts Adam, and the batch processing size is 4.

Compared with the prior art, the monocular depth recognition network of the encoder-decoder framework is constructed, the coding unit in the monocular depth recognition network comprises the local encoder and the global encoder, the local information of the image can be extracted, the global information of the image can also be obtained, and the fusion feature map is up-sampled after the local information and the global information are fused, so that the comprehensiveness of the image depth information is ensured, the detail information of the depth map is ensured, and the depth information of the image is fully and completely obtained. And then, training the monocular depth recognition network through a training data set, and updating parameters of the monocular depth recognition network through a preset loss function. The monocular depth recognition network obtained in the way not only keeps the advantages of low requirements of monocular recognition on the number and parameters of the cameras, convenient use, lower cost and benefit for popularization, but also has higher accuracy and comprehensiveness of depth information recognition.

On the other hand, the perceptibility of image details depends on the sampling density of the image signal, the distance of the image plane to the camera and the perceptual capability of the camera system. In the invention, the loss function of the monocular depth recognition network takes scale invariance loss and multiscale structure similarity loss into account, and the effectiveness and the accuracy of the trained monocular depth recognition network in recognizing the image depth information can be ensured. In addition, the invention provides specific parameter values of each parameter in the loss function, and can ensure the effectiveness of the trained monocular depth recognition network. The coding unit provided by the invention can capture image characteristics by using transfer learning, can quickly converge a monocular depth recognition network, and saves the training time of the network. In addition, the decoding unit in the monocular depth recognition network performs upsampling by a bilinear difference method, so that the quality of the image after upsampling is ensured, and excessive calculation cost is not increased. The network model used by the invention is easy to migrate and popularize to other image processing prediction tasks, and has wide application range.

It should be noted that the above embodiments are only used for illustrating the technical solutions of the present invention and not for limiting the technical solutions, and those skilled in the art should understand that the technical solutions of the present invention can be modified or substituted with equivalent solutions without departing from the spirit and scope of the technical solutions, and all should be covered in the claims of the present invention.

Claims

1. A monocular depth estimation method based on deep learning is characterized by comprising the following steps:

2. The deep learning-based monocular depth estimation method of claim 1, wherein: and S3, calculating the difference between the depth map identified by the monocular depth recognition network and the actual depth map when the monocular depth recognition network is trained, then obtaining the derivative of the loss function, and then reversely propagating and updating the network parameter weight.

3. The monocular depth estimation method based on deep learning of claim 2, wherein: in S3, the loss function is:

L(y,y′)L_MS-SSIM (y,y′)+λL_pixel (y,y′)；

wherein L is_pixel Represents the loss of scale invariance, L_{MS_SSIM} Representing a multi-scale structural similarity loss; y represents the true depth map, y' represents the predicted depth map, and λ is a hyper-parameter with a preset specific value.

4. The deep learning-based monocular depth estimation method of claim 3, wherein: loss of scale invariance

5. The method of monocular depth estimation based on deep learning of claim 4, wherein: beta is 0.85, alpha is 10.

6. The deep learning-based monocular depth estimation method of claim 5, wherein: loss of multi-scale structural similarity

Wherein,

in the formula, mu_y Represents the mean of y;

7. The monocular depth estimation method based on deep learning of claim 1, wherein: in S2, the local encoder is an EfficientNet B5 after pre-training, and the global encoder is a Vision Transformer after pre-training.

8. The monocular depth estimation method based on deep learning of claim 1, wherein: in S2, the decoding unit performs upsampling by a bilinear difference method.