Movatterモバイル変換


[0]ホーム

URL:


CN115731278A - Monocular depth estimation method based on deep learning - Google Patents

Monocular depth estimation method based on deep learning
Download PDF

Info

Publication number
CN115731278A
CN115731278ACN202211600042.XACN202211600042ACN115731278ACN 115731278 ACN115731278 ACN 115731278ACN 202211600042 ACN202211600042 ACN 202211600042ACN 115731278 ACN115731278 ACN 115731278A
Authority
CN
China
Prior art keywords
depth
monocular
monocular depth
recognition network
global
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211600042.XA
Other languages
Chinese (zh)
Inventor
巩书凯
蓝玲玲
梁先黎
江虹锋
陈磊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chongqing Humi Network Technology Co Ltd
Original Assignee
Chongqing Humi Network Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chongqing Humi Network Technology Co LtdfiledCriticalChongqing Humi Network Technology Co Ltd
Priority to CN202211600042.XApriorityCriticalpatent/CN115731278A/en
Publication of CN115731278ApublicationCriticalpatent/CN115731278A/en
Pendinglegal-statusCriticalCurrent

Links

Images

Landscapes

Abstract

The invention belongs to the technical field of image depth estimation, and particularly relates to a monocular depth estimation method based on deep learning, which comprises the following steps: s1, acquiring a training data set; s2, constructing a monocular depth recognition network, wherein the monocular depth recognition network comprises a coding unit, a fusion unit and a decoding unit; the encoding unit comprises a local encoder and a global encoder; the fusion unit is used for splicing the local feature map and the global feature map to obtain a fusion feature map fusing the local information and the global information; the decoding unit is used for carrying out up-sampling on the fusion characteristic graph to obtain an identified depth graph; s3, training the monocular depth recognition network through a training data set, and updating parameters of the monocular depth recognition network through a preset loss function; and S4, acquiring the depth information of the image by using the trained monocular depth recognition network. The invention can ensure both convenience in use and lower cost on the basis of ensuring accuracy.

Description

Monocular depth estimation method based on deep learning
Technical Field
The invention belongs to the technical field of image depth estimation, and particularly relates to a monocular depth estimation method based on deep learning.
Background
With the rapid development of artificial intelligence, computer vision is rapidly applied in people's daily life. Three-dimensional information of a scene is important for scene understanding, and how to acquire depth information from the scene is a hot spot direction in recent years. Computer vision obtains two-dimensional plane images, and depth information of the images is lacked. Therefore, one of the important tasks of computer vision is to reconstruct a three-dimensional model of a scene by acquiring depth information of the scene in a manner that simulates the world as perceived by the human eye.
At present, two technical schemes, namely monocular depth estimation and binocular/eye-capturing depth estimation, are mainly used for acquiring depth information of an image. The binocular/multi-view depth estimation has good accuracy, but has high requirements on the number of cameras and parameters, is inconvenient to use and has high cost, so that the practical popularization is not facilitated; the requirements of monocular depth estimation on the number of cameras and parameters can be obviously reduced, the use is convenient, the cost is lower, and the popularization is facilitated, but the monocular depth estimation is difficult to ensure on the accuracy rate.
Therefore, how to combine the convenience of use and the lower cost on the basis of ensuring the accuracy becomes a problem to be solved urgently at present.
Disclosure of Invention
Aiming at the defects of the prior art, the invention provides a monocular depth estimation method based on deep learning, which can give consideration to the convenience of use and lower cost on the basis of ensuring the accuracy.
In order to solve the technical problems, the invention adopts the following technical scheme:
a monocular depth estimation method based on deep learning comprises the following steps:
s1, acquiring a training data set, wherein the training data set comprises training images and corresponding depth maps;
s2, constructing a monocular depth recognition network, wherein the monocular depth recognition network comprises a coding unit, a fusion unit and a decoding unit; the encoding unit comprises a local encoder and a global encoder, the local encoder is used for extracting local information of the image and generating a local feature map, and the global encoder is used for extracting global information of the image and generating a global feature map; the fusion unit is used for splicing the local feature map and the global feature map to obtain a fusion feature map fused with the local information and the global information; the decoding unit is used for carrying out up-sampling on the fusion characteristic graph to obtain an identified depth graph;
s3, training the monocular depth recognition network through a training data set, and updating parameters of the monocular depth recognition network through a preset loss function;
and S4, acquiring the depth information of the image to be processed by using the trained monocular depth recognition network.
Preferably, in S3, when training the monocular depth recognition network, after calculating a difference between the depth map recognized by the monocular depth recognition network and the actual depth map, derivation is performed on the loss function, and then the network parameter weight is updated through back propagation.
Preferably, in S3, the loss function is:
L(y,y′)LMS-SSIM (y,y′)+λLpixel (y,y′
wherein L ispixel Represents the loss of scale invariance, LMS_SSIM Representing a loss of multi-scale structural similarity; y represents the true depth map, y' represents the predicted depth map, and λ is a hyper-parameter with a preset specific value.
Preferably, loss of scale invariance
Figure BDA0003994913350000021
Wherein, gi =logy′i -logyi And T represents the number of pixel points with effective depth, and in the experiment in this chapter, beta and alpha are training parameters with preset specific numerical values.
Preferably, β is 0.85 and α is 10.
Preferably, the multi-scale structure similarity is lost
Figure BDA0003994913350000022
Wherein,
Figure BDA0003994913350000023
Figure BDA0003994913350000024
Figure BDA0003994913350000025
Figure BDA0003994913350000026
in the formula, muy Represents the mean of y;
Figure BDA0003994913350000027
and a variance representing y; sigmayy′ Represents y and y' covariance; l (y, y') represents the luminance estimates of the real image and the predicted depth map; c (y, y') represents the contrast estimate of the real image and the predicted depth map; s (y, y') represents the trend of the change of the real image and the predicted depth map; m represents the largest scale in the multi-scale structural similarity loss.
Preferably, in S2, the local encoder is pre-trained EfficientNetB5, and the global encoder is pre-trained Vision Transformer.
Preferably, in S2, the decoding unit performs upsampling by a bilinear difference method.
Compared with the prior art, the invention has the following beneficial effects:
1. compared with the prior art, the monocular depth recognition network of the encoder-decoder framework is constructed, the coding unit in the monocular depth recognition network comprises the local encoder and the global encoder, the local information of the image can be extracted, the global information of the image can also be obtained, and the fusion feature map is up-sampled after the local information and the global information are fused, so that the comprehensiveness of the image depth information is ensured, the detail information of the depth map is ensured, and the depth information of the image is fully and completely obtained. And then, training the monocular depth recognition network through a training data set, and updating parameters of the monocular depth recognition network through a preset loss function. The monocular depth recognition network obtained in the way not only keeps the advantages of low requirements of monocular recognition on the number and parameters of cameras, convenience in use, lower cost and benefit for popularization, but also has higher accuracy and comprehensiveness of depth information recognition.
In conclusion, the method can give consideration to the convenience of use and lower cost on the basis of ensuring the accuracy.
2. The perceptibility of image details depends on the sampling density of the image signal, the distance of the image plane to the camera and the perceptual capability of the camera system. In the invention, the loss function of the monocular depth recognition network takes scale invariance loss and multiscale structure similarity loss into account, and the effectiveness and the accuracy of the trained monocular depth recognition network in recognizing the image depth information can be ensured.
3. The invention provides the specific parameter values of each parameter in the loss function, and can ensure the effectiveness of the trained monocular depth recognition network.
4. The coding unit provided by the invention can capture image characteristics by using transfer learning, can quickly converge a monocular depth recognition network, and saves the training time of the network.
5. The decoding unit in the monocular depth recognition network performs upsampling by a bilinear difference method, so that the quality of the image after upsampling is ensured, and excessive calculation cost is not increased.
6. The network model used by the invention is easy to migrate and popularize to other image processing prediction tasks, and has wide application range.
Drawings
For a better understanding of the objects, solutions and advantages of the present invention, reference will now be made in detail to the present invention, which is illustrated in the accompanying drawings, in which:
FIG. 1 is a flow chart in the embodiment;
FIG. 2 is a schematic diagram of an embodiment of a monocular deep recognition network;
fig. 3 is a schematic diagram of a training process of the monocular deep recognition network in the embodiment.
Detailed Description
The following is further detailed by the specific embodiments:
example (b):
as shown in fig. 1, the present embodiment discloses a monocular depth estimation method based on deep learning, which includes the following steps:
s1, a training data set is obtained, and the training data set comprises training images and corresponding depth maps.
In specific implementation, the training data set can be obtained from NYU Depth v 2. NYU Depth v2 is a data set providing images and Depth maps of different indoor scenes with a resolution of 640 x 480. The data set contained 120000 training samples and 654 test samples. In this embodiment, a 50000 sample subset is obtained as a training data set.
S2, constructing a monocular depth recognition network, wherein the monocular depth recognition network comprises a coding unit, a fusion unit and a decoding unit. The architecture of the monocular depth recognition network is shown in fig. 2.
The encoding unit comprises a local encoder and a global encoder, the local encoder is used for extracting local information of the image and generating a local feature map, and the global encoder is used for extracting global information of the image and generating a global feature map; in specific implementation, a convolutional neural network EfficientNet B5 is used as a specific encoder, and a Vision Transformer (ViT) is used as a global encoder. The convolutional neural networks EfficientNet B5 and ViT are models pre-trained on ImageNet, the input images are extracted with features of different levels by using the convolutional neural networks, and then global attention operation is performed on the input images by using the ViT to obtain richer semantic information and contextual information. In this embodiment, viT extracts global information of an image to generate a global feature map with a resolution of 30 × 40, and EfficientNetB5 extracts local information of the image to generate a local feature map with a resolution of 30 × 40.
The fusion unit is used for splicing the local feature map and the global feature map to obtain a fusion feature map fusing the local information and the global information; during specific implementation, concat splicing is carried out on the global feature map and the local feature map according to the channel dimension, and the feature map fusing the local information and the global information is obtained.
The decoding unit is used for carrying out up-sampling on the fusion characteristic image to obtain an identified depth image. In specific implementation, the decoding unit performs upsampling by a bilinear difference method. Therefore, the quality of the image after up-sampling is ensured, and excessive calculation cost is not increased.
And S3, as shown in FIG. 3, training the monocular depth recognition network through the training data set, and updating parameters of the monocular depth recognition network through a preset loss function. When the monocular depth recognition network is trained, after the difference between the depth map recognized by the monocular depth recognition network and the actual depth map is calculated, the derivative of the loss function is obtained, and then the network parameter weight is updated through back propagation.
In specific implementation, the loss function is:
L(y,y′)=LMS-SSIM (y,y′)+λLpixel (y,y′)
wherein L ispixel Represents the loss of scale invariance, LMS_SSIM Representing a multi-scale structural similarity loss; y denotes the true depth map, y' denotes the predicted depth map, and λ is the hyper-parameter. In this example, λ is 0.5.
In practice, the scale invariance is lost
Figure BDA0003994913350000041
Wherein, gi =logy′i -logyi And T represents the number of pixel points with effective depth, and in the experiment in this chapter, beta and alpha are training parameters with preset specific numerical values. In this example, β is 0.85 and α is 10.
Loss of multi-scale structural similarity
Figure BDA0003994913350000042
Wherein,
Figure BDA0003994913350000043
Figure BDA0003994913350000044
Figure BDA0003994913350000051
Figure BDA0003994913350000052
in the formula, muy Means for y;
Figure BDA0003994913350000053
and a variance representing y; sigmayy′ Represents y and y' covariance; l (y, y') represents the luminance estimates of the real image and the predicted depth map; c (y, y') represents the contrast estimate of the real image and the predicted depth map; s (y, y') represents the trend of the change of the real image and the predicted depth map; m represents the largest scale in the multi-scale structural similarity loss. That is, the brightness index is compared only in scale M, and the contrast and texture index are compared in all scales.
After the training is completed, the original image is used as an input, and the generated depth map is half the resolution of the input image, that is, the resolution of the depth map is 320 × 240.
And S4, acquiring the depth information of the image to be processed by using the trained monocular depth recognition network.
To facilitate the implementation of those skilled in the art, the hardware platform used in the present invention is i7-10700CPU, NVIDIA GeForce RTX 3090, and the software platform is PyTorch deep learning framework. The encoder is an EfficientNet B5 pre-trained on ImageNet, the EfficientNet network improves the network performance by scaling three dimensions (depth, width and image resolution) of a model and obtains the most advanced result in an image classification task, the initial learning rate in the training process is set to be 0.00005, the optimizer adopts Adam, and the batch processing size is 4.
Compared with the prior art, the monocular depth recognition network of the encoder-decoder framework is constructed, the coding unit in the monocular depth recognition network comprises the local encoder and the global encoder, the local information of the image can be extracted, the global information of the image can also be obtained, and the fusion feature map is up-sampled after the local information and the global information are fused, so that the comprehensiveness of the image depth information is ensured, the detail information of the depth map is ensured, and the depth information of the image is fully and completely obtained. And then, training the monocular depth recognition network through a training data set, and updating parameters of the monocular depth recognition network through a preset loss function. The monocular depth recognition network obtained in the way not only keeps the advantages of low requirements of monocular recognition on the number and parameters of the cameras, convenient use, lower cost and benefit for popularization, but also has higher accuracy and comprehensiveness of depth information recognition.
On the other hand, the perceptibility of image details depends on the sampling density of the image signal, the distance of the image plane to the camera and the perceptual capability of the camera system. In the invention, the loss function of the monocular depth recognition network takes scale invariance loss and multiscale structure similarity loss into account, and the effectiveness and the accuracy of the trained monocular depth recognition network in recognizing the image depth information can be ensured. In addition, the invention provides specific parameter values of each parameter in the loss function, and can ensure the effectiveness of the trained monocular depth recognition network. The coding unit provided by the invention can capture image characteristics by using transfer learning, can quickly converge a monocular depth recognition network, and saves the training time of the network. In addition, the decoding unit in the monocular depth recognition network performs upsampling by a bilinear difference method, so that the quality of the image after upsampling is ensured, and excessive calculation cost is not increased. The network model used by the invention is easy to migrate and popularize to other image processing prediction tasks, and has wide application range.
It should be noted that the above embodiments are only used for illustrating the technical solutions of the present invention and not for limiting the technical solutions, and those skilled in the art should understand that the technical solutions of the present invention can be modified or substituted with equivalent solutions without departing from the spirit and scope of the technical solutions, and all should be covered in the claims of the present invention.

Claims (8)

1. A monocular depth estimation method based on deep learning is characterized by comprising the following steps:
s1, acquiring a training data set, wherein the training data set comprises training images and corresponding depth maps;
s2, constructing a monocular depth recognition network, wherein the monocular depth recognition network comprises a coding unit, a fusion unit and a decoding unit; the encoding unit comprises a local encoder and a global encoder, the local encoder is used for extracting local information of the image and generating a local feature map, and the global encoder is used for extracting global information of the image and generating a global feature map; the fusion unit is used for splicing the local feature map and the global feature map to obtain a fusion feature map fused with the local information and the global information; the decoding unit is used for carrying out up-sampling on the fusion characteristic graph to obtain an identified depth graph;
s3, training the monocular depth recognition network through a training data set, and updating parameters of the monocular depth recognition network through a preset loss function;
and S4, acquiring the depth information of the image to be processed by using the trained monocular depth recognition network.
2. The deep learning-based monocular depth estimation method of claim 1, wherein: and S3, calculating the difference between the depth map identified by the monocular depth recognition network and the actual depth map when the monocular depth recognition network is trained, then obtaining the derivative of the loss function, and then reversely propagating and updating the network parameter weight.
3. The monocular depth estimation method based on deep learning of claim 2, wherein: in S3, the loss function is:
L(y,y′)LMS-SSIM (y,y′)+λLpixel (y,y′);
wherein L ispixel Represents the loss of scale invariance, LMS_SSIM Representing a multi-scale structural similarity loss; y represents the true depth map, y' represents the predicted depth map, and λ is a hyper-parameter with a preset specific value.
4. The deep learning-based monocular depth estimation method of claim 3, wherein: loss of scale invariance
Figure FDA0003994913340000011
Wherein, gi =logy′i -logyi And T represents the number of pixel points with effective depth, and in the experiment in this chapter, beta and alpha are training parameters with preset specific numerical values.
5. The method of monocular depth estimation based on deep learning of claim 4, wherein: beta is 0.85, alpha is 10.
6. The deep learning-based monocular depth estimation method of claim 5, wherein: loss of multi-scale structural similarity
Figure FDA0003994913340000012
Wherein,
Figure FDA0003994913340000013
Figure FDA0003994913340000021
Figure FDA0003994913340000022
Figure FDA0003994913340000023
in the formula, muy Represents the mean of y;
Figure FDA0003994913340000024
and a variance representing y; sigmayy′ Represents y and y' covariance; l (y, y') represents the luminance estimates of the real image and the predicted depth map; c (y, y') represents the contrast estimate of the real image and the predicted depth map; s (y, y') represents the trend of the change of the real image and the predicted depth map; m represents the largest scale in the multi-scale structural similarity loss.
7. The monocular depth estimation method based on deep learning of claim 1, wherein: in S2, the local encoder is an EfficientNet B5 after pre-training, and the global encoder is a Vision Transformer after pre-training.
8. The monocular depth estimation method based on deep learning of claim 1, wherein: in S2, the decoding unit performs upsampling by a bilinear difference method.
CN202211600042.XA2022-12-122022-12-12Monocular depth estimation method based on deep learningPendingCN115731278A (en)

Priority Applications (1)

Application NumberPriority DateFiling DateTitle
CN202211600042.XACN115731278A (en)2022-12-122022-12-12Monocular depth estimation method based on deep learning

Applications Claiming Priority (1)

Application NumberPriority DateFiling DateTitle
CN202211600042.XACN115731278A (en)2022-12-122022-12-12Monocular depth estimation method based on deep learning

Publications (1)

Publication NumberPublication Date
CN115731278Atrue CN115731278A (en)2023-03-03

Family

ID=85301268

Family Applications (1)

Application NumberTitlePriority DateFiling Date
CN202211600042.XAPendingCN115731278A (en)2022-12-122022-12-12Monocular depth estimation method based on deep learning

Country Status (1)

CountryLink
CN (1)CN115731278A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN116563349A (en)*2023-04-282023-08-08珠海欧比特宇航科技股份有限公司 Hyperspectral image processing method, device, electronic equipment and medium
CN117746177A (en)*2023-11-282024-03-22智平方(深圳)科技有限公司Image recognition model training method and image recognition model application method
CN118447288A (en)*2023-12-132024-08-06荣耀终端有限公司 A depth estimation network training method, depth estimation method and electronic device

Cited By (3)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN116563349A (en)*2023-04-282023-08-08珠海欧比特宇航科技股份有限公司 Hyperspectral image processing method, device, electronic equipment and medium
CN117746177A (en)*2023-11-282024-03-22智平方(深圳)科技有限公司Image recognition model training method and image recognition model application method
CN118447288A (en)*2023-12-132024-08-06荣耀终端有限公司 A depth estimation network training method, depth estimation method and electronic device

Similar Documents

PublicationPublication DateTitle
CN112767554B (en)Point cloud completion method, device, equipment and storage medium
CN115731278A (en)Monocular depth estimation method based on deep learning
CN114120432B (en) Online learning attention tracking method based on gaze estimation and its application
CN113435269A (en)Improved water surface floating object detection and identification method and system based on YOLOv3
CN112507990A (en)Video time-space feature learning and extracting method, device, equipment and storage medium
CN112001960A (en)Monocular image depth estimation method based on multi-scale residual error pyramid attention network model
CN114936605A (en) A neural network training method, equipment and storage medium based on knowledge distillation
CN111968217A (en)SMPL parameter prediction and human body model generation method based on picture
CN113240722B (en) A self-supervised depth estimation method based on multi-frame attention
CN113837290A (en)Unsupervised unpaired image translation method based on attention generator network
CN118470219B (en)Multi-view three-dimensional reconstruction method and system based on calibration-free image
CN112634331B (en) Optical flow prediction method and device
CN117218246A (en)Training method and device for image generation model, electronic equipment and storage medium
CN118821047A (en) A first-person perspective gaze point prediction method based on multimodal deep learning
CN117217997A (en)Remote sensing image super-resolution method based on context perception edge enhancement
CN119599967B (en)Stereo matching method and system based on context geometry cube and distortion parallax optimization
CN120031758A (en) A method for image shadow removal based on contrastive learning
CN110969109A (en) An eye-blink detection model under unrestricted conditions and its construction method and application
CN114943837A (en)Salt dome identification method based on improved U-net
CN114529890A (en)State detection method and device, electronic equipment and storage medium
CN116612495B (en)Image processing method and device
CN118134809A (en)Self-adaptive face restoration method and device based on face attribute information prediction
CN116778187A (en)Salient target detection method based on light field refocusing data enhancement
CN117292421A (en)GRU-based continuous vision estimation deep learning method
CN115564959A (en) A real-time semantic segmentation method based on asymmetric spatial feature convolution

Legal Events

DateCodeTitleDescription
PB01Publication
PB01Publication
SE01Entry into force of request for substantive examination
SE01Entry into force of request for substantive examination

[8]ページ先頭

©2009-2025 Movatter.jp