Disclosure of Invention
Aiming at the technical problems in the prior art, the invention provides an edge-preserving multi-view depth estimation and ranging method for an unmanned aerial vehicle platform, and aims to solve the technical problems that the depth of a thin structure and an object edge area is difficult to recover and good balance between performance and efficiency is difficult to realize in the conventional method.
According to a first aspect of the present invention, there is provided an edge-preserving multi-view depth estimation method for a drone platform, comprising: step 1, a reference image is given
And N-1 neighborhood images thereof
Extracting the multi-scale depth features of each image by using a weight sharing multi-scale depth feature extraction network
Wherein, in the process,
representing the s-th scale, the size of the s-th scale feature being
,
Is the number of channels of the s-th scale feature,
is the size of the original input image;
step 2, determining the depth map estimated at the 1 st stage of the multi-scale depth feature extraction network
;
Step 3, based on the depth map
Determining a depth map for the 2 nd stage estimate of the multi-scale depth feature extraction network
;
Step 4, adopting a hierarchy edge preserving residual error learning module to carry out depth map matching on the depth map
Optimizing and upsampling to obtain an optimized depth map
;
Step 5, based on the depth map
And image depth features at 2 nd scale
Sequentially carrying out depth estimation of the 3 rd stage and the 4 th stage to obtain a depth map estimated in the 4 th stage
;
Step 6, adopting a hierarchy edge preserving residual error learning module to carry out comparison on the depth map
Optimizing and upsampling to obtain an optimized depth map
;
Step 7, based on the optimized depth map
And image depth features at the 3 rd scale
Performing depth estimation of the 5 th stage to obtain a depth map
。
On the basis of the technical scheme, the invention can be improved as follows.
Optionally, the multi-scale feature extraction network is a two-dimensional U-shaped network composed of an encoder and a decoder with a jump connection; the encoder and the decoder are composed of a plurality of residual blocks.
Optionally, step 2 includes:
step 201, in the whole scene depth range
Internal uniform sampling
A depth ofPresume the value;
step 202, through the micro-homography transformation, under each depth hypothesis, the first one
iDepth characterization of a view of a web neighborhood
Transforming projection to reference view, and constructing two-view cost body by using group correlation measurement
;
Step 203, for the second step
iTwo-view cost body
Estimation of visibility map using shallow 3D CNN
And based on visibility map of each domain view
And carrying out weighted summation on all the two-view cost bodies to obtain the final aggregated cost body
;
Step 204, utilizing a three-dimensional convolution neural network to carry out the cost matching on the cost body
Regularization is carried out, a depth probability body is obtained through Softmax operation, and based on the depth probability body, soft-argmax is adopted to obtain the depth map
。
Optionally, step 3 includes:
step 301, according to the depth map
Determining a depth hypothesis sampling range for the second stage
And performing uniform sampling in the depth range
A depth hypothesis value;
step 302, performing two-view cost body construction and aggregation according to the method of the steps 201 to 203, and performing image depth feature under the 1 st scale
And
obtaining aggregated cost body based on individual depth hypothesis value
;
Step 303, regularizing a cost body and predicting a depth map according to the method in step 204, and based on the cost body
Obtaining the depth map
。
Optionally, the step 4 includes:
step 401, extracting multi-scale context features of a reference image by using a context coding network
Wherein
representing the s-th scale, the size of the s-th scale feature being
;
Step 402, aligning the depth map
Normalizing the normalized depth map by using a shallow 2D CNN network
Carrying out feature extraction;
step 403, the extracted depth map features and the context features of the image are combined
Connecting, inputting to an edge preserving residual error learning network for residual error learning to obtain a residual error map
;
Step 404, normalizing and upsampling the depth map and the residual map
Adding the depth map and performing normalization on the result after the addition to obtain the optimized depth map
。
Optionally, the context coding network in step 401 is a two-dimensional U-shaped network, and the context coding network includes: an encoder and a decoder having a jump connection;
the depth map is mapped in the step 402
The normalized formula is:
wherein,
and
mean and variance calculations are represented, respectively;
the edge preserving residual learning network in step 403 is a two-dimensional U-shaped network consisting of one encoder and one decoder with a jump connection; the encoder and the decoder are composed of a plurality of residual blocks;
in step 404, the normalized depth map is processed
Upsampling using bilinear interpolation and matching the residual map
Adding to obtain optimized normalized depth map
I.e. by
Wherein,
representation of using bilinear interpolation
Sampling to twice of the original; using depth maps
The mean value and the variance are subjected to solution normalization to obtain an optimized depth map
:
Optionally, in the process of performing depth estimation in the 3 rd stage, the 4 th stage and the 5 th stage in the step 5 and the step 7: determining a depth range according to the method of step 301;
constructing and aggregating the two-view cost body according to the method from the step 201 to the step 203; cost body regularization and depth map prediction are performed according to the method of step 204.
Optionally, the step 6 includes:
step 601, extracting multi-scale context characteristics of reference image by using context coding network
;
Step 602, aligning the depth map
Normalizing the normalized depth map by using a shallow 2D CNN network
Carrying out feature extraction;
step 603, the extracted depth map features and the context features of the image are combined
Connecting, inputting to an edge preserving residual error learning network for residual error learning to obtain a residual error map
;
Step 604, add the normalized and up-sampled depth map to the residual map, and denormalize the result after additionOptimizing to obtain the optimized depth map
。
Optionally, the training process of the multi-scale depth feature extraction network includes:
step 801, adopting cross-view photometric consistency loss and L1 loss together to supervise a multi-scale depth estimation network, and regarding the reference image
Pixel with middle depth value d
Corresponding pixel in the source view
Is composed of
Wherein,
and
camera intrinsic parameters of the reference view and the ith neighborhood view respectively,
、
is the relative rotation and translation between the reference view and the i-th neighborhood view; obtaining an image synthesized by the ith neighborhood view on the reference view based on the depth map D through differentiable bilinear interpolation
I.e. by
Binary mask generated in the conversion process
For marking the composite image
An invalid pixel in (1);
the computational disclosure of cross-view photometric consistency loss is:
wherein, respectively, views synthesized on the basis of the i-th neighborhood view according to the true depth and the estimated depth are represented, N represents the number of views,
representing the effective pixels in the composite image and the generated GT depth map
So as to obtain the compound with the characteristics of,
representing valid pixels in the GT depth map;
step 802, combining the cross-view photometric consistency loss and the L1 loss to obtain the loss of the multi-scale depth estimation branch part:
wherein
A weight coefficient which is a loss function at the s-th stage;
step 803, the hierarchy edge residual error keeping learning branch adopts L1 loss for supervision, and the total loss of the whole network is:
wherein
Is the weight coefficient of the loss function at the s-th stage.
According to a second aspect of the present invention, there is provided a ranging method for an unmanned aerial vehicle platform, including: the distance measurement is carried out based on the depth map obtained by the edge preserving multi-view depth estimation method facing the unmanned aerial vehicle platform.
The invention provides an edge-preserving multi-view depth estimation and ranging method for an unmanned aerial vehicle platform, and provides a hierarchical edge-preserving residual error learning module for correcting errors generated in bilinear upsampling and helping to improve the accuracy of the estimated depth of a multi-scale depth estimation network in order to realize accurate estimation of a detailed area. In addition, in order to enhance the gradient flow of the detail region during network training, cross-view photometric consistency loss is provided, and the accuracy of the estimated depth can be further improved. In order to realize better balance on performance and efficiency, a lightweight multi-view depth estimation cascade network framework is designed and combined with the two strategies, so that accurate depth estimation can be realized under the efficient condition, and the method is favorable for practical application on an unmanned aerial vehicle platform.
Detailed Description
The principles and features of this invention are described below in conjunction with the following drawings, which are set forth by way of illustration only and are not intended to limit the scope of the invention.
In order to overcome the defects and problems in the background art, a hierarchical edge preserving residual error learning module is proposed to optimize a depth map estimated by a multi-scale depth estimation network, so that the network can perform edge-aware depth map upsampling. In addition, a cross-view photometric consistency loss is proposed to strengthen the gradient flow of the detail region during training, thereby realizing more refined depth estimation. Meanwhile, on the basis, a lightweight multi-view depth estimation cascade network framework is designed, and depth estimation can be efficiently carried out.
Therefore, the invention provides an efficient edge-preserving multi-view depth estimation and ranging method for an unmanned aerial vehicle platform, fig. 1 is an overall architecture schematic diagram of the edge-preserving multi-view depth estimation and ranging method for the unmanned aerial vehicle platform, as shown in fig. 1, the edge-preserving multi-view depth estimation method includes:
step 1, a reference image is given
And N-1 neighborhood images thereof
Extracting the multi-scale depth features of each image by using a weight sharing multi-scale depth feature extraction network
Wherein
representing the s-th scale, the size of the s-th scale feature being
,
Is the s-th scale featureThe number of the channels of (a) is,
is the size of the original input image.
Step 2, determining the depth map estimated at the 1 st stage of the multi-scale depth feature extraction network
。
Step 3, based on the depth map
Determining depth maps for 2 nd stage estimation of multi-scale depth feature extraction networks
。
Step 4, in order to carry out edge-preserving upsampling, a hierarchical edge-preserving residual error learning module is adopted to carry out depth map
Optimizing and upsampling to obtain an optimized depth map
。
Step 5, based on the depth map
And image depth features at 2 nd scale
Sequentially carrying out depth estimation of the 3 rd stage and the 4 th stage to obtain a depth map estimated by the 4 th stage
。
Step 6, adopting a hierarchy edge preserving residual error learning module to carry out depth map matching
Optimizing and upsampling to obtain an optimized depth map
。
Step 7, based on the optimized depth map
And image depth features at the 3 rd scale
Performing depth estimation of the 5 th stage to obtain the final depth map
。
In summary, the whole multi-scale depth estimation network branch has five stages in total, the depth hypothesis sampling number of each stage is 32, 16, 8 and 8 respectively, the depth sampling range corresponding to the 2 nd stage is attenuated to be half of the previous stage, and the attenuation of the rest stages is one fourth of the previous stage.
The invention provides an efficient edge-preserving multi-view depth estimation method for an unmanned aerial vehicle platform, which aims to solve the technical problems that the depth of a thin structure and an object edge area is difficult to recover and good balance between performance and efficiency is difficult to realize in the conventional method.
Example 1
Embodiment 1 provided by the present invention is an embodiment of an edge-preserving multi-view depth estimation method for an unmanned aerial vehicle platform, and as can be seen from fig. 1, the embodiment of the edge-preserving multi-view depth estimation method includes:
step 1, a reference image is given
And N-1 neighborhood images thereof
Extracting the multi-scale depth features of each image by using a weight sharing multi-scale depth feature extraction network
Wherein
representing the s-th scale, the size of the s-th scale feature being
,
Is the number of channels of the s-th scale feature,
is the size of the original input image.
In one possible embodiment, the multi-scale feature extraction network is a two-dimensional U-shaped network consisting essentially of an encoder and a decoder with a jump connection. Furthermore, to enhance the feature representation capability, the encoder and decoder are composed of a plurality of residual blocks.
Step 2, determining the depth map estimated at the 1 st stage of the multi-scale depth feature extraction network
。
In a possible embodiment, for the 1 st stage, step 2 includes:
step 201, in the whole scene depth range
Internal uniform sampling
A depth hypothesis value.
It will be appreciated that for depthDegree hypothesis d, depth characterization of all neighborhood views by way of a micro-homographic transformation
Transforming the projection to a reference view to obtain transformed features
The calculation process of the micro homography is shown as formula (1).
Wherein,
and
camera internal and external references respectively representing reference views,
and
and respectively representing the camera internal reference and the external reference of the ith neighborhood view.
Step 202, through the micro-homography transformation, under each depth hypothesis, the first one
iDepth characterization of a view of a web neighborhood
Projective transformation is carried out to the reference view, and then the two-view cost body is constructed by utilizing the group correlation measurement
。
It will be appreciated that the similarity of the projective transformation depth features of each neighborhood view to the depth features of the reference view is calculated based on the group correlation metric. In particular, for the depth of the reference imageFeature(s)
And projective transformation characteristics of the ith neighborhood view under the depth value d
Their features are evenly divided into G groups along the feature channel dimension. Then, the user can use the device to perform the operation,
and
the inter-gth group feature similarity was calculated as:
wherein,
,
and
are respectively
And
the group g of features of (1),
is an inner product operation. When the calculation is finished
And
after the feature similarities of all G groups, the feature similarities form a feature similarity graph of G channels
. Due to the fact that
Individual depth hypothesis, between reference image and i-th neighborhood view
The feature similarity map is further sized as
Two-view cost body
。
Step 203, for the ith two-view cost body
Estimation of visibility map using shallow 3D CNN
And based on visibility map of each domain view
And carrying out weighted summation on all the two-view cost bodies to obtain the final aggregated cost body
。
It can be understood that in order to obtain the visibility graph of the ith neighborhood view under the reference view
For each two-view cost volume, one layer of 3D convolution-batch correction is adoptedThen visualization is performed by a shallow 3D CNN consisting of a visualization-ReLU activation function-a layer of 3D convolution-a Sigmoid activation function. On the basis, the visibility map of each field view is utilized
Carrying out weighted summation on the cost bodies of the two views to obtain the final aggregated cost body
I.e. by
Step 204, utilizing the three-dimensional convolution neural network to compare the cost body
Regularization is carried out, a depth probability body is obtained through Softmax operation, and a depth map is obtained through soft-argmax based on the depth probability body
。
It can be understood that for the cost body
Using a three-dimensional convolutional neural network to match the cost body
And carrying out regularization, wherein the three-dimensional convolution neural network is formed by a three-dimensional U-shaped neural network. Then, obtaining a depth probability body by adopting a Softmax operation, and regressing a depth map based on soft-argmax, namely obtaining a final depth map by expecting the depth probability body and a depth hypothesis
。
Step 3, based on the depth map
Determining a depth map for the 2 nd stage estimate of the multi-scale depth feature extraction network
。
In a possible embodiment, for the 2 nd stage, the step 3 includes:
step 301, according to the depth map
Determining a depth hypothesis sampling range for the second stage
And performing uniform sampling in the depth range
A depth hypothesis value.
As will be appreciated, estimated from the previous stage
Determining a depth hypothesis sampling range for the phase
And performing uniform sampling in the depth range
A depth hypothesis value, wherein
The determined sampling range is
。
Step 302, performing two-view cost body construction and aggregation according to the method from step 201 to step 203, and performing image depth feature at the 1 st scale
And
obtaining aggregated cost body based on individual depth hypothesis value
。
It can be understood that according to the two-view cost body construction and aggregation method in step 2, the image depth feature at the 1 st scale
And
obtaining aggregated cost body based on individual depth hypothesis value
。
Step 303, regularizing a cost body and predicting a depth map according to the method in step 204, based on the cost body
Obtaining the depth map
。
It can be understood that, according to the cost body regularization and depth map prediction method in step 2, the cost body is based on
Obtaining a depth map
。
Step 4, adopting a hierarchy edge preserving residual error learning module to carry out depth map comparison
Optimizing and upsampling to obtain an optimized depth map
。
In one possible embodiment, step 4 includes:
step 401, extracting multi-scale context features of a reference image by using a context coding network
Wherein
representing the s-th scale, the size of the s-th scale feature being
。
It is understood that the context coding network structure in step 401 is similar to the multi-scale feature extraction network structure in step 1, and is also a two-dimensional U-type network composed of one encoder and one decoder with a jump connection.
Step 402, depth map is aligned
Normalizing the depth map by using a shallow 2D CNN network
And (5) carrying out feature extraction.
It is to be understood that step 402 is directed to the depth map
The normalized formula is:
wherein,
and
mean and variance calculations are indicated separately.
Step 403, the extracted depth map features and the context features of the image are combined
Connecting, inputting to an edge preserving residual error learning network for residual error learning to obtain a residual error map
。
It will be appreciated that the edge preserving residual learning network in step 403 is a two-dimensional U-shaped network consisting of one encoder and one decoder with a jump connection; the encoder and decoder are composed of a plurality of residual blocks to enhance the feature representation capability.
Step 404, the normalized and up-sampled depth map and residual map are compared
Adding the depth data and the depth data, and performing normalization on the result to obtain an optimized depth map
。
It will be appreciated that, in step 404, the normalized depth map is compared
Upsampling using bilinear interpolation and matching the residual map
Adding to obtain optimized normalized depth map
I.e. by
Wherein,
represents that the image is processed by bilinear interpolation
Sampling to twice of the original; on the basis, a depth map is utilized
The mean value and the variance are subjected to solution normalization to obtain an optimized depth map
:
Step 5, based on the depth map
And image depth features at 2 nd scale
Sequentially carrying out depth estimation of the 3 rd stage and the 4 th stage to obtain a depth map estimated by the 4 th stage
。
Step 6, adopting a hierarchy edge preserving residual error learning module to carry out depth map matching
Optimizing and upsampling to obtain an optimized depth map
。
In a possible embodiment, the method of step 6 is similar to that of step 4, and may specifically include:
step 601, extracting multi-scale context characteristics of reference image by using context coding network
。
Step 602, depth map is aligned
Normalizing the depth map by using a shallow 2D CNN network
And (5) carrying out feature extraction.
Step 603, the extracted depth map features and the context features of the image are combined
Connecting, inputting to an edge preserving residual error learning network for residual error learning to obtain a residual error map
。
Step 604, adding the normalized and up-sampled depth map and the residual map, and de-normalizing the added result to obtain an optimized depth map
。
Step 7, based on the optimized depth map
And image depth at 3 rd scaleCharacteristic of
Performing depth estimation of the 5 th stage to obtain a depth map
。
In a possible embodiment, in the process of performing the depth estimation of the 3 rd stage, the 4 th stage and the 5 th stage in the steps 5 and 7: the depth range is determined in accordance with the method of step 301.
Constructing and aggregating the two-view cost body according to the method from step 201 to step 203; the cost body regularization and depth map prediction are performed according to the method of step 204.
In a possible way of implementing the embodiment,
the training process of the multi-scale depth feature extraction network comprises the following steps:
step 801, supervising the multi-scale depth estimation network with cross-view photometric consistency loss together with L1 loss, the core idea of cross-view photometric consistency is to amplify the gradient flow of the detail region by translating the difference of the true depth value and the predicted depth value into a difference of the image synthesized based on the true depth value and the depth value synthesized based on the predicted depth value by depth-based view synthesis. For reference images
Pixel with middle depth value d
Corresponding pixel in the source view
Comprises the following steps:
wherein,
and
camera intrinsic parameters of the reference view and the ith neighborhood view respectively,
、
is the relative rotation and translation between the reference view and the i-th neighborhood view; through the transformation, an image synthesized by the ith neighborhood view on the reference view based on the depth map D can be obtained through differentiable bilinear interpolation
I.e. by
During the transformation, a binary mask is generated
For identifying the composite image
I.e. the pixels projected to the area outside the image.
The computational disclosure of cross-view photometric consistency loss is:
wherein, respectively, views synthesized on the basis of the i-th neighborhood view according to the true depth and the estimated depth are represented, N represents the number of views,
representing the effective pixels in the composite image and the generated GT depth map
So as to obtain the compound with the characteristics of,
representing the active pixels in the GT depth map.
Step 802, combining the cross-view photometric consistency loss and the L1 loss to obtain the loss of the multi-scale depth estimation branch part:
wherein
For the weight coefficients of the loss functions at the s-th stage, the weight coefficients of the loss functions at the 1 st to 5 th stages may be set to 0.5, 1, and 2, respectively.
Step 803, the hierarchy edge residual error keeping learning branch adopts L1 loss for supervision, and the total loss of the whole network is:
wherein
For the weight coefficient of the loss function at the s-th stage, the weight coefficients of the loss functions at the 2 nd and 4 th stages may be set to 1 and 2, respectively.
Example 2
Embodiment 2 provided by the present invention is an embodiment of a ranging method for an unmanned aerial vehicle platform provided by the present invention, and as can be seen by referring to fig. 1, the embodiment of the ranging method includes: the distance measurement is carried out based on the depth map obtained by the edge preserving multi-view depth estimation method for the unmanned aerial vehicle platform.
It can be understood that the ranging method for the unmanned aerial vehicle platform provided by the present invention corresponds to the edge preservation multiview depth estimation method for the unmanned aerial vehicle platform provided by the foregoing embodiments, and the relevant technical features of the ranging method for the unmanned aerial vehicle platform may refer to the relevant technical features of the edge preservation multiview depth estimation method for the unmanned aerial vehicle platform, which are not described herein again.
The edge-preserving multi-view depth estimation and ranging method for the unmanned aerial vehicle platform has obvious gains on depth estimation results and efficiency, and the gains mainly come from the following three aspects: firstly, correcting errors generated in bilinear upsampling through a hierarchical edge retention residual error learning module and optimizing a depth map estimated by a multi-scale depth estimation network to obtain a depth map with retained edge details; meanwhile, cross-view luminosity consistency loss is introduced to enhance the gradient flow of a detail area during training, so that the accuracy of depth estimation can be further improved; on the basis, a lightweight multi-view depth estimation cascade network framework is designed, depth hypothesis sampling can be performed as much as possible under the condition that a lot of extra video memory and time consumption are not increased in the stacking stage under the same resolution, so that accurate depth estimation can be achieved under the efficient condition, and the multi-view depth estimation network can be applied to an unmanned aerial vehicle platform practically.
It should be noted that, in the foregoing embodiments, the descriptions of the respective embodiments have respective emphasis, and reference may be made to relevant descriptions of other embodiments for parts that are not described in detail in a certain embodiment.
As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
While preferred embodiments of the present invention have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including preferred embodiments and all such alterations and modifications as fall within the scope of the invention.
It will be apparent to those skilled in the art that various changes and modifications may be made in the present invention without departing from the spirit and scope of the invention. Thus, if such modifications and variations of the present invention fall within the scope of the claims of the present invention and their equivalents, the present invention is also intended to include such modifications and variations.