Neural radiation field incremental optimal view selection method based on mixed uncertainty estimationTechnical Field
The field of technology referred to herein is that of computer vision using neural radiation fields (Neural RADIANCE FIELDS, NERF) for large-scale three-dimensional reconstruction. In particular, the invention relates to a neural radiation field incremental optimal view selection method based on hybrid uncertainty estimation.
Background
At present, three-dimensional reconstruction techniques based on images are mainly divided into two main categories, namely a traditional geometric-based method and a neural network-based method. The conventional method mainly includes two steps of motion restoration structure (Structure From Motion, sfM) and multi-view stereoscopic (Multiple View Stereo, MVS). SfM aims at analyzing feature points from a sequence of images, acquiring a three-dimensional structure and a camera trajectory by means of beam adjustment parameters. Subsequently, the MVS generates a cloud point of a single view by estimating a depth map for each view. Another class of methods is represented by the neural radiation field (Neural RADIANCE FIELDS, NERF) as a significant breakthrough in three-dimensional reconstruction techniques in recent years. NeRF implicitly express three-dimensional scene information by training network parameters and enable synthesis of images of new perspectives. However, urban-level large-scale reconstruction techniques face two major challenges, namely, firstly, large-scale reconstruction generally requires processing and storing of a large amount of data, which may rapidly deplete the memory resources of a single GPU, resulting in slow processing speed, insufficient memory and other performance problems, which are particularly unfriendly to users or researchers with limited video memory resources, and secondly, as the demand for real-time or near real-time applications increases, such as navigation or disaster management, research into faster and more efficient large-scale three-dimensional reconstruction techniques becomes particularly urgent.
In order to create a more accurate, detailed and useful large-scale 3D model and solve the practical challenges of using large datasets and limited computational resources, the present study proposes an efficient method that enables fast, robust large-scale three-dimensional reconstruction under limited video memory resources.
Disclosure of Invention
The invention provides a rapid and efficient large-scale three-dimensional reconstruction method based on a nerve radiation field, which can effectively process a large amount of data input under a limited video memory resource, and is based on the problems that the training speed of the existing large-scale reconstruction technology is still very long, the video memory requirement is high and the like. The method mainly comprises two major parts, namely a view planning strategy based on uncertainty and a view selection strategy based on information gain.
The method provided by the invention fuses the characteristics of different time windows, utilizes a plurality of prediction heads to output a plurality of suggested positions at different levels, takes the point position with the highest confidence as the final point tracking result, and the constructed model has no limitation on the length of an input video sequence and the number of tracking points, so that the tracking of all points in all frames can be completed in parallel only through one forward propagation process.
In order to achieve the above purpose, the application adopts the following technical scheme:
a neural radiation field incremental optimal view selection method based on mixed uncertainty estimation comprises the following steps:
Inputting a group of N RGB image data with unified resolution shot by using the unmanned aerial vehicle;
Processing each input RGB image, estimating the pose, recovering the structure in the RGB image and outputting the 3D position x and the observing direction vector D corresponding to each image in the scene, wherein the structure in the RGB image refers to scene geometric information recovered from the RGB image and comprises but is not limited to the position, the shape and the spatial relationship of objects;
initializing NeRF, namely randomly selecting images with a preset proportion from the RGB image set to perform initializing training of the NeRF model, wherein the preset proportion is not lower than 15%;
the method comprises the steps of calculating mixed uncertainty, calculating color and opacity of an RGB image based on 5D coordinates (x, D) of a residual image set, performing threshold sampling through a modified NeRF network, integrating the color and the opacity into Beta distribution, and calculating rendering uncertainty;
And selecting an image with the highest mixing uncertainty, adding the image with the highest mixing uncertainty into a training set, and repeating the process until a specific reconstruction effect or a preset image quantity limit is achieved.
The training set added to the model is used for training an image dataset of a NeRF model. The NeRF model and the traditional triangle mesh, the point cloud, the voxels and the like are all three-dimensional reconstruction expression forms, but the triangle mesh, the point cloud, the voxels and the like are all explicit expressions, the NeRF is implicit expression, and the scene information is stored in a neural network. To reconstruct a scene, a network needs to be trained specifically, and multiple view reconstruction is performed using the acquired images of the scene, and the reconstructed images are the training set. Under the condition of enough video memory, the more the training set images are, the more the information is brought, and the better the reconstruction effect is.
As a preferred aspect of the invention, in the step of calculating the mixing uncertainty,
Based on the 5D coordinates (x, y, z, θ, phi) of the remaining image set, threshold sampling is performed through the modified NeRF network, and the color and opacity of the sampling point on each ray of each RGB image are calculated through multi-resolution hash information storage of the explicit voxel grid and the implicit neural network, and the rendering uncertainty of each picture is calculated through integration into the Beta distribution, and at the same time,
The method comprises the steps of inputting 3D coordinates (x, y, z) of a residual image set, judging whether a flying track of the unmanned aerial vehicle is a plane or a non-plane through a classifier, carrying out top-layer global planning and bottom-layer local planning on the plane track by utilizing a Voronoi information radiation field so as to compress and limit position information, and quantifying position uncertainty on the non-plane track by adopting a Voronoi clustering algorithm through distance from a central point.
Where (x, D) and (x, y, z, θ, φ) are different descriptions of 5D coordinates, the former x and D representing the position and viewing direction in three-dimensional space, respectively, the latter x, y, z referring to the values of the xyz coordinate system in three-dimensional space, θ, φ referring to the horizontal and vertical components of the direction.
As a preferred aspect of the present invention, the step of calculating the mixing uncertainty is specifically performed as:
the color and the opacity on each ray are obtained through NeRF model, and the volume density is calculated according to the following calculation formula:
where ai represents the bulk density of the ith sample point on the ray, σj represents the opacity of the jth sample point, and δj represents the distance of the jth sample point from the last sample point.
The color variance is calculated for each ray by NeRF model, and the calculation formula of the color variance is as follows:
β2(r(ti))=-P(αi)log P(αi)
Wherein β2(r(ti)) represents the variance of the i-th sampling point on the ray, αi represents the bulk density of the i-th sampling point (equation 1 above), and P (αi) represents the proportion of αi to the total bulk density Σαi of the ray;
Building Beta distribution from estimated mean and varianceWherein,Representing the color of a predicted ray,Representing the mean value of the color of a ray,Representing the mean of a ray variance;
optimizing the continuous function by minimizing the square reconstruction error between the true RGB image and the rendered pixel colors while calculating the rendering uncertainty for each picture, the calculation formula for the picture rendering uncertainty is as follows:
Representing the rendering uncertainty of the I-th picture, Nr represents the total number of rays per picture,Representing the squared error between the true RGB image and the rendered pixel color,Represents the mean variance of ray i;
Calculating Hausdorff distance between two data sets to determine whether the image track is plane or non-plane, wherein the Hausdorff distance isH (A, B) is Ha Siduo f distance of two tracks A and B, and I ai-bj I is distance of the ith point in the A track and the jth point in the B track.
As a preferred aspect of the invention, the step of calculating the mixing uncertainty is specifically performed by calculating the position uncertainty for the planar trajectory using the formula:
wherein Fp (I) is the plane position uncertainty value of the photo I, Nv represents the total three-dimensional pose point number,Representing the distance values of points i and j at the weight value λi of i, ai represents the voronoi diagram area of point i.
As a preferred aspect of the invention, for non-planar trajectories, the following formula is used to estimate the position uncertainty:
Wherein Fnp (I) is a non-planar position uncertainty value of the photo I, Gi is a probability of selecting the picture I based on the Veno graph polygon, and ro represents a relative local density of the evaluation point in density change before and after selection.
In a preferred aspect of the invention, in the step of selecting the optimal view, the rendering uncertainty and the position uncertainty are normalized and summed to form a mixed uncertainty of each image, the image with the highest mixed uncertainty is selected in the candidate set, added into the training set, and the process is repeated until a specific reconstruction effect or a preset image quantity limit is achieved, so that the view selection strategy is optimized, wherein the calculation formula of the mixed uncertainty is as follows:
Wherein,Representing the position uncertainty of picture I,The rendering uncertainty of the picture I is represented, and the total uncertainty psi2 (I) of the picture I is obtained by normalizing and adding the rendering uncertainty and the rendering uncertainty.
As a preferred aspect of the present invention, further comprising:
And in each iteration, calculating the uncertainty of each unused candidate view based on the current training set, selecting the view with the largest information gain, and adding the view to the training set to gradually improve the quality of new view synthesis.
In the input RGB image collecting step, a rotor unmanned aerial vehicle is used for carrying out specific route flight on a target area, the flight route is divided into a plane and a non-plane, and fixed-point shooting is carried out in the flight process, so that a group of N aerial RGB image data with the same resolution of orthographic or oblique shooting by using the unmanned aerial vehicle is obtained.
In the method, COLMAP software is used for estimating the pose of all input RGB images in the process of calculating the 3D position and the observing direction to obtain the shooting pose of the camera corresponding to each image, wherein the default camera model is a pinhole camera model, and all cameras are set to be the same internal reference to perform incremental SfM reconstruction, so that the internal and external parameters of all cameras are obtained.
In a preferred aspect of the present invention, in the initializing NeRF training step, at least 15% of the images from the RGB image set are randomly selected for initializing the NeRF model to enable the NeRF model to learn the basic structure and appearance characteristics of the scene and store the basic structure and appearance characteristics in the multi-layer perceptron network.
The method comprises the steps of randomly selecting 15% of images from a training set to perform initialization training to obtain a problematic scene model, and performing iterative training on the basis of the initialization model. And randomly selecting 15% of images from the training set to perform initialization training to obtain a problematic scene model, wherein if the problem is small, the subsequent iteration can achieve a better effect with fewer rounds, otherwise, more rounds are needed to be iterated.
For example, 100 images are provided, 10 (10%) images are selected as test sets for evaluating the quality of the model, the remaining 90 images are used as training sets, 15 (15%) are used as initialized training sets, the rest is training candidate sets, a worse scene model is obtained after initialization training is performed at this time, mixing uncertainty of all images in the training candidate sets is calculated at this time, candidate images with the largest information amount are obtained, the candidate images are added into the training sets (16 images are used as training sets, 15 initialized images are+1 newly added candidate images at this time), iterative training is performed to obtain a new optimized scene model, and the steps are repeated until the set conditions are met, for example, 15 images are selected.
The invention has the specific advantages that:
The invention introduces the realization of the information gain based on the mixed uncertainty of the position uncertainty and the rendering uncertainty, and simultaneously helps to realize the task of fast and large-scale three-dimensional reconstruction under the limited video memory resource through the visual angle selection of the maximum gain. Compared with the prior art, the method has the remarkable advantages of larger input data standard, higher speed of completing large-scale three-dimensional reconstruction, higher quality of rendered images under the same condition and the like.
The neural radiation field incremental optimal view selection method based on mixed uncertainty estimation can solve the problem of artifacts appearing under multiple viewpoints, and reduces the calculation cost while maintaining high rendering fidelity. The method is particularly focused on how to effectively select the view which can bring the maximum information gain from candidate views through an incremental optimal view selection strategy under the restriction of limited video memory resources, so that the rendering quality and the rendering efficiency are improved.
In addition, the invention combines the technologies of mixed multi-resolution hash storage of the explicit voxel grid and the implicit neural network, voronoi diagram information gain radiation field and clustering algorithm, threshold sampling, flight classifier and the like, so as to further improve the processing efficiency and ensure the performance of the original NeRF network. The technology can be used as a plugin tool to assist the existing system to realize more efficient three-dimensional scene reconstruction and rendering
Specific embodiments of the invention are disclosed in detail below with reference to the following description and drawings, indicating the manner in which the principles of the invention may be employed. It should be understood that the embodiments of the invention are not limited in scope thereby. The embodiments of the invention include many variations, modifications and equivalents within the spirit and scope of the appended claims.
Features that are described and/or illustrated with respect to one embodiment may be used in the same way or in a similar way in one or more other embodiments in combination with or instead of the features of the other embodiments.
It should be emphasized that the term "comprises/comprising" when used herein is taken to specify the presence of stated features, integers, steps or components but does not preclude the presence or addition of one or more other features, integers, steps or components.
Drawings
In order to more clearly illustrate the embodiments of the invention or the technical solutions of the prior art, the drawings which are used in the description of the embodiments or the prior art will be briefly described, it being obvious that the drawings in the description below are only some embodiments of the invention, and that other drawings can be obtained from these drawings without inventive faculty for a person skilled in the art.
FIG. 1 is a schematic illustration of a neural radiation field incremental optimal view selection method based on hybrid uncertainty estimation of the present invention.
FIG. 2 is a flow chart of a neural radiation field incremental optimal view selection method based on hybrid uncertainty estimation of the present invention.
FIG. 3 is a block diagram of a method of constructing the neural radiation field incremental optimal view selection method based on hybrid uncertainty estimation of the present invention.
Fig. 4 is a schematic diagram of a voronoi diagram information gain radiation field employed by a planar flight trajectory in a neural radiation field incremental optimal view selection method based on hybrid uncertainty estimation in accordance with the present invention.
Fig. 5 is a schematic diagram of a voronoi diagram clustering algorithm employed by a non-planar flight trajectory in a neural radiation field incremental optimal view selection method based on hybrid uncertainty estimation in accordance with the present invention.
FIG. 6 is a partial result effectiveness display in the neural radiation field incremental optimal view selection method based on hybrid uncertainty estimation of the present invention.
Detailed Description
In order to make the technical solution of the present invention better understood by those skilled in the art, the technical solution of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are only some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, shall fall within the scope of the invention.
As shown in fig. 1 to 6, an embodiment of the present invention provides a neural radiation field incremental optimal view selection method based on hybrid uncertainty estimation, which is mainly used for solving the problems of large-scale three-dimensional reconstruction and new view rendering output, and the algorithm schematic is shown in fig. 1.
The algorithm mainly comprises two major parts, namely calculating rendering uncertainty and position uncertainty, obtaining mixed uncertainty to carry out view angle selection planning of maximum information gain, wherein the algorithm flow is shown in figure 2 and mainly comprises the following steps of:
s1, firstly inputting an RGB real photo set. A set of high resolution RGB image data photographed using the unmanned aerial vehicle is input.
A rotor unmanned aerial vehicle is used for carrying out specific route flight on a certain area, the flight route is divided into a plane and a non-plane, and the non-plane generally adopts manual flight. The plane may have a route set in the app, typically a groined fly method, etc. And carrying out fixed-point shooting in the flight process, so as to obtain a group of N aerial survey image data with the same resolution of orthojets or oblique jets shot by using the unmanned aerial vehicle.
S2, performing pose estimation (the position of a camera in space and the orientation of the camera) and sparse point cloud reconstruction on all input images by using COLMAP, and obtaining camera shooting pose and sparse point cloud information corresponding to each image.
And performing pose estimation (the position of the camera in space and the orientation of the camera) on all the input images by using COLMAP to obtain the shooting pose of the camera corresponding to each image. The default camera model is a pinhole camera model (PINEHOLE) and is set so that all cameras are used as the same internal parameter to perform incremental SfM reconstruction, and therefore the internal and external parameters of all cameras are obtained.
S3, initializing NeRF for training. A proportion of images (at least 15% recommended) are randomly selected from the input dataset for initialization training of the NeRF model.
At least 15% of the images from the input dataset are randomly selected for initialization training of NeRF models, ensuring that the models can learn the basic structure and appearance features of the scene.
S4, calculating the mixing uncertainty. Based on the 5D coordinates (x, D) of the remaining image set, the color and opacity of the image are calculated by threshold sampling through the modified NeRF network.
The NeRF network is the part of the NeRF model responsible for learning and encoding scene information, and the NeRF model relies on the NeRF network to learn the three-dimensional representation of the scene, and requires other components (e.g., volume rendering equations, etc.) to achieve new view angle synthesis and three-dimensional reconstruction.
The complete image set is divided into a test set, a training set and a training candidate set, the remaining image set refers to the training candidate set, and the test set is not participated in the training process and is only used for evaluating training results. For example, 100 images are selected, 10 (10%) images are selected as test sets for evaluating the quality of the model, the rest 90 images are used as training candidate sets, 15 images (15%) are selected for initializing the training set for training, and the images with the largest information content are selected from the rest training candidate sets and added into the training set.
The color and opacity are then integrated into the Beta distribution, and the rendering uncertainty is calculated. For the position uncertainty, the 3D coordinate x of the residual image set is input, whether the track is a plane or a non-plane is judged through a classifier, and then the position information is estimated by using a Voronoi graph algorithm, so that the position uncertainty is obtained. The specific execution algorithm of the step is as follows:
S41, calculating the color and the opacity on each light line through the MLP of the NeRF model, wherein the distance and the opacity between adjacent samples are considered by the volume density calculated by the following calculation formula:
where ai represents the bulk density of the ith sample point on the ray, σj represents the opacity of the jth sample point, and δj represents the distance of the jth sample point from the last sample point.
S42, calculating the variance of the color for each ray through a NeRF model. The color variance is calculated as follows:
β2(r(ti))=-P(αi)log P(αi)
Where β2(r(ti)) represents the variance of the ith sample point on the ray, αi represents the bulk density of the ith sample point (equation 1 above), and P (αi) represents the proportion of αi to the total bulk density Σαi of the ray.
S43, since the volume density at a specific location is only affected by its own 3D coordinates and not by the viewing direction, which makes the distributions of different locations independent of each other, while the volume rendering can be approximated as a linear combination along the ray sampling points, beta distribution can be constructed from the estimated mean and variance
Wherein,Representing the color of a ray predicted (calculated according to the build model),Representing the mean value of the color of a ray,Represents the mean of the variance of a ray and builds the beta distribution.
S44, optimizing the continuous function by minimizing the square reconstruction error between the true RGB image and the rendered pixel colors, and simultaneously calculating the rendering uncertainty of each picture. The calculation formula of the picture rendering uncertainty is as follows:
Representing the rendering uncertainty of the I-th picture, Nr represents the total number of rays per picture,Representing the squared error between the true RGB image and the rendered pixel color,Representing the mean variance of ray i.
S45, calculating Hausdorff distance between two data setsTo determine whether the image track is planar or non-planar.
Where h (A, B) is Ha Siduo f distance of the two tracks A and B, |ai-bj || is distance of the ith point in the A track and the jth point in the B track.
S46, for the plane track, calculating position uncertainty by using the Voronoi diagram, and taking importance of each position point and the plane Veno diagram area into consideration. The position uncertainty is calculated using the following formula:
wherein Fp (I) is the plane position uncertainty value of the photo I, Nv represents the total three-dimensional pose point number,Representing the distance values of points i and j at the weight value λi of i, ai represents the voronoi diagram area of point i.
For non-planar trajectories, a Voronoi diagram clustering algorithm is used to estimate position uncertainty, taking into account three-dimensional spatial importance and local correlation density. At this time, the position uncertainty is calculated using the following formula:
Wherein Fnp (I) is a non-planar position uncertainty value of the photo I, Gi is a probability of selecting the picture I based on the Veno graph polygon, and ri represents a relative local density of the evaluation point in density change before and after selection.
S5, selecting an optimal view. And normalizing and summing the rendering uncertainty and the position uncertainty, and calculating the mixed uncertainty of each image. The image with the highest mixing uncertainty is selected and added to the training set, and the process is repeated until a specific reconstruction effect or a preset image quantity limit is reached. The calculation formula of the mixing uncertainty is as follows:
Wherein,Representing the position uncertainty of picture I,The rendering uncertainty of the picture I is represented, and the total uncertainty psi2 (I) of the picture I is obtained by normalizing and adding the rendering uncertainty and the rendering uncertainty.
S6, iterative optimization. In each iteration, the uncertainty of each unused candidate view is calculated based on the current training set, the view with the greatest information gain is selected and added to the training set to gradually improve the quality of the new view synthesis.
It should be noted that NeRF introduces a new scene representation and view synthesis method, allowing to synthesize highly realistic 3D scenes from 2D images. The training process includes capturing a set of 2D images of the scene from different viewpoints. For each pixel in the image, neRF calculates the corresponding ray in 3D space. NeRF estimate scene color and opacity at each point along these rays, then compare the predicted appearance color values to the observed appearance color values, and adjust network parameters to minimize the variance. After training NeRF, new scene views can be synthesized by projecting light from the virtual camera locations and accumulating color and opacity along the light using the learned neuro-radiation fields. This process allows photographs to be generated from previously invisible viewpoints, providing realistic scene rendering images. Instant-NGP is the fastest nerve radiation field model to date that is extremely engineering-valuable. The work proposes a multi-resolution hash coding to accelerate model training, the multi-resolution hash table enhancement of trainable feature vectors can further reduce the model size, and the whole system is realized by using the completely fused CUDA kernel, so that parallelism is extremely exerted.
Each time iterative training is performed, the method is a three-dimensional reconstruction method based on a nerve radiation field, and view angle planning of position uncertainty and rendering uncertainty is considered, wherein a specific model is Instant-NGP.
S6, iterative optimization. In each iteration, the uncertainty of each unused candidate view is calculated based on the current training set, the view with the greatest information gain is selected and added to the training set to gradually improve the quality of the new view synthesis.
As can be seen from fig. 6, according to the partial results in the rapid large-scale three-dimensional reconstruction method based on the nerve radiation field, six groups of public large scene data sets are selected, and compared with the previous base line, better reconstruction effects are obtained.
In the present application, a plurality of elements, components, parts or steps can be provided by a single integrated element, component, part or step. Alternatively, a single integrated element, component, part or step may be divided into separate plural elements, components, parts or steps. The disclosure of "a" or "an" to describe an element, component, section or step is not intended to exclude other elements, components, sections or steps.
It is to be understood that the above description is intended to be illustrative, and not restrictive. Many embodiments and many applications other than the examples provided will be apparent to those of skill in the art upon reading the above description. The scope of the present teachings should, therefore, be determined not with reference to the above description, but instead should be determined with reference to the appended claims, along with the full scope of equivalents to which such claims are entitled. The disclosures of all articles and references, including patent applications and publications, are incorporated herein by reference for the purpose of completeness. The omission of any aspect of the subject matter disclosed herein in the preceding claims is not intended to forego such subject matter, nor should the inventors regard such subject matter as not be considered to be part of the disclosed subject matter.