Disclosure of Invention
The invention aims to provide a sparse new view angle synthesis method based on multi-scale feature fusion, which can greatly improve the processing efficiency and synthesis quality of sparse multi-view angle data, remarkably reduce the dependence of a model on the number of input views, simultaneously preserve important details and global structures in an image, and improve the fidelity and accuracy of new view angle generation results.
A sparse new view image synthesis method based on multi-scale feature fusion comprises the following steps:
S1, generating a multi-scale reference point and sampling characteristics:
generating reference points of different scales according to the resolution of an input image, sampling on a corresponding multi-scale feature map by using the reference points, extracting features of each scale and splicing to obtain initial feature information;
s2, extracting residual characteristics of multiple receptive fields:
Performing depth convolution operation of different convolution kernel sizes on the sampled initial feature information by utilizing a multi-receptive-field convolution and residual error connection mechanism, acquiring information of different scales by the multi-receptive-field, simultaneously adding residual error connection to keep original features and fusing the features under each scale, and finally generating residual error features of the multi-receptive-field;
s3, feature aggregation and image generation based on an attention network:
The fused multi-scale depth features (namely residual features) are used as input, the multi-scale depth features are processed layer by layer through a self-attention module in a GPNR model, and multi-view information is gradually aggregated and aligned through capturing global dependency relations among different views, so that the depth fusion and expression of the features are completed, and a new view image under a sparse scene is generated;
S4, pre-training a new view angle synthetic model:
pre-training the model on a large-scale general data set, training a large number of dense images of different scenes, performing repeated iterative training by combining different objects in a plurality of data sets, covering various possible visual scenes and object types, and simultaneously optimizing model weights by using loss function calculation and back propagation, so that the model can learn general visual features and high-level semantic features suitable for sparse new view angle synthesis tasks;
s5, fine adjustment of a pre-training model:
Further optimizing based on transfer learning of the pre-training model, transferring parameters of the pre-training model to a new scene, fine-tuning new scene data under a sparse view angle, updating model parameters through back propagation, and further accelerating training and rendering of the new scene, so that the model is more suitable for a specific sparse scene;
The new view angle image can be generated by using the finally formed model.
In step S1, firstly, a plurality of scale reference points are generated according to the resolution of an input image, the reference points are key points used for marking pixel positions in a feature map, an automatic selection method is adopted, the image is divided into grids with fixed sizes, grid center points are used as the reference points, after normalization processing is carried out on coordinates, reference point sets with different scales are combined, then feature sampling is carried out on a corresponding multi-scale feature map (C1,C2,C3), index positions are positioned in a flattened feature matrix through the normalized coordinates of the reference points, and feature values are extracted from the index positions and mapped back to a multi-scale structure. Multiscale bits for sampling these
The features are spliced and fused into a unified feature representation C. This process is expressed as:
C=[Sample(C1),Sample(C2),Sample(C3)]
Wherein Sample (·) represents multi-scale feature sampling based on a reference point, and C contains the fused multi-scale features, providing abundant multi-scale information for subsequent processing.
In step S2, the specific steps of extracting the residual features of the multiple receptive fields are as follows:
Firstly, carrying out linear transformation on features C from different scales through a full-connection layer, applying different weights and biases to each channel, carrying out re-weighted combination on input features through matrix operation, adjusting and recombining feature relations among the channels, and generating a feature representation (namely feature information) after linear transformation:
FC1(C)=W·C+b
where C is the multi-scale feature generated in step S1, W and b are parameters obtained after training to help the model fit the data better, and FC1 represents the first fully connected layer (Fully Connected Layer).
And performing deep convolution operation on the feature subjected to dimension reduction by using multi-scale receptive field convolution, adopting convolution kernels with different sizes to respectively convolve the feature, dividing the feature in the channel dimension, then convolving the feature one by one, and checking the feature of different parts through convolution kernels with multiple receptive fields to perform independent deep convolution processing. The specific formula is as follows:
Conv(x)=[Conv3×3(x1),Conv5×5(x2),Convk×k(x3)]
Wherein x1,x2,x3 is three parts of the input feature x divided in a channel dimension, the channel dimension represents the feature quantity extracted by the network, and then (x1,x2,x3) is respectively convolved by using different convolution kernel sizes (3×3), (5×5) and (k×k) to generate features with different scales. Conv represents convolution operation, k is a self-defined numerical value, and in the invention, the value is 7;
After convolution operation of each scale, adding residual connection, directly adding the convolved features and the original input features, keeping the channel of the original input features unchanged, fusing the convolved features and the input features, and combining two feature representations through residual connection operation:
x=[Conv(x),identity(x)]
Where identity (x) indicates that the input feature x is passed directly, without any manipulation, i.e. the input feature is preserved.
The fused multi-scale features are expanded back to the original channel number through linear projection, the fused features are subjected to linear transformation by adopting a full-connection layer, and the features are subjected to weighted sum addition operation by using a weight matrix and bias, so that a feature representation with the same channel number as the input features is generated:
F=FC2([x1,x2,x3])
Where FC2 represents a second fully connected layer that maps the spliced multi-scale features back to the target channel number.
In step S3, the fused multi-scale features are input GPNR (Generalizable patch-based neural rendering) for training. The GPNR model is used for receiving multi-view fusion characteristics as input and generating characteristic expression of a target view by utilizing a nerve rendering technology. By processing the input features, GPNR can effectively reconstruct and complement the missing information in the sparse scene, and finally output a high-quality new view image under the sparse condition, thereby providing complete and consistent feature support for subsequent synthesis and presentation. Reference paper :Suhail,M.,Esteves,C.,Sigal,L.,Makadia,A.,2022a.Generalizable patch-based neural rendering,in:European Conference on Computer Vision,Springer.pp.156–174.
In step S5, the weight and bias parameters of the pre-training model are loaded by initializing the pre-training model as a new model, and the pre-training model is set as an initial parameter of a new scene, and then the new scene data under the sparse view angle is subjected to fast fine tuning. In the fine tuning process, the model is updated by minimizing the loss function using the parameters θ0 of the pre-trained model as initial values
A parameter of the sample is selected,Is a weighted loss function expressed as:
Wherein the method comprises the steps ofRepresenting the loss of the pre-trained model,Representing a penalty term for the new scene data. Under a small number of camera visual angles, the parameter theta is updated through a gradient descent method, and the model can be quickly adapted to new scene characteristics, so that training and rendering processes are accelerated, and meanwhile, the quality of new scene visual angle synthesis is improved.
The beneficial effects are that:
compared with the prior art, the invention has the following advantages:
The invention provides a sparse new view synthesis method based on multi-scale feature fusion, which mainly combines multi-scale feature processing and a residual mechanism to extract depth features. Compared with the prior art, the method fully utilizes the multi-scale characteristics and effectively combines the residual structure. The multi-receptive field convolution and residual connection are introduced in the multi-scale feature fusion, so that the feature expression capability is enhanced, the depth information is captured by the model more accurately, and the rendering quality and global consistency under the sparse view scene are improved. In addition, the key layers of the pre-training model are selectively thawed for fine adjustment, so that the model can be rapidly adapted to sparse visual angle scenes, and the training efficiency and the adaptation capability are remarkably improved.
Detailed Description
The invention is further described below with reference to specific examples and figures:
Example 1
Task definition
Assume that a set of sparsely sampled multi-view image sets is inputEach image represents a different camera view angle, and the pose of the camera, including the internal and external reference matrices of the camera, is accurately calculated using COLMAP tools. The camera pose provides positional and orientation information of the camera in three-dimensional space, determines the angle of view of the input image, and by adjusting the camera pose, images viewed from different angles and directions can be generated. The invention aims to realize new view angle reconstruction under sparse input images by a nerve rendering method, and remarkably improve reconstruction quality.
The invention discloses a sparse new view angle synthesis method based on multi-scale feature fusion, which comprises the following steps:
(1) Multi-scale reference point generation and feature sampling
Firstly, according to the resolutions H and W of an input image, a multi-scale space shape is generated for the subsequent segmentation and remodeling of multi-scale features.
Defining the spatial shape of the multi-scale feature map as:
(Hi,Wi)=[(H×2,W×2),(H,W),(H/2,W/2)]
These spatial shapes (Hi,Wi) describe the size of each scale feature map, high resolution, medium resolution, low resolution, respectively.
For each feature map scale, reference points are generated by a uniform grid for spatial localization of the multi-scale features. The reference point coordinates xi and yi are defined as:
xi=[0.5,1.5,2.5,…,Wi-0.5],yi=[0.5,1.5,2.5,…,Hi-0.5]
A two-dimensional grid is generated and expanded to one dimension. The x coordinate and the y coordinate of the reference point are respectively:
normalizing all the reference point coordinates to obtain normalized coordinates:
And combining the reference points of all scales to form a unified reference point set. Let Nscales =3 correspond to three scales and the final set of reference points is:
where i=1, 2,3 corresponds to three dimensions.
Next, a starting index for each scale is calculated to locate the position of each scale feature in the flattened feature. The calculation formula of the initial index is:
L0=0
The flattened features are then mapped back to the multi-scale feature map using these reference points and the starting index. Given a flattened feature matrix C e RB×N×C, where B is the batch size,C is the number of channels.
For each scale i, extracting a corresponding feature segment:
starti=Li-1
endi=Li
Ci=C[:,starti:endi,:]
Then, carrying out multi-scale feature fusion on the feature graphs of all scales to generate a multi-scale feature pyramid:
Pyramid={C0,C1,C2}
and then input into a feature extraction network for subsequent multi-receptive field feature extraction and fusion.
(2) Multi-receptive field residual feature extraction
And applying a multi-receptive field residual feature extraction module on the multi-scale feature pyramid to further extract and fuse depth features. First, the input features are linearly transformed using the full connection layer:
Xi=FC1(Xi)
For each scale feature (Xi) (i=1, 2, 3), Xi is first divided into two parts on average in the channel dimension:
Here, Xi1 and Xi2 represent two sub-tensors after segmentation, respectively, each sub-tensor having the shape ofI.e. equally along the channel dimension.
A depth separable convolution of different kernel sizes is then applied to each sub-tensor:
Yi1=Conv3×3(Xi1)
Yi2=Conv5×5(Xi2)
Wherein Convk×k represents a depth separable convolution with a kernel size of k x k.
The convolved features are then stitched in the channel dimension:
Yi=[Yi1,Yi2]
note that the stitching operation is performed here along the channel dimension.
An activation function (e.g., GELU) and a batch normalization operation are applied. Firstly, applying GELU activating functions to the spliced features to enhance nonlinear expression capacity, and then carrying out batch normalization:
Zi=BNisGELU(Yi)x
wherein, the definition of the activation function GELU is:
The mathematical representation of the batch normalization is:
where μi represents the mean value of the ith scale feature, σi represents the variance of the ith scale feature, γi and βi are training parameters, and initial value γi=1,βi =0.
The processed features are then stitched with the initial input features in the channel dimension (residual connection):
Z′i=[Zi,Xi]
finally, the features of all dimensions are spliced along the channel dimension to form a final feature representation:
Z=[Z′1,Z′2]
At this time, Z ε RB×H×W×2C, where B is the batch size, H and W are the height and width of the feature map, respectively, and C is the number of channels. The last layer is the second fully connected layer, mapping features back to the output dimension:
Zout=FC2(Z)
the second fully-connected layer functions to map the high-dimensional features of the network to the desired output dimensions for subsequent feature aggregation tasks.
(3) Transformer-based multi-view feature aggregation model
In new view angle synthesis, GPNR model realizes new view angle image synthesis under sparse view angle condition through transform-based multi-view angle feature aggregation. The method has the core concept that a multi-head self-attention mechanism is utilized to capture the global relation among different visual angles, and depth and characteristic information are gradually aggregated by combining geometric consistency modeling. Reference paper :Suhail,M.,Esteves,C.,Sigal,L.,Makadia,A.,2022a.Generalizable patch-based neural rendering,in:European Conference on Computer Vision,Springer.pp.156–174.
The method inputs the obtained depth features into GPNR networks, and performs multi-view feature aggregation and new view image generation:
Foutput=GPNR(Zout)
Where Foutput is a feature representation used to predict the target view, which can be used to further generate an image of the new view.
(4) Generating a target image
The object feature representation output GPNR is input into a multi-layer perceptron (MLP) for predicting the color of the object ray:
(5) Loss function design
In order to improve model performance, GPNR designs fine loss and regularization loss based on color supervision, enhances the fitting capability to the target scene color and texture, and simultaneously suppresses the risk of over-fitting:
fine loss ofFor measuring the color difference between the model predictive image and the real image, the method is defined as:
Wherein, theFor the predicted pixel color value, pi is the true pixel color value, N is the total number of pixels in the image, and N is the same variable as above.
Regularization lossBy limiting the size of the model weights, the risk of overfitting is reduced, defined as:
Where λ is a regularization coefficient, where λ takes a value of 0.01 and w represents a trainable weight of the model.
The final total loss function is:
Where α is a weight factor, set to 100 in the present invention, to balance the relative importance of the fine loss and regularization loss.
The loss design effectively improves the new view angle image synthesis quality of the model under the sparse view angle condition, and simultaneously ensures the training stability and generalization capability of the model.
(7) Pre-training model fine tuning
The pre-trained GPNR model is taken as a basis, is a general model obtained by pre-training in a large-scale data set, and then the invention utilizes a small number of images with known visual angles and corresponding geometric information to carry out fine adjustment in a target scene. In the fine tuning process, the weights related to partial high-level feature aggregation modules and geometric consistency modeling in the pre-training model are frozen, and only parameters of the bottom-layer feature extraction module are trained to adapt to the color and texture features of a target scene:
Wherein the method comprises the steps ofRepresenting the loss of the pre-trained model,Representing a penalty term for the new scene data. Under a small number of camera visual angles, the parameter theta is updated through a gradient descent method, and the model can be quickly adapted to new scene characteristics, so that training and rendering processes are accelerated, and meanwhile, the quality of new scene visual angle synthesis is improved.
Example 2
As shown in fig. 1-2, the invention provides a sparse new view angle synthesis method based on multi-scale feature fusion, which comprises the following steps:
Step1 Pre-training of New View angle Synthesis model
Preparing diversified image data sets, preprocessing, training the model on a large-scale general data set, measuring errors by using a loss function, updating weights by back propagation, repeating training until convergence, combining iterative training of multiple data sets to improve generalization capability, and finally storing optimized model weights to provide general features and initial parameters for a sparse new view angle synthesis task.
Step 2, multi-scale reference point generation and feature sampling
Firstly, generating a multi-scale reference point according to the resolution of an input sparse view image, and performing feature sampling. In this process, a resolution basis is provided for feature maps of different scales by computing multi-scale spatial shapes, e.g., (H2, W2), (H, W), (H/2, W/2). And generating a two-dimensional reference point grid by using a grid generation method, and mapping the reference point coordinates into the range of [0,1] after normalization to form the spatial information for guiding sampling. Then, based on the reference points, feature sampling is carried out on feature graphs with different resolutions, and multi-scale features with high, medium and low resolutions are extracted. The features are fused into a unified feature representation through a splicing operation, and a foundation is provided for subsequent processing.
Step 3, extracting residual characteristics of multiple receptive fields
In a multi-scale feature fusion stage, the invention designs a feature processing module based on a multi-receptive field convolution and residual error mechanism. The multi-receptive field convolution module extracts local and global context information by using depth convolution operations of different convolution kernels such as 3×3, 5×5, k×k and the like aiming at high, medium and low resolution features. After each convolution operation, the information integrity of the input features is maintained by adding residual connections and fused with the convolved features. The fused multi-scale features are further processed through an activation function and a dimension reduction operation, and finally remodeled into unified feature representation, so that the method has more abundant expression capability and lays a solid foundation for generating new view angle images.
Step 4, training a transducer model based on multi-view feature aggregation
In a training stage based on multi-view feature aggregation, a GPNR model is introduced, and a transducer is used as a core module, so that efficient aggregation and fusion of multi-view information are gradually realized. The fused multi-scale features are firstly input into a self-attention mechanism, and global relevance among the features of different visual angles is extracted. Then, through geometric modeling, multi-view features are further integrated in the epipolar line dimension, and depth consistency information of the ray directions is captured. In the stage of fusing the reference view features, the invention adopts a weighted summation mechanism to generate final target ray features according to the attention weight distribution for reconstructing the features of the target view angles.
And 5, fine tuning of a pre-training model:
in the fine tuning stage of the pre-training model, the method uses an optimization method based on transfer learning to accelerate training and rendering of new scenes. In the pre-training stage, GPNR models are trained by utilizing large-scale multi-view image data, and the fusion and reconstruction capability of multi-view features is learned. For a specific sparse view scene, the model parameters are further updated by fine tuning so as to be more suitable for specific scene data. In the fine tuning process, two loss functions are used, namely, fine prediction lossAnd regularization lossModel overfitting is avoided. In practical application, the parameters of fine tuning are less, so that the training speed is greatly improved, and the quality of the generated result is obviously improved.
Step 6, outputting the rendered new view angle image:
And finally, combining the target ray characteristics generated by the processing, and inputting the target ray characteristics into a rendering module to generate a new view angle image. Specifically, the model reconstructs features under sparse viewing angle conditions based on color prediction of the target rays, forms pixel values under the target viewing angle, and combines all rays to generate a complete new viewing angle image. According to the invention, through the combination of multi-scale feature fusion and multi-view information modeling, high-quality generation of new view images is realized under the sparse view condition, the generalization capability and the generation efficiency of the model are remarkably improved, the density requirement of view sampling is reduced, and the wide practical application value is realized.
In this embodiment, step 1 is only required to be executed once, and steps 2 to 4 are an iterative process, and finally, a new view angle image is obtained through step 5, and the stopping condition of the iteration is that the iteration is performed for 20 hours on NVIDIA GeForce RTX 2070 GPU, or the integral loss function reaches a threshold value within 3×10-3 within 1000 iterations.
Experimental results
1. Data set
The experiment employed a Co3D dataset that contained rich categories and objects and was therefore widely used for multi-view image generation and rendering tasks. In the experimental design, we selected a variety of scenes, each using six images taken at different angles for training, rendering and evaluation. Three images with different visual angles are generated by each rendering, and then the rendering results are compared with the real images. See literature :Park,E.,Yang,J.,Yumer,E.,Ceylan,D.,Berg,A.C.,2017.Transformation-grounded image generation network for novel 3d view synthesis,in:Proceedings of the ieee conference on computer vision and pattern recognition,pp.3500–3509.
To ensure the accuracy of image generation, we employ COLMAP tools to generate the camera pose of the image. The camera pose determines from what angle the scene is viewed and directly affects the viewing angle and quality of the rendered image. By adjusting the camera pose, we have realized generating images from multiple perspectives, thereby testing the generalization ability and rendering performance of the model on multi-perspective generation.
2. Experimental setup
The experiment was developed based on the JAX framework and trained with an Adam optimizer with an initial learning rate set to η0=2×10-5. The training time for each scenario is approximately 20 hours to ensure optimal performance under limited resource conditions. We run experiments on NVIDIAGeForce RTX 2070 GPU with batch size (batch size) set to 8. Six images are selected for input at a uniform viewing angle in each scene, and three images of new viewing angles are generated.
The experiment selects peak signal-to-Noise Ratio (PSNR), structural similarity index (Structural Similarity Index Measure, SSIM) and perceived image patch similarity (Learned Perceptual IMAGE PATCH SIMILARITY, LPIPS) as main evaluation indexes for measuring the quality of the generated image. The indexes can comprehensively reflect the accuracy, the structural similarity and the visual effect between the generated image and the real image
3. Performance comparison
We compared the baseline model and the complete model of the invention and the most recently advanced method as follows:
GPNR[Suhail,M.,Esteves,C.,Sigal,L.,Makadia,A.,2022a.Generalizable patch-based neural rendering,in:European Conference on Computer Vision,Springer.pp.156–174.]: The neural radiation field model based on the local patch features adopts a neural network rendering method based on patches to decompose a complex three-dimensional scene into a plurality of small blocks for rendering.
LFNR[Suhail,M.,Esteves,C.,Sigal,L.,Makadia,A.,2022b.Light field neural rendering,in:Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,pp.8269–8279.]: The neural rendering model based on the light field representation aiming at the sparse view angle directly models the relation between the view angle and the light rays by introducing the light field technology, avoids complex volume rendering calculation in the traditional NeRF, and achieves higher efficient rendering speed and better new view angle synthesis effect.
WAH[Bao,Y.,Li,Y.,Huo,J.,Ding,T.,Liang,X.,Li,W.,Gao,Y.,2023.Where and how:Mitigating confusion in neural radiance fields from sparse inputs,in:Proceedings of the 31st ACM International Conference on Multimedia,pp.2180–2188.]: The method is a window perception hash acceleration method oriented to a nerve radiation field, and is focused on optimizing training efficiency and reasoning performance in a sparse sampling scene. By introducing a window-aware hash mechanism, the WAH can effectively capture local features, and meanwhile, resource waste of the traditional global hash method under the sparse view angle condition is avoided, so that efficient scene modeling and rapid new view angle rendering are realized.
(1) The method provided by the invention compares the results of other methods in different scenes
Table 1 comparison of the performance of different methods in different scenarios
By comparing the characteristics and targets of all methods, all indexes of the method are superior to those of other comparison methods in all scenes, and the rendered image is more real and clear. As shown in table 1, the method of the present invention achieves significant performance improvement in various scenarios, especially excellent in PSNR and SSIM metrics. Compared with other methods (such as LFNR, WAH, GPNR), the method of the invention improves the PSNR value by 4-14 values on average, and the SSIM value reaches higher values, and meanwhile, the LPIPS value is obviously reduced, which shows that the method has obvious advantages in the aspects of detail preservation, structure reduction and visual perception quality of image rendering. These results demonstrate the excellent performance of the method of the present invention in image rendering tasks, enabling higher image quality and consistency in a variety of scenarios.
As shown in FIG. 3, in three scenes of plants, books and bears, the method of the invention is compared with GPNR, LFNR and WAH to obtain different rendered images. In a plant scene, the method can clearly show the texture and the edge of the leaf, while other methods are more fuzzy, especially LFNR and WAH, and hardly see the detail of the leaf and the real texture of the background. In a book scene, the method of the invention accurately restores the details of the book-seal characters and pages, and other methods have defects in detail level and color saturation. GPNR is poor in detail and color rendering, text is not clear enough, LFNR is blurred and distorted seriously, and WAH is rendered with almost all details lost and difficult to identify. In a bear scene, the method provided by the invention shows the real texture of hair, accurately restores the complete photo frame in the background, and can clearly see the edge and the internal content of the photo frame. The experimental results show that the method has remarkable advantages in improving the image rendering quality, and especially under the condition of sparse input quantity, the global information can be better kept and the local detail quality can be improved.
Although embodiments of the present invention have been described in connection with the accompanying drawings, various modifications and variations may be made by those skilled in the art without departing from the spirit and scope of the invention, and such modifications and variations fall within the scope of the invention as defined by the appended claims.