Movatterモバイル変換


[0]ホーム

URL:


CN119762358B - Sparse new view image synthesis method based on multi-scale feature fusion - Google Patents

Sparse new view image synthesis method based on multi-scale feature fusion

Info

Publication number
CN119762358B
CN119762358BCN202411833524.9ACN202411833524ACN119762358BCN 119762358 BCN119762358 BCN 119762358BCN 202411833524 ACN202411833524 ACN 202411833524ACN 119762358 BCN119762358 BCN 119762358B
Authority
CN
China
Prior art keywords
features
feature
scale
model
new
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202411833524.9A
Other languages
Chinese (zh)
Other versions
CN119762358A (en
Inventor
刘婧雯
蔡敏捷
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hunan University
Original Assignee
Hunan University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hunan UniversityfiledCriticalHunan University
Priority to CN202411833524.9ApriorityCriticalpatent/CN119762358B/en
Publication of CN119762358ApublicationCriticalpatent/CN119762358A/en
Application grantedgrantedCritical
Publication of CN119762358BpublicationCriticalpatent/CN119762358B/en
Activelegal-statusCriticalCurrent
Anticipated expirationlegal-statusCritical

Links

Classifications

Landscapes

Abstract

Translated fromChinese

本发明公开了一种基于多尺度特征融合的稀疏新视角图像合成方法,属于计算机视觉与图像生成技术领域,具体包括以下步骤:S1.多尺度参考点生成与特征采样;S2.多感受野残差特征提取;S3.基于注意力网络的特征聚合与图像生成;S4.新视角合成模型的预训练;S5.预训练模型的微调。与现有技术不同,本发明通过结合多尺度特征和残差特征提取方法,提出了一种新的特征融合策略,同时引入基于注意力网络的特征聚合进行新视角图像的高效生成。此外,本发明还提出了基于预训练与迁移学习的自适应优化机制,从而能够加速稀疏场景下的训练过程,提高合成结果的质量与效率。基于本发明的方法,可显著提升稀疏场景下的新视角图像合成效果。

The present invention discloses a sparse new-perspective image synthesis method based on multi-scale feature fusion, which belongs to the field of computer vision and image generation technology, and specifically includes the following steps: S1. Multi-scale reference point generation and feature sampling; S2. Multi-receptive field residual feature extraction; S3. Feature aggregation and image generation based on attention network; S4. Pre-training of new-perspective synthesis model; S5. Fine-tuning of pre-trained model. Different from the existing technology, the present invention proposes a new feature fusion strategy by combining multi-scale features and residual feature extraction methods, and introduces feature aggregation based on attention network for efficient generation of new-perspective images. In addition, the present invention also proposes an adaptive optimization mechanism based on pre-training and transfer learning, which can accelerate the training process in sparse scenes and improve the quality and efficiency of synthesis results. Based on the method of the present invention, the new-perspective image synthesis effect in sparse scenes can be significantly improved.

Description

Sparse new view image synthesis method based on multi-scale feature fusion
Technical Field
The invention relates to the field of computer vision and computer graphics, in particular to a sparse visual angle nerve rendering method based on a depth convolution characteristic and a vision transformer (Vision Transformer, viT). The method is mainly used for generating quality images under the condition of limited view angle data, and is widely applied to the fields of Virtual Reality (VR), augmented Reality (AR), film and television special effects, digital twinning, meta universe and the like.
Background
New view angle synthesis is an important issue in computer vision and graphics, aiming at generating high quality new view angle images from limited view angle images. Conventional methods typically rely on geometric modeling and explicit representation of the scene, such as multi-view geometric or voxel modeling, which require explicit depth information and scene structure. However, this method of relying on explicit geometry has a great limitation in the actual scene, and especially in the case of sparse view angles or missing geometric information, the generated view angles often lack realism.
In recent years, the advent of implicit expression methods such as neural radiation fields (NeRF) has provided a new paradigm for new view angle synthesis. NeRF by fitting the color and density distribution of the scene using a neural network, high quality three-dimensional scene rendering is achieved by learning an implicit representation of the radiation field from the two-dimensional image without explicit geometric information. The method can generate the new view angle image with extremely strong sense of reality, and greatly improves the precision of new view angle synthesis. However, neRF still faces two main problems, namely that the dependence on a large number of view images is poor under the sparse view condition, and that the calculation cost is extremely high, and particularly in the reasoning stage, multiple samples along the ray path are needed to accumulate color and density information, so that the real-time rendering requirement is difficult to meet. Reference paper :Mildenhall B,Srinivasan P P,Tancik M,et al.Nerf:Representing scenes as neural radiance fields for view synthesis[J].Communications of the ACM,2021,65(1):99-106.
To further enhance the performance of new perspective synthesis, many researchers have introduced advanced network architecture and optimization strategies. For example, some feature-enhanced rendering quality by aggregating features of neighboring perspectives, and others feature-model large-scale scenes via a transfomer's self-attention mechanism. However, these methods still have limitations under sparse view data, mainly reflected in insufficient capturing capability of local and global spatial information, resulting in insufficient accuracy of rendering results in detail. Therefore, new view synthesis still faces the challenge of how to generate high-quality, high-realism new view images under sparse view data conditions, especially in terms of improving local detail performance, global consistency and computational efficiency. The research in the field has important significance for virtual reality, film and television production, unmanned driving and other applications.
Disclosure of Invention
The invention aims to provide a sparse new view angle synthesis method based on multi-scale feature fusion, which can greatly improve the processing efficiency and synthesis quality of sparse multi-view angle data, remarkably reduce the dependence of a model on the number of input views, simultaneously preserve important details and global structures in an image, and improve the fidelity and accuracy of new view angle generation results.
A sparse new view image synthesis method based on multi-scale feature fusion comprises the following steps:
S1, generating a multi-scale reference point and sampling characteristics:
generating reference points of different scales according to the resolution of an input image, sampling on a corresponding multi-scale feature map by using the reference points, extracting features of each scale and splicing to obtain initial feature information;
s2, extracting residual characteristics of multiple receptive fields:
Performing depth convolution operation of different convolution kernel sizes on the sampled initial feature information by utilizing a multi-receptive-field convolution and residual error connection mechanism, acquiring information of different scales by the multi-receptive-field, simultaneously adding residual error connection to keep original features and fusing the features under each scale, and finally generating residual error features of the multi-receptive-field;
s3, feature aggregation and image generation based on an attention network:
The fused multi-scale depth features (namely residual features) are used as input, the multi-scale depth features are processed layer by layer through a self-attention module in a GPNR model, and multi-view information is gradually aggregated and aligned through capturing global dependency relations among different views, so that the depth fusion and expression of the features are completed, and a new view image under a sparse scene is generated;
S4, pre-training a new view angle synthetic model:
pre-training the model on a large-scale general data set, training a large number of dense images of different scenes, performing repeated iterative training by combining different objects in a plurality of data sets, covering various possible visual scenes and object types, and simultaneously optimizing model weights by using loss function calculation and back propagation, so that the model can learn general visual features and high-level semantic features suitable for sparse new view angle synthesis tasks;
s5, fine adjustment of a pre-training model:
Further optimizing based on transfer learning of the pre-training model, transferring parameters of the pre-training model to a new scene, fine-tuning new scene data under a sparse view angle, updating model parameters through back propagation, and further accelerating training and rendering of the new scene, so that the model is more suitable for a specific sparse scene;
The new view angle image can be generated by using the finally formed model.
In step S1, firstly, a plurality of scale reference points are generated according to the resolution of an input image, the reference points are key points used for marking pixel positions in a feature map, an automatic selection method is adopted, the image is divided into grids with fixed sizes, grid center points are used as the reference points, after normalization processing is carried out on coordinates, reference point sets with different scales are combined, then feature sampling is carried out on a corresponding multi-scale feature map (C1,C2,C3), index positions are positioned in a flattened feature matrix through the normalized coordinates of the reference points, and feature values are extracted from the index positions and mapped back to a multi-scale structure. Multiscale bits for sampling these
The features are spliced and fused into a unified feature representation C. This process is expressed as:
C=[Sample(C1),Sample(C2),Sample(C3)]
Wherein Sample (·) represents multi-scale feature sampling based on a reference point, and C contains the fused multi-scale features, providing abundant multi-scale information for subsequent processing.
In step S2, the specific steps of extracting the residual features of the multiple receptive fields are as follows:
Firstly, carrying out linear transformation on features C from different scales through a full-connection layer, applying different weights and biases to each channel, carrying out re-weighted combination on input features through matrix operation, adjusting and recombining feature relations among the channels, and generating a feature representation (namely feature information) after linear transformation:
FC1(C)=W·C+b
where C is the multi-scale feature generated in step S1, W and b are parameters obtained after training to help the model fit the data better, and FC1 represents the first fully connected layer (Fully Connected Layer).
And performing deep convolution operation on the feature subjected to dimension reduction by using multi-scale receptive field convolution, adopting convolution kernels with different sizes to respectively convolve the feature, dividing the feature in the channel dimension, then convolving the feature one by one, and checking the feature of different parts through convolution kernels with multiple receptive fields to perform independent deep convolution processing. The specific formula is as follows:
Conv(x)=[Conv3×3(x1),Conv5×5(x2),Convk×k(x3)]
Wherein x1,x2,x3 is three parts of the input feature x divided in a channel dimension, the channel dimension represents the feature quantity extracted by the network, and then (x1,x2,x3) is respectively convolved by using different convolution kernel sizes (3×3), (5×5) and (k×k) to generate features with different scales. Conv represents convolution operation, k is a self-defined numerical value, and in the invention, the value is 7;
After convolution operation of each scale, adding residual connection, directly adding the convolved features and the original input features, keeping the channel of the original input features unchanged, fusing the convolved features and the input features, and combining two feature representations through residual connection operation:
x=[Conv(x),identity(x)]
Where identity (x) indicates that the input feature x is passed directly, without any manipulation, i.e. the input feature is preserved.
The fused multi-scale features are expanded back to the original channel number through linear projection, the fused features are subjected to linear transformation by adopting a full-connection layer, and the features are subjected to weighted sum addition operation by using a weight matrix and bias, so that a feature representation with the same channel number as the input features is generated:
F=FC2([x1,x2,x3])
Where FC2 represents a second fully connected layer that maps the spliced multi-scale features back to the target channel number.
In step S3, the fused multi-scale features are input GPNR (Generalizable patch-based neural rendering) for training. The GPNR model is used for receiving multi-view fusion characteristics as input and generating characteristic expression of a target view by utilizing a nerve rendering technology. By processing the input features, GPNR can effectively reconstruct and complement the missing information in the sparse scene, and finally output a high-quality new view image under the sparse condition, thereby providing complete and consistent feature support for subsequent synthesis and presentation. Reference paper :Suhail,M.,Esteves,C.,Sigal,L.,Makadia,A.,2022a.Generalizable patch-based neural rendering,in:European Conference on Computer Vision,Springer.pp.156–174.
In step S5, the weight and bias parameters of the pre-training model are loaded by initializing the pre-training model as a new model, and the pre-training model is set as an initial parameter of a new scene, and then the new scene data under the sparse view angle is subjected to fast fine tuning. In the fine tuning process, the model is updated by minimizing the loss function using the parameters θ0 of the pre-trained model as initial values
A parameter of the sample is selected,Is a weighted loss function expressed as:
Wherein the method comprises the steps ofRepresenting the loss of the pre-trained model,Representing a penalty term for the new scene data. Under a small number of camera visual angles, the parameter theta is updated through a gradient descent method, and the model can be quickly adapted to new scene characteristics, so that training and rendering processes are accelerated, and meanwhile, the quality of new scene visual angle synthesis is improved.
The beneficial effects are that:
compared with the prior art, the invention has the following advantages:
The invention provides a sparse new view synthesis method based on multi-scale feature fusion, which mainly combines multi-scale feature processing and a residual mechanism to extract depth features. Compared with the prior art, the method fully utilizes the multi-scale characteristics and effectively combines the residual structure. The multi-receptive field convolution and residual connection are introduced in the multi-scale feature fusion, so that the feature expression capability is enhanced, the depth information is captured by the model more accurately, and the rendering quality and global consistency under the sparse view scene are improved. In addition, the key layers of the pre-training model are selectively thawed for fine adjustment, so that the model can be rapidly adapted to sparse visual angle scenes, and the training efficiency and the adaptation capability are remarkably improved.
Drawings
FIG. 1 is a network frame diagram of a sparse new perspective synthesis method based on multi-scale feature fusion of the present invention.
Fig. 2 is a flow chart of the method of the present invention.
FIG. 3 shows the comparison results of the present invention with the variant model and the most advanced method under different scenarios.
Detailed Description
The invention is further described below with reference to specific examples and figures:
Example 1
Task definition
Assume that a set of sparsely sampled multi-view image sets is inputEach image represents a different camera view angle, and the pose of the camera, including the internal and external reference matrices of the camera, is accurately calculated using COLMAP tools. The camera pose provides positional and orientation information of the camera in three-dimensional space, determines the angle of view of the input image, and by adjusting the camera pose, images viewed from different angles and directions can be generated. The invention aims to realize new view angle reconstruction under sparse input images by a nerve rendering method, and remarkably improve reconstruction quality.
The invention discloses a sparse new view angle synthesis method based on multi-scale feature fusion, which comprises the following steps:
(1) Multi-scale reference point generation and feature sampling
Firstly, according to the resolutions H and W of an input image, a multi-scale space shape is generated for the subsequent segmentation and remodeling of multi-scale features.
Defining the spatial shape of the multi-scale feature map as:
(Hi,Wi)=[(H×2,W×2),(H,W),(H/2,W/2)]
These spatial shapes (Hi,Wi) describe the size of each scale feature map, high resolution, medium resolution, low resolution, respectively.
For each feature map scale, reference points are generated by a uniform grid for spatial localization of the multi-scale features. The reference point coordinates xi and yi are defined as:
xi=[0.5,1.5,2.5,…,Wi-0.5],yi=[0.5,1.5,2.5,…,Hi-0.5]
A two-dimensional grid is generated and expanded to one dimension. The x coordinate and the y coordinate of the reference point are respectively:
normalizing all the reference point coordinates to obtain normalized coordinates:
And combining the reference points of all scales to form a unified reference point set. Let Nscales =3 correspond to three scales and the final set of reference points is:
where i=1, 2,3 corresponds to three dimensions.
Next, a starting index for each scale is calculated to locate the position of each scale feature in the flattened feature. The calculation formula of the initial index is:
L0=0
The flattened features are then mapped back to the multi-scale feature map using these reference points and the starting index. Given a flattened feature matrix C e RB×N×C, where B is the batch size,C is the number of channels.
For each scale i, extracting a corresponding feature segment:
starti=Li-1
endi=Li
Ci=C[:,starti:endi,:]
Then, carrying out multi-scale feature fusion on the feature graphs of all scales to generate a multi-scale feature pyramid:
Pyramid={C0,C1,C2}
and then input into a feature extraction network for subsequent multi-receptive field feature extraction and fusion.
(2) Multi-receptive field residual feature extraction
And applying a multi-receptive field residual feature extraction module on the multi-scale feature pyramid to further extract and fuse depth features. First, the input features are linearly transformed using the full connection layer:
Xi=FC1(Xi)
For each scale feature (Xi) (i=1, 2, 3), Xi is first divided into two parts on average in the channel dimension:
Here, Xi1 and Xi2 represent two sub-tensors after segmentation, respectively, each sub-tensor having the shape ofI.e. equally along the channel dimension.
A depth separable convolution of different kernel sizes is then applied to each sub-tensor:
Yi1=Conv3×3(Xi1)
Yi2=Conv5×5(Xi2)
Wherein Convk×k represents a depth separable convolution with a kernel size of k x k.
The convolved features are then stitched in the channel dimension:
Yi=[Yi1,Yi2]
note that the stitching operation is performed here along the channel dimension.
An activation function (e.g., GELU) and a batch normalization operation are applied. Firstly, applying GELU activating functions to the spliced features to enhance nonlinear expression capacity, and then carrying out batch normalization:
Zi=BNisGELU(Yi)x
wherein, the definition of the activation function GELU is:
The mathematical representation of the batch normalization is:
where μi represents the mean value of the ith scale feature, σi represents the variance of the ith scale feature, γi and βi are training parameters, and initial value γi=1,βi =0.
The processed features are then stitched with the initial input features in the channel dimension (residual connection):
Z′i=[Zi,Xi]
finally, the features of all dimensions are spliced along the channel dimension to form a final feature representation:
Z=[Z′1,Z′2]
At this time, Z ε RB×H×W×2C, where B is the batch size, H and W are the height and width of the feature map, respectively, and C is the number of channels. The last layer is the second fully connected layer, mapping features back to the output dimension:
Zout=FC2(Z)
the second fully-connected layer functions to map the high-dimensional features of the network to the desired output dimensions for subsequent feature aggregation tasks.
(3) Transformer-based multi-view feature aggregation model
In new view angle synthesis, GPNR model realizes new view angle image synthesis under sparse view angle condition through transform-based multi-view angle feature aggregation. The method has the core concept that a multi-head self-attention mechanism is utilized to capture the global relation among different visual angles, and depth and characteristic information are gradually aggregated by combining geometric consistency modeling. Reference paper :Suhail,M.,Esteves,C.,Sigal,L.,Makadia,A.,2022a.Generalizable patch-based neural rendering,in:European Conference on Computer Vision,Springer.pp.156–174.
The method inputs the obtained depth features into GPNR networks, and performs multi-view feature aggregation and new view image generation:
Foutput=GPNR(Zout)
Where Foutput is a feature representation used to predict the target view, which can be used to further generate an image of the new view.
(4) Generating a target image
The object feature representation output GPNR is input into a multi-layer perceptron (MLP) for predicting the color of the object ray:
(5) Loss function design
In order to improve model performance, GPNR designs fine loss and regularization loss based on color supervision, enhances the fitting capability to the target scene color and texture, and simultaneously suppresses the risk of over-fitting:
fine loss ofFor measuring the color difference between the model predictive image and the real image, the method is defined as:
Wherein, theFor the predicted pixel color value, pi is the true pixel color value, N is the total number of pixels in the image, and N is the same variable as above.
Regularization lossBy limiting the size of the model weights, the risk of overfitting is reduced, defined as:
Where λ is a regularization coefficient, where λ takes a value of 0.01 and w represents a trainable weight of the model.
The final total loss function is:
Where α is a weight factor, set to 100 in the present invention, to balance the relative importance of the fine loss and regularization loss.
The loss design effectively improves the new view angle image synthesis quality of the model under the sparse view angle condition, and simultaneously ensures the training stability and generalization capability of the model.
(7) Pre-training model fine tuning
The pre-trained GPNR model is taken as a basis, is a general model obtained by pre-training in a large-scale data set, and then the invention utilizes a small number of images with known visual angles and corresponding geometric information to carry out fine adjustment in a target scene. In the fine tuning process, the weights related to partial high-level feature aggregation modules and geometric consistency modeling in the pre-training model are frozen, and only parameters of the bottom-layer feature extraction module are trained to adapt to the color and texture features of a target scene:
Wherein the method comprises the steps ofRepresenting the loss of the pre-trained model,Representing a penalty term for the new scene data. Under a small number of camera visual angles, the parameter theta is updated through a gradient descent method, and the model can be quickly adapted to new scene characteristics, so that training and rendering processes are accelerated, and meanwhile, the quality of new scene visual angle synthesis is improved.
Example 2
As shown in fig. 1-2, the invention provides a sparse new view angle synthesis method based on multi-scale feature fusion, which comprises the following steps:
Step1 Pre-training of New View angle Synthesis model
Preparing diversified image data sets, preprocessing, training the model on a large-scale general data set, measuring errors by using a loss function, updating weights by back propagation, repeating training until convergence, combining iterative training of multiple data sets to improve generalization capability, and finally storing optimized model weights to provide general features and initial parameters for a sparse new view angle synthesis task.
Step 2, multi-scale reference point generation and feature sampling
Firstly, generating a multi-scale reference point according to the resolution of an input sparse view image, and performing feature sampling. In this process, a resolution basis is provided for feature maps of different scales by computing multi-scale spatial shapes, e.g., (H2, W2), (H, W), (H/2, W/2). And generating a two-dimensional reference point grid by using a grid generation method, and mapping the reference point coordinates into the range of [0,1] after normalization to form the spatial information for guiding sampling. Then, based on the reference points, feature sampling is carried out on feature graphs with different resolutions, and multi-scale features with high, medium and low resolutions are extracted. The features are fused into a unified feature representation through a splicing operation, and a foundation is provided for subsequent processing.
Step 3, extracting residual characteristics of multiple receptive fields
In a multi-scale feature fusion stage, the invention designs a feature processing module based on a multi-receptive field convolution and residual error mechanism. The multi-receptive field convolution module extracts local and global context information by using depth convolution operations of different convolution kernels such as 3×3, 5×5, k×k and the like aiming at high, medium and low resolution features. After each convolution operation, the information integrity of the input features is maintained by adding residual connections and fused with the convolved features. The fused multi-scale features are further processed through an activation function and a dimension reduction operation, and finally remodeled into unified feature representation, so that the method has more abundant expression capability and lays a solid foundation for generating new view angle images.
Step 4, training a transducer model based on multi-view feature aggregation
In a training stage based on multi-view feature aggregation, a GPNR model is introduced, and a transducer is used as a core module, so that efficient aggregation and fusion of multi-view information are gradually realized. The fused multi-scale features are firstly input into a self-attention mechanism, and global relevance among the features of different visual angles is extracted. Then, through geometric modeling, multi-view features are further integrated in the epipolar line dimension, and depth consistency information of the ray directions is captured. In the stage of fusing the reference view features, the invention adopts a weighted summation mechanism to generate final target ray features according to the attention weight distribution for reconstructing the features of the target view angles.
And 5, fine tuning of a pre-training model:
in the fine tuning stage of the pre-training model, the method uses an optimization method based on transfer learning to accelerate training and rendering of new scenes. In the pre-training stage, GPNR models are trained by utilizing large-scale multi-view image data, and the fusion and reconstruction capability of multi-view features is learned. For a specific sparse view scene, the model parameters are further updated by fine tuning so as to be more suitable for specific scene data. In the fine tuning process, two loss functions are used, namely, fine prediction lossAnd regularization lossModel overfitting is avoided. In practical application, the parameters of fine tuning are less, so that the training speed is greatly improved, and the quality of the generated result is obviously improved.
Step 6, outputting the rendered new view angle image:
And finally, combining the target ray characteristics generated by the processing, and inputting the target ray characteristics into a rendering module to generate a new view angle image. Specifically, the model reconstructs features under sparse viewing angle conditions based on color prediction of the target rays, forms pixel values under the target viewing angle, and combines all rays to generate a complete new viewing angle image. According to the invention, through the combination of multi-scale feature fusion and multi-view information modeling, high-quality generation of new view images is realized under the sparse view condition, the generalization capability and the generation efficiency of the model are remarkably improved, the density requirement of view sampling is reduced, and the wide practical application value is realized.
In this embodiment, step 1 is only required to be executed once, and steps 2 to 4 are an iterative process, and finally, a new view angle image is obtained through step 5, and the stopping condition of the iteration is that the iteration is performed for 20 hours on NVIDIA GeForce RTX 2070 GPU, or the integral loss function reaches a threshold value within 3×10-3 within 1000 iterations.
Experimental results
1. Data set
The experiment employed a Co3D dataset that contained rich categories and objects and was therefore widely used for multi-view image generation and rendering tasks. In the experimental design, we selected a variety of scenes, each using six images taken at different angles for training, rendering and evaluation. Three images with different visual angles are generated by each rendering, and then the rendering results are compared with the real images. See literature :Park,E.,Yang,J.,Yumer,E.,Ceylan,D.,Berg,A.C.,2017.Transformation-grounded image generation network for novel 3d view synthesis,in:Proceedings of the ieee conference on computer vision and pattern recognition,pp.3500–3509.
To ensure the accuracy of image generation, we employ COLMAP tools to generate the camera pose of the image. The camera pose determines from what angle the scene is viewed and directly affects the viewing angle and quality of the rendered image. By adjusting the camera pose, we have realized generating images from multiple perspectives, thereby testing the generalization ability and rendering performance of the model on multi-perspective generation.
2. Experimental setup
The experiment was developed based on the JAX framework and trained with an Adam optimizer with an initial learning rate set to η0=2×10-5. The training time for each scenario is approximately 20 hours to ensure optimal performance under limited resource conditions. We run experiments on NVIDIAGeForce RTX 2070 GPU with batch size (batch size) set to 8. Six images are selected for input at a uniform viewing angle in each scene, and three images of new viewing angles are generated.
The experiment selects peak signal-to-Noise Ratio (PSNR), structural similarity index (Structural Similarity Index Measure, SSIM) and perceived image patch similarity (Learned Perceptual IMAGE PATCH SIMILARITY, LPIPS) as main evaluation indexes for measuring the quality of the generated image. The indexes can comprehensively reflect the accuracy, the structural similarity and the visual effect between the generated image and the real image
3. Performance comparison
We compared the baseline model and the complete model of the invention and the most recently advanced method as follows:
GPNR[Suhail,M.,Esteves,C.,Sigal,L.,Makadia,A.,2022a.Generalizable patch-based neural rendering,in:European Conference on Computer Vision,Springer.pp.156–174.]: The neural radiation field model based on the local patch features adopts a neural network rendering method based on patches to decompose a complex three-dimensional scene into a plurality of small blocks for rendering.
LFNR[Suhail,M.,Esteves,C.,Sigal,L.,Makadia,A.,2022b.Light field neural rendering,in:Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,pp.8269–8279.]: The neural rendering model based on the light field representation aiming at the sparse view angle directly models the relation between the view angle and the light rays by introducing the light field technology, avoids complex volume rendering calculation in the traditional NeRF, and achieves higher efficient rendering speed and better new view angle synthesis effect.
WAH[Bao,Y.,Li,Y.,Huo,J.,Ding,T.,Liang,X.,Li,W.,Gao,Y.,2023.Where and how:Mitigating confusion in neural radiance fields from sparse inputs,in:Proceedings of the 31st ACM International Conference on Multimedia,pp.2180–2188.]: The method is a window perception hash acceleration method oriented to a nerve radiation field, and is focused on optimizing training efficiency and reasoning performance in a sparse sampling scene. By introducing a window-aware hash mechanism, the WAH can effectively capture local features, and meanwhile, resource waste of the traditional global hash method under the sparse view angle condition is avoided, so that efficient scene modeling and rapid new view angle rendering are realized.
(1) The method provided by the invention compares the results of other methods in different scenes
Table 1 comparison of the performance of different methods in different scenarios
By comparing the characteristics and targets of all methods, all indexes of the method are superior to those of other comparison methods in all scenes, and the rendered image is more real and clear. As shown in table 1, the method of the present invention achieves significant performance improvement in various scenarios, especially excellent in PSNR and SSIM metrics. Compared with other methods (such as LFNR, WAH, GPNR), the method of the invention improves the PSNR value by 4-14 values on average, and the SSIM value reaches higher values, and meanwhile, the LPIPS value is obviously reduced, which shows that the method has obvious advantages in the aspects of detail preservation, structure reduction and visual perception quality of image rendering. These results demonstrate the excellent performance of the method of the present invention in image rendering tasks, enabling higher image quality and consistency in a variety of scenarios.
As shown in FIG. 3, in three scenes of plants, books and bears, the method of the invention is compared with GPNR, LFNR and WAH to obtain different rendered images. In a plant scene, the method can clearly show the texture and the edge of the leaf, while other methods are more fuzzy, especially LFNR and WAH, and hardly see the detail of the leaf and the real texture of the background. In a book scene, the method of the invention accurately restores the details of the book-seal characters and pages, and other methods have defects in detail level and color saturation. GPNR is poor in detail and color rendering, text is not clear enough, LFNR is blurred and distorted seriously, and WAH is rendered with almost all details lost and difficult to identify. In a bear scene, the method provided by the invention shows the real texture of hair, accurately restores the complete photo frame in the background, and can clearly see the edge and the internal content of the photo frame. The experimental results show that the method has remarkable advantages in improving the image rendering quality, and especially under the condition of sparse input quantity, the global information can be better kept and the local detail quality can be improved.
Although embodiments of the present invention have been described in connection with the accompanying drawings, various modifications and variations may be made by those skilled in the art without departing from the spirit and scope of the invention, and such modifications and variations fall within the scope of the invention as defined by the appended claims.

Claims (5)

Translated fromChinese
1.一种基于多尺度特征融合的稀疏新视角图像合成方法,其特征在于,包括以下步骤:1. A sparse new-view image synthesis method based on multi-scale feature fusion, characterized by comprising the following steps:S1.多尺度参考点生成与特征采样:S1. Multi-scale reference point generation and feature sampling:根据输入图像的分辨率生成不同尺度的参考点,并在对应多尺度特征图上使用这些参考点进行采样,提取各个尺度下的特征并拼接来获取初始特征信息;Generate reference points of different scales according to the resolution of the input image, and use these reference points to sample the corresponding multi-scale feature maps, extract features at each scale and splice them to obtain initial feature information;S2.多感受野残差特征提取:S2. Multi-receptive field residual feature extraction:利用多感受野卷积和残差连接机制,对采样的初始特征信息进行不同卷积核大小的深度卷积操作,通过多感受野获取不同尺度的信息,同时加入残差连接以保持原始特征并对各尺度下的特征进行融合,最终生成多感受野的残差特征;Using multi-receptive field convolution and residual connection mechanisms, the initial feature information of the sample is subjected to depthwise convolution operations with different convolution kernel sizes. Information of different scales is obtained through multiple receptive fields. At the same time, residual connections are added to maintain the original features and fuse the features at each scale, ultimately generating residual features of multiple receptive fields.S3.基于注意力网络的特征聚合和图像生成:S3. Feature aggregation and image generation based on attention network:将融合后的多尺度深度特征作为输入,通过GPNR模型中的自注意力模块逐层进行处理,通过捕捉不同视角间的全局依赖关系逐步聚合和对齐多视角信息,完成特征的深度融合与表达,从而生成稀疏场景下的新视角图像;The fused multi-scale deep features are used as input and processed layer by layer through the self-attention module in the GPNR model. By capturing the global dependencies between different perspectives, multi-view information is gradually aggregated and aligned to complete the deep fusion and expression of features, thereby generating new perspective images in sparse scenes.S4.新视角合成模型的预训练:S4. Pre-training of new perspective synthesis model:在大规模、通用数据集上对模型进行预训练,通过训练大量密集的不同场景图像,并结合多个数据集中的不同物体进行多次迭代训练,覆盖各种可能的视觉场景和物体类型,同时利用损失函数计算和反向传播优化模型权重,使得模型能够学习到适用于稀疏新视角合成任务的通用视觉特征和高层次语义特征;The model is pre-trained on a large-scale, general-purpose dataset. By training on a large number of dense images of different scenes and combining different objects from multiple datasets for multiple iterations, it covers a wide range of possible visual scenes and object types. At the same time, the model weights are optimized using loss function calculation and backpropagation, enabling the model to learn general visual features and high-level semantic features suitable for sparse new perspective synthesis tasks.S5.预训练模型微调:S5. Pre-trained model fine-tuning:进一步通过基于预训练模型的迁移学习进行优化,将预训练模型的参数迁移到新场景中,在稀疏视角下的新场景数据进行微调,通过反向传播更新模型参数,进而加速新场景的训练和渲染,使得模型更加适用于特定的稀疏场景;Further optimization is performed through transfer learning based on pre-trained models. The parameters of the pre-trained model are transferred to the new scene, fine-tuned based on the new scene data under sparse perspective, and the model parameters are updated through backpropagation, thereby accelerating the training and rendering of the new scene, making the model more suitable for specific sparse scenes.采用最终形成的模型能生成新视角图像。The resulting model can be used to generate images from new perspectives.2.根据权利要求1所述的基于多尺度特征融合的稀疏新视角合成方法,其特征在于,步骤S1中,首先根据输入图像的分辨率生成多个尺度的参考点,参考点是特征图中用于标识像素位置的关键点,采用自动选择的方法,将图像划分成固定大小的网格,网格中心点作为参考点,并对坐标进行归一化处理后,合并不同尺度的参考点集合;然后在对应的多尺度特征图(C1,C2,C3)上进行特征采样,通过参考点的归一化坐标在展平的特征矩阵中定位索引位置,从中提取特征值并映射回多尺度结构。将这些采样的多尺度特征进行拼接,融合为一个统一的特征表示C,这一过程表示为:2. The sparse new perspective synthesis method based on multi-scale feature fusion according to claim 1 is characterized in that, in step S1, reference points of multiple scales are first generated according to the resolution of the input image. The reference points are key points in the feature map used to identify pixel positions. The image is divided into a fixed-size grid using an automatic selection method, and the grid center point is used as the reference point. After the coordinates are normalized, the reference point sets of different scales are merged. Feature sampling is then performed on the corresponding multi-scale feature map (C1 ,C2 ,C3 ). The index position in the flattened feature matrix is located using the normalized coordinates of the reference point, and the eigenvalue is extracted and mapped back to the multi-scale structure. These sampled multi-scale features are spliced and fused into a unified feature representation C. This process is expressed as:C=[Sample(C1),Sample(C2),Sample(C3)]C=[Sample(C1 ),Sample(C2 ),Sample(C3 )]其中Sample(·)表示基于参考点的多尺度特征采样,C包含了融合后的多尺度特征,为后续处理提供丰富的多尺度信息。Here, Sample(·) represents the multi-scale feature sampling based on the reference point, and C contains the fused multi-scale features, providing rich multi-scale information for subsequent processing.3.根据权利要求1所述基于多尺度特征融合的稀疏新视角合成方法,其特征在于,步骤S2中,所述的多感受野残差特征提取具体步骤如下:3. The sparse new perspective synthesis method based on multi-scale feature fusion according to claim 1, characterized in that in step S2, the multi-receptive field residual feature extraction specifically comprises the following steps:首先,将来自不同尺度的特征C通过全连接层进行线性变换,对每个通道施加不同的权重和偏置,通过矩阵运算对输入特征进行重新加权组合,调整和重组各个通道之间的特征关系,生成一个经过线性变换后的特征表示:First, the features C from different scales are linearly transformed through the fully connected layer, different weights and biases are applied to each channel, and the input features are re-weighted and combined through matrix operations. The feature relationships between the channels are adjusted and reorganized to generate a linearly transformed feature representation:FC1(C)=W·C+bFC1 (C)=W·C+b其中,C是步骤S1生成的多尺度特征,W和b是通过训练后得到的参数,帮助模型更好地拟合数据,FC1表示第一全连接层;Where C is the multi-scale feature generated in step S1, W and b are parameters obtained after training to help the model better fit the data, and FC1 represents the first fully connected layer;使用多尺度感受野卷积对降维后的特征进行深度卷积操作,采用不同尺寸的卷积核,分别对特征进行卷积,在通道维度上将特征进行分割后逐个卷积,通过多种感受野的卷积核对不同部分的特征进行独立的深度卷积处理,具体公式为:Use multi-scale receptive field convolution to perform deep convolution operations on the features after dimensionality reduction. Use convolution kernels of different sizes to convolve the features separately. After dividing the features in the channel dimension, convolve them one by one. Use convolution kernels of multiple receptive fields to perform independent deep convolution processing on the features of different parts. The specific formula is:Conv(x)=[Conv3×3(x1),Conv5×5(x2),Convk×k(x3)]Conv(x)=[Conv3×3 (x1 ), Conv5×5 (x2 ), Convk×k (x3 )]其中,x1,x2,x3是将输入特征x在通道维度上分割成的三个部分,通道维度表示网络提取的特征数量;然后,使用不同的卷积核大小(3×3)、(5×5)、(k×k)分别对(x1,x2,x3)进行卷积操作,生成不同尺度的特征。Conv表示卷积操作;Here, x1 , x2 , and x3 are the three parts into which the input feature x is divided along the channel dimension. The channel dimension represents the number of features extracted by the network. Then, different convolution kernel sizes (3×3), (5×5), and (k×k) are used to perform convolution operations on (x1 , x2 , and x3 ) to generate features of different scales. Conv represents the convolution operation.在每个尺度的卷积操作后,添加残差连接,将卷积后的特征与原始输入特征进行直接相加,保持原始输入特征的通道不变,融合卷积特征和输入特征,通过残差连接操作合并两个特征表示:After the convolution operation at each scale, a residual connection is added to directly add the convolved features to the original input features, keeping the channels of the original input features unchanged, fusing the convolution features and the input features, and merging the two feature representations through the residual connection operation:x=[Conv(x),identity(x)]x=[Conv(x),identity(x)]其中,identity(x)表示对输入特征x直接进行传递,没有进行任何操作,即保留输入特征。Among them, identity(x) means that the input feature x is directly passed without any operation, that is, the input feature is retained.融合后的多尺度特征通过线性投影扩展回原始通道数,同样采用全连接层对融合后的特征进行线性变换,使用权重矩阵和偏置对特征进行加权和加法运算,生成一个与输入特征具有相同通道数的特征表示:The fused multi-scale features are expanded back to the original number of channels through linear projection. The fully connected layer is also used to perform linear transformation on the fused features. The weight matrix and bias are used to weight and add the features to generate a feature representation with the same number of channels as the input features:F=FC2([x1,x2,x3])F=FC2 ([x1 ,x2 ,x3 ])其中,FC2表示第二全连接层,它将拼接后的多尺度特征映射回目标通道数。Among them, FC2 represents the second fully connected layer, which maps the concatenated multi-scale features back to the target number of channels.4.根据权利要求1所述的稀疏新视角合成方法,其特征在于,步骤S3中,将融合后的多尺度特征输入GPNR(Generalizable patch-based neural rendering)模型进行训练,GPNR模型的作用是接收多视角融合特征作为输入,利用神经渲染技术生成目标视角的特征表达,通过对输入特征的处理,GPNR能够有效重建和补全稀疏场景中的缺失信息,最终输出稀疏条件下的高质量新视角图像,为后续的合成与呈现提供完整且一致的特征支持。4. The sparse new perspective synthesis method according to claim 1 is characterized in that in step S3, the fused multi-scale features are input into a GPNR (Generalizable patch-based neural rendering) model for training. The GPNR model receives multi-perspective fusion features as input and uses neural rendering technology to generate a feature representation of the target perspective. By processing the input features, the GPNR can effectively reconstruct and complete the missing information in the sparse scene, and ultimately output a high-quality new perspective image under sparse conditions, providing complete and consistent feature support for subsequent synthesis and presentation.5.根据权利要求1所述的稀疏新视角合成方法,其特征在于,步骤S5中,通过将预训练模型作为新模型的初始化,加载预训练模型的权重和偏置参数,并将其作为新场景的初始参数设置,然后在稀疏视角下的新场景数据上进行快速微调,在微调过程中,使用预训练模型的参数θ0作为初始值,通过最小化损失函数更新模型参数,是一个加权的损失函数,表达式为:5. The sparse new perspective synthesis method according to claim 1, characterized in that in step S5, a pre-trained model is used as the initialization of the new model, the weights and bias parameters of the pre-trained model are loaded, and used as the initial parameter settings of the new scene, and then a fast fine-tuning is performed on the new scene data under the sparse perspective. During the fine-tuning process, the parameters θ0 of the pre-trained model are used as the initial values, and the model parameters are updated by minimizing the loss function. Is a weighted loss function, expressed as:其中表示预训练模型的损失,表示针对新场景数据的损失项。在少量相机视角下,通过梯度下降法更新参数θ,模型能够快速适应新场景特征,从而加速训练和渲染过程,同时提升新场景视角合成的质量。in represents the loss of the pre-trained model, Denotes the loss term for new scene data. By updating the parameters θ via gradient descent with a small number of camera viewpoints, the model can quickly adapt to new scene features, accelerating training and rendering while improving the quality of synthesizing new scene viewpoints.
CN202411833524.9A2024-12-132024-12-13Sparse new view image synthesis method based on multi-scale feature fusionActiveCN119762358B (en)

Priority Applications (1)

Application NumberPriority DateFiling DateTitle
CN202411833524.9ACN119762358B (en)2024-12-132024-12-13Sparse new view image synthesis method based on multi-scale feature fusion

Applications Claiming Priority (1)

Application NumberPriority DateFiling DateTitle
CN202411833524.9ACN119762358B (en)2024-12-132024-12-13Sparse new view image synthesis method based on multi-scale feature fusion

Publications (2)

Publication NumberPublication Date
CN119762358A CN119762358A (en)2025-04-04
CN119762358Btrue CN119762358B (en)2025-10-03

Family

ID=95193954

Family Applications (1)

Application NumberTitlePriority DateFiling Date
CN202411833524.9AActiveCN119762358B (en)2024-12-132024-12-13Sparse new view image synthesis method based on multi-scale feature fusion

Country Status (1)

CountryLink
CN (1)CN119762358B (en)

Citations (2)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN114240811A (en)*2021-11-292022-03-25浙江大学 A method for generating new images based on multiple images
CN118379466A (en)*2024-04-302024-07-23长春理工大学 A new perspective synthesis method based on prior residual and position reference information

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
WO2022099613A1 (en)*2020-11-132022-05-19华为技术有限公司Training method for image generation model, and new view angle image generation method and apparatus
CN118587340A (en)*2024-01-082024-09-03中国传媒大学 Neural rendering method and system based on global semantic information and local geometric perception
CN118657903A (en)*2024-07-042024-09-17安徽省农业科学院农业经济与信息研究所 A 3D reconstruction method for Pelteobagrus fulvidraco based on instance segmentation and improved neural radiation field

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN114240811A (en)*2021-11-292022-03-25浙江大学 A method for generating new images based on multiple images
CN118379466A (en)*2024-04-302024-07-23长春理工大学 A new perspective synthesis method based on prior residual and position reference information

Also Published As

Publication numberPublication date
CN119762358A (en)2025-04-04

Similar Documents

PublicationPublication DateTitle
Chen et al.Lara: Efficient large-baseline radiance fields
CN114255238A (en)Three-dimensional point cloud scene segmentation method and system fusing image features
CN109035267B (en) A deep learning-based image target extraction method
CN118229889B (en)Video scene previewing auxiliary method and device
CN117237623B (en) A method and system for semantic segmentation of UAV remote sensing images
Xu et al.Underwater image enhancement method based on a cross attention mechanism
CN119180898B (en)Neural radiation field rendering method and device based on nerve basis and tensor decomposition
CN116524070A (en)Scene picture editing method and system based on text
CN118761936B (en) Image restoration method and electronic device based on texture and structure fusion prior
CN118967536B (en) A color dithering method for images based on color transfer model
CN119599967A (en)Stereo matching method and system based on context geometry cube and distortion parallax optimization
CN118096978B (en) A method for rapid generation of 3D art content based on arbitrary stylization
CN114119916A (en)Multi-view stereoscopic vision reconstruction method based on deep learning
CN119762358B (en)Sparse new view image synthesis method based on multi-scale feature fusion
Luo et al.A fast denoising fusion network using internal and external priors
CN118379466A (en) A new perspective synthesis method based on prior residual and position reference information
Lin et al.Enhancing underwater imaging with 4-D light fields: Dataset and method
Lai et al.Immovable Cultural Relics Preservation Through 3D Reconstruction Using NeRF
Wang et al.Light field angular super resolution based on residual channel attention and classification up-sampling
CN119919563B (en)NeRF multi-model scene construction method, equipment and medium based on distorted light rays
Zamani et al.A High-Performance Learning-Based Framework for Monocular 3D Point Cloud Reconstruction
Wen et al.RetouchFormer: semi-supervised high-quality face retouching transformer with prior-based selective self-attention
CN119151773A (en)Complex landscape multi-state image generation method, device and program product
Li et al.Multi-scale perceptual image super-resolution reconstruction based on coordinate attention mechanism
CN118864275A (en) View virtual viewpoint synthesis method and terminal based on multi-plane images

Legal Events

DateCodeTitleDescription
PB01Publication
PB01Publication
SE01Entry into force of request for substantive examination
SE01Entry into force of request for substantive examination
GR01Patent grant

[8]ページ先頭

©2009-2025 Movatter.jp