CN118864275A

Movatterモバイル変換

Info

Publication number: CN118864275A
Application number: CN202411001882.3A
Authority: CN
Inventors: 陈素琼; 张进
Original assignee: Fujian Gaotu Information Technology Co ltd
Current assignee: Fujian Gaotu Information Technology Co ltd
Priority date: 2024-07-25
Filing date: 2024-07-25
Publication date: 2024-10-29

Abstract

Translated fromChinese

本发明提供的一种基于多平面图像的视图虚拟视点合成方法及终端，包括：采集场景中多视点观测数据，利用预处理算法得到三维场景重建范围，确定MPI的深度预设值；将采集的多视点图像送入VPF特征编码器模块，提取每幅图像的多尺度特征图；使用NCV编码技术对视图间局部外观特征进行建模得到局部外观模型；将多尺度特征图和局部外观模型用于引导解码端建立每个输入视点的MPI；输出各MPI的颜色与透明度后，通过单应变换和Alpha渲染得到各MPI投影至目标视点的初步合成结果，并采用以视点距离为权重的加权融合方式得到最终的新视点图像。本发明能实现在少量视图可用条件下能够有效地避免视图过拟合问题，实现更好的几何结构重建能力。

The present invention provides a method and terminal for synthesizing virtual viewpoints of views based on multi-plane images, including: collecting multi-viewpoint observation data in a scene, obtaining a three-dimensional scene reconstruction range using a preprocessing algorithm, and determining a preset depth value of an MPI; sending the collected multi-viewpoint images to a VPF feature encoder module, extracting a multi-scale feature map of each image; using NCV coding technology to model local appearance features between views to obtain a local appearance model; using the multi-scale feature map and the local appearance model to guide the decoding end to establish an MPI for each input viewpoint; after outputting the color and transparency of each MPI, obtaining a preliminary synthesis result of each MPI projected to a target viewpoint through homography transformation and Alpha rendering, and obtaining the final new viewpoint image by a weighted fusion method with viewpoint distance as a weight. The present invention can effectively avoid the problem of view overfitting under the condition that a small number of views are available, and achieve better geometric structure reconstruction capabilities.

Description

Translated fromChinese

基于多平面图像的视图虚拟视点合成方法及终端View virtual viewpoint synthesis method and terminal based on multi-plane images

技术领域Technical Field

本发明涉及计算机视觉技术领域，特别涉及一种基于多平面图像的视图虚拟视点合成方法及终端。The present invention relates to the field of computer vision technology, and in particular to a method and a terminal for synthesizing virtual viewpoints of views based on multi-plane images.

背景技术Background Art

理解真实场景的三维结构是各类高级计算机视觉任务的关键步骤，神经渲染方法显著推进了这一任务领域的进展。虽然有关工作已经在辐射场模型构建与推理方面有巨大的提升，但大多数方法都限制在NeRF先记忆后推理的框架中。然而在大多数实际应用场景中，通常只能提供少数视点信息作为基础先验。Understanding the 3D structure of real scenes is a key step in various advanced computer vision tasks, and neural rendering methods have significantly advanced the progress in this field. Although related work has made great progress in the construction and reasoning of radiation field models, most methods are limited to the NeRF framework of first memorizing and then reasoning. However, in most practical application scenarios, only a few viewpoints can usually be provided as basic priors.

NeRF模型获取的已知场景信息随着视图减少而逐步减少，传统NeRF中仅仅将参考图像与渲染图像构建的像素颜色损失作为监督项，模型会偏向于学习当前输入视图图片从而导致新视图外观和几何结构出现较大偏差。并且当视图输入极为稀疏时，目标视图中可能会出现已知参考视点处未观察到的区域信息，从而导致新视点图像中出现空洞等问题。The known scene information obtained by the NeRF model gradually decreases as the number of views decreases. In traditional NeRF, only the pixel color loss constructed by the reference image and the rendered image is used as a supervision item. The model tends to learn the current input view image, resulting in a large deviation in the appearance and geometric structure of the new view. And when the view input is extremely sparse, the target view may have regional information that is not observed at the known reference viewpoint, resulting in problems such as holes in the new viewpoint image.

发明内容Summary of the invention

本发明所要解决的技术问题是：提供一种基于多平面图像的视图虚拟视点合成方法及终端，结合NeRF的隐式连续表示思想，将多平面图像MPI的深度平面数扩展为可学习迭代的参数，在少量视图可用条件下能够有效地避免视图过拟合问题，实现更好的几何结构重建能力。The technical problem to be solved by the present invention is: to provide a view virtual viewpoint synthesis method and terminal based on multi-plane images, combine the implicit continuous representation idea of NeRF, expand the number of depth planes of the multi-plane image MPI into a learnable iterative parameter, and effectively avoid the view overfitting problem when a small number of views are available, so as to achieve better geometric structure reconstruction capability.

为了解决上述技术问题，本发明采用的技术方案为：In order to solve the above technical problems, the technical solution adopted by the present invention is:

一种基于多平面图像的视图虚拟视点合成方法，包括步骤：A method for synthesizing virtual viewpoints of views based on multi-plane images comprises the following steps:

S1、采集场景中多视点观测数据，包括多视点图像、相机内参与外参和相机位姿数据，利用预处理算法得到三维场景重建范围，确定多平面图像MPI的深度预设值；S1, collecting multi-view observation data in the scene, including multi-view images, camera internal and external parameters and camera pose data, using a preprocessing algorithm to obtain a 3D scene reconstruction range, and determining a depth preset value of a multi-plane image MPI;

S2、将采集的多视点图像送入VPF特征编码器模块，提取每幅图像的多尺度特征图；S2, sending the collected multi-view images to the VPF feature encoder module to extract the multi-scale feature map of each image;

S3、使用神经代价体NCV编码技术对视图间局部外观特征进行建模得到局部外观模型；S3, using neural cost volume NCV coding technology to model the local appearance features between views to obtain a local appearance model;

S4、将所述多尺度特征图和所述局部外观模型用于引导解码端建立每个输入视点的所述多平面图像MPI；S4, using the multi-scale feature map and the local appearance model to guide a decoding end to establish the multi-plane image MPI for each input viewpoint;

S5、输出各所述多平面图像MPI的颜色与透明度后，通过单应变换和Alpha渲染得到各所述多平面图像MPI投影至目标视点的初步合成结果，并采用以视点距离为权重的加权融合方式得到最终的新视点图像。S5. After outputting the color and transparency of each of the multi-plane images MPI, a preliminary synthesis result of projecting each of the multi-plane images MPI to a target viewpoint is obtained through homography transformation and Alpha rendering, and a weighted fusion method with viewpoint distance as weight is used to obtain a final new viewpoint image.

为了解决上述技术问题，本发明采用的另一技术方案为：In order to solve the above technical problems, another technical solution adopted by the present invention is:

一种基于多平面图像的视图虚拟视点合成终端，包括存储器、处理器以及存储在所述存储器上并可在处理器上运行的计算机程序，所述处理器执行所述计算机程序时实现如上所述的一种基于多平面图像的视图虚拟视点合成方法中的步骤。A terminal for synthesizing virtual viewpoints of views based on multi-plane images comprises a memory, a processor and a computer program stored in the memory and executable on the processor. When the processor executes the computer program, the steps in the method for synthesizing virtual viewpoints of views based on multi-plane images as described above are implemented.

本发明的有益效果在于：提供一种基于多平面图像的视图虚拟视点合成方法及终端，针对NeRF在少量视图可用条件下的视点合成质量退化问题，通过引入金字塔结构编码器VPF来提取图像的全局多尺度特征，同时使用神经代价体NCV编码技术对视图间局部外观特征进行建模，并将二者用于引导解码端建立每个输入视点的多平面图像MPI，并再输出各深度平面的颜色与透明度后，通过单应变换和Alpha渲染得到各MPI投影至目标视点的合成结果，另外还采用以视点距离为权重的加权融合方式得到最终的新视点图像，实现在具有复杂纹理的数据集上进行了评估实验，在少量视图可用条件下也能够有效地避免视图过拟合问题，相较于其他对比方法具有更好的几何结构重建能力。The beneficial effects of the present invention are: providing a virtual viewpoint synthesis method and terminal based on multi-plane images, aiming at the viewpoint synthesis quality degradation problem of NeRF under the condition of a small number of views available, by introducing a pyramid structure encoder VPF to extract the global multi-scale features of the image, and using the neural cost volume NCV coding technology to model the local appearance features between views, and the two are used to guide the decoding end to establish a multi-plane image MPI for each input viewpoint, and then output the color and transparency of each depth plane, and then obtain the synthesis result of each MPI projected to the target viewpoint through homography transformation and Alpha rendering, and also adopt a weighted fusion method with viewpoint distance as weight to obtain the final new viewpoint image, and carry out evaluation experiments on a data set with complex textures, and can effectively avoid the view overfitting problem under the condition of a small number of views available, and has better geometric structure reconstruction capability than other comparison methods.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

图1为本发明实施例的一种基于多平面图像的视图虚拟视点合成方法的流程图；FIG1 is a flow chart of a method for synthesizing virtual viewpoints based on multi-plane images according to an embodiment of the present invention;

图2为本发明实施例的一种基于多平面图像的视图虚拟视点合成方法的整体框架图；FIG2 is an overall framework diagram of a method for synthesizing virtual viewpoints based on multi-plane images according to an embodiment of the present invention;

图3为本发明实施例中VPF编码器网络结构图；FIG3 is a diagram of a VPF encoder network structure according to an embodiment of the present invention;

图4为本发明实施例中解码器网络结构图；FIG4 is a diagram of a decoder network structure according to an embodiment of the present invention;

图5为本发明实施例中多平面图像渲染目标视点的示意图；FIG5 is a schematic diagram of a multi-plane image rendering target viewpoint according to an embodiment of the present invention;

图6为本发明实施例的一种基于多平面图像的视图虚拟视点合成终端的结构示意图。FIG6 is a schematic diagram of the structure of a virtual viewpoint synthesis terminal based on multi-plane images according to an embodiment of the present invention.

标号说明：Description of labels:

1、一种基于多平面图像的视图虚拟视点合成终端；2、存储器；3、处理器。1. A virtual viewpoint synthesis terminal based on multi-plane images; 2. A memory; 3. A processor.

具体实施方式DETAILED DESCRIPTION

为详细说明本发明的技术内容、所实现目的及效果，以下结合实施方式并配合附图予以说明。In order to explain the technical content, achieved objectives and effects of the present invention in detail, the following is an explanation in combination with the implementation modes and the accompanying drawings.

在此之前，对本文中出现的英文缩写作如下释义说明：Prior to this, the English abbreviations appearing in this article are explained as follows:

1、MIP：Multiplane Images，多平面图像。1. MIP: Multiplane Images.

2、NeRF：Neural Radiance Field，神经辐射场，是一种使用神经网络来隐式表达3D场景的技术。2. NeRF: Neural Radiance Field, a technology that uses neural networks to implicitly express 3D scenes.

3、MLP：Multilayer Perceptron，多层感知器，是一种前馈神经网络，由多个层次组成，包括输入层、隐藏层和输出层，通过其多层次的结构和反向传播算法，能够有效地进行复杂的数据处理和分类任务。3. MLP: Multilayer Perceptron, a feedforward neural network consisting of multiple levels, including input layer, hidden layer and output layer. Through its multi-level structure and back propagation algorithm, it can effectively perform complex data processing and classification tasks.

4、COLMAP：一款开源的计算机视觉软件，主要用于从一系列二维图像中进行三维重建，全称为“COLLISION-MAPping”，是一个高度集成的Structure-from-Motion(SfM)和Multi-View Stereo(MVS)管道工具，主要功能包括特征加测与匹配、增量式SfM、多视图立体匹配及优化，能够从无序或有序的二维照片集合中恢复处三维场景的几何结构(点云)以及每张图像对应的相机姿态。4. COLMAP: An open source computer vision software, mainly used for 3D reconstruction from a series of 2D images. Its full name is "COLLISION-MAPping". It is a highly integrated Structure-from-Motion (SfM) and Multi-View Stereo (MVS) pipeline tool. Its main functions include feature addition and matching, incremental SfM, multi-view stereo matching and optimization. It can recover the geometric structure (point cloud) of the 3D scene and the camera pose corresponding to each image from an unordered or ordered collection of 2D photos.

5、VPF：Vison Pyramid Fusion，视觉金字塔融合，是一种基于图像金字塔的图像融合技术，图像金字塔通过多次降采样和上采样操作，生成不同分辨率的图像层次结构，从而实现多尺度的图像处理和分析。5. VPF: Vison Pyramid Fusion, visual pyramid fusion, is an image fusion technology based on image pyramid. The image pyramid generates an image hierarchy of different resolutions through multiple downsampling and upsampling operations, thereby realizing multi-scale image processing and analysis.

6、NCV：Neural Cost Volume，神经代价体，是一种用于多视图立体匹配和深度估计的神经网络结构，其核心思想是通过构建一个三维或四维的成本体积，来度量不同视角图像块之间的相似度，并进行匹配代价的聚合，从而实现高精度的深度推断。6. NCV: Neural Cost Volume, a neural network structure for multi-view stereo matching and depth estimation. Its core idea is to measure the similarity between image blocks of different perspectives by constructing a three-dimensional or four-dimensional cost volume, and aggregate the matching costs to achieve high-precision depth inference.

7、Alpha渲染：是一种在图形渲染中常见的技术，主要用于处理图像的透明度信息。7. Alpha rendering: It is a common technology in graphics rendering, mainly used to process the transparency information of the image.

8、ViT，Vision Transformer，是一种基于架构的深度学习模型，用于图像识别和计算机视觉任务。8. ViT, Vision Transformer, is an architecture-based deep learning model for image recognition and computer vision tasks.

请参照图1至图5，一种基于多平面图像的视图虚拟视点合成方法，包括步骤：Referring to FIGS. 1 to 5 , a method for synthesizing virtual viewpoints of views based on multi-plane images includes the following steps:

从上述描述可知，本发明的有益效果在于：提供一种基于多平面图像的视图虚拟视点合成方法，针对NeRF在少量视图可用条件下的视点合成质量退化问题，通过引入金字塔结构编码器VPF来提取图像的全局多尺度特征，同时使用神经代价体NCV编码技术对视图间局部外观特征进行建模，并将二者用于引导解码端建立每个输入视点的多平面图像MPI，并再输出各深度平面的颜色与透明度后，通过单应变换和Alpha渲染得到各MPI投影至目标视点的合成结果，另外还采用以视点距离为权重的加权融合方式得到最终的新视点图像，实现在具有复杂纹理的数据集上进行了评估实验，在少量视图可用条件下也能够有效地避免视图过拟合问题，相较于其他对比方法具有更好的几何结构重建能力。From the above description, it can be seen that the beneficial effects of the present invention are: providing a virtual viewpoint synthesis method based on multi-plane images, aiming at the viewpoint synthesis quality degradation problem of NeRF under the condition of a small number of views available, by introducing a pyramid structure encoder VPF to extract the global multi-scale features of the image, and using the neural cost volume NCV coding technology to model the local appearance features between views, and the two are used to guide the decoding end to establish a multi-plane image MPI for each input viewpoint, and then output the color and transparency of each depth plane, and then obtain the synthesis result of each MPI projected to the target viewpoint through homography transformation and Alpha rendering, and also adopt a weighted fusion method with viewpoint distance as weight to obtain the final new viewpoint image, and carry out evaluation experiments on a data set with complex textures, which can effectively avoid the problem of view overfitting under the condition of a small number of views available, and has better geometric structure reconstruction capability than other comparison methods.

进一步地，所述步骤S2具体为：：Furthermore, the step S2 is specifically as follows:

S21、将采集的多视点图像送入VPF特征编码器模块，所述VPF特征编码器模块包括四个下采样阶段，且每个下采样阶段中包括一次Patch Embedding操作与一个空间压缩多头注意力模块HWMHT，所述Patch Embedding操作为Vit中对所述多视点图像的原始特征图进行的分块操作；S21, sending the collected multi-view image to the VPF feature encoder module, the VPF feature encoder module includes four downsampling stages, and each downsampling stage includes a Patch Embedding operation and a spatial compression multi-head attention module HWMHT, the Patch Embedding operation is a block operation performed on the original feature map of the multi-view image in Vit;

S22、所述Patch Embedding操作中采用卷积操作将所述原始特征图划分成大小为(P×P)的(HW/P²)个Patch，其中H和W分别为所述原始特征图的长和宽；S22, in the Patch Embedding operation, a convolution operation is used to divide the original feature map into (HW/P² ) patches of size (P×P), where H and W are the length and width of the original feature map respectively;

S23、再将每个Patch进行Flatten展平以及线性层映射得到维度为[C_S×(HW/P²)]的特征向量，将每阶段中的特征记为将位置编码后的向量与展平后的特征向量进行逐元素相加，完成相机位置信息的嵌入，F_S表示第S阶段的特征向量，R表示特征向量空间，H_S、W_S和C_S分别表示第S阶段每层输出的特征的长、宽和通道数；S23, Flatten each patch and perform linear layer mapping to obtain a feature vector with a dimension of [_CS × (HW/^P2 )], and record the features in each stage as The position-encoded vector is added element by element to the flattened feature vector to complete the embedding of the camera position information._FS represents the feature vector of the Sth stage, R represents the feature vector space,_HS ,_WS , and_CS represent the length, width, and number of channels of the features output by each layer of the Sth stage, respectively.

S24、所述空间压缩多头注意力模块HWMHT包括依次连接的一个LayerNormalization归一化层、一个深度可分离卷积模块、一个多头自注意力计算模块、一个Layer Normalization归一化层和一个前馈网络层，在每个所述下采样阶段结束后都输出此时的特征图S24, the spatial compression multi-head attention module HWMHT includes a LayerNormalization normalization layer, a depth-separable convolution module, a multi-head self-attention calculation module, a Layer Normalization normalization layer and a feedforward network layer connected in sequence, and outputs the feature map at this time after each downsampling stage

进一步地，所述步骤S2中还包括：Furthermore, the step S2 also includes:

四个所述下采样阶段中所述VPF特征编码器模块的Patch尺度大小均设置为2，并定义整体网络中四层的输出特征通道数C_S＝{256,512,1024,2048}，每阶段中叠加多个所述空间压缩多头注意力模块HWMHT，每个阶段的所述空间压缩多头注意力模块HWMHT数量设置为L_S＝{3,3,6,3}；The patch scale size of the VPF feature encoder module in the four downsampling stages is set to 2, and the number of output feature channels of the four layers in the overall network is defined as_CS = {256, 512, 1024, 2048}. Multiple spatial compression multi-head attention modules HWMHT are superimposed in each stage, and the number of spatial compression multi-head attention modules HWMHT in each stage is set to_LS = {3, 3, 6, 3};

将每个所述下采样阶段后输出的特征图的大小系数分别定义为{1/2,1/4,1/8,1/16}，并将每一幅特征图连接至解码端作为解码输入。The size coefficients of the feature maps output after each downsampling stage are defined as {1/2, 1/4, 1/8, 1/16}, respectively, and each feature map is connected to the decoding end as a decoding input.

由上述描述可知，通过在金字塔结构编码器中引入Vision Transformer来提取图像的全局多尺度特征，并通过多次下采样操作，生成不同分辨率的图像层次结构，从而实现多尺度的图像处理和分析。From the above description, it can be seen that by introducing Vision Transformer in the pyramid structure encoder to extract the global multi-scale features of the image, and through multiple downsampling operations, image hierarchical structures of different resolutions are generated, thereby realizing multi-scale image processing and analysis.

进一步地，所述步骤S3具体为：Furthermore, the step S3 is specifically as follows:

S31、对于N幅所述多视点图像，依次确定一张视图为参考视点图像并记作I_r，其余视图作为辅助构造视图，利用CNN网络分别对N幅所述多视点图像提取二维尺度不变的特征图F_i∈R^H×W×C，H、W和C分别为特征的长high、宽width和通道数channel，计算过程如下式所示：S31. For the N multi-view images, one view is determined as a reference view image and recorded as I_r , and the remaining views are used as auxiliary construction views. A CNN network is used to extract a two-dimensional scale-invariant feature map F_i ∈^RH×W×C for each of the N multi-view images, where H, W and C are the feature height, width and number of channels, respectively. The calculation process is shown in the following formula:

F_i(u,v)＝LWC(I_i) (1)；F_i (u,v) = LWC (I_i ) (1);

其中，u和v分别表示特征图水平与垂直方向的索引，LWC()表示轻量级CNN网络所代表的函数，i表示输入特征图索引，即从1-N个特征图中的第i个图，I_i表示输入的第i个特征图，公式(1)表示输入的第i个特征图进入神经网络LWC中得到二维尺度不变的特征图F_i，特征图F_i以u和v为坐标系；Wherein, u and v represent the horizontal and vertical indexes of the feature map, respectively, LWC() represents the function represented by the lightweight CNN network, i represents the input feature map index, that is, the i-th map from 1-N feature maps,_Ii represents the i-th feature map of the input, and formula (1) represents that the i-th feature map of the input enters the neural network LWC to obtain a two-dimensional scale-invariant feature map F_i , and the feature map F_i uses u and v as the coordinate system;

S32、根据COLMAP预处理算法从N幅所述多视点图像中得出神经代价体NCV和所述多平面图像MPI的构建范围约束如下式所示：S32, according to the COLMAP preprocessing algorithm, the construction range constraints of the neural cost volume NCV and the multi-plane image MPI are obtained from the N multi-viewpoint images as shown in the following formula:

其中，z_n为近边界平面，z_f为远边界平面，j表示深度平面图像索引，即从1-D个深度平面图像中的第j个，z_j表示第j个深度平面图像的构建范围，U表示z_j的取值区间，从中给出一组随机预设深度样本值{z_f|j＝1,2,...,D}，D表示所述参考视点图像的深度平面个数；Wherein,_zn is the near boundary plane,_zf is the far boundary plane, j represents the depth plane image index, that is, from the jth of 1-D depth plane images,_zj represents the construction range of the jth depth plane image, U represents the value interval of_zj , from which a set of random preset depth sample values {_zf |j＝1,2,...,D} is given, and D represents the number of depth planes of the reference viewpoint image;

S33、将N-1幅辅助视图通过单应变换W(～)投影至所述参考视点图像的D个深度平面，其中对于每个输入特征图I_i投影至所述参考视点图像I_r的z_j的深度平面下的单应变换矩阵表示如下式所示：S33, projecting the N-1 auxiliary views to the D depth planes of the reference viewpoint image through homography transformation W(~), wherein the homography transformation matrix of each input feature map I_i projected to the depth plane z_j of the reference viewpoint image I_r is expressed as follows:

其中，Φ_r＝[K_r,R_r,t_r]为所述参考视点图像对应的相机参数，Φ_i＝[K_i,R_i,t_i]为输入相机参数，表示所述参考视点图像的平面朝向视点方向的法向量。Wherein, Φ_r =[K_r ,R_r ,t_r ] is the camera parameter corresponding to the reference viewpoint image, Φ_i =[K_i ,R_i ,t_i ] is the input camera parameter, A normal vector representing the plane of the reference viewpoint image facing the viewpoint direction.

进一步地，所述步骤S33中还包括：Furthermore, the step S33 also includes:

将通过单应变换得到所述参考视点图像I_r的视图前向锥体空间内由D个深度平面下的N个特征体构成的五维代价空间表示为C∈R^{D×N×H×W×C}，完成通道维度数据在像素位置处的堆叠后，在输入视图个数的维度上计算每个深度下特征体的方差来进行代价度量，将五维代价空间降为四维代价体，计算过程如下式所示：The five-dimensional cost space composed of N feature volumes under D depth planes in the view forward cone space of the reference viewpoint image I_r obtained by homography transformation is expressed as C∈R^{D×N×H×W×C} . After completing the stacking of channel dimension data at the pixel position, the variance of the feature volume under each depth is calculated in the dimension of the number of input views to perform cost measurement, and the five-dimensional cost space is reduced to a four-dimensional cost volume. The calculation process is shown in the following formula:

其中，为第j个深度平面下的特征代价体，F_zⁱ表示在z深度平面下的第i幅特征图，表示在z深度平面下的第N幅特征图，表示在z深度平面下的特征图平均值。in, is the feature cost volume in the jth depth plane, F_zⁱ represents the i-th feature map in the z depth plane, represents the Nth feature map in the z depth plane, Represents the average value of the feature map in the z-depth plane.

由上述描述可知，由上述公式得到D个基于方差的代价体，其能够编码不同输入视图图像的边缘纹理外观阴影变化，量化了视图间特征相似性与预设深度准确性之间的关系，能够很好的为解码端提供场景中的遮挡外观变化。From the above description, it can be seen that D variance-based cost volumes are obtained by the above formula, which can encode the edge texture appearance shadow changes of different input view images, quantify the relationship between the feature similarity between views and the preset depth accuracy, and can well provide the decoding end with the occlusion appearance changes in the scene.

进一步地，所述步骤S4具体为：Furthermore, the step S4 is specifically as follows:

S41、所述解码端通过多层解码器为每个所述多视点图像构建对应的所述多平面图像MPI；S41, the decoding end constructs the corresponding multi-plane image MPI for each of the multi-viewpoint images through a multi-layer decoder;

S42、根据所述深度预设值，利用联合分层采样策略得到D个深度平面，将所述多平面图像MPI扩展为连续可迭代训练的场景表示，具体为：S42, according to the preset depth value, using a joint layered sampling strategy to obtain D depth planes, and expanding the multi-plane image MPI into a scene representation that can be continuously iteratively trained, specifically:

将所述深度预设值通过NeRF中的编码函数编码为高维向量γ(z_j)，编码后如下式所示：The depth preset value is encoded into a high-dimensional vector γ(z_j ) through the encoding function in NeRF, and the encoded value is shown in the following formula:

γ(z_j)＝sin(2⁰πz_j),cos(2⁰πz_j),...,sin(2^L-1πz_j),cos(2^L-1πz_j) (5)；γ(z_j )=sin(2⁰ πz_j ),cos(2⁰ πz_j ),...,sin(2^L-1 πz_j ),cos(2^L-1 πz_j ) (5);

其中，L表示编码过程中的最大特征频率。Wherein, L represents the maximum characteristic frequency in the encoding process.

进一步地，所述步骤S4中还包括：Furthermore, the step S4 also includes:

所述多层解码器包括四个上采样块，每个所述上采样块都由一个3×3卷积层(Pad＝1,strid＝1)、一个批次归一化层、一个ELU激活函数层以及一个两倍近邻上采样层组成。The multi-layer decoder includes four upsampling blocks, each of which is composed of a 3×3 convolutional layer (Pad=1, strid=1), a batch normalization layer, an ELU activation function layer, and a two-times nearest neighbor upsampling layer.

由上述描述可知，先根据预设深度进行第一次平面采样并且通过解码器得到各平面的合成权重，再根据合成权重回传生成采样概率，从而对空间进行第二次深度采样得到最终的平面，以此将MPI扩展为连续可迭代训练的场景表示。From the above description, it can be seen that the first plane sampling is performed according to the preset depth and the synthesis weight of each plane is obtained through the decoder, and then the sampling probability is generated based on the synthesis weight feedback, so that the space is sampled for the second time to obtain the final plane, thereby expanding MPI into a scene representation that can be continuously iteratively trained.

进一步地，所述步骤S5具体为：Furthermore, the step S5 is specifically as follows:

S51、使用投影操作，采用逆单应变换函数H(.)建立所述多平面图像MPI中位于不同深度下的平面像素点与目标视点MPI像素的对应关系如下式所示：S51, using a projection operation, adopting an inverse homography transformation function H(.) to establish a correspondence between plane pixels at different depths in the multi-plane image MPI and pixels of the target viewpoint MPI as shown in the following formula:

其中，为输入图像MPI每个平面上的像素，为目标图像MPI每个平面上的像素，t表示上述相机参数中的平移矩阵，n＝[0,0,1]^T为第j个平面的法向量，R’与t’为目标视图到输入视图的相对摄像机参数；in, is the pixel on each plane of the input image MPI, is the pixel on each plane of the target image MPI, t represents the translation matrix in the above camera parameters, n = [0,0,1]^T is the normal vector of the jth plane, R' and t' are the relative camera parameters from the target view to the input view;

S52、通过查询所述输入图像MPI逐像素对应的(c,a)向量从而得到目标视点方向上的四通道多平面体其中c为颜色，a为颜色透明度；S52, obtaining a four-channel multiplane in the target viewpoint direction by querying the (c, a) vector corresponding to each pixel of the input image MPI Where c is the color and a is the color transparency;

S53、采用Alpha合成方式渲染目标视点的渲染图像根据NeRF的体渲染公式得出累积透光率与每个深度位置透明度之间的关系如下式所示：S53, render the rendered image of the target viewpoint using Alpha synthesis According to NeRF's volume rendering formula, the relationship between the cumulative transmittance and the transparency at each depth position is as follows:

T_j＝(1-α₁)(1-α₂)…(1-α_j-1) (7)；T_j =(1-α₁ )(1-α₂ )…(1-α_j-1 ) (7);

其中，α表示深度平面的合成权重，得出的计算公式如下式所示：Among them, α represents the synthesis weight of the depth plane, and it is obtained The calculation formula is as follows:

其中，表示第i个所述输入图像MPI变换至所述目标图像MPI得到的目标渲染图像，和分别表示一束由相机发出的射线与所述目标图像MPI在每个深度平面下的交点的RGB值和第j个深度平面的合成权重，表示一束由相机发出的射线与所述目标图像MPI在第k个深度平面的合成权重，k的取值范围为1到(j-1)；in, represents the target rendered image obtained by transforming the i-th input image MPI to the target image MPI, and They respectively represent the RGB value of the intersection of a ray emitted by the camera and the target image MPI at each depth plane and the synthesis weight of the j-th depth plane, represents the synthesis weight of a ray emitted by the camera and the target image MPI at the kth depth plane, and the value of k ranges from 1 to (j-1);

S54、用同样的方式对深度值进行聚合输出，得到目标视点的第i个渲染深度图计算过程如下式所示：S54: Aggregate and output the depth values in the same way to obtain the i-th rendered depth map of the target viewpoint The calculation process is shown in the following formula:

其中，表示第j个深度平面距目标相机的距离；in, Indicates the distance of the jth depth plane from the target camera;

S55、通过上述渲染合成得到目标视图集合与深度图集合分别如下公式(10)和公式(11)所示：S55. The target view set and the depth map set obtained by the above rendering synthesis are respectively shown in the following formula (10) and formula (11):

S56、按照不同输入视点与目标视点间的距离划分MPI聚合权重，对上述得到的N幅目标视图与深度图进行加权求和得到最终的目标新视图，计算如下式所示：S56, dividing the MPI aggregation weights according to the distances between different input viewpoints and the target viewpoint, performing weighted summation on the N target views and depth maps obtained above to obtain the final target new view, as calculated by the following formula:

其中，w_i表示MPI的聚合权重。Where_wi represents the aggregation weight of MPI.

由上述描述可知，采用类似于NeRF中体渲染的Alpha合成方式渲染目标视点的渲染图像，确保透明度信息的处理效果；同时由于视场遮挡和视锥范围受限的问题，单个MPI不一定包含从目标相机位姿可见的所有内容，因此提出按照不同输入视点与目标视点间的距离划分MPI聚合权重，对上述得到的N幅目标视图与深度图进行加权求和得到最终的目标新视图。From the above description, it can be seen that the rendered image of the target viewpoint is rendered using an Alpha synthesis method similar to volume rendering in NeRF to ensure the processing effect of transparency information; at the same time, due to the problems of field of view occlusion and limited cone range, a single MPI does not necessarily contain all the content visible from the target camera pose. Therefore, it is proposed to divide the MPI aggregation weights according to the distance between different input viewpoints and the target viewpoint, and perform weighted summation of the N target views and depth maps obtained above to obtain the final new target view.

进一步地，所述步骤S5之后还包括步骤：Furthermore, the step S5 further includes the following steps:

S6、使用损失函数对整个神经网络训练过程进行监督正则化，包括光度损失和边缘感知深度平滑度损失，所述光度损失为所述目标视图的每个像素颜色与真实值之间的损失，记为L_C，计算如下式所示：S6. Use the loss function to supervise and regularize the entire neural network training process, including photometric loss and edge-aware depth smoothness loss. The photometric loss is the loss between the color of each pixel of the target view and the true value, denoted as_LC , and is calculated as shown in the following formula:

其中，R表示一组于多平面上采样的射线，表示射线穿过像素点的颜色值，C_gt(r)表示像素颜色真实值；Where R represents a set of rays sampled on multiple planes, Indicates that the ray passes through The color value of the pixel, C_gt (r) represents the true value of the pixel color;

所述边缘感知深度平滑度损失记为L_smooth，计算如下式所示：The edge-aware depth smoothness loss is denoted as L_smooth , and is calculated as follows:

其中，表示图像在不同方向上的二阶梯度，表示渲染图像x_t处像素值的拉普拉斯算子，表示得到目标视点的第i个渲染深度图，x和y分别表示关于x和y方向上的导数，在中代表求两次关于x的导数，代表求x导之后再求关于y的导，代表求两次关于y的导数；in, Represents the second-order gradient of the image in different directions, represents the Laplacian of the pixel value at the rendered image_xt , Indicates the i-th rendered depth map of the target viewpoint, x and y represent the derivatives in the x and y directions respectively. where represents the derivative with respect to x twice. It means to find the derivative of x first and then find the derivative of y. It means to find the derivative with respect to y twice;

则整体损失函数如下式所示：The overall loss function is as follows:

L＝λ_CL_C+λ_SL_smooth (14)；L=λ_C L_C +λ_S L_smooth (14);

其中，λ_C和λ_S分别为所述光度损失和所述边缘感知深度平滑度损失的损失项超参数。Among them, λ_C and λ_S are loss term hyper parameters of the photometric loss and the edge-aware depth smoothness loss, respectively.

由上述描述可知，采用两个部分损失项组成的损失函数完成对模型的监督训练，对合成的深度信息进行正则化，从而引导模型学习正确的场景几何。From the above description, it can be seen that a loss function consisting of two partial loss terms is used to complete the supervised training of the model, and the synthesized depth information is regularized to guide the model to learn the correct scene geometry.

请参照图6，一种基于多平面图像的视图虚拟视点合成终端，包括存储器、处理器以及存储在所述存储器上并可在处理器上运行的计算机程序，所述处理器执行所述计算机程序时实现如上所述的一种基于多平面图像的视图虚拟视点合成方法中的步骤。Please refer to Figure 6, a view virtual viewpoint synthesis terminal based on multi-plane images includes a memory, a processor, and a computer program stored in the memory and executable on the processor, and the processor implements the steps in the view virtual viewpoint synthesis method based on multi-plane images as described above when executing the computer program.

由上述描述可知，本发明的有益效果在于：基于同一技术构思，配合上述的一种基于多平面图像的视图虚拟视点合成方法，提供一种基于多平面图像的视图虚拟视点合成终端，针对NeRF在少量视图可用条件下的视点合成质量退化问题，通过引入金字塔结构编码器VPF来提取图像的全局多尺度特征，同时使用神经代价体NCV编码技术对视图间局部外观特征进行建模，并将二者用于引导解码端建立每个输入视点的多平面图像MPI，并再输出各深度平面的颜色与透明度后，通过单应变换和Alpha渲染得到各MPI投影至目标视点的合成结果，另外还采用以视点距离为权重的加权融合方式得到最终的新视点图像，实现在具有复杂纹理的数据集上进行了评估实验，在少量视图可用条件下也能够有效地避免视图过拟合问题，相较于其他对比方法具有更好的几何结构重建能力。From the above description, it can be seen that the beneficial effects of the present invention are: based on the same technical concept, in conjunction with the above-mentioned method for virtual viewpoint synthesis of views based on multi-plane images, a virtual viewpoint synthesis terminal for views based on multi-plane images is provided, and the viewpoint synthesis quality degradation problem of NeRF under the condition that a small number of views are available is solved by introducing a pyramid structure encoder VPF to extract the global multi-scale features of the image, and at the same time, the neural cost volume NCV coding technology is used to model the local appearance features between views, and the two are used to guide the decoding end to establish a multi-plane image MPI for each input viewpoint, and then output the color and transparency of each depth plane, and then obtain the synthesis result of each MPI projected to the target viewpoint through homography transformation and Alpha rendering, and also adopt a weighted fusion method with viewpoint distance as weight to obtain the final new viewpoint image, and the evaluation experiment is carried out on a data set with complex textures. Under the condition that a small number of views are available, the problem of view overfitting can be effectively avoided, and compared with other comparison methods, it has better geometric structure reconstruction capability.

本发明提供的一种基于多平面图像的视图虚拟视点合成方法及终端，主要应用于少量视图可用条件下的几何结构重建场景中，下面结合具体实施例进行具体说明：The present invention provides a method and terminal for synthesizing virtual viewpoints based on multi-plane images, which are mainly used in geometric structure reconstruction scenes under the condition that a small number of views are available. The following is a specific description in conjunction with specific embodiments:

请参照图1及图2，本发明的实施例一为：Please refer to Figures 1 and 2, the first embodiment of the present invention is:

一种基于多平面图像的视图虚拟视点合成方法，如图1及图2所示，包括步骤：A method for synthesizing virtual viewpoints of views based on multi-plane images, as shown in FIG1 and FIG2, comprises the following steps:

S1、采集场景中多视点观测数据，包括多视点图像、相机内参与外参和相机位姿数据，利用预处理算法得到三维场景重建范围，确定多平面图像MPI的深度预设值。在本实施例中，预处理算法可以采用COLMAP预处理算法。S1. Collect multi-view observation data in the scene, including multi-view images, camera intrinsics and extrinsic parameters, and camera pose data, use a preprocessing algorithm to obtain a 3D scene reconstruction range, and determine a depth preset value of a multi-plane image MPI. In this embodiment, the preprocessing algorithm may use a COLMAP preprocessing algorithm.

S2、将采集的多视点图像送入VPF特征编码器模块，提取每幅图像的多尺度特征图。S2. Send the collected multi-view images to the VPF feature encoder module to extract the multi-scale feature map of each image.

S3、使用神经代价体NCV编码技术对视图间局部外观特征进行建模得到局部外观模型。S3. Use neural cost volume (NCV) coding technology to model the local appearance features between views to obtain a local appearance model.

S4、将多尺度特征图和局部外观模型用于引导解码端建立每个输入视点的多平面图像MPI。S4. The multi-scale feature map and the local appearance model are used to guide the decoder to establish a multi-plane image MPI for each input viewpoint.

S5、输出各多平面图像MPI的颜色与透明度后，通过单应变换和Alpha渲染得到各多平面图像MPI投影至目标视点的初步合成结果，并采用以视点距离为权重的加权融合方式得到最终的新视点图像。S5. After outputting the color and transparency of each multi-plane image MPI, a preliminary synthesis result of projecting each multi-plane image MPI to the target viewpoint is obtained through homography transformation and Alpha rendering, and a weighted fusion method with viewpoint distance as weight is used to obtain the final new viewpoint image.

即在本实施例中，针对NeRF在少量视图可用条件下的视点合成质量退化问题，通过引入金字塔结构编码器VPF来提取图像的全局多尺度特征，同时使用神经代价体NCV编码技术对视图间局部外观特征进行建模，并将二者用于引导解码端建立每个输入视点的多平面图像MPI，并再输出各深度平面的颜色与透明度后，通过单应变换和Alpha渲染得到各MPI投影至目标视点的合成结果，另外还采用以视点距离为权重的加权融合方式得到最终的新视点图像，实现在具有复杂纹理的数据集上进行了评估实验，在少量视图可用条件下也能够有效地避免视图过拟合问题，相较于其他对比方法具有更好的几何结构重建能力。That is, in this embodiment, in order to address the viewpoint synthesis quality degradation problem of NeRF when a small number of views are available, a pyramid structure encoder VPF is introduced to extract the global multi-scale features of the image, and the neural cost volume NCV coding technology is used to model the local appearance features between views. The two are used to guide the decoding end to establish a multi-plane image MPI for each input viewpoint, and then output the color and transparency of each depth plane. The synthesis result of each MPI projected to the target viewpoint is obtained through homography transformation and Alpha rendering. In addition, a weighted fusion method with viewpoint distance as weight is used to obtain the final new viewpoint image. Evaluation experiments are carried out on a data set with complex textures. Even when a small number of views are available, the problem of view overfitting can be effectively avoided, and compared with other comparison methods, it has better geometric structure reconstruction capability.

请参照图3至图5，本发明的实施例二为：Please refer to Figures 3 to 5, the second embodiment of the present invention is:

一种基于多平面图像的视图虚拟视点合成方法，在上述实施例一的基础上，在本实施例中，步骤S2具体为：A method for synthesizing virtual viewpoints of views based on multi-plane images. Based on the above-mentioned embodiment 1, in this embodiment, step S2 is specifically as follows:

S21、将采集的多视点图像送入VPF特征编码器模块，如图3所示，在本实施例中，VPF特征编码器模块包括四个下采样阶段，且每个下采样阶段中包括一次Patch Embedding操作与一个空间压缩多头注意力模块HWMHT。其中Patch Embedding操作为Vit中对多视点图像的原始特征图进行的分块操作。S21, sending the collected multi-view images to the VPF feature encoder module, as shown in Figure 3, in this embodiment, the VPF feature encoder module includes four downsampling stages, and each downsampling stage includes a Patch Embedding operation and a spatial compression multi-head attention module HWMHT. The Patch Embedding operation is a block operation performed on the original feature map of the multi-view image in Vit.

S22、Patch Embedding操作中采用卷积操作将原始特征图划分成大小为(P×P)的(HW/P²)个Patch，其中H和W分别为原始特征图的长和宽。In S22, Patch Embedding operation, convolution operation is used to divide the original feature map into (HW/P² ) patches of size (P×P), where H and W are the length and width of the original feature map, respectively.

S23、再将每个Patch进行Flatten展平以及线性层映射得到维度为[C_S×(HW/P²)]的特征向量，将每阶段中的特征记为将位置编码后的向量与展平后的特征向量进行逐元素相加，完成相机位置信息的嵌入，F_S表示第S阶段的特征向量，R表示特征向量空间，H_S、W_S和C_S分别表示第S阶段每层输出的特征的长、宽和通道数。S23, Flatten each patch and perform linear layer mapping to obtain a feature vector with a dimension of [_CS × (HW/^P2 )], and record the features in each stage as The position-encoded vector is added element-by-element to the flattened feature vector to complete the embedding of the camera position information._FS represents the feature vector of the Sth stage, R represents the feature vector space,_HS ,_WS , and_CS represent the length, width, and number of channels of the features output by each layer of the Sth stage, respectively.

S24、空间压缩多头注意力模块HWMHT包括依次连接的一个Layer Normalization归一化层、一个深度可分离卷积模块、一个多头自注意力计算模块、一个LayerNormalization归一化层和一个前馈网络层，在本实施例中，深度可分离卷积模块计算过程可如下式所示：S24, the spatial compression multi-head attention module HWMHT includes a Layer Normalization layer, a depth-separable convolution module, a multi-head self-attention calculation module, a LayerNormalization layer and a feedforward network layer connected in sequence. In this embodiment, the calculation process of the depth-separable convolution module can be shown as follows:

其中，DS代表深度可分离卷积，C_S为卷积核个数，t为卷积核的大小和步长，N_S为第S阶段特征尺度之积，即H_S×W_S。Among them, DS stands for depthwise separable convolution,_CS is the number of convolution kernels, t is the size and step size of the convolution kernel, and_NS is the product of the feature scales of the Sth stage, that is,_HS ×_WS .

多头自注意力计算模块计算过程如下式所示：The calculation process of the multi-head self-attention calculation module is shown in the following formula:

MSA(Q,K,V)＝Concat(H₀,...,H_h)W⁰；MSA(Q,K,V)=Concat(H₀ ,...,H_h )W⁰ ;

其中，Q、W和V均为特征向量，H_h表示Q、W和V分成h个头后的集合，W表示权重矩阵。Among them, Q, W and V are all feature vectors, H_h represents the set of Q, W and V after being divided into h heads, and W represents the weight matrix.

在每个下采样阶段结束后都输出此时的特征图则在本实施例中，步骤S2中还包括：After each downsampling stage, the feature map is output. In this embodiment, step S2 also includes:

四个下采样阶段中VPF特征编码器模块的Patch尺度大小均设置为2，并定义整体网络中四层的输出特征通道数C_S＝{256,512,1024,2048}，每阶段中叠加多个空间压缩多头注意力模块HWMHT，每个阶段的空间压缩多头注意力模块HWMHT数量设置为L_S＝{3,3,6,3}。The patch scale size of the VPF feature encoder module in the four downsampling stages is set to 2, and the number of output feature channels of the four layers in the overall network is defined as_CS = {256, 512, 1024, 2048}. Multiple spatial compression multi-head attention modules HWMHT are superimposed in each stage, and the number of spatial compression multi-head attention modules HWMHT in each stage is set to_LS = {3, 3, 6, 3}.

然后再将每个下采样阶段后输出的特征图的大小系数分别定义为{1/2,1/4,1/8,1/16}，并将每一幅特征图连接至解码端作为解码输入。Then, the size coefficients of the feature maps output after each downsampling stage are defined as {1/2, 1/4, 1/8, 1/16}, and each feature map is connected to the decoding end as the decoding input.

即在本实施例中，通过在金字塔结构编码器中引入Vision Transformer来提取图像的全局多尺度特征，并通过多次下采样操作，生成不同分辨率的图像层次结构，从而实现多尺度的图像处理和分析。That is, in this embodiment, the Vision Transformer is introduced into the pyramid structure encoder to extract the global multi-scale features of the image, and through multiple downsampling operations, image hierarchical structures of different resolutions are generated, thereby realizing multi-scale image processing and analysis.

其中在本实施例中，如图4所示，对输入图像都建立各自的神经代价体，In this embodiment, as shown in FIG4 , a neural cost volume is established for each input image.

即步骤S3具体为：That is, step S3 is specifically:

S31、对于N幅多视点图像，依次确定一张视图为参考视点图像并记作I_r，其余视图作为辅助构造视图，利用CNN网络分别对N幅多视点图像提取二维尺度不变的特征图F_i∈R^H^×W×C，H、W和C分别为特征的长high、宽width和通道数channel，计算过程如下式所示：S31. For N multi-view images, one view is determined as the reference view image and recorded as I_r , and the remaining views are used as auxiliary construction views. The CNN network is used to extract two-dimensional scale-invariant feature maps F_i ∈^RH^×W×C for the N multi-view images, where H, W and C are the height, width and number of channels of the feature, respectively. The calculation process is shown in the following formula:

F_i(u,v)＝LWC(I_i) (1)；F_i (u,v) = LWC (I_i ) (1);

其中，u和v分别表示特征图水平与垂直方向的索引，LWC()表示轻量级CNN网络所代表的函数，i表示输入特征图索引，即从1-N个特征图中的第i个图，I_i表示输入的第i个特征图，公式(1)表示输入的第i个特征图进入神经网络LWC中得到二维尺度不变的特征图F_i，特征图F_i以u和v为坐标系。Wherein, u and v represent the horizontal and vertical indexes of the feature map respectively, LWC() represents the function represented by the lightweight CNN network, i represents the input feature map index, that is, the i-th map from 1-N feature maps,_Ii represents the i-th feature map of the input, and formula (1) represents that the i-th feature map of the input enters the neural network LWC to obtain a two-dimensional scale-invariant feature map F_i , where the feature map F_i uses u and v as the coordinate system.

S32、根据COLMAP预处理算法从N幅多视点图像中得出神经代价体NCV和多平面图像MPI的构建范围约束如下式所示：S32. According to the COLMAP preprocessing algorithm, the construction range constraints of the neural cost volume NCV and the multi-plane image MPI are obtained from N multi-viewpoint images as shown in the following formula:

其中，z_n为近边界平面，z_f为远边界平面，j表示深度平面图像索引，即从1-D个深度平面图像中的第j个，z_j表示第j个深度平面图像的构建范围，U表示z_j的取值区间，从中给出一组随机预设深度样本值{z_f|j＝1,2,...,D}，D表示参考视点图像的深度平面个数。Among them,_zn is the near boundary plane,_zf is the far boundary plane, j represents the depth plane image index, that is, the jth from the 1-D depth plane images,_zj represents the construction range of the jth depth plane image, U represents the value interval of_zj , from which a set of random preset depth sample values {_zf |j＝1,2,...,D} are given, and D represents the number of depth planes of the reference viewpoint image.

S33、将N-1幅辅助视图通过单应变换W(～)投影至参考视点图像的D个深度平面，其中对于每个输入特征图I_i投影至参考视点图像I_r的z_j的深度平面下的单应变换矩阵表示如下式所示：S33, projecting the N-1 auxiliary views to the D depth planes of the reference viewpoint image through homography transformation W(~), wherein the homography transformation matrix of each input feature map I_i projected to the depth plane z_j of the reference viewpoint image I_r is expressed as follows:

其中，Φ_r＝[K_r,R_r,t_r]为参考视点图像对应的相机参数，Φ_i＝[K_i,R_i,t_i]为输入相机参数，表示参考视点图像的平面朝向视点方向的法向量。Among them, Φ_r = [K_r , R_r , t_r ] is the camera parameter corresponding to the reference viewpoint image, Φ_i = [K_i , R_i , t_i ] is the input camera parameter, Represents the normal vector of the plane of the reference viewpoint image towards the viewpoint.

其中步骤S33中还包括：Wherein step S33 also includes:

将通过单应变换得到参考视点图像I_r的视图前向锥体空间内由D个深度平面下的N个特征体构成的五维代价空间表示为C∈R^{D×N×H×W×C}，完成通道维度数据在像素位置处的堆叠后，在输入视图个数的维度上计算每个深度下特征体的方差来进行代价度量，将五维代价空间降为四维代价体，计算过程如下式所示：The five-dimensional cost space composed of N feature volumes under D depth planes in the view forward cone space of the reference viewpoint image I_r obtained by homography transformation is expressed as C∈R^{D×N×H×W×C} . After completing the stacking of channel dimension data at the pixel position, the variance of the feature volume under each depth is calculated in the dimension of the number of input views to perform cost measurement, and the five-dimensional cost space is reduced to a four-dimensional cost volume. The calculation process is shown in the following formula:

其中，为第j个深度平面下的特征代价体，表示在z深度平面下的第i幅特征图，表示在z深度平面下的第N幅特征图，表示在z深度平面下的特征图平均值。in, is the feature cost volume under the j-th depth plane, represents the i-th feature map in the z depth plane, represents the Nth feature map in the z depth plane, Represents the average value of the feature map in the z-depth plane.

也即由上述公式得到D个基于方差的代价体，其能够编码不同输入视图图像的边缘纹理外观阴影变化，量化了视图间特征相似性与预设深度准确性之间的关系，能够很好的为解码端提供场景中的遮挡外观变化。That is, D variance-based cost volumes are obtained from the above formula, which can encode the edge texture appearance shadow changes of different input view images, quantify the relationship between the feature similarity between views and the preset depth accuracy, and can well provide the decoding end with the occlusion appearance changes in the scene.

同时，在本实施例中，步骤S4具体为：Meanwhile, in this embodiment, step S4 is specifically as follows:

S41、解码端通过多层解码器为每个多视点图像构建对应的多平面图像MPI。S41. The decoding end constructs a corresponding multi-plane image MPI for each multi-viewpoint image through a multi-layer decoder.

S42、根据深度预设值，利用联合分层采样策略得到D个深度平面，将多平面图像MPI扩展为连续可迭代训练的场景表示，具体为：S42, according to the preset depth value, a joint layered sampling strategy is used to obtain D depth planes, and the multi-plane image MPI is expanded into a scene representation that can be continuously iteratively trained, specifically:

将深度预设值通过NeRF中的编码函数编码为高维向量γ(z_j)，编码后如下式所示：The depth preset value is encoded into a high-dimensional vector γ(z_j ) through the encoding function in NeRF, which is shown in the following formula after encoding:

其中在本实施例中，步骤S4中还包括：In this embodiment, step S4 also includes:

多层解码器包括四个上采样块，每个上采样块都由一个3×3卷积层(Pad＝1,strid＝1)、一个批次归一化层、一个ELU激活函数层以及一个两倍近邻上采样层组成。The multi-layer decoder consists of four upsampling blocks, each of which consists of a 3×3 convolutional layer (Pad=1, strid=1), a batch normalization layer, an ELU activation function layer, and a two-times nearest neighbor upsampling layer.

即先根据预设深度进行第一次平面采样并且通过解码器得到各平面的合成权重，再根据合成权重回传生成采样概率，从而对空间进行第二次深度采样得到最终的平面，以此将MPI扩展为连续可迭代训练的场景表示。That is, the first plane sampling is performed according to the preset depth and the synthesis weights of each plane are obtained through the decoder. Then, the sampling probability is generated based on the synthesis weights, and the space is sampled for the second time to obtain the final plane, thereby expanding MPI into a scene representation that can be continuously iteratively trained.

在本实施例中，如图5所示，步骤S5具体为：In this embodiment, as shown in FIG5 , step S5 is specifically as follows:

S51、使用投影Warping操作，采用逆单应变换函数H(.)建立多平面图像MPI中位于不同深度下的平面像素点与目标视点MPI像素的对应关系如下式所示：S51, using the projection Warping operation, adopting the inverse homography transformation function H(.) to establish the correspondence between the plane pixel points at different depths in the multi-plane image MPI and the target viewpoint MPI pixels as shown in the following formula:

其中，为输入图像MPI每个平面上的像素，为目标图像MPI每个平面上的像素，t表示上述相机参数中的平移矩阵，n＝[0,0,1]^T为第j个平面的法向量，R’与t’为目标视图到输入视图的相对摄像机参数。in, is the pixel on each plane of the input image MPI, is the pixel on each plane of the target image MPI, t represents the translation matrix in the above camera parameters, n = [0,0,1]^T is the normal vector of the jth plane, and R' and t' are the relative camera parameters from the target view to the input view.

S52、通过查询输入图像MPI逐像素对应的(c,a)向量从而得到目标视点方向上的四通道多平面体其中c为颜色，a为颜色透明度。S52, by querying the (c, a) vector corresponding to each pixel of the input image MPI, a four-channel multiplane in the target viewpoint direction is obtained. Where c is the color and a is the color transparency.

S53、采用类似于NeRF中体渲染的Alpha合成方式渲染目标视点的渲染图像根据NeRF的体渲染公式得出累积透光率与每个深度位置透明度之间的关系如下式所示：S53, using Alpha synthesis similar to volume rendering in NeRF to render the rendered image of the target viewpoint According to NeRF's volume rendering formula, the relationship between the cumulative transmittance and the transparency at each depth position is as follows:

其中，表示深度平面的合成权重，得出的计算公式如下式所示。Among them, represents the synthesis weight of the depth plane, and we get The calculation formula is shown below.

其中，表示第i个输入图像MPI变换至目标图像MPI得到的目标渲染图像，和分别表示一束由相机发出的射线与目标图像MPI在每个深度平面下的交点的RGB值和第j个深度平面的合成权重，表示一束由相机发出的射线与所述目标图像MPI在第k个深度平面的合成权重，k的取值范围为1到(j-1)。in, represents the target rendered image obtained by transforming the i-th input image MPI to the target image MPI, and They represent the RGB value of the intersection of a ray emitted by the camera and the target image MPI at each depth plane and the synthesis weight of the jth depth plane, respectively. It represents the synthesis weight of a ray emitted by the camera and the target image MPI at the kth depth plane, and the value range of k is 1 to (j-1).

其中，表示第j个深度平面距目标相机的距离。in, Represents the distance of the jth depth plane from the target camera.

即采用类似于NeRF中体渲染的Alpha合成方式渲染目标视点的渲染图像，确保透明度信息的处理效果；同时由于视场遮挡和视锥范围受限的问题，单个MPI不一定包含从目标相机位姿可见的所有内容，因此提出按照不同输入视点与目标视点间的距离划分MPI聚合权重，对上述得到的N幅目标视图与深度图进行加权求和得到最终的目标新视图。That is, the rendered image of the target viewpoint is rendered using an Alpha synthesis method similar to volume rendering in NeRF to ensure the processing effect of transparency information; at the same time, due to the problems of field of view occlusion and limited viewing cone range, a single MPI does not necessarily contain all the content visible from the target camera pose. Therefore, it is proposed to divide the MPI aggregation weights according to the distance between different input viewpoints and the target viewpoint, and perform weighted summation of the N target views and depth maps obtained above to obtain the final target new view.

本发明的实施例三为：Embodiment 3 of the present invention is:

一种基于多平面图像的视图虚拟视点合成方法，在上述实施例一或实施例二的基础上，在本实施例中，步骤S5之后还包括步骤：A method for synthesizing virtual viewpoints of views based on multi-plane images, based on the above-mentioned embodiment 1 or embodiment 2, in this embodiment, after step S5, further comprising the following steps:

S6、使用损失函数对整个神经网络训练过程进行监督正则化，包括光度损失和边缘感知深度平滑度损失，光度损失为目标视图的每个像素颜色与真实值之间的损失，记为L_C，计算如下式所示：S6. Use the loss function to supervise and regularize the entire neural network training process, including photometric loss and edge-aware depth smoothness loss. The photometric loss is the loss between the color of each pixel of the target view and the true value, denoted as_LC , and is calculated as follows:

边缘感知深度平滑度损失记为L_smooth，计算如下式所示：The edge-aware depth smoothness loss is denoted as L_smooth and is calculated as follows:

则整体损失函数如下式所示：The overall loss function is as follows:

L＝λ_CL_C+λ_SL_smooth (14)；L=λ_C L_C +λ_S L_smooth (14);

其中，λ_C和λ_S分别为光度损失和边缘感知深度平滑度损失的损失项超参数。Among them, λ_C and λ_S are the loss term hyperparameters of photometric loss and edge-aware depth smoothness loss, respectively.

即在本实施例中，采用两个部分损失项组成的损失函数完成对模型的监督训练，对合成的深度信息进行正则化，从而引导模型学习正确的场景几何。That is, in this embodiment, a loss function consisting of two partial loss terms is used to complete the supervised training of the model, and the synthesized depth information is regularized, thereby guiding the model to learn the correct scene geometry.

请参照图6，本发明的实施例四为：Please refer to FIG. 6 , the fourth embodiment of the present invention is:

一种基于多平面图像的视图虚拟视点合成终端1，包括存储器2、处理器3以及存储在存储器2上并可在处理器3上运行的计算机程序，处理器3执行计算机程序时完成上述实施例一或实施例三中任一实施例的一种基于多平面图像的视图虚拟视点合成方法中的步骤。A terminal 1 for synthesizing a view virtual viewpoint based on a multi-plane image comprises a memory 2, a processor 3 and a computer program stored in the memory 2 and executable on the processor 3. When the processor 3 executes the computer program, the steps of a method for synthesizing a view virtual viewpoint based on a multi-plane image in any one of the first embodiment or the third embodiment are completed.

综上所述，本发明提供的一种基于多平面图像的视图虚拟视点合成方法及终端，针对NeRF在少量视图可用条件下的视点合成质量退化问题，通过引入金字塔结构编码器VPF来提取图像的全局多尺度特征，同时使用神经代价体NCV编码技术对视图间局部外观特征进行建模，并将二者用于引导解码端建立每个输入视点的多平面图像MPI，并再输出各深度平面的颜色与透明度后，通过单应变换和Alpha渲染得到各MPI投影至目标视点的合成结果，另外还采用以视点距离为权重的加权融合方式得到最终的新视点图像，最后还采用两个部分损失项组成的损失函数完成对模型的监督训练，对合成的深度信息进行正则化，从而引导模型学习正确的场景几何，实现在具有复杂纹理的数据集上进行了评估实验，在少量视图可用条件下也能够有效地避免视图过拟合问题，相较于其他对比方法具有更好的几何结构重建能力。In summary, the present invention provides a method and terminal for virtual viewpoint synthesis based on multi-plane images. To address the problem of viewpoint synthesis quality degradation of NeRF under the condition of a small number of available views, a pyramid structure encoder VPF is introduced to extract global multi-scale features of the image, and a neural cost volume NCV coding technology is used to model local appearance features between views. The two are used to guide the decoding end to establish a multi-plane image MPI for each input viewpoint, and then output the color and transparency of each depth plane. The synthesis result of each MPI projected to the target viewpoint is obtained through homography transformation and Alpha rendering. In addition, a weighted fusion method with viewpoint distance as weight is used to obtain the final new viewpoint image. Finally, a loss function composed of two partial loss terms is used to complete the supervised training of the model, and the synthesized depth information is regularized, thereby guiding the model to learn the correct scene geometry. Evaluation experiments are carried out on data sets with complex textures. The problem of view overfitting can be effectively avoided under the condition of a small number of available views, and it has better geometric structure reconstruction capability than other comparison methods.

以上所述仅为本发明的实施例，并非因此限制本发明的专利范围，凡是利用本发明说明书及附图内容所作的等同变换，或直接或间接运用在相关的技术领域，均同理包括在本发明的专利保护范围内。The above descriptions are merely embodiments of the present invention and are not intended to limit the patent scope of the present invention. Any equivalent transformations made using the contents of the present invention's specification and drawings, or directly or indirectly applied in related technical fields, are also included in the patent protection scope of the present invention.