CN118096978B

Movatterモバイル変換

Info

Publication number: CN118096978B
Application number: CN202410503092.9A
Authority: CN
Inventors: 邢树军; 于迅博; 汲鲁育; 高鑫; 许世鑫; 刘博阳; 高超; 黄辉
Original assignee: Shenzhen Zhenxiang Technology Co ltd; Beijing University of Posts and Telecommunications
Current assignee: Shenzhen Zhenxiang Technology Co ltd; Beijing University of Posts and Telecommunications
Priority date: 2024-04-25
Filing date: 2024-04-25
Publication date: 2024-07-12
Anticipated expiration: 2044-04-25
Also published as: CN118096978A

Abstract

Translated fromChinese

本发明公开了一种基于任意风格化的3D艺术内容快速生成方法，其根据输入的多张内容图像构建富含高层次语义信息的特征网格辐射场，通过张量分解优化上述辐射场的存储结构，然后对内容特征进行通道和空间维度上的自适应特征增强，之后从输入的艺术风格图像中学习互补的多层次风格信息，在对特征网格通过体渲染得到的特征图上进行风格迁移，最后根据全局质量损失函数和局部细节损失函数联合训练优化解码器及上述组件，最终实现该内容场景任意视角下风格化艺术图像的快速生成。相比现有技术而言，本发明实现了高质量个性化3D艺术内容的快速生成，适用于复杂场景，而且能有效避免出现可影响视觉效果的伪影，较好地满足了应用需求。

The present invention discloses a method for rapid generation of 3D art content based on arbitrary stylization, which constructs a feature grid radiation field rich in high-level semantic information based on multiple input content images, optimizes the storage structure of the radiation field through tensor decomposition, and then performs adaptive feature enhancement on the channel and spatial dimensions of the content features, then learns complementary multi-level style information from the input artistic style image, performs style transfer on the feature map obtained by volume rendering of the feature grid, and finally jointly trains and optimizes the decoder and the above components based on the global quality loss function and the local detail loss function, and finally realizes the rapid generation of stylized art images at any viewing angle of the content scene. Compared with the prior art, the present invention realizes the rapid generation of high-quality personalized 3D art content, is suitable for complex scenes, and can effectively avoid the appearance of artifacts that can affect visual effects, and better meets the application requirements.

Description

Translated fromChinese

一种基于任意风格化的3D艺术内容快速生成方法A method for rapid generation of 3D art content based on arbitrary stylization

技术领域Technical Field

本发明涉及3D内容生成和非真实感风格化渲染技术领域，尤其涉及一种基于任意风格化的3D艺术内容快速生成方法。The present invention relates to the technical field of 3D content generation and non-realistic stylized rendering, and in particular to a method for quickly generating 3D art content based on arbitrary stylization.

背景技术Background technique

随着3D视觉设备和显示技术的发展，人们对3D内容的需求日益增长。然而，目前3D内容的创作和生成需要很高的人力成本和时间成本，对于3D艺术内容的创作和个性化生成更是当下的一个难题。实际应用中，风格化通过将原始图像的内容结构与另一种风格相结合能够实现不同视觉效果的内容生成，为3D内容艺术创作和个性化生成提供了新的思路。With the development of 3D visual devices and display technology, people's demand for 3D content is growing. However, the creation and generation of 3D content currently requires high manpower and time costs, and the creation and personalized generation of 3D art content is a current problem. In practical applications, stylization can achieve content generation with different visual effects by combining the content structure of the original image with another style, providing a new idea for the artistic creation and personalized generation of 3D content.

现有技术中，传统的风格化大部分是针对2D图像进行的，直接将这类方法应用到3D风格化上会导致多视图不一致和伪影的产生。可见目前的3D风格化面临着多视图不一致、风格化质量差、不能推广到任意风格、训练时间长等多重困难。同时，基于点云的3D风格化方法受限于深度估计环节的精准度，对复杂场景进行风格化往往会导致较多伪影的出现且需要大量时间进行训练优化。而单风格及多风格迁移方法只能实现一个或多个特定风格下的迁移，无法应用到任意风格，不便于用户进行3D内容的个性化艺术创作。再者，基于神经辐射场的优化学习方法对参考风格的颜色和纹理的保留一致性较差，风格化结果与原风格存在明显的视觉差异且内容原始细节丢失严重。In the prior art, most of the traditional stylization is performed on 2D images. Directly applying such methods to 3D stylization will lead to multi-view inconsistency and artifacts. It can be seen that the current 3D stylization faces multiple difficulties such as multi-view inconsistency, poor stylization quality, inability to generalize to any style, and long training time. At the same time, the point cloud-based 3D stylization method is limited by the accuracy of the depth estimation link. Stylizing complex scenes often leads to the appearance of more artifacts and requires a lot of time for training optimization. The single-style and multi-style transfer methods can only achieve transfer under one or more specific styles, and cannot be applied to any style, which is not convenient for users to perform personalized artistic creation of 3D content. Furthermore, the optimization learning method based on neural radiation field has poor consistency in retaining the color and texture of the reference style. There is a significant visual difference between the stylized result and the original style, and the original details of the content are seriously lost.

发明内容Summary of the invention

本发明要解决的技术问题在于，针对现有技术的不足，提供一种支持用户个性化创作、适用于复杂场景、能避免出现伪影而影响视觉效果的基于任意风格化的3D艺术内容快速生成方法。The technical problem to be solved by the present invention is to provide a method for quickly generating 3D art content based on arbitrary stylization, which supports user personalized creation, is applicable to complex scenes, and can avoid the occurrence of artifacts that affect visual effects, in view of the shortcomings of the existing technology.

为解决上述技术问题，本发明采用如下技术方案。In order to solve the above technical problems, the present invention adopts the following technical solutions.

一种基于任意风格化的3D艺术内容快速生成方法，其包括：步骤S1，根据输入的多张内容图像构建包含高层次语义信息的特征网格神经辐射场，所述高层次语义信息是由深度学习网络得出的抽象特征；步骤S2，通过张量分解优化所述特征网格神经辐射场的存储结构；步骤S3，对内容特征进行通道和空间维度上的自适应特征增强后得出特征网格，所述内容特征是所述高层次语义信息的特征图；步骤S4，从输入的艺术风格图像中学习互补的多层次风格信息；步骤S5，在对所述特征网格通过体渲染得到的内容特征图上进行风格迁移；步骤S6，根据全局质量损失函数和局部细节损失函数联合训练优化解码器、自适应特征增强的组件和提取多层次风格信息的组件，进而生成内容场景任意视角下的风格化艺术图像。A method for quickly generating 3D art content based on arbitrary stylization comprises: step S1, constructing a feature grid neural radiation field containing high-level semantic information according to multiple input content images, wherein the high-level semantic information is an abstract feature obtained by a deep learning network; step S2, optimizing the storage structure of the feature grid neural radiation field by tensor decomposition; step S3, obtaining a feature grid after performing adaptive feature enhancement on the channel and spatial dimensions of the content features, wherein the content features are feature maps of the high-level semantic information; step S4, learning complementary multi-level style information from the input art style image; step S5, performing style transfer on the content feature map obtained by volume rendering of the feature grid; step S6, jointly training and optimizing a decoder, an adaptive feature enhancement component and a component for extracting multi-level style information according to a global quality loss function and a local detail loss function, thereby generating a stylized art image at any viewing angle of the content scene.

优选地，所述步骤S1包括：步骤S10，根据输入的多张不同视点下的所述内容图像构建基于体素网格的原始神经辐射场，每个体素中存储的特征包括体密度和表征颜色的原始场景特征；步骤S11，通过预训练的卷积神经网络提取所述内容图像的所述高层次语义信息，实现特征网格的重构，构建出每个体素包含所述体密度和所述高层次语义信息的神经辐射场；步骤S12，对所述特征网格上每个采样点的特征进行体积自适应实例规范化，在训练阶段始终保持均值和方差的更新迭代，进而消除因批次差异造成的多视点不一致现象。Preferably, the step S1 includes: step S10, constructing an original neural radiation field based on a voxel grid according to the input content images under multiple different viewpoints, the features stored in each voxel include volume density and original scene features representing color; step S11, extracting the high-level semantic information of the content image through a pre-trained convolutional neural network, reconstructing the feature grid, and constructing a neural radiation field in which each voxel contains the volume density and the high-level semantic information; step S12, performing volume adaptive instance normalization on the features of each sampling point on the feature grid, and always maintaining the update iteration of the mean and variance during the training stage, thereby eliminating the multi-viewpoint inconsistency caused by batch differences.

优选地，所述步骤S2中，对于体素网格神经辐射场，通过XYZ方向上的向量和矩阵来存储特征信息，利用张量分解降低内存复杂度。Preferably, in step S2, for the voxel grid neural radiation field, feature information is stored by vectors and matrices in the XYZ directions, and tensor decomposition is used to reduce memory complexity.

优选地，所述步骤S3包括：步骤S30，对所述内容特征进行金字塔池化，将不同层次的特征分别输入卷积层，得到不同尺度的自适应通道注意力图并相加融合，再与原始内容特征相乘实现通道维度上的多层次结构表征增强；步骤S31，对经过自适应增强的内容特征进行通道维度上的压缩，之后通过不同大小的卷积核及上下采样操作计算得到融合后的自适应空间注意力图，再与内容特征相乘实现多尺度区域感知增强。Preferably, the step S3 includes: step S30, pyramid pooling the content features, inputting features of different levels into the convolution layer respectively, obtaining adaptive channel attention maps of different scales and adding and fusing them, and then multiplying them with the original content features to achieve multi-level structural representation enhancement in the channel dimension; step S31, compressing the adaptively enhanced content features in the channel dimension, and then calculating the fused adaptive spatial attention map through convolution kernels of different sizes and up and down sampling operations, and then multiplying them with the content features to achieve multi-scale regional perception enhancement.

优选地，所述步骤S30包括：步骤S300，对输入内容特征进行多尺度池化，得到不同尺度下的区域特征；步骤S301，将多路区域特征分别输入到不同的卷积层中，得到多路添加注意力的区域特征；步骤S302，将多路添加注意力的区域特征求和，并进行非线性激活，之后输出自适应通道注意力图。Preferably, the step S30 includes: step S300, performing multi-scale pooling on the input content features to obtain regional features at different scales; step S301, inputting multiple regional features into different convolutional layers respectively to obtain multiple regional features with added attention; step S302, summing the multiple regional features with added attention, performing nonlinear activation, and then outputting an adaptive channel attention map.

优选地，所述步骤S31中：所述通道维度表示特征图的通道数，每个通道对应一个滤波器或卷积核；所述空间维度是特征图中某一点在高度和宽度维度下的空间位置。Preferably, in step S31: the channel dimension represents the number of channels of the feature map, each channel corresponds to a filter or a convolution kernel; the spatial dimension is the spatial position of a point in the feature map in the height and width dimensions.

优选地，所述步骤S31包括：步骤S310，通道压缩：沿着通道维度对输入特征分别进行最大池化和平均池化，并将二者沿着通道维度连接起来；步骤S311，特征细化：包含全局平均池化分支和不同金字塔尺度下的特征融合感知分支，所述全局平均池化分支用于在空间维度上对所述步骤S310的输出特征进行全局平均，再经过一层卷积层后进行上采样操作，所述特征融合感知分支是U型网络结构；步骤S312，注意力输出：将所述步骤S311中的两个分支的输出相加，进行非线性激活后，输出自适应空间注意力图。Preferably, the step S31 includes: step S310, channel compression: performing maximum pooling and average pooling on the input features along the channel dimension, and connecting the two along the channel dimension; step S311, feature refinement: including a global average pooling branch and feature fusion perception branches at different pyramid scales, the global average pooling branch is used to globally average the output features of the step S310 in the spatial dimension, and then perform an upsampling operation after a convolution layer, and the feature fusion perception branch is a U-shaped network structure; step S312, attention output: adding the outputs of the two branches in the step S311, performing nonlinear activation, and outputting an adaptive spatial attention map.

优选地，所述步骤S4包括：步骤S40，将所述艺术风格图像通过预训练的卷积神经网络得出风格特征图，然后计算得出风格特征图的均值和方差；步骤S41，将风格特征图进行序列化之后再经过卷积层，计算特征协方差并进行线性变换，得到风格迁移矩阵；上述步骤计算得到的风格特征图的均值、方差和风格迁移矩阵作为互补的多层次信息，用于表征艺术风格图像的风格。Preferably, the step S4 includes: step S40, obtaining a style feature map of the artistic style image through a pre-trained convolutional neural network, and then calculating the mean and variance of the style feature map; step S41, serializing the style feature map and then passing it through a convolution layer, calculating the feature covariance and performing a linear transformation to obtain a style transfer matrix; the mean, variance and style transfer matrix of the style feature map calculated in the above steps are used as complementary multi-level information to characterize the style of the artistic style image.

优选地，所述步骤S5包括：步骤S50，对经过自适应增强后的特征网格进行任一视角下的体渲染，得到内容特征图；步骤S51，将艺术风格图像中提取到的风格信息与内容特征图进行数学运算，得到风格化特征图。Preferably, the step S5 comprises: step S50, performing volume rendering of the adaptively enhanced feature grid at any viewing angle to obtain a content feature map; step S51, performing mathematical operations on the style information extracted from the artistic style image and the content feature map to obtain a stylized feature map.

优选地，所述步骤S6包括：步骤S60，将所述风格化特征图输入基于卷积神经网络的解码器中，得到RGB空间下相应视角的风格化图像；步骤S61，通过设计的全局和局部联合损失函数对解码器、自适应特征增强的组件和提取多层次风格信息的组件进行优化训练；其中，所述联合损失函数包括全局风格损失部分、全局内容损失部分以及基于拉普拉斯矩阵的局部细节保留损失部分，每个部分有相应的权重进行调节。Preferably, the step S6 comprises: step S60, inputting the stylized feature map into a decoder based on a convolutional neural network to obtain a stylized image of the corresponding viewing angle in the RGB space; step S61, optimizing and training the decoder, the adaptive feature enhancement component and the component for extracting multi-level style information through a designed global and local joint loss function; wherein the joint loss function comprises a global style loss part, a global content loss part and a local detail preservation loss part based on a Laplacian matrix, and each part has a corresponding weight for adjustment.

本发明公开的基于任意风格化的3D艺术内容快速生成方法中，首先根据输入的多张内容图像构建富含高层次语义信息的特征网格辐射场，通过张量分解优化上述辐射场的存储结构，然后对内容特征进行通道和空间维度上的自适应特征增强，之后从输入的艺术风格图像中学习互补的多层次风格信息，在对特征网格通过体渲染得到的特征图上进行风格迁移，最后根据全局质量损失函数和局部细节损失函数联合训练优化解码器及上述组件，最终实现该内容场景任意视角下风格化艺术图像的快速生成。相比现有技术而言，本发明实现了高质量个性化3D艺术内容的快速生成，且适用于复杂场景，同时基于本发明生成的风格化艺术图像不仅高质量继承了艺术图像的风格，也保留了复杂场景的内容结构细节，能有效避免出现可影响视觉效果的伪影，此外，本发明可支持用户个性化创作，较好地满足了应用需求。In the method for rapid generation of 3D art content based on arbitrary stylization disclosed in the present invention, firstly, a feature grid radiation field rich in high-level semantic information is constructed based on multiple input content images, and the storage structure of the above radiation field is optimized by tensor decomposition, and then the content features are adaptively enhanced in channel and spatial dimensions, and then complementary multi-level style information is learned from the input artistic style image, and style transfer is performed on the feature map obtained by volume rendering of the feature grid, and finally the decoder and the above components are jointly trained and optimized according to the global quality loss function and the local detail loss function, and finally the rapid generation of stylized art images under any viewing angle of the content scene is realized. Compared with the prior art, the present invention realizes the rapid generation of high-quality personalized 3D art content, and is suitable for complex scenes. At the same time, the stylized art images generated based on the present invention not only inherit the style of the art image with high quality, but also retain the content structure details of the complex scene, and can effectively avoid the appearance of artifacts that can affect the visual effect. In addition, the present invention can support user personalized creation and better meet the application needs.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

图1为本发明基于任意风格化的3D艺术内容快速生成方法流程图。FIG. 1 is a flow chart of a method for rapidly generating 3D art content based on arbitrary stylization according to the present invention.

具体实施方式Detailed ways

下面结合附图和实施例对本发明作更加详细的描述。The present invention will be described in more detail below with reference to the accompanying drawings and embodiments.

本发明公开了一种基于任意风格化的3D艺术内容快速生成方法，请参见图1，其包括：The present invention discloses a method for quickly generating 3D art content based on arbitrary stylization, as shown in FIG1 , which includes:

步骤S1，根据输入的多张内容图像构建包含高层次语义信息的特征网格神经辐射场，所述高层次语义信息是由深度学习网络得出的抽象特征；Step S1, constructing a feature grid neural radiation field containing high-level semantic information according to multiple input content images, wherein the high-level semantic information is an abstract feature obtained by a deep learning network;

步骤S2，通过张量分解优化所述特征网格神经辐射场的存储结构；Step S2, optimizing the storage structure of the characteristic grid neural radiation field by tensor decomposition;

步骤S3，对内容特征进行通道和空间维度上的自适应特征增强后得出特征网格，所述内容特征是所述高层次语义信息的特征图；Step S3, performing adaptive feature enhancement on the content features in channel and space dimensions to obtain a feature grid, wherein the content features are feature maps of the high-level semantic information;

步骤S4，从输入的艺术风格图像中学习互补的多层次风格信息；Step S4, learning complementary multi-level style information from the input artistic style image;

步骤S5，在对所述特征网格通过体渲染得到的内容特征图上进行风格迁移；Step S5, performing style transfer on the content feature map obtained by volume rendering of the feature grid;

步骤S6，根据全局质量损失函数和局部细节损失函数联合训练优化解码器、自适应特征增强的组件和提取多层次风格信息的组件，进而生成内容场景任意视角下的风格化艺术图像。Step S6, jointly training and optimizing the decoder, the adaptive feature enhancement component, and the component for extracting multi-level style information according to the global quality loss function and the local detail loss function, thereby generating a stylized art image at any viewing angle of the content scene.

上述方法中，首先根据输入的多张内容图像构建富含高层次语义信息的特征网格辐射场，通过张量分解优化上述辐射场的存储结构，然后对内容特征进行通道和空间维度上的自适应特征增强，之后从输入的艺术风格图像中学习互补的多层次风格信息，在对特征网格通过体渲染得到的特征图上进行风格迁移，最后根据全局质量损失函数和局部细节损失函数联合训练优化解码器及上述组件，最终实现该内容场景任意视角下风格化艺术图像的快速生成。相比现有技术而言，本发明实现了高质量个性化3D艺术内容的快速生成，且适用于复杂场景，同时基于本发明生成的风格化艺术图像不仅高质量继承了艺术图像的风格，也保留了复杂场景的内容结构细节，能有效避免出现可影响视觉效果的伪影，此外，本发明可支持用户个性化创作，较好地满足了应用需求。In the above method, firstly, a feature grid radiation field rich in high-level semantic information is constructed based on multiple input content images, and the storage structure of the above radiation field is optimized by tensor decomposition. Then, the content features are adaptively enhanced in channel and spatial dimensions. Then, complementary multi-level style information is learned from the input artistic style image, and style transfer is performed on the feature map obtained by volume rendering of the feature grid. Finally, the decoder and the above components are jointly trained and optimized according to the global quality loss function and the local detail loss function, and finally the rapid generation of stylized art images at any viewing angle of the content scene is realized. Compared with the prior art, the present invention realizes the rapid generation of high-quality personalized 3D art content, and is suitable for complex scenes. At the same time, the stylized art images generated based on the present invention not only inherit the style of the art image with high quality, but also retain the content structure details of the complex scene, and can effectively avoid the appearance of artifacts that can affect the visual effect. In addition, the present invention can support user personalized creation and better meet the application needs.

进一步地，所述步骤S1包括：Furthermore, the step S1 comprises:

步骤S10，根据输入的多张不同视点下的所述内容图像构建基于体素网格的原始神经辐射场，每个体素中存储的特征包括体密度和表征颜色的原始场景特征；Step S10, constructing an original neural radiation field based on a voxel grid according to the input content images at multiple different viewpoints, wherein the features stored in each voxel include volume density and original scene features representing color;

步骤S11，通过预训练的卷积神经网络提取所述内容图像的所述高层次语义信息，实现特征网格的重构，构建出每个体素包含所述体密度和所述高层次语义信息的神经辐射场；Step S11, extracting the high-level semantic information of the content image through a pre-trained convolutional neural network, reconstructing a feature grid, and constructing a neural radiation field in which each voxel contains the volume density and the high-level semantic information;

步骤S12，对所述特征网格上每个采样点的特征进行体积自适应实例规范化，在训练阶段始终保持均值和方差的更新迭代，进而消除因批次差异造成的多视点不一致现象。Step S12, volume adaptive instance normalization is performed on the features of each sampling point on the feature grid, and the update iteration of the mean and variance is always maintained during the training phase, thereby eliminating the multi-viewpoint inconsistency caused by batch differences.

该步骤S1中，根据输入的多张内容图像构建富含高层次语义信息的特征网格辐射场，具体内容如下：In step S1, a feature grid radiation field rich in high-level semantic information is constructed based on multiple input content images. The specific contents are as follows:

给定一个场景的多视图图像和一张参考风格图像，3D风格迁移的目的是生成参考风格下该场景的新视图，风格迁移允许设计师或普通用户以一种创新的方式创造艺术作品，通过将不同风格的绘画、艺术作品应用于图像，可以生成新颖、独特的艺术效果，拓展了创作和个性化定制的手段。Given a multi-view image of a scene and a reference style image, the purpose of 3D style transfer is to generate a new view of the scene in the reference style. Style transfer allows designers or ordinary users to create artworks in an innovative way. By applying paintings and artworks of different styles to images, novel and unique artistic effects can be generated, expanding the means of creation and personalized customization.

所述输入的多张内容图像指对风格化目标场景在不同视角下拍摄的RGB图像，这些图像捕捉了目标在空间中不同方向、不同角度下的信息，利用多视角图像，可以通过三角测量等方法还原场景或物体的三维结构。对于风格迁移任务来说，内容图像提供了原始的内容和结构信息，风格迁移后的新图像仍然保持着与内容图像相同的内容和结构，只是在风格上发生了变化。The input multiple content images refer to RGB images taken from different perspectives of the stylized target scene. These images capture information about the target in different directions and angles in space. Using multi-view images, the three-dimensional structure of the scene or object can be restored through triangulation and other methods. For the style transfer task, the content image provides the original content and structure information. The new image after style transfer still maintains the same content and structure as the content image, but the style has changed.

其中，输入的内容图像的数量至少为两张，数量在20-40张最佳。The number of input content images is at least two, and the optimal number is 20-40.

所述高层次语义信息指的是深度学习网络中较深层次的特征表示，通常是通过卷积和池化层多次堆叠得到的抽象特征，包括对图像中物体、纹理、形状等更为抽象和语义化的描述。对于图像分类网络提取到的高层次语义信息，它有助于对输入图像进行分类，使网络能够学到图像的语义结构，而不仅仅是表面的像素信息。The high-level semantic information refers to the deeper feature representation in the deep learning network, which is usually an abstract feature obtained by stacking convolution and pooling layers multiple times, including more abstract and semantic descriptions of objects, textures, shapes, etc. in the image. For the high-level semantic information extracted by the image classification network, it helps to classify the input image, so that the network can learn the semantic structure of the image, not just the surface pixel information.

所述辐射场在此处指的是神经辐射场（NeRF），它是三维重建和视点生成领域效果较佳的三维场景表征方法。原始神经辐射场的核心思想是将三维坐标和摄像机视图方向作为三维场景信息编码入多层感知器（MLP）中直接预测出RGB值和体积密度，然后通过在每个像素对应的光线上进行采样，利用体渲染得到最终的RGB颜色值，但原始辐射场中大型多层感知器的使用和密集的采样导致该方式的计算量很大，训练和推理速度很慢。The radiation field here refers to the neural radiation field (NeRF), which is a 3D scene representation method with good performance in the field of 3D reconstruction and viewpoint generation. The core idea of the original neural radiation field is to encode the 3D coordinates and camera view direction as 3D scene information into the multi-layer perceptron (MLP) to directly predict the RGB value and volume density, and then obtain the final RGB color value by sampling the light corresponding to each pixel using volume rendering. However, the use of large multi-layer perceptrons and dense sampling in the original radiation field leads to a large amount of computation in this method, and the training and inference speeds are very slow.

特征网格辐射场是在上述原始神经辐射场基础上进行的改进，主要的改进思路是采用显隐式方法混合表征场景。例如，使用离散体素网格或哈希映射等显式的数据结构去存储特征，从而实现快速的收敛和推理。在本方法中，使用离散体素网格来存储场景信息，每一个体素中都包含该采样点的体素密度和场景特征，通过一些插值方法可以实现空间上体密度值和特征值的连续，可选的方法包括三线性插值、最邻近插值、双线性插值等，优选的方法为三线性插值。The feature grid radiation field is an improvement on the original neural radiation field mentioned above. The main idea of improvement is to use a hybrid representation of the scene using explicit and implicit methods. For example, using explicit data structures such as discrete voxel grids or hash maps to store features, so as to achieve fast convergence and reasoning. In this method, a discrete voxel grid is used to store scene information. Each voxel contains the voxel density and scene features of the sampling point. Through some interpolation methods, the continuity of volume density values and feature values in space can be achieved. Optional methods include trilinear interpolation, nearest neighbor interpolation, bilinear interpolation, etc. The preferred method is trilinear interpolation.

构建富含高层次语义信息的特征网格辐射场包含两个实现阶段：（1）第一阶段是根据输入的多张内容图像构建原始特征网格辐射场，所构建的原始特征网格辐射场，可选的包括Plenoxels、TensoRF等，优选是Plenoxels，此时，辐射场是通过离散体素网格表示的，每一个体素中存储的是对应场景采样点的体密度和球谐系数，球谐系数表征了该场景的原始颜色信息，包含的语义信息很少，如果直接对其进行特征迁移，不利于高质量的风格迁移；所以，（2）第二阶段是进行语义特征网格的重构，首先利用常见的预训练网络获取内容图像的包含语义信息的特征图，所述预训练网络，可选的包括：VGGNet16、VGGNet19、AlexNet、ResNet等，优选的是VGGNet19，然后通过类似于体渲染的过程沿射线对每个采样点表征颜色的重构语义特征分量进行积分，可以得到穿过特征网格的任一光线的多通道特征：Constructing a feature grid radiation field rich in high-level semantic information includes two implementation stages: (1) The first stage is to construct an original feature grid radiation field based on multiple input content images. The constructed original feature grid radiation field can optionally include Plenoxels, TensoRF, etc., preferably Plenoxels. At this time, the radiation field is represented by a discrete voxel grid. Each voxel stores the volume density and spherical harmonic coefficients of the corresponding scene sampling point. The spherical harmonic coefficients represent the original color information of the scene and contain very little semantic information. If the features are directly transferred, it is not conducive to high-quality style transfer. Therefore, (2) The second stage is to reconstruct the semantic feature grid. First, a common pre-trained network is used to obtain a feature map containing semantic information of the content image. The pre-trained network can optionally include: VGGNet16, VGGNet19, AlexNet, ResNet, etc., preferably VGGNet19. Then, through a process similar to volume rendering, the reconstructed semantic feature component representing the color of each sampling point is integrated along the ray to obtain any ray passing through the feature grid. Multi-channel features:

其中，是模型在采样位置处计算出的体密度和重构特征，表示特征通道数量；示一条光线上的总采样点数；为光线步长大小；为透射率；表示为光线的权重。本发明利用新视点生成任务中常见的优化损失函数来训练，以实现任意视角下语义特征与辐射场体素特征的对齐，所述优化损失函数，可选的包括预测的和真实的RGB图像间的均方误差损失（MSE）、感知损失等，优选的是使用预测的和真实的RGB图像及特征图间的均方误差损失和可增强生成图像质量的感知损失作为本阶段的训练损失函数：in, is the model at the sampling position The volume density and reconstructed features calculated at Indicates the number of feature channels; Indicates the total number of sampling points on a ray; is the light step size; is the transmittance; Represented as light The present invention uses the common optimization loss function in the new viewpoint generation task for training to achieve the alignment of semantic features and radiation field voxel features under any viewing angle. The optimization loss function may optionally include the mean square error loss (MSE) between the predicted and real RGB images, perceptual loss, etc. Preferably, the mean square error loss between the predicted and real RGB images and feature maps and the perceptual loss that can enhance the quality of the generated image are used as the training loss function of this stage:

其中，表示在当前批次包含的光线下场景图像的预测值。表示在计算感知损失时涉及到的预训练网络的不同层，表示预训练网络第层输出的特征图。in, Indicates the rays included in the current batch The predicted value of the next scene image. represents the different layers of the pre-trained network involved in computing the perceptual loss, Represents the pre-trained network The feature map output by the layer.

此时便完成了包含高层次语义信息的特征网格辐射场的构建，每一个体素中存储的是体密度和语义特征，在富含高层次语义信息的高级特征空间中进行辐射场风格的转换可产生更贴近参考风格的风格化效果，且计算量小、适应能力强，可以推广到任意风格化。At this point, the construction of the feature grid radiation field containing high-level semantic information is completed. Each voxel stores volume density and semantic features. Converting the radiation field style in a high-level feature space rich in high-level semantic information can produce a stylized effect that is closer to the reference style. It has low computational complexity and strong adaptability, and can be extended to any stylization.

作为一种优选方式，所述步骤S2中，对于体素网格神经辐射场，通过XYZ方向上的向量和矩阵来存储特征信息，利用张量分解降低内存复杂度。As a preferred manner, in step S2, for the voxel grid neural radiation field, feature information is stored by vectors and matrices in the XYZ directions, and tensor decomposition is used to reduce memory complexity.

关于所述步骤S2，通过张量分解优化上述辐射场的存储结构，具体内容包括：Regarding the step S2, the storage structure of the radiation field is optimized by tensor decomposition, and the specific contents include:

所述张量分解是一种数学技术，用于将多维数组（张量）表示为一组低秩的分量，可以帮助提取数据中的潜在结构和特征，从而降低数据的维度并简化分析。The tensor decomposition is a mathematical technique for representing a multidimensional array (tensor) as a set of low-rank components, which can help extract the underlying structure and features in the data, thereby reducing the dimensionality of the data and simplifying the analysis.

关于优化上述辐射场的存储结构，具体做法是通过向量-矩阵（VM）分解的方式释放张量两种模式的低秩约束，将张量分解为紧凑的向量和矩阵因子Regarding optimizing the storage structure of the above radiation field, the specific approach is to release the low-rank constraints of the two modes of the tensor through vector-matrix (VM) decomposition, decomposing the tensor into compact vectors and matrix factors

其中，是三个超参数，大小取决于对应基的复杂程度，对于大多数场景可以设置为上文所述张量指的就是之前构建辐射场的每一个体素中存储的信息，即体密度和语义特征，本发明分别对体密度和语义特征进行张量分解，通过XYZ方向上的向量和矩阵来存储特征网格信息，利用张量分解实现了内存复杂度的降低，大大提高了存储效率，将内存复杂度从in, There are three hyperparameters, the size depends on the complexity of the corresponding basis, and for most scenarios they can be set to The tensor mentioned above refers to the information stored in each voxel of the radiation field previously constructed, namely, the volume density and semantic features. The present invention performs tensor decomposition on the volume density and semantic features respectively, and stores the feature grid information through vectors and matrices in the XYZ directions. The tensor decomposition is used to reduce the memory complexity, greatly improve the storage efficiency, and reduce the memory complexity from

本发明的所述步骤S3包括：The step S3 of the present invention comprises:

步骤S30，对所述内容特征进行金字塔池化，将不同层次的特征分别输入卷积层，得到不同尺度的自适应通道注意力图并相加融合，再与原始内容特征相乘实现通道维度上的多层次结构表征增强；Step S30, pyramid pooling is performed on the content features, and features at different levels are respectively input into the convolution layer to obtain adaptive channel attention maps of different scales and add and fuse them, and then multiply them with the original content features to achieve multi-level structural representation enhancement in the channel dimension;

步骤S31，对经过自适应增强的内容特征进行通道维度上的压缩，之后通过不同大小的卷积核及上下采样操作计算得到融合后的自适应空间注意力图，再与内容特征相乘实现多尺度区域感知增强。In step S31, the adaptively enhanced content features are compressed in the channel dimension, and then the fused adaptive spatial attention map is calculated through convolution kernels of different sizes and up and down sampling operations, which is then multiplied with the content features to achieve multi-scale regional perception enhancement.

具体而言，所述步骤S30包括：Specifically, the step S30 includes:

步骤S300，对输入内容特征进行多尺度池化，得到不同尺度下的区域特征；Step S300, performing multi-scale pooling on the input content features to obtain regional features at different scales;

步骤S301，将多路区域特征分别输入到不同的卷积层中，得到多路添加注意力的区域特征；Step S301, inputting multiple regional features into different convolutional layers respectively to obtain multiple regional features with added attention;

步骤S302，将多路添加注意力的区域特征求和，并进行非线性激活，之后输出自适应通道注意力图。Step S302, sum the features of the multiple attention-added regions, perform nonlinear activation, and then output an adaptive channel attention map.

具体而言，所述步骤S31中：Specifically, in step S31:

所述通道维度表示特征图的通道数，每个通道对应一个滤波器或卷积核；The channel dimension represents the number of channels of the feature map, and each channel corresponds to a filter or a convolution kernel;

所述空间维度是特征图中某一点在高度和宽度维度下的空间位置。The spatial dimension is the spatial position of a point in the feature map in the height and width dimensions.

进一步地，所述步骤S31包括：Furthermore, the step S31 includes:

步骤S310，通道压缩：沿着通道维度对输入特征分别进行最大池化和平均池化，并将二者沿着通道维度连接起来；Step S310, channel compression: perform maximum pooling and average pooling on the input features along the channel dimension, and connect the two along the channel dimension;

步骤S311，特征细化：包含全局平均池化分支和不同金字塔尺度下的特征融合感知分支，所述全局平均池化分支用于在空间维度上对所述步骤S310的输出特征进行全局平均，再经过一层卷积层后进行上采样操作，所述特征融合感知分支是U型网络结构；Step S311, feature refinement: including a global average pooling branch and feature fusion perception branches at different pyramid scales, wherein the global average pooling branch is used to globally average the output features of step S310 in the spatial dimension, and then perform an upsampling operation after passing through a convolution layer, and the feature fusion perception branch is a U-shaped network structure;

步骤S312，注意力输出：将所述步骤S311中的两个分支的输出相加，进行非线性激活后，输出自适应空间注意力图。Step S312, attention output: add the outputs of the two branches in step S311, perform nonlinear activation, and output an adaptive spatial attention map.

上述步骤S3中，对内容特征进行通道和空间维度上的自适应特征增强，具体内容包括：In the above step S3, adaptive feature enhancement is performed on the content features in the channel and space dimensions, and the specific contents include:

所述内容特征是先前步骤所生成的富含高层次语义信息的特征图，其包含四个维度BCWH：批次（Batch）、通道（Channel）、高度（Height）和宽度（Width）。批次表示一次训练中使用的样本数量，也称为批次大小。在训练深度神经网络时，通常会同时使用多个样本进行参数更新，这就是批次大小的概念。B维度表示网络每次处理的样本数量，C维度表示特征图的深度，即特征通道的数量。H维度表示特征图在垂直方向上的大小，W维度表示特征图在水平方向上的大小。The content feature is a feature map rich in high-level semantic information generated in the previous step, which contains four dimensions BCWH: Batch, Channel, Height and Width. Batch refers to the number of samples used in one training, also known as batch size. When training deep neural networks, multiple samples are usually used simultaneously for parameter updates, which is the concept of batch size. The B dimension represents the number of samples processed by the network each time, and the C dimension represents the depth of the feature map, that is, the number of feature channels. The H dimension represents the size of the feature map in the vertical direction, and the W dimension represents the size of the feature map in the horizontal direction.

所述通道维度，是指特征图的通道维度，表示特征图的通道数，也称为深度，每个通道对应于一个滤波器或卷积核，负责提取特定类型的特征。所述空间维度，是指特征图中某一点在高度和宽度维度下的空间位置。The channel dimension refers to the channel dimension of the feature map, which indicates the number of channels of the feature map, also known as the depth. Each channel corresponds to a filter or convolution kernel, which is responsible for extracting a specific type of feature. The spatial dimension refers to the spatial position of a point in the feature map in terms of height and width dimensions.

所述通道维度和空间维度上的自适应特征增强，是串行处理结构，先进行通道维度上的自适应特征增强，然后再进行空间维度上的自适应特征增强。The adaptive feature enhancement in the channel dimension and the spatial dimension is a serial processing structure, which first performs adaptive feature enhancement in the channel dimension and then performs adaptive feature enhancement in the spatial dimension.

具体来说，所述通道维度上的自适应特征增强，是先将内容特征输入到构建的多层次结构表征增强网络中，经过计算得到自适应通道注意力图；然后再将得到的自适应通道注意力图与原始输入的内容特征相乘实现通道维度上的多层次结构表征增强。关于注意力图，其包含了特征不同部分分配的不同权重，通过与原始特征加权可以使网络更关注对任务有意义的信息。关于多层次结构表征增强网络，其包含三个串行连接的部分：（1）首先是对输入内容特征进行多尺度池化，得到不同尺度下的区域特征，所述池化，是深度学习中常用的一种操作，用于减小特征图的空间尺寸，降低计算复杂度，并在一定程度上提取特征，可选的池化类型包括：最大池化和平均池化，优选的是平均池化，所述多尺度，指的是选取不同的池化窗口大小；（2）对多路区域特征分别输入到不同的卷积层中，得到多路添加注意力的区域特征；（3）将多路添加注意力的区域特征求和，并进行非线性激活，最后输出自适应通道注意力图。所述非线性激活函数，可选的包括：ReLU、Sigmoid等，优选的是Sigmoid函数。通过通道维度上的自适应特征增强，使模型侧重于关注给定的内容图像“什么”是有意义的，也就是说通过优化通道权重，可以使得风格化模型重点关注有价值的结构层次，提升复杂场景下风格迁移效果的同时保留必要的场景结构。此外，通过金字塔池化进行多尺度的特征提取，收集关于场景特征的多种主要线索，可以实现全局和局部上下文信息的融合。Specifically, the adaptive feature enhancement in the channel dimension is to first input the content feature into the constructed multi-level structure representation enhancement network, and obtain the adaptive channel attention map after calculation; then multiply the obtained adaptive channel attention map with the original input content feature to realize the multi-level structure representation enhancement in the channel dimension. Regarding the attention map, it contains different weights assigned to different parts of the feature. By weighting with the original feature, the network can pay more attention to the information meaningful to the task. Regarding the multi-level structure representation enhancement network, it includes three serially connected parts: (1) First, multi-scale pooling is performed on the input content feature to obtain regional features at different scales. The pooling is a commonly used operation in deep learning, which is used to reduce the spatial size of the feature map, reduce the computational complexity, and extract features to a certain extent. The optional pooling types include: maximum pooling and average pooling, preferably average pooling. The multi-scale refers to selecting different pooling window sizes; (2) The multi-channel regional features are respectively input into different convolutional layers to obtain multi-channel added attention regional features; (3) The multi-channel added attention regional features are summed and non-linearly activated, and finally the adaptive channel attention map is output. The nonlinear activation function may optionally include: ReLU, Sigmoid, etc., preferably the Sigmoid function. Through adaptive feature enhancement in the channel dimension, the model focuses on "what" is meaningful in a given content image. That is to say, by optimizing the channel weights, the stylized model can focus on valuable structural levels, improving the style transfer effect in complex scenes while retaining the necessary scene structure. In addition, by performing multi-scale feature extraction through pyramid pooling and collecting multiple main clues about scene features, the fusion of global and local context information can be achieved.

关于空间维度上的自适应特征增强，是先将上述经过通道自适应特征增强的内容特征输入到构建的多尺度区域感知增强网络中，经过计算得到自适应空间注意力图；然后再将得到的自适应空间注意力图与输入的内容特征相乘实现空间维度上的多尺度区域感知增强。关于多尺度区域感知增强网络，其包含顺序进行的三个阶段：（1）通道压缩，沿着通道维度对输入特征分别进行最大池化和平均池化，并将二者沿着通道维度连接起来；（2）特征细化，该阶段包含两个并行的分支，一个是全局平均池化分支，一个是不同金字塔尺度下的特征融合感知分支。全局平均池化分支，首先在空间维度上对阶段（1）的输出特征进行全局平均，然后经过一层卷积层，最后进行上采样操作，特征融合感知分支，是U型网络结构，下行操作是对阶段（1）的输出特征按照卷积核从大到小的顺序，输入到不同的卷积层中并进行下采样操作；上行操作是对卷积之后的特征图进行上采样并与其他卷积层的输出相加融合；（3）注意力输出，将阶段（2）两个分支的输出相加，然后进行非线性激活，最终输出自适应空间注意力图。通过空间维度上的自适应特征增强，使模型侧重于关注“在哪里”这一信息部分。本发明通过对不同尺度特征图的不同区域内容赋予不同的关注度，进而增强模型对特定敏感区域的感知和表征能力以及对关键细节纹理的增强。Regarding the adaptive feature enhancement in the spatial dimension, the above-mentioned content features that have undergone channel adaptive feature enhancement are first input into the constructed multi-scale region perception enhancement network, and the adaptive spatial attention map is obtained after calculation; then the obtained adaptive spatial attention map is multiplied with the input content features to achieve multi-scale region perception enhancement in the spatial dimension. Regarding the multi-scale region perception enhancement network, it includes three stages that are performed sequentially: (1) channel compression, which performs maximum pooling and average pooling on the input features along the channel dimension, and connects the two along the channel dimension; (2) feature refinement, which includes two parallel branches, one is the global average pooling branch, and the other is the feature fusion perception branch at different pyramid scales. The global average pooling branch first globally averages the output features of stage (1) in the spatial dimension, then passes through a convolution layer, and finally performs an upsampling operation. The feature fusion perception branch is a U-shaped network structure. The downlink operation is to input the output features of stage (1) into different convolution layers in the order of convolution kernels from large to small and perform a downsampling operation; the uplink operation is to upsample the feature map after convolution and add and fuse it with the output of other convolution layers; (3) Attention output, add the outputs of the two branches of stage (2), then perform nonlinear activation, and finally output an adaptive spatial attention map. Through adaptive feature enhancement in the spatial dimension, the model focuses on the information part of "where". The present invention gives different attention to the content of different regions of feature maps of different scales, thereby enhancing the model's perception and representation capabilities of specific sensitive areas and enhancing key detail textures.

作为一种优选方式，所述步骤S4包括：As a preferred embodiment, step S4 includes:

步骤S40，将所述艺术风格图像通过预训练的卷积神经网络得出风格特征图，然后计算得出风格特征图的均值和方差；Step S40, obtaining a style feature map by passing the art style image through a pre-trained convolutional neural network, and then calculating the mean and variance of the style feature map;

步骤S41，将风格特征图进行序列化之后再经过卷积层，计算特征协方差并进行线性变换，得到风格迁移矩阵；Step S41, serializing the style feature map and then passing it through a convolutional layer, calculating the feature covariance and performing a linear transformation to obtain a style transfer matrix;

上述步骤计算得到的风格特征图的均值、方差和风格迁移矩阵作为互补的多层次信息，用于表征艺术风格图像的风格。The mean, variance and style transfer matrix of the style feature map calculated in the above steps are used as complementary multi-level information to characterize the style of the artistic style image.

关于所述步骤S4，从输入的艺术风格图像中学习互补的多层次风格信息，具体内容包括：Regarding the step S4, learning complementary multi-level style information from the input artistic style image specifically includes:

关于艺术风格图像，即风格化任务中的参考风格图像，风格化任务的关键步骤之一就是提取风格图像的风格特征，并将其与内容特征相融合。关于风格信息，是指从上述风格艺术图像中提取到的能够表征风格的信息，常见的风格信息包括：均值、方差等。但仅仅依靠这些简单的风格信息，无法实现高质量的风格提取和转换。关于互补的多层次风格信息，是指通过构建的风格提取网络从风格图像特征图中提取到的均值、方差、风格迁移矩阵。上述三种风格信息是从风格图像的不同层次中提取得到的，包含了风格图像的不同统计特征，可以全面且互补的表示风格图像的明暗程度、颜色分布、纹理特征等。Regarding artistic style images, that is, reference style images in stylization tasks, one of the key steps of stylization tasks is to extract style features of style images and integrate them with content features. Regarding style information, it refers to the information that can characterize the style extracted from the above-mentioned style art images. Common style information includes: mean, variance, etc. However, high-quality style extraction and conversion cannot be achieved by relying solely on these simple style information. Regarding complementary multi-level style information, it refers to the mean, variance, and style transfer matrix extracted from the style image feature map through the constructed style extraction network. The above three types of style information are extracted from different levels of the style image, including different statistical features of the style image, which can comprehensively and complementary represent the brightness, color distribution, texture features, etc. of the style image.

关于学习互补的多层次风格信息，是通过构建的风格提取网络实现的，首先将所述艺术风格图像输入到预训练的卷积神经网络网络中，得到风格特征图。所述预训练的卷积神经网络网络，可选的包括：VGGNet16、VGGNet19、AlexNet、ResNet等，优选的是VGGNet19。然后可以计算得到风格特征图的均值和方差，接下来将风格特征图输入到卷积层，然后计算特征协方差，最后再应用一个线性层便可以学习到风格迁移矩阵，上述过程计算得到的风格特征图的均值、方差、风格迁移矩阵作为互补的多层次信息，表征艺术风格图像的风格。The learning of complementary multi-level style information is achieved through the constructed style extraction network. First, the artistic style image is input into the pre-trained convolutional neural network to obtain a style feature map. The pre-trained convolutional neural network may include: VGGNet16, VGGNet19, AlexNet, ResNet, etc., preferably VGGNet19. Then the mean and variance of the style feature map can be calculated, and then the style feature map is input into the convolution layer, and then the feature covariance is calculated. Finally, a linear layer is applied to learn the style transfer matrix. ,The mean, variance, and style transfer matrix of the style feature map calculated by the ,above process serve as complementary multi-level information to characterize the style of the ,artistic style image.

本发明的所述步骤S5包括：The step S5 of the present invention comprises:

步骤S50，对经过自适应增强后的特征网格进行任一视角下的体渲染，得到内容特征图；Step S50, performing volume rendering at any viewing angle on the adaptively enhanced feature grid to obtain a content feature map;

步骤S51，将艺术风格图像中提取到的风格信息与内容特征图进行数学运算，得到风格化特征图。Step S51, performing mathematical operations on the style information extracted from the artistic style image and the content feature map to obtain a stylized feature map.

关于步骤S5的具体流程，在对特征网格通过体渲染得到的特征图上进行风格迁移，其中包括:Regarding the specific process of step S5, style transfer is performed on the feature map obtained by volume rendering of the feature grid, which includes:

所述特征网格是经过自适应特征增强后的细化特征网格，不同区域内容具有不同的关注度，增强了模型对特定敏感区域的感知和表征能力。The feature grid is a refined feature grid after adaptive feature enhancement, and different area contents have different attention levels, which enhances the model's perception and representation capabilities for specific sensitive areas.

关于体渲染，传统的体渲染是一种计算机图形学和可视化的技术，用于呈现和可视化三维体数据，通过对体数据进行透明度和颜色的调整，以模拟光线在体内传播的效果，最终生成图像，此处所指的体渲染，指的是在神经辐射场中使用的体渲染方式，目的是生成对应光线的RGB像素值。具体方法是，首先生成特定视角下图像的每个像素对应的一条光线，沿着光线方向在辐射场中进行采样，通过对各采样点的颜色和体密度积分，最终生成对应的像素颜色。Regarding volume rendering, traditional volume rendering is a computer graphics and visualization technology used to present and visualize three-dimensional volume data. It adjusts the transparency and color of the volume data to simulate the effect of light propagation in the body and finally generates an image. The volume rendering referred to here refers to the volume rendering method used in the neural radiation field, the purpose of which is to generate the RGB pixel value of the corresponding light. The specific method is to first generate a light ray corresponding to each pixel of the image at a specific viewing angle, sample in the radiation field along the direction of the light ray, and finally generate the corresponding pixel color by integrating the color and volume density of each sampling point.

关于特征图，指的是对特征网格进行体渲染得到的内容特征图，计算方法是参照上述体渲染思路，沿着光线方向对特征网格进行采样，然后对采样点的体密度和细化特征进行积分，从而最终得到该光线对应的光线特征计算公式如下所示：The feature map refers to the content feature map obtained by volume rendering the feature grid. The calculation method is to refer to the above volume rendering idea, sample the feature grid along the light direction, and then integrate the volume density and refinement features of the sampling points, so as to finally obtain the light feature corresponding to the light. The calculation formula is as follows:

其中，是模型在采样位置处计算出的体密度和细化特征，表示特征通道数量；表示一条光线上的总采样点数；为光线步长大小；为透射率（transmittance）；表示为光线的权重。in, is the model at the sampling position The volume density and refinement features calculated at Indicates the number of feature channels; Indicates the total number of sampling points on a ray; is the light step size; is the transmittance; Represented as light the weight of.

关于在特征图上进行风格迁移，指的是将步骤S4中学习到的多层次风格信息（均值、方差、风格迁移矩阵）迁移到上述内容特征图上，从而实现特征图层面上的风格迁移，通过多次优化迭代最终实现风格化辐射场的生成。具体迁移步骤为：（1）将风格迁移矩阵与内容特征图相乘；（2）将前述（1）的输出与方差相乘然后与均值和权重图的乘积相加，最终得到风格化特征图所述权重图指的是每个像素对应的权重组成的二维数据。上述转换过程可通过以下公式描述:Regarding style transfer on the feature map, it means transferring the multi-level style information (mean, variance, style transfer matrix) learned in step S4 to the above-mentioned content feature map, thereby realizing style transfer on the feature map surface, and finally realizing the generation of a stylized radiation field through multiple optimization iterations. The specific migration steps are: (1) Transfer the style transfer matrix Content feature map Multiply; (2) The output of (1) and the variance Multiply and then add to the mean and weight graph The product of is added, and finally the stylized feature map is obtained. The weight map refers to the weight corresponding to each pixel The above conversion process can be described by the following formula:

其中，表示矩阵乘法，是内容特征图。通过上述简单的加法和乘法运算便实现了特征图的风格迁移，避免了传统方法直接对辐射场进行的风格迁移，降低了计算量、提升了运算速度。 in, represents matrix multiplication, is the content feature map. The style transfer of the feature map is achieved through the above simple addition and multiplication operations, avoiding the traditional method of directly transferring the style of the radiation field, reducing the amount of calculation and improving the operation speed.

本发明的所述步骤S6包括：The step S6 of the present invention comprises:

步骤S60，将所述风格化特征图输入基于卷积神经网络的解码器中，得到RGB空间下相应视角的风格化图像；Step S60, inputting the stylized feature map into a decoder based on a convolutional neural network to obtain a stylized image of a corresponding viewing angle in an RGB space;

步骤S61，通过设计的全局和局部联合损失函数对解码器、自适应特征增强的组件和提取多层次风格信息的组件进行优化训练；Step S61, optimizing and training the decoder, the adaptive feature enhancement component, and the component for extracting multi-level style information through the designed global and local joint loss function;

其中，所述联合损失函数包括全局风格损失部分、全局内容损失部分以及基于拉普拉斯矩阵的局部细节保留损失部分，每个部分有相应的权重进行调节。The joint loss function includes a global style loss part, a global content loss part, and a local detail preservation loss part based on the Laplacian matrix, and each part has a corresponding weight for adjustment.

关于所述步骤S6，根据全局质量损失函数和局部细节损失函数联合训练优化解码器及上述组件，最终实现该内容场景任意视角下风格化艺术图像的快速生成，具体包括如下内容:Regarding step S6, the decoder and the above components are jointly trained and optimized according to the global quality loss function and the local detail loss function, and finally the stylized art image is quickly generated at any viewing angle of the content scene, which specifically includes the following contents:

关于解码器，指构建的用于将风格化特征图转化为RGB空间下图像的卷积神经网络。所述卷积神经网络由卷积层和非线性激活函数组成。关于全局质量损失函数，用于调控全局空间结构和整体风格质量，由两部分组成：内容损失和风格损失。其中，内容损失是最终输出的风格化图像的特征与内容图像特征间的均方误差（MSE）：The decoder refers to a convolutional neural network constructed to convert the stylized feature map into an image in RGB space. The convolutional neural network consists of a convolutional layer and a nonlinear activation function. The global quality loss function is used to adjust the global spatial structure and overall style quality, and consists of two parts: content loss and style loss. Among them, content loss is the mean square error (MSE) between the features of the final output stylized image and the content image features:

其中，风格损失是风格化图像与风格图像分别经过预训练VGG19各层输出特征的均值和方差各自间的均方误差之和： Among them, style loss It is the sum of the mean square error between the mean and variance of the output features of each layer of the pre-trained VGG19 of the stylized image and the style image:

其中，表示风格化图像；表示输入的原始内容图像；表示图像经过预训练卷积神经网络的第层输出的特征图。in, Represents a stylized image; represents the original content image of the input; Representing images The pre-trained convolutional neural network The feature map output by the layer.

关于局部细节损失函数，使用风格化图像与内容图像间拉普拉斯矩阵的差异作为拉普拉斯损失项来衡量局部细节结构上的差异，以此来优化风格化图像的细节内容：Regarding the local detail loss function, the difference in the Laplacian matrix between the stylized image and the content image is used as the Laplacian loss term to measure the difference in local detail structure, so as to optimize the detail content of the stylized image:

其中，拉普拉斯矩阵通过输入图像与拉普拉斯滤波器间的卷积得到，拉普拉斯滤波器常被用于边缘和轮廓的检测。Among them, the Laplace matrix By inputting the image It is obtained by convolution with the Laplace filter. The Laplace filter is often used for edge and contour detection.

关于联合训练，指的是将所述内容损失、风格损失、拉普拉斯损失通过各自的权重将它们组合在一起，共同优化风格化模型训练，这样可以使得风格化图像在反映参考风格的同时保留原始内容的结构细节。风格化训练阶段的总损失函数为：Joint training refers to combining the content loss, style loss, and Laplace loss through their respective weights to jointly optimize the stylized model training, so that the stylized image can reflect the reference style while retaining the structural details of the original content. The total loss function of the stylized training stage is:

关于快速生成，是指在训练阶段完成之后，便可实现在任意风格下推理生成该场景任意视角的高质量3D风格化艺术内容，且训练和推理速度超过了绝大多数现有3D风格迁移方法。Regarding fast generation, it means that after the training phase is completed, it is possible to infer and generate high-quality 3D stylized art content of the scene from any perspective in any style, and the training and inference speeds exceed those of most existing 3D style transfer methods.

本发明公开的基于任意风格化的3D艺术内容快速生成方法，其实际应用中可参考如下实施例。The method for quickly generating 3D art content based on arbitrary stylization disclosed in the present invention can be referred to the following embodiments in practical application.

实施例一Embodiment 1

本实施例具体包含如下步骤：This embodiment specifically includes the following steps:

步骤1，选取30张从多个视角拍摄的目标场景多视角图像作为输入的内容图像；选取包含超过八万张艺术图像的WikiArt数据集作为参考风格图像来源，并从中选取140张作为测试集，其余的作为训练集。Step 1: Select 30 multi-view images of the target scene taken from multiple viewpoints as the input content images; select the WikiArt dataset containing more than 80,000 art images as the reference style image source, and select 140 of them as the test set, and the rest as the training set.

将内容图像作为输入，构建Plenoxels原始场景辐射场，每一个体素中包含该采样点的体素密度和场景特征，通过三线性插值实现空间上取值的连续。The content image is used as input to construct the Plenoxels original scene radiation field. Each voxel contains the voxel density and scene features of the sampling point, and the continuity of spatial values is achieved through trilinear interpolation.

利用预训练的VGGNet19作为语义信息提取网络，提取内容图像在VGGNet19的ReLU3_1层输出作为语义信息，通过上述公式(1)体渲染得到穿过特征网格的任一光线的多通道特征，然后利用上述公式(2)作为训练损失函数进行本阶段的训练，计算感知损失时，选取为预训练VGGNet19的relu3_1和relu4_1层，为了更好地提高多视图一致性，本实施例禁用了视线方向对效果的影响，最终得到富含高层次语义信息的特征网格辐射场。Using the pre-trained VGGNet19 as the semantic information extraction network, the content image is extracted and outputted at the ReLU3_1 layer of VGGNet19 as semantic information. The volume rendering is performed using the above formula (1) to obtain any ray passing through the feature grid. The multi-channel features of , and then use the above formula (2) as the training loss function for training in this stage. When calculating the perceptual loss, The relu3_1 and relu4_1 layers of the pre-trained VGGNet19 are selected. In order to better improve the multi-view consistency, this embodiment disables the influence of the line of sight direction on the effect, and finally obtains a feature grid radiation field rich in high-level semantic information.

步骤2，利用向量-矩阵（VM）分解的方式优化存储结构，如上述公式(3)，本实施例设置X、Y、Z方向上的张量分量数量一致且即张量分量的总数量为192，利用张量分解实现了内存复杂度的降低，大大提高了存储效率，将内存复杂度从。Step 2, optimize the storage structure by using vector-matrix (VM) decomposition, as shown in the above formula (3). In this embodiment, the number of tensor components in the X, Y, and Z directions is set to be the same and That is, the total number of tensor components is 192. The use of tensor decomposition reduces memory complexity and greatly improves storage efficiency, reducing memory complexity from .

步骤3，本实施例构建的多层次结构表征增强网络中使用三种不同的池化尺度对内容特征进行平均池化，多尺度池化输出特征的空间分辨率分别为1×1、2×2、3×3，然后输入到顺序连接的两个1×1卷积层中，并使用Sigmoid函数进行非线性激活，最终输出自适应通道注意力图，将得到的自适应通道注意力图与原始输入的内容特征相乘实现了通道维度上的多层次结构表征增强。Step 3: In the multi-level structure representation enhancement network constructed in this embodiment, three different pooling scales are used to average pool the content features. The spatial resolutions of the multi-scale pooling output features are 1×1, 2×2, and 3×3, respectively. They are then input into two sequentially connected 1×1 convolutional layers, and nonlinear activation is performed using the Sigmoid function. Finally, an adaptive channel attention map is output, and the obtained adaptive channel attention map is multiplied by the content features of the original input to achieve multi-level structure representation enhancement in the channel dimension.

本实施例构建的多尺度区域感知增强网络，首先是通过最大池化和平均池化对通道维度进行压缩，然后使用了三种不同大小的卷积核conv7*7、conv5*5、conv3*3进行特征融合感知，然后与全局平均池化结果相加并经过Sigmoid函数的非线性激活，最终输出自适应空间注意力图，将得到的自适应空间注意力图与输入的内容特征相乘实现了空间维度上的多尺度区域感知增强。The multi-scale area perception enhancement network constructed in this embodiment first compresses the channel dimension through maximum pooling and average pooling, and then uses three different sizes of convolution kernels conv7*7, conv5*5, and conv3*3 for feature fusion perception. The result is then added to the global average pooling result and activated nonlinearly by the Sigmoid function. Finally, an adaptive spatial attention map is output. The obtained adaptive spatial attention map is multiplied by the input content features to achieve multi-scale area perception enhancement in the spatial dimension.

步骤4，输入的风格图像首先通过VGGNet19得到风格特征图，然后计算得到风格特征图的均值和方差，接下来将风格特征图输入到由三个conv1d+ReLU模块顺序连接组成的卷积层中，然后计算特征协方差，最后再应用一个线性层便可以学习到风格迁移矩阵，上述计算得到的风格特征图的均值、方差、风格迁移矩阵作为互补的多层次信息，表征艺术风格图像的风格。Step 4: The input style image is first passed through VGGNet19 to obtain the style feature map, and then the mean and variance of the style feature map are calculated. Next, the style feature map is input into a convolutional layer consisting of three conv1d+ReLU modules connected in sequence, and then the feature covariance is calculated. Finally, a linear layer is applied to learn the style transfer matrix. ,The mean, variance and style transfer matrix of the style feature map calculated above serve as ,complementary multi-level information to characterize the style of the ,artistic style image.

步骤5，对经过自适应特征增强后的细化特征网格进行如上述公式(4)所示的体渲染过程，得到内容特征图。如公式(5)，将风格迁移矩阵与内容特征图相乘，然后将结果与方差相乘，最后与均值和权重图的乘积相加，最终得到风格化特征图。Step 5: Perform the volume rendering process as shown in the above formula (4) on the refined feature grid after adaptive feature enhancement to obtain the content feature map As shown in formula (5), the style transfer matrix Multiply it with the content feature map and then add the result to the variance Multiply and finally add to the mean and weight graph The product of is added, and finally the stylized feature map is obtained. .

步骤6，构建包含八个conv2d+ReLU模块顺序连接组成的卷积神经网络作为解码器，通过上述公式(9)所示的全局加局部联合优化损失函数对风格化模型进行训练，其中内容损失权重可设置为1，风格损失权重可设置为20，拉普拉斯损失权重可设置为100，使用预训练VGGNet19的relu1_1、relu2_1、relu3_1、relu4_1层参与计算，训练迭代次数可设置为25k，训练时间在RTX3090下约为4h。Step 6: Construct a convolutional neural network consisting of eight conv2d+ReLU modules connected in sequence as a decoder, and train the stylized model using the global plus local joint optimization loss function shown in formula (9), where the content loss weight is Can be set to 1, style loss weight Can be set to 20, Laplace loss weight Can be set to 100, using the relu1_1, relu2_1, relu3_1, and relu4_1 layers of pre-trained VGGNet19 to participate in the calculation , the number of training iterations can be set to 25k, and the training time is about 4h under RTX3090.

在训练完成之后，本实施例输入所需视角的内参和外参矩阵，便可推理生成该场景任意视角下的高质量3D风格化艺术内容，生成的风格化艺术图像不仅高质量的继承了艺术图像的风格，也保留了复杂场景的内容细节结构，避免了复杂场景下伪影的出现。其中，单张任意视角下的720p分辨率的风格化艺术图像，推理生成时间约为4s，训练和推理速度超过了大多数现有3D风格迁移方法。After the training is completed, the present embodiment inputs the internal and external parameter matrices of the required viewing angle, and can infer and generate high-quality 3D stylized art content at any viewing angle of the scene. The generated stylized art image not only inherits the style of the art image with high quality, but also retains the content detail structure of the complex scene, avoiding the appearance of artifacts in complex scenes. Among them, the inference generation time of a single stylized art image with a resolution of 720p at any viewing angle is about 4s, and the training and inference speed exceeds most existing 3D style transfer methods.

以上所述只是本发明较佳的实施例，并不用于限制本发明，凡在本发明的技术范围内所做的修改、等同替换或者改进等，均应包含在本发明所保护的范围内。The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention. Any modifications, equivalent substitutions or improvements made within the technical scope of the present invention should be included in the scope of protection of the present invention.