CN116363304A

Movatterモバイル変換

Info

Publication number: CN116363304A
Application number: CN202310235367.0A
Authority: CN
Inventors: 汪飞; 陈斯伟; 朱长盛; 熊智
Original assignee: Shantou University
Current assignee: Shantou University
Priority date: 2023-03-13
Filing date: 2023-03-13
Publication date: 2023-06-30

Abstract

The invention discloses a hand-drawn three-dimensional reconstruction method based on multi-feature fusion, which comprises the steps of firstly constructing a multi-level convolution module for directly processing a sketch based on a VGG-16 network, and taking the multi-level convolution module as a part of an encoder to obtain a pixel-level feature map; adding a 2D point cloud processing module, and obtaining a point-level feature map with hidden geometric characteristics from a point set uniformly sampled in the sketch; then based on a self-attention mechanism in a Transformer network, constructing an SAfusion module, and capturing complementary semantic clues according to the dependency relationship between the feature graphs of the two modes for integrating the two modes into a unified feature form; then constructing a decoding module based on 3D deconvolution, which is used for analyzing sketch information contained in the feature map into a 3D model; and finally, adopting a U-Net network variant in a 3D form, reserving an accurate local area in the 3D model, and optimizing details of a reconstruction result. By adopting the method and the device, the understanding of the sketch is effectively promoted, and the quality of 3D reconstruction is finally improved.

Description

Translated fromChinese

一种基于多特征融合的手绘三维重建方法A hand-painted 3D reconstruction method based on multi-feature fusion

技术领域technical field

本发明涉及一种图像处理技术领域，尤其涉及一种基于多特征融合的手绘三维重建方法。The invention relates to the technical field of image processing, in particular to a hand-painted three-dimensional reconstruction method based on multi-feature fusion.

背景技术Background technique

作为计算机图形学和计算机视觉研究领域中最基础但又富有挑战性的任务之一，基于图像生成三维重建得到了国内外学者的广泛关注。早期的传统方法是通过二维图像的纹理、阴影或光照线索计算出表面朝向或深度信息，然后还原或重建三维模型。然而，从单幅图像中还原三维模型是个不适定问题——对于3D模型在2D平面上的投影，由于视角的不同带来巨大的差异，更有自我遮挡带来的挑战，存在多个可能的3D模型对应同一张图片。如何解决这些问题并实现准确、高效和可靠的3D重建，仍然是一个亟待解决的难题。As one of the most basic but challenging tasks in the field of computer graphics and computer vision research, image-based 3D reconstruction has attracted extensive attention from scholars at home and abroad. The early traditional method is to calculate the surface orientation or depth information through the texture, shadow or lighting clues of the 2D image, and then restore or reconstruct the 3D model. However, restoring a 3D model from a single image is an ill-posed problem—for the projection of a 3D model on a 2D plane, due to the huge difference caused by different viewing angles and the challenges brought about by self-occlusion, there are multiple possible The 3D model corresponds to the same picture. How to solve these problems and achieve accurate, efficient and reliable 3D reconstruction is still an urgent problem to be solved.

近年来，深度神经网络在基于图像的3D重建方面展现出新可能性。出现的以数据驱动的方法能直接从2D图像推断出对应的3D模型。得益于大规模3D模型数据集的发布，弥补了之前数据量不足的问题，先进的深度学习模型得以发挥优点，各种出色的3D重建网络相继被提出。例如，Fan等人提出了一个基于单张图像的点云生成网络，其通过估计可见部分的深度并猜测其余被覆盖部分，生成了多个相近的三维点云模型，来应对单视图的不确定性问题。Wu等人提出的3D-VAE-GAN将单视角图片作为输入，通过变分自动编码器(Variational AutoEncoder)输出表示向量，然后生成器(Generator)建立了从低维概率空间到三维物体空间的映射。同时，加入鉴别器(discriminator)后可以通过对抗训练，使生成器隐式捕获物体的结构并重建高质量的3D模型。3D-LMNet用于从单一图像中重建3D点云，并在网络中隐式嵌入一种匹配方式。然而，由于点与点之间的联系很松散，导致在点云表示中具有很大的自由度，重建准确率不高。Choy等人提出了一种称为3D-R2N2的网络结构，其中创新地嵌入长短期记忆（LSTM）模块。LSTM模块依次从不同视角图片中提取特征图，选择性地更新特征单元的状态，最终达到融合多视角图片的效果，有效地解决了单视图中自我遮挡的问题。随着更多角度图片的输入，带来了更明确的物体信息，因此重建的结果越接近真实三维模型。然而，网络要求按照统一的顺序输入多视角图片，否则重建结果相去甚远。Xie等人提出了一种基于单视角或多视角图片的三维重建网络，名为Pix2Vox。该网络引入了一个上下文感知融合模块，从多个粗糙的3D模型中挑选可信度高的重建部分，并最终融合成高质量的3D模型。In recent years, deep neural networks have opened up new possibilities in image-based 3D reconstruction. Emerging data-driven approaches can directly infer corresponding 3D models from 2D images. Thanks to the release of large-scale 3D model datasets, which made up for the lack of data before, advanced deep learning models can play their advantages, and various excellent 3D reconstruction networks have been proposed one after another. For example, Fan et al. proposed a point cloud generation network based on a single image, which generated multiple similar 3D point cloud models by estimating the depth of the visible part and guessing the rest of the covered part to deal with the uncertainty of a single view. sexual issues. The 3D-VAE-GAN proposed by Wu et al. takes a single-view image as input, outputs a representation vector through a variational autoencoder (Variational AutoEncoder), and then the generator (Generator) establishes a mapping from a low-dimensional probability space to a three-dimensional object space . At the same time, after adding the discriminator, the generator can implicitly capture the structure of the object and reconstruct a high-quality 3D model through confrontation training. 3D-LMNet is used to reconstruct 3D point clouds from a single image, and a matching method is implicitly embedded in the network. However, due to the loose connection between points, resulting in a large degree of freedom in the point cloud representation, the reconstruction accuracy is not high. Choy et al. proposed a network structure called 3D-R2N2, in which a long short-term memory (LSTM) module is innovatively embedded. The LSTM module sequentially extracts feature maps from different perspective images, selectively updates the state of feature units, and finally achieves the effect of fusing multi-view images, effectively solving the problem of self-occlusion in a single view. With the input of more angle pictures, more clear object information is brought, so the reconstruction result is closer to the real 3D model. However, the network requires multi-view images to be input in a uniform order, otherwise the reconstruction results will be far apart. Xie et al. proposed a 3D reconstruction network based on single-view or multi-view images, named Pix2Vox. The network introduces a context-aware fusion module, which selects highly reliable reconstructed parts from multiple rough 3D models, and finally fuses them into high-quality 3D models.

与上述基于自然图像的方法不同，另一类研究试图从手绘草图中重建出3D模型。曾有一项开创性的研究将传统的草图三维建模技术分为启发式和重建式。启发式方法以模板原语为基础创建3D模型，或是从模型集合中检索出最匹配的对象并进一步变形，因此仅限定于生成某些形状或类型的模型。另一方面，重建式方法能直接从草图映射到模型，生成形状丰富的三维模型。相对地，缺乏先验知识，使重建式方法通常需要更繁复的工作来实现效果。Different from the aforementioned natural image-based methods, another class of research attempts to reconstruct 3D models from hand-drawn sketches. There was a pioneering study that dividedtraditional sketch 3D modeling techniques into heuristic and reconstruction. Heuristic methods create 3D models based on template primitives, or retrieve the best matching objects from a model collection and further deform them, so they are limited to generating models of certain shapes or types. On the other hand, reconstruction methods can directly map from sketches to models, generating shape-rich 3D models. In contrast, the lack of prior knowledge makes reconstruction methods usually require more laborious work to achieve the effect.

近来提出的很多方法，聚焦于以深度学习的手段从3D模型中获取先验知识，然后利用输入的准确线条草图重建3D模型。但只有少数专注于研究手绘草图。其中，Wang等人提出了检索和重建相结合的方法，首先将渲染图像和手绘草图嵌入到同一共享的潜在向量空间中，然后在所有的训练三维数据中，为每个草图检索最邻近的模型，作为三维重建的先验知识。考虑在特定角度上捕捉草图的特征，Han等人利用CGAN将草图转化为具备更多几何细节的衰减图像，以此来解析草图。并使用direct shape optimization的方式，从多个草图中重建3D模型。这种方案最大限度地减少了损失，使其与来自所有输入草图的视点的预测衰减图像相匹配，用以重建3D模型。此外，他们还提出了一种渐进式更新方法，来处理同一3D模型下的几个手绘草图之间的不一致。Zhang等人提出了一种视图感知的三维重建网络，该网络显式地限定给定草图的视角，以解决草图的模糊性问题。该方式提升了重建结果，且对最终输出的结果具有可控性。Wang等人提出了一种从单视角草图转变为3D模型的框架，其中的点云是根据手绘的草图生成的，并根据预测的视角将点云旋转至标准视图，方便与真实模型匹配。为了消除手绘草图风格的差异，提出了草图标准化模块，将网络输入转为标准化的草图。同时，为了解决数据量不足的问题，创新性地提出用生成对抗网络(GAN)合成的草图进行训练的方式。而收集到的合成草图数据集，也能有效地弥补之前缺乏足够训练数据的问题。Delanoy等人则提出了一种更新型CNN结构，它通过迭代更新以融合来自任意数量视角的信息，以此来矫正现有预测3D模型。尽管该网络的创新卓有成效，但其局限性在于需要从多个不同视角绘制的草图，且在处理具有微小抖动的草图时效果不佳。Many methods proposed recently focus on obtaining prior knowledge from 3D models by means of deep learning, and then use the input accurate line sketches to reconstruct 3D models. But only a few focus on the study of hand-drawn sketches. Among them, Wang et al. proposed a method combining retrieval and reconstruction, first embedding rendered images and hand-drawn sketches into the same shared latent vector space, and then retrieving the nearest neighbor model for each sketch in alltraining 3D data , as prior knowledge for 3D reconstruction. Considering capturing the features of the sketch at a specific angle, Han et al. use CGAN to convert the sketch into an attenuated image with more geometric details to resolve the sketch. And use direct shape optimization to reconstruct 3D models from multiple sketches. This scheme minimizes the loss to match the predicted attenuation images from viewpoints of all input sketches to reconstruct the 3D model. Furthermore, they propose a progressive update method to deal with inconsistencies among several hand-drawn sketches under the same 3D model. Zhang et al. propose a view-aware 3D reconstruction network that explicitly constrains the viewpoint of a given sketch to resolve the ambiguity of the sketch. This method improves the reconstruction result and is controllable to the final output result. Wang et al. proposed a framework for transforming single-view sketches into 3D models, in which point clouds are generated from hand-drawn sketches, and the point clouds are rotated to a standard view according to the predicted viewing angle to facilitate matching with the real model. In order to eliminate the difference in hand-drawn sketch style, a sketch normalization module is proposed to convert the network input into a standardized sketch. At the same time, in order to solve the problem of insufficient data volume, an innovative method of training with sketches synthesized by Generative Adversarial Network (GAN) is proposed. The collected synthetic sketch data set can also effectively make up for the lack of sufficient training data before. Delanoy et al. proposed an updated CNN architecture that corrects existing predictive 3D models by iteratively updating to incorporate information from any number of views. While the network's innovations are promising, it's limited by requiring sketches drawn from many different viewpoints, and it doesn't work well with sketches with tiny jitters.

然而，由于草图的抽象性、多样性和歧义性所造成的困难，目前直接将2D草图转换为3D模型仍然是一项富有挑战性的任务。However, currently it is still a challenging task to directly convert 2D sketches to 3D models due to the difficulties caused by the abstractness, diversity and ambiguity of sketches.

不同于自然图像，草图通常由稀疏线条组成，勾勒出三维物体在特定视角下的投影轮廓。草图作为二元图像，缺乏纹理或颜色等视觉线索，细节不足给神经网络的语义理解带来困难。而现有的基于卷积的网络只捕获草图匮乏的像素级特征，忽略了整体线条隐含的几何结构特征。Unlike natural images, sketches are usually composed of sparse lines that outline the projected outline of a 3D object at a specific viewing angle. As a binary image, the sketch lacks visual cues such as texture or color, and the lack of details makes it difficult for neural networks to understand semantics. However, the existing convolution-based networks only capture the pixel-level features that are scarce in sketches, ignoring the geometric structure features implied by the overall line.

而且在绘制草图时，涉及多样性视角的选取，投影轮廓从俯视到侧视存在巨大的差异，导致相应的草图间也相去甚远，这也是目前的三维重建工作中函待解决的重点之一。Moreover, when drawing a sketch, it involves the selection of various perspectives, and there is a huge difference in the projection outline from the top view to the side view, which leads to a great difference between the corresponding sketches. This is also one of the key points to be solved in the current 3D reconstruction work. .

最后，手绘草图的风格受用户的影响，不可避免地出现形状扭曲或比例失调等技巧问题。因此，即便是要求用户从固定的角度绘制，同一物体的不同草图间也差异明显，导致网络参数在训练时难以收敛。Finally, the style of hand-drawn sketches is influenced by the user, and technical problems such as shape distortion or out-of-scale inevitably appear. Therefore, even if the user is required to draw from a fixed angle, there are significant differences between different sketches of the same object, making it difficult for the network parameters to converge during training.

发明内容Contents of the invention

本发明实施例所要解决的技术问题在于，提供一种基于多特征融合的手绘三维重建方法，使用添加点云处理模块的新型双通道3D重建框架，以此来促进对草图的理解，并最终提升3D重建的质量。The technical problem to be solved by the embodiments of the present invention is to provide a hand-painted 3D reconstruction method based on multi-feature fusion, using a new dual-channel 3D reconstruction framework with a point cloud processing module, so as to promote the understanding of sketches, and finally improve the The quality of the 3D reconstruction.

为了解决上述技术问题，本发明实施例提供了一种基于多特征融合的手绘三维重建方法，包括以下步骤：In order to solve the above technical problems, an embodiment of the present invention provides a hand-painted 3D reconstruction method based on multi-feature fusion, including the following steps:

S1：使用VGG-16网络，构建多层次卷积模块作为编码器，用于捕获二值草图的像素特征，并输出像素级特征图；S1: Use the VGG-16 network to build a multi-level convolution module as an encoder to capture the pixel features of the binary sketch and output the pixel-level feature map;

S2：添加2D点云处理模块，先从草图中抽样出2D点云，随后再提取出点级特征图；S2: Add a 2D point cloud processing module, first sample the 2D point cloud from the sketch, and then extract the point-level feature map;

S3：利用隐含草图语义信息和几何信息的像素级和点级特征图构建SAFusion模块，根据两种模态的特征图之间的依赖关系，捕获互补的语义线索，用于将两者整合为统一的特征形式；S3: The SAFusion module is constructed using the pixel-level and point-level feature maps that imply the semantic information and geometric information of the sketch, and capture complementary semantic clues according to the dependencies between the feature maps of the two modalities, which are used to integrate the two into one Uniform characteristic form;

S4：构建基于3D反卷积的解码模块，用于将2D特征图中的信息转换为3D体积形式；S4: Construct a decoding module based on 3D deconvolution, which is used to convert the information in the 2D feature map into a 3D volume form;

S5：采用3D形式的U-Net网络变型，保留3D模型中准确的局部区域，优化重建结果的细节。S5: U-Net network variant in 3D form is used to retain accurate local areas in the 3D model and optimize the details of the reconstruction results.

其中，所述S1还包括对草图进行预处理的方法，包括以下步骤：Wherein, said S1 also includes a method for preprocessing the sketch, including the following steps:

S11：沿着草图物体的包围框进行裁剪，并将留下的区域重新调整至分辨率为224x224像素，并使草图转换为二值图像；S11: Crop along the bounding box of the sketch object, readjust the remaining area to a resolution of 224x224 pixels, and convert the sketch into a binary image;

S12：二值图像将输入SktConv模块进行特征提取，同时以二值草图中的前景像素点作为2D点云，并从中采样出子集。S12: The binary image is input to the SktConv module for feature extraction, and the foreground pixels in the binary sketch are used as a 2D point cloud, and a subset is sampled from it.

其中，所述SktConv模块包括前四组卷积模块采用预定义模型，而后紧接着两组自定义的模型；对于前四组卷积模块，每组预定义模块都包含两个3x3的小尺度卷积层以及相应的批规范化层、Relu激活函数和最大池化层，每次经过池化层都将特征图的尺寸缩小至一半；首先从224x224x3的手绘草图中提取出28x28x512的中间特征图，将中间特征图输入两组自定义模块。两组自定义模块包含核大小为3x3的2D卷积层、批规范化层、Relu激活函数层，卷积层的输出通道数分别为512和256，经过最大池化层后特征图缩小至原来的1/3，并输出像素级特征图

。Wherein, the SktConv module includes the first four groups of convolution modules adopting predefined models, followed by two groups of self-defined models; for the first four groups of convolution modules, each group of predefined modules contains two small-scale volumes of 3x3 The multiplication layer and the corresponding batch normalization layer, Relu activation function and maximum pooling layer reduce the size of the feature map to half each time through the pooling layer; first extract the 28x28x512 intermediate feature map from the 224x224x3 hand-drawn sketch, and put The intermediate feature maps are input to two sets of custom modules. Two sets of custom modules include 2D convolutional layer with a kernel size of 3x3, batch normalization layer, and Relu activation function layer. The output channels of the convolutional layer are 512 and 256 respectively. After the maximum pooling layer, the feature map is reduced to the original 1/3, and output pixel-level feature maps

.

其中，所述S2还包括步骤：Wherein, said S2 also includes the steps of:

将取样后的2D点集表示为

，其中n代表集合中的点数，所述2D点云处理模块捕获采样点集的全局上下文，并逐个与点级特征结合，产生点级特征图/>

。Express the sampled 2D point set as

, where n represents the number of points in the set, the 2D point cloud processing module captures the global context of the sampling point set, and combines them with point-level features one by one to generate a point-level feature map />

.

其中，所述S3还包括步骤：Wherein, said S3 also includes the steps of:

在融合多模态数据之前，先将提取的特征图都转变为相同的样式，将像素级特征图

中每个大小为(8x8)的特征图都视作一个64维的向量，则则转换后的像素级特征图

，与点级特征图样式相同，然后将像素级特征图/>

和点级特征图/>

结合起来，形成由512个64维向量组成的特征矩阵/>

，作为所述SAFusion模块的输入。Before fusing multimodal data, the extracted feature maps are converted into the same style, and the pixel-level feature maps

Each feature map of size (8x8) in is regarded as a 64-dimensional vector, then the converted pixel-level feature map

, in the same style as the point-level feature map, and then the pixel-level feature map />

and point-level feature maps />

combined to form a feature matrix consisting of 512 64-dimensional vectors />

, as an input to the SAFusion module.

其中，所述S4还包括步骤：Wherein, said S4 also includes the steps of:

将融合后的特征矩阵

映射成三维空间的表示并转变为/>

，所述解码器模块由五个3D反卷积层组成，输出通道数分别为 512、128、32、8 和 1，前四个反卷积层的核大小为/>

，步长为2，边缘填充为1，每个反卷积层后都有一个批归一化层和一个ReLU激活层，最后一个反卷积层的核大小为/>

，后面跟着一个sigmoid激活层，生成一个/>

的体素表示的三维模型。The fused feature matrix

Mapped to a three-dimensional space representation and converted to />

, the decoder module consists of five 3D deconvolution layers, the number of output channels is 512, 128, 32, 8 and 1 respectively, and the kernel size of the first four deconvolution layers is />

,stride 2,edge padding 1, each deconvolution layer is followed by a batch normalization layer and a ReLU activation layer, and the kernel size of the last deconvolution layer is />

, followed by a sigmoid activation layer, generating a />

The voxel representation of the 3D model.

实施本发明实施例，具有如下有益效果：本发明提出了一个新颖的双通道框架，通过提取点级和像素级特征作为互补的语义线索来促进草图理解和三维重建，引入了基于多头自注意机制的SAFusion模块，将像素特征和几何特征整合为统一的特征表示，用于三维重建，有效地促进了对草图的理解，并最终提升3D重建的质量。Implementing the embodiment of the present invention has the following beneficial effects: the present invention proposes a novel dual-channel framework, which promotes sketch understanding and 3D reconstruction by extracting point-level and pixel-level features as complementary semantic clues, and introduces a mechanism based on multi-head self-attention The SAFusion module integrates pixel features and geometric features into a unified feature representation for 3D reconstruction, which effectively promotes the understanding of sketches and ultimately improves the quality of 3D reconstruction.

附图说明Description of drawings

图1是本发明模型框架示意图；Fig. 1 is a schematic diagram of a model framework of the present invention;

图2是本发明的SAFusion 模型的结构示意图；Fig. 2 is the structural representation of SAFusion model of the present invention;

图3是SktConv、SktPoint和3D-Decoder-Refiner模块的详细结构；Figure 3 is the detailed structure of the SktConv, SktPoint and 3D-Decoder-Refiner modules;

图4是草图数据集上各种基于单视图的3D重建模型的比较表；Figure 4 is a comparison table of various single-view based 3D reconstruction models on the sketch dataset;

图5是不同3D重建方法的可视效果比较示意图；Fig. 5 is a schematic diagram of comparison of visual effects of different 3D reconstruction methods;

图6是各种基于单视图的3D重建网络的比较示意图。Fig. 6 is a comparative schematic diagram of various single-view based 3D reconstruction networks.

具体实施方式Detailed ways

为使本发明的目的、技术方案和优点更加清楚，下面将结合附图对本发明作进一步地详细描述。In order to make the object, technical solution and advantages of the present invention clearer, the present invention will be further described in detail below in conjunction with the accompanying drawings.

本发明实施例的一种基于多特征融合的手绘三维重建方法，如图1所示，主要分为六部分的步骤进行：首先基于手绘草图的特性，对输入草图进行预处理，以便后续操作并缩短运算时间；接着引入了SktConv模块和SktPoint模块，前者直接从草图图像中提取像素级特征图，后者则先从草图中抽样出2D点云，随后再提取出点级特征图。以上两种特征图将共同输入SAFusion融合模块，计算数据间的相互依赖以捕获关键信息，通过融合两种特征图的差异来增强包含的上下文信息。然后，3D-Decoder模块根据融合后的特征，生成初步的3D模型。最后，由3D-Refiner模块进一步优化细节，纠正重建差错的部分，由此获得精致的3D模型。A hand-drawn 3D reconstruction method based on multi-feature fusion in the embodiment of the present invention, as shown in Figure 1, is mainly divided into six steps: first, based on the characteristics of the hand-drawn sketch, the input sketch is preprocessed for subsequent operations and Shorten the calculation time; then introduce the SktConv module and the SktPoint module, the former directly extracts the pixel-level feature map from the sketch image, and the latter first samples the 2D point cloud from the sketch, and then extracts the point-level feature map. The above two feature maps will be jointly input into the SAFusion fusion module to calculate the interdependence between data to capture key information, and enhance the contained context information by fusing the differences between the two feature maps. Then, the 3D-Decoder module generates a preliminary 3D model based on the fused features. Finally, the 3D-Refiner module further optimizes the details and corrects the reconstruction errors, thereby obtaining an exquisite 3D model.

以下对各部分实施步骤进行详细说明。The implementation steps of each part are described in detail below.

S1：使用VGG-16网络，构建多层次卷积模块作为编码器，用于捕获二值草图的像素特征，并输出像素级特征图。S1: Using the VGG-16 network, construct a multi-level convolution module as an encoder to capture the pixel features of binary sketches and output pixel-level feature maps.

为了方便网络训练和推断，首先对草图进行预处理，获得的二值图像和2D点云，之后将用作特征提取。In order to facilitate network training and inference, the sketches are first preprocessed, and the obtained binary images and 2D point clouds will be used as feature extraction later.

考虑到草图可能以不同的尺寸，出现在画布的随意位置，为了排除无意义的缺省背景的干扰，本发明选择沿着草图物体的包围框进行裁剪，并将留下的区域重新调整至分辨率为224x224像素。模糊不清的线条会影响网络对草图的理解，预设固定的阈值，将草图中灰度低于50的像素看做前景，高于50则看做背景，分别将前景和背景的灰度调整至255和0，使草图转换为二值图像。随后，二值图像将输入SktConv模块进行特征提取。Considering that sketches may appear in random positions on the canvas in different sizes, in order to eliminate the interference of meaningless default backgrounds, the present invention chooses to crop along the bounding box of the sketch object, and readjusts the remaining area to the resolution The rate is 224x224 pixels. Fuzzy lines will affect the network's understanding of the sketch. A fixed threshold is preset, and pixels with a gray level below 50 in the sketch are regarded as the foreground, and pixels higher than 50 are regarded as the background, and the gray levels of the foreground and background are adjusted separately. to 255 and 0 to convert the sketch to a binary image. Subsequently, the binary image will be input into the SktConv module for feature extraction.

同时，框架中的SktPoint模块需要输入草图的代表性点云。为此，本发明以二值草图中的前景像素点作为2D点云，并从中采样出子集。相较于随机采样的方式，本发明采用的最远点采样法能更均匀地分布在草图上，最大限度地保留蕴含的形状特征。Meanwhile, the SktPoint module in the framework requires a representative point cloud of the input sketch. For this reason, the present invention uses the foreground pixel points in the binary sketch as a 2D point cloud, and samples a subset therefrom. Compared with the random sampling method, the farthest point sampling method adopted by the present invention can be more evenly distributed on the sketch, and the implied shape features can be preserved to the greatest extent.

多层次卷积模块，即SktConv模块，作为编码器，负责捕获二值草图的像素特征。图3展示了提出的SktConv模块所用的详细配置。The multi-level convolution module, namely the SktConv module, acts as an encoder and is responsible for capturing the pixel features of the binary sketch. Figure 3 shows the detailed configuration used by the proposed SktConv module.

SktConv模块的结构是基于VGG-16网络进行扩展，其前四组卷积模块采用预定义模型，而后紧接着两组自定义的模型，通过增加网络的深度来学习并提取层级特征。The structure of the SktConv module is extended based on the VGG-16 network. The first four sets of convolution modules use predefined models, followed by two sets of custom models to learn and extract hierarchical features by increasing the depth of the network.

本发明在初始化预定义模型时，保留VGG-16在ImageNet数据集上预训练后的模型参数，以此来加速其他参数收敛，缩减训练时间。对于前四组卷积模块，每组预定义模块都包含两个3x3的小尺度卷积层以及相应的批规范化层、Relu激活函数和最大池化层，每次经过池化层都将特征图的尺寸缩小至一半。本发明首先从224x224x3的手绘草图中提取出28x28x512的中间特征图。When initializing the predefined model, the present invention retains the model parameters pre-trained by VGG-16 on the ImageNet data set, so as to accelerate the convergence of other parameters and reduce the training time. For the first four sets of convolution modules, each set of predefined modules contains two 3x3 small-scale convolution layers and the corresponding batch normalization layer, Relu activation function and maximum pooling layer. is reduced to half the size. The present invention first extracts a 28x28x512 intermediate feature map from a 224x224x3 hand-drawn sketch.

紧接着，将中间特征图输入两组自定义模块。两组自定义模块包含核大小为3x3的2D卷积层、批规范化层、Relu激活函数层，卷积层的输出通道数分别为512和256。卷积计算量与输入草图的边长平方成正比，考虑高分辨率的特征图对算力和内存的负担，在整个模块最后，加入步长为3的最大池化层，凝聚关键特征的同时压缩后续计算量。经过最大池化层后特征图会缩小至原来的1/3，最终输出像素级特征图

。Next, the intermediate feature maps are fed into two sets of custom modules. Two sets of custom modules include a 2D convolutional layer with a kernel size of 3x3, a batch normalization layer, and a Relu activation function layer. The output channels of the convolutional layers are 512 and 256, respectively. The amount of convolution calculation is proportional to the square of the side length of the input sketch. Considering the burden of high-resolution feature maps on computing power and memory, at the end of the entire module, a maximum pooling layer with a step size of 3 is added to aggregate key features. Compress subsequent calculations. After the maximum pooling layer, the feature map will be reduced to 1/3 of the original, and finally the pixel-level feature map will be output

.

而在解码阶段，这些像素特征图将会被用于融合并重建三维模型。In the decoding stage, these pixel feature maps will be used to fuse and reconstruct the 3D model.

S2：添加2D点云处理模块，先从草图中抽样出2D点云，随后再提取出点级特征图

。S2: Add a 2D point cloud processing module, first sample the 2D point cloud from the sketch, and then extract the point-level feature map

.

本发明的2D点云处理模块，即SktPoint模块，从均匀取样的2D点集中捕获草图隐含的几何特征，更能反映目标模型的3D结构。The 2D point cloud processing module of the present invention, that is, the SktPoint module, captures the hidden geometric features of the sketch from the uniformly sampled 2D point set, and can better reflect the 3D structure of the target model.

草图的线条是由图像中一些连续的像素点构成，依据像素的位置，使用网格将草图像素点转变为坐标形式的草图点集

。/>

中的点在连续区域内密集遍布，冗余的数据项会降低模型的处理速度。为了简化点集，对其进行抽样以获得分布均匀的像素点。本发明中采用最远点抽样法(farthest point sampling，FPS)，从草图点集/>

中逐一挑选/>

个点，组成抽样点集/>

。开始时，任意选取一点产生初始的抽样子集/>

。若已有包含/>

个点的抽样子集/>

，则下次抽样是在/>

剩余的

个点中挑选出一个点/>

，使其满足在所有剩余点中，/>

与集合/>

内的最远点间的欧式距离最大，随后将/>

加入/>

中。在抽样数较少的情况下，最远点抽样相较于随机抽样，能更均匀地分布在草图的轮廓上，更简洁地呈现草图的几何形状。The lines of the sketch are composed of some continuous pixels in the image. According to the position of the pixels, the grid is used to convert the sketch pixels into a set of sketch points in the form of coordinates.

. />

The points in are densely distributed in a continuous area, and redundant data items will reduce the processing speed of the model. To simplify the point set, it is sampled to obtain evenly distributed pixels. Adopt farthest point sampling method (farthest point sampling, FPS) among the present invention, from sketch point set

Select one by one />

points to form a sampling point set/>

. Initially, pick a point arbitrarily to generate an initial sampling subset />

. If already included />

sampling subset of points />

, then the next sampling is at />

the rest

Pick a point out of points />

, so that it satisfies in all remaining points, />

with collection />

The Euclidean distance between the furthest points within is the largest, then the />

join />

middle. In the case of a small number of samples, farthest point sampling can be more evenly distributed on the outline of the sketch than random sampling, and the geometric shape of the sketch can be presented more concisely.

取样后的2D点集可表示为

，其中/>

代表每张草图的抽样点数，在本发明的方案中取值为256。SktPoint模块将捕获采样点集的全局上下文，并逐个与点级特征结合以生成特征图。如图1所示，采样后的点集先经过四个1维卷积层，卷积核的大小都为1x1，输出通道数分别为64，64，128和1024。卷积操作单独提取每个采样点的特征，将n个点的坐标向量逐步扩展至1024维，输出 (nx1024) 的特征图。然后最大池化特征向量的每一维，表示汇聚各点的信息，得到一个代表点集全局上下文的1024维的向量。随后，复制该向量n遍，与第二个卷积层输出的n个64维的特征向量相连接，形成 (nx1088)的特征图，代表局部几何信息和全局语义信息的融合。连接后的特征图再经过三个卷积层，卷积核大小同样为1x1，输出通道数分别为512，256，64，从而产生由/>

个64维的向量组成，隐含几何特征的点级特征图/>

。The sampled 2D point set can be expressed as

, where />

Represents the number of sampling points of each sketch, which is 256 in the scheme of the present invention. The SktPoint module will capture the global context of the sampled point set and combine with point-level features one by one to generate a feature map. As shown in Figure 1, the sampled point set first passes through four 1D convolution layers, the size of the convolution kernel is 1x1, and the number of output channels is 64, 64, 128 and 1024 respectively. The convolution operation extracts the features of each sampling point separately, gradually expands the coordinate vector of n points to 1024 dimensions, and outputs a (nx1024) feature map. Then each dimension of the maximum pooled feature vector means that the information of each point is gathered to obtain a 1024-dimensional vector representing the global context of the point set. Subsequently, the vector is copied n times, and connected with n 64-dimensional feature vectors output by the second convolutional layer to form a (nx1088) feature map, which represents the fusion of local geometric information and global semantic information. The connected feature map goes through three convolution layers, the size of the convolution kernel is also 1x1, and the number of output channels is 512, 256, 64 respectively, thus generating

A 64-dimensional vector, a point-level feature map with hidden geometric features/>

.

S3：利用隐含草图语义信息和几何信息的像素级和点级特征图构建SAFusion模块，根据两种模态的特征图之间的依赖关系，捕获互补的语义线索，用于将两者整合为统一的特征形式。S3: The SAFusion module is constructed using the pixel-level and point-level feature maps that imply the semantic information and geometric information of the sketch, and capture complementary semantic clues according to the dependencies between the feature maps of the two modalities, which are used to integrate the two into one Uniform feature form.

利用隐含草图语义信息和几何信息的像素级和点级特征图，本实施例提出了一个特征融合模块，SAFusion。Using the pixel-level and point-level feature maps that imply semantic information and geometric information of sketches, this embodiment proposes a feature fusion module, SAFusion.

为此，本发明直接采用Transformer网络中的自注意机制。在融合多模态数据之前，需要先将提取的特征图都转变为相同的形式。将像素级特征图

中每个尺寸为(8x8)的特征图都视作一个64维的向量，则转换后的像素级特征图/>

，与点级特征图形式相同。然后组合像素级特征图/>

和点级特征图/>

，形成由512个64维向量组成的特征矩阵/>

，作为SAFusion模块的输入。For this reason, the present invention directly adopts the self-attention mechanism in the Transformer network. Before fusing multimodal data, it is necessary to transform the extracted feature maps into the same form. The pixel-level feature map

Each feature map with a size of (8x8) in is regarded as a 64-dimensional vector, then the converted pixel-level feature map/>

, which has the same form as the point-level feature map. Then combine the pixel-level feature maps />

and point-level feature maps />

, forming a feature matrix consisting of 512 64-dimensional vectors />

, as input to the SAFusion module.

如图2所示的SAFusion 模型的结构：由六个多头注意力模块组成（图中省略了其中四个），每个模块利用三个全连接层和Scaled Dot-Product Attention层计算输入向量之间的相关性。The structure of the SAFusion model shown in Figure 2: It consists of six multi-head attention modules (four of which are omitted in the figure), each module uses three fully connected layers and a Scaled Dot-Product Attention layer to calculate the relationship between input vectors relevance.

对于输出的特征矩阵

，自注意力模块捕获它们之间的相互依赖关系，作为提取关键信息进行融合的权重系数/>

。所有输入向量都根据其重要性进行叠加，并生成输出矩阵/>

，作为下一层的输入。For the output feature matrix

, the self-attention module captures the interdependence between them, as the weight coefficient for extracting key information for fusion />

. All input vectors are superimposed according to their importance and an output matrix is generated />

, as the input to the next layer.

具体而言，首先将特征矩阵送入三个全连接层，分别转化为Query矩阵、Key矩阵和Value矩阵。为了加速计算，将Query、Key、Value矩阵中的每个向量分成

段，维度分别为

。每个输出向量/>

都对应了三个行向量/>

，其中/>

。Specifically, the feature matrix is first sent to three fully connected layers, which are converted into Query matrix, Key matrix and Value matrix respectively. In order to speed up the calculation, each vector in the Query, Key, and Value matrices is divided into

segment, the dimensions are

. Each output vector />

Both correspond to three row vectors />

, where />

.

首先，本实施例使用行向量

来计算每对输入向量/>

之间的相关系数/>

：First, this example uses the row vector

to compute each pair of input vectors />

Correlation coefficient between />

:

随后，使用softmax函数进行归一化，得到权重系数

：Subsequently, the softmax function is used for normalization to obtain the weight coefficient

:

最后，加权累和所有的value值来生成相应的输出向量：Finally, all values are weighted and accumulated to generate the corresponding output vector:

在本实施例的应用中，SAFusion模块由六个相同的Multi-head Attention模块连续排列而成，如图2所示。每个Multi-head Attention模块由三个全连接层组成，将输入值线性映射到Query、Key和Value矩阵中，其中矩阵

，矩阵/>

。每个矩阵被拆分成/>

个矩阵并平均划分成/>

个组，每组包含相应的Query、Key和Value矩阵，这样多个Scaled Dot-Product Attention层，或称为头，才可以并行地运行于每组矩阵上。本实施例采用多个并行的头的数量为/>

，因此所有矩阵的维数为/>

。Scaled Dot-Product Attention层计算Query矩阵中的每个向量与Key矩阵中的所有向量间点积，缩放为/>

倍并应用softmax函数计算输入向量之间的相关性，作为累加Value矩阵里每个向量时的权重。重新拼接所有并行头的输出后，再次将结果线性映射为初始的维度。本实施例在每个Multi-head Attention模块周围使用残差连接，然后进行层归一化。最后，SAFusion模块输出的特征矩阵为/>

。In the application of this embodiment, the SAFusion module is composed of six identical Multi-head Attention modules arranged continuously, as shown in FIG. 2 . Each Multi-head Attention module consists of three fully connected layers, which linearly map the input value to the Query, Key and Value matrices, where the matrix

, matrix />

. Each matrix is split into />

matrix and equally divided into />

Each group contains the corresponding Query, Key and Value matrices, so that multiple Scaled Dot-Product Attention layers, or heads, can run in parallel on each group of matrices. This embodiment uses multiple parallel headers with the number of />

, so the dimensions of all matrices are />

. The Scaled Dot-Product Attention layer calculates the dot product between each vector in the Query matrix and all vectors in the Key matrix, scaling to />

Double and apply the softmax function to calculate the correlation between the input vectors as the weight when accumulating each vector in the Value matrix. After re-stitching the output of all parallel heads, the result is linearly mapped to the original dimension again. This embodiment uses residual connections around each Multi-head Attention module, and then performs layer normalization. Finally, the feature matrix output by the SAFusion module is />

.

S4：构建基于3D反卷积的解码模块，用于将2D特征图中的信息转换为3D体积形式。S4: Build a 3D deconvolution-based decoding module to convert information in the 2D feature map into a 3D volumetric form.

3D Transposed Convolutional Decoder模块(3D-Decoder)负责将2D特征图中的信息转换为3D体积形式，并将3D模型的体素分辨率逐步提高到

。The 3D Transposed Convolutional Decoder module (3D-Decoder) is responsible for converting the information in the 2D feature map into a 3D volume form, and gradually increasing the voxel resolution of the 3D model to

.

首先，融合后的特征矩阵

被映射成三维空间的表示并转变为/>

。如图3所示，解码器模块主要由五个3D反卷积层组成，输出通道数分别为 512、128、32、8 和1。前四个反卷积层的核大小为/>

，步长为2，边缘填充为1。每个反卷积层后都有一个批归一化层和一个ReLU激活层。最后一个反卷积层的核大小为/>

，后面跟着一个sigmoid激活层，生成一个/>

的体素表示的三维模型。First, the fused feature matrix

is mapped to a three-dimensional representation and transformed into />

. As shown in Figure 3, the decoder module mainly consists of five 3D deconvolutional layers with the number of output channels being 512, 128, 32, 8 and 1, respectively. The kernel size of the first four deconvolution layers is />

, with a stride of 2 and an edge padding of 1. Each deconvolution layer is followed by a batch normalization layer and a ReLU activation layer. The kernel size of the last deconvolution layer is />

, followed by a sigmoid activation layer, generating a />

The voxel representation of the 3D model.

3D-Decoder模块通常生成的是粗粒度的3D体素表示。U-Net最初是为医学图像分割而设计的，受其启发，在本发明中使用了3D-Refiner，一个细化模块用以提高3D重建的质量。The 3D-Decoder module usually generates a coarse-grained 3D voxel representation. Inspired by U-Net originally designed for medical image segmentation, 3D-Refiner, a refinement module, is used in this invention to improve the quality of 3D reconstruction.

如图3所示，它遵循带有U-net连接的3D编码器-解码器的设计思路，可以保留生成3D模型中的局部结构。3D-Refiner模块中的编码器由三个3D卷积模块组成，核的大小为

，边缘填充为2，输出的通道数分别为32、64和128。每个3D卷积层后都跟随一个批归一化层和leaky Relu激活函数，而模块的最后通过核大小为/>

的最大池化进行尺寸压缩。编码器输出一个大小为/>

的特征图，并展开成为一个高维向量。在经过两个输出维度分别为2048和8192的全连接层后，生成一个8192维的向量来表示全局语义信息。As shown in Figure 3, it follows the design idea of 3D encoder-decoder with U-net connection, which can preserve the local structure in the generated 3D model. The encoder in the 3D-Refiner module consists of three 3D convolution modules with a kernel size of

, the edge padding is 2, and the number of output channels are 32, 64 and 128 respectively. Each 3D convolutional layer is followed by a batch normalization layer and leaky Relu activation function, and the final pass kernel size of the module is />

The maximum pooling of the size compression. The encoder outputs a file of size />

The feature map of , and expanded into a high-dimensional vector. After passing through two fully connected layers with output dimensions of 2048 and 8192, an 8192-dimensional vector is generated to represent global semantic information.

3D-Refiner模块中的解码器采用对称结构，由三个3D反卷积模块组成。但是在最后一个反卷积层之后，只连接了一个sigmoid激活函数用于归一化体素的概率。编码器提取的语义信息与残差短接结构中保留的局部信息相结合。然后在解码阶段，3D反卷积层逐步扩大3D模型的尺寸，于是重建结果可以保留准确的局部区域。The decoder in the 3D-Refiner module adopts a symmetrical structure and consists of three 3D deconvolution modules. But after the last deconvolution layer, only a sigmoid activation function is connected to normalize the voxel probabilities. The semantic information extracted by the encoder is combined with the local information preserved in the residual short-circuit structure. Then in the decoding stage, the 3D deconvolution layer gradually expands the size of the 3D model, so that the reconstruction result can retain accurate local regions.

粗粒度的3D模型被导入到3D-Refiner模块中，用于优化细节，如物体表面的凹陷或偏离的异常值等。最终，重建出3D目标模型

。The coarse-grained 3D model is imported into the 3D-Refiner module for refinement of details such as depressions on the surface of objects or deviated outliers. Finally, the 3D target model is reconstructed

.

在以往的重建网络中，习惯使用体素的二元交叉熵的平均值作为损失函数，来计算重建后的3D模型与真实3D模型之间的差异。但重建后物体的体素占整体空间中的的比例较低，而且大部分存在于内部。于是，使用二元交叉熵往往会低估重建概率。In the previous reconstruction network, it is customary to use the average value of the binary cross-entropy of voxels as the loss function to calculate the difference between the reconstructed 3D model and the real 3D model. However, the voxels of the reconstructed object account for a relatively low proportion of the overall space, and most of them exist in the interior. Thus, using binary cross-entropy tends to underestimate the reconstruction probability.

为了应对3D体积表示的稀疏性，本实施例采用了更适合于优化交并比(IoU)的Mean Squared 0 Cross-Entropy损失函数(MSFCEL)。损失的计算如下:将目标3D模型所在空间划分为填充体素Vp和未填充体素Vn这两类，假设对应的数量分别为

和/>

。/>

表示在Vp中第i个体素的填充概率，同时/>

表示在Vn中第j个体素的填充概率，这意味着在目标3D模型中/>

且/>

。而/>

和/>

则分别是对应于/>

和/>

的预测概率。In order to cope with the sparsity of the 3D volume representation, this embodiment adopts the Mean Squared 0 Cross-Entropy loss function (MSFCEL) which is more suitable for optimizing the intersection-over-union ratio (IoU). The calculation of the loss is as follows: Divide the space where thetarget 3D model is located into two types of filled voxels Vp and unfilled voxels Vn, assuming that the corresponding quantities are

and />

. />

Indicates the filling probability of the i-th voxel in Vp, while />

Denotes the filling probability of the jth voxel in Vn, which means that in thetarget 3D model />

and/>

. And />

and />

are respectively corresponding to />

and />

predicted probability of .

FPCE是定义在未填充体素上的错误预测的正例交叉熵，而FNCE是定义在填充体素上的假负例交叉熵。FPCE is the wrongly predicted positive cross-entropy defined on unfilled voxels, while FNCE is the false negative cross-entropy defined on filled voxels.

从公式中可以看出，MSFCEL由FPCE与FNCE间的最小和及最小差相加而成，将同时最小化填充体素的损失和未填充体素的损失，从而平衡它们的预测准确率。It can be seen from the formula that MSFCEL is formed by adding the minimum sum and minimum difference between FPCE and FNCE, which will minimize the loss of filled voxels and the loss of unfilled voxels at the same time, thus balancing their prediction accuracy.

本发明实施例通过以下方法进行评估指标：The embodiment of the present invention evaluates indicators by the following methods:

首先，设置一个阈值来将预测的3D模型中的体素分类为填充体素或未填充体素。First, a threshold is set to classify voxels in the predicted 3D model as filled or unfilled voxels.

然后将交并比(IoU)用作评估指标，其定义为：Then we use intersection-over-union (IoU) as the evaluation metric, which is defined as:

其中

和/>

分别表示在目标3D模型和预测3D模型中坐标为/>

的体素的填充概率，而/>

分别代表3D模型所在的整个空间的长、宽和高。/>

表示概率的阈值，在实验中设置为0.5。/>

是指标函数。in

and />

Indicates that the coordinates in thetarget 3D model and the predicted 3D model are />

The filling probability of the voxel, while />

represent the length, width and height of the entire space where the 3D model is located, respectively. />

Indicates the threshold of the probability, which is set to 0.5 in the experiment. />

is an indicator function.

IoU指标用于评估两个3D模型之间的相似性。 IoU 值越高，重建结果越好。The IoU metric is used to evaluate the similarity between two 3D models. The higher the IoU value, the better the reconstruction result.

为了评估所提出的Sketch2Vox的重建效果，本发明与多个最先进的3D重建网络进行了比较，包括3D-R2N2、3D-VAE-GAN和Pix2Vox。第一种3D-R2N2方法，嵌入LSTM模块，在解码前顺序更新特征单元的状态。第二种3D-VAE-GAN方法，采用对抗架构来生成可靠的结果。第三种Pix2Vox方法，其中添加了一个融合模块，可以在不同的粗糙3D模型中为每个部件选择高质量的重建。每个网络都使用相同的数据集和实验设置进行训练。本发明在两个数据集上逐一比较了这些方法的3D重建性能。To evaluate the reconstruction effect of the proposed Sketch2Vox, the present invention is compared with several state-of-the-art 3D reconstruction networks, including 3D-R2N2, 3D-VAE-GAN and Pix2Vox. The first 3D-R2N2 method, which embeds an LSTM module, sequentially updates the state of feature units before decoding. The second method, 3D-VAE-GAN, employs an adversarial architecture to generate reliable results. The third Pix2Vox method, where a fusion module is added to select high-quality reconstructions for each part among different coarse 3D models. Each network is trained using the same dataset and experimental settings. The present invention compares the 3D reconstruction performance of these methods on two data sets one by one.

如图4所示的草图数据集上各种基于单视图的3D重建模型的比较表，本发明所提出的Sketch2Vox方法具有最佳的重建表现，并且在所有测试类别上都大大优于所有同类方法。具体来说，同类3D重建方法的IoU只在32.3%到33.05%之间，不如本发明的方法。正如表中所示，当面对包含细长结构的物体类别时，例如桌子、椅子、或是落地灯，本发明提出的方法所重建的结果显着提高，最多达13.5%。The comparison table of various single-view-based 3D reconstruction models on the sketch data set shown in Figure 4, the Sketch2Vox method proposed by the present invention has the best reconstruction performance, and is greatly superior to all similar methods in all test categories . Specifically, the IoU of similar 3D reconstruction methods is only between 32.3% and 33.05%, which is not as good as the method of the present invention. As shown in the table, when facing object categories containing slender structures, such as tables, chairs, or floor lamps, the reconstruction results of the proposed method in the present invention are significantly improved, up to 13.5%.

为了直观体现不同方法重建的3D模型，本发明在图5中展示了几个实例。如图中展示，同类方法的重建结果在细节方面不如所提出的方法。例如，第一排的飞机，第四排的椅子，或是最后一排的桌子，其他方法的结果表现出明显的部分结构缺失，而本发明的方法解决了上述缺陷。此外，与同类方法相比，本发明所提出的模型在处理物体的平坦区域时，展现了更好的重建效果。In order to visualize the 3D models reconstructed by different methods, the present invention shows several examples in FIG. 5 . As shown in the figure, the reconstruction results of similar methods are inferior to the proposed method in terms of details. For example, the first row of airplanes, the fourth row of chairs, or the last row of tables, the results of other methods show obvious partial structure missing, and the method of the present invention solves the above-mentioned defects. In addition, compared with similar methods, the proposed model in the present invention exhibits better reconstruction results when dealing with flat regions of objects.

图6中展示了在ModelNet数据集上的平均IoU。总体而言，本发明所提出的Sketch2Vox 方法优于其他最先进的方法，在IoU这一指标上有

提升。具体而言，所提出的方法在bathtub、bed、chair、monitor、night_stand和sofa的类别上取得了所有方法中最好的性能，而在desk、dresser、table和toilet的类别上仅比3D-R2N2的表现略微逊色。以上的对比结果表明，即使面向的是3D模型的轮廓图片，由Sketch2Vox中的2D点云处理模块所添加的几何信息仍然可以提高最终的重建质量。Figure 6 shows the average IoU on the ModelNet dataset. Overall, the Sketch2Vox method proposed by the present invention is superior to other state-of-the-art methods, and has

promote. Specifically, the proposed method achieves the best performance among all methods on the categories of bathtub, bed, chair, monitor, night_stand and sofa, while only outperforming 3D-R2N2 on the categories of desk, dresser, table and toilet slightly underperformed. The above comparison results show that even if the outline image of the 3D model is oriented, the geometric information added by the 2D point cloud processing module in Sketch2Vox can still improve the final reconstruction quality.

以上所揭露的仅为本发明一种较佳实施例而已，当然不能以此来限定本发明之权利范围，因此依本发明权利要求所作的等同变化，仍属本发明所涵盖的范围。The above disclosure is only a preferred embodiment of the present invention, which certainly cannot limit the scope of rights of the present invention. Therefore, equivalent changes made according to the claims of the present invention still fall within the scope of the present invention.

Claims

Translated fromChinese

1.一种基于多特征融合的手绘三维重建方法，其特征在于，包括以下步骤：1. A hand-painted three-dimensional reconstruction method based on multi-feature fusion, is characterized in that, comprises the following steps:

2.根据权利要求1所述的基于多特征融合的手绘三维重建方法，其特征在于，所述S1还包括对草图进行预处理的方法，包括以下步骤：2. The hand-painted three-dimensional reconstruction method based on multi-feature fusion according to claim 1, wherein said S1 also includes a method for preprocessing the sketch, comprising the following steps:

3.根据权利要求2所述的基于多特征融合的手绘三维重建方法，其特征在于，所述SktConv模块包括前四组卷积模块采用预定义模型，而后紧接着两组自定义的模型；对于前四组卷积模块，每组预定义模块都包含两个3x3的小尺度卷积层以及相应的批规范化层、Relu激活函数和最大池化层，每次经过池化层都将特征图的尺寸缩小至一半；首先从224x224x3的手绘草图中提取出28x28x512的中间特征图，将中间特征图输入两组自定义模块，两组自定义模块包含核大小为3x3的2D卷积层、批规范化层、Relu激活函数层，卷积层的输出通道数分别为512和256，经过最大池化层后特征图缩小至原来的1/3，并输出像素级特征图。3. the hand-drawn three-dimensional reconstruction method based on multi-feature fusion according to claim 2, is characterized in that, described SktConv module comprises that first four groups of convolution modules adopt predefined models, and then two groups of self-defined models are followed; The first four groups of convolution modules, each group of predefined modules include two 3x3 small-scale convolution layers and the corresponding batch normalization layer, Relu activation function and maximum pooling layer, each time the pooling layer passes through the feature map Reduce the size to half; first extract the 28x28x512 intermediate feature map from the 224x224x3 hand-drawn sketch, and input the intermediate feature map into two sets of custom modules. The two sets of custom modules include a 2D convolution layer with a kernel size of 3x3 and a batch normalization layer. , Relu activation function layer, the number of output channels of the convolutional layer is 512 and 256 respectively, after the maximum pooling layer, the feature map is reduced to 1/3 of the original, and the pixel-level feature map is output.

4.根据权利要求3所述的基于多特征融合的手绘三维重建方法，其特征在于，所述S2还包括步骤：4. the hand-drawn three-dimensional reconstruction method based on multi-feature fusion according to claim 3, is characterized in that, described S2 also comprises the step:

将取样后的2D点集表示为

，其中n代表集合中的点数，所述2D点云处理模块捕获采样点集的全局上下文，并逐个与点级特征结合以生成特征图。Express the sampled 2D point set as

, where n represents the number of points in the set, the 2D point cloud processing module captures the global context of the sampled point set, and combines with point-level features one by one to generate a feature map.

5.根据权利要求4所述的基于多特征融合的手绘三维重建方法，其特征在于，所述S3还包括步骤：5. the hand-drawn three-dimensional reconstruction method based on multi-feature fusion according to claim 4, is characterized in that, described S3 also comprises the step:

中每个尺寸为(8x8)的特征图都视作一个64维的向量，则转换后的像素级特征图

，与点级特征图样式相同，然后将像素级特征图/>

和点级特征图/>

结合起来，形成由512个64维向量组成的特征矩阵/>

and point-level feature maps />

combined to form a feature matrix consisting of 512 64-dimensional vectors />

, as an input to the SAFusion module.

6.根据权利要求5所述的基于多特征融合的手绘三维重建方法，其特征在于，所述S4还包括步骤：6. the hand-drawn three-dimensional reconstruction method based on multi-feature fusion according to claim 5, is characterized in that, described S4 also comprises the step:

将融合后的特征矩阵

映射成三维空间的表示并转变为/>

，后面跟着一个sigmoid激活层，生成一个/>

的体素表示的三维模型。The fused feature matrix

Mapped to a three-dimensional space representation and converted to />

, stride 2, edge padding 1, each deconvolution layer is followed by a batch normalization layer and a ReLU activation layer, and the kernel size of the last deconvolution layer is />

, followed by a sigmoid activation layer, generating a />

The voxel representation of the 3D model.