CN116932803A

Movatterモバイル変換

Info

Publication number: CN116932803A
Application number: CN202311177091.1A
Authority: CN
Inventors: 杜国光; 范宝余; 王丽; 郭振华; 赵雅倩; 李仁刚
Original assignee: Inspur Beijing Electronic Information Industry Co Ltd
Current assignee: Inspur Beijing Electronic Information Industry Co Ltd
Priority date: 2023-09-13
Filing date: 2023-09-13
Publication date: 2023-10-24
Anticipated expiration: 2043-09-13
Also published as: CN116932803B

Abstract

Translated fromChinese

本发明公开了基于多模态预训练模型的数据集生成方法、训练方法，应用于三维内容生成技术领域，包括：将三维内容集中的每个三维内容渲染为二维图像；构建问题集合；所述问题集合包括多个属性对应的问题；针对每一所述二维图像，基于所述问题集合询问图文问答预训练模型以得到每个问题对应的答案，并基于每个属性对应的答案确定每个属性的文本描述；基于文本描述确定每个三维内容的每个属性的描述信息，得到每个三维内容的三维内容描述，以生成三维内容描述数据集；三维内容描述包含多个属性的描述信息。这样，能够提升数据集质量，进而保障三维内容生成模型的性能，从而提升生成三维内容的准确性。

The invention discloses a data set generation method and a training method based on a multi-modal pre-training model, which are applied in the technical field of three-dimensional content generation, including: rendering each three-dimensional content in the three-dimensional content set into a two-dimensional image; constructing a problem set; The question set includes questions corresponding to multiple attributes; for each of the two-dimensional images, the graphic question and answer pre-training model is asked based on the question set to obtain the answer corresponding to each question, and the answer is determined based on the answer corresponding to each attribute. Text description of each attribute; determine the description information of each attribute of each three-dimensional content based on the text description, and obtain the three-dimensional content description of each three-dimensional content to generate a three-dimensional content description data set; the three-dimensional content description contains descriptions of multiple attributes information. In this way, the quality of the data set can be improved, thereby ensuring the performance of the three-dimensional content generation model, thereby improving the accuracy of generating three-dimensional content.

Description

Translated fromChinese

基于多模态预训练模型的数据集生成方法、训练方法Data set generation method and training method based on multi-modal pre-training model

技术领域Technical field

本发明涉及三维内容生成领域，特别涉及基于多模态预训练模型的数据集生成方法及装置、训练方法、三维内容生成方法、电子设备、计算机可读存储介质。The present invention relates to the field of three-dimensional content generation, and in particular to a data set generation method and device based on a multi-modal pre-training model, a training method, a three-dimensional content generation method, electronic equipment, and a computer-readable storage medium.

背景技术Background technique

AIGC（Artificial Intelligence Generated Content，即人工智能内容生成），是指采用人工智能技术自动生产包括文本、音频、图像等模态的数字化内容，此外，AIGC还用于3D（即三维）内容的生成，也即3D内容智能生成技术，通过生成高质量、多样化的3D内容，作为3D数字资产广泛应用于虚拟现实、增强现实等行业。AIGC (Artificial Intelligence Generated Content) refers to the use of artificial intelligence technology to automatically produce digital content including text, audio, images and other modalities. In addition, AIGC is also used to generate 3D (i.e. three-dimensional) content. That is, 3D content intelligent generation technology, by generating high-quality and diverse 3D content, is widely used as 3D digital assets in industries such as virtual reality and augmented reality.

目前基于文本的三维内容生成方案中，三维数据集的文本描述质量较差，无法得到性能较好的三维内容生成模型，导致生成的三维内容准确性较低。In the current text-based 3D content generation scheme, the text description quality of the 3D data set is poor, and a better-performing 3D content generation model cannot be obtained, resulting in low accuracy of the generated 3D content.

发明内容Contents of the invention

有鉴于此，本发明的目的在于提供基于多模态预训练模型的数据集生成方法、训练方法，能够提升数据集质量，进而保障三维内容生成模型的性能，从而提升生成三维内容的准确性。其具体方案如下：In view of this, the purpose of the present invention is to provide a data set generation method and training method based on a multi-modal pre-training model, which can improve the quality of the data set, thereby ensuring the performance of the three-dimensional content generation model, thereby improving the accuracy of generating three-dimensional content. The specific plan is as follows:

第一方面，本发明公开了一种基于多模态预训练模型的数据集生成方法，包括：In a first aspect, the present invention discloses a data set generation method based on a multi-modal pre-training model, including:

将三维内容集中的每个三维内容渲染为二维图像；Render each three-dimensional content in the three-dimensional content set into a two-dimensional image;

构建问题集合；所述问题集合包括多个属性对应的问题；Construct a question set; the question set includes questions corresponding to multiple attributes;

针对每一所述二维图像，基于所述问题集合询问图文问答预训练模型以得到每个问题对应的答案，并基于每个属性对应的所述答案确定每个属性的文本描述；For each of the two-dimensional images, query the graphic question and answer pre-trained model based on the question set to obtain the answer corresponding to each question, and determine the text description of each attribute based on the answer corresponding to each attribute;

基于所述文本描述确定每个所述三维内容的每个属性的描述信息，得到每个所述三维内容的三维内容描述，以生成三维内容描述数据集；所述三维内容描述包含多个所述属性的所述描述信息。The description information of each attribute of each three-dimensional content is determined based on the text description, and a three-dimensional content description of each three-dimensional content is obtained to generate a three-dimensional content description data set; the three-dimensional content description includes a plurality of three-dimensional content descriptions. The description information of the attribute.

可选的，将三维内容集中的每个三维内容渲染为二维图像，包括：Optionally, render each 3D content in the 3D content set as a 2D image, including:

将三维内容集中的每个三维内容渲染为多个视角下的二维图像。Render each 3D content in the 3D content set into a 2D image from multiple perspectives.

可选的，所述将三维内容集中的每个三维内容渲染为多个视角下的二维图像，包括：Optionally, rendering each three-dimensional content in the three-dimensional content set into two-dimensional images from multiple perspectives includes:

基于球坐标系计算多个虚拟相机位置；Calculate multiple virtual camera positions based on the spherical coordinate system;

基于所述多个虚拟相机位置对三维内容集中的每个三维内容进行渲染，得到多个视角下的二维图像。Each three-dimensional content in the three-dimensional content set is rendered based on the multiple virtual camera positions to obtain two-dimensional images from multiple perspectives.

可选的，所述将三维内容集中的每个三维内容渲染为二维图像，包括：Optionally, rendering each three-dimensional content in the three-dimensional content set into a two-dimensional image includes:

将三维内容集中的每个三维内容转换至世界坐标系；Convert each three-dimensional content in the three-dimensional content set to the world coordinate system;

将所述世界坐标系下的三维内容中的每个点均乘以缩放因子以完成尺度缩放；Multiply each point in the three-dimensional content in the world coordinate system by a scaling factor to complete scale scaling;

将尺度缩放后的三维内容渲染为二维图像。Render scaled three-dimensional content into a two-dimensional image.

可选的，在将所述世界坐标系下的三维内容中的每个点均乘以缩放因子以完成尺度缩放之前，还包括：Optionally, before multiplying each point in the three-dimensional content in the world coordinate system by a scaling factor to complete the scale, it also includes:

计算所述世界坐标系下的三维内容在每个坐标轴下最大值和最小值之间的差值；Calculate the difference between the maximum value and the minimum value of the three-dimensional content in the world coordinate system under each coordinate axis;

将各坐标轴对应的所述差值中的最大值取倒数，得到所述缩放因子。The scaling factor is obtained by taking the reciprocal of the maximum value of the differences corresponding to each coordinate axis.

可选的，所述基于所述文本描述确定每个所述三维内容的每个属性的描述信息，包括：Optionally, determining the description information of each attribute of each of the three-dimensional content based on the text description includes:

针对每个属性，利用评价网络模型确定多个视角下的文本描述的得分，并将所述得分大于预设阈值的文本描述进行融合以得到该属性的描述信息；其中，所述多个视角下的文本描述均为同一三维内容对应的文本描述。For each attribute, the evaluation network model is used to determine the scores of text descriptions from multiple perspectives, and the text descriptions with scores greater than the preset threshold are fused to obtain the description information of the attribute; wherein, from the multiple perspectives, The text descriptions are all text descriptions corresponding to the same three-dimensional content.

可选的，所述针对每个属性，利用评价网络模型确定多个视角下的文本描述的得分，包括：Optionally, for each attribute, use an evaluation network model to determine the scores of text descriptions from multiple perspectives, including:

针对每个属性，将多个视角下的文本描述输入评价网络模型，抽取局部特征和全局特征，基于所述局部特征和所述全局特征构建联合特征，并基于所述联合特征输出多个视角下的文本描述的得分。For each attribute, input text descriptions from multiple perspectives into the evaluation network model, extract local features and global features, construct joint features based on the local features and the global features, and output multiple views based on the joint features. The score of the text description.

可选的，针对每个属性，将多个视角下的文本描述输入评价网络模型，抽取局部特征和全局特征，包括：Optionally, for each attribute, input text descriptions from multiple perspectives into the evaluation network model to extract local features and global features, including:

针对每个属性，将多个视角下的文本描述输入评价网络模型，依次经过双向编码器表示结构、一个多层感知机层得到局部特征，再经过一个多层感知机层以及池化层，得到全局特征。For each attribute, text descriptions from multiple perspectives are input into the evaluation network model, which is sequentially passed through a bidirectional encoder to represent the structure, a multi-layer perceptron layer to obtain local features, and then through a multi-layer perceptron layer and a pooling layer to obtain global characteristics.

可选的，基于所述联合特征输出多个视角下的文本描述的得分，包括：Optionally, output scores of text descriptions from multiple perspectives based on the joint features, including:

基于所述联合特征，并利用预设数量个多层感知机层输出多个视角下的文本描述的得分。Based on the joint features, a preset number of multi-layer perceptron layers are used to output scores of text descriptions from multiple perspectives.

可选的，所述评价网络模型的训练过程包括：Optionally, the training process of the evaluation network model includes:

构建训练数据集，所述训练数据集包括训练样本和所述训练样本对应的标签信息，所述训练样本为不同属性对应的多个视角下的文本描述；Constructing a training data set, the training data set includes training samples and label information corresponding to the training samples, where the training samples are text descriptions from multiple perspectives corresponding to different attributes;

将所述训练样本输入初始模型，得到多个视角下的文本描述的得分；Input the training samples into the initial model to obtain scores for text descriptions from multiple perspectives;

基于该得分和所述标签信息计算训练损失；Calculate training loss based on the score and the label information;

基于所述训练损失更新所述初始模型的参数，得到参数更新后模型；Update the parameters of the initial model based on the training loss to obtain a model with updated parameters;

对参数更新后模型进行训练迭代，直到满足训练停止条件，将当前的参数更新后模型确定为评价网络模型。The model after updated parameters is trained iteratively until the training stop condition is met, and the current model after updated parameters is determined as the evaluation network model.

可选的，还包括：Optional, also includes:

针对任一属性，如果评价网络模型输出的多个视角下的文本描述的得分中不存在大于所述预设阈值的得分，则重新基于该属性对应的问题询问图文问答预训练模型以得到该属性的文本描述，以及执行利用评价网络模型确定多个视角下的文本描述的得分，并将所述得分大于预设阈值的文本描述进行融合以得到该属性的描述信息的步骤。For any attribute, if there is no score greater than the preset threshold among the scores of text descriptions from multiple perspectives output by the evaluation network model, then the graphic question and answer pre-training model is re-asked based on the question corresponding to the attribute to obtain the Text description of the attribute, and performing the step of using an evaluation network model to determine scores of text descriptions from multiple perspectives, and fusing text descriptions with scores greater than a preset threshold to obtain description information of the attribute.

可选的，所述构建问题集合，包括：Optionally, the set of construction questions includes:

针对多个属性分别设置问题，得到问题集合；Set questions separately for multiple attributes to obtain a question set;

其中，所述多个属性包括概念属性、几何属性、颜色属性以及材质属性中的至少两个。Wherein, the plurality of attributes include at least two of conceptual attributes, geometric attributes, color attributes and material attributes.

可选的，所述基于所述问题集合询问图文问答预训练模型以得到每个问题对应的答案，并基于每个属性对应的所述答案确定每个属性的文本描述，包括：Optionally, asking the graphic question and answer pre-training model based on the question set to obtain the answer corresponding to each question, and determining the text description of each attribute based on the answer corresponding to each attribute, including:

当所述问题集合中每个属性对应一个问题，则利用每个属性的问题多次询问图文问答预训练模型以得到多个答案，并将多个答案中与该二维图像最相似的答案确定为该属性的文本描述。When each attribute in the question set corresponds to a question, the question of each attribute is used to query the image and text question and answer pre-training model multiple times to obtain multiple answers, and the answer most similar to the two-dimensional image among the multiple answers is Identifies the textual description of this property.

可选的，所述将多个答案中与该二维图像最相似的答案确定为该属性的文本描述，包括：Optionally, the answer most similar to the two-dimensional image among the multiple answers is determined as the text description of the attribute, including:

利用图文对比预训练模型计算多个答案与该二维图像的相似度，并将相似度最高的答案确定为该属性的文本描述。The image-text comparison pre-trained model is used to calculate the similarity between multiple answers and the two-dimensional image, and the answer with the highest similarity is determined as the text description of the attribute.

当所述问题集合中每个属性对应多个问题，则利用每个属性的多个问题询问图文问答预训练模型以得到多个答案，并将多个答案中与该二维图像最相似的答案确定为该属性的文本描述。When each attribute in the question set corresponds to multiple questions, multiple questions of each attribute are used to ask the graphic question and answer pre-training model to obtain multiple answers, and the one most similar to the two-dimensional image among the multiple answers is The answer is determined by the textual description of the property.

第二方面，本发明公开了一种三维内容生成模型训练方法，包括：In a second aspect, the present invention discloses a three-dimensional content generation model training method, including:

基于所述三维内容描述数据集对生成式对抗网络进行训练；所述三维内容描述数据集根据前述的基于多模态预训练模型的数据集生成方法生成；所述生成式对抗网络包括鉴别器、生成器以及条件控制器；A generative adversarial network is trained based on the three-dimensional content description data set; the three-dimensional content description data set is generated according to the aforementioned data set generation method based on a multi-modal pre-training model; the generative adversarial network includes a discriminator, Generators and conditional controllers;

当所述鉴别器无法分辨所述生成器所生成内容，则将当前生成式对抗网络中的所述生成器和所述条件控制器组成的网络结构确定为三维内容生成模型。When the discriminator cannot distinguish the content generated by the generator, the network structure composed of the generator and the conditional controller in the current generative adversarial network is determined as a three-dimensional content generation model.

可选的，所述基于所述三维内容描述数据集对生成式对抗网络进行训练，包括：Optionally, training a generative adversarial network based on the three-dimensional content description data set includes:

基于所述三维内容描述数据集对所述生成器和鉴别器进行交替训练，先冻结生成器、训练鉴别器，后冻结鉴别器、训练生成器，交替执行多次。The generator and the discriminator are alternately trained based on the three-dimensional content description data set. The generator is first frozen and the discriminator is trained, and then the discriminator is frozen and the generator is trained. This is performed alternately multiple times.

可选的，所述生成器的训练过程包括：Optionally, the training process of the generator includes:

从所述三维内容描述数据集中获取三维内容描述，并利用所述条件控制器生成多属性编码描述子；Obtain a three-dimensional content description from the three-dimensional content description data set, and use the condition controller to generate a multi-attribute encoding descriptor;

将初始噪声变换以得到噪声编码描述子，并基于所述噪声编码描述子与所述多属性编码描述子构建联合描述子；Transform the initial noise to obtain a noise coding descriptor, and construct a joint descriptor based on the noise coding descriptor and the multi-attribute coding descriptor;

将所述联合描述子输入生成器，得到生成器生成的三维内容；Input the joint descriptor into the generator to obtain the three-dimensional content generated by the generator;

将该三维内容输入鉴别器，得到该三维内容对应的预测值Input the three-dimensional content into the discriminator and obtain the predicted value corresponding to the three-dimensional content.

基于该预测值与第一损失函数计算第一训练损失，并基于第一训练损失更新生成器的参数。A first training loss is calculated based on the predicted value and the first loss function, and parameters of the generator are updated based on the first training loss.

可选的，所述鉴别器的训练过程包括：Optionally, the training process of the discriminator includes:

将真值数据和生成器生成的三维内容输入鉴别器，得到所述真值数据对应的第一预测值以及该三维内容对应的第二预测值；Input the true value data and the three-dimensional content generated by the generator into the discriminator, and obtain the first predicted value corresponding to the true value data and the second predicted value corresponding to the three-dimensional content;

基于所述第一预测值和所述第二预测值以及第二损失函数计算第二训练损失，并基于所述第二训练损失更新鉴别器的参数。A second training loss is calculated based on the first predicted value and the second predicted value and a second loss function, and parameters of the discriminator are updated based on the second training loss.

第三方面，本发明公开了一种三维内容生成方法，包括：In a third aspect, the present invention discloses a three-dimensional content generation method, including:

获取目标描述以及目标噪声，并输入三维内容生成模型；所述目标描述为多个属性的描述信息；其中，所述三维内容生成模型根据前述的三维内容生成模型训练方法训练得到；Obtain the target description and target noise, and input the three-dimensional content generation model; the target description is description information of multiple attributes; wherein, the three-dimensional content generation model is trained according to the aforementioned three-dimensional content generation model training method;

利用条件控制器生成多属性编码描述子，并将目标噪声变换以得到噪声编码描述子；Use conditional controllers to generate multi-attribute coding descriptors, and transform the target noise to obtain noise coding descriptors;

基于该多属性编码描述子与该噪声编码描述子构建联合描述子，并利用生成器基于该联合描述子生成目标三维内容。A joint descriptor is constructed based on the multi-attribute coding descriptor and the noise coding descriptor, and a generator is used to generate target three-dimensional content based on the joint descriptor.

第四方面，本发明公开了一种数据集生成装置，包括：In a fourth aspect, the present invention discloses a data set generating device, including:

二维图像渲染模块，用于将三维内容集中的每个三维内容渲染为二维图像；A two-dimensional image rendering module, used to render each three-dimensional content in the three-dimensional content set into a two-dimensional image;

问题集合构建模块，用于构建问题集合；所述问题集合包括多个属性对应的问题；A question set building module is used to build a question set; the question set includes questions corresponding to multiple attributes;

属性描述确定模块，用于针对每一所述二维图像，基于所述问题集合询问图文问答预训练模型以得到每个问题对应的答案，并基于每个属性对应的所述答案确定每个属性的文本描述；Attribute description determination module, configured to query the graphic question and answer pre-training model based on the question set for each two-dimensional image to obtain the answer corresponding to each question, and determine each attribute based on the answer corresponding to each attribute. Text description of the attribute;

数据集确定模块，用于基于所述文本描述确定每个所述三维内容的每个属性的描述信息，得到每个所述三维内容的三维内容描述，以生成三维内容描述数据集；所述三维内容描述包含多个所述属性的所述描述信息。a data set determination module, configured to determine the description information of each attribute of each of the three-dimensional content based on the text description, and obtain the three-dimensional content description of each of the three-dimensional content to generate a three-dimensional content description data set; the three-dimensional The content description includes the description information of a plurality of the attributes.

第五方面，本发明公开了一种电子设备，包括存储器和处理器，其中：In a fifth aspect, the present invention discloses an electronic device including a memory and a processor, wherein:

所述存储器，用于保存计算机程序；The memory is used to store computer programs;

所述处理器，用于执行所述计算机程序，以实现前述的基于多模态预训练模型的数据集生成方法，和/或，前述的三维内容生成模型训练方法，和/或，前述的三维内容生成方法。The processor is configured to execute the computer program to implement the aforementioned data set generation method based on a multi-modal pre-training model, and/or the aforementioned three-dimensional content generation model training method, and/or the aforementioned three-dimensional Content generation methods.

第六方面，本发明公开了一种计算机可读存储介质，用于保存计算机程序，其中，所述计算机程序被处理器执行时实现前述的基于多模态预训练模型的数据集生成方法，和/或，前述的三维内容生成模型训练方法，和/或，前述的三维内容生成方法。In a sixth aspect, the present invention discloses a computer-readable storage medium for storing a computer program, wherein when the computer program is executed by a processor, the aforementioned data set generation method based on a multi-modal pre-training model is implemented, and /or, the aforementioned three-dimensional content generation model training method, and/or, the aforementioned three-dimensional content generation method.

可见，本发明先将三维内容集中的每个三维内容渲染为二维图像，以及构建问题集合，所述问题集合包括多个属性对应的问题，之后针对每一所述二维图像，基于所述问题集合询问图文问答预训练模型以得到每个问题对应的答案，并基于每个属性对应的所述答案确定每个属性的文本描述，然后基于所述文本描述确定每个所述三维内容的每个属性的描述信息，得到每个所述三维内容的三维内容描述，以生成三维内容描述数据集；所述三维内容描述包含多个所述属性的所述描述信息。也即，本发明构建包括多个属性问题的提问集合，针对三维内容对应的二维图像，询问图文问答预训练模型以得到每个问题对应的答案，并基于答案确定每个属性的文本描述，进而确定每个三维内容对应的多个属性的描述信息，得到三维内容描述数据集。It can be seen that the present invention first renders each three-dimensional content in the three-dimensional content set into a two-dimensional image, and constructs a question set, which includes questions corresponding to multiple attributes. Then, for each two-dimensional image, based on the The question set asks the graphic question and answer pre-trained model to obtain the answer corresponding to each question, and determines the text description of each attribute based on the answer corresponding to each attribute, and then determines the text description of each three-dimensional content based on the text description. The description information of each attribute is used to obtain the three-dimensional content description of each three-dimensional content to generate a three-dimensional content description data set; the three-dimensional content description includes the description information of multiple attributes. That is, the present invention constructs a question set including multiple attribute questions, queries the graphic and text question and answer pre-training model for the two-dimensional image corresponding to the three-dimensional content to obtain the answer corresponding to each question, and determines the text description of each attribute based on the answer , and then determine the description information of multiple attributes corresponding to each three-dimensional content, and obtain a three-dimensional content description data set.

本发明的有益效果在于：三维内容描述中包含多个属性中每个属性分别对应的描述信息，这样得到的三维内容描述数据集更加丰富和准确，能够提升数据集质量，进而保障三维内容生成模型的性能，从而提升生成三维内容的准确性。The beneficial effect of the present invention is that: the three-dimensional content description contains description information corresponding to each attribute in multiple attributes, so that the three-dimensional content description data set obtained is richer and more accurate, which can improve the quality of the data set and thereby ensure the three-dimensional content generation model. performance, thereby improving the accuracy of generating 3D content.

附图说明Description of the drawings

为了更清楚地说明本发明实施例或现有技术中的技术方案，下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍，显而易见地，下面描述中的附图仅仅是本发明的实施例，对于本领域普通技术人员来讲，在不付出创造性劳动的前提下，还可以根据提供的附图获得其他的附图。In order to explain the embodiments of the present invention or the technical solutions in the prior art more clearly, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings in the following description are only These are embodiments of the present invention. For those of ordinary skill in the art, other drawings can be obtained based on the provided drawings without exerting creative efforts.

图1为本发明实施例提供的一种基于多模态预训练模型的数据集生成方法流程图；Figure 1 is a flow chart of a data set generation method based on a multi-modal pre-training model provided by an embodiment of the present invention;

图2为本发明实施例提供的一种具体的数据集生成示意图；Figure 2 is a schematic diagram of a specific data set generation provided by an embodiment of the present invention;

图3为本发明实施例提供的一种评价网络模型结构图；Figure 3 is a structural diagram of an evaluation network model provided by an embodiment of the present invention;

图4为本发明实施例提供的一种三维内容生成模型训练方法；Figure 4 is a three-dimensional content generation model training method provided by an embodiment of the present invention;

图5为本发明实施例提供的一种生成式对抗网络示意图；Figure 5 is a schematic diagram of a generative adversarial network provided by an embodiment of the present invention;

图6为本发明实施例提供的一种条件控制器训练示意图；Figure 6 is a schematic diagram of a conditional controller training provided by an embodiment of the present invention;

图7为本发明实施例提供的一种数据集生成装置结构示意图；Figure 7 is a schematic structural diagram of a data set generation device provided by an embodiment of the present invention;

图8为本发明实施例提供的一种电子设备结构图。Figure 8 is a structural diagram of an electronic device provided by an embodiment of the present invention.

具体实施方式Detailed ways

下面将结合本发明实施例中的附图，对本发明实施例中的技术方案进行清楚、完整地描述，显然，所描述的实施例仅仅是本发明一部分实施例，而不是全部的实施例。基于本发明中的实施例，本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例，都属于本发明保护的范围。The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention. Obviously, the described embodiments are only some of the embodiments of the present invention, rather than all the embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those of ordinary skill in the art without creative efforts fall within the scope of protection of the present invention.

目前基于文本的三维内容生成方案中，三维数据集的文本描述质量较差，无法得到性能较好的三维内容生成模型，导致生成的三维内容准确性较低。为此，本发明提供了一种三维内容生成方案，能够提升数据集质量，进而保障三维内容生成模型的性能，从而提升生成三维内容的准确性。In the current text-based 3D content generation scheme, the text description quality of the 3D data set is poor, and a better-performing 3D content generation model cannot be obtained, resulting in low accuracy of the generated 3D content. To this end, the present invention provides a three-dimensional content generation solution that can improve the quality of the data set, thereby ensuring the performance of the three-dimensional content generation model, thereby improving the accuracy of generating three-dimensional content.

参见图1所示，本发明实施例公开了一种基于多模态预训练模型的数据集生成方法，包括：As shown in Figure 1, an embodiment of the present invention discloses a data set generation method based on a multi-modal pre-training model, including:

步骤S11：将三维内容集中的每个三维内容渲染为二维图像。Step S11: Render each three-dimensional content in the three-dimensional content set into a two-dimensional image.

在具体的实施方式中，可以将三维内容集中的每个三维内容转换至世界坐标系；将所述世界坐标系下的三维内容中的每个点均乘以缩放因子以完成尺度缩放；将尺度缩放后的三维内容渲染为二维图像。具体的，可以计算所述世界坐标系下的三维内容在每个坐标轴下最大值和最小值之间的差值；将各坐标轴对应的所述差值中的最大值取倒数，得到所述缩放因子。In a specific implementation, each three-dimensional content in the three-dimensional content set can be converted to a world coordinate system; each point in the three-dimensional content in the world coordinate system is multiplied by a scaling factor to complete scale scaling; Scaled 3D content is rendered as a 2D image. Specifically, the difference between the maximum value and the minimum value of the three-dimensional content in the world coordinate system under each coordinate axis can be calculated; the maximum value of the difference corresponding to each coordinate axis is reciprocated to obtain the the scaling factor.

需要指出的是，本发明实施例首先对三维内容进行预处理，由于每个三维内容的三维点云出现在空间中的位置不固定，为了保证渲染过程可控，本发明实施例先进行坐标系对齐与尺度放缩两种预处理。坐标系对齐是指将三维内容的所有点统一到与世界坐标系对齐，先计算当前物体在世界坐标系下的中心o_center，再计算每个点的新坐标为p_new=p_ori-o_center，其中，p_ori为点的原始坐标，完成坐标系对齐。尺度放缩是指将模型归一化到标准尺度，也即将物体缩放到边长为1的立方体内，先计算三维内容在x轴、y轴和z轴上的最大最小值之差，并取最大值，再取倒数即为缩放尺度，即s=1/max((max_x-min_x)，(max_y-min_y)，(max_z-min_z))，将三维内容的每一个点的坐标都乘以缩放因子，完成尺度缩放。It should be pointed out that the embodiment of the present invention first preprocesses the three-dimensional content. Since the position of the three-dimensional point cloud of each three-dimensional content in the space is not fixed, in order to ensure that the rendering process is controllable, the embodiment of the present invention first performs the coordinate system Two preprocessing methods: alignment and scaling. Coordinate system alignment refers to unifying all points of the three-dimensional content to align with the world coordinate system. First calculate the center o_center of the current object in the world coordinate system, and then calculate the new coordinates of each point as p_new =p_ori -o_center , where p_ori is the original coordinate of the point, completing the coordinate system alignment. Scaling refers to normalizing the model to a standard scale, that is, scaling the object into a cube with a side length of 1. First, calculate the difference between the maximum and minimum values of the three-dimensional content on the x-axis, y-axis, and z-axis, and take The maximum value, and then the reciprocal is the scaling scale, that is, s=1/max((max_x -min_x ), (max_y -min_y ), (max_z -min_z )), each point of the three-dimensional content The coordinates are multiplied by the scaling factor to complete the scale scaling.

并且，在具体的实施方式中，可以将三维内容集中的每个三维内容渲染为多个视角下的二维图像。具体的，可以基于球坐标系计算多个虚拟相机位置；基于所述多个虚拟相机位置对三维内容集中的每个三维内容进行渲染，得到多个视角下的二维图像。Moreover, in a specific implementation, each three-dimensional content in the three-dimensional content set can be rendered into two-dimensional images from multiple viewing angles. Specifically, multiple virtual camera positions can be calculated based on the spherical coordinate system; each three-dimensional content in the three-dimensional content set is rendered based on the multiple virtual camera positions to obtain two-dimensional images from multiple perspectives.

本发明实施例可以计算虚拟相机的位姿。虚拟相机的位姿由位置和朝向组成。为了保证全面地获取3D内容各个视角下的信息，本发明实施例设置大量环绕3D内容上半部分的虚拟相机，其中相机的位置不同，而相机的朝向则固定为由相机位置指向物体中心。在球坐标系中设置虚拟相机的位置(p_camera-x，p_camera-y，p_camera-z)，依据公式：p_camera-x=r*sinθ*cosφ，p_camera-y=r*sinθ*sinφ，p_camera-z=r*cosθ，其中，r表示球半径，θ为垂直方向极角，φ为水平方向方位角。半径、极角和方位角可以设置不同的情况，如果半径r存在n_r种情况，θ存在n_θ种情况，φ存在n_φ种情况，则最终可以渲染得到n=n_r*n_θ*n_φ个视角下的图像。例如，可以固定半径r为3，令极角θ分60度和90度两种情况，方位角φ分0度、30度直到360度共12种情况，则一共存在1*2*12共24个虚拟相机位置，共渲染得到24幅图像；最后，渲染多视角2D（即二维）图像。依据计算的虚拟相机位置，使用渲染引擎如Blender将3D内容渲染得到n张2D图像。同样地，针对3D内容数据集中的每个3D内容，都按照同样步骤，得到每个3D内容渲染后的n幅2D图像。The embodiment of the present invention can calculate the pose of the virtual camera. The pose of the virtual camera consists of position and orientation. In order to ensure comprehensive acquisition of information from each perspective of the 3D content, embodiments of the present invention set up a large number of virtual cameras surrounding the upper half of the 3D content. The positions of the cameras are different, and the orientation of the cameras is fixed so that the camera position points to the center of the object. Set the position of the virtual camera (p_camera-x , p_camera-y , p_camera-z ) in the spherical coordinate system, according to the formula: p_camera-x =r*sinθ*cosφ, p_camera-y =r*sinθ* sinφ, p_camera-z =r*cosθ, where r represents the radius of the ball, θ is the vertical polar angle, and φ is the horizontal azimuth angle. Radius, polar angle and azimuth angle can be set in different situations. If there are n_r situations for radius r, n_θ situations for θ, and n_φ situations for φ, then n=n_r *n_θ *n can finally be rendered. Images from_φ viewing angles. For example, you can fix the radius r to 3, let the polar angle θ be divided into two cases: 60 degrees and 90 degrees, and the azimuth angle φ can be divided into 12 cases from 0 degrees, 30 degrees to 360 degrees, then there will be a total of 1*2*12 and a total of 24 A total of 24 images are rendered at a virtual camera position; finally, a multi-view 2D (i.e. two-dimensional) image is rendered. Based on the calculated virtual camera position, a rendering engine such as Blender is used to render the 3D content to obtain n 2D images. Similarly, for each 3D content in the 3D content data set, the same steps are followed to obtain n 2D images after rendering of each 3D content.

步骤S12：构建问题集合；所述问题集合包括多个属性对应的问题。Step S12: Construct a question set; the question set includes questions corresponding to multiple attributes.

在具体的实施方式中，可以针对多个属性分别设置问题，得到问题集合；其中，所述多个属性包括但不限于概念属性、几何属性、颜色属性以及材质属性中的至少两个。对于每个属性，可以设置一个或者多个问题。可以理解的是，属性是对三维内容的特征的描述。In a specific implementation, questions can be set separately for multiple attributes to obtain a question set; wherein the multiple attributes include but are not limited to at least two of conceptual attributes, geometric attributes, color attributes, and material attributes. For each attribute, one or more questions can be set. It can be understood that attributes are descriptions of characteristics of three-dimensional content.

需要指出的是，3D内容存在概念、几何、颜色、材质等多个属性，共同细粒度刻画3D内容的最终形态。例如，概念属性表明是什么类别的物体，几何属性表明是什么样的几何形状，颜色属性表明具有什么样的颜色风格，材质属性表明具有什么样的材质。针对每一个属性，可以设置一个或者多个问题，以尽可能的得到该属性的详尽描述，例如，针对概念属性，可以提问“图片中是什么物体？”，针对几何属性，可以提问“图片中物体的几何结构是什么样的？”，针对颜色属性，可以提问“图片中物体的颜色风格是什么样的？”等等。这里属性的数量不固定，可以依情况而定。It should be pointed out that 3D content has multiple attributes such as concept, geometry, color, and material, which together describe the final form of the 3D content in a fine-grained manner. For example, the concept attribute indicates what type of object it is, the geometric attribute indicates what kind of geometric shape it is, the color attribute indicates what kind of color style it has, and the material attribute indicates what kind of material it has. For each attribute, you can set one or more questions to get as detailed a description of the attribute as possible. For example, for conceptual attributes, you can ask "What object is in the picture?"; for geometric attributes, you can ask "What object is in the picture?" What is the geometric structure of the object like?" For color attributes, you can ask "What is the color style of the object in the picture?" and so on. The number of attributes here is not fixed and can depend on the situation.

步骤S13：针对每一所述二维图像，基于所述问题集合询问图文问答预训练模型以得到每个问题对应的答案，并基于每个属性对应的所述答案确定每个属性的文本描述。Step S13: For each two-dimensional image, query the graphic question and answer pre-training model based on the question set to obtain the answer corresponding to each question, and determine the text description of each attribute based on the answer corresponding to each attribute. .

其中，图文问答预训练模型可以为大规模预训练模型。Among them, the graphic question and answer pre-training model can be a large-scale pre-training model.

在一种实施方式中，当所述问题集合中每个属性对应一个问题，则利用每个属性的问题多次询问图文问答预训练模型以得到多个答案，并将多个答案中与该二维图像最相似的答案确定为该属性的文本描述。具体的，可以利用图文对比预训练模型计算多个答案与该二维图像的相似度，并将相似度最高的答案确定为该属性的文本描述。图文对比预训练模型可以为大规模预训练模型。In one implementation, when each attribute in the question set corresponds to a question, the question of each attribute is used to query the graphic question and answer pre-training model multiple times to obtain multiple answers, and the multiple answers are combined with the The most similar answer to a 2D image is determined by the textual description of that attribute. Specifically, the image-text comparison pre-training model can be used to calculate the similarity between multiple answers and the two-dimensional image, and the answer with the highest similarity is determined as the text description of the attribute. The image and text comparison pre-training model can be a large-scale pre-training model.

在另一种实施方式中，当所述问题集合中每个属性对应多个问题，则利用每个属性的多个问题询问图文问答预训练模型以得到多个答案，并将多个答案中与该二维图像最相似的答案确定为该属性的文本描述。具体的，利用图文对比预训练模型计算多个答案与该二维图像的相似度，并将相似度最高的答案确定为该属性的文本描述。In another implementation, when each attribute in the question set corresponds to multiple questions, multiple questions of each attribute are used to ask the graphic question and answer pre-training model to obtain multiple answers, and the multiple answers are included in The answer that is most similar to the 2D image is determined to be the textual description of the attribute. Specifically, the image-text comparison pre-training model is used to calculate the similarity between multiple answers and the two-dimensional image, and the answer with the highest similarity is determined as the text description of the attribute.

步骤S14：基于所述文本描述确定每个所述三维内容的每个属性的描述信息，得到每个所述三维内容的三维内容描述，以生成三维内容描述数据集。Step S14: Determine the description information of each attribute of each three-dimensional content based on the text description, and obtain the three-dimensional content description of each three-dimensional content to generate a three-dimensional content description data set.

也即，三维内容的三维内容描述中是多个属性中每个属性分别对应的描述信息。That is, the three-dimensional content description of the three-dimensional content contains description information corresponding to each of multiple attributes.

在一种实施方式中，将三维内容集中的每个三维内容渲染均为一个二维图像。相应的，基于所述文本描述确定每个所述三维内容的每个属性的描述信息具体为：将每个属性对应的文本描述确定为三维内容的该属性的描述信息。In one implementation, each three-dimensional content in the three-dimensional content set is rendered as a two-dimensional image. Correspondingly, determining the description information of each attribute of each of the three-dimensional content based on the text description specifically includes: determining the text description corresponding to each attribute as the description information of the attribute of the three-dimensional content.

在另一种实施方式中，将三维内容集中的每个三维内容渲染为多个二维图像，则针对每个属性，基于多个二维图像对应的该属性的文本描述确定三维内容的该属性的描述信息。在具体的实施方式中，多个二维图像为多个视角下的二维图像，针对每个属性，利用评价网络模型确定多个视角下的文本描述的得分，并将所述得分大于预设阈值的文本描述进行融合以得到该属性的描述信息；其中，所述多个视角下的文本描述均为同一三维内容对应的文本描述。In another implementation, each three-dimensional content in the three-dimensional content set is rendered into multiple two-dimensional images, and then for each attribute, the attribute of the three-dimensional content is determined based on the text description of the attribute corresponding to the multiple two-dimensional images. description information. In a specific implementation, the multiple two-dimensional images are two-dimensional images from multiple perspectives. For each attribute, an evaluation network model is used to determine the score of the text description from multiple perspectives, and the score is greater than the preset The text descriptions of the threshold are fused to obtain the description information of the attribute; where the text descriptions from multiple perspectives are all text descriptions corresponding to the same three-dimensional content.

并且，针对任一属性，如果评价网络模型输出的多个视角下的文本描述的得分中不存在大于所述预设阈值的得分，则重新基于该属性对应的问题询问图文问答预训练模型以得到该属性的文本描述，以及执行利用评价网络模型确定多个视角下的文本描述的得分，并将所述得分大于预设阈值的文本描述进行融合以得到该属性的描述信息的步骤。Moreover, for any attribute, if there is no score greater than the preset threshold among the scores of text descriptions from multiple perspectives output by the evaluation network model, the graphic question and answer pre-training model will be asked again based on the question corresponding to the attribute to Obtain the text description of the attribute, and perform the steps of using an evaluation network model to determine the scores of the text descriptions from multiple perspectives, and fusing the text descriptions whose scores are greater than a preset threshold to obtain the description information of the attribute.

其中，可以针对每个属性，将多个视角下的文本描述输入评价网络模型，抽取局部特征和全局特征，基于所述局部特征和所述全局特征构建联合特征，并基于所述联合特征输出多个视角下的文本描述的得分。Among them, for each attribute, text descriptions from multiple perspectives can be input into the evaluation network model, local features and global features are extracted, joint features are constructed based on the local features and the global features, and multiple features are output based on the joint features. The score of the text description from each perspective.

并且，在具体的实施方式中，可以针对每个属性，将多个视角下的文本描述输入评价网络模型，依次经过双向编码器表示结构、一个多层感知机层得到局部特征，再经过一个多层感知机层以及池化层，得到全局特征。基于所述局部特征和所述全局特征构建联合特征，基于所述联合特征，并利用预设数量个多层感知机层输出多个视角下的文本描述的得分。预设数量可以为四个。Moreover, in a specific implementation, for each attribute, text descriptions from multiple perspectives can be input into the evaluation network model, and the local features are obtained through a bidirectional encoder representation structure, a multi-layer perceptron layer, and then a multi-layer perceptron layer. layer perceptron layer and pooling layer to obtain global features. A joint feature is constructed based on the local feature and the global feature, and based on the joint feature, a preset number of multi-layer perceptron layers are used to output scores of text descriptions from multiple perspectives. The preset number can be four.

进一步的，所述评价网络模型的训练过程包括：构建训练数据集，所述训练数据集包括训练样本和所述训练样本对应的标签信息，所述训练样本为不同属性对应的多个视角下的文本描述；将所述训练样本输入初始模型，得到多个视角下的文本描述的得分；基于该得分和所述标签信息计算训练损失；基于所述训练损失更新所述初始模型的参数，得到参数更新后模型；对参数更新后模型进行训练迭代，直到满足训练停止条件，将当前的参数更新后模型确定为评价网络模型。也即，每个训练样本均为某一属性对应的多个视角下的文本描述，并且，语义清晰且互补的视角下的文本描述标签值为1，否则为0。Further, the training process of the evaluation network model includes: constructing a training data set. The training data set includes training samples and label information corresponding to the training samples. The training samples are from multiple perspectives corresponding to different attributes. Text description; input the training samples into the initial model to obtain scores of text descriptions from multiple perspectives; calculate a training loss based on the score and the label information; update the parameters of the initial model based on the training loss to obtain parameters Updated model; perform training iterations on the model with updated parameters until the training stop condition is met, and determine the current model with updated parameters as the evaluation network model. That is, each training sample is a text description from multiple perspectives corresponding to a certain attribute, and the text description label value from a semantically clear and complementary perspective is 1, otherwise it is 0.

参见图2所示，图2为本发明实施例公开的一种具体的数据集生成示意图。Refer to Figure 2, which is a schematic diagram of a specific data set generation disclosed in an embodiment of the present invention.

首先，对三维内容集中的每个三维内容进行多视角渲染，得到多视角下的二维图像，假设每个三维内容渲染为n个视角下的二维图像。并且，构建问题集合，假定有m种属性，每个属性可以有一个问题或多个问题。First, each three-dimensional content in the three-dimensional content set is rendered from multiple perspectives to obtain a two-dimensional image from multiple perspectives. It is assumed that each three-dimensional content is rendered into a two-dimensional image from n perspectives. Moreover, to construct a question set, it is assumed that there are m kinds of attributes, and each attribute can have one question or multiple questions.

其次，使用大规模预训练模型进行视觉问答。例如，可以使用图文大规模预训练模型，如GPT（即Generative Pre-trained Transformer，生成式预训练变换模型）-4，具有强大的认知推理和图片问答能力，因此，针对3D内容的每一张渲染图像，可以使用构建的问题集合，询问大规模预训练模型，得到每张图像针对该问题的回答，作为该问题对应属性的文本描述。为了提升每个属性描述的准确性，本发明采用一种校验机制。针对每一条属性，都向大规模预训练模型询问w次，得到w个答案；之后，使用专注于相似度比较的图文大规模预训练模型，如CLIP（即Contrastive Language-Image Pre-training，图文对比预训练模型），判断w个答案中与图片的相似程度最高的答案，作为该图像针对该属性的文本描述。之后针对m个属性，都执行相同操作，得到每一张渲染图像m个细粒度文字描述。需要指出的是，图文大规模预训练模型具有强大的认知能力，能够根据输入的图像和文本形式的问题，得到合理的结果。因此，可以借助其强大的认知能力，用于生成细粒度的3D内容的属性描述。Second, use large-scale pre-trained models for visual question answering. For example, you can use large-scale pre-training models for graphics and text, such as GPT (Generative Pre-trained Transformer, generative pre-training transformation model)-4, which has powerful cognitive reasoning and picture question and answer capabilities. Therefore, for every 3D content, For a rendered image, you can use the built question set to query a large-scale pre-trained model, and get the answer to the question for each image as a text description of the corresponding attribute of the question. In order to improve the accuracy of each attribute description, the present invention adopts a verification mechanism. For each attribute, the large-scale pre-training model is asked w times and w answers are obtained; after that, a large-scale image and text pre-training model focusing on similarity comparison is used, such as CLIP (Contrastive Language-Image Pre-training, Image and text comparison pre-training model), determine the answer with the highest similarity to the picture among the w answers, and use it as the text description of the image for that attribute. Then, the same operation is performed for m attributes to obtain m fine-grained text descriptions for each rendered image. It should be pointed out that the large-scale image and text pre-training model has strong cognitive capabilities and can obtain reasonable results based on the input image and text-form questions. Therefore, its powerful cognitive capabilities can be used to generate fine-grained attribute descriptions of 3D content.

进一步的，为了得到3D内容的细粒度描述，需要融合n幅渲染图像的细粒度描述。然而，这n个视角下的细粒度描述质量不一，存在较差的描述。因此，本发明设计了一种基于多视角细粒度描述评价网络的带反馈机制的多视角细粒度描述融合方法，具体包括以下步骤：Furthermore, in order to obtain a fine-grained description of the 3D content, it is necessary to fuse the fine-grained descriptions of n rendered images. However, the quality of fine-grained descriptions under these n perspectives varies, and there are poor descriptions. Therefore, the present invention designs a multi-view fine-grained description fusion method with a feedback mechanism based on a multi-view fine-grained description evaluation network, which specifically includes the following steps:

首先，针对某一粒度即某一属性，使用多视角细粒度描述评价网络，得到n个视角下细粒度描述的得分。为了完成n个视角细粒度描述的质量评价，得到去重后的高质量细粒度描述，设计了一种多视角细粒度描述评价网络模型，参见图3所示，图3为本发明实施例公开的一种评价网络模型结构图。评价网络每次输入某一粒度的n个视角下的细粒度描述，抽取得到局部特征和全局特征，再进行联合共同预测n个视角下细粒度描述的得分。通过局部特征和全局特征的抽取，网络模型具有极强的判别能力，能够判断哪些描述质量较差，达到去除低质量描述的目标。具体步骤如下：First, for a certain granularity, that is, a certain attribute, a multi-view fine-grained description evaluation network is used to obtain the scores of fine-grained descriptions from n viewpoints. In order to complete the quality evaluation of n-perspective fine-grained descriptions and obtain high-quality fine-grained descriptions after deduplication, a multi-perspective fine-grained description evaluation network model is designed, as shown in Figure 3, which is a disclosure of an embodiment of the present invention. Structure diagram of an evaluation network model. Each time the evaluation network inputs a certain granularity of fine-grained descriptions from n perspectives, extracts local features and global features, and then jointly predicts the scores of fine-grained descriptions from n perspectives. Through the extraction of local features and global features, the network model has a strong discriminative ability and can determine which descriptions are of poor quality, achieving the goal of removing low-quality descriptions. Specific steps are as follows:

a.构建数据集：针对大量不同粒度下未评分的n个视角的文本描述，结合对应的3D内容进行人工筛选，挑选语义清晰和互补的一个或者多个描述赋予得分为1，将其他描述赋予0，并将所有得分为1的细粒度描述，作为3D内容针对该粒度的准确和完整的描述；a. Construct a data set: A large number of unscored text descriptions from n perspectives at different granularities are manually screened in combination with the corresponding 3D content. One or more descriptions with clear and complementary semantics are selected and assigned a score of 1, and other descriptions are assigned a score of 1. 0, and consider all fine-grained descriptions with a score of 1 as accurate and complete descriptions of the 3D content for that granularity;

b.设计网络结构：首先，使用语言大模型预训练模型，如Bert（即BidirectionalEncoder Representation from Transformers，来自变换模型的双向编码表示模型）等，提取文本描述的高维特征，每一个视角都可以得到768维特征，n个视角可以构成n×768维特征；其次，使用两个MLP（即Multi-Layer Perception，多层感知机）层将特征升维到n×2048维，并进行池化得到1×2048维，这2048维向量，可以看作对n个视角文本描述特征的高维抽象，因此是全局特征；最后，全局特征联合前一步的n×1024维局部特征，构成n×3072维度联合特征，经过4个MLP层，最终预测n个视角下细粒度描述的得分；该网络结构融合了n个视角的局部特征和全局特征，不仅能够筛选哪些属于低质量的描述，并且能够预测哪些描述能够互补，共同构成3D内容关于该粒度的最终描述；b. Design the network structure: First, use a large language model pre-training model, such as Bert (ie BidirectionalEncoder Representation from Transformers, a bidirectional encoding representation model from the transformation model), etc., to extract high-dimensional features of the text description, which can be obtained from every perspective 768-dimensional features, n views can constitute n×768-dimensional features; secondly, use two MLP (ie Multi-Layer Perception, multi-layer perceptron) layers to upgrade the feature dimension to n×2048 dimensions, and perform pooling to obtain 1 ×2048 dimensions. This 2048-dimensional vector can be regarded as a high-dimensional abstraction of the text description features of n perspectives, so it is a global feature. Finally, the global features combine with the n×1024-dimensional local features of the previous step to form an n×3072-dimensional joint feature. , after 4 MLP layers, the scores of fine-grained descriptions under n perspectives are finally predicted; this network structure integrates local features and global features of n perspectives, which can not only screen out low-quality descriptions, but also predict which descriptions can complement each other and together constitute the final description of the 3D content at this granularity;

c.训练及推理：训练使用均方根误差(Mean Square Error，MSE)，具体公式如下：c. Training and inference: Training uses the root mean square error (Mean Square Error, MSE). The specific formula is as follows:

； ;

其中，f(x_i)为网络预测的第i个视角的得分，y_i为第i个视角的真值（GroundTruth，GT）；推理时，输入n个视角的文本描述，可以得到n个视角下范围在[0,1]内的细粒度描述的得分。Among them, f(xi₎ is the score of the i-th perspective predicted by the network, and_yi is the true value (GroundTruth, GT) of the i-th perspective; during reasoning, input the text description of n perspectives, and you can get n perspectives The score of a fine-grained description in the range [0,1].

其次，进行带反馈机制的细粒度描述融合。针对某一粒度i，在n个细粒度描述中，如果存在得分大于一定阈值δ的细粒度描述，则说明存在合格的细粒度描述，将所有合格的细粒度描述进行融合，作为3D内容针对该粒度i的细粒度描述；如果不存在得分大于一定阈值δ的细粒度描述，则说明不存在合格的细粒度描述，则重新执行基于大规模预训练模型的细粒度描述生成，重新进行多视角细粒度描述评价，直至存在得分大于一定阈值δ的细粒度描述，完成粒度i的细粒度描述生成。针对m个粒度，均进行该反馈机制，直至完成该3D内容m个粒度的文本描述。Secondly, perform fine-grained description fusion with feedback mechanism. For a certain granularity i, among n fine-grained descriptions, if there is a fine-grained description with a score greater than a certain threshold δ, it means that there is a qualified fine-grained description, and all qualified fine-grained descriptions are fused and used as 3D content for that Fine-grained description of granularity i; if there is no fine-grained description with a score greater than a certain threshold δ, it means that there is no qualified fine-grained description, then the fine-grained description generation based on the large-scale pre-training model is re-executed, and the multi-view fine-grained description is re-executed. The granular description is evaluated until there is a fine-grained description with a score greater than a certain threshold δ, and the fine-grained description generation of granularity i is completed. The feedback mechanism is performed for m granularities until the text description of m granularities of the 3D content is completed.

本发明实施例针对3D内容数据集中的每个3D模型，都进行以上步骤，完成细粒度描述3D内容数据库的构建。The embodiment of the present invention performs the above steps for each 3D model in the 3D content data set to complete the construction of a fine-grained description 3D content database.

也即，本发明实施例首先针对3D内容数据集中的任一3D内容，通过设置不同的虚拟相机位置进行多视角渲染，得到大量不同视角下的内容2D图像；其次，创建针对不同属性的问题集合，其中，属性的数量不限，通常可使用概念、几何、颜色、材质等属性；再次，针对每一个3D内容渲染得到的每一幅图像，使用图文大规模预训练模型，针对图像内容进行多次提问，使用一种校验机制选择最佳答案作为该属性的文本描述；再次，将渲染得到的大量2D图像不同属性的文本描述进行融合去重，得到该3D内容的针对不同属性的文本描述；最后，针对3D内容数据集中的每个3D模型，执行前面步骤，得到带细粒度属性文本描述的3D内容数据集；That is to say, the embodiment of the present invention first performs multi-view rendering for any 3D content in the 3D content data set by setting different virtual camera positions to obtain a large number of 2D images of the content from different viewing angles; secondly, creates question sets targeting different attributes , among which, the number of attributes is not limited, and attributes such as concept, geometry, color, material, etc. can usually be used; thirdly, for each image rendered by each 3D content, a large-scale pre-training model with graphics and text is used to conduct Ask multiple questions and use a verification mechanism to select the best answer as the text description of the attribute; again, fuse the text descriptions of different attributes of a large number of rendered 2D images to remove duplication, and obtain the text of the 3D content for different attributes. Description; Finally, for each 3D model in the 3D content data set, perform the previous steps to obtain a 3D content data set with fine-grained attribute text description;

可见，本发明实施例先将三维内容集中的每个三维内容渲染为二维图像，以及构建问题集合，所述问题集合包括多个属性对应的问题，之后针对每一所述二维图像，基于所述问题集合询问图文问答预训练模型以得到每个问题对应的答案，并基于每个属性对应的所述答案确定每个属性的文本描述，然后基于所述文本描述确定每个所述三维内容的每个属性的描述信息，得到每个所述三维内容的三维内容描述，以生成三维内容描述数据集；所述三维内容描述包含多个所述属性的所述描述信息。也即，本发明构建包括多个属性问题的提问集合，针对三维内容对应的二维图像，询问图文问答预训练模型以得到每个问题对应的答案，并基于答案确定每个属性的文本描述，进而确定每个三维内容对应的多个属性的描述信息，得到三维内容描述数据集。三维内容描述中包含多个属性中每个属性分别对应的描述信息，这样得到的三维内容描述数据集更加丰富和准确，能够提升数据集质量，进而保障三维内容生成模型的性能，从而提升生成三维内容的准确性。It can be seen that in the embodiment of the present invention, each three-dimensional content in the three-dimensional content set is first rendered into a two-dimensional image, and a question set is constructed. The question set includes questions corresponding to multiple attributes, and then for each two-dimensional image, based on The question set queries the graphic question and answer pre-trained model to obtain the answer corresponding to each question, and determines the text description of each attribute based on the answer corresponding to each attribute, and then determines each of the three-dimensional The description information of each attribute of the content is obtained to obtain a three-dimensional content description of each three-dimensional content to generate a three-dimensional content description data set; the three-dimensional content description includes the description information of a plurality of the attributes. That is, the present invention constructs a question set including multiple attribute questions, queries the graphic and text question and answer pre-training model for the two-dimensional image corresponding to the three-dimensional content to obtain the answer corresponding to each question, and determines the text description of each attribute based on the answer , and then determine the description information of multiple attributes corresponding to each three-dimensional content, and obtain a three-dimensional content description data set. The 3D content description contains description information corresponding to each attribute in multiple attributes. The 3D content description data set obtained in this way is more abundant and accurate, which can improve the quality of the data set, thereby ensuring the performance of the 3D content generation model, thereby improving the generation of 3D content. Accuracy of content.

参见图4所示，本发明实施例公开了一种三维内容生成模型训练方法，包括：As shown in Figure 4, an embodiment of the present invention discloses a three-dimensional content generation model training method, which includes:

步骤S21:基于所述三维内容描述数据集对生成式对抗网络进行训练；所述三维内容描述数据集根据前述实施例公开的基于多模态预训练模型的数据集生成方法生成；所述生成式对抗网络包括鉴别器、生成器以及条件控制器。Step S21: Train a generative adversarial network based on the three-dimensional content description data set; the three-dimensional content description data set is generated according to the data set generation method based on the multi-modal pre-training model disclosed in the previous embodiment; the generative formula Adversarial networks include discriminators, generators, and conditional controllers.

步骤S22:当所述鉴别器无法分辨所述生成器所生成内容，则将当前生成式对抗网络中的所述生成器和所述条件控制器组成的网络结构确定为三维内容生成模型。Step S22: When the discriminator cannot distinguish the content generated by the generator, the network structure composed of the generator and the conditional controller in the current generative adversarial network is determined as a three-dimensional content generation model.

也即，当鉴别器无法分辨输入内容是否为生成器所生成内容，则将当前的生成式对抗网络确定为三维内容生成模型。That is, when the discriminator cannot tell whether the input content is content generated by the generator, the current generative adversarial network is determined to be a three-dimensional content generation model.

在具体的实施方式中，可以基于所述三维内容描述数据集对所述生成器和鉴别器进行交替训练，先冻结生成器、训练鉴别器，后冻结鉴别器、训练生成器，交替执行多次。In a specific implementation, the generator and the discriminator can be alternately trained based on the three-dimensional content description data set. First, freeze the generator and train the discriminator, then freeze the discriminator and train the generator, and execute it alternately multiple times. .

其中，所述生成器的训练过程包括：从所述三维内容描述数据集中获取三维内容描述，并利用所述条件控制器生成多属性编码描述子；将初始噪声变换以得到噪声编码描述子，并基于所述噪声编码描述子与所述多属性编码描述子构建联合描述子；将所述联合描述子输入生成器，得到生成器生成的三维内容；将该三维内容输入鉴别器，得到该三维内容对应的预测值基于该预测值与第一损失函数计算第一训练损失，并基于第一训练损失更新生成器的参数。Wherein, the training process of the generator includes: obtaining the three-dimensional content description from the three-dimensional content description data set, and using the conditional controller to generate a multi-attribute coding descriptor; transforming the initial noise to obtain the noise coding descriptor, and Construct a joint descriptor based on the noise coding descriptor and the multi-attribute coding descriptor; input the joint descriptor into the generator to obtain the three-dimensional content generated by the generator; input the three-dimensional content into the discriminator to obtain the three-dimensional content The corresponding predicted value calculates a first training loss based on the predicted value and the first loss function, and updates the parameters of the generator based on the first training loss.

并且，所述鉴别器的训练过程包括：将真值数据和生成器生成的三维内容输入鉴别器，得到所述真值数据对应的第一预测值以及该三维内容对应的第二预测值；基于所述第一预测值和所述第二预测值以及第二损失函数计算第二训练损失，并基于所述第二训练损失更新鉴别器的参数。Moreover, the training process of the discriminator includes: inputting the true value data and the three-dimensional content generated by the generator into the discriminator, and obtaining the first predicted value corresponding to the true value data and the second predicted value corresponding to the three-dimensional content; based on The first predicted value and the second predicted value and the second loss function calculate a second training loss and update the parameters of the discriminator based on the second training loss.

例如，参见图5所示，图5为本发明实施例公开一种生成式对抗网络示意图。生成式对抗网络(GAN，Generative Adversarial Networks)，通过一个生成器和一个鉴别器控制内容生成，在多项任务如图像生成上，取得了高质量生成结果。针对多通道细粒度条件控制的3D内容生成任务，GAN网络包含条件控制器(Controller)、生成器(Generator)和鉴别器(Discriminator)三部分。基于构建的包含细粒度描述的3D内容数据集，本发明提出了一种基于生成式对抗网络的多通道细粒度条件控制的3D内容生成网络结构，包括：For example, see Figure 5, which is a schematic diagram of a generative adversarial network disclosed in an embodiment of the present invention. Generative Adversarial Networks (GAN) control content generation through a generator and a discriminator, and achieve high-quality generation results in multiple tasks such as image generation. For the 3D content generation task of multi-channel fine-grained condition control, the GAN network consists of three parts: a condition controller (Controller), a generator (Generator) and a discriminator (Discriminator). Based on the constructed 3D content data set containing fine-grained descriptions, the present invention proposes a multi-channel fine-grained conditionally controlled 3D content generation network structure based on a generative adversarial network, including:

1）条件控制器模块，用于多粒度编码描述子生成：1) Conditional controller module, used for multi-granularity coding descriptor generation:

针对m个通道的细粒度条件文本描述，首先基于用于文本和图像语义对齐的CLIP大规模预训练模型，抽取得到256×d'的初始描述子；其次，设计多层感知机（MLP）网络，将初始描述子，由256×d'转换到256×d维的3D内容控制描述子；最后，再使用MLP网络，得到m个通道的1×256维编码描述子。For the fine-grained conditional text description of m channels, firstly, based on the CLIP large-scale pre-training model for text and image semantic alignment, an initial descriptor of 256×d' is extracted; secondly, a multi-layer perceptron (MLP) network is designed , convert the initial descriptor from 256×d' to a 256×d-dimensional 3D content control descriptor; finally, use the MLP network to obtain a 1×256-dimensional coding descriptor of m channels.

2）生成器模块，用于目标点云生成：2) Generator module for target point cloud generation:

生成器部分由高斯噪声z出发，通过生成器网络，生成目标内容。然而，原始噪声可能不适于3D内容生成任务，因此，本发明基于MLP网络，将初始噪声z变换，得到k维噪声编码描述子；之后，连同m个通道的1×256维编码描述子，共同构成了生成器的w维输入，即w=(m*256+k)维度的联合描述子。生成器网络包括多层MLP网络，能够将w维联合描述子，变换得到6*g维度描述子，再通过重组(reshape)操作，可以得到gx6维的输出，也即目标带颜色3D点云。The generator part starts from Gaussian noise z and generates target content through the generator network. However, the original noise may not be suitable for the 3D content generation task. Therefore, the present invention is based on the MLP network and transforms the initial noise z to obtain a k-dimensional noise coding descriptor; then, together with the 1×256-dimensional coding descriptor of m channels, It constitutes the w-dimensional input of the generator, that is, the joint descriptor of w=(m*256+k) dimension. The generator network includes a multi-layer MLP network, which can transform the w-dimensional joint descriptor into a 6*g-dimensional descriptor, and then through the reshape operation, the gx6-dimensional output can be obtained, that is, the target colored 3D point cloud.

3）鉴别器模块，用于目标点云合理性鉴别：3) Discriminator module, used to identify the rationality of target point cloud:

鉴别器模块能够将生成的点云预测输出0，将真值的点云预测输出1。其网络结构包括多层MLP网络，能够输出gx2048维描述子，再经过池化操作，可以得到1x2048维的全局描述子，再经过1层MLP网络，得到1维的预测值。The discriminator module can output 0 for the generated point cloud prediction and output 1 for the ground-truth point cloud prediction. Its network structure includes a multi-layer MLP network, which can output gx2048-dimensional descriptors. After a pooling operation, a 1x2048-dimensional global descriptor can be obtained, and then through a layer of MLP network, a 1-dimensional prediction value can be obtained.

在网络训练的过程中，GAN网络采用交替的训练策略，先冻结生成器，训练鉴别器；之后冻结鉴别器参数，训练生成器；之后再冻结生成器，训练鉴别器；如此交替，直至鉴别器不能分辨生成器生成的内容是否真伪。In the process of network training, the GAN network adopts an alternating training strategy, first freezing the generator and training the discriminator; then freezing the discriminator parameters and training the generator; then freezing the generator and training the discriminator; and so on until the discriminator It is impossible to tell whether the content generated by the generator is authentic or fake.

针对鉴别器，采用如下损失函数：loss_D=log(D(x_gt))+log(1-D(x_generate))，其中D(x_gt)代表针对真值3D点云的预测值，当D(x_gt)=1时，第一项为0；D(x_generate)代表针对生成点云的预测值，当D(x_generate)=0时，第二项为0；由于输出其他在[0,1]之间的预测值时，log函数值为负，因此，通过最大化上述损失函数，能够训练鉴别器，使其正确鉴别点云的真伪。输出为0.5表示无法辨别。For the discriminator, the following loss function is used: loss_D =log(D(x_gt ))+log(1-D(x_generate )), where D(x_gt ) represents the predicted value of the true 3D point cloud, when When D(x_gt )=1, the first item is 0; D(x_generate ) represents the predicted value for the generated point cloud. When D(x_generate )=0, the second item is 0; since the output is other than [ When the predicted value is between 0,1], the log function value is negative. Therefore, by maximizing the above loss function, the discriminator can be trained to correctly identify the authenticity of the point cloud. An output of 0.5 means that it cannot be distinguished.

针对生成器，采用如下损失函数：loss_G=log(1-D(x_generate))，也即针对已训练好的鉴别器，使其判定生成内容的概率为1，表明生成器生成内容已经能够骗过鉴别器，此时损失函数接近负无穷。因此，通过最小化上述损失函数，能够训练生成器，使其生成足够真实合理的3D内容；For the generator, the following loss function is used: loss_G =log(1-D(x_generate )), that is, for the trained discriminator, the probability of judging the generated content is 1, indicating that the generator can generate content. To fool the discriminator, the loss function is close to negative infinity. Therefore, by minimizing the above loss function, the generator can be trained to generate 3D content that is realistic and reasonable enough;

针对条件控制器部分，为了加快收敛，可以采用多通道细粒度适配的模型训练策略，也即依次训练单通道的编码描述子抽取模块，再最后进行整体微调。参见图6所示，图6为本发明实施例公开的一种条件控制器训练示意图。在具体的训练过程中，可以将生成器和条件控制器一起训练，基于第一训练损失更新生成器和条件控制器的参数。并且，对于条件控制器，依次使某一通道处于激活状态、其他通道处于失活状态，进行训练。For the conditional controller part, in order to speed up convergence, a multi-channel fine-grained adaptation model training strategy can be adopted, that is, a single-channel coding descriptor extraction module is trained in sequence, and then the overall fine-tuning is performed. Referring to Figure 6, Figure 6 is a schematic diagram of a conditional controller training disclosed in an embodiment of the present invention. In the specific training process, the generator and the conditional controller can be trained together, and the parameters of the generator and the conditional controller are updated based on the first training loss. Moreover, for the conditional controller, one channel is activated and other channels are inactivated in sequence for training.

在进行推理时，仅使用条件控制器和生成器，不使用鉴别器。首先从一个高斯分布中采样得到噪声z；其次，将多通道细粒度条件分别抽取得到细粒度编码描述子；再次，构建联合描述子，使用生成器推理，得到最终的带颜色3D点云，完成生成任务。When doing inference, only conditional controllers and generators are used, not discriminators. First, the noise z is obtained by sampling from a Gaussian distribution; secondly, the multi-channel fine-grained conditions are extracted separately to obtain the fine-grained encoding descriptor; thirdly, the joint descriptor is constructed and the generator is used for inference to obtain the final colored 3D point cloud. Complete Generate tasks.

进一步的，本发明实施例公开了一种三维内容生成方法，包括：Further, embodiments of the present invention disclose a three-dimensional content generation method, including:

获取目标描述以及目标噪声，并输入三维内容生成模型；所述目标描述为多个属性的描述信息；其中，所述三维内容生成模型根据前述实施例公开的三维内容生成模型训练方法训练得到；目标描述可以为基于用户输入确定的多个属性的文本描述。Obtain the target description and target noise, and input the three-dimensional content generation model; the target description is description information of multiple attributes; wherein the three-dimensional content generation model is trained according to the three-dimensional content generation model training method disclosed in the previous embodiment; the target The description can be a textual description of multiple properties determined based on user input.

参见图7所示，本发明实施例公开了一种数据集生成装置，包括：Referring to Figure 7, an embodiment of the present invention discloses a data set generating device, which includes:

二维图像渲染模块11，用于将三维内容集中的每个三维内容渲染为二维图像；The two-dimensional image rendering module 11 is used to render each three-dimensional content in the three-dimensional content set into a two-dimensional image;

问题集合构建模块12，用于构建问题集合；所述问题集合包括多个属性对应的问题；The question set building module 12 is used to build a question set; the question set includes questions corresponding to multiple attributes;

属性描述确定模块13，用于针对每一所述二维图像，基于所述问题集合询问图文问答预训练模型以得到每个问题对应的答案，并基于每个属性对应的所述答案确定每个属性的文本描述；The attribute description determination module 13 is configured to query the graphic question and answer pre-training model based on the question set for each two-dimensional image to obtain the answer corresponding to each question, and determine each attribute based on the answer corresponding to each attribute. A text description of an attribute;

数据集确定模块14，用于基于所述文本描述确定每个所述三维内容的每个属性的描述信息，得到每个所述三维内容的三维内容描述，以生成三维内容描述数据集；所述三维内容描述包含多个所述属性的所述描述信息。The data set determination module 14 is configured to determine the description information of each attribute of each three-dimensional content based on the text description, and obtain the three-dimensional content description of each three-dimensional content to generate a three-dimensional content description data set; The three-dimensional content description includes the description information of a plurality of the attributes.

其中，二维图像渲染模块11，具体包括：Among them, the two-dimensional image rendering module 11 specifically includes:

二维图像渲染模块11，具体用于将三维内容集中的每个三维内容渲染为多个视角下的二维图像。The two-dimensional image rendering module 11 is specifically used to render each three-dimensional content in the three-dimensional content set into two-dimensional images from multiple perspectives.

在一种实施方式中，二维图像渲染模块11，具体包括：In one implementation, the two-dimensional image rendering module 11 specifically includes:

虚拟相机位置计算子模块，用于基于球坐标系计算多个虚拟相机位置；Virtual camera position calculation submodule, used to calculate multiple virtual camera positions based on the spherical coordinate system;

二维图像渲染子模块，用于基于所述多个虚拟相机位置对三维内容集中的每个三维内容进行渲染，得到多个视角下的二维图像。The two-dimensional image rendering submodule is used to render each three-dimensional content in the three-dimensional content set based on the multiple virtual camera positions to obtain two-dimensional images from multiple perspectives.

在具体的实施方式中，二维图像渲染模块11，具体包括：In a specific implementation, the two-dimensional image rendering module 11 specifically includes:

坐标系转换子模块，用于将三维内容集中的每个三维内容转换至世界坐标系；The coordinate system conversion submodule is used to convert each 3D content in the 3D content set to the world coordinate system;

尺度缩放子模块，用于将所述世界坐标系下的三维内容中的每个点均乘以缩放因子以完成尺度缩放；The scale scaling submodule is used to multiply each point in the three-dimensional content under the world coordinate system by a scaling factor to complete scale scaling;

二维图像渲染子模块，用于将尺度缩放后的三维内容渲染为二维图像。The two-dimensional image rendering submodule is used to render the scaled three-dimensional content into a two-dimensional image.

进一步的，所述二维图像渲染模块11，还包括：Further, the two-dimensional image rendering module 11 also includes:

缩放因子计算模块，用于在将所述世界坐标系下的三维内容中的每个点均乘以缩放因子以完成尺度缩放之前，计算所述世界坐标系下的三维内容在每个坐标轴下最大值和最小值之间的差值；将各坐标轴对应的所述差值中的最大值取倒数，得到所述缩放因子。A scaling factor calculation module, used to calculate each coordinate axis of the three-dimensional content in the world coordinate system before multiplying each point in the three-dimensional content in the world coordinate system by a scaling factor to complete the scale scaling. The difference between the maximum value and the minimum value; take the reciprocal of the maximum value among the differences corresponding to each coordinate axis to obtain the scaling factor.

数据集确定模块14，具体包括：Data set determination module 14, specifically includes:

描述融合子模块，用于针对每个属性，利用评价网络模型确定多个视角下的文本描述的得分，并将所述得分大于预设阈值的文本描述进行融合以得到该属性的描述信息；其中，所述多个视角下的文本描述均为同一三维内容对应的文本描述。The description fusion sub-module is used for each attribute, using the evaluation network model to determine the scores of text descriptions from multiple perspectives, and fusing the text descriptions with scores greater than the preset threshold to obtain the description information of the attribute; wherein , the text descriptions in the multiple perspectives are all text descriptions corresponding to the same three-dimensional content.

其中，描述融合子模块，具体用于：针对每个属性，将多个视角下的文本描述输入评价网络模型，抽取局部特征和全局特征，基于所述局部特征和所述全局特征构建联合特征，并基于所述联合特征输出多个视角下的文本描述的得分。具体的，针对每个属性，将多个视角下的文本描述输入评价网络模型，依次经过双向编码器表示结构、一个多层感知机层得到局部特征，再经过一个多层感知机层以及池化层，得到全局特征。基于所述联合特征，并利用预设数量个多层感知机层输出多个视角下的文本描述的得分。Among them, the description fusion sub-module is specifically used to: for each attribute, input text descriptions from multiple perspectives into the evaluation network model, extract local features and global features, and construct joint features based on the local features and the global features, And output scores of text descriptions from multiple perspectives based on the joint features. Specifically, for each attribute, text descriptions from multiple perspectives are input into the evaluation network model, which is sequentially passed through a bidirectional encoder to represent the structure, a multi-layer perceptron layer to obtain local features, and then through a multi-layer perceptron layer and pooling. layer to obtain global features. Based on the joint features, a preset number of multi-layer perceptron layers are used to output scores of text descriptions from multiple perspectives.

所述评价网络模型的训练过程包括：The training process of the evaluation network model includes:

基于所述训练损失更新所述初始模型参数，得到参数更新后模型；Update the initial model parameters based on the training loss to obtain a parameter-updated model;

进一步的，所述装置还用于：Furthermore, the device is also used for:

其中，问题集合构建模块12，具体用于针对多个属性分别设置问题，得到问题集合；其中，所述多个属性包括概念属性、几何属性、颜色属性以及材质属性中的至少两个。Among them, the question set building module 12 is specifically used to set questions respectively for multiple attributes to obtain a question set; wherein the multiple attributes include at least two of conceptual attributes, geometric attributes, color attributes and material attributes.

在一种实施方式中，属性描述确定模块13，用于当所述问题集合中每个属性对应一个问题，则利用每个属性的问题多次询问图文问答预训练模型以得到多个答案，并将多个答案中与该二维图像最相似的答案确定为该属性的文本描述。具体的，利用图文对比预训练模型计算多个答案与该二维图像的相似度，并将相似度最高的答案确定为该属性的文本描述。In one embodiment, the attribute description determination module 13 is configured to use the questions of each attribute to query the graphic question and answer pre-training model multiple times to obtain multiple answers when each attribute in the question set corresponds to a question. And the answer most similar to the two-dimensional image among the multiple answers is determined as the text description of the attribute. Specifically, the image-text comparison pre-training model is used to calculate the similarity between multiple answers and the two-dimensional image, and the answer with the highest similarity is determined as the text description of the attribute.

在另一种实施方式中，属性描述确定模块13，用于当所述问题集合中每个属性对应多个问题，则利用每个属性的多个问题询问图文问答预训练模型以得到多个答案，并将多个答案中与该二维图像最相似的答案确定为该属性的文本描述。In another implementation, the attribute description determination module 13 is configured to use multiple questions of each attribute to query the graphic question and answer pre-training model to obtain multiple questions when each attribute in the question set corresponds to multiple questions. answer, and determine the answer among the multiple answers that is most similar to the two-dimensional image as the text description of the attribute.

参见图8所示，本发明实施例公开了一种电子设备20，包括处理器21和存储器22；其中，所述存储器22，用于保存计算机程序；所述处理器21，用于执行所述计算机程序，前述实施例公开的基于多模态预训练模型的数据集生成方法，和/或，三维内容生成模型训练方法，和/或，三维内容生成方法。Referring to Figure 8, the embodiment of the present invention discloses an electronic device 20, which includes a processor 21 and a memory 22; wherein the memory 22 is used to save the computer program; the processor 21 is used to execute the Computer program, the data set generation method based on the multi-modal pre-training model disclosed in the foregoing embodiments, and/or the three-dimensional content generation model training method, and/or the three-dimensional content generation method.

关于上述基于多模态预训练模型的数据集生成方法，和/或，三维内容生成模型训练方法，和/或，三维内容生成方法的具体过程可以参考前述实施例中公开的相应内容，在此不再进行赘述。Regarding the above-mentioned data set generation method based on the multi-modal pre-training model, and/or the three-dimensional content generation model training method, and/or the specific process of the three-dimensional content generation method, please refer to the corresponding content disclosed in the foregoing embodiments. Herein No further details will be given.

并且，所述存储器22作为资源存储的载体，可以是只读存储器、随机存储器、磁盘或者光盘等，存储方式可以是短暂存储或者永久存储。Moreover, the memory 22, as a carrier for resource storage, may be a read-only memory, a random access memory, a magnetic disk or an optical disk, and the storage method may be short-term storage or permanent storage.

另外，所述电子设备20还包括电源23、通信接口24、输入输出接口25和通信总线26；其中，所述电源23用于为所述电子设备20上的各硬件设备提供工作电压；所述通信接口24能够为所述电子设备20创建与外界设备之间的数据传输通道，其所遵循的通信协议是能够适用于本发明技术方案的任意通信协议，在此不对其进行具体限定；所述输入输出接口25，用于获取外界输入数据或向外界输出数据，其具体的接口类型可以根据具体应用需要进行选取，在此不进行具体限定。In addition, the electronic device 20 also includes a power supply 23, a communication interface 24, an input and output interface 25 and a communication bus 26; wherein the power supply 23 is used to provide operating voltage for each hardware device on the electronic device 20; The communication interface 24 can create a data transmission channel between the electronic device 20 and external devices, and the communication protocol it follows is any communication protocol that can be applied to the technical solution of the present invention, which is not specifically limited here; The input and output interface 25 is used to obtain external input data or output data to the external world. Its specific interface type can be selected according to specific application needs and is not specifically limited here.

进一步的，本发明实施例还公开了一种计算机可读存储介质，用于保存计算机程序，其中，所述计算机程序被处理器执行时实现前述实施例公开的基于多模态预训练模型的数据集生成方法，和/或，三维内容生成模型训练方法，和/或，三维内容生成方法。Further, embodiments of the present invention also disclose a computer-readable storage medium for storing a computer program, wherein when the computer program is executed by a processor, the data based on the multi-modal pre-training model disclosed in the previous embodiments is implemented. A set generation method, and/or a three-dimensional content generation model training method, and/or a three-dimensional content generation method.

本说明书中各个实施例采用递进的方式描述，每个实施例重点说明的都是与其它实施例的不同之处，各个实施例之间相同或相似部分互相参见即可。对于实施例公开的装置而言，由于其与实施例公开的方法相对应，所以描述的比较简单，相关之处参见方法部分说明即可。Each embodiment in this specification is described in a progressive manner. Each embodiment focuses on its differences from other embodiments. The same or similar parts between the various embodiments can be referred to each other. As for the device disclosed in the embodiment, since it corresponds to the method disclosed in the embodiment, the description is relatively simple. For relevant details, please refer to the description in the method section.

结合本文中所公开的实施例描述的方法或算法的步骤可以直接用硬件、处理器执行的软件模块，或者二者的结合来实施。软件模块可以置于随机存储器（RAM）、内存、只读存储器（ROM）、电可编程ROM、电可擦除可编程ROM、寄存器、硬盘、可移动磁盘、CD-ROM、或技术领域内所公知的任意其它形式的存储介质中。The steps of the methods or algorithms described in conjunction with the embodiments disclosed herein may be implemented directly in hardware, in software modules executed by a processor, or in a combination of both. Software modules may be located in random access memory (RAM), memory, read-only memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disks, removable disks, CD-ROMs, or anywhere in the field of technology. any other known form of storage media.

以上对本发明进行了详细介绍，本文中应用了具体个例对本发明的原理及实施方式进行了阐述，以上实施例的说明只是用于帮助理解本发明的方法及其核心思想；同时，对于本领域的一般技术人员，依据本发明的思想，在具体实施方式及应用范围上均会有改变之处，综上所述，本说明书内容不应理解为对本发明的限制。The present invention has been introduced in detail above. Specific examples are used in this article to illustrate the principles and implementation modes of the present invention. The description of the above embodiments is only used to help understand the method of the present invention and its core idea; at the same time, for those in this field Ordinary technicians may make changes in the specific implementation and application scope based on the ideas of the present invention. In summary, the contents of this description should not be understood as limiting the present invention.