CN119444995A

Movatterモバイル変換

Info

Publication number: CN119444995A
Application number: CN202411580766.1A
Authority: CN
Inventors: 欧阳建权; 李柳阳; 唐欢容
Original assignee: Xiangtan University
Current assignee: Xiangtan University
Priority date: 2024-11-06
Filing date: 2024-11-06
Publication date: 2025-02-14

Abstract

The invention provides a layer focusing single-view 3D reconstruction method, which comprises the steps of carrying out region segmentation on original single-view images in a data set through SAM to obtain a region text pair fine-tuning CLIP model containing the original single-view images and layer channels thereof, extracting unnatural contour lines, semantic information and layer information from each original single-view image in the data set T, selecting basic element models according to character body states in the single-view images to initialize 3D point clouds, segmenting human body point clouds into different parts through point cloud segmentation, carrying out noise growth and color disturbance on corresponding point cloud regions according to position information of the unnatural contour lines, respectively mapping the corresponding point cloud regions into a latent space, adopting LDM to obtain final potential representation, and finally obtaining final 3D representation through a decoder to reconstruct 3D assets. The invention can generate corresponding 3D assets according to a given single picture.

Description

Layer focusing picture-inserting single-view 3D reconstruction method and system

Technical Field

The invention relates to an image processing method and system, in particular to a layer focusing picture-inserting single-view 3D reconstruction method, electronic equipment, a storage medium and a product.

Background

In recent years, the work of AIGC has become a hot topic in the entertainment, remote sensing and business fields. This new and complex technology attracts more and more eyes and investment with the construction of related communities and the development of hardware products. As a great direction in the 3D field, the work of 3D generation and reconstruction is simplified along with the development of a large-scale model and the development of a 3D representation method, and even a personal computer can deploy the model to infer a 3D asset with good effect. The following are several currently popular techniques:

1. training a new diffusion model by using 3D data (namely a 3D diffusion model), directly generating 3D assets according to conditions and keeping strong 3D consistency, wherein the corresponding technical example is Get3D;

2. The 2D diffusion model is directly lifted and applied to 3D generation, so that various text prompts can be processed, highly detailed and complex geometry and appearance are generated, and DREAMFACE is available as a corresponding technical example;

3. according to the new view generation thought, enough multiple views are acquired, a sparse view reconstruction method or a Score Distillation Sampling (SDS) optimization is applied, the multiple view images are fused into a 3D shape, high-quality 3D shape creation can be generated, and an Instant3D/Dreamfusion is adopted as a corresponding technical example.

Although these techniques all develop models with bright eyes, the resulting 3D asset is indeed very excellent, but there are still some drawbacks to these techniques that need to be addressed in extensive experimentation and application.

First, with the first technical approach, there are difficulties in expanding to large generation fields, because 3D data is often difficult and costly to obtain, and current 3D data sets are much smaller in size than 2D data sets, which results in the generated 3D assets being deficient in handling complex text cues and generating complex/fine geometry and appearance.

Then for the second technical approach, the generated 3D asset is difficult to form geometric consistency, and the problem of misalignment of the generated 3D asset often occurs, especially for complex structural instances, because the 2D diffusion model cannot understand the camera view.

Finally, for the third technical method, a great inefficiency is caused in the process of indirectly generating the multi-view image, and moreover, the quality of the generated shape is seriously dependent on the fidelity and continuity of the multi-view image, which often causes detail loss or reconstruction failure.

Moreover, the above-mentioned techniques are mostly considered to be applied and optimized in terms of the problems related to natural pictures, and have drawbacks in optimizing the 3D generation of 2D pictorial pictures. The 2D picture-inserting 3D reconstruction problem mainly comprises complex and extreme color matching and black contours based on unnatural lines, but the current research has the problems of color block stacking, large calculation amount, unintelligible illumination picture layers, insufficient specific accessory information retrieval and the like.

Disclosure of Invention

Aiming at the defects in the prior art, the application aims to provide a layer focusing picture-inserting single-view 3D reconstruction method which is suitable for generating a 3D model by using the picture-inserting single-view and can improve the 3D reconstruction quality.

The technical scheme provided by the invention is as follows:

in a first aspect, the present application provides a layer focusing, inserting and drawing single-view 3D reconstruction method, including the following steps:

Acquiring an illustration single-view data set T, wherein the illustration single-view data set T comprises a plurality of original illustration single views;

The method comprises the steps of constructing and training a layer-CLIP model, wherein the layer-CLIP model comprises an image encoder and a text encoder, the image encoder comprises an RGB image convolution layer and a layer convolution layer which are respectively used for inputting an original picture single view and a layer thereof, the text encoder is used for inputting text description of a corresponding layer, an auxiliary channel is arranged in the text encoder and used for inputting two-dimensional position information of an effect area layer obtained by second segmentation in a characteristic area layer obtained by first segmentation, and the trained layer-CLIP model has the function of outputting semantic information corresponding to the original picture single view and a mark c₁ of the layer information according to the input original picture single view, the layer and the corresponding text description;

For the original illustration single view, a mark (tokens) c₁ containing corresponding semantic information and layer information is obtained through a layer-CLIP model;

And inputting the original single-view of the picture into a 3D diffusion model based on the single-view, and combining the mark c₁, and diffusing the corresponding point cloud by adopting a potential diffusion model (Latent Diffusion Model, LDM) according to the layer number of the picture layer corresponding to each characteristic region to obtain the final 3D representation corresponding to the original single-view of the picture.

In one possible implementation manner, the single-view-based 3D diffusion model includes an unnatural line extraction module, an initial point cloud generation module, a point cloud segmentation module, a point cloud-layer diffusion module, and a potential-tri-plane module (latex-Triplane module);

The step of inputting the original single-view of the picture into the 3D diffusion model based on the single-view, and carrying out diffusion on the corresponding point cloud for a corresponding number of times by adopting a potential diffusion model according to the layer number corresponding to each characteristic region by combining the mark c₁ to obtain a final 3D representation corresponding to the original single-view of the picture comprises the following steps:

The non-natural line extraction module is used for extracting a first single-picture illustration from the original single-picture illustration, wherein the first single-picture illustration is the single-picture illustration of the original single-picture illustration from which the non-natural contour lines are removed;

The initial point cloud generation module selects a corresponding basic pixel model according to the character body state in the first picture-inserting single view so as to generate an initial point cloud P;

The point cloud segmentation module is used for segmenting the initial point cloud P into different point cloud parts, and performing basic coloring on the point cloud parts according to colors of the areas with the minimum weight values in the marks c₁ corresponding to the point cloud parts to obtain basic colored point clouds, wherein each point cloud part segmented by the initial point cloud P corresponds to each characteristic area layer obtained after the first segmentation one by one, for example, four parts of a head, a body, a hand and a foot are obtained after the first segmentation, and the subsequent point cloud segmentation is also performed according to the four parts.

The point cloud-layer diffusion module maps the point cloud after basic coloring into potential representation z (Latent z), adopts a potential diffusion model according to the potential representation z, the mark c₁ and the total number of layers corresponding to each characteristic region layer, diffuses for corresponding times each time through time t, and outputs final potential representation z_m;

A latent-tri-planar module obtains a final 3D (Triplane) representation, i.e., a tri-planar representation, based on the final latent representation z_m.

In one possible implementation, the extracting the first pictorial view from the original pictorial view includes:

s101, extracting an unnatural contour line x of an original illustration single view through a difference Gaussian sketch, creating a convex hull surrounding a key structure by adopting an existing face mark detector during extraction, preventing lines of the key structure from being extracted into the unnatural contour line x, and avoiding filling the lines of the key structure when pixel colors near the lines in the drawing are generated by subsequent linear interpolation;

S102, extracting preliminary features of an original illustration single view by using a shallow convolutional neural network;

s103, on the basis of the initial characteristics, linear interpolation (Lerp) is carried out on the region where the unnatural contour line x is located for a plurality of times, and pixel colors of the unnatural contour line x in the original single picture illustration view are regenerated, so that a first single picture illustration view is obtained.

In one possible implementation manner, the selecting a corresponding basic prime mode according to the character body state in the first single-view of the drawing to generate the initial point cloud P includes:

S201, selecting a basic element model according to the character body state in the first picture-inserting single view as an initial 3D model;

S202, constructing a triangular mesh M based on an initial 3D model, predicting a Symbol Distance Function (SDF) value and texture color of each vertex on the triangular mesh M by using a multi-layer sensor MLPs, and converting the vertex SDF value and color of the M into a point cloud, and recording the point cloud as pt_M(p_M,c_M), wherein p_M∈R³ refers to the position of the point cloud, the point coordinate is equal to the vertex coordinate of the M, c_M∈R³ refers to the color of the point cloud, and the color of the point cloud is the same as that of the vertex of the M;

S203, performing noise point cloud growth and color perturbation on the point clouds around pt_M, including:

Firstly, calculating a bounding box on the surface of pt_M, and then uniformly growing noise point clouds pt_r(p_r,c_r) in BBox covered by a mapping area according to the mapping area of an unnatural contour line x on the point clouds, wherein p_r and c_r represent the positions and colors of the noise point clouds respectively;

s204, screening the noise point cloud, constructing a K-dimensional tree according to the position p_M, and keeping the points within a set distance threshold according to the distance between the nearest point found in the K-dimensional tree and p_r;

S205, merging the positions and colors of pt_M and pt_r to obtain the final initial point cloud P.

The initialization quality can be improved through noise point cloud growth, color disturbance and point cloud screening, and in the point cloud screening process, quick searching can be realized through constructing a K-dimensional tree.

In one possible implementation manner, according to the potential representation z, the label c₁ and the total number of layers corresponding to each feature area layer, a potential diffusion model is adopted, and each time, diffusion is performed through time t, so as to output a final potential representation z_m, which includes:

And using a cross attention layer to promote interaction between corresponding layer information in the first mark c₁ and the final potential representation z_m during each noise diffusion, wherein the noise diffusion sequence is sequentially carried out from small to large according to the weight value in the first mark c₁, and the diffusion time t is appropriately increased when the weight is only one bit, and carrying out m times of noise diffusion on the part of the characteristic region layer corresponding to the total number m of the layers in the potential representation z.

In one possible implementation, obtaining a final 3D (Triplane) representation based on the final potential representation z_m includes:

After the final potential representation z_m is obtained, it is reshaped into a tri-planar representation z_reshape and the three planes are connected vertically in the height dimension, resulting inAnd upsampling the z_concat to a high-resolution tri-planar feature map, wherein the upsampling process is to gradually upsample the explicit final potential representation using a convolutional decoder and obtain a final 3D representation.

In a second aspect, the present application provides an electronic device comprising a memory and a processor;

the memory is used for storing a computer program;

the processor is configured to invoke the computer program to perform the method as described above.

In a third aspect, the present application provides a computer readable storage medium having stored therein a computer program which, when run on an electronic device, causes the electronic device to implement a method as described above.

In a fourth aspect, the application provides a computer program product comprising a computer program which, when run on an electronic device, causes the electronic device to implement a method as described above.

Specific implementation manners of the second to fourth aspects of the present application may refer to implementation manners of the first aspect, and are not described herein.

The technical scheme provided by the invention has the following beneficial technical effects:

1. The layer focusing picture single-view 3D reconstruction method adopted by the invention sufficiently simulates the drawing process of the 2D picture, namely, the detailed information on the picture is understood by dividing the picture layer, so that the problems of extreme color block stacking, highlight shadow picture layer interference and complex accessory information loss are greatly solved, meanwhile, a mature 2D diffusion model is utilized to optimize 3D generation, the requirement on high-quality 3D assets is reduced, the position of 2D details in the process of mapping to 3D is accurately obtained, and the problem of characteristic offset is solved;

2. According to the method, the unnatural contour line x is extracted, so that artifact interference caused by the line for illustration can be reduced, meanwhile, the area where the unnatural contour line x is located is conveniently used as a main target of initial 3D point cloud noise point cloud growth and color disturbance, and the quality of the initial point cloud is further improved;

3. by dividing the point cloud, the complex human body point cloud is divided into a plurality of parts, each part is independently operated on a low-dimensional latent space, but not directly operated on a high-dimensional 3D space, the expandability and the calculation efficiency of the model are greatly improved, the quality of the generated 3D content is maintained, and the problem of discontinuity in new view angle synthesis is avoided.

4. The construction of the weight dataset expands the scope of generation while also reducing the risk of creating an illusion when generating the desired 3D asset.

Drawings

FIG. 1 is a flow chart of a method according to an embodiment of the present application;

FIG. 2 is a schematic diagram of an unnatural line extraction module according to an embodiment of the application;

FIG. 3 is a schematic diagram of an initial point cloud generating module according to an embodiment of the present application;

FIG. 4 is a schematic diagram of a point cloud segmentation module according to an embodiment of the present application;

FIG. 5 is a schematic diagram of a point cloud-layer diffusion module according to an embodiment of the present application;

FIG. 6 is a schematic diagram of a potential-tri-planar module in accordance with an embodiment of the present application;

fig. 7 is a diagram illustrating a single view segmentation, a point cloud segmentation, and a point cloud-layer diffusion correspondence in an embodiment of the present application.

Detailed Description

In order to enable those skilled in the art to better understand the present application, the following description will make clear and complete descriptions of the technical solutions according to the embodiments of the present application with reference to the accompanying drawings.

For better understanding of the embodiments of the present application, the following description of the related models or techniques in the context of the embodiments of the present application is as follows:

The SAM Model is taken as a basic Model, and is a segmentation all Model (SEGMENT ANYTHING Model, SAM) proposed by Meta and is a general Model for processing image segmentation tasks. The SAM consists of three parts, namely computing the image embedding by a powerful image encoder, embedding the hint by a hint encoder, and then combining the two sources in a lightweight mask decoder to predict the segmentation mask. Hints can be used plug and play to address various tasks involving object and image distribution, including edge detection, object proposal generation, instance segmentation, text-to-mask prediction, etc., and segmentation masks can be output in real-time as hinted for interactive use.

The english full name of clip is ContrastiveLanguage-ImagePre-training, a pre-trained model based on contrast text-image pairs. The CLIP is a multi-modal model based on contrast learning, training data of the CLIP is a text-image pair, one image and a text description corresponding to the image are obtained, and the model can learn a matching relation of the text-image pair through the contrast learning.

A tree data structure for storing example points in k-dimensional space for quick retrieval is mainly applied to searching of multi-dimensional space key data (such as range searching and nearest neighbor searching).

PointNet is a deep learning network architecture and is specially used for processing point cloud data. Feature representations can be learned directly from the point cloud without complex voxelization or shape assumptions, dividing the point cloud into different objects or parts of objects.

The application discloses a layer focusing single-view 3D reconstruction method which comprises the steps of collecting a single-view inserting data set T suitable for reconstruction, processing an original single-view inserting in the T through a SAM to generate a training fine-tuning layer-CLIP model of region text pairs containing the original single-view inserting and a layer channel thereof, S3, extracting and removing unnatural contour lines from each original single-view inserting in the T through an unnatural line extraction module, selecting a basic prime model (the basic prime model refers to a basic, universal, free and easy-to-obtain human body model) according to the character state of each original single-view inserting in the T, generating an initial point cloud P as an initial point cloud P according to the corresponding position of the unnatural contour lines, carrying out noise growth and color disturbance at the initial point cloud P through point cloud segmentation and coloring to obtain a basic colored point cloud N, mapping an input point cloud-layer diffusion module of the basic colored point cloud N into a latent space, carrying out diffusion, obtaining a potential representation total number of the final representation of three-time-z-based on a final representation plane figure_m through a layer-figure layer-4364, namely, and finally obtaining a final representation of a final representation plane figure-based on a final representation layer_m z.

The present application is specifically described below with reference to fig. 1 to 7.

As shown in fig. 1, the embodiment of the application discloses a layer focusing picture-inserting single-view 3D reconstruction method, which comprises the following steps:

S1, acquiring an illustration single-view data set T, wherein the illustration single-view data set T comprises a plurality of original illustration single views;

The dataset T image requirements include high resolution, elevation, neutral expression, uncut. In some embodiments, the dataset T may employ Hololive role datasets, honkai Impact rd role datasets, or Honkai: starRail role datasets. Hololive character dataset from virtual idol company Hololive or VirtualYoutuberFandomWiki, and Honkai Impact 3rd character dataset from official website database of miHoYo.

Preferably, the dataset T is Vroid D datasets, the Vroid D datasets are from the 11.2k 3D cartoon character data of VroidHub created by UniversityofMaryland-CollegePark. A high resolution, neutral expression, uncut image from the Vroid D dataset may be used as the primary image dataset for 3D generation. For the selected dataset, the front view image may be filtered out using a front face detection method.

S2, performing first segmentation on the original single-picture illustration in the data set T to obtain each characteristic region Layer of the original single-picture illustration, performing second segmentation on each characteristic region Layer to obtain an effect region Layer in each characteristic region Layer, and generating a region text pair containing the original single-picture illustration and a Layer (Layer) thereof, a weight when the corresponding Layer is input into an image encoder and a total number of layers corresponding to each characteristic region Layer of the original single-picture illustration;

The image layers of the original picture-inserting single view comprise a characteristic area image layer and an effect area image layer;

the characteristic region layer corresponds to various characteristics of the figure, such as head, hands and feet, body, five sense organs, hair, ornaments and the like;

the effect area layer corresponds to effect areas on various characteristics of the character, such as high light, shadow and the like;

after the segmentation is completed, the weight set when the corresponding image layer is input into the image encoder is set according to the segmentation condition, wherein the weight value is set to be between 0 and 1, the weight value of the foreground area is set to be 1 according to the coverage relation by the image layer of the characteristic area obtained by the first segmentation, the weight value of the subsequent area is decreased by 0.1 gradient, and the weight value of the image layer of the characteristic area obtained by the first segmentation is decreased by 0.01 gradient on the basis of the weight value of the image layer of the characteristic area obtained by the second segmentation.

Recording the total number of layers corresponding to each characteristic region layer;

For example, performing a second segmentation on a certain characteristic region layer obtained by the first segmentation to obtain 4 effect region layers, wherein the total number of layers corresponding to the characteristic region layer is 5;

the two-dimensional position information of the effect area layer obtained by the second segmentation in the characteristic area layer obtained by the first segmentation is obtained in proportion;

In the above steps, when the original single-view of the data set T is split, the original single-view of the data set T is split for the second time, and the picture is layered to record interaction information in the form of multiple layers by recording the information of the picture layers, such as highlights and shadows, and then transmitted to the downstream task, so that the problems of color block mixing, artifact interference and accessory information loss are greatly reduced. Data set generation 3D models, such as game artwork character head model generation, are particularly suited for artwork single view.

In some embodiments, the original pictorial single view in the data set T may be partitioned using a partition-cut model SAM.

Each region text pair comprises a layer obtained by dividing an original picture-inserting single view and a corresponding text description, wherein the region is a layer in the original picture-inserting single view, and the text content is a brief description (such as English) of the layers;

S3, constructing and training a layer-CLIP model, wherein the layer-CLIP model comprises an image encoder and a text encoder, the image encoder comprises an RGB image convolution layer and a layer convolution layer which are respectively used for inputting an original picture single view and a layer thereof, the text encoder is used for inputting text description of a corresponding layer, the trained layer-CLIP model has the function of outputting semantic information corresponding to the original picture single view and a mark (tokens) c₁ of the picture information according to the input original picture single view, the layer thereof and the corresponding text description;

Compared with the existing CLIP model, the layer-CLIP model adds an adaptive layer convolution layer in the Vision Transformer (ViT) structure of the image encoder, which is equivalent to adding a layer channel in the image encoder, the convolution layer is used for processing input data of the layer channel, parallel to the RGB convolution layer, the CLIP image encoder is allowed to accept additional layer channels as input, the accepted weight value range is set between [0,1], wherein 1 represents the foreground, 0 represents the background, when the input precision is 0.1, the image obtained by first segmentation is the image obtained by second segmentation, when the precision is 0.01, the convolution core of 3*3 is used for the convolution core of the image obtained by first segmentation, and the convolution core of 1*1 is used for the image obtained by second segmentation.

An auxiliary channel is added in the CLIP model text encoder for inputting the two-dimensional position information of the effect area layer obtained by the second segmentation in the characteristic area layer obtained by the first segmentation, the channel is opened to accept input only when the layer picture obtained by the second segmentation is input into the image encoder, and the channel input can be set to (-2, -2) in the rest cases to be regarded as invalid input.

An auxiliary channel is added in the CLIP model text encoder for inputting two-dimensional position information of the layer after the second segmentation in the once segmented layer picture, wherein the range is x to (-1, 1) and y to (-1, 1). For example, for a hand, the position information is (0, 0) with the center of the back of the hand as the origin, whereas for a shadow of the hand above the middle finger, the position information may be (0.1, -0.9), with the specific case of scaling down the selection position information.

And when the embodiment is used, the data of the original channels, namely the RGB channels and the layer channels, are simultaneously accepted to train and finely tune the layer-CLIP model, namely the original single-view picture and the layer obtained by splitting the original single-view picture are simultaneously input into the image encoder.

The fine-tune layer-CLIP may be trained on RGBL region-text pairs with image data from GRIT-20 m.

S4, for the original picture-inserting single view, a mark (tokens) c₁ containing corresponding semantic information and picture layer information is obtained through a picture layer-CLIP model;

S5, inputting the original single-view illustration into a 3D diffusion model based on the single-view illustration, and combining the mark c₁, diffusing the corresponding point cloud by adopting a potential diffusion model (Latent Diffusion Model, LDM) according to the layer number of the layers corresponding to each characteristic region to obtain a final 3D representation corresponding to the original single-view illustration.

In some embodiments, the single view based 3D diffusion model includes an unnatural line extraction module, an initial point cloud generation module, a point cloud segmentation module, a point cloud-layer diffusion module, and a potential-tri-plane module (latency-Triplane module);

As shown in fig. 4, the point cloud segmentation module segments the initial point cloud P into different point cloud portions, and performs basic coloring of each point cloud portion according to the color of the area with the minimum weight value in the mark c₁ corresponding to each point cloud portion to obtain a point cloud after basic coloring, where each point cloud portion segmented by the initial point cloud P corresponds to each feature area layer obtained after the first segmentation one by one, for example, four portions of the head, the body, the hand and the foot are obtained after the first segmentation, and the subsequent point cloud segmentation is also performed according to the four portions, as shown in fig. 7.

In some embodiments, the initial point cloud P may be partitioned into different point cloud portions N using a machine learning model PointNet.

In some embodiments, as shown in fig. 2, the non-natural line extraction module extracts a first pictorial single view from an original pictorial single view, including:

S101, extracting an unnatural contour line x of an original illustration single view through a difference Gaussian sketch, creating a convex hull surrounding a key structure (such as five sense organs including eyes, nose, mouth, eyebrows and ears) by adopting an existing face mark detector during extraction, preventing lines of the key structure from being extracted into the unnatural contour line x, and avoiding filling the lines of the key structure when the subsequent linear interpolation generates pixel colors near the lines in the drawing;

In some embodiments, as shown in fig. 3, the initial point cloud generating module selects a corresponding basic prime mode according to the character body state in the first single-view of the drawing, so as to generate an initial point cloud P, including:

S202, constructing a triangular mesh M based on an initial 3D model (asset), predicting a Symbol Distance Function (SDF) value and texture color of each vertex on the triangular mesh M by using a multi-layer perceptron MLPs, and converting the vertex SDF value and color of the M into a point cloud, denoted as pt_M(p_M,c_M), wherein p_M∈R³ refers to the position of the point cloud, is equal to the vertex coordinates of the M, c_M∈R³ refers to the color of the point cloud, and is the same as the color of the vertex of the M;

Illustratively, a mesh size of 128³ means 128 points in each coordinate axis direction, for a total of 128 x 128 vertices, the sign distance function SDF (x, y, z) represents the distance of each point to the nearest surface of the initial 3D model, with a sign, where negative values indicate that the point is on the inside of the surface and positive values indicate that the point is on the outside of the surface, and in MLPs, for each vertex (x, y, z), an SDF value is calculated to indicate its positional relationship with respect to the initial 3D model surface, each vertex (x, y, z) having not only positional information but also texture color information associated therewith, typically expressed as (r, g, b), representing red, green, and blue color components.

Firstly, calculating a bounding box (BBox) on the surface of pt_M, and then uniformly growing noise point clouds pt_r(p_r,c_r in BBox covered by a mapping area according to the mapping area of an unnatural contour line x on the point clouds, wherein p_r and c_r represent the position and color of the noise point clouds respectively;

BBox is to find the minimum circumscribed rectangle of the point cloud in the three-dimensional space, traverse all points in the point cloud, record the minimum value and the maximum value of each point on X, Y, Z three coordinate axes, and the bounding box can be represented by any two diagonal points in 8 angular points, usually the minimum point and the maximum point;

because the selected picture-inserting images are all front views, the 2D image coordinates are mapped into the 3D space, the change of the Z axis is ignored, and the 3D point cloud coordinates corresponding to each pixel point (u, v) in the 2D image unnatural contour line x areWhere D is the distance of the 3D point in the depth direction (Z axis), f is the focal length of the camera;

S204, screening the noise point cloud, constructing a K-dimensional Tree (KD-Tree) according to the position p_M, and keeping the points within a set distance threshold according to the distance between the nearest point found in the K-dimensional Tree and p_r, namely traversing each generated point, finding the nearest point and checking whether the distance condition is met.

By constructing a K-dimensional tree, a fast search can be achieved.

For example, the distance threshold may be normalized by 0.01.

Wherein, for the color of the noise point cloud, let c_r be similar to c_M, and add some perturbations:

c_r＝c_M+a

Wherein the value of a is randomly sampled between 0 and 0.2.

In some embodiments, as shown in fig. 5, the point cloud-layer diffusion module maps the underlying colored point cloud to a potential representation z (latency z), including:

using Fourier features to represent the location structure of a base colored point cloud, a series of learnable markers (Learnable Tokens) are introducedQuerying point cloud features using cross-attention layers, enabling 3D information in the point cloud to be injected into potential markers, and then enhancing representations of these markers using multiple self-attention layers, resulting in a underlying colored potential representation of the point cloudWhere r denotes resolution of the potential representation, d_e denotes e-channel dimension, and d_z denotes z-channel dimension.

Fourier features are used to represent location information of the point cloud. For each point position P_i, it can be mapped to a high-dimensional space by Fourier feature mapping:

γ(p_i)＝[cos(2πB P_i),sin(2πB P_i)]

where B ε R^r×3 is the Fourier base matrix, typically randomly sampled.

In some embodiments, according to the potential representation z, the label c₁ and the total number of layers corresponding to each feature region layer, a potential diffusion model is adopted, and each time, diffusion is performed through time t, so as to output a final potential representation z_m, which includes:

The cross-attention layer is used at each time of noise diffusion to facilitate interaction between the corresponding layer information in the first marker c₁ and the final potential representation z_m, wherein the order of noise diffusion is sequentially from small to large with the weight value in the first marker c₁, and the diffusion time t can be appropriately increased when the weight fraction is only one bit, wherein m times of noise diffusion are performed for the portion of the feature region layer that potentially represents m in total number of corresponding layers in z.

In some embodiments, in each potential diffusion process, using LoRA to perform fine tuning training on model weight epsilon_i of the potential diffusion model, adding semantic information in c₁ into data epsilon_i as interpolation, connecting all LoRA matrixes to obtain weight data representing one layer part, training based on N different examples, and obtaining a final model weight data set D= { epsilon₁,ε₂,…,ε_N };

wherein the model weights of the potential diffusion models are parameters of the potential diffusion models, which are optimized during training to minimize the loss function. Model weights are encodings of knowledge learned by the model from the input data, determining how the model maps the input to the output;

Because certain layer attributes often appear together in layer training, which may result in entanglement of these attributes during subsequent retrieval, such as reddish hair and warmth, which may limit the accuracy of the model in editing or generating particular attributes, the smaller party in the model weights of the layers labeled c₁ fine-tunes the trained results to cover the results of the other party when the attribute entanglement is encountered based on the information of labeled c₁.

Performing principal component analysis on the model weight data set, maintaining principal components of the model weight data set, reducing dimensions of the principal components, compressing data points into fewer parameters, only maintaining the most important information, using the model weight data set as a data training semantic retriever after the dimensions of the data set are reduced, and constructing a hypergraph based on semantic information in c₁ as a label;

The model weight dataset D may be sampled according to a piece of semantic information S to obtain a new model generating a new layer part, defining a hypergraph with a layer part, retrieving by means of semantic information of the markers c₁ embedded in the model weights, and editing the relevant model weights in the model. For example, a hypergraph conforming to the semantic meaning is constructed by taking a red bracelet and a black tattoo female right hand in sunlight, and model weights of warm highlight, red bracelet, female right hand and black tattoo are obtained in a hyperedge connection model weight data set according to the combination relation of the hypergraph, and then the diffusion is continued according to the model weights.

According to the method, semantic information and image layer information in the image are injected into the potential space, so that mapping of image features on the point cloud is accurate, high-frequency details of 3D assets generated by the diffusion model can be aligned with a conditional image (original picture-inserting single view), model weights of each diffusion are used as new data, a model weight data set corresponding to the 3D asset features is constructed, the range of generated objects is expanded, and the risk of generating illusions is reduced.

In some embodiments, as shown in fig. 6, the latent-tri-plane module obtains a final 3D (Triplane) representation based on the final latent representation z_m, including:

After the final potential representation z_m is obtained, it is reshaped into a tri-planar representation z_reshape and the three planes are connected vertically in the height dimension, resulting inTo prevent misclassification of planes in the channel dimension, then the latent-tri-plane module upsamples z_concat to a high-resolution tri-plane feature map with an upsampling factor of f.

The up-sampling process is to gradually up-sample the explicit final potential representation with a convolutional decoder (convolutional network) and get the final 3D representation.

Wherein the reshaped tri-plane is expressed as:

z_reshape＝reshape(z,(3,r,r,d_z))

Connecting three planes vertically in height dimension to obtain

The convolution network is adopted to gradually up-sample the explicit final potential representation, and compared with a transform decoder, the method can effectively up-sample by transpose convolution and simultaneously provide a feature extraction mode different from an encoder to realize the complementation of features;

The method is used for game artwork character head model generation, virtual host or artwork 3D hand head model generation, preferably for game artwork character head model generation.

The following describes an example of a two-dimensional meta-map 3D reconstruction of a character wearing animal ear hair accessories.

1. Inputting a two-dimensional illustration

2. Layer segmentation and weight setting

And (3) segmenting the characteristic region image layer, namely carrying out primary segmentation on the secondary element illustration through the SAM model to obtain a basic characteristic image layer, namely segmenting the head, face, ear hair ornament and other regions in the secondary element illustration into independent image layers. The light and shadow effect layer is further divided in the ear hair ornament characteristic area.

And generating the layer description and the text, namely generating regional text pairs (such as head region, ear hair ornament and highlight effect) based on each layer and the layer description thereof obtained by segmentation, and obtaining marks containing corresponding semantic information and layer information through a layer-CLIP model for layer control during generation.

And setting weights, namely setting the weights set when the corresponding layers are input into the image encoder according to the segmentation condition after segmentation is completed.

3. Initial point cloud generation

And selecting a basic element model, namely selecting the basic element model with head and ear contours as an initial 3D model according to the facial and body state characteristics of the character.

Triangle mesh and point cloud generation, constructing a triangle mesh based on the selected prime mode, and predicting the Sign Distance Function (SDF) and texture color of the mesh vertex through a multi-layer perceptron (MLP). The vertex SDF values and colors of the grids are converted into initial point clouds so as to ensure that the outline and color information of the characteristics such as ear hair ornaments and the like are completely reserved.

4. Point cloud noise growth and color perturbation

And (3) extracting unnatural lines, namely extracting unnatural contour lines (such as black edge lines of ear hair ornaments) of the picture through a difference Gaussian filtering method, and recording positions of the contour lines in a picture layer of the picture.

And (3) growing a noise point cloud, namely generating a boundary box (BBox) in a mapping area of the ear hair accessory, and uniformly growing the noise point cloud in the boundary box to enable the noise point cloud density to be matched with the ear hair accessory area.

KD-Tree screening and color disturbance, namely screening noise point clouds by using a KD-Tree structure, reserving noise points within a distance threshold through a nearest neighbor distance, and adding slight color disturbance on an ear area to coordinate the color with surrounding hair decorations so as to generate a natural hair decorations color transition effect.

5. Point cloud-layer diffusion processing

The potential diffusion model applies that the point cloud is mapped to the potential representation space z and the region is diffused a corresponding number of times according to the layer weight (e.g., ear hair weight of 0.9). Cross-attention layers are used in the diffusion process to interact the layer information of the ear hair accessory with the point cloud features.

And the diffusion sequence is that the structures of the background layers, the hair decorations and other foreground layers are ensured to be perfect layer by layer according to the sequential diffusion of the layer weight values from small to large. If the weight accuracy of the ear hair accessory is 0.1, the diffusion time t is appropriately increased to enhance the effect.

6. Generating a tri-planar representation by a latent-tri-planar module

Tri-planar construction-the final potential representation of the diffusion completion is reshaped into a tri-planar representation and the three planes are vertically stacked along the height dimension to ensure that the hair graphic layers are not confused when generated.

And (3) up-sampling to generate a high-resolution 3D feature map, namely gradually up-sampling the three-plane feature map through a convolution decoder to generate a high-resolution three-dimensional representation, so that details such as ear hair ornaments and the like are clearly visible in the 3D model.

7. Final 3D model output

Model output the final generated 3D model contains the complete structure and texture of the head and ear hair accessory. The highlight, shadow and edge line details of the ear hair ornament part are naturally presented in the 3D model, so that the visual effect consistent with the original picture is realized.

The process fully utilizes the layer information, the noise growth and the multilayer diffusion method, ensures that ear hair decorations and other details are accurately presented in the generation process, and accords with the characteristics of the secondary meta-picture.

The embodiment of the application also provides electronic equipment, which comprises a memory and a processor;

the memory is used for storing a computer program;

Embodiments of the present application also provide a computer readable storage medium having stored therein a computer program which, when run on an electronic device, causes the electronic device to implement a method as described above.

Embodiments of the present application also provide a computer program product comprising a computer program which, when run on an electronic device, causes the electronic device to implement a method as described above.

The embodiment of the present application may refer to the specific embodiment of the foregoing method for specific implementation of a system, an electronic device, a computer readable storage medium, and a computer program product, which are not described herein.

The above description is only of the preferred embodiments of the present invention and is not intended to limit the present invention, but various modifications and variations can be made to the present invention by those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. The layer focusing picture-inserting single-view 3D reconstruction method is characterized by comprising the following steps of:

The method comprises the steps of carrying out first segmentation on an original picture-inserting single view in a data set T to obtain each characteristic region picture layer of the original picture-inserting single view, respectively carrying out second segmentation on each characteristic region picture layer to obtain an effect region picture layer in each characteristic region picture layer, generating a region text pair comprising the original picture-inserting single view and the picture layers thereof, weight when the corresponding picture layers are input into an image encoder, total number of picture layers corresponding to each characteristic region picture layer of the original picture-inserting single view and two-dimensional position information of the effect region picture layer obtained by the second segmentation in the characteristic region picture layer obtained by the first segmentation, wherein the picture layer of the original picture-inserting single view comprises the characteristic region picture layer and the effect region picture layer, and each region text pair comprises one picture layer obtained by the segmentation of the original picture-inserting single view and corresponding text description;

for the original illustration single view, a mark c₁ containing corresponding semantic information and layer information is obtained through a layer-CLIP model;

and inputting the original single-view of the picture into a 3D diffusion model based on the single-view, and carrying out diffusion on the corresponding point cloud for a corresponding number of times by adopting a potential diffusion model according to the layer number corresponding to each characteristic region by combining the mark c₁ to obtain the final 3D representation corresponding to the original single-view of the picture.

2. The method of claim 1, wherein the single view based 3D diffusion model comprises an unnatural line extraction module, an initial point cloud generation module, a point cloud segmentation module, a point cloud-layer diffusion module, and a potential-tri-plane module;

The point cloud segmentation module is used for segmenting an initial point cloud P into different point cloud parts, and performing basic coloring on each point cloud part according to the color of a weight value minimum region in a mark c₁ corresponding to each point cloud part to obtain a basic colored point cloud, wherein each point cloud part segmented by the initial point cloud P corresponds to each characteristic region layer obtained after the first segmentation one by one;

According to the potential representation z, the mark c₁ and the total number of layers corresponding to each characteristic area layer, adopting a potential diffusion model, diffusing for corresponding times each time through time t, and outputting a final potential representation z_m;

the latent-tri-planar module obtains a final 3D representation, i.e., a tri-planar representation, based on the final latent representation z_m.

3. The method of claim 2, wherein extracting the first artwork single view from the original artwork single view comprises:

S101, extracting an unnatural contour line x of an original picture-inserting single view through a difference Gaussian sketch, and creating a convex hull surrounding a key structure by adopting an existing face mark detector during extraction;

s103, on the basis of the initial characteristics, performing linear interpolation for a plurality of times on the area where the unnatural contour line x is located, and regenerating the pixel color of the unnatural contour line x in the original single picture drawing view to obtain a first single picture drawing view.

4. The method of claim 3, wherein selecting the corresponding basic prime mode to generate the initial point cloud P according to the character body state in the first pictorial view comprises:

5. The method of claim 4, wherein the outputting the final potential representation z_m by using the potential diffusion model and performing a corresponding number of diffusions each time via time t according to the potential representation z, the label c₁, and the total number of layers corresponding to each feature region layer comprises:

6. The method of claim 5, wherein obtaining a final 3D representation based on the final potential representation z_m, comprises:

7. An electronic device is characterized by comprising a memory and a processor;

the memory is used for storing a computer program;

the processor for invoking the computer program to perform the method of any of claims 1 to 6.

8. A computer readable storage medium, characterized in that the computer readable storage medium has stored therein a computer program which, when run on an electronic device, causes the electronic device to implement the method of any one of claims 1 to 6.

9. A computer program product comprising a computer program which, when run on an electronic device, causes the electronic device to carry out the method of any one of claims 1 to 6.