Detailed Description
Reference will now be made in detail to exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, the same numbers in different drawings refer to the same or similar elements, unless otherwise indicated. The implementations described in the following exemplary embodiments are not representative of all implementations consistent with one or more embodiments of the application. Rather, they are merely examples consistent with aspects of one or more embodiments of the present application.
It should be noted that in other embodiments, the steps of the corresponding method are not necessarily performed in the order shown and described. In some other embodiments, the method may include more or fewer steps than described herein. Furthermore, individual steps described in this disclosure may be broken down into multiple steps in other embodiments; while various steps described in this application may be combined into a single step in other embodiments.
In practical applications, the application program facing the user is usually implemented by matching a foreground client and a background server. The foreground client can output a user interface for a user to execute interactive operations such as clicking, long pressing, text inputting, image drawing, file inserting and the like in the user interface, data generated based on various interactive operations are sent to the background server, and the background server can perform corresponding calculation based on the data sent by the foreground client so as to realize the function externally provided by the application program.
For applications that provide AI drawing functions externally, such applications need to output corresponding images based on descriptive text entered by the user. If the user inputs descriptive text containing descriptive text for human hand (for example, holding a cup) in the user interface output by the foreground client, the background server needs to generate corresponding human hand image, and send the human hand image to the foreground client, and the foreground client outputs the human hand image to the user for viewing by the user through the user interface.
In addition, in order to reduce the workload of the user when drawing, the application program for providing the drawing function provides a certain convenience for the user in drawing, and the application program generally outputs images of human body parts such as hands and feet of the human body to the user, so that the user can draw on the basis of the images of the human body parts or splice the images of the human body parts with the images drawn by the user, and the trouble of drawing the images of the human body parts by the user is avoided. In this case, the background server also needs to generate human hand images with different hand shapes and different hand postures, send the human hand images to the foreground client, and output the human hand images to the user through the user interface by the foreground client for the user to select and use.
However, there are often problems in the generated human hand image that human hands are unreasonable, such as: the hand has more fingers or fewer fingers, or the hand is in a posture which is not consistent with normal conditions, and the like. This typically results in the generated human hand image being unavailable to the user, affecting the user experience.
The technical scheme provided by one or more embodiments of the present application provides a technical scheme for generating a human hand image, which can ensure that human hands in the generated human hand image are reasonable when the human hand image is generated, thereby ensuring that the generated human hand image is available for a user and improving user experience.
In the above technical solution, a two-dimensional hand image corresponding to a human hand sample may be obtained first, then a three-dimensional hand image is reconstructed based on the two-dimensional hand image by a three-dimensional reconstruction model, then the three-dimensional hand image may be projected to a two-dimensional plane, the three-dimensional hand image may be converted into a two-dimensional image, and finally the two-dimensional image obtained by converting the three-dimensional hand image may be used as an additional control condition by a generation model, and a human hand image matched with the two-dimensional image may be generated based on a noise image.
In a specific implementation, a large number of different human hand portions may be used as human hand samples, and two-dimensional hand images corresponding to the respective human hand samples of the human hand samples may be acquired.
When the two-dimensional hand image corresponding to the human hand sample is obtained, the two-dimensional hand image corresponding to each human hand sample may be input into the three-dimensional reconstruction model as described above, and the three-dimensional hand pattern is reconstructed from the three-dimensional reconstruction model based on the two-dimensional hand images, that is, the three-dimensional hand pattern corresponding to each human hand sample.
Under the condition that the three-dimensional hand figure is reconstructed by the three-dimensional reconstruction model, the three-dimensional gesture figure is projected to a preset two-dimensional plane so as to convert the three-dimensional hand figure into a two-dimensional image. The related information such as the position and the angle of the two-dimensional plane relative to the three-dimensional hand graph can be set by a user according to actual requirements, or can be a default value of a system default, and the application is not limited to the above.
As described above, the three-dimensional hand pattern reconstructed from the three-dimensional reconstruction model may be a three-dimensional hand pattern corresponding to each human hand sample. Accordingly, for any one of the human hand samples, the three-dimensional hand pattern corresponding to the human hand sample may be projected to the two-dimensional plane to convert the three-dimensional hand pattern corresponding to the human hand sample into a two-dimensional image corresponding to the human hand sample.
Further, a noise image for generating a human hand image may be acquired, and when the two-dimensional image is obtained, the two-dimensional image and the noise image may be input into the generation model described above, and the human hand image matching the two-dimensional image may be generated based on the noise image by the generation model. Specifically, one two-dimensional image corresponding to one human hand sample (or a set of two-dimensional images consisting of one two-dimensional contour image, one two-dimensional key point image, and one two-dimensional depth image corresponding to one human hand sample) and one noise image may be input into the above-described generation model, and a human hand image matching the two-dimensional image may be generated by the generation model based on this noise image. Taking a group of two-dimensional images consisting of a two-dimensional contour image, a two-dimensional key point image and a two-dimensional depth image corresponding to a human hand sample as an example, for a human hand in a human hand image matched with the group of two-dimensional images, the contour of the human hand is the contour in the two-dimensional contour image; the relative positional relationship between the human hand key points on the human hand is the same as the relative positional relationship between the human hand key points in the two-dimensional key point image; the geometry of the visible surface of the human hand is determined by the pixel values of the individual pixels in this two-dimensional depth image.
By adopting the mode, through a large number of different human hand samples, three-dimensional hand patterns with large data volume and diversity can be obtained, so that two-dimensional images obtained by converting the three-dimensional hand patterns can be used as priori information, and human hand images matched with the two-dimensional images can be guided and generated through the priori information; under the condition that the shape, the posture and the like of the human hand samples are reasonable, the human hand in the generated human hand image can be ensured to be reasonable, so that the generated human hand image can be ensured to be available for a user, and the user experience is improved. The generated human hand image has a large data volume and a variety of data.
The following describes a technical solution for generating a human hand image according to one or more embodiments of the present application.
Referring to fig. 1, fig. 1 is a schematic diagram of a human hand image generating system according to an exemplary embodiment of the application.
As shown in fig. 1, the system for generating the human hand image may include a three-dimensional reconstruction model and a generation model. Wherein:
the three-dimensional reconstruction model may be used to reconstruct a stereoscopic three-dimensional figure (also referred to as a three-dimensional model) based on planar two-dimensional images. For example, the three-dimensional reconstruction model may first acquire data for constructing a three-dimensional graph, where the data may be acquired by various sensors such as a camera, a laser scanner, a depth camera, etc., specifically, may be a two-dimensional image directly acquired, or may be a two-dimensional image formed by acquired point cloud, depth information, etc.; useful feature information can then be extracted from these data, typically using feature detection algorithms (e.g., SIFT, SURF, ORB, etc.) for two-dimensional images; then the collected data can be matched and aligned to obtain a consistent coordinate system, which can be realized by matching the features, calculating the pose and the view angle of the camera, and registering the data in the same coordinate system; after matching and alignment, the data can be converted into a three-dimensional point cloud or grid model by using a three-dimensional reconstruction algorithm to generate a three-dimensional graph, wherein the common three-dimensional reconstruction algorithm comprises stereoscopic vision, structured light, voxel filling and the like; then, the generated three-dimensional graph can be optimized, noise is removed, the surface is smooth, holes are filled, and the like, which can be realized through technologies such as filtering, grid editing, surface fitting, and the like; finally, texture information in the two-dimensional image can be mapped to the generated three-dimensional graph, so that the three-dimensional graph has more real appearance and details.
The generative model may be a conditional generative model. The condition generating model is a class of deep learning models that can generate data given some additional conditions. These conditions are typically vectors or tensors that are input into the condition generation model and may include information such as images, text, labels, audio, etc. Depending on the different condition types, the condition generating model may be exemplified by the following: an image-to-image model for receiving an input image and progressively generating a corresponding output image by modifying One-step Transistion Probability on a Markov chain, for example: black and white images may be converted into color images, low resolution images into high resolution images, etc.; a text-to-image model for receiving an input text description and generating a corresponding image, for example: generating a scene image, a character avatar, etc. through text description; an image-to-text model for receiving an input image and generating a corresponding text description, for example: corresponding labels, descriptions, etc. may be generated from the images. The generative model may be in particular an image-to-image model.
Firstly, the two-dimensional hand image corresponding to the human hand sample obtained in advance can be input into the three-dimensional reconstruction model, the three-dimensional hand pattern is reconstructed based on the two-dimensional hand images by the three-dimensional reconstruction model, and the reconstructed three-dimensional hand pattern is the three-dimensional hand pattern corresponding to the human hand sample. That is, a large number of different human hand samples are utilized to obtain a three-dimensional hand pattern with large data volume and diversity; and under the condition that the shape, the posture and the like of the human hand samples are reasonable, the three-dimensional hand patterns are reasonable.
Then, the three-dimensional hand pattern reconstructed by the three-dimensional reconstruction model can be converted into a plurality of types of two-dimensional images such as a contour image, a key point image, a depth image and the like through projection. Since the three-dimensional hand patterns are large and diverse in data amount and the rationality thereof can be ensured, the two-dimensional images obtained by converting the three-dimensional hand patterns are large and diverse in data amount and the rationality thereof can also be ensured.
Finally, the plurality of types of two-dimensional images described above, and a noise image for generating a human hand image may be input into the generation model, the human hand image may be generated based on the noise image on the condition that the human hand image is generated by the generation model, and the generated human hand image may be matched with the two-dimensional images (for example, one human hand image may be matched with one two-dimensional image). In this case, the shape, the state, and the like of the human hand in each generated human hand image depend on the two-dimensional image matched with this human hand image, so that a human hand image having a large data amount and a variety can be finally obtained, and the human hands in these human hand images are all reasonable.
Referring to fig. 2 in conjunction with fig. 1, fig. 2 is a flowchart illustrating a method for generating a human hand image according to an exemplary embodiment of the present application.
In this embodiment, the method for generating a human hand image described above may be applied to a server. The server may be a server including one independent physical host, or may be a server cluster formed by a plurality of independent physical hosts; alternatively, the server may be a virtual server, cloud server, or the like, carried by the host cluster.
Alternatively, the method for generating the human hand image can be applied to electronic devices with certain computing capacity, such as desktop computers, notebook computers, palm computers (PDAs, personal Digital Assistants), tablet devices and the like.
The method for generating the human hand image can comprise the following steps:
step 202: acquiring a two-dimensional hand image corresponding to a human hand sample, and inputting the two-dimensional hand image into a three-dimensional reconstruction model so that the three-dimensional reconstruction model reconstructs a three-dimensional hand figure based on the two-dimensional hand image.
In this embodiment, a large number of different human hand portions may be used as human hand samples first, and two-dimensional hand images corresponding to each of the human hand samples may be acquired.
In practical applications, a large number of two-dimensional images including different human hands may be acquired, and these two-dimensional images may be two-dimensional images acquired for a real human body or human hand, for example: the two-dimensional image obtained by shooting may be a two-dimensional image obtained by dynamic capture of a human body or a human hand, or may be a two-dimensional image including a human hand randomly generated under a simulated condition, which is not limited in the present application. In order to ensure the rationality of the human hand sample, after the two-dimensional images containing the human hand are acquired, whether the human hand contained in each of the two-dimensional images is reasonable or not can be checked, if so, the human hand contained in the two-dimensional image can be used as a human hand sample, and the two-dimensional hand image can be extracted from the two-dimensional image and used as the two-dimensional hand image corresponding to the human hand sample.
In some embodiments, when acquiring the two-dimensional hand image corresponding to the human hand sample, a two-dimensional image including the human hand sample may be specifically acquired, and human hand detection is performed on the two-dimensional image, so as to extract the two-dimensional hand image corresponding to the human hand sample from the two-dimensional image.
Specifically, human hand detection may be performed on the two-dimensional image based on a preset human hand detection algorithm, so as to extract a two-dimensional hand image corresponding to the human hand sample from the two-dimensional image. The human hand detection algorithm may be a human hand detection algorithm based on color, texture, shape or machine learning.
The skin color of a human body usually has a certain characteristic in an image, and the color information can be utilized for human hand detection. The human hand detection algorithm based on the color judges whether the pixel points belong to the human hand area or not in the image according to the color threshold value by establishing a skin color model.
Human hands often have certain texture features, such as wrinkles, textures, fingerprints, etc. The texture-based human hand detection algorithm distinguishes human hand regions from other regions in the image by extracting and analyzing the texture features of the image. Common texture features include local binary patterns (Local Binary Patterns), directional gradient histograms (Histogram of Oriented Gradients), and the like.
The shape of a human hand typically has certain characteristics such as the degree of bending of the fingers and the position of the joints. Shape-based human hand detection algorithms identify human hand regions and other regions in an image by extracting and analyzing contour or edge information of each region in the image. Common methods include techniques based on edge detection, contour extraction, shape matching, and the like.
With machine learning algorithms, feature representations and classification models of human hand regions and other regions in images can be learned by training samples. Common methods include Support Vector Machines (SVMs), random Forest (Random Forest), convolutional neural networks (Convolutional Neural Networks), and the like. Further, the characteristic representation of the image and the space structure information of the human hand can be learned by using the deep neural network, so that the high-efficiency and accurate human hand detection can be realized. Common deep learning models include Convolutional Neural Networks (CNNs), regional Convolutional Neural Networks (RCNNs), single-stage Detectors (One-stage Detectors), double-stage Detectors (Two-stage Detectors), and the like.
When the two-dimensional hand image corresponding to the human hand sample is obtained, the two-dimensional hand image corresponding to each human hand sample may be input into the three-dimensional reconstruction model as described above, and the three-dimensional hand pattern is reconstructed from the three-dimensional reconstruction model based on the two-dimensional hand images, that is, the three-dimensional hand pattern corresponding to each human hand sample.
In some embodiments, the three-dimensional reconstruction model may include a parameterized model constructed based on a neural network. Correspondingly, when the two-dimensional hand image is input into the three-dimensional reconstruction model so that the three-dimensional reconstruction model reconstructs the three-dimensional hand figure based on the two-dimensional hand image, the two-dimensional hand image can be input into the parameterized model constructed based on the neural network, so that the parameterized model constructed based on the neural network converts the two-dimensional hand image into three-dimensional hand parameters, and the three-dimensional hand figure is generated based on the three-dimensional hand parameters.
Further, in some embodiments, the parameterized model may comprise a MANO (Model for Articulated Hands) model.
Referring to fig. 3, fig. 3 is a schematic diagram of a parameterized model constructed based on a neural network according to an exemplary embodiment of the present application.
The parameterized model constructed based on the neural network may include a neural network and a MANO model. In this case, the above two-dimensional hand image may be input into the neural network, which converts the two-dimensional hand image into three-dimensional hand parameters corresponding to the MANO model; the MANO model may generate the three-dimensional hand pattern based on the three-dimensional hand parameters.
In practical applications, the three-dimensional hand parameters corresponding to the MANO model may include shape parameters and pose parameters. Wherein, the shape parameters can be used for defining the overall shape characteristics of the human hand and describing the change of the geometrical shape of the human hand, and the shape parameters can comprise the length of fingers, the width of palms and the like; the gesture parameters may be used to define gesture information such as joint angles of the hand and bending degrees of the fingers, and may represent rotation and translation information of the hand. In addition, the three-dimensional hand parameters corresponding to the MANO model may further include global rotation parameters for describing a rotation posture of the entire human hand, representing a rotation angle and an axial direction of the entire human hand with respect to a global coordinate system.
On the one hand, the MANO model can obtain joint positions of the three-dimensional hand graph through the input shape parameters and gesture parameters, and the joint positions represent rotation and translation information of the hand in the forms of Euler angles, quaternions and the like through mathematical operation and interpolation calculation; on the other hand, the joint positions of the three-dimensional hand patterns can be mapped onto the skin grids of the hands through skin technology, and the three-dimensional hand patterns are more realistically matched with the joint positions through weight adjustment and vertex deformation, so that the stretching, shrinking and twisting of the skin are realized. Thus, the MANO model can realize the generation of the three-dimensional hand graph based on the three-dimensional hand parameters.
Step 204: and projecting the three-dimensional hand pattern to a preset two-dimensional plane so as to convert the three-dimensional hand pattern into a two-dimensional image.
In this embodiment, when the three-dimensional hand pattern is reconstructed from the three-dimensional reconstruction model, the three-dimensional gesture pattern is projected to a preset two-dimensional plane, so as to convert the three-dimensional hand pattern into a two-dimensional image. The related information such as the position and the angle of the two-dimensional plane relative to the three-dimensional hand graph can be set by a user according to actual requirements, or can be a default value of a system default, and the application is not limited to the above.
As described above, the three-dimensional hand pattern reconstructed from the three-dimensional reconstruction model may be a three-dimensional hand pattern corresponding to each human hand sample. Accordingly, for any one of the human hand samples, the three-dimensional hand pattern corresponding to the human hand sample may be projected to the two-dimensional plane to convert the three-dimensional hand pattern corresponding to the human hand sample into a two-dimensional image corresponding to the human hand sample.
In some embodiments, the three-dimensional hand graphic may include preset human hand key points, which may be joints, finger connection points, etc. on the human hand, and may specifically depend on gesture parameters corresponding to the MANO model. Accordingly, the two-dimensional image may include: a two-dimensional contour image for describing the outer contour of a human hand, a two-dimensional keypoint image for describing the relative positional relationship between the human hand keypoints, and a two-dimensional depth image for describing the geometry of the visible surface of a human hand. Wherein, the relative position relation can comprise parameters such as relative direction, distance and the like; the pixel value of a certain pixel point in the two-dimensional depth image is the depth value of the pixel point (for example, the distance from the shooting device to the pixel point).
It should be noted that, by performing projection processing on a three-dimensional hand pattern corresponding to a human hand sample, a two-dimensional contour image, a two-dimensional key point image, and a two-dimensional depth image can be obtained, and the two-dimensional contour image, the two-dimensional key point image, and the two-dimensional depth image together form a set of two-dimensional images corresponding to the human hand sample.
The corresponding human hand can be determined through the two-dimensional contour image, the two-dimensional key point image and the two-dimensional depth image; because the two-dimensional contour image, the two-dimensional key point image and the two-dimensional depth image are obtained by carrying out projection processing on the three-dimensional hand pattern corresponding to a reasonable human hand sample, the rationality of the determined human hand can be ensured.
In some embodiments, when the three-dimensional hand pattern is projected onto the two-dimensional plane to convert the three-dimensional hand pattern into the two-dimensional image, for any one three-dimensional hand pattern, the three-dimensional hand pattern may be specifically simulated by using a simulated photographing device at a preset position in a three-dimensional space corresponding to the three-dimensional hand pattern. For example, the three-dimensional hand figure may be photographed by a photographing device simulated at a position directly in front of, obliquely above, 45 degrees, or the like. This three-dimensional hand pattern can thus be projected onto the imaging plane of the camera device at the respective position. At this time, two-dimensional images captured by the capturing devices at the respective positions, that is, two-dimensional projection images in the imaging planes of the capturing devices can be acquired; the two-dimensional projection image is the two-dimensional image obtained by conversion.
By performing projection processing on the three-dimensional hand pattern corresponding to the human hand sample, a two-dimensional image corresponding to the human hand sample is obtained, and the diversity and rationality of the obtained two-dimensional image can be ensured. In addition, by setting different virtual machine positions to shoot and obtain two-dimensional images corresponding to the human hand samples, the data volume of the two-dimensional images can be further enlarged, and the diversity of the two-dimensional images can be increased.
In some embodiments, as previously described, the two-dimensional image may include the two-dimensional contour image, the two-dimensional key point image, and the two-dimensional depth image.
Therefore, in the first aspect, when the two-dimensional image obtained by photographing is acquired, the two-dimensional contour image may be extracted from the two-dimensional projection image obtained by photographing.
In the second aspect, when the two-dimensional image obtained by shooting is obtained, the plurality of human hand key points may be specifically determined from the two-dimensional projection image obtained by shooting, and the plurality of human hand key points may be connected to obtain the two-dimensional key point image.
In the third aspect, when the two-dimensional image obtained by photographing is obtained, the two-dimensional depth image may specifically be generated according to a distance between each pixel point corresponding to the three-dimensional hand pattern and the photographing device. In practical application, for any pixel point in a two-dimensional projection image obtained by shooting, a point corresponding to the pixel point on the three-dimensional hand graph can be determined, and the distance between the shooting equipment and the point on the three-dimensional hand graph is used as the pixel value of the pixel point; thus, the pixel values of the pixel points in the two-dimensional projection image can be obtained, and the two-dimensional depth image can be formed.
Step 206: acquiring a noise image for generating a human hand image, and inputting the two-dimensional image and the noise image into a generation model so that the generation model generates the human hand image matched with the two-dimensional image based on the noise image.
In the present embodiment, a noise image for generating a human hand image may be acquired, and when the two-dimensional image is obtained, the two-dimensional image and the noise image are input into the generation model described above, and the human hand image matching the two-dimensional image may be generated based on the noise image by the generation model. Specifically, one two-dimensional image corresponding to one human hand sample (or a set of two-dimensional images consisting of one two-dimensional contour image, one two-dimensional key point image, and one two-dimensional depth image corresponding to one human hand sample) and one noise image may be input into the above-described generation model, and a human hand image matching the two-dimensional image may be generated by the generation model based on this noise image. Taking a group of two-dimensional images consisting of a two-dimensional contour image, a two-dimensional key point image and a two-dimensional depth image corresponding to a human hand sample as an example, for a human hand in a human hand image matched with the group of two-dimensional images, the contour of the human hand is the contour in the two-dimensional contour image; the relative positional relationship between the human hand key points on the human hand is the same as the relative positional relationship between the human hand key points in the two-dimensional key point image; the geometry of the visible surface of the human hand is determined by the pixel values of the individual pixels in this two-dimensional depth image.
In some embodiments, the generative model may include a Diffusion model (Diffusion Models) built based on Control Net. Wherein Control Net is a neural network structure, the diffusion model can be controlled by adding additional conditions, such as: an image may be generated by the two-dimensional contour image, the two-dimensional key point image, and the two-dimensional depth image guided diffusion model.
Specifically, the diffusion model constructed based on the Control Net may be a structure in which one Control Net is added to the diffusion model. For example, the diffusion model can be replicated in one piece for building a trainable part (i.e., the Control Net), while the diffusion model remains as a non-trainable part, preserving the original model parameters; the noise image may be input to the diffusion model alone, and additional Control conditions and the noise image may be input to the Control Net, and the result obtained by fusing (may be adding) the output of the diffusion model and the output of the Control Net may be used as the output of the diffusion model constructed based on the Control Net.
Further, in some embodiments, the Diffusion model may include a Stable Diffusion model. The Stable diffration model may be used as an image-to-image model for generating certain images based on noisy images.
The noise image may be a randomly generated noise image. For example, a random vector may be sampled from a distribution of data and a noise image may be generated based on the sampled random vector.
A noise image refers to an image to which a random disturbance or a random signal is added. Noise causes the image to exhibit some random, undesirable visual variation. The noise image may be obtained in the form of additive noise, multiplicative noise, uniform noise, or the like. In some cases, the noise image may be generated by an actual image acquisition device, sensor, or other device; in other cases, a computer program or software may be used to generate and add a particular type of noise to the image.
Additive noise refers to generating a noise image by adding randomly generated noise to an original image. Additive noise may simulate noise introduced by the image sensor or during transmission. Common types of additive noise include gaussian noise, impulse noise (black and white pixels randomly appearing in the image), and the like.
Multiplicative noise is obtained by multiplying an original image with a randomly generated noise image. Multiplicative noise is often used to simulate lighting conditions or image corruption. For example, an image taken under low light conditions may be affected by multiplicative noise.
Uniform noise refers to the addition of uniformly distributed random disturbances in the image. The uniform noise may simulate interference signals introduced during certain image acquisition devices or transmissions.
The diffusion model may perform denoising processing on the noise image under the condition of a human hand until the noise image is restored to the human hand image. However, it is generally not required that the original image corresponding to the noise image contains a human hand. That is, essentially, human hand images are generated from diffusion models.
In some embodiments, in order to enable the diffusion model constructed based on Control Net to use the input two-dimensional image as an additional Control condition, a human hand image matched with the two-dimensional image is generated based on the input noise image, a noise image may be used as a training sample, and a two-dimensional image corresponding to the noise image may be used as a label of the noise image, so as to train the diffusion model constructed based on Control Net.
Specifically, for the diffusion model constructed based on the Control Net described above, a loss function may be set for the Control Net and the diffusion model, respectively. The loss function of the Control Net is used for measuring the difference between the human hand in the human hand image generated based on the diffusion model constructed by the Control Net and the human hand described by the two-dimensional image serving as the label. The loss function of the diffusion model is used for measuring the difference between the human hand image generated by the diffusion model constructed based on the Control Net and the real human hand image. In this case, the loss function of the Control Net and the loss function of the diffusion model can be taken into account in combination, and the entire model can be optimized by minimizing the total loss function. Wherein the total loss function is typically composed of a weighted sum of the loss function of the Control Net and the loss function of the diffusion model, the weighting coefficients can be used to balance the importance of both. In the training process, the model parameters of the diffusion model constructed based on the Control Net are continuously updated (only the model parameters of the Control Net are generally updated) so as to gradually improve the performance of the model and minimize the loss. Finally, the diffusion model constructed based on the Control Net obtained through training can be used for generating a human hand image matched with the two-dimensional image based on the noise image.
Further, in some embodiments, the number of input channels supported by the Control Net is extended based on the number of two-dimensional images resulting from the three-dimensional hand graphics transformation. For example, assuming that the dimension of the two-dimensional image is 512×512 (i.e., the number of pixels in both directions of the image is 512), and the two-dimensional image is an RGB image, when the number of two-dimensional images obtained by converting the three-dimensional hand image is 1, the number of input channels supported by the Control Net is 512×512×3 (where 3 corresponds to RGB); when the number of two-dimensional images obtained by converting the three-dimensional hand image is 2, the number of input channels supported by the Control Net is 512×512×6 (6= 3*2); when the number of two-dimensional images obtained by converting the three-dimensional hand image is 3, the number of input channels supported by the Control Net is 512×512×9 (9= 3*3); and so on.
In the above case, when inputting the two-dimensional image and the noise image into the generation model so that the generation model generates a human hand image matching the two-dimensional image based on the noise image, specifically, the two-dimensional image and the noise image after the concatenation may be first subjected to a stitching process with respect to the two-dimensional image obtained by converting the three-dimensional hand image, and then the diffusion model constructed based on the Control Net may be input so that the diffusion model constructed based on the Control Net generates a human hand image matching the two-dimensional image based on the noise image.
As previously described, the set of two-dimensional images translated from the three-dimensional hand graphic may include a two-dimensional contour image, a two-dimensional keypoint image, and a two-dimensional depth image. In this case, the stitching process is performed on the set of two-dimensional images, that is, the three images are connected together in a certain arrangement (e.g., horizontal stitching, vertical stitching, grid stitching, etc.) to form a larger image. For horizontal stitching, each image may be arranged in sequence in the horizontal direction; for vertical stitching, each image may be arranged in sequence in the vertical direction; for grid stitching, the images may be arranged in a particular rank order.
The image stitching can be realized according to the following steps: firstly, determining a splicing mode to be used; if the images to be stitched are not uniform in size, they can be adjusted to the same size, which can be achieved by adjusting the width and height of the images; creating a blank target image according to the stitching mode and the adjusted image size, wherein the blank target image is large enough to accommodate all images to be stitched; and pasting the adjusted images to the target images according to the corresponding sequence according to the selected splicing mode.
Referring to fig. 4, fig. 4 is a schematic diagram of an architecture of a diffusion model constructed based on Control Net according to an exemplary embodiment of the present application.
The diffusion model constructed based on the Control Net may include a Control Net and a diffusion model. The Control Net can be built based on a copy obtained by copying the diffusion model; the diffusion model is an untrainable part, and the copy of the diffusion model contained by the Control Net is a trainable part. The Control Net also includes a number of Zero Convolution (Zero Convolution) layers. In this case, a noise image may be input to the diffusion model, resulting in an output of the diffusion model; because the two-dimensional image can comprise the two-dimensional contour image, the two-dimensional key point image and the two-dimensional depth image, a two-dimensional contour image, a two-dimensional key point image and a two-dimensional depth image corresponding to a human hand sample can be spliced to obtain a spliced two-dimensional image corresponding to the human hand sample, the spliced two-dimensional image is input into the Control Net first, after zero convolution processing is carried out on the spliced two-dimensional image, the noise image is input into the Control Net to be fused with a zero convolution processing result, and then calculation is carried out on the fusion result based on a copy of the diffusion model and a zero convolution layer to obtain the output of the Control Net; and the result obtained by fusing the output of the diffusion model and the output of the Control Net is a human hand image which is generated based on the noise image and is matched with the two-dimensional contour image, the two-dimensional key point image and the two-dimensional depth image contained in the spliced two-dimensional image.
In the above technical solution, a two-dimensional hand image corresponding to a human hand sample may be obtained first, then a three-dimensional hand image is reconstructed based on the two-dimensional hand image by a three-dimensional reconstruction model, then the three-dimensional hand image may be projected to a two-dimensional plane, the three-dimensional hand image may be converted into a two-dimensional image, and finally the two-dimensional image obtained by converting the three-dimensional hand image may be used as an additional control condition by a generation model, and a human hand image matched with the two-dimensional image may be generated based on a noise image.
By adopting the mode, through a large number of different human hand samples, three-dimensional hand patterns with large data volume and diversity can be obtained, so that two-dimensional images obtained by converting the three-dimensional hand patterns can be used as priori information, and human hand images matched with the two-dimensional images can be guided and generated through the priori information; under the condition that the shape, the posture and the like of the human hand samples are reasonable, the human hand in the generated human hand image can be ensured to be reasonable, so that the generated human hand image can be ensured to be available for a user, and the user experience is improved. The generated human hand image has a large data volume and a variety of data.
The application also provides an embodiment of a human hand image generating device corresponding to the embodiment of the human hand image generating method.
Referring to fig. 5, fig. 5 is a schematic view illustrating a structure of an apparatus according to an exemplary embodiment of the present application. At the hardware level, the device includes a processor 502, an internal bus 504, a network interface 506, a memory 508, and a non-volatile storage 510, although other hardware may be included as desired. One or more embodiments of the application may be implemented in a software-based manner, such as by the processor 502 reading a corresponding computer program from the non-volatile storage 510 into the memory 508 and then running. Of course, in addition to software implementation, one or more embodiments of the present application do not exclude other implementation, such as a logic device or a combination of software and hardware, etc., that is, the execution subject of the following process flows is not limited to each logic module, but may also be hardware or a logic device.
Referring to fig. 6, fig. 6 is a block diagram illustrating a human hand image generating apparatus according to an exemplary embodiment of the present application.
The above-mentioned generating device of human hand image can be applied to the apparatus shown in fig. 5 to implement the technical scheme of the present application.
The device for generating the human hand image may include:
the reconstruction module 602 acquires a two-dimensional hand image corresponding to a human hand sample, and inputs the two-dimensional hand image into a three-dimensional reconstruction model, so that the three-dimensional reconstruction model reconstructs a three-dimensional hand pattern based on the two-dimensional hand image;
the projection module 604 projects the three-dimensional hand pattern to a preset two-dimensional plane so as to convert the three-dimensional hand pattern into a two-dimensional image;
the generating module 606 acquires a noise image for generating a human hand image, and inputs the two-dimensional image and the noise image into a generating model, so that the generating model generates a human hand image matched with the two-dimensional image based on the noise image.
Optionally, the three-dimensional reconstruction model comprises a parameterized model constructed based on a neural network;
inputting the two-dimensional hand image into a three-dimensional reconstruction model so that the three-dimensional reconstruction model reconstructs a three-dimensional hand figure based on the two-dimensional hand image, comprising:
inputting the two-dimensional hand image into the parameterized model constructed based on the neural network, so that the parameterized model constructed based on the neural network converts the two-dimensional hand image into three-dimensional hand parameters, and generating a three-dimensional hand graph based on the three-dimensional hand parameters.
Optionally, the parameterized model comprises a MANO model.
Optionally, the generation model comprises a diffusion model constructed based on Control Net.
Optionally, the Diffusion model comprises a Stable Diffusion model.
Optionally, the number of input channels supported by the Control Net is expanded based on the number of two-dimensional images obtained by the three-dimensional hand graphic conversion;
the inputting the two-dimensional image and the noise image into a generation model, so that the generation model generates a human hand image matched with the two-dimensional image based on the noise image, comprising:
performing stitching processing on the two-dimensional image obtained by converting the three-dimensional hand graph;
and inputting the spliced two-dimensional image and the noise image into the diffusion model constructed based on the Control Net so that the diffusion model constructed based on the Control Net generates a human hand image matched with the two-dimensional image based on the noise image.
Optionally, the three-dimensional hand pattern comprises a plurality of preset human hand key points; the two-dimensional image includes: a two-dimensional contour image for describing the outer contour of a human hand, a two-dimensional keypoint image for describing the relative positional relationship between the human hand keypoints, and a two-dimensional depth image for describing the geometry of the visible surface of a human hand.
Optionally, the projecting the three-dimensional hand graphic onto a preset two-dimensional plane to convert the three-dimensional hand graphic into a two-dimensional image includes:
performing simulation shooting on the three-dimensional hand pattern through shooting equipment simulated at a preset position in a three-dimensional space corresponding to the three-dimensional hand pattern so as to project the three-dimensional hand pattern to an imaging plane of the shooting equipment;
and acquiring a two-dimensional image obtained by shooting.
Optionally, the acquiring the two-dimensional image obtained by shooting includes:
extracting the two-dimensional contour image from the two-dimensional projection image obtained by shooting;
determining the key points of the human hands from the two-dimensional projection image obtained by shooting, and connecting the key points of the human hands to obtain the two-dimensional key point image;
and generating the two-dimensional depth image according to the distance between the shooting equipment and each pixel point corresponding to the three-dimensional hand graph.
Optionally, the acquiring a two-dimensional hand image corresponding to a human hand sample includes:
and acquiring a two-dimensional image containing a human hand sample, and detecting the human hand aiming at the two-dimensional image so as to extract a two-dimensional hand image corresponding to the human hand sample from the two-dimensional image.
For the device embodiments, they essentially correspond to the method embodiments, so that reference is made to the description of the method embodiments for relevant points. The apparatus embodiments described above are merely illustrative, wherein the modules illustrated as separate components may or may not be physically separate, and the components shown as modules may or may not be physical, i.e., may be located in one place, or may be distributed over a plurality of network modules. Some or all of the modules can be selected according to actual needs to achieve the purpose of the technical scheme of the application.
The system, apparatus, module or unit set forth in the above embodiments may be implemented in particular by a computer chip or entity, or by a product having a certain function. A typical implementation device is a computer, which may be in the form of a personal computer, laptop computer, cellular telephone, camera phone, smart phone, personal digital assistant, media player, navigation device, email device, game console, tablet computer, wearable device, or a combination of any of these devices.
In a typical configuration, a computer includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.
The memory may include volatile memory in a computer-readable medium, random Access Memory (RAM) and/or nonvolatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of computer-readable media.
Computer readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of storage media for a computer include, but are not limited to, phase change memory (PRAM), static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, read only compact disc read only memory (CD-ROM), digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic disk storage, quantum memory, graphene-based storage or other magnetic storage devices, or any other non-transmission medium, which can be used to store information that can be accessed by the computing device. Computer-readable media, as defined herein, does not include transitory computer-readable media (transmission media), such as modulated data signals and carrier waves.
It should be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article or apparatus that comprises the element.
The foregoing describes certain embodiments of the present application. Other embodiments are within the scope of the application. In some cases, the acts or steps recited in the present application may be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing are also possible or may be advantageous.
The terminology used in the one or more embodiments of the application is for the purpose of describing particular embodiments only and is not intended to be limiting of the one or more embodiments of the application. The singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. The term "and/or" refers to and encompasses any or all possible combinations of one or more of the associated listed items.
The description of the terms "one embodiment," "some embodiments," "example," "specific example," or "one implementation" and the like as used in connection with one or more embodiments of the present application mean that a particular feature or characteristic described in connection with the embodiment is included in at least one embodiment of the present application. The schematic descriptions of these terms are not necessarily directed to the same embodiment. Furthermore, the particular features or characteristics described may be combined in any suitable manner in one or more embodiments of the application. Furthermore, different embodiments, as well as specific features or characteristics of different embodiments, may be combined without contradiction.
It should be understood that while the terms first, second, third, etc. may be used in one or more embodiments of the application to describe various information, such information should not be limited to these terms. These terms are only used to distinguish one type of information from another. For example, first information may also be referred to as second information, and similarly, second information may also be referred to as first information, without departing from the scope of one or more embodiments of the application. The word "if" as used herein may be interpreted as "at … …" or "at … …" or "in response to a determination", depending on the context.
The foregoing description of the preferred embodiment(s) of the application is not intended to limit the embodiment(s) of the application, but is to be accorded the widest scope consistent with the principles and spirit of the embodiment(s) of the application.
The user information (including but not limited to user equipment information, user personal information, etc.) and data (including but not limited to data for analysis, stored data, presented data, etc.) related to the present application are information and data authorized by the user or fully authorized by each party, and the collection, use and processing of related data is required to comply with the relevant laws and regulations and standards of the relevant country and region, and is provided with corresponding operation entries for the user to select authorization or rejection.