Movatterモバイル変換


[0]ホーム

URL:


CN120259461A - Image generation model training, image generation method, device, equipment and medium - Google Patents

Image generation model training, image generation method, device, equipment and medium
Download PDF

Info

Publication number
CN120259461A
CN120259461ACN202510294423.7ACN202510294423ACN120259461ACN 120259461 ACN120259461 ACN 120259461ACN 202510294423 ACN202510294423 ACN 202510294423ACN 120259461 ACN120259461 ACN 120259461A
Authority
CN
China
Prior art keywords
vector
rotation matrix
training
position rotation
sub
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202510294423.7A
Other languages
Chinese (zh)
Inventor
李永波
胡一江
杨跃
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Seashell Housing Beijing Technology Co Ltd
Original Assignee
Seashell Housing Beijing Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Seashell Housing Beijing Technology Co LtdfiledCriticalSeashell Housing Beijing Technology Co Ltd
Priority to CN202510294423.7ApriorityCriticalpatent/CN120259461A/en
Publication of CN120259461ApublicationCriticalpatent/CN120259461A/en
Pendinglegal-statusCriticalCurrent

Links

Landscapes

Abstract

Translated fromChinese

本公开实施例涉及一种图像生成模型的训练、图像生成方法、装置、设备及介质,其中该方法包括:获取训练全景图像样本和训练全景图像样本对应的描述文本输入待训练图像生成模型,以对训练全景图像样本对应的输入向量进行位置编码,得到输入向量对应的位置旋转矩阵,基于输入向量对应的位置旋转矩阵计算输出向量,基于训练全景图像样本的训练向量和输出向量之间的损失值调整待训练图像生成模型的模型参数得到图像生成模型。采用上述技术方案,通过位置旋转矩阵表示全景图像的像素点位置以训练图像生成模型,从而在全景图像生成过程中,基于位置旋转矩阵可以确定生成全景图像中各个像素点的距离信息,提高全景图像的生成效果。

The disclosed embodiments relate to a training, image generation method, device, equipment and medium for an image generation model, wherein the method comprises: obtaining a training panoramic image sample and a description text corresponding to the training panoramic image sample and inputting the image generation model to be trained, so as to positionally encode the input vector corresponding to the training panoramic image sample, obtain a position rotation matrix corresponding to the input vector, calculate an output vector based on the position rotation matrix corresponding to the input vector, and adjust the model parameters of the image generation model to be trained based on the loss value between the training vector and the output vector of the training panoramic image sample to obtain the image generation model. By adopting the above technical solution, the pixel position of the panoramic image is represented by the position rotation matrix to train the image generation model, so that in the process of generating the panoramic image, the distance information of each pixel in the generated panoramic image can be determined based on the position rotation matrix, thereby improving the generation effect of the panoramic image.

Description

Training of image generation model, image generation method, device, equipment and medium
Technical Field
The disclosure relates to the technical field of image processing, and in particular relates to training of an image generation model, an image generation method, an image generation device, image generation equipment and a medium.
Background
At present, image generation algorithms are rapidly developed and widely used in various fields.
All existing image generation algorithms can only be applied to generation of perspective views, but cannot be applied to 360-degree panoramic images. Specifically, the texture distribution of the panoramic image is relatively larger than the difference of perspective views, and the position coding mode for the panoramic image in the existing image generation algorithm cannot well represent the texture distribution of the panoramic image, so that the panoramic image generation effect is relatively poor.
Disclosure of Invention
In order to solve the above technical problems or at least partially solve the above technical problems, the present disclosure provides a training method of an image generation model, an image generation method, an image generation device and a medium.
The embodiment of the disclosure provides a training method of an image generation model, which comprises the following steps:
acquiring training data pairs, wherein the training data pairs comprise training panoramic image samples and descriptive texts corresponding to the training panoramic image samples;
Inputting the training panoramic image sample and the description text into a pre-constructed image to be trained to generate a model, carrying out position coding on an input vector corresponding to the training panoramic image sample through a preset position rotation matrix formula to obtain a position rotation matrix corresponding to the input vector, and calculating an output vector based on the position rotation matrix corresponding to the input vector;
and adjusting model parameters of the image generation model to be trained based on the loss value between the training vector of the training panoramic image sample and the output vector to obtain an image generation model.
The embodiment of the disclosure also provides an image generation method, which comprises the following steps:
acquiring an image to generate a descriptive text;
Inputting the image generation description text into an image generation model, carrying out position coding on a target vector corresponding to a preset noise vector through a preset position rotation matrix formula to obtain a position rotation matrix corresponding to the target vector, calculating a generation vector based on the position rotation matrix corresponding to the target vector, and decoding the generation vector to obtain a target panoramic image;
wherein the image generation model is obtained according to the training method of the image generation model according to any one of the preceding embodiments.
The embodiment of the disclosure also provides a training device for the image generation model, which comprises:
The system comprises a first acquisition module, a second acquisition module and a first acquisition module, wherein the first acquisition module is used for acquiring training data pairs, and the training data pairs comprise training panoramic image samples and description texts corresponding to the training panoramic image samples;
The input module is used for inputting the training panoramic image sample and the description text into a pre-constructed image to be trained to generate a model, carrying out position coding on an input vector corresponding to the training panoramic image sample through a preset position rotation matrix formula, obtaining a position rotation matrix corresponding to the input vector, and calculating an output vector based on the position rotation matrix corresponding to the input vector;
And the training module is used for adjusting the model parameters of the image generation model to be trained based on the loss value between the training vector of the training panoramic image sample and the output vector to obtain an image generation model.
The embodiment of the disclosure also provides an image generating device, which comprises:
the second acquisition module is used for acquiring the image generation description text;
The generation module is used for inputting the image generation description text into an image generation model, carrying out position coding on a target vector corresponding to a preset noise vector through a preset position rotation matrix formula to obtain a position rotation matrix corresponding to the target vector, calculating a generation vector based on the position rotation matrix corresponding to the target vector, and decoding the generation vector to obtain a target panoramic image;
wherein the image generation model is obtained according to the training method of the image generation model according to any one of the preceding embodiments.
The embodiment of the disclosure also provides electronic equipment, which comprises a processor, a memory for storing executable instructions of the processor, and the processor, wherein the processor is used for reading the executable instructions from the memory and executing the instructions to realize the training of the image generation model and the image generation method provided by the embodiment of the disclosure.
The embodiment of the present disclosure also provides a computer readable storage medium storing a computer program for executing the training of the image generation model, the image generation method as provided by the embodiment of the present disclosure.
The embodiment of the disclosure also provides a computer program product, comprising a computer program, wherein the computer program is used for executing the training and image generation method of the image generation model provided by the embodiment of the disclosure by a processor.
Compared with the prior art, the technical scheme of the image generation model has the advantages that training data pairs are obtained through the training scheme of the image generation model, the training data pairs comprise training panoramic image samples and description texts corresponding to the training panoramic image samples, the training panoramic image samples and the description texts are input into a pre-built image generation model to be trained, input vectors corresponding to the training panoramic image samples are subjected to position coding through a preset position rotation matrix formula to obtain a position rotation matrix corresponding to the input vectors, output vectors are calculated based on the position rotation matrix corresponding to the input vectors, and model parameters of the image generation model to be trained are adjusted based on loss values between the training vectors and the output vectors of the training panoramic image samples to obtain the image generation model. By adopting the technical scheme, the position rotation matrix is used for representing the pixel point positions of the panoramic image so as to train the image generation model, so that in the panoramic image generation process, the position rotation matrix is acquired based on the image generation model so as to determine the distance information of each pixel point in the generated panoramic image, and the panoramic image generation effect is improved.
Drawings
The above and other features, advantages, and aspects of embodiments of the present disclosure will become more apparent by reference to the following detailed description when taken in conjunction with the accompanying drawings. The same or similar reference numbers will be used throughout the drawings to refer to the same or like elements. It should be understood that the figures are schematic and that elements and components are not necessarily drawn to scale.
Fig. 1 is a flowchart of a training method of an image generation model according to an embodiment of the present disclosure;
FIG. 2A is a schematic diagram of a panoramic spherical coordinate system provided by an embodiment of the present disclosure;
FIG. 2B is a view of a panorama expanded coordinate system provided by an embodiment of the present disclosure;
FIG. 3 is a schematic diagram of image generation model training provided by an embodiment of the present disclosure;
Fig. 4 is a schematic diagram of a Self-attention module provided in an embodiment of the present disclosure;
FIG. 5 is a flowchart of another training method for an image generation model according to an embodiment of the present disclosure;
FIG. 6 is a schematic diagram of an image generation provided by an embodiment of the present disclosure;
fig. 7 is a flowchart of an image generating method according to an embodiment of the present disclosure;
FIG. 8 is a schematic diagram of another image generation provided by an embodiment of the present disclosure;
Fig. 9 is a schematic structural diagram of a training device for an image generation model according to an embodiment of the present disclosure;
fig. 10 is a schematic structural diagram of an image generating apparatus according to an embodiment of the present disclosure;
Fig. 11 is a schematic structural diagram of an electronic device according to an embodiment of the disclosure.
Detailed Description
Embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While certain embodiments of the present disclosure have been shown in the accompanying drawings, it is to be understood that the present disclosure may be embodied in various forms and should not be construed as limited to the embodiments set forth herein, but are provided to provide a more thorough and complete understanding of the present disclosure. It should be understood that the drawings and embodiments of the present disclosure are for illustration purposes only and are not intended to limit the scope of the present disclosure.
It should be understood that the various steps recited in the method embodiments of the present disclosure may be performed in a different order and/or performed in parallel. Furthermore, method embodiments may include additional steps and/or omit performing the illustrated steps. The scope of the present disclosure is not limited in this respect.
The term "including" and variations thereof as used herein are intended to be open-ended, i.e., including, but not limited to. The term "based on" is based at least in part on. The term "one embodiment" means "at least one embodiment," another embodiment "means" at least one additional embodiment, "and" some embodiments "means" at least some embodiments. Related definitions of other terms will be given in the description below.
It should be noted that the terms "first," "second," and the like in this disclosure are merely used to distinguish between different devices, modules, or units and are not used to define an order or interdependence of functions performed by the devices, modules, or units.
It should be noted that references to "one", "a plurality" and "a plurality" in this disclosure are intended to be illustrative rather than limiting, and those of ordinary skill in the art will appreciate that "one or more" is intended to be understood as "one or more" unless the context clearly indicates otherwise.
The names of messages or information interacted between the various devices in the embodiments of the present disclosure are for illustrative purposes only and are not intended to limit the scope of such messages or information.
In general, the texture distribution of a panoramic image is greatly different from that of a perspective view, and particularly, compared with the perspective view, the texture structure of the panoramic image has the characteristics that the aspect ratio of the image is 2:1, the horizontal visual field is 360 degrees, the vertical visual field is 180 degrees, the left side and the right side of the panoramic image are continuous due to the continuous visual field of 360 degrees in the horizontal direction, the upper edge and the lower edge of the image are discontinuous due to the continuous visual field of 180 degrees in the horizontal direction, the image representing imaging environment is projected on a unit sphere, the structure in the horizontal direction is distorted into an arc shape, however, the texture distribution of the panoramic image cannot be well represented in a position coding mode aiming at the panoramic image in the existing image generating algorithm, and the panoramic image generating effect is poor.
Aiming at the problems, the disclosure provides a training scheme of an image generation model, a training data pair is obtained, the training data pair comprises training panoramic image samples and description texts corresponding to the training panoramic image samples, the training panoramic image samples and the description texts are input into a pre-built image generation model to be trained, input vectors corresponding to the training panoramic image samples are subjected to position coding through a preset position rotation matrix formula to obtain a position rotation matrix corresponding to the input vectors, output vectors are calculated based on the position rotation matrix corresponding to the input vectors, and model parameters of the image generation model to be trained are adjusted based on loss values between the training vectors and the output vectors to obtain the image generation model. By adopting the technical scheme, the position rotation matrix is used for representing the pixel point positions of the panoramic image so as to train the image generation model, so that in the panoramic image generation process, the position rotation matrix is acquired based on the image generation model so as to determine the distance information of each pixel point in the generated panoramic image, and the panoramic image generation effect is improved.
Fig. 1 is a flow chart of a training method of an image generation model according to an embodiment of the present disclosure, where the method may be performed by a training apparatus of the image generation model, where the apparatus may be implemented by software and/or hardware, and may be generally integrated in an electronic device. As shown in fig. 1, the method includes:
And 101, acquiring a training data pair, wherein the training data pair comprises a training panoramic image sample and a description text corresponding to the training panoramic image sample.
The training panoramic image sample may be any panoramic image, and in the embodiment of the present disclosure, the training panoramic image sample is not limited, for example, the training panoramic image sample may be a panoramic image spliced for a plurality of rooms of a template room of a certain building, or may be a panoramic image spliced for a room of a certain house for sale.
The description text refers to text information for introducing the training panoramic image sample, the image content recognition result of the training panoramic image sample can be obtained as the description text by carrying out image content recognition on the training panoramic image sample, for example, a room spliced panoramic image sample is recognized, the image content recognition result is obtained as a living room comprising a sofa and a television, the living room comprising the sofa and the television is taken as the description text of the room spliced panoramic image sample, and text information input by a user for the training panoramic image sample can be received as the description text and the like.
Specifically, in the process of training the image generation model, a plurality of training data pairs can be obtained, and each training data pair comprises a training panoramic image sample and a description text corresponding to the training panoramic image sample.
Step 102, inputting training panoramic image samples and descriptive text into a pre-constructed image generation model to be trained, carrying out position coding on input vectors corresponding to the training panoramic image samples through a preset position rotation matrix formula, obtaining a position rotation matrix corresponding to the input vectors, and calculating output vectors based on the position rotation matrix corresponding to the input vectors.
The method comprises the steps of pre-constructing a network model framework formed by combining an image generation model to be trained such as Diffusion Transformer (DiT, diffusion model-based deep learning framework) or SDXL (Stable Diffusion XL, an open-source text-to-picture framework) and a plurality of VAEs (Variational Autoencoder, variational self-encoders), and specifically selecting and setting according to actual application scenes.
Specifically, after inputting a training panoramic image sample and a description text into a pre-constructed image to be trained to generate a model, firstly encoding the training panoramic image sample based on a VAE to obtain a training vector, wherein the training vector refers to converting the training panoramic image sample into a processable numerical form, namely extracting features from the training panoramic image sample and encoding the features into a vector representation, and processing the training vector and the description text into DiT or SDXL, wherein DiT or SDXL comprises a Self-attention module, and the vector input into the Self-attention module can be understood as a vector for performing network encoding and other processing on the training vector, and the vector processed by the Self-attention module can learn the distance relation of each pixel better.
It can be understood that, before the Self-attention module is input, the training panoramic image sample is subjected to a series of network coding to obtain an input vector proportional to the scale of the training panoramic image sample (original image), the input vector is subjected to position coding through a preset position rotation matrix formula to obtain a position rotation matrix corresponding to the input vector, and the output vector is calculated based on the position rotation matrix corresponding to the input vector.
The input vector refers to a vector processed by the input Self-attention module, and is an input vector proportional to the scale of the training panoramic image sample, for example, the size of the input vector X of the Self-attention module is (c, h, w), c represents the channel number, h and w represent the vector size, which is proportional to the size of the original image (training panoramic image sample), the position rotation matrix formula refers to a calculation formula for calculating a position rotation matrix corresponding to the input vector, the position rotation matrix corresponding to the input vector refers to a rotation matrix of each sub-vector of the input vector, and the position rotation matrix corresponding to the input vector such as the rotation matrix of the sub-vector at (hi,wi) of the input vector (c, h, w) is continued to be exemplified.
The method comprises the steps of presetting a position rotation matrix formula, specifically, determining each position rotation matrix parameter in the position rotation matrix formula and a calculation formula of a parameter value corresponding to each position rotation matrix parameter.
In the embodiment of the disclosure, after an input vector is acquired, the input vector is subjected to position coding through a position rotation matrix formula to obtain a position rotation matrix corresponding to the input vector, as an example, the input vector corresponding to a training panoramic image sample is acquired, each sub-vector corresponding to the input vector is acquired, pixel position information corresponding to each sub-vector is encoded according to the position rotation matrix formula, a plurality of sub-position rotation matrices are acquired and combined to obtain the position rotation matrix corresponding to the input vector, and as another example, parameter values corresponding to all position rotation matrix parameters in the position rotation matrix formula are determined according to pixel position coordinates of each sub-vector in the input vector under a target coordinate system to obtain a plurality of sub-position rotation matrices and combined to obtain the position rotation matrix corresponding to the input vector. The above two ways are merely examples of performing position coding on an input vector through a position rotation matrix formula to obtain a position rotation matrix corresponding to the input vector, and the present disclosure does not limit a specific implementation manner of performing position coding on the input vector through the position rotation matrix formula to obtain a position rotation matrix corresponding to the input vector.
Therefore, after the input vector is subjected to position coding, the Self-attention module is input, for example, the input vector respectively passes through a linear layer to obtain a first vector, a second vector and a third vector, the first vector and the second vector are subjected to correlation calculation, and a new vector is output after the third vector is applied to the correlation calculation result, so that when the first vector and the second vector are subjected to correlation calculation, the distance between the two vectors can be better learned by using the position rotation matrix corresponding to the input vector, the pixel distance between each pixel point in a training panoramic image sample can be learned, the training effect of the image generation model can be improved, and the generation effect of the panoramic image generated based on the image generation model is improved.
It can be understood that the vector output by the Self-attention module is further processed in the DiT or SDXL network architecture to output the vector.
In the embodiment of the disclosure, after the training panoramic image sample and the description text corresponding to the training panoramic image sample are obtained, the training panoramic image sample and the description text can be input into a pre-constructed image to be trained generation model, so that the input vector corresponding to the training panoramic image sample is subjected to position coding through a preset position rotation matrix formula, a position rotation matrix corresponding to the input vector is obtained, and the output vector is calculated based on the position rotation matrix corresponding to the input vector.
And step 103, adjusting model parameters of the image generation model to be trained based on the loss value between the training vector and the output vector of the training panoramic image sample to obtain the image generation model.
The training vector is used for converting the training panoramic image sample into a numerical form which can be processed, namely extracting features from the training panoramic image sample and encoding the features into a vector representation.
In the embodiment of the disclosure, a training vector corresponding to a training panoramic image sample is obtained, specifically, the training panoramic image sample is encoded to obtain the training vector, and more specifically, the training panoramic image sample is encoded through a VAE (Variational auto-encoder, variation self-encoder) to obtain the training vector.
In the embodiment of the disclosure, the similarity between the training vector and the output vector can be calculated, a loss value is calculated for the similarity through a preset loss function (such as a cross entropy loss function, etc.), when the loss value is greater than or equal to a preset loss threshold, model parameters of an image generation model to be trained are adjusted, the loss value is calculated for the similarity between the training vector and the generation vector in a new training data pair again, and the calculated loss value is compared with the loss threshold until the loss value is smaller than the loss threshold, and the image generation model is obtained.
Specifically, after the output vector is obtained, model parameters of an image generation model to be trained are adjusted based on a loss value between a training vector of a training panoramic image sample and the output vector, and an image generation model is obtained.
The training scheme of the image generation model provided by the embodiment of the disclosure is used for acquiring training data pairs, wherein the training data pairs comprise training panoramic image samples and description texts corresponding to the training panoramic image samples, inputting the training panoramic image samples and the description texts into a pre-constructed image generation model to be trained, carrying out position coding on input vectors corresponding to the training panoramic image samples through a preset position rotation matrix formula to obtain a position rotation matrix corresponding to the input vectors, calculating output vectors based on the position rotation matrix corresponding to the input vectors, and adjusting model parameters of the image generation model to be trained based on loss values between the training vectors and the output vectors of the training panoramic image samples to obtain the image generation model. By adopting the technical scheme, the position rotation matrix is used for representing the pixel point positions of the panoramic image so as to train the image generation model, so that in the panoramic image generation process, the position rotation matrix is acquired based on the image generation model so as to determine the distance information of each pixel point in the generated panoramic image, and the panoramic image generation effect is improved.
In some embodiments, performing position coding on an input vector corresponding to a training panoramic image sample through a preset position rotation matrix formula to obtain a position rotation matrix corresponding to the input vector, wherein the method comprises the steps of obtaining the input vector corresponding to the training panoramic image sample, obtaining each sub-vector corresponding to the input vector, encoding pixel position information corresponding to each sub-vector according to the preset position rotation matrix formula, and obtaining a plurality of sub-position rotation matrices to be combined to obtain the position rotation matrix corresponding to the input vector.
In the embodiment of the disclosure, training vectors are obtained after training panoramic image samples are encoded, input vectors are obtained after DiT and SDXL are subjected to a series of network encoding processes, the input vectors are input to a Self-attention module after being subjected to position encoding so as to learn pixel position relations among all pixel points in the training panoramic image samples, and a series of network encoding output vectors are carried out.
Specifically, each sub-vector corresponding to the input vector is obtained, pixel position information corresponding to each sub-vector is encoded according to a preset position rotation matrix formula, and a plurality of sub-position rotation matrices are obtained and combined to obtain a position rotation matrix corresponding to the input vector.
In the embodiment of the present disclosure, there are various ways of encoding pixel position information corresponding to each sub-vector according to a position rotation matrix formula, obtaining a plurality of sub-position rotation matrices, and combining to obtain a position rotation matrix corresponding to an input vector, which is used as an example, determining each position rotation matrix parameter based on the position rotation matrix formula, determining a parameter value corresponding to each position rotation matrix parameter based on the pixel position information corresponding to each sub-vector, inputting the parameter value corresponding to each position rotation matrix parameter to the position rotation matrix formula, obtaining a sub-position rotation matrix corresponding to each sub-vector, and combining to obtain a position rotation matrix corresponding to the input vector.
As another example, a target pixel position coordinate of each sub-vector in a target coordinate system is determined based on pixel position information corresponding to each sub-vector, a first position rotation matrix parameter value is determined based on a preset attitude parameter, the target pixel position coordinate of each sub-vector and a preset first matrix parameter calculation formula, a second position rotation matrix parameter value is determined based on the target pixel position coordinate of each sub-vector and a preset second matrix parameter calculation formula, the first position rotation matrix parameter value and the second position rotation matrix parameter value are input into the position rotation matrix formula to obtain a sub-position rotation matrix of each sub-vector, and a plurality of sub-position rotation matrices are combined to obtain a position rotation matrix corresponding to the input vector.
In the embodiment of the present disclosure, the target coordinate system refers to a polar coordinate system, and it is understood that the image pixel coordinates in the training panoramic image sample are usually a cartesian coordinate system, that is, the initial coordinate system is a cartesian coordinate system, and therefore, a coordinate system conversion process is required.
The method comprises the steps of obtaining initial pixel position coordinates of each sub-vector corresponding to an initial coordinate system based on pixel position information corresponding to each sub-vector, and converting the initial pixel position coordinates of each sub-vector according to a preset coordinate system conversion formula to obtain target pixel position coordinates of each sub-vector in a target coordinate system.
Specifically, the coordinate system of the training panoramic image sample is shown in FIG. 2A, wherein each pixel corresponds to a point on the unit sphere, which can be expressed as (x, y, z) in Cartesian coordinate system, wherein x, y, z E [ -1,1] can also be expressed as polar coordinate systemWherein, theta is [ -pi/2, pi/2 ],
Specifically, the transformation relationship between the two coordinate systems is shown in formulas (1) and (2).
In particular, the pose of an object in space can be represented by rotation (R) and translation (t), i.e.,Wherein, theRRT=I,det(R)=1,
Meanwhile, the rotation matrix can be calculated by using preset gesture parameters (yaw, pitch, roll), so thatPitch=θ, roll=0 (the same definition as the other sequences), the rotation matrix of the position of any pixel in the training panoramic image sample may be expressed as shown in equation (3).
Further, since the panoramic image is continuous horizontally and vertically, and discontinuous vertically, absolute positional information in the θ direction is introduced into the displacement as shown in formula (4).
t=[θ/(π/2),θ/(π/2),θ/(π/2)]T (4)
Therefore, the preset rotational position encoding formula is shown as formula (5).
Wherein the first position rotation matrix parameter R is from equation (3) and the second position rotation matrix parameter t is from equation (4).
Therefore, the first position rotation matrix parameter value and the second position rotation matrix parameter value can be determined based on the pixel position information corresponding to each sub-vector, so as to obtain a sub-position rotation matrix of each sub-vector, and finally, a plurality of sub-position rotation matrices are combined to obtain a position rotation matrix corresponding to the input vector.
It can be understood that, after the position rotation matrix corresponding to the input vector is obtained, the pixel distance between each pixel in the training panoramic image sample can be learned by performing subsequent correlation calculation and the like through the position rotation matrix corresponding to the input vector. Because all the pixel points of the panoramic image are distributed on one unit sphere, the complete representation of the panoramic image pixel space (namely, each point has only one rotation matrix corresponding to the pixel point) can be realized, so that the distance between two pixel points can be represented by an included angle between two view angles based on the rotation matrix characteristic, and an included angle calculation formula is shown as a public expression (6).
In the scheme, the position rotation matrix of each sub-vector corresponding to the input vector can be obtained, so that the pixel distance between each pixel point in the training panoramic image sample can be obtained in subsequent processing, the correlation between each pixel point in the training panoramic image sample can be quickly and effectively learned, the trained image generation model can determine the distance information of each pixel point in the generated panoramic image based on the position rotation matrix in the panoramic image generation process, and the panoramic image generation effect is improved.
Based on the description of the embodiment, the pre-constructed image generation model to be trained can be trained through training data, and the input vector corresponding to the training panoramic image sample is subjected to position coding through a preset position rotation matrix formula in the training process, so that the image generation model is obtained, and the generation effect of generating the panoramic image is improved.
Specifically, as shown in fig. 3, the pre-constructed image to be trained is formed by Diffusion Transformer (DiT) frames, in the training process, firstly, an input image is encoded (encode) through a VAE to obtain an image hidden vector latent, then, the image hidden vector latent and a description text are subjected to a noise adding and diffusion process as shown in fig. 3 as 'Aliving room WITH TV AND sofa' input DiT, finally, an output hidden vector is obtained, and the image similar to the original image can be obtained through decoding (decode) by a decoder of the VAE. However, the above are designed for perspective views, and texture distortion characteristics of the panoramic image cannot be learned, resulting in poor panoramic image generation.
Specifically, the Self-attention module is widely applied to images and language models, is also one of key modules of a diffration network, and is applied to a text-generated graph framework (such as DiT, SDXL and the like). The Self-attention module is shown in fig. 4, the input vector X is respectively subjected to linear layer to obtain three vectors { Q, K, V }, Q and K are subjected to correlation calculation with each other, matmul operation (a matrix multiplication operation), scale operation (enlargement and reduction) and normalized exponential function softmax operation shown in fig. 4, and the correlation calculation result is applied to V, matmul operation shown in fig. 4 to obtain an outputThe Self-attention module realizes the autocorrelation calculation between input sequences, and the Self-attention module formula is shown as formula (7).
In the embodiment of the disclosure, the pixel point positions of the panoramic image are represented through the position rotation matrix to train the image generation model, so that the distance information of each pixel point in the panoramic image can be determined and generated based on the position rotation matrix in the panoramic image generation process, and the panoramic image generation effect is improved.
Specifically, the rotation position coding (Rotary Position Embedding, roPE) can be applied to an image generation model, such as position coding for a one-dimensional sequence, and has the characteristics of strong expansibility and quicker convergence on the sequence length, namely, the relative position coding effect is realized by an absolute position coding mode.
As shown in equation (8), the correlation between Q and K is obtained by an inner product calculation, which is improved by designing some function f in RoPE so that the final result is equivalent to the relative position coding.
<f(Qm,m),f(Kn,n)>=g(Qm,Kn,m-n) (8)
Wherein Q and K correspond to Q and K in self-attention modules, m and n represent positions, and m, n E [0, L-1] assuming that the lengths of Q and K are L. I.e. g (Qm,Kn, m-n) is equivalently realized by designing a certain function f, which is related to the relative position (m-n), and then by the same inner product calculation.
In the related RoPE coding mode, f is designed as shown in formula (9).
Wherein alpha represents a preset fixed parameter, the effect of f coding is realized by splitting Q and K into two channels and applying a 2x2 rotation matrix, and the inner product calculation process of the formula (9) can be converted into the formula (10) through deduction.
(RmQm)T(RnKn)=QmTRmTRnKn=QmTRm-nKn (10)
G (Qm,Kn,m-n)=QmTRm-nKn).
When RoPE is used for an image task, the two dimensions of the image in the horizontal direction and the vertical direction are decoupled respectively, and the two dimensions are encoded in a one-dimensional manner respectively, so that the method can be applied to a general perspective image, but the two dimensions of the panoramic image in the horizontal direction and the vertical direction are not continuous in real space, and cannot be well expressed in the method.
In the training process of the image generation model in the embodiment of the disclosure, the texture structure of the panoramic image can be effectively expressed through the three-dimensional rotation position coding, the effects of the relative position coding and the absolute position coding are realized at the same time in the form of absolute position coding, and the method can be conveniently applied to the generation model (such as DiT, SDXL and the like) to realize the generation of the panoramic image, and is specifically described in detail with reference to FIG. 5.
Fig. 5 is a flowchart of another training method for an image generation model according to an embodiment of the present disclosure, where the training method for an image generation model is further optimized based on the foregoing embodiment. As shown in fig. 5, the method includes:
Step 201, acquiring a training data pair, wherein the training data pair comprises a training panoramic image sample and a description text corresponding to the training panoramic image sample.
It should be noted that, step 201 is the same as step 101, and detailed description of step 101 is specifically referred to, and will not be described in detail herein.
Step 202, inputting training panoramic image samples and description texts into a pre-constructed image generation model to be trained so as to obtain input vectors corresponding to the training panoramic image samples and obtain each sub-vector corresponding to the input vectors.
Step 203, coding pixel position information corresponding to each sub-vector according to a position rotation matrix formula, obtaining a plurality of sub-position rotation matrices, combining to obtain a position rotation matrix corresponding to the input vector, and calculating an output vector based on the position rotation matrix corresponding to the input vector.
In the embodiment of the disclosure, pixel position information corresponding to each sub-vector is encoded according to a position rotation matrix formula, a plurality of sub-position rotation matrices are obtained and combined to obtain a position rotation matrix corresponding to an input vector, the method comprises the steps of determining target pixel position coordinates of each sub-vector in a target coordinate system based on the pixel position information corresponding to each sub-vector, determining a first position rotation matrix parameter value based on a preset attitude parameter, the target pixel position coordinates of each sub-vector and a preset first matrix parameter calculation formula, determining a second position rotation matrix parameter value based on the target pixel position coordinates of each sub-vector and a preset second matrix parameter calculation formula, inputting the first position rotation matrix parameter value and the second position rotation matrix parameter value into the position rotation matrix formula to obtain a sub-position rotation matrix of each sub-vector, and combining the plurality of sub-position rotation matrices to obtain a position rotation matrix corresponding to the input vector.
In the embodiment of the disclosure, determining the target pixel position coordinate of each sub-vector in the target coordinate system based on the pixel position information corresponding to each sub-vector comprises acquiring the initial pixel position coordinate of each sub-vector corresponding to the initial coordinate system based on the pixel position information corresponding to each sub-vector, and converting the initial pixel position coordinate of each sub-vector according to a preset coordinate system conversion formula to obtain the target pixel position coordinate of each sub-vector in the target coordinate system.
Specifically, before the Self-attention module is input, the input image is subjected to a series of network coding to obtain an input vector X proportional to the original image scale, the input vector X is assumed to be (c, H, W) in size (c, H, W) (c, H, W) = (c, H/N, W/N), N is a rational number, the input vector X is aligned with the original image in the width-height dimension, so that the input vector X can be coded in a manner of each pixel of the panoramic image, namely, vector Xi E R (c×1) corresponding to any position (Hi,wi) in the input vector X, and the position is codedWith reference to the structure of the Self-attention module of FIG. 4, a corresponding structure can be obtainedRecombining { Qi,Ki } intoThe channel number C of the input vector X is a multiple of 4, i.e., Rpano_i can be applied to { Qi,Ki }, as shown in equation (11).
f(Qi,i)=Rpano_iQi (10)
Thus, the pixel point positions of the panoramic image are represented in the form of a rotation matrix, and a model is generated by subsequent training images, so that the panoramic image generation effect is realized.
And 204, adjusting model parameters of the image generation model to be trained based on the loss value between the training vector of the training panoramic image sample and the output vector to obtain the image generation model.
It should be noted that, step 204 is the same as step 103, and specific reference is made to the detailed description of step 103, which is not described in detail herein.
Specifically, an image generation model to be trained is built according to network parameter design, an input vector of each Self-attention module is calculated, a rotation matrix Rpano of each position of the input vector is calculated based on a formula (5), Rpano is applied to { Q, K } of the Self-attention module in a RoPE coding mode in the training and reasoning process, and a generated image model is obtained by training and reasoning a network according to an original model architecture.
The training scheme of the image generation model provided by the embodiment of the disclosure obtains a training data pair, wherein the training data pair comprises training panoramic image samples and description texts corresponding to the training panoramic image samples, the training panoramic image samples and the description texts are input into a pre-built image generation model to be trained so as to obtain input vectors corresponding to the training panoramic image samples, each sub-vector corresponding to the input vectors is obtained, pixel position information corresponding to each sub-vector is encoded according to a position rotation matrix formula, a plurality of sub-position rotation matrices are obtained and combined to obtain a position rotation matrix corresponding to the input vectors, an output vector is calculated based on the position rotation matrix corresponding to the input vectors, and model parameters of the image generation model to be trained are adjusted based on loss values between the training vectors and the output vectors of the training panoramic image samples, so that the image generation model is obtained. By adopting the technical scheme, the pixel point position of each pixel point of the panoramic image is represented by the position rotation matrix, so that the pixel distance between each pixel point in the panoramic image can be accurately determined to train the image generation model, and the position rotation matrix is acquired based on the image generation model to determine the distance information of each pixel point in the panoramic image generation process, thereby improving the panoramic image generation effect.
Based on the description of the foregoing embodiment, after the image generation model is obtained, image generation may be performed based on the image generation model, and as illustrated in fig. 6, the image hiding vector latent is replaced by a noise vector (noise shown in fig. 6), so that an image conforming to the text description "a driving room WITH TV AND sofa" may be obtained through a DiT network and a decoder of the VAE, which will be described in detail below in connection with fig. 7.
Fig. 7 is a flowchart of an image generating method according to an embodiment of the present disclosure, where the method may be performed by an image generating apparatus, and the apparatus may be implemented by using software and/or hardware, and may be generally integrated in an electronic device. As shown in fig. 7, the method includes:
Step 301, acquiring an image to generate descriptive text.
Step 302, inputting the image generation description text into an image generation model, so as to perform position coding on a target vector corresponding to a preset noise vector through a preset position rotation matrix formula, obtain a position rotation matrix corresponding to the target vector, calculate a generated vector based on the position rotation matrix corresponding to the target vector, and decode the generated vector to obtain the target panoramic image.
In the embodiments of the present disclosure, the image generation model is acquired by the training method of the image generation model described in the foregoing embodiments.
In the embodiment of the disclosure, the image generation description text such as "a driving room WITH TV AND sofa" is specifically input according to the actual generation requirement, and the relevant noise vector is preset.
Specifically, position encoding is carried out on target vectors corresponding to preset noise vectors through a preset position rotation matrix formula to obtain a position rotation matrix corresponding to the target vectors, the method comprises the steps of obtaining the target vectors corresponding to the noise vectors, obtaining each sub-target vector corresponding to the target vectors, encoding pixel position information corresponding to each sub-target vector according to the position rotation matrix formula, and obtaining a plurality of sub-target position rotation matrices to be combined to obtain the position rotation matrix corresponding to the target vectors.
More specifically, the target pixel position coordinates of each sub-target vector in the target coordinate system are determined based on the pixel position information corresponding to each sub-target vector, the first position rotation matrix parameter value is determined based on the preset attitude parameter, the target pixel position coordinates of each sub-target vector and the preset first matrix parameter calculation formula, the second position rotation matrix parameter value is determined based on the target pixel position coordinates of each sub-target vector and the preset second matrix parameter calculation formula, the first position rotation matrix parameter value and the second position rotation matrix parameter value are input into the position rotation matrix formula to obtain the sub-target position rotation matrix of each sub-target vector, and the plurality of sub-target position rotation matrices are combined to obtain the position rotation matrix corresponding to the target vector.
In some embodiments, the method further comprises the steps of calculating based on the sub-target position rotation matrix corresponding to any two sub-target vectors and a preset angle calculation formula to obtain a target angle, determining pixel distance information between any two sub-target vectors based on the target angle, and determining the pixel position relation of the target panoramic image based on the pixel distance information.
The angle calculation formula is shown as formula (6), a sub-target position rotation matrix corresponding to any two sub-target vectors is input into formula (6), a target angle can be obtained, the target angle is used for representing the distance between two pixel points, therefore, the pixel distance information between any two sub-target vectors can be determined, and the pixel position relation of the target panoramic image is determined based on the pixel distance information. It can be understood that the panoramic image can be rotated randomly in the left-right direction, and has absolute positions at the south pole and the north pole, so that absolute position information of the second position rotation matrix parameter is added into the position rotation matrix, and finally, a transverse relative position effect and a longitudinal absolute position effect are realized.
Specifically, in the image generation process, the pixel distance between each pixel point in the generated target panoramic image can be accurately determined through the encoding mode of the position rotation matrix, so that the pixel position relationship between each pixel point in the target panoramic image can be determined, and the panoramic image generation effect is improved.
Exemplary, as shown in fig. 8, an image generation description text "a moving rotor WITH TV AND sofa" is input, a noise vector is processed based on DiT, a target vector corresponding to the noise vector is subjected to position coding by a position coding mode of the present disclosure, a position rotation matrix corresponding to the target vector is obtained, a generated vector is calculated based on the position rotation matrix corresponding to the target vector, and the generated vector is decoded, so that a target panoramic image is obtained.
According to the image generation scheme provided by the embodiment of the disclosure, the image generation description text is acquired, the image generation description text is input into the image generation model, the target vector corresponding to the preset noise vector is subjected to position coding through the preset position rotation matrix formula, the position rotation matrix corresponding to the target vector is obtained, the generated vector is calculated based on the position rotation matrix corresponding to the target vector, and the generated vector is decoded, so that the target panoramic image is obtained. In this way, in the panoramic image generation process, the target panoramic image is generated by encoding according to the position rotation matrix encoding method, and the panoramic image generation effect is improved.
Fig. 9 is a schematic structural diagram of an image generation model training apparatus according to an embodiment of the present disclosure, where the apparatus may be implemented by software and/or hardware, and may be generally integrated in an electronic device. As shown in fig. 9, the apparatus includes:
a first obtaining module 401, configured to obtain a training data pair, where the training data pair includes a training panoramic image sample and a description text corresponding to the training panoramic image sample;
An input module 402, configured to input the training panoramic image sample and the description text into a pre-constructed image to be trained generation model, so as to perform position encoding on an input vector corresponding to the training panoramic image sample through a preset position rotation matrix formula, obtain a position rotation matrix corresponding to the input vector, and calculate an output vector based on the position rotation matrix corresponding to the input vector;
And the training module 403 is configured to adjust model parameters of the image generation model to be trained based on a loss value between the training vector of the training panoramic image sample and the output vector, so as to obtain an image generation model.
Optionally, the input module 402 includes:
The first acquisition unit is used for inputting the training panoramic image sample and the description text into a pre-constructed image to be trained to generate a model so as to acquire an input vector corresponding to the training panoramic image sample;
The second acquisition unit is used for acquiring each sub-vector corresponding to the input vector;
The encoding unit is used for encoding the pixel position information corresponding to each sub-vector according to the position rotation matrix formula, obtaining a plurality of sub-position rotation matrixes and combining the sub-position rotation matrixes to obtain the position rotation matrix corresponding to the input vector.
Optionally, the coding unit is specifically configured to:
Determining target pixel position coordinates of each sub-vector in a target coordinate system based on the pixel position information corresponding to each sub-vector;
determining a first position rotation matrix parameter value based on a preset attitude parameter, a target pixel position coordinate of each sub-vector and a preset first matrix parameter calculation formula;
Determining a second position rotation matrix parameter value based on the target pixel position coordinates of each sub-vector and a preset second matrix parameter calculation formula;
Inputting the first position rotation matrix parameter value and the second position rotation matrix parameter value into the position rotation matrix formula to obtain a sub-position rotation matrix of each sub-vector, and combining the plurality of sub-position rotation matrices to obtain a position rotation matrix corresponding to the input vector.
Optionally, the determining, based on the pixel position information corresponding to each sub-vector, the target pixel position coordinate of each sub-vector in the target coordinate system includes:
acquiring initial pixel position coordinates of each sub-vector corresponding to an initial coordinate system based on the pixel position information corresponding to each sub-vector;
And converting the initial pixel position coordinates of each sub-vector according to a preset coordinate system conversion formula to obtain the target pixel position coordinates of each sub-vector in the target coordinate system.
The training device for the image generation model provided by the embodiment of the disclosure can execute the training method for the image generation model provided by any embodiment of the disclosure, and has the corresponding functional modules and beneficial effects of the execution method.
Fig. 10 is a schematic structural diagram of an image generating apparatus according to an embodiment of the present disclosure, where the apparatus may be implemented by software and/or hardware, and may be generally integrated in an electronic device. As shown in fig. 10, the apparatus includes:
A second obtaining module 501, configured to obtain an image generation description text;
The generating module 502 is configured to generate the image into a descriptive text input image generating model, perform position encoding on a target vector corresponding to a preset noise vector through a preset position rotation matrix formula, obtain a position rotation matrix corresponding to the target vector, calculate a generating vector based on the position rotation matrix corresponding to the target vector, and decode the generating vector to obtain a target panoramic image;
The image generation model is obtained according to the training method of the image generation model in the previous embodiment.
Optionally, the generating module 502 is specifically configured to:
Inputting the image generation description text into an image generation model to obtain a target vector corresponding to the noise vector;
each sub-target vector corresponding to the target vector is obtained;
And encoding pixel position information corresponding to each sub-target vector according to the position rotation matrix formula, obtaining a plurality of sub-target position rotation matrices, and combining to obtain a position rotation matrix corresponding to the target vector.
Optionally, the apparatus further includes:
The calculation module is used for calculating based on any two sub-target position rotation matrixes corresponding to the sub-target vectors and a preset angle calculation formula to obtain a target angle;
The first determining module is used for determining pixel distance information between any two sub-target vectors based on the target angle;
and the second determining module is used for determining the pixel position relation of the target panoramic image based on the pixel distance information.
The image generating device provided by the embodiment of the disclosure can execute the image generating method provided by any embodiment of the disclosure, and has the corresponding functional modules and beneficial effects of the executing method.
Embodiments of the present disclosure also provide a computer program product comprising a computer program/instructions which, when executed by a processor, implement the training method of the image generation model provided by any embodiment of the present disclosure.
Fig. 11 is a schematic structural diagram of an electronic device according to an embodiment of the disclosure. Referring now in particular to fig. 11, a schematic diagram of an electronic device 600 suitable for use in implementing embodiments of the present disclosure is shown. The electronic device 600 in the embodiments of the present disclosure may include, but is not limited to, mobile terminals such as mobile phones, notebook computers, digital broadcast receivers, PDAs (personal digital assistants), PADs (tablet computers), PMPs (portable multimedia players), car terminals (e.g., car navigation terminals), and the like, and stationary terminals such as digital TVs, desktop computers, and the like. The electronic device shown in fig. 11 is merely an example, and should not impose any limitations on the functionality and scope of use of embodiments of the present disclosure.
As shown in fig. 11, the electronic device 600 may include a processing means (e.g., a central processing unit, a graphic processor, etc.) 601, which may perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM) 602 or a program loaded from a storage means 608 into a Random Access Memory (RAM) 603. In the RAM 603, various programs and data required for the operation of the electronic apparatus 600 are also stored. The processing device 601, the ROM 602, and the RAM 603 are connected to each other through a bus 604. An input/output (I/O) interface 605 is also connected to bus 604.
In general, devices may be connected to I/O interface 605 including input devices 606, including for example, touch screens, touch pads, keyboards, mice, cameras, microphones, accelerometers, gyroscopes, etc., output devices 607, including for example, liquid Crystal Displays (LCDs), speakers, vibrators, etc., storage devices 608, including for example, magnetic tape, hard disk, etc., and communication devices 609. The communication means 609 may allow the electronic device 600 to communicate with other devices wirelessly or by wire to exchange data. While fig. 11 shows an electronic device 600 having various means, it is to be understood that not all of the illustrated means are required to be implemented or provided. More or fewer devices may be implemented or provided instead.
In particular, according to embodiments of the present disclosure, the processes described above with reference to flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program embodied on a non-transitory computer readable medium, the computer program containing program code for performing the method shown in the flowchart. In such an embodiment, the computer program may be downloaded and installed from a network via communication means 609, or from storage means 608, or from ROM 602. When executed by the processing means 601, the computer program performs the functions defined above in the training method of the image generation model of the embodiment of the present disclosure.
It should be noted that the computer readable medium described in the present disclosure may be a computer readable signal medium or a computer readable storage medium, or any combination of the two. The computer readable storage medium can be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples of a computer-readable storage medium may include, but are not limited to, an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this disclosure, a computer-readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In the present disclosure, however, the computer-readable signal medium may include a data signal propagated in baseband or as part of a carrier wave, with the computer-readable program code embodied therein. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to electrical wiring, fiber optic cable, RF (radio frequency), and the like, or any suitable combination of the foregoing.
In some embodiments, the clients, servers may communicate using any currently known or future developed network protocol, such as HTTP (HyperText Transfer Protocol ), and may be interconnected with any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include a local area network ("LAN"), a wide area network ("WAN"), the internet (e.g., the internet), and peer-to-peer networks (e.g., ad hoc peer-to-peer networks), as well as any currently known or future developed networks.
The computer readable medium may be included in the electronic device or may exist alone without being incorporated into the electronic device.
The computer readable medium carries one or more programs which, when executed by the electronic device, cause the electronic device to acquire training panoramic image samples and descriptive text corresponding to the training panoramic image samples to input an image generation model to be trained, to perform position coding on input vectors corresponding to the training panoramic image samples to obtain a position rotation matrix corresponding to the input vectors, calculate output vectors based on the position rotation matrix corresponding to the input vectors, and adjust model parameters of the image generation model to be trained based on loss values between the training vectors and the output vectors of the training panoramic image samples to obtain the image generation model.
Computer program code for carrying out operations of the present disclosure may be written in one or more programming languages, including, but not limited to, an object oriented programming language such as Java, smalltalk, C ++ and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computer (for example, through the Internet using an Internet service provider).
The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
The units involved in the embodiments of the present disclosure may be implemented by means of software, or may be implemented by means of hardware. Wherein the names of the units do not constitute a limitation of the units themselves in some cases.
The functions described above herein may be performed, at least in part, by one or more hardware logic components. For example, without limitation, exemplary types of hardware logic that may be used include Field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), systems-on-a-chip (SOCs), complex Programmable Logic Devices (CPLDs), and the like.
In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. The machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
According to one or more embodiments of the present disclosure, the present disclosure provides an electronic device comprising:
A processor;
A memory for storing the processor-executable instructions;
the processor is configured to read the executable instructions from the memory and execute the instructions to implement the training method of the image generation model as any one of the methods provided in the present disclosure.
According to one or more embodiments of the present disclosure, the present disclosure provides a computer-readable storage medium storing a computer program for performing the training method of the image generation model as any one of the present disclosure provides.
The foregoing description is only of the preferred embodiments of the present disclosure and description of the principles of the technology being employed. It will be appreciated by persons skilled in the art that the scope of the disclosure referred to in this disclosure is not limited to the specific combinations of features described above, but also covers other embodiments which may be formed by any combination of features described above or equivalents thereof without departing from the spirit of the disclosure. Such as those described above, are mutually substituted with the technical features having similar functions disclosed in the present disclosure (but not limited thereto).
Moreover, although operations are depicted in a particular order, this should not be understood as requiring that such operations be performed in the particular order shown or in sequential order. In certain circumstances, multitasking and parallel processing may be advantageous. Likewise, while several specific implementation details are included in the above discussion, these should not be construed as limiting the scope of the present disclosure. Certain features that are described in the context of separate embodiments can also be implemented in combination in a single embodiment. Conversely, various features that are described in the context of a single embodiment can also be implemented in multiple embodiments separately or in any suitable subcombination.
Although the subject matter has been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described above. Rather, the specific features and acts described above are example forms of implementing the claims.

Claims (10)

Translated fromChinese
1.一种图像生成模型的训练方法,其特征在于,包括:1. A training method for an image generation model, comprising:获取训练数据对;其中,所述训练数据对包括训练全景图像样本和所述训练全景图像样本对应的描述文本;Acquire a training data pair; wherein the training data pair includes a training panoramic image sample and a description text corresponding to the training panoramic image sample;将所述训练全景图像样本和所述描述文本输入预构建的待训练图像生成模型,以通过预设置的位置旋转矩阵公式对所述训练全景图像样本对应的输入向量进行位置编码,得到所述输入向量对应的位置旋转矩阵,并基于所述输入向量对应的位置旋转矩阵计算输出向量;Input the training panoramic image sample and the description text into a pre-constructed image generation model to be trained, so as to positionally encode the input vector corresponding to the training panoramic image sample by a preset position rotation matrix formula, obtain the position rotation matrix corresponding to the input vector, and calculate the output vector based on the position rotation matrix corresponding to the input vector;基于所述训练全景图像样本的训练向量和所述输出向量之间的损失值调整所述待训练图像生成模型的模型参数,得到图像生成模型。The model parameters of the image generation model to be trained are adjusted based on the loss value between the training vector of the training panoramic image sample and the output vector to obtain the image generation model.2.根据权利要求1所述的方法,其特征在于,所述通过预设置的位置旋转矩阵公式对所述训练全景图像样本对应的输入向量进行位置编码,得到所述输入向量对应的位置旋转矩阵,包括:2. The method according to claim 1, characterized in that the position encoding of the input vector corresponding to the training panoramic image sample by a preset position rotation matrix formula to obtain the position rotation matrix corresponding to the input vector comprises:获取所述训练全景图像样本对应的输入向量;Obtaining an input vector corresponding to the training panoramic image sample;获取所述输入向量对应的每个子向量;Obtain each sub-vector corresponding to the input vector;按照所述位置旋转矩阵公式对所述每个子向量对应的像素位置信息进行编码,获取多个子位置旋转矩阵进行组合得到所述输入向量对应的位置旋转矩阵。The pixel position information corresponding to each sub-vector is encoded according to the position rotation matrix formula, and multiple sub-position rotation matrices are obtained and combined to obtain the position rotation matrix corresponding to the input vector.3.根据权利要求2所述的方法,其特征在于,所述按照所述位置旋转矩阵公式对所述每个子向量对应的像素位置信息进行编码,获取多个子位置旋转矩阵进行组合得到所述输入向量对应的位置旋转矩阵,包括:3. The method according to claim 2, characterized in that encoding the pixel position information corresponding to each sub-vector according to the position rotation matrix formula, obtaining multiple sub-position rotation matrices and combining them to obtain the position rotation matrix corresponding to the input vector, comprises:基于所述每个子向量对应的像素位置信息确定目标坐标系中所述每个子向量的目标像素位置坐标;Determine the target pixel position coordinates of each sub-vector in the target coordinate system based on the pixel position information corresponding to each sub-vector;基于预设的姿态参数、所述每个子向量的目标像素位置坐标和预设的第一矩阵参数计算公式,确定第一位置旋转矩阵参数值;Determine a first position rotation matrix parameter value based on preset posture parameters, the target pixel position coordinates of each sub-vector and a preset first matrix parameter calculation formula;基于所述每个子向量的目标像素位置坐标和预设的第二矩阵参数计算公式,确定第二位置旋转矩阵参数值;Determine a second position rotation matrix parameter value based on the target pixel position coordinates of each sub-vector and a preset second matrix parameter calculation formula;将所述第一位置旋转矩阵参数值和所述第二位置旋转矩阵参数值输入所述位置旋转矩阵公式,得到所述每个子向量的子位置旋转矩阵,并将所述多个子位置旋转矩阵进行组合得到所述输入向量对应的位置旋转矩阵。The first position rotation matrix parameter value and the second position rotation matrix parameter value are input into the position rotation matrix formula to obtain the sub-position rotation matrix of each sub-vector, and the multiple sub-position rotation matrices are combined to obtain the position rotation matrix corresponding to the input vector.4.根据权利要求3所述的方法,其特征在于,所述基于所述每个子向量对应的像素位置信息确定目标坐标系中所述每个子向量的目标像素位置坐标,包括:4. The method according to claim 3, characterized in that the step of determining the target pixel position coordinates of each sub-vector in the target coordinate system based on the pixel position information corresponding to each sub-vector comprises:基于所述每个子向量对应的像素位置信息获取初始坐标系对应的每个子向量的初始像素位置坐标;Acquire the initial pixel position coordinates of each sub-vector corresponding to the initial coordinate system based on the pixel position information corresponding to each sub-vector;按照预设的坐标系转换公式对所述每个子向量的初始像素位置坐标进行转换,得到所述目标坐标系中所述每个子向量的目标像素位置坐标。The initial pixel position coordinates of each sub-vector are converted according to a preset coordinate system conversion formula to obtain the target pixel position coordinates of each sub-vector in the target coordinate system.5.一种图像生成方法,其特征在于,包括:5. A method for generating an image, comprising:获取图像生成描述文本;Get the image to generate description text;将所述图像生成描述文本输入图像生成模型,以通过预设置的位置旋转矩阵公式对预设的噪声向量对应的目标向量进行位置编码,得到所述目标向量对应的位置旋转矩阵,并基于所述目标向量对应的位置旋转矩阵计算生成向量,以及对所述生成向量进行解码,得到目标全景图像;Input the image generation description text into the image generation model, so as to position encode the target vector corresponding to the preset noise vector through a preset position rotation matrix formula to obtain the position rotation matrix corresponding to the target vector, calculate the generation vector based on the position rotation matrix corresponding to the target vector, and decode the generation vector to obtain the target panoramic image;其中,所述图像生成模型根据权利要求1-4中任一项所述的图像生成模型的训练方法得到。Wherein, the image generation model is obtained according to the training method of the image generation model according to any one of claims 1-4.6.根据权利要求5所述的方法,其特征在于,所述通过预设置的位置旋转矩阵公式对预设的噪声向量对应的目标向量进行位置编码,得到所述目标向量对应的位置旋转矩阵,包括:6. The method according to claim 5, characterized in that the position encoding of the target vector corresponding to the preset noise vector by using a preset position rotation matrix formula to obtain the position rotation matrix corresponding to the target vector comprises:获取所述噪声向量对应的目标向量;Obtaining a target vector corresponding to the noise vector;获取所述目标向量对应的每个子目标向量;Obtain each sub-target vector corresponding to the target vector;按照所述位置旋转矩阵公式对所述每个子目标向量对应的像素位置信息进行编码,获取多个子目标位置旋转矩阵进行组合得到所述目标向量对应的位置旋转矩阵。The pixel position information corresponding to each sub-target vector is encoded according to the position rotation matrix formula, and multiple sub-target position rotation matrices are obtained and combined to obtain the position rotation matrix corresponding to the target vector.7.根据权利要求6所述的方法,其特征在于,所述方法还包括:7. The method according to claim 6, characterized in that the method further comprises:基于任意两个所述子目标向量对应的子目标位置旋转矩阵和预设的角度计算公式进行计算,得到目标角度;Calculate based on the sub-target position rotation matrix corresponding to any two of the sub-target vectors and a preset angle calculation formula to obtain the target angle;基于所述目标角度确定所述任意两个所述子目标向量之间的像素距离信息;Determine pixel distance information between any two of the sub-target vectors based on the target angle;基于所述像素距离信息确定所述目标全景图像的像素位置关系。The pixel position relationship of the target panoramic image is determined based on the pixel distance information.8.一种电子设备,其特征在于,所述电子设备包括:8. An electronic device, characterized in that the electronic device comprises:处理器;processor;用于存储所述处理器可执行指令的存储器;a memory for storing instructions executable by the processor;所述处理器,用于从所述存储器中读取所述可执行指令,并执行所述指令以实现上述权利要求1-4中任一所述的图像生成模型的训练方法或权利要求5-7中任一所述的图像生成方法。The processor is used to read the executable instructions from the memory and execute the instructions to implement the training method of the image generation model described in any one of claims 1-4 or the image generation method described in any one of claims 5-7.9.一种计算机可读存储介质,其特征在于,所述存储介质存储有计算机程序,所述计算机程序用于执行上述权利要求1-4中任一所述的图像生成模型的训练方法或权利要求5-7中任一所述的图像生成方法。9. A computer-readable storage medium, characterized in that the storage medium stores a computer program, and the computer program is used to execute the training method of the image generation model described in any one of claims 1 to 4 or the image generation method described in any one of claims 5 to 7.10.一种计算机程序产品,其特征在于,包括计算机程序,其中,所述计算机程序在被处理器执行上述权利要求1-4中任一所述的图像生成模型的训练方法或权利要求5-7中任一所述的图像生成方法。10. A computer program product, characterized in that it comprises a computer program, wherein the computer program is executed by a processor as the training method for an image generation model described in any one of claims 1 to 4 or the image generation method described in any one of claims 5 to 7.
CN202510294423.7A2025-03-122025-03-12 Image generation model training, image generation method, device, equipment and mediumPendingCN120259461A (en)

Priority Applications (1)

Application NumberPriority DateFiling DateTitle
CN202510294423.7ACN120259461A (en)2025-03-122025-03-12 Image generation model training, image generation method, device, equipment and medium

Applications Claiming Priority (1)

Application NumberPriority DateFiling DateTitle
CN202510294423.7ACN120259461A (en)2025-03-122025-03-12 Image generation model training, image generation method, device, equipment and medium

Publications (1)

Publication NumberPublication Date
CN120259461Atrue CN120259461A (en)2025-07-04

Family

ID=96191359

Family Applications (1)

Application NumberTitlePriority DateFiling Date
CN202510294423.7APendingCN120259461A (en)2025-03-122025-03-12 Image generation model training, image generation method, device, equipment and medium

Country Status (1)

CountryLink
CN (1)CN120259461A (en)

Similar Documents

PublicationPublication DateTitle
CN108027885A (en)Space transformer module
CN111915480B (en)Method, apparatus, device and computer readable medium for generating feature extraction network
CN114549722B (en)Rendering method, device, equipment and storage medium of 3D material
CN112070888B (en)Image generation method, device, equipment and computer readable medium
CN114049417B (en)Virtual character image generation method and device, readable medium and electronic equipment
CN113688907A (en)Model training method, video processing method, device, equipment and storage medium
CN113688928A (en) Image matching method, apparatus, electronic device and computer readable medium
WO2020034981A1 (en)Method for generating encoded information and method for recognizing encoded information
CN117253008A (en)Space object layout generation method, device, equipment and storage medium
CN115018994B (en) Three-dimensional reconstruction method, system, device and storage medium based on two-dimensional image
CN115063536B (en)Image generation method, device, electronic equipment and computer readable storage medium
CN118411452B (en)Digital person generation method, device, equipment and storage medium
CN111611420B (en)Method and device for generating image description information
US20250193363A1 (en)Method, device, and computer program product for image generation for particular view angle
CN120259461A (en) Image generation model training, image generation method, device, equipment and medium
CN115810086B (en) Three-dimensional scene reconstruction method, device, computer equipment and storage medium
CN115272667B (en)Farmland image segmentation model training method and device, electronic equipment and medium
CN119516009A (en) Image generation method, device, electronic device and storage medium
US12323635B2 (en)Image processing method and apparatus
CN111738899B (en)Method, apparatus, device and computer readable medium for generating watermark
CN115035223A (en)Image processing method, device, equipment and medium
CN119339041B (en)Three-dimensional video construction method, device and medium based on target monitoring area
CN114612596B (en) Method, device, equipment and storage medium for generating special effects images
CN120711251A (en) Video generation method, device, medium, equipment and computer program product
CN114926568B (en) Model training method, image generation method and device

Legal Events

DateCodeTitleDescription
PB01Publication
PB01Publication
SE01Entry into force of request for substantive examination
SE01Entry into force of request for substantive examination

[8]ページ先頭

©2009-2025 Movatter.jp