Disclosure of Invention
Therefore, it is necessary to provide a method and a system for synthesizing a real scene image, so as to quickly and effectively synthesize a photographic-level scene image with a stronger sense of reality, improve the sense of reality and the visual quality of the synthesized image, and expand the application range and the application scene.
In order to achieve the purpose, the invention provides the following scheme:
a method of synthesizing an image of a real scene, comprising:
acquiring an image training set; the image training set is composed of a plurality of image pairs; each image pair is composed of a semantic graph and a real scene reference graph corresponding to the semantic graph;
establishing a real scene image synthesis network model according to the U-net convolution neural network model and the excitation residual block; the excitation residual block is composed of a convolution layer and an active layer;
establishing a loss function of the real scene image synthetic network model by utilizing a pre-trained VGG-19 convolutional neural network model;
taking the image training set as the input of the real scene image synthesis network model, and training the real scene image synthesis network model according to the loss function to obtain a trained real scene image synthesis network model;
acquiring a plurality of semantic graphs to be synthesized;
and inputting the semantic graph to be synthesized into the trained real scene image synthesis network model to obtain a real scene synthesis graph corresponding to the semantic graph to be synthesized.
Optionally, the establishing a real scene image synthesis network model according to the U-net convolutional neural network model and the excitation residual block specifically includes:
establishing a U-net convolution neural network model; the U-net convolutional neural network model comprises a plurality of hierarchical levels;
establishing an excitation residual block;
and embedding the excitation residual block between every two adjacent layers of the U-net convolutional neural network model to form a real scene image synthesis network model.
Optionally, the training the real scene image synthesis network model with the image training set as an input of the real scene image synthesis network model according to the loss function to obtain a trained real scene image synthesis network model specifically includes:
inputting the ith Zhang Yuyi image in the training set into a current real scene image synthesis network model to obtain a real scene synthesis image corresponding to the ith semantic image; wherein i is an integer greater than or equal to 1; the current real scene image synthesis network model is a real scene image synthesis network model updated after the jth training; wherein j is an integer greater than or equal to 0;
judging whether j is smaller than a preset maximum training frequency;
if yes, inputting the real scene synthesis graph and a real scene reference graph corresponding to the ith semantic graph into the loss function to obtain a loss value;
inputting the loss value into an Adam optimizer, and updating the real scene image synthesis network model by adopting an Adam optimization algorithm; then i = i +1, j = j +1, and returns the ith semantic graph in the training set to be input into the current real scene image synthesis network model, so as to obtain a real scene synthesis graph corresponding to the ith semantic graph;
if not, the current real scene image synthesis network model is used as the trained real scene image synthesis network model.
Optionally, the excitation residual block specifically includes:
f(x)=x·sigmoid(β(x))
wherein x represents the input semantic graph; sigmoid is an activation function, and the functional expression is sigmoid (x) = 1/(1 + exp (-x)); β represents a convolution layer in the excitation residual block; β (x) represents an image obtained by performing convolution operation on the input semantic graph.
Optionally, the loss function specifically includes:
wherein L isf Represents a loss value; f represents a real scene synthesis graph output by the real scene image synthesis network model, and G represents a real scene reference graph; phi denotes the pre-trained VGG-19 convolutional neural network model, phil Represents the l layer, phi, in the pre-trained VGG-19 convolutional neural network modell (F) The characteristic diagram phi of the output of the I-th convolutional layer after F is input into a pre-trained VGG-19 convolutional neural networkl (G) Means that G is input into a pre-trained VGG-19 convolutional neural network, and the first layer convolutional layer is outputDrawing a feature graph; l is {0,1,2,3,4,5}; phi is a unit of0 Input diagram, phi, representing a pre-trained VGG-19 network1 To phi5 A characteristic diagram representing the corresponding output of five convolutional layers in the pre-trained VGG-19; lambda [ alpha ]l Weight coefficient, λ, corresponding to loss value representing the l-th layerl The value of (1/1.6,1/2.3,1/1.8,1/2.8, 10/0.8).
Optionally, before the training the real scene image synthesis network model according to the loss function by using the image training set as the input of the real scene image synthesis network model to obtain the trained real scene image synthesis network model, the method further includes:
determining initialization parameters of the real scene image synthesis network model; the initialization parameters comprise a learning rate, a maximum training frequency, the number of semantic graphs, the width of the semantic graphs and the height of the semantic graphs.
The invention also provides a real scene image synthesis system, which comprises:
the first acquisition module is used for acquiring an image training set; the image training set is composed of a plurality of image pairs; each image pair consists of a semantic graph and a real scene reference graph corresponding to the semantic graph;
the synthetic model establishing module is used for establishing a real scene image synthetic network model according to the U-net convolutional neural network model and the excitation residual block; the excitation residual block is composed of a convolution layer and an active layer;
the loss function establishing module is used for establishing a loss function of the real scene image synthetic network model by utilizing a pre-trained VGG-19 convolutional neural network model;
the training module is used for taking the image training set as the input of the real scene image synthesis network model, training the real scene image synthesis network model according to the loss function and obtaining the trained real scene image synthesis network model;
the second acquisition module is used for acquiring a plurality of semantic graphs to be synthesized;
and the synthesis module is used for inputting the semantic graph to be synthesized into the trained real scene image synthesis network model to obtain a real scene synthesis graph corresponding to the semantic graph to be synthesized.
Optionally, the synthetic model establishing module specifically includes:
the first establishing unit is used for establishing a U-net convolutional neural network model; the U-net convolutional neural network model comprises a plurality of tiers;
a second establishing unit for establishing an excitation residual block;
and the synthetic model establishing unit is used for embedding the excitation residual block between every two adjacent layers of the U-net convolutional neural network model to form a real scene image synthetic network model.
Optionally, the training module specifically includes:
a synthetic image obtaining unit, configured to input the i Zhang Yuyi image in the training set into a current real scene image synthetic network model, so as to obtain a real scene synthetic image corresponding to the i semantic image; wherein i is an integer greater than or equal to 1; the current real scene image synthesis network model is a real scene image synthesis network model updated by the jth training; wherein j is an integer greater than or equal to 0;
the judging unit is used for judging whether j is smaller than the preset maximum training frequency;
an updating unit, configured to input the real scene synthesis graph and the real scene reference graph corresponding to the ith semantic graph into the loss function to obtain a loss value if j is less than a preset maximum training time; inputting the loss value into an Adam optimizer, and updating the real scene image synthetic network model by adopting an Adam optimization algorithm; enabling i = i +1, j = j +1, and returning to input the ith semantic graph in the training set into the current real scene image synthesis network model to obtain a real scene synthesis graph corresponding to the ith semantic graph;
and the synthetic model determining unit is used for taking the current real scene image synthetic network model as the trained real scene image synthetic network model if j is greater than or equal to the preset maximum training times.
Optionally, the system further includes:
the parameter determining module is used for determining the initialization parameters of the real scene image synthesis network model; the initialization parameters comprise a learning rate, a maximum training frequency, the number of semantic graphs, the width of the semantic graphs and the height of the semantic graphs.
Compared with the prior art, the invention has the beneficial effects that:
the invention provides a real scene image synthesis method and a system, wherein the method comprises the steps of establishing a real scene image synthesis network model according to a U-net convolutional neural network model and an excitation residual block, establishing a loss function of the real scene image synthesis network model by using a pre-trained VGG-19 convolutional neural network model, and training the real scene image synthesis network model by using the loss function to obtain a final real scene image synthesis network model. The method or the system can quickly, effectively and reliably synthesize the photographic-level scene image with stronger reality sense, improve the reality sense and the visual quality of the synthesized image, and enlarge the application range and the application scene.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in further detail below.
Fig. 1 is a flowchart of a method for synthesizing a real scene image according to an embodiment of the present invention.
Referring to fig. 1, the method for synthesizing an image of a real scene according to the embodiment includes:
step S1: acquiring an image training set; the image training set is composed of a plurality of image pairs; each image pair is composed of a semantic graph and a real scene reference graph corresponding to the semantic graph.
In this embodiment, the images in the image training set may be obtained from cityscape datasets, which represent real-world real scenes, or from GTA5 datasets, which represent game scenes.
Step S2: establishing a real scene image synthesis network model according to the U-net convolution neural network model and the excitation residual block; the excitation residual block is composed of a convolution layer and an active layer.
Fig. 2 is a schematic structural diagram of an excitation Residual Block according to an embodiment of the present invention, where fig. 2 (a) is a structural diagram of an active Layer (Swish Layer), and fig. 2 (b) is a structural diagram of a single excitation Residual Block (SRB).
Referring to fig. 2, each rectangular box represents a corresponding data operation in the network, arrows represent data flows,
representing the splicing operation of the feature map, "·" represents the element-by-element multiplication "x" of the matrix, representing the feature map of the network input, i.e. the semantic map, H represents the convolution kernel with the size of 3 × 3 convolution layers, the activation function of H is sigmoid, R (x) represents the final output feature map of 2 layers of convolution layers, the activation function of the convolution layer in the R module is LRelu, G (x) represents the output of the activation layer, and "64-d" represents that the number of channels of the feature map x is 64.
The step S2 specifically includes:
establishing a U-net convolution neural network model; the U-net convolutional neural network model comprises a plurality of tiers;
establishing an excitation residual block, wherein the excitation residual block specifically comprises the following steps:
f(x)=x·sigmoid(β(x))
wherein, x represents the input semantic graph, sigmoid is an activation function, the function expression is sigmoid (x) = 1/(1 + exp (-x)), β represents the convolution layer in the excitation residual block, and β (x) represents the image after convolution operation is performed on the input semantic graph;
and embedding the excitation residual block between every two adjacent layers of the U-net convolutional neural network model to form a real scene image synthesis network model.
The U-net convolutional neural network model in this embodiment is formed by 2 symmetrical branches on the left and right, and in the left and right branches of the U-net, several convolutional layers that perform convolution operations on feature maps of different resolutions form different U-net levels, and the U-net convolutional neural network model in this embodiment includes 6 levels, and the 6 levels have 6 levels of different resolutions.
Fig. 3 is a schematic structural diagram of a real scene image synthesis network model according to an embodiment of the present invention. Referring to fig. 3, the real scene image synthesis network model includes 6 levels, which are 1 level to 6 levels from top to bottom, and the resolution of the feature map is reduced by half in sequence; each rectangular box represents a multi-channel feature map, the numbers at the top of the rectangular boxes represent the number of channels, such as "20, 96, 192, 384, 512, 1536", etc., "s" represents an excitation residual block, arrows represent different operations, "arrow ↓" represents a downsampling operation, i.e., a maximum pooling operation, arrow ↓ "represents an upsampling operation, the method of the upsampling operation in this embodiment is a resize-convolution (resize-convolution) method, arrow" → "represents a convolution operation of a convolution layer, and a dotted arrow represents a copy-paste operation of the feature map. The upsampling operation is to enlarge an image with a small resolution to an image with a large resolution by some method. The process of resize-volume upsampling in this embodiment is as follows: firstly, an input small-resolution image is subjected to an interpolation algorithm of bicubic interpolation to double the resolution of the image, then the amplified image is subjected to convolution operation through a layer of convolution layer, and then a feature graph output by the convolution layer is an output graph of an up-sampling method.
And step S3: and establishing a loss function of the real scene image synthesis network model by using a pre-trained VGG-19 convolutional neural network model. The loss function is specifically:
wherein L isf Represents a loss value; f represents a real scene synthesis graph output by the real scene image synthesis network model, and G represents a real scene reference graph; phi denotes the pre-trained VGG-19 convolutional neural network model, phil Represents the l layer, phi, in the pre-trained VGG-19 convolutional neural network modell (F) The characteristic diagram phi of the output after the I layer convolution layer is shown after F is input into the pre-trained VGG-19 convolution neural networkl (G) A characteristic diagram showing the output of the first layer convolution layer when G is input into a pre-trained VGG-19 convolution neural network; the value of l is {0,1,2,3,4,5}; phi is a unit of0 Input diagram, phi, representing a pre-trained VGG-19 network1 To phi5 A characteristic diagram representing the corresponding output of five convolutional layers in the pre-trained VGG-19; lambda [ alpha ]l Denotes the l-th layerCorresponding to the loss value of (a) is given by a weight factor, λl The value of (1/1.6,1/2.3,1/1.8,1/2.8, 10/0.8).
And step S4: and determining initialization parameters of the real scene image synthesis network model.
Specifically, the initialization parameters include a learning rate, a maximum training frequency, the number of semantic graphs, the width of the semantic graphs, and the height of the semantic graphs.
In this embodiment, the learning rate learning _ rate =0.0001, the maximum training time epoch =100, the width of the semantic graph =384, and the height of the semantic graph =192.
Step S5: and taking the image training set as the input of the real scene image synthesis network model, and training the real scene image synthesis network model according to the loss function to obtain the trained real scene image synthesis network model. The step S5 specifically includes:
inputting the ith Zhang Yuyi image in the training set into a current real scene image synthesis network model to obtain a real scene synthesis image corresponding to the ith semantic image; wherein i is an integer greater than or equal to 1; the current real scene image synthesis network model is a real scene image synthesis network model updated by the jth training; wherein j is an integer greater than or equal to 0;
judging whether j is smaller than a preset maximum training frequency;
if yes, inputting the real scene synthesis graph and a real scene reference graph corresponding to the ith semantic graph into the loss function to obtain a loss value;
inputting the loss value into an Adam optimizer, and updating the real scene image synthesis network model by adopting an Adam optimization algorithm; then i = i +1, j = j +1, and returns the ith semantic graph in the training set to be input into the current real scene image synthesis network model, so as to obtain a real scene synthesis graph corresponding to the ith semantic graph;
if not, the current real scene image synthesis network model is used as the trained real scene image synthesis network model.
In this embodiment, the calculation process of the loss value may be specifically described as follows: respectively inputting the real scene synthesis graph and the corresponding real scene reference graph into a pre-trained VGG-19 convolutional neural network model, respectively obtaining feature sub-graphs output by 5 convolutional layers (respectively conv 1-2, conv22, conv32, conv42 and conv 52) in the pre-trained VGG-19, then calculating the square absolute error of the 5 groups of feature sub-graphs to obtain 5 groups of square absolute error values, then calculating the square absolute error value between the real scene synthesis graph and the real reference graph corresponding to the semantic graph, finally obtaining 6 groups of square absolute error values, and summing the 6 groups of square absolute error values to obtain the loss value.
Step S6: and acquiring a plurality of semantic graphs to be synthesized.
Step S7: and inputting the semantic graph to be synthesized into the trained real scene image synthesis network model to obtain a real scene synthesis graph corresponding to the semantic graph to be synthesized.
In the embodiment, the images in the city-block data set cityscapesdataset representing the real scene of the real world are used as the training set and the test set, so that the real scene image synthesis method is realized. Fig. 4 is a synthesis result diagram using an image in the city view data set cityscaps dataset as a semantic graph to be synthesized, where the diagram (a) in fig. 4 is a semantic graph selected in the city view data set cityscaps dataset to be synthesized, and the diagram (b) in fig. 4 is a real scene synthesis graph corresponding to the diagram (a) in fig. 4. Fig. 5 is a diagram of a synthesis result using an image in the game scene GTA5 dataset as a semantic graph to be synthesized, where fig. 5 (a) is a semantic graph selected in the game scene GTA5 dataset to be synthesized, and fig. 5 (b) is a real scene synthesis graph corresponding to the graph of fig. 5 (a).
The method for synthesizing the real scene images can quickly, effectively and reliably synthesize the photographic-level scene images with stronger reality sense, improve the reality sense and the visual quality of the synthesized images and enlarge the application range and the application scene; and the upsampling method in the U-net convolution neural network model is a resizing-convolution (resize-convolution) method, so that the chessboard artifact in the synthetic image can be reduced, and the reality of the synthetic image is further improved.
The invention also provides a real scene image synthesis system, and fig. 6 is a schematic structural diagram of a real scene image synthesis system according to an embodiment of the invention.
Referring to fig. 6, the real scene image synthesis system of the embodiment includes:
a first obtainingmodule 601, configured to obtain an image training set; the image training set is composed of a plurality of image pairs; each image pair is composed of a semantic graph and a real scene reference graph corresponding to the semantic graph.
A syntheticmodel establishing module 602, configured to establish a real scene image synthetic network model according to the U-net convolutional neural network model and the excitation residual block; the excitation residual block is composed of a convolution layer and an active layer.
The synthesismodel building module 602 specifically includes:
the first establishing unit is used for establishing a U-net convolutional neural network model; the U-net convolutional neural network model comprises a plurality of tiers;
a second establishing unit for establishing an excitation residual block;
and the synthetic model establishing unit is used for embedding the excitation residual block between every two adjacent layers of the U-net convolutional neural network model to form a real scene image synthetic network model.
And a lossfunction establishing module 603, configured to establish a loss function of the real scene image synthesis network model by using a pre-trained VGG-19 convolutional neural network model.
Aparameter determining module 604, configured to determine an initialization parameter of the real scene image synthesis network model; the initialization parameters comprise a learning rate, a maximum training frequency, the number of semantic graphs, the width of the semantic graphs and the height of the semantic graphs.
Atraining module 605, configured to use the image training set as an input of the real scene image synthesis network model, train the real scene image synthesis network model according to the loss function, and obtain a trained real scene image synthesis network model.
Thetraining module 605 specifically includes:
a synthetic image obtaining unit, configured to input the i Zhang Yuyi image in the training set into the current real scene image synthetic network model, to obtain a real scene synthetic image corresponding to the i semantic image; wherein i is an integer greater than or equal to 1; the current real scene image synthesis network model is a real scene image synthesis network model updated by the jth training; wherein j is an integer greater than or equal to 0;
the judging unit is used for judging whether j is smaller than a preset maximum training frequency or not;
an updating unit, configured to input the real scene synthesis graph and the real scene reference graph corresponding to the ith semantic graph into the loss function to obtain a loss value if j is less than a preset maximum training time; inputting the loss value into an Adam optimizer, and updating the real scene image synthetic network model by adopting an Adam optimization algorithm; then i = i +1, j = j +1, and returns the ith semantic graph in the training set to be input into the current real scene image synthesis network model, so as to obtain a real scene synthesis graph corresponding to the ith semantic graph;
and the synthetic model determining unit is used for taking the current real scene image synthetic network model as the trained real scene image synthetic network model if j is greater than or equal to the preset maximum training times.
A second obtainingmodule 606, configured to obtain multiple semantic graphs to be synthesized.
And asynthesizing module 607, configured to input the semantic graph to be synthesized into the trained real scene image synthesizing network model, so as to obtain a real scene synthetic graph corresponding to the semantic graph to be synthesized.
The real scene image synthesis system in the embodiment can quickly, effectively and reliably synthesize the photographic-level scene image with stronger reality sense, improve the reality sense and the visual quality of the synthesized image, and expand the application range and the application scene.
In the system disclosed by the embodiment in the specification, the description is relatively simple because the system corresponds to the method disclosed by the embodiment, and the relevant points can be referred to the method part for description.
The principles and embodiments of the present invention have been described herein using specific examples, which are provided only to help understand the method and the core concept of the present invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, the specific embodiments and the application range may be changed. In view of the above, the present disclosure should not be construed as limiting the invention.