CN109447897B

Movatterモバイル変換

Info

Publication number: CN109447897B
Application number: CN201811241932.XA
Authority: CN
Inventors: 饶鉴; 陈欣; 刘罡
Original assignee: Wenchuang Smart Technology Wuhan Co ltd
Current assignee: Wenchuang Smart Technology Wuhan Co ltd
Priority date: 2018-10-24
Filing date: 2018-10-24
Publication date: 2023-04-07
Anticipated expiration: 2038-10-24
Also published as: CN109447897A

Abstract

The invention discloses a method and a system for synthesizing a real scene image. The method comprises the following steps: acquiring an image training set consisting of a semantic graph and a real scene reference graph corresponding to the semantic graph; establishing a real scene image synthesis network model according to the U-net convolution neural network model and the excitation residual block; establishing a loss function of a real scene image synthesis network model by using a pre-trained VGG-19 convolutional neural network model; taking the image training set as the input of a real scene image synthesis network model, and training the real scene image synthesis network model according to a loss function to obtain a trained real scene image synthesis network model; acquiring a plurality of semantic graphs to be synthesized; and inputting the semantic graph to be synthesized into the trained real scene image synthesis network model to obtain a real scene synthesis graph corresponding to the semantic graph to be synthesized. The invention can quickly and effectively synthesize the photographic-level scene image with stronger reality sense, and improve the reality sense and the visual quality of the synthesized image.

Description

Real scene image synthesis method and system

Technical Field

The invention relates to the technical field of image synthesis, in particular to a method and a system for synthesizing a real scene image.

Background

In the field of image synthesis, real scene image synthesis technology based on deep learning is increasingly applied. Real scene image synthesis is a visual image synthesis technique for synthesizing an approximation to a real scene image based on information of object segmentation in a semantic layout. The real scene image synthesis method integrates various professional technologies such as deep learning, mode recognition and digital image processing. The key points of the real scene image synthesis are three points: (1) global coordination; (2) storage capacity of the network model; and (3) high resolution. The deep learning can realize the global feature extraction of the image, and simultaneously can improve the parameter quantity of the network model, namely the storage capacity of the network model and generate the high-resolution image, thereby greatly improving the reality of the real scene image synthesis. The design of the deep learning network structure used by the real scene image synthesis method often directly influences the effect of real scene image synthesis. Therefore, designing a suitable deep learning network structure is one of the important tasks for improving the synthesis fidelity of the scene images.

Currently, a method for synthesizing an image of a real scene includes: (1) A generator for generating countermeasure networks (GANs) using U-net as a condition, and this method can achieve desired performance when converting grayscale and binary edge images into color images. However, when this method converts the semantic graph into a photographic-level realistic image (i.e., a real scene image), the synthesis speed and visual quality thereof are to be improved. (2) A Cascaded Refinement Network (CRNs) for synthesizing photographic-level images is used to convert the semantic layout into a photographic-level photorealistic image. Although CRNs have a huge storage capacity and can generate images more realistic than method (1), they take a lot of time in the training and prediction stages and cannot realize fast and efficient synthesis of images of real scenes. In a word, the existing synthesis method has low efficiency, and the reality degree of the synthesized photographic-level realistic image and the visual quality of the image are to be improved.

Disclosure of Invention

Therefore, it is necessary to provide a method and a system for synthesizing a real scene image, so as to quickly and effectively synthesize a photographic-level scene image with a stronger sense of reality, improve the sense of reality and the visual quality of the synthesized image, and expand the application range and the application scene.

In order to achieve the purpose, the invention provides the following scheme:

a method of synthesizing an image of a real scene, comprising:

acquiring an image training set; the image training set is composed of a plurality of image pairs; each image pair is composed of a semantic graph and a real scene reference graph corresponding to the semantic graph;

establishing a real scene image synthesis network model according to the U-net convolution neural network model and the excitation residual block; the excitation residual block is composed of a convolution layer and an active layer;

establishing a loss function of the real scene image synthetic network model by utilizing a pre-trained VGG-19 convolutional neural network model;

taking the image training set as the input of the real scene image synthesis network model, and training the real scene image synthesis network model according to the loss function to obtain a trained real scene image synthesis network model;

acquiring a plurality of semantic graphs to be synthesized;

and inputting the semantic graph to be synthesized into the trained real scene image synthesis network model to obtain a real scene synthesis graph corresponding to the semantic graph to be synthesized.

Optionally, the establishing a real scene image synthesis network model according to the U-net convolutional neural network model and the excitation residual block specifically includes:

establishing a U-net convolution neural network model; the U-net convolutional neural network model comprises a plurality of hierarchical levels;

establishing an excitation residual block;

and embedding the excitation residual block between every two adjacent layers of the U-net convolutional neural network model to form a real scene image synthesis network model.

Optionally, the training the real scene image synthesis network model with the image training set as an input of the real scene image synthesis network model according to the loss function to obtain a trained real scene image synthesis network model specifically includes:

inputting the ith Zhang Yuyi image in the training set into a current real scene image synthesis network model to obtain a real scene synthesis image corresponding to the ith semantic image; wherein i is an integer greater than or equal to 1; the current real scene image synthesis network model is a real scene image synthesis network model updated after the jth training; wherein j is an integer greater than or equal to 0;

judging whether j is smaller than a preset maximum training frequency;

if yes, inputting the real scene synthesis graph and a real scene reference graph corresponding to the ith semantic graph into the loss function to obtain a loss value;

inputting the loss value into an Adam optimizer, and updating the real scene image synthesis network model by adopting an Adam optimization algorithm; then i = i +1, j = j +1, and returns the ith semantic graph in the training set to be input into the current real scene image synthesis network model, so as to obtain a real scene synthesis graph corresponding to the ith semantic graph;

if not, the current real scene image synthesis network model is used as the trained real scene image synthesis network model.

Optionally, the excitation residual block specifically includes:

f(x)＝x·sigmoid(β(x))

wherein x represents the input semantic graph; sigmoid is an activation function, and the functional expression is sigmoid (x) = 1/(1 + exp (-x)); β represents a convolution layer in the excitation residual block; β (x) represents an image obtained by performing convolution operation on the input semantic graph.

Optionally, the loss function specifically includes:

wherein L is_f Represents a loss value; f represents a real scene synthesis graph output by the real scene image synthesis network model, and G represents a real scene reference graph; phi denotes the pre-trained VGG-19 convolutional neural network model, phi_l Represents the l layer, phi, in the pre-trained VGG-19 convolutional neural network model_l (F) The characteristic diagram phi of the output of the I-th convolutional layer after F is input into a pre-trained VGG-19 convolutional neural network_l (G) Means that G is input into a pre-trained VGG-19 convolutional neural network, and the first layer convolutional layer is outputDrawing a feature graph; l is {0,1,2,3,4,5}; phi is a unit of₀ Input diagram, phi, representing a pre-trained VGG-19 network₁ To phi₅ A characteristic diagram representing the corresponding output of five convolutional layers in the pre-trained VGG-19; lambda [ alpha ]_l Weight coefficient, λ, corresponding to loss value representing the l-th layer_l The value of (1/1.6,1/2.3,1/1.8,1/2.8, 10/0.8).

Optionally, before the training the real scene image synthesis network model according to the loss function by using the image training set as the input of the real scene image synthesis network model to obtain the trained real scene image synthesis network model, the method further includes:

determining initialization parameters of the real scene image synthesis network model; the initialization parameters comprise a learning rate, a maximum training frequency, the number of semantic graphs, the width of the semantic graphs and the height of the semantic graphs.

The invention also provides a real scene image synthesis system, which comprises:

the first acquisition module is used for acquiring an image training set; the image training set is composed of a plurality of image pairs; each image pair consists of a semantic graph and a real scene reference graph corresponding to the semantic graph;

the synthetic model establishing module is used for establishing a real scene image synthetic network model according to the U-net convolutional neural network model and the excitation residual block; the excitation residual block is composed of a convolution layer and an active layer;

the loss function establishing module is used for establishing a loss function of the real scene image synthetic network model by utilizing a pre-trained VGG-19 convolutional neural network model;

the training module is used for taking the image training set as the input of the real scene image synthesis network model, training the real scene image synthesis network model according to the loss function and obtaining the trained real scene image synthesis network model;

the second acquisition module is used for acquiring a plurality of semantic graphs to be synthesized;

and the synthesis module is used for inputting the semantic graph to be synthesized into the trained real scene image synthesis network model to obtain a real scene synthesis graph corresponding to the semantic graph to be synthesized.

Optionally, the synthetic model establishing module specifically includes:

the first establishing unit is used for establishing a U-net convolutional neural network model; the U-net convolutional neural network model comprises a plurality of tiers;

a second establishing unit for establishing an excitation residual block;

and the synthetic model establishing unit is used for embedding the excitation residual block between every two adjacent layers of the U-net convolutional neural network model to form a real scene image synthetic network model.

Optionally, the training module specifically includes:

a synthetic image obtaining unit, configured to input the i Zhang Yuyi image in the training set into a current real scene image synthetic network model, so as to obtain a real scene synthetic image corresponding to the i semantic image; wherein i is an integer greater than or equal to 1; the current real scene image synthesis network model is a real scene image synthesis network model updated by the jth training; wherein j is an integer greater than or equal to 0;

the judging unit is used for judging whether j is smaller than the preset maximum training frequency;

an updating unit, configured to input the real scene synthesis graph and the real scene reference graph corresponding to the ith semantic graph into the loss function to obtain a loss value if j is less than a preset maximum training time; inputting the loss value into an Adam optimizer, and updating the real scene image synthetic network model by adopting an Adam optimization algorithm; enabling i = i +1, j = j +1, and returning to input the ith semantic graph in the training set into the current real scene image synthesis network model to obtain a real scene synthesis graph corresponding to the ith semantic graph;

and the synthetic model determining unit is used for taking the current real scene image synthetic network model as the trained real scene image synthetic network model if j is greater than or equal to the preset maximum training times.

Optionally, the system further includes:

the parameter determining module is used for determining the initialization parameters of the real scene image synthesis network model; the initialization parameters comprise a learning rate, a maximum training frequency, the number of semantic graphs, the width of the semantic graphs and the height of the semantic graphs.

Compared with the prior art, the invention has the beneficial effects that:

the invention provides a real scene image synthesis method and a system, wherein the method comprises the steps of establishing a real scene image synthesis network model according to a U-net convolutional neural network model and an excitation residual block, establishing a loss function of the real scene image synthesis network model by using a pre-trained VGG-19 convolutional neural network model, and training the real scene image synthesis network model by using the loss function to obtain a final real scene image synthesis network model. The method or the system can quickly, effectively and reliably synthesize the photographic-level scene image with stronger reality sense, improve the reality sense and the visual quality of the synthesized image, and enlarge the application range and the application scene.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings without inventive exercise.

Fig. 1 is a flowchart of a method for synthesizing a real scene image according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of an excitation residual block according to an embodiment of the present invention;

FIG. 3 is a schematic structural diagram of a real scene image synthesis network model according to an embodiment of the present invention;

FIG. 4 is a diagram of the result of the synthesis using the image in the street view data set cityscaps dataset as the semantic graph to be synthesized;

FIG. 5 is a diagram of the result of the synthesis using the image in the GTA5 dataset of the game scene as the semantic graph to be synthesized;

fig. 6 is a schematic structural diagram of a real scene image synthesis system according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in further detail below.

Fig. 1 is a flowchart of a method for synthesizing a real scene image according to an embodiment of the present invention.

Referring to fig. 1, the method for synthesizing an image of a real scene according to the embodiment includes:

step S1: acquiring an image training set; the image training set is composed of a plurality of image pairs; each image pair is composed of a semantic graph and a real scene reference graph corresponding to the semantic graph.

In this embodiment, the images in the image training set may be obtained from cityscape datasets, which represent real-world real scenes, or from GTA5 datasets, which represent game scenes.

Step S2: establishing a real scene image synthesis network model according to the U-net convolution neural network model and the excitation residual block; the excitation residual block is composed of a convolution layer and an active layer.

Fig. 2 is a schematic structural diagram of an excitation Residual Block according to an embodiment of the present invention, where fig. 2 (a) is a structural diagram of an active Layer (Swish Layer), and fig. 2 (b) is a structural diagram of a single excitation Residual Block (SRB).

Referring to fig. 2, each rectangular box represents a corresponding data operation in the network, arrows represent data flows,

representing the splicing operation of the feature map, "·" represents the element-by-element multiplication "x" of the matrix, representing the feature map of the network input, i.e. the semantic map, H represents the convolution kernel with the size of 3 × 3 convolution layers, the activation function of H is sigmoid, R (x) represents the final output feature map of 2 layers of convolution layers, the activation function of the convolution layer in the R module is LRelu, G (x) represents the output of the activation layer, and "64-d" represents that the number of channels of the feature map x is 64.

The step S2 specifically includes:

establishing a U-net convolution neural network model; the U-net convolutional neural network model comprises a plurality of tiers;

establishing an excitation residual block, wherein the excitation residual block specifically comprises the following steps:

f(x)＝x·sigmoid(β(x))

wherein, x represents the input semantic graph, sigmoid is an activation function, the function expression is sigmoid (x) = 1/(1 + exp (-x)), β represents the convolution layer in the excitation residual block, and β (x) represents the image after convolution operation is performed on the input semantic graph;

The U-net convolutional neural network model in this embodiment is formed by 2 symmetrical branches on the left and right, and in the left and right branches of the U-net, several convolutional layers that perform convolution operations on feature maps of different resolutions form different U-net levels, and the U-net convolutional neural network model in this embodiment includes 6 levels, and the 6 levels have 6 levels of different resolutions.

Fig. 3 is a schematic structural diagram of a real scene image synthesis network model according to an embodiment of the present invention. Referring to fig. 3, the real scene image synthesis network model includes 6 levels, which are 1 level to 6 levels from top to bottom, and the resolution of the feature map is reduced by half in sequence; each rectangular box represents a multi-channel feature map, the numbers at the top of the rectangular boxes represent the number of channels, such as "20, 96, 192, 384, 512, 1536", etc., "s" represents an excitation residual block, arrows represent different operations, "arrow ↓" represents a downsampling operation, i.e., a maximum pooling operation, arrow ↓ "represents an upsampling operation, the method of the upsampling operation in this embodiment is a resize-convolution (resize-convolution) method, arrow" → "represents a convolution operation of a convolution layer, and a dotted arrow represents a copy-paste operation of the feature map. The upsampling operation is to enlarge an image with a small resolution to an image with a large resolution by some method. The process of resize-volume upsampling in this embodiment is as follows: firstly, an input small-resolution image is subjected to an interpolation algorithm of bicubic interpolation to double the resolution of the image, then the amplified image is subjected to convolution operation through a layer of convolution layer, and then a feature graph output by the convolution layer is an output graph of an up-sampling method.

And step S3: and establishing a loss function of the real scene image synthesis network model by using a pre-trained VGG-19 convolutional neural network model. The loss function is specifically:

wherein L is_f Represents a loss value; f represents a real scene synthesis graph output by the real scene image synthesis network model, and G represents a real scene reference graph; phi denotes the pre-trained VGG-19 convolutional neural network model, phi_l Represents the l layer, phi, in the pre-trained VGG-19 convolutional neural network model_l (F) The characteristic diagram phi of the output after the I layer convolution layer is shown after F is input into the pre-trained VGG-19 convolution neural network_l (G) A characteristic diagram showing the output of the first layer convolution layer when G is input into a pre-trained VGG-19 convolution neural network; the value of l is {0,1,2,3,4,5}; phi is a unit of₀ Input diagram, phi, representing a pre-trained VGG-19 network₁ To phi₅ A characteristic diagram representing the corresponding output of five convolutional layers in the pre-trained VGG-19; lambda [ alpha ]_l Denotes the l-th layerCorresponding to the loss value of (a) is given by a weight factor, λ_l The value of (1/1.6,1/2.3,1/1.8,1/2.8, 10/0.8).

And step S4: and determining initialization parameters of the real scene image synthesis network model.

Specifically, the initialization parameters include a learning rate, a maximum training frequency, the number of semantic graphs, the width of the semantic graphs, and the height of the semantic graphs.

In this embodiment, the learning rate learning _ rate =0.0001, the maximum training time epoch =100, the width of the semantic graph =384, and the height of the semantic graph =192.

Step S5: and taking the image training set as the input of the real scene image synthesis network model, and training the real scene image synthesis network model according to the loss function to obtain the trained real scene image synthesis network model. The step S5 specifically includes:

inputting the ith Zhang Yuyi image in the training set into a current real scene image synthesis network model to obtain a real scene synthesis image corresponding to the ith semantic image; wherein i is an integer greater than or equal to 1; the current real scene image synthesis network model is a real scene image synthesis network model updated by the jth training; wherein j is an integer greater than or equal to 0;

judging whether j is smaller than a preset maximum training frequency;

In this embodiment, the calculation process of the loss value may be specifically described as follows: respectively inputting the real scene synthesis graph and the corresponding real scene reference graph into a pre-trained VGG-19 convolutional neural network model, respectively obtaining feature sub-graphs output by 5 convolutional layers (respectively conv 1-2, conv22, conv32, conv42 and conv 52) in the pre-trained VGG-19, then calculating the square absolute error of the 5 groups of feature sub-graphs to obtain 5 groups of square absolute error values, then calculating the square absolute error value between the real scene synthesis graph and the real reference graph corresponding to the semantic graph, finally obtaining 6 groups of square absolute error values, and summing the 6 groups of square absolute error values to obtain the loss value.

Step S6: and acquiring a plurality of semantic graphs to be synthesized.

Step S7: and inputting the semantic graph to be synthesized into the trained real scene image synthesis network model to obtain a real scene synthesis graph corresponding to the semantic graph to be synthesized.

In the embodiment, the images in the city-block data set cityscapesdataset representing the real scene of the real world are used as the training set and the test set, so that the real scene image synthesis method is realized. Fig. 4 is a synthesis result diagram using an image in the city view data set cityscaps dataset as a semantic graph to be synthesized, where the diagram (a) in fig. 4 is a semantic graph selected in the city view data set cityscaps dataset to be synthesized, and the diagram (b) in fig. 4 is a real scene synthesis graph corresponding to the diagram (a) in fig. 4. Fig. 5 is a diagram of a synthesis result using an image in the game scene GTA5 dataset as a semantic graph to be synthesized, where fig. 5 (a) is a semantic graph selected in the game scene GTA5 dataset to be synthesized, and fig. 5 (b) is a real scene synthesis graph corresponding to the graph of fig. 5 (a).

The method for synthesizing the real scene images can quickly, effectively and reliably synthesize the photographic-level scene images with stronger reality sense, improve the reality sense and the visual quality of the synthesized images and enlarge the application range and the application scene; and the upsampling method in the U-net convolution neural network model is a resizing-convolution (resize-convolution) method, so that the chessboard artifact in the synthetic image can be reduced, and the reality of the synthetic image is further improved.

The invention also provides a real scene image synthesis system, and fig. 6 is a schematic structural diagram of a real scene image synthesis system according to an embodiment of the invention.

Referring to fig. 6, the real scene image synthesis system of the embodiment includes:

a first obtainingmodule 601, configured to obtain an image training set; the image training set is composed of a plurality of image pairs; each image pair is composed of a semantic graph and a real scene reference graph corresponding to the semantic graph.

A syntheticmodel establishing module 602, configured to establish a real scene image synthetic network model according to the U-net convolutional neural network model and the excitation residual block; the excitation residual block is composed of a convolution layer and an active layer.

The synthesismodel building module 602 specifically includes:

a second establishing unit for establishing an excitation residual block;

And a lossfunction establishing module 603, configured to establish a loss function of the real scene image synthesis network model by using a pre-trained VGG-19 convolutional neural network model.

Aparameter determining module 604, configured to determine an initialization parameter of the real scene image synthesis network model; the initialization parameters comprise a learning rate, a maximum training frequency, the number of semantic graphs, the width of the semantic graphs and the height of the semantic graphs.

Atraining module 605, configured to use the image training set as an input of the real scene image synthesis network model, train the real scene image synthesis network model according to the loss function, and obtain a trained real scene image synthesis network model.

Thetraining module 605 specifically includes:

a synthetic image obtaining unit, configured to input the i Zhang Yuyi image in the training set into the current real scene image synthetic network model, to obtain a real scene synthetic image corresponding to the i semantic image; wherein i is an integer greater than or equal to 1; the current real scene image synthesis network model is a real scene image synthesis network model updated by the jth training; wherein j is an integer greater than or equal to 0;

the judging unit is used for judging whether j is smaller than a preset maximum training frequency or not;

an updating unit, configured to input the real scene synthesis graph and the real scene reference graph corresponding to the ith semantic graph into the loss function to obtain a loss value if j is less than a preset maximum training time; inputting the loss value into an Adam optimizer, and updating the real scene image synthetic network model by adopting an Adam optimization algorithm; then i = i +1, j = j +1, and returns the ith semantic graph in the training set to be input into the current real scene image synthesis network model, so as to obtain a real scene synthesis graph corresponding to the ith semantic graph;

A second obtainingmodule 606, configured to obtain multiple semantic graphs to be synthesized.

And asynthesizing module 607, configured to input the semantic graph to be synthesized into the trained real scene image synthesizing network model, so as to obtain a real scene synthetic graph corresponding to the semantic graph to be synthesized.

The real scene image synthesis system in the embodiment can quickly, effectively and reliably synthesize the photographic-level scene image with stronger reality sense, improve the reality sense and the visual quality of the synthesized image, and expand the application range and the application scene.

In the system disclosed by the embodiment in the specification, the description is relatively simple because the system corresponds to the method disclosed by the embodiment, and the relevant points can be referred to the method part for description.

The principles and embodiments of the present invention have been described herein using specific examples, which are provided only to help understand the method and the core concept of the present invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, the specific embodiments and the application range may be changed. In view of the above, the present disclosure should not be construed as limiting the invention.

Claims

1. A method for synthesizing an image of a real scene, comprising:

acquiring an image training set; the image training set is composed of a plurality of image pairs; each image pair consists of a semantic graph and a real scene reference graph corresponding to the semantic graph;

establishing a loss function of the real scene image synthesis network model by using a pre-trained VGG-19 convolutional neural network model;

acquiring a plurality of semantic graphs to be synthesized;

inputting the semantic graph to be synthesized into the trained real scene image synthesis network model to obtain a real scene synthesis graph corresponding to the semantic graph to be synthesized;

the method for establishing the real scene image synthesis network model according to the U-net convolution neural network model and the excitation residual block specifically comprises the following steps:

establishing an excitation residual block;

2. The method of claim 1, wherein the training the real-scene image synthesis network model according to the loss function by using the image training set as an input of the real-scene image synthesis network model to obtain a trained real-scene image synthesis network model specifically includes:

inputting the ith semantic graph in the training set into a current real scene image synthesis network model to obtain a real scene synthesis graph corresponding to the ith semantic graph; wherein i is an integer greater than or equal to 1; the current real scene image synthesis network model is a real scene image synthesis network model updated by the jth training; wherein j is an integer greater than or equal to 0;

judging whether j is smaller than a preset maximum training frequency;

inputting the loss value into an Adam optimizer, and updating the real scene image synthetic network model by adopting an Adam optimization algorithm; then i = i +1, j = j +1, and returns the ith semantic graph in the training set to be input into the current real scene image synthesis network model, so as to obtain a real scene synthesis graph corresponding to the ith semantic graph;

and if not, taking the current real scene image synthesis network model as the trained real scene image synthesis network model.

3. The method for synthesizing an image of a real scene according to claim 1, wherein the excitation residual block is specifically:

f(x)＝x·sigmoid(β(x))

wherein x represents the input semantic graph; sigmoid is an activation function, and the functional expression is sigmoid (x) = 1/(1 + exp (-x)); β represents a convolutional layer in the excitation residual block; β (x) represents an image obtained by performing convolution operation on the input semantic graph.

4. The method for synthesizing an image of a real scene according to claim 1, wherein the loss function is specifically:

wherein L is_f Represents a loss value; f represents a real scene synthesis graph output by the real scene image synthesis network model, and G represents a real scene reference graph; phi denotes the pre-trained VGG-19 convolutional neural network model, phi_l Represents the l layer, phi, in the pre-trained VGG-19 convolutional neural network model_l (F) The characteristic diagram phi of the output after the I layer convolution layer is shown after F is input into the pre-trained VGG-19 convolution neural network_l (G) A characteristic diagram showing the output of the first layer convolution layer when G is input into a pre-trained VGG-19 convolution neural network; the value of l is {0,1,2,3,4,5}; phi is a₀ Input diagram, phi, representing a pre-trained VGG-19 network₁ To phi₅ A characteristic diagram representing the corresponding output of five convolutional layers in the pre-trained VGG-19; lambda [ alpha ]_l Weight coefficient, λ, representing the loss value of the l-th layer_l The value of (1/1.6,1/2.3,1/1.8,1/2.8, 10/0.8).

5. The method as claimed in claim 1, wherein before the step of training the real-scene image synthesis network model according to the loss function by using the image training set as the input of the real-scene image synthesis network model to obtain the trained real-scene image synthesis network model, the method further comprises:

determining initialization parameters of the real scene image synthesis network model; the initialization parameters comprise a learning rate, a preset maximum training frequency, the number of semantic graphs, the width of the semantic graphs and the height of the semantic graphs.

6. A real scene image composition system, comprising:

the first acquisition module is used for acquiring an image training set; the image training set is composed of a plurality of image pairs; each image pair is composed of a semantic graph and a real scene reference graph corresponding to the semantic graph;

the synthesis module is used for inputting the semantic graph to be synthesized into the trained real scene image synthesis network model to obtain a real scene synthesis graph corresponding to the semantic graph to be synthesized;

the synthetic model establishing module specifically comprises:

a second establishing unit for establishing an excitation residual block;

7. The real scene image synthesis system according to claim 6, wherein the training module specifically includes:

a synthetic image obtaining unit, configured to input the ith semantic image in the training set into a current real scene image synthetic network model, so as to obtain a real scene synthetic image corresponding to the ith semantic image; wherein i is an integer greater than or equal to 1; the current real scene image synthesis network model is a real scene image synthesis network model updated by the jth training; wherein j is an integer greater than or equal to 0;

an updating unit, configured to input the real scene synthesis graph and the real scene reference graph corresponding to the ith semantic graph into the loss function to obtain a loss value if j is less than a preset maximum training time; inputting the loss value into an Adam optimizer, and updating the real scene image synthesis network model by adopting an Adam optimization algorithm; then i = i +1, j = j +1, and returns the ith semantic graph in the training set to be input into the current real scene image synthesis network model, so as to obtain a real scene synthesis graph corresponding to the ith semantic graph; and the synthetic model determining unit is used for taking the current real scene image synthetic network model as the trained real scene image synthetic network model if j is greater than or equal to the preset maximum training times.

8. A real scene image synthesis system according to claim 6, further comprising:

the parameter determining module is used for determining initialization parameters of the real scene image synthesis network model; the initialization parameters comprise a learning rate, a preset maximum training frequency, the number of semantic graphs, the width of the semantic graphs and the height of the semantic graphs.