Disclosure of Invention
The invention provides a facial thermal infrared-visible light image conversion method based on prior information, which is characterized by comprising the steps of obtaining a facial infrared image to be converted and a corresponding visible light paired image, inputting the facial infrared image into a trained facial thermal infrared-visible light generation network model based on prior information to obtain a facial visible light synthesized image, wherein the facial thermal infrared-visible light generation network model based on prior information comprises a facial analysis map condition network module, a spatial feature transformation mapping layer, a attention module, a generator network module and a discriminator;
the process for training the human face thermal infrared-visible light generation network model based on the priori information comprises the following steps:
s1, acquiring a thermal infrared-visible light image dataset, and classifying skin color information labels of images in the dataset;
s2, preprocessing the paired images in the data set, and inputting the preprocessed images into a facial thermal infrared-visible light generation network model based on priori information;
S3, extracting features of the preprocessed face analysis chart by adopting a face analysis chart condition network module to obtain face priori information features;
S4, carrying out feature extraction and coding on the preprocessed human face thermal infrared image by adopting a generator network module to obtain coded human face feature information;
s5, inputting the prior information features of the human face and the encoded human face feature information into a space feature transformation mapping layer to generate a pair of modulation parameters;
S6, the face characteristic information is subjected to multi-layer residual transformation and a decoder to obtain a corresponding face visible light combined image, and the face visible light combined image is input into a discriminator for discrimination training;
S7, inputting the human face visible light synthesized image and the human face thermal infrared image into an encoder and two layers of MLP mapping layers to obtain corresponding human face thermal infrared image features and human face visible light synthesized image features;
And S8, optimizing parameters of the model by adopting an Adam optimizer, and outputting optimal parameters when the loss function of the model is minimum, so as to obtain an optimal human face thermal infrared-visible light generation network model based on priori information.
Preferably, the generator network module comprises an encoder Genc, a converter and a decoder Gdec, wherein the encoder Genc mainly comprises 3 layers of CIR, each layer of CIR is formed by convolution, instanceNorm normalization and ReLU activation functions, the input image is subjected to feature extraction through the encoder, the converter comprises 9 Residual blocks, each Residual block is sequentially formed by a spatial feature transformation mapping layer STL and CIR layers, the Residual block mainly carries out enhancement processing on the feature image extracted by the encoder, the decoder Gdec comprises two CTIR layers, one Reflect operation layer and one convolution layer, wherein the CTIR layers are sequentially formed by deconvolution, instanceNorm normalization and ReLU activation functions, the decoder is used for up-sampling operation, and the learned face features are gradually reconstructed to the original image size.
Preferably, the two input face features are processed by using a spatial feature transformation mapping layer. The two inputs are face priori information characteristics generated by a face analysis chart condition module and characteristic output GF of each layer of a generating network respectively, wherein the face priori information characteristics are respectively subjected to two-layer convolution operation to obtain a pair of parameters alpha and beta, and the face characteristic output GF in the generating network is subjected to point multiplication firstly and then to addition operation according to the modulation parameters, so that the output of the whole STL network is obtained. The network can perform corresponding mapping transformation on the face characteristic information in the generation network in space, so that the generation quality of the face image is adaptively optimized.
Preferably, the attention module is adopted to perform contrast learning on the thermal infrared input image of the human face and the visible light combined image of the human face, and the process comprises the following steps:
S71, extracting the characteristics of the human face thermal infrared image and the characteristics of the human face visible light synthesized image in a multi-layer mode, namely respectively obtaining human face characteristic spectrums FH∈RC×H×W and FV∈RC×H×W by the characteristics of the human face thermal infrared image and the characteristics of the human face visible light synthesized image through an encoder Genc and a two-layer MLP network layer Hl;
S72, carrying out reshape and transferring operation on the characteristic spectrum of the thermal infrared image of the human face to obtain two-dimensional matrixes QH∈RHW×C and VH∈RHW×C;
And S73, constructing global attention contrast loss according to the two-dimensional matrixes QH∈RHW×C and VH∈RHW×C.
Furthermore, the model has two attention methods, namely, global attention and local attention are respectively used for building contrast learning loss. In this embodiment, global attention is used to construct contrast learning loss. The process for constructing the global attention contrast loss comprises the steps of multiplying QH by a transpose KH∈RC×HW of the QH to obtain a matrix, carrying out Softmax normalization operation on each row of the matrix to obtain a global attention matrix Aglobal∈RHW×HW, calculating an entropy value Hs of each row in the global attention matrix according to an entropy value calculation formula, carrying out ascending arrangement on each row of the global attention matrix according to the calculated entropy value, respectively routing VH∈RHW×C and VV∈RHW×C characteristics in a source domain face thermal infrared image and a target domain face visible light synthetic image according to the ordered matrices, and finally routing corresponding value characteristics VH and VV to construct the global contrast loss.
Further, the process of constructing the local attention contrast loss comprises the steps that a k multiplied by k window with constant size is used for local attention, and sliding inquiry with the step length of 1 is carried out on the source domain face thermal infrared image, so that the space information interaction in a local area can be enhanced. Multiplying QH by its partial transposeObtaining a matrix, and performing Softmax normalization operation on each row of the matrix to obtain a local attention matrixCalculating an entropy value Hs of each row in Alocal, arranging each row of data in a local attention matrix in an ascending order according to the calculated entropy value, and respectively routing the values of the source domain face thermal infrared image according to the arranged matrixAnd the value of the target domain face visible light combined imageThereby creating a multi-layer local contrast penalty.
Preferably, the loss function of the model is:
L=λ1LConH+λ2LConG(H)+λ3LPcl+λ4LGm+λ5Lgan
Wherein lambda1、λ2、λ3、λ4、λ5 is the super-parameters of contrast learning loss, identity preservation contrast learning loss, gradient enhancement loss, pixel level consistency loss and generation contrast loss respectively.
The invention has the beneficial effects that:
The invention uses the face analysis map as prior information to guide and generate local texture information of the network learning face image, and uses the face analysis map mapping feature as prior condition to generate a pair of modulation parameters through a spatial feature transformation mapping layer (STL), and carries out affine transformation on the face features of the generation network according to the modulation parameters, thereby adaptively optimizing the generation quality of the face image, being beneficial to relieving the occurrence of artifacts on the face generation image and improving the local texture detail. The invention designs a face gradient enhancement loss in the model training process, which extracts a gradient map corresponding to a face visible light synthesized image and a face visible light truth image (GT), and the gradient enhancement loss can enhance the face details of the face visible light synthesized image and keep better face contour information.
Detailed Description
The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
A facial thermal infrared-visible light image conversion method based on priori information is as shown in fig. 1 and 2, and comprises the steps of firstly preparing paired facial thermal infrared-visible light data sets and corresponding facial priori data sets (facial analytic images), simultaneously carrying out skin color information classification and label extraction, preprocessing corresponding data, constructing a facial thermal infrared-visible light generation network model based on the priori information and a model comprising a facial analytic image condition network (FPCN), a spatial feature conversion mapping layer (STL), a generator network module G, a discriminator D and the like, training a facial generation network model, carrying out corresponding attention operation on a source domain image and a target domain image to obtain a selected significant anchor point and positive and negative samples, carrying out training and optimization by combining a plurality of loss functions and an Adam optimizer, updating network parameters, completing training to obtain an optimal model, and inputting the facial thermal infrared image into the optimal generation model to obtain a facial visible light synthetic image.
The face thermal infrared-visible light image conversion method based on priori information comprises the steps of obtaining a face infrared image to be converted and a corresponding visible light paired image, inputting the face infrared image into a trained face thermal infrared-visible light generation network model based on the priori information to obtain a face visible light synthesized image, wherein the face thermal infrared-visible light generation network model based on the priori information comprises a face analytic graph condition network module, a spatial feature transformation mapping layer, an attention module, a generator network module and a discriminator.
The process for training the human face thermal infrared-visible light generation network model based on the prior information comprises the following steps:
s1, acquiring a thermal infrared-visible light image dataset, and classifying skin color information labels of images in the dataset;
s2, preprocessing the paired images in the data set, and inputting the preprocessed images into a facial thermal infrared-visible light generation network model based on priori information;
S3, extracting features of the preprocessed face analysis chart by adopting a face analysis chart condition network module to obtain face priori information features;
S4, carrying out feature extraction and coding on the preprocessed human face thermal infrared image by adopting a generator network module to obtain coded human face feature information;
s5, inputting the prior information features of the human face and the encoded human face feature information into a space feature transformation mapping layer to generate a pair of modulation parameters;
S6, the face characteristic information is subjected to multi-layer residual transformation and a decoder to obtain a corresponding face visible light combined image, and the face visible light combined image is input into a discriminator for discrimination training;
S7, inputting the human face visible light synthesized image and the human face thermal infrared image into an encoder and two layers of MLP mapping layers to obtain corresponding human face thermal infrared image features and human face visible light synthesized image features;
And S8, optimizing parameters of the model by adopting an Adam optimizer, and outputting optimal parameters when the loss function of the model is minimum, so as to obtain an optimal human face thermal infrared-visible light generation network model based on priori information.
Another embodiment for training a facial thermal infrared-visible light generating network model based on prior information, as shown in fig. 1, includes:
s1, acquiring a thermal infrared-visible light image dataset, setting iteration times, and classifying skin color information labels of images in the dataset;
s2, preprocessing the paired images in the data set, and inputting the preprocessed images into a facial thermal infrared-visible light generation network model based on priori information;
S3, extracting features of the preprocessed face analysis chart by adopting a face analysis chart condition network module to obtain face priori information features;
S4, carrying out feature extraction and coding on the preprocessed human face thermal infrared image by adopting a generator network module to obtain coded human face feature information;
s5, inputting the prior information features of the human face and the encoded human face feature information into a space feature transformation mapping layer to generate a pair of modulation parameters;
S6, the face characteristic information is subjected to multi-layer residual transformation and a decoder to obtain a corresponding face visible light combined image, and the face visible light combined image is input into a discriminator for discrimination training;
S7, inputting the human face visible light synthesized image and the human face thermal infrared image into an encoder and two layers of MLP mapping layers to obtain corresponding human face thermal infrared image features and human face visible light synthesized image features;
And S8, optimizing parameters of the model by adopting an Adam optimizer, carrying out back propagation on the optimized parameters, adding 1 to the iteration times, determining the current iteration times and the set iteration times, outputting the optimal parameters if the current iteration times are equal to the set iteration times, and obtaining the optimal human face thermal infrared-visible light generation network model based on prior information, otherwise, returning to the step S3.
Acquiring the infrared-visible light image pairing data set comprises acquiring corresponding face thermal infrared and visible light data sets by using a thermal infrared and visible light dual-mode camera, wherein the size of the face thermal infrared and visible light data set is 256x256. Then aligning the face and using RETINAFACE face detection algorithm to locate 5 key points of the face, cutting the face to obtain correspondent face thermal infrared-visible light data set, classifying skin color information of cut and unclamped face thermal image-visible light data set and extracting correspondent label for training of subsequent model, and according to the latest face analysis image synthetic model to generate correspondent face input priori information (face analysis image data set).
In this embodiment, as shown in fig. 4, the face analysis map condition network module (FPCN) uses the face analysis map as input and performs a 3-layer convolution layer processing. The convolution layer here uses 1 x1 and 3 x 3 convolution kernels to extract face features.
Spatial feature transformation mapping layer (STL) as shown in fig. 3, the module has two inputs, namely facial a priori information features generated by FPCN module and feature outputs GF of each layer of the generated network. The prior information feature of the face obtains a pair of parameters alpha and beta through a multi-layer convolution operation and a sigmoid activation function respectively. And performing point multiplication on the face characteristic output GF in the generating network according to the modulation parameters, and then performing addition operation, so as to obtain the output of the whole STL network. The network can perform corresponding mapping transformation on the face characteristic information in the generation network in space, so that the generation quality of the face image is adaptively optimized.
The generator network module G mainly comprises an encoder Genc, a converter consisting of 9 Residual blocks and a decoder Gdec.
In this embodiment, the encoder Genc consists essentially of 3 layers of CIRs, each layer of CIR consisting of a convolution, instanceNorm normalization, reLU activation function operation. A Reflect operation layer (boundary filling) is inserted before the 3-layer CIR, and the operation symmetrically fills the image up, down, left and right along the edge to increase the resolution of the image. The main function of the encoder Genc is to extract the face features.
In this embodiment, the converter consisting of a plurality of Residual blocks is mainly composed of 9 Residual blocks. Each Residual block is composed of a spatial feature conversion mapping layer and a CIR layer in sequence, and the Residual block is mainly used for enhancing features. The facial analysis chart is used as priori information, and the characteristics of the generation network are mapped and transformed through the modulation parameters generated by the spatial characteristic transformation mapping layer so as to be suitable for output, thereby greatly improving the texture details generated by the facial image.
In this embodiment, the decoder Gdec is mainly composed of two CTIR layers, one Reflect operation layer, and one convolution layer in this order. The CTIR layers consist of deconvolution, instanceNorm normalization, and ReLU activation functions in sequence. The decoder gradually reconstructs the image into a human face visible light image with the original size.
In this embodiment, the structure of the discriminator is PatchGAN, and the focus of the discriminator is to compare the matrix with the output structure of n×n, mainly considering the difference of global receptive field information, and having a certain high resolution and high detail retention for image definition. The method comprises the steps of selecting an image block with the size of 70 multiplied by 70 from original pictures to judge authenticity each time, finally outputting a matrix with the size of 30 multiplied by 30, and taking the average value of the matrix as the output of True/False.
As shown in fig. 5, the process of performing contrast learning on the target image and the input image by using the attention module includes:
Step 1, extracting multi-layer characteristics of source domain and target domain images, namely enabling the source domain image (human face thermal infrared input image) and the target domain image (human face visible light combined image) to pass through an encoder Genc and a two-layer MLP network layer Hl respectively. The human face thermal infrared input image extracts a plurality of layers of characteristic spectrums through an encoder Genc, meanwhile, an L-layer characteristic image after encoding is selected, each layer of characteristic has S space positions, and the characteristic of each layer of characteristic is projected to a shared embedded space through a 2-layer MLP network. Wherein different spatial locations of different layers represent different image blocks. The relationship after image block mapping is { yl}L=Hl{Gelnc(x)}L, L represents the output characteristics of the first layer, L e {1,2, 3., L }. The method is also corresponding to the target domain image, and the image blocks after encoding and mapping the target domain face visible light synthesized image are adopted, wherein one image block is taken as an anchor point, the image blocks at the corresponding positions of the source domain face thermal infrared input image are taken as positive samples, and the image blocks at other positions of the same face thermal infrared input image are taken as negative samples.
And 2, an attention module in contrast learning, wherein the attention module is mainly used for solving the problem of position selection of positive and negative samples in contrast learning, and is used for selecting random position image blocks to carry out constraint, which may not be suitable. Because less source domain saliency information is contained in certain positions of the image, only the saliency information containing important domains needs to be selected, and the contrast loss constructed in this way is more beneficial to ensuring the consistency of the cross domains. There are two methods of attention to the model, global attention and local attention, respectively, to construct contrast learning loss, in this embodiment global attention is used to construct contrast learning loss.
Step 2.1, the source domain face thermal infrared input image and the target domain face visible light combined image are respectively obtained into a feature map FH∈RC×H×W and a feature map FV∈RC×H×W through an encoder Genc and a two-layer MLP network layer Hl. In this embodiment, a global attention method is used, firstly, two-dimensional matrices QH∈RHW×C and VH∈RHW×C are obtained by performing two steps of operations of reshape and transferring the features of the thermal infrared image of the face, QH is multiplied by its transpose KH∈RC×HW to obtain a matrix, and then, softmax normalization is performed on each row of the matrix to obtain a global attention matrix aglobal∈RHW×HW. Entropy can be used as an index for measuring the importance of features, and thus the importance of features can be measured according to the calculated entropy value Hs of each line in aglobal. Where i and j correspond to index positions of rows and columns in Aglobal, the global calculation formula for entropy value Hs is:
After calculating the entropy value of each row of Aglobal, performing a sort ascending operation according to the entropy value, and selecting the minimum N rows as a global attention matrix Aglobal-s∈RN×HW. And respectively routing the matrix to VH∈RHW×C and VV∈RHW×C characteristics in the source domain face thermal infrared image and the target domain face visible light combined image.
Aglobal-s is applied to features from the face thermal infrared input image and the face visible light combined image as shown in fig. 5, and the corresponding value features (VH and VV) are routed to form corresponding anchor points, positive samples and negative samples. The positive sample and the negative sample are positioned in the source domain face thermal infrared input image, and the anchor point is positioned in the target domain face visible light synthesized image, so that corresponding global contrast loss is established.
Step 2.2 for the local attention method, the difference is that the local attention uses a k×k window with constant size, and the sliding inquiry with the step length of 1 is performed on the source domain face thermal infrared input image, so that the space information interaction and connection in the local area can be enhanced. The feature map of the thermal infrared image of the face is first passed through reshape and transdose to obtain two-dimensional matrices QH∈RHW×C and VH∈RHW×C, unlike global attention where QH is multiplied by its local transposeObtaining a matrix, and then carrying out Softmax normalization operation on each row of the matrix to obtain a local attention matrixThe significance of the features is also measured according to the entropy value Hs of each line in Aglobal, then, like the global attention, the front N lines are selected according to the ascending order of the entropy values to form a selected local attention matrix Alocal-s, and then, the values of the source domain face thermal infrared image are respectively routedAnd the value of the target domain face visible light combined imageAnd finally constructing corresponding multi-layer local contrast loss.
Establishing a contrast loss function LCon according to the source domain face thermal infrared image and the target domain face visible light synthetic image:
Where τ=0.07 is a super parameter, V is an anchor point of the target domain face visible light combined image, and H+ and H- are a positive sample and N-1 negative samples in the source domain face thermal infrared image, respectively. The comparative loss is noted as LConH.
A positive sample and N-1 negative samples are also selected from the face visible light synthetic image of the target domain, which is similar to identity retention loss, so that the similarity of the characteristics and the structure of G (H) and H is ensured, and the excessive change of the face synthetic image by a generator is prevented. The loss function is denoted as LConG(H).
Extracting corresponding face gradient map from the target domain face visible light synthesized image IG(H) and the face visible light truth image IV (GT)AndGradient enhancement loss LGm was constructed. The gradient enhancement loss can reduce the generation of human face artifacts and keep better human face contour information.
Considering that many thermal infrared-visible light datasets provide corresponding paired images in both fields, the unsupervised algorithm can be extended by enforcing additional constraints, thereby minimizing the "L1 distance" of the composite image from the real visible light image. This supervised penalty is complementary to the unsupervised penalty of the CUT, while this additional regularization may supplement the unsupervised image synthesis algorithm. This loss is referred to as pixel level coincidence loss and is denoted as LPcl. Wherein IV is the corresponding face visible light truth image (GT).
Lpcl=||IG(H)-IV||1
The resistance loss between the generator G and the arbiter D is represented by the thermal infrared input image of the face as IH, and the pseudo visible image generated by the face generation network model is represented by IG(H). The skin color label information condition represents Z, skin color information is divided into one-hot coding representations, and category labels and image data are subjected to cat operation as input during training. The face analysis chart (face priori information condition) is represented as P, and ζ is represented as minimum L1 loss of the face visible light synthesized image and the real face visible light image synthesized by generating the network. The method comprises the steps of generating a mapping relation between a network based on face priori information and face label conditions: So its generation fight loss is:
the total loss function of the model is:
L=λ1LConH+λ2LConG(H)+λ3LPcl+λ4LGm+λ5Lgan
Wherein lambda1、λ2、λ3、λ4、λ5 is the super-parameters of contrast learning loss, identity preservation contrast learning loss, gradient enhancement loss, pixel level consistency loss and generation contrast loss respectively.
Under the contrast learning framework, the invention provides local texture information for guiding and generating a network learning face image based on a face analysis chart as priori knowledge. A space feature transformation mapping layer (STL) is mainly introduced, and generates a pair of modulation parameters based on the prior condition of the mapping features of the face analysis chart, and affine transformation is carried out on the face features of the generation network according to the modulation parameters, so that the generation of the face image is adaptively optimized. Meanwhile, the invention designs a method for reducing the generation of the face artifacts by losing the face gradient enhancement loss. In addition, the invention provides that skin color label condition information is added to the inputted human face thermal infrared image and the paired visible light image, so that the image generates corresponding skin color information which is restored as far as possible.
While the foregoing is directed to embodiments, aspects and advantages of the present invention, other and further details of the invention may be had by the foregoing description, it will be understood that the foregoing embodiments are merely exemplary of the invention, and that any changes, substitutions, alterations, etc. which may be made herein without departing from the spirit and principles of the invention.