Disclosure of Invention
According to the problems in the prior art, the invention provides a cross-domain pedestrian re-identification method based on normalization and feature enhancement, which can effectively inhibit domain gaps and enhance pedestrian distinguishing features on the basis of not using target domain data, thereby enhancing the generalization capability of a model.
The method is realized by the following technical scheme:
the cross-domain pedestrian re-identification method based on normalization and feature enhancement is characterized by comprising the steps of establishing an identification network model, image feature normalization, image feature recovery and image feature output;
establishing a recognition network model, wherein the establishing of the recognition network model comprises establishing a normalization enhancement module NEM with an Instance normalization unit IN (instant normalization 1), a residual weight training unit CMS and an attention unit CAB, and inserting the normalization enhancement module NEM into a ResNet50 model by taking a ResNet50 model as a backbone network to form a recognition network model;
the image feature normalization comprises the following steps:
s11, extracting pedestrian image features x ∈ R based on ResNet50 modelc×h×w(Note: x, x in this embodiment)1、x2All image features), where x is the input feature of the normalization enhancement module NEM, c is the number of channels of the image features, h is the height of the image features, w is the width of the image features, R is the number of channels of the image featuresc×h×wRepresenting a real number domain space of dimensions c x h x w, x ∈ Rc×h×wRepresenting a vector in a real number domain space with input features x of dimensions c x h x w;
s12, obtaining input characteristic x e R by using example normalization unit INc×h×wAnd a variance σ (x) in each channel, and calculates a normalized feature x based on the obtained mean μ (x) and variance σ (x)1The calculation formula is as follows:
wherein, gamma and beta are learnable parameter vectors, and gamma belongs to Rc and beta belongs to Rc, which means that gamma and beta are vectors in a real number domain space with c dimension; the initial values of gamma and beta elements are respectively set to be 1 and 0, and then are automatically updated in the training process;
the image feature recovery comprises the following steps:
s21, the residual error weight training unit CMS is facilitated according to the normalized feature x1Learning a residual weight WrNamely, the following steps are provided:
Wr=sigmoid(mean(conv(x1)))
wherein conv (-) represents convolution, mean (-) represents global mean, sigmoid (-) represents activation function;
s22, based on the residual weight WrFusing input features x and normalized features x1And recovering discrimination information lost by the image characteristics due to style normalization, wherein a fusion formula is as follows:
x2=Wr×x1+(1-Wr)×x
wherein x is2For the restored image feature, named restored feature, and x2∈Rc×h×wExpressed is a recovered feature x2Is a vector in a real number domain space of dimensions c x h x w;
the image feature output comprises the steps of:
s31, exploring the restored features x using the attention cell CAB2The relevance between different channels in the channel and the self-adaptive extraction of the attention weight W of the channelcNamely, the following steps are provided:
Wc=ca(x2)
where ca (-) is the attention unit CAB, the channel attention weight WcMeasures the recovered feature x2The importance of the information of each channel;
s32, attention weighting W by channelcFor the recovered feature x2Filtering is carried out to enhance the characterization capability of the pedestrian characteristics, namely:
f=(Wc+1)×x2
wherein f is the output characteristic of the normalization and enhancement module NEM.
Further, the ResNet50 model comprises a Res1 unit, a Res2 unit, a Res3 unit, a Res4 unit, a Res5 unit and a Head unit which are sequentially connected in a communication mode, and normalization enhancement modules NEM are inserted into the output ends of the Res2 unit, the Res3 unit, the Res4 unit and the Res5 unit respectively.
Further, the method also includes introducing a NEM loss function into the normalized enhancement module NEM at the output end of the Res5 unit, that is, the image feature output of the normalized enhancement module NEM further includes the following steps:
s33, respectively calculating the central loss C of the input feature xxAnd output characteristic f center loss CfIn order to measure the in-class dispersion of the input feature x and the output feature f in the feature space, the calculation formula is as follows:
wherein, cxj∈RdRepresenting the class center of the jth pedestrian in the input feature x; c. Cfj∈RdRepresenting the class center of the jth pedestrian in the output characteristic f; n represents the total number of pedestrians in the data set, m represents the total number of features of the jth pedestrian, and xjiI-th feature representing j-th pedestrian in input feature x, d representing the dimension of each feature, RdA real number domain space representing d dimensions, i.e. cxjAnd cfjA vector in a real number domain space that is both d-dimensional;
s34, based on the center loss cfAnd cxEstablishing an NEM loss function, wherein the NEM loss function is as follows:
wherein L isNEMThe loss values calculated for input feature x and output feature f of NEM 5.
Further, in the step S11, x ∈ Rc×h×wThe carried characteristic information comprises a style and a shape; the style comprises an imaging style of the image and a clothing style of the pedestrian, and the shape is the contour shape of the pedestrian in the image.
Further, in the step S31, a channel attention weight W is obtainedcComprises thatThe following steps:
s311, along the recovered feature x2Performing maximum pooling and average pooling on the channel dimension to obtain two 1 × h × w two-dimensional matrixes, and recovering the characteristic x2Respectively carrying out element-by-element multiplication with the two 1 xhxw two-dimensional matrixes so as to respectively introduce the spatial information respectively corresponding to the two 1 xhxw two-dimensional matrixes into the recovered characteristic x2In the channel of (a);
s312, respectively performing maximum pooling and average pooling on the features corresponding to the introduced spatial information along the spatial dimension to generate two spatial aggregation masks F1And F2And F is1∈Rc×1×1,F2∈Rc×1×1(ii) a Wherein R isc×1×1Representing a real number domain space of dimensions c × 1 × 1, F1∈Rc×1×1And F2∈Rc×1×1Is represented by F1And F2Vectors in real number domain space, all of dimensions c × 1 × 1;
s313, concat operation is carried out on the two space aggregation masks, and the result obtained through the concat operation is sequentially subjected to convolution and sigmoid operation and is fused to obtain the final channel attention weight Wc。
Further, the spatial information includes global information and saliency information on a space corresponding to the 1 × h × w two-dimensional matrix.
The beneficial effect that this technical scheme brought:
1) according to the technical scheme, on the basis of not using target domain data, domain gaps can be effectively restrained, pedestrian distinguishing characteristics are enhanced, and further the generalization capability of the recognition network model is enhanced; by means of the residual error connection idea, the example normalization can inhibit style difference and prevent information loss, so that the extracted features have domain invariance and the discrimination is kept; spatial information is fused into the channels through the attention unit CAB, and the characteristic weight of each channel is self-adaptively adjusted through constructing the dependency relationship among the channels, so that the pedestrian characteristic is effectively enhanced.
2) According to the technical scheme, NEM loss function constraint is introduced to identify the invariant features of the network model learning domain, so that the distance in the feature class is reduced, and the feature distribution is optimized.
Detailed Description
The technical solutions for achieving the objects of the present invention are further illustrated by the following specific examples, and it should be noted that the technical solutions claimed in the present invention include, but are not limited to, the following examples.
Example 1
The embodiment discloses a cross-domain pedestrian re-identification method based on normalization and feature enhancement, and as a basic implementation scheme of the invention, the method comprises the steps of establishing an identification network model, normalizing image features, recovering image features and outputting the image features.
Establishing a recognition network model, including establishing a normalized enhancement module NEM with an example normalization unit IN (namely IN IN FIG. 2), a residual weight training unit CMS (namely CMS IN FIG. 2) and an attention unit CAB (namely CA IN FIG. 2) as shown IN FIG. 2, and taking a ResNet50 model as a backbone network, and inserting the normalized enhancement module NEM into a ResNet50 model to form the recognition network model.
Normalizing the image features, namely normalizing the style of the features by calculating the mean and variance in each channel of the image features, so that the style difference between different domains can be inhibited, and the method specifically comprises the following steps:
s11, extracting pedestrian image features x ∈ R based on ResNet50 modelc×h×wThe carried characteristic information, wherein x is the input characteristic of the normalization enhancement module NEM, c is the channel number of the image characteristic, h is the channel height of the image characteristic, w is the channel width of the image characteristic, Rc×h×wRepresenting a real number domain space of dimensions c x h x w, x ∈ Rc×h×wIt is shown that the input feature x is a vector in a real number domain space of dimensions c × h × w, and x ∈ Rc×h×wThe carried characteristic information comprises a style and a shape; the style comprises the imaging style of the image and the clothing style of the pedestrian, and the shape is the contour shape of the pedestrian in the image;
s12, obtaining input characteristic x e R by using example normalization unit INc×h×wAnd a variance σ (x) in each channel, and calculates a normalized feature x based on the obtained mean μ (x) and variance σ (x)1The calculation formula is as follows:
wherein μ (x) and σ (x) represent an average value and a variance value calculated over a spatial dimension (h × w) of the image feature, respectively; both γ and β are learnable parameter vectors, and γ ∈ Rc、β∈RcIndicating that both γ and β are vectors in a real number domain space that is c-dimensional; the initial values of gamma and beta elements are respectively set to be 1 and 0, and then the values are automatically updated in the training process, specifically, the gamma initializes the vector with 1 and the beta initializes the vector with 0, the values of the gamma and the beta automatically change according to the gradient of back propagation in the training process, the function of the gamma and the beta is to ensure that the original learned characteristics are kept after each data is normalized, and simultaneously, the normalization operation and the accelerated training can be completed.
Although image feature normalization helps reduce style variation resulting in inter-domain gaps, if the style itself contains pedestrian re-recognition discrimination information, it may also result in significant information loss while eliminating the style variation. For example, clothing of pedestrians is important re-identification discrimination information, the texture of clothing fabric obviously belongs to one of styles, and when the style is inhibited, the discrimination of the feature is weakened, so that the image feature normalization can inhibit style difference and prevent information loss by means of a residual error connection idea, and meanwhile, the extracted feature has domain invariance and maintains discrimination. The image feature recovery method is specifically realized by image feature recovery, and comprises the following steps:
s21, the residual error weight training unit CMS is facilitated according to the normalized feature x1Learning a residual weight WrNamely, the following steps are provided:
Wr=sigmoid(mean(conv(x1)))
where conv (-) represents convolution, mean (-) represents global mean, sigmoid (-) represents activation function, i.e., feature x is first normalized1Passing through a convolution layer with convolution kernel size of 3 × 3 × c, step length of 2 and output channel of 1, normalizing feature x1Compressing the contained information in the dimensions of space and channels, then calculating the mean value in each channel, further compressing the space information, and finally obtaining the residual error weight W between 0 and 1 after sigmoid mappingrI.e. residual weights Wr∈R1;
S22, based on the residual weight WrFusing input features x and normalized features x1And recovering discrimination information lost by the image characteristics due to style normalization, wherein a fusion formula is as follows:
x2=Wr×x1+(1-Wr)×x
wherein x is2For the restored image feature, named restored feature, and x2∈Rc×h×wExpressed is a recovered feature x2Is a vector in a real number domain space of dimensions c x h x w.
Since spatial information is gradually compressed and pedestrian-related information is gradually shifted to channel dimensions in the feature extraction process (referring to the overall feature extraction process, more than one link) by the ResNet50 module, it is necessary to enhance pedestrian features by means of channel attention, that is, the image feature output includes the following steps:
s31, exploring the restored features x using the attention cell CAB2The relevance between different channels enables the attention to be focused on the most meaningful part of the pedestrian image, and the attention weight W of the channel is extracted in a self-adaptive mannercNamely, the following steps are provided:
Wc=ca(x2)
where ca (-) is the attention unit CAB, the channel attention weight WcMeasure x2The importance of the information of each channel;
s32, attention weighting W by channelcFor the recovered feature x2Filtering is carried out to enhance the characterization capability of the pedestrian characteristics, namely:
f=(Wc+1)×x2
wherein f is the output characteristic of the normalization and enhancement module NEM.
According to the technical scheme, on the basis of not using target domain data, domain gaps can be effectively restrained, pedestrian distinguishing characteristics are enhanced, and further the generalization capability of the recognition network model is enhanced; by means of the residual error connection idea, the example normalization can inhibit style difference and prevent information loss, so that the extracted features have domain invariance and the discrimination is kept; spatial information is fused into the channels through the attention unit CAB, and the characteristic weight of each channel is self-adaptively adjusted through constructing the dependency relationship among the channels, so that the pedestrian characteristic is effectively enhanced.
Example 2
The embodiment discloses a cross-domain pedestrian re-identification method based on normalization and feature enhancement, which is a preferred implementation scheme of the invention, namely in theembodiment 1, a ResNet50 model comprises a Resl unit, a Res2 unit, a Res3 unit, a Res4 unit, a Res5 unit and a Head unit which are sequentially connected in a communication mode, a normalization enhancement module NEM is inserted after each Res unit or part of Res units of a ResNet50 model, and the normalization enhancement module NEM can respectively enhance features in relevant stages, so that the overall effect is good. In the ResNet50 model, the features obtained by the operation of the Res1 unit are too shallow and basically do not contain semantic information such as styles, and the normalized enhancement module NEM is inserted after the Res1 unit to play a role in feature enhancement, so that the complexity of the model is further increased, and therefore, the normalized enhancement module NEM is not inserted after the Res1 unit in the process of designing the identified network model.
Specifically, the effect of the NEM is best after the Res units are inserted into the NEM, and verification can be carried out according to experiments. As shown in fig. 4: NEM23 indicates the insertion of normalized-enhancement-module NEM at the output of Res2 and Res3 cells, NEM234 indicates the insertion of normalized-enhancement-module NEM at the output of Res2, Res3 and Res4 cells, NEM2345 indicates the insertion of normalized-enhancement-module NEM at the output of Res2, Res3, Res4 and Res5 cells, NEM345 indicates the insertion of normalized-enhancement-module NEM at the output of Res3, Res4 and Res5 cells, NEM45 indicates the insertion of normalized-enhancement-module NEM at the output of Res4 and Res5 cells, respectively; in addition, M, D, MS in the abscissa represents three pedestrian re-identification common data sets of Market1501, DukeMTMC-reiD and MSMT17 respectively; M-D represents training a model on a Market1501, then carrying out pedestrian re-identification test on the trained model on a DukeMTMC-reiD, and so on, wherein D-M, MS-M and MS-D are the same principle; the ordinate represents the mAP accuracy. As can be seen from fig. 4, NEM2345 has the best effect and can effectively enhance the cross-domain pedestrian re-recognition performance of the model, so that normalization enhancing modules NEM, such as NEM2, NEM3, NEM4 and NEM5 shown in fig. 1, are inserted into the output ends of Res2 unit, Res3 unit, Res4 unit and Res5 unit, respectively. Thus, the network model identification method in the technical scheme comprises the following working procedures: the Res2 unit of the ResNet50 model extracts the image characteristics of the original image, and the image characteristics are normalized, restored and output through NEM2, and then the image characteristics are sent to: and extracting deeper features of pedestrians from Res3 units of the ResNet50 model, and continuing image feature normalization, image feature recovery and image feature output processing on the features of Res3 units by NEM3, and so on, wherein the Res4 units and NEM4 and Res5 units and NEM5 are the same principle until the Head unit of the ResNet50 model is finally output.
Example 3
This example discloses a cross-domain pedestrian re-identification method based on normalization and feature enhancement, which is a preferred implementation of the present invention, that is, in example 2, in order to promote better clustering characteristics of features, the method further includes introducing a NEM loss function into the normalization and enhancement module NEM at the output end of the Res5 unit, and constraining the normalization and enhancement module NEM (i.e., NEM5), where it is expected that features extracted by NEM5 have better domain invariance and discriminability, and therefore, the image feature output of the normalization and enhancement module NEM further includes the following steps:
s33, respectively calculating the central loss C of the input feature xxAnd output characteristic f center loss CfIn order to measure the in-class dispersion of the input feature x and the output feature f in the feature space, the calculation formula is as follows:
wherein, cxj∈RdRepresenting the class center of the jth pedestrian in the input feature x; c. Cfj∈RdRepresenting the class center of the jth pedestrian in the output characteristic f; n represents the total number of pedestrians in the data set, m represents the total number of features of the jth pedestrian, and xjiI-th feature representing j-th pedestrian in input feature x, d representing the dimension of each feature, RdA real number domain space representing d dimensions, i.e. cxjAnd cfjVectors in real number domain space, both d-dimensional;
s34, based on the center loss cfAnd cxEstablishing an NEM loss function, wherein the NEM loss function is as follows:
wherein L isNEMThe loss values calculated for input feature x and output feature f ofNEM 5.
According to the technical scheme, NEM loss function constraint is introduced to identify the invariant features of the network model learning domain, so that the distance in the feature class is reduced, and the feature distribution is optimized.
Example 4
This example discloses a cross-domain pedestrian re-identification method based on normalization and feature enhancement, which is a preferred embodiment of the present invention, that is, in step S31 of example 1, as shown in fig. 3, a channel attention weight W is obtainedcThe method comprises the following steps:
s311, along the recovered feature x2Performing maximum pooling and average pooling on the channel dimension to obtain two 1 × h × w two-dimensional matrixes, and recovering the characteristic x2Respectively carrying out element-by-element multiplication with the two 1 xhxw two-dimensional matrixes so as to respectively introduce the spatial information respectively corresponding to the two 1 xhxw two-dimensional matrixes into the recovered characteristic x2In the channel of (a);
s312, in order to effectively calculate the attention of the channel, the spatial dimension of the relevant features needs to be compressed, generally, average pooling is used for aggregation of the spatial information to pay more attention to the global information, however, the maximum pooling can also obtain the unique features of the pedestrian to infer more detailed information on the channel, so that the features corresponding to the introduced spatial information are respectively subjected to maximum pooling and average pooling along the spatial dimension to generate two spatial aggregation masks F1And F2And F is1∈Rc×1×1,F2∈Rc×1×1The masks respectively focus on global information and unique information about pedestrians in the feature map; wherein R isc×h×1Representing a real number domain space of dimensions c × 1 × 1, F1∈Rc×1×1And F2∈Rc×1×1Is represented by F1And F2Vectors in real number domain space, all of dimensions c × 1 × 1;
s313, concat (vector splicing) operation is carried out on the two space aggregation masks, and the result obtained through the concat operation is sequentially subjected to convolution and sigmoid operation and is fused to obtain the final channel attention weight Wc。