Background
The purpose of pedestrian retrieval is to find a target person of interest in a given set of scene images. With the popularization of monitoring equipment and the progress of technology, the generated monitoring data volume is larger and larger, and the monitoring data is analyzed and extracted by means of a computer, so that the efficient retrieval of target people by people is a key means. Cross-modal pedestrian retrieval refers to querying a target person image from an image library through descriptive text information. Early pedestrian retrieval efforts focused mainly on searching for images using existing pedestrian images as query conditions. However, in practical application, the pedestrian template image for searching is generally not known in advance, and the text description provides a relatively comprehensive way to describe attribute information of a person, so that the pedestrian searching based on the text description is more flexible and universal, and has wide application scenes in the fields of public security systems, social management, personal album searching and the like. As a fusion task in two fields of pedestrian retrieval and image-text retrieval, the text-based cross-mode pedestrian retrieval method is characterized in that fine granularity attribute information of images and texts is mined, and identity correspondence between images and texts is better established by constructing highly distinguishable local features.
With the continuous deep research of image-text retrieval and pedestrian re-recognition tasks by scientific researchers, a plurality of text-based cross-mode pedestrian retrieval methods have been proposed.
The university of northwest industries discloses a cross-modal retrieval method of text to pedestrian images in the patent literature (application number: 2021104547243, application publication number: CN 113221680A) applied thereto, which is a text pedestrian retrieval method based on text dynamic guidance visual feature extraction. The method comprises the steps of firstly dividing each visual characteristic level into a plurality of strip areas, then generating a filter of specific description by using a text-based filter generator for indicating the importance degree of the text input on the mentioned image areas, then dynamically fusing partial visual characteristics of each text description by using a text dynamic guiding visual characteristic extractor, and finally performing cross-modal characteristic matching on text characteristic vectors and final visual characteristics. According to the method, through interaction among the cross-modal information, the retrieval of the pedestrian images through the text is realized, and the accuracy of the pedestrian retrieval task is further improved. However, the method still has two defects that firstly, the method only carries out horizontal cutting on the image features, but the horizontal cutting is operated based on pixels, so that local areas such as bags, clothes and trousers can be cut, the integrity of area identification is lost, secondly, the position and the mode of horizontal cutting can be different from person to person due to different dressing styles and walking postures of each person, partial image horizontal strips are aligned by the dominant background information, and thus, the inaccuracy of fine-grained feature extraction is caused.
Chen et al in "A simple but effective part-based convolutional baseline for text-based person search"(Neurocomputing2022) discloses a text-based cross-modal pedestrian retrieval method. The method provides a simple and effective end-to-end text-based pedestrian search learning framework, and visual and text local characterization is extracted through a dual-path local comparison network structure. The visual local representation is extracted by a horizontal segmentation, wherein the pedestrian image is segmented horizontally into several stripes. In the text token learning branch, word embedding is learned by BERT model with pre-training and fixed parameters and further processed by multi-branch residual networks. In each branch, the model learns the text representations to adaptively match the corresponding visual local representations in order to extract the aligned text local representations. In addition, a multi-stage cross-modal matching strategy is also provided, which eliminates modal differences between low-level, local and global features, thereby gradually reducing feature differences between image and text fields. But there is a problem in that the method does not take into account correlation between intra-modality information. In particular, each pedestrian image may be divided into a plurality of regions, each region containing varying degrees of fine-grained characteristics, such as shoes, pants, jackets, etc. These fine-grained features should interact with global features, complementing each other. However, the method still has the defect that the pedestrian images often have noise backgrounds such as different angles, illumination, backgrounds and the like, so that the global characterization is easy to be interfered by noise to generate semantic errors, the trend of the global characterization is separated from the trend of the local characterization, and the retrieval accuracy is reduced.
Disclosure of Invention
The invention aims to overcome the defects of the prior art, and provides a text-pedestrian retrieval method based on constraint of boundary box extraction and semantic consistency. The method aims to solve the problems that in the existing cross-mode retrieval method for the text to pedestrian images, key fine granularity information of real identification identities in the images is not effectively utilized, partial fine granularity semantics are not lost, or cross-mode matching of the images and text features is excessively concerned, and consistency of semantic representation of information with different granularity inside a mode is ignored.
The invention introduces a more accurate and perfect fine-grained information extraction method, and the method can obtain relevant bounding boxes, such as critical local areas of body parts, clothing details, accessories and the like for identifying the identity of pedestrians in a targeted manner through a predefined interesting vocabulary as a text prompt, so as to obtain more accurate and rich fine-grained characteristics. Therefore, the problem that part of fine granularity semantics are lost due to the fact that key fine granularity information of a real identification identity in an image is effectively utilized in the current cross-mode retrieval method from text to pedestrian images or is not achieved is solved. The invention designs a constraint method for keeping semantic consistency of image features with different granularities, and by maximizing mutual information between global features and local features, image characterization of a model on different granularities is kept centering on identity representative features, so that ambiguity between global semantic characterization and local semantic characterization of the same identity caused by interference caused by noise such as different angles, illumination, background and the like in the image is avoided, and the problem of semantic characterization consistency of information with different granularities in a mode in the prior art is solved.
The technical scheme adopted by the invention comprises the following steps:
Step 1, extracting fine granularity bounding boxes of each image in an image-text pair of a dataset:
inputting the text prompt describing the attribute of the pedestrian and each image into a phrase positioning model GLIP at the same time, and extracting the boundary box of each image in the data set;
step 2, extracting fine-grained noun phrases of each text in an image-text pair of a dataset;
Step3, generating a training set:
Forming a sample by each image and a fine granularity boundary box corresponding to each image and each text and a fine granularity noun phrase corresponding to each text, and forming a training set by all samples in the data set;
Step 4, constructing a fine granularity aggregation network:
Step 4.1, constructing a sub-network consisting of an image encoder and a text encoder of the CLIP, wherein the image encoder is CLIP ViT-B/16, the text encoder is a CLIP X-former, and both encoders are formed by connecting a 12-layer converter block with a full connection layer in series, and the dimension of an output vector is 512;
step 4.2, constructing a fine-grained aggregation network formed by connecting two branches in parallel, wherein the first branch is formed by connecting an image encoder with a bidirectional GRU in series, and the second branch is formed by connecting a text encoder with the bidirectional GRU in series;
Step 5, training a fine granularity aggregation network:
the training set is input into a fine granularity aggregation network, an image encoder forwards propagates to output the global feature of each image, and a text encoder forwards propagates to output the global feature of each text;
Calculating semantic alignment loss and identity classification loss between modes by using global features of the images and texts, calculating semantic consistency constraint loss by using global features of the images and local features of the images, and iteratively updating network parameters by adding three loss functions as a total target loss function until the total target loss function converges to obtain a trained fine-grained aggregation network;
Step6, searching pedestrians by using the text:
Step 6.1, respectively adopting noun phrases and boundary boxes corresponding to the text to be searched and the pedestrian image to be searched obtained by the method according to the step 1 and the step 2, and respectively inputting the noun phrases and the boundary boxes into a trained fine granularity aggregation network to obtain text global features and text local features, image global features and image local features;
And 6.2, respectively calculating the local similarity, the global similarity and the weighted calculation total similarity of the text to be searched and the pedestrian image to be searched, sorting the similarity of the pedestrian images to be searched according to descending order, and selecting the first 10 images from the image sequence as search results.
Compared with the prior art, the invention has the following advantages:
Firstly, the invention takes the predefined interesting vocabulary as a text prompt to obtain the relevant bounding box in a targeted way, overcomes the problem that the key fine granularity information for truly identifying the identity in the existing method for cross-modal retrieval from text to pedestrian images is not effectively utilized, so that part of fine granularity semantic deletion is caused, enables the invention to be capable of extracting a fine granularity area with more identity identification, better adapts to different pedestrian gestures and image scales, enhances the identification and distinguishing capability of pedestrian identities, and simultaneously is not influenced by shielding problems in the process of extracting the fine granularity area, and improves the retrieval accuracy.
Secondly, the invention designs a consistency constraint method for keeping high mutual information of image features with different granularities, so that global features can be better mutually complemented with local detail information, the problem that the prior art is excessively concerned with cross-mode matching of images and text features and ignores semantic characterization consistency of information with different granularities in a mode is solved, the model can learn the characterization of identity consistency among different image views, interference of background noise and loss of key information are avoided, and robustness and stability of the model are improved.
Detailed Description
The present invention will be described in further detail with reference to the accompanying drawings and examples.
The implementation steps of the embodiment of the present invention will be further described with reference to fig. 1.
And step 1, extracting fine granularity bounding boxes of each image in the image-text pair of the data set.
The text prompt describing the attribute of the pedestrian and each image are simultaneously input into a phrase positioning model GLIP, and the bounding box of each image in the dataset is extracted.
Referring to FIG. 2, the specific process is described in further detail by traversing the dataset to gather vocabulary related to character body features, clothing styles, personal belongings, such as, but not limited to, "male," "cricket," "jacket," "cotta," "skirt," "pants," "high-heeled shoes," "backpack," etc.; because a plurality of words describe synonymous or similar concepts, in order to reduce redundancy of a vocabulary, the similar words can be generalized into more general categories to obtain a generalized vocabulary, abstract representations are needed to be carried out on the collected words, such as jeans and sports pants are abstract to trousers, women and men are abstract to people, the abstract general conceptual words of the invention are as follows, people, coats, trousers, bags, eyes, luggage, heads, hair collars, caps, shoes, bicycles, headphones, short sleeves, pockets, arms, vehicles, skirts, hair, dolls, hands, boxes, books, masks, teacups and mobile phones, and the vocabulary can guide the model to pay attention to critical areas related to identity features of pedestrians in images, and guide the extraction process of fine granularity bounding boxes; the large phrase positioning model GLIP is used for extracting a boundary frame as accurate fine-grained information of pedestrian images, the GLIP is a model for positioning interested targets in the images, target areas in the images can be identified and positioned according to important phrases and phrase combinations in texts, the GLIP is modified from a target positioning model to a boundary frame extraction model, each vocabulary in text prompts is used as a text query, corresponding vocabulary features are extracted from the image features by calculating similarity between the texts and the image features, and selecting the region coordinates covered by the vocabulary features with high confidence as an interested target boundary box, and capturing fine-grained regions of partial vocabularies in the vocabulary in the pedestrian image. Each image I extracts the 8 bounding boxes with the highest confidence as the fine-grained sequence Ibbox=(b1,b2…b8 for that image).
The text prompt describing the pedestrian attribute is that ['person.coat.pants.bag.glasses.luggage.head.collar.cap.shoes.bike.headphone.shirt.pocket.legs.car.dress.hair.toy.hands.box.book.mask.cup.cellphone.'].
The extracting of the bounding boxes of each image in the dataset refers to inputting the text prompt describing the attribute of the pedestrian and each image into the phrase positioning model GLIP at the same time, and selecting 8 bounding boxes with highest confidence from the detection layers output by the GLIP as fine granularity information of the image to store in a list sequence.
And 2, extracting fine-grained noun phrases of each text in the image-text pair of the data set.
Referring to FIG. 2, the detailed process is further described by extracting noun phrases appearing in the text description using spacy tool kit, first installing spacy library and loading corresponding English model "en_core_web_sm", then using nlp () function to pass text to language model for processing to obtain a document object, then traversing each word in the document to determine if its part of speech tag is a noun, if it is a noun, continuing to determine if it is a head of noun phrase, using noun_ chunks attribute to obtain noun phrase p, traversing noun phrases in the document by iteration, storing them in a list sequence, providing 15 phrases per text, filling with special filling field [ 'pad' ] if the number of extracted noun phrases is less than 15, if the number of extracted noun phrases exceeds 15, cutting off end noun phrases, and extracting fine-grained sequence of extracted noun phrases (phrase=(p1,p2…p15).
The extracting of fine-grained noun phrases of each text in an image-text pair of a data set refers to generating a list sequence for each text description, wherein the list sequence is used for storing noun phrases appearing in each text description in the data set, when the length of the list sequence is smaller than the rated length of the list sequence, a pad field is repeatedly added at the tail end of the sequence until the length of the sequence is equal to the rated length, and the part of the list sequence with the length larger than the rated length is truncated, and the rated length of the list sequence is the average value of the lengths of all list sequences.
And step3, generating a training set.
And forming a sample by each image and the corresponding fine-granularity boundary box and each text and the corresponding fine-granularity noun phrase, and forming a training set by all samples in the data set.
Each image-text pair and the corresponding extracted fine granularity information form a training sample X= (I, T, Ibbox,Tphrase), and N training samples and the corresponding identity label set form a training set S= { (Xn,Ln) |1 is not less than N is not more than N.
And 4, constructing a fine-granularity aggregation network.
Step 4.1, constructing a sub-network consisting of an image encoder and a text encoder of the CLIP, wherein the image encoder is CLIP ViT-B/16, the text encoder is a CLIP X-former, and both encoders are formed by connecting a 12-layer converter block with a full connection layer in series, and the dimension of an output vector is 512;
and 4.2, constructing a fine-grained aggregation network formed by connecting two branches in parallel, wherein the first branch is formed by connecting an image encoder with a bidirectional GRU in series, and the second branch is formed by connecting a text encoder with the bidirectional GRU in series.
And 5, training a fine-grained aggregation network.
The training process is described in detail with reference to fig. 3.
The training set is input into a fine granularity aggregation network, the image encoder forwards propagates to output the global feature of each image, the text encoder forwards propagates to output the global feature of each text, all bounding boxes of each image are forwards propagated through the fine granularity aggregation network to obtain an image local feature, and noun phrase sequences of each text are forwarded through the fine granularity aggregation network to obtain a text local feature.
Specifically, for a pedestrian image I, the pedestrian image I is input into CLIP-ViT to obtain the global feature Iglobal of the image, and the region covered by each bounding box in the corresponding bounding box sequence Ibbox is cut out and input into CLIP-ViT as an image to obtain the region feature sequence [ Il1,Il2,…Il8 ] of the image. The pedestrian text description T and the corresponding noun phrase sequence Tphrase are segmented using the segmenter of CLIP, generating a word-level vector representation, and the global feature Tglobal and phrase feature sequence [ Tl1,Tl2,…Tl15 ] of the text description are extracted using the text encoder of CLIP. After fine-grained local teletext features are obtained, it is often necessary to aggregate the local features to obtain an overall fine-grained feature representation. Common practice uses average pooling or maximum pooling for aggregation, often only considering statistical information of local features, ignoring correlations and interactions with each other, and possibly causing loss of feature information and generation of information bottlenecks. The invention uses GPO, a flexible pooling operator to aggregate the image-text local feature sequence, inputs the feature sequence into a bidirectional GRU in a fine granularity aggregation network, learns the weight coefficient theta of the pooling operator of the feature vector in the sequence, and carries out weighted representation on the feature sequence according to the weight coefficient to obtain the local feature. Local feature Ilocal=GPO([Il1,Il2,…Il8 of image I), local feature Tlocal=GPO([Tl1,Tl2,…Tl15 of text T).
And calculating semantic alignment loss and identity classification loss between modes by using global features of the images and texts, calculating semantic consistency constraint loss by using global features of the images and local features of the images, and iteratively updating network parameters by adding three loss functions as a total target loss function until the total target loss function converges to obtain a trained fine-grained aggregation network.
The forward propagation of all bounding boxes of each image through the fine-granularity aggregation network to obtain an image local feature means that all bounding boxes of each image are forward propagated in an image encoder to obtain a region feature sequenceThe sequence learns trainable parameter vector theta in the bidirectional GRU as characteristic weighting coefficients of different positions, uses GPO aggregation operators to carry out weighted representation on the characteristic vector of each position to obtain aggregated image local characteristics Ilocal,Wherein, theRepresenting the kth region feature of the region feature sequence, θk representing the weight of the kth region feature, GPO (·) representing the aggregation operator.
The term phrase sequence of each text obtains a text local feature through a fine granularity aggregation network, namely, the term phrase sequence of each text is transmitted forward in a text encoder to obtain a phrase feature sequence omega, the sequence learns a trainable parameter vector theta in a bidirectional GRU as a feature weighting coefficient of different positions, a GPO aggregation operator is used for weighting and representing the feature vector of each position to obtain an aggregated text local feature Tlocal,Where ωk represents the kth phrase feature of the phrase feature sequence, θk represents the weight of the kth phrase feature, and GPO (-) represents the aggregation operator.
The semantic consistency constraint loss is calculated by the following formula:
where Linfo represents a semantic consistency constraint penalty, Ep represents a desire, log represents a logarithmic operation based on a natural constant E, exp represents an exponential operation based on a natural constant E, Iglobal represents an image global feature,Representing positive local features, τ representing a learnable temperature parameter, N representing the number of unmatched image-text pairs in a batch,Indicating the nth negative partial feature, the upper corner mark T indicates the transpose operation.
The intermodal semantic alignment loss is calculated by the following formula:
Sglobal=(Iglobal)T(Tglobal)
Slocal=(Ilocal)T(Tlocal)
Where Sglobal represents a global feature similarity matrix, Tglobal represents a text global feature,Representing a loss of semantic alignment between global modes, τp represents a temperature parameter that adjusts the positive gradient slope,Representing the positive similarity score of the ith row of the similarity matrix, alpha representing the positive similarity lower limit, τn representing the negative temperature parameter for the gradient slope of the adjustment,Representing the negative pair similarity score of the ith row of the similarity matrix, β being the upper similarity limit for the negative pair, Slocal representing the local feature similarity matrix, Tlocal representing the text local feature,Representing the loss of semantic alignment between local modalities, Lalign represents the loss of semantic alignment between modalities.
The identity classification loss is calculated by the following formula:
pv=softmax((W)t(Iglobal))
pt=softmax((W)t(Tglobal))
Lid=(-log(pv(c))+-log(pt(c)))/2
Wherein pv represents the identity prediction tag of the image, softmax (·) represents the normalized exponential function, W represents the shared identity projection matrix, pt represents the identity prediction tag of the text, Lid represents the identity classification penalty, pv (c) represents the probability of correctly predicting identity class c by the global feature of the image, and pt (c) represents the probability of correctly predicting identity class c by the global feature of the text.
The semantic consistency loss maximizes mutual information between the image global features and the image region features, and learns the characterization of consistency between different image views of pedestrian identities. Since the text corresponding to different pedestrian images of the same identity may be described from different angles, no semantic consistency constraint is imposed on the text modality. Inter-modality semantic alignment loss uses cosine similarity as a measure of visual text feature correlation. For positive pairs cosine similarity s+ is encouraged to be as large as possible, whereas for negative pairs forced cosine similarity S-as small as possible is a relatively strong constraint, since the different negative pair information is different and the similarity scores are also different. Thus, two spacing margins β are set to the upper similarity limit of the negative pair, and α is the lower similarity limit of the positive pair. It is desirable that the positive pair of similarity scores S + be no less than a, while the negative pair of similarity scores S-be no greater than β. Identity discrimination loss ensures that feature representations of the same identity are tightly clustered together in joint embedding space by calculating the cross entropy of the identity prediction tag and the true tag of an image or text.
And 6, searching pedestrians by using the text.
Step 6.1, respectively adopting noun phrases and boundary boxes corresponding to the text to be searched and the pedestrian image to be searched obtained by the method according to the step 1 and the step 2, and respectively inputting the noun phrases and the boundary boxes into a trained fine granularity aggregation network to obtain text global features and text local features, image global features and image local features;
And 6.2, respectively calculating the local similarity, the global similarity and the weighted calculation total similarity of the text to be searched and the pedestrian image to be searched, sorting the similarity of the pedestrian images to be searched according to descending order, and selecting the first 10 images from the image sequence as search results.
The weighted calculation total similarity is calculated by the following formula,
S=α*Sglobal+β*Slocal
Where α, β represent weight coefficients, where α+β=1.