CN116842212B

Movatterモバイル変換

Info

Publication number: CN116842212B
Application number: CN202310609087.1A
Authority: CN
Inventors: 王笛; 李渊博; 田玉敏; 王泉; 万波; 罗雪梅; 王义峰; 赵辉; 潘蓉
Original assignee: Xidian University
Current assignee: Xidian University
Priority date: 2023-05-26
Filing date: 2023-05-26
Publication date: 2025-09-30
Anticipated expiration: 2043-05-26
Also published as: CN116842212A

Abstract

The invention discloses a cross-modal text-pedestrian retrieval method based on boundary box extraction and semantic consistency constraint, which comprises the following steps of extracting an image fine-granularity boundary box; extracting text fine-granularity noun phrases, generating a training set, constructing a fine-granularity aggregation network, training the fine-granularity aggregation network, and searching pedestrians by using the text. The invention constructs a text-pedestrian retrieval model based on boundary box extraction and semantic consistency constraint, utilizes visual language knowledge in the existing large-scale pre-training model (GLIP and CLIP), uses text prompt and GLIP to accurately extract and identify key local features of pedestrian identity, improves the accuracy of pedestrian retrieval, uses CLIP to extract visual and language features to obtain more comprehensive semantic characterization, designs a constraint method for maintaining feature semantic consistency, reduces noise interference and improves the stability of pedestrian retrieval.

Description

Text-pedestrian retrieval method based on bounding box extraction and semantic consistency constraint

Technical Field

The invention belongs to the technical field of multiple modes, and further relates to a text-pedestrian retrieval method based on bounding box extraction and semantic consistency constraint in the technical field of cross-mode retrieval of text-pedestrian images. The invention can be used for cross-modal pedestrian retrieval, and the target person is queried from the image library through descriptive text information.

Background

The purpose of pedestrian retrieval is to find a target person of interest in a given set of scene images. With the popularization of monitoring equipment and the progress of technology, the generated monitoring data volume is larger and larger, and the monitoring data is analyzed and extracted by means of a computer, so that the efficient retrieval of target people by people is a key means. Cross-modal pedestrian retrieval refers to querying a target person image from an image library through descriptive text information. Early pedestrian retrieval efforts focused mainly on searching for images using existing pedestrian images as query conditions. However, in practical application, the pedestrian template image for searching is generally not known in advance, and the text description provides a relatively comprehensive way to describe attribute information of a person, so that the pedestrian searching based on the text description is more flexible and universal, and has wide application scenes in the fields of public security systems, social management, personal album searching and the like. As a fusion task in two fields of pedestrian retrieval and image-text retrieval, the text-based cross-mode pedestrian retrieval method is characterized in that fine granularity attribute information of images and texts is mined, and identity correspondence between images and texts is better established by constructing highly distinguishable local features.

With the continuous deep research of image-text retrieval and pedestrian re-recognition tasks by scientific researchers, a plurality of text-based cross-mode pedestrian retrieval methods have been proposed.

The university of northwest industries discloses a cross-modal retrieval method of text to pedestrian images in the patent literature (application number: 2021104547243, application publication number: CN 113221680A) applied thereto, which is a text pedestrian retrieval method based on text dynamic guidance visual feature extraction. The method comprises the steps of firstly dividing each visual characteristic level into a plurality of strip areas, then generating a filter of specific description by using a text-based filter generator for indicating the importance degree of the text input on the mentioned image areas, then dynamically fusing partial visual characteristics of each text description by using a text dynamic guiding visual characteristic extractor, and finally performing cross-modal characteristic matching on text characteristic vectors and final visual characteristics. According to the method, through interaction among the cross-modal information, the retrieval of the pedestrian images through the text is realized, and the accuracy of the pedestrian retrieval task is further improved. However, the method still has two defects that firstly, the method only carries out horizontal cutting on the image features, but the horizontal cutting is operated based on pixels, so that local areas such as bags, clothes and trousers can be cut, the integrity of area identification is lost, secondly, the position and the mode of horizontal cutting can be different from person to person due to different dressing styles and walking postures of each person, partial image horizontal strips are aligned by the dominant background information, and thus, the inaccuracy of fine-grained feature extraction is caused.

Chen et al in "A simple but effective part-based convolutional baseline for text-based person search"(Neurocomputing2022) discloses a text-based cross-modal pedestrian retrieval method. The method provides a simple and effective end-to-end text-based pedestrian search learning framework, and visual and text local characterization is extracted through a dual-path local comparison network structure. The visual local representation is extracted by a horizontal segmentation, wherein the pedestrian image is segmented horizontally into several stripes. In the text token learning branch, word embedding is learned by BERT model with pre-training and fixed parameters and further processed by multi-branch residual networks. In each branch, the model learns the text representations to adaptively match the corresponding visual local representations in order to extract the aligned text local representations. In addition, a multi-stage cross-modal matching strategy is also provided, which eliminates modal differences between low-level, local and global features, thereby gradually reducing feature differences between image and text fields. But there is a problem in that the method does not take into account correlation between intra-modality information. In particular, each pedestrian image may be divided into a plurality of regions, each region containing varying degrees of fine-grained characteristics, such as shoes, pants, jackets, etc. These fine-grained features should interact with global features, complementing each other. However, the method still has the defect that the pedestrian images often have noise backgrounds such as different angles, illumination, backgrounds and the like, so that the global characterization is easy to be interfered by noise to generate semantic errors, the trend of the global characterization is separated from the trend of the local characterization, and the retrieval accuracy is reduced.

Disclosure of Invention

The invention aims to overcome the defects of the prior art, and provides a text-pedestrian retrieval method based on constraint of boundary box extraction and semantic consistency. The method aims to solve the problems that in the existing cross-mode retrieval method for the text to pedestrian images, key fine granularity information of real identification identities in the images is not effectively utilized, partial fine granularity semantics are not lost, or cross-mode matching of the images and text features is excessively concerned, and consistency of semantic representation of information with different granularity inside a mode is ignored.

The invention introduces a more accurate and perfect fine-grained information extraction method, and the method can obtain relevant bounding boxes, such as critical local areas of body parts, clothing details, accessories and the like for identifying the identity of pedestrians in a targeted manner through a predefined interesting vocabulary as a text prompt, so as to obtain more accurate and rich fine-grained characteristics. Therefore, the problem that part of fine granularity semantics are lost due to the fact that key fine granularity information of a real identification identity in an image is effectively utilized in the current cross-mode retrieval method from text to pedestrian images or is not achieved is solved. The invention designs a constraint method for keeping semantic consistency of image features with different granularities, and by maximizing mutual information between global features and local features, image characterization of a model on different granularities is kept centering on identity representative features, so that ambiguity between global semantic characterization and local semantic characterization of the same identity caused by interference caused by noise such as different angles, illumination, background and the like in the image is avoided, and the problem of semantic characterization consistency of information with different granularities in a mode in the prior art is solved.

The technical scheme adopted by the invention comprises the following steps:

Step 1, extracting fine granularity bounding boxes of each image in an image-text pair of a dataset:

inputting the text prompt describing the attribute of the pedestrian and each image into a phrase positioning model GLIP at the same time, and extracting the boundary box of each image in the data set;

step 2, extracting fine-grained noun phrases of each text in an image-text pair of a dataset;

Step3, generating a training set:

Forming a sample by each image and a fine granularity boundary box corresponding to each image and each text and a fine granularity noun phrase corresponding to each text, and forming a training set by all samples in the data set;

Step 4, constructing a fine granularity aggregation network:

Step 4.1, constructing a sub-network consisting of an image encoder and a text encoder of the CLIP, wherein the image encoder is CLIP ViT-B/16, the text encoder is a CLIP X-former, and both encoders are formed by connecting a 12-layer converter block with a full connection layer in series, and the dimension of an output vector is 512;

step 4.2, constructing a fine-grained aggregation network formed by connecting two branches in parallel, wherein the first branch is formed by connecting an image encoder with a bidirectional GRU in series, and the second branch is formed by connecting a text encoder with the bidirectional GRU in series;

Step 5, training a fine granularity aggregation network:

the training set is input into a fine granularity aggregation network, an image encoder forwards propagates to output the global feature of each image, and a text encoder forwards propagates to output the global feature of each text;

Calculating semantic alignment loss and identity classification loss between modes by using global features of the images and texts, calculating semantic consistency constraint loss by using global features of the images and local features of the images, and iteratively updating network parameters by adding three loss functions as a total target loss function until the total target loss function converges to obtain a trained fine-grained aggregation network;

Step6, searching pedestrians by using the text:

Step 6.1, respectively adopting noun phrases and boundary boxes corresponding to the text to be searched and the pedestrian image to be searched obtained by the method according to the step 1 and the step 2, and respectively inputting the noun phrases and the boundary boxes into a trained fine granularity aggregation network to obtain text global features and text local features, image global features and image local features;

And 6.2, respectively calculating the local similarity, the global similarity and the weighted calculation total similarity of the text to be searched and the pedestrian image to be searched, sorting the similarity of the pedestrian images to be searched according to descending order, and selecting the first 10 images from the image sequence as search results.

Compared with the prior art, the invention has the following advantages:

Firstly, the invention takes the predefined interesting vocabulary as a text prompt to obtain the relevant bounding box in a targeted way, overcomes the problem that the key fine granularity information for truly identifying the identity in the existing method for cross-modal retrieval from text to pedestrian images is not effectively utilized, so that part of fine granularity semantic deletion is caused, enables the invention to be capable of extracting a fine granularity area with more identity identification, better adapts to different pedestrian gestures and image scales, enhances the identification and distinguishing capability of pedestrian identities, and simultaneously is not influenced by shielding problems in the process of extracting the fine granularity area, and improves the retrieval accuracy.

Secondly, the invention designs a consistency constraint method for keeping high mutual information of image features with different granularities, so that global features can be better mutually complemented with local detail information, the problem that the prior art is excessively concerned with cross-mode matching of images and text features and ignores semantic characterization consistency of information with different granularities in a mode is solved, the model can learn the characterization of identity consistency among different image views, interference of background noise and loss of key information are avoided, and robustness and stability of the model are improved.

Drawings

FIG. 1 is a flow chart of the present invention;

FIG. 2 is a schematic diagram of fine granularity teletext information extraction of the invention;

Fig. 3 is a schematic diagram of the training process of the present invention.

Detailed Description

The present invention will be described in further detail with reference to the accompanying drawings and examples.

The implementation steps of the embodiment of the present invention will be further described with reference to fig. 1.

And step 1, extracting fine granularity bounding boxes of each image in the image-text pair of the data set.

The text prompt describing the attribute of the pedestrian and each image are simultaneously input into a phrase positioning model GLIP, and the bounding box of each image in the dataset is extracted.

Referring to FIG. 2, the specific process is described in further detail by traversing the dataset to gather vocabulary related to character body features, clothing styles, personal belongings, such as, but not limited to, "male," "cricket," "jacket," "cotta," "skirt," "pants," "high-heeled shoes," "backpack," etc.; because a plurality of words describe synonymous or similar concepts, in order to reduce redundancy of a vocabulary, the similar words can be generalized into more general categories to obtain a generalized vocabulary, abstract representations are needed to be carried out on the collected words, such as jeans and sports pants are abstract to trousers, women and men are abstract to people, the abstract general conceptual words of the invention are as follows, people, coats, trousers, bags, eyes, luggage, heads, hair collars, caps, shoes, bicycles, headphones, short sleeves, pockets, arms, vehicles, skirts, hair, dolls, hands, boxes, books, masks, teacups and mobile phones, and the vocabulary can guide the model to pay attention to critical areas related to identity features of pedestrians in images, and guide the extraction process of fine granularity bounding boxes; the large phrase positioning model GLIP is used for extracting a boundary frame as accurate fine-grained information of pedestrian images, the GLIP is a model for positioning interested targets in the images, target areas in the images can be identified and positioned according to important phrases and phrase combinations in texts, the GLIP is modified from a target positioning model to a boundary frame extraction model, each vocabulary in text prompts is used as a text query, corresponding vocabulary features are extracted from the image features by calculating similarity between the texts and the image features, and selecting the region coordinates covered by the vocabulary features with high confidence as an interested target boundary box, and capturing fine-grained regions of partial vocabularies in the vocabulary in the pedestrian image. Each image I extracts the 8 bounding boxes with the highest confidence as the fine-grained sequence I_bbox＝(b₁,b₂…b₈ for that image).

The text prompt describing the pedestrian attribute is that ['person.coat.pants.bag.glasses.luggage.head.collar.cap.shoes.bike.headphone.shirt.pocket.legs.car.dress.hair.toy.hands.box.book.mask.cup.cellphone.'].

The extracting of the bounding boxes of each image in the dataset refers to inputting the text prompt describing the attribute of the pedestrian and each image into the phrase positioning model GLIP at the same time, and selecting 8 bounding boxes with highest confidence from the detection layers output by the GLIP as fine granularity information of the image to store in a list sequence.

And 2, extracting fine-grained noun phrases of each text in the image-text pair of the data set.

Referring to FIG. 2, the detailed process is further described by extracting noun phrases appearing in the text description using spacy tool kit, first installing spacy library and loading corresponding English model "en_core_web_sm", then using nlp () function to pass text to language model for processing to obtain a document object, then traversing each word in the document to determine if its part of speech tag is a noun, if it is a noun, continuing to determine if it is a head of noun phrase, using noun_ chunks attribute to obtain noun phrase p, traversing noun phrases in the document by iteration, storing them in a list sequence, providing 15 phrases per text, filling with special filling field [ 'pad' ] if the number of extracted noun phrases is less than 15, if the number of extracted noun phrases exceeds 15, cutting off end noun phrases, and extracting fine-grained sequence of extracted noun phrases (_phrase＝(p₁,p₂…p₁₅).

The extracting of fine-grained noun phrases of each text in an image-text pair of a data set refers to generating a list sequence for each text description, wherein the list sequence is used for storing noun phrases appearing in each text description in the data set, when the length of the list sequence is smaller than the rated length of the list sequence, a pad field is repeatedly added at the tail end of the sequence until the length of the sequence is equal to the rated length, and the part of the list sequence with the length larger than the rated length is truncated, and the rated length of the list sequence is the average value of the lengths of all list sequences.

And step3, generating a training set.

And forming a sample by each image and the corresponding fine-granularity boundary box and each text and the corresponding fine-granularity noun phrase, and forming a training set by all samples in the data set.

Each image-text pair and the corresponding extracted fine granularity information form a training sample X= (I, T, I_bbox,T_phrase), and N training samples and the corresponding identity label set form a training set S= { (Xⁿ,Lⁿ) |1 is not less than N is not more than N.

And 4, constructing a fine-granularity aggregation network.

and 4.2, constructing a fine-grained aggregation network formed by connecting two branches in parallel, wherein the first branch is formed by connecting an image encoder with a bidirectional GRU in series, and the second branch is formed by connecting a text encoder with the bidirectional GRU in series.

And 5, training a fine-grained aggregation network.

The training process is described in detail with reference to fig. 3.

The training set is input into a fine granularity aggregation network, the image encoder forwards propagates to output the global feature of each image, the text encoder forwards propagates to output the global feature of each text, all bounding boxes of each image are forwards propagated through the fine granularity aggregation network to obtain an image local feature, and noun phrase sequences of each text are forwarded through the fine granularity aggregation network to obtain a text local feature.

Specifically, for a pedestrian image I, the pedestrian image I is input into CLIP-ViT to obtain the global feature I_global of the image, and the region covered by each bounding box in the corresponding bounding box sequence I_bbox is cut out and input into CLIP-ViT as an image to obtain the region feature sequence [ I_l1,I_l2,…I_l8 ] of the image. The pedestrian text description T and the corresponding noun phrase sequence T_phrase are segmented using the segmenter of CLIP, generating a word-level vector representation, and the global feature T_global and phrase feature sequence [ T_l1,T_l2,…T_l15 ] of the text description are extracted using the text encoder of CLIP. After fine-grained local teletext features are obtained, it is often necessary to aggregate the local features to obtain an overall fine-grained feature representation. Common practice uses average pooling or maximum pooling for aggregation, often only considering statistical information of local features, ignoring correlations and interactions with each other, and possibly causing loss of feature information and generation of information bottlenecks. The invention uses GPO, a flexible pooling operator to aggregate the image-text local feature sequence, inputs the feature sequence into a bidirectional GRU in a fine granularity aggregation network, learns the weight coefficient theta of the pooling operator of the feature vector in the sequence, and carries out weighted representation on the feature sequence according to the weight coefficient to obtain the local feature. Local feature I_local＝GPO([I_l1,I_l2,…I_l8 of image I), local feature T_local＝GPO([T_l1,T_l2,…T_l15 of text T).

And calculating semantic alignment loss and identity classification loss between modes by using global features of the images and texts, calculating semantic consistency constraint loss by using global features of the images and local features of the images, and iteratively updating network parameters by adding three loss functions as a total target loss function until the total target loss function converges to obtain a trained fine-grained aggregation network.

The forward propagation of all bounding boxes of each image through the fine-granularity aggregation network to obtain an image local feature means that all bounding boxes of each image are forward propagated in an image encoder to obtain a region feature sequenceThe sequence learns trainable parameter vector theta in the bidirectional GRU as characteristic weighting coefficients of different positions, uses GPO aggregation operators to carry out weighted representation on the characteristic vector of each position to obtain aggregated image local characteristics I_local,Wherein, theRepresenting the kth region feature of the region feature sequence, θ_k representing the weight of the kth region feature, GPO (·) representing the aggregation operator.

The term phrase sequence of each text obtains a text local feature through a fine granularity aggregation network, namely, the term phrase sequence of each text is transmitted forward in a text encoder to obtain a phrase feature sequence omega, the sequence learns a trainable parameter vector theta in a bidirectional GRU as a feature weighting coefficient of different positions, a GPO aggregation operator is used for weighting and representing the feature vector of each position to obtain an aggregated text local feature T_local,Where ω_k represents the kth phrase feature of the phrase feature sequence, θ_k represents the weight of the kth phrase feature, and GPO (-) represents the aggregation operator.

The semantic consistency constraint loss is calculated by the following formula:

where L_info represents a semantic consistency constraint penalty, E_p represents a desire, log represents a logarithmic operation based on a natural constant E, exp represents an exponential operation based on a natural constant E, I_global represents an image global feature,Representing positive local features, τ representing a learnable temperature parameter, N representing the number of unmatched image-text pairs in a batch,Indicating the nth negative partial feature, the upper corner mark T indicates the transpose operation.

The intermodal semantic alignment loss is calculated by the following formula:

S_global＝(I_global)^T(T_global)

S_local＝(I_local)^T(T_local)

Where S_global represents a global feature similarity matrix, T_global represents a text global feature,Representing a loss of semantic alignment between global modes, τ_p represents a temperature parameter that adjusts the positive gradient slope,Representing the positive similarity score of the ith row of the similarity matrix, alpha representing the positive similarity lower limit, τ_n representing the negative temperature parameter for the gradient slope of the adjustment,Representing the negative pair similarity score of the ith row of the similarity matrix, β being the upper similarity limit for the negative pair, S_local representing the local feature similarity matrix, T_local representing the text local feature,Representing the loss of semantic alignment between local modalities, L_align represents the loss of semantic alignment between modalities.

The identity classification loss is calculated by the following formula:

p_v＝softmax((W)^t(I_global))

p_t＝softmax((W)^t(T_global))

L_id＝(-log(p_v(c))+-log(p_t(c)))/2

Wherein p_v represents the identity prediction tag of the image, softmax (·) represents the normalized exponential function, W represents the shared identity projection matrix, p_t represents the identity prediction tag of the text, L_id represents the identity classification penalty, p_v (c) represents the probability of correctly predicting identity class c by the global feature of the image, and p_t (c) represents the probability of correctly predicting identity class c by the global feature of the text.

The semantic consistency loss maximizes mutual information between the image global features and the image region features, and learns the characterization of consistency between different image views of pedestrian identities. Since the text corresponding to different pedestrian images of the same identity may be described from different angles, no semantic consistency constraint is imposed on the text modality. Inter-modality semantic alignment loss uses cosine similarity as a measure of visual text feature correlation. For positive pairs cosine similarity s+ is encouraged to be as large as possible, whereas for negative pairs forced cosine similarity S-as small as possible is a relatively strong constraint, since the different negative pair information is different and the similarity scores are also different. Thus, two spacing margins β are set to the upper similarity limit of the negative pair, and α is the lower similarity limit of the positive pair. It is desirable that the positive pair of similarity scores S + be no less than a, while the negative pair of similarity scores S-be no greater than β. Identity discrimination loss ensures that feature representations of the same identity are tightly clustered together in joint embedding space by calculating the cross entropy of the identity prediction tag and the true tag of an image or text.

And 6, searching pedestrians by using the text.

The weighted calculation total similarity is calculated by the following formula,

S=α*S_global+β*S_local

Where α, β represent weight coefficients, where α+β=1.

Claims

Translated fromChinese

1.一种基于边界框提取和语义一致性约束的文本-行人检索方法，其特征在于，提取数据集的图像-文本对中每张图像的细粒度边界框，计算语义一致性约束损失；该检索方法的步骤包括如下：1. A text-to-person retrieval method based on bounding box extraction and semantic consistency constraints, characterized by extracting fine-grained bounding boxes for each image in an image-to-text pair in a dataset and calculating a semantic consistency constraint loss. The retrieval method comprises the following steps:

步骤1，提取数据集的图像-文本对中每张图像的细粒度边界框：Step 1: Extract fine-grained bounding boxes for each image in the image-text pair of the dataset:

将描述行人属性的文本提示与每张图像同时输入到短语定位模型GLIP中，提取数据集中每张图像的边界框；The textual hints describing the attributes of pedestrians are fed into the phrase localization model GLIP along with each image to extract the bounding box of each image in the dataset.

步骤2，提取数据集的图像-文本对中每个文本的细粒度名词短语；Step 2: extract fine-grained noun phrases for each text in the image-text pair of the dataset;

步骤3，生成训练集：Step 3: Generate training set:

将每张图像与其对应的细粒度边界框以及每个文本与其对应的细粒度名词短语组成一个样本，将数据集中所有样本组成训练集；Each image and its corresponding fine-grained bounding box, as well as each text and its corresponding fine-grained noun phrase, are combined into a sample, and all samples in the dataset are combined into a training set;

步骤4，构建细粒度聚合网络：Step 4: Build a fine-grained aggregation network:

步骤4.1，构建一个由CLIP的图像编码器和文本编码器组成的子网络，其中图像编码器为CLIP ViT-B/16，文本编码器为CLIP X-former，两个编码器均由12层的Transformer块与一个全连接层串联构成，输出向量维度为512；Step 4.1: Build a subnetwork consisting of the CLIP image encoder and text encoder. The image encoder is CLIP ViT-B/16, and the text encoder is CLIP X-former. Both encoders are composed of a 12-layer Transformer block connected in series with a fully connected layer, and the output vector dimension is 512.

步骤4.2，构建一个由两条支路并联组成的细粒度聚合网络，第一条支路由图像编码器与双向GRU的串联组成，第二条支路由文本编码器与双向GRU的串联组成；Step 4.2: Build a fine-grained aggregation network consisting of two branches in parallel. The first branch is composed of an image encoder and a bidirectional GRU in series, and the second branch is composed of a text encoder and a bidirectional GRU in series.

步骤5，训练细粒度聚合网络：Step 5: Train the fine-grained aggregation network:

将训练集输入到细粒度聚合网络中，图像编码器前向传播输出每张图像的全局特征；文本编码器中前向传播输出每个文本的全局特征；每张图像的所有边界框经细粒度聚合网络前向传播得到一个图像局部特征，每个文本的名词短语序列经细粒度聚合网络得到一个文本局部特征；The training set is input into the fine-grained aggregation network. The image encoder forward propagates to output the global features of each image. The text encoder forward propagates to output the global features of each text. All bounding boxes of each image are forward propagated through the fine-grained aggregation network to obtain a local image feature. The noun phrase sequence of each text is forward propagated through the fine-grained aggregation network to obtain a local text feature.

使用图像和文本的全局特征分别计算模态间语义对齐损失以及身份分类损失，使用图像全局特征和图像局部特征计算语义一致性约束损失，三个损失函数相加作为总的目标损失函数进行迭代更新网络参数，直至总目标损失函数收敛为止，得到训练好的细粒度聚合网络；The global features of the image and text are used to calculate the inter-modal semantic alignment loss and identity classification loss respectively. The global features of the image and the local features of the image are used to calculate the semantic consistency constraint loss. The three loss functions are added together as the total target loss function to iteratively update the network parameters until the total target loss function converges, resulting in a trained fine-grained aggregation network.

步骤6，使用文本对行人进行检索：Step 6: Use text to retrieve pedestrians:

步骤6.1，对待检索的文本和待检索的行人图像，分别采用与按照步骤1，步骤2的方法获取对应的名词短语和边界框，分别输入到训练好的细粒度聚合网络中获得文本全局特征、文本局部特征和图像全局特征、图像局部特征；Step 6.1: For the text to be retrieved and the pedestrian image to be retrieved, obtain the corresponding noun phrases and bounding boxes using the methods of steps 1 and 2, respectively, and input them into the trained fine-grained aggregation network to obtain the global features of the text, the local features of the text, and the global features of the image, and the local features of the image;

步骤6.2，分别计算待检索的文本和待检索的行人图像的局部相似度、全局相似度、加权计算总相似度，对待检索的行人图像的相似度按照降序进行排序，从图像序列中选择前10个图像作为检索结果。In step 6.2, the local similarity, global similarity, and weighted total similarity of the text to be retrieved and the pedestrian image to be retrieved are calculated respectively, the similarities of the pedestrian images to be retrieved are sorted in descending order, and the first 10 images are selected from the image sequence as the retrieval results.

2.根据权利要求1所述基于边界框提取和语义一致性约束的文本-行人检索方法，其特征在于，步骤1中所述描述行人属性的文本提示为[‘person.coat.pants.bag.glasses.luggage.head.collar.cap.shoes.bike.headphone.shirt.pocket.legs.car.dress.hair.toy.hands.box.book.mask.cup.cellphone.’]。2. The text-to-person retrieval method based on bounding box extraction and semantic consistency constraints according to claim 1, wherein the text prompt describing the pedestrian attributes in step 1 is [‘person.coat.pants.bag.glasses.luggage.head.collar.cap.shoes.bike.headphone.shirt.pocket.legs.car.dress.hair.toy.hands.box.book.mask.cup.cellphone.’].

3.根据权利要求1所述基于边界框提取和语义一致性约束的文本-行人检索方法，其特征在于，步骤1中所述提取数据集中每张图像的边界框指的是，将描述行人属性的文本提示与每张图像同时输入到短语定位模型GLIP中，从GLIP输出的检测图层中选取置信度最高的8个边界框作为该图像的细粒度信息存入一个列表序列中。3. The text-to-person retrieval method based on bounding box extraction and semantic consistency constraints according to claim 1 is characterized in that extracting the bounding box of each image in the dataset in step 1 comprises simultaneously inputting a textual prompt describing the attributes of the pedestrian into a phrase localization model GLIP along with each image, and selecting the eight bounding boxes with the highest confidence from the detection layer output by GLIP as fine-grained information for the image and storing them in a list sequence.

4.根据权利要求1所述基于边界框提取和语义一致性约束的文本-行人检索方法，其特征在于，步骤2中所述提取数据集的图像-文本对中每个文本的细粒度名词短语指的是，为每个文本描述生成一个列表序列，用于存储数据集中每个文本描述中出现的名词短语，将列表序列长度小于列表序列额定长度时，在序列末尾反复添加“pad”字段直至该序列的长度等于额定长度，截断列表序列长度大于额定长度的部分，所述列表序列的额定长度为所有列表序列长度的平均值。4. The text-to-person retrieval method based on bounding box extraction and semantic consistency constraints according to claim 1 is characterized in that the step of extracting fine-grained noun phrases from each text in the image-text pair of the dataset in step 2 comprises generating a list sequence for each text description to store the noun phrases appearing in each text description in the dataset, repeatedly adding a "pad" field to the end of the sequence when the length of the list sequence is less than the rated length of the list sequence until the length of the sequence equals the rated length, and truncating the portion of the list sequence longer than the rated length, wherein the rated length of the list sequence is the average length of all list sequences.

5.根据权利要求1所述基于边界框提取和语义一致性约束的文本-行人检索方法，其特征在于，步骤5中所述每张图像的所有边界框经细粒度聚合网络前向传播得到一个图像局部特征指的是，每张图像的所有边界框在图像编码器中前向传播得到一个区域特征序列该序列在双向GRU中学习可训练的参数向量θ作为不同位置的特征加权系数，使用GPO聚合算子对每个位置的特征向量进行加权表示，得到聚合的图像局部特征I_local，其中，表示区域特征序列的第k个区域特征，θ_k表示第k个区域特征的权重，GPO(·)表示聚合算子。5. The text-to-person retrieval method based on bounding box extraction and semantic consistency constraint according to claim 1 is characterized in that the step 5 wherein all bounding boxes of each image are forward propagated through a fine-grained aggregation network to obtain an image local feature is referred to as the step 5 wherein all bounding boxes of each image are forward propagated through an image encoder to obtain a regional feature sequence. This sequence learns a trainable parameter vector θ in a bidirectional GRU as the feature weight coefficient of different positions, and uses the GPO aggregation operator to weight the feature vector of each position to obtain the aggregated image local feature I_local . in, represents the kth regional feature of the regional feature sequence,_θk represents the weight of the kth regional feature, and GPO(·) represents the aggregation operator.

6.根据权利要求1所述基于边界框提取和语义一致性约束的文本-行人检索方法，其特征在于，步骤5中所述每个文本的名词短语序列经细粒度聚合网络得到一个文本局部特征是指，每个文本的名词短语序列经文本编码器前向传播得到一个短语特征序列ω，该序列在双向GRU中学习训练的参数向量θ作为不同位置的特征加权系数，使用GPO聚合算子对每个位置的特征向量进行加权表示，得到聚合的文本局部特征T_local，其中，ω_k表示短语特征序列第k个短语特征，θ_k表示第k个短语特征的权重，GPO(·)表示聚合算子。6. The text-to-person retrieval method based on bounding box extraction and semantic consistency constraints according to claim 1 is characterized in that the step 5 of obtaining a text local feature from the noun phrase sequence of each text through a fine-grained aggregation network means that the noun phrase sequence of each text is forward-propagated through a text encoder to obtain a phrase feature sequence ω, the parameter vector θ learned and trained in a bidirectional GRU is used as the feature weighting coefficient at different positions, and the feature vector at each position is weighted using a GPO aggregation operator to obtain an aggregated text local feature T_local . Where_ωk represents the kth phrase feature in the phrase feature sequence,_θk represents the weight of the kth phrase feature, and GPO(·) represents the aggregation operator.

7.根据权利要求1所述基于边界框提取和语义一致性约束的文本-行人检索方法，其特征在于，步骤5中所述语义一致性约束损失是由下式计算得到的：7. The text-to-person retrieval method based on bounding box extraction and semantic consistency constraint according to claim 1, wherein the semantic consistency constraint loss in step 5 is calculated by the following formula:

其中，L_info表示语义一致性约束损失，E_p表示期望，log表示以自然常数e为底的对数操作，exp表示以自然常数e为底的指数操作，I_global表示图像全局特征，表示正局部特征，τ表示可学习的温度参数，N表示一个批次中的不匹配图像-文本对的数量，表示第n个负局部特征，上角标T表示转置操作。Among them,_Linfo represents the semantic consistency constraint loss,_Ep represents the expectation, log represents the logarithmic operation with the natural constant e as the base, exp represents the exponential operation with the natural constant e as the base, and_Iglobal represents the global feature of the image. represents the positive local features, τ represents the learnable temperature parameter, N represents the number of unmatched image-text pairs in a batch, represents the nth negative local feature, and the superscript T represents the transposition operation.

8.根据权利要求7所述基于边界框提取和语义一致性约束的文本-行人检索方法，其特征在于，步骤5中所述的模态间语义对齐损失是由下式计算得到：8. The text-to-person retrieval method based on bounding box extraction and semantic consistency constraint according to claim 7, wherein the inter-modality semantic alignment loss in step 5 is calculated using the following formula:

S_global＝(I_global)^T(T_global)S_global = (I_global )^T (T_global )

S_local＝(I_local)^T(T_local)S_local = (I_local )^T (T_local )

其中，S_global表示全局特征相似矩阵，T_global表示文本全局特征，表示全局模态间语义对齐损失，τ_p表示调节正对梯度斜率的温度参数，表示相似矩阵第i行的正对相似分数，α表示正对的相似下限，τ_n表示负对调节梯度斜率的温度参数，表示相似矩阵第i行的负对相似分数，β为负对的相似上限，S_local表示局部特征相似矩阵，T_local表示文本局部特征，表示局部模态间语义对齐损失，L_align表示模态间语义对齐损失。Among them, S_global represents the global feature similarity matrix, T_global represents the global feature of the text, represents the global inter-modal semantic alignment loss,_τp represents the temperature parameter that adjusts the slope of the positive gradient, represents the similarity score of the positive pair in the i-th row of the similarity matrix, α represents the lower limit of similarity of the positive pair,_τn represents the temperature parameter for adjusting the gradient slope of the negative pair, Represents the negative pair similarity score of the i-th row of the similarity matrix, β is the upper limit of the similarity of the negative pair, S_local represents the local feature similarity matrix, T_local represents the local feature of the text, represents the local inter-modal semantic alignment loss, and_Lalign represents the inter-modal semantic alignment loss.

9.根据权利要求1所述基于边界框提取和语义一致性约束的文本-行人检索方法，其特征在于，步骤5中所述的身份分类损失是由下式计算得到的：9. The text-to-person retrieval method based on bounding box extraction and semantic consistency constraint according to claim 1, wherein the identity classification loss in step 5 is calculated by the following formula:

p_v＝softmax((W)^t(I_global))p_v =softmax((W)^t (I_global ))

p_t＝softmax((W)^t(T_global))p_t =softmax((W)^t (T_global ))

L_id＝(-log(p_v(c))+-log(p_t(c)))/2L_id =(-log(p_v (c))+-log(p_t (c)))/2

其中，p_v表示图像的身份预测标签，softmax(·)表示归一化指数函数，W表示共享的身份投影矩阵，p_t表示文本的身份预测标签，L_id表示身份分类损失，p_v(c)表示通过图像全局特征正确预测到身份类别c的概率，p_t(c)表示通过文本全局特征正确预测到身份类别c的概率。Where_pv represents the identity prediction label of the image, softmax(·) represents the normalized exponential function, W represents the shared identity projection matrix,_pt represents the identity prediction label of the text,_Lid represents the identity classification loss,_pv (c) represents the probability of correctly predicting the identity category c through the global features of the image, and_pt (c) represents the probability of correctly predicting the identity category c through the global features of the text.

10.根据权利要求1所述基于边界框提取和语义一致性约束的文本-行人检索方法，其特征在于，步骤6.2中所述的加权计算总相似度是由下式计算得到的：10. The text-to-person retrieval method based on bounding box extraction and semantic consistency constraint according to claim 1, wherein the weighted total similarity calculated in step 6.2 is calculated using the following formula:

S＝α*S_global+β*S_localS＝α*S_global +β*S_local

其中，α、β表示权重系数，α+β＝1。Here, α and β represent weight coefficients, and α+β=1.