技术领域Technical Field
本发明属于计算机视觉技术领域,具体涉及一种损坏场景鲁棒的跨模态行人重识别方法。The present invention belongs to the field of computer vision technology, and in particular relates to a cross-modal pedestrian re-identification method that is robust in damaged scenes.
背景技术Background technique
智能监控设备能够收集实时,信息量大,直观的视频数据,而行人重识别就是一项借助摄像机拍摄数据确定行人轨迹的技术,可以实现在人脸等生物特征失效情况下的跨摄像机行人检索,已经成为智能视频监控系统的关键技术。但是单模态行人重识别在夜间应用能力有限,为了实现全天候视频监控,可见光-红外跨模态行人重识别应运而生,它将可见光-红外双模摄像机捕获的可见光和红外人体图像关联起来进行跨模态图像检索。由于姿态、视点、光照等变化,以及可见光和红外两种模态间的巨大差异,导致同一身份行人图像之间的类内差异大,不同身份图像之间的类间差异。Intelligent monitoring equipment can collect real-time, large-volume, and intuitive video data. Pedestrian re-identification is a technology that uses camera data to determine pedestrian trajectories. It can achieve cross-camera pedestrian retrieval when biometric features such as faces fail, and has become a key technology for intelligent video surveillance systems. However, single-modal pedestrian re-identification has limited application capabilities at night. In order to achieve all-weather video surveillance, visible light-infrared cross-modal pedestrian re-identification came into being. It associates visible light and infrared human body images captured by visible light-infrared dual-mode cameras for cross-modal image retrieval. Due to changes in posture, viewpoint, lighting, and the huge differences between visible light and infrared modalities, there are large intra-class differences between images of pedestrians with the same identity, and inter-class differences between images of different identities.
目前的跨模态行人重识别方法大致可分为两种类型,基于模态不变特征学习的方法和基于图像生成的方法。他们的出发点都是缓解模态间的差异,学习模态间共享的特征,前者通过特征表示学习和度量学习来缩小模态差异,挖掘模态共享特征;后者通过生成目标模态或中间模态来缓解模态差异学习模态共享特征。The current cross-modal person re-identification methods can be roughly divided into two types: methods based on modality-invariant feature learning and methods based on image generation. Their starting points are to alleviate the differences between modalities and learn the shared features between modalities. The former reduces the modality differences and mines the shared features of modalities through feature representation learning and metric learning; the latter alleviates the modality differences and learns the shared features of modalities by generating target modalities or intermediate modalities.
然而,目前的可见光-红外行人重识别很少关注模型在处理各种损坏图像时的鲁棒性问题。在现实场景中,采集到的图像往往会受到天气、信号传输、图片压缩等影响,产生雨、雪、高斯模糊、低分辨率等损坏,绝大多数跨模态行人重识别方法在损坏场景下性能下降严重。However, current visible-infrared person re-identification methods rarely focus on the robustness of models when dealing with various damaged images. In real-world scenarios, the collected images are often affected by weather, signal transmission, image compression, etc., resulting in damage such as rain, snow, Gaussian blur, and low resolution. Most cross-modal person re-identification methods have serious performance degradation in damaged scenarios.
发明内容Summary of the invention
本发明为克服现有技术中存在的问题,提出一种损坏场景鲁棒的跨模态行人重识别方法,以期能减少模型对损坏噪声的敏感度,并提取损坏图像数据中的跨模态共享特征,同时保留在干净场景下的识别能力,从而能提高可见光与近红外跨模态行人重识别对损坏场景的鲁棒性。In order to overcome the problems existing in the prior art, the present invention proposes a cross-modal pedestrian re-identification method that is robust to damaged scenes, in order to reduce the sensitivity of the model to damaged noise and extract cross-modal shared features in damaged image data while retaining the recognition capability in clean scenes, thereby improving the robustness of visible light and near-infrared cross-modal pedestrian re-identification for damaged scenes.
本发明为解决技术问题采用如下技术方案:The present invention adopts the following technical solutions to solve the technical problems:
本发明一种损坏场景鲁棒的跨模态行人重识别方法的特点在于,是按如下步骤进行:The present invention provides a cross-modal pedestrian re-identification method that is robust to damaged scenes. The method is characterized in that the method is performed according to the following steps:
步骤1、构建可见光与近红外行人数据集;Step 1: Build visible light and near-infrared pedestrian datasets;
步骤1.1、用光学相机和近红外像机分别采集一定数量的可见光图像和近红外图像,并利用行人检测算法分别对可见光图像和近红外图像进行检测,生成行人的边界框并裁剪后,得到可见光图像和近红外图像中的行人图像;Step 1.1, using an optical camera and a near-infrared camera to collect a certain number of visible light images and near-infrared images respectively, and using a pedestrian detection algorithm to detect the visible light images and near-infrared images respectively, generating a pedestrian boundary frame and cropping it, and obtaining pedestrian images in the visible light images and near-infrared images;
步骤1.2、根据拍摄像机、拍摄场景、行人身份对每个行人图像进行标注,得到可见光图像集和红外图像集其中,/>分别表示第t,s张可见光图像,/>分别表示第t,s张红外图像,/>和/>所属标签记为Yt,/>和/>所属标签记为Ys;Step 1.2: Label each pedestrian image according to the camera, shooting scene, and pedestrian identity to obtain a visible light image set. and infrared image sets Among them,/> Respectively represent the t,sth visible light image,/> Respectively represent the t,sth infrared image,/> and/> The label is denoted as Yt , /> and/> The label is denoted as Ys ;
步骤1.3、对第t张可见光图像进行灰度化处理后,得到第t张可见光灰度图像Step 1.3: For the tth visible light image After grayscale processing, the tth visible light grayscale image is obtained
步骤1.4、对第t张可见光图像进行色相抖动增强处理后,得到第t张可见光增强图像/>Step 1.4: For the tth visible light image After the hue dithering enhancement process, the tth visible light enhanced image is obtained/>
步骤1.5、按照设定的损坏概率分别对进行损坏处理,得到损坏后的/>Step 1.5: According to the set damage probability, Perform damage processing to obtain the damaged />
将中的任意一张图像记为第t张图像/>其中,C表示图像的通道数,H表示图像的高度,W表示图像是宽度;Will Any image in is recorded as the tth image/> Among them, C represents the number of channels of the image, H represents the height of the image, and W represents the width of the image;
步骤1.6、按照步骤1.3-步骤1.5的过程得到第s张图像Step 1.6: Follow the process from step 1.3 to step 1.5 to get the sth image
步骤2、构建L层的Transformer模块,用于提取的全局特征Ftgloabl和/>的全局特征Fsgloabl;Step 2: Build an L-layer Transformer module to extract The global featuresFtgloabland /> The global feature Fsgloabl ;
步骤3、构建局部动态注意力图模块,用于提取的局部特征Ftlocal和/>的局部特征Fslocal;Step 3: Construct a local dynamic attention map module to extract The local featuresFtlocal and/> The local featureFslocal ;
步骤4、将全局特征Ftgloabl和局部特征Ftlocal拼接为第t张图像的输出特征Ft,将全局特征Fsgloabl和局部特征Fslocal拼接为第s张图像/>的输出特征Fs;Step 4: Concatenate the global featureFtgloabl andthelocal featureFtlocal into the tth image The output feature Ftof , concatenates the global featureFsgloabl andthelocal featureFslocal into the sth image/> Output feature Fs ;
步骤5、利用硬边界损失优化损坏场景下的难样本分布;Step 5: Use hard boundary loss to optimize the distribution of difficult samples in the damaged scenario;
步骤5.1、利用式(7)计算与Ft的标签Yt种类相同的所有特征的中心Step 5.1: Use formula (7) to calculate the center of all features with the same type as the labelYt ofFt
式(7)中,mean表示取平均,dis表示计算特征距离,Fn表示与Ft的标签Yt种类相同的任意第n个特征,Nt表示与Ft的标签种类相同的特征数量;In formula (7), mean means taking the average, dis means calculating the feature distance,Fn means any nth feature of the same type as the labelYt of Ft, andNt means the number of features of the same type as the label ofFt ;
步骤4.2、与Ft的标签Yt种类相同的特征作为正样本;利用式(8)计算所有正样本特征到特征中心的正样本平均距离/>Step 4.2: The features of the same type as the labelYt ofFt are taken as positive samples; use formula (8) to calculate all positive sample features to the feature center The average distance of positive samples/>
与Ft的标签Yt种类不同的特征作为负样本;利用式(9)计算所有负样本特征到特征中心的负样本平均距离/>The features that are different from the labelYt ofFt are regarded as negative samples; the feature center of all negative sample features is calculated using formula (9) The average distance of negative samples/>
式(9)中,||·||2表示计算欧式距离,Fn′表示与Ft的标签Yt种类不同的第n′个特征,N′t表示与Ft的标签种类不同的特征数量;In formula (9), ||·||2 indicates the calculation of the Euclidean distance, Fn′ indicates the n′th feature that is different from the labelYt of Ft, andN′t indicates the number of features that are different from the label type ofFt ;
步骤4.3、利用式(10)计算第t张图像的类内硬边界损失/>Step 4.3: Calculate the tth image using formula (10) Intra-class hard boundary loss/>
式(10)中,Fq表示在负样本平均距离内且与Ft的标签Yt种类不同的第q个特征,Qt表示在负样本平均距离/>内且与Ft的标签Yt种类不同的特征数量;In formula (10),Fq represents the average distance of negative samples The qth feature that is different from the label Yt of Ft , Qt represents the average distance in negative samples/> The number of features in Ft that are different from the labelYt ofFt ;
步骤4.4、利用式(7)计算与Fs标签Ys种类相同的所有特征的中心然后利用式(8)计算/>的正样本平均边界/>从而利用式(11)计算/>与/>的类间硬边界损失/>Step 4.4: Use formula (7) to calculate the center of all features with the same type asFs labelYs Then use formula (8) to calculate/> The average positive sample boundary of Therefore, we can use formula (11) to calculate With/> Hard boundary loss between classes/>
式(11)中,Fu表示在正样本平均边界外且与Ft的标签Yt种类相同的第u个特征,Ut表示在正样本平均边界/>外且与Ft的标签Yt种类相同的特征数量,Fw表示在正样本平均边界/>外且与Fs的标签Ys种类相同的第w个特征,Wt表示在正样本平均边界/>外且与Fs的标签Ys种类相同的特征数量;In formula (11),Fu represents the average boundary of positive samples The u-th feature that is outside and of the same type as the label Yt of Ft , Ut represents the average boundary of the positive sample/> The number of features that are the same as the label Yt of Ft , Fw represents the average boundary of positive samples/> The wth feature that is outside and of the same type as the label Ys of Fs , Wt represents the average boundary of the positive samples/> The number of features that are of the same type as the labelYs ofFs ;
步骤4.5、利用式(12)构建和/>的总硬边界损失/>Step 4.5: Use formula (12) to construct and/> Total hard edge loss of
式(12)中,α和β是平衡和/>权重的超参数;In formula (12), α and β are equilibrium and/> Hyperparameters of weights;
步骤5、利用梯度下降法对由Transformer模块和局部动态注意力图模块构成的行人重识别网络进行训练,并通过最小化建硬边界损失LHBL以更新网络参数,直到LHBL收敛为止,从而得到训练好的最优行人重识别模型;Step 5: Use the gradient descent method to train the pedestrian re-identification network composed of the Transformer module and the local dynamic attention map module, and update the network parameters by minimizing the hard boundary loss LHBL until LHBL converges, thereby obtaining the trained optimal pedestrian re-identification model;
步骤5、利用训练好的最优行人重识别模型对损坏场景测试集进行识别;Step 5: Use the trained optimal person re-identification model to identify the damaged scene test set;
步骤5.1、获取未损坏的查询图像和图库集并输入最优行人重识别模型中进行处理,输出未损坏的查询图像特征和图库集特征,然后计算查询图像特征和图库集中所有图像特征的相似度并降序排序,选取前α个排名的图像作为干净场景的跨模态行人重识别查询结果;Step 5.1, obtain the uncorrupted query image and gallery set and input them into the optimal pedestrian re-identification model for processing, output the uncorrupted query image features and gallery set features, then calculate the similarity between the query image features and all image features in the gallery set and sort them in descending order, and select the first α ranked images as the cross-modal pedestrian re-identification query results of the clean scene;
步骤5.2、将所述查询图像随机损坏后,输入最优行人重识别模型中进行处理,输出损坏的查询图像特征,然后计算损坏的查询图像特征和图库集中所有图像特征的相似度并降序排序,选取前α个排名的图像作为损坏场景的跨模态行人重识别查询结果。Step 5.2: After the query image is randomly damaged, it is input into the optimal pedestrian re-identification model for processing, and the damaged query image features are output. Then, the similarity between the damaged query image features and the features of all images in the gallery set is calculated and sorted in descending order, and the top α ranked images are selected as the cross-modal pedestrian re-identification query results of the damaged scene.
本发明所述的一种损坏场景鲁棒的跨模态行人重识别方法的特点也在于,所述步骤2包括:The cross-modal pedestrian re-identification method for damaged scenes described in the present invention is also characterized in that the step 2 comprises:
步骤2.1、使用重叠步幅策略对第t张图像进行切分,得到第t个切片序列其中,/>表示第t张图像/>的第k个切片;K表示切片总数;Step 2.1: Use overlapping stride strategy for the tth image Split and get the tth slice sequence Among them,/> Indicates the tth image/> The kth slice of ; K represents the total number of slices;
步骤2.2、初始化l=1;将第t个切片序列输入Transformer模块中,并利用式(1)得到第l-1层Transformer模块输出的嵌入Zt,l-1:Step 2.2, initialize l = 1; input the tth slice sequence into the Transformer module, and use formula (1) to obtain the embedding Zt,l-1 output by the l-1th layer Transformer module:
式(1)中,Epos表示待学习的位置嵌入;Xcls,l-1表示第l-1层Transformer模块待学习的初始类标记,Wp表示线性投影;In formula (1), Epos represents the position embedding to be learned; Xcls,l-1 represents the initial class label to be learned in the l-1th layer Transformer module, and Wp represents the linear projection;
步骤2.3、利用式(2)得到第l层Transformer模块输出的嵌入Zt,l:Step 2.3: Use formula (2) to get the embedding Zt,l of the output of the l-th layer Transformer module:
Zt,l=MSA(LN(Zt,l-1))+Zt,l-1 (2)Zt,l =MSA(LN(Zt,l-1 ))+Zt,l-1 (2)
式(2)中,MSA表示多头自注意力操作,MLP表示多层感知操作,LN表示残差连接和层归一化操作;In formula (2), MSA represents the multi-head self-attention operation, MLP represents the multi-layer perception operation, and LN represents the residual connection and layer normalization operation;
步骤2.4、将最后第L层Transformer模块输出的嵌入Zt,L中的类标记Xcls,L作为第t张图像的全局特征表示Ftgloabl;Step 2.4: Take the class label Xcls,L embedded in Zt, L output by the last Lth layer Transformer module as the tth image The global feature representation Ftgloabl ;
步骤2.5、按照步骤2.1-步骤2.4的过程得到第s张图像的全局特征表示Fsgloabl。Step 2.5: Follow the process of steps 2.1 to 2.4 to get the sth image The global feature representation Fsgloabl .
所述步骤3包括:The step 3 comprises:
步骤3.1、将第t个切片序列中的第k个切片/>采样成尺寸为S×S的M个局部特征/>然后将每个局部特征展平后,再用线性投影Ep将每个展平后的局部特征映射到d维嵌入向量中,从而得到第k个局部特征序列其中,/>表示第k个切片/>的第m个局部特征;M表示局部特征的数量;Step 3.1: Sequence the tth slice The kth slice in /> Sampling into M local features of size S×S/> Then, each local feature is flattened and mapped into a d-dimensional embedding vector using linear projectionEp , thereby obtaining the kth local feature sequence Among them,/> Indicates the kth slice/> The mth local feature of ; M represents the number of local features;
步骤3.2、根据第k个局部特征序列利用式(3)构建第k个局部图/>Step 3.2: According to the kth local feature sequence Use formula (3) to construct the kth local graph/>
式(3)中,Vt,k,j表示中的第j个嵌入节点,/>表示/>中的第i个嵌入节点Vt,k,i和第j个嵌入节点Vt,k,j之间存在的一条边,/>表示/>中的嵌入节点集合,/>表示中的无向边集合;In formula (3), Vt,k,j represents The jth embedded node in ,/> Indicates/> There is an edge between the i-th embedded node Vt,k,i and the j-th embedded node Vt,k,j in ,/> Indicates/> The set of embedded nodes in ,/> express The set of undirected edges in ;
步骤3.3、利用式(4)计算第i个嵌入节点Vt,k,i和第j个嵌入节点Vt,k,j间边的注意力得分S(Vt.k,i,Vt,k,j),用于表征第j个嵌入节点Vt,k,i对第i个嵌入节点Vt,k,j的重要性:Step 3.3: Use formula (4) to calculate the attention score S(Vtk,i ,V t,k,j ) of the edge between the i-th embedded node V t,k,iand the j-th embedded node Vt,k,j , which is used to characterize the importance of the j-th embedded node Vt,k,i to the i-th embedded node Vt,k,j :
S(Vt.k,i,Vt,k,j)=aTLeakyReLU(W·[Vt,k,i||Vt,k,j]) (4)S(Vtk,i ,Vt,k,j ) =aTLeakyReLU (W[Vt,k,i ||Vt,k,j ]) (4)
式(4)中,表示嵌入拼接操作,W表示权重共享矩阵,a表示实数映射操作;LeakyRelLU表示激活函数;T表示转置;In formula (4), represents the embedding concatenation operation, W represents the weight sharing matrix, a represents the real number mapping operation; LeakyRelLU represents the activation function; T represents the transpose;
步骤3.4、利用式(5)得到第i个嵌入节点Vt,k,i和第j个嵌入节点Vt,k,j间边的动态注意力权重αt,i,j:Step 3.4: Use formula (5) to obtain the dynamic attention weight αt,i,j of the edge between the i-th embedded node Vt,k,i and the j-th embedded node Vt,k ,j:
式(5)中,Nt,i表示与第i个嵌入节点Vt,k,i相邻的嵌入节点的集合,Vt,k,j′表示与第i个嵌入节点Vt,k,i相邻的第j′个嵌入节点;In formula (5), Nt,i represents the set of embedded nodes adjacent to the i-th embedded node Vt,k,i, and Vt,k,j′ represents the j′th embedded node adjacent to the i-th embedded node Vt,k,i ;
步骤3.5、利用式(6)得到聚合后的第i个嵌入节点特征V′t,k,i:Step 3.5: Use formula (6) to obtain the i-th embedded node feature V′t,k,i after aggregation:
步骤3.6、按照步骤3.3-步骤3.5的过程得到第k个局部嵌入序列中的所有嵌入节点特征,并构成第k个更新后的局部嵌入序列/>从而得更新后的切片序列/>并拼接为第t张图像/>的局部特征表示Ftlocal;Step 3.6: Obtain the kth local embedding sequence according to the process of steps 3.3 to 3.5 All embedded node features in , and constitute the kth updated local embedding sequence/> Thus, the updated slice sequence is obtained/> And spliced into the tth image/> The local feature representation Ftlocal ;
步骤3.7、按照步骤3.1-步骤3.6的过程得到第s张图像的局部特征表示Fslocal。Step 3.7: Follow the process from step 3.1 to step 3.6 to get the sth image The local feature representation Fslocal .
本发明一种电子设备,包括存储器以及处理器的特点在于,所述存储器用于存储支持处理器执行所述跨模态行人重识别方法的程序,所述处理器被配置为用于执行所述存储器中存储的程序。An electronic device of the present invention includes a memory and a processor, wherein the memory is used to store a program that supports the processor to execute the cross-modal pedestrian re-identification method, and the processor is configured to execute the program stored in the memory.
本发明一种计算机可读存储介质,计算机可读存储介质上存储有计算机程序的特点在于,所述计算机程序被处理器运行时执行所述跨模态行人重识别方法的步骤。The present invention provides a computer-readable storage medium, wherein a computer program is stored on the computer-readable storage medium, and the computer program executes the steps of the cross-modal pedestrian re-identification method when executed by a processor.
与已有技术相比,本发明有益效果体现在:Compared with the prior art, the beneficial effects of the present invention are as follows:
1、本发明设计了一种双模态三阶段学习策略,通过将三种图像对输入模型使模型逐渐关注色彩相关的模态不变特征,避免了添加复杂的额外网络生成中间模态,缓解了可见光模态与红外模态间的差异。1. The present invention designs a dual-modal three-stage learning strategy, which makes the model gradually focus on the color-related modality-invariant features by inputting three image pairs into the model, avoids adding complex additional networks to generate intermediate modalities, and alleviates the difference between the visible light modality and the infrared modality.
2、本发明设计了一种随机切片损坏增强方式,该方法在网络提取特征前随机为每个图像加上不同程度的各种损坏切片,迫使模型在训练过程中适应各种损坏图像,从而提升了模型面对损坏样本时的鲁棒性。2. The present invention designs a random slice damage enhancement method, which randomly adds various damaged slices of different degrees to each image before the network extracts features, forcing the model to adapt to various damaged images during training, thereby improving the robustness of the model when facing damaged samples.
3、本发明设计了一种局部动态注意力图Transformer,将Transformer和图注意力网络结合在一起,提取建模长距离结构信息的全局特征和过滤损坏噪声、背景噪声的局部特征,从而提升了模型提取特征的鉴别力。3. The present invention designs a local dynamic attention graph Transformer, which combines the Transformer and the graph attention network to extract global features for modeling long-distance structural information and local features for filtering out damage noise and background noise, thereby improving the discriminative power of the model's extracted features.
4、本发明设计了一个硬边界损失来对齐不同模态间特征,利用实例边界之间的关系,聚焦损坏场景下的难样本或干净场景下因光照、视角、遮挡等造成的难样本,推开不同标签的难样本,拉近了相同标签的难样本,从而提高了可见光与近红外行人重识别的准确度以及对损坏场景的鲁棒性。4. The present invention designs a hard boundary loss to align features between different modalities, and utilizes the relationship between instance boundaries to focus on difficult samples in damaged scenes or difficult samples in clean scenes caused by lighting, viewing angle, occlusion, etc., pushes away difficult samples with different labels, and brings closer difficult samples with the same label, thereby improving the accuracy of visible light and near-infrared pedestrian re-identification and the robustness to damaged scenes.
附图说明BRIEF DESCRIPTION OF THE DRAWINGS
图1为本发明跨模态行人重识别方法的流程图;FIG1 is a flow chart of a cross-modal person re-identification method according to the present invention;
图2为本发明随机切片损坏数据增强的示例;FIG2 is an example of random slice damage data enhancement according to the present invention;
图3为本发明局部动态注意力图模块。FIG3 is a local dynamic attention graph module of the present invention.
具体实施方式Detailed ways
本实施例中,一种损坏场景鲁棒的跨模态行人重识别方法,是通过三阶段训练、随机切片损坏增强、局部动态注意力图Transformer网络和硬边界损失,提取损坏不变的全局和局部跨模态特征,从而能提高干净场景和损坏场景下的可见光与近红外行人重识别准确度。具体的说,该方法如图1所示,是按如下步骤进行:In this embodiment, a cross-modal pedestrian re-identification method that is robust in damaged scenes is to extract damage-invariant global and local cross-modal features through three-stage training, random slice damage enhancement, local dynamic attention map Transformer network and hard boundary loss, so as to improve the accuracy of visible light and near-infrared pedestrian re-identification in clean scenes and damaged scenes. Specifically, the method is shown in Figure 1 and is carried out in the following steps:
步骤1、构建可见光与近红外行人数据集;Step 1: Build visible light and near-infrared pedestrian datasets;
步骤1.1、用光学相机和近红外像机分别采集一定数量可见光图像和近红外图像,利用行人检测算法生成行人边界框并裁剪出可见光和近红外图像中的行人图像;Step 1.1, using an optical camera and a near-infrared camera to collect a certain number of visible light images and near-infrared images respectively, using a pedestrian detection algorithm to generate a pedestrian bounding box and crop out pedestrian images in the visible light and near-infrared images;
步骤1.2、根据拍摄像机、拍摄场景、行人身份对每个行人图像进行标注,得到可见光图像集和红外图像集其中,/>分别表示第t,s张可见光图像,/>分别表示第t,s张红外图像,/>和/>所属标签记为Yt,/>和/>所属标签记为Ys;Step 1.2: Label each pedestrian image according to the camera, shooting scene, and pedestrian identity to obtain a visible light image set. and infrared image sets Among them,/> Respectively represent the t,sth visible light image,/> Respectively represent the t,sth infrared image,/> and/> The label is denoted as Yt , /> and/> The label is denoted as Ys ;
在本实施例中,采用权威可见光与近红外行人数据集SYSU-MM01训练和评估模型。SYSU-MM01数据集是由4台可见光摄像机和2台近红外摄像机采集的大规模数据集,包含491个不同行人,共287628幅可见光图像和15792幅红外图像。其训练集包含来自395个行人的22,258幅可见光图像和11,909幅近红外图像,测试集包含另外96个行人的图像。该数据集有两种不同的搜索设置,即All-Search模式和Indoor-Search模式,All-Search模式使用室内和室外图像测试,Indoor-Search模式仅使用室内图像测试。In this embodiment, the authoritative visible light and near-infrared pedestrian dataset SYSU-MM01 is used to train and evaluate the model. The SYSU-MM01 dataset is a large-scale dataset collected by 4 visible light cameras and 2 near-infrared cameras, containing 491 different pedestrians, a total of 287,628 visible light images and 15,792 infrared images. Its training set contains 22,258 visible light images and 11,909 near-infrared images from 395 pedestrians, and the test set contains images of another 96 pedestrians. This dataset has two different search settings, namely All-Search mode and Indoor-Search mode. The All-Search mode uses indoor and outdoor images for testing, and the Indoor-Search mode only uses indoor images for testing.
步骤1.3、对第t张可见光图像进行灰度化处理后,得到第t张可见光灰度图像本实施例中所有输入图像大小调整为256×128。Step 1.3: For the tth visible light image After grayscale processing, the tth visible light grayscale image is obtained In this embodiment, all input images are resized to 256×128.
步骤1.4、对第t张可见光图像进行色相抖动增强处理后,得到第t张可见光增强图像/>Step 1.4: For the tth visible light image After the hue dithering enhancement process, the tth visible light enhanced image is obtained/>
步骤1.5、对输入图像使用随机切片损坏数据增强,该方法如图2所示,按照设定的损坏概率分别对进行局部损坏处理,在输入图像中随机选取一个矩形区域,然后随机加入一种一定等级的损坏,得到损坏后的/>本实施例中损坏概率设置为0.5;Step 1.5: Use random slice damage data enhancement on the input image. The method is shown in Figure 2. According to the set damage probability, Perform local damage processing, randomly select a rectangular area in the input image, and then randomly add a certain level of damage to obtain the damaged /> In this embodiment, the damage probability is set to 0.5;
将中的任意一张图像记为第t张图像/>其中,C表示通道数,H表示高度,W表示宽度,Yt表示图像/>所属的行人标签。Will Any image in is recorded as the tth image/> Among them, C represents the number of channels, H represents the height, W represents the width, andYt represents the image/> The pedestrian label to which it belongs.
步骤1.6、按照步骤1.3-步骤1.5的过程得到第s张图像Step 1.6: Follow the process from step 1.3 to step 1.5 to get the sth image
步骤2、构建L层的Transformer模块,用于提取的全局特征Ftgloabl和/>的全局特征Fsgloabl;Step 2: Build an L-layer Transformer module to extract The global featuresFtgloabland /> The global feature Fsgloabl ;
步骤2.1、使用重叠步幅策略对第t张图像进行切分,得到切片序列其中,/>表示第t张图像/>的第k个切片;K表示切片总数;本实施例中,切片大小为16×16,重叠步幅为12,防止模型忽略切片边缘信息。Step 2.1: Use overlapping stride strategy for the tth image Split and get slice sequence Among them,/> Indicates the tth image/> The kth slice of ; K represents the total number of slices; in this embodiment, the slice size is 16×16, and the overlap stride is 12 to prevent the model from ignoring the slice edge information.
步骤2.2、初始化l=1;将第t个切片序列输入Transformer模块中,并利用式(1)得到第l-1层Transformer模块输出的嵌入Zt,l-1:Step 2.2, initialize l = 1; input the tth slice sequence into the Transformer module, and use formula (1) to obtain the embedding Zt,l-1 output by the l-1th layer Transformer module:
式(1)中,Epos表示待学习的位置嵌入;Xcls,l-1表示待学习的初始类标记,Wp表示线性投影;In formula (1), Epos represents the position embedding to be learned; Xcls,l-1 represents the initial class label to be learned, and Wp represents the linear projection;
步骤2.3、将输入嵌入Z0输入Transformer块提取特征,利用式(2)得到第l层Transformer模块输出的嵌入Zt,l:Step 2.3: Input the input embedding Z0 into the Transformer block to extract features, and use formula (2) to obtain the embedding Zt,l of the output of the l-th layer Transformer module:
Zt,l=MSA(LN(Zt,l-1))+Zt,l-1 (2)Zt,l =MSA(LN(Zt,l-1 ))+Zt,l-1 (2)
式(2)中,MSA表示多头自注意力操作,MLP表示多层感知操作,LN表示残差连接和层归一化操作;In formula (2), MSA represents the multi-head self-attention operation, MLP represents the multi-layer perception operation, and LN represents the residual connection and layer normalization operation;
步骤2.4、将最后第L层Transformer模块输出的嵌入Zt,L中的类标记Xcls,L作为第t张图像的全局特征表示Ftgloabl,本实施例中L=12,以求模型不过大的同时达到较高识别率。Step 2.4: Take the class label Xcls,L embedded in Zt, L output by the last Lth layer Transformer module as the tth image The global feature representation Ftgloabl , in this embodiment, L=12, so as to achieve a high recognition rate while keeping the model small.
步骤2.5、按照步骤2.1-步骤2.4的过程得到第s张图像的全局特征表示Fsgloabl;Step 2.5: Follow the process of steps 2.1 to 2.4 to obtain the sth image The global feature representation Fsgloabl ;
步骤3、构建局部动态注意力图模块,用于提取对损坏噪声鲁棒的细粒度特征。具体的说,该方法如图3所示,是按如下步骤进行;Step 3: Construct a local dynamic attention map module to extract fine-grained features that are robust to corruption noise. Specifically, the method is shown in Figure 3 and is performed in the following steps:
步骤3.1、将第t个切片序列中的采样成S×S大小的M个局部特征然后将每个局部特征展平后,再用线性投影Ep将每个展平后的局部特征映射到d维嵌入向量中,从而得到第k个局部特征序列表示第k个切片/>的第m个局部特征;M表示局部特征的数量;本实施例中,综合模型性能和计算开销,将采样大小S设置为4,相应产生16个局部特征。Step 3.1: The tth slice sequence Sample into M local features of size S×S Then, each local feature is flattened and mapped into a d-dimensional embedding vector using linear projectionEp , thereby obtaining the kth local feature sequence Indicates the kth slice/> The mth local feature of ; M represents the number of local features; in this embodiment, considering the model performance and computational overhead, the sampling size S is set to 4, and 16 local features are generated accordingly.
步骤3.2、根据第k个局部特征序列利用式(3)构建第k个局部图/>Step 3.2: According to the kth local feature sequence Use formula (3) to construct the kth local graph/>
式(3)中,Vt,k,j表示中的第j个嵌入节点,/>表示/>中的第i个嵌入节点Vt,k,i和第j个嵌入节点Vt,k,j之间存在的一条边,/>表示/>中的嵌入节点集合,/>表示中的无向边集合。In formula (3), Vt,k,j represents The jth embedded node in ,/> Indicates/> There is an edge between the i-th embedded node Vt,k,i and the j-th embedded node Vt,k,j in ,/> Indicates/> The set of embedded nodes in ,/> express The set of undirected edges in .
步骤3.3、构建图注意力层来搭建网络,通过聚合一阶邻居节点的特征来更新每个嵌入节点。首先,利用式(4)计算第i个嵌入节点Vt,k,i和第j个嵌入节点Vt,k,j间边的注意力得分S(Vt.k,i,Vt,k,j),用于表征第j个嵌入节点Vt,k,i对第i个嵌入节点Vt,k,j的重要性:Step 3.3: Construct a graph attention layer to build the network, and update each embedded node by aggregating the features of the first-order neighbor nodes. First, use formula (4) to calculate the attention score S(Vtk,i ,Vt,k,j ) of the edge between the i-th embedded node V t,k,i and the j-th embedded node Vt,k,j , which is used to characterize the importance of the j-th embedded node Vt,k,i to the i-th embedded node Vt,k,j :
S(Vt.k,i,Vt,k,j)=aTLeakyReLU(W·[Vt,k,i||Vt,k,j]) (4)S(Vtk,i ,Vt,k,j ) =aTLeakyReLU (W[Vt,k,i ||Vt,k,j ]) (4)
式(4)中,表示嵌入拼接操作,W表示权重共享矩阵,a表示实数映射操作;LeakyRelLU表示激活函数;T表示转置。In formula (4), represents the embedding concatenation operation, W represents the weight sharing matrix, a represents the real number mapping operation, LeakyRelLU represents the activation function, and T represents the transpose.
步骤3.4、利用式(5)得到第i个嵌入节点Vt,k,i和第j个嵌入节点Vt,k,j间边的动态注意力权重αt,i,j:Step 3.4: Use formula (5) to obtain the dynamic attention weight αt,i,j of the edge between the i-th embedded node Vt,k,i and the j-th embedded node Vt,k ,j:
式(5)中,Nt,i表示与第i个嵌入节点Vt,k,i相邻的嵌入节点的集合,Vt,k,j′表示与第i个嵌入节点Vt,k,i相邻的第j′个嵌入节点。In formula (5), Nt,i represents the set of embedded nodes adjacent to the i-th embedded node Vt,k,i, and Vt,k,j′ represents the j′th embedded node adjacent to the i-th embedded node Vt,k,i .
步骤3.5、利用式(6)得到聚合后的第i个嵌入节点特征V′t,k,i:Step 3.5: Use formula (6) to obtain the i-th embedded node feature V′t,k,i after aggregation:
步骤3.6、按照步骤3.3-步骤3.5的过程得到第k个局部嵌入序列中的所有嵌入节点特征,并构成第k个更新后的局部嵌入序列/>从而得更新后的切片序列/>并拼接为第t张图像/>的局部特征表示Ftlocal;本实施例中,图注意力层设置为4层。Step 3.6: Obtain the kth local embedding sequence according to the process of steps 3.3 to 3.5 All embedded node features in , and constitute the kth updated local embedding sequence/> Thus, the updated slice sequence is obtained/> And spliced into the tth image/> The local feature representation Ftlocal ; in this embodiment, the graph attention layer is set to 4 layers.
步骤3.7、按照步骤3.1-步骤3.6的过程可以得到第s张图像的局部特征表示Fslocal。Step 3.7: Follow the process from step 3.1 to step 3.6 to get the sth image. The local feature representation Fslocal .
步骤4、利用硬边界损失优化损坏场景下的难样本分布;Step 4: Use hard boundary loss to optimize the distribution of difficult samples in the damaged scenario;
步骤4.1、利用式(7)计算与Ft的标签Yt种类相同的所有特征的中心Step 4.1: Use formula (7) to calculate the center of all features with the same type as the labelYt ofFt
式(7)中,mean表示取平均,dis表示计算特征距离,Fn表示与Ft的标签Yt种类相同的任意第n个特征,Nt表示与Ft的标签种类相同的特征数量;In formula (7), mean means taking the average, dis means calculating the feature distance,Fn means any nth feature of the same type as the labelYt of Ft, andNt means the number of features of the same type as the label ofFt ;
步骤4.2、与Ft的标签Yt种类相同的特征作为正样本;利用式(8)计算所有正样本特征到特征中心的正样本平均距离/>Step 4.2: The features of the same type as the labelYt ofFt are taken as positive samples; use formula (8) to calculate all positive sample features to the feature center The average distance of positive samples/>
与Ft的标签Yt种类不同的特征作为负样本;利用式(9)计算所有负样本特征到特征中心的负样本平均距离/>The features that are different from the labelYt ofFt are regarded as negative samples; the feature center of all negative sample features is calculated using formula (9) The average distance of negative samples/>
式(9)中,||·||2表示计算欧式距离,Fn′表示与Ft的标签Yt种类不同的第n′个特征,N′t表示与Ft的标签种类不同的特征数量;In formula (9), ||·||2 indicates the calculation of the Euclidean distance, Fn′ indicates the n′th feature that is different from the labelYt of Ft, andN′t indicates the number of features that are different from the label type ofFt ;
步骤4.3、在负样本平均边界内仍然存在着一些更难的负样本,利用式(10)计算类内硬边界损失用于拉近正样本,惩罚更难的负样本:Step 4.3: There are still some more difficult negative samples within the average boundary of negative samples. Use formula (10) to calculate the intra-class hard boundary loss Used to bring positive samples closer and penalize more difficult negative samples:
式(10)中,Fq表示在负样本平均距离内且与Ft的标签Yt种类不同的第q个特征,Qt表示在负样本平均距离/>内且与Ft的标签Yt种类不同的特征数量;In formula (10),Fq represents the average distance of negative samples The qth feature that is different from the label Yt of Ft , Qt represents the average distance in negative samples/> The number of features in Ft that are different from the labelYt ofFt ;
步骤4.4、利用式(7)计算与Fs标签Ys种类相同的所有特征的中心然后利用式(8)计算/>的正样本平均边界/>从而利用式(11)计算类间硬边界损失/>用于推开不同标签特征中心的更难正样本边界:Step 4.4: Use formula (7) to calculate the center of all features with the same type asFs labelYs Then use formula (8) to calculate/> The average positive sample boundary of Therefore, the inter-class hard boundary loss is calculated using formula (11)/> Used to push the boundaries of harder positive samples with different label feature centers:
式(11)中,Fu表示在正样本平均边界外且与Ft的标签Yt种类相同的第u个特征,Ut表示在正样本平均边界/>外且与Ft的标签Yt种类相同的特征数量,Fw表示在正样本平均边界/>外且与Fs的标签Ys种类相同的第w个特征,Wt表示在正样本平均边界/>外且与Fs的标签Ys种类相同的特征数量;In formula (11),Fu represents the average boundary of positive samples The u-th feature that is outside and of the same type as the label Yt of Ft , Ut represents the average boundary of the positive sample/> The number of features that are the same as the label Yt of Ft , Fw represents the average boundary of positive samples/> The wth feature that is outside and of the same type as the label Ys of Fs , Wt represents the average boundary of the positive samples/> The number of features that are of the same type as the labelYs ofFs ;
步骤4.5、利用式(12)构建和/>的总的硬边界损失/>Step 4.5: Use formula (12) to construct and/> The total hard edge loss of
式(12)中,α和β是平衡和/>权重的超参数,训练过程中最小化/>使模型收敛;本实例中,将Transformer输出的全局特征和局部动态注意力图模块输出的局部特征使用拼接操作融合为768维特征作为最终输出并计算损失,δ设置为0.3,α设置为0.6,β设置为0.3。In formula (12), α and β are equilibrium and/> Hyperparameters of weights, minimized during training/> Make the model converge; in this example, the global features output by the Transformer and the local features output by the local dynamic attention map module are fused into 768-dimensional features as the final output using a splicing operation and the loss is calculated. δ is set to 0.3, α is set to 0.6, and β is set to 0.3.
步骤5、构建损坏场景测试集测试训练好的模型;Step 5: Build a damage scenario test set to test the trained model;
步骤5.1、保存训练得到的最优模型,利用最优模型提取未损坏的查询图像特征和图库集特征,然后计算查询图像特征和图库集所有图像特征的相似度并降序排序,选取前20个排名的图像作为干净场景的跨模态行人重识别查询结果;Step 5.1: Save the optimal model obtained through training, use the optimal model to extract the undamaged query image features and gallery set features, then calculate the similarity between the query image features and all image features in the gallery set and sort them in descending order, and select the top 20 ranked images as the cross-modal person re-identification query results for clean scenes;
步骤5.2、将待查询的图像随机损坏,利用最优模型提取损坏的查询图像特征和图库集特征,然后计算损坏的查询图像特征和图库集所有图像特征的相似度并降序排序,选取前20个排名的图像作为损坏场景的跨模态行人重识别查询结果。Step 5.2: Randomly damage the image to be queried, use the optimal model to extract the damaged query image features and the gallery set features, then calculate the similarity between the damaged query image features and the features of all images in the gallery set and sort them in descending order, and select the top 20 ranked images as the cross-modal pedestrian re-identification query results for the damaged scene.
在本实施例中,使用带有余弦退火学习率调节器的AdamW优化器进行训练,初始学习率设置为3×10-4,权值衰减被设置为1×10-4,灰度-红外阶段训练轮数为6,色相抖动-红外阶段训练轮数为6,可见光-红外阶段训练轮数为18,共为SYSU-MM01数据集训练30轮以达到模型最优性能;对于损坏的待查询输入,与随机切片损坏数据增强一样使用噪声、模糊、天气和数字化四大类损坏,不同是损坏全图添加在待查询输入上来模拟更真实的损坏场景;评价指标采用累积匹配特性曲线(Cumulative Matching Characteristics,CMC),平均精度均值(Mean Average Precision,mAP)和平均反向负惩罚(mean of inverse negativepenalty,mINP),最后的测试性能是进行10次重复测试的平均值。In this embodiment, the AdamW optimizer with a cosine annealing learning rate regulator is used for training, the initial learning rate is set to 3×10-4 , the weight decay is set to 1×10-4 , the grayscale-infrared stage training rounds are 6, the hue jitter-infrared stage training rounds are 6, and the visible light-infrared stage training rounds are 18. A total of 30 rounds of training for the SYSU-MM01 data set are used to achieve the optimal performance of the model; for the damaged input to be queried, four major types of damage, namely noise, blur, weather and digitization, are used as the random slice damage data enhancement, but the difference is that the entire damaged image is added to the input to be queried to simulate a more realistic damage scene; the evaluation indicators use the Cumulative Matching Characteristics (CMC), the Mean Average Precision (mAP) and the mean inverse negative penalty (mINP), and the final test performance is the average value of 10 repeated tests.
本实施例中,一种电子设备,包括存储器以及处理器,该存储器用于存储支持处理器执行上述方法的程序,该处理器被配置为用于执行该存储器中存储的程序。In this embodiment, an electronic device includes a memory and a processor, wherein the memory is used to store a program that supports the processor to execute the above method, and the processor is configured to execute the program stored in the memory.
本实施例中,一种计算机可读存储介质,是在计算机可读存储介质上存储有计算机程序,该计算机程序被处理器运行时执行上述方法的步骤。In this embodiment, a computer-readable storage medium stores a computer program on the computer-readable storage medium, and the computer program executes the steps of the above method when executed by a processor.
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202410241752.0ACN118038494A (en) | 2024-03-04 | 2024-03-04 | A cross-modal person re-identification method robust to damaged scenes |
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202410241752.0ACN118038494A (en) | 2024-03-04 | 2024-03-04 | A cross-modal person re-identification method robust to damaged scenes |
| Publication Number | Publication Date |
|---|---|
| CN118038494Atrue CN118038494A (en) | 2024-05-14 |
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| CN202410241752.0APendingCN118038494A (en) | 2024-03-04 | 2024-03-04 | A cross-modal person re-identification method robust to damaged scenes |
| Country | Link |
|---|---|
| CN (1) | CN118038494A (en) |
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN118411739A (en)* | 2024-07-02 | 2024-07-30 | 江西财经大学 | Visual language person re-identification network method and system based on dynamic attention |
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN118411739A (en)* | 2024-07-02 | 2024-07-30 | 江西财经大学 | Visual language person re-identification network method and system based on dynamic attention |
| Publication | Publication Date | Title |
|---|---|---|
| CN111767882B (en) | Multi-mode pedestrian detection method based on improved YOLO model | |
| CN111104898B (en) | Image scene classification method and device based on target semantics and attention mechanism | |
| CN110717411A (en) | A Pedestrian Re-identification Method Based on Deep Feature Fusion | |
| CN110503076B (en) | Video classification method, device, equipment and medium based on artificial intelligence | |
| CN112434599B (en) | Pedestrian re-identification method based on random occlusion recovery of noise channel | |
| CN108564049A (en) | A kind of fast face detection recognition method based on deep learning | |
| CN110070066A (en) | A kind of video pedestrian based on posture key frame recognition methods and system again | |
| CN113239753A (en) | Improved traffic sign detection and identification method based on YOLOv4 | |
| CN109376580B (en) | A deep learning-based identification method for power tower components | |
| CN116030396B (en) | An Accurate Segmentation Method for Video Structured Extraction | |
| CN107688830A (en) | It is a kind of for case string and show survey visual information association figure layer generation method | |
| CN113361475A (en) | Multi-spectral pedestrian detection method based on multi-stage feature fusion information multiplexing | |
| CN111652035A (en) | A method and system for pedestrian re-identification based on ST-SSCA-Net | |
| CN118799919B (en) | Full-time multi-mode pedestrian re-recognition method based on simulation augmentation and prototype learning | |
| CN115620090A (en) | Model training method, low-illumination target re-identification method and device, and terminal equipment | |
| CN112541403A (en) | Indoor personnel falling detection method utilizing infrared camera | |
| TWI696958B (en) | Image adaptive feature extraction method and its application | |
| Ren et al. | Image set classification using candidate sets selection and improved reverse training | |
| CN118038494A (en) | A cross-modal person re-identification method robust to damaged scenes | |
| CN115601396A (en) | Infrared target tracking method based on depth feature and key point matching | |
| CN117351246B (en) | Mismatching pair removing method, system and readable medium | |
| CN115082854B (en) | Pedestrian searching method for security monitoring video | |
| CN114627506B (en) | Person Re-ID Method Based on Pose Estimation and Non-local Network | |
| Wang et al. | YOLOv5-light: efficient convolutional neural networks for flame detection | |
| CN112699846B (en) | Specific character and specific behavior combined retrieval method and device with identity consistency check function |
| Date | Code | Title | Description |
|---|---|---|---|
| PB01 | Publication | ||
| PB01 | Publication | ||
| SE01 | Entry into force of request for substantive examination | ||
| SE01 | Entry into force of request for substantive examination |