CN112101122A

Movatterモバイル変換

Info

Publication number: CN112101122A
Application number: CN202010845336.3A
Authority: CN
Inventors: 李国荣; 杨一帆; 黄庆明; 苏荔
Original assignee: University of Chinese Academy of Sciences
Current assignee: University of Chinese Academy of Sciences
Priority date: 2020-08-20
Filing date: 2020-08-20
Publication date: 2020-12-18
Anticipated expiration: 2040-08-20
Also published as: CN112101122B

Abstract

The invention relates to the technical field of computer vision, in particular to a weak supervision object number estimation method based on a sequencing network, which does not need to rely on object position marking information to train a model, saves human resources and improves the universality of the model; the method comprises the following steps: extracting image features by using a deep neural network, and acquiring pyramid feature vectors by using an adaptive pooling layer; the number of objects returned back and forth using the full link layer; and training a model by using a multi-branch sequencing network, converting a sequencing result into a sequencing matrix by using a Sinkhorn layer, and calculating loss by using a soft label transmission matrix as a true value.

Description

Translated fromChinese

一种基于排序网络的弱监督物体数目估计方法A Weakly Supervised Object Number Estimation Method Based on Ranking Network

技术领域technical field

本发明涉及计算机视觉的技术领域，特别是涉及一种基于排序网络的弱监督物体数目估计方法。The invention relates to the technical field of computer vision, in particular to a method for estimating the number of weakly supervised objects based on a ranking network.

背景技术Background technique

公共场合中通过摄像机实现人数、车辆等关键物体的计数具有重要的研究价值。比如:候车大厅中人群计数的结果、交通路口中的车辆数目估计，可优化公共交通的调度；某区域中人数的急剧变化既可能会导致意外事件的发生，又可能是意外事件发生的结果。因此图像视频中的物体数目估计在智能安防领域具有重要价值，是计算机视觉和智能视频监控领域的重要研究内容。It has important research value to realize the counting of key objects such as people and vehicles through cameras in public places. For example, the results of crowd counting in the waiting hall and the estimation of the number of vehicles in traffic intersections can optimize the scheduling of public transportation; a sharp change in the number of people in a certain area may lead to or be the result of an accident. Therefore, the estimation of the number of objects in images and videos has important value in the field of intelligent security and is an important research content in the fields of computer vision and intelligent video surveillance.

目前，物体数目估计方法大致可以分为三种:1)物体检测:这种方法比较直接，在物体较稀疏的场景中，通过检测图像中的物体，进而得到物体数目，这种方法在物体拥挤情况下不大奏效。2)视觉特征轨迹聚类：对于视频监控，一般用KLT跟踪器和聚类的方法，通过轨迹聚类得到的数目来物体数目。3)基于特征的回归:建立图像特征和图像物体数目的回归模型，通过测量图像特征从而估计场景中的物体数目。由于拥挤情况下采用直接法容易受到遮挡等难点问题的影响，而间接法从物体群体的整体特征出发，具有大规模物体计数的能力。At present, the number of objects estimation methods can be roughly divided into three types: 1) Object detection: This method is relatively straightforward. In scenes with sparse objects, the number of objects is obtained by detecting objects in the image. This method is used in crowded objects. Not very effective in this case. 2) Visual feature trajectory clustering: For video surveillance, KLT tracker and clustering methods are generally used, and the number of objects obtained by trajectory clustering is calculated. 3) Feature-based regression: establish a regression model of image features and the number of image objects, and estimate the number of objects in the scene by measuring image features. Since the direct method is easily affected by difficult problems such as occlusion under crowded conditions, the indirect method starts from the overall characteristics of the object group and has the ability to count objects on a large scale.

现有的基于特征回归的算法存在着以下缺点。首先，物体位置的标注通常很昂贵。现有的物体数目估计数据集提供了每个物体的位置来训练数目回归网络，而在评估阶段，却没有考虑这些位置标签，仅仅评估估计的物体数目的准确性。实际上，在不需要位置的情况下，可以仅标注图像中物体的数目，利用更有效的弱监督方法来训练物体数目估计模型。The existing feature regression-based algorithms have the following shortcomings. First, the annotation of object locations is usually expensive. Existing object number estimation datasets provide the location of each object to train the number regression network, while in the evaluation stage, these location labels are not considered and only the accuracy of the estimated object number is evaluated. In fact, without the need for location, it is possible to only label the number of objects in the image, using more efficient weakly supervised methods to train the object number estimation model.

发明内容SUMMARY OF THE INVENTION

为解决上述技术问题，本发明提供一种不需要物体位置标注信息、节省人力资源、提高物体数目估计准确性的基于排序网络的弱监督物体数目估计方法。In order to solve the above technical problems, the present invention provides a method for estimating the number of weakly supervised objects based on a sorting network, which does not require object position labeling information, saves human resources, and improves the accuracy of estimating the number of objects.

本发明的一种基于排序网络的弱监督物体数目估计方法，包括以下步骤：A method for estimating the number of weakly supervised objects based on a sorting network of the present invention includes the following steps:

S1、使用预训练好的深度神经网络如VGG-16提取图像特征，然后利用卷积操作回归密度图；利用自适应池化层从密度图中提取多尺度特征来捕获图像中的全局和局部信息，输入到全连接层回归物体数目。其中自适应池化层包括全局子簇层和局部子簇层两种类型。S1. Use a pre-trained deep neural network such as VGG-16 to extract image features, and then use the convolution operation to regress the density map; use an adaptive pooling layer to extract multi-scale features from the density map to capture global and local information in the image , which is input to the fully connected layer to regress the number of objects. The adaptive pooling layer includes two types of global sub-cluster layer and local sub-cluster layer.

S2、使用图像物体数目排序网络对多尺度特征进行学习，使得多尺度特征对物体数目敏感。这里的排序网络为多分支网络，其输入为多张图像的多尺度特征，输出为依据图像中物体的数目进行排序的结果。S2. Use the image object number sorting network to learn multi-scale features, so that the multi-scale features are sensitive to the number of objects. The sorting network here is a multi-branch network, whose input is the multi-scale features of multiple images, and the output is the result of sorting according to the number of objects in the image.

S3、排序网络中使用Sinkhorn层将排序特征变为序数矩阵，利用图像中物体的真实数目构造软标签传输矩阵，使用交叉熵损失来训练排序网络,得到对物体数目敏感的特征；然后训练回归网络，最终得到物体数目回归模型；S3. The Sinkhorn layer is used in the sorting network to change the sorting features into an ordinal matrix, and the soft label transmission matrix is constructed by using the real number of objects in the image, and the cross-entropy loss is used to train the sorting network to obtain features that are sensitive to the number of objects; then train the regression network. , and finally get the number of objects regression model;

本发明的的一种基于排序网络的弱监督物体数目估计方法，所述步骤S1的具体操作为：利用在图像分析任务上预训练好的深度网络模型提取图像特征，回归一个伪概率密度图；然后使用步幅较大的池化层构造全局子簇层，从密度图中提取全局特征；利用步幅较小的池化层构造局部子簇层，从密度图中提取局部特征。In a method for estimating the number of weakly supervised objects based on a sorting network of the present invention, the specific operations of the step S1 are: extracting image features using a deep network model pre-trained on the image analysis task, and returning a pseudo probability density map; Then, a pooling layer with a larger stride is used to construct a global subcluster layer to extract global features from the density map; a pooling layer with a smaller stride is used to construct a local subcluster layer to extract local features from the density map.

本发明的一种基于排序网络的弱监督物体数目估计方法，，所述步骤S2的具体操作为：使用多分支排序网络来微调特征提取模型，获取对图像中物体数目全局、局部特征In a method for estimating the number of weakly supervised objects based on a sorting network of the present invention, the specific operation of the step S2 is: using a multi-branch sorting network to fine-tune the feature extraction model, and obtain global and local characteristics of the number of objects in the image.

本发明的一种基于排序网络的弱监督物体数目估计方法，所述步骤S3的具体操作为：使用可微分的Sinkhorn层将排序特征变为序数矩阵；构造更有效的软标签运输矩阵来训练排序网络；使用交叉熵损失来训练排序网络，使用均方误差来训练回归网络。In a method for estimating the number of weakly supervised objects based on a sorting network of the present invention, the specific operations of the step S3 are: using a differentiable Sinkhorn layer to change the sorting feature into an ordinal matrix; constructing a more effective soft label transport matrix to train sorting Networks; use the cross-entropy loss to train the ranking network and the mean squared error to train the regression network.

本发明的有益效果为：排序网络能够通图像间物体数目的相对关系来学习对物体数目敏感的多尺度特征，用于回归网络的输入，避免使用物体的位置信息，不需要大量人力来标注物体位置信息。使用收可微分的Sinkhorn层，使得网络可以端到端训练；利用图像中物体数目的相对关系来构建软标签运输矩阵，有效的反应了排序任务的复杂程序，提升了物体数目估计的准确性。The beneficial effects of the invention are: the sorting network can learn multi-scale features sensitive to the number of objects through the relative relationship of the number of objects between images, which is used for the input of the regression network, avoids using the position information of the objects, and does not require a lot of manpower to label the objects. location information. The use of the differentiable Sinkhorn layer enables the network to be trained end-to-end; the relative relationship between the number of objects in the image is used to construct the soft label transport matrix, which effectively reflects the complex procedure of the sorting task and improves the accuracy of the number of objects estimated.

附图说明Description of drawings

图1是本发明的示意图。Figure 1 is a schematic diagram of the present invention.

具体实施方式Detailed ways

下面结合实施例，对本发明的具体实施方式作进一步详细描述。以下实施例用于说明本发明，但不用来限制本发明的范围The specific embodiments of the present invention will be further described in detail below with reference to the examples. The following examples are used to illustrate the present invention, but not to limit the scope of the present invention

实施例Example

S1、使用预训练好的深度神经网络如VGG-16提取图像特征，然后利用卷积操作回归密度图；利用多个池化层从密度图中提取多尺度特征来捕获图像中的全局和局部信息，输入到全连接层回归物体数目。其中自适应池化层包括全局子簇层和局部子簇层两种类型。全局子簇层使用三Max池化层，池化步长分别为8、16、32；局部子簇层使用两个Average池化层，池化步长为1、2；S1. Use a pre-trained deep neural network such as VGG-16 to extract image features, and then use convolution operations to regress the density map; use multiple pooling layers to extract multi-scale features from the density map to capture global and local information in the image , which is input to the fully connected layer to regress the number of objects. The adaptive pooling layer includes two types of global sub-cluster layer and local sub-cluster layer. The global subcluster layer uses three Max pooling layers, and the pooling steps are 8, 16, and 32 respectively; the local subcluster layer uses two Average pooling layers, and the pooling steps are 1 and 2;

S2、使用图像物体数目排序网络对多尺度特征进行学习，使得多尺度特征对物体数目敏感。这里的排序网络为多分支网络，其输入为多张图像的多尺度特征，输出为依据图像中物体的数目进行排序的结果。具体可采用K分支网络，提取K张图像的多尺度特征f₁,f₂,f₃,…，f_K然后计算f₁-f₂，f₁-f₃,…,f₁-f_k,f₂-f₄,…,f₂-f_K,…,f_K-1-f_K，输入到排序网络中，得到一个K(K-1)维的排序向量f_d；S2. Use the image object number sorting network to learn multi-scale features, so that the multi-scale features are sensitive to the number of objects. The sorting network here is a multi-branch network, whose input is the multi-scale features of multiple images, and the output is the result of sorting according to the number of objects in the image. Specifically, a K branch network can be used to extract multi-scale features f₁ , f₂ , f₃ ,..., f_K of K images, and then calculate f₁ -f₂ , f₁ -f₃ ,...,f₁ -f_k , f₂ -f₄ ,…,f₂ -f_K ,…,f_K-1 -f_K , input into the sorting network, and get a K(K-1)-dimensional sorting vector f_d ;

S3、排序网络中使用Sinkhorn层将排序特征f_d变为序数矩阵P,其中第i行第j列个元素P_i,j表第i张图像排在第j名的概率；利用图像中物体的真实数目构造软标签传输矩阵

S3. The Sinkhorn layer is used in the sorting network to change the sorting feature f_d into an ordinal matrix P, in which the element P_{i in the i-th row and the j-th column, j} represents the probability that the i-th image is ranked in the j-th place; True Number Constructing Soft Label Transmission Matrix

用σ表示图像真实的排序结果，其中σ第i个元素σ(i)表示第i张图像排在第σ(i)个位置，则软标签矩阵中的元素计算方式如下：Use σ to represent the real sorting result of the image, where the i-th element σ(i) of σ indicates that the i-th image is ranked in the σ(i)-th position, then the elements in the soft label matrix are calculated as follows:

其中in

△_thr为预先定义的阈值。然使用如下交叉熵损失来训练排序网络,得到对物体数目敏感的特征。△_thr is a predefined threshold. However, the following cross-entropy loss is used to train the ranking network to obtain features that are sensitive to the number of objects.

然后使用均方误差损失来训练回归网络，最终得到物体数目回归模型。Then use the mean square error loss to train the regression network, and finally get the object number regression model.

以上所述仅是本发明的优选实施方式，应当指出，对于本技术领域的普通技术人员来说，在不脱离本发明技术原理的前提下，还可以做出若干改进和变型，这些改进和变型也应视为本发明的保护范围。The above are only the preferred embodiments of the present invention. It should be pointed out that for those skilled in the art, without departing from the technical principle of the present invention, several improvements and modifications can be made. These improvements and modifications It should also be regarded as the protection scope of the present invention.

Claims

1. A weak supervision object number estimation method based on a sequencing network is characterized by comprising the following steps:

s1, extracting image features by using a pre-trained deep neural network such as VGG-16, and then regressing a density map by using convolution operation; and extracting multi-scale features from the density map by using the self-adaptive pooling layer to capture global and local information in the image, and inputting the information into the number of the regression objects of the full-connection layer. The self-adaptive pooling layer comprises a global sub-cluster layer and a local sub-cluster layer;

s2, learning the multi-scale features by using the image object number ordering network, so that the multi-scale features are sensitive to the number of objects. The sorting network is a multi-branch network, which inputs multi-scale characteristics of a plurality of images and outputs a result of sorting according to the number of objects in the images;

s3, using a Sinkhorn layer in the sequencing network to change the sequencing characteristics into a ordinal matrix, constructing a soft label transmission matrix by using the real number of objects in the image, and training the sequencing network by using cross entropy loss to obtain characteristics sensitive to the number of the objects; and then training a regression network to finally obtain an object number regression model.

2. The method for estimating the number of weakly supervised objects based on a ranking network as recited in claim 1, wherein the specific operations of step S1 are: extracting image features by using a depth network model pre-trained on an image analysis task, and regressing a pseudo probability density map; then constructing a global sub-cluster layer by using the pooling layer with larger stride, and extracting global features from the density map; and constructing a local sub-cluster layer by using the pooling layer with a smaller step length, and extracting local features from the density map.

3. The method for estimating the number of weakly supervised objects based on a ranking network as recited in claim 1, wherein the specific operations of step S2 are: and (3) fine-tuning the feature extraction model by using a multi-branch sequencing network to obtain global and local features of the number of objects in the image.

4. The method for estimating the number of weakly supervised objects based on a ranking network as recited in claim 1, wherein the specific operations of step S3 are: using a differentiable Sinkhorn layer to change the ordering characteristics into a ordinal matrix; constructing a more effective soft label transport matrix, and training a sequencing network by using cross entropy loss; the regression network is trained using the mean square error.