CN108460403A

Movatterモバイル変換

Info

Publication number: CN108460403A
Application number: CN201810065807.1A
Authority: CN
Inventors: 张重阳; 程浩; 刘泽祥
Original assignee: Shanghai Jiao Tong University
Current assignee: Shanghai Jiao Tong University
Priority date: 2018-01-23
Filing date: 2018-01-23
Publication date: 2018-08-28

Abstract

The invention discloses the object detection method and system of multi-scale feature fusion in a kind of image, the method includes：The first step carries out the scaling of different scale using picture to be detected, constructs image pyramid；Second step obtains the multiple scale detecting template of one group of most of sample size of covering using statistics clustering method；Third walks, and is based on multiple scale detecting template, carries out the target context structure of dimension self-adaption；4th step, multiple dimensioned depth characteristic fusion；5th step, the non-maxima suppression based on soft-decision.The present invention is by constructing the serial of methods such as the sparse pyramid of image multiresolution, multiple scale detecting masterplate, masterplate dimension self-adaption context, the fusion of multiple dimensioned depth characteristic, the abundant excavation and fusion for realizing depth characteristic utilize, and can promote target detection performance.

Description

Translated fromChinese

一种图像中多尺度特征融合的目标检测方法与系统A target detection method and system for multi-scale feature fusion in an image

技术领域technical field

本发明涉及的是一种图像中目标检测领域的方法，具体是一种图像中多特征融合的目标检测方法与系统。The invention relates to a method in the field of target detection in images, in particular to a target detection method and system for multi-feature fusion in images.

背景技术Background technique

图像中的目标检测识别，在诸如智能视频监控等应用场合，具有广泛的实用需求，也是计算机视觉领域较为热门的研究方向。现有的图像目标检测方法，因为尚存如下困难和挑战，检测结果还有待提升：(1)同类目标之间，颜色纹理形状等表观特征存在较大的多样性、差异性。(2)同类目标存在姿态的多样性，导致类内样本的结构特征存在较大变化。如现实中目标具有直立、卧倒、倾斜等姿态，不同的姿态的同类目标会呈现不同的轮廓、形状等结构特征；(3)同类目标高度、宽度等尺寸大小和比例变化区间大。一方面目标物理高度则会有较大的分布区间，另一方面由于拍摄距离的不同目标在图像中也会呈现不同的大小、比例等尺度变化。(4)目标的遮挡会影响检测结果。目标被遮挡后其部分信息是缺失的，增加了检测难度。(5)目标所处环境背景和光照的多样性导致误检增加。目标出现在为室外时，如城市道路、出入口等，其背景往往较为复杂，且一些复杂的背景如树木、路灯会与目标产生混淆、导致误检。Object detection and recognition in images has a wide range of practical needs in applications such as intelligent video surveillance, and it is also a popular research direction in the field of computer vision. Due to the following difficulties and challenges in the existing image target detection methods, the detection results still need to be improved: (1) There are great diversity and differences in the appearance features such as color, texture and shape among similar targets. (2) There is a diversity of poses in the same type of target, resulting in a large change in the structural characteristics of the samples within the class. For example, in reality, the target has postures such as upright, lying down, and tilting, and similar targets with different postures will present different structural features such as outlines and shapes; (3) The size and ratio of the height and width of similar targets vary widely. On the one hand, the physical height of the target will have a large distribution range. On the other hand, due to the different shooting distances, the target will also show different scale changes such as size and proportion in the image. (4) The occlusion of the target will affect the detection results. After the target is occluded, some of its information is missing, which increases the difficulty of detection. (5) The diversity of the environment background and illumination of the target leads to the increase of false detection. When the target appears outdoors, such as urban roads, entrances and exits, etc., its background is often complex, and some complex backgrounds such as trees and street lights will confuse the target and cause false detection.

目前，较为成熟的目标检测方法基本可以分为两类：(1)基于背景建模。该方法主要用于视频中检测运动目标：即将输入的静态图像进行场景分割，利用混合高斯模型(GMM)或运动检测等方法，分割出其前景与背景，再在前景中提取特定运动目标。这类方法需要连续的图像序列来实现建模，不适合于单幅图像中的目标检测。(2)基于统计学习。即将所有已知属于某一类目标的图像收集起来形成训练集，基于一个人工设计的方法(如HOG、Harr等)对训练集图像提取特征。提取的特征一般为目标的灰度、纹理、梯度直方图、边缘等信息。继而根据大量的训练样本的特征库来构建行人检测分类器。分类器一般可用SVM，Adaboost及神经网络等模型。At present, more mature object detection methods can be basically divided into two categories: (1) Based on background modeling. This method is mainly used to detect moving objects in video: the input static image is about to be segmented, and the foreground and background are segmented by using Gaussian mixture model (GMM) or motion detection methods, and then specific moving objects are extracted in the foreground. Such methods require continuous image sequences for modeling and are not suitable for object detection in a single image. (2) Based on statistical learning. That is to collect all images known to belong to a certain type of target to form a training set, and extract features from the training set images based on an artificially designed method (such as HOG, Harr, etc.). The extracted features are generally the target's grayscale, texture, gradient histogram, edge and other information. Then build a pedestrian detection classifier based on the feature library of a large number of training samples. Classifiers generally use models such as SVM, Adaboost, and neural networks.

综合而言近年来基于统计学习的目标检测方法表现较优，基于统计学习的目标检测方法可以分为传统人工特征目标检测方法以及深度特征机器学习目标检测方法。In general, the target detection methods based on statistical learning have performed better in recent years. The target detection methods based on statistical learning can be divided into traditional artificial feature target detection methods and deep feature machine learning target detection methods.

传统人工特征目标检测方法主要是指其利用人工设计的特征，来进行目标目标的建模。近年来表现优秀的人工设计的特征方法主要包括：2010年Pedro F.Felzenszwalb等提出的DPM(Deformable Part Model)方法(Object detection with discriminativelytrained part-based models)。Piotr Dollár等2009年提出的ICF(Integral ChannelFeatures)、2014年提出的ACF方法(Fast Feature Pyramids for Object Detection)。2014年Shanshan Zhang等提出的Informed Harr方法(Informed Haar-like FeaturesImprove Pedestrian Detection)，致力于提取更加具有表征信息的Harr特征来进行训练。这些人工设计的特征虽然取得了一定的效果，但因为人工特征表征能力不足，仍存在检测精度不高问题。由于深度卷积神经网络模型所具有的更强大的特征学习与表达能力，在图像目标分类检测方面得到越来越广泛和成功的应用。基础的目标检测算子是R-CNN(Region-Convolutional Neural Network)模型。2014年Girshick等人提出RCNN用于通用目标的检测，之后又是提出了fast-rcnn和faster-rcnn，提高了基于深度学习目标检测方法的精度和速度。2016年提出的Yolo和SSD等方法，则通过Anchor等思想实现单一阶段的快速目标检测。这些基于深度学习技术的目标检测，大都是基于单一尺度、固定大小上下文的深度特征，仍存在深度特征利用不充分的问题，检测性能有待进一步提高。The traditional artificial feature target detection method mainly refers to the use of artificially designed features to model the target. The feature methods of artificial design that have performed well in recent years mainly include: the DPM (Deformable Part Model) method (Object detection with discriminatively trained part-based models) proposed by Pedro F. Felzenszwalb in 2010. The ICF (Integral Channel Features) proposed by Piotr Dollár et al. in 2009, and the ACF method (Fast Feature Pyramids for Object Detection) proposed in 2014. The Informed Harr method (Informed Haar-like Features Improve Pedestrian Detection) proposed by Shanshan Zhang et al. in 2014 is dedicated to extracting Harr features with more representational information for training. Although these artificially designed features have achieved certain results, there is still a problem of low detection accuracy due to insufficient characterization capabilities of artificial features. Due to the more powerful feature learning and expression capabilities of the deep convolutional neural network model, it has been more and more widely and successfully applied in image target classification and detection. The basic target detection operator is the R-CNN (Region-Convolutional Neural Network) model. In 2014, Girshick et al. proposed RCNN for the detection of general targets, and then proposed fast-rcnn and faster-rcnn, which improved the accuracy and speed of the deep learning-based target detection method. Methods such as Yolo and SSD proposed in 2016 achieve single-stage rapid target detection through ideas such as Anchor. Most of these target detections based on deep learning technology are based on single-scale, fixed-size context depth features, and there is still the problem of insufficient utilization of deep features, and the detection performance needs to be further improved.

发明内容Contents of the invention

针对基于深度模型的目标检测方法存在的不足，本发明提出一种图像中多尺度特征融合的目标检测方法与系统，通过构造图像多分辨率稀疏金字塔、多尺度检测模版、模版尺度自适应上下文、多尺度深度特征融合等一系列创造性方法，实现深度特征的充分挖掘和融合利用，提升目标检测性能。Aiming at the shortcomings of the target detection method based on the depth model, the present invention proposes a target detection method and system for multi-scale feature fusion in an image. A series of creative methods such as multi-scale deep feature fusion realize the full mining and fusion of deep features and improve the performance of target detection.

根据本发明的第一方面，提供一种图像中多尺度特征融合的目标检测方法，包括：According to a first aspect of the present invention, a target detection method for multi-scale feature fusion in an image is provided, including:

S1：利用待检测图片进行不同尺度的缩放，构造一个图像金字塔；S1: Use the image to be detected to zoom in different scales to construct an image pyramid;

S2：基于所述图像金字塔得到的训练图像，利用统计学聚类方法获取一组覆盖大多数样本尺度的多尺度检测模板；S2: Obtain a set of multi-scale detection templates covering most sample scales by using a statistical clustering method based on the training images obtained from the image pyramid;

S3：在上述多尺度检测模板的基础上，进行尺度自适应的目标上下文构建；S3: On the basis of the above multi-scale detection template, construct a scale-adaptive target context;

S4：根据目标上下文构建的结果，进行多尺度深度特征融合，得到多尺度特征图；S4: According to the result of the target context construction, multi-scale deep feature fusion is performed to obtain a multi-scale feature map;

S5：根据上述多尺度特征图，进行基于软判决的非极大值抑制，实现图像中多尺度特征融合的目标检测。S5: According to the above multi-scale feature map, non-maximum suppression based on soft decision is performed to realize the target detection of multi-scale feature fusion in the image.

优选地，所述S1中：为了使得检测网络利用一个或几个有限大小的检测框，能够对图像中不同大小的目标都能完整紧凑地进行框选采样，需要对原始图像进行多尺度缩放、使得原始目标经过多次缩放增加其被检测框完整紧凑框选的概率，将待训练图片通过按比例缩放成L个不同分辨率大小的图片，从而构造一个分辨率由高到低的图像金字塔。具体的，在训练的时候，对于每个原始的训练图像，进行多个尺度的缩放，得到不同尺度下的L个图像用于训练。在测试的时候，对于每个待检测的图像，同样进行多个尺度的缩放，得到不同尺度下的L个图像用于检测，并对这L个图像的检测结果，进行融合判决，得到最终的检测结果。Preferably, in said S1: in order to enable the detection network to use one or several detection frames of limited size, and to be able to perform frame selection and sampling of targets of different sizes in the image in a complete and compact manner, it is necessary to perform multi-scale scaling on the original image, The original target is scaled multiple times to increase the probability of being selected by the detection frame in a complete and compact frame, and the image to be trained is scaled into L images of different resolutions to construct an image pyramid with resolutions ranging from high to low. Specifically, during training, each original training image is scaled in multiple scales to obtain L images of different scales for training. During the test, for each image to be detected, multiple scales are also scaled to obtain L images of different scales for detection, and the detection results of these L images are fused and judged to obtain the final Test results.

优选地，所述利用统计学聚类方法获取一组覆盖大多数样本尺度的多尺度检测模板，是指：基于K-medoids聚类方法，并利用杰卡德距离(Jaccard distance)作为聚类评价指标，对训练数据集中目标按不同的宽高值及宽高比进行聚类，形成一组K个聚类中心的宽高比，作为覆盖绝大部分宽高比例的目标模板。Preferably, said use of statistical clustering method to obtain a set of multi-scale detection templates covering most sample scales refers to: based on K-medoids clustering method, and using Jaccard distance (Jaccard distance) as clustering evaluation The index clusters the targets in the training data set according to different width-height values and aspect ratios to form a set of K cluster center aspect ratios, which serve as target templates covering most of the aspect ratios.

优选地，所述进行尺度自适应的目标上下文构建，是指：将CNN网络卷积层输出的特征图上每个点的感受野，作为候选目标框；感受野相对于模板框多出的部分，即作为目标框的上下文，用来辅助目标的检测识别。Preferably, the scale-adaptive target context construction refers to: using the receptive field of each point on the feature map output by the CNN network convolution layer as a candidate target frame; the extra part of the receptive field relative to the template frame , that is, as the context of the target frame, it is used to assist the detection and recognition of the target.

更优选地，所述进行尺度自适应的目标上下文构建，最终得到一个上下文信息随目标尺度变化而变化的检测模型，即：小尺度目标将获得更大的上下文信息，而大尺度目标的上下文信息较少，从而满足不同尺度的目标对上下文信息的不同需求。More preferably, the scale-adaptive target context construction finally obtains a detection model in which context information changes with target scale changes, that is, small-scale targets will obtain greater context information, while large-scale targets will obtain more context information. Less, so as to meet the different needs of different scale targets for context information.

优选地，所述进行多尺度深度特征融合，是指：将CNN不同卷积层输出的特征图，选出M层进行融合，用于构造多尺度特征金字塔。Preferably, the multi-scale deep feature fusion refers to: selecting M layers for fusion of feature maps output by different convolutional layers of CNN to construct a multi-scale feature pyramid.

更优选地，所述进行多尺度深度特征融合，具体为：对于M个选出的卷积层中处于CNN网络最后一层，将其输出的特征图利用反卷积对其做上采样，使其扩大到与上一层特征图同样分辨率大小后，与上一层特征图做逐像素相加，得到融合相邻两层的多尺度特征图；再以此类推，反卷积扩大、与更上一层特征图融合，直到完成选出的所有M层特征图的融合。More preferably, the multi-scale deep feature fusion is specifically: for the M selected convolutional layers that are in the last layer of the CNN network, the feature map output by it is upsampled by deconvolution, so that After it is expanded to the same resolution as the feature map of the previous layer, it is added pixel by pixel with the feature map of the previous layer to obtain a multi-scale feature map that fuses two adjacent layers; and so on, deconvolution expansion, and The next level of feature map fusion is performed until the fusion of all selected M-level feature maps is completed.

优选地，所述基于软判决的非极大值抑制，是指：Preferably, the soft-decision-based non-maximum suppression refers to:

先选出来置信概率最大的检测框，通过其他的检测框与置信概率最大的检测框计算IOU(intersection over union)，超过某一阈值则将其置信概率降低；First select the detection frame with the highest confidence probability, calculate the IOU (intersection over union) through other detection frames and the detection frame with the highest confidence probability, and reduce the confidence probability if it exceeds a certain threshold;

将此置信概率最大的检测框去掉后，再选出剩下检测框中置信概率最大的检测框，并将其余的检测框与置信概率最大的检测框计算IOU，超过某一阈值则将其置信概率降低；After removing the detection frame with the highest confidence probability, select the detection frame with the highest confidence probability among the remaining detection frames, and calculate the IOU between the remaining detection frames and the detection frame with the highest confidence probability. If it exceeds a certain threshold, it will be trusted. The probability is reduced;

通过以上不断迭代，得到最后的筛选后的检测框。Through the above continuous iterations, the final filtered detection frame is obtained.

根据本发明的第二方面，提供一种图像中多尺度特征融合的目标检测系统，包括：According to a second aspect of the present invention, a target detection system for multi-scale feature fusion in an image is provided, including:

图像金字塔构建模块：利用待检测图片进行不同尺度的缩放，构造一个图像金字塔；Image pyramid construction module: use the image to be detected to zoom in different scales to construct an image pyramid;

多尺度检测模板构建模块：基于图像金字塔构建模块的所述图像金字塔得到的训练图像，利用统计学聚类方法获取一组覆盖大多数样本尺度的多尺度检测模板；Multi-scale detection template construction module: based on the training images obtained by the image pyramid of the image pyramid construction module, a group of multi-scale detection templates covering most sample scales are obtained by using statistical clustering methods;

目标上下文构建模块：在上述多尺度检测模板构建模块得到的多尺度检测模板的基础上，进行尺度自适应的目标上下文构建；Target context building module: on the basis of the multi-scale detection template obtained by the above-mentioned multi-scale detection template building module, perform scale-adaptive target context construction;

多尺度深度特征融模块：根据目标上下文构建模块的结果，进行多尺度深度特征融合，得到多尺度特征图；Multi-scale deep feature fusion module: According to the results of the target context building module, perform multi-scale deep feature fusion to obtain multi-scale feature maps;

目标检测模块：根据上述多尺度深度特征融模块的多尺度特征图，进行基于软判决的非极大值抑制，实现图像中多尺度特征融合的目标检测。Target detection module: According to the multi-scale feature map of the above-mentioned multi-scale deep feature fusion module, non-maximum value suppression based on soft judgment is performed to realize the target detection of multi-scale feature fusion in the image.

优选地，所述图像金字塔构建模块，在训练的时候，对于每个原始的训练图像，进行多个尺度的缩放，得到不同尺度下的L个图像用于训练。在测试的时候，对于每个待检测的图像，同样进行多个尺度的缩放，得到不同尺度下的L个图像用于检测，并对这L个图像的检测结果，进行融合判决，得到最终的检测结果。Preferably, the image pyramid construction module performs multiple scale scaling for each original training image during training to obtain L images of different scales for training. During the test, for each image to be detected, multiple scales are also scaled to obtain L images of different scales for detection, and the detection results of these L images are fused and judged to obtain the final Test results.

优选地，所述多尺度检测模板构建模块，该模块基于K-medoids聚类方法，并利用杰卡德距离(Jaccard distance)作为聚类评价指标，对训练数据集中目标按不同的宽高值及宽高比进行聚类，形成一组K个聚类中心的宽高比，作为覆盖绝大部分宽高比例的目标模板。Preferably, the multi-scale detection template construction module is based on the K-medoids clustering method, and uses Jaccard distance (Jaccard distance) as a clustering evaluation index, and the targets in the training data set are classified according to different width and height values and Aspect ratios are clustered to form a set of aspect ratios of K cluster centers, which serve as target templates covering most of the aspect ratios.

优选地，所述目标上下文构建模块，该模块将CNN网络卷积层输出的特征图上每个点的感受野，作为候选目标框；感受野相对于模板框多出的部分，即作为目标框的上下文，用来辅助目标的检测识别。Preferably, the target context building module uses the receptive field of each point on the feature map output by the convolutional layer of the CNN network as a candidate target frame; the extra part of the receptive field relative to the template frame is used as the target frame The context of the object is used to assist the detection and recognition of the target.

优选地，所述多尺度深度特征融模块，该模块将CNN不同卷积层输出的特征图，选出M层进行融合，用于构造多尺度特征金字塔。Preferably, the multi-scale depth feature fusion module selects M layers of feature maps output by different convolutional layers of CNN for fusion to construct a multi-scale feature pyramid.

更优选地，所述多尺度深度特征融模块，该模块对于M个选出的卷积层中处于CNN网络最后一层，将其输出的特征图利用反卷积对其做上采样，使其扩大到与上一层特征图同样分辨率大小后，与上一层特征图做逐像素相加，得到融合相邻两层的多尺度特征图；再以此类推，反卷积扩大、与更上一层特征图融合，直到完成选出的所有M层特征图的融合。More preferably, the multi-scale depth feature fusion module, which is in the last layer of the CNN network for the M selected convolutional layers, uses deconvolution to upsample the feature map output by it, so that it After expanding to the same resolution as the feature map of the previous layer, it is added pixel by pixel with the feature map of the previous layer to obtain a multi-scale feature map that fuses two adjacent layers; and so on, deconvolution is expanded, and more The feature map of the previous layer is fused until the fusion of all selected M layer feature maps is completed.

优选地，所述目标检测模块，其先选出来置信概率最大的检测框，通过其他的检测框与置信概率最大的检测框计算IOU(intersection over union)，超过某一阈值则将其置信概率降低；Preferably, the target detection module first selects the detection frame with the largest confidence probability, calculates the IOU (intersection over union) through other detection frames and the detection frame with the largest confidence probability, and reduces its confidence probability if it exceeds a certain threshold ;

与现有技术相比，本发明具有以下有益效果：Compared with the prior art, the present invention has the following beneficial effects:

本发明中根据不同尺度的目标识别对上下文信息需求的不同，结合CNN的结构特点和感受野的概念，采用尺度自适应的上下文信息建模，用于辅助目标识别。In the present invention, according to the different needs of context information for target recognition of different scales, combined with the structural characteristics of CNN and the concept of receptive field, a scale-adaptive context information modeling is used to assist target recognition.

本发明中利用K-medoids聚类方法优化模板的选取方式，满足不同模板对目标检测模型检测效果。In the present invention, the K-medoids clustering method is used to optimize the selection mode of the template to meet the detection effect of different templates on the target detection model.

本发明中图像金字塔解决训练图片需要固定尺度的问题，设计利用crop的方法得到固定大小的图片用于训练。In the present invention, the image pyramid solves the problem that the training pictures need a fixed scale, and the design uses the crop method to obtain pictures of a fixed size for training.

本发明中充分的利用CNN不同层的特征，通过深层特征和浅层特征融合,能够同时利用深层特征的表征能力和浅层特征的细节信息，提升小尺度目标的检测精度。In the present invention, the features of different layers of CNN are fully utilized, and through the fusion of deep features and shallow features, the representation ability of deep features and detailed information of shallow features can be used simultaneously to improve the detection accuracy of small-scale targets.

本发明中建立多尺度特征检测机制，将特征组合之后再进行融合，然后再进行检测，最后将检测的结果融合，实现多尺度特征检测。In the present invention, a multi-scale feature detection mechanism is established, the features are combined and then fused, and then detected, and finally the detection results are fused to realize multi-scale feature detection.

本发明中利用了基于软判决的非极大值抑制，提升了融合效果。In the present invention, non-maximum value suppression based on soft decision is used to improve the fusion effect.

综上，本发明通过综合利用多分辨率金字塔、多尺度模板聚类、尺度自适应上下文信息、多尺度深度特征融合、软判决非极大值抑制等技术，增强图像目标的特征学习与表示能力，有效提升行人等目标检测的精度，同时较好地解决了现有技术中目标在小尺度、远距离时，密集目标检测等问题。In summary, the present invention enhances the feature learning and representation capabilities of image objects by comprehensively utilizing technologies such as multi-resolution pyramids, multi-scale template clustering, scale-adaptive context information, multi-scale deep feature fusion, and soft-decision non-maximum suppression. , effectively improving the accuracy of target detection such as pedestrians, and at the same time better solving the problems of dense target detection when the target is small-scale and long-distance in the prior art.

附图说明Description of drawings

通过阅读参照以下附图对非限制性实施例所作的详细描述，本发明的其它特征、目的和优点将会变得更明显：Other characteristics, objects and advantages of the present invention will become more apparent by reading the detailed description of non-limiting embodiments made with reference to the following drawings:

图1a、图1b为本发明一实施例中图像金字塔构建流程图；Fig. 1a, Fig. 1b are the flow charts of image pyramid construction in an embodiment of the present invention;

图2为本发明一实施例中多尺度模板获取流程图；Fig. 2 is a flowchart of multi-scale template acquisition in an embodiment of the present invention;

图3为本发明一实施例中多尺度特征融合实现流程图；FIG. 3 is a flow chart for realizing multi-scale feature fusion in an embodiment of the present invention;

图4为本发明一实施例中软判决的非极大值抑制实现流程图。Fig. 4 is a flow chart of implementing non-maximum value suppression of soft decisions in an embodiment of the present invention.

具体实施方式Detailed ways

下面结合具体实施例对本发明进行详细说明。以下实施例将有助于本领域的技术人员进一步理解本发明，但不以任何形式限制本发明。应当指出的是，对本领域的普通技术人员来说，在不脱离本发明构思的前提下，还可以做出若干变形和改进。这些都属于本发明的保护范围。The present invention will be described in detail below in conjunction with specific embodiments. The following examples will help those skilled in the art to further understand the present invention, but do not limit the present invention in any form. It should be noted that those skilled in the art can make several modifications and improvements without departing from the concept of the present invention. These all belong to the protection scope of the present invention.

现有的目标检测方法，对于某些尺寸较大的目标可以很好地识别，但是尺寸较大的目标在现实生活中只占很小的一部分，对于距离较远的目标，检测结果并不是很好。目标检测有以下特点，以行人目标为例：Existing target detection methods can identify some larger targets well, but the larger targets only account for a small part in real life, and the detection results are not very good for distant targets. it is good. Target detection has the following characteristics, taking pedestrian targets as an example:

特点一、尺度的多样性。一方面行人中有老人、中年人、孩童，其物理身高则会有较大的分布区间。另一方面由于拍摄距离的不同，对于同一个行人而言，越高、距离越远的拍摄点拍摄到行人其像素越低，高度越低；而越低、越近的拍摄点拍摄到的行人像素越高，高度越接近其真实身高。现有的方法对于高度大于100像素的行人检测效果较好，但是对于远距离，高度低的行人检测效果不好。行人检测应用在车辆辅助驾驶系统中时，往往都是需要系统检测远距离处出现的行人来提醒驾驶员，因此解决检测远距离的行人问题也是一个迫切的需求。Features 1. Diversity of scales. On the one hand, there are old people, middle-aged people, and children among pedestrians, and their physical heights will have a large distribution range. On the other hand, due to the different shooting distances, for the same pedestrian, the higher and the farther the shooting point, the lower the pixel and the lower the height of the pedestrian; The higher the pixels, the closer the height is to its real height. Existing methods have a good detection effect for pedestrians whose height is greater than 100 pixels, but are not effective for long-distance and low-height pedestrians. When pedestrian detection is applied in the vehicle assisted driving system, it is often necessary for the system to detect pedestrians appearing at a long distance to remind the driver. Therefore, it is also an urgent need to solve the problem of detecting long-distance pedestrians.

特点二、遮挡。现实中拍摄到的行人会存在遮挡情况，一方面会存在人群簇拥的情况。当多个人走在一起时，从各个角度看过去总会有人被挡住身体的一部分；另一方面行人可能会被环境中的物体，如数目、车辆、房屋遮挡住身体的一部分。此时行人被遮挡后其身体部分信息是缺失的，对于基于人体完整轮廓特征的检测器则会导致漏检的结果。Feature two, occlusion. The pedestrians photographed in reality will be occluded, and on the one hand, there will be crowds. When multiple people walk together, there will always be a part of someone's body blocked from all angles; on the other hand, pedestrians may be blocked by objects in the environment, such as numbers, vehicles, and houses. At this time, after the pedestrian is occluded, the body part information is missing, and for the detector based on the complete contour feature of the human body, it will lead to missed detection results.

本发明提出图像中基于多特征融合的目标检测方法，从上述问题出发，较好地解决了目标在小尺度、远距离时，密集目标检测等问题。本发明基于现实中存在的目标检测困难，提出了基于金字塔对于多尺度目标自适应检测，以及利用多层感受野不同进行融合，并采用软判决非极大值抑制针对密集目标检测进而提升目标检测的效果。其中：The present invention proposes a target detection method based on multi-feature fusion in an image. Starting from the above problems, it better solves the problems of dense target detection when the target is small-scale and long-distance. Based on the difficulty of target detection in reality, the present invention proposes self-adaptive detection of multi-scale targets based on pyramids, and uses different multi-layer receptive fields for fusion, and uses soft decision non-maximum value suppression to detect dense targets to improve target detection. Effect. in:

第一步，利用待检测图片进行不同尺度的缩放，构造包含从大小多个分辨率的图像金字塔；The first step is to use the image to be detected to zoom in different scales, and construct an image pyramid containing multiple resolutions from size to size;

第二步，利用统计学聚类方法，按照训练样本宽高大小对样本进行聚类，通过聚类形成K个聚类中心、每个聚类中心形成一个以该中心宽高均值为尺度的检测模板；K个聚类即形成一组K个不同尺度的模板组，作为检测器目标模板组；The second step is to use the statistical clustering method to cluster the samples according to the width and height of the training samples, form K cluster centers through clustering, and each cluster center forms a detection center with the mean width and height of the center as the scale. Template; K clusters form a set of K template groups of different scales as the detector target template group;

第三步，尺度自适应的目标上下文构建。The third step is the construction of scale-adaptive target context.

对各个尺度的检测模板框，都沿上下左右方向均匀扩展，扩展到同卷积神经网络(CNN，Convolutional Neural Network)最后输出的特征图每个点的感受野大小相同，扩展部分即形成包含目标的上下文信息，且包含的上下文大小同模板尺寸自适应。For the detection template frames of each scale, they are evenly expanded along the up, down, left, and right directions, and the receptive field size of each point of the feature map finally output by the same convolutional neural network (CNN, Convolutional Neural Network) is the same, and the expanded part is formed to contain the target. context information, and the included context size is adaptive to the size of the template.

第四步，多尺度特征融合。The fourth step is multi-scale feature fusion.

通过将CNN多个卷积层特征图在同一分辨率上进行逐像素相加融合，形成目标检测识别所需的多尺度深度特征。Multi-scale depth features required for target detection and recognition are formed by pixel-by-pixel addition and fusion of CNN multiple convolutional layer feature maps at the same resolution.

第五步，基于软判决的非极大值抑制。The fifth step is non-maximum suppression based on soft decision.

通过降低检测框的置信度代替直接删除检测框，再通过不断迭代对检测框进行筛选。Instead of directly deleting the detection frame, the confidence of the detection frame is reduced, and the detection frame is screened through continuous iteration.

同时，本发明通过集成上述方法步骤构造一个图像目标检测系统，该系统综合利用多分辨率金字塔、多尺度模板聚类、尺度自适应上下文信息、多尺度深度特征融合、软判决非极大值抑制等方法，增强图像目标的特征学习与表示能力，有效提升行人等目标检测的精度。At the same time, the present invention constructs an image target detection system by integrating the above method steps, and the system comprehensively utilizes multi-resolution pyramids, multi-scale template clustering, scale-adaptive context information, multi-scale depth feature fusion, and soft-decision non-maximum value suppression and other methods to enhance the feature learning and representation capabilities of image targets, and effectively improve the accuracy of target detection such as pedestrians.

具体的，一种图像中基于多特征融合的目标检测系统，包括：Specifically, a target detection system based on multi-feature fusion in an image, including:

以下以行人检测为例对本发明上述方法和系统进行详细的说明，尤其是上述的五个步骤/模块中涉及的实施。The above-mentioned method and system of the present invention will be described in detail below by taking pedestrian detection as an example, especially the implementation involved in the above-mentioned five steps/modules.

一、利用待检测图片进行不同尺度的缩放构造图像金字塔。1. Use the image to be detected to zoom in different scales to construct an image pyramid.

为了使得检测网络利用一个或几个有限大小的检测框，能够对图像中不同大小的目标都能完整紧凑地进行框选采样，需要对原始图像进行多尺度缩放、使得原始目标经过多次缩放增加其被检测框完整紧凑框选的概率。将待训练图片通过按比例缩放成L个不同分辨率大小的图片，从而构造一个分辨率由高到低的图像金字塔。In order to enable the detection network to use one or several detection frames of limited size to completely and compactly select and sample objects of different sizes in the image, it is necessary to perform multi-scale scaling on the original image, so that the original object can be increased after multiple scalings. The probability that it is selected by the detection frame is complete and compact. The image to be trained is scaled into L images of different resolutions to construct an image pyramid with resolutions from high to low.

在行人检测模型训练和测试的时候，利用图像金字塔的原理对原始数据进行缩放处理，如附图1a、1b所示。During the training and testing of the pedestrian detection model, the principle of image pyramid is used to scale the original data, as shown in Figures 1a and 1b.

具体来说就是在训练的时候，对于待训练的每一个图像，进行0.5X、1X和2X的缩放，得到不同尺度下的图像用于训练，如附图1a所示。在测试的时候，对于每个待检测的图像，进行0.5X、1X和2X的缩放，得到不同尺度下的图像用于检测，并对三个尺度图像的检测结果，进行融合判决，得到最终的检测结果，如附图1b所示。Specifically, during training, each image to be trained is scaled by 0.5X, 1X, and 2X to obtain images of different scales for training, as shown in Figure 1a. During the test, each image to be detected is scaled by 0.5X, 1X and 2X to obtain images of different scales for detection, and the detection results of the three scale images are fused and judged to obtain the final The test results are shown in Figure 1b.

在本实施例中，为了方便计算反向传播的梯度，所以输入的图片需要固定的尺度，例如Caltech中所有图片的大小都是640*480像素。在经随机缩放后，训练图片尺度发生了变化，640*480大小的图片经过0.5X缩放之后变成了320*240，而经过2X缩放之后尺度则变为1280*960。为了获得统一尺度的训练图片，在训练的时候，从经过缩放的图片中裁剪出尺度为640*480的图片，具体操作流程如图1a所示。即对于大小为320*240的图片，用图像填充至640*480，而对于大小为1280*960的图片，则从中随机裁剪出640*480的小图，然后将这三种图片同时用于训练，如附图1a所示。这种做法能有效增加训练样本数量，提升深度学习等数据驱动方法的性能。In this embodiment, in order to facilitate the calculation of the backpropagation gradient, the input picture needs to have a fixed scale, for example, the size of all pictures in Caltech is 640*480 pixels. After random zooming, the scale of the training picture has changed. The size of the 640*480 picture becomes 320*240 after 0.5X scaling, and the scale becomes 1280*960 after 2X scaling. In order to obtain training pictures of a uniform scale, during training, a picture with a scale of 640*480 is cut out from the scaled pictures. The specific operation process is shown in Figure 1a. That is, for a picture with a size of 320*240, fill it with an image to 640*480, and for a picture with a size of 1280*960, randomly cut out a small picture of 640*480, and then use these three kinds of pictures for training at the same time , as shown in Figure 1a. This approach can effectively increase the number of training samples and improve the performance of data-driven methods such as deep learning.

二、利用统计学聚类方法获取最佳的检测器模板组。2. Obtain the best detector template group by using the statistical clustering method.

本发明利用统计学聚类方法获取最佳的检测器模板组，提取训练图像中已标定的目标矩形框，将框进行K-medoids聚类方法得到不同一组(K个)多尺度模板。这里的尺度是指目标框的大小(长和宽)和比例(宽高比)。利用统计学的方法获取K个多尺度模板，使其既能覆盖绝大多数训练样本尺度又兼顾类内样本尺度的差异化。在选取模板的时候即要考虑样本尺度的差异化，避免仅用单一或少数尺度模板造成匹配模板过少、难以满足多尺度目标与模板精准匹配识别问题，而是统计数据集中目标样本宽高及比例的分布情况，通过聚类形成多个聚类中心、每个聚类中心形成一个尺度(平均宽高比但不限于此)模板，从而根据样本尺度分布来形成一组多尺度模板；同时，通过限定聚类中心的数量，来避免中心过多导致某一尺度的模板训练样本太少、检测器得不到充分训练问题。聚类方法可采用K-medoids等，实现对数据集中的样本宽高及宽高比进行聚类和多尺度模板选取。The invention uses a statistical clustering method to obtain the best detector template group, extracts the calibrated target rectangular frame in the training image, and performs K-medoids clustering method on the frame to obtain different groups (K) of multi-scale templates. The scale here refers to the size (length and width) and ratio (aspect ratio) of the target box. Using statistical methods to obtain K multi-scale templates, it can not only cover most of the training sample scales but also take into account the difference of sample scales within the class. When selecting templates, it is necessary to consider the difference of sample scales, to avoid too few matching templates caused by only one or a few scale templates, and it is difficult to meet the problem of multi-scale targets and accurate matching and recognition of templates. According to the distribution of the ratio, multiple cluster centers are formed by clustering, and each cluster center forms a scale (average aspect ratio but not limited to this) template, thereby forming a set of multi-scale templates according to the sample scale distribution; at the same time, By limiting the number of cluster centers, it is possible to avoid the problem that too many centers lead to too few template training samples of a certain scale, and the detector is not fully trained. The clustering method can use K-medoids, etc. to realize clustering and multi-scale template selection for the width, height and aspect ratio of samples in the data set.

具体操作时，可以先定义模板的尺度，覆盖不同尺度的行人目标。通常直立行人的宽高比一般为1:3，根据这个经验并结合Caltech数据集中行人目标的大致高度分布，可以手动的去选择典型的模板尺度，如30*90、50*150等，但是这种手工选取的方式不仅不具备通用性，并且很大概率不能选取最合适的模板。这是因为：In the specific operation, the scale of the template can be defined first to cover pedestrian targets of different scales. Usually the aspect ratio of upright pedestrians is generally 1:3. Based on this experience and combined with the approximate height distribution of pedestrian targets in the Caltech dataset, typical template scales can be manually selected, such as 30*90, 50*150, etc., but this This manual selection method is not only not universal, but also has a high probability of not being able to select the most suitable template. This is because:

首先，在不同的数据集中，由于图片分辨率和监控视角的差异，行人高度的分布是不一样的，这种手工选取模板的方法不具备通用性；First of all, in different data sets, due to the difference in image resolution and monitoring perspective, the distribution of pedestrian heights is different. This method of manually selecting templates is not universal;

其次，在真实的监控场景中，由于行人姿态、摄像头角度、遮挡信息等情况的出现，导致行人的宽高比发生较大变化，导致选取的模板不具备典型性。同时，在选取模板的时候要考虑数据集中目标高度的分布情况，如果某一尺度的模板对应的训练样本太少，会导致这个模板对应的检测器得不到充分的训练。Secondly, in the real monitoring scene, due to the emergence of pedestrian posture, camera angle, occlusion information, etc., the aspect ratio of pedestrians changes greatly, resulting in the selected template not being typical. At the same time, when selecting a template, the distribution of the target height in the data set should be considered. If there are too few training samples corresponding to a template of a certain scale, the detector corresponding to this template will not be fully trained.

因此，本实施例中提出通过K-medoids聚类的方法对行人数据集中的样本的宽高进行聚类，利用统计学的方法获取K个多尺度模板(本实施例K＝32，当然，在其他实施例中也可以是其他数目)。Therefore, in this embodiment, it is proposed to cluster the width and height of the samples in the pedestrian data set by K-medoids clustering method, and use statistical methods to obtain K multi-scale templates (K=32 in this embodiment, of course, in Other numbers are also possible in other embodiments).

如图2所示，基于K-medoids聚类方法，对训练数据集中行人高度和宽度进行聚类。利用杰卡德距离(Jaccard distance)作为K-medoids的聚类评价指标，即：As shown in Figure 2, based on the K-medoids clustering method, the height and width of pedestrians in the training data set are clustered. Use Jaccard distance as the clustering evaluation index of K-medoids, namely:

d(s_i,s_j)＝1-J(s_i,s_j)d(s_i ,s_j )=1-J(s_i ,s_j )

其中，s_i＝(h_i,w_i)；s_j＝(h_j,w_j)表示数据集中两个不同的行人框，h,w分别表示行人框的高度与宽度，J表示标准的杰卡德相似度(Jaccard similarity coefficient)，Among them, s_i ＝(h_i ,w_i ); s_j ＝(h_j ,w_j ) represent two different pedestrian frames in the data set, h and w represent the height and width of the pedestrian frame respectively, and J represents the standard Jie Jaccard similarity coefficient,

三、尺度自适应的目标上下文构建。3. Scale-adaptive target context construction.

本发明采用尺度自适应的目标上下文构建，一方面，小尺度目标因为有用信息较少、往往需要更多的上下文信息来辅助识别；而大尺度目标则往往不需要大量的上下文。另一方面，由于CNN全连接层每个特征点对应原图中一个尺度固定的感受野，可以基于感受野来构造不同模板的上下文。The present invention adopts scale-adaptive object context construction. On the one hand, small-scale objects often need more context information to assist recognition because of less useful information; while large-scale objects often do not need a large amount of context. On the other hand, since each feature point of the CNN fully connected layer corresponds to a fixed-scale receptive field in the original image, the context of different templates can be constructed based on the receptive field.

具体的，对K-medoids聚类法回归出的K个不同尺度的模板框，沿上下左右进行扩展、并均扩展到与CNN全连接层每个点的感受野大小一致。因为感受野大小固定，则小尺度的模板框需做较大尺度的扩展，因而可以获得较大尺度的上下文；而大尺度模板则相反，获得的上下文较小。基于该方法，可获得与模板框尺度自适应的上下文信息，用于辅助目标识别。Specifically, the K template boxes of different scales returned by the K-medoids clustering method are expanded along the up, down, left, and right sides, and are all extended to the same size as the receptive field of each point of the CNN fully connected layer. Because the size of the receptive field is fixed, the small-scale template frame needs to be expanded to a larger scale, so a larger-scale context can be obtained; while the large-scale template is the opposite, and the obtained context is smaller. Based on this method, the context information adaptive to the scale of the template box can be obtained to assist target recognition.

四、多尺度深度特征融合。4. Multi-scale deep feature fusion.

考虑到CNN每一个卷积层输出的特征图，都含有不同尺度上的有用特征，浅层的输出往往包含更多的局部细节特征，而高层的特征往往包含更多的全局性和语义性信息。对这些不同层得到的不同尺度的特征，进行融合利用即可获得更丰富的特征表达。基于该思想本发明设计了一种多尺度融合方法：Considering that the feature maps output by each convolutional layer of CNN contain useful features on different scales, the output of the shallow layer often contains more local detail features, while the high-level features often contain more global and semantic information. . By fusing and utilizing the features of different scales obtained by these different layers, a richer feature expression can be obtained. Based on this idea, the present invention designs a multi-scale fusion method:

如图3所示，对于CNN的M个主要卷积层(往往是CNN网络中最后面的M个层，本实施例为3)输出的特征图，从深至浅，先各自利用反卷积对最深层的特征图进行反卷积上采样，使其变换到同上一层输出特征图同一个分辨率尺度上，然后将这两层同一尺度的特征图做逐像素相加，得到融合后的多尺度特征图；再据此类推，实现所有M层的融合，得到融合M层特征的多尺度特征图。例如输入图片经过Resnet网络之后，将res5层的输出结果反卷积一次得到和res4相同的分辨率，并和res4的结果相加，再将其反卷积一次得到和res3相同的分辨率并和res3的结果相加进行特征融合得到最终结果进行测试。As shown in Figure 3, for the feature maps output by the M main convolutional layers of CNN (usually the last M layers in the CNN network, this embodiment is 3), from deep to shallow, first use deconvolution Deconvolution and upsampling is performed on the deepest feature map to transform it to the same resolution scale as the output feature map of the previous layer, and then add the feature maps of the same scale of the two layers pixel by pixel to obtain the fused Multi-scale feature map; and so on, realize the fusion of all M layers, and obtain a multi-scale feature map that fuses M layer features. For example, after the input image passes through the Resnet network, the output result of the res5 layer is deconvolved once to obtain the same resolution as res4, and added to the result of res4, and then deconvolved once to obtain the same resolution as res3 and combined The results of res3 are added for feature fusion to get the final result for testing.

五、软判决的非极大值抑制。5. Non-maximum suppression of soft decisions.

在以前的方法中，用非极大值抑制方法融合检测框得到最终的检测结果。传统的非极大值抑制方法是一种基于贪心的硬判卷方法，在融合的过程中可能会导致正确的检测框被抑制，尤其是当IOU(Intersection of Units，两个矩形框之间的重叠区域与合并区域的比值)阈值选择不恰当的时候。In previous methods, non-maximum suppression is used to fuse the detection boxes to obtain the final detection results. The traditional non-maximum value suppression method is a greedy hard judgment method, which may cause the correct detection frame to be suppressed during the fusion process, especially when the IOU (Intersection of Units, the distance between two rectangular frames When the ratio of the overlapping area to the merged area) threshold is not selected properly.

如图4所示，本发明中软判决的非极大值抑制整体思路就是：通过降低检测框的置信度代替直接删除检测框。As shown in FIG. 4 , the overall idea of non-maximum value suppression in the soft decision in the present invention is to reduce the confidence of the detection frame instead of directly deleting the detection frame.

本实施例中，具体操作为：先选出来置信概率最大的检测框，然后通过其他的检测框与最大置信概率的框计算IOU，超过某一阈值则将其置信概率降低。然后将此最大框去掉后在选出剩下检测框中最大置信概率框，并将其余的检测框与最大置信概率的框计算IOU，超过某一阈值则将其置信概率降低，通过不断迭代，得到最后的筛选后的检测框。In this embodiment, the specific operation is: first select the detection frame with the highest confidence probability, then calculate the IOU through other detection frames and the frame with the highest confidence probability, and reduce the confidence probability if it exceeds a certain threshold. Then remove the largest frame and select the maximum confidence probability frame in the remaining detection frames, and calculate the IOU of the remaining detection frames and the maximum confidence probability frame. If it exceeds a certain threshold, its confidence probability will be reduced. Through continuous iteration, Get the final filtered detection box.

本实施例中可以采用线性加权的方式，根据iou的值来降低检测框的得分：In this embodiment, linear weighting can be used to reduce the score of the detection frame according to the value of iou:

该公式中，M表示置信概率最大的检测框，b_i表示第i个检测框，N_i表示IOU阈值(可为经验值或预设值)，iou(M，b_i)表示M与b_i的IOU值。In this formula, M represents the detection frame with the highest confidence probability, b_i represents the i-th detection frame, N_i represents the IOU threshold (which can be an empirical value or a preset value), and iou(M, b_i ) represents M and b_i The IOU value.

因此，本发明软判决对于密集人群检测效果更好。Therefore, the soft decision of the present invention has a better effect on dense crowd detection.

以上通过行人检测为例对本发明具体实现的一些细节和优选技术特征进行了详细描述。本发明也可以应用于其他图像中目标检测，并不限于行人检测，其他目标检测的操作与上述实施例类似，在此不再另行举例。Above, some details and preferred technical features of the specific implementation of the present invention are described in detail by taking pedestrian detection as an example. The present invention can also be applied to target detection in other images, and is not limited to pedestrian detection. The operation of other target detection is similar to the above embodiment, and no further examples are given here.

综上，本发明通过构造图像多分辨率稀疏金字塔、基于聚类优化的多尺度检测模版、模版尺度自适应上下文、多尺度深度特征融合等一系列创造性方法，实现深度特征的充分挖掘和融合利用，取得目标检测性能的提升。In summary, the present invention achieves full mining and fusion utilization of deep features by constructing image multi-resolution sparse pyramids, multi-scale detection templates based on clustering optimization, template scale adaptive context, and multi-scale deep feature fusion. , to achieve an improvement in object detection performance.

以上对本发明的具体实施例进行了描述。需要理解的是，本发明并不局限于上述特定实施方式，本领域技术人员可以在权利要求的范围内做出各种变形或修改，这并不影响本发明的实质内容。Specific embodiments of the present invention have been described above. It should be understood that the present invention is not limited to the specific embodiments described above, and those skilled in the art may make various changes or modifications within the scope of the claims, which do not affect the essence of the present invention.