CN108399362B

Movatterモバイル変換

Info

Publication number: CN108399362B
Application number: CN201810069322.XA
Authority: CN
Inventors: 林倞; 尹森堂; 张冬雨; 王青
Original assignee: Sun Yat Sen University
Current assignee: Sun Yat Sen University
Priority date: 2018-01-24
Filing date: 2018-01-24
Publication date: 2022-01-07
Anticipated expiration: 2038-01-24
Also published as: CN108399362A; WO2019144575A1

Abstract

Translated fromChinese

本发明公开了一种快速行人检测方法及装置，所述方法包括如下步骤：步骤S1，构建可配置的基于卷积神经网络的深度模型，利用训练样本学习出构建的网络参数，获得用于测试过程的模型；步骤S2，输入测试样本，通过训练好的模型利用神经网络感知域的变化规律使用不同的中间层对不同尺度范围内的目标物体进行检测，预测出图像中目标物体的框图，本发明通过利用神经网络感知域的变化规律，使用不同的中间层对特定尺度范围内的目标物体进行检测，更好的适应了感知域与物体大小的关系，有效提高了检测结果。

The invention discloses a fast pedestrian detection method and device. The method includes the following steps: Step S1, constructing a configurable deep model based on a convolutional neural network, using training samples to learn the constructed network parameters, and obtaining parameters for testing. The model of the process; step S2, input the test sample, use the trained model to use the change rule of the neural network perception domain to use different intermediate layers to detect the target objects in different scales, and predict the block diagram of the target object in the image. The invention uses different intermediate layers to detect target objects within a specific scale range by using the changing law of the neural network perception domain, which better adapts to the relationship between the perception domain and the size of the object, and effectively improves the detection result.

Description

Rapid pedestrian detection method and device

Technical Field

The invention relates to the technical field of pedestrian detection, in particular to a rapid pedestrian detection method and device facing an embedded system based on deep learning.

Background

As part of target detection in computer vision, pedestrian detection has important significance in real world application, more and more cameras are deployed in public places with the maturity of image acquisition technology and the reduction of storage technology cost, and on the other hand, with the implementation of automatic driving and intelligent transportation, the vehicle-mounted camera also generates massive video resources. Traditional manual screening and processing not only has low efficiency and consumes a large amount of manpower and material resources, but also may introduce some human factors to cause some deviations. In recent years, deep learning has made an unprecedented breakthrough in the field of computer vision, and not only is efficiency far better than manpower, but also accuracy is better than that of human beings in many fields. Therefore, the problem of effectively using the deep learning method to detect pedestrians is receiving attention.

People are one of the most important targets in video surveillance or automatic driving, and the primary task of pedestrian detection is to identify the presence of human bodies and provide corresponding labeled information. The quality of images captured in the real world is uneven, detection of small objects and sheltered objects is always a difficulty in pedestrian detection, on the other hand, some fuzzy images are often captured by the vehicle-mounted camera, and a large number of objects similar to pedestrians but not pedestrians exist in the images. Particularly, in the embedded system, because the large neural network model with strong recognition capability is usually difficult to efficiently run on the embedded device with limited computing resources, and the application requirements of the embedded device are real-time, it is important to consider the detection accuracy and efficiency to be fast pedestrian detection for the embedded system.

Disclosure of Invention

In order to overcome the defects in the prior art, an object of the present invention is to provide a method and an apparatus for rapid pedestrian detection, which utilize the change rule of the neural network sensing domain and use different intermediate layers to detect a target object within a specific scale range, so as to better adapt to the relationship between the sensing domain and the object size and effectively improve the detection result.

The invention also aims to provide a rapid pedestrian detection method and a rapid pedestrian detection device, which can obtain an squeeze VGG-16 network which meets the requirements of an embedded system by adjusting and training the VGG-16 network, effectively reduce the parameter quantity of a network model and accelerate the calculation efficiency.

Another objective of the present invention is to provide a method and an apparatus for rapid pedestrian detection, which amplify a feature map of a specific network layer by a deconvolution method, so as to enhance the detection of small objects, and hardly increase the video memory and the calculation amount compared with the conventional image amplification method.

It is still another object of the present invention to provide a method and apparatus for rapid pedestrian detection, which has excellent performance for the detection of fuzzy objects and small distant objects by using a region 1.5 times as large as the target object as background semantic features added to the network.

In order to achieve the above and other objects, the present invention provides a rapid pedestrian detection method, comprising the steps of:

step S1, constructing a configurable depth model based on a convolutional neural network, and learning constructed network parameters by using training samples to obtain a model for a test process;

and step S2, inputting a test sample, detecting the target object in different scale ranges by using different intermediate layers through the trained model by using the change rule of the neural network perception domain, and predicting a block diagram of the target object in the image.

Preferably, the step S1 further includes:

constructing a configurable depth model based on a convolutional neural network;

inputting a training sample;

initializing a convolutional neural network and parameters thereof, including weights and offsets of each layer connection in the network layer;

and learning the constructed network parameters, namely the model for the test process by using the training samples by adopting a forward propagation algorithm and a backward propagation algorithm.

Preferably, the depth model includes a multi-scale target candidate network and a target detection network, and the target candidate network respectively generates candidate block diagrams for target objects of different scales in an intermediate layer based on differences of features proposed by different layers of a convolutional neural network; and the target detection network carries out refined classification and detection on the basis of the candidate block diagram output by the target candidate network.

Preferably, the convolutional neural network is formed by stacking a convolutional layer, a down-sampling layer and an up-sampling layer. The convolutional layer is used for performing convolution operation on an input image or a characteristic diagram on a two-dimensional space and extracting hierarchical characteristics; the down-sampling layer uses a non-overlapped max-firing operation, and the operation is used for extracting the features with unchanged shapes and offsets, reducing the size of a feature map and improving the calculation efficiency; the upsampling layer is an operation of deconvolving the input feature map on a two-dimensional space, so as to increase pixels of the feature map.

Preferably, the depth model adopts a Squeeze VGG-16 convolutional neural network as a backbone network, and the Squeeze VGG-16 convolutional neural network adopts a network structure with a conv1-1 layer and a 12-layer Fire module layer which is immediately followed as a feature extraction layer.

Preferably, the target candidate network generates network branches at Fire9, Fire12, conv6 and added posing layers according to convolutional layer characteristics on the basis of the Squeeze VGG-16 convolutional neural network so as to perform regression of candidate frames of the detected object with different scales.

Preferably, on the basis of the target candidate region, the target detection network takes a picture region with a preset multiple size of the target candidate region as background semantic information of the target, performs primary up-sampling on a feature map of a Fire9 layer as information for enhancing small object perception, performs pooling of the background semantic information and the up-sampling information in the region of interest to obtain features with a fixed size, and then adds a full connection layer to perform regression of the category and the final candidate frame.

Preferably, the training sample includes RGB image data and labeling information of a pedestrian region in the image, and the image data for actual training is a small patch cut according to the region where the pedestrian is located.

Preferably, the back propagation algorithm needs to first find the loss function between the target block diagram of the forward propagation prediction and the actual target block diagram of the image

Then, the gradient of the parameter W is obtained, and the gradient descending algorithm is adopted to update W so as to minimize the loss function

Assuming that the middle layer has M branches to output the target candidate region, l^mRepresenting the loss function of branch m, α_mIs represented by^mWeight of function, S ═ S¹，S²，…，S^MDenotes the target object of corresponding scale, then the loss function

Can be defined as:

to achieve the above object, the present invention further provides a rapid pedestrian detection system, including:

the training unit is used for constructing a configurable depth model based on the convolutional neural network, and learning constructed network parameters by using training samples to obtain a model for a testing process;

and the detection unit is used for inputting a test sample, detecting the target object in different scale ranges by using different intermediate layers through the trained model by utilizing the change rule of the neural network perception domain, and predicting a block diagram of the target object in the image.

Compared with the prior art, the rapid pedestrian detection method and the rapid pedestrian detection device provided by the invention use a method of compressing a network for reference, adjust and train the network of the VGG-16 to obtain the squeeze VGG-16 network which meets the requirements of an embedded system, effectively reduce the parameter quantity of a network model and accelerate the calculation efficiency; on the other hand, aiming at the problem that the sensing domain is not consistent with the size of the object in the traditional detection method, the invention utilizes the change rule of the neural network sensing domain (namely, the deeper the neural network layer is, the larger the sensing domain is, the larger the object is suitable for detecting the larger object), uses different intermediate layers to detect the object in the specific scale range, better adapts to the relation between the sensing domain and the object size, and effectively improves the detection result; in addition, in order to enhance the detection of small objects, the characteristic diagram of a specific network layer is amplified by using a deconvolution method, and compared with the traditional image amplification method, the method hardly increases the video memory and the calculation amount; in order to enhance the detection of the fuzzy object, a region with the size 1.5 times that of the target object is used as a background semantic feature to be added into the network on the feature map of the layer, and the detection of the fuzzy object and the long-distance small object has excellent performance.

Drawings

FIG. 1 is a flow chart of the steps of a rapid pedestrian detection method of the present invention;

FIG. 2 is a schematic diagram of the structure of the Squeeze VGG-16 neural network according to the embodiment of the present invention;

FIG. 3 is a diagram of a Fire module in accordance with an embodiment of the present invention;

FIG. 4 is a schematic diagram of a target candidate network according to an embodiment of the present invention;

FIG. 5 is a schematic diagram of a target detection network according to an embodiment of the present invention;

FIG. 6 is a schematic diagram of a rapid pedestrian detection process according to an embodiment of the present invention;

FIG. 7 is a system architecture diagram of a rapid pedestrian detection device in accordance with the present invention;

FIG. 8 is a detailed block diagram of a training unit in accordance with an embodiment of the present invention;

FIG. 9 is a detailed structure diagram of a detecting unit according to an embodiment of the present invention.

Detailed Description

Other advantages and capabilities of the present invention will be readily apparent to those skilled in the art from the present disclosure by describing the embodiments of the present invention with specific embodiments thereof in conjunction with the accompanying drawings. The invention is capable of other and different embodiments and its several details are capable of modification in various other respects, all without departing from the spirit and scope of the present invention.

Fig. 1 is a flow chart illustrating steps of a rapid pedestrian detection method according to the present invention. As shown in fig. 1, the present invention provides a rapid pedestrian detection method, which comprises the following steps:

and step S1, constructing a configurable depth model based on the convolutional neural network, and learning constructed network parameters by using training samples to obtain a model for the test process. In a specific embodiment of the present invention, the depth model is composed of two sub-networks: the first sub-network is a multi-scale target candidate network and is used for extracting character features and giving out candidate regions, and specifically, the target candidate network respectively generates candidate block diagrams of pedestrians in different scales in an intermediate layer based on differences of the characteristics given out by different layers of a convolutional neural network; and the second sub-network is a target detection network, enhances the detection effect, shares parameters with the target candidate network, and performs refined classification and detection on the basis of the candidate block diagram. Specifically, step S1 further includes:

and S100, constructing a configurable depth model based on the convolutional neural network.

The convolutional neural network is formed by stacking a convolutional layer, a down-sampling layer and an up-sampling layer, wherein the convolutional layer is used for performing convolution operation on an input image or a characteristic diagram in a two-dimensional space and extracting hierarchical characteristics; the down-sampling layer uses a non-overlapped max-firing operation, and the operation is used for extracting the features with unchanged shapes and offsets, reducing the size of a feature map and improving the calculation efficiency; the upsampling layer is an operation of deconvolving an input feature map in a two-dimensional space, is used for increasing pixels of the feature map, and is mainly used for a target detection network and improving a detection effect, in the specific embodiment of the invention, an Squeeze VGG-16 convolutional neural network is used as a backbone network, as shown in FIG. 2, the Squeeze VGG-16 convolutional neural network adopts a conv1-1 layer and a 12-layer Fire module which is immediately followed as a convolutional layer for extracting features; wherein pool1-pool5 is a down-sampling layer; a pre-trained model on the ImageNet dataset was used as initialization. Namely, the invention firstly trains Squeeze VGG-16 in advance by using ImageNet data set as network initialization.

Fig. 3 is a schematic structural diagram of a Fire module in an embodiment of the present invention. As shown in fig. 3, the Fire module is composed of two convolutional layers with convolution kernel size of 1 × 1 and one convolutional layer with convolution kernel size of 3 × 3, and aims to replace the convolution kernel of 3 × 3 with convolution kernel of 1 × 1, so as to reduce the parameter amount by 9 times, but in order not to affect the characterization capability of the network, the convolution kernel of 1 × 1 is not replaced completely, and the convolution kernel of 3 × 3 is used partially, so that another benefit of this is to reduce the input channel of convolution kernel of 3 × 3, and at the same time, to achieve the effect of reducing the parameter amount, specifically, the Fire module firstly uses convolution kernel of 1 × 1 to perform dimensionality reduction operation on the input layer, then uses convolution kernel of 1 × 1 and convolution kernel of 3 to extract the features, and finally connects the two parts of features, in such a way, the computation amount and the model parameters are greatly reduced.

Fig. 4 is a schematic diagram of a target candidate network according to an embodiment of the present invention. In the embodiment of the invention, the target candidate network generates network branches on the basis of the Squeeze VGG-16 convolutional neural network, wherein the network branches are calculated by 4 layers in total on the basis of the convolutional layer characteristics, namely Fire9, Fire12, conv6 and added firing layers, and the network branches perform regression on candidate frames of objects detected in different scales. However, for the Fire-9 layer, which is closer to the lower layer of the backbone network, the influence on the gradient is larger than that of other layers, the learning process is unstable, and therefore, a buffer layer is added, as shown in the det-conv layer in fig. 4, and the buffer layer prevents the gradient of the detection branch from being directly back-propagated to the backbone layer.

The invention utilizes the change rule of the neural network perception domain (namely, the deeper the neural network layer, the larger the perception domain, and is suitable for detecting larger target objects), and uses different intermediate layers to detect the target objects in a specific scale range, thereby better adapting to the relationship between the perception domain and the object size and effectively improving the detection result.

Fig. 5 is a schematic diagram of an architecture of a target detection network according to an embodiment of the present invention. The target detection network and the target candidate network share parameters, and candidate frames of the target candidate network are summarized to enhance the distinguishing capability of the monitoring network on objects and backgrounds. In the specific embodiment of the present invention, the target detection network uses, on the basis of the target candidate region, a picture region 1.5 times as large as the target candidate region as background semantic information of the target; performing primary up-sampling on a feature map of a Fire9 layer to serve as information for enhancing small object perception, performing pooling (ROI posing) of a region of interest on background semantic information and up-sampling information to obtain features with fixed sizes, then adding a layer of full connection layer, and performing regression of categories and final candidate frames, wherein specifically, a trunk cnn layer is connected with a propalss node and is used for summarizing candidate frame information obtained by a target candidate network; on the other hand, for the feature map of Fire9 layer, W and H are the width and height of the input picture, cube 1 represents the mapping of the object region on the feature map, and cube 2 represents the mapping of the context region on the feature map, the context region is about 1.5 times of the object region, and in order to enhance the detection of small objects, the Fire9 layer is up-sampled once again, and then the pooling of the region of interest is used to obtain features of fixed size, similar to the fast RCNN algorithm; the processed features of the Fire9 layer are connected (concat) with the features summarized by the prosassals, and then a fully connected layer is added to perform regression of the category and the final candidate frame, which is not described herein again.

Step S101, inputting training samples.

The training process needs to provide a corresponding frame of a reference character in the image, and meanwhile, in order to accelerate the training, the training process cuts the image containing the reference character from the original image to form patch (image block), and the patch is smaller than the original image and used for training, so that the training process is effectively accelerated. Specifically, in the present invention, the input training sample includes RGB image data and labeling information of a pedestrian region in an image, and the image data for actual training is a small patch (image block) cut out according to the region where a pedestrian is present. Expressed in mathematical language, training samples

Wherein X_iA patch representing a training picture; in practical applications, there are other categories than the category of pedestrian, such as K categories of background, cyclist, sitting person, etc., so the label data Y_i＝(y_i，b_i) By category label y_iE {0, 1, 2.,. K } and block diagram coordinate points

Is composed of (a) wherein

Is the starting coordinate point in the upper left corner of the diagram,

the frame width and height.

Step S102, initializing the convolutional neural network and parameters thereof, including the weight and the bias of each layer connection in the network layer. Specifically, the present invention utilizes ImageNet data set to pre-train Squeeze VGG-16 convolutional neural networks as network initialization.

Step S103, learning the constructed network parameters, namely the model for the test process by using the training samples by adopting a forward propagation algorithm and a backward propagation algorithm.

In the invention, the forward propagation algorithm firstly normalizes the size of an input image to be 3 × 480 × 640, cuts out patch with the size of 3 × 448 × 448 and corresponding labeled information as the input of a convolutional neural network, passes through a convolutional Layer, a down-sampling Layer and a correction linear unit Layer (ReLU nonlinear Layer), and has the image characteristic diagram size of 512 × 60 × 80 at a Fire9 Layer; at the Fire12 level, the feature map size is 512 × 30 × 40, and the feature map sizes of the two latter branches are 512 × 15 × 20 and 512 × 8 × 10 in this order. On different feature maps, four coordinate points and category information of a target block diagram are obtained in a convolution mode, for example, a Fire9 layer is taken as an example, and assuming that only pedestrians and backgrounds are detected, the output is that the feature size is 6 × 60 × 80, wherein 6 includes four coordinate points of the background, two categories of pedestrians and a candidate block diagram. In the target detection network, the candidate block diagrams obtained by each branch layer are collected at a prosals node, and are simultaneously superposed with the background semantic information of the Fire9 layer and the characteristics obtained by pooling the up-sampling information in the region of interest to perform final block diagram regression and category regression.

In the present invention, the backward propagation algorithm needs to first find the loss function between the target block diagram of forward (i.e. forward) propagation prediction and the actual target block diagram of image

Assuming that there are M branches in the middle layer to output target candidate regions (all target objects in the image can be approximately detected by the perception domains of M scales), l^mRepresenting the loss function of branch m, α_mIs represented by^mWeight of function, S ═ S¹，S²，…，S^MDenotes the target object of corresponding scale, then the loss function

Can be defined as:

the loss function, for a specific detection layer m, only contributes to the loss function if the target scale is within the range that m can detect, so the loss function is defined as

Wherein p (X) ═ p₀(X)，...，p_K(X)) represents a probability distribution of the target class; λ is the equilibrium coefficient; b are 4 coordinate points of the block diagram,

pointing to coordinate points obtained by forward propagation; in the loss function, a cross-entropy loss function is used to define class regression, i.e.

L_cls(p(X)，y)＝-log_y(P(X)) (3)

Regression of the target block diagram was performed using the smoothed Manhattan distance criterion (smooth L1 criterion), defined as follows

And step S2, detecting the target objects in different scale ranges by using different intermediate layers through the trained model and utilizing the change rule of the neural network perception domain, and predicting a block diagram of the target objects (such as pedestrians) in the image.

Specifically, step S2 further includes:

step S200, loading the trained model;

step S201, inputting a test sample;

and S202, detecting pedestrians in different scale ranges by using different intermediate layers through the change rule of the neural network perception domain by using the trained model, and predicting a block diagram of the pedestrians in the image. Fig. 6 is a schematic diagram of a process of rapid pedestrian detection in an embodiment of the present invention, that is, a target candidate network in a model is used to generate network branches in 4 total layers of fire9, fire12, conv6 and added pooling layers on the basis of the Squeeze VGG-16 convolutional neural network according to the characteristics of convolutional layers, and target candidate regions (middle layer a, middle layer b, middle layer c) of an object are detected in different scales; then, by using a target detection network, on the basis of a target candidate region, taking a picture region 1.5 times as large as the target candidate region as background semantic information of a target, performing primary up-sampling on a feature map of a Fire9 layer as information for enhancing small object perception, pooling the background semantic information and the up-sampling information in a region of interest to obtain features of a fixed size, then adding a full connection layer, and performing regression of categories and final candidate frames. Preferably, in step S202, the feature map of the specific network layer is further enlarged by using a deconvolution method.

The pedestrian detection method provided by the invention respectively uses evaluation indexes in two aspects: average precision rate mAP and frames per second FPS. The mAP is used for evaluating the condition of the intersection ratio of the final detection area and the real target person area, and the average value of the precision ratio under different intersection ratios; FPS, which is mainly an efficiency indicator, refers to the number of pictures that can be processed per second.

Fig. 7 is a system architecture diagram of a rapid pedestrian detection device according to the present invention. As shown in fig. 7, the present invention provides a rapid pedestrian detection apparatus, including:

and the training unit 70 is used for constructing a configurable depth model based on the convolutional neural network, and learning constructed network parameters by using training samples to obtain a model for the test process. In an embodiment of the present invention, the depth model constructed by the training unit 70 is composed of two sub-networks: the first sub-network is a multi-scale target candidate network and is used for extracting character features and giving out candidate regions, and specifically, the target candidate network respectively generates candidate block diagrams of pedestrians in different scales in an intermediate layer based on differences of the characteristics given out by different layers of a convolutional neural network; and the second sub-network is a target detection network, enhances the detection effect, shares parameters with the target candidate network, and performs refined classification and detection on the basis of the candidate block diagram. Specifically, as shown in fig. 8, the training unit 70 further includes:

and the model building unit 701 is used for building a configurable depth model based on the convolutional neural network.

The convolutional neural network is formed by stacking a convolutional layer, a down-sampling layer and an up-sampling layer, wherein the convolutional layer is used for performing convolution operation on an input image or a characteristic diagram in a two-dimensional space and extracting hierarchical characteristics; the down-sampling layer uses non-overlapping max-forcing operation, which is used for extracting features with unchanged shapes and offsets, reducing the size of a feature map and improving the calculation efficiency, and the up-sampling layer is used for performing deconvolution operation on an input feature map on a two-dimensional space to increase pixels of the feature map. In a specific embodiment of the invention, an Squeeze VGG-16 convolutional neural network is adopted as a backbone network.

In the embodiment of the invention, the target candidate network generates network branches on the basis of the Squeeze VGG-16 convolutional neural network, wherein the network branches are calculated by 4 layers in total on fire9, fire12, conv6 and added posing layers according to the characteristics of convolutional layers, and the network branches perform regression on candidate frames of objects detected by different scales. However, for the fire-9 layer, the gradient is relatively close to the lower layer of the backbone network, the influence on the gradient is large compared with other layers, the learning process is unstable, and therefore, a buffer layer is added, and the buffer layer prevents the gradient of the detection branch from being directly back-propagated to the backbone layer.

The target detection network and the target candidate network share parameters, and candidate frames of the target candidate network are summarized to enhance the distinguishing capability of the monitoring network on objects and backgrounds. In the specific embodiment of the present invention, the target detection network uses, on the basis of the target candidate region, a picture region 1.5 times as large as the target candidate region as background semantic information of the target; the method comprises the steps of performing primary up-sampling on a feature map of a Fire9 layer to serve as information for enhancing perception of small objects, performing pooling of background semantic information and up-sampling information in a region of interest to obtain features of a fixed size, then adding a layer of full connection layer to perform regression of categories and final candidate frames, specifically, connecting a backbone cnn layer with a subnetwork of a prosal, wherein W and H are the width and the height of an input picture, acube 1 represents the firing of an object region, acube 2 represents the firing of a context region, the context region is about 1.5 times of the object region, performing primary up-sampling on a Fire9 layer to enhance detection of the small objects, then obtaining the features of the fixed size by using pooling of the region of interest similar to a false RCNN algorithm, and then adding a layer of full connection layer to perform regression of the categories and the final candidate frames.

A trainingsample input unit 702 for inputting training samples.

Specifically, training samples

Wherein X_iOne patch representing a training picture, labeled data Y_i＝(y_i，b_i) By category label y_iAnd block diagram coordinate points

And (4) forming.

The initialization unit 703 is configured to initialize the convolutional neural network and its parameters, including the weight and the offset of each layer connection in the network layer. Specifically, the present invention utilizes ImageNet data set to pre-train Squeeze VGG-16 convolutional neural networks as network initialization.

And the sample training unit 704 is used for learning the constructed network parameters, namely the model for the test process by using the training samples by adopting a forward propagation algorithm and a backward propagation algorithm.

The backward propagation algorithm needs to firstly calculate the loss function of the target block diagram of the forward propagation prediction and the actual target block diagram of the image

Can be defined as:

Wherein p (X) ═ p₀(X)，...，p_K(X)) is the probability distribution of the target class. In the loss function, a cross-entropy loss function is used to define class regression, i.e.

L_cls(p(X)，y)＝-log_y(P(X))

Regression of the target block diagram was performed using smooth L1 criterion, defined as follows

The detection unit 71 is configured to input a test sample, detect a target object (e.g., a pedestrian) in a range of different scales by using different intermediate layers according to a change rule of a neural network perception domain through a trained model, and predict a block diagram of the target object (e.g., the pedestrian) in the image.

Specifically, as shown in fig. 9, the detection unit 71 further includes:

a model loading unit 710 for loading the trained model;

a testsample input unit 711 for inputting a test sample;

and the image prediction unit 712 is configured to detect pedestrians in different scale ranges by using the trained model and using different intermediate layers according to the change rule of the neural network sensing domain through the trained model, and predict a block diagram of the pedestrian in the image. Specifically, the image prediction unit 712 performs different-scale detection on the target candidate area of the object by using the target candidate network in the model and generating network branches in 4 layers in total of Fire9, Fire12, conv6 and added posing layers according to the characteristics of the convolution layer on the basis of the Squeeze VGG-16 convolutional neural network; then, by using a target detection network, on the basis of a target candidate region, taking a picture region 1.5 times as large as the target candidate region as background semantic information of a target, performing primary up-sampling on a feature map of a Fire9 layer as information for enhancing small object perception, pooling the background semantic information and the up-sampling information in a region of interest to obtain features of a fixed size, then adding a full connection layer, and performing regression of categories and final candidate frames.

In conclusion, the rapid pedestrian detection method and the rapid pedestrian detection device provided by the invention use a method of compressing a network for reference, adjust and train the network of the VGG-16 to obtain the squeeze VGG-16 network which meets the requirements of an embedded system, effectively reduce the parameters of a network model and accelerate the calculation efficiency; on the other hand, aiming at the problem that the sensing domain is not consistent with the size of the object in the traditional detection method, the invention utilizes the change rule of the neural network sensing domain (namely, the deeper the neural network layer is, the larger the sensing domain is, the larger the object is suitable for detecting the larger object), uses different intermediate layers to detect the object in the specific scale range, better adapts to the relation between the sensing domain and the object size, and effectively improves the detection result; in addition, in order to enhance the detection of small objects, the characteristic diagram of a specific network layer is amplified by using a deconvolution method, and compared with the traditional image amplification method, the method hardly increases the video memory and the calculation amount; in order to enhance the detection of the fuzzy object, a region with the size 1.5 times that of the target object is used as a background semantic feature to be added into the network on the feature map of the layer, and the detection of the fuzzy object and the long-distance small object has excellent performance.

The foregoing embodiments are merely illustrative of the principles and utilities of the present invention and are not intended to limit the invention. Modifications and variations can be made to the above-described embodiments by those skilled in the art without departing from the spirit and scope of the present invention. Therefore, the scope of the invention should be determined from the following claims.

Claims

1. A rapid pedestrian detection method comprises the following steps:

step S2, inputting a test sample, detecting target objects in different scale ranges by using different intermediate layers through a trained model and utilizing the change rule of a neural network perception domain, and predicting a block diagram of the target objects in the image;

step S1 further includes:

inputting a training sample;

learning the constructed network parameters, namely a model for the test process by using a training sample by adopting a forward propagation algorithm and a backward propagation algorithm;

the depth model comprises a multi-scale target candidate network and a target detection network, wherein the target candidate network respectively generates candidate block diagrams of target objects with different scales in the middle layer based on the difference of features proposed by different layers of the convolutional neural network; the target detection network carries out refined classification and detection on the basis of the candidate block diagram output by the target candidate network;

the convolutional neural network is formed by stacking a convolutional layer, a down-sampling layer and an up-sampling layer, wherein the convolutional layer is used for performing convolution operation on an input image or a characteristic diagram in a two-dimensional space and extracting hierarchical characteristics; the down-sampling layer uses a non-overlapped max-firing operation, and the operation is used for extracting the features with unchanged shapes and offsets, reducing the size of a feature map and improving the calculation efficiency; the upsampling layer is an operation of deconvolving the input feature map on a two-dimensional space, so as to increase pixels of the feature map.

2. The rapid pedestrian detection method of claim 1, wherein: the depth model adopts a Squeeze VGG-16 convolutional neural network as a backbone network, and the Squeeze VGG-16 convolutional neural network adopts a network structure with a conv1-1 layer and a 12-layer Fire module layer which is immediately followed as features for extraction.

3. A rapid pedestrian detection method according to claim 2, characterized in that: and the target candidate network generates network branches on the basis of the Squeeze VGG-16 convolutional neural network and in the Fire9, the Fire12, the conv6 and the added posing layer according to the convolutional layer characteristics so as to carry out regression on candidate frames of the detected object with different scales.

4. A rapid pedestrian detection method according to claim 2, characterized in that: the target detection network takes a picture area with the size of a preset multiple of the target candidate area as background semantic information of a target on the basis of the target candidate area, performs primary up-sampling on a feature map of a Fire9 layer to serve as information for enhancing small object perception, performs pooling of the background semantic information and the up-sampling information in an interested area to obtain features with fixed size, and then adds a full connection layer to perform regression of categories and final candidate frames.