Target detection method and system for multi-scale feature parallel processingTechnical Field
The invention relates to the technical field of image processing, in particular to a target detection method and system for multi-scale feature parallel processing.
Background
Real-time object detection is an important task in the field of computer vision, aimed at detecting and locating objects in images in video streams or real-time images. Unlike conventional target detection, real-time target detection requires that it be performed in real-time while maintaining high accuracy, and generally requires processing of multiple frames of images per second.
A typical real-time object detection procedure is to obtain an image from a camera, video stream or image sequence, pre-process the image, possibly including resizing, normalizing and other image enhancement operations, locate and identify the object in the image using an object detection algorithm, post-process the detection result, e.g., applying non-maximal suppression (NMS) to remove redundant frames, filtering out low confidence detection results, etc., visualize the final detection result, and mark the detected object on the image.
In recent years, a deep learning-based method takes a dominant role in the field of real-time target detection, and a series of excellent models such as R-CNN series and YOLO series are created. However, the target detection algorithms are deep networks, the number of layers is large, each layer of network needs a special division method to improve the utilization rate of a hardware operation unit, calculation of each layer needs to wait for the end of calculation of the previous layer, network depth brings limitation to real-time reasoning, the reasoning speed is highly dependent on hardware performance and the division method, and the reasoning delay is high in the fields of automatic driving, military reconnaissance and the like. Therefore, considering how to effectively reduce the delay of the real-time target detection algorithm is a direction to be studied.
Disclosure of Invention
Based on the technical problems in the background art, the invention provides a target detection method and a target detection system for multi-scale feature parallel processing, which improve the utilization rate of an operation unit on an inference platform through the high parallelism of an algorithm, reduce interlayer synchronization time through fewer network layers and effectively reduce the delay of a real-time target detection algorithm.
According to the target detection method for multi-scale feature parallel processing, a target image is input into a trained target detection model, and a detection result of the target image is output, wherein the detection result comprises the category and the position of the target;
The target detection model comprises at least two subnets, wherein the subnets are communicated with each other, and each subnet comprises a sampling module, a summarizing and fusing module and a prediction head;
The training process of the target detection model is as follows:
s1, constructing a training set and conveying the training set to each subnet;
S2, each sub-network carries out downsampling processing on RGB features of an input image based on a sampling module to obtain a scale feature map;
S3, based on mutual communication among the subnets, each subnet acquires scale feature images of all other subnets based on a summarizing and fusing module, and fuses all the scale feature images after feature alignment, and each subnet acquires a fused feature image after corresponding fusion;
and S4, each subnet transmits the fusion feature map to a prediction head, outputs prediction information, gathers the prediction information of all the subnets and outputs a prediction result, wherein the prediction result comprises the category and the position of the target.
Further, in the training process of the target detection model, an independent auxiliary training head is introduced in front of the prediction head of each sub-network, and the auxiliary training head monitors the middle layer characteristic representation of the target detection model based on a convolution layer formed by convolution kernels with different sizes.
Further, the auxiliary training head monitors the middle layer characteristic representation of the target detection model based on convolution layers formed by convolution kernels with different sizes,
For large scale features, a convolution layer with a convolution kernel stack of 1×1 size is adopted to link contextual features among different channels;
For small and medium scale features, a convolution layer of a convolution kernel stack of 3×3 size is employed to capture spatial position features of small targets, thereby supervising intermediate layer feature representation of the target detection model.
Further, the loss function of the object detection modelThe following are provided:
Wherein,Indicating a loss of confidence in the data,Representing a loss of classification,Indicating a loss of the target bounding box,Representing the auxiliary training head loss, and alpha, beta, gamma and delta represent the confidence loss, classification loss, target bounding box loss and auxiliary training head loss weights respectively.
Further, in step S2, each sub-network obtains a scale feature map of different scales by controlling the number of downsampling times in the sampling module, and convolution operations of different times are designed in each sub-network to control the computation time of all sub-networks to be close.
Further, in step S3, the summary fusion of all scale feature maps is specifically:
each sub-network obtains the scale feature map of all other sub-networks based on the summarizing and fusing module;
Based on the same scale as the current subnet scale feature map as a target, performing downsampling/upsampling treatment on the scale feature maps of all other subnets to obtain a scale feature map with aligned features;
and each sub-network adopts a point convolution operation to perform feature fusion on all the current sub-network scale feature images and the scale feature images with the features aligned to obtain a fusion feature image.
The target detection system for multi-scale feature parallel processing inputs a target image into a trained target detection model, and outputs a detection result of the target image, wherein the detection result comprises the category and the position of the target;
The target detection model comprises at least two subnets, wherein the subnets are communicated with each other, and each subnet comprises a sampling module, a summarizing and fusing module and a prediction head;
The training process of the target detection model is as follows:
s1, constructing a training set and conveying the training set to each subnet;
S2, each sub-network carries out downsampling processing on RGB features of an input image based on a sampling module to obtain a scale feature map;
S3, based on mutual communication among the subnets, each subnet acquires scale feature images of all other subnets based on a summarizing and fusing module, and fuses all the scale feature images after feature alignment, and each subnet acquires a fused feature image after corresponding fusion;
and S4, each subnet transmits the fusion feature map to a prediction head, outputs prediction information, gathers the prediction information of all the subnets and outputs a prediction result, wherein the prediction result comprises the category and the position of the target.
A computer readable storage medium having stored thereon a plurality of classification programs for being invoked by a processor and performing an object detection method as described above
The target detection method and the target detection system for multi-scale feature parallel processing have the advantages that the target detection model has a parallel computing structure, different hardware computing nodes are used for processing input images at the same time, the computing amount of a subnet is small, the number of layers is small, the communication duration of the subnet is reduced through parallel reasoning of an algorithm, the network reasoning speed is accelerated, and the reasoning delay is greatly reduced, so that the target detection model reduces the dependence on a partitioning algorithm, the performance of a hardware platform is maximally utilized, the delay of a real-time target detection algorithm is effectively reduced, the target detection model can also output the category, the position and the distance information of targets at the same time, and the target detection method and the target detection system have important significance for application scenes such as automatic driving.
Drawings
FIG. 1 is a schematic diagram of a target detection model according to the present invention;
FIG. 2 is a structural flow chart showing the summary fusion between features of the feature maps of different scales in step S3;
FIG. 3 is a training flow chart of the object detection model.
Detailed Description
In the following detailed description of the present invention, numerous specific details are set forth in order to provide a thorough understanding of the present invention. The invention may be embodied in many other forms than described herein and similarly modified by those skilled in the art without departing from the spirit or scope of the invention, which is therefore not limited to the specific embodiments disclosed below.
As shown in fig. 1 to 3, in the target detection method for multi-scale feature parallel processing, a target image is input into a trained target detection model, and a detection result of the target image is output, wherein the detection result comprises a category and a position of the target.
The target detection model divides the model into a plurality of subnets through the parallelization design of the model structure, network depth is reduced, different subnets of the model use different hardware calculation units to process input images simultaneously, communication duration of the subnets is reduced through annular communication between the subnets, network reasoning speed is accelerated, and the target detection model is specifically described below.
The target detection model comprises at least two subnets, wherein the subnets are communicated with each other, and each subnet comprises a sampling module, a summarizing and fusing module and a prediction head;
The training process of the target detection model is as follows:
s1, constructing a training set and conveying the training set to each subnet;
S2, each sub-network carries out downsampling processing on RGB features of an input image based on a sampling module to obtain a scale feature map;
Each sub-network is composed of Convolutional Neural Networks (CNNs) to extract the characteristics of an input image respectively, the sub-networks obtain scale characteristic diagrams of different scales by controlling the times of downsampling in a sampling module, and convolutional operations of different times are designed in each sub-network to control the calculation time of the three sub-networks to be close.
S3, based on mutual communication among the subnets, each subnet acquires scale feature images of all other subnets based on a summarizing and fusing module, and fuses all the scale feature images after feature alignment, and each subnet acquires a fused feature image after corresponding fusion;
For convolutional neural networks, low-level features are generally high in resolution and rich in detail information, and high-level features are generally low in resolution and rich in semantic information. As shown in fig. 2, the features corresponding to the feature graphs of other scales are subjected to downsampling or upsampling treatment to achieve the same scale as the same set of features, and then all the features are fused in each subnet by adopting point convolution operation (Fusion layer) respectively to achieve information Fusion and extraction.
Summarizing and fusing scale feature maps of different scales as follows:
each sub-network obtains the scale feature map of all other sub-networks based on the summarizing and fusing module;
Based on the same scale as the current subnet scale feature map as a target, performing downsampling/upsampling treatment on the scale feature maps of all other subnets to obtain a scale feature map with aligned features;
and each sub-network adopts a point convolution operation to perform feature fusion on all the current sub-network scale feature images and the scale feature images with the features aligned to obtain a fusion feature image.
S4, each subnet transmits the fusion feature map to a prediction head (prediction), prediction information is output, the prediction information of all the subnets is summarized, and then a prediction result is output, wherein the prediction result comprises the category and the position of a target;
the target detection method comprises the steps of classifying the target in the image by a pre-measuring head, regressing the boundary box, decoding, suppressing non-maximum value, drawing a detection box, and finally outputting a prediction result containing the type and the position of the target by a target detection model.
The target detection model has a distributed parallel structure, uses different hardware calculation units to process input images simultaneously, has less calculation amount and less layer number of subnets, reduces communication duration of the subnets through annular communication through algorithm parallel reasoning, accelerates network reasoning speed and greatly reduces reasoning delay, thus reducing dependence on hardware performance, effectively reducing delay of a real-time target detection algorithm, and can output category, position and distance information of targets simultaneously, thereby having important significance for application scenes such as automatic driving and the like.
The following object detection model includes 3 subnets (subnet one, subnet two, and subnet three) to illustrate the training process of the object detection model:
(1) Constructing a training set and transmitting the training set to each subnet;
(2) Each sub-network carries out downsampling processing on RGB features of an input image based on a sampling module to obtain a scale feature map;
The method comprises the steps that an input image is subjected to 3 times of downsampling in a first subnet and then subjected to a plurality of convolution processes to extract features, the input image is subjected to 4 times of downsampling in a second subnet and then subjected to a plurality of convolution processes to extract features, and the input image is subjected to 5 times of downsampling in a third subnet and then subjected to a plurality of convolution processes to extract features, wherein the times of the plurality of convolution processes are different, so that the processing time of the three subnets of the target detection model for a given input image is ensured to be similar.
(3) Based on mutual communication among the subnets, each subnet acquires the scale feature images of all other subnets based on a summarizing and fusing module, and fuses all the scale feature images after feature alignment, and each subnet acquires the fused feature images after corresponding fusion;
As shown in fig. 2, the information exchange of the scale feature maps of different scales between the subnets adopts ring communication, for example, subnet one to subnet two, subnet two to subnet three, and subnet three to subnet one. And (3) carrying out feature alignment on the features of different scales with the set of features through an up-sampling or down-sampling method to obtain an aligned scale feature map, and then respectively fusing all the features in each subnet by adopting a point convolution operation (Fusion layer) to obtain a fused feature map, so as to realize Fusion and extraction of information.
(4) Each subnet transmits the fusion feature map to a prediction head, outputs prediction information, gathers the prediction information of all the subnets and outputs a prediction result, wherein the prediction result comprises the category and the position of a target;
The three groups of prediction heads respectively carry out category classification and bounding box regression on targets in the image, the post-processing process comprises decoding, non-maximum suppression and detection frame drawing, and the target detection model finally outputs a prediction result containing the categories and the positions of the targets.
In the training process of the target detection model, in order to accelerate the convergence of the target detection model and improve the convergence of the target detection model, an independent auxiliary training head is introduced in front of the prediction heads of all the subnets in the training stage aiming at the characteristics corresponding to the scale characteristic diagrams of different scales, a convolution layer formed by convolution kernels with different sizes is used, and the middle layer characteristic representation of the monitoring model is performed. Specifically, for large-scale features, a 1*1-sized convolution kernel stacked convolution layer is adopted to better relate to context features among different channels, and for small-scale features, a 3*3-sized convolution kernel stacked convolution layer is adopted to better capture spatial position features of small targets.
The prediction header part in the subnet is responsible for class classification, bounding box regression, and distance regression. In the actual use stage, the prediction of the target detection model is directly output as a detection result after post-processing (post-processing includes post-processing processes such as decoding, non-Maximum Suppression (NMS), detection frame drawing and the like). In the training phase, the predictions of the target detection model described above need to be compared with Label (tag), the loss function calculated and the weights of the model adjusted by back propagation. Loss function of object detection model in this embodimentThe expression is as follows:
Wherein,Representing the confidence loss, reflecting whether the target detection model predicts a true target; representing classification loss, reflecting the difference between the predicted class and the true class of the target detection model; representing the target bounding box loss, reflecting the gap between the model predicted bounding box and the real bounding box; Representing the loss of the auxiliary training head, reflecting the difference between the self-distilling teacher model and the student model, and the alpha, beta, gamma and delta represent the confidence loss, the classification loss, the target boundary box loss and the weight of the auxiliary training head loss respectively.
The foregoing is only a preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art, who is within the scope of the present invention, should make equivalent substitutions or modifications according to the technical scheme of the present invention and the inventive concept thereof, and should be covered by the scope of the present invention.