Movatterモバイル変換


[0]ホーム

URL:


CN117876661B - A target detection method and system using multi-scale feature parallel processing - Google Patents

A target detection method and system using multi-scale feature parallel processing
Download PDF

Info

Publication number
CN117876661B
CN117876661BCN202311788491.6ACN202311788491ACN117876661BCN 117876661 BCN117876661 BCN 117876661BCN 202311788491 ACN202311788491 ACN 202311788491ACN 117876661 BCN117876661 BCN 117876661B
Authority
CN
China
Prior art keywords
subnet
target
scale feature
scale
target detection
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202311788491.6A
Other languages
Chinese (zh)
Other versions
CN117876661A (en
Inventor
葛梦柯
李元
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Institute of Artificial Intelligence of Hefei Comprehensive National Science Center
Original Assignee
Institute of Artificial Intelligence of Hefei Comprehensive National Science Center
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute of Artificial Intelligence of Hefei Comprehensive National Science CenterfiledCriticalInstitute of Artificial Intelligence of Hefei Comprehensive National Science Center
Priority to CN202311788491.6ApriorityCriticalpatent/CN117876661B/en
Publication of CN117876661ApublicationCriticalpatent/CN117876661A/en
Application grantedgrantedCritical
Publication of CN117876661BpublicationCriticalpatent/CN117876661B/en
Activelegal-statusCriticalCurrent
Anticipated expirationlegal-statusCritical

Links

Classifications

Landscapes

Abstract

The invention discloses a target detection method and a target detection system for multi-scale feature parallel processing, which are characterized in that a target image is input into a trained target detection model, a detection result of the target image is output, the training process of the target detection model is as follows, S1, a training set is constructed and is transmitted to each subnet, S2, each subnet carries out downsampling processing on RGB features of the input image based on a sampling module to obtain a scale feature map, S3, each subnet acquires the scale feature maps of all other subnets based on a summarizing and fusing module and carries out feature alignment on all the scale feature maps, each subnet acquires a corresponding fused feature map, S4, each subnet transmits the fused feature map to a pre-measuring head, prediction information is output, the prediction information of all the subnets is summarized and then the prediction result is output, the target detection method and the target detection system reduce interlayer synchronization expenditure of an algorithm on hardware, improve the utilization rate of a hardware operation unit through the subnet architecture, and effectively reduce the delay of a real-time target detection algorithm.

Description

Target detection method and system for multi-scale feature parallel processing
Technical Field
The invention relates to the technical field of image processing, in particular to a target detection method and system for multi-scale feature parallel processing.
Background
Real-time object detection is an important task in the field of computer vision, aimed at detecting and locating objects in images in video streams or real-time images. Unlike conventional target detection, real-time target detection requires that it be performed in real-time while maintaining high accuracy, and generally requires processing of multiple frames of images per second.
A typical real-time object detection procedure is to obtain an image from a camera, video stream or image sequence, pre-process the image, possibly including resizing, normalizing and other image enhancement operations, locate and identify the object in the image using an object detection algorithm, post-process the detection result, e.g., applying non-maximal suppression (NMS) to remove redundant frames, filtering out low confidence detection results, etc., visualize the final detection result, and mark the detected object on the image.
In recent years, a deep learning-based method takes a dominant role in the field of real-time target detection, and a series of excellent models such as R-CNN series and YOLO series are created. However, the target detection algorithms are deep networks, the number of layers is large, each layer of network needs a special division method to improve the utilization rate of a hardware operation unit, calculation of each layer needs to wait for the end of calculation of the previous layer, network depth brings limitation to real-time reasoning, the reasoning speed is highly dependent on hardware performance and the division method, and the reasoning delay is high in the fields of automatic driving, military reconnaissance and the like. Therefore, considering how to effectively reduce the delay of the real-time target detection algorithm is a direction to be studied.
Disclosure of Invention
Based on the technical problems in the background art, the invention provides a target detection method and a target detection system for multi-scale feature parallel processing, which improve the utilization rate of an operation unit on an inference platform through the high parallelism of an algorithm, reduce interlayer synchronization time through fewer network layers and effectively reduce the delay of a real-time target detection algorithm.
According to the target detection method for multi-scale feature parallel processing, a target image is input into a trained target detection model, and a detection result of the target image is output, wherein the detection result comprises the category and the position of the target;
The target detection model comprises at least two subnets, wherein the subnets are communicated with each other, and each subnet comprises a sampling module, a summarizing and fusing module and a prediction head;
The training process of the target detection model is as follows:
s1, constructing a training set and conveying the training set to each subnet;
S2, each sub-network carries out downsampling processing on RGB features of an input image based on a sampling module to obtain a scale feature map;
S3, based on mutual communication among the subnets, each subnet acquires scale feature images of all other subnets based on a summarizing and fusing module, and fuses all the scale feature images after feature alignment, and each subnet acquires a fused feature image after corresponding fusion;
and S4, each subnet transmits the fusion feature map to a prediction head, outputs prediction information, gathers the prediction information of all the subnets and outputs a prediction result, wherein the prediction result comprises the category and the position of the target.
Further, in the training process of the target detection model, an independent auxiliary training head is introduced in front of the prediction head of each sub-network, and the auxiliary training head monitors the middle layer characteristic representation of the target detection model based on a convolution layer formed by convolution kernels with different sizes.
Further, the auxiliary training head monitors the middle layer characteristic representation of the target detection model based on convolution layers formed by convolution kernels with different sizes,
For large scale features, a convolution layer with a convolution kernel stack of 1×1 size is adopted to link contextual features among different channels;
For small and medium scale features, a convolution layer of a convolution kernel stack of 3×3 size is employed to capture spatial position features of small targets, thereby supervising intermediate layer feature representation of the target detection model.
Further, the loss function of the object detection modelThe following are provided:
Wherein,Indicating a loss of confidence in the data,Representing a loss of classification,Indicating a loss of the target bounding box,Representing the auxiliary training head loss, and alpha, beta, gamma and delta represent the confidence loss, classification loss, target bounding box loss and auxiliary training head loss weights respectively.
Further, in step S2, each sub-network obtains a scale feature map of different scales by controlling the number of downsampling times in the sampling module, and convolution operations of different times are designed in each sub-network to control the computation time of all sub-networks to be close.
Further, in step S3, the summary fusion of all scale feature maps is specifically:
each sub-network obtains the scale feature map of all other sub-networks based on the summarizing and fusing module;
Based on the same scale as the current subnet scale feature map as a target, performing downsampling/upsampling treatment on the scale feature maps of all other subnets to obtain a scale feature map with aligned features;
and each sub-network adopts a point convolution operation to perform feature fusion on all the current sub-network scale feature images and the scale feature images with the features aligned to obtain a fusion feature image.
The target detection system for multi-scale feature parallel processing inputs a target image into a trained target detection model, and outputs a detection result of the target image, wherein the detection result comprises the category and the position of the target;
The target detection model comprises at least two subnets, wherein the subnets are communicated with each other, and each subnet comprises a sampling module, a summarizing and fusing module and a prediction head;
The training process of the target detection model is as follows:
s1, constructing a training set and conveying the training set to each subnet;
S2, each sub-network carries out downsampling processing on RGB features of an input image based on a sampling module to obtain a scale feature map;
S3, based on mutual communication among the subnets, each subnet acquires scale feature images of all other subnets based on a summarizing and fusing module, and fuses all the scale feature images after feature alignment, and each subnet acquires a fused feature image after corresponding fusion;
and S4, each subnet transmits the fusion feature map to a prediction head, outputs prediction information, gathers the prediction information of all the subnets and outputs a prediction result, wherein the prediction result comprises the category and the position of the target.
A computer readable storage medium having stored thereon a plurality of classification programs for being invoked by a processor and performing an object detection method as described above
The target detection method and the target detection system for multi-scale feature parallel processing have the advantages that the target detection model has a parallel computing structure, different hardware computing nodes are used for processing input images at the same time, the computing amount of a subnet is small, the number of layers is small, the communication duration of the subnet is reduced through parallel reasoning of an algorithm, the network reasoning speed is accelerated, and the reasoning delay is greatly reduced, so that the target detection model reduces the dependence on a partitioning algorithm, the performance of a hardware platform is maximally utilized, the delay of a real-time target detection algorithm is effectively reduced, the target detection model can also output the category, the position and the distance information of targets at the same time, and the target detection method and the target detection system have important significance for application scenes such as automatic driving.
Drawings
FIG. 1 is a schematic diagram of a target detection model according to the present invention;
FIG. 2 is a structural flow chart showing the summary fusion between features of the feature maps of different scales in step S3;
FIG. 3 is a training flow chart of the object detection model.
Detailed Description
In the following detailed description of the present invention, numerous specific details are set forth in order to provide a thorough understanding of the present invention. The invention may be embodied in many other forms than described herein and similarly modified by those skilled in the art without departing from the spirit or scope of the invention, which is therefore not limited to the specific embodiments disclosed below.
As shown in fig. 1 to 3, in the target detection method for multi-scale feature parallel processing, a target image is input into a trained target detection model, and a detection result of the target image is output, wherein the detection result comprises a category and a position of the target.
The target detection model divides the model into a plurality of subnets through the parallelization design of the model structure, network depth is reduced, different subnets of the model use different hardware calculation units to process input images simultaneously, communication duration of the subnets is reduced through annular communication between the subnets, network reasoning speed is accelerated, and the target detection model is specifically described below.
The target detection model comprises at least two subnets, wherein the subnets are communicated with each other, and each subnet comprises a sampling module, a summarizing and fusing module and a prediction head;
The training process of the target detection model is as follows:
s1, constructing a training set and conveying the training set to each subnet;
S2, each sub-network carries out downsampling processing on RGB features of an input image based on a sampling module to obtain a scale feature map;
Each sub-network is composed of Convolutional Neural Networks (CNNs) to extract the characteristics of an input image respectively, the sub-networks obtain scale characteristic diagrams of different scales by controlling the times of downsampling in a sampling module, and convolutional operations of different times are designed in each sub-network to control the calculation time of the three sub-networks to be close.
S3, based on mutual communication among the subnets, each subnet acquires scale feature images of all other subnets based on a summarizing and fusing module, and fuses all the scale feature images after feature alignment, and each subnet acquires a fused feature image after corresponding fusion;
For convolutional neural networks, low-level features are generally high in resolution and rich in detail information, and high-level features are generally low in resolution and rich in semantic information. As shown in fig. 2, the features corresponding to the feature graphs of other scales are subjected to downsampling or upsampling treatment to achieve the same scale as the same set of features, and then all the features are fused in each subnet by adopting point convolution operation (Fusion layer) respectively to achieve information Fusion and extraction.
Summarizing and fusing scale feature maps of different scales as follows:
each sub-network obtains the scale feature map of all other sub-networks based on the summarizing and fusing module;
Based on the same scale as the current subnet scale feature map as a target, performing downsampling/upsampling treatment on the scale feature maps of all other subnets to obtain a scale feature map with aligned features;
and each sub-network adopts a point convolution operation to perform feature fusion on all the current sub-network scale feature images and the scale feature images with the features aligned to obtain a fusion feature image.
S4, each subnet transmits the fusion feature map to a prediction head (prediction), prediction information is output, the prediction information of all the subnets is summarized, and then a prediction result is output, wherein the prediction result comprises the category and the position of a target;
the target detection method comprises the steps of classifying the target in the image by a pre-measuring head, regressing the boundary box, decoding, suppressing non-maximum value, drawing a detection box, and finally outputting a prediction result containing the type and the position of the target by a target detection model.
The target detection model has a distributed parallel structure, uses different hardware calculation units to process input images simultaneously, has less calculation amount and less layer number of subnets, reduces communication duration of the subnets through annular communication through algorithm parallel reasoning, accelerates network reasoning speed and greatly reduces reasoning delay, thus reducing dependence on hardware performance, effectively reducing delay of a real-time target detection algorithm, and can output category, position and distance information of targets simultaneously, thereby having important significance for application scenes such as automatic driving and the like.
The following object detection model includes 3 subnets (subnet one, subnet two, and subnet three) to illustrate the training process of the object detection model:
(1) Constructing a training set and transmitting the training set to each subnet;
(2) Each sub-network carries out downsampling processing on RGB features of an input image based on a sampling module to obtain a scale feature map;
The method comprises the steps that an input image is subjected to 3 times of downsampling in a first subnet and then subjected to a plurality of convolution processes to extract features, the input image is subjected to 4 times of downsampling in a second subnet and then subjected to a plurality of convolution processes to extract features, and the input image is subjected to 5 times of downsampling in a third subnet and then subjected to a plurality of convolution processes to extract features, wherein the times of the plurality of convolution processes are different, so that the processing time of the three subnets of the target detection model for a given input image is ensured to be similar.
(3) Based on mutual communication among the subnets, each subnet acquires the scale feature images of all other subnets based on a summarizing and fusing module, and fuses all the scale feature images after feature alignment, and each subnet acquires the fused feature images after corresponding fusion;
As shown in fig. 2, the information exchange of the scale feature maps of different scales between the subnets adopts ring communication, for example, subnet one to subnet two, subnet two to subnet three, and subnet three to subnet one. And (3) carrying out feature alignment on the features of different scales with the set of features through an up-sampling or down-sampling method to obtain an aligned scale feature map, and then respectively fusing all the features in each subnet by adopting a point convolution operation (Fusion layer) to obtain a fused feature map, so as to realize Fusion and extraction of information.
(4) Each subnet transmits the fusion feature map to a prediction head, outputs prediction information, gathers the prediction information of all the subnets and outputs a prediction result, wherein the prediction result comprises the category and the position of a target;
The three groups of prediction heads respectively carry out category classification and bounding box regression on targets in the image, the post-processing process comprises decoding, non-maximum suppression and detection frame drawing, and the target detection model finally outputs a prediction result containing the categories and the positions of the targets.
In the training process of the target detection model, in order to accelerate the convergence of the target detection model and improve the convergence of the target detection model, an independent auxiliary training head is introduced in front of the prediction heads of all the subnets in the training stage aiming at the characteristics corresponding to the scale characteristic diagrams of different scales, a convolution layer formed by convolution kernels with different sizes is used, and the middle layer characteristic representation of the monitoring model is performed. Specifically, for large-scale features, a 1*1-sized convolution kernel stacked convolution layer is adopted to better relate to context features among different channels, and for small-scale features, a 3*3-sized convolution kernel stacked convolution layer is adopted to better capture spatial position features of small targets.
The prediction header part in the subnet is responsible for class classification, bounding box regression, and distance regression. In the actual use stage, the prediction of the target detection model is directly output as a detection result after post-processing (post-processing includes post-processing processes such as decoding, non-Maximum Suppression (NMS), detection frame drawing and the like). In the training phase, the predictions of the target detection model described above need to be compared with Label (tag), the loss function calculated and the weights of the model adjusted by back propagation. Loss function of object detection model in this embodimentThe expression is as follows:
Wherein,Representing the confidence loss, reflecting whether the target detection model predicts a true target; representing classification loss, reflecting the difference between the predicted class and the true class of the target detection model; representing the target bounding box loss, reflecting the gap between the model predicted bounding box and the real bounding box; Representing the loss of the auxiliary training head, reflecting the difference between the self-distilling teacher model and the student model, and the alpha, beta, gamma and delta represent the confidence loss, the classification loss, the target boundary box loss and the weight of the auxiliary training head loss respectively.
The foregoing is only a preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art, who is within the scope of the present invention, should make equivalent substitutions or modifications according to the technical scheme of the present invention and the inventive concept thereof, and should be covered by the scope of the present invention.

Claims (7)

CN202311788491.6A2023-12-222023-12-22 A target detection method and system using multi-scale feature parallel processingActiveCN117876661B (en)

Priority Applications (1)

Application NumberPriority DateFiling DateTitle
CN202311788491.6ACN117876661B (en)2023-12-222023-12-22 A target detection method and system using multi-scale feature parallel processing

Applications Claiming Priority (1)

Application NumberPriority DateFiling DateTitle
CN202311788491.6ACN117876661B (en)2023-12-222023-12-22 A target detection method and system using multi-scale feature parallel processing

Publications (2)

Publication NumberPublication Date
CN117876661A CN117876661A (en)2024-04-12
CN117876661Btrue CN117876661B (en)2025-03-18

Family

ID=90592752

Family Applications (1)

Application NumberTitlePriority DateFiling Date
CN202311788491.6AActiveCN117876661B (en)2023-12-222023-12-22 A target detection method and system using multi-scale feature parallel processing

Country Status (1)

CountryLink
CN (1)CN117876661B (en)

Citations (2)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN113177511A (en)*2021-05-202021-07-27中国人民解放军国防科技大学Rotating frame intelligent perception target detection method based on multiple data streams
CN117152414A (en)*2023-08-312023-12-01西安交通大学 A target detection method and system based on scale attention-assisted learning method

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN111476306B (en)*2020-04-102023-07-28腾讯科技(深圳)有限公司Object detection method, device, equipment and storage medium based on artificial intelligence
CN113065575B (en)*2021-02-272025-03-04华为技术有限公司 Image processing method and related device
CN114202672B (en)*2021-12-092025-06-13南京理工大学 A small object detection method based on attention mechanism
CN115187786A (en)*2022-07-212022-10-14北京工业大学 A Rotation-Based Object Detection Method for CenterNet2
CN116630802A (en)*2023-05-242023-08-22中国科学院合肥物质科学研究院SwinT and size self-adaptive convolution-based power equipment rust defect image detection method

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN113177511A (en)*2021-05-202021-07-27中国人民解放军国防科技大学Rotating frame intelligent perception target detection method based on multiple data streams
CN117152414A (en)*2023-08-312023-12-01西安交通大学 A target detection method and system based on scale attention-assisted learning method

Also Published As

Publication numberPublication date
CN117876661A (en)2024-04-12

Similar Documents

PublicationPublication DateTitle
CN113158862B (en) A lightweight real-time face detection method based on multi-task
CN115222946B (en) A single-stage instance image segmentation method, device and computer equipment
US20180114071A1 (en)Method for analysing media content
CN111027576B (en) Co-saliency detection method based on co-saliency generative adversarial network
CN112507777A (en)Optical remote sensing image ship detection and segmentation method based on deep learning
CN111696110B (en)Scene segmentation method and system
Li et al.A lightweight multi-scale aggregated model for detecting aerial images captured by UAVs
US20220398747A1 (en)Volumetric sampling with correlative characterization for dense estimation
CN112927245A (en)End-to-end instance segmentation method based on instance query
CN114241250B (en) A cascade regression target detection method, device and computer readable storage medium
CN110781744A (en) A small-scale pedestrian detection method based on multi-level feature fusion
US12380689B2 (en)Managing occlusion in Siamese tracking using structured dropouts
CN117157679A (en)Perception network, training method of perception network, object recognition method and device
US12412373B2 (en)Saliency-based input resampling for efficient object detection
CN115100410A (en) A Real-time Instance Segmentation Method Fusing Sparse Framework and Spatial Attention
CN110852295A (en)Video behavior identification method based on multitask supervised learning
Nakamura et al.An effective combination of loss gradients for multi-task learning applied on instance segmentation and depth estimation
Bavirisetti et al.A multi-task vision transformer for segmentation and monocular depth estimation for autonomous vehicles
CN112101113B (en)Lightweight unmanned aerial vehicle image small target detection method
CN117593506A (en) A spacecraft target detection method integrating multi-head self-attention mechanism
CN116486394B (en)Label text detection method based on multistage separation feature enhancement and spatial attention
CN116797797A (en)Image feature extraction method, device, image processing system and storage medium
CN119832211A (en)Aerial scene infrared small target detection method, electronic equipment and storage medium
CN117876661B (en) A target detection method and system using multi-scale feature parallel processing
Jung et al.Scale-aware token-matching for transformer-based object detector

Legal Events

DateCodeTitleDescription
PB01Publication
PB01Publication
SE01Entry into force of request for substantive examination
SE01Entry into force of request for substantive examination
GR01Patent grant
GR01Patent grant

[8]ページ先頭

©2009-2025 Movatter.jp