Movatterモバイル変換


[0]ホーム

URL:


CN119107486A - A method and system for aerial image target detection based on attention mechanism and related devices - Google Patents

A method and system for aerial image target detection based on attention mechanism and related devices
Download PDF

Info

Publication number
CN119107486A
CN119107486ACN202411086772.1ACN202411086772ACN119107486ACN 119107486 ACN119107486 ACN 119107486ACN 202411086772 ACN202411086772 ACN 202411086772ACN 119107486 ACN119107486 ACN 119107486A
Authority
CN
China
Prior art keywords
network
target detection
module
attention mechanism
aerial image
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202411086772.1A
Other languages
Chinese (zh)
Inventor
冯增喜
姬秀明
陆金聪
安建虎
闫俊豪
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xian University of Architecture and Technology
Original Assignee
Xian University of Architecture and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xian University of Architecture and TechnologyfiledCriticalXian University of Architecture and Technology
Priority to CN202411086772.1ApriorityCriticalpatent/CN119107486A/en
Publication of CN119107486ApublicationCriticalpatent/CN119107486A/en
Pendinglegal-statusCriticalCurrent

Links

Classifications

Landscapes

Abstract

Translated fromChinese

本发明提供一种基于注意力机制的航空图像目标检测方法及系统及相关装置,将无人机获取的航拍图像输入航空图像目标检测网络模型中,得到航空图像的目标检测结果;其中,航空图像目标检测网络模型中主干网络为包含含注意力机制的背景抑制模块BSAM的YOLOv7主干网络,颈部结构为双向加权融合金字塔模块TWFP、检测头结构为泛化高效层聚合网络GELAN和视觉Swin Transformer模型结合,本发明检测方法简化了检测流程,提高了检测速度,同时保持了较高的检测精度。

The present invention provides an aerial image target detection method and system based on an attention mechanism and related devices, wherein aerial images acquired by a drone are input into an aerial image target detection network model to obtain target detection results of the aerial images; wherein the backbone network in the aerial image target detection network model is a YOLOv7 backbone network including a background suppression module BSAM containing an attention mechanism, the neck structure is a bidirectional weighted fusion pyramid module TWFP, and the detection head structure is a combination of a generalized efficient layer aggregation network GELAN and a visual Swin Transformer model. The detection method of the present invention simplifies the detection process, improves the detection speed, and maintains a relatively high detection accuracy.

Description

Aviation image target detection method and system based on attention mechanism and related device
Technical Field
The invention belongs to the field of computer vision, and particularly relates to an aerial image target detection method and system based on an attention mechanism and a related device.
Background
With the development of computer vision technology, object detection has become an important research direction in the field of computer vision. The main task of the object detection algorithm is to automatically identify objects present in an image and to label their location and class. In practical application, the target detection algorithm has been widely applied to the fields of intelligent agriculture, traffic detection, disaster response and the like. Meanwhile, application of some popular algorithms for target detection to unmanned aerial vehicles is becoming a popular research field.
In the current general target detection research field, for images captured by unmanned aerial vehicles, a target detection technology of natural images cannot be directly applied, because aerial images generally have different challenges. In the images captured by the unmanned aerial vehicle, due to the unique characteristics of the small target, certain difficulties still exist in detecting the target. The difficulty is (1) small target size, which makes it difficult to accurately detect and describe the features of the target, (2) background interference, which is the fact that small targets are often found in complex backgrounds, where background noise and interference can negatively impact target detection, (3) large target scale changes, which are often found at different viewing angles, where changes in the viewing angle can cause changes in the appearance features of the target, which increase the difficulty of target detection, and (4) occlusion of the target, which due to its small size, relatively few pixels can cause blurring of the target boundaries or be less noticeable in the image, which makes the small target susceptible to occlusion. These factors contribute to the low accuracy of detection of the aerial image by the detection method.
The development of object detection algorithms in the past decades has been largely divided into a conventional object detection algorithm stage and a deep learning-based object detection algorithm stage.
The traditional target detection framework mainly comprises three steps of target positioning, feature extraction and feature classification, wherein a certain part of the map is framed by utilizing sliding windows with different sizes as candidate areas, and visual features related to the candidate areas are extracted in the second step. Such as the HOG features commonly used for pedestrian detection and common object detection, and the third step is to use a classifier to identify, such as a commonly used SVM model. The essence of the method is that a sliding window and a traditional machine learning classifier are adopted, windows with different sizes and proportions are adopted to slide on the whole picture in a certain step length, and then the image classification is carried out on the areas corresponding to the windows, so that the detection of the whole picture is realized.
However, conventional object detection algorithms have a number of problems. Firstly, the detection speed is low, the area selection strategy of the sliding window of the traditional target detection algorithm is not specific, the sliding window needs to be detected at each position of the image, and the problems of high time complexity and window redundancy exist. This approach is computationally intensive, resulting in slower detection speeds. Secondly, due to the shape diversity, illumination variation diversity, background diversity and other factors of the targets, the feature extraction of the targets is not easy, but the quality of the extracted features directly affects the classification accuracy. Also, conventional object detection algorithms are less capable of handling occlusion and complex backgrounds. When the target is blocked by other objects or background, the conventional algorithm is easy to miss.
Compared with the traditional target detection method, the target detection method based on deep learning has become the dominant method in the field, and the target detection algorithm based on Convolutional Neural Network (CNN) architecture has become the mainstream.
In CNN-based algorithms, two types of network architectures are divided into one-stage and two-stage. The first is a two-stage detection method based on region suggestion. An early example was R-CNN, which extracts regional suggestions and uses convolutional neural networks (e.g., alexNet and VGG) for object classification and bounding box regression. After the Fast R-CNN is subjected to feature sharing and is introduced RoIPooling, the processing speed and the performance are obviously improved compared with the traditional R-CNN. Faster R-CNN introduced Region Proposal Network to generate candidate regions, which were then used for target classification and localization by RoI pooling. It achieves a good balance between accuracy and speed.
However, the two-stage target detection network architecture also has many problems, the first point is that the detection speed is slow, and the two-stage method requires two independent steps, and thus is generally slower than the one-stage method. The second point is that the model is complex and the training and deployment difficulties are great. The third point is that the method is not suitable for real-time application, because the model is complex, the calculation amount is large, and the method is not suitable for application requiring real-time performance
Whereas the single-stage method is another class of methods that uses a network to regress object size, position and class directly from the input image. One of the most widely accepted single-stage detection methods is the YOLO series framework. The method has the core concept that the target detection task is regarded as a regression problem, and the efficient target detection is realized by simultaneously predicting the bounding box and the category of the target in a single neural network. Compared with the traditional target detection method, the YOLO series frame has higher detection speed and higher real-time performance, and is suitable for the fields of automatic driving, video monitoring and the like which need to rapidly detect targets.
The single-stage object detection algorithm has many advantages over the two-stage object detection algorithm. First, the calculation speed is greatly improved, and the calculation cost is reduced. The method is suitable for positioning and detecting small targets. But may be slightly less accurate than the two-stage detection algorithm. In the YOLO series algorithm, it is found that background interference in the image when the network backbone structure extracts features at the end affects feature extraction of the detected object. Moreover, PAN used in neck network fusion requires attention to receptive field and feature fusion, and may not be sensitive enough to small targets, but aerial images are mostly small targets, so that it is difficult to obtain good fusion. The YOLO-series detection head is constituted by a detection head including only a convolution layer, and can acquire only local features. Information about the global context is difficult to obtain, resulting in a poor global perceptibility of the model. These are some of the problems currently existing in the YOLO series algorithm.
Disclosure of Invention
In order to solve the problems in the prior art, the invention provides an aviation image target detection method and system based on an attention mechanism and a related device, which fully utilize the advantages of the attention mechanism and a visual Swin transform model to complement each other. Firstly, a background suppression module BSAM containing an attention mechanism is adopted to extract network characteristics and suppress background noise at the tail end of a main network, so that the problem of poor characteristic extraction capability caused by background suppression is solved. The improved neck structure of the bidirectional weighted fusion pyramid module TWFP is adopted to replace the original PAN structure, and the multi-scale fusion is carried out and the perception of a small target is improved in a weighted feature fusion and jump connection mode. The original detection head only containing the convolution layer is replaced by adopting a combination mode of the generalized high-efficiency layer aggregation network GELAN and the visual Swin transform model, so that the local characteristics and the global context information are acquired by the generalized high-efficiency layer aggregation network GELAN and the visual Swin transform model, and the global perception capability is improved. These improvements allow the detection performance of the method to be better improved.
In order to achieve the aim, the invention provides the technical scheme that the aerial image target detection method based on the attention mechanism inputs the aerial image acquired by the unmanned aerial vehicle into an aerial image target detection network model to obtain a target detection result of the aerial image;
The backbone network in the aerial image target detection network model is YOLOv backbone network comprising a background suppression module BSAM of an attention mechanism, the neck structure is a bidirectional weighted fusion pyramid module TWFP, and the detection head structure is a combination of a generalized high-efficiency layer aggregation network GELAN and a visual Swin transducer model.
Further, the YOLOv network backbone network includes 12 layers, the background suppression module BSAM with attention mechanism is located at the eleventh layer of the YOLOv network backbone network, and the rapid spatial pyramid pooling module SPPF is located at the twelfth layer of the YOLOv network backbone network.
Further, the background suppression module BSAM with the attention mechanism is obtained by fusing a C3 module and CBAMbottleneck with the CBAM convolution attention mechanism, the CBAMbottleneck with the CBAM convolution attention mechanism is connected with one of the two front convolutions of the C3 module, the CBAMbottleneck comprises a 1x1 convolution layer, a 3x3 convolution layer and a CBAM convolution block attention mechanism, residual connection is carried out on the characteristics of the 1x1 convolution layer, the 3x3 convolution layer and the CBAM convolution block attention mechanism and the input characteristic diagram, and the obtained characteristic diagram is convolved with the other convolutions of the C3 module to obtain the output characteristic diagram.
Further, the neck structure of the bidirectional weighted fusion pyramid module TWFP includes an aggregation network structure PAN, bidirectional jump connection is added in the middle layer of the aggregation network structure PAN, and weight calculation of input features is added in the aggregation network structure PAN weighted fusion part, so that high-level semantic information and low-level detail information are fully fused.
Further, the calculation of the two-way weighted fusion pyramid module TWFP in the neck structure is as follows:
F′4=Conv1(I4)
F′3=Conv1{ELAN[I3+up(F′4)]}
F′2=Conv1{ELAN[I2+up(F3')]}
F1'=Conv1[I1+up(F2')]
F1=GELAN-SW(F1')
F2=GELAN-SW[F2'+I2+Conv2(F1)]
F3=GELAN-SW[F3'+I3+Conv2(F2)]
F4=GELAN-SW[F4'+Conv2(F3)]
Wherein F1'~F4' represents a feature map obtained from a bottom-up path, corresponding to level I1~I4, F1~F4 represents a feature map output from a top-down path, + represents performing concat weighted feature fusion, up represents upsampling, conv1 represents a1×1 convolution operation, conv2 represents a3×3 convolution operation, ELAN represents that the feature map has passed through an ELAN module, and GELAN-SW represents that the feature map has passed through a detection head.
Further, the detection head structure comprises four layers, wherein the first layer comprises a detection head combined by a generalized high-efficiency layer aggregation network GELAN and a visual Swin transducer model, and the second layer to the fourth layer respectively comprise two detection heads combined by the generalized high-efficiency layer aggregation network GELAN and the visual Swin transducer model.
The invention also provides an aerial image target detection system based on the attention mechanism, which comprises the following steps:
The data acquisition module is used for acquiring aerial images shot by the unmanned aerial vehicle;
The network module construction module is used for building an aerial image target detection network model, wherein a backbone network in the aerial image target detection network model is a YOLOv network comprising a background suppression module BSAM with an attention mechanism, a neck structure is a bidirectional weighted fusion pyramid module TWFP, and a detection head structure is formed by combining a generalized high-efficiency layer aggregation network GELAN and a visual Swin transform model;
The target detection module is used for inputting the aerial image acquired by the unmanned aerial vehicle into an aerial image target detection network model to obtain a target detection result of the aerial image.
The invention also provides a terminal device comprising a memory, a processor and a computer program stored in the memory and operable on the processor, wherein the processor implements the steps of an aerial image target detection method based on an attention mechanism as claimed in any one of claims 1 to 7 when executing the computer program.
The present invention also provides a computer readable storage medium storing a computer program which, when executed by a processor, implements the steps of an aerial image target detection method based on an attention mechanism as claimed in any one of claims 1 to 7.
The invention also provides a computer program product comprising a computer program which, when executed by a processor, implements the steps of an aerial image target detection method based on an attention mechanism as claimed in any one of claims 1 to 6.
Compared with the prior art, the invention has at least the following beneficial effects:
The invention provides an aerial image target detection method based on an attention mechanism, which is used for constructing an empty image target detection network model, wherein a background suppression module BSAM containing the attention mechanism is introduced into a network backbone structure to extract image characteristics, so that the interference of complex backgrounds in aerial images is effectively suppressed, target characteristics are more prominent, the target detection accuracy is improved, and secondly, a neck structure adopts a bidirectional weighted fusion pyramid module TWFP to replace a traditional PAN structure, and full fusion of high-level semantic information and low-level detail information is realized through bidirectional jump connection and weighted fusion strategies, so that the detection capability of the model on targets with different scales is improved. Finally, the detection head structure adopts a mode of combining a generalized high-efficiency layer aggregation network GELAN and a visual Swin transducer model, so that the local feature extraction capability of the convolutional neural network is reserved, the global context sensing capability of the transducer is introduced, and the model can more accurately position and classify targets in the detection process.
Furthermore, the backbone network of the aerial image target detection network model adopts YOLOv frames and is combined with BSAM modules, so that the feature extraction capability is improved, the neck structure realizes the effective fusion of multi-scale features through TWFP modules, and the detection head improves the detection precision through the combination of a generalized high-efficiency layer aggregation network GELAN and a visual Swin transducer model. Each module in the model can be adjusted and optimized according to specific task requirements so as to adapt to different application scenes and data sets. By introducing an attention mechanism and an advanced network structure, the model has higher detection speed while keeping high precision, and is suitable for application scenes with higher real-time requirements.
The improved network structure of the invention can bring about improvement of accuracy and stability, which means that accurate detection results can be obtained more reliably. After the algorithm can obtain an accurate detection result, more accurate decision making and planning can be performed in the fields of public safety, urban planning, traffic monitoring and the like. In the public safety field, the detection method can monitor the road surface condition in real time, discover abnormal events and potential threats in time, and provide powerful guarantee for the safety and stability of cities. In city planning, by detecting targets such as buildings, roads, greenbelts and the like in a city, accurate data support can be provided for city planning, and city layout optimization and city management level improvement are facilitated. In the traffic monitoring field, the detection method can rapidly and accurately identify targets such as vehicles, pedestrians and the like, is beneficial to realizing real-time monitoring and intelligent scheduling of traffic flow and improves traffic running efficiency. Therefore, our improved method is of great importance for improving the decision-making effect and social management level in practical applications.
Drawings
FIG. 1 is a schematic illustration of an aerial image target detection method based on an attention mechanism of the present invention.
Fig. 2 is a background suppression module BSAM containing an attention mechanism in an aerial image object detection network model.
FIG. 3 is a CBAM convolution attention mechanism in an aerial image target detection network model.
Fig. 4 is a neck structure-bi-directional weighted fusion pyramid module TWFP in an aerial image target detection network model.
Fig. 5 is a detection head combining a generalized high-efficiency layer aggregation network GELAN and a visual Swin transducer model in an aerial image target detection network model.
Fig. 6 is a block diagram of a generalized high-efficiency layer aggregation network GELAN in an aerial image target detection network model.
FIG. 7 is a graph of the results of different algorithms in the aerial image target detection network model visualized on VisDrone datasets.
Detailed Description
The invention is further described below with reference to the drawings and the detailed description.
The invention provides an aerial image target detection method based on an attention mechanism, which can more accurately detect a target in an image under a complex aerial image by fusing the attention mechanism and a visual Swin transducer model. The method specifically comprises the following steps:
step 1, training by using two public data sets as input images;
The images in both of the disclosed datasets are taken as inputs. These datasets were VisDrone and UAVDT, visDrone2019 consisted of 10 categories of 7019 high resolution aerial images (2,000x 1,500). The data set is characterized by small object size, dense distribution, and possible partial occlusion. UAVDT the dataset is a dataset (1,024×540) designed to perform vehicle detection tasks from the Unmanned Aerial Vehicle (UAV) perspective. It contains images taken by unmanned aerial vehicles at different heights and angles on urban roads. Wherein each dataset has different scene, density and image quality to ensure that the model has good generalization performance in various environments.
The specific implementation process of the embodiment is described as follows:
The data set first needs to be converted to yolo format to meet the experimental requirements. Training was performed using 6,471 images in VisDrone datasets and verification was performed using 548 images. The dataset image input was set to 1536×1536. The UAVDT dataset consisted of 50 videos, containing 40,376 images in total. Training was performed using 24,778 images in UAVDT datasets and 15,598 images were tested. All images have a resolution of 1024×540. All images from the same video are grouped into either a training set or a test set according to the requirements of the dataset. Of these 50 videos, 31 video images were placed in the training set and the remaining 19 video images were placed in the test set. By using the two data sets, the invention can comprehensively cover different scenes and target categories, and can evaluate and compare the performance of the target detection algorithm in different data sets.
Step 2, constructing an aerial image target detection network ASA-YOLO
The object detection network adopts YOLOv network, most of objects in the aerial image are smaller, so that when the main network performs deep feature extraction, a large amount of semantic information is often lost by the small objects, and the small objects are more easily influenced by image background, so that the background suppression module BSAM with the integration of attention mechanisms is added at the tail end of the main network of the YOLOv network, the feature information of the objects in the image is extracted, rich information is provided for follow-up, and the interference on the small object features caused by complex background at the tail end of the main network is reduced. In the neck architecture of YOLOv network, the bidirectional weighted fusion pyramid module TWFP is adopted to replace the original PAN structure, so that the characteristics of different levels can be better ensured to be mutually communicated, and the characteristic representation is enriched. At YOLOv, the original detection head only comprising the convolution layer is changed into a mode of combining the generalized high-efficiency layer aggregation network GELAN and the visual Swin transform module, the local information in the target is extracted by using the generalized high-efficiency layer aggregation network GELAN, and then the visual Swin transform module captures the global context information of the image, so that the detection precision is effectively improved by combining the two modes.
In the whole, the target detection network of the invention brings about improvement of accuracy and stability, which means that the accurate detection result can be obtained more reliably by adopting the target detection network of the invention, and the method of the invention can make more accurate decisions and plans in the fields of public safety, urban planning, traffic monitoring and the like, and has important significance for improving the decision effect and social management level in practical application.
Step 2.1 building a backbone network
As shown in fig. 1, the backbone network is a part of the whole target detection network for feature extraction, and adopts a backbone network of YOLOv networks, wherein an eleventh layer is a background suppression module BSAM with attention mechanisms fused, a twelfth layer is a rapid spatial pyramid pooling module SPPF, and the whole target detection network is more efficient by adjusting the width and depth of the backbone network.
Specifically, first, the first ten layers of the backbone network adopt the backbone network of YOLOv networks. The backbone network is based on the CSPDARKNET improved version, and the feature extraction capability and the calculation efficiency of the network are improved by introducing the CSPNet structure and further optimizing. This makes YOLOv excellent in the object detection task, an important version of the modern object detection model.
In the eleventh layer of the backbone network, the invention adopts the background suppression module BSAM with the integration of the attention mechanism, and the module integrates the CBAM convolution attention mechanism, so that not only the characteristic information of a small target is kept in the down sampling process, but also the irrelevant background noise in the image is suppressed, and meanwhile, the channel and the position information are considered, so that the invention is flexible and lightweight enough, and the detection precision can be improved with lower calculation cost. And then, a rapid space pyramid pooling module SPPF is connected to the back of the eleventh layer, the rapid space pyramid pooling module SPPF extracts and fuses high-level features, 3x3 maximum pooling operations are applied for multiple times in the fusion process, high-level semantic features are extracted as much as possible, and the computing efficiency is improved while the multi-scale feature fusion effect is maintained. And 5,7,9 and 12 layers in the backbone network enter the neck architecture with different layers of feature graphs when entering the neck architecture, so that powerful support is provided for the next multi-layer feature fusion.
The specific implementation process of the embodiment is described as follows:
In the invention, the backbone network is mainly composed of 12 layers, the input image is a given aerial image i=rW×H×3 (W, H and channel size 3 represent the basic features of a given crowd count image, wherein W represents the width of the image, H represents the height of the image, and channel size 3 represents the image having three channels, typically RGB color modes), the first layer is a convolution layer with a convolution kernel size of 3, the step size is 1, the second layer is a convolution layer with a convolution kernel size of 3, the step size is 2, the third layer is a convolution layer with a convolution kernel size of 3, the step size is 1, the fourth layer is a convolution layer with a convolution kernel size of 3, and the step size is 2. The fifth layer is an ELAN module formed by convolution and residual modules, and the sixth layer is a convolution layer with the convolution kernel size of 3 and the step length of 2. The fifth layer and the sixth layer are repeated from the 7 th layer to the 10 th layer.
The ELAN module respectively passes the input feature images through two 1x1 convolution layers, reduces the channel number and improves the calculation efficiency, then carries out layer-by-layer processing and extraction on the feature images through a plurality of 3x3 convolution layers, finally selects specific feature images through an index list, splices the specific feature images together, and carries out fusion through one 1x1 convolution layer to obtain final output. The design can enhance the characteristic representation capability while maintaining the calculation efficiency, so that the model can better extract and fuse the characteristics when processing complex tasks.
The eleventh layer is the BSAM module that incorporates CBAM convolution attention mechanisms, as shown in fig. 2. The background suppression module BSAM with the attention mechanism is formed by fusing a C3 module and a CBAMbottleneck module with a CBAM convolution attention mechanism, wherein the C3 module comprises three convolution operations, the first two convolution operations are respectively and independently used for processing input features, the third convolution operation is used for combining the outputs of the first two paths, CBAMbottleneck is connected at the back of one of the first two convolutions of the C3 module, and CBAMbottleneck comprises two convolution layers, a CBAM convolution block attention module and a residual connection shortcut.
In BSAM module, the input feature map reduces the channel number through two 1x1 convolution layers of the C3 module, CBAMbottleneck performs feature processing on one of the first two convolutions, the processed feature map and the other convolutions perform convolution operation to achieve splicing, and finally the spliced feature map is converted into the required output channel number through one 1x1 convolution layer. The design can reduce the calculated amount while guaranteeing the characteristic expression capability.
CBAMbottleneck reduce the number of channels by a 1x1 convolutional layer and then increase the number of channels by a 3x3 convolutional layer, into the CBAM convolutional block attention mechanism. As shown in fig. 3, the CBAM attention mechanism consists of two parts, a channel attention module and a spatial attention module. After the attention mechanism of the feature map is input CBAM, the channel attention module is firstly entered to perform global pooling and operation of the multi-layer perceptron, the two outputs are added to generate a channel attention map, and then the channel attention map and the original feature map are subjected to element-by-element multiplication by channels to obtain a weighted feature map F'. The weighted feature map input enters a spatial attention module again, pooling and convolution operation are carried out on the feature map, and element-by-element multiplication is carried out on the spatial attention map and the weighted feature map F' according to the spatial position, so that a final weighted feature map F″ is obtained. The final output f″ is connected with the input feature map of CBAMbottleneck by a residual. CBAMbottleneck in combination with CBAM attention mechanism reduces background to target interference and more effectively promotes gradient flow and convergence speed by means of residual connection.
The twelfth layer is the fast spatial pyramid pooling module SPPF.
The rapid space pyramid pooling module SPPF firstly carries out 1x1 convolution on input, then carries out three times of maximum pooling in sequence, splices the original feature map and the three times of maximum pooling results together, and finally reduces the channel number of the spliced feature map to the expected output channel number through another 1x1 convolution layer. The design can enhance the characteristic representation capability while ensuring the calculation efficiency, so that the model can better perform when processing targets with different scales.
The input image is 1536×1536×3, and is 46×46×1024 output after passing through the twelve layers of the backbone network. The backbone network feature extraction process is expressed as:
Fc=Czgwl(I) (1)
Fc denotes features derived from CNN, Czgwl denotes feature extraction by the backbone network, and I denotes an input image.
Step 2.2, constructing a neck structure of the bidirectional weighted fusion pyramid module TWFP:
As shown in fig. 4, the bidirectional weighted fusion pyramid module TWFP improves the whole neck by adopting a bidirectional weighted fusion and jump connection mode, can better realize multi-scale fusion of feature graphs in a feature fusion stage, and provides more accurate prediction detection for a subsequent target detection task;
the concrete improvement is as follows:
The PAN structure of the aggregation network structure adopts bidirectional connection, and comprises feature propagation from top to bottom and feature propagation from bottom to top. This ensures that features at different levels can communicate with each other, enriching the feature representation. The weighted bi-directional feature pyramid network BiFPN introduces a learnable weighting mechanism and jump connection, and can perform weighted average and fusion on features of different scales, thereby more effectively utilizing feature information. Under the influence of the PAN structure of the original structure path aggregation network and the weighted bidirectional feature pyramid network BiFPN structure, some structures of the PAN structure and the weighted bidirectional feature pyramid network are reserved, and improved fusion is carried out to obtain the bidirectional weighted fusion pyramid module TWFP.
The specific implementation process of the embodiment is described as follows:
As shown in fig. 4. Wherein fig. 4 is a TWFP structure. The model takes as input four feature maps of layers 5,7,9,12 extracted from the backbone network, performs bottom-up feature propagation from F '4 to F'1, and then performs top-down feature propagation from F1 to F4. Two layers in the middle of the original PAN structure are added with two-way jump connection, the two-way jump connection allows information to be directly transferred from the front layer to the rear layer, and feature map fusion is directly carried out by bypassing the middle layer. In this way, the semantic information of the upper layer and the detail information of the lower layer can be fully fused. The multi-scale feature fusion can enhance the capability of the network to process targets with different sizes and improve the accuracy of target detection.
The feature map is weighted and fused after the bidirectional jump connection is performed. Learning the importance of different input features, and fusing different input features. Conventional feature fusion tends to be simply a superposition or addition of feature maps, which are adjusted to the same resolution and then added. Previous methods treat all input features at the same time without distinction. But different input feature maps have different resolutions and their contributions to the fused input feature map are also different, so simply adding or superimposing them is not the optimal operation. An additional weight is added to each input to let the network learn the importance of each input feature.
Taking F'2 and F2 as examples, the weighted fusion is calculated as follows:
Where I2,F2',F2 represents the backbone network output, bottom-up, top-down feature map, ω represents weights, ε represents a small constant, resize represents a resolution-matched up-sampling or down-sampling operation, conv represents a convolution operation for feature processing.
The feature map of the input image is 384×384× 256,192 ×192× 512,96 ×96× 1024,48 ×48×1024, where the feature map resolution of I1 is highest and the subsequent feature map resolution is halved. In order to enhance the capability of the framework to fuse multi-scale features and reduce the complexity of the framework, the invention selects to add a bidirectional jump connection between two layers in the middle of the module in an actual experiment. This enables feature information of different scales to be captured efficiently. And finally outputting F1、F2、F3、F4. The whole fusion process is as follows:
The characteristic diagram of the I4 is subjected to convolution operation of 1X 1 convolution kernel and stride of 1 to obtain the characteristic diagram of the F4'.
And F4' performing up-sampling of nearest neighbor interpolation on the feature map, and improving the resolution of the feature map. And then carrying out concat weighted feature fusion with the I3 feature map, enabling the feature map after weighted fusion to enter an ELAN convolution module (reducing the channel number) to carry out feature processing, and then carrying out convolution operation with the stride of 1 through a 1X 1 convolution kernel, and obtaining an F'3 feature map after operation.
And F3' performing up-sampling of nearest neighbor interpolation on the feature map, and improving the resolution of the feature map. And then carrying out concat weighted feature fusion with the I2 feature map, enabling the feature map after weighted fusion to enter an ELAN convolution module (reducing the channel number) to carry out feature processing, and then carrying out convolution operation with the stride of 1 through a 1X 1 convolution kernel, and obtaining an F'2 feature map after operation.
And F'2, performing up-sampling of nearest neighbor interpolation on the feature map, and improving the resolution of the feature map. And then carrying out concat weighted feature fusion with the I1 feature map, carrying out convolution operation with the stride of 1 by 1 convolution kernel on the feature map after weighted fusion, and obtaining the F'1 feature map after operation.
And carrying out local and global feature extraction on the F'1 feature map through a generalized high-efficiency layer aggregation network GELAN+visual Swin transducer model detection head to obtain an F1 feature map.
After the F1 feature map is subjected to 3×3 convolution kernel and stride 2 convolution operation, I2,F2',F1 is subjected to concat weighted feature fusion (wherein I2 skips the middle layer and directly performs weighted fusion with F2',F1), and then is subjected to 1×1 convolution kernel and stride 1 convolution operation, and finally is input into a generalized high-efficiency layer aggregation network GELAN+vision Swin transducer model detection head to perform local and global feature extraction, so that the F2 feature map is obtained.
F2 is subjected to convolution operation with 3×3 convolution kernels and stride of 2, F3',I3,F2 is subjected to concat weighted feature fusion (wherein I3 skips the middle layer and directly performs weighted fusion with F3',F2), and is subjected to convolution operation with 1×1 convolution kernels and stride of 1, and finally, the obtained result is input into a generalized high-efficiency layer aggregation network GELAN+vision Swin transducer model detection head to perform local and global feature extraction, so that an F3 feature map is obtained.
And F3, performing convolution operation with a 3×3 convolution kernel and a stride of 2 on the feature map, performing concat weighted feature fusion on the feature map by F3,F′4, and finally inputting the feature map into a generalized high-efficiency layer aggregation network GELAN+vision Swin transform model detection head to perform local and global feature extraction to obtain the feature map of F4.
F′4=Conv1(I4) (4)
F′3=Conv1{ELAN[I3+up(F′4)]} (5)
F2'=Conv1{ELAN[I2+up(F3')]} (6)
F1'=Conv1[I1+up(F2')] (7)
F1=GELAN-SW(F1') (8)
F2=GELAN-SW[F2'+I2+Conv2(F1)] (9)
F3=GELAN-SW[F3'+I3+Conv2(F2)] (10)
F4=GELAN-SW[F4'+Conv2(F3)] (11)
Where F'1~F′4 represents the bottom-up path taken feature map, corresponding to stage I1~I4, F1~F4 represents the top-down path output feature map. + means that concat weighted feature fusion is performed, up means up-sampling, conv1 means 1×1 convolution operation. Conv2 represents a3×3 convolution operation. The ELAN indicates that the feature map passed through the ELAN module. GELAN-SW represents the detection head of the characteristic map through the generalized high-efficiency layer aggregation network GELAN+visual Swin transducer model.
Step 2.3, constructing a detection head structure
The structure of the new detection head designed by the invention is shown in figure 5, and the detection head in the model has four layers, wherein the first layer comprises a generalized high-efficiency layer aggregation network GELAN+vision Swin transducer model detection head. The second through fourth layers each contain two generalized high-efficiency layer aggregation network gelan+visual Swin transducer models, which are stacked twice in order to produce an output. And inputting the feature map with the neck fused in multiple scales into a generalized high-efficiency layer aggregation network GELAN, and extracting the local information of the features in the feature map. The generalized high-efficiency layer aggregation network GELAN structure is shown in fig. 6, and the efficiency and effect of processing complex data modes by the model are improved by combining CSPNet and the characteristics of the convolution module ELAN. And the extracted feature map is transmitted into a visual Swin transducer model again to extract global features. The globally extracted feature map is transmitted to a detection module to generate a tensor containing boundary frame coordinate information, confidence coefficient and category classification. And then decoding the generated tensor of each detection layer, and integrating the boundary boxes, the confidence degrees and the category predictions of the four detection layers after decoding to obtain a final detection result.
The specific implementation process of the embodiment is described as follows:
In the invention, after the neck part performs characteristic multi-scale fusion, 4 layers of characteristic images with different scales are used as the input of a detection head to enter a generalized high-efficiency layer aggregation network GELAN for processing operation. The generalized efficient layer aggregation network GELAN extracts features through multiple convolution layers and RepNCSP modules. The input feature map firstly reduces the number of input channels through a 1x1 convolution layer, then uses chunk to split the output into two tensors along the channel dimension, and respectively passes through a RepNCSP module and a 3x3 convolution layer. And finally, all the characteristic diagrams are spliced and output. The design of the module can enhance the capability of feature extraction while ensuring the calculation efficiency. The feature map output after the calculation is transmitted into a visual Swin transducer model.
The feature map entered in the visual Swin transducer model is expressed as W X C. Where W represents the width of the image, H represents the height of the image, and the number of channels is C. The input feature map is partitioned by Patch Partition into a series of fixed-size patches, each Patch having dimensions P x C, where P is the side length of the Patch. This process converts the input image from a two-dimensional structure to a one-dimensional sequence patch token. The patch keys are then mapped to high dimensional space through full connectivity layer Linear Embedding by linear transformation, resulting in Patch Embedding.
Patch Embedding=Linear(Flatten(Patch)) (12)
The Patch Embedding generated goes into Layer Normalization (layer normalization), which is generally applied to the input of each sub-layer in the visual Swin transform model, and mainly plays a role in improving convergence speed and robustness. The capture capability of context information and spatial information in the image is then improved by introducing a window multi-head self-attention (W-MHSA) module and a shift window multi-head self-attention (SW-MHSA) module. The W-MHSA module adopts a window attention mechanism to divide the input Patch Embedding into a plurality of windows, and the local information is effectively modeled by calculating the attention weight value in each window. The SW-MHSA module effectively captures local information by using a transfer window attention mechanism and considering the relation between adjacent windows. Whereas the MLP multi-Layer perceptron is used in the visual Swin Transformer model to perform a non-linear transformation in the last part of each visual Swin Transformer model Layer. Its nonlinear transformation capability enables the network to learn a richer representation of the features. The calculation process is as follows:
F'=W-MHSA(LN(F))+F (13)
F'=MLP(LN(F'))+F' (14)
F"=SW-MHSA(LN(F'))+F' (15)
F"=MLP(LN(F"))+F" (16)
The feature map output after the processing of the visual Swin transducer model enters the Detect module again to generate tensors containing the boundary box coordinates, confidence and category classification information. For prediction of each detection layer, a decoding bounding box is required. The predictions in the present invention are given in terms of center coordinates and width and height, and the relative position and scale information typically needs to be converted to actual image coordinates. The confidence and class prediction decoding is calculated by a sigmoid function. And integrating the boundary boxes, the confidence coefficient and the category prediction of the four decoded detection layers together to obtain a final detection result.
Step 3, training the output result
The experiment is carried out by adopting an NVIDIA A800-SXM4 display card with 80GB video memory, and the CPU core number is 16 cores. The experimental software was configured as PyTorch 2.21.21 and CUDA 12.0. The training phase included 200 times, with the first 2 times being used as a warm-up. The initial learning rate was set at 3.2X10-4 and decayed to 0.12 times its value in the final epoch. The parameter optimization adopts an Adam optimization algorithm. The performance of the target detection network ASA-YOLO is optimized with a minimum loss function. And performing iterative training on the target detection network ASA-YOLO, and continuously optimizing the model by using a back propagation and gradient descent method according to the error calculated by each forward propagation, so as to finally obtain a trained network model. And testing the test set by using the trained target detection network model to output a test result.
The specific implementation process of the embodiment is described as follows:
The invention aims at experimental configuration and optimization strategies of ASA-YOLO model on PyTorch platform. The present invention leverages the 80G NVIDIA A800-SXM4 GPU equipped hardware environment to train and evaluate the model of the present invention and leverages publicly accessible datasets to perform target detection tasks. The evaluation index of the invention mainly comprises the average precision AP50 calculated when the average precision AP and IoU threshold value is 0.5.
The ASA-YOLO model is composed of three blocks, namely, a main network is used for extracting features, a neck is used for multi-scale fusion, and a detection head is used for outputting results. The backbone network extracts the characteristic information of the target in the image, introduces a background suppression module BSAM containing an attention mechanism, and reduces the interference on the small target characteristic caused by the complex background at the tail end of the backbone network. The improved weighted bidirectional feature fusion architecture TWFP is adopted in the neck structure to perform multi-scale fusion, so that the features of different levels can be better ensured to be mutually communicated, and the feature representation is enriched. The detection head part adopts a mode of combining a generalized high-efficiency layer aggregation network GELAN and a visual Swin transducer model, the local information in the target is extracted by using the generalized high-efficiency layer aggregation network GELAN, and then the visual Swin transducer model captures the global context information of the image. The three are combined effectively to improve the detection precision.
During training, the present invention takes 1536×1536 for VisDrone datasets and 1024×1024 for UAVDT datasets, respectively, as model inputs and sets the batch size to 4 according to the computational power of the GPU. The invention selects an Adam optimizer to carry out model training, sets the learning rate to be 3.2 multiplied by 10 < -4 >, and the weight attenuation to be 0.12 times of the value so as to minimize the loss function to optimize the network performance. These experimental configuration and optimization strategies aim to ensure good performance and reliability of the model in the target detection task. Fig. 7 is a visual result of our comparison with yolov.
We have carried out a number of experiments on 2 published data sets, including VisDrone and UAVDT, confirming that the method we have adopted gives excellent results
TABLE 1
TABLE 2
Experiments performed on the VisDrone dataset, in order to evaluate the performance of ASA-YOLO, the invention compared the experimental results with the experimental results of the other eleven most advanced methods on the VisDrone dataset, as shown in table 1. The average accuracy (AP 50) of ASA-YOLO of the present invention at IoU =0.5 was 64.3%, the highest value of this index was obtained, whereas the average Accuracy (AP) of the present invention reached 41.8, which is only 0.3 less than the best SOTA. GSE-YOLO showed a significant improvement in AP50 (from 61.5% to 64.3%) and AP (from 39.6% to 41.8%) compared to baseline approach. The components remarkably improve the feature extraction capability of detecting targets with various scales in the captured image of the unmanned aerial vehicle, and simultaneously reduce false alarm. In order to capture the comprehensive characteristic information of small targets in a complex scene, the depth and the width of a network are increased, and in the aspect of model complexity, the calculation complexity of the model is 551.0G FLOPs. The calculated amount of TPH-YOLOv reaches 556.6GFLOPS, and the calculated amount of the method is less than that of TPH-YOLOv. When evaluated using the AP50 index, the method was 1.2% higher than TPH-YOLOV, while the AP was 0.2% lower, but the AP50 of the present invention achieved SOTA performance. Highlighting its superior accuracy. Experimental comparisons show that slightly increasing complexity can significantly improve performance.
Experiments performed on UAVDT datasets by the present invention to further verify the inventive method, experiments were performed on UAVDT datasets by the present invention as shown in table 2. Compared with the prior art, the ASA-YOLO of the invention achieves the best SOTA effect on the AP 50. Wherein the AP is only 0.2% less than SMFF-YOLO. In conclusion, the method has good detection precision in the small targets and dense scenes, and the challenge of target detection in complex scenes is overcome to a certain extent.
The invention also provides an aerial image target detection system based on the attention mechanism, which can be used for realizing the aerial image target detection method based on the attention mechanism, and specifically comprises a data acquisition module, a network module construction module and a target detection module, wherein:
The data acquisition module is used for acquiring aerial images shot by the unmanned aerial vehicle;
The network module construction module is used for building an aerial image target detection network model, wherein a backbone network in the aerial image target detection network model is a YOLOv network comprising a background suppression module BSAM with an attention mechanism, a neck structure is a bidirectional weighted fusion pyramid module TWFP, and a detection head structure is formed by combining a generalized high-efficiency layer aggregation network GELAN and a visual Swin transform model;
The target detection module is used for inputting the aerial image acquired by the unmanned aerial vehicle into an aerial image target detection network model to obtain a target detection result of the aerial image.
An embodiment of the invention provides a terminal device comprising a processor, a memory and a computer program stored in the memory and executable on the processor. The processor performs the steps in the method for detecting the aerial image target based on the attention mechanism when executing the computer program, or performs the functions of each module in the aerial image target detection system based on the attention mechanism when executing the computer program.
The computer program may be divided into one or more modules/units, which are stored in the memory and executed by the processor to accomplish the present invention.
The terminal equipment can be computing equipment such as a desktop computer, a notebook computer, a palm computer, a cloud server and the like. The terminal device may include, but is not limited to, a processor, a memory.
The processor may be a central processing unit (CentralProcessingUnit, CPU), but may also be other general purpose processors, digital signal processors (DigitalSignalProcessor, DSP), application specific integrated circuits (APPLICATI ON SPECIFIC IntegratedCircuit, ASIC), off-the-shelf programmable gate arrays (Field ProgrammableGateArray, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, or the like.
The memory may be used to store the computer program and/or module, and the processor may implement various functions of the terminal device by running or executing the computer program and/or module stored in the memory and invoking data stored in the memory.
The modules/units integrated in the terminal device may be stored in a computer readable storage medium if implemented in the form of software functional units and sold or used as separate products.
Based on such understanding, the present invention may implement all or part of the flow of the method of the above embodiment, or may be implemented by instructing related hardware by a computer program, where the computer program may be stored in a computer readable storage medium, and the computer program may implement the steps of the method for detecting an aerial image target based on an attention mechanism when executed by a processor. Wherein the computer program comprises computer program code which may be in source code form, object code form, executable file or some intermediate form etc.
The computer readable medium may include any entity or device capable of carrying the computer program code, a recording medium, a USB flash disk, a removable hard disk, a magnetic disk, an optical disk, a computer memory, a Read Only Memory (ROM), a random access memory (RAM, random Access Memory), an electrical carrier signal, a telecommunications signal, a software distribution medium, and so forth. It should be noted that the computer readable medium contains content that can be appropriately scaled according to the requirements of jurisdictions in which such content is subject to legislation and patent practice, such as in certain jurisdictions in which such content is subject to legislation and patent practice, the computer readable medium does not include electrical carrier signals and telecommunication signals.
In summary, the aviation image target detection method based on the attention mechanism is performed on an end-to-end single-stage target detection algorithm. The aerial image target detection network model solves the problems encountered by the tail end of a trunk, a neck structure and a detection head. Compared with the two-stage detection method, the method has the advantages of high detection speed, high precision return and high efficiency, and compared with the traditional target detection method and the traditional two-stage detection method, the method has the advantages that the precision is higher than ClusDet, DMNet and the like.

Claims (10)

Translated fromChinese
1.一种基于注意力机制的航空图像目标检测方法,其特征在于,将无人机获取的航拍图像输入航空图像目标检测网络模型中,得到航空图像的目标检测结果;1. An aerial image target detection method based on an attention mechanism, characterized in that the aerial image acquired by a drone is input into an aerial image target detection network model to obtain the target detection result of the aerial image;其中,航空图像目标检测网络模型中主干网络为包含含注意力机制的背景抑制模块BSAM的YOLOv7主干网络,颈部结构为双向加权融合金字塔模块TWFP、检测头结构为泛化高效层聚合网络GELAN和视觉Swin Transformer模型的结合。Among them, the backbone network in the aerial image target detection network model is the YOLOv7 backbone network including the background suppression module BSAM with attention mechanism, the neck structure is the bidirectional weighted fusion pyramid module TWFP, and the detection head structure is the combination of the generalized efficient layer aggregation network GELAN and the visual Swin Transformer model.2.根据权利要求1所述的一种基于注意力机制的航空图像目标检测方法,其特征在于,YOLOv7网络主干网络包括12层,含注意力机制的背景抑制模块BSAM位于YOLOv7网络主干网络第十一层,快速空间金字塔池化模块SPPF位于YOLOv7网络主干网络第十二层。2. A method for aerial image target detection based on an attention mechanism according to claim 1, characterized in that the YOLOv7 network backbone network includes 12 layers, the background suppression module BSAM containing the attention mechanism is located at the eleventh layer of the YOLOv7 network backbone network, and the fast spatial pyramid pooling module SPPF is located at the twelfth layer of the YOLOv7 network backbone network.3.根据权利要求1或2所述的一种基于注意力机制的航空图像目标检测方法,其特征在于,含注意力机制的背景抑制模块BSAM由C3模块和含CBAM卷积注意力机制的CBAMbottleneck融合得到,含CBAM卷积注意力机制的CBAMbottleneck衔接在C3模块前两个卷积其中一支后,CBAMbottleneck包括1x1卷积层、3x3卷积层、CBAM卷积块注意力机制,经1x1卷积层、3x3卷积层、CBAM卷积块注意力机制后的特征与输入特征图进行残差连接,得到的特征图与C3模块另一支进行卷积操作,得到输出特征图。3. According to claim 1 or 2, an aerial image target detection method based on an attention mechanism is characterized in that the background suppression module BSAM containing the attention mechanism is obtained by fusing the C3 module and the CBAMbottleneck containing the CBAM convolutional attention mechanism, and the CBAMbottleneck containing the CBAM convolutional attention mechanism is connected to one of the first two convolutions of the C3 module. The CBAMbottleneck includes a 1x1 convolution layer, a 3x3 convolution layer, and a CBAM convolutional block attention mechanism. The features after the 1x1 convolution layer, the 3x3 convolution layer, and the CBAM convolutional block attention mechanism are residually connected with the input feature map, and the obtained feature map is convolved with the other branch of the C3 module to obtain the output feature map.4.根据权利要求1所述的一种基于注意力机制的航空图像目标检测方法,其特征在于,双向加权融合金字塔模块TWFP的颈部结构包括聚合网络结构PAN,并在聚合网络结构PAN中间层添加双向跳跃连接,在聚合网络结构PAN加权融合部分增加输入特征的权重计算,使得高层的语义信息和低层的细节信息充分融合。4. According to the method for aerial image target detection based on attention mechanism in claim 1, the neck structure of the bidirectional weighted fusion pyramid module TWFP includes a polymerization network structure PAN, and a bidirectional jump connection is added to the middle layer of the polymerization network structure PAN, and the weight calculation of the input feature is added to the weighted fusion part of the polymerization network structure PAN, so that the high-level semantic information and the low-level detail information are fully integrated.5.根据权利要求4所述的一种基于注意力机制的航空图像目标检测方法,其特征在于,颈部结构中双向加权融合金字塔模块TWFP的计算如下:5. According to the method for aerial image target detection based on attention mechanism in claim 4, it is characterized in that the calculation of the bidirectional weighted fusion pyramid module TWFP in the neck structure is as follows:F4'=Conv1(I4)F4 '=Conv1 (I4 )F3'=Conv1{ELAN[I3+up(F4')]}F3 '=Conv1 {ELAN[I3 +up(F4 ')]}F2'=Conv1{ELAN[I2+up(F3')]}F2 '=Conv1 {ELAN[I2 +up(F3 ')]}F1'=Conv1[I1+up(F2')]F1 '=Conv1 [I1 +up(F2 ')]F1=GELAN-SW(F1')F1 =GELAN-SW(F1 ')F2=GELAN-SW[F2'+I2+Conv2(F1)]F2 =GELAN-SW [F2 '+I2 +Conv2 (F1 )]F3=GELAN-SW[F3'+I3+Conv2(F2)]F3 =GELAN-SW [F3 '+I3 +Conv2 (F2 )]F4=GELAN-SW[F4'+Conv2(F3)]F4 =GELAN-SW [F4 '+Conv2 (F3 )]其中,F1'~F4'表示自底向上路径获得的特征图,对应I1~I4级,F1~F4表示自顶向下路径输出的特征图;+表示进行concat加权特征融合,up表示上采样,Conv1表示1×1卷积运算;Conv2表示3×3卷积运算;ELAN表示特征图经过了ELAN模块;GELAN-SW表示特征图经过了检测头。Among them, F1 '~F4 ' represent the feature maps obtained from the bottom-up path, corresponding to levels I1 ~I4 , F1 ~F4 represent the feature maps output from the top-down path; + represents concat weighted feature fusion, up represents upsampling, Conv1 represents 1×1 convolution operation; Conv2 represents 3×3 convolution operation; ELAN represents that the feature map has passed through the ELAN module; GELAN-SW represents that the feature map has passed through the detection head.6.根据权利要求1所述的一种基于注意力机制的航空图像目标检测方法,其特征在于,检测头包括四层,第一层含有一个泛化高效层聚合网络GELAN和视觉Swin Transformer模型结合的检测头;第二层到第四层各含有两个泛化高效层聚合网络GELAN和视觉SwinTransformer模型结合的检测头。6. According to claim 1, a method for aerial image target detection based on an attention mechanism is characterized in that the detection head includes four layers, the first layer contains a detection head combining a generalized efficient layer aggregation network GELAN and a visual Swin Transformer model; the second to fourth layers each contain two detection heads combining a generalized efficient layer aggregation network GELAN and a visual Swin Transformer model.7.一种基于注意力机制的航空图像目标检测系统,其特征在于,包括:7. An aerial image target detection system based on attention mechanism, characterized by comprising:数据获取模块,用于获取无人机拍摄的航拍图像;A data acquisition module is used to acquire aerial images taken by a drone;网络模块构建模块,用于建立航空图像目标检测网络模型,航空图像目标检测网络模型中主干网络为包含注意力机制的背景抑制模块BSAM的YOLOv7网络,颈部结构为双向加权融合金字塔模块TWFP、检测头结构为泛化高效层聚合网络GELAN和视觉Swin Transformer模型结合;The network module building module is used to build an aerial image target detection network model. The backbone network in the aerial image target detection network model is a YOLOv7 network with a background suppression module BSAM that includes an attention mechanism, the neck structure is a bidirectional weighted fusion pyramid module TWFP, and the detection head structure is a combination of a generalized efficient layer aggregation network GELAN and a visual Swin Transformer model.目标检测模块,用于将无人机获取的航拍图像输入航空图像目标检测网络模型中,得到航空图像的目标检测结果。The target detection module is used to input the aerial images acquired by the UAV into the aerial image target detection network model to obtain the target detection results of the aerial images.8.一种终端设备,包括存储器、处理器以及存储在所述存储器中并可在所述处理器上运行的计算机程序,其特征在于,所述处理器执行所述计算机程序时实现如权利要求1~6任意一项所述的一种基于注意力机制的航空图像目标检测方法的步骤。8. A terminal device, comprising a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein when the processor executes the computer program, the steps of an aerial image target detection method based on an attention mechanism as described in any one of claims 1 to 6 are implemented.9.一种计算机可读存储介质,所述计算机可读存储介质存储有计算机程序,其特征在于,所述计算机程序被处理器执行时实现如权利要求1~6任意一项所述的一种基于注意力机制的航空图像目标检测方法的步骤。9. A computer-readable storage medium storing a computer program, wherein when the computer program is executed by a processor, the steps of an aerial image target detection method based on an attention mechanism as described in any one of claims 1 to 6 are implemented.10.一种计算机程序产品,包括计算机程序,其特征在于,所述计算机程序被处理器执行时实现权利要求1~6任意一项所述的一种基于注意力机制的航空图像目标检测方法的步骤。10. A computer program product, comprising a computer program, characterized in that when the computer program is executed by a processor, the steps of the aerial image target detection method based on the attention mechanism described in any one of claims 1 to 6 are implemented.
CN202411086772.1A2024-08-082024-08-08 A method and system for aerial image target detection based on attention mechanism and related devicesPendingCN119107486A (en)

Priority Applications (1)

Application NumberPriority DateFiling DateTitle
CN202411086772.1ACN119107486A (en)2024-08-082024-08-08 A method and system for aerial image target detection based on attention mechanism and related devices

Applications Claiming Priority (1)

Application NumberPriority DateFiling DateTitle
CN202411086772.1ACN119107486A (en)2024-08-082024-08-08 A method and system for aerial image target detection based on attention mechanism and related devices

Publications (1)

Publication NumberPublication Date
CN119107486Atrue CN119107486A (en)2024-12-10

Family

ID=93715952

Family Applications (1)

Application NumberTitlePriority DateFiling Date
CN202411086772.1APendingCN119107486A (en)2024-08-082024-08-08 A method and system for aerial image target detection based on attention mechanism and related devices

Country Status (1)

CountryLink
CN (1)CN119107486A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN119672550A (en)*2025-02-212025-03-21上海理工大学 Smart city target detection method, device and storage medium based on improved YOLOv5

Cited By (1)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN119672550A (en)*2025-02-212025-03-21上海理工大学 Smart city target detection method, device and storage medium based on improved YOLOv5

Similar Documents

PublicationPublication DateTitle
JP7289918B2 (en) Object recognition method and device
CN110378381B (en)Object detection method, device and computer storage medium
CN113591872B (en) A data processing system, object detection method and device
CN111931764B (en) A target detection method, target detection framework and related equipment
CN112396645B (en)Monocular image depth estimation method and system based on convolution residual learning
WO2021147325A1 (en)Object detection method and apparatus, and storage medium
CN114359851A (en)Unmanned target detection method, device, equipment and medium
CN113743417B (en)Semantic segmentation method and semantic segmentation device
CN115631344B (en)Target detection method based on feature self-adaptive aggregation
CN111640136B (en) A deep target tracking method in complex environment
CN115187786A (en) A Rotation-Based Object Detection Method for CenterNet2
CN111401517A (en) A perceptual network structure search method and device
JP7357176B1 (en) Night object detection, training method and device based on self-attention mechanism in frequency domain
CN113065645A (en)Twin attention network, image processing method and device
CN115294548B (en) A Lane Line Detection Method Based on Position Selection and Classification in Travel Direction
CN114998610A (en)Target detection method, device, equipment and storage medium
CN117935333A (en)Driver face detection method and medium based on lightweight improved YOLOv model
CN116740516A (en) Target detection method and system based on multi-scale fusion feature extraction
WO2022179599A1 (en)Perceptual network and data processing method
CN119107486A (en) A method and system for aerial image target detection based on attention mechanism and related devices
CN117351189A (en)Satellite sequence image moving small target detection method based on space-time fusion
CN115240163A (en) A traffic sign detection method and system based on a one-stage detection network
CN115731530A (en)Model training method and device
CN118379696A (en)Ship target detection method and device and readable storage medium
CN118351410A (en)Multi-mode three-dimensional detection method based on sparse agent attention

Legal Events

DateCodeTitleDescription
PB01Publication
PB01Publication
SE01Entry into force of request for substantive examination
SE01Entry into force of request for substantive examination

[8]ページ先頭

©2009-2025 Movatter.jp