Disclosure of Invention
In order to solve the problems in the prior art, the invention provides an aviation image target detection method and system based on an attention mechanism and a related device, which fully utilize the advantages of the attention mechanism and a visual Swin transform model to complement each other. Firstly, a background suppression module BSAM containing an attention mechanism is adopted to extract network characteristics and suppress background noise at the tail end of a main network, so that the problem of poor characteristic extraction capability caused by background suppression is solved. The improved neck structure of the bidirectional weighted fusion pyramid module TWFP is adopted to replace the original PAN structure, and the multi-scale fusion is carried out and the perception of a small target is improved in a weighted feature fusion and jump connection mode. The original detection head only containing the convolution layer is replaced by adopting a combination mode of the generalized high-efficiency layer aggregation network GELAN and the visual Swin transform model, so that the local characteristics and the global context information are acquired by the generalized high-efficiency layer aggregation network GELAN and the visual Swin transform model, and the global perception capability is improved. These improvements allow the detection performance of the method to be better improved.
In order to achieve the aim, the invention provides the technical scheme that the aerial image target detection method based on the attention mechanism inputs the aerial image acquired by the unmanned aerial vehicle into an aerial image target detection network model to obtain a target detection result of the aerial image;
The backbone network in the aerial image target detection network model is YOLOv backbone network comprising a background suppression module BSAM of an attention mechanism, the neck structure is a bidirectional weighted fusion pyramid module TWFP, and the detection head structure is a combination of a generalized high-efficiency layer aggregation network GELAN and a visual Swin transducer model.
Further, the YOLOv network backbone network includes 12 layers, the background suppression module BSAM with attention mechanism is located at the eleventh layer of the YOLOv network backbone network, and the rapid spatial pyramid pooling module SPPF is located at the twelfth layer of the YOLOv network backbone network.
Further, the background suppression module BSAM with the attention mechanism is obtained by fusing a C3 module and CBAMbottleneck with the CBAM convolution attention mechanism, the CBAMbottleneck with the CBAM convolution attention mechanism is connected with one of the two front convolutions of the C3 module, the CBAMbottleneck comprises a 1x1 convolution layer, a 3x3 convolution layer and a CBAM convolution block attention mechanism, residual connection is carried out on the characteristics of the 1x1 convolution layer, the 3x3 convolution layer and the CBAM convolution block attention mechanism and the input characteristic diagram, and the obtained characteristic diagram is convolved with the other convolutions of the C3 module to obtain the output characteristic diagram.
Further, the neck structure of the bidirectional weighted fusion pyramid module TWFP includes an aggregation network structure PAN, bidirectional jump connection is added in the middle layer of the aggregation network structure PAN, and weight calculation of input features is added in the aggregation network structure PAN weighted fusion part, so that high-level semantic information and low-level detail information are fully fused.
Further, the calculation of the two-way weighted fusion pyramid module TWFP in the neck structure is as follows:
F′4=Conv1(I4)
F′3=Conv1{ELAN[I3+up(F′4)]}
F′2=Conv1{ELAN[I2+up(F3')]}
F1'=Conv1[I1+up(F2')]
F1=GELAN-SW(F1')
F2=GELAN-SW[F2'+I2+Conv2(F1)]
F3=GELAN-SW[F3'+I3+Conv2(F2)]
F4=GELAN-SW[F4'+Conv2(F3)]
Wherein F1'~F4' represents a feature map obtained from a bottom-up path, corresponding to level I1~I4, F1~F4 represents a feature map output from a top-down path, + represents performing concat weighted feature fusion, up represents upsampling, conv1 represents a1×1 convolution operation, conv2 represents a3×3 convolution operation, ELAN represents that the feature map has passed through an ELAN module, and GELAN-SW represents that the feature map has passed through a detection head.
Further, the detection head structure comprises four layers, wherein the first layer comprises a detection head combined by a generalized high-efficiency layer aggregation network GELAN and a visual Swin transducer model, and the second layer to the fourth layer respectively comprise two detection heads combined by the generalized high-efficiency layer aggregation network GELAN and the visual Swin transducer model.
The invention also provides an aerial image target detection system based on the attention mechanism, which comprises the following steps:
The data acquisition module is used for acquiring aerial images shot by the unmanned aerial vehicle;
The network module construction module is used for building an aerial image target detection network model, wherein a backbone network in the aerial image target detection network model is a YOLOv network comprising a background suppression module BSAM with an attention mechanism, a neck structure is a bidirectional weighted fusion pyramid module TWFP, and a detection head structure is formed by combining a generalized high-efficiency layer aggregation network GELAN and a visual Swin transform model;
The target detection module is used for inputting the aerial image acquired by the unmanned aerial vehicle into an aerial image target detection network model to obtain a target detection result of the aerial image.
The invention also provides a terminal device comprising a memory, a processor and a computer program stored in the memory and operable on the processor, wherein the processor implements the steps of an aerial image target detection method based on an attention mechanism as claimed in any one of claims 1 to 7 when executing the computer program.
The present invention also provides a computer readable storage medium storing a computer program which, when executed by a processor, implements the steps of an aerial image target detection method based on an attention mechanism as claimed in any one of claims 1 to 7.
The invention also provides a computer program product comprising a computer program which, when executed by a processor, implements the steps of an aerial image target detection method based on an attention mechanism as claimed in any one of claims 1 to 6.
Compared with the prior art, the invention has at least the following beneficial effects:
The invention provides an aerial image target detection method based on an attention mechanism, which is used for constructing an empty image target detection network model, wherein a background suppression module BSAM containing the attention mechanism is introduced into a network backbone structure to extract image characteristics, so that the interference of complex backgrounds in aerial images is effectively suppressed, target characteristics are more prominent, the target detection accuracy is improved, and secondly, a neck structure adopts a bidirectional weighted fusion pyramid module TWFP to replace a traditional PAN structure, and full fusion of high-level semantic information and low-level detail information is realized through bidirectional jump connection and weighted fusion strategies, so that the detection capability of the model on targets with different scales is improved. Finally, the detection head structure adopts a mode of combining a generalized high-efficiency layer aggregation network GELAN and a visual Swin transducer model, so that the local feature extraction capability of the convolutional neural network is reserved, the global context sensing capability of the transducer is introduced, and the model can more accurately position and classify targets in the detection process.
Furthermore, the backbone network of the aerial image target detection network model adopts YOLOv frames and is combined with BSAM modules, so that the feature extraction capability is improved, the neck structure realizes the effective fusion of multi-scale features through TWFP modules, and the detection head improves the detection precision through the combination of a generalized high-efficiency layer aggregation network GELAN and a visual Swin transducer model. Each module in the model can be adjusted and optimized according to specific task requirements so as to adapt to different application scenes and data sets. By introducing an attention mechanism and an advanced network structure, the model has higher detection speed while keeping high precision, and is suitable for application scenes with higher real-time requirements.
The improved network structure of the invention can bring about improvement of accuracy and stability, which means that accurate detection results can be obtained more reliably. After the algorithm can obtain an accurate detection result, more accurate decision making and planning can be performed in the fields of public safety, urban planning, traffic monitoring and the like. In the public safety field, the detection method can monitor the road surface condition in real time, discover abnormal events and potential threats in time, and provide powerful guarantee for the safety and stability of cities. In city planning, by detecting targets such as buildings, roads, greenbelts and the like in a city, accurate data support can be provided for city planning, and city layout optimization and city management level improvement are facilitated. In the traffic monitoring field, the detection method can rapidly and accurately identify targets such as vehicles, pedestrians and the like, is beneficial to realizing real-time monitoring and intelligent scheduling of traffic flow and improves traffic running efficiency. Therefore, our improved method is of great importance for improving the decision-making effect and social management level in practical applications.
Detailed Description
The invention is further described below with reference to the drawings and the detailed description.
The invention provides an aerial image target detection method based on an attention mechanism, which can more accurately detect a target in an image under a complex aerial image by fusing the attention mechanism and a visual Swin transducer model. The method specifically comprises the following steps:
step 1, training by using two public data sets as input images;
The images in both of the disclosed datasets are taken as inputs. These datasets were VisDrone and UAVDT, visDrone2019 consisted of 10 categories of 7019 high resolution aerial images (2,000x 1,500). The data set is characterized by small object size, dense distribution, and possible partial occlusion. UAVDT the dataset is a dataset (1,024×540) designed to perform vehicle detection tasks from the Unmanned Aerial Vehicle (UAV) perspective. It contains images taken by unmanned aerial vehicles at different heights and angles on urban roads. Wherein each dataset has different scene, density and image quality to ensure that the model has good generalization performance in various environments.
The specific implementation process of the embodiment is described as follows:
The data set first needs to be converted to yolo format to meet the experimental requirements. Training was performed using 6,471 images in VisDrone datasets and verification was performed using 548 images. The dataset image input was set to 1536×1536. The UAVDT dataset consisted of 50 videos, containing 40,376 images in total. Training was performed using 24,778 images in UAVDT datasets and 15,598 images were tested. All images have a resolution of 1024×540. All images from the same video are grouped into either a training set or a test set according to the requirements of the dataset. Of these 50 videos, 31 video images were placed in the training set and the remaining 19 video images were placed in the test set. By using the two data sets, the invention can comprehensively cover different scenes and target categories, and can evaluate and compare the performance of the target detection algorithm in different data sets.
Step 2, constructing an aerial image target detection network ASA-YOLO
The object detection network adopts YOLOv network, most of objects in the aerial image are smaller, so that when the main network performs deep feature extraction, a large amount of semantic information is often lost by the small objects, and the small objects are more easily influenced by image background, so that the background suppression module BSAM with the integration of attention mechanisms is added at the tail end of the main network of the YOLOv network, the feature information of the objects in the image is extracted, rich information is provided for follow-up, and the interference on the small object features caused by complex background at the tail end of the main network is reduced. In the neck architecture of YOLOv network, the bidirectional weighted fusion pyramid module TWFP is adopted to replace the original PAN structure, so that the characteristics of different levels can be better ensured to be mutually communicated, and the characteristic representation is enriched. At YOLOv, the original detection head only comprising the convolution layer is changed into a mode of combining the generalized high-efficiency layer aggregation network GELAN and the visual Swin transform module, the local information in the target is extracted by using the generalized high-efficiency layer aggregation network GELAN, and then the visual Swin transform module captures the global context information of the image, so that the detection precision is effectively improved by combining the two modes.
In the whole, the target detection network of the invention brings about improvement of accuracy and stability, which means that the accurate detection result can be obtained more reliably by adopting the target detection network of the invention, and the method of the invention can make more accurate decisions and plans in the fields of public safety, urban planning, traffic monitoring and the like, and has important significance for improving the decision effect and social management level in practical application.
Step 2.1 building a backbone network
As shown in fig. 1, the backbone network is a part of the whole target detection network for feature extraction, and adopts a backbone network of YOLOv networks, wherein an eleventh layer is a background suppression module BSAM with attention mechanisms fused, a twelfth layer is a rapid spatial pyramid pooling module SPPF, and the whole target detection network is more efficient by adjusting the width and depth of the backbone network.
Specifically, first, the first ten layers of the backbone network adopt the backbone network of YOLOv networks. The backbone network is based on the CSPDARKNET improved version, and the feature extraction capability and the calculation efficiency of the network are improved by introducing the CSPNet structure and further optimizing. This makes YOLOv excellent in the object detection task, an important version of the modern object detection model.
In the eleventh layer of the backbone network, the invention adopts the background suppression module BSAM with the integration of the attention mechanism, and the module integrates the CBAM convolution attention mechanism, so that not only the characteristic information of a small target is kept in the down sampling process, but also the irrelevant background noise in the image is suppressed, and meanwhile, the channel and the position information are considered, so that the invention is flexible and lightweight enough, and the detection precision can be improved with lower calculation cost. And then, a rapid space pyramid pooling module SPPF is connected to the back of the eleventh layer, the rapid space pyramid pooling module SPPF extracts and fuses high-level features, 3x3 maximum pooling operations are applied for multiple times in the fusion process, high-level semantic features are extracted as much as possible, and the computing efficiency is improved while the multi-scale feature fusion effect is maintained. And 5,7,9 and 12 layers in the backbone network enter the neck architecture with different layers of feature graphs when entering the neck architecture, so that powerful support is provided for the next multi-layer feature fusion.
The specific implementation process of the embodiment is described as follows:
In the invention, the backbone network is mainly composed of 12 layers, the input image is a given aerial image i=rW×H×3 (W, H and channel size 3 represent the basic features of a given crowd count image, wherein W represents the width of the image, H represents the height of the image, and channel size 3 represents the image having three channels, typically RGB color modes), the first layer is a convolution layer with a convolution kernel size of 3, the step size is 1, the second layer is a convolution layer with a convolution kernel size of 3, the step size is 2, the third layer is a convolution layer with a convolution kernel size of 3, the step size is 1, the fourth layer is a convolution layer with a convolution kernel size of 3, and the step size is 2. The fifth layer is an ELAN module formed by convolution and residual modules, and the sixth layer is a convolution layer with the convolution kernel size of 3 and the step length of 2. The fifth layer and the sixth layer are repeated from the 7 th layer to the 10 th layer.
The ELAN module respectively passes the input feature images through two 1x1 convolution layers, reduces the channel number and improves the calculation efficiency, then carries out layer-by-layer processing and extraction on the feature images through a plurality of 3x3 convolution layers, finally selects specific feature images through an index list, splices the specific feature images together, and carries out fusion through one 1x1 convolution layer to obtain final output. The design can enhance the characteristic representation capability while maintaining the calculation efficiency, so that the model can better extract and fuse the characteristics when processing complex tasks.
The eleventh layer is the BSAM module that incorporates CBAM convolution attention mechanisms, as shown in fig. 2. The background suppression module BSAM with the attention mechanism is formed by fusing a C3 module and a CBAMbottleneck module with a CBAM convolution attention mechanism, wherein the C3 module comprises three convolution operations, the first two convolution operations are respectively and independently used for processing input features, the third convolution operation is used for combining the outputs of the first two paths, CBAMbottleneck is connected at the back of one of the first two convolutions of the C3 module, and CBAMbottleneck comprises two convolution layers, a CBAM convolution block attention module and a residual connection shortcut.
In BSAM module, the input feature map reduces the channel number through two 1x1 convolution layers of the C3 module, CBAMbottleneck performs feature processing on one of the first two convolutions, the processed feature map and the other convolutions perform convolution operation to achieve splicing, and finally the spliced feature map is converted into the required output channel number through one 1x1 convolution layer. The design can reduce the calculated amount while guaranteeing the characteristic expression capability.
CBAMbottleneck reduce the number of channels by a 1x1 convolutional layer and then increase the number of channels by a 3x3 convolutional layer, into the CBAM convolutional block attention mechanism. As shown in fig. 3, the CBAM attention mechanism consists of two parts, a channel attention module and a spatial attention module. After the attention mechanism of the feature map is input CBAM, the channel attention module is firstly entered to perform global pooling and operation of the multi-layer perceptron, the two outputs are added to generate a channel attention map, and then the channel attention map and the original feature map are subjected to element-by-element multiplication by channels to obtain a weighted feature map F'. The weighted feature map input enters a spatial attention module again, pooling and convolution operation are carried out on the feature map, and element-by-element multiplication is carried out on the spatial attention map and the weighted feature map F' according to the spatial position, so that a final weighted feature map F″ is obtained. The final output f″ is connected with the input feature map of CBAMbottleneck by a residual. CBAMbottleneck in combination with CBAM attention mechanism reduces background to target interference and more effectively promotes gradient flow and convergence speed by means of residual connection.
The twelfth layer is the fast spatial pyramid pooling module SPPF.
The rapid space pyramid pooling module SPPF firstly carries out 1x1 convolution on input, then carries out three times of maximum pooling in sequence, splices the original feature map and the three times of maximum pooling results together, and finally reduces the channel number of the spliced feature map to the expected output channel number through another 1x1 convolution layer. The design can enhance the characteristic representation capability while ensuring the calculation efficiency, so that the model can better perform when processing targets with different scales.
The input image is 1536×1536×3, and is 46×46×1024 output after passing through the twelve layers of the backbone network. The backbone network feature extraction process is expressed as:
Fc=Czgwl(I) (1)
Fc denotes features derived from CNN, Czgwl denotes feature extraction by the backbone network, and I denotes an input image.
Step 2.2, constructing a neck structure of the bidirectional weighted fusion pyramid module TWFP:
As shown in fig. 4, the bidirectional weighted fusion pyramid module TWFP improves the whole neck by adopting a bidirectional weighted fusion and jump connection mode, can better realize multi-scale fusion of feature graphs in a feature fusion stage, and provides more accurate prediction detection for a subsequent target detection task;
the concrete improvement is as follows:
The PAN structure of the aggregation network structure adopts bidirectional connection, and comprises feature propagation from top to bottom and feature propagation from bottom to top. This ensures that features at different levels can communicate with each other, enriching the feature representation. The weighted bi-directional feature pyramid network BiFPN introduces a learnable weighting mechanism and jump connection, and can perform weighted average and fusion on features of different scales, thereby more effectively utilizing feature information. Under the influence of the PAN structure of the original structure path aggregation network and the weighted bidirectional feature pyramid network BiFPN structure, some structures of the PAN structure and the weighted bidirectional feature pyramid network are reserved, and improved fusion is carried out to obtain the bidirectional weighted fusion pyramid module TWFP.
The specific implementation process of the embodiment is described as follows:
As shown in fig. 4. Wherein fig. 4 is a TWFP structure. The model takes as input four feature maps of layers 5,7,9,12 extracted from the backbone network, performs bottom-up feature propagation from F '4 to F'1, and then performs top-down feature propagation from F1 to F4. Two layers in the middle of the original PAN structure are added with two-way jump connection, the two-way jump connection allows information to be directly transferred from the front layer to the rear layer, and feature map fusion is directly carried out by bypassing the middle layer. In this way, the semantic information of the upper layer and the detail information of the lower layer can be fully fused. The multi-scale feature fusion can enhance the capability of the network to process targets with different sizes and improve the accuracy of target detection.
The feature map is weighted and fused after the bidirectional jump connection is performed. Learning the importance of different input features, and fusing different input features. Conventional feature fusion tends to be simply a superposition or addition of feature maps, which are adjusted to the same resolution and then added. Previous methods treat all input features at the same time without distinction. But different input feature maps have different resolutions and their contributions to the fused input feature map are also different, so simply adding or superimposing them is not the optimal operation. An additional weight is added to each input to let the network learn the importance of each input feature.
Taking F'2 and F2 as examples, the weighted fusion is calculated as follows:
Where I2,F2',F2 represents the backbone network output, bottom-up, top-down feature map, ω represents weights, ε represents a small constant, resize represents a resolution-matched up-sampling or down-sampling operation, conv represents a convolution operation for feature processing.
The feature map of the input image is 384×384× 256,192 ×192× 512,96 ×96× 1024,48 ×48×1024, where the feature map resolution of I1 is highest and the subsequent feature map resolution is halved. In order to enhance the capability of the framework to fuse multi-scale features and reduce the complexity of the framework, the invention selects to add a bidirectional jump connection between two layers in the middle of the module in an actual experiment. This enables feature information of different scales to be captured efficiently. And finally outputting F1、F2、F3、F4. The whole fusion process is as follows:
The characteristic diagram of the I4 is subjected to convolution operation of 1X 1 convolution kernel and stride of 1 to obtain the characteristic diagram of the F4'.
And F4' performing up-sampling of nearest neighbor interpolation on the feature map, and improving the resolution of the feature map. And then carrying out concat weighted feature fusion with the I3 feature map, enabling the feature map after weighted fusion to enter an ELAN convolution module (reducing the channel number) to carry out feature processing, and then carrying out convolution operation with the stride of 1 through a 1X 1 convolution kernel, and obtaining an F'3 feature map after operation.
And F3' performing up-sampling of nearest neighbor interpolation on the feature map, and improving the resolution of the feature map. And then carrying out concat weighted feature fusion with the I2 feature map, enabling the feature map after weighted fusion to enter an ELAN convolution module (reducing the channel number) to carry out feature processing, and then carrying out convolution operation with the stride of 1 through a 1X 1 convolution kernel, and obtaining an F'2 feature map after operation.
And F'2, performing up-sampling of nearest neighbor interpolation on the feature map, and improving the resolution of the feature map. And then carrying out concat weighted feature fusion with the I1 feature map, carrying out convolution operation with the stride of 1 by 1 convolution kernel on the feature map after weighted fusion, and obtaining the F'1 feature map after operation.
And carrying out local and global feature extraction on the F'1 feature map through a generalized high-efficiency layer aggregation network GELAN+visual Swin transducer model detection head to obtain an F1 feature map.
After the F1 feature map is subjected to 3×3 convolution kernel and stride 2 convolution operation, I2,F2',F1 is subjected to concat weighted feature fusion (wherein I2 skips the middle layer and directly performs weighted fusion with F2',F1), and then is subjected to 1×1 convolution kernel and stride 1 convolution operation, and finally is input into a generalized high-efficiency layer aggregation network GELAN+vision Swin transducer model detection head to perform local and global feature extraction, so that the F2 feature map is obtained.
F2 is subjected to convolution operation with 3×3 convolution kernels and stride of 2, F3',I3,F2 is subjected to concat weighted feature fusion (wherein I3 skips the middle layer and directly performs weighted fusion with F3',F2), and is subjected to convolution operation with 1×1 convolution kernels and stride of 1, and finally, the obtained result is input into a generalized high-efficiency layer aggregation network GELAN+vision Swin transducer model detection head to perform local and global feature extraction, so that an F3 feature map is obtained.
And F3, performing convolution operation with a 3×3 convolution kernel and a stride of 2 on the feature map, performing concat weighted feature fusion on the feature map by F3,F′4, and finally inputting the feature map into a generalized high-efficiency layer aggregation network GELAN+vision Swin transform model detection head to perform local and global feature extraction to obtain the feature map of F4.
F′4=Conv1(I4) (4)
F′3=Conv1{ELAN[I3+up(F′4)]} (5)
F2'=Conv1{ELAN[I2+up(F3')]} (6)
F1'=Conv1[I1+up(F2')] (7)
F1=GELAN-SW(F1') (8)
F2=GELAN-SW[F2'+I2+Conv2(F1)] (9)
F3=GELAN-SW[F3'+I3+Conv2(F2)] (10)
F4=GELAN-SW[F4'+Conv2(F3)] (11)
Where F'1~F′4 represents the bottom-up path taken feature map, corresponding to stage I1~I4, F1~F4 represents the top-down path output feature map. + means that concat weighted feature fusion is performed, up means up-sampling, conv1 means 1×1 convolution operation. Conv2 represents a3×3 convolution operation. The ELAN indicates that the feature map passed through the ELAN module. GELAN-SW represents the detection head of the characteristic map through the generalized high-efficiency layer aggregation network GELAN+visual Swin transducer model.
Step 2.3, constructing a detection head structure
The structure of the new detection head designed by the invention is shown in figure 5, and the detection head in the model has four layers, wherein the first layer comprises a generalized high-efficiency layer aggregation network GELAN+vision Swin transducer model detection head. The second through fourth layers each contain two generalized high-efficiency layer aggregation network gelan+visual Swin transducer models, which are stacked twice in order to produce an output. And inputting the feature map with the neck fused in multiple scales into a generalized high-efficiency layer aggregation network GELAN, and extracting the local information of the features in the feature map. The generalized high-efficiency layer aggregation network GELAN structure is shown in fig. 6, and the efficiency and effect of processing complex data modes by the model are improved by combining CSPNet and the characteristics of the convolution module ELAN. And the extracted feature map is transmitted into a visual Swin transducer model again to extract global features. The globally extracted feature map is transmitted to a detection module to generate a tensor containing boundary frame coordinate information, confidence coefficient and category classification. And then decoding the generated tensor of each detection layer, and integrating the boundary boxes, the confidence degrees and the category predictions of the four detection layers after decoding to obtain a final detection result.
The specific implementation process of the embodiment is described as follows:
In the invention, after the neck part performs characteristic multi-scale fusion, 4 layers of characteristic images with different scales are used as the input of a detection head to enter a generalized high-efficiency layer aggregation network GELAN for processing operation. The generalized efficient layer aggregation network GELAN extracts features through multiple convolution layers and RepNCSP modules. The input feature map firstly reduces the number of input channels through a 1x1 convolution layer, then uses chunk to split the output into two tensors along the channel dimension, and respectively passes through a RepNCSP module and a 3x3 convolution layer. And finally, all the characteristic diagrams are spliced and output. The design of the module can enhance the capability of feature extraction while ensuring the calculation efficiency. The feature map output after the calculation is transmitted into a visual Swin transducer model.
The feature map entered in the visual Swin transducer model is expressed as W X C. Where W represents the width of the image, H represents the height of the image, and the number of channels is C. The input feature map is partitioned by Patch Partition into a series of fixed-size patches, each Patch having dimensions P x C, where P is the side length of the Patch. This process converts the input image from a two-dimensional structure to a one-dimensional sequence patch token. The patch keys are then mapped to high dimensional space through full connectivity layer Linear Embedding by linear transformation, resulting in Patch Embedding.
Patch Embedding=Linear(Flatten(Patch)) (12)
The Patch Embedding generated goes into Layer Normalization (layer normalization), which is generally applied to the input of each sub-layer in the visual Swin transform model, and mainly plays a role in improving convergence speed and robustness. The capture capability of context information and spatial information in the image is then improved by introducing a window multi-head self-attention (W-MHSA) module and a shift window multi-head self-attention (SW-MHSA) module. The W-MHSA module adopts a window attention mechanism to divide the input Patch Embedding into a plurality of windows, and the local information is effectively modeled by calculating the attention weight value in each window. The SW-MHSA module effectively captures local information by using a transfer window attention mechanism and considering the relation between adjacent windows. Whereas the MLP multi-Layer perceptron is used in the visual Swin Transformer model to perform a non-linear transformation in the last part of each visual Swin Transformer model Layer. Its nonlinear transformation capability enables the network to learn a richer representation of the features. The calculation process is as follows:
F'=W-MHSA(LN(F))+F (13)
F'=MLP(LN(F'))+F' (14)
F"=SW-MHSA(LN(F'))+F' (15)
F"=MLP(LN(F"))+F" (16)
The feature map output after the processing of the visual Swin transducer model enters the Detect module again to generate tensors containing the boundary box coordinates, confidence and category classification information. For prediction of each detection layer, a decoding bounding box is required. The predictions in the present invention are given in terms of center coordinates and width and height, and the relative position and scale information typically needs to be converted to actual image coordinates. The confidence and class prediction decoding is calculated by a sigmoid function. And integrating the boundary boxes, the confidence coefficient and the category prediction of the four decoded detection layers together to obtain a final detection result.
Step 3, training the output result
The experiment is carried out by adopting an NVIDIA A800-SXM4 display card with 80GB video memory, and the CPU core number is 16 cores. The experimental software was configured as PyTorch 2.21.21 and CUDA 12.0. The training phase included 200 times, with the first 2 times being used as a warm-up. The initial learning rate was set at 3.2X10-4 and decayed to 0.12 times its value in the final epoch. The parameter optimization adopts an Adam optimization algorithm. The performance of the target detection network ASA-YOLO is optimized with a minimum loss function. And performing iterative training on the target detection network ASA-YOLO, and continuously optimizing the model by using a back propagation and gradient descent method according to the error calculated by each forward propagation, so as to finally obtain a trained network model. And testing the test set by using the trained target detection network model to output a test result.
The specific implementation process of the embodiment is described as follows:
The invention aims at experimental configuration and optimization strategies of ASA-YOLO model on PyTorch platform. The present invention leverages the 80G NVIDIA A800-SXM4 GPU equipped hardware environment to train and evaluate the model of the present invention and leverages publicly accessible datasets to perform target detection tasks. The evaluation index of the invention mainly comprises the average precision AP50 calculated when the average precision AP and IoU threshold value is 0.5.
The ASA-YOLO model is composed of three blocks, namely, a main network is used for extracting features, a neck is used for multi-scale fusion, and a detection head is used for outputting results. The backbone network extracts the characteristic information of the target in the image, introduces a background suppression module BSAM containing an attention mechanism, and reduces the interference on the small target characteristic caused by the complex background at the tail end of the backbone network. The improved weighted bidirectional feature fusion architecture TWFP is adopted in the neck structure to perform multi-scale fusion, so that the features of different levels can be better ensured to be mutually communicated, and the feature representation is enriched. The detection head part adopts a mode of combining a generalized high-efficiency layer aggregation network GELAN and a visual Swin transducer model, the local information in the target is extracted by using the generalized high-efficiency layer aggregation network GELAN, and then the visual Swin transducer model captures the global context information of the image. The three are combined effectively to improve the detection precision.
During training, the present invention takes 1536×1536 for VisDrone datasets and 1024×1024 for UAVDT datasets, respectively, as model inputs and sets the batch size to 4 according to the computational power of the GPU. The invention selects an Adam optimizer to carry out model training, sets the learning rate to be 3.2 multiplied by 10 < -4 >, and the weight attenuation to be 0.12 times of the value so as to minimize the loss function to optimize the network performance. These experimental configuration and optimization strategies aim to ensure good performance and reliability of the model in the target detection task. Fig. 7 is a visual result of our comparison with yolov.
We have carried out a number of experiments on 2 published data sets, including VisDrone and UAVDT, confirming that the method we have adopted gives excellent results
TABLE 1
TABLE 2
Experiments performed on the VisDrone dataset, in order to evaluate the performance of ASA-YOLO, the invention compared the experimental results with the experimental results of the other eleven most advanced methods on the VisDrone dataset, as shown in table 1. The average accuracy (AP 50) of ASA-YOLO of the present invention at IoU =0.5 was 64.3%, the highest value of this index was obtained, whereas the average Accuracy (AP) of the present invention reached 41.8, which is only 0.3 less than the best SOTA. GSE-YOLO showed a significant improvement in AP50 (from 61.5% to 64.3%) and AP (from 39.6% to 41.8%) compared to baseline approach. The components remarkably improve the feature extraction capability of detecting targets with various scales in the captured image of the unmanned aerial vehicle, and simultaneously reduce false alarm. In order to capture the comprehensive characteristic information of small targets in a complex scene, the depth and the width of a network are increased, and in the aspect of model complexity, the calculation complexity of the model is 551.0G FLOPs. The calculated amount of TPH-YOLOv reaches 556.6GFLOPS, and the calculated amount of the method is less than that of TPH-YOLOv. When evaluated using the AP50 index, the method was 1.2% higher than TPH-YOLOV, while the AP was 0.2% lower, but the AP50 of the present invention achieved SOTA performance. Highlighting its superior accuracy. Experimental comparisons show that slightly increasing complexity can significantly improve performance.
Experiments performed on UAVDT datasets by the present invention to further verify the inventive method, experiments were performed on UAVDT datasets by the present invention as shown in table 2. Compared with the prior art, the ASA-YOLO of the invention achieves the best SOTA effect on the AP 50. Wherein the AP is only 0.2% less than SMFF-YOLO. In conclusion, the method has good detection precision in the small targets and dense scenes, and the challenge of target detection in complex scenes is overcome to a certain extent.
The invention also provides an aerial image target detection system based on the attention mechanism, which can be used for realizing the aerial image target detection method based on the attention mechanism, and specifically comprises a data acquisition module, a network module construction module and a target detection module, wherein:
The data acquisition module is used for acquiring aerial images shot by the unmanned aerial vehicle;
The network module construction module is used for building an aerial image target detection network model, wherein a backbone network in the aerial image target detection network model is a YOLOv network comprising a background suppression module BSAM with an attention mechanism, a neck structure is a bidirectional weighted fusion pyramid module TWFP, and a detection head structure is formed by combining a generalized high-efficiency layer aggregation network GELAN and a visual Swin transform model;
The target detection module is used for inputting the aerial image acquired by the unmanned aerial vehicle into an aerial image target detection network model to obtain a target detection result of the aerial image.
An embodiment of the invention provides a terminal device comprising a processor, a memory and a computer program stored in the memory and executable on the processor. The processor performs the steps in the method for detecting the aerial image target based on the attention mechanism when executing the computer program, or performs the functions of each module in the aerial image target detection system based on the attention mechanism when executing the computer program.
The computer program may be divided into one or more modules/units, which are stored in the memory and executed by the processor to accomplish the present invention.
The terminal equipment can be computing equipment such as a desktop computer, a notebook computer, a palm computer, a cloud server and the like. The terminal device may include, but is not limited to, a processor, a memory.
The processor may be a central processing unit (CentralProcessingUnit, CPU), but may also be other general purpose processors, digital signal processors (DigitalSignalProcessor, DSP), application specific integrated circuits (APPLICATI ON SPECIFIC IntegratedCircuit, ASIC), off-the-shelf programmable gate arrays (Field ProgrammableGateArray, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, or the like.
The memory may be used to store the computer program and/or module, and the processor may implement various functions of the terminal device by running or executing the computer program and/or module stored in the memory and invoking data stored in the memory.
The modules/units integrated in the terminal device may be stored in a computer readable storage medium if implemented in the form of software functional units and sold or used as separate products.
Based on such understanding, the present invention may implement all or part of the flow of the method of the above embodiment, or may be implemented by instructing related hardware by a computer program, where the computer program may be stored in a computer readable storage medium, and the computer program may implement the steps of the method for detecting an aerial image target based on an attention mechanism when executed by a processor. Wherein the computer program comprises computer program code which may be in source code form, object code form, executable file or some intermediate form etc.
The computer readable medium may include any entity or device capable of carrying the computer program code, a recording medium, a USB flash disk, a removable hard disk, a magnetic disk, an optical disk, a computer memory, a Read Only Memory (ROM), a random access memory (RAM, random Access Memory), an electrical carrier signal, a telecommunications signal, a software distribution medium, and so forth. It should be noted that the computer readable medium contains content that can be appropriately scaled according to the requirements of jurisdictions in which such content is subject to legislation and patent practice, such as in certain jurisdictions in which such content is subject to legislation and patent practice, the computer readable medium does not include electrical carrier signals and telecommunication signals.
In summary, the aviation image target detection method based on the attention mechanism is performed on an end-to-end single-stage target detection algorithm. The aerial image target detection network model solves the problems encountered by the tail end of a trunk, a neck structure and a detection head. Compared with the two-stage detection method, the method has the advantages of high detection speed, high precision return and high efficiency, and compared with the traditional target detection method and the traditional two-stage detection method, the method has the advantages that the precision is higher than ClusDet, DMNet and the like.