CN113989645A

Movatterモバイル変換

Info

Publication number: CN113989645A
Application number: CN202111313879.1A
Authority: CN
Inventors: 王超杰; 陈曦; 李治洪; 刘敏; 郑来文; 方涛; 李庆利; 刘小平
Original assignee: East China Normal University
Current assignee: East China Normal University
Priority date: 2021-11-08
Filing date: 2021-11-08
Publication date: 2022-01-28
Anticipated expiration: 2041-11-08
Also published as: CN113989645B

Abstract

A target detection method for a large-size aerial remote sensing image belongs to the technical field of target detection. The method aims at the problem that context information is lost due to image segmentation of the large-size aerial remote sensing image. The method comprises the steps of determining an original image name; configuring a subgraph name for the subgraph; searching in the original remote sensing image set according to the sub-image name to obtain a corresponding current original remote sensing image; performing feature extraction on the current sub-graph to be detected by adopting a local feature extractor to obtain a group of multi-scale sub-graph feature graphs; carrying out down-sampling on a current original remote sensing image, and then carrying out feature extraction by adopting a global feature extractor to obtain a group of multi-scale original image feature maps; acquiring a feature pyramid network of a global local coupling mechanism to obtain a feature graph after fusion; respectively obtaining a target prediction result and a filtered global prediction result; and re-fusing the fused target prediction result and the filtered global prediction result to obtain the detection target. The invention is used for target detection.

Description

Target detection method for large-size aerial remote sensing image

Technical Field

The invention relates to a target detection method of a large-size aerial remote sensing image, and belongs to the technical field of target detection.

Background

With the development of image sensors and aviation technologies, the resolution of aerial remote sensing optical images is gradually improved, and the data volume contained in the images is also obviously improved. Object detection is one of the common technical means of computer vision for inferring the location and class of each target object in an image. In the application field of high-resolution aerial remote sensing images, the target detection technology can play an important role in the aspects of building identification, natural disaster management, change detection, traffic planning, agricultural investigation, military affairs and the like.

Compared with natural images in computer vision, aerial remote sensing images have larger resolution and size. Many aerial image datasets provide large size images that cannot be directly read by the GPU, for example a DOTA dataset containing large size images of 10000 × 10000 resolution. If the picture is simply scaled down directly to the extent that the memory can accommodate it, significant accuracy degradation results. Therefore, the method widely adopted by the scholars is to divide all the original images into sub-images which can be accommodated by the display card with a fixed size in the data preprocessing process, and a certain overlap exists between the sub-images. However, this pre-treatment method often leads to two easily overlooked problems: firstly, context information from the global is segmented along with the image; secondly, even if there is a certain overlap between adjacent subgraphs, some oversized targets may be incomplete on each subgraph and thus cannot be completely detected. In aerial images, many objects are associated with large scenes, such as airplanes, which are often present at airports, while ports are present only at the edges of large bodies of water. In the field of semantic segmentation and medical images, these global contexts from large-size images have been shown to contribute to the accuracy improvement of neural network models.

Today many excellent object detection models have emerged for applications in the natural image domain, such as FasterR-CNN, CascadeR-CNN and YOLO. However, these methods only focus on the features of the object target's own region, and ignore the object's multiscale and contextual information. The human visual system can leverage context and more information to support our perceptual inference and judgment. In the field of computer vision, the role of contextual information is also important. There have been many approaches to show that multi-scale features and contextual information are beneficial to improve the accuracy of target detection. For example, Feature Pyramids (FPNs) create a top-down pyramid structure so that the model can use multi-scale joint features. The Inside-out side network captures multi-scale context information by utilizing a recurrent neural network, so that the detection precision is obviously improved, and the importance of the multi-scale context information on target detection is explained. The AC-CNN can extract multi-scale context information by using an attention mechanism and a recurrent neural network. The MS-CNN model is used to sense context information of an object. The FPN structure can also extract the peripheral context information by utilizing the feature map with larger upsampling degree. The FA-SSD can extract more remarkable context information by using an attention mechanism, and the performance of small target detection is improved.

The role of contextual information is also important for aerial image object detection. For example, a bridge and a road are very similar in texture information, but the bridge often appears on a water area, so the context around the bridge provides a basis for judgment. Many scholars design special context perception models according to the characteristics of remote sensing images. A context-aware detection network (CAD-Net) uses GC-Net and PLC-Net to perceive context information at the global scene level and the local object level, respectively. The CA-CNN extracts context information in the multi-scale features through a plurality of context anchor boxes with different sizes. GLNet uses a long-term short-term memory (LSTM) network structure to extract global context information. Although these network structures for extracting context information have achieved good results, they do not address large-size images, and therefore the problem of missing context information due to image segmentation cannot be solved.

Disclosure of Invention

The invention provides a target detection method of a large-size aerial remote sensing image, aiming at the problem that context information is lost due to image segmentation of the large-size aerial remote sensing image.

The invention relates to a target detection method of a large-size aerial remote sensing image, which comprises the following steps,

original remote sensing images form an original remote sensing image set, and each original remote sensing image is configured with an original image name;

segmenting each original remote sensing image into sub-images with preset sizes according to a preset overlapping rate, configuring a sub-image name for each sub-image, wherein the sub-image name comprises an original image name, a sub-image serial number and initial position information of the sub-image on the original remote sensing image;

searching in an original remote sensing image set according to a sub-image name of a current sub-image to be detected to obtain a corresponding current original remote sensing image;

performing feature extraction on the current sub-graph to be detected by adopting a local feature extractor to obtain a group of multi-scale sub-graph feature graphs;

carrying out down-sampling on a current original remote sensing image, and then carrying out feature extraction by adopting a global feature extractor to obtain a group of multi-scale original image feature maps;

performing feature fusion on the multi-scale sub-image feature map and the multi-scale original image feature map by adopting a feature pyramid network of a global local coupling mechanism to obtain a fused feature map;

generating candidate frames by the fused feature graphs through a first RPN, aligning the features by a first RolAlign, changing feature dimensions through an FC full-link layer to obtain one-dimensional fused feature graphs, and performing classification and regression operations on the one-dimensional fused feature graphs to obtain fused target prediction results;

performing multi-scale transformation on a group of multi-scale original image feature maps, generating a candidate frame through a second RPN, filtering, performing feature alignment through a second RolAlign, changing feature dimensions through an FC full-link layer to obtain a one-dimensional global feature map, classifying and regressing the one-dimensional global feature map respectively, and performing NLS filtering to obtain a global prediction result after filtering;

and re-fusing the fused target prediction result and the filtered global prediction result to obtain the detection target.

According to the target detection method of the large-size aerial remote sensing image, a group of multi-scale sub-image feature graphs output by the local feature extractor comprise four sub-image feature graphs with different scales;

the group of multi-scale original image feature maps output by the global feature extractor comprise four original image feature maps with different scales;

the feature pyramid network of the global local coupling mechanism adopts a top-down mode for interactive fusion, and four different-scale sub-image feature graphs and four different-scale original image feature graphs are respectively transformed into five transformed multi-scale sub-image feature graphs and five transformed multi-scale original image feature graphs; and fusing the five transformed multi-scale sub-image feature maps and the five transformed multi-scale original image feature maps according to corresponding scales to obtain a multi-scale fused feature map.

According to the target detection method of the large-size aerial remote sensing image, the method for fusing the five transformed multi-scale sub-image feature maps and the five transformed multi-scale original image feature maps according to the corresponding scales comprises the steps of splicing the five transformed multi-scale sub-image feature maps and the five transformed multi-scale original image feature maps according to the corresponding scales, and then reducing the dimensions.

According to the target detection method of the large-size aerial remote sensing image, the process of obtaining the fused feature map comprises the following steps:

selecting the middle level of the five transformed multi-scale sub-image feature graphs to obtain a selected sub-image feature graph;

selecting the middle level of the five transformed multi-scale original image feature maps to obtain a selected original image feature map;

and cutting the designated area corresponding to the selected sub-image feature map in the selected original image feature map, up-sampling to the size of the original image feature map, and then performing feature fusion with the selected sub-image feature map to obtain a fused feature map.

According to the target detection method of the large-size aerial remote sensing image, both the local feature extractor and the global feature extractor adopt ResNet-50 as feature extractors.

According to the target detection method of the large-size aerial remote sensing image, a mapping function is adopted to select the grade of a feature map according to the size of a target candidate frame in a feature pyramid network of a global local coupling mechanism; the mapping function is:

in the formula, k is the level number corresponding to the target candidate frame, w is the length of the target candidate frame, and h is the width of the target candidate frame.

According to the target detection method of the large-size aerial remote sensing image, the local feature extractor and the global feature extractor perform feature extraction in the same mode;

and the local feature extractor firstly performs feature extraction on the current sub-graph to be detected at 5 depths according to the sequence of gradually increasing feature dimensions by adopting a convolutional neural network, and then performs 1 × 1 convolution calculation on the sub-graph features extracted at the last four depths to obtain four sub-graph feature graphs with different scales.

According to the target detection method of the large-size aerial remote sensing image, a characteristic pyramid network of a global local coupling mechanism has the same mode of converting four different-scale sub-image characteristic graphs into five converted multi-scale sub-image characteristic graphs as that of converting four different-scale original image characteristic graphs into five converted multi-scale original image characteristic graphs;

and respectively carrying out 3-by-3 convolution calculation and maximum pooling on the four different scale sub-graph feature graphs to obtain five transformed multi-scale sub-graph feature graphs.

According to the target detection method of the large-size aerial remote sensing image, the preset overlapping rate comprises 200 pixels of overlapping.

According to the target detection method of the large-size aerial remote sensing image, the sub-graph name format comprises the following steps:

original image name _ subgraph sequence number _ subgraph start abscissa _ subgraph start ordinate.

The invention has the beneficial effects that: the method can be used for target detection of aerial images, solves the problem that targets cannot be detected due to oversized target loss through the fusion model, and can effectively improve the target detection precision of large-size aerial remote sensing images.

The invention adopts a PGL parallel feature extraction module consisting of a local feature extractor and a global feature extractor which are parallel to each other to obtain global images and sub-image information. And a feature pyramid module GL-FPN of a global local coupling mechanism is designed based on the feature pyramid FPN structure, and the loss of context information caused by the segmentation of a large-size image is remarkably relieved by synchronously acquiring information in a subgraph and an original remote sensing image. In addition, the method also designs an additional global detector for detecting the extra-large target on the thumbnail of the global image, thereby further ensuring the reliability of target detection.

The invention can form a module to be integrated into a common target detection model, and has a great application range.

Drawings

FIG. 1 is a flow chart of a target detection method of a large-size aerial remote sensing image according to the present invention;

FIG. 2 is a schematic diagram of a data processing network for PGL and GL-FPN;

FIG. 3 is a feature map fusion diagram of a feature pyramid network of a global local coupling mechanism;

FIG. 4 is an example of a truth label presented from a sub-graph perspective;

FIG. 5 is an example of a reference model illustrated from a sub-graph perspective;

FIG. 6 is a diagram illustrating the detection results of the present invention from a sub-graph perspective;

FIG. 7 is a truth label presented from a subgraph perspective for example two;

FIG. 8 is an example two reference model illustrated from a sub-graph perspective;

FIG. 9 is a diagram illustrating the detection results of the present invention from a sub-graph perspective;

FIG. 10 is an example three truth labels presented from a subgraph perspective;

FIG. 11 is an example three reference model illustrated from a sub-graph perspective;

FIG. 12 is a diagram illustrating the detection results of the present invention from a sub-graph perspective;

FIG. 13 is an example four truth labels presented from a subgraph perspective;

FIG. 14 is an example four reference model illustrated from a sub-graph perspective;

FIG. 15 is a diagram illustrating the detection results of the present invention from a sub-graph perspective;

FIG. 16 is an example five reference model shown from an original image perspective;

FIG. 17 is a diagram illustrating the detection results of the present invention from the perspective of an original image;

FIG. 18 is an example six reference model shown from an original image perspective;

FIG. 19 is a diagram illustrating six inspection results of the present invention from the perspective of an original image;

FIG. 20 is an example seven reference model shown from an original image perspective;

FIG. 21 is a diagram illustrating seven detection results of the present invention from the perspective of an original image;

FIG. 22 is an example eight reference model shown from an original image perspective;

FIG. 23 is a diagram illustrating the detection results of the present invention from the perspective of an original image;

FIG. 24 is an example nine reference model shown from an original image perspective;

FIG. 25 is a diagram illustrating the detection results of the present invention from an original image perspective;

FIG. 26 is an example ten reference models shown from the perspective of an original image;

FIG. 27 is a diagram illustrating the detection results of the present invention from the perspective of an original image;

FIG. 28 is an example eleven reference model presented from an original image perspective;

fig. 29 is a schematic diagram illustrating the detection results of the present invention from the perspective of an original image;

FIG. 30 is an example twelve reference model presented from an original image perspective;

fig. 31 is a diagram illustrating the detection results of the present invention from the perspective of an original image.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

It should be noted that the embodiments and features of the embodiments may be combined with each other without conflict.

The invention is further described with reference to the following drawings and specific examples, which are not intended to be limiting.

In a first embodiment, referring to fig. 1 to 3, the present invention provides a target detection method for large-size aerial remote sensing images, which is characterized by comprising,

In this embodiment, the Local Feature extractor and the Global Feature extractor form a PGL (Parallel Global and Local Feature Extraction), which can respectively extract information of a sub-picture and a corresponding Global picture; and then, deep fusion is carried out on the multi-scale two-part features through the coupling of a Feature Pyramid network GL-FPN (global and local Feature Pyramid networks) of a global local coupling mechanism. In addition, the subsequent RPN, RolAlign and FC full-connection layer form a detection branch with small calculation amount, and an ultra-large target which is difficult to detect can be detected. The present invention is easily integrated on most mainstream detectors as an integral module.

In this embodiment, the original remote sensing images are all global images with a larger size, for example, 10000 × 10000 pixels. In the deep learning model, such an image size cannot be directly input into the network, and therefore preprocessing is required, and the image is cut into small images with a fixed size. The original remote sensing image can be divided into a plurality of sub-images according to the size of 1024 × 1024, for example, and certain overlapping rate can be ensured among different sub-images during division.

Further, as shown in fig. 1 and fig. 2, the group of multi-scale sub-graph feature maps output by the local feature extractor includes four sub-graph feature maps with different scales;

the feature pyramid network of the global local coupling mechanism adopts a top-down mode for interactive fusion, and respectively converts four different-scale sub-image feature graphs and four different-scale original image feature graphs into five converted multi-scale sub-image feature graphs and five converted multi-scale original image feature graphs, which respectively correspond to P2-P6 on two sides in FIG. 2; and fusing the five transformed multi-scale sub-image feature maps and the five transformed multi-scale original image feature maps according to corresponding scales to obtain a multi-scale fused feature map.

And further, the method for fusing the five transformed multi-scale sub-image feature maps and the five transformed multi-scale original image feature maps according to the corresponding scales comprises the steps of splicing the five transformed multi-scale sub-image feature maps and the five transformed multi-scale original image feature maps according to the corresponding scales, and then carrying out dimensionality reduction treatment.

Still further, with reference to fig. 3, the process of obtaining the fused feature map includes:

C1-C5 and M2-M5 in FIG. 2 correspond to the operation of the local feature extractor and the global feature extractor. The NLS filtering in fig. 1 transforms the confidence score of each detection result.

C1-C5 refer to features of the image at different depths when the image passes through the convolutional neural network, and the features are all feature maps, namely feature vectors in the neural network, and feature dimensions gradually increase from C1 to C5. M2-M5 were calculated from 1 × 1 convolution as shown in fig. 2, and P2-P6 were calculated from 3 × 3 convolution and pooling layers.

As an example, both the local feature extractor and the global feature extractor employ ResNet-50 as the feature extractor.

The currently common target detection method segments training data in a preprocessing step, and corresponding original images do not appear in a model training process. In this embodiment, as shown in fig. 1, in order to extract global context information, an index structure of a sub-graph and a corresponding original image is first established, the corresponding original image can be easily found according to the graph name of the sub-graph and scaled to the size same as that of the sub-graph, and then the two sub-graphs are used as input of a convolution model. The parallel backbone network formed by the local feature extractor and the global feature extractor has a structure similar to a twin network, but the method does not comprise a weight sharing mechanism similar to the twin network, and uses ResNet-50 as the feature extractor to effectively solve the problem of gradient disappearance and has stronger feature extraction capability. The parallel feature extractor will finally output two sets of features of the same size.

The classical feature pyramid network FPN is a top-down pyramid architecture that generates multi-scale features by interconnecting. The method can effectively improve the target detection performance of the model on the aerial image. In order to more effectively provide multi-scale information and capture global context, the embodiment designs a global and local coupling pyramid model based on a pyramid network. Meanwhile, the embodiment notices that the coupling mechanism is sensitive to the scale and has larger effect difference of different scales, so that the rule is summarized and the optimal scale is selected after extensive experiments are carried out. Feature pyramid FPN is a module containing multiple inputs and outputs that uses a mapping function to assign feature levels to candidate boxes based on their size. The mapping function is described as follows:

wherein k is₀Is the initial level, defined as 4 in the FPN structure of Faster R-CNN. w and h are the length and width of each candidate box. k corresponds to the level number assigned to each candidate frame, and is assigned to one of { P2, P3, P4, P5} as shown in connection with FIG. 2. This allocation strategy is further refined in the ResNet network architecture, resulting in five levels of feature maps, with the mapping function described below:

furthermore, in a feature pyramid network of a global local coupling mechanism, a mapping function is adopted to select the feature map level according to the size of a target candidate box; the mapping function is:

As shown in fig. 2, the feature pyramid module of the global local coupling mechanism fuses the features from the sub-images with the features from the original image and then outputs the fused features to the RCNN portion of the model, and the features from the original image are fused with the features from the sub-images and also directly output to the additional global detector portion of the model. Naming the two sets of outputs of the parallel feature extractor as

And

they respectively contain four feature maps with different scales_iRepresents the number of channels, H_iAnd W_iRepresenting the height and width of each scale feature map, respectively. After the FPN module is calculated, the method obtains

And

at this time, each group comprises five feature maps with different levels, each feature map comprises information with other scales, and the channels of different levels have the same number but different sizes. FIG. 3 illustrates the fusion process of the present invention from feature sets from a global image

Take out a specific rank of

Then, the position of the sub-image at the moment on the global image will be determined

Is cut out of the characteristic diagram of the corresponding position and is expressed as

To correspond to

Fusion, in this embodiment will

Up-sampling to

The same size. Then will be

And

spliced together along the channel, denoted LG^2C×H×WFinally, LG is convolved by dimension reduction^2C×H×WInto LG^C×H×W。

The GL-FPN mainly has the function of carrying out multi-scale change and fusion on two groups of feature maps obtained from the previous part, so that the subgraphs can contain information from an original global image. Fig. 2 is a diagram for a group of feature subgraphs and a group of feature global graphs, which are respectively obtained by interactively fusing a group of features from top to bottom and converting four feature graphs in each group into five more representative groups of feature graphs, thereby obtaining two groups of multi-scale feature graphs. And the feature fusion is to fuse corresponding scales of the two groups of feature maps, wherein position information of the subgraph in the global image needs to be considered during fusion, corresponding positions are cut off, and then the two groups of feature maps are up-sampled to the same size. And then fusing the two groups of feature maps, wherein the fusion method comprises splicing and dimension reduction. The two sets of feature maps that have undergone fusion are output in terms of the size and dimensions of the input.

After fusion, the features from the global graph and the subgraph are respectively sent to different detection modules, RPN is used for generating a candidate box, ROIAlign is used for feature alignment, an FC full-connection layer is used for changing feature dimensions, and then a classification layer and a regression layer are used for type identification and accurate positioning of targets. Thereby respectively obtaining the target detection result of the subgraph and the target detection result of the global graph.

And finally, organically fusing the target detection result from the subgraph and the target detection result of the global graph, and screening the detection result from the global graph by using a filter and an NLS confidence limit algorithm to finally obtain the optimal detection result.

In the invention, all levels of features are not fused, because the YoLOF and AugFPN propose that feature graphs of different levels have different characteristics. Yolof [40] uses only one level of input and output and achieves almost the same performance as all inputs and outputs after expansion. Due to the difference in scale, different FPN feature levels have respective emphasis on deep semantic information and shallow detail information. This results in the assumption that the different levels in the FPN respond differently to the coupling mechanism of the present invention. To further verify the effect of different levels of features on the results, a set of ablation experiments can be designed.

Still further, as shown in fig. 2, the local feature extractor and the global feature extractor perform feature extraction in the same manner;

Further, with reference to fig. 2, the feature pyramid network of the global local coupling mechanism converts four different-scale sub-image feature maps into five transformed multi-scale sub-image feature maps in the same manner as the four different-scale original image feature maps into five transformed multi-scale original image feature maps;

Models based on convolutional neural networks are usually reduced in dimension by using a 1 × 1 convolution operation, and the FPN model adds 3 × 3 convolution in the feature fusion process to alleviate the superposition effect caused by upsampling. In the model of the invention, the global feature map is also formed by upsampling, and the correlation between adjacent pixels is strong, but the upsampling magnification of the method of the invention is related to the size of the original image, and the feature pyramid FPN is two times of the standard upsampling. On the basis, a set of ablation experiments is designed to explore how the size of the dimensionality reduction convolution kernel influences the coupling mechanism. The following are discussed separately: {1 × 1,3 × 3,5 × 5, N × N }, the experimental results are shown in table 8, and the convolution kernel of 3 × 3 can stably achieve the optimal effect, because the size of 3 × 3 can effectively alleviate the stacking effect and avoid excessive interference. N is a set of convolution packets determined by the up-sampling rate, and the specific allocation criteria is as follows:

when segmenting large-size aerial images, some objects with large width or height, such as airports and large ships, may be segmented into several parts, resulting in a reduction in detection accuracy. Although the commonly used preprocessing methods maintain a certain overlap ratio, an excessively large overlap ratio can cause excessive repeated detection, and an excessively small overlap ratio has a relatively limited effect. Therefore, the embodiment designs an additional detector, aims to detect the oversize target from the thumbnail of the original image, and can effectively reduce the calculation amount and avoid influencing the precision of the main detector through some control algorithms.

The structure of the global detector is included in fig. 1, which is similar to a two-stage object detection model, including RPN (candidate box generation network), classifier, and regressor. In order to reduce the amount of computation and to be dedicated to detecting very large targets, the global detector deletes those smaller candidate blocks by a threshold function, which is described in detail as follows:

where height and width are the width and height of the candidate frame, the threshold is 200 because the overlap ratio of the conventional segmentation pre-processing step is 200, and it is considered that the probability that an object exceeding this length is segmented is large, and when flag is 0, this candidate frame is discarded. However, if an object has the same confidence in the output of both the primary detector and the global detector, the detection result from the global detector should be discarded because of the higher localization accuracy of the primary detector. Therefore, when the prediction blocks of the two detectors have a high degree of overlap, the results from the global detector should be suppressed appropriately.

The invention designs an algorithm named as non-local suppression (NLS) based on the idea of soft-NMS, and can effectively suppress the confidence coefficient from a global detector according to the overlapping rate.

In NLS algorithm, g_iAnd s_iRespectively, the prediction box from the global detector and its confidence, L referring to all prediction boxes from the main detector, NLS linearly reduces the confidence of the repeated detection results, the more regions are repeated, the greater the degree of suppression. Since the NLS algorithm does not require abrupt suppression in extreme cases, a penalty function in the form of gaussian does not need to be used. x is the abscissa of the image and y is the ordinate of the image.

Finally, in the training process, the loss function of the model can be defined as follows:

and

representing classification and regression losses from the main detector respectively,

and

representing the classification and regression loss from the global detector. t is t_cAnd t_lRepresents the prediction from the main detector, t'_cAnd t_l' represents the prediction from the global detector. λ represents the balance hyperparameter between classification and regression loss, and can be set to the same value as FasterR-CNN.

As an example, the predetermined overlap ratio includes overlapping 200 pixels.

By way of example, the sub-graph name format includes:

And (3) experimental verification:

to evaluate the method proposed by the present invention, extensive experiments were performed on the DOTA-v1.0 dataset, the DOTA-v1.5 dataset and the DOTA-v2.0 dataset. Experimental results show that the global local coupling detection mechanism (CGLNet) of the method can effectively improve the performance of the model in the three data sets. The results are shown in tables 1 to 3, all results are obtained from the official evaluation server. The mAP accuracy index of the Faster R-CNN added with the CGLNet is improved by 2.45 percent on DOTA-v1.0, the average accuracy of DOTA-v1.5 is improved by 1.56 percent, and the average accuracy of DOTA-v2 is improved by 1.67 percent. Cascade R-CNN added with CGLNet is improved by 1.96 percent on DOTA-v1.0, 1.76 percent on DOTA-v1.5 and 1.07 percent on DOTA-v 2.0. In addition, the integrated module adopting the method of the invention has improved recognition precision of most of the image categories, which indicates the importance of the global context information on object detection. The detection precision of the ground-track-field is improved by 9.15%, the detection precision of the sodium-ball-field is improved by 16.20%, and the detection precision of the container-chord is improved by 8.65%.

In addition, performance comparison is carried out with other two types of target detection models, one type is a reference detection model widely used in the field of natural images, the other type is a special detection model used in the field of aerial remote sensing images, and comparison results are shown in tables 4 to 5. The first class comprises SSD, YOLOv2, RetinaNet, YOLOv3, YOLOv4, R-FCN, Mask R-CNN, PANET, Faster R-CNN, Cascade R-CNN and Faster R-CNN H-OBB; the second category includes RICNN, ORConv, USB-BBR, OFIC, MS-VANs, GLNet, ICN, IAD R-CNN, CDD-Net, CAD-Net, SCANet. Compared with the existing methods, the method can achieve better performance on the basis of the reference model, and the improvement of the method is not limited to a specific class, but can improve the detection accuracy of most classes. The category with larger average target size has greater accuracy gain under the method of the present invention because the method of the present invention focuses more on large-scale global context information. In order to more clearly show the detection performance improvement achieved by the method of the present invention, the detection results of the present invention are shown from the perspective of the sub-image and the original image, respectively, as shown in fig. 4 to fig. 31.

In order to more rigorously explore the influence of each module and structure of the invention on the detection result, in the embodiment, a detailed ablation experiment is designed and carried out to analyze the influence of a single module and the influence of the structural hyper-parameters of the coupling mechanism. To analyze the importance of each sub-module in the CGLNet, GL-FPN and global detector were applied separately to the reference model to verify their validity on the three data sets. The reference model used the Faster R-CNN of ResNet 50-FPN. Table 6 shows the ablation study results for each component. Each module improves the performance of the model on three data sets, and their combination also achieves superior results, proving that there is no conflict between these sub-modules. Among the detection results of the three data sets, GL-FPN greatly improves the performances on DOTA-v1.0, DOTA-v1.5 and DOTA-v2.0, which are respectively 2.04%, 1.18% and 1.47%. DOTA-v1.0 is the most elevated because it contains the fewest objects in the three datasets. If deleting GL-FPN and adding global detector alone, the detection performance will be improved by more than 0.2%. The performance improvement of the global head module remained stable over the three datasets, demonstrating that the phenomenon of slicing of oversized objects is widespread in multiple datasets.

Since the scale is sensitive to global context information, the global context does not yield better results for objects of all scales. To further explore this issue, additional ablation experiments may be performed, as shown in table 7. The characteristics of { P2 and P3} have lower downsampling multiplying power, are rich in detail information, and influence the feature expression by the context features paying attention to the global information, so that the two levels are not paid attention to the feature level combination. However, different combinations result in slight differences in performance, with only P4 achieving the best results on all three datasets. This is consistent with other views that the intermediate level of the FPN pyramid network contains more balanced semantic information and details, so this level is more conducive to fusing global information. The use of { P2} and { P3} can still produce better than baseline results only at DOTA-v1.0, but detection performance at DOTA-v1.5 is reduced instead because global features affect the expression of detailed features. The accuracy trends were similar across the three data sets, which also demonstrates the stability of the method of the invention.

The specific embodiment is as follows:

the mAP was used as an evaluation index.

To verify the performance of the invention on large aerial images, DOTA [17] datasets were used, which is a large-scale dataset containing aerial images for target detection, which contains many larger-size images, mainly from Google Earth, satellites, and aerial images. There are three versions of the DOTA dataset: DOTA-v1.0, DOTA-v1.5 and DOTA-v 2.0. The label box of the HBB is used.

DOTA-v1.0 has 2806 images in total, ranging in size from 800 to 4000 pixels, including 15 classes and 188282 objects. DOTA-v1.5 notes many small targets that are difficult to detect and, in addition, adds a new category called "container cranes". The DOTA-v2.0 version contains 402089 instances. DOTA-v2.0 is very different from the previous versions, with 18 categories (with the addition of "airport" and "helipad"), 11286 pictures and 1793658 targets. The results in the experiment are all test set accuracy.

For convenience, to test the performance of the module on an official basis, the present invention was implemented and evaluated using MMDetection [43 ]. The entire original image is changed to 1024 × 1024 partial images with the step of 824 according to the setting of the official reference. In the inference phase, the inference results from the local images are merged into the results of the global image by setting the NMS threshold to 0.3. All experiments were performed on a Ubuntu operating system, trained using a GPU graphics card (nvidiageforcetx 1080Ti), with a total batch of 1. The initial learning rate was set to 0.0025 and the learning rate adjustment plan setting was the same as the "1 x" plan. The remaining hyper-parameters are set to coincide with the hyper-parameters of the official reference, for example, the maximum number of objects in the candidate frame and each image is set to 2000.

TABLE 1 comparison of Performance to the baseline model at DOTA-v1.0

FR-H is fast R-CNN model, CR-H is Cascade MaskR-CNN model, P is airplane, BD is baseball field, B is bridge, GTF is stadium runway, SV is small vehicle, LV is large vehicle, S is ship, TC is tennis court, BC is basketball court, ST is oil tank, SBF is football field, RA is roundabout, H is port, SP is swimming pool, HC is helicopter.

TABLE 2 comparison of Performance to the benchmark model at DOTA-v1.5

FR-H-faster r-CNN model, CR-H-CascadeMaskR-CNN model, P-airplane, BD-baseball field, B-bridge, GTF-stadium runway, SV-small vehicle, LV-large vehicle, S-ship, TC-tennis court, BC-basketball field, ST-tank, SBF-football field, RA-roundabout, H-port, SP-swimming pool, HC-helicopter, CC-crane boom.

TABLE 3 comparison of Performance to the baseline model at DOTA-v2.0

FR-H-faster r-CNN model, CR-H-CascadeMaskR-CNN model, P-airplane, BD-baseball field, B-bridge, GTF-stadium runway, SV-small vehicle, LV-large vehicle, S-ship, TC-tennis court, BC-basketball court, ST-tank, SBF-football field, RA-roundabout, H-port, SP-swimming pool, HC-helicopter, CC-crane boom, a-airport, He-helicopter apron.

TABLE 4 comparison of Performance at DOTA-v1.0 with mainstream model

FR-H ═ fast R-CNN model, CR-H ═ CascadeMaskR-CNN model, FR-H ═ model used different feature extraction structure from FR-H.

TABLE 5 comparison of Performance to mainstream model at DOTA-v1.5

FR-H ═ FasterR-CNN model, CR-H ═ CascadeMaskR-CNN model.

TABLE 6 ablation experiments in sub-modules of the invention

TABLE 7 melting test of GL-FPN fusion grade

TABLE 8GL-FPN ablation experiments with reduced dimension convolution kernel size

Although the invention herein has been described with reference to particular embodiments, it is to be understood that these embodiments are merely illustrative of the principles and applications of the present invention. It is therefore to be understood that numerous modifications may be made to the illustrative embodiments and that other arrangements may be devised without departing from the spirit and scope of the present invention as defined by the appended claims. It should be understood that features described in different dependent claims and herein may be combined in ways different from those described in the original claims. It is also to be understood that features described in connection with individual embodiments may be used in other described embodiments.

Claims

1. A target detection method of a large-size aerial remote sensing image is characterized by comprising the following steps,

2. The method for detecting the target of the aerial remote sensing image with large size as claimed in claim 1,

the group of multi-scale sub-graph feature graphs output by the local feature extractor comprise four different-scale sub-graph feature graphs;

3. The method for detecting the object in the large-size aerial remote sensing image according to claim 2,

the method for fusing the five transformed multi-scale sub-image feature maps and the five transformed multi-scale original image feature maps according to the corresponding scales comprises the steps of splicing the five transformed multi-scale sub-image feature maps and the five transformed multi-scale original image feature maps according to the corresponding scales, and then carrying out dimensionality reduction processing.

4. The method for detecting the object in the large-size aerial remote sensing image according to claim 3,

the process of obtaining the fused feature map comprises the following steps:

5. The method for detecting the target of the aerial remote sensing image with the large size according to claim 4,

both the local feature extractor and the global feature extractor employ ResNet-50 as the feature extractor.

6. The method for detecting the object in the large-size aerial remote sensing image according to claim 5,

in a feature pyramid network of a global local coupling mechanism, selecting feature map levels by adopting a mapping function according to the size of a target candidate box; the mapping function is:

7. The method for detecting the object in the large-size aerial remote sensing image according to claim 6,

the local feature extractor and the global feature extractor perform feature extraction in the same way;

8. The method for detecting the object in the large-size aerial remote sensing image according to claim 7,

the method for converting the four different-scale sub-image feature graphs into the five converted multi-scale sub-image feature graphs by the feature pyramid network of the global local coupling mechanism is the same as the method for converting the four different-scale original image feature graphs into the five converted multi-scale original image feature graphs;

9. The method for detecting the target of the aerial remote sensing image with large size as claimed in claim 1,

the predetermined overlap ratio includes overlapping 200 pixels.

10. The method for detecting the target of the aerial remote sensing image with large size as claimed in claim 1,

the subgraph name format comprises: