Movatterモバイル変換


[0]ホーム

URL:


CN113989645A - Object detection method for large-scale aerial remote sensing images - Google Patents

Object detection method for large-scale aerial remote sensing images
Download PDF

Info

Publication number
CN113989645A
CN113989645ACN202111313879.1ACN202111313879ACN113989645ACN 113989645 ACN113989645 ACN 113989645ACN 202111313879 ACN202111313879 ACN 202111313879ACN 113989645 ACN113989645 ACN 113989645A
Authority
CN
China
Prior art keywords
feature
image
sub
scale
remote sensing
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202111313879.1A
Other languages
Chinese (zh)
Other versions
CN113989645B (en
Inventor
王超杰
陈曦
李治洪
刘敏
郑来文
方涛
李庆利
刘小平
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
East China Normal University
Original Assignee
East China Normal University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by East China Normal UniversityfiledCriticalEast China Normal University
Priority to CN202111313879.1ApriorityCriticalpatent/CN113989645B/en
Publication of CN113989645ApublicationCriticalpatent/CN113989645A/en
Application grantedgrantedCritical
Publication of CN113989645BpublicationCriticalpatent/CN113989645B/en
Activelegal-statusCriticalCurrent
Anticipated expirationlegal-statusCritical

Links

Images

Classifications

Landscapes

Abstract

A target detection method for a large-size aerial remote sensing image belongs to the technical field of target detection. The method aims at the problem that context information is lost due to image segmentation of the large-size aerial remote sensing image. The method comprises the steps of determining an original image name; configuring a subgraph name for the subgraph; searching in the original remote sensing image set according to the sub-image name to obtain a corresponding current original remote sensing image; performing feature extraction on the current sub-graph to be detected by adopting a local feature extractor to obtain a group of multi-scale sub-graph feature graphs; carrying out down-sampling on a current original remote sensing image, and then carrying out feature extraction by adopting a global feature extractor to obtain a group of multi-scale original image feature maps; acquiring a feature pyramid network of a global local coupling mechanism to obtain a feature graph after fusion; respectively obtaining a target prediction result and a filtered global prediction result; and re-fusing the fused target prediction result and the filtered global prediction result to obtain the detection target. The invention is used for target detection.

Description

Target detection method for large-size aerial remote sensing image
Technical Field
The invention relates to a target detection method of a large-size aerial remote sensing image, and belongs to the technical field of target detection.
Background
With the development of image sensors and aviation technologies, the resolution of aerial remote sensing optical images is gradually improved, and the data volume contained in the images is also obviously improved. Object detection is one of the common technical means of computer vision for inferring the location and class of each target object in an image. In the application field of high-resolution aerial remote sensing images, the target detection technology can play an important role in the aspects of building identification, natural disaster management, change detection, traffic planning, agricultural investigation, military affairs and the like.
Compared with natural images in computer vision, aerial remote sensing images have larger resolution and size. Many aerial image datasets provide large size images that cannot be directly read by the GPU, for example a DOTA dataset containing large size images of 10000 × 10000 resolution. If the picture is simply scaled down directly to the extent that the memory can accommodate it, significant accuracy degradation results. Therefore, the method widely adopted by the scholars is to divide all the original images into sub-images which can be accommodated by the display card with a fixed size in the data preprocessing process, and a certain overlap exists between the sub-images. However, this pre-treatment method often leads to two easily overlooked problems: firstly, context information from the global is segmented along with the image; secondly, even if there is a certain overlap between adjacent subgraphs, some oversized targets may be incomplete on each subgraph and thus cannot be completely detected. In aerial images, many objects are associated with large scenes, such as airplanes, which are often present at airports, while ports are present only at the edges of large bodies of water. In the field of semantic segmentation and medical images, these global contexts from large-size images have been shown to contribute to the accuracy improvement of neural network models.
Today many excellent object detection models have emerged for applications in the natural image domain, such as FasterR-CNN, CascadeR-CNN and YOLO. However, these methods only focus on the features of the object target's own region, and ignore the object's multiscale and contextual information. The human visual system can leverage context and more information to support our perceptual inference and judgment. In the field of computer vision, the role of contextual information is also important. There have been many approaches to show that multi-scale features and contextual information are beneficial to improve the accuracy of target detection. For example, Feature Pyramids (FPNs) create a top-down pyramid structure so that the model can use multi-scale joint features. The Inside-out side network captures multi-scale context information by utilizing a recurrent neural network, so that the detection precision is obviously improved, and the importance of the multi-scale context information on target detection is explained. The AC-CNN can extract multi-scale context information by using an attention mechanism and a recurrent neural network. The MS-CNN model is used to sense context information of an object. The FPN structure can also extract the peripheral context information by utilizing the feature map with larger upsampling degree. The FA-SSD can extract more remarkable context information by using an attention mechanism, and the performance of small target detection is improved.
The role of contextual information is also important for aerial image object detection. For example, a bridge and a road are very similar in texture information, but the bridge often appears on a water area, so the context around the bridge provides a basis for judgment. Many scholars design special context perception models according to the characteristics of remote sensing images. A context-aware detection network (CAD-Net) uses GC-Net and PLC-Net to perceive context information at the global scene level and the local object level, respectively. The CA-CNN extracts context information in the multi-scale features through a plurality of context anchor boxes with different sizes. GLNet uses a long-term short-term memory (LSTM) network structure to extract global context information. Although these network structures for extracting context information have achieved good results, they do not address large-size images, and therefore the problem of missing context information due to image segmentation cannot be solved.
Disclosure of Invention
The invention provides a target detection method of a large-size aerial remote sensing image, aiming at the problem that context information is lost due to image segmentation of the large-size aerial remote sensing image.
The invention relates to a target detection method of a large-size aerial remote sensing image, which comprises the following steps,
original remote sensing images form an original remote sensing image set, and each original remote sensing image is configured with an original image name;
segmenting each original remote sensing image into sub-images with preset sizes according to a preset overlapping rate, configuring a sub-image name for each sub-image, wherein the sub-image name comprises an original image name, a sub-image serial number and initial position information of the sub-image on the original remote sensing image;
searching in an original remote sensing image set according to a sub-image name of a current sub-image to be detected to obtain a corresponding current original remote sensing image;
performing feature extraction on the current sub-graph to be detected by adopting a local feature extractor to obtain a group of multi-scale sub-graph feature graphs;
carrying out down-sampling on a current original remote sensing image, and then carrying out feature extraction by adopting a global feature extractor to obtain a group of multi-scale original image feature maps;
performing feature fusion on the multi-scale sub-image feature map and the multi-scale original image feature map by adopting a feature pyramid network of a global local coupling mechanism to obtain a fused feature map;
generating candidate frames by the fused feature graphs through a first RPN, aligning the features by a first RolAlign, changing feature dimensions through an FC full-link layer to obtain one-dimensional fused feature graphs, and performing classification and regression operations on the one-dimensional fused feature graphs to obtain fused target prediction results;
performing multi-scale transformation on a group of multi-scale original image feature maps, generating a candidate frame through a second RPN, filtering, performing feature alignment through a second RolAlign, changing feature dimensions through an FC full-link layer to obtain a one-dimensional global feature map, classifying and regressing the one-dimensional global feature map respectively, and performing NLS filtering to obtain a global prediction result after filtering;
and re-fusing the fused target prediction result and the filtered global prediction result to obtain the detection target.
According to the target detection method of the large-size aerial remote sensing image, a group of multi-scale sub-image feature graphs output by the local feature extractor comprise four sub-image feature graphs with different scales;
the group of multi-scale original image feature maps output by the global feature extractor comprise four original image feature maps with different scales;
the feature pyramid network of the global local coupling mechanism adopts a top-down mode for interactive fusion, and four different-scale sub-image feature graphs and four different-scale original image feature graphs are respectively transformed into five transformed multi-scale sub-image feature graphs and five transformed multi-scale original image feature graphs; and fusing the five transformed multi-scale sub-image feature maps and the five transformed multi-scale original image feature maps according to corresponding scales to obtain a multi-scale fused feature map.
According to the target detection method of the large-size aerial remote sensing image, the method for fusing the five transformed multi-scale sub-image feature maps and the five transformed multi-scale original image feature maps according to the corresponding scales comprises the steps of splicing the five transformed multi-scale sub-image feature maps and the five transformed multi-scale original image feature maps according to the corresponding scales, and then reducing the dimensions.
According to the target detection method of the large-size aerial remote sensing image, the process of obtaining the fused feature map comprises the following steps:
selecting the middle level of the five transformed multi-scale sub-image feature graphs to obtain a selected sub-image feature graph;
selecting the middle level of the five transformed multi-scale original image feature maps to obtain a selected original image feature map;
and cutting the designated area corresponding to the selected sub-image feature map in the selected original image feature map, up-sampling to the size of the original image feature map, and then performing feature fusion with the selected sub-image feature map to obtain a fused feature map.
According to the target detection method of the large-size aerial remote sensing image, both the local feature extractor and the global feature extractor adopt ResNet-50 as feature extractors.
According to the target detection method of the large-size aerial remote sensing image, a mapping function is adopted to select the grade of a feature map according to the size of a target candidate frame in a feature pyramid network of a global local coupling mechanism; the mapping function is:
Figure BDA0003343023640000031
in the formula, k is the level number corresponding to the target candidate frame, w is the length of the target candidate frame, and h is the width of the target candidate frame.
According to the target detection method of the large-size aerial remote sensing image, the local feature extractor and the global feature extractor perform feature extraction in the same mode;
and the local feature extractor firstly performs feature extraction on the current sub-graph to be detected at 5 depths according to the sequence of gradually increasing feature dimensions by adopting a convolutional neural network, and then performs 1 × 1 convolution calculation on the sub-graph features extracted at the last four depths to obtain four sub-graph feature graphs with different scales.
According to the target detection method of the large-size aerial remote sensing image, a characteristic pyramid network of a global local coupling mechanism has the same mode of converting four different-scale sub-image characteristic graphs into five converted multi-scale sub-image characteristic graphs as that of converting four different-scale original image characteristic graphs into five converted multi-scale original image characteristic graphs;
and respectively carrying out 3-by-3 convolution calculation and maximum pooling on the four different scale sub-graph feature graphs to obtain five transformed multi-scale sub-graph feature graphs.
According to the target detection method of the large-size aerial remote sensing image, the preset overlapping rate comprises 200 pixels of overlapping.
According to the target detection method of the large-size aerial remote sensing image, the sub-graph name format comprises the following steps:
original image name _ subgraph sequence number _ subgraph start abscissa _ subgraph start ordinate.
The invention has the beneficial effects that: the method can be used for target detection of aerial images, solves the problem that targets cannot be detected due to oversized target loss through the fusion model, and can effectively improve the target detection precision of large-size aerial remote sensing images.
The invention adopts a PGL parallel feature extraction module consisting of a local feature extractor and a global feature extractor which are parallel to each other to obtain global images and sub-image information. And a feature pyramid module GL-FPN of a global local coupling mechanism is designed based on the feature pyramid FPN structure, and the loss of context information caused by the segmentation of a large-size image is remarkably relieved by synchronously acquiring information in a subgraph and an original remote sensing image. In addition, the method also designs an additional global detector for detecting the extra-large target on the thumbnail of the global image, thereby further ensuring the reliability of target detection.
The invention can form a module to be integrated into a common target detection model, and has a great application range.
Drawings
FIG. 1 is a flow chart of a target detection method of a large-size aerial remote sensing image according to the present invention;
FIG. 2 is a schematic diagram of a data processing network for PGL and GL-FPN;
FIG. 3 is a feature map fusion diagram of a feature pyramid network of a global local coupling mechanism;
FIG. 4 is an example of a truth label presented from a sub-graph perspective;
FIG. 5 is an example of a reference model illustrated from a sub-graph perspective;
FIG. 6 is a diagram illustrating the detection results of the present invention from a sub-graph perspective;
FIG. 7 is a truth label presented from a subgraph perspective for example two;
FIG. 8 is an example two reference model illustrated from a sub-graph perspective;
FIG. 9 is a diagram illustrating the detection results of the present invention from a sub-graph perspective;
FIG. 10 is an example three truth labels presented from a subgraph perspective;
FIG. 11 is an example three reference model illustrated from a sub-graph perspective;
FIG. 12 is a diagram illustrating the detection results of the present invention from a sub-graph perspective;
FIG. 13 is an example four truth labels presented from a subgraph perspective;
FIG. 14 is an example four reference model illustrated from a sub-graph perspective;
FIG. 15 is a diagram illustrating the detection results of the present invention from a sub-graph perspective;
FIG. 16 is an example five reference model shown from an original image perspective;
FIG. 17 is a diagram illustrating the detection results of the present invention from the perspective of an original image;
FIG. 18 is an example six reference model shown from an original image perspective;
FIG. 19 is a diagram illustrating six inspection results of the present invention from the perspective of an original image;
FIG. 20 is an example seven reference model shown from an original image perspective;
FIG. 21 is a diagram illustrating seven detection results of the present invention from the perspective of an original image;
FIG. 22 is an example eight reference model shown from an original image perspective;
FIG. 23 is a diagram illustrating the detection results of the present invention from the perspective of an original image;
FIG. 24 is an example nine reference model shown from an original image perspective;
FIG. 25 is a diagram illustrating the detection results of the present invention from an original image perspective;
FIG. 26 is an example ten reference models shown from the perspective of an original image;
FIG. 27 is a diagram illustrating the detection results of the present invention from the perspective of an original image;
FIG. 28 is an example eleven reference model presented from an original image perspective;
fig. 29 is a schematic diagram illustrating the detection results of the present invention from the perspective of an original image;
FIG. 30 is an example twelve reference model presented from an original image perspective;
fig. 31 is a diagram illustrating the detection results of the present invention from the perspective of an original image.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
It should be noted that the embodiments and features of the embodiments may be combined with each other without conflict.
The invention is further described with reference to the following drawings and specific examples, which are not intended to be limiting.
In a first embodiment, referring to fig. 1 to 3, the present invention provides a target detection method for large-size aerial remote sensing images, which is characterized by comprising,
original remote sensing images form an original remote sensing image set, and each original remote sensing image is configured with an original image name;
segmenting each original remote sensing image into sub-images with preset sizes according to a preset overlapping rate, configuring a sub-image name for each sub-image, wherein the sub-image name comprises an original image name, a sub-image serial number and initial position information of the sub-image on the original remote sensing image;
searching in an original remote sensing image set according to a sub-image name of a current sub-image to be detected to obtain a corresponding current original remote sensing image;
performing feature extraction on the current sub-graph to be detected by adopting a local feature extractor to obtain a group of multi-scale sub-graph feature graphs;
carrying out down-sampling on a current original remote sensing image, and then carrying out feature extraction by adopting a global feature extractor to obtain a group of multi-scale original image feature maps;
performing feature fusion on the multi-scale sub-image feature map and the multi-scale original image feature map by adopting a feature pyramid network of a global local coupling mechanism to obtain a fused feature map;
generating candidate frames by the fused feature graphs through a first RPN, aligning the features by a first RolAlign, changing feature dimensions through an FC full-link layer to obtain one-dimensional fused feature graphs, and performing classification and regression operations on the one-dimensional fused feature graphs to obtain fused target prediction results;
performing multi-scale transformation on a group of multi-scale original image feature maps, generating a candidate frame through a second RPN, filtering, performing feature alignment through a second RolAlign, changing feature dimensions through an FC full-link layer to obtain a one-dimensional global feature map, classifying and regressing the one-dimensional global feature map respectively, and performing NLS filtering to obtain a global prediction result after filtering;
and re-fusing the fused target prediction result and the filtered global prediction result to obtain the detection target.
In this embodiment, the Local Feature extractor and the Global Feature extractor form a PGL (Parallel Global and Local Feature Extraction), which can respectively extract information of a sub-picture and a corresponding Global picture; and then, deep fusion is carried out on the multi-scale two-part features through the coupling of a Feature Pyramid network GL-FPN (global and local Feature Pyramid networks) of a global local coupling mechanism. In addition, the subsequent RPN, RolAlign and FC full-connection layer form a detection branch with small calculation amount, and an ultra-large target which is difficult to detect can be detected. The present invention is easily integrated on most mainstream detectors as an integral module.
In this embodiment, the original remote sensing images are all global images with a larger size, for example, 10000 × 10000 pixels. In the deep learning model, such an image size cannot be directly input into the network, and therefore preprocessing is required, and the image is cut into small images with a fixed size. The original remote sensing image can be divided into a plurality of sub-images according to the size of 1024 × 1024, for example, and certain overlapping rate can be ensured among different sub-images during division.
Further, as shown in fig. 1 and fig. 2, the group of multi-scale sub-graph feature maps output by the local feature extractor includes four sub-graph feature maps with different scales;
the group of multi-scale original image feature maps output by the global feature extractor comprise four original image feature maps with different scales;
the feature pyramid network of the global local coupling mechanism adopts a top-down mode for interactive fusion, and respectively converts four different-scale sub-image feature graphs and four different-scale original image feature graphs into five converted multi-scale sub-image feature graphs and five converted multi-scale original image feature graphs, which respectively correspond to P2-P6 on two sides in FIG. 2; and fusing the five transformed multi-scale sub-image feature maps and the five transformed multi-scale original image feature maps according to corresponding scales to obtain a multi-scale fused feature map.
And further, the method for fusing the five transformed multi-scale sub-image feature maps and the five transformed multi-scale original image feature maps according to the corresponding scales comprises the steps of splicing the five transformed multi-scale sub-image feature maps and the five transformed multi-scale original image feature maps according to the corresponding scales, and then carrying out dimensionality reduction treatment.
Still further, with reference to fig. 3, the process of obtaining the fused feature map includes:
selecting the middle level of the five transformed multi-scale sub-image feature graphs to obtain a selected sub-image feature graph;
selecting the middle level of the five transformed multi-scale original image feature maps to obtain a selected original image feature map;
and cutting the designated area corresponding to the selected sub-image feature map in the selected original image feature map, up-sampling to the size of the original image feature map, and then performing feature fusion with the selected sub-image feature map to obtain a fused feature map.
C1-C5 and M2-M5 in FIG. 2 correspond to the operation of the local feature extractor and the global feature extractor. The NLS filtering in fig. 1 transforms the confidence score of each detection result.
C1-C5 refer to features of the image at different depths when the image passes through the convolutional neural network, and the features are all feature maps, namely feature vectors in the neural network, and feature dimensions gradually increase from C1 to C5. M2-M5 were calculated from 1 × 1 convolution as shown in fig. 2, and P2-P6 were calculated from 3 × 3 convolution and pooling layers.
As an example, both the local feature extractor and the global feature extractor employ ResNet-50 as the feature extractor.
The currently common target detection method segments training data in a preprocessing step, and corresponding original images do not appear in a model training process. In this embodiment, as shown in fig. 1, in order to extract global context information, an index structure of a sub-graph and a corresponding original image is first established, the corresponding original image can be easily found according to the graph name of the sub-graph and scaled to the size same as that of the sub-graph, and then the two sub-graphs are used as input of a convolution model. The parallel backbone network formed by the local feature extractor and the global feature extractor has a structure similar to a twin network, but the method does not comprise a weight sharing mechanism similar to the twin network, and uses ResNet-50 as the feature extractor to effectively solve the problem of gradient disappearance and has stronger feature extraction capability. The parallel feature extractor will finally output two sets of features of the same size.
The classical feature pyramid network FPN is a top-down pyramid architecture that generates multi-scale features by interconnecting. The method can effectively improve the target detection performance of the model on the aerial image. In order to more effectively provide multi-scale information and capture global context, the embodiment designs a global and local coupling pyramid model based on a pyramid network. Meanwhile, the embodiment notices that the coupling mechanism is sensitive to the scale and has larger effect difference of different scales, so that the rule is summarized and the optimal scale is selected after extensive experiments are carried out. Feature pyramid FPN is a module containing multiple inputs and outputs that uses a mapping function to assign feature levels to candidate boxes based on their size. The mapping function is described as follows:
Figure BDA0003343023640000081
wherein k is0Is the initial level, defined as 4 in the FPN structure of Faster R-CNN. w and h are the length and width of each candidate box. k corresponds to the level number assigned to each candidate frame, and is assigned to one of { P2, P3, P4, P5} as shown in connection with FIG. 2. This allocation strategy is further refined in the ResNet network architecture, resulting in five levels of feature maps, with the mapping function described below:
furthermore, in a feature pyramid network of a global local coupling mechanism, a mapping function is adopted to select the feature map level according to the size of a target candidate box; the mapping function is:
Figure BDA0003343023640000082
in the formula, k is the level number corresponding to the target candidate frame, w is the length of the target candidate frame, and h is the width of the target candidate frame.
As shown in fig. 2, the feature pyramid module of the global local coupling mechanism fuses the features from the sub-images with the features from the original image and then outputs the fused features to the RCNN portion of the model, and the features from the original image are fused with the features from the sub-images and also directly output to the additional global detector portion of the model. Naming the two sets of outputs of the parallel feature extractor as
Figure BDA0003343023640000083
And
Figure BDA0003343023640000084
they respectively contain four feature maps with different scalesiRepresents the number of channels, HiAnd WiRepresenting the height and width of each scale feature map, respectively. After the FPN module is calculated, the method obtains
Figure BDA0003343023640000085
And
Figure BDA0003343023640000086
at this time, each group comprises five feature maps with different levels, each feature map comprises information with other scales, and the channels of different levels have the same number but different sizes. FIG. 3 illustrates the fusion process of the present invention from feature sets from a global image
Figure BDA0003343023640000087
Take out a specific rank of
Figure BDA0003343023640000088
Then, the position of the sub-image at the moment on the global image will be determined
Figure BDA0003343023640000089
Is cut out of the characteristic diagram of the corresponding position and is expressed as
Figure BDA00033430236400000810
To correspond to
Figure BDA00033430236400000811
Fusion, in this embodiment will
Figure BDA00033430236400000812
Up-sampling to
Figure BDA00033430236400000813
The same size. Then will be
Figure BDA00033430236400000814
And
Figure BDA00033430236400000815
spliced together along the channel, denoted LG2C×H×WFinally, LG is convolved by dimension reduction2C×H×WInto LGC×H×W
The GL-FPN mainly has the function of carrying out multi-scale change and fusion on two groups of feature maps obtained from the previous part, so that the subgraphs can contain information from an original global image. Fig. 2 is a diagram for a group of feature subgraphs and a group of feature global graphs, which are respectively obtained by interactively fusing a group of features from top to bottom and converting four feature graphs in each group into five more representative groups of feature graphs, thereby obtaining two groups of multi-scale feature graphs. And the feature fusion is to fuse corresponding scales of the two groups of feature maps, wherein position information of the subgraph in the global image needs to be considered during fusion, corresponding positions are cut off, and then the two groups of feature maps are up-sampled to the same size. And then fusing the two groups of feature maps, wherein the fusion method comprises splicing and dimension reduction. The two sets of feature maps that have undergone fusion are output in terms of the size and dimensions of the input.
After fusion, the features from the global graph and the subgraph are respectively sent to different detection modules, RPN is used for generating a candidate box, ROIAlign is used for feature alignment, an FC full-connection layer is used for changing feature dimensions, and then a classification layer and a regression layer are used for type identification and accurate positioning of targets. Thereby respectively obtaining the target detection result of the subgraph and the target detection result of the global graph.
And finally, organically fusing the target detection result from the subgraph and the target detection result of the global graph, and screening the detection result from the global graph by using a filter and an NLS confidence limit algorithm to finally obtain the optimal detection result.
In the invention, all levels of features are not fused, because the YoLOF and AugFPN propose that feature graphs of different levels have different characteristics. Yolof [40] uses only one level of input and output and achieves almost the same performance as all inputs and outputs after expansion. Due to the difference in scale, different FPN feature levels have respective emphasis on deep semantic information and shallow detail information. This results in the assumption that the different levels in the FPN respond differently to the coupling mechanism of the present invention. To further verify the effect of different levels of features on the results, a set of ablation experiments can be designed.
Still further, as shown in fig. 2, the local feature extractor and the global feature extractor perform feature extraction in the same manner;
and the local feature extractor firstly performs feature extraction on the current sub-graph to be detected at 5 depths according to the sequence of gradually increasing feature dimensions by adopting a convolutional neural network, and then performs 1 × 1 convolution calculation on the sub-graph features extracted at the last four depths to obtain four sub-graph feature graphs with different scales.
Further, with reference to fig. 2, the feature pyramid network of the global local coupling mechanism converts four different-scale sub-image feature maps into five transformed multi-scale sub-image feature maps in the same manner as the four different-scale original image feature maps into five transformed multi-scale original image feature maps;
and respectively carrying out 3-by-3 convolution calculation and maximum pooling on the four different scale sub-graph feature graphs to obtain five transformed multi-scale sub-graph feature graphs.
Models based on convolutional neural networks are usually reduced in dimension by using a 1 × 1 convolution operation, and the FPN model adds 3 × 3 convolution in the feature fusion process to alleviate the superposition effect caused by upsampling. In the model of the invention, the global feature map is also formed by upsampling, and the correlation between adjacent pixels is strong, but the upsampling magnification of the method of the invention is related to the size of the original image, and the feature pyramid FPN is two times of the standard upsampling. On the basis, a set of ablation experiments is designed to explore how the size of the dimensionality reduction convolution kernel influences the coupling mechanism. The following are discussed separately: {1 × 1,3 × 3,5 × 5, N × N }, the experimental results are shown in table 8, and the convolution kernel of 3 × 3 can stably achieve the optimal effect, because the size of 3 × 3 can effectively alleviate the stacking effect and avoid excessive interference. N is a set of convolution packets determined by the up-sampling rate, and the specific allocation criteria is as follows:
Figure BDA0003343023640000101
when segmenting large-size aerial images, some objects with large width or height, such as airports and large ships, may be segmented into several parts, resulting in a reduction in detection accuracy. Although the commonly used preprocessing methods maintain a certain overlap ratio, an excessively large overlap ratio can cause excessive repeated detection, and an excessively small overlap ratio has a relatively limited effect. Therefore, the embodiment designs an additional detector, aims to detect the oversize target from the thumbnail of the original image, and can effectively reduce the calculation amount and avoid influencing the precision of the main detector through some control algorithms.
The structure of the global detector is included in fig. 1, which is similar to a two-stage object detection model, including RPN (candidate box generation network), classifier, and regressor. In order to reduce the amount of computation and to be dedicated to detecting very large targets, the global detector deletes those smaller candidate blocks by a threshold function, which is described in detail as follows:
Figure BDA0003343023640000102
where height and width are the width and height of the candidate frame, the threshold is 200 because the overlap ratio of the conventional segmentation pre-processing step is 200, and it is considered that the probability that an object exceeding this length is segmented is large, and when flag is 0, this candidate frame is discarded. However, if an object has the same confidence in the output of both the primary detector and the global detector, the detection result from the global detector should be discarded because of the higher localization accuracy of the primary detector. Therefore, when the prediction blocks of the two detectors have a high degree of overlap, the results from the global detector should be suppressed appropriately.
The invention designs an algorithm named as non-local suppression (NLS) based on the idea of soft-NMS, and can effectively suppress the confidence coefficient from a global detector according to the overlapping rate.
Figure BDA0003343023640000103
Figure BDA0003343023640000104
In NLS algorithm, giAnd siRespectively, the prediction box from the global detector and its confidence, L referring to all prediction boxes from the main detector, NLS linearly reduces the confidence of the repeated detection results, the more regions are repeated, the greater the degree of suppression. Since the NLS algorithm does not require abrupt suppression in extreme cases, a penalty function in the form of gaussian does not need to be used. x is the abscissa of the image and y is the ordinate of the image.
Finally, in the training process, the loss function of the model can be defined as follows:
Figure BDA0003343023640000111
Figure BDA0003343023640000112
and
Figure BDA0003343023640000113
representing classification and regression losses from the main detector respectively,
Figure BDA0003343023640000114
and
Figure BDA0003343023640000115
representing the classification and regression loss from the global detector. t is tcAnd tlRepresents the prediction from the main detector, t'cAnd tl' represents the prediction from the global detector. λ represents the balance hyperparameter between classification and regression loss, and can be set to the same value as FasterR-CNN.
As an example, the predetermined overlap ratio includes overlapping 200 pixels.
By way of example, the sub-graph name format includes:
original image name _ subgraph sequence number _ subgraph start abscissa _ subgraph start ordinate.
And (3) experimental verification:
to evaluate the method proposed by the present invention, extensive experiments were performed on the DOTA-v1.0 dataset, the DOTA-v1.5 dataset and the DOTA-v2.0 dataset. Experimental results show that the global local coupling detection mechanism (CGLNet) of the method can effectively improve the performance of the model in the three data sets. The results are shown in tables 1 to 3, all results are obtained from the official evaluation server. The mAP accuracy index of the Faster R-CNN added with the CGLNet is improved by 2.45 percent on DOTA-v1.0, the average accuracy of DOTA-v1.5 is improved by 1.56 percent, and the average accuracy of DOTA-v2 is improved by 1.67 percent. Cascade R-CNN added with CGLNet is improved by 1.96 percent on DOTA-v1.0, 1.76 percent on DOTA-v1.5 and 1.07 percent on DOTA-v 2.0. In addition, the integrated module adopting the method of the invention has improved recognition precision of most of the image categories, which indicates the importance of the global context information on object detection. The detection precision of the ground-track-field is improved by 9.15%, the detection precision of the sodium-ball-field is improved by 16.20%, and the detection precision of the container-chord is improved by 8.65%.
In addition, performance comparison is carried out with other two types of target detection models, one type is a reference detection model widely used in the field of natural images, the other type is a special detection model used in the field of aerial remote sensing images, and comparison results are shown in tables 4 to 5. The first class comprises SSD, YOLOv2, RetinaNet, YOLOv3, YOLOv4, R-FCN, Mask R-CNN, PANET, Faster R-CNN, Cascade R-CNN and Faster R-CNN H-OBB; the second category includes RICNN, ORConv, USB-BBR, OFIC, MS-VANs, GLNet, ICN, IAD R-CNN, CDD-Net, CAD-Net, SCANet. Compared with the existing methods, the method can achieve better performance on the basis of the reference model, and the improvement of the method is not limited to a specific class, but can improve the detection accuracy of most classes. The category with larger average target size has greater accuracy gain under the method of the present invention because the method of the present invention focuses more on large-scale global context information. In order to more clearly show the detection performance improvement achieved by the method of the present invention, the detection results of the present invention are shown from the perspective of the sub-image and the original image, respectively, as shown in fig. 4 to fig. 31.
In order to more rigorously explore the influence of each module and structure of the invention on the detection result, in the embodiment, a detailed ablation experiment is designed and carried out to analyze the influence of a single module and the influence of the structural hyper-parameters of the coupling mechanism. To analyze the importance of each sub-module in the CGLNet, GL-FPN and global detector were applied separately to the reference model to verify their validity on the three data sets. The reference model used the Faster R-CNN of ResNet 50-FPN. Table 6 shows the ablation study results for each component. Each module improves the performance of the model on three data sets, and their combination also achieves superior results, proving that there is no conflict between these sub-modules. Among the detection results of the three data sets, GL-FPN greatly improves the performances on DOTA-v1.0, DOTA-v1.5 and DOTA-v2.0, which are respectively 2.04%, 1.18% and 1.47%. DOTA-v1.0 is the most elevated because it contains the fewest objects in the three datasets. If deleting GL-FPN and adding global detector alone, the detection performance will be improved by more than 0.2%. The performance improvement of the global head module remained stable over the three datasets, demonstrating that the phenomenon of slicing of oversized objects is widespread in multiple datasets.
Since the scale is sensitive to global context information, the global context does not yield better results for objects of all scales. To further explore this issue, additional ablation experiments may be performed, as shown in table 7. The characteristics of { P2 and P3} have lower downsampling multiplying power, are rich in detail information, and influence the feature expression by the context features paying attention to the global information, so that the two levels are not paid attention to the feature level combination. However, different combinations result in slight differences in performance, with only P4 achieving the best results on all three datasets. This is consistent with other views that the intermediate level of the FPN pyramid network contains more balanced semantic information and details, so this level is more conducive to fusing global information. The use of { P2} and { P3} can still produce better than baseline results only at DOTA-v1.0, but detection performance at DOTA-v1.5 is reduced instead because global features affect the expression of detailed features. The accuracy trends were similar across the three data sets, which also demonstrates the stability of the method of the invention.
The specific embodiment is as follows:
the mAP was used as an evaluation index.
To verify the performance of the invention on large aerial images, DOTA [17] datasets were used, which is a large-scale dataset containing aerial images for target detection, which contains many larger-size images, mainly from Google Earth, satellites, and aerial images. There are three versions of the DOTA dataset: DOTA-v1.0, DOTA-v1.5 and DOTA-v 2.0. The label box of the HBB is used.
DOTA-v1.0 has 2806 images in total, ranging in size from 800 to 4000 pixels, including 15 classes and 188282 objects. DOTA-v1.5 notes many small targets that are difficult to detect and, in addition, adds a new category called "container cranes". The DOTA-v2.0 version contains 402089 instances. DOTA-v2.0 is very different from the previous versions, with 18 categories (with the addition of "airport" and "helipad"), 11286 pictures and 1793658 targets. The results in the experiment are all test set accuracy.
For convenience, to test the performance of the module on an official basis, the present invention was implemented and evaluated using MMDetection [43 ]. The entire original image is changed to 1024 × 1024 partial images with the step of 824 according to the setting of the official reference. In the inference phase, the inference results from the local images are merged into the results of the global image by setting the NMS threshold to 0.3. All experiments were performed on a Ubuntu operating system, trained using a GPU graphics card (nvidiageforcetx 1080Ti), with a total batch of 1. The initial learning rate was set to 0.0025 and the learning rate adjustment plan setting was the same as the "1 x" plan. The remaining hyper-parameters are set to coincide with the hyper-parameters of the official reference, for example, the maximum number of objects in the candidate frame and each image is set to 2000.
TABLE 1 comparison of Performance to the baseline model at DOTA-v1.0
Figure BDA0003343023640000131
FR-H is fast R-CNN model, CR-H is Cascade MaskR-CNN model, P is airplane, BD is baseball field, B is bridge, GTF is stadium runway, SV is small vehicle, LV is large vehicle, S is ship, TC is tennis court, BC is basketball court, ST is oil tank, SBF is football field, RA is roundabout, H is port, SP is swimming pool, HC is helicopter.
TABLE 2 comparison of Performance to the benchmark model at DOTA-v1.5
Figure BDA0003343023640000132
Figure BDA0003343023640000141
FR-H-faster r-CNN model, CR-H-CascadeMaskR-CNN model, P-airplane, BD-baseball field, B-bridge, GTF-stadium runway, SV-small vehicle, LV-large vehicle, S-ship, TC-tennis court, BC-basketball field, ST-tank, SBF-football field, RA-roundabout, H-port, SP-swimming pool, HC-helicopter, CC-crane boom.
TABLE 3 comparison of Performance to the baseline model at DOTA-v2.0
Figure BDA0003343023640000142
FR-H-faster r-CNN model, CR-H-CascadeMaskR-CNN model, P-airplane, BD-baseball field, B-bridge, GTF-stadium runway, SV-small vehicle, LV-large vehicle, S-ship, TC-tennis court, BC-basketball court, ST-tank, SBF-football field, RA-roundabout, H-port, SP-swimming pool, HC-helicopter, CC-crane boom, a-airport, He-helicopter apron.
TABLE 4 comparison of Performance at DOTA-v1.0 with mainstream model
Figure BDA0003343023640000151
FR-H ═ fast R-CNN model, CR-H ═ CascadeMaskR-CNN model, FR-H ═ model used different feature extraction structure from FR-H.
TABLE 5 comparison of Performance to mainstream model at DOTA-v1.5
Figure BDA0003343023640000152
Figure BDA0003343023640000161
FR-H ═ FasterR-CNN model, CR-H ═ CascadeMaskR-CNN model.
TABLE 6 ablation experiments in sub-modules of the invention
Figure BDA0003343023640000162
TABLE 7 melting test of GL-FPN fusion grade
Figure BDA0003343023640000163
TABLE 8GL-FPN ablation experiments with reduced dimension convolution kernel size
Figure BDA0003343023640000164
Although the invention herein has been described with reference to particular embodiments, it is to be understood that these embodiments are merely illustrative of the principles and applications of the present invention. It is therefore to be understood that numerous modifications may be made to the illustrative embodiments and that other arrangements may be devised without departing from the spirit and scope of the present invention as defined by the appended claims. It should be understood that features described in different dependent claims and herein may be combined in ways different from those described in the original claims. It is also to be understood that features described in connection with individual embodiments may be used in other described embodiments.

Claims (10)

1. A target detection method of a large-size aerial remote sensing image is characterized by comprising the following steps,
original remote sensing images form an original remote sensing image set, and each original remote sensing image is configured with an original image name;
segmenting each original remote sensing image into sub-images with preset sizes according to a preset overlapping rate, configuring a sub-image name for each sub-image, wherein the sub-image name comprises an original image name, a sub-image serial number and initial position information of the sub-image on the original remote sensing image;
searching in an original remote sensing image set according to a sub-image name of a current sub-image to be detected to obtain a corresponding current original remote sensing image;
performing feature extraction on the current sub-graph to be detected by adopting a local feature extractor to obtain a group of multi-scale sub-graph feature graphs;
carrying out down-sampling on a current original remote sensing image, and then carrying out feature extraction by adopting a global feature extractor to obtain a group of multi-scale original image feature maps;
performing feature fusion on the multi-scale sub-image feature map and the multi-scale original image feature map by adopting a feature pyramid network of a global local coupling mechanism to obtain a fused feature map;
generating candidate frames by the fused feature graphs through a first RPN, aligning the features by a first RolAlign, changing feature dimensions through an FC full-link layer to obtain one-dimensional fused feature graphs, and performing classification and regression operations on the one-dimensional fused feature graphs to obtain fused target prediction results;
performing multi-scale transformation on a group of multi-scale original image feature maps, generating a candidate frame through a second RPN, filtering, performing feature alignment through a second RolAlign, changing feature dimensions through an FC full-link layer to obtain a one-dimensional global feature map, classifying and regressing the one-dimensional global feature map respectively, and performing NLS filtering to obtain a global prediction result after filtering;
and re-fusing the fused target prediction result and the filtered global prediction result to obtain the detection target.
2. The method for detecting the target of the aerial remote sensing image with large size as claimed in claim 1,
the group of multi-scale sub-graph feature graphs output by the local feature extractor comprise four different-scale sub-graph feature graphs;
the group of multi-scale original image feature maps output by the global feature extractor comprise four original image feature maps with different scales;
the feature pyramid network of the global local coupling mechanism adopts a top-down mode for interactive fusion, and four different-scale sub-image feature graphs and four different-scale original image feature graphs are respectively transformed into five transformed multi-scale sub-image feature graphs and five transformed multi-scale original image feature graphs; and fusing the five transformed multi-scale sub-image feature maps and the five transformed multi-scale original image feature maps according to corresponding scales to obtain a multi-scale fused feature map.
3. The method for detecting the object in the large-size aerial remote sensing image according to claim 2,
the method for fusing the five transformed multi-scale sub-image feature maps and the five transformed multi-scale original image feature maps according to the corresponding scales comprises the steps of splicing the five transformed multi-scale sub-image feature maps and the five transformed multi-scale original image feature maps according to the corresponding scales, and then carrying out dimensionality reduction processing.
4. The method for detecting the object in the large-size aerial remote sensing image according to claim 3,
the process of obtaining the fused feature map comprises the following steps:
selecting the middle level of the five transformed multi-scale sub-image feature graphs to obtain a selected sub-image feature graph;
selecting the middle level of the five transformed multi-scale original image feature maps to obtain a selected original image feature map;
and cutting the designated area corresponding to the selected sub-image feature map in the selected original image feature map, up-sampling to the size of the original image feature map, and then performing feature fusion with the selected sub-image feature map to obtain a fused feature map.
5. The method for detecting the target of the aerial remote sensing image with the large size according to claim 4,
both the local feature extractor and the global feature extractor employ ResNet-50 as the feature extractor.
6. The method for detecting the object in the large-size aerial remote sensing image according to claim 5,
in a feature pyramid network of a global local coupling mechanism, selecting feature map levels by adopting a mapping function according to the size of a target candidate box; the mapping function is:
Figure FDA0003343023630000021
in the formula, k is the level number corresponding to the target candidate frame, w is the length of the target candidate frame, and h is the width of the target candidate frame.
7. The method for detecting the object in the large-size aerial remote sensing image according to claim 6,
the local feature extractor and the global feature extractor perform feature extraction in the same way;
and the local feature extractor firstly performs feature extraction on the current sub-graph to be detected at 5 depths according to the sequence of gradually increasing feature dimensions by adopting a convolutional neural network, and then performs 1 × 1 convolution calculation on the sub-graph features extracted at the last four depths to obtain four sub-graph feature graphs with different scales.
8. The method for detecting the object in the large-size aerial remote sensing image according to claim 7,
the method for converting the four different-scale sub-image feature graphs into the five converted multi-scale sub-image feature graphs by the feature pyramid network of the global local coupling mechanism is the same as the method for converting the four different-scale original image feature graphs into the five converted multi-scale original image feature graphs;
and respectively carrying out 3-by-3 convolution calculation and maximum pooling on the four different scale sub-graph feature graphs to obtain five transformed multi-scale sub-graph feature graphs.
9. The method for detecting the target of the aerial remote sensing image with large size as claimed in claim 1,
the predetermined overlap ratio includes overlapping 200 pixels.
10. The method for detecting the target of the aerial remote sensing image with large size as claimed in claim 1,
the subgraph name format comprises:
original image name _ subgraph sequence number _ subgraph start abscissa _ subgraph start ordinate.
CN202111313879.1A2021-11-082021-11-08 Object detection method for large-scale aerial remote sensing imagesActiveCN113989645B (en)

Priority Applications (1)

Application NumberPriority DateFiling DateTitle
CN202111313879.1ACN113989645B (en)2021-11-082021-11-08 Object detection method for large-scale aerial remote sensing images

Applications Claiming Priority (1)

Application NumberPriority DateFiling DateTitle
CN202111313879.1ACN113989645B (en)2021-11-082021-11-08 Object detection method for large-scale aerial remote sensing images

Publications (2)

Publication NumberPublication Date
CN113989645Atrue CN113989645A (en)2022-01-28
CN113989645B CN113989645B (en)2025-03-21

Family

ID=79747123

Family Applications (1)

Application NumberTitlePriority DateFiling Date
CN202111313879.1AActiveCN113989645B (en)2021-11-082021-11-08 Object detection method for large-scale aerial remote sensing images

Country Status (1)

CountryLink
CN (1)CN113989645B (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN114549958A (en)*2022-02-242022-05-27四川大学 A nighttime and camouflaged target detection method based on contextual information perception mechanism
CN114782714A (en)*2022-02-222022-07-22北京深睿博联科技有限责任公司 An image matching method and device based on context information fusion
CN116206142A (en)*2022-09-092023-06-02哈尔滨工业大学Compressed image target detection method based on degraded network feature learning
CN117893922A (en)*2024-01-252024-04-16中国自然资源航空物探遥感中心Large-format remote sensing image semantic segmentation method and system

Citations (3)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN110909642A (en)*2019-11-132020-03-24南京理工大学Remote sensing image target detection method based on multi-scale semantic feature fusion
CN111179217A (en)*2019-12-042020-05-19天津大学 A multi-scale target detection method in remote sensing images based on attention mechanism
CN113505792A (en)*2021-06-302021-10-15中国海洋大学Multi-scale semantic segmentation method and model for unbalanced remote sensing image

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN110909642A (en)*2019-11-132020-03-24南京理工大学Remote sensing image target detection method based on multi-scale semantic feature fusion
CN111179217A (en)*2019-12-042020-05-19天津大学 A multi-scale target detection method in remote sensing images based on attention mechanism
CN113505792A (en)*2021-06-302021-10-15中国海洋大学Multi-scale semantic segmentation method and model for unbalanced remote sensing image

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
XI CHEN 等: "Coupled Global–Local object detection for large VHR aerial images", 《KNOWLEDGE-BASED SYSTEMS》, 17 November 2022 (2022-11-17), pages 1 - 15*
王超杰: "基于全局局部耦合机制的大场景遥感图像目标检测方法研究", 《中国优秀硕士学位论文全文数据库 信息科技辑》, no. 2024, 15 February 2024 (2024-02-15), pages 028 - 200*

Cited By (6)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN114782714A (en)*2022-02-222022-07-22北京深睿博联科技有限责任公司 An image matching method and device based on context information fusion
CN114549958A (en)*2022-02-242022-05-27四川大学 A nighttime and camouflaged target detection method based on contextual information perception mechanism
CN114549958B (en)*2022-02-242023-08-04四川大学 Night and camouflage target detection method based on context information perception mechanism
CN116206142A (en)*2022-09-092023-06-02哈尔滨工业大学Compressed image target detection method based on degraded network feature learning
CN116206142B (en)*2022-09-092025-07-25哈尔滨工业大学Compressed image target detection method based on degraded network feature learning
CN117893922A (en)*2024-01-252024-04-16中国自然资源航空物探遥感中心Large-format remote sensing image semantic segmentation method and system

Also Published As

Publication numberPublication date
CN113989645B (en)2025-03-21

Similar Documents

PublicationPublication DateTitle
CN113989645B (en) Object detection method for large-scale aerial remote sensing images
Wang et al.DDU-Net: Dual-decoder-U-Net for road extraction using high-resolution remote sensing images
CN115331087B (en)Remote sensing image change detection method and system fusing regional semantics and pixel characteristics
Xu et al.Scale-aware feature pyramid architecture for marine object detection
CN114519819B (en)Remote sensing image target detection method based on global context awareness
Sun et al.IRDCLNet: Instance segmentation of ship images based on interference reduction and dynamic contour learning in foggy scenes
CN111783523A (en) A method for detecting rotating objects in remote sensing images
CN113837941A (en)Training method and device for image hyper-resolution model and computer readable storage medium
CN113111740B (en) A feature weaving method for remote sensing image target detection
CN115346071A (en)Image classification method and system for high-confidence local feature and global feature learning
Liu et al.CT-UNet: Context-transfer-UNet for building segmentation in remote sensing images
CN113838064A (en) A Cloud Removal Method Using Multitemporal Remote Sensing Data Based on Branch GAN
Hu et al.Supervised multi-scale attention-guided ship detection in optical remote sensing images
CN118230180A (en) A remote sensing image target detection method based on multi-scale feature extraction
Zhu et al.SIRS: Multitask joint learning for remote sensing foreground-entity image–text retrieval
Zhou et al.Class-aware edge-assisted lightweight semantic segmentation network for power transmission line inspection
Ma et al.Multiscale sparse cross-attention network for remote sensing scene classification
Hu et al.FSAU-Net: a network for extracting buildings from remote sensing imagery using feature self-attention
CN116681930A (en)Remote sensing image change detection and model training method, device and storage medium thereof
CN114463205B (en) A vehicle target segmentation method based on dual-branch Unet noise suppression
Yang et al.Small object detection model for remote sensing images combining super-resolution assisted reasoning and dynamic feature fusion
CN119360220A (en)Image processing method, device, equipment and medium based on user intention
Jun et al.Fusion of near-infrared and visible images based on saliency-map-guided multi-scale transformation decomposition
Le et al.Acmfnet: Asymmetric convolutional feature enhancement and multiscale fusion network for change detection
Wang et al.Automatic building extraction based on boundary detection network in satellite images

Legal Events

DateCodeTitleDescription
PB01Publication
PB01Publication
SE01Entry into force of request for substantive examination
SE01Entry into force of request for substantive examination
GR01Patent grant
GR01Patent grant

[8]ページ先頭

©2009-2025 Movatter.jp