Disclosure of Invention
In view of the above, the invention innovatively provides a Mask-RCNN-CBAM fused attention mechanism-based multi-class pest identification and detection method and system, which aim to solve the problems of complex background and multi-scale feature extraction and realize high-precision identification of small target pests by combining the attention mechanism and multi-level semantic features. In order to achieve the above object, the present invention specifically adopts the following technical scheme:
a method for identifying and detecting multiple types of pests based on Mask-RCNN-CBAM fusion attention mechanism comprises the following steps:
Collecting multiple types of pest image data and constructing an experimental data set;
Introducing CBAM attention mechanism module in the feature processing stage of Mask-RCNN model, and connecting with multi-scale feature enhancement module to construct Mask-RCNN-CBAM deep learning network;
model training is carried out on the Mask-RCNN-CBAM deep learning network through an experimental data set;
And acquiring real-time data, and inputting the real-time data into a trained Mask-RCNN-CBAM deep learning network to obtain multiple pest identification and detection results.
Optionally, the CBAM attention mechanism module includes a channel attention module and a spatial attention module that are sequentially connected, and after the backbone features are extracted by a backhaul, the input image is sent to the CBAM attention mechanism module. In the module, the channel attention module firstly acts on the input features, fuses the generated channel attention features with the original features, further generates final channel attention features, inputs the generated channel attention features into the spatial feature map, and obtains the final attention features through pooling and convolution connection.
Optionally, the channel attention module aggregates the spatial information of the channel feature map by means of an average pooling and a maximum pooling operation to generate two different spatial context descriptors, which represent the average pooling feature and the maximum pooling feature, respectively, and then the features are transmitted to a shared network, which comprises a multi-layer perceptron MLP and a hidden layer, the activation size of which is carefully set to C/r, and finally generates a channel attention map Mc (F) of 1×1×c, wherein C is the number of neurons, r is the attenuation rate, and the activation function is ReLu, and after the shared network is applied to each descriptor, the sum of the feature elements is combined and the feature vector is output.
Optionally, the calculation formula of the channel attention is:
Wherein, FCmax = MaxPool (F) is global max pooling, FCavg = AvgPool (F) is global average pooling, σ is an S-type function, and W0、W1 is the shared weight of two input features.
Optionally, the spatial attention module generates a feature map by using the spatial relationship of the features, and then aggregates the channel information of the feature map by using two pooling layers to generate two-dimensional features, which represent the average pooling feature and the maximum pooling feature, respectively. And connecting and convolving the processed features through a standard convolution layer, and then inputting the processed features into a sigmoid function to generate a spatial attention feature map.
Optionally, the calculation formula of the spatial attention is:
Ms(F)=σ(f7×7(AvgPool(F)));
(MaxPool(F)))=σ(f7×7(Favg;Fmax));
Wherein Ms (F) is a space feature diagram, sigma is an S-shaped function, and F7×7 is convolution operation with a filter size of 7×7.
The multi-scale feature enhancement module comprises two main parts, namely top-down downsampling feature extraction and bottom-up upsampling feature extraction. The downsampling part aims at extracting deeper semantic information and stronger image features to obtain high-resolution features, the upsampling feature extraction focuses on the spatial information of shallow features, the spatial information is transmitted to a deeper layer for fusion, and finally the features transmitted by the upper layer are integrated with the shallow spatial information and the deep semantic information. In the feature transfer process, a double-channel downsampling convolution operation is adopted, so that detail feature loss during downsampling is effectively reduced, and the image detail feature is ensured to be reserved.
Optionally, the two-channel downsampling convolution operation comprises fusing two transition modules, performing 2×2 maximum pooling operation on a left branch, connecting a 1×1 convolution layer later, performing 1×1 convolution layer on a right branch, connecting a 3×3 convolution layer later, and outputting the overlapped results by the two branches, wherein the stride is 2×2.
Optionally, the Mask-RCNN-CBAM deep learning network loss function includes:
L=Lcls+Lbox+Lmask;
where Lcls denotes a classification loss for judging the category to which each ROI belongs, Lbox denotes a bounding box offset loss for regressing the bounding box of each ROI, and Lmask denotes a pixel segmentation mask generation loss for generating a mask for each ROI and each category.
Optionally, a Mask-RCNN-CBAM fusion attention mechanism-based multi-type pest identification and detection system comprises:
The acquisition module is used for acquiring multiple types of pest image data and constructing an experimental data set;
The model construction module is used for introducing CBAM attention mechanism module in the feature processing stage of the Mask-RCNN model and connecting with the multi-scale feature enhancement module to construct a Mask-RCNN-CBAM deep learning network;
the training module is used for carrying out model training on the Mask-RCNN-CBAM deep learning network through the experimental data set;
The recognition detection module is used for acquiring real-time data and inputting the real-time data into the trained Mask-RCNN-CBAM deep learning network to obtain multiple types of pest recognition detection results.
Compared with the prior art, the invention discloses a method and a system for identifying and detecting various pests based on Mask-RCNN-CBAM fusion attention mechanism, which have the following beneficial effects:
The invention provides a Mask-RCNN-CBAM fusion attention mechanism-based multi-class pest identification and detection method, which comprises the steps of collecting multi-class pest image data, constructing an experimental data set, introducing CBAM an attention mechanism module in a feature processing stage of a Mask-RCNN model, connecting a multi-scale feature enhancement module to construct a Mask-RCNN-CBAM deep learning network, carrying out model training on the Mask-RCNN-CBAM deep learning network through the experimental data set, collecting real-time data, and inputting the real-time data into the trained Mask-RCNN-CBAM deep learning network to obtain multi-class pest identification and detection results.
In the input stage, the image passes through ResNet backbone network feature extraction layer, and the extracted features are sent to CBAM attention mechanism module, and the attention mechanism helps the model focus on important area processing. The attention module and the space attention module are sequentially connected, so that the learning capability of the network on the characteristics of the complex object is improved, and false positives and false negatives under the complex background are avoided. The network also introduces a multi-scale feature fusion pyramid module, and feature transfer is carried out on different scales, so that the influence of insufficient multi-scale features on target detection is relieved. In order to further solve the problem of information loss caused by the traditional downsampling method, a double-channel downsampling module is introduced, and two channels are utilized to retain information and extract features, so that the model accuracy is improved. The method provided by the invention has better performance in detecting the pest-dense area and extracting the pests. The Mask-RCNN-CBAM network effectively filters and complicatedly improves the identification rate of pest characteristics and is more sensitive to the background. In various pest detection tasks, the Mask-RCNN-CBAM network has good recognition performance, higher accuracy and smoother Mask. This is because the network integrates a two-channel attention mechanism, gives greater weight to pest characteristics, reduces the influence of background characteristics, and effectively distinguishes pest and background characteristics. In addition, the FPN (feature pyramid network) with enhanced features fuses the shallow and deep features of the image, further enriches detail information and remarkably improves the detection precision and efficiency of the network. Aiming at the problems of false alarm and false missing of pest extraction caused by complex background and dense pest superposition, the Mask-RCNN-CBAM pest extraction network with the attention mechanism provided by the invention has the advantages that the attention mechanism CBAM is integrated into the feature processing process, the multi-scale feature extraction and fusion are enhanced through the feature pyramid module, the capability of extracting context information by the network is enhanced, and the loss of detail features of images is reduced. Experimental results show that the Mask-RCNN-CBAM network has excellent target extraction effect on a pest dataset, obtains higher accuracy and better performance especially under complex background and pest dense conditions, and obtains highest performance in the aspects of Precision, F1score, recall and the like, and meanwhile has lower false alarm rate and false missing rate, so that the network model is reliable and applicable. Compared with other pest extraction methods, the Mask-RCNN-CBAM network can better extract pest characteristic information and optimize detail information.
Detailed Description
The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
The embodiment of the invention discloses a method for identifying and detecting multiple types of pests based on Mask-RCNN-CBAM fusion attention mechanism, which comprises the following steps:
Collecting multiple types of pest image data and constructing an experimental data set;
Introducing CBAM attention mechanism module in the feature processing stage of Mask-RCNN model, and connecting with multi-scale feature enhancement module to construct Mask-RCNN-CBAM deep learning network;
model training is carried out on the Mask-RCNN-CBAM deep learning network through an experimental data set;
And acquiring real-time data, and inputting the real-time data into a trained Mask-RCNN-CBAM deep learning network to obtain multiple pest identification and detection results.
In a specific embodiment, collecting multiple classes of pest image data, constructing an experimental dataset includes:
Firstly, a training data set is filtered to remove image samples with lower quality, and in order to ensure high recognition accuracy of a deep learning algorithm, the image resolution needs to reach a high standard, so that images with resolution lower than 4096×2160 and images containing incomplete leaves or pests are removed. And then, extracting images in different seasons according to a certain proportion, so that each pest is ensured to have enough images, and the deep learning training requirement is met.
In view of the diversity of pest pictures, a dataset containing multiple types of pests was constructed. All the pictures are processed uniformly before the experiment, the pictures are cut into nine equal parts, and the resolution of each cut picture is set to 1824 pixels multiplied by 1216 pixels. In order to ensure the diversity of experimental samples, the data enhancement technology is adopted to increase the size of a data set, including horizontal overturning, vertical overturning, 90-degree rotation, 180-degree rotation, 270-degree rotation, noise addition and the like, and through the operations, the data set size is increased by 7 times, and the total number of the data sets is 7000. Marking by using LabelMe tools, and storing the label of each picture in a list of the pictures in a json file mode. In Mask-RCNN dataset training, the dataset format complies with the COCO format specification, including picture information, labeling details, class definitions, and the like. And finally, splitting the data set into a training set, a verification set and a test set according to the ratio of 6:1:3. This example only marks pests that caused severe damage to the crop, but none of the pests that caused mild damage, beneficial pests, and non-crop pests.
In a specific embodiment, a method for identifying and detecting multiple types of pests based on Mask-RCNN-CBAM fusion attention mechanism comprises the following specific implementation steps:
The network architecture employed in this embodiment is based on the classical Mask-RCNN architecture. The Mask-RCNN has a simple structure and can be used for various tasks such as object detection, semantic segmentation, instance segmentation, human body gesture recognition and the like. The network structure is shown in fig. 1. The Mask-RCNN model adopts the idea of Faster-RCNN, adopts a ResNet-FPN architecture on feature extraction, and adds Mask prediction branches for semantic segmentation, and has the functions of detection and extraction. Unlike fast-RCNN which uses VGG as the backbone feature extraction network, mask-RCNN uses ResNet-50 and ResNet-101 as the backbone feature extraction network, and adds a Feature Pyramid Network (FPN) to the backbone structure, and different backbone network combinations are used to generate feature layers with different sizes. Rough screening is then performed by a Regional Proposal Network (RPN) at each point generation Anchorbox of the active feature layer, resulting in a local feature layer. And performing ROI alignment on the local feature layers, classifying and generating masks by introducing classification, regression models and Mask models, and finally outputting classification and segmentation results.
As shown in fig. 2, the Mask-RCNN-CBAM deep learning network proposed in this embodiment introduces CBAM (ConvolutionalBlockAttentionModule) attention mechanism in the feature processing stage based on the original Mask-RCNN structure. The CBAM module gives weight to the feature information of different areas by enhancing the channel and the spatial attention of the image, so that the model can effectively learn and highlight the target features. Compared with the original network, the enhanced feature pyramid module is newly added with a feature transmission path from bottom to top, so that shallow features can be smoothly transmitted to a deeper layer and effectively fused, and meanwhile, the features transmitted from an upper layer, the space information of the shallow layer and the deep semantic information are fused. In the feature transfer process, a double-channel downsampling convolution operation is utilized, so that detail feature loss in the downsampling process is reduced, image detail features are reserved, and feature extraction and detail optimization capacity of a network is improved.
Specifically, in the input stage, the image passes through ResNet a backbone network feature extraction layer, and the extracted features are sent to CBAM attention mechanism module, where the attention mechanism helps the model focus on important area processing. The attention module and the space attention module are sequentially connected, so that the learning capability of the network on the characteristics of the complex object is improved, and false positives and false negatives under the complex background are avoided. The network also introduces a multi-scale feature fusion pyramid module, and feature transfer is carried out on different scales, so that the influence of insufficient multi-scale features on target detection is relieved. In order to further solve the problem of information loss caused by the traditional downsampling method, a double-channel downsampling module is introduced, and two channels are utilized to retain information and extract features, so that the model accuracy is improved.
In a specific embodiment, the CBAM attention mechanism module includes:
CBAM is an optimization algorithm that combines a channel attention module and a spatial attention module. The backbone features of the image are extracted by the backbone and then sent to a CBAM module, and the features generated by the channel attention are aggregated with the input features by the channel attention module to generate final channel attention features. The generated features are input into a spatial feature map, and final attention features are obtained through pooling and convolution connection. CBAM module, as a mechanism of attention, capable of sequentially inferring an attention profile of channel and spatial dimensions for adaptive feature optimization. This mechanism is very effective in handling multi-scale feature extraction tasks, and can highlight the effective features of the target and reduce redundant information. The present embodiment uses the channel attention relationship of the features to generate a channel attention map, where each channel of the features represents a particular feature detector, and the attention of the channels of different channel features is given to corresponding weighting coefficients. Channel attention mechanisms can calculate channel attention more efficiently by focusing on meaningful features and compressing the spatial dimensions of the input feature map. This enables the model to provide a more accurate representation of features for feature rich regions in the image.
(1) As shown in fig. 3, the channel attention module utilizes the maximum pooled output and the average pooled output of the shared network. The specific flow comprises ①, generating two different spatial context descriptors by aggregating spatial information of channel feature graphs through an average pooling operation and a maximum pooling operation, respectively representing the average pooling feature and the maximum pooling feature, ②, transmitting the generated features to a shared network, and generating a channel attention map Mc (F) of 1×1×C. The shared network consists of a multi-layer perceptron MLP and a hidden layer. In order to reduce the parameter cost, the activation size of the hidden layer is set to be C/r, wherein C is the number of neurons, r is the attenuation rate, and the activation function is ReLu. After the shared network is applied to each descriptor, the sum of the feature elements is combined and the feature vector is output. In short, a specific calculation method of channel attention is as follows:
Wherein, FCmax = MaxPool (F) is global max pooling, FCavg = AvgPool (F) is global average pooling, σ is an S-type function, and W0、W1 is the shared weight of two input features.
(2) As shown in fig. 4, unlike the channel attention mechanism, the spatial attention mechanism focuses on the position of a feature, and mainly uses the spatial relationship of the feature to generate a spatial attention feature map. The spatial attention and the channel attention are complementary to each other. In computing spatial attention, the average pooling and maximum pooling operations of channels are first employed, and then the results of these operations are concatenated to generate an efficient feature descriptor. The spatial signature Ms (F) is then generated by the convolution layer using the signature descriptors. The specific flow comprises the steps of ① utilizing two pooling layers to aggregate channel information of the feature map to generate two-dimensional feature sums respectively representing average pooling features and maximum pooling features, ② connecting and convolving the features through a standard convolution layer, and inputting the features into a sigmoid function to generate a spatial attention feature map. In short, the method of calculating spatial attention is as follows:
Ms(F)=σ(f7×7(AvgPool(F)));
(MaxPool(F)))=σ(f7×7(Favg;Fmax));
Wherein sigma is an S-shaped function, and f7×7 represents convolution operation with a filter size of 7 x 7.
In a specific embodiment, the multi-scale feature enhancement module, as shown in fig. 5, includes two parts, namely a top-down downsampling feature extraction and a bottom-up upsampling feature extraction. And (3) downsampling to extract deeper image features with stronger semantic information, so as to obtain features with higher resolution. Upsampling shallow features often implies rich spatial information of the object, which is crucial for locating objects in the image. For example, when the input image feature is 1024×1024×3, the dimension reduction is performed using a convolution layer with a stride of 2. Since convolutional downsampling results in some feature loss, identityBlock modules are used to enhance the network during sampling, allowing deeper layers to learn more complex features. After this block processing, the features are labeled Ci, i ε [2,3,4,5], where the C5 features are fused with the C4 features of the previous layer by upsampling. And performing convolution operation on the fused features after upsampling to generate new features P4. After up-sampling from bottom to top, pooling operation is carried out on C5 features, information redundancy is reduced, overfitting is prevented, and a new feature layer Pj, j E [2,3,4,5,6] is obtained, wherein more semantic and spatial information is contained.
In a specific embodiment, the two-channel downsampling convolution operation specifically includes:
The downsampling module is a method for reducing the resolution of the image or the feature map, however, the common downsampling method often results in loss of detail information. In order to effectively reduce the feature loss possibly generated in the downsampling process, the downsampling module is optimized and improved in particular. The improved double-channel downsampling module fuses two transition modules, the left branch adopts 2×2 maximum pooling operation, the rear 1×1 convolution layer is adopted, the right branch adopts 1×1 convolution layer, the rear 3×3 convolution layer (the step is 2×2), and the two branches output the overlapped results. Compared with the traditional method, the module can better capture and process the characteristics of the input image, improve the target detection performance, reduce the size of the feature map without changing the depth of the feature map, effectively reduce the information loss and ensure that the network retains important detail information. The structure is shown in fig. 6.
In a specific embodiment, the Mask-RCNN-CBAM deep learning network loss function includes that the Mask-RCNN-CBAM loss function is a multi-tasking loss function that incorporates the loss of classification, localization and segmentation masks as shown in the following formula:
L=Lcls+Lbox+Lmask;
Where Lcls denotes a classification loss for determining to which category each ROI belongs, Lbox denotes a bounding box offset loss for regressing the bounding box of each ROI, and Lmask denotes a pixel segmentation mask generation loss for generating a mask for each ROI and each category.
In a specific embodiment, experiments and result analysis are carried out on the multiple pest identification and detection method based on Mask-RCNN-CBAM fusion attention mechanism:
Experimental environment
The programming language adopted by the experimental environment is Python3.8.10, the deep learning framework is Tensorfl ow 2.4.0, the hardware environment is configured as Intel (R) Core (TM) i7-12700K x 20, the operating system is Wind ows, the graphics card is NVIDIA GeForce RTX 3080, the graphics card driving is configured as CUDA 11.6 and cuDNN 8.0.6, the initial learning rate is set to 0.001, and the learning momentum is set to 0.9. The small batch size is set to 128, the weight decay is set to 0.0005, and the total number of iterations is also set to 300.
Evaluation index
In the embodiment, 4 Precision evaluation indexes, namely a cross ratio IoU, an accuracy Precision, a Recall ratio Recall and an F1 value are adopted to measure the performance of the model for extracting the pests. The intersection ratio IoU represents the overlapping degree of the predicted area and the real area, and is generally used to represent the accuracy of detecting the target position. The accuracy represents the ratio of the sample predicted by the model as a positive example to the true positive example. Recall indicates the ratio of the number of pests correctly detected by the model to the number of targets detected. F1 is a weighted average of precision and recall, as shown in the following equation:
Wherein intersection (Intersection) represents an area where the model prediction area overlaps with the real area, union (Union) represents an area where the model prediction area overlaps with the real area, TP represents pixels correctly classified as vermin, and FP represents positive samples incorrectly classified.
Analysis of experimental results
In order to verify the accuracy and effectiveness of Mask-RCNN-CBAM network in pest detection, this example introduced three classical networks for comparative analysis, resNet, faster R-CNN and Mask R-CNN. Mask R-CNN, as the best paper of ICCV2017, is jointly developed by KAIMING HE and Ross Girshick of FAIR team, which adds an instance segmentation function on the basis of Faster R-CNN, and realizes target detection and pixel level segmentation through a parallel branch network. ResNet solves the problem of gradient disappearance in the depth network by introducing residual connection, so that information and gradients can be efficiently propagated in the depth network, training of a very deep neural network becomes possible, and model performance is remarkably improved. FasterR-CNN realizes end-to-end training by introducing a Regional Proposal Network (RPN) and sharing convolution characteristics, efficiently generates candidate regions, and provides higher precision and efficiency in target detection. Mask-RCNN adds a fully connected segmentation network based on FasterR-CNN for semantic segmentation, and introduces ROIAlign module to precisely align pixels and deal with the problem of semantic segmentation. In order to verify the effectiveness of the proposed Mask-RCNN-CBAM network, it was analyzed in comparison to these classical networks. To more intuitively demonstrate the performance of the improved model, the present study inputs test set data into four networks and outputs pest detection results for each image in an end-to-end manner. By comparing performance indexes such as mAP and F1 Score between different models, the system can accurately detect and classify various pest targets on the surface of crops. The comparison of the recognition results of the respective models is shown in fig. 7. Fig. 7 shows pest identification results of three sets of images in the pest data set. The above-mentioned methods are all generally good in pest identification. However, in the case of complex background and dense pest population, there is a difference in the recognition ability of the model. From the overall performance, the Mask-RCNN-CBAM model provided by the embodiment achieves the best recognition effect. The yellow box in fig. 7 demonstrates the performance evaluation of the different methods, highlighting the advantages of the method of this example in terms of reduced misdiagnosis and missed diagnosis by comparison of false positives and false negatives. In the first line, resNet networks, when extracting the masking of individual pests in a dense population, are somewhat less accurate, resulting in less smooth masking edges. The ResNet network is less optimized for detailed characteristics of similar pests, resulting in consistent pest segmentation. Both FasterR-CNN and Mask-RCNN networks appear as false negatives when detecting dense pests. In the second and third rows of fig. 7, resNet and FasterR-CNN models perform poorly in extracting from structurally complex or densely distributed pests. The three models-ResNet, fasterR-CNN and Mask-RCNN all show false negative and false positive phenomena in the detection process, wherein the false negative problem of the Mask-RCNN network is particularly remarkable. ResNet, while superior to FasterR-CNN in some respects in terms of Mask smoothness and segmentation quality, mask-RCNN, particularly the version incorporating the FPN architecture and Mask-predicted branches, and the further optimized Mask-RCNN-CBAM network, exhibits better performance.
The method proposed in this embodiment is better in detecting pest-dense areas and extracting pests. The Mask-RCNN-CBAM network effectively filters and complicatedly improves the identification rate of pest characteristics and is more sensitive to the background. In various pest detection tasks, the Mask-RCNN-CBAM network has good recognition performance, higher accuracy and smoother Mask. This benefits from its integrated dual channel attention mechanism which gives greater weight to pest characteristics while weakening the impact of background characteristics, effectively distinguishing pest from background characteristics. In addition, the FPN (feature pyramid network) with enhanced features fuses the shallow features and the deep features of the image, enriches detail information, and remarkably improves the detection precision and the operation efficiency of the network.
To evaluate the effectiveness of the proposed method, quantitative analysis was performed on the experimental results. The ResNet, fasterR-CNN, mask-RCNN, and Mask-RCNN-CBAM models were tested on pest datasets and the results of the test sets are given in Table 1. As can be seen from Table 1, the proposed method is superior to the other three methods in Precision, MIoU and F1 scores. Compared with ResNet, precision, MIoU and F1 are respectively improved by 0.73%, 1.38% and 0.11%, compared with Faster R-CNN, the three indexes are respectively improved by 2.65%, 5.14% and 3.35%, and compared with Mask-RCNN, the three indexes are respectively improved by 2.44%, 0.6% and 1.21%. For example, in the research of improving the precision of the robot mixed-disassembly code, the Mask-RCNN algorithm realizes a larger improvement in efficiency and precision compared with the traditional algorithm. In terms of recall index, the method proposed by this embodiment is improved by 2.67% and 1.64% compared to FasterR-CNN and Mask-RCNN, respectively, while the improvement compared to ResNet is not obvious, which may be due to insufficient size of the dataset, resulting in less obvious learning effect.
Table 1 evaluation index comparison results of three classical network models
In addition, as shown in Table 1, the proposed network parameter size is 63.73MB, which is reduced by 1.39MB compared to the Mask-RCNN network, which benefits from the improved two-channel attention module and feature enhanced FPN module optimization of some parameters during the convolution process. Compared with the original network, the method has the advantages that the parameter quantity is less, the pest extraction and segmentation are more accurate, the detection performance is better, and the improved method achieves better balance between the segmentation precision and the segmentation efficiency.
In a specific embodiment, the influence of the attention mechanism module ablation on the extraction result is evaluated, and the specific steps comprise:
In order to evaluate the influence of the attentiveness mechanism module on the pest extraction result, ablation experiments were performed. The experimental base network was Mask-RCNN-CBAM and the data set used was the constructed pest data set. The results of the ablation experiment for quantitatively verifying the pest extraction effect by the attention mechanism module are shown in table 2.
Table 2 comparison of metrics before and after integration into the attention mechanism Module
| Method of | Training time (seconds/times) | F1(%) | AP(50)/(%) | AR(small)/(%) |
| Belt module | 25 | 58.8 | 76.3 | 35.0 |
| Module-free | 24 | 53.3 | 70.0 | 32.0 |
As shown in table 2, after the attention mechanism module is added, the F1 score of the network is improved by 5.5%, and the small target extraction accuracy is improved by 3%. After the attention mechanism module is added, the important information extraction and weight distribution capacity of the network to the pest characteristics is obviously improved, so that the sensitivity of the network to the pest characteristics is enhanced, and the accuracy and efficiency of the characteristic extraction are improved. In addition, after the attention mechanism module is added, the running time of the network is only slightly increased (1 second slower), and the influence on the overall running efficiency of the network is small. Experimental results prove that the attention mechanism can improve the accuracy of target detection, improve the feature extraction performance, help the model focus on important features, reduce the sensitivity to noise or irrelevant information, reduce overfitting and accelerate convergence rate.
In a specific embodiment, the influence of the multi-scale feature enhancement module ablation on the extraction result is evaluated, and the specific steps comprise the FPN network enhancement feature information and full utilization of the multi-scale features. The embodiment improves the FPN network, and branches from top to bottom are added to better mine the detail characteristics of the image. To verify the effectiveness of the proposed feature enhanced pyramid module in agricultural pest detection, a quantitative experiment was performed in this study, and the results are shown in table 3. Experimental results show that the module remarkably improves the recognition capability of the model on the characteristics of pests, and the table shows that after the module is added, the F1 score of the network is improved by 0.4%, the AP value is improved by 3.3%, and the recall rate of small target pests such as thrips, leafhoppers and the like is improved by 6%. However, during experimentation it was found that while the addition of this module made the network run time somewhat longer, this variation could have some impact on the running efficiency of the model when processing large data sets. But the accuracy of the model is obviously improved. Through the ablation experiment, the multiscale feature fusion pyramid module can be observed to enable the Mask-RCNN network to increase the receptive field of the model, enrich information in the feature layer, better understand object examples in the image and adapt to targets with different scales, so that the accuracy of detection and segmentation is improved.
Table 3 Integrated Multi-scale feature fusion pyramid Module front-rear index comparison
| Method of | Run time (seconds) | F1(%) | AP(50)/(%) | AR(small)/(%) |
| Belt module | 50 | 55.2 | 74.1 | 41.3 |
| Module-free | 46 | 54.8 | 70.8 | 35.0 |
In a specific embodiment, to verify the effectiveness of the proposed two-channel downsampling module, this example compares the performance of the maximally pooled downsampling and two-channel downsampling modules in a network. For example, in the field of text information processing, the extraction capability of a model for text features can be improved by performing mixed pooling in combination with maximum pooling and average pooling. In sonar image target detection, compared with the traditional method, the dual-channel attention mechanism model can remarkably improve detection accuracy. The two-pass downsampling module uses a 2x 2 max pooling operation followed by a1 x 1 convolutional compression module and combines it with another 1 x 1 convolutional, followed by a3 x 3 convolutional kernel of step 2x 2. The method reduces the downsampling step in the characteristic transmission process and reduces the channel number by stacking convolution layers. The experimental results are shown in table 4. The dual channel downsampling module improves the F1 score, AP value and AR (small) of the overall network by 0.8%, 3.1% and 0.8%, respectively. Experiments have shown that the two-channel downsampling technique balances the class distribution by reducing the number of samples, which reduces the computational cost and increases the training speed, although it may lead to information loss. In addition, by appropriate methods, such as random downsampling or NearMiss, the loss of important information can be minimized, thereby preserving the characteristic information to some extent.
Table 4 Integrated Dual channel downsampling Module front and rear index contrast
| Method of | F1(%) | AP(50)/(%) | AR (small)/(%) |
| Belt module | 54.5 | 73.2 | 35.1 |
| Module-free | 53.7 | 70.1 | 34.3 |
In a specific embodiment, the model complexity and efficiency are analyzed, the steps comprising:
After the attention mechanism module, the feature enhancement FPN module and the improved dual-channel downsampling module are introduced, the number of model parameters is not increased, but the complexity is reduced compared with the original model. Ablation experiments show that the adding CBAM of the attention mechanism module enhances the capability of the network to extract the context information from the remote sensing image. The feature enhancement pyramid module is added to further promote fusion of deep and shallow feature information, and feature loss in the transmission process is effectively reduced due to the design of the double-channel downsampling module. The improvement modules effectively improve the feature extraction and analysis capability of the network when processing the remote sensing image target detection task, and further positively influence the target recognition efficiency and accuracy. The introduction of the attention mechanism module and the feature enhancement FPN module has little effect on the model processing time in terms of model operating efficiency. The implementation shows that the proposed Mask-RCNN-CBAM network is quite excellent in terms of the operation efficiency of pest extraction tasks.
Aiming at the problems of false alarm and false missing of pest extraction caused by complex background and dense pest superposition, the embodiment provides a Mask-RCNN-CBAM pest extraction network integrating an attention mechanism. The CB AM attention mechanism is integrated into the feature processing process, the feature pyramid module is used for enhancing multi-scale feature extraction and fusion, enhancing the capability of extracting context information by a network, and reducing the loss of detail features of images. Experimental results show that the Mask-RCNN-CBAM network has excellent target extraction effect on a pest dataset, obtains higher accuracy and better performance especially under complex background and pest dense conditions, and obtains highest performance in the aspects of Precisio n, F1score, recall and the like, and meanwhile has lower false alarm rate and false alarm rate, so that the network model is reliable and applicable. Compared with other pest extraction methods, the Mask-RCNN-CBAM network can better extract pest characteristic information and optimize detail information.
In the present specification, each embodiment is described in a progressive manner, and each embodiment is mainly described in a different point from other embodiments, and identical and similar parts between the embodiments are all enough to refer to each other. The device disclosed in the embodiment corresponds to the method disclosed in the embodiment, so that the description is simpler, and the relevant points refer to the description of the method.
The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.