Small sample target detection method based on coordinate attention group optimizationTechnical Field
The invention belongs to the technical field of computers, in particular to the technical field of computer vision, relates to target detection in a small sample scene, and particularly relates to a small sample target detection method based on coordinate attention group optimization.
Background
Object detection is a fundamental computer vision task aimed at locating and identifying a person or object of interest from an image. The traditional method obtains target characteristics through a sliding window mechanism and a manual characteristic extraction algorithm such as Scale-INVARIANT FEATURE TRANSFORM (SIFT), a direction gradient histogram (Histogram of Oriented Gradient, HOG) and the like, and then predicts target categories by adopting a support vector machine. Over the past decade, the deep learning method based on the convolutional neural network is widely applied to the field of target detection, and the performance of target detection is remarkably improved. However, deep learning methods typically require training a model with a large amount of labeled data, which is difficult to apply directly to data-scarce scenes. Meanwhile, the bounding box and the category of the target in the manual annotation image need to consume a great deal of labor cost, which greatly limits the popularization and application of the deep learning method in industry. To solve the above-mentioned problems of lack of data and labor cost, researchers have proposed the task of small sample target detection, which aims to detect similar targets in unlabeled images by means of a small number of labeled images. The method can be applied to scenes such as industrial product defect detection, rare animal detection and the like, and meanwhile, the cost of manual annotation data is reduced. For example, industrial product defect detection generally requires a large number of product samples with defects to train a model, however, the proportion of defective products in an actual production scene is small, and it is difficult to collect enough samples meeting the conditions, so that the model training is insufficient, and thus the model is over-fitted, and a small sample target detection model can detect defects on the product surface through a small number of defective samples, so that the problems are avoided.
Currently, most small sample target detection methods use a two-stage Faster regional convolutional neural network fast-RCNN (fast Region-Convolution Neural Network) model as a detection framework, and meanwhile, a Meta-learning (Meta-learning) mode is adopted for training. During data preprocessing, the small sample method divides the data set into a training set and a testing set according to the target category in the image, namely, the training set and the target category in the testing set have no intersection. The target class in the training set is called a Base class (Base class), and the target class in the test set is called a new class (new class). During training, the model randomly selects part of samples from the training set containing the basic category targets as a supporting set during each iterative optimization, and randomly selects part of the rest samples containing the similar targets as a query set. Wherein the support set is a sample set of known labels, and the model needs to detect the target in the query sample with a small number of support samples. The support set and the query set form a data set in a small sample object detection task. These methods typically employ a dual-branch architecture, where a Backbone (Backbone) network that shares the support set and query set input parameters is used to obtain the support feature map and the query feature map. Then, the query feature map obtains a query target feature map through a candidate box output by the regional generation network (Region Proposal Network, RPN), and the support feature map obtains a support target class feature vector through a global pooling operation. Finally, the support target class feature vector and the query target feature map are fused through the attention module to obtain a fused feature map, and then the fused feature map is input into a detector to obtain a target boundary box and a class of the query set. The model learns priori knowledge irrelevant to tasks according to a small sample target detection task data set generated by each iteration during training, so that the model is quickly adapted to a new detection task in actual application. After training, the model detects similar targets in other unlabeled samples through a small number of labeled samples containing new targets.
The method has the defects that 1) a two-stage detection frame is adopted, new types of targets are positioned depending on generated candidate frames, a model is trained on basic type samples only, if the new types are very different from the basic types, the new types of targets are difficult to output to be positioned by the candidate frames, 2) the model is easy to lose target space position characteristics by adopting global pooling operation, accurate positioning of the targets is not facilitated, 3) each sample in a support set is regarded as an independent individual, the similarity relation among different targets in the support set in a specific detection task is not fully considered, and when the different types of targets are similar, error classification is easy to cause. In summary, in order to alleviate the problem that the existing method is difficult to capture the spatial position characteristics of the support targets and similar targets of different categories, which results in inaccurate classification, there is an urgent need for a method to locate the targets of the query set according to the spatial positions of the targets in the support set without depending on the candidate frame, and identify the targets in the query set by using the similarity relationship between the targets in the support set.
Disclosure of Invention
The invention aims to provide a small sample target detection method based on coordinate attention group optimization, aiming at the defects of the prior art. The method does not need to generate candidate frames in advance, but captures the key space position characteristics of the support set targets from the horizontal direction and the vertical direction of the characteristic diagram through the coordinate attention, guides the detector to focus on the region containing the targets in the query set, thereby improving the positioning accuracy of the targets, and simultaneously adopts the group optimization module to update the support target class vectors, and obtains more discriminative class characteristics by utilizing the similarity relationship among different targets in the support set, thereby improving the accuracy of target identification.
The method comprises the steps of firstly obtaining an image data set containing a target boundary box and category labels, and then carrying out the following operations:
the method comprises the steps of (1) sampling an image data set to obtain a support set and a query set, inputting the support set and the query set into a depth visual feature extraction module, and outputting a support feature map set and a query feature map set;
step (2) constructing a coordinate attention guiding module, inputting a target parameter matrix set and a support feature map set which are initialized randomly, and outputting a target fusion matrix set;
Step (3) a group optimization module is constructed, input is a support feature map set and a boundary box label corresponding to the support set, and output is a support target class vector set;
step (4) constructing a query target prediction module, inputting a query feature map set, a target fusion matrix set and a support target class vector set, and outputting a predicted query target boundary box and class probability;
and (5) optimizing a small sample target detection model consisting of a coordinate attention guiding module, a group optimizing module and a query target predicting module by using a random gradient descent algorithm, and obtaining target bounding boxes and categories of images in the query set from the new support set and the query set through steps (1) - (4).
Further, the step (1) specifically comprises:
(1-1) first scaling the images in the dataset to the same size, performing non-return random sampling on the image samples to obtain a support setAnd a query setWherein,In the real number domain, Ns represents the number of image samples in the support set,Representing the ith support image sample, Nq representing the number of image samples in the query set,Representing a j-th query image sample, H representing image height, W representing image width, and 3 representing the number of RGB channels;
each supporting an image sampleWith labelsWherein Φi representsIs Ci,φ epsilon {1, C representsThe category of the phi-th object in the list,Representation ofFour-dimensional vector composed of upper left corner and lower right corner coordinates of phi-th target bounding box in the support setTogether contain CK targets, ck=c×k, i.eC represents the number of target classes, K represents the number of targets of each class;
(1-2) constructing a depth visual feature extraction module consisting of a depth convolution network and a two-dimensional convolution layer, wherein the depth convolution network is a 50-layer residual network ResNet-50 pre-trained in an ImageNet data set, and the convolution kernel size of the two-dimensional convolution layer is 1 multiplied by 1;
(1-3) support setAnd a query setInput to a depth vision feature extraction module to obtain a support feature image setAnd query feature graph setWherein,An i-th support feature map is shown,Representing the j-th query feature map, H ', W' and 256 represent the height, width and channel number of the individual feature map, respectively,
Still further, the step (2) specifically comprises:
(2-1) randomly initializing to obtain a target parameter matrix setThe method comprises the steps of Aj, constructing a coordinate attention guiding module, wherein the coordinate attention guiding module consists of a coordinate attention sub-module and a cross attention sub-module, the coordinate attention sub-module calculates the weight of each coordinate position from the horizontal direction and the vertical direction of a feature map to obtain a space position attention feature map set, and the cross attention sub-module is used for fusing the target parameter matrix set and the space position attention feature map set;
(2-2) for the ith support feature mapRespectively carrying out average pooling operation along the horizontal coordinate direction and the vertical coordinate direction to obtain a horizontal feature mapAnd vertical feature mapRepresenting tensorsThe value at the coordinate position (H ', W') in the mu-th channel is 1. Ltoreq.mu.ltoreq.256, 1. Ltoreq.h '. Ltoreq.H', 1. Ltoreq.w '. Ltoreq.W', tensorThe value at coordinate position (h', 1) in the mu th channelTensorThe value at coordinate position (1, w') in the mu th channel
(2-3) Horizontal feature mapAnd vertical feature mapSequentially inputting a two-dimensional convolution layer and an activation function layer to obtain a horizontal weight characteristic diagramAnd vertical weight feature mapWherein Conv1 (·) represents a two-dimensional convolution layer with a convolution kernel size of 1×1, σ (·) represents a Sigmoid activation function;
(2-4) from the support feature mapHorizontal weight feature mapAnd vertical weight feature mapCalculating to obtain a spatial position attention feature mapTensorThe value at coordinate position (h ', w') in the mu-th channelRepresenting tensorsThe coordinate position in the mu-th channel is the value at (h', 1),Representing tensorsThe coordinate position in the mu th channel is the value at (1, w');
(2-5) pair support feature map setAll feature patterns in (a)Executing the steps (2-2) - (2-4) to obtain a spatial position attention feature map setWill be assembledAll feature patterns in (a)Is unfolded and spliced according to the space dimension to obtain the space target position characteristic
(2-6) WillEach target parameter matrix Aj and spatial target position featureInputting into a cross attention sub-module to obtain a target fusion matrix setSoftmax (·) is the normalized exponential function, and the superscript T denotes the transpose operation.
Still further, the step (3) specifically comprises:
(3-1) constructing a group optimization module, wherein the group is a support target set belonging to the same category, a support target probability matrix and a similarity matrix are calculated according to the support target class mark and the obtained support target vector set, and a group propagation matrix is obtained through the support target probability matrix and the similarity matrix;
(3-2) aggregating the support feature maps obtained in step (1-3)And each of the support image samples obtained in step (1-1)Corresponding labelsEach bounding box bi,φ of the target feature set is input into the region of interest pooling layer to obtain a support target feature setThe interesting region pooling layer represents that the feature map performs maximum pooling operation in the corresponding target boundary box region, and Oc,k represents the feature map of the kth target in the c-th class in the set;
(3-3) supporting target feature atlasSequentially carrying out convolution and global average pooling operation to obtain a support target vector setWherein, the feature vector of the kth target in the c-th class in the collectionGAP (·) represents a global average pooling operation over the feature map space dimension, conv2 (·) represents a two-dimensional convolution layer with a convolution kernel size of 3×3;
(3-4) computing the set of support target class vectorsObtaining a support target probability matrix through a support target class vector setWherein, the support target class vector of the c-th class in the setThe value of the ith row and the ith column of matrix P, representing the probability that the ith support vector belongs to the ith class1≤u≤CK,1≤v≤C,u=(c-1)×K+k,Representing a collectionThe u-th support target vector in (i.e.)Exp (·) represents an exponential function with the natural constant e as a base, dist (·, ·) represents the Euclidean distance function, then
(3-5) Calculating a similarity matrix between support target vectors using a Gaussian kernel functionConstructing a group propagation matrix by a similarity matrix Z and a support target probability matrix PWherein the value of the w column of the ith row of the matrix Z represents the support target vectorAndSimilarity betweenW is more than or equal to 1 and less than or equal to CK, the super parameter gamma in the Gaussian kernel function is more than 0,Representing a collectionThe w-th support target vector in (a), the value Λu,v of the v-th column in the u-th row in the group propagation matrix Λ=zp, represents the support target vectorWeighting and summing according to the similarity among samples to obtain probability belonging to the v-th class;
(3-6) initializing an iteration upper limit value psi, iteratively optimizing a support target probability matrix P through a group propagation matrix lambda, a similarity matrix Z and a label propagation algorithm until the iteration number reaches the upper limit value psi to obtain an updated support target probability matrix P(Ψ), wherein P(0)=P,Λ(0) =lambda is adopted in the initial iteration, and the label propagation algorithm is iterated to be100≤ψ≤120,AndRepresent the firstThe support target probability matrix and group propagation matrix at the time of iteration,AndRepresentation matrixValue sum matrix of the ith row and the ith columnThe value of the v-th column of the u-th row,AndRepresentation matrixA value of a ith row and a ith column of a ith row;
(3-7) using the updated support target probability matrix P(Ψ) and the support target vector setCalculating to obtain a new support target class vector setWherein, the support target class vector of the c-th class in the setP(Ψ) represents the target probability matrix after the completion of the ψ -th iteration,The value of the matrix P(Ψ) in the ith row and the ith column represents the updated ith support target vectorProbability belonging to class c.
Further, the step (4) specifically comprises:
(4-1) constructing a query target prediction module, wherein the module consists of a converter submodule, a target classification function and a boundary frame prediction submodule;
Aggregating query feature graphsEach feature graph in the query is unfolded along the space dimension to obtain a query feature matrix setCalculating a position-coding matrix of query samplesValues of row kappa and column omega of matrix G1.Ltoreq.kappa.ltoreq.H '. Times.W', 1.ltoreq.omega.ltoreq.256, mod representing a remainder taking operation;
(4-2) aggregating query feature matricesAnd the position coding matrix E is input into an encoder of a converter submodule to obtain a query target coding feature setThe encoder consists of an attention layer and two full connection layers, and the jth query target coding featureRepresenting an element-by-element addition operation, FFN (·) represents a feed-forward neural network consisting of two fully connected layers;
(4-3) encoding the query object into the feature setAnd the target fusion matrix set obtained in the step (2-6)Input into decoder of converter submodule to obtain target decoding characteristic setThe decoder consists of two attention layers and two full connection layers, and the j-th query target decoding featureRepresentation ofInput to the intermediate result matrix obtained by the first attention layer,
(4-4) Decoding the feature set by querying the targetAnd the support target class vector set obtained in the step (3-6)Calculating the target prediction class probability of the query set, and calculating the probability that the mth target belongs to the c-th class in the jth query sample1≤m≤M,Representation matrixVectors of the m-th row;
(4-5) decoding the query object into a feature setInput to the bounding box prediction submodule to obtain a set of prediction target bounding boxes for the query setThe boundary frame prediction submodule is a multi-layer perceptron formed by three fully-connected layers, and the prediction boundary frame matrix of the jth query sampleM th row ofA prediction bounding box representing the mth target in the jth query sample,Is the upper left corner coordinate of the bounding box,Is the lower right corner of the bounding box.
Still further, step (5) is specifically:
(5-1) target prediction Category probability by query setAnd cross entropy loss function to calculate target classification lossYj,m,c epsilon {0,1} represents the true mark value of the mth target belonging to the c-th class in the jth query sample;
(5-2) prediction target bounding box set by query setCalculating bounding box lossBj,m represents the true bounding box of the mth target in the jth query sample,Representing the intersection ratio of the real boundary box and the prediction boundary box;
(5-3) sorting losses according to the targetsAnd bounding box lossObtaining total lossOptimizing a small sample target detection model consisting of a coordinate attention guiding module, a group optimizing module and a query target predicting module by utilizing a random gradient descent algorithm, and iteratively training the model until convergence to obtain an optimized small sample target detection model;
(5-4) sampling the new image dataset to obtain a support setAnd a query setInputting the optimized small sample target detection model, sequentially executing according to the steps (1) - (4), and outputting a query setTarget class probability for a medium image sampleAnd bounding box setAnd selecting the target class index with the highest probability as the prediction class.
The small sample target detection method based on the coordinate attention group optimization has the following characteristics that 1) targets in query concentration are detected directly by using a converter module and are not dependent on generated candidate frames, 2) the spatial position characteristics of the targets in support concentration are captured from two spatial directions of a feature map by utilizing the coordinate attention, the spatial characteristics of the targets are fused into a target parameter matrix, so that a detector can adaptively pay attention to the region containing the targets in the query concentration according to the support concentration, 3) the support target class vectors are updated by adopting a label propagation algorithm in a group optimization module, the similarity relationship among different targets in the support concentration is fully utilized, and the updated support target class vectors are more discriminative in an embedded space.
The invention is suitable for target detection tasks in small sample environments, and has the beneficial effects that 1) boundary frames and categories of query targets are directly predicted through a converter module, inaccurate candidate frames generated when new categories are greatly different from base categories are avoided, 2) coordinate attention can capture key spatial position characteristics of the support targets, a guiding detector can dynamically adjust the attention to relevant areas in the query samples according to the current spatial position characteristics of the targets, the positioning accuracy of the query targets is effectively improved, and 3) a group optimization module fully utilizes similarity relations among different targets in a supporting set to update support target category vectors, obtains more discriminative support target category characteristics, and is beneficial to model discrimination of targets of different categories, thereby improving the classification accuracy of the query targets. The coordinate attention mechanism and the group optimization mechanism provided by the invention remarkably improve the performance of the small sample target detection model, and can be applied to the practical application fields of industrial product defect detection, rare animal detection and the like.
Drawings
Fig. 1 is a flow chart of the method of the present invention.
Detailed Description
A small sample target detection method based on coordinate attention group optimization is characterized in that as shown in fig. 1, a support set and a query set are obtained by sampling an image data set, a support feature map set and a query feature map set are obtained by utilizing a depth visual feature extractor, then the support feature map set and an initialized target parameter matrix set are input into a coordinate attention guide module to obtain a target fusion matrix set, a group optimization module is built again, the support feature map set is updated through a target real boundary box and a label propagation algorithm to obtain a support target class vector set, and finally the query feature map set, the target fusion matrix set and the support target class vector set are input into a query target prediction module to obtain a boundary box and a class of a sample target in a query set. The method captures the spatial position characteristics of the support concentrated targets by utilizing the coordinate attention, so that the detector can adaptively pay attention to the region containing the targets in the query concentrated, the positioning accuracy of the query targets is improved, and meanwhile, the differentiated support target class vectors are obtained through the group optimization module, so that the classification accuracy of the query targets is improved.
The method comprises the steps of firstly obtaining an image data set containing a target boundary box and category labels, and then carrying out the following operations:
The method comprises the steps of (1) sampling an image data set to obtain a support set and a query set, inputting the support set and the query set into a deep visual feature extraction module, and outputting a support feature map set and a query feature map set, wherein the method specifically comprises the following steps:
(1-1) first scaling the images in the dataset to the same size, performing non-return random sampling on the image samples to obtain a support setAnd a query setWherein,For the real number domain, s represents "support", Ns represents the number of image samples in the support set,Representing the ith support image sample, q represents the "query", Nq represents the number of image samples in the query set,Representing a j-th query image sample, H representing image height, W representing image width, and 3 representing the number of RGB channels;
each supporting an image sampleWith labelsWherein Φi representsIs Ci,φ epsilon {1, C representsThe category of the phi-th object in the list,Representation ofFour-dimensional vector composed of upper left corner and lower right corner coordinates of phi-th target bounding box in the support setTogether contain CK targets, ck=c×k, i.eC represents the number of target classes, K represents the number of targets of each class;
(1-2) constructing a depth visual feature extraction module consisting of a depth convolution Network and a two-dimensional convolution layer, wherein the depth convolution Network is a 50-layer Residual Network ResNet-50 (Residual Network) pre-trained in an ImageNet dataset, and the convolution kernel size of the two-dimensional convolution layer is 1×1;
(1-3) support setAnd a query setInput to a depth vision feature extraction module to obtain a support feature image setAnd query feature graph setWherein,An i-th support feature map is shown,Representing the j-th query feature map, H ', W' and 256 represent the height, width and channel number of the individual feature map, respectively,
The step (2) of constructing a coordinate attention guiding module, wherein the coordinate attention guiding module is input into a randomly initialized target parameter matrix set and a support feature map set and output into a target fusion matrix set, and the specific steps are as follows:
(2-1) randomly initializing to obtain a target parameter matrix setThe method comprises the steps of (aj) constructing a coordinate attention guiding module, wherein the coordinate attention guiding module consists of a coordinate attention sub-module and a cross attention sub-module, the coordinate attention sub-module calculates the weight of each coordinate position from the horizontal direction and the vertical direction of a feature map, so that the weight of a position area containing target features is larger, a space position attention feature map set is further obtained, and the cross attention sub-module is used for fusing the target parameter matrix set and the space position attention feature map set;
(2-2) for the ith support feature mapRespectively carrying out average pooling operation along the horizontal coordinate direction and the vertical coordinate direction to obtain a horizontal feature mapAnd vertical feature mapHor represents "horizontal", ver represents "vertical",Representing tensorsThe value at the coordinate position (H ', W') in the mu-th channel is 1. Ltoreq.mu.ltoreq.256, 1. Ltoreq.h '. Ltoreq.H', 1. Ltoreq.w '. Ltoreq.W', tensorThe value at coordinate position (h', 1) in the mu th channelTensorThe value at coordinate position (1, w') in the mu th channel
(2-3) Horizontal feature mapAnd vertical feature mapSequentially inputting a two-dimensional convolution layer and an activation function layer to obtain a horizontal weight characteristic diagramAnd vertical weight feature mapWherein Conv1 (·) represents a two-dimensional convolution layer with a convolution kernel size of 1×1, σ (·) represents a Sigmoid activation function;
(2-4) from the support feature mapHorizontal weight feature mapAnd vertical weight feature mapCalculating to obtain a spatial position attention feature mapTensorThe value at coordinate position (h ', w') in the mu-th channelRepresenting tensorsThe coordinate position in the mu-th channel is the value at (h', 1),Representing tensorsThe coordinate position in the mu th channel is the value at (1, w');
(2-5) pair support feature map setAll feature patterns in (a)Executing the steps (2-2) - (2-4) to obtain a spatial position attention feature map setWill be assembledAll feature patterns in (a)Is unfolded and spliced according to the space dimension to obtain the space target position characteristic
(2-6) Aggregating the target parameter matricesEach target parameter matrix Aj and spatial target position featureInputting into a cross attention sub-module to obtain a target fusion matrix setSoftmax (·) is the normalized exponential function, and the superscript T denotes the transpose operation.
Step (3) a group optimization module is constructed, wherein the input is a support feature graph set and a boundary box label corresponding to the support set, and the output is a support target class vector set, and the method specifically comprises the following steps:
(3-1) constructing a group optimization module, wherein the group is a support target set belonging to the same category, a support target probability matrix and a similarity matrix are calculated according to the support target class mark and the obtained support target vector set, and a group propagation matrix is obtained through the support target probability matrix and the similarity matrix;
(3-2) aggregating the support feature maps obtained in step (1-3)And each of the support image samples obtained in step (1-1)Corresponding labelsIs input into the region of interest pooling (Regions Of Interest Pooling, ROI Pooling) layer to obtain a set of support target feature mapsThe interesting region pooling layer represents that the feature map performs maximum pooling operation in the corresponding target boundary box region, and Oc,k represents the feature map of the kth target in the c-th class in the set;
(3-3) supporting target feature atlasSequentially carrying out convolution and global average pooling operation to obtain a support target vector setWherein, the feature vector of the kth target in the c-th class in the collectionGAP (·) represents a global average pooling operation over the feature map space dimension, conv2 (·) represents a two-dimensional convolution layer with a convolution kernel size of 3×3;
(3-4) computing the set of support target class vectorsObtaining a support target probability matrix through a support target class vector setWherein, the support target class vector of the c-th class in the setThe value of the ith row and the ith column of matrix P, representing the probability that the ith support vector belongs to the ith class1≤u≤CK,1≤v≤C,u=(c-1)×K+k,Representing a collectionThe u-th support target vector in (i.e.)Exp (·) represents an exponential function with the natural constant e as a base, dist (·, ·) represents the Euclidean distance function, then
(3-5) Calculating a similarity matrix between support target vectors using a Gaussian kernel functionConstructing a group propagation matrix by a similarity matrix Z and a support target probability matrix PWherein the value of the w column of the ith row of the matrix Z represents the support target vectorAndSimilarity betweenW is more than or equal to 1 and less than or equal to CK, the super parameter gamma in the Gaussian kernel function is more than 0,Representing a collectionThe w-th support target vector in (a), the value Λu,v of the v-th column in the u-th row in the group propagation matrix Λ=zp, represents the support target vectorWeighting and summing according to the similarity among samples to obtain probability belonging to the v-th class;
(3-6) initializing an iteration upper limit value psi, iteratively optimizing a support target probability matrix P through a group propagation matrix lambda, a similarity matrix Z and a label propagation algorithm until the iteration number reaches the upper limit value psi to obtain an updated support target probability matrix P(Ψ), wherein P(0)=P,Λ(0) =lambda is adopted in the initial iteration, and the label propagation algorithm is iterated to beWherein the value range of psi is more than or equal to 100 and less than or equal to 120,AndRepresent the firstThe support target probability matrix and group propagation matrix at the time of iteration,AndRepresentation matrixValue sum matrix of the ith row and the ith columnThe value of the v-th column of the u-th row,AndRepresentation matrixA value of a ith row and a ith column of a ith row;
(3-7) using the updated support target probability matrix P(Ψ) and the support target vector setCalculating to obtain a new support target class vector setWherein, the support target class vector of the c-th class in the setP(Ψ) represents the target probability matrix after the completion of the ψ -th iteration,The value of the matrix P(Ψ) in the ith row and the ith column represents the updated ith support target vectorProbability belonging to class c.
Step (4) a query target prediction module is constructed, wherein the query target prediction module is input into a query feature graph set, a target fusion matrix set and a support target class vector set, and the query target prediction module is output into a predicted query target boundary box and class probability, and the specific steps are as follows:
(4-1) constructing a query target prediction module consisting of a converter (transducer) sub-module, a target classification function and a bounding box prediction sub-module;
Aggregating query feature graphsEach feature graph in the query is unfolded along the space dimension to obtain a query feature matrix setCalculating a position-coding matrix of query samplesValues of row kappa and column omega of matrix G1.Ltoreq.kappa.ltoreq.H '. Times.W', 1.ltoreq.omega.ltoreq.256, mod representing a remainder taking operation;
(4-2) aggregating query feature matricesAnd the position coding matrix E is input into an encoder of a converter submodule to obtain a query target coding feature setThe encoder consists of an attention layer and two full connection layers, and the jth query target coding featureRepresenting an element-by-element addition operation, FFN (·) represents a feed-forward neural network consisting of two fully connected layers;
(4-3) encoding the query object into the feature setAnd the target fusion matrix set obtained in the step (2-6)Input into decoder of converter submodule to obtain target decoding characteristic setThe decoder consists of two attention layers and two full connection layers, and the j-th query target decoding featureRepresentation ofInput to the intermediate result matrix obtained by the first attention layer,
(4-4) Decoding the feature set by querying the targetAnd the support target class vector set obtained in the step (3-6)Calculating the target prediction class probability of the query set, and calculating the probability that the mth target belongs to the c-th class in the jth query sample1≤m≤M,Representation matrixVectors of the m-th row;
(4-5) decoding the query object into a feature setInput to the bounding box prediction submodule to obtain a set of prediction target bounding boxes for the query setThe boundary frame prediction submodule is a multi-layer perceptron formed by three fully-connected layers, and the prediction boundary frame matrix of the jth query sampleM th row ofA prediction bounding box representing the mth target in the jth query sample,Is the upper left corner coordinate of the bounding box,Is the lower right corner of the bounding box.
The step (5) of optimizing a small sample target detection model consisting of a coordinate attention guiding module, a group optimizing module and a query target predicting module by utilizing a random gradient descent algorithm, and obtaining target bounding boxes and categories of images in a query set for a new support set and a query set through the steps (1) - (4), wherein the method specifically comprises the following steps:
(5-1) target prediction Category probability by query setAnd cross entropy loss function to calculate target classification lossYj,m,c epsilon {0,1} represents the true mark value of the mth target belonging to the c-th class in the jth query sample;
(5-2) prediction target bounding box set by query setCalculating bounding box lossBj,m represents the true bounding box of the mth target in the jth query sample,Representing the intersection ratio of the real boundary box and the prediction boundary box;
(5-3) sorting losses according to the targetsAnd bounding box lossObtaining total lossOptimizing a small sample target detection model consisting of a coordinate attention guiding module, a group optimizing module and a query target predicting module by utilizing a random gradient descent algorithm, and iteratively training the model until convergence to obtain an optimized small sample target detection model;
(5-4) sampling the new image dataset to obtain a support setAnd a query setInputting the optimized small sample target detection model, sequentially executing according to the steps (1) - (4), and outputting a query setTarget class probability for a medium image sampleAnd bounding box setAnd selecting the target class index with the highest probability as the prediction class.
The description of the present embodiment is merely an enumeration of implementation forms of the inventive concept, and the scope of protection of the present invention should not be construed as limited to the specific forms set forth in the embodiments, as well as equivalent technical means conceivable by those skilled in the art according to the inventive concept.