Fine granularity small sample classification method based on task specific channel reconstruction networkTechnical Field
The invention belongs to the technical field of computer vision, and particularly relates to a fine-granularity small sample classification method based on a task specific channel reconstruction network.
Background
The cost and difficulty of fine-grained visual data collection and labeling is greater than that of image data containing only coarse-grained categories, often requiring expert knowledge in a particular field. For fine-grained classification tasks, the data available for a portion of fine-grained subcategories is limited, e.g., when a new fine-grained subcategory of a species is discovered, the available data sample is often very scarce. The human can abstract new fine granularity concepts from a few samples with supervision information and rapidly distinguish the subsequent new samples belonging to the same category, but the fine granularity classification method under the current deep learning still needs a large number of samples with label information to train, and the two samples have obvious differences. In light of this, the small sample approach is an important means to solve the data scarcity problem faced by fine-grained tasks. Accordingly, there has recently been a focus of researchers on fine-grained small sample visual classification methods, where a model is expected to learn conceptual knowledge of visually similar fine-grained categories from a limited number of fine-grained labeling samples (e.g., using only 1 or 5 fine-grained labeling samples).
Related technologies in the field of small sample learning rapidly develop, and a solution idea is provided for solving the fine granularity problem under the small sample setting. However, the fine-granularity small sample visual classification task has the dual challenges of fine-granularity classification and small sample learning, and the problem of fine-granularity small samples cannot be effectively solved by simply using a small sample classification method, and the method design must be fully combined with the characteristics of the fine-granularity task.
Disclosure of Invention
In view of the above, the present invention aims to provide a fine-granularity small sample classification method based on a task specific channel reconstruction network, which can accurately and effectively classify fine-granularity small samples.
Firstly, a fine-granularity small sample classification data set is obtained for data preprocessing, and label extraction is completed. Dividing the data set into three parts, and performing N-way K-shot small sample element task sampling according to episode modes to obtain a query set image and N kinds of support set images. And then, constructing a multi-dimensional dynamic feature extraction network through multi-dimensional dynamic convolution, inputting all the support set images and the query set images into the multi-dimensional dynamic feature extraction network to respectively obtain the features of the support set and the query set of each category, and calculating the prototype features of each category of the support set. Then constructing a weight generating block, calculating initial weights of the support set and the query set, adaptively aggregating the initial weights to obtain the attention weights of the task specific channels, and reconstructing the features of the support set and the query set by using the attention weights of the task specific channels. And finally, scoring the similarity by using the distance measurement to obtain the category to which the query set image belongs, and efficiently completing the classification task of the small-granularity sample.
The technical scheme adopted for solving the technical problems is as follows:
A fine granularity small sample classification method based on a task specific channel reconstruction network comprises the following steps of;
Step S1, acquiring a fine-granularity small sample classification dataset, carrying out data preprocessing, and completing label extraction, dividing the dataset into three parts Dbase、Dval and Dnovel, and carrying out N-way K-shot small sample element task sampling according to a episode mode to obtain an inquiry set image and N types of support set images;
S2, constructing a multi-dimensional dynamic feature extraction network through multi-dimensional dynamic convolution, inputting all support set images and query set images into the multi-dimensional dynamic feature extraction network to respectively obtain features of support sets and query sets of various categories, and calculating prototype features of the support sets of various categories;
S3, constructing a weight generating block, inputting the support set feature of the ch_i class to obtain an initial weight of the ch_i class support set, inputting the query set feature to the weight generating block to obtain an initial weight of the query set, carrying out self-adaptive aggregation on the initial weight of the ch_i class support set and the initial weight of the query set to obtain a specific channel attention weight of the ch_i class task, and reconstructing the support set feature and the query set feature by utilizing the specific channel attention weight of the task to obtain a support set reconstruction feature of the ch_i class and a query set reconstruction feature of the ch_i class;
And S4, carrying out similarity scoring by using distance measurement to obtain the category of the query set image, carrying out iterative training on a plurality of sampling element tasks according to specified training parameters, updating model parameters by optimizing combination loss, continuously storing an optimal model according to verification accuracy, and calculating average classification accuracy of episode of all random sampling of the optimal model in an element test stage.
Further, the step S1 specifically includes the following steps:
S11, classifying a data set by adopting a small sample with the fine granularity, and performing data preprocessing to finish label extraction;
Step S12, for a given fine-grained image dataset D, dividing it into three parts, namely a basic dataset Dbase, a verification dataset Dval and a new class dataset Dnovel, denoted Dbase={(xbase_i,ybase_i),ybase_i∈Ybase}、Dval={(xval_i,yval_i),yval_i∈Yval} and Dnovel={(xnovel_i,ynovel_i),ynovel_i∈Ynovel respectively, wherein Ybase∪Yval∪Ynovel =Y, Y denotes the tag space of the original dataset, Ybase denotes the tag space of the basic dataset, Yval denotes the tag space of the verification dataset, Ynovel denotes the tag space of the new class dataset, the classes of the parts being mutually exclusive, i.e.Each category in the base dataset Dbase contains more label samples than the verification dataset Dval and the new category dataset Dnovel;
Step S13, constructing N-way K-shot fine granularity small sample classification setting, randomly sampling meta-tasks based on episode, forming each episode by a support set S and a query set Q, wherein each support set S and the query set Q totally comprise N fine granularity categories, each category comprises K+M samples, for one category, the corresponding support set only comprises K label image samples, the query set comprises M label-free image samples, and defining the support set S and the query set Q in each episode asThus, a query set image and N categories of support set images are obtained.
Further, the step S2 specifically includes the following steps:
step S21, a multi-dimensional dynamic convolution mode is constructed in which, for the input feature map Fea_x, the output feature map extracted by the multi-dimensional dynamic convolution ODConv is expressed as Fea_x ', and the specific calculation mode of Fea_x' = (af⊙ac⊙as. Sup. CONV. Sup. W). Sup. Fea_x is as follows
Wherein af is the attention weight of the output channel of the convolution kernel, which gives different attention weights to different convolution kernels constituting the number of channels of the output feature map,Ac is the attention weight of the input channel of the convolution kernel, which gives different weights to different channels of the convolution kernel to dynamically extract the features of different channels of the input feature map,Cout denotes the number of output channels, ccin denotes the number of input channels, as is a convolution kernel spatial attention weight which gives different attention weights to different spatial positions of the convolution kernel, as∈Rk×k, k denotes the spatial size of the convolution kernel, CONV _ W denotes the convolution kernel parameters,CONV_Wconv_n represents the parameters of the conv_n convolution kernel, conv_n ε [1, cout ];
Step S22, utilizing a dynamic feature extraction mode in the step S21, adopting a multi-dimensional dynamic convolution ODConv to respectively improve two feature extraction network structures of Conv-4 and ResNet-12 to obtain two multi-dimensional dynamic feature extraction networks, using ODConv with 3X 3 size of a single convolution kernel to replace two-dimensional convolution with the second, third and fourth 3X 3 sizes in the Conv4 network, using ODConv with 3X 3 size of the single convolution kernel to replace two-dimensional convolution with 3X 3 size in BasicBlock of ResNet-12, and using optimization to obtain three network modules after ODBasicBlock to replace ResNet-12, wherein the feature diagram size of each stage is unchanged before and after improvement;
S23, selecting a multi-dimensional dynamic feature extraction network according to the requirements of fine-granularity small sample classification tasks, inputting all support set images and query set images into the multi-dimensional dynamic feature extraction network to respectively obtain support set features and query set features of each category, and inputting query images xQ of the current original task and im_j support set images of ch_i categoriesThe parameters DEFN-theta are respectively input into a multi-dimensional dynamic feature extraction network FDEFN_θ (-) to generate feature graphs FQ and F support set images of a query set and a support setThe specific calculation mode is as follows:
FQ=fDEFN_θ(xQ)
step S24, calculating the prototype feature of the support set of each class, wherein the prototype feature of the support set class of the ch_i type is expressed asThe specific calculation mode is as follows:
Where Nch_i represents the number of samples of the chi-th class.
Further, the step S3 specifically includes the following steps:
S31, gathering space information of query set features and support set prototype features through global average pooling operation GAP, and inputting the query set features FQ and the support set features of ch_i typeRespectively obtaining initial query set weight wQ and ch_i type support set weightThe specific calculation mode is as follows:
wQ=GAP(FQ)
step S32, constructing a weight generation block FCB (,) and combining the initial query set weight wQ and the ch_i type support set weightInputting a weight generating block FCB (-) to generate a support set attention weight of the ch_i class and a query set attention weight of the ch_i class to obtain a query attention weight w'Q of the ch_i class and a support attention weight of the ch_i classThe specific calculation mode is as follows:
wherein,For full connection layer weights, δ (·) represents the ReLU function, σ (·) represents the 1+tanh function;
Step S33, the channel attention weight w 'Q and the channel attention weight w'Q obtained in the step S32 are comparedFurther using task-specific adaptive aggregation to maximally highlight task-specific key semantics, and obtaining a task-specific channel attention weight twch_i of the ch_i class, wherein the specific calculation process is as follows:
Wherein τ represents a learnable parameter, τ e [0,1];
step S34, reconstructing the support prototype feature map of the ch_i class by using the task-specific channel attention weight twch_i obtained in step S33And inquiring the characteristic channel of the characteristic map FQ to obtain a channel reconstruction characteristic map of the ch_i typeAndThe specific calculation is as follows:
wherein,Representation ofThe value in the dim j dimension,AndRespectively represent FQ andCh_j-th channel feature in the channel dimension, C represents the total number of channels of the feature map.
Further, the step S4 specifically includes the following steps:
Step S41, scoring the similarity by using the distance measure to obtain the category to which the query set image belongs, wherein the probability p (y=ch_i|x) that the query image x belongs to the ch_i category is calculated as follows:
Wherein γ represents a learnable temperature parameter, dis (·,) represents a distance metric function, cls_n represents the number of categories;
Step S42, calculating a cross entropy loss function by using the probability p (y=ch_i|x) obtained in the step S41, setting specified training parameters, and continuously updating gradients to perform iterative training on all episode of the meta-training stage;
Step S43, calculating the average classification accuracy of the meta-test stage model at episode of all random samplingsThe calculation formula is as follows:
Where epi_n represents the total number of test stages episode, accepi_i represents the classification accuracy of all queries in epi_i episode;
And S44, in the training process, performing model verification according to the verification interval mark at the iteration number interval, continuously storing the optimal model, and ending the training process when the iteration number reaches a preset maximum iteration number threshold.
Compared with the prior art, the invention and the preferable scheme thereof have the following beneficial effects:
1. Aiming at the characteristic of scarcity of fine-granularity classification task data, the fine-granularity small sample classification method based on the task specific channel reconstruction network is constructed, and high fine-granularity prediction accuracy can be realized by using a small number of samples.
2. Aiming at the characteristics of the fine-granularity small sample classification task, the conventional small sample classification common feature extraction network is improved by utilizing the multidimensional attention of the convolution kernel layer so as to better adapt to the fine-granularity small sample visual classification task.
3. The task-specific category-level semantic activation difference problem existing in fine-grained small sample visual classification is analyzed, a task-specific channel attention method is provided, characteristic channel reconstruction of a query set and a support set is carried out by using channel attention weights, and better measurement matching among fine-grained subcategories is achieved.
4. At the cost of extremely small parameter quantity and calculation quantity, a general fine-granularity small sample classification method is provided, and two component methods contained in the method can be effectively applied to different measurement learning methods.
Drawings
The invention is described in further detail below with reference to the attached drawings and detailed description:
Fig. 1 is a schematic diagram of a principle and implementation flow of an embodiment of the present invention.
Detailed Description
In order to make the features and advantages of the present patent more comprehensible, embodiments accompanied with figures are described in detail below:
It should be noted that the following detailed description is illustrative and is intended to provide further explanation of the application. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs.
As shown in fig. 1, the present embodiment provides a fine-grained small sample classification method based on a task-specific channel reconstruction network, which specifically includes the following steps:
And S1, acquiring a fine-granularity small sample classification data set, performing data preprocessing, and completing label extraction. Dividing a data set into three parts Dbase、Dval and Dnovel, and performing N-way K-shot small sample element task sampling according to a episode mode to obtain a query set image and N categories of support set images;
S2, constructing a multi-dimensional dynamic feature extraction network through multi-dimensional dynamic convolution, inputting all support set images and query set images into the multi-dimensional dynamic feature extraction network to respectively obtain features of support sets and query sets of various categories, and calculating prototype features of the support sets of various categories;
S3, constructing a weight generating block, inputting the support set feature of the ch_i class to obtain an initial weight of the ch_i class support set, inputting the query set feature to the weight generating block to obtain an initial weight of the query set, carrying out self-adaptive aggregation on the initial weight of the ch_i class support set and the initial weight of the query set to obtain a specific channel attention weight of the ch_i class task, and reconstructing the support set feature and the query set feature by utilizing the specific channel attention weight of the task to obtain a support set reconstruction feature of the ch_i class and a query set reconstruction feature of the ch_i class;
And S4, scoring the similarity by using the distance measurement to obtain the category to which the query set image belongs, performing iterative training on a plurality of sampling element tasks according to the specified training parameters, updating model parameters by optimizing combination loss, continuously storing an optimal model according to the verification accuracy, and calculating the average classification accuracy of episode of all random samples of the optimal model in the element test stage.
In this embodiment, the step S1 specifically includes the following steps:
S11, classifying a data set by adopting a small sample with the fine granularity, and performing data preprocessing to finish label extraction;
Step S12, for a given fine-grained image dataset D, it is divided into three parts, denoted Dbase={(xbase_i,ybase_i),ybase_i∈Ybase}、Dval={(xval_i,yval_i),yval_i∈Yval} and Dnovel={(xnovel_i,ynovel_i),ynovel_i∈Ynovel, respectively, of a base dataset Dbase, a verification dataset Dval and a new class dataset Dnovel, where Ybase∪Yval∪Ynovel = Y, Y denotes the tab space of the original dataset, Ybase denotes the tab space of the base dataset, Yval denotes the tab space of the verification dataset, and Ynovel denotes the tab space of the new class dataset. With this division arrangement, the categories of the individual parts do not intersect each other, i.eEach category in the base dataset Dbase contains enough label samples, while the verification dataset Dval and the new category dataset Dnovel contain a small number of label samples;
And S13, constructing N-way K-shot fine granularity small sample classification setting, and randomly sampling the meta-tasks based on episode. Each episode consists of a support set S and a query set Q, which contain a total of N fine-grained categories, each category containing K+M samples. For a certain class, the support set of the class only contains K label image samples, and the query set contains M unlabeled image samples. The support set S and the query set Q in each episode are defined asThus, a query set image and N categories of support set images are obtained.
In this embodiment, the step S2 specifically includes the following steps:
and S21, constructing a multi-dimensional dynamic convolution mode. For the input feature map fea_x, the output feature map extracted by the multidimensional dynamic convolution ODConv is denoted as fea_x ', and the specific calculation mode of fea_x' is as follows
fea_x′=(af⊙ac⊙as⊙CONV_W)*fea_x
Where af is the attention weight of the output channel of the convolution kernel, which gives different attention weights to different convolution kernels constituting the number of channels of the output feature map,Ac is the attention weight of the input channel of the convolution kernel, which gives different weights to different channels of the convolution kernel to dynamically extract the features of different channels of the input feature map,Cout represents the number of output channels, and ccin represents the number of input channels. as is a convolution kernel spatial attention weight, which gives different attention weights to different spatial positions of the convolution kernel, as∈Rk×k, k represents the spatial size of the convolution kernel. CONV _ W represents the convolution kernel parameters,CONV_Wconv_n represents the parameters of the conv_n convolution kernel, conv_n ε [1, cout ];
And S22, improving the structure of the Conv-4 and ResNet-12 feature extraction networks commonly used for classifying the small samples by using the dynamic feature extraction mode ODConv in the step S21 to obtain two multi-dimensional dynamic feature extraction networks. The second, third and fourth 3 x 3 size two-dimensional convolutions in the Conv4 network were replaced with a single convolution kernel of 3 x 3 size ODConv, and the 3 x 3 size two-dimensional convolutions in BasicBlock of ResNet-12 were replaced with a single convolution kernel of 3 x 3 size ODConv, resulting in the replacement of ResNet-12 with optimized ODBasicBlock. The size of the feature map at each stage before and after the improvement is kept unchanged;
And S23, selecting a multi-dimensional dynamic feature extraction network according to the classification task requirements of the fine-granularity small samples. Inputting all support set images and query set images into a multidimensional dynamic feature extraction network to respectively obtain support set features and query set features of each category, and inputting the query images xQ of the current original task and the im_j support set images of the ch_i categoryThe parameters DEFN-theta are respectively input into a multi-dimensional dynamic feature extraction network FDEFN_θ (-) to generate feature graphs FQ and F support set images of a query set and a support setThe concrete calculation mode is as follows
FQ=fDEFN_θ(xQ)
Step S24, calculating the prototype feature of the support set of each class, wherein the prototype feature of the support set class of the ch_i type is expressed asThe concrete calculation mode is as follows
Where Nch_i represents the number of samples of the chj class;
in this embodiment, the step S3 specifically includes the following steps:
S31, gathering the space information of the query set features and the support set prototype features through a Global Average Pooling (GAP) operation, and inputting the query set features FQ and the support set features of the ch_i typeRespectively obtaining initial query set weight wQ and ch_i type support set weightThe concrete calculation mode is as follows
wQ=GAP(FQ)
Step S32, constructing a weight generation block FCB (,) and combining the initial query set weight wQ and the ch_i type support set weightInputting a weight generating block FCB (-) to generate a support set attention weight of the ch_i class and a query set attention weight of the ch_i class to obtain a query attention weight w'Q of the ch_i class and a support attention weight of the ch_i classThe concrete calculation mode is as follows
Wherein,For full connection layer weights, δ (·) represents the ReLU function, σ (·) represents the 1+tanh function;
Step S33, the channel attention weight w 'Q and the channel attention weight w'Q obtained in the step S32 are comparedFurther using task-specific adaptive aggregation to maximally highlight task-specific key semantics to obtain a task-specific channel attention weight twch_i of the ch_i type, wherein the specific calculation process is as follows
Where τ represents a learnable parameter, τ e 0, 1.
Step S34, reconstructing the support prototype feature map of the ch_i type by using the task specific channel attention weight twch_i obtained in step S33And inquiring the characteristic channel of the characteristic map FQ to obtain a channel reconstruction characteristic map of the ch_i typeAndThe concrete calculation is as follows
Wherein,Representation ofThe value in the dim j dimension,AndRespectively represent FQ andCh_j channel features in the channel dimension, C representing the total number of channels of the feature map;
In this embodiment, the step S4 specifically includes the following steps:
And S41, scoring the similarity by using the distance measure to obtain the category to which the query set image belongs. The probability p (y=chi|x) that the query image x belongs to the chi-th category is calculated as follows
Wherein γ represents a learnable temperature parameter, dis (·,) represents a distance metric function, cls_n represents the number of categories;
step S42, calculating a cross entropy loss function by using the probability p (y=ch_i|x) obtained in step S41, setting specified training parameters, and continuously updating gradients to perform iterative training on all episode of the meta-training phases.
Step S43, calculating the average classification accuracy of the meta-test stage model at episode of all random samplingsThe calculation formula is as follows
Where epi_n represents the total number of test stages episode, accepi_i represents the classification accuracy of all queries in epi_i episode;
step S44, in the training process, model verification is carried out according to a verification interval mark at intervals of a certain iteration number, an optimal model is continuously stored, and when the iteration number reaches a preset maximum iteration number threshold value, the training process is ended;
In summary, the present invention proposes an effective solution for task specificity exhibited by subtle discriminant semantics under a small sample setting. Fine discriminant feature learning is critical to fine-grained visual classification tasks, but most existing small sample learning methods ignore this, so that it cannot be effectively applied to fine-grained visual classification. Therefore, the invention optimizes the dense feature representation construction process of small sample learning, and captures fine-granularity image fine difference information by utilizing a multi-dimensional dynamic feature extraction network. On the basis, the invention further provides a task-specific channel attention method, and the task-specific channel attention weight is utilized to activate the fine-granularity small sample task-specific key semantic information. The method provided by the invention can efficiently learn the fine-granularity concept knowledge from a small number of samples, thereby realizing higher fine-granularity prediction accuracy and coping with the double challenges of fine-granularity classification and small-sample learning.
It will be appreciated by those skilled in the art that embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
The present patent is not limited to the above-mentioned best mode, any person can obtain other various fine-grained small sample classification methods based on task-specific channel reconstruction network under the teaching of the present patent, and all equivalent changes and modifications made according to the scope of the present patent should be covered by the present patent.