CN114842343B

Movatterモバイル変換

Info

Publication number: CN114842343B
Application number: CN202210541111.8A
Authority: CN
Inventors: 熊盛武; 赵怡晨; 陈亚雄; 路雄博
Original assignee: Wuhan University of Technology WUT
Current assignee: Wuhan University of Technology WUT
Priority date: 2022-05-17
Filing date: 2022-05-17
Publication date: 2024-11-01
Anticipated expiration: 2042-05-17
Also published as: CN114842343A

Abstract

The invention discloses an aviation image recognition method based on ViT, which comprises the following steps: s1, acquiring an aviation image data set, constructing a training set, a verification set and a test set; s2, expanding the data volume of the training set; s3, constructing an aerial image recognition model based on ViT; s4, inputting the expanded training set into an identification model, carrying out differentiated label smoothing on labels corresponding to the images, training the model by adopting a cross entropy loss function and a differentiated contrast loss function, updating the identification model through a back propagation algorithm, and selecting an optimal aerial image identification model; s5, testing the identification performance of the model through the test set. According to the invention, the labels corresponding to the images are subjected to differentiated label smoothing treatment, and meanwhile, the training process of the cross entropy loss function and the differentiated contrast loss function supervision model is adopted, so that the ViT-based aerial image recognition model with stronger feature learning capability is obtained, and the method has the advantages of high recognition rate, strong expansibility and the like.

Description

Translated fromChinese

一种基于ViT的航空图像识别方法A ViT-based aerial image recognition method

技术领域Technical Field

本发明涉及机器学习算法与图像处理技术领域，具体地指一种基于ViT的航空图像识别方法。The present invention relates to the technical field of machine learning algorithms and image processing, and in particular to an aerial image recognition method based on ViT.

背景技术Background Art

航空图像识别指给定一张航空图像，识别其所属类别。随着航空技术的日益成熟，航空图像分辨率日益提高，航空图像在人们日常的生活中发挥着越来越重要的作用。自然灾害探测、城市规划、资源勘探及专题地图制作等任务都离不开航空图像识别，因此对航空图像进行准确识别具有重要的价值。Aerial image recognition refers to identifying the category of an aerial image given one. With the increasing maturity of aviation technology, the resolution of aerial images is increasing, and aerial images are playing an increasingly important role in people's daily lives. Tasks such as natural disaster detection, urban planning, resource exploration, and thematic map production are inseparable from aerial image recognition, so accurate recognition of aerial images is of great value.

虽然航空图像数据量多，但是可用于做模型训练的数据集数量少，质量不高，而有标注的数据集更是稀少，且噪音样本，困难样本的问题普遍存在。另外，航空图像多为俯视成像，具有成像范围广、尺度变化大和场景内目标稀疏变化等特点。因此，与自然图像相比，航空图像识别具有数据量小和背景复杂的困难。Although there is a large amount of aerial image data, the number of datasets available for model training is small and the quality is not high. The labeled datasets are even rarer, and the problems of noisy samples and difficult samples are common. In addition, aerial images are mostly overhead images, with the characteristics of wide imaging range, large scale changes, and sparse changes in targets in the scene. Therefore, compared with natural images, aerial image recognition has the difficulties of small data volume and complex background.

目前，针对以上问题，大部分解决方案都是围绕建立有针对性的轻量级深度学习算法，没有拓展到更多样化的航空图像，存在局限性。另外，这些方法大多采用学习标签信息的交叉熵损失对模型进行监督，没有考虑到航空图像本身的内部信息。At present, most solutions to the above problems are centered around building targeted lightweight deep learning algorithms, which have not been extended to more diverse aerial images and have limitations. In addition, most of these methods use the cross entropy loss of learning label information to supervise the model, without considering the internal information of the aerial image itself.

发明内容Summary of the invention

针对背景技术中存在的不足之处，本发明提出一种基于ViT的航空图像识别方法，利用ViT(Vision Transformer)在捕获长距离依赖和动态自适应建模能力上的优势，以ViT作为图像的特征编码器去捕捉显著地语义特征，且在ViT的基础上进行改进，使其能充分利用有限的航空图像数据进行训练，避免过度拟合图像中的噪点。In view of the shortcomings of the background technology, the present invention proposes an aerial image recognition method based on ViT, which utilizes the advantages of ViT (Vision Transformer) in capturing long-distance dependencies and dynamic adaptive modeling capabilities, uses ViT as the feature encoder of the image to capture significant semantic features, and makes improvements based on ViT so that it can make full use of limited aerial image data for training and avoid overfitting the noise in the image.

为实现上述目的，本发明所设计的一种基于ViT的航空图像识别方法，其特殊之处在于，所述方法包括如下步骤：To achieve the above object, the present invention designs an aerial image recognition method based on ViT, which is special in that the method comprises the following steps:

S1)采集航空图像数据集，得到所需原始航空图像x_i及其对应的类别标签y_i，按比例数量划分训练集、验证集和测试集，分别用于后续对模型进行训练、验证和评估，其中训练集记为B为训练集的图像数目；S1) Collect the aerial image dataset, obtain the required original aerial images x_i and their corresponding category labels y_i , divide them into training set, validation set and test set according to the proportion, and use them for subsequent model training, validation and evaluation respectively, where the training set is denoted as B is the number of images in the training set;

S2)将所述训练集图像进行在线数据增强，使得训练集中每张图像都生成M张不同的增强图像，训练集被扩充后的图像数量为B*M，记为S2) performing online data enhancement on the training set images so that each image in the training set generates M different enhanced images. The number of images in the training set after the enhancement is B*M, which is recorded as

S3)构建基于ViT的航空图像识别模型；S3) construct an aerial image recognition model based on ViT;

S4)将所述训练集的图像，输入所述基于ViT的航空图像识别模型，对图像相对应的标签进行区分性标签平滑，同时采用交叉熵损失函数和区分性对比损失函数对模型进行训练，通过反向传播算法更新识别模型，并利用步骤S1)中的验证集遴选最优的航空图像识别模型；S4) the training set The image is input into the ViT-based aerial image recognition model, the labels corresponding to the images are discriminatively smoothed, and the model is trained by using a cross entropy loss function and a discriminative contrast loss function, the recognition model is updated by a back propagation algorithm, and the optimal aerial image recognition model is selected by using the validation set in step S1);

S5)使用步骤S1)的测试集测试所述航空图像识别模型的识别性能，得到最终的模型识别正确率，当模型识别正确率达到设定阈值时，将待识别图像输入航空图像识别模型进行识别；否则返回步骤S3)直至模型识别正确率达到设定阈值。S5) using the test set of step S1) to test the recognition performance of the aerial image recognition model to obtain a final model recognition accuracy rate; when the model recognition accuracy rate reaches a set threshold, the image to be recognized is input into the aerial image recognition model for recognition; otherwise, returning to step S3) until the model recognition accuracy rate reaches the set threshold.

优选地，步骤S2)将输入的图像随机裁剪为224*224像素后进行随机地水平翻转，然后使用图像增强策略对图像进行增强，最终得到扩容后的训练集，记为Preferably, in step S2), the input image is randomly cropped to 224*224 pixels and then randomly flipped horizontally, and then the image is enhanced using an image enhancement strategy, and finally an expanded training set is obtained, which is recorded as

优选地，步骤S2)中图像增强策略包括以下操作中的一种或多种组合：对图像进行归一化操作、按照顺序进行随机颜色失真和高斯模糊、自动增强、随机增强、每次随机选择一个图像增强操作，然后随机确定它的增强幅度，并对图像进行增强、随机从图像中擦除一个矩形区域而不改变图像的原始标签。Preferably, the image enhancement strategy in step S2) includes one or more combinations of the following operations: normalizing the image, performing random color distortion and Gaussian blur in sequence, automatic enhancement, random enhancement, randomly selecting an image enhancement operation each time, then randomly determining its enhancement amplitude, and enhancing the image, and randomly erasing a rectangular area from the image without changing the original label of the image.

优选地，步骤S3)中所述基于ViT的航空图像识别模型由编码器F(·)，分类头G(·)和仅用于训练阶段的投影头P(·)构成：Preferably, the ViT-based aerial image recognition model in step S3) is composed of an encoder F(·), a classification head G(·) and a projection head P(·) used only in the training phase:

编码器F(·)由在数据集上预训练好的ViT构成，用于对图像全局特征进行学习和编码，将训练图像输入特征编码器F(·)中，采用编码器的第一个token作为的全局特征表示h_i；The encoder F(·) is composed of ViT pre-trained on the dataset and is used to learn and encode the global features of the image. In the input feature encoder F(·), the first token of the encoder is used as The global feature representation h_i ;

分类头G(·)由MLP层构成，其结构为全连接层FC—激活函数Tanh—全连接层FC，MLP层输出神经元个数为当前数据集中航空图像的总类别数目；The classification head G(·) consists of MLP layers, and its structure is fully connected layer FC-activation function Tanh-fully connected layer FC. The number of output neurons in the MLP layer is the total number of categories of aerial images in the current dataset.

投影头P(·)仅用在模型的训练阶段，其作用是将编码后的全局特征表示h_i映射到应用对比损失的潜在空间中，其结构为全连接层FC—激活函数ReLU—全连接层FC。The projection head P(·) is only used in the training stage of the model. Its function is to map the encoded global feature representation h_i into the latent space where the contrast loss is applied. Its structure is fully connected layer FC—activation function ReLU—fully connected layer FC.

优选地，步骤S4)中对图像相对应的标签进行区分性标签平滑，指根据模型输出的离散概率值和当前的训练阶段，对图像进行区分性的标签平滑，然后将平滑后的标签用以计算交叉熵损失函数值，表达式如下：Preferably, in step S4), the discriminative label smoothing is performed on the label corresponding to the image, which means that the discriminative label smoothing is performed on the image according to the discrete probability value output by the model and the current training stage, and then the smoothed label is used to calculate the cross entropy loss function value, and the expression is as follows:

式中，L_CE是交叉熵损失函数值，K是航空图像数据集中的总类别数目；是第i个样本初始标签概率分布，即对于正确的标签类别为1，其他情况则为0；是由模型输出的离散概率分布，指模型对第i个样本在第k个类的预测概率，γ.(s)是平滑变量。Where L_CE is the cross entropy loss function value, K is the total number of categories in the aerial image dataset; is the initial label probability distribution of the i-th sample, that is, for the correct label category is 1, otherwise it is 0; It is the discrete probability distribution output by the model, which refers to the model's predicted probability of the i-th sample in the k-th class. γ.(s) is a smooth variable.

优选地，所述平滑变量γ.(s)由两个平滑变量γ_hard(s)和γ_simple(s)构成，分别用以控制在不同训练阶段中，困难样本和简单样本各自的平滑权重，其表达式如下：Preferably, the smoothing variable γ.(s) is composed of two smoothing variables_γhard (s) and_γsimple (s), which are used to control the smoothing weights of the difficult samples and the simple samples in different training stages, respectively. The expressions are as follows:

γ_simple(s)＝(γ_hard(s)+γ_bias)*0.5^(1+s/I)γ_simple (s)＝(γ_hard (s)+γ_bias )*0.5^(1+s/I)

其中，s∈{1…I}是当前训练的迭代次数，I为总迭代次数；γ_max是困难样本对应的平滑权重最大值，γ_min是最小值；γ_bias是困难样本和简单样本平滑权重的偏差值；指平滑插值函数，其表达式如下：Where s∈{1…I} is the number of iterations of the current training, I is the total number of iterations; γ_max is the maximum value of the smoothing weight corresponding to the difficult sample, γ_min is the minimum value; γ_bias is the deviation value of the smoothing weight of the difficult sample and the simple sample; Refers to the smooth interpolation function, which is expressed as follows:

其中，Comb：是排列组合数，表示从N+n个元素中取出n个元素的取出方式总数，N用于控制平滑的速率。Among them, Comb: is the number of permutations and combinations, which represents the total number of ways to take out n elements from N+n elements, and N is used to control the smoothing rate.

优选地，在第i个样本属于困难或简单样本的划分中，根据模型输出的K个类的概率当其中最大值大于0.8，次大值小于0.2时，认为其属于简单样本，否则，将其划分为困难样本；由此分别选择相对应的平滑变量，计算交叉熵损失函数值。Preferably, in the i-th sample The probability of K classes output by the model in the classification of difficult or simple samples When the maximum value is greater than 0.8 and the second largest value is less than 0.2, it is considered to be a simple sample, otherwise, it is classified as a difficult sample. The corresponding smooth variables are selected and the cross entropy loss function value is calculated.

优选地，步骤S4)中同时采用交叉熵损失函数和区分性对比损失函数对模型进行训练时，依下式计算总损失值L：Preferably, when the cross entropy loss function and the discriminative contrast loss function are used simultaneously to train the model in step S4), the total loss value L is calculated according to the following formula:

L＝L_CE+β*L_DCLL＝L_CE +β*L_DCL

式中，L_CE为交叉熵损失函数，L_DCL为区分性对比损失函数，β为权重系数，用于调节区分性对比损失函数的重要性。Where L_CE is the cross entropy loss function, L_DCL is the discriminative contrast loss function, and β is the weight coefficient, which is used to adjust the importance of the discriminative contrast loss function.

所述区分性对比损失函数的表达式如下：The expression of the discriminative contrast loss function is as follows:

式中，B*M是训练集样本总数量，是一个指示函数，当且仅当输入条件成立时为1，与样本属于同类的样本中，S_i表示由同一图像增强的样本集合，C_i表示其他情况，表示与样本同类但由不同图像增强得到的样本的点积占比，表示与样本同类且由相同图像增强得到的样本的点积占比，τ>0是温度参数，ε是相似度阈值，1≥ε>0。In the formula, B*M is the total number of samples in the training set, is an indicator function that is 1 if and only if the input condition is met, and Among samples belonging to the same category,_Si represents the sample set enhanced by the same image, and_Ci represents other cases. Representation and Sample Samples of the same type but obtained by different image enhancements The dot product ratio of Representation and Sample Samples of the same type and obtained from the same image enhancement The dot product ratio of , τ>0 is the temperature parameter, ε is the similarity threshold, 1≥ε>0.

本发明还提出一种基于ViT的航空图像识别计算机设备，其包括存储器、处理器和存储在存储器中可供处理器运行的程序指令，其中所述处理器执行所述程序指令以实现上述方法中的步骤。The present invention also proposes a ViT-based aerial image recognition computer device, which includes a memory, a processor, and program instructions stored in the memory and executable by the processor, wherein the processor executes the program instructions to implement the steps in the above method.

本发明另外提出一种计算机可读存储介质，存储有计算机程序，其特征在于，所述计算机程序被处理器执行时实现上述一种基于ViT的航空图像识别方法。The present invention further proposes a computer-readable storage medium storing a computer program, wherein the computer program implements the above-mentioned ViT-based aerial image recognition method when executed by a processor.

本发明的有益效果在于：The beneficial effects of the present invention are:

1、识别率高：本发明针对航空图像识别中可训练数据量小，易导致深度学习算法过拟合的问题，采用区分性标签平滑以促进模型既能学习到足够好的特征信息，同时不至于过分拟合噪声数据的分布。1. High recognition rate: In order to solve the problem that the amount of trainable data in aerial image recognition is small and easily leads to overfitting of deep learning algorithms, the present invention adopts discriminative label smoothing to promote the model to learn sufficiently good feature information while not overfitting the distribution of noise data.

2、可拓展性强：本发明的基于ViT的航空图像识别方法，其原理通用性较高，根据实际需要，选择合适的训练数据，可以应用到不同类型的航空图像识别任务。2. Strong scalability: The ViT-based aerial image recognition method of the present invention has a high degree of versatility in principle and can be applied to different types of aerial image recognition tasks by selecting appropriate training data according to actual needs.

3、数据结构合理：本发明设计区分性标签平滑项和区分性有监督对比损失，学习更紧凑和合理的数据结构；由此，训练具有更强显著性特征捕获能力的基于ViT的航空图像识别模型，使得航空图像的识别更准确。3. Reasonable data structure: The present invention designs a discriminative label smoothing term and a discriminative supervised contrast loss to learn a more compact and reasonable data structure; thereby, a ViT-based aerial image recognition model with a stronger ability to capture significant features is trained, making the recognition of aerial images more accurate.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

图1为本发明一种基于ViT的航空图像识别方法整体流程图；FIG1 is an overall flow chart of an aerial image recognition method based on ViT of the present invention;

图2为本发明实施例中随机增强模块演示图；FIG2 is a demonstration diagram of a random enhancement module in an embodiment of the present invention;

图3为本发明实施例中用于航空图像事件识别的模型示意图。FIG. 3 is a schematic diagram of a model for event recognition in aerial images according to an embodiment of the present invention.

具体实施方式DETAILED DESCRIPTION

为了具体说明使本发明的目的、技术方案、优点和可实现性，下面结合附图和实施例对本发明做进一步的说明。应当理解，此处所描述的具体实例仅仅用于对本发明进行解释，并不用于限定本发明。此外，下面所描述的本发明各个实施方式中所涉及到的技术特征只要彼此之间不构成冲突就可以相互结合。In order to specifically illustrate the purpose, technical scheme, advantages and feasibility of the present invention, the present invention is further described below in conjunction with the accompanying drawings and embodiments. It should be understood that the specific examples described herein are only used to explain the present invention and are not intended to limit the present invention. In addition, the technical features involved in the various embodiments of the present invention described below can be combined with each other as long as they do not conflict with each other.

本实施例以航空图像中的事件识别为场景，对本发明提供的一种基于ViT的航空图像识别方法作详细说明。This embodiment takes event recognition in aerial images as a scenario to explain in detail a ViT-based aerial image recognition method provided by the present invention.

如图1所示，本发明提出的一种基于ViT的航空图像识别方法应用于航空图像中的事件识别任务，该方法详细步骤如下：As shown in FIG1 , a ViT-based aerial image recognition method proposed in the present invention is applied to event recognition tasks in aerial images. The detailed steps of the method are as follows:

步骤S1：采集航空图像中事件识别数据集，得到航空图像x_i及其对应的事件标签y_i，本实施例选择ERA航空图像中事件识别数据集，该数据集包含25个事件类别的2864个样本图像，直接使用其已经划分好的训练集和测试集，且将原始训练集中按照9:1的比例随机划分训练集和验证集，将训练集记为B为训练集的图像数目。Step S1: Collect an event recognition dataset in aerial images to obtain aerial images x_i and their corresponding event labels y_i . In this embodiment, the ERA aerial image event recognition dataset is selected. The dataset contains 2864 sample images of 25 event categories. The already divided training set and test set are directly used. The original training set is randomly divided into the training set and the validation set at a ratio of 9:1. The training set is recorded as B is the number of images in the training set.

步骤S2：构建数据随机增强模块以扩充训练集的数据量，将步骤S1中的训练集图像输入随机增强模块进行在线数据增强。在随机增强模块中，首先将输入的图像随机裁剪为224*224像素后进行随机地水平翻转，然后选择当前视觉任务的中六种常用的图像增强策略，包括(1)BaseAugment(只对图像进行归一化操作)；(2)SimAugment(按照顺序进行随机颜色失真和高斯模糊，并可能在序列最后进行额外的稀疏图像扭曲操作)；(3)AutoAugment(自动增强)；(4)RandAugment(随机增强)；(5)TrivialAugment(每次随机选择一个图像增强操作，然后随机确定它的增强幅度，并对图像进行增强)；(6)RandomErasing(随机从图像中擦除一个矩形区域而不改变图像的原始标签)。即给定训练集中的一张图像，在上述六种策略中随机选择的M(6≥M≥0)种对图像进行增强，最终得到扩容后的训练集，记为本实施例中M取4，如图2所示。Step S2: Construct a random data augmentation module to expand the amount of training data. Input the training set images in step S1 into the random augmentation module for online data augmentation. In the random augmentation module, the input image is first randomly cropped to 224*224 pixels and then randomly flipped horizontally. Then, six commonly used image augmentation strategies for the current visual task are selected, including (1) BaseAugment (normalization of the image only); (2) SimAugment (random color distortion and Gaussian blur are performed in sequence, and additional sparse image distortion operations may be performed at the end of the sequence); (3) AutoAugment (automatic augmentation); (4) RandAugment (random augmentation); (5) TrivialAugment (randomly select an image augmentation operation each time, then randomly determine its augmentation amplitude, and enhance the image); (6) RandomErasing (randomly erase a rectangular area from the image without changing the original label of the image). That is, given an image in the training set, M (6 ≥ M ≥ 0) strategies are randomly selected from the above six strategies to enhance the image, and finally the expanded training set is obtained, which is recorded as In this embodiment, M is 4, as shown in FIG2 .

步骤S3：构建基于ViT的航空图像识别模型，模型结构如图3所示。模型由编码器F(·)，分类头G(·)和仅用于训练阶段的投影头P(·)构成的：Step S3: Construct an aerial image recognition model based on ViT. The model structure is shown in Figure 3. The model consists of an encoder F(·), a classification head G(·), and a projection head P(·) used only in the training phase:

编码器F(·)由在ImageNet数据集上预训练好的ViT构成，用于对图像全局特征进行学习和编码。具体的，编码器F(·)包括线性层和transformer编码器两部分：线性层用于将图像嵌入表示；transformer编码器由多头自注意力层和多层感知机块构成，用于学习图像的全局特征。在每个块前应用LayerNorm归一化，在每个块后应用残差连接。将训练图像输入特征编码器F(·)中，采用最后一层transformer编码器的第一个token作为的全局特征表示h_i。随后将h_i输入到分类器和投影器中以计算总损失值。The encoder F(·) is composed of ViT pre-trained on the ImageNet dataset, which is used to learn and encode the global features of the image. Specifically, the encoder F(·) consists of two parts: a linear layer and a transformer encoder. The linear layer is used to embed the image into a representation; the transformer encoder consists of a multi-head self-attention layer and a multi-layer perceptron block to learn the global features of the image. LayerNorm normalization is applied before each block, and residual connection is applied after each block. The training image In the input feature encoder F(·), the first token of the last layer of transformer encoder is used as The global feature representation of is_hi ._Hi is then input into the classifier and projector to calculate the total loss value.

分类头G(·)由MLP层构成，其结构为“全连接层FC—激活函数Tanh—全连接层FC”，MLP层输出神经元个数为当前数据集中航空图像的总类别数目，在本实施例中为25。The classification head G(·) is composed of MLP layers, and its structure is “fully connected layer FC—activation function Tanh—fully connected layer FC”. The number of output neurons in the MLP layer is the total number of categories of aerial images in the current data set, which is 25 in this embodiment.

投影头P(·)仅用在模型的训练阶段，其作用是将编码后的表征h_i映射到应用对比损失的潜在空间中，其结构为“全连接层FC—激活函数ReLU—全连接层FC”，MLP层输出神经元个数为128。The projection head P(·) is only used in the training phase of the model. Its function is to map the encoded representation h_i into the latent space where the contrast loss is applied. Its structure is “fully connected layer FC—activation function ReLU—fully connected layer FC”. The number of output neurons in the MLP layer is 128.

步骤S4：将步骤S2中的训练集的图像，输入步骤S3构建的识别模型，接着对图像相对应的标签进行区分性标签平滑，同时采用交叉熵损失函数和区分性对比损失函数对模型进行训练，通过反向传播算法更新识别模型，选取在步骤S1的验证集上识别正确率最优的模型最为最终训练好的识别模型。Step S4: The training set in step S2 The image is input into the recognition model constructed in step S3, and then the labels corresponding to the images are discriminatively smoothed. At the same time, the model is trained using the cross entropy loss function and the discriminative contrast loss function. The recognition model is updated through the back propagation algorithm, and the model with the best recognition accuracy on the verification set in step S1 is selected as the final trained recognition model.

其中，对图像相对应的标签进行区分性标签平滑，指根据模型输出的离散概率值和当前的训练阶段，对图像进行区分性的标签平滑，然后将平滑后的标签用以计算交叉熵损失函数值，其表达式如下：Among them, discriminative label smoothing is performed on the labels corresponding to the image, which means that according to the discrete probability value output by the model and the current training stage, the discriminative label smoothing of the image is performed, and then the smoothed label is used to calculate the cross entropy loss function value, which is expressed as follows:

式中，K是航空图像数据集中的总类别数目；是第i个样本初始标签概率分布，即对于正确的标签类别为1，其他情况则为0；是由模型输出的离散概率分布，指模型对第i个样本在第k个类的预测概率。Where K is the total number of categories in the aerial image dataset; is the initial label probability distribution of the i-th sample, that is, for the correct label category is 1, otherwise it is 0; It is the discrete probability distribution output by the model, which refers to the model's predicted probability of the i-th sample in the k-th class.

与自然图像数据集相比，获取带注释的航空图像通常需要花费更大的代价，故航空图像数据集的规模普遍较小，这极易导致模型在训练数据上的过拟合。而传统的标签平滑虽然能在一定程度上缓解模型过拟合，但在数据集规模较小时会有模型欠拟合的风险。因此，通过提出平滑变量γ.(s)来控制平滑权重，根据模型训练阶段的变化赋予不同的平滑权重值。具体的，γ.(s)由两个平滑变量γ_hard(s)和γ_simple(s)构成，分别用以控制在不同训练阶段中，困难样本和简单样本各自的平滑权重，其表达式如下：Compared with natural image datasets, it usually costs more to obtain annotated aerial images, so the scale of aerial image datasets is generally small, which can easily lead to overfitting of the model on the training data. Although traditional label smoothing can alleviate model overfitting to a certain extent, there is a risk of underfitting when the dataset size is small. Therefore, a smoothing variable γ.(s) is proposed to control the smoothing weight, and different smoothing weight values are assigned according to the changes in the model training stage. Specifically, γ.(s) consists of two smoothing variables γ_hard (s) and γ_simple (s), which are used to control the smoothing weights of difficult samples and simple samples in different training stages, respectively. The expression is as follows:

其中，s∈{1…I}是当前训练的迭代次数，I为总迭代次数；γ_max是困难样本对应的平滑权重最大值，类似地，γ_min是最小值；γ_bias是困难样本和简单样本平滑权重的偏差值；指平滑插值函数。其表达式如下：Where s∈{1…I} is the number of iterations of the current training, and I is the total number of iterations; γ_max is the maximum value of the smoothing weight corresponding to the difficult sample, and similarly, γ_min is the minimum value; γ_bias is the deviation value of the smoothing weight of the difficult sample and the simple sample; Refers to the smooth interpolation function. Its expression is as follows:

其中，Comb：表示排列组合数，例如是指不考虑取出顺序，从N+n个元素中取出n个元素的取出方式总数。N用于控制平滑的速率，本实施例中取1。Among them, Comb: represents the number of permutations and combinations, for example It refers to the total number of ways to extract n elements from N+n elements without considering the extraction order. N is used to control the smoothing rate, and is 1 in this embodiment.

在第i个样本属于困难或简单样本的划分中，根据模型输出的K个类的概率当其中最大值大于0.8，次大值小于0.2时，认为其属于简单样本，否则，将其划分为困难样本。由此分别选择相对应的平滑函数计算以交叉熵损失函数值。In the i-th sample The probability of K classes output by the model in the classification of difficult or simple samples When the maximum value is greater than 0.8 and the second largest value is less than 0.2, it is considered to be a simple sample, otherwise it is classified as a difficult sample. The corresponding smoothing function is selected to calculate the cross entropy loss function value.

其中，所述同时采用交叉熵损失函数和区分性对比损失函数对模型进行训练，依下式计算总损失值L：The cross entropy loss function and the discriminative contrast loss function are used to train the model at the same time, and the total loss value L is calculated according to the following formula:

L＝L_CE+β*L_DCLL＝L_CE +β*L_DCL

式中，L_CE为权利要求5所述交叉熵损失函数，L_DCL为区分性对比损失函数，β为权重系数，用于调节区分性对比损失函数的重要性。In the formula, L_CE is the cross entropy loss function described in claim 5, L_DCL is the discriminative contrast loss function, and β is a weight coefficient used to adjust the importance of the discriminative contrast loss function.

其中，区分性对比损失函数，其表达式如下：Among them, the discriminative contrast loss function is expressed as follows:

由于航空图像相比于自然图像具有更大的类内变化和类间相似性，即使是同类样本也存在一定差异，在随机增强后这种差异会进一步增强，因此，通过对同类图像是否由相同图像增强所得进一步区分，提出上述的区分性对比损失函数。具体的，B*M是训练集样本总数量，是一个指示函数，当且仅当输入条件成立时为1。与样本属于同类的样本中，S_i表示由同一图像增强的样本集合，C_i表示其他情况。表示与样本同类但由不同图像增强得到的样本的点积占比，表示与样本同类且由相同图像增强得到的样本的点积占比。τ>0是温度参数，ε(1≥ε>0)是相似度阈值。Compared with natural images, aerial images have greater intra-class variation and inter-class similarity. Even samples of the same type have certain differences. After random enhancement, such differences will be further enhanced. Therefore, by further distinguishing whether images of the same type are obtained by enhancing the same image, the above discriminative contrast loss function is proposed. Specifically, B*M is the total number of samples in the training set, is an indicator function that is 1 if and only if the input condition is true. Among samples belonging to the same category,_Si represents a set of samples enhanced by the same image, and_Ci represents other cases. Representation and Sample Samples of the same type but obtained by different image enhancements The dot product ratio of Representation and Sample Samples of the same type and obtained from the same image enhancement τ>0 is the temperature parameter, and ε(1≥ε>0) is the similarity threshold.

步骤S5：将步骤S1的测试集图像输入训练好的识别模型中，根据模型的输出的预测类别与真实类别作比较，即得到最终的识别正确率。当模型识别正确率达到设定阈值时，将待识别图像输入航空图像识别模型进行识别；否则返回步骤S3)直至模型识别正确率达到设定阈值。Step S5: Input the test set images of step S1 into the trained recognition model, and compare the predicted category output by the model with the real category to obtain the final recognition accuracy. When the model recognition accuracy reaches the set threshold, the image to be recognized is input into the aerial image recognition model for recognition; otherwise, return to step S3) until the model recognition accuracy reaches the set threshold.

基于上述方法，本发明还提出一种基于ViT的航空图像识别计算机设备，其包括存储器、处理器和存储在存储器中可供处理器运行的程序指令，其中所述处理器执行所述程序指令以实现上述一种基于ViT的航空图像识别方法中的步骤。Based on the above method, the present invention also proposes a ViT-based aerial image recognition computer device, which includes a memory, a processor, and program instructions stored in the memory for the processor to execute, wherein the processor executes the program instructions to implement the steps in the above ViT-based aerial image recognition method.

本说明书未作详细描述的内容属于本领域专业技术人员公知的现有技术。The contents not described in detail in this specification belong to the prior art known to professional and technical personnel in this field.

应当理解的是，对本领域普通技术人员来说，可以根据本发明的原理和上述说明加以改进或变换，或将本发明所提供的方法应用到类似的航空图像识别任务，而所有这些改进和变换都应属于本发明所附权利要求的保护范围。It should be understood that, for those skilled in the art, the principles of the present invention and the above description can be improved or transformed, or the method provided by the present invention can be applied to similar aerial image recognition tasks, and all these improvements and transformations should fall within the scope of protection of the claims attached to the present invention.

Claims

Translated fromChinese

1.一种基于ViT的航空图像识别方法，其特征在于：所述方法包括如下步骤：1. A ViT-based aerial image recognition method, characterized in that the method comprises the following steps:

2.根据权利要求1所述的一种基于ViT的航空图像识别方法，其特征在于：步骤S2)将输入的图像随机裁剪为224*224像素后进行随机地水平翻转，然后使用图像增强策略对图像进行增强，最终得到扩容后的训练集，记为2. The ViT-based aerial image recognition method according to claim 1, characterized in that: step S2) randomly crops the input image into 224*224 pixels and then randomly flips it horizontally, and then enhances the image using an image enhancement strategy, and finally obtains an expanded training set, which is recorded as

3.根据权利要求2所述的一种基于ViT的航空图像识别方法，其特征在于：步骤S2)中图像增强策略包括以下操作中的一种或多种组合：对图像进行归一化操作、按照顺序进行随机颜色失真和高斯模糊、自动增强、随机增强、每次随机选择一个图像增强操作，然后随机确定它的增强幅度，并对图像进行增强、随机从图像中擦除一个矩形区域而不改变图像的原始标签。3. A ViT-based aerial image recognition method according to claim 2, characterized in that: the image enhancement strategy in step S2) includes one or more combinations of the following operations: normalizing the image, performing random color distortion and Gaussian blur in sequence, automatic enhancement, random enhancement, randomly selecting an image enhancement operation each time, then randomly determining its enhancement amplitude, and enhancing the image, and randomly erasing a rectangular area from the image without changing the original label of the image.

4.根据权利要求1所述的一种基于ViT的航空图像识别方法，其特征在于：步骤S3)中所述基于ViT的航空图像识别模型由编码器F(·)，分类头G(·)和仅用于训练阶段的投影头P(·)构成：4. The ViT-based aerial image recognition method according to claim 1 is characterized in that: the ViT-based aerial image recognition model in step S3) is composed of an encoder F(·), a classification head G(·) and a projection head P(·) used only in the training phase:

5.根据权利要求1所述的一种基于ViT的航空图像识别方法，其特征在于：步骤S4)中对图像相对应的标签进行区分性标签平滑，指根据模型输出的离散概率值和当前的训练阶段，对图像进行区分性的标签平滑，然后将平滑后的标签用以计算交叉熵损失函数值，表达式为：5. The ViT-based aerial image recognition method according to claim 1, characterized in that: in step S4), the discriminative label smoothing is performed on the label corresponding to the image, which means that the discriminative label smoothing is performed on the image according to the discrete probability value output by the model and the current training stage, and then the smoothed label is used to calculate the cross entropy loss function value, and the expression is:

式中，L_CE是交叉熵损失函数值，K是航空图像数据集中的总类别数目；是第i个样本初始标签概率分布，即对于正确的标签类别为1，其他情况则为0；是由模型输出的离散概率分布，指模型对第i个样本在第k个类的预测概率，γ_·(s)是平滑变量。Where L_CE is the cross entropy loss function value, K is the total number of categories in the aerial image dataset; is the initial label probability distribution of the i-th sample, that is, for the correct label category is 1, otherwise it is 0; It is the discrete probability distribution output by the model, which refers to the model's predicted probability of the i-th sample in the k-th class, and γ_· (s) is a smooth variable.

6.根据权利要求5所述的一种基于ViT的航空图像识别方法，其特征在于：所述平滑变量γ_·(s)由两个平滑变量γ_hard(s)和γ_simple(s)构成，分别用以控制在不同训练阶段中，困难样本和简单样本各自的平滑权重，其表达式如下：6. The ViT-based aerial image recognition method according to claim 5, characterized in that: the smooth variable γ_· (s) is composed of two smooth variables_γhard (s) and_γsimple (s), which are used to control the smoothing weights of the difficult samples and the simple samples in different training stages, respectively, and the expression is as follows:

其中，s∈{1...I}是当前训练的迭代次数，I为总迭代次数；γ_max是困难样本对应的平滑权重最大值，γ_min是最小值；γ_bias是困难样本和简单样本平滑权重的偏差值；指平滑插值函数，其表达式如下：Where s∈{1...I} is the number of iterations of the current training, I is the total number of iterations; γ_max is the maximum value of the smoothing weight corresponding to the difficult sample, γ_min is the minimum value; γ_bias is the deviation value of the smoothing weight of the difficult sample and the simple sample; Refers to the smooth interpolation function, which is expressed as follows:

其中，Comb是排列组合数，表示从N+n个元素中取出n个元素的取出方式总数，N用于控制平滑的速率。Among them, Comb is the number of permutations and combinations, which means the total number of ways to take out n elements from N+n elements, and N is used to control the smoothing rate.

7.根据权利要求6所述的一种基于ViT的航空图像识别方法，其特征在于：在第i个样本属于困难或简单样本的划分中，根据模型输出的K个类的概率当其中最大值大于0.8，次大值小于0.2时，认为其属于简单样本，否则，将其划分为困难样本；由此分别选择相对应的平滑变量，计算交叉熵损失函数值。7. The ViT-based aerial image recognition method according to claim 6, characterized in that: The probability of K classes output by the model in the classification of difficult or simple samples When the maximum value is greater than 0.8 and the second largest value is less than 0.2, it is considered to be a simple sample, otherwise, it is classified as a difficult sample. The corresponding smooth variables are selected and the cross entropy loss function value is calculated.

8.根据权利要求6所述的一种基于ViT的航空图像识别方法，其特征在于：步骤S4)中同时采用交叉熵损失函数和区分性对比损失函数对模型进行训练时，依下式计算总损失值L：8. The ViT-based aerial image recognition method according to claim 6, characterized in that: when the cross entropy loss function and the discriminative contrast loss function are used simultaneously to train the model in step S4), the total loss value L is calculated according to the following formula:

L＝L_CE+β*L_DCLL＝L_CE +β*L_DCL

式中，L_CE为交叉熵损失函数，L_DCL为区分性对比损失函数，β为权重系数，用于调节区分性对比损失函数的重要性；Where L_CE is the cross entropy loss function, L_DCL is the discriminative contrast loss function, and β is the weight coefficient, which is used to adjust the importance of the discriminative contrast loss function;

式中，B*M是训练集样本总数量，是一个指示函数，当且仅当输入条件成立时为1，与样本属于同类的样本中，S_i表示由同一图像增强的样本集合，C_i表示其他情况，表示与样本同类但由不同图像增强得到的样本的点积占比，表示与样本同类且由相同图像增强得到的样本的点积占比，τ＞0是温度参数，ε是相似度阈值，1≥ε＞0。In the formula, B*M is the total number of samples in the training set, is an indicator function that is 1 if and only if the input condition is met, and Among samples belonging to the same category,_Si represents the sample set enhanced by the same image, and_Ci represents other cases. Representation and Sample Samples of the same type but obtained by different image enhancements The dot product ratio of Representation and Sample Samples of the same type and obtained from the same image enhancement The dot product ratio of , τ＞0 is the temperature parameter, ε is the similarity threshold, 1≥ε＞0.

9.一种基于ViT的航空图像识别计算机设备，其包括存储器、处理器和存储在存储器中可供处理器运行的程序指令，其中所述处理器执行所述程序指令以实现权利要求1至8中任一项所述方法中的步骤。9. A ViT-based aerial image recognition computer device, comprising a memory, a processor, and program instructions stored in the memory and executable by the processor, wherein the processor executes the program instructions to implement the steps in the method according to any one of claims 1 to 8.

10.一种计算机可读存储介质，存储有计算机程序，其特征在于，所述计算机程序被处理器执行时实现权利要求1至8中任一项所述的方法。10. A computer-readable storage medium storing a computer program, wherein the computer program implements the method according to any one of claims 1 to 8 when executed by a processor.