CN120014280B

Movatterモバイル変換

Info

Publication number: CN120014280B
Application number: CN202510470102.8A
Authority: CN
Inventors: 黄邦菊; 李俊辉; 黄龙杨
Original assignee: Civil Aviation Flight University of China
Current assignee: Civil Aviation Flight University of China
Priority date: 2025-04-15
Filing date: 2025-04-15
Publication date: 2025-07-04
Anticipated expiration: 2045-04-15
Also published as: CN120014280A

Abstract

Translated fromChinese

本发明公开了一种视觉语言融合的无人机航拍图像开放词汇语义分割方法，其涉及多模态人工智能技术领域。本方法基于多种注意力机制、多层次融合模块、动态调整机制，构建了视觉语言融合分割模型，确保了复杂场景下对已知与未知类别的航拍图像实现高精度、鲁棒性的分割效果；利用VIT、Mamba模型提取全局图像信息、局部图像细节，并采用自适应加权融合实现全局与局部特征的动态平衡，使用可变形卷积对局部结构进行强化，保证整体场景语义的准确表达；利用异构跨模态图融合模型整合更远距离的跨模态语义关系，不断融合来自视觉、文本以及领域知识的多维信息。

The present invention discloses a visual language fusion unmanned aerial image open vocabulary semantic segmentation method, which relates to the field of multimodal artificial intelligence technology. This method is based on multiple attention mechanisms, multi-level fusion modules, and dynamic adjustment mechanisms to construct a visual language fusion segmentation model, ensuring high-precision and robust segmentation effects for aerial images of known and unknown categories in complex scenes; using VIT and Mamba models to extract global image information and local image details, and using adaptive weighted fusion to achieve a dynamic balance between global and local features, using deformable convolution to strengthen local structures, and ensuring accurate expression of the overall scene semantics; using a heterogeneous cross-modal graph fusion model to integrate longer-distance cross-modal semantic relationships, and continuously integrating multidimensional information from vision, text, and domain knowledge.

Description

Translated fromChinese

视觉语言融合的无人机航拍图像开放词汇语义分割方法Open vocabulary semantic segmentation method for UAV aerial images based on vision and language fusion

技术领域Technical Field

本发明涉及多模态人工智能技术领域，特别涉及视觉语言融合的无人机航拍图像开放词汇语义分割方法。The present invention relates to the field of multimodal artificial intelligence technology, and in particular to a method for semantic segmentation of open vocabulary of drone aerial images by integrating vision and language.

背景技术Background Art

近年来，无人机航拍图像在灾害评估、精准农业、城市规划等遥感应用中发挥着越来越重要的作用。然而，传统的语义分割技术主要依赖于全监督深度学习方法，这类方法通常需要大量人工标注数据，并且仅针对预定义的闭集类别进行训练。这种闭集识别方法在面对实际应用中频繁出现的未知类别时，往往表现出泛化能力不足和识别不准确的问题。此外，无人机航拍图像通常呈现出独特的地理和结构特征，如目标物体分布稀疏、尺度变化大、背景复杂等，使得基于自然图像训练的传统模型在处理此类图像时存在明显局限，同时，获取覆盖全面的标注数据也面临高昂的成本和操作上的困难，从而进一步限制了传统方法在实际应用中的效能和实用性。In recent years, drone aerial images have played an increasingly important role in remote sensing applications such as disaster assessment, precision agriculture, and urban planning. However, traditional semantic segmentation techniques mainly rely on fully supervised deep learning methods, which usually require a large amount of manually annotated data and are only trained for predefined closed-set categories. This closed-set recognition method often exhibits problems of insufficient generalization and inaccurate recognition when faced with unknown categories that frequently appear in practical applications. In addition, drone aerial images usually present unique geographical and structural features, such as sparse distribution of target objects, large scale variations, and complex backgrounds, which makes traditional models trained based on natural images have obvious limitations in processing such images. At the same time, obtaining comprehensive annotated data also faces high costs and operational difficulties, which further limits the effectiveness and practicality of traditional methods in practical applications.

多模态融合技术在图像理解领域展现出了巨大的潜力。通过引入文本描述信息，模型能够获得图像中难以直接从视觉信号中提取的语义信息，进而弥补传统视觉模型在开放词汇识别中的不足。然而，当前的多模态方法主要针对自然图像，对于无人机航拍图像中复杂地理和结构特征的处理依然存在一定困难，尚未形成一套成熟而高效的解决方案。Multimodal fusion technology has shown great potential in the field of image understanding. By introducing text description information, the model can obtain semantic information in the image that is difficult to extract directly from the visual signal, thereby making up for the shortcomings of traditional visual models in open vocabulary recognition. However, current multimodal methods are mainly aimed at natural images, and there are still certain difficulties in processing complex geographical and structural features in drone aerial images, and a mature and efficient solution has not yet been formed.

综上所述，鉴于现有技术在处理复杂背景、计算资源消耗、新类别识别准确性以及恶劣环境下稳定性方面的不足，亟需一种新型的基于视觉与语言融合的无人机航拍图像开放词汇语义分割方法。In summary, given the shortcomings of existing technologies in dealing with complex backgrounds, computing resource consumption, accuracy in identifying new categories, and stability in harsh environments, there is an urgent need for a new open vocabulary semantic segmentation method for UAV aerial images based on the fusion of vision and language.

发明内容Summary of the invention

本发明的目的在于提供视觉语言融合的无人机航拍图像开放词汇语义分割方法，以改善上述技术问题。The purpose of the present invention is to provide a method for open vocabulary semantic segmentation of drone aerial images with visual language fusion to improve the above technical problems.

为了实现上述发明目的，本发明实施例提供了以下技术方案：In order to achieve the above-mentioned object of the invention, the embodiment of the present invention provides the following technical solutions:

一种视觉语言融合的无人机航拍图像开放词汇语义分割方法包括：A vision-language fusion open vocabulary semantic segmentation method for drone aerial images includes:

利用无人机采集不同环境条件下的无类别航拍混合图像，生成对应的语言描述数据；Use drones to collect unclassified aerial mixed images under different environmental conditions and generate corresponding language description data;

构建视觉语言融合分割模型；所述视觉语言融合分割模型包括视觉-语言特征提取模型、异构跨模态图融合模型和语义分割模型；Constructing a visual-language fusion segmentation model; the visual-language fusion segmentation model includes a visual-language feature extraction model, a heterogeneous cross-modal graph fusion model and a semantic segmentation model;

将无类别航拍混合图像、语言描述数据输入至视觉-语言特征提取模型，输出得到多尺度时空视觉特征、语言特征；The unclassified aerial mixed images and language description data are input into the visual-linguistic feature extraction model, and the multi-scale spatiotemporal visual features and language features are output;

将语言特征和多尺度时空视觉特征输入至异构跨模态图融合模型，输出得到视觉-语言匹配特征；The language features and multi-scale spatiotemporal visual features are input into the heterogeneous cross-modal graph fusion model, and the visual-language matching features are output;

将视觉-语言匹配特征输入至语义分割模型，完成对航拍图像的语义分割。The visual-language matching features are input into the semantic segmentation model to complete the semantic segmentation of the aerial images.

进一步地，所述利用无人机采集不同环境条件下的无类别航拍混合图像，生成对应的语言描述数据，包括：Furthermore, the method of using a drone to collect unclassified aerial mixed images under different environmental conditions and generating corresponding language description data includes:

利用无人机平台采集初始无类别航拍图像并进行预处理，得到对应的无类别航拍图像；The initial unclassified aerial images are collected by using the UAV platform and preprocessed to obtain the corresponding unclassified aerial images;

选择满足选择条件的无类别航拍图像并进行检查和拼接，得到无类别航拍混合图像；所述选择条件为拍摄角度或拍摄环境不同；Selecting unclassified aerial images that meet a selection condition, checking and stitching them, and obtaining an unclassified aerial mixed image; the selection condition is that the shooting angle or shooting environment is different;

利用GPT模型生成各无类别航拍混合图像的描述文本，得到语言描述数据。The GPT model is used to generate description text for each unclassified aerial mixed image to obtain language description data.

进一步地，所述视觉-语言特征提取模型包括并行的视觉特征提取子模型和语言特征提取子模型；视觉特征提取子模型包括串联的视觉特征提取模块、多尺度时空视觉特征融合模块；视觉特征提取模块包括并行的VIT全局特征提取模块、局部细节特征提取模块；VIT全局特征提取模块包括Patch-Embedding层和基于自注意力机制的transformer层；多尺度时空视觉特征融合模块包括串联的动态特征融合层和可变形卷积层；局部细节特征提取模块采用Mamba模型；语言特征提取子模型为BERT模型；Furthermore, the visual-language feature extraction model includes a parallel visual feature extraction sub-model and a language feature extraction sub-model; the visual feature extraction sub-model includes a serial visual feature extraction module and a multi-scale spatiotemporal visual feature fusion module; the visual feature extraction module includes a parallel VIT global feature extraction module and a local detail feature extraction module; the VIT global feature extraction module includes a Patch-Embedding layer and a transformer layer based on a self-attention mechanism; the multi-scale spatiotemporal visual feature fusion module includes a serial dynamic feature fusion layer and a deformable convolution layer; the local detail feature extraction module adopts a Mamba model; the language feature extraction sub-model is a BERT model;

所述异构跨模态图融合模型采用基于多头注意力机制的异构跨模态图网络；The heterogeneous cross-modal graph fusion model adopts a heterogeneous cross-modal graph network based on a multi-head attention mechanism;

所述语义分割模型采用轻量级U-Net++模型；轻量级U-Net++模型包括串联的编码器和基于通道-空间双注意力的解码器；编码器包括N层解码层；基于通道-空间双注意力的解码器包括N层解码模块，每两层解码模块通过多层次融合模块进行连接；各解码模块均包括串联的CSDA层和解码层。The semantic segmentation model adopts a lightweight U-Net++ model; the lightweight U-Net++ model includes a serially connected encoder and a decoder based on channel-space dual attention; the encoder includes N layers of decoding layers; the decoder based on channel-space dual attention includes N layers of decoding modules, and each two layers of decoding modules are connected through a multi-level fusion module; each decoding module includes a serially connected CSDA layer and a decoding layer.

进一步地，所述视觉-语言特征提取模型的训练过程包括：Furthermore, the training process of the visual-linguistic feature extraction model includes:

获取无类别航拍混合训练图像和对应的语言描述训练数据；Obtain unclassified aerial mixed training images and corresponding language description training data;

将无类别航拍混合训练图像输入至Patch-Embedding层，得到无类别航拍混合训练分割图像；Input the unclassified aerial mixed training image into the Patch-Embedding layer to obtain the unclassified aerial mixed training segmentation image;

将无类别航拍混合训练分割图像输入至基于自注意力机制的transformer层，得到无类别航拍混合训练全局特征图；The unclassified aerial photography mixed training segmentation image is input into the transformer layer based on the self-attention mechanism to obtain the unclassified aerial photography mixed training global feature map;

将无类别航拍混合训练图像输入至Mamba模型，得到无类别航拍混合训练局部特征图；Input the unclassified aerial mixed training image into the Mamba model to obtain the unclassified aerial mixed training local feature map;

将无类别航拍混合训练局部特征图、无类别航拍混合训练全局特征图输入至多尺度时空视觉特征融合模块，得到多尺度时空视觉训练特征；Inputting the unclassified aerial photography mixed training local feature map and the unclassified aerial photography mixed training global feature map into the multi-scale spatiotemporal visual feature fusion module to obtain the multi-scale spatiotemporal visual training features;

将语言描述训练数据输入至BERT模型，得到语言训练特征；Input the language description training data into the BERT model to obtain language training features;

基于多尺度时空视觉训练特征和语言训练特征，调整视觉特征提取子模型的权重参数。Based on the multi-scale spatiotemporal visual training features and language training features, the weight parameters of the visual feature extraction sub-model are adjusted.

进一步地，所述将语言描述训练数据输入至BERT模型，得到语言训练特征，包括：Furthermore, the language description training data is input into the BERT model to obtain language training features, including:

基于语言描述训练数据生成对应的连续向量文本表示；Generate corresponding continuous vector text representation based on language description training data;

基于无类别航拍混合训练局部特征图和无类别航拍混合训练全部特征图，设置一组视觉原型；A set of visual prototypes is set based on the local feature map of unclassified aerial photography mixed training and the full feature map of unclassified aerial photography mixed training;

基于连续向量文本表示的正/负相似度，计算CSA损失函数；Calculate the CSA loss function based on the positive/negative similarity of the continuous vector text representation;

基于CSA损失函数，对视觉原型和连续向量文本表示进行匹配对齐，得到匹配结果；Based on the CSA loss function, the visual prototype and the continuous vector text representation are matched and aligned to obtain the matching result;

基于匹配结果和连续向量文本表示，生成对应的语言训练特征。Based on the matching results and the continuous vector text representation, the corresponding language training features are generated.

进一步地，所述异构跨模态图融合模型的训练过程包括：Furthermore, the training process of the heterogeneous cross-modal graph fusion model includes:

构建初始异构跨模态图；获取语言训练特征、多尺度时空视觉训练特征并分别作为文本节点、视觉节点，并映射至初始异构跨模态图中；Construct an initial heterogeneous cross-modal graph; obtain language training features and multi-scale spatiotemporal visual training features as text nodes and visual nodes respectively, and map them to the initial heterogeneous cross-modal graph;

计算文本节点和视觉节点之间的节点相似度；基于各个节点相似度设置边的权重以及知识图谱，构建对应的边，得到异构跨模态图；Calculate the node similarity between text nodes and visual nodes; set the edge weights and knowledge graph based on the similarity of each node, construct the corresponding edges, and obtain a heterogeneous cross-modal graph;

根据多头图注意力机制，获取每个节点及其相邻节点的聚合信息；According to the multi-head graph attention mechanism, the aggregate information of each node and its adjacent nodes is obtained;

基于各个聚合信息，通过多头图注意力机制的多跳聚合更新每个节点的表示；所述节点的表示为视觉-语言匹配训练特征，其包括语言特征、多尺度时空视觉特征；Based on each aggregated information, the representation of each node is updated through multi-hop aggregation of a multi-head graph attention mechanism; the representation of the node is a visual-language matching training feature, which includes language features and multi-scale spatiotemporal visual features;

基于更新后的节点的表示，优化异构跨模态图，即优化异构跨模态图融合模型。Based on the updated node representation, the heterogeneous cross-modal graph is optimized, that is, the heterogeneous cross-modal graph fusion model is optimized.

进一步地，所述语义分割模型的训练过程包括：Furthermore, the training process of the semantic segmentation model includes:

获取视觉-语言匹配训练特征并输入至编码器，得到每个编码层对应的第i视觉-语言匹配训练编码特征；除第一层编码层的输入为视觉-语言匹配训练特征，第i层编码层的输入为第i-1层编码层的输出；Obtain visual-language matching training features and input them into the encoder to obtain the i-th visual-language matching training encoding features corresponding to each encoding layer; except that the input of the first encoding layer is the visual-language matching training features, the input of the i-th encoding layer is the output of the i-1-th encoding layer;

将第N视觉-语言匹配训练特征输入至第N层解码模块，输出得到第N视觉-语言匹配解码结果；Input the Nth visual-language matching training feature into the Nth layer decoding module, and output the Nth visual-language matching decoding result;

将第N视觉-语言匹配解码结果和第N-1视觉-语言匹配训练特征输入至多层次融合模块，输出得到第N-1视觉-语言匹配训练融合特征；Input the Nth visual-language matching decoding result and the N-1th visual-language matching training feature into a multi-level fusion module, and output the N-1th visual-language matching training fusion feature;

将第N-1视觉-语言匹配训练融合特征输入至第N-1层解码模块，输出得到第N-1视觉-语言匹配解码结果；Input the N-1th visual-language matching training fusion feature into the N-1th layer decoding module, and output the N-1th visual-language matching decoding result;

重复多层次融合模块和第N-1层解码模块的处理过程，直至得到第一层解码模块的输出，即得到语义分割训练结果；Repeat the processing of the multi-level fusion module and the N-1th layer decoding module until the output of the first layer decoding module is obtained, that is, the semantic segmentation training result is obtained;

基于语义分割训练结果，计算动态权重三元组损失函数和广义交并比损失函数；Based on the semantic segmentation training results, the dynamic weight triplet loss function and the generalized intersection-over-union loss function are calculated;

基于动态权重三元组损失函数和广义交并比损失函数，调整语义分割模型的权重参数。Based on the dynamic weight triplet loss function and the generalized intersection-over-union loss function, the weight parameters of the semantic segmentation model are adjusted.

进一步地，所述动态权重三元组损失函数对应的公式为：Furthermore, the formula corresponding to the dynamic weight triplet loss function is:

； ;

其中，、分别表示三元动态权重参数、预定超参数，表示期望操作，表示语义分割模型对视觉-语言匹配训练特征中各类别预测的概率分布，表示最大值函数。in, , Respectively represent the ternary dynamic weight parameters and the predetermined hyperparameters, Indicates the expected operation. Represents the semantic segmentation model for visual-language matching training features In each category The predicted probability distribution, Represents the maximum value function.

进一步地，所述视觉语言融合分割模型还包括优化过程，对应的步骤为：Furthermore, the visual language fusion segmentation model also includes an optimization process, and the corresponding steps are:

利用NAS算法对初始优化后的视觉语言融合分割模型进行处理，得到一次优化后的视觉语言融合分割模型；The NAS algorithm is used to process the initially optimized visual language fusion segmentation model to obtain an optimized visual language fusion segmentation model;

对一次优化后的视觉语言融合分割模型进行参数剪枝和精简，得到二次优化后的视觉语言融合分割模型；The parameters of the visual language fusion segmentation model after the primary optimization are pruned and simplified to obtain the visual language fusion segmentation model after the secondary optimization;

利用量化感知训练方法对二次优化后的视觉语言融合分割模型进行处理，得到优化后的视觉语言融合分割模型。The second-optimized visual-language fusion segmentation model is processed using a quantized-aware training method to obtain an optimized visual-language fusion segmentation model.

本发明的有益效果为：The beneficial effects of the present invention are:

本方法基于多种注意力机制、多层次融合模块、动态调整机制，构建了视觉语言融合分割模型，确保了复杂场景下对已知与未知类别的航拍图像实现高精度、鲁棒性的分割效果；利用VIT、Mamba模型提取全局图像信息、局部图像细节，并采用自适应加权融合实现全局与局部特征的动态平衡，使用可变形卷积对局部结构进行强化，保证整体场景语义的准确表达，增强对视觉-语言融合特征的效率与精度；利用异构跨模态图融合模型整合更远距离的跨模态语义关系，不断融合来自视觉、文本以及领域知识的多维信息。Based on multiple attention mechanisms, multi-level fusion modules, and dynamic adjustment mechanisms, this method constructs a visual-language fusion segmentation model to ensure high-precision and robust segmentation effects for aerial images of known and unknown categories in complex scenes; VIT and Mamba models are used to extract global image information and local image details, and adaptive weighted fusion is used to achieve a dynamic balance between global and local features. Deformable convolution is used to strengthen the local structure to ensure the accurate expression of the overall scene semantics and enhance the efficiency and accuracy of visual-language fusion features; a heterogeneous cross-modal graph fusion model is used to integrate longer-distance cross-modal semantic relationships, continuously integrating multi-dimensional information from vision, text, and domain knowledge.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

为了更清楚地说明本发明实施例的技术方案，下面将对实施例中所需要使用的附图作简单地介绍，应当理解，以下附图仅示出了本发明的某些实施例，因此不应被看作是对范围的限定，对于本领域普通技术人员来讲，在不付出创造性劳动的前提下，还可以根据这些附图获得其他相关的附图。In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required for use in the embodiments are briefly introduced below. It should be understood that the following drawings only show certain embodiments of the present invention and therefore should not be regarded as limiting the scope. For ordinary technicians in this field, other related drawings can be obtained based on these drawings without creative work.

图1为本发明实施例中方法流程图；FIG1 is a flow chart of a method according to an embodiment of the present invention;

图2为本发明实施例中视觉-语言特征提取模型结构图；FIG2 is a structural diagram of a visual-linguistic feature extraction model in an embodiment of the present invention;

图3为本发明实施例中语义分割模型结构图。FIG3 is a diagram showing the structure of a semantic segmentation model in an embodiment of the present invention.

具体实施方式DETAILED DESCRIPTION

下面将结合本发明实施例中附图，对本发明实施例中的技术方案进行清楚、完整地描述，显然，所描述的实施例仅仅是本发明一部分实施例，而不是全部的实施例。通常在此处附图中描述和表示出的本发明实施例的组件可以以各种不同的配置来布置和设计。因此，以下对在附图中提供的本发明的实施例的详细描述并非旨在限制要求保护的本发明的范围，而是仅仅表示本发明的选定实施例。基于本发明的实施例，本领域技术人员在没有做出创造性劳动的前提下所获得的所有其他实施例，都属于本发明保护的范围。The technical solutions in the embodiments of the present invention will be clearly and completely described below in conjunction with the drawings in the embodiments of the present invention. Obviously, the described embodiments are only a part of the embodiments of the present invention, rather than all the embodiments. The components of the embodiments of the present invention generally described and represented in the drawings here can be arranged and designed in various different configurations. Therefore, the following detailed description of the embodiments of the present invention provided in the drawings is not intended to limit the scope of the claimed invention, but merely represents the selected embodiments of the present invention. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without making creative work belong to the scope of protection of the present invention.

请参阅图1，本实施例提供的一种视觉语言融合的无人机航拍图像开放词汇语义分割方法，其包括：Please refer to FIG1 . This embodiment provides a method for semantic segmentation of open vocabulary of drone aerial images with visual language fusion, which includes:

S1、利用无人机采集不同环境条件下的无类别航拍混合图像，生成对应的语言描述数据；不同环境条件是指不同的天气、时间或/和场景的环境条件。S1. Use a drone to collect unclassified aerial mixed images under different environmental conditions and generate corresponding language description data; different environmental conditions refer to environmental conditions of different weather, time and/or scene.

所述S1包括：The S1 includes:

S1-1、利用无人机平台采集高分辨率的初始无类别航拍图像；高分辨率为1280×720或更高；初始无类别航拍图像通常为RGB图像。在采集过程中，需覆盖多种场景，包括不同光照条件（晴天、阴天、夜间）、多样化天气（雨天、雾天）以及复杂环境（城市、农村、工业区等），进一步保证初始无类别航拍图像的多样性和代表性，有利于视觉语言融合分割模型的训练过程，提高对各种环境条件下的航拍图像对应的语义分割的准确性。S1-1. Use the UAV platform to collect high-resolution initial unclassified aerial images; the high resolution is 1280×720 or higher; the initial unclassified aerial images are usually RGB images. During the collection process, it is necessary to cover a variety of scenes, including different lighting conditions (sunny, cloudy, night), diverse weather (rainy, foggy) and complex environments (urban, rural, industrial areas, etc.), to further ensure the diversity and representativeness of the initial unclassified aerial images, which is conducive to the training process of the visual language fusion segmentation model and improves the accuracy of semantic segmentation corresponding to aerial images under various environmental conditions.

S1-2、对初始无类别航拍图像进行预处理，得到无类别航拍图像；预处理包括滤波降噪、归一化、尺寸调整和格式转换。S1-2, preprocessing the initial unclassified aerial image to obtain an unclassified aerial image; the preprocessing includes filtering and denoising, normalization, size adjustment and format conversion.

滤波降噪采取高斯滤波或中值滤波。当采用高斯滤波进行处理时，设置卷积核尺寸，用于平滑图像噪声，同时保留对应的图像边缘信息；卷积核尺寸可设置为3×3或5×5。当采用中值滤波进行处理时，设置对应的窗口大小，确定滤波效果，避免过度平滑导致图像细节丢失。Filtering and noise reduction adopts Gaussian filtering or median filtering. When Gaussian filtering is used for processing, set the convolution kernel size to smooth the image noise while retaining the corresponding image edge information; the convolution kernel size can be set to 3×3 or 5×5. When median filtering is used for processing, set the corresponding window size to determine the filtering effect and avoid excessive smoothing resulting in loss of image details.

归一化对应的公式为：The corresponding formula for normalization is:

； ;

其中，表示归一化后的无类别航拍图像，表示初始无类别航拍图像的像素值，、分别表示初始无类别航拍图像的最小像素值和最大像素值。in, represents the normalized unclassified aerial image, represents the pixel value of the initial unclassified aerial image, , They represent the minimum pixel value and maximum pixel value of the initial unclassified aerial image respectively.

尺寸调整采用双线性插值或其他重采样技术，将归一化后的无类别航拍图像调整为统一尺寸，得到调整后的无类别航拍图像，确保所有图像满足视觉语言融合分割模型或GPT模型的输入要求。在进行尺寸调整时，需注意保持图像的宽高比例，必要时采用填充技术防止形变。Resizing uses bilinear interpolation or other resampling techniques to resize the normalized unclassified aerial images to a uniform size, obtaining adjusted unclassified aerial images to ensure that all images meet the input requirements of the visual language fusion segmentation model or the GPT model. When resizing, it is necessary to maintain the aspect ratio of the image, and use padding technology to prevent deformation when necessary.

格式转换为视觉语言融合分割模型或GPT模型所需的张量格式；输入张量形式对应的公式为：The format is converted to the tensor format required by the visual language fusion segmentation model or the GPT model; the input tensor form The corresponding formula is:

； ;

其中，表示无类别航拍图像，、分别表示无类别航拍图像的高度、宽度，表示通道数。通道数可设置为3。in, Represents an unclassified aerial image. , Respectively represent the height and width of the unclassified aerial image, Indicates the number of channels. Can be set to 3.

S1-3、选择满足选择条件的无类别航拍图像并进行检查和拼接，得到无类别航拍混合图像。S1-3. Select unclassified aerial images that meet the selection conditions, check and stitch them together to obtain unclassified aerial mixed images.

为了利用不同视角下图像的丰富信息，对来自不同拍摄角度的无类别航拍图像进行混合生成，并选取满足选择条件的混合生成后的无类别航拍图像。选择条件为具有代表性、拍摄角度和/或拍摄环境不同的图像，这些图像应覆盖同一场景或相似场景中的不同部分，确保每幅图像提供独特的空间细节。In order to utilize the rich information of images from different perspectives, unclassified aerial images from different shooting angles are mixed and generated, and the unclassified aerial images after mixing that meet the selection criteria are selected. The selection criteria are representative images with different shooting angles and/or shooting environments. These images should cover different parts of the same scene or similar scenes to ensure that each image provides unique spatial details.

对满足选择条件的混合生成后的无类别航拍图像进行非重叠拼接，确保保留每幅图像的独特特征，得到无类别航拍混合图像。非重叠拼接是将图像按预定规则进行排列和拼接，形成一幅复合图像，对应的公式为：The mixed and generated non-classified aerial images that meet the selection conditions are non-overlapped and stitched to ensure that the unique characteristics of each image are retained to obtain a non-classified aerial mixed image. Non-overlapping stitching is to arrange and stitch images according to predetermined rules to form a composite image. The corresponding formula is:

； ;

其中，表示无类别航拍混合图像，表示拼接操作，、、分别表示第一个、第二个和第个的满足选择条件的混合生成后的无类别航拍图像。in, represents an unclassified aerial mixed image. Represents a splicing operation, , , Represents the first, second and The unclassified aerial images are generated by mixing the selected conditions.

S1-4、利用GPT模型生成各无类别航拍混合图像的描述文本，得到语言描述数据，为跨模态特征融合提供准确的语义信息。将各无类别航拍混合图像或其局部区域按照GPT模型要求进行格式化处理。格式化处理包括调整图像尺寸、确保图像通道及格式与模型输入匹配，为模型输入做好充分准备。S1-4. Generate description text for each unclassified aerial mixed image using the GPT model to obtain language description data, providing accurate semantic information for cross-modal feature fusion. Format each unclassified aerial mixed image or its local area according to the requirements of the GPT model. Formatting includes adjusting the image size, ensuring that the image channel and format match the model input, and making full preparations for the model input.

S2、构建视觉语言融合分割模型；所述视觉语言融合分割模型包括视觉-语言特征提取模型、异构跨模态图融合模型和语义分割模型。S2. Construct a visual-language fusion segmentation model; the visual-language fusion segmentation model includes a visual-language feature extraction model, a heterogeneous cross-modal graph fusion model and a semantic segmentation model.

S3、将无类别航拍混合图像、语言描述数据输入至视觉-语言特征提取模型，输出得到多尺度时空视觉特征。S3. Input the unclassified aerial mixed image and language description data into the visual-linguistic feature extraction model, and output the multi-scale spatiotemporal visual features.

如图2所示，所述视觉-语言特征提取模型包括并行的视觉特征提取子模型和语言特征提取子模型；视觉特征提取子模型包括串联的视觉特征提取模块、多尺度时空视觉特征融合模块；视觉特征提取模块包括并行的VIT全局特征提取模块（Vision Transformer）、局部细节特征提取模块；VIT全局特征提取模块包括Patch-Embedding层和基于自注意力机制的transformer层；多尺度时空视觉特征融合模块包括串联的动态特征融合层和可变形卷积层；局部细节特征提取模块采用Mamba模型；语言特征提取子模型包括串联的BERT层、CSA-跨模态对齐层。As shown in Figure 2, the visual-language feature extraction model includes a parallel visual feature extraction sub-model and a language feature extraction sub-model; the visual feature extraction sub-model includes a serial visual feature extraction module and a multi-scale spatiotemporal visual feature fusion module; the visual feature extraction module includes a parallel VIT global feature extraction module (Vision Transformer) and a local detail feature extraction module; the VIT global feature extraction module includes a Patch-Embedding layer and a transformer layer based on a self-attention mechanism; the multi-scale spatiotemporal visual feature fusion module includes a serial dynamic feature fusion layer and a deformable convolution layer; the local detail feature extraction module adopts the Mamba model; the language feature extraction sub-model includes a serial BERT layer and a CSA-cross-modal alignment layer.

从而，所述S3包括：Thus, the S3 includes:

S3-1、获取无类别航拍混合训练图像和对应的语言描述训练数据；S3-1, obtaining unclassified aerial mixed training images and corresponding language description training data;

S3-2、将无类别航拍混合训练图像输入至Patch-Embedding层，得到无类别航拍混合训练分割图像；通过Patch Embedding技术将无类别航拍混合训练图像分割成若干小块，每个小块被视为一个独立的Token。S3-2. Input the unclassified aerial photography mixed training image into the Patch-Embedding layer to obtain the unclassified aerial photography mixed training segmentation image; use the Patch Embedding technology to divide the unclassified aerial photography mixed training image into several small blocks, and each small block is regarded as an independent Token.

S3-3、将无类别航拍混合训练分割图像输入至基于自注意力机制的transformer层，得到无类别航拍混合训练全局特征图；通过自注意力机制捕捉各图块之间的长距离依赖关系和全局语义联系，从而提取整体场景的上下文信息。例如，VIT全局特征提取模块能够识别出航拍图像中的建筑群、大面积道路以及开阔区域，为整体场景建模提供了坚实的基础。S3-3, input the unclassified aerial mixed training segmented image into the transformer layer based on the self-attention mechanism to obtain the unclassified aerial mixed training global feature map; the long-distance dependencies and global semantic connections between each block are captured through the self-attention mechanism, thereby extracting the context information of the overall scene. For example, the VIT global feature extraction module can identify buildings, large roads and open areas in aerial images, providing a solid foundation for overall scene modeling.

基于自注意力机制的transformer层包括对应的第一编码器和第一解码器，第一编码器采用基于自注意力机制的编码器。从而，所述S3-3包括：The transformer layer based on the self-attention mechanism includes a corresponding first encoder and a first decoder, and the first encoder adopts an encoder based on the self-attention mechanism. Thus, S3-3 includes:

S3-3-1、对各个Token进行线性转换并附加对应的位置信息，得到对应的Token向量表示；S3-3-1. Perform linear transformation on each Token and add corresponding position information to obtain the corresponding Token vector representation;

S3-3-2、将各个Token向量表示输入至基于自注意力机制的编码器，计算每个Token向量表示与其他Token向量表示之间的相关性，得到对应的注意力权重；基于注意力权重，确定对应的Token关键向量表示；S3-3-2. Input each Token vector representation into the encoder based on the self-attention mechanism, calculate the correlation between each Token vector representation and other Token vector representations, and obtain the corresponding attention weight; based on the attention weight, determine the corresponding Token key vector representation;

在编码器的每一层中，自注意力机制会计算每个Token向量表示与其它所有Token向量表示之间的相关性，每个小块都会“关注”图像中所有其他小块的信息。为每个Token向量表示生成查询（Query）、键（Key）和值（Value）三个向量。通过对查询和所有键向量计算点积，再经过Softmax归一化，得到每个Token向量表示对于其他Token向量表示的注意力权重，再利用这些权重对所有Token对应的值进行加权求和，从而生成一个新的表示。这种全局信息的融合能够捕捉到远距离区域之间的关系和上下文语义。比如，在航拍图像中，可以通过自注意力机制将分散在各处的建筑、道路等信息进行关联，构建出一个全面的场景理解。从而，经过多层自注意力处理后，编码器输出的全局关键训练特征就包含了丰富的上下文信息和长距离依赖，能够为后续的分割或其他视觉任务提供更为精准的特征支持。In each layer of the encoder, the self-attention mechanism calculates the correlation between each Token vector representation and all other Token vector representations, and each small block will "pay attention" to the information of all other small blocks in the image. Generate three vectors for each Token vector representation: query, key, and value. By calculating the dot product between the query and all key vectors, and then normalizing them with Softmax, we can get the attention weight of each Token vector representation for other Token vector representations, and then use these weights to weighted sum the values corresponding to all Tokens to generate a new representation. This fusion of global information can capture the relationship and contextual semantics between distant regions. For example, in aerial images, the self-attention mechanism can be used to associate information such as buildings and roads scattered everywhere to build a comprehensive scene understanding. Therefore, after multiple layers of self-attention processing, the global key training features output by the encoder contain rich contextual information and long-distance dependencies, which can provide more accurate feature support for subsequent segmentation or other visual tasks.

S3-3-3、对所有Token关键向量表示进行加权求和，生成全局关键训练特征；S3-3-3. Perform weighted summation on all Token key vector representations to generate global key training features.

S3-3-4、将全局关键训练特征输入至第一解码器，输出得到无类别航拍混合训练全局特征图。S3-3-4. Input the global key training features into the first decoder, and output the unclassified aerial photography mixed training global feature map.

S3-4、将无类别航拍混合训练图像输入至Mamba模型，得到无类别航拍混合训练局部特征图；Mamba模型是基于状态空间模型（SSM）构建的，用于对航拍图像的局部区域进行特征提取，能够有效捕捉长程空间依赖关系，并对图像中的边缘、纹理等细节进行高效建模，从而弥补VIT全局特征提取模块在细粒度目标提取上的弱化问题，确保了在复杂场景中对小目标和细节区域的精细刻画。S3-4. Input the unclassified aerial mixed training image into the Mamba model to obtain the unclassified aerial mixed training local feature map; the Mamba model is built based on the state-space model (SSM) and is used to extract features from local areas of aerial images. It can effectively capture long-range spatial dependencies and efficiently model details such as edges and textures in the image, thereby making up for the weakness of the VIT global feature extraction module in fine-grained target extraction and ensuring the fine characterization of small targets and detail areas in complex scenes.

S3-5、将无类别航拍混合训练局部特征图、无类别航拍混合训练全局特征图输入至多尺度时空视觉特征融合模块，得到多尺度时空视觉训练特征；S3-5, inputting the unclassified aerial photography mixed training local feature map and the unclassified aerial photography mixed training global feature map into the multi-scale spatiotemporal visual feature fusion module to obtain the multi-scale spatiotemporal visual training feature;

所述S3-5包括：The S3-5 includes:

S3-5-1、将无类别航拍混合训练局部特征图、无类别航拍混合训练全局特征图输入至动态特征融合层，得到多尺度时空视觉初始训练特征；S3-5-1, input the unclassified aerial photography mixed training local feature map and the unclassified aerial photography mixed training global feature map into the dynamic feature fusion layer to obtain the multi-scale spatiotemporal visual initial training features;

动态特征融合层采用自适应加权融合机制，通过动态计算权重平衡两者（无类别航拍混合训练局部特征图、无类别航拍混合训练全局特征图）的贡献，对应的公式为：The dynamic feature fusion layer adopts an adaptive weighted fusion mechanism to balance the contributions of the two (the local feature map of unclassified aerial photography mixed training and the global feature map of unclassified aerial photography mixed training) by dynamically calculating the weights. The corresponding formula is:

； ;

其中，、分别表示无类别航拍混合训练全局特征图、无类别航拍混合训练局部特征图，表示激活函数，表示自适应权重参数，表示自适应权重矩阵，表示多尺度时空视觉初始训练特征，表示无类别航拍混合训练全局特征图、无类别航拍混合训练局部特征图的拼接特征图。自适应权重参数的取值范围为（0，1）。in, , They represent the global feature map of unclassified aerial photography mixed training and the local feature map of unclassified aerial photography mixed training respectively. represents the activation function, represents the adaptive weight parameter, represents the adaptive weight matrix, represents the initial training features of multi-scale spatiotemporal vision, Represents the concatenated feature map of the global feature map of the unclassified aerial photography mixed training and the local feature map of the unclassified aerial photography mixed training. Adaptive weight parameter The value range is (0, 1).

通过自适应加权融合机制得到多尺度时空视觉特征，可根据不同图像区域的重要性动态调整权重，在局部细节更丰富的区域更依赖Mamba模型，而在全局结构明确的区域则侧重VIT全局特征提取模块，从而获得更为平衡和精准的视觉表示。Multi-scale spatiotemporal visual features are obtained through an adaptive weighted fusion mechanism. The weights can be dynamically adjusted according to the importance of different image regions. In areas with richer local details, the Mamba model is more relied on, while in areas with clear global structures, the VIT global feature extraction module is emphasized, thereby obtaining a more balanced and accurate visual representation.

S3-5-2、将多尺度时空视觉初始训练特征输入至可变形卷积层，得到多尺度时空视觉训练特征，可进一步提升局部结构的捕捉能力；可变形卷积允许卷积核根据图像内容自适应调整采样位置，从而更好地适应目标形变和复杂遮挡情况；能够针对图像中的边缘和局部形状进行精细建模，提高了分割边界的准确性和整体细粒度信息的表达能力。S3-5-2. Input the multi-scale spatiotemporal vision initial training features into the deformable convolution layer to obtain multi-scale spatiotemporal vision training features, which can further improve the ability to capture local structures; deformable convolution allows the convolution kernel to adaptively adjust the sampling position according to the image content, so as to better adapt to target deformation and complex occlusion situations; it can perform fine modeling on the edges and local shapes in the image, thereby improving the accuracy of the segmentation boundaries and the ability to express overall fine-grained information.

分别利用VIT全局特征提取模块、Mamba模型提取全局图像信息、局部图像细节，并采用自适应加权融合实现全局与局部特征的动态平衡，使用可变形卷积对局部结构进行强化，从而构成了一套多尺度时空视觉特征提取方案，既保证整体场景语义的准确表达，又能细致地捕捉到图像中的局部细节。The VIT global feature extraction module and the Mamba model are used to extract global image information and local image details respectively, and adaptive weighted fusion is used to achieve a dynamic balance between global and local features. Deformable convolution is used to enhance the local structure, thus forming a multi-scale spatiotemporal visual feature extraction solution, which not only ensures the accurate expression of the overall scene semantics, but also can capture the local details in the image in detail.

S3-6、将语言描述训练数据输入至BERT模型，得到语言训练特征；S3-6. Input the language description training data into the BERT model to obtain language training features;

所述S3-6包括：The S3-6 includes:

S3-6-1、基于语言描述训练数据生成对应的连续向量文本表示，对应的公式为：S3-6-1. Training data based on language description Generate the corresponding continuous vector text representation , the corresponding formula is:

； ;

其中，表示BERT模型，表示分词操作。in, represents the BERT model, Represents word segmentation operation.

S3-6-2、基于无类别航拍混合训练局部特征图和无类别航拍混合训练全部特征图，设置一组视觉原型（Prototype）；为了与多尺度时空视觉训练特征进行跨模态对齐，需预设一组视觉原型，这些原型通常来源于视觉特征提取模块提取的区域特征（无类别航拍混合训练局部/全局特征图），连续向量文本表示与视觉原型处于同一隐空间中，但直接对齐存在困难，因此需要进一步优化文本特征，使其能够与正样本视觉特征更紧密地对齐。S3-6-2. Based on the local feature map of unclassified aerial photography mixed training and the whole feature map of unclassified aerial photography mixed training, a set of visual prototypes is set; in order to perform cross-modal alignment with the multi-scale spatiotemporal visual training features, a set of visual prototypes needs to be preset. These prototypes are usually derived from the regional features extracted by the visual feature extraction module (local/global feature map of unclassified aerial photography mixed training), and the continuous vector text representation Although they are in the same latent space as the visual prototype, direct alignment is difficult, so the text features need to be further optimized so that they can be more closely aligned with the positive sample visual features.

视觉特征提取子模型提取区域特征时，实际上是结合了VIT全局特征提取模块和局部细节特征提取模块（Mamba模型）的优势：VIT全局特征提取模块通过自注意力机制提取全局信息，而Mamba模型则专注于捕捉局部细节，因此，对应的区域特征作为视觉原型，再与多尺度时空视觉训练特征进行对齐，从而辅助文本特征（连续向量文本表示）的进一步优化，使其与正样本视觉特征更加紧密地对应，最终通过对比语义对齐（CSA）损失函数实现跨模态对齐。When the visual feature extraction sub-model extracts regional features, it actually combines the advantages of the VIT global feature extraction module and the local detail feature extraction module (Mamba model): the VIT global feature extraction module extracts global information through the self-attention mechanism, while the Mamba model focuses on capturing local details. Therefore, the corresponding regional features are used as visual prototypes and then aligned with the multi-scale spatiotemporal visual training features, thereby assisting in the further optimization of text features (continuous vector text representation) to make them more closely correspond to the positive sample visual features, and finally achieving cross-modal alignment through the contrastive semantic alignment (CSA) loss function.

S3-6-3、基于连续向量文本表示的正/负相似度，计算CSA损失函数；S3-6-3. Calculate the CSA loss function based on the positive/negative similarity of the continuous vector text representation;

为了确保连续向量文本表示与视觉原型在同一隐空间内达到准确对齐，设计了对比语义对齐（Contrastive Semantic Alignment，CSA）预训练任务。该任务的目标是使得连续向量文本表示与之对应的正样本视觉特征之间的正相似度尽可能高，而与负样本视觉特征之间的负相似度尽可能低。正样本视觉特征和负样本视觉特征分别表示与连续向量文本表示匹配、不匹配的多尺度时空视觉训练特征。In order to ensure that the continuous vector text representation and the visual prototype are accurately aligned in the same latent space, a Contrastive Semantic Alignment (CSA) pre-training task is designed. The goal of this task is to make the continuous vector text representation and its corresponding positive sample visual features The positive similarity between them is as high as possible, while the visual features of negative samples The negative similarity between them should be as low as possible. Positive sample visual features and negative sample visual features Respectively represent the multi-scale spatiotemporal visual training features that match and do not match the continuous vector text representation.

从而，CSA损失函数对应的公式为：Thus, the CSA loss function The corresponding formula is:

； ;

其中，表示相似度度量函数，表示以常数e为底的指数函数，表示以常数为底的对数函数，表示求和函数，表示温度参数。相似度度量函数可采取余弦相似度。温度参数用于调节分布的平滑程度。in, represents the similarity measurement function, represents an exponential function with a constant e as base, represents a logarithmic function with a constant as base, represents the summation function, Represents the temperature parameter. The similarity measurement function can be cosine similarity. Temperature parameter Used to adjust the smoothness of the distribution.

CSA损失函数迫使文本编码器（BERT模型）学习到更具判别性的语义表示，使得生成的连续向量文本表示能够更好地反映无类别航拍图像中的关键信息，并与对应的多尺度时空视觉训练特征实现精准对齐。CSA Loss Function The text encoder (BERT model) is forced to learn more discriminative semantic representations, so that the generated continuous vector text representation can better reflect the key information in unclassified aerial images and achieve accurate alignment with the corresponding multi-scale spatiotemporal visual training features.

S3-6-4、基于CSA损失函数，对视觉原型和连续向量文本表示进行匹配对齐，得到匹配结果；S3-6-4. Visual prototype and continuous vector text representation based on CSA loss function Perform matching alignment to obtain matching results;

S3-6-5、基于匹配结果和连续向量文本表示，生成对应的语言训练特征。S3-6-5. Text representation based on matching results and continuous vectors , generate corresponding language training features.

在CSA预训练任务的驱动下，文本编码器不断更新其参数，从而在隐空间中拉近匹配对连续向量文本表示和正样本视觉特征之间的距离，同时拉远连续向量文本表示与所有负样本视觉特征之间的距离。通过这种对比学习机制，语言训练特征不仅保留了语言的自然表达能力，还嵌入了更丰富的跨模态语义信息，提升了与多尺度时空视觉训练特征的匹配精度。文本编码器输出的特征表示将具备更强的判别性，为后续跨模态图融合和分割任务提供了坚实的语义基础。Driven by the CSA pre-training task, the text encoder continuously updates its parameters to closely match the continuous vector text representation in the latent space. and positive sample visual features The distance between them, while pulling away the continuous vector text representation With all negative visual features Through this contrastive learning mechanism, the language training features not only retain the natural expressive power of the language, but also embed richer cross-modal semantic information, improving the matching accuracy with the multi-scale spatiotemporal visual training features. The feature representation output by the text encoder will have stronger discriminability, providing a solid semantic foundation for subsequent cross-modal graph fusion and segmentation tasks.

S3-7、基于多尺度时空视觉训练特征和语言训练特征，调整视觉特征提取子模型的权重参数。S3-7. Based on the multi-scale spatiotemporal visual training features and language training features, adjust the weight parameters of the visual feature extraction sub-model.

S4、将语言特征和多尺度时空视觉特征输入至异构跨模态图融合模型，输出得到视觉-语言匹配特征。S4. Input the language features and multi-scale spatiotemporal visual features into the heterogeneous cross-modal graph fusion model, and output the visual-language matching features.

所述异构跨模态图融合模型采用基于多头注意力机制的异构跨模态图网络。The heterogeneous cross-modal graph fusion model adopts a heterogeneous cross-modal graph network based on a multi-head attention mechanism.

所述异构跨模态图融合模型的训练过程包括：The training process of the heterogeneous cross-modal graph fusion model includes:

S4-1、构建初始异构跨模态图；获取语言训练特征、多尺度时空视觉训练特征并分别作为文本节点、视觉节点，并映射至初始异构跨模态图中；S4-1. Construct an initial heterogeneous cross-modal graph; obtain language training features and multi-scale spatiotemporal visual training features and use them as text nodes and visual nodes respectively, and map them into the initial heterogeneous cross-modal graph;

S4-2、计算文本节点和视觉节点之间的节点相似度；基于各个节点相似度设置边的权重以及知识图谱，构建对应的边，得到异构跨模态图；边包括视觉-文本边和文本-知识边；节点相似度可采用余弦相似度。余弦相似度用于衡量两个向量之间的角度相似性。余弦相似度的取值范围为[-1，1]，其值越大表示两个向量越相似。在文本节点和视觉节点之间，余弦相似度可以用来计算它们在特征空间中的相似性，通常可以通过对文本特征和视觉特征向量进行标准化后计算，对应的公式为：S4-2. Calculate the node similarity between text nodes and visual nodes; set the edge weights and knowledge graph based on the similarity of each node, construct the corresponding edges, and obtain a heterogeneous cross-modal graph; the edges include visual-text edges and text-knowledge edges; the node similarity can use cosine similarity. Cosine similarity is used to measure the angular similarity between two vectors. The value range of cosine similarity is [-1, 1], and the larger the value, the more similar the two vectors are. Between text nodes and visual nodes, cosine similarity can be used to calculate their similarity in the feature space, which can usually be calculated by normalizing the text feature and visual feature vectors. The corresponding formula is:

； ;

其中，、分别表示文本节点和视觉节点，表示文本节点和视觉节点之间的余弦相似性，表示欧几里得范数。in, , Represent text nodes and visual nodes respectively, Represents a text node and visual nodes The cosine similarity between represents the Euclidean norm.

如果语言特征和多尺度时空视觉特征中的“受损房屋”高度匹配，则在该视觉节点与对应文本节点之间建立一条权重较高的边。其次，构建文本-知识边，将文本节点与外部领域知识图谱中的相关节点相连接，例如“农田”节点与“作物类型”节点，通过这种连接引入外部专业知识，增强图中的语义表达。这一步确保了图中不仅包含直接的视觉与文本对应关系，还能融入领域知识以拓展跨模态信息的深度。If the language feature and the "damaged house" in the multi-scale spatiotemporal visual feature are highly matched, a high-weight edge is established between the visual node and the corresponding text node. Secondly, a text-knowledge edge is constructed to connect the text node with the relevant nodes in the external domain knowledge graph, such as the "farmland" node and the "crop type" node. This connection introduces external expertise and enhances the semantic expression in the graph. This step ensures that the graph not only contains direct visual and textual correspondences, but also incorporates domain knowledge to expand the depth of cross-modal information.

S4-3、根据多头图注意力机制（GAT），获取每个节点及其相邻节点的聚合信息；S4-3. According to the multi-head graph attention mechanism (GAT), the aggregate information of each node and its adjacent nodes is obtained;

S4-4、基于各个聚合信息，通过多头图注意力机制的多跳聚合更新每个节点的表示；所述节点的表示为视觉-语言匹配训练特征，其包括语言特征、多尺度时空视觉特征；更新对应的公式为：S4-4, based on each aggregated information, the representation of each node is updated through multi-hop aggregation of the multi-head graph attention mechanism; the representation of the node is a visual-language matching training feature, which includes language features and multi-scale spatiotemporal visual features; the corresponding update formula is:

； ;

其中，表示将所有图注意力头的输出进行拼接，表示多头图注意力数量，、、、分别表示第一个图注意力头、第二个图注意力头、第个图注意力头和最后一个图注意力头对应的输出数据，表示节点的邻居节点集合，表示第个图注意力头中节点对邻居节点的归一化注意力权重，表示第个图注意力头的可学习权重矩阵，、分别表示第层的节点的表示、第层的邻居节点的表示。in, It means concatenating the outputs of all the attention heads. represents the number of multi-head graph attentions, , , , They represent the first graph attention head, the second graph attention head, and the The output data corresponding to the first graph attention head and the last graph attention head, Representation Node The set of neighbor nodes of Indicates Nodes in the graph attention head For neighbor nodes The normalized attention weight of Indicates The learnable weight matrix of the graph attention head, , Respectively represent Layer Node The expression Neighbor nodes of the layer Representation.

通过这种多头注意力机制，每个节点能够捕捉到来自不同角度和尺度的跨模态语义信息，确保信息聚合的全面性和深度。通过多头图注意力的多跳聚合，保证每个节点在更新时不仅考虑直接邻居的信息，还能整合更远距离的跨模态语义关系。这种聚合机制使得图中各节点的表示在逐层传播过程中不断融合来自视觉、文本以及领域知识的多维信息，形成更加丰富和判别性的节点表示。通过图神经网络的多层堆叠，整个异构图能够充分捕获并传递跨模态的细粒度语义信息，为后续的分割任务提供了强有力的语义支撑。Through this multi-head attention mechanism, each node can capture cross-modal semantic information from different angles and scales, ensuring the comprehensiveness and depth of information aggregation. Through the multi-hop aggregation of multi-head graph attention, each node is guaranteed to not only consider the information of direct neighbors when updating, but also integrate cross-modal semantic relationships at a longer distance. This aggregation mechanism enables the representation of each node in the graph to continuously integrate multi-dimensional information from vision, text, and domain knowledge during the layer-by-layer propagation process, forming a richer and more discriminative node representation. Through the multi-layer stacking of graph neural networks, the entire heterogeneous graph can fully capture and transmit cross-modal fine-grained semantic information, providing strong semantic support for subsequent segmentation tasks.

S4-5、基于更新后的节点的表示，优化异构跨模态图，即优化异构跨模态图融合模型。S4-5. Based on the updated node representation, optimize the heterogeneous cross-modal graph, that is, optimize the heterogeneous cross-modal graph fusion model.

通过构建视觉节点和文本节点以及视觉-文本边关系、文本-知识边关系，引入多头图注意力机制实现节点信息的加权聚合，并通过多跳传播不断优化图结构，构建了一个高效的异构跨模态图，使得视觉特征和文本特征在同一图结构中实现细粒度交互，并为后续任务提供了丰富的跨模态语义表达。By constructing visual nodes and text nodes as well as visual-text edge relationships and text-knowledge edge relationships, a multi-head graph attention mechanism is introduced to realize weighted aggregation of node information. The graph structure is continuously optimized through multi-hop propagation, and an efficient heterogeneous cross-modal graph is constructed. This allows visual features and text features to interact in a fine-grained manner in the same graph structure, and provides rich cross-modal semantic expressions for subsequent tasks.

S5、将视觉-语言匹配特征输入至语义分割模型，完成对航拍图像的语义分割。S5. Input the visual-language matching features into the semantic segmentation model to complete the semantic segmentation of the aerial image.

如图3所示，所述语义分割模型采用轻量级U-Net++模型；轻量级U-Net++模型包括串联的编码器和基于通道-空间双注意力的解码器；在编码器与解码器之间构建了多级跳跃连接，从而确保在信息传递过程中能够充分保留细粒度的图像信息。编码器包括N层解码层；基于通道-空间双注意力的解码器包括N层解码模块，每两层解码模块通过多层次融合模块进行连接；各解码模块均包括串联的CSDA层（通道-空间双注意力层）和解码层。As shown in Figure 3, the semantic segmentation model adopts a lightweight U-Net++ model; the lightweight U-Net++ model includes a serial encoder and a decoder based on channel-space dual attention; a multi-level skip connection is constructed between the encoder and the decoder to ensure that fine-grained image information can be fully retained during information transmission. The encoder includes N layers of decoding layers; the decoder based on channel-space dual attention includes N layers of decoding modules, and every two layers of decoding modules are connected through a multi-level fusion module; each decoding module includes a serial CSDA layer (channel-space dual attention layer) and a decoding layer.

从而，所述S5包括：Thus, the S5 includes:

S5-1、获取视觉-语言匹配训练特征并输入至编码器，得到每个编码层对应的第i视觉-语言匹配训练编码特征；除第一层编码层的输入为视觉-语言匹配训练特征，其余编码层的输入为上一层编码层的输出，即第i层编码层的输入为第i-1层编码层的输出；S5-1, obtain visual-language matching training features and input them to the encoder to obtain the i-th visual-language matching training coding features corresponding to each coding layer; except that the input of the first coding layer is the visual-language matching training features, the input of the remaining coding layers is the output of the previous coding layer, that is, the input of the i-th coding layer is the output of the i-1-th coding layer;

S5-2、将第N视觉-语言匹配训练特征输入至第N层解码模块，输出得到第N视觉-语言匹配解码结果，对应的过程为：S5-2, input the Nth visual-language matching training feature into the Nth layer decoding module, and output the Nth visual-language matching decoding result. The corresponding process is:

S5-2-1、将第N视觉-语言匹配训练特征输入至第N层CSDA层，得到对应的第N视觉-语言匹配训练关键特征；轻量级U-Net++模型嵌入了通道-空间双注意力模块（CSDA）。CSDA模块通过在通道和空间两个维度上分别引入注意力机制，进一步强化了特征表达的判别能力。通道注意力模块能够自适应地为不同特征通道分配权重，从而强调对分割任务更为关键的语义信息；而空间注意力模块则在空间维度上对重要区域进行聚焦，确保细节丰富的区域（如小目标或模糊边缘）得到充分表达。通过这两种注意力机制的协同作用，CSDA模块显著提升了分割头对细粒度信息的敏感度和表达能力，为后续精细分割提供了更加鲁棒的特征基础。S5-2-1. Input the Nth visual-language matching training features into the Nth CSDA layer to obtain the corresponding Nth visual-language matching training key features; the lightweight U-Net++ model is embedded with a channel-space dual attention module (CSDA). The CSDA module further enhances the discriminative ability of feature expression by introducing attention mechanisms in the two dimensions of channel and space. The channel attention module can adaptively assign weights to different feature channels, thereby emphasizing semantic information that is more critical to the segmentation task; while the spatial attention module focuses on important areas in the spatial dimension to ensure that areas with rich details (such as small targets or blurred edges) are fully expressed. Through the synergy of these two attention mechanisms, the CSDA module significantly improves the sensitivity and expression ability of the segmentation head to fine-grained information, providing a more robust feature foundation for subsequent fine segmentation.

S5-2-2、将第N视觉-语言匹配训练关键特征输入至第N层解码层，得到第N视觉-语言匹配解码结果。S5-2-2, input the Nth visual-language matching training key features into the Nth decoding layer to obtain the Nth visual-language matching decoding result.

各解码模块的处理过程与第N层解码模块的处理过程相同。The processing process of each decoding module is the same as the processing process of the Nth layer decoding module.

S5-3、将第N视觉-语言匹配解码结果和第N-1视觉-语言匹配训练特征输入至多层次融合模块，输出得到第N-1视觉-语言匹配训练融合特征；轻量级U-Net++模型通过引入多层次融合模块，使得从低层到高层的特征能够逐步被整合，同时加强了对小目标和复杂边缘信息的捕捉，使得分割头在不同尺度上均能获得精准的特征表达，为最终的分割任务提供了稳固的基础。S5-3. Input the Nth visual-language matching decoding result and the N-1th visual-language matching training feature into the multi-level fusion module, and output the N-1th visual-language matching training fusion feature. The lightweight U-Net++ model introduces a multi-level fusion module, so that the features from low-level to high-level can be gradually integrated, while strengthening the capture of small targets and complex edge information, so that the segmentation head can obtain accurate feature expression at different scales, providing a solid foundation for the final segmentation task.

S5-4、将第N-1视觉-语言匹配训练融合特征输入至第N-1层解码模块，输出得到第N-1视觉-语言匹配解码结果；S5-4, inputting the N-1th visual-language matching training fusion feature into the N-1th layer decoding module, and outputting the N-1th visual-language matching decoding result;

S5-5、重复多层次融合模块和第N-1层解码模块的处理过程，直至得到第一层解码模块的输出，即得到语义分割训练结果；S5-5, repeat the processing of the multi-level fusion module and the N-1th layer decoding module until the output of the first layer decoding module is obtained, that is, the semantic segmentation training result is obtained;

S5-6、基于语义分割训练结果，计算动态权重三元组损失函数和广义交并比损失函数；S5-6. Based on the semantic segmentation training results, calculate the dynamic weight triplet loss function and the generalized intersection-over-union loss function;

动态权重三元组损失函数通过不确定性估计机制动态计算三元动态权重参数，从而，所述动态权重三元组损失函数对应的公式为：The dynamic weight triplet loss function dynamically calculates the triplet dynamic weight parameters through the uncertainty estimation mechanism. , thus, the formula corresponding to the dynamic weight triplet loss function is:

； ;

当语义分割模型对某个样本（视觉-语言匹配训练特征）预测不确定性较高时（置信度较低），通过动态权重三元组损失函数动态调整三元动态权重参数，使语义分割模型在训练过程中能给予未知类别更多关注，从而实现对已知与未知类别的均衡优化。When the semantic segmentation model predicts a certain sample (visual-language matching training features) with high uncertainty (low confidence), the dynamic weight triplet loss function is used to dynamically adjust the ternary dynamic weight parameters, so that the semantic segmentation model can pay more attention to the unknown categories during training, thereby achieving balanced optimization of known and unknown categories.

采用广义交并比（GIoU）损失函数来替代传统的交叉熵损失函数。传统交叉熵损失函数在处理分割边界时可能存在拟合不足的问题，而GIoU损失函数则通过综合考虑预测分割区域与真实区域的重叠、差异及外包络情况，进一步提升了分割结果的边界精度和整体鲁棒性，可以更精细地调控语义分割模型在各个像素级别上的预测表现，确保分割图能够更贴合真实边界。The generalized intersection-over-union (GIoU) loss function is used to replace the traditional cross-entropy loss function. The traditional cross-entropy loss function may have the problem of insufficient fitting when dealing with segmentation boundaries, while the GIoU loss function further improves the boundary accuracy and overall robustness of the segmentation result by comprehensively considering the overlap, difference and outer envelope between the predicted segmentation area and the real area. It can more finely regulate the prediction performance of the semantic segmentation model at each pixel level, ensuring that the segmentation map can better fit the real boundary.

S5-7、基于动态权重三元组损失函数和广义交并比损失函数，调整语义分割模型的权重参数。S5-7. Adjust the weight parameters of the semantic segmentation model based on the dynamic weight triplet loss function and the generalized intersection-over-union loss function.

基于通道-空间双注意力机制和多层次融合模块对U-Net++模型进行改进，确保融合后的多模态特征能够被充分利用；引入了动态权重三元组损失函数，通过不确定性估计动态调整损失权重，解决了已知与未知类别间的不平衡优化问题；采用GIoU损失进一步提升了分割边界的拟合精度和整体鲁棒性；通过高效的分割头设计和自适应损失函数的精细调控，实现了对多模态融合特征的高精度分割。The U-Net++ model is improved based on the channel-space dual attention mechanism and multi-level fusion module to ensure that the fused multimodal features can be fully utilized; the dynamic weight triplet loss function is introduced to dynamically adjust the loss weight through uncertainty estimation, which solves the unbalanced optimization problem between known and unknown categories; the GIoU loss is used to further improve the fitting accuracy and overall robustness of the segmentation boundary; through efficient segmentation head design and fine-tuning of the adaptive loss function, high-precision segmentation of multimodal fusion features is achieved.

所述视觉语言融合分割模型还包括优化过程，对应的步骤为：The visual language fusion segmentation model also includes an optimization process, and the corresponding steps are:

利用NAS算法对初始优化后的视觉语言融合分割模型进行处理，得到一次优化后的视觉语言融合分割模型；对异构跨模态图融合模型进行神经架构搜索（NAS），通过自动化搜索算法，在预定义的架构空间中寻找最优网络结构，剔除冗余的结构和参数，从而得到既满足准确性要求又具备更低计算复杂度的异构跨模态图融合模型。NAS算法能够评估不同网络配置在特定任务下的性能，最终选择那些在边缘设备上高效运行的结构，达到整体轻量化的目标。The NAS algorithm is used to process the initially optimized visual language fusion segmentation model to obtain an optimized visual language fusion segmentation model; the neural architecture search (NAS) is performed on the heterogeneous cross-modal graph fusion model. Through the automated search algorithm, the optimal network structure is found in the predefined architecture space, and redundant structures and parameters are eliminated, thereby obtaining a heterogeneous cross-modal graph fusion model that meets the accuracy requirements and has lower computational complexity. The NAS algorithm can evaluate the performance of different network configurations under specific tasks, and ultimately select those structures that run efficiently on edge devices to achieve the goal of overall lightweight.

对一次优化后的视觉语言融合分割模型进行参数剪枝和精简，得到二次优化后的视觉语言融合分割模型；对视觉语言融合分割模型进行参数剪枝。通过分析视觉语言融合分割模型中各模块的权重贡献，剔除那些对输出影响较小或冗余的参数。剪枝操作不仅可以减少模型参数数量，还能降低内存占用和计算需求。经过精细的剪枝和参数精简后，模型将更加紧凑，便于在资源受限的边缘设备上部署，同时保持较高的预测精度。The parameters of the visual language fusion segmentation model after the primary optimization are pruned and simplified to obtain the visual language fusion segmentation model after the secondary optimization; the parameters of the visual language fusion segmentation model are pruned. By analyzing the weight contribution of each module in the visual language fusion segmentation model, those parameters that have little impact on the output or are redundant are eliminated. The pruning operation can not only reduce the number of model parameters, but also reduce memory usage and computing requirements. After fine pruning and parameter simplification, the model will be more compact and easy to deploy on resource-constrained edge devices while maintaining high prediction accuracy.

利用量化感知训练方法对二次优化后的视觉语言融合分割模型进行处理，得到训练后的视觉语言融合分割模型。采用量化感知训练（QAT）方法，实现模型8-bit量化部署。在训练过程中，QAT方法会模拟低精度计算环境，使得模型能够逐步适应8-bit整数运算。经过量化训练后的模型，其权重和激活函数均以低精度表示，从而大幅降低了内存需求和计算资源消耗。这种量化不仅缩小了模型尺寸，而且在保持模型精度的前提下，显著提高了推理速度，预计可实现3倍左右的加速效果。The visual language fusion segmentation model after secondary optimization is processed using the quantization-aware training method to obtain the trained visual language fusion segmentation model. The quantization-aware training (QAT) method is used to implement the 8-bit quantization deployment of the model. During the training process, the QAT method simulates a low-precision computing environment so that the model can gradually adapt to 8-bit integer operations. After quantization training, the weights and activation functions of the model are represented in low precision, which greatly reduces memory requirements and computing resource consumption. This quantization not only reduces the model size, but also significantly improves the inference speed while maintaining the model accuracy, and is expected to achieve an acceleration effect of about 3 times.

以上所述，仅为本发明的具体实施方式，但本发明的保护范围并不局限于此，任何熟悉本技术领域的技术人员在本发明揭露的技术范围内，可轻易想到变化或替换，都应涵盖在本发明的保护范围之内。因此，本发明的保护范围应所述以权利要求的保护范围为准。The above is only a specific embodiment of the present invention, but the protection scope of the present invention is not limited thereto. Any person skilled in the art can easily think of changes or substitutions within the technical scope disclosed by the present invention, which should be included in the protection scope of the present invention. Therefore, the protection scope of the present invention should be based on the protection scope of the claims.