CN117636449A

Movatterモバイル変換

Info

Publication number: CN117636449A
Application number: CN202311781657.1A
Authority: CN
Inventors: 黄晓阳; 陈妍羽
Original assignee: Xiamen University
Current assignee: Xiamen University
Priority date: 2023-12-22
Filing date: 2023-12-22
Publication date: 2024-03-01

Abstract

Translated fromChinese

一种眼底图像分类方法，涉及深度学习和图像处理领域。包括以下步骤：1)获取已有的眼底图像数据集并进行数据转换、图像预处理和数据的增广处理，得到预处理数据集；2)提取图像特征并进行融合；3)构建包含注意力机制以及特征融合机制的HiFuse网络；将预处理数据集通过VGG‑16网络和HiFuse模型分别进行特征提取，并与步骤2)中的特征进行融合，采用softmax层进行分类，得到眼底病变图像分类结果。在对眼底图像预处理阶段，探索出一种能够较为清晰的提取出眼底图像特征的图像预处理方法，为后续的模型训练提供重要的帮助。将注意力机制和特征融合机制相结合，在眼底图像分类具有更加精确的结果。

A fundus image classification method, involving the fields of deep learning and image processing. It includes the following steps: 1) Obtain the existing fundus image data set and perform data conversion, image preprocessing and data augmentation processing to obtain the preprocessed data set; 2) Extract image features and fuse them; 3) Construct an attention-containing mechanism and feature fusion mechanism of the HiFuse network; the preprocessed data set is extracted through the VGG‑16 network and the HiFuse model respectively, and fused with the features in step 2), and the softmax layer is used for classification to obtain the fundus lesion image classification results . In the fundus image preprocessing stage, an image preprocessing method that can more clearly extract fundus image features was explored, providing important help for subsequent model training. Combining the attention mechanism and feature fusion mechanism has more accurate results in fundus image classification.

Description

Translated fromChinese

一种眼底图像分类方法A fundus image classification method

技术领域Technical Field

本发明涉及深度学习和图像处理领域，尤其是涉及一种眼底图像分类方法。The present invention relates to the fields of deep learning and image processing, and in particular to a fundus image classification method.

背景技术Background Art

眼底成像作为一种有效和经济的工具，在眼科筛查视网膜疾病和监测疾病进展方面被普遍利用。与眼科医生当面检查相比，视网膜摄影具有较高的敏感性、特异性和检查间/检查内的一致性。因此，在许多临床情况下，视网膜照片可以代替眼底镜检查。光学眼底成像技术的进步使其在即便没有瞳孔扩张的情况下也更容易获得高质量的视网膜图像。眼底照相机有几个优点，它们对患者来说是很方便的，因为只需要泛光灯的单次闪光曝光。此外，它们在不同的情况下不会影响图像质量，如在白内障病例中的退化减少。总的来说，数字视网膜摄影可以促进远程医疗咨询，这提供了更多获得准确和及时的亚专业护理的机会，特别是对于医疗服务不足的地区。因而，利用计算机技术辅助眼底诊断，进行眼底健康筛查十分必要。Fundus imaging is widely used in ophthalmology as an effective and economical tool for screening retinal diseases and monitoring disease progression. Retinal photography has higher sensitivity, specificity, and inter-/intra-examination agreement compared with in-person examination by an ophthalmologist. Therefore, retinal photographs can replace funduscopy in many clinical situations. Advances in optical fundus imaging technology have made it easier to obtain high-quality retinal images even without pupil dilation. Fundus cameras have several advantages. They are convenient for patients because only a single flash exposure from a floodlight is required. In addition, they do not affect image quality in different situations, such as reduced degeneration in cataract cases. Overall, digital retinal photography can facilitate telemedicine consultations, which provides more opportunities for accurate and timely subspecialty care, especially for underserved areas. Therefore, it is necessary to use computer technology to assist fundus diagnosis and perform fundus health screening.

深度学习(DL)现已成为计算机视觉的主流技术。它在开发新的医学图像处理算法以支持疾病检测和诊断方面扮演者重要的角色。利用深度学习技术进行眼底图像分类的研究方法可以从广义上分为两种，即利用全局信息进行分析和基于局部病灶。在这些方法中，一些研究通过对整个眼底图像进行特征提取，随后进行分类的方法已经取得了很多成果。例如，研究人员在糖尿病视网膜眼底的诊断上尝试利用Inception-v3结构进行深度学习技术的应用，对数据集EyePACS-1和Messidor进行了验证。在EyePACS-1数据集上，其AUC结果为0.991。同时，Muhammad等人也设计一种深度学习方法，该方法将Alexnet网络与随机森林方法混合，用于青光眼的诊断，并获得63.7％至93.1％之间的模型精度范围。Deep learning (DL) has now become a mainstream technology in computer vision. It plays an important role in developing new medical image processing algorithms to support disease detection and diagnosis. Research methods for fundus image classification using deep learning technology can be broadly divided into two types, namely, analysis based on global information and local lesions. Among these methods, some studies have achieved a lot of results by extracting features from the entire fundus image and then classifying it. For example, researchers tried to use the Inception-v3 structure to apply deep learning technology in the diagnosis of diabetic retinal fundus, and verified it on the datasets EyePACS-1 and Messidor. On the EyePACS-1 dataset, its AUC result was 0.991. At the same time, Muhammad et al. also designed a deep learning method that mixed the Alexnet network with the random forest method for the diagnosis of glaucoma and obtained a model accuracy range of 63.7% to 93.1%.

除了全局信息分析，局部病灶分析也是一种较为常见的方法。这种方法的主要流程是先使用定位算法检测眼底图像中的病变区域，然后根据病变的类型或数量等信息进行分类，最终得到分类结果。2015年，Haloi等人提出一种利用卷积神经网络判断每个像素的类别属性的眼底微动脉瘤检测方法，用于早期糖尿病视网膜病变中眼底微动脉瘤的检测。在2017年，Bogunovic等研究人员使用OCT图像来检测与年龄相关的黄斑病变，通过定位算法得到病变区域，再利用卷积神经网络对病变区域进行识别和分类。In addition to global information analysis, local lesion analysis is also a common method. The main process of this method is to first use a positioning algorithm to detect the lesion area in the fundus image, and then classify it according to information such as the type or number of lesions, and finally obtain the classification result. In 2015, Haloi et al. proposed a fundus microaneurysm detection method that uses a convolutional neural network to determine the category attributes of each pixel for the detection of fundus microaneurysms in early diabetic retinopathy. In 2017, researchers such as Bogunovic used OCT images to detect age-related macular lesions, obtained the lesion area through a positioning algorithm, and then used a convolutional neural network to identify and classify the lesion area.

还有一些研究探索不同深度学习方法的融合以提高诊断的精度。例如，Felix等研究人员设计一种模型，采用神经网络和随机森林相结合的方法，对年龄相关性黄斑病变进行12个等级的分类。该模型使用AlexNet、VGG、GoogLeNet等多个网络进行训练，并综合参考这些模型的结果，采用随机森林方法进行诊断。最终，该算法检测到84.2％的图像中存在明确的早期或晚期年龄相关性黄斑病变。There are also some studies exploring the integration of different deep learning methods to improve the accuracy of diagnosis. For example, Felix et al. designed a model that uses a combination of neural networks and random forests to classify age-related macular degeneration into 12 levels. The model is trained using multiple networks such as AlexNet, VGG, and GoogLeNet, and the results of these models are comprehensively referenced, and the random forest method is used for diagnosis. In the end, the algorithm detected clear early or late age-related macular degeneration in 84.2% of the images.

中国专利CN202110274127.2公开一种基于深度学习的眼底图像识别分类方法，解决现有深度学习方法进行眼底图像识别存在着准确度与速度之间的矛盾制约，效率较低，且原始数据少，给训练精度带来极大的影响等问题。该方法包括：步骤1，将眼底图像数据集进行数据增强后分为初始训练集a和验证集c，对训练集a进行数据扩充得到相对改变性质较小的训练集b1和相对改变性质较大的训练集b2；步骤2，分别将三个有映射关系的训练集a、b1、b2输入到不同的神经网络，其中将训练集b1的训练的卷积基复用到训练集b2，再通过密集连接分类器生成分类标签；步骤3，以图像改变性质的大小通过验证集c验证赋以不同的权重，融合使用优化算法达到最终结果。Chinese patent CN202110274127.2 discloses a deep learning-based fundus image recognition and classification method, which solves the contradiction between accuracy and speed in the existing deep learning method for fundus image recognition, and the low efficiency and small amount of original data, which greatly affects the training accuracy. The method includes: step 1, after data enhancement, the fundus image data set is divided into an initial training set a and a verification set c, and the training set a is expanded to obtain a training set b1 with relatively small changes in properties and a training set b2 with relatively large changes in properties; step 2, three training sets a, b1, and b2 with mapping relationships are input into different neural networks respectively, wherein the convolutional basis of the training set b1 is reused in the training set b2, and then the classification label is generated by a densely connected classifier; step 3, different weights are assigned according to the size of the image change through the verification set c, and the optimization algorithm is used to achieve the final result.

发明目的Purpose of the Invention

本发明的目的在于针对现有技术存在的由于数据收集和注释的成本较高而图像数量有限造成的模型准确度低、泛化能力低的问题，提供一种眼底病变图像分类方法，利用深度学习技术对ODIR-2019数据集进行分类。The purpose of the present invention is to provide a method for classifying fundus lesion images by using deep learning technology to classify the ODIR-2019 dataset, in order to address the problems of low model accuracy and low generalization ability in the prior art due to the high cost of data collection and annotation and the limited number of images.

本发明所述一种眼底病变图像分类方法，包括以下步骤：The present invention provides a method for classifying fundus lesion images, comprising the following steps:

1)获取已有的眼底图像数据集并进行数据转换、图像预处理和数据的增广处理，得到预处理数据集；1) Obtaining an existing fundus image dataset and performing data conversion, image preprocessing and data augmentation processing to obtain a preprocessed dataset;

在步骤1)中，所述数据集采用2019年北京大学“智慧之眼”国际眼底图像智能识别竞赛(Peking University International Competition on Ocular DiseaseIntelligent Recognition，ODIR-2019)开源的训练集；原始数据来自合作医院及医疗机构进行眼健康检查的患者，该训练集共包括3500组信息，每一组分别包含左眼和右眼的相关信息，以下简称该数据集为ODIR数据集，该数据集包含标签信息和眼底视网膜图像两部分；In step 1), the dataset uses the open source training set of the 2019 Peking University International Competition on Ocular Disease Intelligent Recognition (ODIR-2019); the original data comes from patients who undergo eye health examinations in cooperative hospitals and medical institutions. The training set includes 3500 groups of information, each of which contains relevant information of the left eye and the right eye. The dataset is referred to as the ODIR dataset below. The dataset contains two parts: label information and fundus retinal images.

所述数据转换针对ODIR数据集标签信息表中的“Left-Diagnostic Keywords”和“Right-Diagnostic Keywords”列进行统计，得出所有关键字集合；然后分别针对左眼和右眼提取并生成相应的疾病类别标签，即单眼级别的标签；The data conversion performs statistics on the "Left-Diagnostic Keywords" and "Right-Diagnostic Keywords" columns in the ODIR dataset label information table to obtain a set of all keywords; then extracts and generates corresponding disease category labels for the left eye and the right eye, that is, labels at the monocular level;

所述图像预处理，针对ODIR数据集中图像的大小不一致、亮度不均匀等问题，先将每张图像采样至512×512像素尺寸大小，再进行一系列的预处理，以提高图像质量、增强图像特征，图像预处理的具体步骤包括：The image preprocessing, in order to solve the problems of inconsistent image size and uneven brightness in the ODIR dataset, first samples each image to a size of 512×512 pixels, and then performs a series of preprocessing to improve image quality and enhance image features. The specific steps of image preprocessing include:

(1)标准化处理：眼底图像中像素值的范围可能会因为不同的采集设备或者参数设置而有所不同，这会对后续的图像分析和处理造成影响；标准化处理可以将图像中的像素值缩放到统一的范围内，使得不同图像之间的像素值具有可比性，从而更好地进行数据分析或者机器学习；(1) Standardization: The range of pixel values in fundus images may vary due to different acquisition devices or parameter settings, which will affect subsequent image analysis and processing. Standardization can scale the pixel values in the image to a uniform range, making the pixel values between different images comparable, thereby better performing data analysis or machine learning.

(2)CLAHE直方图均衡：眼底图像中可能存在大量的低对比度区域和细节模糊的区域，CLAHE直方图均衡化可以增强图像的对比度和细节，使图像更加清晰，从而更容易进行图像分析和特征提取；(2) CLAHE histogram equalization: There may be a large number of low-contrast areas and areas with blurred details in fundus images. CLAHE histogram equalization can enhance the contrast and details of the image, making the image clearer, making it easier to perform image analysis and feature extraction.

(3)Gamma校正：眼底图像中的亮度和对比度可能会受到拍摄环境、采集设备等因素的影响，这也会影响图像的可视化和特征提取；Gamma校正可以调整图像的亮度和对比度，使得图像更加鲜明、清晰，从而更容易进行图像分析和特征提取；(3) Gamma correction: The brightness and contrast of fundus images may be affected by factors such as the shooting environment and acquisition equipment, which will also affect image visualization and feature extraction. Gamma correction can adjust the brightness and contrast of the image to make the image more vivid and clear, making it easier to perform image analysis and feature extraction.

(4)高斯平滑处理：眼底图像中可能存在的一些噪声信息会对后续的图像分析和处理造成影响；使用高斯平滑处理可以去除一些噪声，使得图像更加平滑，从而更容易进行图像分析和特征提取；(4) Gaussian smoothing: Some noise information that may exist in fundus images will affect subsequent image analysis and processing; Gaussian smoothing can remove some noise, making the image smoother, making it easier to perform image analysis and feature extraction;

I_weight＝I_γ×α+I_blur×β+γI_weight =I_γ ×α+I_blur ×β+γ

I_blur＝I_γ×kernal_h×wI_blur = I_γ × kernal_{h × w}

所述数据的增广处理，用于扩充训练集、防止网络过拟合以及提高网络模型的泛化能力，对数据随机进行如如下增广处理操作：The data augmentation process is used to expand the training set, prevent network overfitting, and improve the generalization ability of the network model. The data is randomly augmented as follows:

(1)随机旋转(Random Rotation)：该方法将图像旋转一个随机角度，以产生不同的图像变换增加数据的多样性，从而帮助模型学习到图像不同角度的特征；(1) Random Rotation: This method rotates the image by a random angle to generate different image transformations to increase data diversity, thereby helping the model learn the features of the image at different angles.

(2)随机缩放裁剪(Random Resized Crop)：该方法随机从原始图像中裁剪出一块子图像，并将其缩放到指定的大小；(2) Random Resized Crop: This method randomly crops a sub-image from the original image and scales it to a specified size.

(3)随机水平/垂直翻转(Random horizontal/vertical flip)：该方法会随机将图像水平或垂直翻转，并将其作为新的训练样本；(3) Random horizontal/vertical flip: This method randomly flips the image horizontally or vertically and uses it as a new training sample;

2)提取图像特征并进行融合：2) Extract image features and fuse them:

(1)红色共生矩阵：灰度共生矩阵(GLDM)的统计方法是20世纪70年代初由R.Haralick等人提出的，它是在假定图像中各像素间的空间分布关系包含了图像纹理信息的前提下，提出的具有广泛性的纹理分析方法；灰度共生矩阵被定义为从灰度为i的像素点出发，离开某个固定位置的点上灰度值为的概率，即所有估计的值可以表示成一个矩阵的形式，以此被称为灰度共生矩阵；由于眼底图像主要包含红色，因此通过红色共生矩阵可以获得更多的图像信息；读取图片并转为RGB格式，将图像的红色通道R的值作为相关特性；取图像(N×N)中任意一点(x，y)及偏离它的另一点(x+a，y+b)，设该点对的R值为(r1，r2)；共生矩阵中每个元素的值可以理解为(x，y)点与(x+a，y+b)点的值对为(i，j)的概率；对于整个画面，统计出每一种(r1，r2)值出现的次数，然后排列成一个方阵，在用(r1，r2)出现的总次数将它们归一化为出现的概率P(r1，r2)；通过基于四个方向(水平、垂直、对角线、反对角线)的红色共生矩阵，从而得到眼底图像的纹理特征统计量R；(1) Red co-occurrence matrix: The statistical method of gray-level co-occurrence matrix (GLDM) was proposed by R. Haralick et al. in the early 1970s. It is a widely used texture analysis method based on the assumption that the spatial distribution relationship between pixels in an image contains the image texture information. The gray-level co-occurrence matrix is defined as the probability that the gray value of a point at a fixed position starting from a pixel with gray level i is , that is, all estimated values can be expressed in the form of a matrix, which is called the gray-level co-occurrence matrix. Since fundus images mainly contain red, more image information can be obtained through the red co-occurrence matrix. Read the image and convert it into RGB format, and convert the red color of the image into RGB. The value of R is taken as the relevant characteristic; take any point (x, y) in the image (N×N) and another point (x+a, y+b) deviating from it, and set the R value of the point pair to be (r1, r2); the value of each element in the co-occurrence matrix can be understood as the probability that the value pair of point (x, y) and point (x+a, y+b) is (i, j); for the entire picture, count the number of occurrences of each (r1, r2) value, and then arrange them into a square matrix, and then normalize them to the probability of occurrence P(r1, r2) by the total number of occurrences of (r1, r2); through the red co-occurrence matrix based on four directions (horizontal, vertical, diagonal, anti-diagonal), the texture feature statistics R of the fundus image are obtained;

R＝(r₁,r₂,…,r_m)R＝(r₁ ,r₂ ,…,_rm )

(2)视觉词袋：Bag-of-words模型是信息检索领域常用的文档表示方法；在信息检索中，BOW模型假定对于一个文档，忽略它的单词顺序和语法、句法等要素，将其仅仅看作是若干个词汇的集合，文档中每个单词的出现都是独立的，不依赖于其它单词是否出现；也就是说，文档中任意一个位置出现的任何单词，都不受该文档语意影响而独立选择的；视觉词袋则是利用在图像中的关键点，即包含有关图像的丰富局部信息的部分；将关键点分组到大量聚类中，并将具有相似描述符的关键点分配到同一聚类中；通过将每个聚类视为一个“视觉词”，代表该聚类中的关键点共享的特定局部模式，有一个描述各种此类局部图像模式的视觉词词汇表；通过将图像的关键点映射到视觉词中，可以将图像表示为视觉词袋；采用尺度无关特征变换方法(Scale-invariantFeatureTransform，SIFT)进行特征提取，从不同类别的图像中提取视觉词汇向量，这些向量代表的是图像中局部不变的特征点；将所有特征点向量集合到一块，利用K-Means算法合并词义相近的视觉词汇，构造一个包含K个词汇的单词表；在关键点特征数据集X＝{x1,x2,..,xi,..,xN}中找到k个簇的聚类中心{c1,c2,..,cj,..,ck}使得各个簇中样本向量到对应簇聚类中心的欧式距离最小；公式如下：(2) Bag-of-words: The Bag-of-words model is a commonly used document representation method in the field of information retrieval. In information retrieval, the BOW model assumes that for a document, its word order, grammar, syntax and other elements are ignored, and it is only regarded as a collection of several words. The appearance of each word in the document is independent and does not depend on whether other words appear. In other words, any word appearing at any position in the document is independently selected without being affected by the semantics of the document. The bag-of-visual words utilizes the key points in the image, that is, the part that contains rich local information about the image. The key points are grouped into a large number of clusters, and key points with similar descriptors are assigned to the same cluster. By considering each cluster as a "visual word" that represents a specific local pattern shared by the key points in the cluster, there is a description of various such local patterns. Visual word vocabulary of image patterns; by mapping the key points of the image to visual words, the image can be represented as a bag of visual words; the scale-invariant feature transform method (SIFT) is used for feature extraction to extract visual word vectors from images of different categories. These vectors represent local invariant feature points in the image; all feature point vectors are grouped together, and the K-Means algorithm is used to merge visual words with similar meanings to construct a word list containing K words; in the key point feature data set X = {x1, x2, .., xi, .., xN}, the cluster centers {c1, c2, .., cj, .., ck} of k clusters are found so that the Euclidean distance from the sample vector in each cluster to the corresponding cluster center is minimized; the formula is as follows:

图像中SIFT关键点特征分别与k个聚类中心(即为视觉单词)进行距离计算，哪一个视觉单词距离最小，就将SIFT关键点特征分配给该视觉单词；最终得到的集合就是视觉词典，表示如下：The SIFT key point features in the image are calculated from the distances of k cluster centers (i.e., visual words). The SIFT key point features are assigned to the visual word with the smallest distance. The resulting set is the visual dictionary, which is expressed as follows:

D＝(d₁，d₂，…，d_k)D=(d₁ , d₂ ,..., d_k )

随后，统计单词表中每个单词在图像中出现的次数，从而将图像表示成为一个K维数值向量，即生成该图像的视觉单词包；Then, the number of times each word in the word list appears in the image is counted, so that the image is represented as a K-dimensional numerical vector, that is, the visual word bag of the image is generated;

H＝(h₁，h₂，…，h_k)H=(h₁ , h₂ ,..., h_k )

(3)扩散模型的中间特征表示；扩散模型首先定义了一个前向噪声过程，将逐步高斯噪声迭代添加到图像x₀中，从数据分布q(x₀)中采样，以T步得到一个完全噪声的图像x_T；这个正向过程是一个具有值x₁，x₂，…，x_t，…，x_T-1，x_T的马尔可夫链，它代表不同程度的噪声图像，定义如下：(3) Intermediate feature representation of the diffusion model; The diffusion model first defines a forward noise process, which iteratively adds step-by-step Gaussian noise to the image_x0 , samples from the data distribution q(_x0 ), and obtains a completely noisy image_xT in T steps; This forward process is a Markov chain with values_x1 ,_x2 , ...,_xt , ...,_xT-1 ,_xT , which represents different degrees of noise images and is defined as follows:

其中，为方差设定，N为正态分布；根据以下公式，可在扩散步骤t时直接从真实图像x₀中采样噪声图像x_t：in, is the variance setting, N is a normal distribution; according to the following formula, the noise image_xt can be directly sampled from the real image_x0 at the diffusion step t:

α_t：＝1-β_tα_t ：＝1-β_t

逆向扩散过程旨在从后验分布q(x_t|x_t-1)中反转正向过程和采样，该分布取决于整个数据分布；迭代地这样做可以对一个完全有噪声的图像x_t进行降噪，这样就可以从数据分布q(x₀)中采样；这通常使用神经网络∈_θ近似为如下表达：The inverse diffusion process aims to reverse the forward process and sample from the posterior distribution q(_xt |_xt-1 ), which depends on the entire data distribution; doing this iteratively can denoise a completely noisy image_xt so that it can be sampled from the data distribution q(_x0 ); this is usually approximated using a neural network ∈_θ as follows:

当p和q作为VAE时，变分下界目标的简化版本只是一个均方误差损失；这可用于训练∈_θ，该∈_θ学习将高斯噪声近似∈添加到真实图像x₀中：When p and q are used as VAEs, a simplified version of the variational lower bound objective is just a mean squared error loss; this can be used to train ∈_θ , which_learns to add an approximation of Gaussian noise ∈ to the real image_x0 :

使用引导扩散(guided diffusion，GD)实现，使用U-Net式架构，其中具有残差块，将残差块、残差加注意力块和下采样或上采样残差块中的每一个都视为单独的块，并将它们编号为b∈{1，2，...，37}，用于预训练好的无条件引导扩散模型；用扩散步骤t和模型块数b进行参数化，得到噪声图像xt和用块号b之后的激活作为特征向量的f(x₀，t，θ)；Implemented using guided diffusion (GD), using a U-Net-style architecture with residual blocks, treating each of the residual blocks, residual plus attention blocks, and downsampled or upsampled residual blocks as separate blocks and numbering them as b∈{1, 2, ..., 37} for a pre-trained unconditional guided diffusion model; parameterized by diffusion step t and number of model blocks b, obtaining a noisy image xt and f(_x0 , t, θ) with activations after block number b as feature vectors;

f(x₀，t，θ)＝(f₁，f₁，…，f_n)f(x₀ , t, θ)=(f₁ , f₁ ,..., f_n )

将三个特征进行线性拼接，形成融合后的特征，各特征向量在融合时所占的权值相等，即：The three features are linearly concatenated to form the fused features. The weights of each feature vector are equal during fusion, that is:

T＝(r₁，r₂，…，r_m，h₁，h₂，…，h_k，f₁，f₁，…，f_n)T=(r₁ , r₂ ,..., r_m , h₁ , h₂ ,..., h_k , f₁ , f₁ ,..., f_n )

3)构建包含注意力机制以及特征融合机制的HiFuse网络；将预处理数据集通过VGG-16网络和HiFuse模型分别进行特征提取，并与步骤2)中的特征进行融合，采用softmax层进行分类，得到眼底病变图像分类结果；3) Constructing a HiFuse network including an attention mechanism and a feature fusion mechanism; extracting features from the preprocessed dataset through the VGG-16 network and the HiFuse model respectively, and fusing them with the features in step 2), and classifying them using the softmax layer to obtain the classification results of fundus lesion images;

(1)使用去除全连接层的VGG-16作为主干网，利用VGG-16网络中的中间特征映射(pool-3和pool-4)推断注意力图；在计算注意力图时，pool-5的输出作为一种“全局特征(global guidance)”(标记为G)，最后阶段的特征包含着整个图像中最压缩和抽象化的信息；使用F＝(f₁，f₂，...，f_n)表示中间层的特征，其中f_i代表第i个块的输出；(1) Using VGG-16 without the fully connected layer as the backbone network, the intermediate feature maps (pool-3 and pool-4) in the VGG-16 network are used to infer the attention map; when calculating the attention map, the output of pool-5 is used as a "global guidance" (labeled as G), and the features of the final stage contain the most compressed and abstract information of the entire image; F = (f₁ , f₂ , ..., f_n ) is used to represent the features of the intermediate layer, where_fi represents the output of the i-th block;

F和G同时作为注意力模块的输入，计算产生一个一通道的输出R，其中代表卷积运算，W_f和W_g由256个卷积核组成，卷积核W输出为一个通道大小，up(·)是双线性插值函数，用于统一输出空间的大小；F and G are simultaneously used as inputs to the attention module to generate a one-channel output R, where represents the convolution operation,_Wf and_Wg are composed of 256 convolution kernels, the convolution kernel W outputs a channel size, up(·) is a bilinear interpolation function used to unify the size of the output space;

最终的注意力图A是由R归一化得到；The final attention map A is obtained by normalizing R;

A＝Sigmoid(R)A＝Sigmoid(R)

A中的每个元素a_i∈A表示对应空间特征向量的关注程度，添加注意力分数后的特征向量由A和F做点积得到；最后，将VGG-16第3、4块的输出以及G做拼接作为提取的特征v；Each element a_i ∈ A in A represents the degree of attention of the corresponding spatial feature vector. The feature vector after adding the attention score It is obtained by doing the dot product of A and F; finally, the output of the 3rd and 4th blocks of VGG-16 and G are concatenated as the extracted feature v;

V＝(v₁，v₂，…，v_s)V=(v₁ , v₂ ,..., v_s )

(2)HiFuse模型用于获得不同尺度的局部空间信息和全局语义表示；在HiFuse模型中使用一个并行的结构从全局和局部特征块中提取医学图像的全局和局部信息，并通过“H”型结构融合不同层次的特征；通过HFF块融合不同层次的特征，然后经过下采样步骤，得到提取的特征F_i；(2) The HiFuse model is used to obtain local spatial information and global semantic representation at different scales. In the HiFuse model, a parallel structure is used to extract global and local information of medical images from global and local feature blocks, and features at different levels are fused through an "H"-shaped structure. Features at different levels are fused through the HFF block, and then the extracted features F_i are obtained after a downsampling step.

将步骤2)、VGG-16网络以及HiFuse模型提取的特征进行线性拼接，即：Linearly concatenate the features extracted from step 2, VGG-16 network and HiFuse model, namely:

Features＝(t₁，t₂，…，t_m，v₁，v₂，…，v_s，F_i1，F_i2，…，F_ik)Features=(t₁ , t₂ ,..., t_m , v₁ , v₂ ,..., v_s , F_i1 , F_i2 ,..., F_ik )

将Features输入到softmax分类层，最后得到眼底病变图像分类结果。The features are input into the softmax classification layer, and finally the classification results of fundus lesion images are obtained.

与现有技术相比，本发明具有以下突出的技术效果：Compared with the prior art, the present invention has the following outstanding technical effects:

(1)本发明在对眼底图像预处理阶段，探索出一种能够较为清晰的提取出眼底图像特征的图像预处理方法，为后续的模型训练提供重要的帮助。(1) In the fundus image preprocessing stage, the present invention explores an image preprocessing method that can extract fundus image features more clearly, providing important assistance for subsequent model training.

(2)本发明同时实现“迁移学习网络模型”、“注意力机制网络模型”和“注意力机制+特征融合机制网络模型”，将注意力机制和特征融合机制相结合，在眼底图像分类具有更加精确的结果。(2) The present invention simultaneously implements the "transfer learning network model", the "attention mechanism network model" and the "attention mechanism + feature fusion mechanism network model", combines the attention mechanism and the feature fusion mechanism, and has more accurate results in fundus image classification.

(3)本发明可将网络模型利用Flutter框架部署到移动应用端。使得本发明可应用到实际场景中，提高眼底疾病的诊断效率。(3) The present invention can deploy the network model to the mobile application end using the Flutter framework, so that the present invention can be applied to actual scenarios to improve the diagnostic efficiency of fundus diseases.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

图1为各阶段图像的预处理结果。Figure 1 shows the image preprocessing results at each stage.

图2为图像预处理和数据增广流程。Figure 2 shows the image preprocessing and data augmentation process.

图3为DenseNet网络结构图。Figure 3 is a diagram of the DenseNet network structure.

图4为EfficientNet-b3网络结构图。Figure 4 is the network structure diagram of EfficientNet-b3.

图5为包含注意力机制的VGG-16整体网络结构图。Figure 5 shows the overall network structure of VGG-16 including the attention mechanism.

图6为包含注意力机制的VGG-16网络注意力模块结构图。Figure 6 is a structural diagram of the VGG-16 network attention module including the attention mechanism.

图7为注意力可视化结果(行数从上到下分别为：原图、注意力模块1、注意力模块2)。Figure 7 shows the attention visualization results (rows from top to bottom are: original image, attention module 1, attention module 2).

图8为HiFuse整体网络结构图。Figure 8 is a diagram of the overall HiFuse network structure.

图9为分层特征融合块详细结构。Figure 9 shows the detailed structure of the hierarchical feature fusion block.

图10为分类流程图。Figure 10 is a classification flow chart.

图11为EfficientNet、DenseNet、VGG16-Attention的acc及loss曲线。Figure 11 shows the acc and loss curves of EfficientNet, DenseNet, and VGG16-Attention.

图12为HiFuse网络的acc及loss曲线。Figure 12 shows the acc and loss curves of the HiFuse network.

具体实施方式DETAILED DESCRIPTION

以下实施例将结合附图对本发明作进一步的说明。The following embodiments will further illustrate the present invention in conjunction with the accompanying drawings.

本发明实施例所述一种眼底病变图像分类方法，包括以下步骤：A method for classifying fundus lesion images described in an embodiment of the present invention comprises the following steps:

所述数据转换针对ODIR数据集标签信息表中的“Left-Diagnostic Keywords”和“Right-Diagnostic Keywords”列进行统计，得出所有关键字集合；然后，然后分别针对左眼和右眼提取并生成相应的疾病类别标签，即单眼级别的标签；The data conversion performs statistics on the "Left-Diagnostic Keywords" and "Right-Diagnostic Keywords" columns in the ODIR dataset label information table to obtain a set of all keywords; then, the corresponding disease category labels, i.e., monocular level labels, are extracted and generated for the left eye and the right eye respectively;

(1)标准化处理：眼底图像中像素值的范围可能会因为不同的采集设备或者参数设置而有所不同，这会对后续的图像分析和处理造成影响。标准化处理可以将图像中的像素值缩放到统一的范围内，使得不同图像之间的像素值具有可比性，从而更好地进行数据分析或者机器学习。(1) Standardization: The range of pixel values in fundus images may vary due to different acquisition devices or parameter settings, which will affect subsequent image analysis and processing. Standardization can scale the pixel values in the image to a uniform range, making the pixel values between different images comparable, thereby better performing data analysis or machine learning.

(2)CLAHE直方图均衡：眼底图像中可能存在大量的低对比度区域和细节模糊的区域，CLAHE直方图均衡化可以增强图像的对比度和细节，使图像更加清晰，从而更容易进行图像分析和特征提取。(2) CLAHE histogram equalization: There may be a large number of low-contrast areas and areas with blurred details in fundus images. CLAHE histogram equalization can enhance the contrast and details of the image, making the image clearer, thereby making it easier to perform image analysis and feature extraction.

(3)Gamma校正：眼底图像中的亮度和对比度可能会受到拍摄环境、采集设备等因素的影响，这也会影响图像的可视化和特征提取。Gamma校正可以调整图像的亮度和对比度，使得图像更加鲜明、清晰，从而更容易进行图像分析和特征提取。本实施例中γ取值为1.2。(3) Gamma correction: The brightness and contrast of fundus images may be affected by factors such as the shooting environment and acquisition equipment, which may also affect the visualization and feature extraction of the image. Gamma correction can adjust the brightness and contrast of the image to make the image more vivid and clear, making it easier to perform image analysis and feature extraction. In this embodiment, the value of γ is 1.2.

(4)高斯平滑处理：眼底图像中可能存在的一些噪声信息会对后续的图像分析和处理造成影响。使用高斯平滑处理可以去除一些噪声，使得图像更加平滑，从而更容易进行图像分析和特征提取。在本实施例中，高斯函数的标准差sigma取值为10，h×w为卷积核的大小，本实施例取值为0，这会使得高斯函数根据sigma的值计算核的大小，α，β，γ分别取值为4，-4，128。(4) Gaussian smoothing: Some noise information that may exist in the fundus image will affect the subsequent image analysis and processing. Using Gaussian smoothing can remove some noise, making the image smoother, making it easier to perform image analysis and feature extraction. In this embodiment, the standard deviation sigma of the Gaussian function is 10, and h×w is the size of the convolution kernel. In this embodiment, the value is 0, which will cause the Gaussian function to calculate the kernel size according to the value of sigma. α, β, and γ are 4, -4, and 128, respectively.

I_weight＝I_γ×α+I_blur×β+γI_weight =I_γ ×α+I_blur ×β+γ

I_blur＝I_γ×kernal_h×wI_blur = I_γ × kernal_{h × w}

各阶段图像按照操作顺序预处理后的效果图1所示。The effect of the images in each stage after preprocessing according to the operation sequence is shown in Figure 1.

(1)随机旋转(Random Rotation)：该方法将图像旋转一个随机角度，以产生不同的图像变换增加数据的多样性，从而帮助模型学习到图像不同角度的特征。本实施例采用旋转-10°～10°、±90°的策略。(1) Random Rotation: This method rotates the image by a random angle to generate different image transformations to increase data diversity, thereby helping the model learn the features of the image at different angles. This embodiment adopts the strategy of rotating -10° to 10° and ±90°.

(2)随机缩放裁剪(Random Resized Crop)：该方法随机从原始图像中裁剪出一块子图像，并将其缩放到指定的大小。在本实施例中，以0.5的概率随机裁剪图像，并缩放至224×224的大小。放成224×224大小。(2) Random Resized Crop: This method randomly crops a sub-image from the original image and scales it to a specified size. In this embodiment, the image is randomly cropped with a probability of 0.5 and scaled to a size of 224×224.

(3)随机水平/垂直翻转(Random horizontal/vertical flip)：该方法会随机将图像水平或垂直翻转，并将其作为新的训练样本。在本实施例中，设置的随机旋转、随机缩放裁剪、随机翻转的概率均为0.5。(3) Random horizontal/vertical flip: This method randomly flips the image horizontally or vertically and uses it as a new training sample. In this embodiment, the probability of random rotation, random scaling and cropping, and random flipping is set to 0.5.

如图2所示，整个图像的处理流程分成以下几步：首先，将原始数据统一裁剪成512×512像素尺寸大小。然后进行图像预处理操作，之后再对图像进行增广处理后，得到预处理数据集。As shown in Figure 2, the entire image processing flow is divided into the following steps: First, the original data is uniformly cropped to a size of 512×512 pixels. Then, the image preprocessing operation is performed, and then the image is augmented to obtain the preprocessed data set.

2)搭建迁移学习模型：基于DenseNet和EfficientNet-b3两种预训练模型，采用迁移学习的方法在原有DenseNet模型结构基础上，添加两个线形层，使用dropout正则化策略和relu激活函数；对EfficientNet-b3模型参数做微调；2) Build a transfer learning model: Based on the two pre-trained models of DenseNet and EfficientNet-b3, the transfer learning method is used to add two linear layers to the original DenseNet model structure, using the dropout regularization strategy and the relu activation function; and fine-tune the parameters of the EfficientNet-b3 model;

在步骤2)中，所述DenseNet-121是由Kaiming He等人于2016年提出，主要应用于图像分类和目标检测等计算机视觉任务。相比于传统的卷积神经网络模型，DenseNet-121使用了密集连接(Dense Connection)的方式，即每一层的输出都连接到了后面所有层的输入，这使得模型更容易训练，并且可以有效地解决梯度消失的问题。此外，DenseNet-121还采用批量归一化(Batch Normalization)和预激活(Pre-Activation)等技术，进一步提高模型的性能和训练效率。In step 2), the DenseNet-121 was proposed by Kaiming He et al. in 2016 and is mainly used in computer vision tasks such as image classification and target detection. Compared with the traditional convolutional neural network model, DenseNet-121 uses a dense connection method, that is, the output of each layer is connected to the input of all subsequent layers, which makes the model easier to train and can effectively solve the problem of gradient disappearance. In addition, DenseNet-121 also uses batch normalization and pre-activation technologies to further improve the performance and training efficiency of the model.

DenseNet-121的网络结构比较简单，共有121层(包括卷积层、批量归一化层、激活函数和池化层)，其中包括4个密集块(Dense Block)和3个转换层(Transition Layer)。每个密集块包括若干个卷积层、批量归一化层和ReLU激活函数，每个转换层包括一个卷积层、一个批量归一化层和一个平均池化层，用于减小特征图的大小。网络结构如图3所示。在本实施例中，在原有的模型结构基础上，添加两个线形层，且使用dropout正则化策略和relu激活函数。The network structure of DenseNet-121 is relatively simple, with a total of 121 layers (including convolutional layers, batch normalization layers, activation functions and pooling layers), including 4 dense blocks and 3 transition layers. Each dense block includes several convolutional layers, batch normalization layers and ReLU activation functions, and each transition layer includes a convolutional layer, a batch normalization layer and an average pooling layer to reduce the size of the feature map. The network structure is shown in Figure 3. In this embodiment, two linear layers are added on the basis of the original model structure, and the dropout regularization strategy and relu activation function are used.

EfficientNet-B3是一种高效的卷积神经网络，由谷歌团队于2019年提出，并在ImageNet数据集上进行了训练和测试。该模型是EfficientNet系列中的一个，其命名方式是基于网络的大小和深度。EfficientNet-B3是EfficientNet系列中的第三种模型，相对于EfficientNet-B0，它更深、更宽，具有更好的性能和更高的准确性。EfficientNet-B3 is an efficient convolutional neural network proposed by the Google team in 2019 and trained and tested on the ImageNet dataset. The model is one of the EfficientNet series, and its naming is based on the size and depth of the network. EfficientNet-B3 is the third model in the EfficientNet series. Compared with EfficientNet-B0, it is deeper and wider, with better performance and higher accuracy.

EfficientNet-B3模型的架构如图4所示，EfficientNet-B3模型基于MobileNetV2和SE-Net的结构，引入一些创新性的设计，如复合缩放(Compound Scaling)、深度可分离卷积(Depthwise Convolution)、压缩与激励(Squeeze-and-Excitation)模块等，以提高模型的效率和精度。复合缩放方法是指同时对网络的深度、宽度和分辨率进行缩放，以获得更好的性能和效率。深度可分离卷积是一种轻量级的卷积操作，可以减少模型的参数和计算量。压缩与激励模块则可以自适应地调整每个通道的权重，以提高模型的表达能力。本实施例对模型最后的输出的通道数做修改。The architecture of the EfficientNet-B3 model is shown in Figure 4. The EfficientNet-B3 model is based on the structure of MobileNetV2 and SE-Net, and introduces some innovative designs, such as compound scaling, depthwise convolution, and squeeze-and-excitation modules, to improve the efficiency and accuracy of the model. The compound scaling method refers to scaling the depth, width, and resolution of the network at the same time to obtain better performance and efficiency. Depthwise convolution is a lightweight convolution operation that can reduce the parameters and computational complexity of the model. The compression and excitation module can adaptively adjust the weight of each channel to improve the expressiveness of the model. This embodiment modifies the number of channels of the final output of the model.

3)提取图像特征并进行融合3) Extract image features and fuse them

(1)红色共生矩阵。读取图片并转为RGB格式，将图像的红色通道R的值作为相关特性。取图像(N×N)中任意一点(x，y)及偏离它的另一点(x+a，y+b)，设该点对的R值为(r1，r2)。共生矩阵中每个元素的值可以理解为(x，y)点与(x+a，y+b)点的值对为(i，j)的概率。对于整个画面，统计出每一种(r1，r2)值出现的次数，然后排列成一个方阵，在用(r1，r2)出现的总次数将它们归一化为出现的概率P(r1，r2)。通过基于四个方向(水平、垂直、对角线、反对角线)的红色共生矩阵，从而得到眼底图像的纹理特征统计量R。(1) Red co-occurrence matrix. Read the image and convert it to RGB format, and use the value of the red channel R of the image as the relevant feature. Take any point (x, y) in the image (N×N) and another point (x+a, y+b) deviating from it, and set the R value of the point pair to (r1, r2). The value of each element in the co-occurrence matrix can be understood as the probability that the value pair of point (x, y) and point (x+a, y+b) is (i, j). For the entire picture, count the number of occurrences of each (r1, r2) value, and then arrange them into a square matrix, and then normalize them to the probability of occurrence P(r1, r2) using the total number of occurrences of (r1, r2). The texture feature statistics R of the fundus image are obtained by using the red co-occurrence matrix based on four directions (horizontal, vertical, diagonal, and anti-diagonal).

R＝(r₁，r₂，…，r_m)R=(r₁ , r₂ ,..., r_m )

(2)视觉词袋。采用尺度无关特征变换方法(Scale-invariantFeatureTransform，SIFT)进行特征提取，从不同类别的图像中提取视觉词汇向量，这些向量代表的是图像中局部不变的特征点。将所有特征点向量集合到一块，利用K-Means算法合并词义相近的视觉词汇，构造一个包含K个词汇的单词表。在关键点特征数据集X＝{x1，x2，..，xi，..，xN}中找到k个簇的聚类中心{c1，c2，..，cj，..，ck}使得各个簇中样本向量到对应簇聚类中心的欧式距离最小。公式如下：(2) Bag of visual words. The scale-invariant feature transform (SIFT) method is used for feature extraction to extract visual vocabulary vectors from images of different categories. These vectors represent local invariant feature points in the image. All feature point vectors are grouped together, and the K-Means algorithm is used to merge visual words with similar meanings to construct a word list containing K words. In the key point feature data set X = {x1, x2, .., xi, .., xN}, find the cluster centers of k clusters {c1, c2, .., cj, .., ck} so that the Euclidean distance from the sample vector in each cluster to the corresponding cluster center is minimized. The formula is as follows:

图像中SIFT关键点特征分别与k个聚类中心(即为视觉单词)进行距离计算，哪一个视觉单词距离最小，就将SIFT关键点特征分配给该视觉单词。最终得到的集合就是视觉词典，表示如下：The SIFT key point features in the image are calculated from the distances of k cluster centers (i.e., visual words). The SIFT key point features are assigned to the visual word with the smallest distance. The resulting set is the visual dictionary, which is expressed as follows:

D＝(d₁，d₂，…，d_k)D=(d₁ , d₂ ,..., d_k )

随后，统计单词表中每个单词在图像中出现的次数，从而将图像表示成为一个K维数值向量，即生成该图像的视觉单词包。Then, the number of times each word in the word list appears in the image is counted, so that the image is represented as a K-dimensional numerical vector, that is, the visual word bag of the image is generated.

H＝(h₁，h₂，…，h_k)H=(h₁ , h₂ ,..., h_k )

(3)扩散模型的中间特征表示。扩散模型首先定义了一个前向噪声过程，将逐步高斯噪声迭代添加到图像x₀中，从数据分布q(x₀)中采样，以T步得到一个完全噪声的图像x_T。这个正向过程是一个具有值x₁，x₂，…，x_t，…，x_T-1，x_T的马尔可夫链，它代表不同程度的噪声图像，定义如下：(3) Intermediate feature representation of the diffusion model. The diffusion model first defines a forward noise process, which iteratively adds step-by-step Gaussian noise to the image_x0 , samples from the data distribution q(_x0 ), and obtains a completely noisy image_xT in T steps. This forward process is a Markov chain with values_x1 ,_x2 , ...,_xt , ..., xT_-1 ,_xT , which represents different degrees of noise images and is defined as follows:

其中，为方差设定，N为正态分布。根据以下公式，可在扩散步骤t时直接从真实图像x₀中采样噪声图像x_t：in, is the variance setting, and N is a normal distribution. According to the following formula, the noise image_xt can be directly sampled from the real image_x0 at the diffusion step t:

α_t：＝1-β_tα_t ：＝1-β_t

逆向扩散过程旨在从后验分布q(x_t|x_t-1)中反转正向过程和采样，该分布取决于整个数据分布。迭代地这样做可以对一个完全有噪声的图像x_t进行降噪，这样就可以从数据分布q(x₀)中采样。这通常使用神经网络∈_θ近似为如下表达：The inverse diffusion process aims to reverse the forward process and sample from the posterior distribution q(_xt |_xt-1 ), which depends on the entire data distribution. Doing this iteratively can denoise a completely noisy image_xt so that it can be sampled from the data distribution q(_x0 ). This is usually approximated using a neural network ∈_θ as follows:

当p和q作为VAE时，变分下界目标的简化版本只是一个均方误差损失。这可用于训练∈_θ，该∈_θ学习将高斯噪声近似∈添加到真实图像x₀中：When p and q are used as VAEs, a simplified version of the variational lower bound objective is just a mean squared error loss. This can be used to train ∈_θ , which_learns to add an approximation of Gaussian noise ∈ to the real image_x0 :

使用引导扩散(guided diffusion，GD)实现，使用U-Net式架构，其中具有残差块，将残差块、残差加注意力块和下采样或上采样残差块中的每一个都视为单独的块，并将它们编号为b∈{1，2，...，37}，用于预训练好的无条件引导扩散模型。用扩散步骤t和模型块数b进行参数化，得到噪声图像x_t和用块号b之后的激活作为特征向量的f(x₀，t，θ)。Use guided diffusion (GD) implementation, using a U-Net-style architecture with residual blocks, treating each of the residual blocks, residual plus attention blocks, and downsampling or upsampling residual blocks as separate blocks, and numbering them as b∈{1, 2, ..., 37} for a pre-trained unconditional guided diffusion model. Parameterized by diffusion step t and model block number b, get the noisy image_xt and f(_x0 , t, θ) with the activation after block number b as the feature vector.

4)构建注意力模型：4) Build an attention model:

包含注意力机制的VGG-16模型：VGG-16 model with attention mechanism:

人类的视觉系统往往会关注于视野中与手头工作相关的物体。例如，在诊断眼底疾病时，眼科医生可能会更加关注视网膜、血管、黄斑区、晶状体等部位，而不是无关紧要的区域。为了模仿这种视觉探索模式，该模型在VGG-16中引入了注意力模块，用于估计一个空间(像素)注意力图。The human visual system tends to focus on objects in the field of vision that are relevant to the task at hand. For example, when diagnosing fundus diseases, ophthalmologists may pay more attention to the retina, blood vessels, macula, lens, etc., rather than insignificant areas. To mimic this visual exploration mode, the model introduces an attention module in VGG-16 to estimate a spatial (pixel) attention map.

包含注意力机制的VGG-16网络模型的总体网络结构如图5所示。该网络结构使用去除了全连接层的VGG-16作为主干网。然后利用VGG-16网络中的中间特征映射(pool-3和pool-4)来推断注意力图。在计算注意力图时，pool-5的输出作为一种“全局特征(globalguidance)”(标记为G)，因为最后阶段的特征包含着整个图像中最压缩和抽象化的信息。使用F＝(f₁,f₂,...,f_n)表示中间层的特征，其中f_i代表第i个块的输出。The overall network structure of the VGG-16 network model including the attention mechanism is shown in Figure 5. The network structure uses VGG-16 with the fully connected layer removed as the backbone network. The intermediate feature maps (pool-3 and pool-4) in the VGG-16 network are then used to infer the attention map. When calculating the attention map, the output of pool-5 is used as a "global guidance" (marked as G) because the features of the final stage contain the most compressed and abstract information in the entire image. Use F = (f₁ , f₂ , ..., f_n ) to represent the features of the intermediate layer, where_fi represents the output of the i-th block.

如图6所示，F和G会同时作为注意力模块的输入，计算产生一个一通道的输出R，其中代表卷积运算，W_f和W_g由256个卷积核组成，卷积核W输出为一个通道大小，up(·)是双线性插值函数，用于统一输出空间的大小。As shown in Figure 6, F and G are simultaneously used as inputs to the attention module to generate a one-channel output R, where represents the convolution operation,_Wf and_Wg consist of 256 convolution kernels, the convolution kernel W outputs a channel size, and up(·) is a bilinear interpolation function used to unify the size of the output space.

A＝Sigmoid(R)A＝Sigmoid(R)

A中的每个元素a_i∈A表示对应空间特征向量的关注程度，添加注意力分数后的特征向量是由A和F做点积得到。最后，将VGG-16第3、4块的输出以及G做拼接作为提取的特征V。Each element a_i ∈ A in A represents the degree of attention of the corresponding spatial feature vector. The feature vector after adding the attention score It is obtained by doing the dot product of A and F. Finally, the output of the 3rd and 4th blocks of VGG-16 and G are concatenated as the extracted feature V.

V＝(v₁,v₂,…,v_s)V＝(v₁ ,v₂ ,…,v_s )

图7给出可视化注意力模块对原始图生成的注意力结果。Figure 7 shows the attention results generated by the visualization attention module on the original image.

分层多尺度特征融合网络模型：Hierarchical multi-scale feature fusion network model:

HiFuse模型作为一种新的医学图像分类方法，被提出来以有效地获得不同尺度的局部空间信息和全局语义表示。在HiFuse模型中使用了一个并行的结构来从全局和局部特征块中提取医学图像的全局和局部信息，并通过“H”型结构融合不同层次的特征。通过HFF块融合不同层次的特征，然后经过下采样步骤，最后得到分类结果。HiFuse模型具有如下一些特点：As a new medical image classification method, the HiFuse model is proposed to effectively obtain local spatial information and global semantic representation at different scales. In the HiFuse model, a parallel structure is used to extract global and local information of medical images from global and local feature blocks, and fuse features at different levels through an "H"-shaped structure. Features at different levels are fused through the HFF block, and then after a downsampling step, the classification result is finally obtained. The HiFuse model has the following characteristics:

结合CNN和Transformer的优势，是一个由局部和全局特征块组成的并行框架，分别有效地捕捉局部空间背景特征和不同尺度特征的全局语义表示。此外，HiFuse不需要建立很深的网络就能达到很好的效果，有效避免了梯度消失和特征信息丢失的问题。Combining the advantages of CNN and Transformer, it is a parallel framework composed of local and global feature blocks, which effectively captures the local spatial background features and the global semantic representation of features at different scales. In addition, HiFuse can achieve good results without building a very deep network, effectively avoiding the problems of gradient disappearance and feature information loss.

包含自适应分层特征融合块(HFF块)，由空间注意、通道注意、残差倒置MLP和捷径连接组成，以自适应融合各分支不同尺度特征之间的语义信息。It contains an adaptive hierarchical feature fusion block (HFF block), which consists of spatial attention, channel attention, residual inversion MLP and shortcut connections to adaptively fuse the semantic information between features of different scales in each branch.

(1)HiFuse网络的整体结构(1) Overall structure of HiFuse network

HiFuse网络的整体网络结构如图8所示。局部分支用于提取图像的局部特征，而全局分支用于提取图像的全局语义表示。这两个分支都是由4个阶段组成，用于在不同尺度上提取特征。局部分支的主干部分是一个4×4的卷积，步长为4，然后是层归一化(layernorm)。全局分支的主干模块通过补丁(patch)分割模块对图像进行分割。每个4×4的相邻像素是一个补丁，然后在通道方向上被展平(flatten)后通过线性嵌入层将输出变成输入通道的两倍，并应用全局特征块进行特征转换。The overall network structure of the HiFuse network is shown in Figure 8. The local branch is used to extract local features of the image, while the global branch is used to extract the global semantic representation of the image. Both branches consist of 4 stages for extracting features at different scales. The backbone of the local branch is a 4×4 convolution with a step size of 4, followed by layer normalization. The backbone module of the global branch segments the image through a patch segmentation module. Each 4×4 adjacent pixel is a patch, which is then flattened in the channel direction and then passed through a linear embedding layer to turn the output into twice the input channel, and a global feature block is applied for feature conversion.

三分支平行结构意味着局部特征和全局表征可以在最大程度上被保留下来，而不会相互干扰。不同层次的特征图是通过四个阶段构建的。HFF块用于融合每The three-branch parallel structure means that local features and global representations can be preserved to the greatest extent without interfering with each other. Feature maps at different levels are constructed through four stages. The HFF block is used to fuse each

个阶段的局部特征和全局表征，并连接前一个阶段的输出。每个层次的局部特征通过空间注意力机制的输出与每个层次的全局特征通过通道注意力的输出结合。The local features and global representations of each stage are combined and connected to the output of the previous stage. The local features of each level are combined with the output of the channel attention mechanism through the output of the spatial attention mechanism and the global features of each level through the output of the channel attention mechanism.

最后，合并后的特征被送入全局平均池化和层规一化(LayerNorm)的线性分类器进行分类。根据每个阶段块数量的不同，HiFuse包含了不同的变体：Finally, the merged features are fed into a linear classifier with global average pooling and layer normalization for classification. Depending on the number of blocks in each stage, HiFuse includes different variants:

·HiFuse-Tiny：Block numbers＝(2，2，2，2)·HiFuse-Tiny: Block numbers=(2, 2, 2, 2)

·HiFuse-Small：Block numbers＝(2，2，6，2)·HiFuse-Small: Block numbers=(2, 2, 6, 2)

·HiFuse-Base：Block numbers＝(2，2，18，2)·HiFuse-Base: Block numbers=(2, 2, 18, 2)

(2)HiFuse网络中的全局特征块(2) Global feature blocks in HiFuse network

在全局特征提取分支中引入了窗口多头自注意力机制(Windows Multi-headSelf-Attention，W-MSA)以获取全局语义信息。对每个阶段，通过将补丁(patch)整合到全局特征块，特征图经过层规范化后进入W-MSA，然后经过具有GELU激活函数的线性层，如图8所示。在每个模块后都有一个残差连接，使用相对位置偏置(position bias)，并将结果送入下一个模块的平移多头注意力机制处(SW-MSA)，计算过程如下：The Windows Multi-head Self-Attention (W-MSA) mechanism is introduced in the global feature extraction branch to obtain global semantic information. For each stage, the patch is integrated into the global feature block, the feature map is normalized and enters the W-MSA, and then passes through the linear layer with the GELU activation function, as shown in Figure 8. There is a residual connection after each module, using the relative position bias, and the result is sent to the translation multi-head attention mechanism (SW-MSA) of the next module. The calculation process is as follows:

g_i＝f^1×1(W-MSA(LN(G_i-1)))+G_i-1g_i =f^1×1 (W-MSA(LN(G_i-1 )))+G_i-1

G_i＝f^1×1(SW-MSA(g_i))+g_iG_i =f^1×1 (SW-MSA(g_i ))+g_i

其中，g_i是W-MSA的输出，G_i是SW-MSA的输出，f^1×1是核大小为1×1的卷积运算，LN(·)表示层归一化。Among them,_gi is the output of W-MSA,_Gi is the output of SW-MSA, f^1×1 is the convolution operation with kernel size 1×1, and LN(·) represents layer normalization.

(3)HiFuse网络中的局部特征块(3) Local feature blocks in HiFuse network

在局部特征块中使用3深度卷积，有效的降低了网络的计算复杂度。而后，通过线性层的跨通道信息交互，通过借鉴Transform中的LN和GELU激活函数，在不同的应用场景下都获得了良好的性能。最后，提取到的局部特征被输入到HFF块中。具体过程描述如下：Using 3-depth convolution in the local feature block effectively reduces the computational complexity of the network. Then, through the cross-channel information interaction of the linear layer, by referring to the LN and GELU activation functions in Transform, good performance is achieved in different application scenarios. Finally, the extracted local features are input into the HFF block. The specific process is described as follows:

L_i＝f^1×1(LN(f^d3×3(L_i-1)))+L_i-1L_i =f^1×1 (LN(f^d3×3 (L_i-1 )))+L_i-1

其中，L_i表示局部特征块的输出，f^d3×3表示卷积核大小为3×3的深度可分离卷积操作。Among them,_Li represents the output of the local feature block, and^fd3×3 represents the depthwise separable convolution operation with a convolution kernel size of 3×3.

(4)HiFuse网络中的分层特征融合块(HFFblock)(4) HiFuse network hierarchical feature fusion block (HFFblock)

由于全局特征块中的自注意机制可以在一定程度上捕捉到空间和时间上的全局信息，如图9，HFF块将传入的全局特征输入通道注意力(CA)机制，该机制利用通道图之间的相互依赖性来改善特定语义的特征表示。Since the self-attention mechanism in the global feature block can capture the global information in space and time to a certain extent, as shown in Figure 9, the HFF block inputs the incoming global features into the channel attention (CA) mechanism, which exploits the interdependencies between channel maps to improve the feature representation of specific semantics.

局部特征被输入到空间注意力(SA)机制，以加强局部细节，抑制不相关的区域。最后，将每个注意力和上一层特征融合块产生的结果将进行特征融合，并连接一个残差倒置的MLP(IRMLP)。在一定程度上，防止梯度消失、爆炸和网络退化的问题，从而有效地捕捉到每个层次的全局和局部特征信息。处理过程如下：The local features are input into the spatial attention (SA) mechanism to enhance local details and suppress irrelevant areas. Finally, the results of each attention and previous layer feature fusion blocks will be feature fused and connected to a residual inverted MLP (IRMLP). To a certain extent, the problems of gradient disappearance, explosion and network degradation are prevented, thereby effectively capturing the global and local feature information of each level. The processing process is as follows:

CA(x)＝σ(MLP(AvgPool(x))+MLP(MaxPool(x)))CA(x)=σ(MLP(AvgPool(x))+MLP(MaxPool(x)))

SA_x＝σ(f^7×7(Concat[AvgPool(x),MaxPool(x)]))SA_x =σ(f^7×7 (Concat[AvgPool(x),MaxPool(x)]))

IRMLP(x)＝f^1×1(f^1×1(f^3×3(LN(x)+LN(x)))IRMLP(x)＝f^1×1 (f^1×1 (f^3×3 (LN(x)+LN(x)))

其中，CA(·)表示通道注意力机制的输出，SA(·)表示空间注意力机制的输出，σ表示sigmoid函数。特征融合操作利用如下的公式：Among them, CA(·) represents the output of the channel attention mechanism, SA(·) represents the output of the spatial attention mechanism, and σ represents the sigmoid function. The feature fusion operation uses the following formula:

式中，表示逐元素乘积，是通过通道注意力机制组合生成的，是通过空间注意力机制组合生成的，是由上一阶段的HFF块下采样生成的。是全局-局部特征和前一阶段融合的结果。最后通过残差倒置的MLP，将和连接起来，生成特征F_i。In the formula, represents element-wise product, It is generated by combining the channel attention mechanism. is generated by combining spatial attention mechanisms, It is generated by downsampling of the HFF block in the previous stage. is the result of the fusion of global-local features and the previous stage. Finally, through the residual inverted MLP, and Connect them to generate features_Fi .

将步骤3)、VGG-16网络以及HiFuse模型提取的特征进行线性拼接，即：Linearly concatenate the features extracted from step 3), VGG-16 network and HiFuse model, namely:

Features＝(t₁,t₂,…,t_m,v₁,v₂,…,v_s,F_i1,F_i2,…,F_ik)Features＝(t₁ ,t₂ ,…,t_m ,v₁ ,v₂ ,…,v_s ,F_i1 ,F_i2 ,…,F_ik )

将Features输入到softmax分类层，最后得到眼底病变图像分类结果。分类流程如图10所示。The features are input into the softmax classification layer, and finally the classification result of the fundus lesion image is obtained. The classification process is shown in Figure 10.

以下对本发明的技术效果进行实验验证：The following is an experimental verification of the technical effect of the present invention:

1、损失函数的选取1. Selection of loss function

由于ODIR数据集中存在样本数据不均衡的问题，所以在本实施例中采用FocalLoss作为损失函数。Focal Loss的公式为：Since there is an imbalance in sample data in the ODIR dataset, FocalLoss is used as the loss function in this embodiment. The formula of Focal Loss is:

FL(p_t)＝-α_t(1-p_t)^γlog(p_t)FL(p_t )＝-α_t (1-p_t )^γ log(p_t )

其中，α_t是样本的类别权重，γ是聚焦参数，控制着难易分类样本的权重。p_t是模型的预测概率，可以表示为：Among them, α_t is the class weight of the sample, γ is the focusing parameter, which controls the weight of the difficult and easy classification samples._pt is the predicted probability of the model, which can be expressed as:

z_t是模型对于样本属于类别t的logits值，K是总的类别数。α_t是样本的类别权重，通常可以设置为样本数目比例的倒数，即：z_t is the model's logits value for samples belonging to category t, and K is the total number of categories. α_t is the category weight of the sample, which can usually be set to the inverse of the sample number ratio, that is:

y_i是样本i的真实标签，N是总的样本数。在计算Focal Loss时，我们需要将每个样本的Focal Loss加权求和，即：_yi is the true label of sample i, and N is the total number of samples. When calculating Focal Loss, we need to weight the Focal Loss of each sample, that is:

其中，p_i,t是模型对于样本i属于类别t的预测概率。Where p_i,t is the model’s predicted probability that sample i belongs to category t.

对于聚焦参数γ，如果γ的值较小，则相对容易分类的样本对损失函数的贡献更小，而难分类的样本对损失函数的贡献更大，从而增强模型对于难分类样本的关注。如果γ的值较大，则相对容易分类的样本对损失函数的贡献更大，而难分类的样本对损失函数的贡献更小，从而使模型更加关注容易分类的样本。在本实施例中设置γ＝2，同时取α_t为第t类样本所占样本总数比例的倒数。For the focus parameter γ, if the value of γ is small, the samples that are relatively easy to classify contribute less to the loss function, while the samples that are difficult to classify contribute more to the loss function, thereby enhancing the model's attention to the difficult-to-classify samples. If the value of γ is large, the samples that are relatively easy to classify contribute more to the loss function, while the samples that are difficult to classify contribute less to the loss function, thereby making the model pay more attention to the samples that are easy to classify. In this embodiment, γ is set to 2, and α_t is taken as the inverse of the proportion of the t-th class sample to the total number of samples.

2、学习率调整策略2. Learning rate adjustment strategy

在本实施例中，采用动态调整学习率的策略。具体而言，在训练前采用学习率线性预热的方式，在训练过程中采用余弦退火方式逐渐减小学习率。In this embodiment, a strategy of dynamically adjusting the learning rate is adopted. Specifically, a linear warm-up method of the learning rate is adopted before training, and a cosine annealing method is adopted during the training process to gradually reduce the learning rate.

学习率线性预热是一种在训练前几个epoch中逐渐增加学习率的策略，以避免模型一开始就陷入局部最优解。具体而言，学习率线性预热通常采用线性增加学习率的方式，在预热期间逐渐将学习率从一个较小的值增加到正常训练时使用的值。其中，预热期间的学习率增加速度通常设置为一个较小的常数。余弦退火算法学习率调整策略的公式如下：Learning rate linear warm-up is a strategy to gradually increase the learning rate in the first few epochs of training to avoid the model falling into a local optimal solution at the beginning. Specifically, learning rate linear warm-up usually adopts a linear increase in learning rate, gradually increasing the learning rate from a small value to the value used in normal training during the warm-up period. Among them, the learning rate increase rate during the warm-up period is usually set to a small constant. The formula for the cosine annealing algorithm learning rate adjustment strategy is as follows:

其中，η_t表示第t个迭代步骤的学习率，η_max表示初始学习率，表示当前训练的epoch数，T_max表示总的训练epoch数。公式中的余弦函数控制学习率在训练过程中逐渐减小，能够将学习率调整到η_max和0之间。Among them, η_t represents the learning rate of the tth iteration step, η_max represents the initial learning rate, Indicates the current training epoch number, and T_max indicates the total training epoch number. The cosine function in the formula controls the learning rate to gradually decrease during the training process, and can adjust the learning rate to between η_max and 0.

将线性预热和余弦退火相结合，既能够在训练初期逐渐增加学习率，又能够在训练过程中逐渐减小学习率，以获得更好的训练效果。Combining linear warm-up with cosine annealing can gradually increase the learning rate at the beginning of training and gradually decrease the learning rate during training to achieve better training results.

3实现细节3 Implementation Details

本实验基于Pytorch深度学习框架进行实现，版本为1.13，选择PyTorch作为深度学习框架，可以提高实验效率和可靠性，同时也能够获得良好的社区和技术支持。使用的Python版本为3.9.16。实验在一台内置24GB显存的NVIDIA RTX3090 GPU的云服务器上进行训练。基础学习率为1e-4，批大小为32，训练轮数为500。在对各种模型分别进行实验时，输入图像的大小统一为224×224，并且在硬件环境和超参数上保持一致。This experiment is implemented based on the Pytorch deep learning framework, version 1.13. Choosing PyTorch as the deep learning framework can improve the efficiency and reliability of the experiment, and also obtain good community and technical support. The Python version used is 3.9.16. The experiment is trained on a cloud server with an NVIDIA RTX3090 GPU with 24GB of built-in video memory. The basic learning rate is 1e-4, the batch size is 32, and the number of training rounds is 500. When experimenting with various models separately, the size of the input image is unified to 224×224, and the hardware environment and hyperparameters are consistent.

结果分析和对比：Result analysis and comparison:

图11分别展示各个模型在训练过程中训练准确度、训练损失、评估准确度、评估损失的变化曲线，图12是HiFuse网络的acc和loss曲线。Figure 11 shows the changing curves of training accuracy, training loss, evaluation accuracy, and evaluation loss of each model during the training process, and Figure 12 is the acc and loss curves of the HiFuse network.

各模型处于最佳状态时，各项指标如表1所示：When each model is in the best state, the indicators are shown in Table 1:

表1Table 1

首先，由图11中模型在训练过程中的训练准确度、训练损失、评估准确度和评估损失曲线可以看出：EfficientNet网络和DenseNet网络的变化趋势大体相同，且两种模型收敛较快，均有过拟合的现象发生。VGG16-Attention在训练集上的表现不如前两种网络，但是在评估集上的最终表现和前两种网络相当，且过拟合现象比前两种迁移学习网络更轻微。First, from the training accuracy, training loss, evaluation accuracy, and evaluation loss curves of the model during training in Figure 11, we can see that the changing trends of the EfficientNet network and the DenseNet network are roughly the same, and both models converge quickly, with overfitting. VGG16-Attention's performance on the training set is not as good as the first two networks, but its final performance on the evaluation set is comparable to the first two networks, and the overfitting phenomenon is less severe than the first two transfer learning networks.

由于HiFuse网络的模型规模较大，故图11展示的是其训练500轮次后的结果。从图11中可以看出，HiFuse模型的评估准确度和评估损失在前期震荡较为严重，模型难以快速收敛，即模型的学习过程不够稳定。这是由于HiFuse模型的复杂度较高，且训练数据较少的原因。由表1各模型的评价指标可以看出，两种迁移学习模型在“准确率”和“Kappa系数”上的表现更好，反映这两种模型的预测结果与真实结果之间的一致性更高。HiFuse模型由于受数据不均匀的影响较大，且由于模型复杂度高，过拟合问题较为严重，造成预测结果和真实结果的一致性最差。但在其他评价指标上，HiFuse的表现均为最优。Since the model scale of the HiFuse network is large, Figure 11 shows the results after 500 rounds of training. As can be seen from Figure 11, the evaluation accuracy and evaluation loss of the HiFuse model fluctuated severely in the early stage, and the model was difficult to converge quickly, that is, the learning process of the model was not stable enough. This is due to the high complexity of the HiFuse model and the small amount of training data. From the evaluation indicators of each model in Table 1, it can be seen that the two transfer learning models perform better in "accuracy" and "Kappa coefficient", reflecting that the consistency between the predicted results and the actual results of the two models is higher. The HiFuse model is greatly affected by data unevenness, and due to the high complexity of the model, the overfitting problem is more serious, resulting in the worst consistency between the predicted results and the actual results. However, in other evaluation indicators, HiFuse performs best.

HiFuse模型的最终得分最高，如果增加数据量的大小，减轻过拟合的现象，HiFuse的多级特征融合机制和注意力机制的优势会更加突出。The HiFuse model has the highest final score. If the amount of data is increased and the overfitting phenomenon is alleviated, the advantages of HiFuse's multi-level feature fusion mechanism and attention mechanism will be more prominent.

上述实施例仅为本发明的较佳实施例，不能被认为用于限定本发明的实施范围。凡依本发明申请范围所作的均等变化与改进等，均应仍归属于本发明的专利涵盖范围之内。The above embodiments are only preferred embodiments of the present invention and cannot be considered to limit the scope of the present invention. All equivalent changes and improvements made within the scope of the present invention should still fall within the scope of the present invention.