Attention model-based diabetic retinopathy focus image identification methodTechnical Field
The invention relates to the technical field of multi-disciplinary intersection of computer vision, digital image processing and ophthalmic clinical medicine, in particular to a diabetic retinopathy focus image recognition method based on an attention model.
Background
Diabetic retinopathy is a retinal complication caused by diabetes mellitus, such as untimely treatment, with an irreversible risk of vision loss or blindness. In general, the early stage of the disease does not cause vision disorder, so that the patient cannot easily perceive the disease, and the best opportunity for the diagnosis may be missed, thereby deteriorating the disease. Currently, retinopathy can be effectively screened by fundus photography. It is therefore necessary to normalize the screening on a large scale. However, currently, there is still a lack of experienced mapping doctors, which prevents large-scale screening. In recent years, deep learning has been developed at a high speed in the fields of computer vision such as image classification, target detection and semantic segmentation, and certain achievements have been achieved in the medical field, and it is expected to relieve the burden of doctors in image reading and solve the current situation of medical resource shortage in partial areas. Therefore, development of an image recognition method for diabetic retinopathy based on deep learning is very important, and has practical value and academic research value.
In order to meet the needs of large-scale screening tasks, accurate classification and segmentation algorithms are a key part in achieving automatic disease screening and lesion segmentation. However, these tasks face three challenges. First, arteriovenous tumors act as the earliest focus of diabetic retinopathy, but occupy very little space in the retina. Second, disturbances unrelated to diabetic retinopathy are easily convolved and non-linear operational amplified, ultimately affecting the screening results. Third, the distribution of lesion mask labeling data is extremely unbalanced, which tends to greatly inhibit the generalization ability of the model. Therefore, it is significant and challenging to study a deep learning-based method for identifying diabetic omentum lesion focus images.
For this reason, a method for identifying diabetic retinopathy lesion images based on an attention model is required, and the purpose of the method is as follows:
1. the specific feature extraction module is designed to acquire features which are sensitive to the micro focus area and resistant to irrelevant interference so as to detect focuses in fundus illumination and provide an explanatory basis for automatic diabetic retinopathy screening;
2. the model is guided to learn lesion characteristics gradually from data with unbalanced distribution, the generalization capability of the model is improved, so that the diabetic retinopathy screening task is effectively completed, and the development of artificial intelligence in the ophthalmic clinical field is promoted.
Therefore, the method for effectively and accurately detecting the diabetic retinopathy focus and predicting the disease probability in fundus color illumination has great promotion effect on the development of the computer vision field and the ophthalmic clinical field.
Disclosure of Invention
Aiming at the challenges and the purposes, the invention provides a method for identifying diabetic omentum lesion focus images based on an attention model.
The invention is realized by the following technical scheme:
An attention model-based diabetic retinopathy focus image identification method comprises the following steps:
S1, acquiring a diabetic retinopathy fundus color photograph data set, and preprocessing medical images of an original image;
s2, constructing an attention network model based on semantic segmentation to extract features;
S3, constructing a focus sensing module, a feature retaining module, a feature fusion module and a head attention module, extracting focus related information from the extracted features, and generating a focus detection probability map;
s4, guiding the eye disease screening by using the focus detection probability map obtained by the attention model through the input fundus color photograph, and obtaining a fundus photograph identification result with diabetic retinopathy focus information.
Preferably, in step S1, the diabetic retinopathy fundus color photograph data set includes: normal fundus illumination and non-proliferative diabetic retinopathy fundus illumination, and comprises four relevant focus mask labels of image-level diseased labels, pixel-level arteriolar tumors, bleeding, hard exudation and soft exudation; wherein the non-proliferative diabetic retinopathy comprises three grades of mild, moderate and severe.
Preferably, in step S1, the preprocessing of the original image includes:
A. cutting redundant black backgrounds around the image to obtain a square image;
B. Converting a sample image from an RGB color space to a Lab color space, performing contrast-limited self-adaptive histogram enhancement on a brightness layer L channel, and converting back to the RGB color space so as to enhance image details;
C. scaling the image size to a uniform size to facilitate inputting the model;
D. Using maximum and minimum standardization to perform normalization processing to enable pixel values to be distributed in a [0,1] interval so as to accelerate model training speed;
E. The images in the training set are subjected to data augmentation by adopting random angle center rotation, random vertical overturn and random horizontal overturn so as to prevent overfitting.
Preferably, in step S2, the semantic segmentation-based attention network model adopts an encoder-decoder structure. And extracting low-level features of the image by adopting ResNet-50 as an encoder, and forming a decoder by adopting the focus sensing module, the feature retaining module, the feature fusion module and the head attention module in the step S3.
Preferably, the encoder outputs four low-level features of different scales and depths, respectively, in four layersCorrespondingly, the decoder outputs four focus advanced features with different scalesThe correspondence is as follows:
Wherein,The feature of the deepest depth, i.e. the deep semantic feature,AndThe method comprises the steps of respectively representing an ith focus sensing module, a feature fusion module and a feature retention module, wherein weights are not shared among the modules, and fHAM (DEG) represents head attention module operation; four focus detection probability maps with four different scales are generated by four focus advanced features through the operations fout (-) of a kernel size of 3 multiplied by 3 convolution and a Sigmoid activation function layer
Preferably, the focus sensing module is operated to obtain focus information through direction sensing feature extraction and self-attention mechanism, and obtain focus sensing advanced feature xdec, and the specific formula is as follows:
Wherein, x1 is obtained by the operation fconv (·) of a convolution layer with a kernel size of 3×3 through the input x of the focus sensing module, fh (·) and fv (·) respectively represent the horizontal convolution with a kernel size of 3×1 and the vertical convolution with a kernel size of 1×3 to obtain direction information, and fort (·) represents the convolution operation with a kernel size of 1×1 to obtain a global attention feature map with a dimension of 1×1×c, wherein the dimension is obtained by global attention extraction with a direction sensing feature xort;xatt of x; i, j and c respectively represent indexes of different dimensions of the feature map, and xort and xatt are subjected to Hadamard product to obtain focus advanced feature xdec.
Preferably, the global attention extraction calculation formula is as follows:
Wherein, x2 is obtained by the input x of the focus sensing module through the convolution layer operation fconv (·) with the kernel size of 1×1, H and W respectively represent the height and width of x2, i, j, c respectively represent the indexes of different dimensions of the feature map, and e is a natural constant; fg (·) represents two convolution layer operations with a kernel size of 1×1; xg and xatt are the global features of the lesion awareness module and the corresponding global attention feature map, respectively.
Preferably, the feature preserving module operates to extract and preserve deep semantic features from the last layer output of the encoder using global attention, formulated as:
Wherein xt is the deep semantic feature output by the last layer of the encoderObtained through a convolution layer operation fconv (-) with the kernel size of 1 multiplied by 1, H and W respectively represent the height and width of xt, i, j and c respectively represent indexes of different dimensions of the feature map, and e is a natural constant; fg (·) represents two convolution layer operations with a kernel size of 1×1; xg and xFPB are the global features of the feature-preserving module and the corresponding global attention feature map, respectively.
Preferably, the feature fusion module is operative to fuse the shallow low-level shape sensing feature of the decoder part, the deep semantic feature and the focus sensing high-level feature output by the focus sensing module on the decoder part, and the formula is expressed as:
xfuse(i,j,c)=fconv3(fconc(z1(i,j,c)×xFPB(c),
z2(i,j,c)×xFPB(c),
z3(i,j,c)×xFPB(c)))
Where fup1 (·) and fup2 (·) represent bilinear interpolation upsampling layer operations, fconc (·) represents channel dimension concatenation, fconv1(·),fconv2 (·) and fconv3 (·) each represent convolution layer operations with a core size of 3×3, xFPB is a global attention profile output by the feature preserving module, and xfuse is a fusion feature output by the feature fusion module.
Preferably, the head attention module performs spatial attention and channel attention operations on deep semantic features output by the last layer of the encoder.
Preferably, in step S2, the semantic segmentation-based attention network model uses a random gradient descent as an optimizer and a dynamic learning rate after a ascent in a training process, wherein a minimum value of the dynamic learning rate is set to 0.001 and a maximum value of the dynamic learning rate is set to 0.1, and bilinear interpolation of four kinds of focus detection probability maps with different scales is up-sampled to the size of an original input image, and a binary cross entropy function with positive sample weighting is used as a loss function respectively, and the definition formula is as follows:
Wherein, H and W are the height and width of the image respectively, g (i, j) represents focus mask labeling, xout (i, j) represents focus detection probability map, alpha is positive sample weighting, and here is set as 10;
The losses obtained by the four different-scale focus detection probability diagrams are accumulated to obtain a total loss value, and then weight adjustment and optimization are carried out through counter propagation.
In the verification process, an optimal model is selected by using the average cross-over ratio as an evaluation index, and the calculation formula of the average cross-over ratio is as follows:
wherein N is the number of samples, K is the number of lesion categories,AndA detection probability map and a corresponding mask label of the kth focus respectively representing the ith fundus color illumination sample.
Preferably, the diabetic retinopathy screening in step S4 is based on a semantic segmentation attention network model, downsampling four focus perception advanced features of different dimensions of a decoder part to the same size, performing channel dimension splicing, performing a convolution layer operation to obtain screening features, and using a global pooling and dense layer as an output module to obtain the diabetic retinopathy illness probability of the input fundus color photograph sample; in the fine tuning training process, adamw is used as an optimizer, cosine annealing dynamic learning rate is used, wherein the initial learning rate is set to be 3e-4, cross entropy with label smoothing is used as a loss function, and the calculation process is as follows:
wherein N is the number of samples, yi andRespectively representing the predicted disease probability score and the smoothed disease label of the ith fundus color photograph sample,The method is calculated by the following formula:
where epsilon is a smoothed level parameter, here set to 0.2,Representing the original hard tag, C is the number of categories of the screening classification, here 2.
Compared with the prior art, the invention has the beneficial effects that:
Based on the mainstream backbone network as an encoder, the provided focus perception module and the feature holding module can effectively capture and represent the related information of the diabetic retinopathy; the proposed module is applied to a lesion perception and screening deep neural network model, and a lesion prediction graph is given as an effective explanatory evidence while the possibility of diabetic retinopathy is automatically screened; the invention has good performance in the aspects of automatic diabetic retinopathy focus detection and screening.
Drawings
FIG. 1 is a flow chart of a method according to an embodiment of the present invention;
FIG. 2 is a diagram of a depth of focus neural network model structure according to the present invention;
FIG. 3 is a diagram of a screening deep neural network model structure of the present invention;
FIG. 4 is a comparison of the results of various methods of the present invention for diabetic retinopathy lesion detection;
fig. 5 is a ROC curve for diabetic retinopathy screening by various methods according to embodiments of the present invention.
Detailed Description
The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without creative efforts, are within the protection scope of the invention.
The invention is described in further detail below with reference to the accompanying drawings.
The invention relates to a method for identifying diabetic retinopathy focus images based on an attention model, which is realized by the following steps with reference to fig. 1:
S1, acquiring a diabetic retinopathy fundus color photograph data set, and preprocessing an original image for medical image application;
s2, constructing an attention network model based on semantic segmentation to extract features;
S3, constructing a focus sensing module, a feature retaining module, a feature fusion module and a head attention module, extracting focus related information from the extracted features, and generating a focus detection probability map;
S4, guiding the eye disease screening by using the focus detection probability map obtained by the attention model through the input fundus color photograph, and obtaining a fundus photo diagnosis result with diabetic retinopathy focus information.
Specifically, for the step S1, firstly, a related dataset IDRiD and DDR are obtained, and the IDRiD contains 81 eyeground color photographic samples marked by segmentation masks with four focuses of micro-aneurysms, hemorrhages, hard exudation and soft exudation, which are named IDRiD-Seg; DDR comprises 757 eyeground color photographic samples marked by focus segmentation mask which is the same as IDRiD, and is named DDR-Seg; in addition DDR there were 13673 fundus colour specimens with diabetic retinopathy grading tags, from which a total of 11609 specimens were selected, of normal and non-proliferative stage grades, designated DDR-Scr, wherein the non-proliferative stage grade contained three specific grades of light, moderate and heavy. The two data sets are then partitioned into a training set and a test set, respectively. Next, for the pretreatment process, the following steps are divided:
A. Cutting redundant black background around the fundus color photograph by using an image processing technology to obtain a square normalized image;
B. Converting a sample image from an RGB color space to a Lab color space, taking out a brightness layer L channel to perform self-adaptive histogram equalization limiting contrast, splicing the sample image with the original two color layers a and b, and converting the sample image into the RGB color space so as to enhance image details;
C. Scaling the image size to h×w for inputting the model, where h=w=512;
D. Using maximum and minimum standardization to perform normalization processing to enable image pixel values to be distributed in a [0,1] interval so as to speed up model training;
E. The images in the training set are subjected to data augmentation by adopting random angle center rotation, random vertical overturn and random horizontal overturn so as to prevent overfitting.
Specifically, for step S2, as shown in fig. 2, the attention network model based on semantic segmentation is composed of an encoder and a decoder. The encoder is characterized by ResNet-50 as a feature extraction module.
Specifically, for step S3, the focus sensing module, the feature preserving module, the feature fusion module, and the head attention module constitute a decoder portion of the semantic segmentation-based attention network model.
Firstly, inputting a preprocessed fundus color photograph sample, extracting low-level features by an encoder, and particularly, outputting intermediate results of three residual layersAnd last residual layer resultAs low-level features of different dimensions and depths, respectively, wherein,Hi′=Wi′=H/2i+1, Ci=128×2i,i=1,2,3,4,。
Then, forThe 0 th output of the decoder is obtained through the head attention module and the focus perception module, and the formula is expressed asWherein the method comprises the steps ofThen a focus detection probability map with the scale of 16 multiplied by 16 is obtained through the operation of a convolution with the core size of 3 multiplied by 3 and the channel number of 4 focus types and Sigmoid activation function layer, and the formula is expressed asWherein the method comprises the steps of
Next, for the followingObtained through a feature-preserving moduleAnd is connected withAndTogether as input to a feature fusion module to obtain fusion features, which are then fed to a lesion sensing module to obtain the 1 st output of the decoder formulated asWherein the method comprises the steps ofThen a focus detection probability map with the scale of 32 multiplied by 32 is obtained through the operation of a convolution with the core size of 3 multiplied by 3 and the channel number of 4 focus types and Sigmoid activation function layer, and the formula is expressed asWherein the method comprises the steps of
And so on, obtaining focus perception advanced features with four different scalesAnd lesion detection probability mapWherein Hi″=Wi″=16×2i, c=256, i=0, 1,2,3. The overall process is defined by the formula:
Wherein,For deep semantic features output by the last layer of the encoder,AndThe method comprises the steps of respectively representing an ith focus sensing module, a feature fusion module and a feature reservation module to operate, wherein weights are not shared among the modules, and fHAM (DEG) represents head attention module to operate; Activating function layer operations for an ith core size of 3 x 3 convolution and Sigmoid; Probability maps are detected for the ith group of lesions.
Then, the process is carried out,Upsampling to original size 512×512 by bilinear interpolation method, and calculating loss by using binary cross entropy function with positive sample weight as loss function, wherein the loss function formula is defined as follows:
Wherein, H and W are the height and width of the image respectively, g (i, j) represents focus mask labeling, xout (i, j) represents focus detection probability map, alpha is positive sample weighting, and here is set as 10;
and then, accumulating losses obtained by the four different-scale focus detection probability maps to obtain a total loss value, and performing weight optimization through back propagation.
Specifically, for step S4, NPDR is expressed as the probability of non-proliferative diabetic retinopathy, as shown in fig. 3. Screening depth neural network model on the basis of keeping focus perception depth neural network model unchanged, firstly, focus perception advanced features with four different scales are obtained for a decoder partDownsampling to the same size by using maximum pooling, and then performing channel dimension splicing, and then performing a convolution layer operation to obtain screening characteristics xscr, wherein the convolution layer comprises a convolution operation with a kernel size of 3×3 and an output channel number of 512, batch normalization and Relu activation functions; and then using a global pooling layer and a dense layer as an output module, specifically, adding two results of global average pooling and global maximum pooling, straightening into a one-dimensional vector, and sending the one-dimensional vector into the dense layer to obtain the diabetes retina pathological change probability of the input fundus illumination sample. The loss is then calculated using the cross entropy function with label smoothing as a loss function, which is defined as follows:
wherein N is the number of samples, yi andRespectively representing the predicted disease probability score and the smoothed disease label of the ith fundus color photograph sample,The method is calculated by the following formula:
where epsilon is a smoothed level parameter, here set to 0.2,Representing the original hard tag, C is the number of categories of the screening classification, here 2.
And then performing weight optimization through back propagation.
Specifically, after model training is completed, a test set fundus color photograph sample is input to obtain an output four focus prediction graphs and screening prediction probabilities of the diabetic retinopathy, and the performance of the model is evaluated.
The test results were as follows:
Tables 1 and 2 are objective performance evaluations of the method (LANet) of the present invention on lesion detection tasks and of multiple mainstream segmentation models on two datasets, respectively.
TABLE 1 evaluation of lesion detection Performance for each method on DDR-Seg dataset
TABLE 2 evaluation of lesion detection Performance for each method on IDRiD-Seg dataset
Wherein LANet represents the cross-dataset test results of the model on the IDRiD-Seg dataset trained using the DDR-Seg dataset.
Fig. 4 is a visual comparison of the results of various methods for diabetic retinopathy lesion detection. By comparing objective evaluation scores and subjective results of various methods, it can be seen that the method of the present invention can exhibit good performance on both data sets, and good performance on either small sample data sets or generalization across data sets.
Table 3 shows the results of the method (LASNet) of the present invention on the diabetic retinopathy screening task as compared to other methods for objective performance assessment.
TABLE 3 screening Performance evaluation for methods on DDR-Scr datasets
Fig. 5 is a ROC curve for various methods of screening for diabetic retinopathy. By comparison, the method has better performance than other methods.
It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus.
While the foregoing embodiments have shown and described the practice of the invention, it will be appreciated by those skilled in the art that changes, modifications, substitutions and alterations may be made to these embodiments without departing from the principles and spirit of the invention, the scope of which is defined in the appended claims and their equivalents.