CN116993661A

Movatterモバイル変換

Info

Publication number: CN116993661A
Application number: CN202310614991.1A
Authority: CN
Inventors: 陈乔松; 刘峻卓; 张冶; 陈浩; 李远路
Original assignee: Chongqing University of Post and Telecommunications
Current assignee: Chongqing University of Post and Telecommunications
Priority date: 2023-05-29
Filing date: 2023-05-29
Publication date: 2023-11-03

Abstract

The invention discloses a clinical diagnosis method of a potential cancerous polyp based on feature fusion and attention mechanism, which comprises the following steps: step one: dividing a required training set and a test set according to the polyp data set; step two: preprocessing a data set, and unifying the size and normalization; step three: inputting the preprocessed data into a neural network, and extracting image features; step four: calculating the loss between the output polyp position prediction graph and the label marked by the clinical medical expert, training and optimizing a model, and recording optimal parameters; step five: and loading optimal preservation weights for the model, and calculating final segmentation prediction according to the prediction graphs output by the first layer and the second layer of the model. The method has the advantages that multi-scale information and detail information in shallow features are captured, and deep semantic features are fused and filtered. The encoder calculates the global relation and readjusts the weight of the feature map. The model realizes end-to-end automatic polyp segmentation, and can accurately segment various polyps.

Description

Translated fromChinese

一种基于特征融合和注意力机制的潜在癌变息肉临床诊断方法A clinical diagnosis of potentially cancerous polyps based on feature fusion and attention mechanismmethod

技术领域Technical field

本发明涉及计算机视觉，医学，Transformer及深度学习领域技术，尤其涉及基于深度学习的结肠息肉分割方法。The invention relates to technologies in the fields of computer vision, medicine, Transformer and deep learning, and in particular, to a colon polyp segmentation method based on deep learning.

背景技术Background technique

随着世界人口老龄化的加剧和风险因素不断增加，全球范围内的结直肠癌患者数量呈上升趋势。结直肠癌多数是由良性腺瘤恶性病变所致。腺瘤的早期形态为结肠息肉，恶化时则会导致癌症。早期息肉筛查可以大大降低结直肠癌的发病率。在各种筛查手段中，结肠镜检查被认为是腺瘤筛查的金标准。As the world's population ages and risk factors continue to increase, the number of colorectal cancer patients is on the rise globally. Colorectal cancer is mostly caused by benign adenomas and malignant lesions. Adenomas form as colon polyps in their early stages and can lead to cancer as they progress. Early polyp screening can significantly reduce the incidence of colorectal cancer. Among various screening methods, colonoscopy is considered the gold standard for adenoma screening.

早期息肉自动分割方法往往采用非线性扩散过率，基于形状先验的边界检测，聚类等。但这些传统息肉分割方法由于准确率较低，数据特异性强，先验知识依赖严重等问题，难以在临床环境中广泛适用。Automatic segmentation methods for early polyps often use nonlinear diffusion rates, boundary detection based on shape priors, clustering, etc. However, these traditional polyp segmentation methods are difficult to be widely applied in clinical settings due to problems such as low accuracy, strong data specificity, and heavy reliance on prior knowledge.

深度学习是机器学习的一种，目的在于学习样本数据的内在规律和表示层次，适用于各种下游任务。深度网络往往从网络输入数据中捕捉低层表示，逐渐提取特征，形成更加抽象的高层表示，属性类别等。目前，深度学习已经在推荐搜索，自然语言处理，目标检测，语义分割，图像生成等领域广泛应用。Deep learning is a type of machine learning that aims to learn the inherent patterns and representation levels of sample data and is suitable for various downstream tasks. Deep networks often capture low-level representations from network input data and gradually extract features to form more abstract high-level representations, attribute categories, etc. Currently, deep learning has been widely used in recommendation search, natural language processing, target detection, semantic segmentation, image generation and other fields.

目前已经存在一部分基于深度学习的结肠息肉分割方法，它们大多基于编码器-解码器架构，编码器用于训练集的特征提取，解码器输出息肉像素位置的二值图像，实现息肉位置的分割，检测。但现有的方法大多数难以在临床环境中广泛应用，主要涉及以下两个问题：特征利用不充分和各级特征在融合过程中存在的语义冲突和信息冗余。因此，改进的基于特征融合和注意力机制的结肠息肉分割方法可以充分地利用编码器提取的特征，并缓解来自不同层级特征融合带来的冲突冗余问题。There are already some colon polyp segmentation methods based on deep learning. Most of them are based on the encoder-decoder architecture. The encoder is used for feature extraction of the training set, and the decoder outputs a binary image of the polyp pixel position to achieve polyp position segmentation and detection. . However, most of the existing methods are difficult to be widely used in clinical environments, mainly involving the following two problems: insufficient feature utilization and semantic conflicts and information redundancy in the fusion process of features at all levels. Therefore, the improved colon polyp segmentation method based on feature fusion and attention mechanism can fully utilize the features extracted by the encoder and alleviate the conflict redundancy problem caused by feature fusion at different levels.

发明内容Contents of the invention

本发明的目的是深入探索特征融合方案和注意力机制在自动结肠息肉分割领域的作用。The purpose of this invention is to deeply explore the role of feature fusion schemes and attention mechanisms in the field of automatic colon polyp segmentation.

为了实现上述目的，本发明采用的技术方案是：设计一种基于特征融合和注意力机制的潜在癌变息肉临床诊断方法，包括以下步骤：In order to achieve the above objectives, the technical solution adopted by the present invention is to design a clinical diagnosis method for potentially cancerous polyps based on feature fusion and attention mechanism, which includes the following steps:

1)、根据五个公开的临床结肠镜息肉分割数据集划分自动息肉分割所需的训练集和测试集；1). Divide the training set and test set required for automatic polyp segmentation based on five public clinical colonoscopy polyp segmentation data sets;

2)、对数据集进行预处理，将划分好的数据集统一尺寸，训练集归一化；2) Preprocess the data set, unify the size of the divided data set, and normalize the training set;

3)、将步骤二经过预处理的数据输入基于Pytorch开源框架实现的神经网络中，提取训练集图像特征；3) Enter the preprocessed data in step 2 into the neural network implemented based on the Pytorch open source framework, and extract the image features of the training set;

4)、计算深度网络模型输出的息肉位置预测图与临床医学专家标注的标签之间的损失，训练并优化自动息肉分割模型，记录模型性能达到最优时模型参数；4) Calculate the loss between the polyp location prediction map output by the deep network model and the labels marked by clinical medical experts, train and optimize the automatic polyp segmentation model, and record the model parameters when the model performance reaches optimal;

5)、将测试集图像样本调整到统一尺寸，为模型加载训练时性能最优的保存权重，根据模型第一层和第二层输出的预测图计算最终分割预测，获得息肉位置分割图像。5) Adjust the test set image samples to a unified size, load the saved weights with the best performance during training for the model, calculate the final segmentation prediction based on the prediction maps output by the first and second layers of the model, and obtain the polyp position segmentation image.

具体地，步骤1)的数据集使用Kvasir，CVC-ClinicDB，CVC-ColonDB，ETIS，CVC-300。Kvasir包括1000张结肠镜检查图像及其掩码，图像分辨率从332×487到1920×1072像素不等。CVC-ClinicDB包括612个不同的结肠镜检查序列中提取的29个图像帧及由临床专家标注的分割掩模，图像的分辨率为384×288。CVC-ColonDB由380张分辨率为574×500的静态图像组成。ETIS包括196张1225×966分辨率的息肉图像。CVC-300包括60张分辨率大小为574×500的结肠镜检查图像。训练集由612张Kvasir数据集图像和838张CVC-ClinicDB数据集图像组成。测试集由五个数据集其它数据构成。Specifically, the data set in step 1) uses Kvasir, CVC-ClinicDB, CVC-ColonDB, ETIS, and CVC-300. Kvasir includes 1000 colonoscopy images and their masks, with image resolutions ranging from 332×487 to 1920×1072 pixels. CVC-ClinicDB includes 29 image frames extracted from 612 different colonoscopy sequences and segmentation masks annotated by clinical experts. The image resolution is 384×288. CVC-ColonDB consists of 380 static images with a resolution of 574×500. ETIS includes 196 polyp images with 1225 × 966 resolution. CVC-300 includes 60 colonoscopy images with a resolution size of 574×500. The training set consists of 612 Kvasir dataset images and 838 CVC-ClinicDB dataset images. The test set consists of five data sets and other data.

步骤2)将划分好的训练集调整尺寸大小为352×352，并以[0.485，0.456，0.406]，[0.229，0.224，0.225]进行归一化。Step 2) Adjust the size of the divided training set to 352×352, and normalize it with [0.485, 0.456, 0.406], [0.229, 0.224, 0.225].

步骤3)网络如图1所示，详细描述如下：Step 3) The network is shown in Figure 1, and the detailed description is as follows:

网络采用编码器-解码器的结构。编码器采用PVTv2网络对输入网络的数据集进行提取，得到4个不同分辨率的特征图，记为x₁～x₄。其中x₁为浅层特征，x₂～x₄为深层特征。x₁通过卷积核大小为1卷积层将通道数降低为32，然后送入多尺度注意力模块。x₂～x₄同样通过卷积核大小为1的卷积层将通道调整为32，然后输入到深层特征增强模块，进行不同层级特征的融合和冲突冗余缓解。所述解码器接收最深层编码器特征图，其中的全局注意力模块融合解码器特征图和同层级处理后的跳跃连接特征图。解码器对特征图逐层上采样，逐渐恢复与输入图像一致的分辨率。The network adopts an encoder-decoder structure. The encoder uses the PVTv2 network to extract the data set of the input network and obtains 4 feature maps of different resolutions, recorded as x₁ ~ x₄ . Among them, x₁ is a shallow feature, and x₂ ~ x₄ are deep features. x₁ reduces the number of channels to 32 through a convolutional layer with a convolution kernel size of 1, and then sends it to the multi-scale attention module. x₂ ~ x₄ are also adjusted to 32 channels through the convolution layer with a convolution kernel size of 1, and then input to the deep feature enhancement module to fuse different levels of features and alleviate conflict redundancy. The decoder receives the deepest encoder feature map, and the global attention module fuses the decoder feature map and the skip connection feature map processed at the same level. The decoder upsamples the feature map layer by layer, gradually restoring the resolution consistent with the input image.

多尺度注意力模块：其结构如图2所示：经过卷积核大小为1的卷积层降低通道后的编码器特征x₁以残差连接的方式输入到多尺度模块，多尺度模块由卷积核大小递增的卷积层和卷积核大小为3的膨胀卷积构成。经过多尺度模块的特征图进入由全局池化层，最大池化层和Sigmoid组成的通道注意力模块；得到的结果输入由空间像素平均，空间像素最大和Sigmoid组成的空间注意力模块。具体过程公式描述如下：Multi-scale attention module: Its structure is shown in Figure 2: The encoder feature x₁ after reducing the channel through the convolution layer with a convolution kernel size of 1 is input to the multi-scale module in the form of residual connection. The multi-scale module is composed of It consists of a convolution layer with increasing convolution kernel size and a dilated convolution with a convolution kernel size of 3. The feature map after the multi-scale module enters the channel attention module composed of the global pooling layer, the maximum pooling layer and Sigmoid; the obtained results are input to the spatial attention module composed of the spatial pixel average, the spatial pixel maximum and the Sigmoid. The specific process formula is described as follows:

T＝Attention_s(Attention_c(M))T=Attention_s (Attention_c (M))

Attention_c(x)＝x⊙Sigmoid(MaxPool(x)+AvgPool(x))Attention_c (x)＝x⊙Sigmoid(MaxPool(x)+AvgPool(x))

Attention_s(x)＝x⊙Sigmoid(Concat(MaxSpatial(x),AvgSpatial(x)))Attention_s (x)＝x⊙Sigmoid(Concat(MaxSpatial(x),AvgSpatial(x)))

其中T和M分别代表多尺度注意力模块的输出特征图和经过多尺度模块后的特征图。Attention_c(x)代表通道注意力机制，Attention_s(x)代表空间注意力机制。X代表输入到注意力机制中的特征图。MaxPool代表最大池化层，AvgPool代表平均池化层。⊙代表像素维度的相乘。Concat代表特征图通道维度的拼接。Among them, T and M respectively represent the output feature map of the multi-scale attention module and the feature map after passing through the multi-scale module. Attention_c (x) represents the channel attention mechanism, and Attention_s (x) represents the spatial attention mechanism. X represents the feature map input to the attention mechanism. MaxPool represents the maximum pooling layer, and AvgPool represents the average pooling layer. ⊙ represents the multiplication of pixel dimensions. Concat represents the concatenation of feature map channel dimensions.

深层特征增强模块：其结构如图3所示：x₄经过双线性插值扩大分辨率至原尺寸的二倍。经过上采样后的特征图输入到卷积核大小为3的卷积层，批量归一化层和Relu激活层。经过上述计算后的特征图与x₃像素维度相乘，然后经过并行的通道注意力机制和空间注意力机制缓解融合过程中的冲突冗余。x₄经过双线性插值扩大分辨率至原尺寸的四倍并经过卷积核大小为3的卷积层，批量归一化层和Relu激活层，x₃经过双线性插值上采样至原尺寸的二倍并同样经过卷积核大小为3的卷积层，批量归一化层和Relu激活层调整特征图。经过尺寸调整后的x₄和x₃与x₂像素维度相乘。上述编码器第二层和编码器第三层中融合后的特征图分别输入并行连接的通道注意力和空间注意力模块。经过上述步骤后，不同层级的特征实现了互相补充并缓解了融合过程中的冲突冗余。Deep feature enhancement module: Its structure is shown in Figure 3: x₄ expands the resolution to twice the original size through bilinear interpolation. The upsampled feature map is input to the convolution layer with a convolution kernel size of 3, the batch normalization layer and the Relu activation layer. The feature map calculated above is multiplied by the x₃ pixel dimension, and then the parallel channel attention mechanism and spatial attention mechanism are used to alleviate the conflict redundancy in the fusion process. x₄ expands the resolution to four times the original size through bilinear interpolation and passes through a convolution layer with a convolution kernel size of 3, a batch normalization layer and a Relu activation layer. x₃ is upsampled to the original size through bilinear interpolation. The feature map is adjusted twice the size and also goes through a convolution layer with a convolution kernel size of 3, a batch normalization layer and a Relu activation layer. The resized x₄ and x₃ are multiplied by the x₂ pixel dimension. The fused feature maps in the second layer of the above-mentioned encoder and the third layer of the encoder are respectively input to the parallel-connected channel attention and spatial attention modules. After the above steps, features at different levels complement each other and alleviate conflict redundancy in the fusion process.

全局注意力模块：其结构如图4所示：对于解码器特征图我们使用卷积核大小为1的卷积层和最大池化层调整它的维度，得到/>对于来自同级编码器的特征图/>同样使用卷积层和池化层变换到/>接下来，K分别与T_i'，V进行内积运算，计算它们之间的像素关系。全局注意力权重f由它们的像素关系相加并经过Softmax(.)得到。这一过程表示为：Global attention module: Its structure is shown in Figure 4: For the decoder feature map We use a convolutional layer with a convolution kernel size of 1 and a maximum pooling layer to adjust its dimensions and get/> For feature maps from peer encoders/> Also use convolutional and pooling layers to transform to/> Next, K performs inner product operations with Ti_' and V respectively to calculate the pixel relationship between them. The global attention weight f is obtained by adding their pixel relationships and passing Softmax(.). This process is expressed as:

其中f代表计算所得的全局注意力权重，K和V代表调整维度后的解码器特征图，T_i'代表调整维度后的来自同层级编码器特征图。Among them, f represents the calculated global attention weight, K and V represent the decoder feature map after adjusting the dimension, and T_i ' represents the feature map from the encoder at the same level after adjusting the dimension.

接下来，全局注意力权重f会与Q相乘。经过权重调整后的特征图会经过卷积核大小为1的卷积层调整通道数，以残差的方式与D_i+1相加，得到特征图最后，T_i会与特征图Y在通道维度上拼接，得到D_i。以上过程表示为：Next, the global attention weight f is multiplied by Q. The weight-adjusted feature map will go through a convolution layer with a convolution kernel size of 1 to adjust the number of channels, and then add it to D_i+1 in the form of residuals to obtain the feature map. Finally, T_i will be spliced with the feature map Y in the channel dimension to obtain D_i . The above process is expressed as:

D_i＝Concat(Y,T_i)D_i =Concat(Y,T_i )

其中，Y是经过残差连接后的特征图，D_i+1是全局注意力模块的输入特征图，Conv代表卷积操作，f代表计算后的全局注意力权重，Q代表调整维度后的解码器特征图，T_i代表来自同层级的编码器特征，D_i代表全局注意力模块的输出。Among them, Y is the feature map after residual connection, D_i+1 is the input feature map of the global attention module, Conv represents the convolution operation, f represents the calculated global attention weight, and Q represents the decoding after adjusting the dimension. The encoder feature map, T_i represents the encoder features from the same level, and D_i represents the output of the global attention module.

步骤4)计算深度网络模型输出的息肉位置预测图与临床医学专家标注的标签之间的损失采用加权IoU损失和加权BCE损失。它们会对所有像素分配权重，更加关注对损失计算影响较大的像素。基本损失函数记为：Step 4) Calculate the loss between the polyp location prediction map output by the deep network model and the labels marked by clinical medical experts using weighted IoU loss and weighted BCE loss. They assign weights to all pixels, paying more attention to pixels that have a greater impact on the loss calculation. The basic loss function is recorded as:

其中，X表示模型的预测输出，Y表示真实值，表示加权IoU损失，/>表示加权BCE损失。如图1所示，模型输出三张预测图P₁，P₂，P₃，模型训练的整体损失函数记为：Among them, X represents the predicted output of the model, Y represents the true value, Represents the weighted IoU loss, /> Represents the weighted BCE loss. As shown in Figure 1, the model outputs three prediction maps P₁ , P₂ , and P₃ . The overall loss function of model training is recorded as:

L_all＝L_b(P₁,G)+L_b(P₂,G)+0.5×L_b(P₃,G)L_all ＝L_b (P₁ ,G)+L_b (P₂ ,G)+0.5×L_b (P₃ ,G)

其中，G表示真实值。P₁和P₂作为主要预测图，参与模型训练过程中的损失计算和分割预测。P₃作为辅助预测图，仅作为训练损失的一部分。Among them, G represents the real value. P₁ and P₂ serve as the main prediction maps and participate in the loss calculation and segmentation prediction during the model training process. P₃ serves as an auxiliary prediction map and is only used as part of the training loss.

附图说明Description of drawings

图1为本发明网络模型整体框架图；Figure 1 is an overall framework diagram of the network model of the present invention;

图2为多尺度注意力模块图；Figure 2 is a diagram of the multi-scale attention module;

图3为深层特征增强模块图；Figure 3 is a diagram of the deep feature enhancement module;

图4为全局注意力模块图。Figure 4 is a diagram of the global attention module.