CN113033570B

Movatterモバイル変換

Info

Publication number: CN113033570B
Application number: CN202110344461.0A
Authority: CN
Inventors: 高世伟; 张长柱; 张皓; 王祝萍; 黄超
Original assignee: Tongji University
Current assignee: Tongji University
Priority date: 2021-03-29
Filing date: 2021-03-29
Publication date: 2022-11-11
Anticipated expiration: 2041-03-29
Also published as: CN113033570A

Abstract

The invention relates to an image semantic segmentation method for improving void convolution and multilevel characteristic information fusion, which comprises the following steps of: extracting image features in a deep convolution neural network by using an improved hole convolution method; the extracted deep characteristic images and the shallow characteristic images are cascaded and fused to make up for the loss of spatial information; learning boundary information of the characteristic image subjected to multistage processing through boundary thinning, fusing and restoring to the resolution of the original image, and generating a prediction segmentation graph; and (4) training the network by using a cross entropy loss function, and evaluating the model performance by using mIoU. The invention improves the utilization method of the prior cavity convolution and designs the deformable space pyramid structure, thereby improving the image characteristic extraction effect of the model. Meanwhile, a multi-level characteristic information fusion structure is designed for image resolution recovery, local information and global information contained in different levels are fully utilized, boundary refinement is introduced, and the accuracy of image semantic segmentation is effectively improved.

Description

Translated fromChinese

一种改进空洞卷积和多层次特征信息融合的图像语义分割方法An Image Semantic Segmentation Based on Improved Atrous Convolution and Multi-level Feature Information Fusionmethod

技术领域technical field

本发明涉及计算机视觉与模式识别智能系统领域，特别涉及一种改进空洞卷积和多层次特征信息融合的图像语义分割方法。The invention relates to the field of computer vision and pattern recognition intelligent systems, in particular to an image semantic segmentation method for improving hole convolution and multi-level feature information fusion.

背景技术Background technique

自动化场景理解是现代计算机视觉领域的重要目标。图像语义分割是计算机视觉的基本场景理解任务，其中涉及将原始数据(例如平面图像)作为输入并将其转换为具有突出显示的感兴趣区域的掩模，以将它们划分为具有不同语义信息的多个区域。近年来，由于深度卷积神经网络在语义分割任务中的出色表现，与传统的GrabCut、N-Cut等方法相比，分割质量得到了显着提高。良好的分割算法对于许多实际应用至关重要，例如，自动驾驶，医学图像处理，计算摄影，图像搜索引擎，增强现实。上述这些应用都需要非常准确的像素预测。Automated scene understanding is an important goal in the modern field of computer vision. Image semantic segmentation is a fundamental scene understanding task in computer vision, which involves taking raw data (e.g., planar images) as input and converting them into masks with highlighted regions of interest to classify them into regions with different semantic information. multiple regions. In recent years, due to the excellent performance of deep convolutional neural networks in semantic segmentation tasks, the segmentation quality has been significantly improved compared with traditional methods such as GrabCut and N-Cut. A good segmentation algorithm is crucial for many practical applications, e.g., autonomous driving, medical image processing, computational photography, image search engines, augmented reality. All of these applications require very accurate pixel-wise predictions.

但是，目前的基于深度卷积神经网络的语义分割方法由于多次池化和下采样造成图像分辨率降低、全局上下文信息丢失等问题，在分割结果上不能取得较高的预测分类准确度。However, the current semantic segmentation method based on deep convolutional neural network cannot achieve high prediction and classification accuracy in segmentation results due to problems such as image resolution reduction and global context information loss caused by multiple pooling and downsampling.

发明内容Contents of the invention

本发明的目的是提供一种改进空洞卷积和多层次特征信息融合的图像语义分割方法，该方法能够有效提高特征提取的信息利用率与有效性，同时还能丰富浅层语义信息并学习图像全局上下文信息，提高对二维图像进行语义分割的准确率。The purpose of the present invention is to provide an image semantic segmentation method with improved atrous convolution and multi-level feature information fusion, which can effectively improve the information utilization and effectiveness of feature extraction, and at the same time enrich shallow semantic information and learn image Global context information improves the accuracy of semantic segmentation of 2D images.

基于改进的空洞卷积方法和多层次特征信息融合的结构能够保证在提升图像分割效果的同时不明显提升系统的计算量。相比简单的将卷积网络层叠，为图像特征提取与空间信息弥补设计更合适的结构和方法，降低下采样过程特征信息的丢失，有效提高像素预测准确率，增强图像语义分割效果。The structure based on the improved atrous convolution method and the fusion of multi-level feature information can ensure that the calculation amount of the system is not significantly increased while improving the image segmentation effect. Compared with simply stacking convolutional networks, designing a more suitable structure and method for image feature extraction and spatial information compensation can reduce the loss of feature information in the downsampling process, effectively improve the accuracy of pixel prediction, and enhance the image semantic segmentation effect.

为了达到上述目的，本发明采用的技术方案是：In order to achieve the above object, the technical scheme adopted in the present invention is:

一种改进空洞卷积和多层次特征信息融合的图像语义分割方法，包括以下步骤：An image semantic segmentation method for improving atrous convolution and multi-level feature information fusion, comprising the following steps:

S1：使用改进的空洞卷积方法在深度卷积神经网络中提取图像特征；S1: Using an improved atrous convolution method to extract image features in a deep convolutional neural network;

S2：将提取的深层特征图像与浅层特征图像级联融合弥补空间信息丢失；；S2: cascade fusion of the extracted deep feature image and shallow feature image to make up for the loss of spatial information;

S3：将多阶段处理后的特征图像通过边界细化学习边界信息，融合并恢复至原始图像分辨率，生成预测分割图；S3: The multi-stage processed feature image learns boundary information through boundary refinement, fuses and restores to the original image resolution, and generates a predicted segmentation map;

S4：利用交叉熵损失函数训练网络，以mIoU评价模型性能。S4: Use the cross-entropy loss function to train the network, and evaluate the model performance with mIoU.

所述S1的具体实现方法包括以下步骤：The concrete realization method of described S1 comprises the following steps:

S1.1：以ResNet-101为基础网络，在第三个采样模块后接入改进的跃级连接空洞卷积模块，该模块包含连续三个空洞卷积层，根据输入图像的分辨率大小更改其中卷积层的空洞率，不同卷积层之间顺向建立跃级连接，在不继续缩小图像的情况下进一步扩大感受野，并减少信息损失；S1.1: Based on the ResNet-101 network, after the third sampling module, an improved skip connection hole convolution module is connected, which contains three consecutive hole convolution layers, which are changed according to the resolution of the input image Among them, the hole rate of the convolutional layer and the forward connection between different convolutional layers can further expand the receptive field and reduce information loss without continuing to shrink the image;

S1.2：将经过跃级连接空洞卷积模块的图像输入到改进的可变形空间金字塔池化模块，利用可变形卷积具有使感受野自适应目标尺度变化、灵活收敛信息的优势与多尺度空洞卷积标准采样能够有效分类图像任意区域的优势相结合，以较小的模型复杂度为代价提高模型学习目标形变的能力；S1.2: Input the image of the leap-connected dilated convolution module into the improved deformable space pyramid pooling module. Using deformable convolution has the advantages of adapting the receptive field to the target scale change, flexible convergence information and multi-scale Combined with the advantages of dilated convolution standard sampling that can effectively classify any region of the image, the ability of the model to learn target deformation is improved at the cost of a small model complexity;

S1.3：保留下采样过程不同阶段、不同分辨率特征图像所包含的不同层次特征信息。S1.3: Retain different levels of feature information contained in different stages of the downsampling process and different resolution feature images.

所述S2的具体实现方法包括以下步骤：The concrete realization method of described S2 comprises the following steps:

S2.1：将经过跃级连接空洞卷积模块处理的特征层经过1×1卷积，与最深层提取的特征图像结合，弥补该浅层特征图像的语义信息，将输出特征图经过1×1卷积作为该层输出；S2.1: Combine the feature layer processed by the leap-connected hole convolution module with 1×1 convolution, and combine it with the feature image extracted from the deepest layer to make up the semantic information of the shallow feature image, and pass the output feature map through 1× 1 convolution as the output of this layer;

S2.2：将S2.1中输出的特征图与前一模块输出的特征图像结合，弥补该浅层特征图像的语义信息，将输出特征图经过1×1卷积作为该层输出；S2.2: Combine the feature map output in S2.1 with the feature image output by the previous module to make up the semantic information of the shallow feature image, and use the output feature map as the output of this layer after 1×1 convolution;

S2.3：将S2.2中输出的特征图像通过双线性插值二倍上采样，与前一模块输出的特征图像结合，弥补该浅层特征图像的语义信息，将输出特征图经过1×1卷积作为该层输出；S2.3: The feature image output in S2.2 is double-upsampled by bilinear interpolation, and combined with the feature image output by the previous module to make up the semantic information of the shallow feature image, and the output feature image is processed by 1× 1 convolution as the output of this layer;

S2.4：将S2.3中输出的特征图像通过双线性插值二倍上采样，与前一模块输出的特征图像结合，弥补该浅层特征图像的语义信息，将输出特征图经过1×1卷积作为该层输出。S2.4: The feature image output in S2.3 is double-upsampled through bilinear interpolation, combined with the feature image output by the previous module to make up the semantic information of the shallow feature image, and the output feature image is processed by 1× 1 convolution as the output of this layer.

所述S3的具体实现方法包括以下步骤：The concrete realization method of described S3 comprises the following steps:

S3.1：将最深层输出特征图像通过双线性插值法四倍上采样；S3.1: Upsampling the deepest output feature image by four times through bilinear interpolation;

S3.2：将S2.1中该层输出特征图像通过双线性插值法四倍上采样；S3.2: Four times upsampling the output feature image of this layer in S2.1 by bilinear interpolation;

S3.3：将S2.2中该层输出特征图像通过双线性插值法四倍上采样；S3.3: Four times upsampling the output feature image of this layer in S2.2 by bilinear interpolation;

S3.4：将S2.3中该层输出特征图像通过双线性插值法二倍上采样；S3.4: Double upsampling the output feature image of this layer in S2.3 by bilinear interpolation;

S3.5：将S2.4、S3.1、S3.2、S3.3、S3.4中输出特征图像经过BR模块细化边界后融合，经过3×3卷积、双线性插值法四倍上采样两步处理后恢复至图像原始分辨率大小，得到最终预测分割图。S3.5: The output feature images in S2.4, S3.1, S3.2, S3.3, S3.4 are fused after the BR module refines the boundary, after 3×3 convolution and bilinear interpolation method four After double-upsampling and two-step processing, it is restored to the original resolution of the image, and the final prediction segmentation map is obtained.

所述S4的具体实现方法包括以下步骤：The concrete realization method of described S4 comprises the following steps:

S4.1：计算分割预测图与数据集中标准分割图的交叉熵损失，利用反向传播算法更新模型中参数权重，经过数据集中训练集训练得到最终的语义分割模型；S4.1: Calculate the cross-entropy loss between the segmentation prediction map and the standard segmentation map in the dataset, use the back propagation algorithm to update the parameter weights in the model, and obtain the final semantic segmentation model after training on the training set in the dataset;

S4.2：利用数据集中的测试集，以mIoU指标测试模型预测性能。S4.2: Use the test set in the dataset to test the prediction performance of the model with the mIoU indicator.

由于上述技术方案运用，本发明与现有技术相比具有下列有点和效果：Due to the use of the above-mentioned technical solutions, the present invention has the following advantages and effects compared with the prior art:

本发明充分考虑空洞卷积对语义分割的益处与其缺点，改进了现有空洞卷积的利用方法并设计了可变形空间金字塔结构，提升模型的图像特征提取效果。同时，相比于一般的上采样方法，为图像分辨率恢复设计了多层次特征信息融合结构，充分利用不同层级包含的局部信息以及全局信息，并引入边界细化，有效提高图像语义分割的准确率。The present invention fully considers the advantages and disadvantages of atrous convolution for semantic segmentation, improves the existing method of using atrous convolution and designs a deformable spatial pyramid structure, and improves the image feature extraction effect of the model. At the same time, compared with the general upsampling method, a multi-level feature information fusion structure is designed for image resolution restoration, making full use of the local information and global information contained in different levels, and introducing boundary refinement, which effectively improves the accuracy of image semantic segmentation. Rate.

附图说明Description of drawings

图1是本发明提出的整体语义分割方法流程图；Fig. 1 is the overall semantic segmentation method flowchart that the present invention proposes;

图2是本发明提出的整体语义分割算法网络模型图；Fig. 2 is the overall semantic segmentation algorithm network model figure that the present invention proposes;

图3是本发明网络结构中的跃级连接空洞卷积模块；Fig. 3 is the leap-level connection hole convolution module in the network structure of the present invention;

图4是本发明算法在Cityscapes数据集的可视化效果图。Fig. 4 is a visualization effect diagram of the algorithm of the present invention in the Cityscapes data set.

具体实施方式Detailed ways

下面结合附图及实施案例对本发明作进一步描述：Below in conjunction with accompanying drawing and embodiment example, the present invention will be further described:

一种改进空洞卷积和多层次特征信息融合的图像语义分割方法：如图1所示包括以下步骤：An image semantic segmentation method that improves hole convolution and multi-level feature information fusion: as shown in Figure 1, it includes the following steps:

S1：使用改进的空洞卷积方法在深度卷积神经网络中提取图像特征，如图2中“S1”虚线框内所示：S1: Use the improved atrous convolution method to extract image features in the deep convolutional neural network, as shown in the dotted box of "S1" in Figure 2:

S1.1：首先以ResNet-101为基础网络，在第三个采样模块后接入改进的跃级连接空洞卷积模块，其中“Conv”表示“Convolution”，代表卷积层。图3展示了该模块的具体构造，其包含连续三个空洞卷积层，根据输入图像的分辨率大小更改其中卷积层的空洞率(rate)，图3中三层空洞卷积的空洞率依次为2、4、8，不同卷积层之间顺向建立跃级连接，在不继续缩小图像的情况下进一步扩大感受野，并减少信息损失；S1.1: Firstly, ResNet-101 is used as the basic network, and after the third sampling module, an improved leap-connected hole convolution module is connected, where "Conv" means "Convolution", which means the convolutional layer. Figure 3 shows the specific structure of the module, which contains three consecutive dilated convolutional layers, and changes the dilated rate (rate) of the convolutional layer according to the resolution of the input image. The dilated rate of the three-layer dilated convolution in Figure 3 The order is 2, 4, 8, and the forward connection between different convolutional layers is established to further expand the receptive field and reduce information loss without continuing to shrink the image;

S1.2：将经过跃级连接空洞卷积模块的图像输入到改进的可变形空间金字塔池化模块，该模块在图2中以三层空洞卷积、一层可变形卷积和最大池化层组成，利用可变形卷积具有使感受野自适应目标尺度变化、灵活收敛信息的优势与多尺度空洞卷积标准采样能够有效分类图像任意区域的优势相结合，以较小的模型复杂度为代价提高模型学习目标形变的能力；S1.2: Input the image of the skip-connected dilated convolution module into the improved deformable spatial pyramid pooling module, which is shown in Figure 2 with three layers of dilated convolution, one layer of deformable convolution and max pooling Layer composition, the use of deformable convolution has the advantages of adapting the receptive field to target scale changes and flexible convergence information, combined with the advantages of multi-scale atrous convolution standard sampling that can effectively classify any region of the image, with a small model complexity of The cost improves the ability of the model to learn the deformation of the target;

S2：将提取的深层特征图像与浅层特征图像级联融合弥补空间信息丢失，如图2中“S2”虚线框内所示；S2: The extracted deep feature image and shallow feature image are cascaded and fused to make up for the loss of spatial information, as shown in the dotted box of "S2" in Figure 2;

S2.1：如图2所示，将经过跃级连接空洞卷积模块处理的特征层经过1×1卷积，与网络模型最深层提取的特征图像结合，图中“C”表示“Concatenate”，指不同层级特征图的融合，用于弥补浅层特征图像的语义信息，将输出特征图经过1×1卷积作为该层输出；S2.1: As shown in Figure 2, the feature layer processed by the skip connection hole convolution module is subjected to 1×1 convolution, and combined with the feature image extracted from the deepest layer of the network model, “C” in the figure means “Concatenate” , refers to the fusion of feature maps of different levels, which is used to make up for the semantic information of shallow feature images, and the output feature map is 1×1 convolution as the output of this layer;

S2.3：将S2.2中输出的特征图像通过双线性插值二倍上采样(即“upsampleby2”)，与前一模块输出的特征图像结合，弥补该浅层特征图像的语义信息，将输出特征图经过1×1卷积作为该层输出；S2.3: Combine the feature image output in S2.2 with bilinear interpolation double upsampling (ie "upsampleby2") with the feature image output by the previous module to make up for the semantic information of the shallow feature image, and The output feature map undergoes 1×1 convolution as the output of this layer;

S3：将多阶段处理后的特征图像通过边界细化学习边界信息，融合并恢复至原始图像分辨率，生成预测分割图，如图2中“S3”虚线框内所示；S3: Learn the boundary information of the multi-stage processed feature image through boundary refinement, fuse and restore it to the original image resolution, and generate a predicted segmentation map, as shown in the dotted line box of "S3" in Figure 2;

S3.5：将S2.4、S3.1、S3.2、S3.3、S3.4中输出特征图像经过BR(BoundaryRefinement，边界细化)模块细化边界后融合，经过3×3卷积、双线性插值法四倍上采样两步处理后恢复至图像原始分辨率大小，得到最终预测分割图。S3.5: The output feature images in S2.4, S3.1, S3.2, S3.3, and S3.4 are fused after the BR (BoundaryRefinement, boundary refinement) module refines the boundary, and then 3×3 convolution , Bilinear interpolation method quadruple upsampling and two-step processing to restore the original resolution of the image to obtain the final predicted segmentation map.

S4.2：利用数据集中测试集，以准确率和mIoU测试模型预测性能。S4.2: Use the test set in the dataset to test the model prediction performance with accuracy and mIoU.

下面按照本发明的方法进行实验，说明本发明的预测效果。Experiments are carried out according to the method of the present invention to illustrate the prediction effect of the present invention.

测试环境：Ubuntu16.04系统；NVIDIA GTX 1080Ti GPU；python3.5；TensorFlow框架。Test environment: Ubuntu16.04 system; NVIDIA GTX 1080Ti GPU; python3.5; TensorFlow framework.

测试数据集：所选数据集为用于计算机视觉任务中图像分割的图像数据集PASCALVOC 2012，涉及四个类别：车辆，家庭，动物，人，并且进一步细分为20个子类别(外加一个背景)。数据集包含1464个训练图像，1449个验证图像和1456个测试图像。Test data set: The selected data set is the image data set PASCALVOC 2012 for image segmentation in computer vision tasks, involving four categories: vehicles, families, animals, people, and further subdivided into 20 subcategories (plus a background) . The dataset contains 1464 training images, 1449 validation images and 1456 testing images.

测试指标：本发明使用mIoU作为性能评价指标。mIoU是指预测区域和实际区域交集与预测区域和实际区域的并集之比。对现今存在的不同算法计算该指标数据进行结果对比，证明本发明在图像语义分割领域取得的较好结果。Test index: The present invention uses mIoU as a performance evaluation index. mIoU refers to the ratio of the intersection of the predicted area and the actual area to the union of the predicted area and the actual area. Comparing the results of calculating the index data with different existing algorithms proves that the present invention achieves better results in the field of image semantic segmentation.

测试结果如下：The test results are as follows:

表1.本发明在可变形空间金字塔池化模块设计不同的空洞卷积空洞率下的性能对比，通过比较可知适当的参数设置可提升网络性能Table 1. The performance comparison of the present invention under the design of the deformable spatial pyramid pooling module with different atrous convolution hole ratios. Through comparison, it can be seen that appropriate parameter settings can improve network performance

表2.本发明在多层级特征信息融合和边界细化模块加入下的性能对比，可以证明本网络设计的有效性Table 2. The performance comparison of the present invention with the addition of multi-level feature information fusion and boundary refinement modules can prove the effectiveness of this network design

表3.本发明与其他算法在PASCAL VOC 2012数据集下的性能比较Table 3. Performance comparison between the present invention and other algorithms under the PASCAL VOC 2012 data set

通过以上对比数据可以看出，本发明的mIoU与现有算法相比有明显的提高。It can be seen from the above comparative data that the mIoU of the present invention is significantly improved compared with the existing algorithm.

需要强调的是，本发明所述的实例是说明性的，其目的在于让熟悉此项技术的人士能够了解本发明的内容并据以实施，本发明包括但不仅限于具体实施方式中所述的实例。凡是根据本发明精神实质所作的等效变化或修饰，都应涵盖在本发明的保护范围之内。It should be emphasized that the examples described in the present invention are illustrative, and its purpose is to allow those skilled in the art to understand the content of the present invention and implement it accordingly. The present invention includes but not limited to the examples described in the specific embodiments instance. All equivalent changes or modifications made according to the spirit of the present invention shall fall within the protection scope of the present invention.