CN111598095B

Movatterモバイル変換

Info

Publication number: CN111598095B
Application number: CN202010156966.XA
Authority: CN
Inventors: 宋秀兰; 魏定杰; 孙云坤; 何德峰; 余世明; 卢为党
Original assignee: Zhejiang University of Technology ZJUT
Current assignee: Zhejiang University of Technology ZJUT
Priority date: 2020-03-09
Filing date: 2020-03-09
Publication date: 2023-04-07
Anticipated expiration: 2040-03-09
Also published as: CN111598095A

Abstract

A deep learning-based urban road scene semantic segmentation method comprises the following steps: 1) Acquiring an image of the front end of the vehicle; 2) And expanding the input data of the marked image and the original image: randomly cutting, splicing or adding different types of noise to the image, transforming the image through an image affine matrix, and finally, maintaining the original resolution of the image through transformation such as filling and cutting to obtain a data set; 3) The image after data expansion and the marked image are used for network training, and the residual U-net network comprises a down-sampling part, a bridge part, an up-sampling part and a classification part; 4) And modifying the time interval T of the acquisition module, inputting the subsequently obtained images into a trained deep learning model, outputting predicted semantic segmentation images, and transmitting different gray levels in the images back to the processor. The invention uses a smaller data set, can prevent the gradient from descending too fast and can ensure that the overfitting problem does not occur during training.

Description

Translated fromChinese

一种基于深度学习的城市道路场景语义分割方法A semantic segmentation method for urban road scenes based on deep learning

技术领域Technical Field

本发明属于智能车辆领域，基于深度学习的城市道路场景语义分割方法。The present invention belongs to the field of intelligent vehicles and is a method for semantic segmentation of urban road scenes based on deep learning.

背景技术Background Art

近年来，随着城市化的不断进展，城市路况也变得越来越复杂，行人、交通信号灯、斑马线和不同的交通工具都会影响智能车辆的车速和避障措施。通过深度学习的语义分割方法可以很好的识别车辆周围的环境，并做出不同的反馈。语义分割是为图像每一像素点赋予预设的类别，这不仅可以保持智能车辆在行驶时实时对周围环境进行理解，还可以降低交通事故的发生。因此，关于深度学习城市道路环境的研究一直是车辆智能化领域的研究热点。现有的深度学习语义分割的方法研究有Segnet、Fcn和Resnet等神经网络。虽然这些神经网络不需要传统的物体识别流程，可以自动学习特征，不用工程师手动设计，并且网络可以通过大量的图像训练得到一个合适的模型，并输出语义分割的结果，但是现有的网络训练时会存在以下问题：1、因权重的数量过多造成过拟合问题；2、因网络层数较多，可能发生梯度快速下降的问题；3、因训练所需的数据集较大，训练时间较长等问题。这些问题将使深度学习网路难以输出精确的语义分割结果，从而导致智能车辆在复杂的路况下难以实时的得到周围环境的反馈，即存在安全隐患。因此，设计一种使用较小数据集，同时可以防止梯度下降过快，并且能够保证在训练时不发生过拟合问题的网络还是很有价值的。In recent years, with the continuous progress of urbanization, urban road conditions have become more and more complicated. Pedestrians, traffic lights, zebra crossings and different means of transportation will affect the speed and obstacle avoidance measures of intelligent vehicles. The semantic segmentation method of deep learning can well identify the environment around the vehicle and make different feedbacks. Semantic segmentation is to assign a preset category to each pixel of the image, which can not only keep the intelligent vehicle understanding the surrounding environment in real time while driving, but also reduce the occurrence of traffic accidents. Therefore, the research on deep learning urban road environment has always been a research hotspot in the field of vehicle intelligence. Existing deep learning semantic segmentation methods include neural networks such as Segnet, Fcn and Resnet. Although these neural networks do not require traditional object recognition processes, they can automatically learn features without manual design by engineers, and the network can obtain a suitable model through a large number of image training and output the results of semantic segmentation, but the existing network training will have the following problems: 1. Overfitting problem caused by too many weights; 2. Rapid gradient descent problem may occur due to the large number of network layers; 3. The training data set required for training is large and the training time is long. These problems will make it difficult for deep learning networks to output accurate semantic segmentation results, making it difficult for smart vehicles to get real-time feedback from the surrounding environment under complex road conditions, which poses a safety hazard. Therefore, it is still very valuable to design a network that uses a smaller data set, prevents the gradient from falling too fast, and ensures that there is no overfitting problem during training.

发明内容Summary of the invention

为了克服现有技术的不足，为考虑智能车辆在城市道路等复杂环境下，能对周围环境做出较好的识别，本发明提出了一种基于深度学习对城市道路场景进行语义分割的方法，使用较小数据集，同时可以防止梯度下降过快，并且能够保证在训练时不发生过拟合问题。In order to overcome the shortcomings of the existing technology and to enable smart vehicles to better recognize the surrounding environment in complex environments such as urban roads, the present invention proposes a method for semantic segmentation of urban road scenes based on deep learning, which uses a smaller data set and can prevent the gradient from descending too fast, and can ensure that overfitting problems do not occur during training.

本发明解决其技术问题所采用的技术方案是：The technical solution adopted by the present invention to solve the technical problem is:

一种基于深度学习的城市道路场景语义分割方法，所述方法包括以下步骤：A method for semantic segmentation of urban road scenes based on deep learning, the method comprising the following steps:

1)、车辆前端的图像采集：定时采集城市道路图像，设定的时间间隔为T，并将分辨率为h×w的图像输入图像检测模块，得到有效的图像；然后图像输入标注模块中标注，系统采用公开的图像界面的标注软件Labelme3.11.2进行标注，通过其场景分割标注功能，将图像上的车辆、行人、自行车、交通信号灯和霓虹灯物体框定并标注为不同的类别，生成的标注图像通过不同灰度级来反应不同类的物体，从标注图像的不同灰度得到灰度表list和图像中所存物体类别K；1) Image acquisition of the front end of the vehicle: collect urban road images at regular intervals, set the time interval as T, and input the image with a resolution of h×w into the image detection module to obtain a valid image; then input the image into the annotation module for annotation. The system uses the public image interface annotation software Labelme3.11.2 for annotation. Through its scene segmentation and annotation function, the vehicles, pedestrians, bicycles, traffic lights and neon lights on the image are framed and labeled into different categories. The generated annotated image reflects different types of objects through different gray levels. The grayscale table list and the object category K stored in the image are obtained from the different gray levels of the annotated image;

2)、标注图像与原图像输入数据扩充：将图像随机裁剪、拼接或添加不同类型噪声，再通过图像仿射矩阵对图像变换，仿射变换参见公式(1)：2) Data expansion of the labeled image and the original image input: randomly crop, splice or add different types of noise to the image, and then transform the image through the image affine matrix. The affine transformation is shown in formula (1):

仿射矩阵中s_x表示横向平移量和s_y表示纵向平移量，c₁表示图像横坐标放大或缩小倍数，c₄表示纵坐标放大或缩小的倍数，c₂和c₃控制图像剪切变换，(a，b表示原像素位置，(a′，b′)为变换后位置，最后通过填充和裁剪等变换，保持图像的原有分辨率，得到数据集；In the affine matrix,_sx represents the horizontal translation and_sy represents the vertical translation,_c1 represents the magnification or reduction factor of the image horizontal coordinate,_c4 represents the magnification or reduction factor of the vertical coordinate,_c2 and_c3 control the image shearing transformation, (a, b represents the original pixel position, (a′, b′) is the transformed position, and finally, through transformations such as filling and cropping, the original resolution of the image is maintained to obtain the data set;

3)、使用数据扩充后的图像和标注图像进行网络的训练，残差U-net网络由四个部分组成，分别是下采样部分、桥梁部分、上采样部分和分类部分；3) Use the data-expanded images and labeled images to train the network. The residual U-net network consists of four parts: downsampling part, bridge part, upsampling part and classification part;

图像长度h，图像宽度w，损失函数大小L，网络迭代次数epochs，批量处理大小batch_size和验证集比例rate。数据集将通过rate分为训练集和验证集，训练时按batch_size分批输入残差U-net网络中进行训练，通过网络输出的预测图像与实际标签图像计算L，并反向传播调节网络中的参数使L输出趋于最小化，反复训练网络到迭代次数，在迭代过程中通过验证集调整网络参数。最后得到最优的网络模型。Image length h, image width w, loss function size L, network iteration number epochs, batch processing size batch_size and validation set ratio rate. The data set will be divided into training set and validation set by rate. During training, the residual U-net network is input in batches according to batch_size for training. L is calculated by the predicted image output by the network and the actual label image, and the parameters in the network are adjusted by back propagation to minimize the L output. The network is repeatedly trained to the number of iterations, and the network parameters are adjusted by the validation set during the iteration process. Finally, the optimal network model is obtained.

4)路况分类：修改采集模块时间间隔T，将后续得到的图像输入训练好的深度学习模型中，输出预测的语义分割图像，并将图像中不同灰度回传给处理器，这样车辆就可以很好的识别出前方位置存在哪些类别的物体，以做出后续的不同反应。4) Road condition classification: Modify the acquisition module time interval T, input the subsequent images into the trained deep learning model, output the predicted semantic segmentation image, and send the different grayscale in the image back to the processor, so that the vehicle can well identify what categories of objects exist in the front position and make different subsequent responses.

进一步，所述步骤3)中，下采样部分分为四级，各级均由一个残差网络组成，分别是第一级到第四级残差网络。第一级残差网络内各层连接顺序为：卷积层、批归一化层、softmax函数层、卷积层和融合层，最后通过恒等连接的方式在融合层将输入图像与处理后的特征图像融合。第二级到四级残差网络各层的形式相同，其连接顺序为：批归一化层、softmax函数层、卷积层、批归一化层、softmax函数层、卷积层和融合层，最后也通过恒等连接的方式在融合层将输入的特征图像与处理后的特征图像融合；卷积层由3×3的卷积核构成，各级的两个卷积核维度分别为64、128、256和512，最后各级通过2×2步长为2的池化层进行相连，其维度变化与各级的卷积层相同Furthermore, in the step 3), the downsampling part is divided into four levels, and each level is composed of a residual network, namely the first to fourth level residual networks. The connection order of each layer in the first level residual network is: convolution layer, batch normalization layer, softmax function layer, convolution layer and fusion layer. Finally, the input image and the processed feature image are fused in the fusion layer by the same connection method. The form of each layer of the second to fourth level residual network is the same, and the connection order is: batch normalization layer, softmax function layer, convolution layer, batch normalization layer, softmax function layer, convolution layer and fusion layer. Finally, the input feature image and the processed feature image are fused in the fusion layer by the same connection method; the convolution layer is composed of a 3×3 convolution kernel, and the dimensions of the two convolution kernels at each level are 64, 128, 256 and 512 respectively. Finally, each level is connected by a 2×2 pooling layer with a step size of 2, and its dimensional change is the same as that of the convolution layer at each level.

再进一步，所述步骤3)中，桥梁部分为网络高底维度信息拼接做准备，它由两层批归一化层、两层softplus函数层、两层3×3维度为1024的卷积层构成，无融合层，所以不需要恒等连接，各层的连接顺序与第二级残差网络相同。最后通过上采样层将特征图像调整到拼接的大小。Furthermore, in step 3), the bridge part is prepared for the splicing of high and low dimensional information of the network. It consists of two batch normalization layers, two softplus function layers, and two 3×3 convolution layers with a dimension of 1024. There is no fusion layer, so there is no need for identity connection. The connection order of each layer is the same as that of the second-level residual network. Finally, the feature image is adjusted to the splicing size through the upsampling layer.

更进一步，所述步骤3)中，上采样部分也通过四级残差网络组成，分别是第五级到第八级残差网络，残差网络的形式和各层的连接顺与下采样部分各级残差网络基本相同，只是在第五到第七级残差网络的恒等连接通过一个1×1的卷积层来替代，而第八级残差网络不变，上采样各级残差网络内卷积层维度分别是512、256、128和64，各级之间通过上采样层和拼接层连接，拼接层将对应尺寸的高低维度信息进行拼接，拼接措施如下：Furthermore, in the step 3), the upsampling part is also composed of four levels of residual networks, namely the fifth to eighth levels of residual networks. The form of the residual network and the connection sequence of each layer are basically the same as those of the residual networks at all levels of the downsampling part, except that the identity connection of the fifth to seventh levels of residual networks is replaced by a 1×1 convolution layer, while the eighth level of residual network remains unchanged. The dimensions of the convolution layers in the upsampling residual networks at all levels are 512, 256, 128 and 64, respectively. Each level is connected by an upsampling layer and a splicing layer. The splicing layer splices the high and low dimensional information of the corresponding size. The splicing measures are as follows:

(3.1)、第四级残差网络的输出经过池化层后的特征图像与桥梁部分输出的特征图像进行拼接；(3.1) The feature image of the output of the fourth-level residual network after the pooling layer is spliced with the feature image output of the bridge part;

(3.2)、第三级残差网络的输出经过池化层后的特征图像与第五级残差网络的输出经过上采样层后的特征图像进行拼接；(3.2) The feature image of the output of the third-level residual network after the pooling layer is concatenated with the feature image of the output of the fifth-level residual network after the upsampling layer;

(3.3)、第二级残差网络的输出经过池化层后的特征图像与第六级残差网络的输出经过上采样层后的特征图像进行拼接；(3.3) The feature image of the output of the second-level residual network after the pooling layer is concatenated with the feature image of the output of the sixth-level residual network after the upsampling layer;

(3.4)、第一级残差网络的输出经过池化层后的特征图像与第七级残差网络的输出经过上采样层后的特征图像进行拼接；(3.4) The feature image of the output of the first-level residual network after the pooling layer is concatenated with the feature image of the output of the seventh-level residual network after the upsampling layer;

拼接后特征图像的维度发生变化，使用替代恒等连接的1×1卷积层调整特征图像维度，四个1×1卷积层维度分别为512、256、128和64，最后在融合层进行特征图像的融合。After splicing, the dimension of the feature image changes. The 1×1 convolution layer with alternative identity connection is used to adjust the dimension of the feature image. The dimensions of the four 1×1 convolution layers are 512, 256, 128, and 64, respectively. Finally, the feature images are fused in the fusion layer.

所述步骤3)中，分类部分通过1×1的卷积层和softmax层组成，由于城市道路图像分割涉及车辆、行人、自行车、交通信号灯、霓虹灯和背景六个类，所以通过1×1的卷积层得到6个通道的特征图像，但原始特征图像的像素值表示的不是概率值，所以通过softmax层将输出转换为概率分布，softmax函数，参见公式(2)：In step 3), the classification part is composed of a 1×1 convolution layer and a softmax layer. Since the urban road image segmentation involves six categories: vehicles, pedestrians, bicycles, traffic lights, neon lights and background, a feature image of 6 channels is obtained through the 1×1 convolution layer. However, the pixel value of the original feature image does not represent a probability value, so the output is converted into a probability distribution through the softmax layer. The softmax function is shown in formula (2):

其中d_k(x)表示像素x是在通道k上的取值，K表示物品类别数量，g_k(x)表示像素x属于k类的概率，g_k(x)∈[0，1]，各通道中概率最大的就是所对应的类；Where d_k (x) indicates the value of pixel x on channel k, K indicates the number of item categories, g_k (x) indicates the probability that pixel x belongs to category k, g_k (x)∈[0,1], and the channel with the highest probability corresponds to the category.

然后使用交叉熵损失函数来评估预测结果与实际的偏差，损失函数，参见公式(3)：Then the cross entropy loss function is used to evaluate the deviation between the predicted result and the actual result. The loss function is shown in formula (3):

其中t(x)表示像素x所对应的类，所以g_t(x)(x)表示该类的概率，

表示标注图像对应像素x属于k类的概率，所以损失函数的值越小表示预测图像和标注图像越相近。通过损失函数反向传递，对神经网络内部参数进行不断的优化，使损失函数不断减少趋于理想取值；Where t(x) represents the class corresponding to pixel x, so g_t(x) (x) represents the probability of this class.

It represents the probability that the pixel x corresponding to the labeled image belongs to class k, so the smaller the value of the loss function, the closer the predicted image is to the labeled image. Through the reverse transfer of the loss function, the internal parameters of the neural network are continuously optimized, so that the loss function is continuously reduced and tends to the ideal value;

最后训练模型时还需要确定网络的迭代次数epochs、批量处理大小batch_size和验证集比例rate。验证集比例将得到的图像集分为训练集和验证集，然后将训练集中的图像按批量处理大小分批输入网络直至训练集图像全部输入，完成了一次迭代，最后通过确定的迭代次数反复训练模型，以得到最优的神经网络模型。When training the model, you also need to determine the number of network iterations epochs, batch processing size batch_size and validation set ratio rate. The validation set ratio divides the obtained image set into a training set and a validation set, and then inputs the images in the training set into the network in batches according to the batch processing size until all the images in the training set are input, completing one iteration. Finally, the model is repeatedly trained through the determined number of iterations to obtain the optimal neural network model.

本发明主要执行部分是图像的采集与处理、神经网络的训练和使用识别模型对图像进行识别。本方法实施过程可以分以下三个阶段：The main execution parts of the present invention are image acquisition and processing, neural network training and image recognition using recognition models. The implementation process of this method can be divided into the following three stages:

第一、图像数据获取：设置采集模块时间间隔T，选取不同的城市环境路段采集图像，输入检测模块，得到有效的图像集；然后使用标注软件labelme3.11.2对图像标注，通过实例场景分割标注功能，框定图像中的目标物品，并标注物体的类别，软件将生成标注图像，图像中使用不同的灰度级标注不同的物体。通过标注图像的不同灰度级，并得到灰度列表list＝[]和物品类别数K；最后通过数据扩充模块对图像与标注图像都进行扩充得到数据集。First, image data acquisition: set the acquisition module time interval T, select different urban environment road sections to collect images, input the detection module, and obtain a valid image set; then use the labeling software labelme3.11.2 to annotate the image, use the instance scene segmentation and annotation function to frame the target objects in the image, and annotate the object category. The software will generate an annotated image, and use different grayscale levels to annotate different objects in the image. By annotating the different grayscale levels of the image, the grayscale list list = [] and the number of object categories K are obtained; finally, the image and the annotated image are expanded through the data expansion module to obtain the data set.

第二、网络的参数和训练：图像长度h，图像宽度w，损失函数大小L，网络迭代次数epochs，批量处理大小batch_size和验证集比例rate。数据集将通过rate分为训练集和验证集，训练时按batch_size分批输入残差U-net网络中进行训练，通过网络输出的预测图像与实际标签图像计算L，并反向传播调节网络中的参数使L输出趋于最小化，反复训练网络到迭代次数，在迭代过程中通过验证集调整网络参数。最后得到最优的网络模型。Second, network parameters and training: image length h, image width w, loss function size L, network iteration number epochs, batch processing size batch_size and validation set ratio rate. The data set will be divided into training set and validation set by rate. During training, it will be input into the residual U-net network in batches according to batch_size for training. L is calculated by the predicted image output by the network and the actual label image, and the parameters in the network are adjusted by back propagation to minimize the L output. The network is repeatedly trained to the number of iterations, and the network parameters are adjusted by the validation set during the iteration process. Finally, the optimal network model is obtained.

第三、路况分类：修改采集模块时间间隔T，将后续得到的图像输入训练好的深度学习模型中，输出预测的语义分割图像，并将图像中不同灰度回传给处理器，这样车辆就可以很好的识别出前方位置存在哪些类别的物体，以做出后续的不同反应。Third, road condition classification: modify the acquisition module time interval T, input the subsequent images into the trained deep learning model, output the predicted semantic segmentation image, and send the different grayscale in the image back to the processor, so that the vehicle can well identify what categories of objects exist in the front position and make different subsequent responses.

本发明的有益效果主要表现在：1、在网络设计中，综合考虑了深度学习网络在训练时可能出现的梯度下降过快、需要数据集过大和过拟合问题，所以在网络中添加了批归一化、残差网络和高底层信息拼接的方法，有效的较少梯度下降和图像信息缺失的问题；有利于提高语义分割的准确性；2、该深度学习的路况检测系统设计简单，便于理解，所用的数据集少，实时性高，实用性和适应性强。The beneficial effects of the present invention are mainly manifested in: 1. In the network design, the problems of too fast gradient descent, too large data set and overfitting that may occur in the deep learning network during training are comprehensively considered, so batch normalization, residual network and high-level information splicing methods are added to the network, which effectively reduces the problems of gradient descent and missing image information; it is beneficial to improve the accuracy of semantic segmentation; 2. The deep learning road condition detection system is simple in design, easy to understand, uses a small data set, has high real-time performance, and is highly practical and adaptable.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

图1为深度学习的城市道路场景语义分割系统的实施的流程。Figure 1 shows the implementation process of the deep learning urban road scene semantic segmentation system.

图2为深度学习的城市道路场景语义分割系统中使用的残差U-net网络的整体模型设计。Figure 2 shows the overall model design of the residual U-net network used in the deep learning urban road scene semantic segmentation system.

图3为深度学习的城市道路场景语义分割系统使用的残差U-net网络内第二级到第五级残差网络的网络形式。Figure 3 shows the network form of the second to fifth level residual networks in the residual U-net network used by the deep learning urban road scene semantic segmentation system.

图4为深度学习的城市道路场景语义分割效果图展示。Figure 4 shows the effect of deep learning semantic segmentation of urban road scenes.

具体实施方式DETAILED DESCRIPTION

下面结合附图对本发明的方法作进一步详细说明。The method of the present invention is further described in detail below with reference to the accompanying drawings.

参照图1～图4，一种基于深度学习的城市道路场景语义分割方法，所述方法包括以下步骤：1 to 4, a method for semantic segmentation of urban road scenes based on deep learning is provided, the method comprising the following steps:

1)、车辆前端的图像采集：定时采集城市道路图像，设定的时间间隔为T，并将分辨率为h×w的图像输入图像检测模块，得到有效的图像；然后图像输入标注模块中标注，系统采用公开的图像界面的标注软件Labelme3.11.2进行标注，通过其场景分割标注功能，将图像上的车辆、行人、自行车、交通信号灯和霓虹灯物体等框定并标注为不同的类别，生成的标注图像通过不同灰度级来反应不同类的物体，从标注图像的不同灰度得到灰度表list和图像中所存物体类别K；1) Image acquisition of the front end of the vehicle: collect urban road images at regular intervals, set the time interval as T, and input the image with a resolution of h×w into the image detection module to obtain a valid image; then input the image into the annotation module for annotation. The system uses the public image interface annotation software Labelme3.11.2 for annotation. Through its scene segmentation and annotation function, the vehicles, pedestrians, bicycles, traffic lights and neon lights on the image are framed and labeled into different categories. The generated annotated image reflects different types of objects through different gray levels. The grayscale table list and the object category K stored in the image are obtained from the different gray levels of the annotated image;

仿射矩阵中s_x表示横向平移量和s_y表示纵向平移量，c₁表示图像横坐标放大或缩小倍数，c₄表示纵坐标放大或缩小的倍数，c₂和c₃控制图像剪切变换，(a，b)表示原像素位置，(a′，b′)为变换后位置，最后通过填充和裁剪等变换，保持图像的原有分辨率，得到数据集；In the affine matrix, s_x represents the horizontal translation and s_y represents the vertical translation, c₁ represents the magnification or reduction factor of the image horizontal coordinate, c₄ represents the magnification or reduction factor of the vertical coordinate, c₂ and c₃ control the image shearing transformation, (a, b) represents the original pixel position, (a′, b′) is the transformed position, and finally, through transformations such as filling and cropping, the original resolution of the image is maintained to obtain the data set;

进一步，所述步骤3)中，下采样部分分为四级，各级均由一个残差网络组成，分别是第一级到第四级残差网络。第一级残差网络内各层连接顺序为：卷积层、批归一化层、softmax函数层、卷积层和融合层，最后通过恒等连接的方式在融合层将输入图像与处理后的特征图像融合。第二级到四级残差网络各层的形式相同，其连接顺序为：批归一化层、softmax函数层、卷积层、批归一化层、softmax函数层、卷积层和融合层，最后也通过恒等连接的方式在融合层将输入的特征图像与处理后的特征图像融合。卷积层由3×3的卷积核构成，各级的两个卷积核维度分别为64、128、256和512。最后各级通过2×2步长为2的池化层进行相连，其维度变化与各级的卷积层相同。Further, in the step 3), the downsampling part is divided into four levels, each of which is composed of a residual network, namely the first to fourth residual networks. The connection order of each layer in the first residual network is: convolution layer, batch normalization layer, softmax function layer, convolution layer and fusion layer, and finally the input image is fused with the processed feature image in the fusion layer by the same connection method. The second to fourth residual networks have the same form, and the connection order is: batch normalization layer, softmax function layer, convolution layer, batch normalization layer, softmax function layer, convolution layer and fusion layer. Finally, the input feature image is fused with the processed feature image in the fusion layer by the same connection method. The convolution layer is composed of 3×3 convolution kernels, and the dimensions of the two convolution kernels at each level are 64, 128, 256 and 512 respectively. Finally, each level is connected by a 2×2 pooling layer with a step size of 2, and its dimensional change is the same as that of the convolution layer at each level.

桥梁部分为网络高底维度信息拼接做准备，它由两层批归一化层、两层softplus函数层、两层3×3维度为1024的卷积层构成，无融合层，所以不需要恒等连接，各层的连接顺序与第二级残差网络相同。最后通过上采样层将特征图像调整到合适拼接的大小。The bridge part is prepared for the splicing of high and low dimensional information of the network. It consists of two batch normalization layers, two softplus function layers, and two 3×3 convolution layers with a dimension of 1024. There is no fusion layer, so there is no need for identity connection. The connection order of each layer is the same as that of the second-level residual network. Finally, the feature image is adjusted to a suitable splicing size through the upsampling layer.

上采样部分也通过四级残差网络组成，分别是第五级到第八级残差网络，残差网络的形式和各层的连接顺与下采样部分各级残差网络基本相同，只是在第五到第七级残差网络的恒等连接通过一个1×1的卷积层来替代，而第八级残差网络不变。上采样各级残差网络内卷积层维度分别是512、256、128和64。各级之间通过上采样层和拼接层连接，拼接层将对应尺寸的高低维度信息进行拼接，拼接措施：The upsampling part is also composed of four levels of residual networks, namely the fifth to eighth levels of residual networks. The form of the residual network and the connection sequence of each layer are basically the same as those of the residual networks at all levels of the downsampling part, except that the identity connection of the fifth to seventh levels of residual networks is replaced by a 1×1 convolution layer, while the eighth level of residual networks remains unchanged. The dimensions of the convolution layers in the upsampling residual networks are 512, 256, 128, and 64, respectively. Each level is connected by an upsampling layer and a splicing layer. The splicing layer splices the high and low dimensional information of the corresponding size. The splicing measures are:

(3.1)、第四级残差网络的输出经过池化层后的特征图像与桥梁部分输出的特征图像进行拼接。(3.1) The feature image of the output of the fourth-level residual network after the pooling layer is concatenated with the feature image output of the bridge part.

(3.2)、第三级残差网络的输出经过池化层后的特征图像与第五级残差网络的输出经过上采样层后的特征图像进行拼接。(3.2) The feature image of the output of the third-level residual network after the pooling layer is concatenated with the feature image of the output of the fifth-level residual network after the upsampling layer.

(3.3)、第二级残差网络的输出经过池化层后的特征图像与第六级残差网络的输出经过上采样层后的特征图像进行拼接。(3.3) The feature image of the output of the second-level residual network after the pooling layer is concatenated with the feature image of the output of the sixth-level residual network after the upsampling layer.

(3.4)、第一级残差网络的输出经过池化层后的特征图像与第七级残差网络的输出经过上采样层后的特征图像进行拼接。(3.4) The feature image of the output of the first-level residual network after the pooling layer is concatenated with the feature image of the output of the seventh-level residual network after the upsampling layer.

分类部分通过1×1的卷积层和softmax层组成，由于城市道路图像分割涉及车辆、行人、自行车、交通信号灯、霓虹灯和背景六个类，所以通过1×1的卷积层得到6个通道的特征图像，但原始特征图像的像素值表示的不是概率值，所以通过softmax层将输出转换为概率分布，softmax函数，参见公式(2)：The classification part consists of a 1×1 convolution layer and a softmax layer. Since the urban road image segmentation involves six categories: vehicles, pedestrians, bicycles, traffic lights, neon lights, and background, a 6-channel feature image is obtained through a 1×1 convolution layer. However, the pixel values of the original feature image do not represent probability values, so the output is converted into a probability distribution through a softmax layer. The softmax function is shown in formula (2):

其中d_k(x)表示像素x是在通道k上的取值，K表示物品类别数量，g_k(x)表示像素x属于k类的概率，g_k(x)∈[0，1]。各通道中概率最大的就是所对应的类。Where d_k (x) indicates the value of pixel x on channel k, K indicates the number of object categories, g_k (x) indicates the probability that pixel x belongs to category k, and g_k (x)∈[0,1]. The channel with the highest probability corresponds to the category.

表示标注图像对应像素x属于k类的概率，所以损失函数的值越小表示预测图像和标注图像越相近。通过损失函数反向传递，对神经网络内部参数进行不断的优化，使损失函数不断减少趋于理想取值。Where t(x) represents the class corresponding to pixel x, so g_t(x) (x) represents the probability of this class.

It represents the probability that the pixel x corresponding to the labeled image belongs to class k, so the smaller the value of the loss function, the closer the predicted image is to the labeled image. Through the reverse transfer of the loss function, the internal parameters of the neural network are continuously optimized, so that the loss function is continuously reduced and tends to the ideal value.

本实施例主要执行部分是图像的采集与处理、神经网络的训练和使用识别模型对图像进行识别。本方法实施过程可以分以下三个阶段：The main execution parts of this embodiment are image acquisition and processing, neural network training and image recognition using recognition models. The implementation process of this method can be divided into the following three stages:

第一、图像数据获取：设置采集模块时间间隔T＝4s，选取不同的城市环境路段采集图像，输入检测模块，得到有效的图像1000张；然后使用标注软件labelme3.11.2对图像标注，通过实例场景分割标注功能，框定图像中的各类目标，并标注目标的类别，软件将生成标注图像，图像中使用不同的灰度级标注不同的目标类别。灰度列表为list＝[0，20，80，140，180，230]表示不同目标的像素值，其分别包括背景、霓虹灯、交通信号灯、车辆、行人和自行车，合计类别数K＝6；最后通过数据扩充模块对图像与标注图像都进行扩充得到数据集。First, image data acquisition: set the acquisition module time interval T = 4s, select different urban environment road sections to collect images, input the detection module, and obtain 1000 valid images; then use the labeling software labelme3.11.2 to annotate the image, and use the instance scene segmentation and annotation function to frame various types of targets in the image and annotate the target categories. The software will generate annotated images, and different grayscale levels are used to annotate different target categories in the image. The grayscale list is list = [0, 20, 80, 140, 180, 230] to represent the pixel values of different targets, which include background, neon lights, traffic lights, vehicles, pedestrians and bicycles, with a total of 6 categories; finally, the data expansion module is used to expand both the image and the annotated image to obtain a data set.

第二、在网络参数设定界面，输入网络参数，如下：图像长度h＝224，图像宽度w＝224，损失函数L；网络迭代次数epochs＝30，批量处理大小batch_size＝4和验证集比例rate＝0.1；3000张图像集将分为2700张训练集和300张验证集，训练时按batch_size每次4张图像输入残差U-net网络中进行训练，直到训练集全部训练完毕，通过网络输出的预测图像与实际标签图像计算损失函数L的大小，并反向传播调节网络中的参数使L输出趋于最小化，完成一次迭代，迭代训练网络30次，在迭代过程中通过验证集调整网络参数；最后得到合适的网络模型。Second, in the network parameter setting interface, enter the network parameters as follows: image length h = 224, image width w = 224, loss function L; network iteration number epochs = 30, batch processing size batch_size = 4 and verification set ratio rate = 0.1; the 3000 image set will be divided into 2700 training sets and 300 verification sets. During training, 4 images are input into the residual U-net network at a time according to batch_size until all training sets are trained. The size of the loss function L is calculated by the predicted image output by the network and the actual label image, and the parameters in the network are adjusted by backpropagation to minimize the L output. After one iteration, the network is iteratively trained 30 times. During the iteration, the network parameters are adjusted through the verification set; finally, a suitable network model is obtained.

第三、修改采集模块时间间隔T＝0.2s，将后续得到的图像输入训练好的深度学习模型中，输出实时的语义分割结果，并将图像中不同灰度回传给处理器，这样车辆就可以很好的识别出前方位置存在哪些类别的物体，以做出后续的不同反应。Third, modify the acquisition module time interval T = 0.2s, input the subsequent images into the trained deep learning model, output real-time semantic segmentation results, and send the different grayscale in the image back to the processor, so that the vehicle can well identify what categories of objects exist in the front position and make subsequent different responses.

实际系统设计形式、网络的建立过程和结果如图1、图2、图3、图4所示，图1为深度学习的城市道路场景语义分割系统的实施的流程。图2为深度学习的城市道路场景语义分割系统中使用的残差U-net网络的整体模型设计。图3为深度学习的城市道路场景语义分割系统使用的残差U-net网络内第二级到第五级残差网络的网络形式。图4为深度学习的城市道路场景语义分割效果图展示。The actual system design form, network establishment process and results are shown in Figures 1, 2, 3 and 4. Figure 1 shows the implementation process of the deep learning urban road scene semantic segmentation system. Figure 2 shows the overall model design of the residual U-net network used in the deep learning urban road scene semantic segmentation system. Figure 3 shows the network form of the second to fifth level residual network in the residual U-net network used in the deep learning urban road scene semantic segmentation system. Figure 4 shows the effect diagram of the deep learning urban road scene semantic segmentation.

以上阐述的是本发明给出的一个实施例所表现出的优良的深度学习的城市道路场景语义分割的效果。需要指出，上述实施例用来解释说明本发明，而不是对本发明进行限制，在本发明的精神和权利要求的保护范围内，对本发明做出的任何修改，都落入本发明的保护范围。The above describes the excellent semantic segmentation effect of urban road scenes by deep learning shown by an embodiment of the present invention. It should be pointed out that the above embodiment is used to explain the present invention, rather than to limit the present invention. Any modification made to the present invention within the spirit of the present invention and the protection scope of the claims shall fall within the protection scope of the present invention.

Claims

1. A deep learning-based urban road scene semantic segmentation method is characterized by comprising the following steps:

1) And image acquisition of the front end of the vehicle: collecting urban road images at regular time, setting a time interval as T, and carrying out image detection on the images with the resolution of h multiplied by w to obtain effective images; then marking the obtained effective image, marking by adopting marking software Labelme3.11.2 of a public image interface, framing and marking the objects of vehicles, pedestrians, bicycles, traffic lights and neon lights on the image into different categories by the scene segmentation marking function, reflecting the objects of different categories by the generated marked image through different gray levels, and obtaining a gray list and the object category K stored in the image from the different gray levels of the marked image;

2) And expanding the input data of the marked image and the original image: randomly clipping, splicing or adding different types of noise to the image, and then transforming the image by using an affine matrix of the image, wherein the affine transformation is shown in a formula (1):

in affine matrix s_x Representing the sum of the lateral translations and s_y Representing the amount of longitudinal translation, c₁ Representing magnification or reduction of the image abscissa, c₄ Denotes the magnification or reduction of the ordinate, c₂ And c₃ Controlling image cutting transformation, (a, b) representing original pixel position, (a)^′ ,b^′ ) The original resolution of the image is maintained through the transformation of filling, cutting and the like to obtain a data set;

3) The image after data expansion and the marked image are used for network training, and the residual U-net network consists of four parts, namely a down-sampling part, a bridge part, an up-sampling part and a classification part;

the method comprises the steps of image length h, image width w, loss function size L, network iteration times epochs, batch processing of batch _ size and verification set proportion rate, dividing a data set into a training set and a verification set through the rate, inputting batch _ size into a residual U-net network for training according to the batch _ size during training, calculating L through predicted images output by the network and actual label images, reversely propagating and adjusting parameters in the network to enable the output of the L to tend to be minimized, repeatedly training the network to the iteration times, adjusting network parameters through the verification set in the iteration process, and finally obtaining an optimal network model;

4) Road condition classification: modifying the acquisition time interval T, inputting the subsequently obtained images into a trained deep learning model, outputting predicted semantic segmentation images, transmitting different gray levels in the images back to a processor, and identifying the object types existing in the front position by the vehicle;

in the step 3), the down-sampling part is divided into four stages, each stage is composed of a residual error network, namely a first-stage to fourth-stage residual error network, and the connection sequence of each layer in the first-stage residual error network is as follows: the method comprises the following steps of (1) merging the input image and the processed characteristic image in a merging layer by an identity connection mode, wherein the merging layer comprises a convolution layer, a batch normalization layer, a softmax function layer, a convolution layer and a merging layer, the input image and the processed characteristic image are merged in the merging layer, the forms of all layers of a second-level residual error network and a fourth-level residual error network are the same, and the connection sequence is as follows: the method comprises the steps of firstly, integrating a plurality of feature images into a fusion layer, wherein the feature images are input into a batch normalization layer, a softmax function layer, a convolution layer, a batch normalization layer, a softmax function layer, a convolution layer and a fusion layer, and finally, the input feature images and the processed feature images are fused in the fusion layer in an identical connection mode; the convolution layer is composed of convolution kernels of 3 x 3, the dimensionalities of two convolution kernels of each level are respectively 64, 128, 256 and 512, and finally, each level is connected through a pooling layer with 2 x 2 step length being 2, and the dimensionality change of the pooling layer is the same as that of each level;

in the step 3), the bridge part is prepared for splicing network high-bottom dimension information and comprises two batch normalization layers, two softplus function layers and two convolution layers with dimensions of 3 multiplied by 3 being 1024, wherein no fusion layer exists, the connection sequence of each layer is the same as that of the second-stage residual error network, and finally the feature image is adjusted to be spliced through an up-sampling layer;

in the step 3), the upsampling part is also composed of four levels of residual error networks, which are respectively residual error networks from the fifth level to the eighth level, the form of the residual error networks and the connection mode of each layer are basically the same as that of the residual error networks from the fifth level to the seventh level, the permanent connection of the residual error networks from the fifth level to the seventh level is replaced by a 1 × 1 convolutional layer, the residual error network from the eighth level is not changed, the dimensions of the convolutional layers in the upsampling residual error networks from the various levels are 512, 256, 128 and 64 respectively, the layers are connected through the upsampling layer and the splicing layer, and the splicing layer splices the high-dimensional information and the low-dimensional information with corresponding dimensions, wherein the splicing measure is as follows:

(3.1) splicing the feature image output by the fourth-level residual error network after passing through the pooling layer with the feature image output by the bridge part;

(3.2) splicing the characteristic image of the output of the third-level residual error network after passing through the pooling layer with the characteristic image of the output of the fifth-level residual error network after passing through the upper sampling layer;

(3.3) splicing the characteristic image of the second-level residual error network after the output of the second-level residual error network passes through the pooling layer with the characteristic image of the sixth-level residual error network after the output of the sixth-level residual error network passes through the up-sampling layer;

(3.4) splicing the characteristic image of the first-level residual error network after the output of the first-level residual error network passes through the pooling layer with the characteristic image of the seventh-level residual error network after the output of the seventh-level residual error network passes through the up-sampling layer;

the dimensionality of the spliced characteristic images changes, the dimensionality of the characteristic images is adjusted by using a 1 × 1 convolutional layer for replacing constant connection, the dimensionalities of the four 1 × 1 convolutional layers are respectively 512, 256, 128 and 64, and finally the characteristic images are fused in a fusion layer;

in the step 3), the classification part is composed of a 1 × 1 convolution layer and a softmax layer, since the urban road image segmentation relates to six classes of vehicles, pedestrians, bicycles, traffic lights, neon lights and backgrounds, the feature images of 6 channels are obtained through the 1 × 1 convolution layer, but the pixel values of the original feature images are not probability values, so that the output is converted into probability distribution through the softmax layer, and the softmax function is shown in formula (2):

wherein d is_k (x) The representation pixel x is a value on a channel K, K represents the number of item classes, g_k (x) Representing the probability that pixel x belongs to class k, g_k (x)∈[0,1]The highest probability in each channel is the corresponding class;

the deviation of the prediction from the actual is then evaluated using a cross-entropy loss function, see equation (3):

where t (x) represents the class to which pixel x corresponds, so g_t(x) (x) The probability of the class is represented by,

and the probability that the pixels x corresponding to the labeled image belong to k classes is represented, so that the smaller the value of the loss function is, the closer the predicted image and the labeled image are, and the internal parameters of the neural network are continuously optimized through reverse transfer of the loss function, so that the loss function is continuously reduced and tends to be an ideal value. />