



技术领域Technical Field
本发明属于智能车辆领域,基于深度学习的城市道路场景语义分割方法。The present invention belongs to the field of intelligent vehicles and is a method for semantic segmentation of urban road scenes based on deep learning.
背景技术Background Art
近年来,随着城市化的不断进展,城市路况也变得越来越复杂,行人、交通信号灯、斑马线和不同的交通工具都会影响智能车辆的车速和避障措施。通过深度学习的语义分割方法可以很好的识别车辆周围的环境,并做出不同的反馈。语义分割是为图像每一像素点赋予预设的类别,这不仅可以保持智能车辆在行驶时实时对周围环境进行理解,还可以降低交通事故的发生。因此,关于深度学习城市道路环境的研究一直是车辆智能化领域的研究热点。现有的深度学习语义分割的方法研究有Segnet、Fcn和Resnet等神经网络。虽然这些神经网络不需要传统的物体识别流程,可以自动学习特征,不用工程师手动设计,并且网络可以通过大量的图像训练得到一个合适的模型,并输出语义分割的结果,但是现有的网络训练时会存在以下问题:1、因权重的数量过多造成过拟合问题;2、因网络层数较多,可能发生梯度快速下降的问题;3、因训练所需的数据集较大,训练时间较长等问题。这些问题将使深度学习网路难以输出精确的语义分割结果,从而导致智能车辆在复杂的路况下难以实时的得到周围环境的反馈,即存在安全隐患。因此,设计一种使用较小数据集,同时可以防止梯度下降过快,并且能够保证在训练时不发生过拟合问题的网络还是很有价值的。In recent years, with the continuous progress of urbanization, urban road conditions have become more and more complicated. Pedestrians, traffic lights, zebra crossings and different means of transportation will affect the speed and obstacle avoidance measures of intelligent vehicles. The semantic segmentation method of deep learning can well identify the environment around the vehicle and make different feedbacks. Semantic segmentation is to assign a preset category to each pixel of the image, which can not only keep the intelligent vehicle understanding the surrounding environment in real time while driving, but also reduce the occurrence of traffic accidents. Therefore, the research on deep learning urban road environment has always been a research hotspot in the field of vehicle intelligence. Existing deep learning semantic segmentation methods include neural networks such as Segnet, Fcn and Resnet. Although these neural networks do not require traditional object recognition processes, they can automatically learn features without manual design by engineers, and the network can obtain a suitable model through a large number of image training and output the results of semantic segmentation, but the existing network training will have the following problems: 1. Overfitting problem caused by too many weights; 2. Rapid gradient descent problem may occur due to the large number of network layers; 3. The training data set required for training is large and the training time is long. These problems will make it difficult for deep learning networks to output accurate semantic segmentation results, making it difficult for smart vehicles to get real-time feedback from the surrounding environment under complex road conditions, which poses a safety hazard. Therefore, it is still very valuable to design a network that uses a smaller data set, prevents the gradient from falling too fast, and ensures that there is no overfitting problem during training.
发明内容Summary of the invention
为了克服现有技术的不足,为考虑智能车辆在城市道路等复杂环境下,能对周围环境做出较好的识别,本发明提出了一种基于深度学习对城市道路场景进行语义分割的方法,使用较小数据集,同时可以防止梯度下降过快,并且能够保证在训练时不发生过拟合问题。In order to overcome the shortcomings of the existing technology and to enable smart vehicles to better recognize the surrounding environment in complex environments such as urban roads, the present invention proposes a method for semantic segmentation of urban road scenes based on deep learning, which uses a smaller data set and can prevent the gradient from descending too fast, and can ensure that overfitting problems do not occur during training.
本发明解决其技术问题所采用的技术方案是:The technical solution adopted by the present invention to solve the technical problem is:
一种基于深度学习的城市道路场景语义分割方法,所述方法包括以下步骤:A method for semantic segmentation of urban road scenes based on deep learning, the method comprising the following steps:
1)、车辆前端的图像采集:定时采集城市道路图像,设定的时间间隔为T,并将分辨率为h×w的图像输入图像检测模块,得到有效的图像;然后图像输入标注模块中标注,系统采用公开的图像界面的标注软件Labelme3.11.2进行标注,通过其场景分割标注功能,将图像上的车辆、行人、自行车、交通信号灯和霓虹灯物体框定并标注为不同的类别,生成的标注图像通过不同灰度级来反应不同类的物体,从标注图像的不同灰度得到灰度表list和图像中所存物体类别K;1) Image acquisition of the front end of the vehicle: collect urban road images at regular intervals, set the time interval as T, and input the image with a resolution of h×w into the image detection module to obtain a valid image; then input the image into the annotation module for annotation. The system uses the public image interface annotation software Labelme3.11.2 for annotation. Through its scene segmentation and annotation function, the vehicles, pedestrians, bicycles, traffic lights and neon lights on the image are framed and labeled into different categories. The generated annotated image reflects different types of objects through different gray levels. The grayscale table list and the object category K stored in the image are obtained from the different gray levels of the annotated image;
2)、标注图像与原图像输入数据扩充:将图像随机裁剪、拼接或添加不同类型噪声,再通过图像仿射矩阵对图像变换,仿射变换参见公式(1):2) Data expansion of the labeled image and the original image input: randomly crop, splice or add different types of noise to the image, and then transform the image through the image affine matrix. The affine transformation is shown in formula (1):
仿射矩阵中sx表示横向平移量和sy表示纵向平移量,c1表示图像横坐标放大或缩小倍数,c4表示纵坐标放大或缩小的倍数,c2和c3控制图像剪切变换,(a,b表示原像素位置,(a′,b′)为变换后位置,最后通过填充和裁剪等变换,保持图像的原有分辨率,得到数据集;In the affine matrix,sx represents the horizontal translation andsy represents the vertical translation,c1 represents the magnification or reduction factor of the image horizontal coordinate,c4 represents the magnification or reduction factor of the vertical coordinate,c2 andc3 control the image shearing transformation, (a, b represents the original pixel position, (a′, b′) is the transformed position, and finally, through transformations such as filling and cropping, the original resolution of the image is maintained to obtain the data set;
3)、使用数据扩充后的图像和标注图像进行网络的训练,残差U-net网络由四个部分组成,分别是下采样部分、桥梁部分、上采样部分和分类部分;3) Use the data-expanded images and labeled images to train the network. The residual U-net network consists of four parts: downsampling part, bridge part, upsampling part and classification part;
图像长度h,图像宽度w,损失函数大小L,网络迭代次数epochs,批量处理大小batch_size和验证集比例rate。数据集将通过rate分为训练集和验证集,训练时按batch_size分批输入残差U-net网络中进行训练,通过网络输出的预测图像与实际标签图像计算L,并反向传播调节网络中的参数使L输出趋于最小化,反复训练网络到迭代次数,在迭代过程中通过验证集调整网络参数。最后得到最优的网络模型。Image length h, image width w, loss function size L, network iteration number epochs, batch processing size batch_size and validation set ratio rate. The data set will be divided into training set and validation set by rate. During training, the residual U-net network is input in batches according to batch_size for training. L is calculated by the predicted image output by the network and the actual label image, and the parameters in the network are adjusted by back propagation to minimize the L output. The network is repeatedly trained to the number of iterations, and the network parameters are adjusted by the validation set during the iteration process. Finally, the optimal network model is obtained.
4)路况分类:修改采集模块时间间隔T,将后续得到的图像输入训练好的深度学习模型中,输出预测的语义分割图像,并将图像中不同灰度回传给处理器,这样车辆就可以很好的识别出前方位置存在哪些类别的物体,以做出后续的不同反应。4) Road condition classification: Modify the acquisition module time interval T, input the subsequent images into the trained deep learning model, output the predicted semantic segmentation image, and send the different grayscale in the image back to the processor, so that the vehicle can well identify what categories of objects exist in the front position and make different subsequent responses.
进一步,所述步骤3)中,下采样部分分为四级,各级均由一个残差网络组成,分别是第一级到第四级残差网络。第一级残差网络内各层连接顺序为:卷积层、批归一化层、softmax函数层、卷积层和融合层,最后通过恒等连接的方式在融合层将输入图像与处理后的特征图像融合。第二级到四级残差网络各层的形式相同,其连接顺序为:批归一化层、softmax函数层、卷积层、批归一化层、softmax函数层、卷积层和融合层,最后也通过恒等连接的方式在融合层将输入的特征图像与处理后的特征图像融合;卷积层由3×3的卷积核构成,各级的两个卷积核维度分别为64、128、256和512,最后各级通过2×2步长为2的池化层进行相连,其维度变化与各级的卷积层相同Furthermore, in the step 3), the downsampling part is divided into four levels, and each level is composed of a residual network, namely the first to fourth level residual networks. The connection order of each layer in the first level residual network is: convolution layer, batch normalization layer, softmax function layer, convolution layer and fusion layer. Finally, the input image and the processed feature image are fused in the fusion layer by the same connection method. The form of each layer of the second to fourth level residual network is the same, and the connection order is: batch normalization layer, softmax function layer, convolution layer, batch normalization layer, softmax function layer, convolution layer and fusion layer. Finally, the input feature image and the processed feature image are fused in the fusion layer by the same connection method; the convolution layer is composed of a 3×3 convolution kernel, and the dimensions of the two convolution kernels at each level are 64, 128, 256 and 512 respectively. Finally, each level is connected by a 2×2 pooling layer with a step size of 2, and its dimensional change is the same as that of the convolution layer at each level.
再进一步,所述步骤3)中,桥梁部分为网络高底维度信息拼接做准备,它由两层批归一化层、两层softplus函数层、两层3×3维度为1024的卷积层构成,无融合层,所以不需要恒等连接,各层的连接顺序与第二级残差网络相同。最后通过上采样层将特征图像调整到拼接的大小。Furthermore, in step 3), the bridge part is prepared for the splicing of high and low dimensional information of the network. It consists of two batch normalization layers, two softplus function layers, and two 3×3 convolution layers with a dimension of 1024. There is no fusion layer, so there is no need for identity connection. The connection order of each layer is the same as that of the second-level residual network. Finally, the feature image is adjusted to the splicing size through the upsampling layer.
更进一步,所述步骤3)中,上采样部分也通过四级残差网络组成,分别是第五级到第八级残差网络,残差网络的形式和各层的连接顺与下采样部分各级残差网络基本相同,只是在第五到第七级残差网络的恒等连接通过一个1×1的卷积层来替代,而第八级残差网络不变,上采样各级残差网络内卷积层维度分别是512、256、128和64,各级之间通过上采样层和拼接层连接,拼接层将对应尺寸的高低维度信息进行拼接,拼接措施如下:Furthermore, in the step 3), the upsampling part is also composed of four levels of residual networks, namely the fifth to eighth levels of residual networks. The form of the residual network and the connection sequence of each layer are basically the same as those of the residual networks at all levels of the downsampling part, except that the identity connection of the fifth to seventh levels of residual networks is replaced by a 1×1 convolution layer, while the eighth level of residual network remains unchanged. The dimensions of the convolution layers in the upsampling residual networks at all levels are 512, 256, 128 and 64, respectively. Each level is connected by an upsampling layer and a splicing layer. The splicing layer splices the high and low dimensional information of the corresponding size. The splicing measures are as follows:
(3.1)、第四级残差网络的输出经过池化层后的特征图像与桥梁部分输出的特征图像进行拼接;(3.1) The feature image of the output of the fourth-level residual network after the pooling layer is spliced with the feature image output of the bridge part;
(3.2)、第三级残差网络的输出经过池化层后的特征图像与第五级残差网络的输出经过上采样层后的特征图像进行拼接;(3.2) The feature image of the output of the third-level residual network after the pooling layer is concatenated with the feature image of the output of the fifth-level residual network after the upsampling layer;
(3.3)、第二级残差网络的输出经过池化层后的特征图像与第六级残差网络的输出经过上采样层后的特征图像进行拼接;(3.3) The feature image of the output of the second-level residual network after the pooling layer is concatenated with the feature image of the output of the sixth-level residual network after the upsampling layer;
(3.4)、第一级残差网络的输出经过池化层后的特征图像与第七级残差网络的输出经过上采样层后的特征图像进行拼接;(3.4) The feature image of the output of the first-level residual network after the pooling layer is concatenated with the feature image of the output of the seventh-level residual network after the upsampling layer;
拼接后特征图像的维度发生变化,使用替代恒等连接的1×1卷积层调整特征图像维度,四个1×1卷积层维度分别为512、256、128和64,最后在融合层进行特征图像的融合。After splicing, the dimension of the feature image changes. The 1×1 convolution layer with alternative identity connection is used to adjust the dimension of the feature image. The dimensions of the four 1×1 convolution layers are 512, 256, 128, and 64, respectively. Finally, the feature images are fused in the fusion layer.
所述步骤3)中,分类部分通过1×1的卷积层和softmax层组成,由于城市道路图像分割涉及车辆、行人、自行车、交通信号灯、霓虹灯和背景六个类,所以通过1×1的卷积层得到6个通道的特征图像,但原始特征图像的像素值表示的不是概率值,所以通过softmax层将输出转换为概率分布,softmax函数,参见公式(2):In step 3), the classification part is composed of a 1×1 convolution layer and a softmax layer. Since the urban road image segmentation involves six categories: vehicles, pedestrians, bicycles, traffic lights, neon lights and background, a feature image of 6 channels is obtained through the 1×1 convolution layer. However, the pixel value of the original feature image does not represent a probability value, so the output is converted into a probability distribution through the softmax layer. The softmax function is shown in formula (2):
其中dk(x)表示像素x是在通道k上的取值,K表示物品类别数量,gk(x)表示像素x属于k类的概率,gk(x)∈[0,1],各通道中概率最大的就是所对应的类;Where dk (x) indicates the value of pixel x on channel k, K indicates the number of item categories, gk (x) indicates the probability that pixel x belongs to category k, gk (x)∈[0,1], and the channel with the highest probability corresponds to the category.
然后使用交叉熵损失函数来评估预测结果与实际的偏差,损失函数,参见公式(3):Then the cross entropy loss function is used to evaluate the deviation between the predicted result and the actual result. The loss function is shown in formula (3):
其中t(x)表示像素x所对应的类,所以gt(x)(x)表示该类的概率,表示标注图像对应像素x属于k类的概率,所以损失函数的值越小表示预测图像和标注图像越相近。通过损失函数反向传递,对神经网络内部参数进行不断的优化,使损失函数不断减少趋于理想取值;Where t(x) represents the class corresponding to pixel x, so gt(x) (x) represents the probability of this class. It represents the probability that the pixel x corresponding to the labeled image belongs to class k, so the smaller the value of the loss function, the closer the predicted image is to the labeled image. Through the reverse transfer of the loss function, the internal parameters of the neural network are continuously optimized, so that the loss function is continuously reduced and tends to the ideal value;
最后训练模型时还需要确定网络的迭代次数epochs、批量处理大小batch_size和验证集比例rate。验证集比例将得到的图像集分为训练集和验证集,然后将训练集中的图像按批量处理大小分批输入网络直至训练集图像全部输入,完成了一次迭代,最后通过确定的迭代次数反复训练模型,以得到最优的神经网络模型。When training the model, you also need to determine the number of network iterations epochs, batch processing size batch_size and validation set ratio rate. The validation set ratio divides the obtained image set into a training set and a validation set, and then inputs the images in the training set into the network in batches according to the batch processing size until all the images in the training set are input, completing one iteration. Finally, the model is repeatedly trained through the determined number of iterations to obtain the optimal neural network model.
本发明主要执行部分是图像的采集与处理、神经网络的训练和使用识别模型对图像进行识别。本方法实施过程可以分以下三个阶段:The main execution parts of the present invention are image acquisition and processing, neural network training and image recognition using recognition models. The implementation process of this method can be divided into the following three stages:
第一、图像数据获取:设置采集模块时间间隔T,选取不同的城市环境路段采集图像,输入检测模块,得到有效的图像集;然后使用标注软件labelme3.11.2对图像标注,通过实例场景分割标注功能,框定图像中的目标物品,并标注物体的类别,软件将生成标注图像,图像中使用不同的灰度级标注不同的物体。通过标注图像的不同灰度级,并得到灰度列表list=[]和物品类别数K;最后通过数据扩充模块对图像与标注图像都进行扩充得到数据集。First, image data acquisition: set the acquisition module time interval T, select different urban environment road sections to collect images, input the detection module, and obtain a valid image set; then use the labeling software labelme3.11.2 to annotate the image, use the instance scene segmentation and annotation function to frame the target objects in the image, and annotate the object category. The software will generate an annotated image, and use different grayscale levels to annotate different objects in the image. By annotating the different grayscale levels of the image, the grayscale list list = [] and the number of object categories K are obtained; finally, the image and the annotated image are expanded through the data expansion module to obtain the data set.
第二、网络的参数和训练:图像长度h,图像宽度w,损失函数大小L,网络迭代次数epochs,批量处理大小batch_size和验证集比例rate。数据集将通过rate分为训练集和验证集,训练时按batch_size分批输入残差U-net网络中进行训练,通过网络输出的预测图像与实际标签图像计算L,并反向传播调节网络中的参数使L输出趋于最小化,反复训练网络到迭代次数,在迭代过程中通过验证集调整网络参数。最后得到最优的网络模型。Second, network parameters and training: image length h, image width w, loss function size L, network iteration number epochs, batch processing size batch_size and validation set ratio rate. The data set will be divided into training set and validation set by rate. During training, it will be input into the residual U-net network in batches according to batch_size for training. L is calculated by the predicted image output by the network and the actual label image, and the parameters in the network are adjusted by back propagation to minimize the L output. The network is repeatedly trained to the number of iterations, and the network parameters are adjusted by the validation set during the iteration process. Finally, the optimal network model is obtained.
第三、路况分类:修改采集模块时间间隔T,将后续得到的图像输入训练好的深度学习模型中,输出预测的语义分割图像,并将图像中不同灰度回传给处理器,这样车辆就可以很好的识别出前方位置存在哪些类别的物体,以做出后续的不同反应。Third, road condition classification: modify the acquisition module time interval T, input the subsequent images into the trained deep learning model, output the predicted semantic segmentation image, and send the different grayscale in the image back to the processor, so that the vehicle can well identify what categories of objects exist in the front position and make different subsequent responses.
本发明的有益效果主要表现在:1、在网络设计中,综合考虑了深度学习网络在训练时可能出现的梯度下降过快、需要数据集过大和过拟合问题,所以在网络中添加了批归一化、残差网络和高底层信息拼接的方法,有效的较少梯度下降和图像信息缺失的问题;有利于提高语义分割的准确性;2、该深度学习的路况检测系统设计简单,便于理解,所用的数据集少,实时性高,实用性和适应性强。The beneficial effects of the present invention are mainly manifested in: 1. In the network design, the problems of too fast gradient descent, too large data set and overfitting that may occur in the deep learning network during training are comprehensively considered, so batch normalization, residual network and high-level information splicing methods are added to the network, which effectively reduces the problems of gradient descent and missing image information; it is beneficial to improve the accuracy of semantic segmentation; 2. The deep learning road condition detection system is simple in design, easy to understand, uses a small data set, has high real-time performance, and is highly practical and adaptable.
附图说明BRIEF DESCRIPTION OF THE DRAWINGS
图1为深度学习的城市道路场景语义分割系统的实施的流程。Figure 1 shows the implementation process of the deep learning urban road scene semantic segmentation system.
图2为深度学习的城市道路场景语义分割系统中使用的残差U-net网络的整体模型设计。Figure 2 shows the overall model design of the residual U-net network used in the deep learning urban road scene semantic segmentation system.
图3为深度学习的城市道路场景语义分割系统使用的残差U-net网络内第二级到第五级残差网络的网络形式。Figure 3 shows the network form of the second to fifth level residual networks in the residual U-net network used by the deep learning urban road scene semantic segmentation system.
图4为深度学习的城市道路场景语义分割效果图展示。Figure 4 shows the effect of deep learning semantic segmentation of urban road scenes.
具体实施方式DETAILED DESCRIPTION
下面结合附图对本发明的方法作进一步详细说明。The method of the present invention is further described in detail below with reference to the accompanying drawings.
参照图1~图4,一种基于深度学习的城市道路场景语义分割方法,所述方法包括以下步骤:1 to 4, a method for semantic segmentation of urban road scenes based on deep learning is provided, the method comprising the following steps:
1)、车辆前端的图像采集:定时采集城市道路图像,设定的时间间隔为T,并将分辨率为h×w的图像输入图像检测模块,得到有效的图像;然后图像输入标注模块中标注,系统采用公开的图像界面的标注软件Labelme3.11.2进行标注,通过其场景分割标注功能,将图像上的车辆、行人、自行车、交通信号灯和霓虹灯物体等框定并标注为不同的类别,生成的标注图像通过不同灰度级来反应不同类的物体,从标注图像的不同灰度得到灰度表list和图像中所存物体类别K;1) Image acquisition of the front end of the vehicle: collect urban road images at regular intervals, set the time interval as T, and input the image with a resolution of h×w into the image detection module to obtain a valid image; then input the image into the annotation module for annotation. The system uses the public image interface annotation software Labelme3.11.2 for annotation. Through its scene segmentation and annotation function, the vehicles, pedestrians, bicycles, traffic lights and neon lights on the image are framed and labeled into different categories. The generated annotated image reflects different types of objects through different gray levels. The grayscale table list and the object category K stored in the image are obtained from the different gray levels of the annotated image;
2)、标注图像与原图像输入数据扩充:将图像随机裁剪、拼接或添加不同类型噪声,再通过图像仿射矩阵对图像变换,仿射变换参见公式(1):2) Data expansion of the labeled image and the original image input: randomly crop, splice or add different types of noise to the image, and then transform the image through the image affine matrix. The affine transformation is shown in formula (1):
仿射矩阵中sx表示横向平移量和sy表示纵向平移量,c1表示图像横坐标放大或缩小倍数,c4表示纵坐标放大或缩小的倍数,c2和c3控制图像剪切变换,(a,b)表示原像素位置,(a′,b′)为变换后位置,最后通过填充和裁剪等变换,保持图像的原有分辨率,得到数据集;In the affine matrix, sx represents the horizontal translation and sy represents the vertical translation, c1 represents the magnification or reduction factor of the image horizontal coordinate, c4 represents the magnification or reduction factor of the vertical coordinate, c2 and c3 control the image shearing transformation, (a, b) represents the original pixel position, (a′, b′) is the transformed position, and finally, through transformations such as filling and cropping, the original resolution of the image is maintained to obtain the data set;
3)、使用数据扩充后的图像和标注图像进行网络的训练,残差U-net网络由四个部分组成,分别是下采样部分、桥梁部分、上采样部分和分类部分;3) Use the data-expanded images and labeled images to train the network. The residual U-net network consists of four parts: downsampling part, bridge part, upsampling part and classification part;
图像长度h,图像宽度w,损失函数大小L,网络迭代次数epochs,批量处理大小batch_size和验证集比例rate。数据集将通过rate分为训练集和验证集,训练时按batch_size分批输入残差U-net网络中进行训练,通过网络输出的预测图像与实际标签图像计算L,并反向传播调节网络中的参数使L输出趋于最小化,反复训练网络到迭代次数,在迭代过程中通过验证集调整网络参数。最后得到最优的网络模型。Image length h, image width w, loss function size L, network iteration number epochs, batch processing size batch_size and validation set ratio rate. The data set will be divided into training set and validation set by rate. During training, the residual U-net network is input in batches according to batch_size for training. L is calculated by the predicted image output by the network and the actual label image, and the parameters in the network are adjusted by back propagation to minimize the L output. The network is repeatedly trained to the number of iterations, and the network parameters are adjusted by the validation set during the iteration process. Finally, the optimal network model is obtained.
4)路况分类:修改采集模块时间间隔T,将后续得到的图像输入训练好的深度学习模型中,输出预测的语义分割图像,并将图像中不同灰度回传给处理器,这样车辆就可以很好的识别出前方位置存在哪些类别的物体,以做出后续的不同反应。4) Road condition classification: Modify the acquisition module time interval T, input the subsequent images into the trained deep learning model, output the predicted semantic segmentation image, and send the different grayscale in the image back to the processor, so that the vehicle can well identify what categories of objects exist in the front position and make different subsequent responses.
进一步,所述步骤3)中,下采样部分分为四级,各级均由一个残差网络组成,分别是第一级到第四级残差网络。第一级残差网络内各层连接顺序为:卷积层、批归一化层、softmax函数层、卷积层和融合层,最后通过恒等连接的方式在融合层将输入图像与处理后的特征图像融合。第二级到四级残差网络各层的形式相同,其连接顺序为:批归一化层、softmax函数层、卷积层、批归一化层、softmax函数层、卷积层和融合层,最后也通过恒等连接的方式在融合层将输入的特征图像与处理后的特征图像融合。卷积层由3×3的卷积核构成,各级的两个卷积核维度分别为64、128、256和512。最后各级通过2×2步长为2的池化层进行相连,其维度变化与各级的卷积层相同。Further, in the step 3), the downsampling part is divided into four levels, each of which is composed of a residual network, namely the first to fourth residual networks. The connection order of each layer in the first residual network is: convolution layer, batch normalization layer, softmax function layer, convolution layer and fusion layer, and finally the input image is fused with the processed feature image in the fusion layer by the same connection method. The second to fourth residual networks have the same form, and the connection order is: batch normalization layer, softmax function layer, convolution layer, batch normalization layer, softmax function layer, convolution layer and fusion layer. Finally, the input feature image is fused with the processed feature image in the fusion layer by the same connection method. The convolution layer is composed of 3×3 convolution kernels, and the dimensions of the two convolution kernels at each level are 64, 128, 256 and 512 respectively. Finally, each level is connected by a 2×2 pooling layer with a step size of 2, and its dimensional change is the same as that of the convolution layer at each level.
桥梁部分为网络高底维度信息拼接做准备,它由两层批归一化层、两层softplus函数层、两层3×3维度为1024的卷积层构成,无融合层,所以不需要恒等连接,各层的连接顺序与第二级残差网络相同。最后通过上采样层将特征图像调整到合适拼接的大小。The bridge part is prepared for the splicing of high and low dimensional information of the network. It consists of two batch normalization layers, two softplus function layers, and two 3×3 convolution layers with a dimension of 1024. There is no fusion layer, so there is no need for identity connection. The connection order of each layer is the same as that of the second-level residual network. Finally, the feature image is adjusted to a suitable splicing size through the upsampling layer.
上采样部分也通过四级残差网络组成,分别是第五级到第八级残差网络,残差网络的形式和各层的连接顺与下采样部分各级残差网络基本相同,只是在第五到第七级残差网络的恒等连接通过一个1×1的卷积层来替代,而第八级残差网络不变。上采样各级残差网络内卷积层维度分别是512、256、128和64。各级之间通过上采样层和拼接层连接,拼接层将对应尺寸的高低维度信息进行拼接,拼接措施:The upsampling part is also composed of four levels of residual networks, namely the fifth to eighth levels of residual networks. The form of the residual network and the connection sequence of each layer are basically the same as those of the residual networks at all levels of the downsampling part, except that the identity connection of the fifth to seventh levels of residual networks is replaced by a 1×1 convolution layer, while the eighth level of residual networks remains unchanged. The dimensions of the convolution layers in the upsampling residual networks are 512, 256, 128, and 64, respectively. Each level is connected by an upsampling layer and a splicing layer. The splicing layer splices the high and low dimensional information of the corresponding size. The splicing measures are:
(3.1)、第四级残差网络的输出经过池化层后的特征图像与桥梁部分输出的特征图像进行拼接。(3.1) The feature image of the output of the fourth-level residual network after the pooling layer is concatenated with the feature image output of the bridge part.
(3.2)、第三级残差网络的输出经过池化层后的特征图像与第五级残差网络的输出经过上采样层后的特征图像进行拼接。(3.2) The feature image of the output of the third-level residual network after the pooling layer is concatenated with the feature image of the output of the fifth-level residual network after the upsampling layer.
(3.3)、第二级残差网络的输出经过池化层后的特征图像与第六级残差网络的输出经过上采样层后的特征图像进行拼接。(3.3) The feature image of the output of the second-level residual network after the pooling layer is concatenated with the feature image of the output of the sixth-level residual network after the upsampling layer.
(3.4)、第一级残差网络的输出经过池化层后的特征图像与第七级残差网络的输出经过上采样层后的特征图像进行拼接。(3.4) The feature image of the output of the first-level residual network after the pooling layer is concatenated with the feature image of the output of the seventh-level residual network after the upsampling layer.
拼接后特征图像的维度发生变化,使用替代恒等连接的1×1卷积层调整特征图像维度,四个1×1卷积层维度分别为512、256、128和64,最后在融合层进行特征图像的融合。After splicing, the dimension of the feature image changes. The 1×1 convolution layer with alternative identity connection is used to adjust the dimension of the feature image. The dimensions of the four 1×1 convolution layers are 512, 256, 128, and 64, respectively. Finally, the feature images are fused in the fusion layer.
分类部分通过1×1的卷积层和softmax层组成,由于城市道路图像分割涉及车辆、行人、自行车、交通信号灯、霓虹灯和背景六个类,所以通过1×1的卷积层得到6个通道的特征图像,但原始特征图像的像素值表示的不是概率值,所以通过softmax层将输出转换为概率分布,softmax函数,参见公式(2):The classification part consists of a 1×1 convolution layer and a softmax layer. Since the urban road image segmentation involves six categories: vehicles, pedestrians, bicycles, traffic lights, neon lights, and background, a 6-channel feature image is obtained through a 1×1 convolution layer. However, the pixel values of the original feature image do not represent probability values, so the output is converted into a probability distribution through a softmax layer. The softmax function is shown in formula (2):
其中dk(x)表示像素x是在通道k上的取值,K表示物品类别数量,gk(x)表示像素x属于k类的概率,gk(x)∈[0,1]。各通道中概率最大的就是所对应的类。Where dk (x) indicates the value of pixel x on channel k, K indicates the number of object categories, gk (x) indicates the probability that pixel x belongs to category k, and gk (x)∈[0,1]. The channel with the highest probability corresponds to the category.
然后使用交叉熵损失函数来评估预测结果与实际的偏差,损失函数,参见公式(3):Then the cross entropy loss function is used to evaluate the deviation between the predicted result and the actual result. The loss function is shown in formula (3):
其中t(x)表示像素x所对应的类,所以gt(x)(x)表示该类的概率,表示标注图像对应像素x属于k类的概率,所以损失函数的值越小表示预测图像和标注图像越相近。通过损失函数反向传递,对神经网络内部参数进行不断的优化,使损失函数不断减少趋于理想取值。Where t(x) represents the class corresponding to pixel x, so gt(x) (x) represents the probability of this class. It represents the probability that the pixel x corresponding to the labeled image belongs to class k, so the smaller the value of the loss function, the closer the predicted image is to the labeled image. Through the reverse transfer of the loss function, the internal parameters of the neural network are continuously optimized, so that the loss function is continuously reduced and tends to the ideal value.
最后训练模型时还需要确定网络的迭代次数epochs、批量处理大小batch_size和验证集比例rate。验证集比例将得到的图像集分为训练集和验证集,然后将训练集中的图像按批量处理大小分批输入网络直至训练集图像全部输入,完成了一次迭代,最后通过确定的迭代次数反复训练模型,以得到最优的神经网络模型。When training the model, you also need to determine the number of network iterations epochs, batch processing size batch_size and validation set ratio rate. The validation set ratio divides the obtained image set into a training set and a validation set, and then inputs the images in the training set into the network in batches according to the batch processing size until all the images in the training set are input, completing one iteration. Finally, the model is repeatedly trained through the determined number of iterations to obtain the optimal neural network model.
本实施例主要执行部分是图像的采集与处理、神经网络的训练和使用识别模型对图像进行识别。本方法实施过程可以分以下三个阶段:The main execution parts of this embodiment are image acquisition and processing, neural network training and image recognition using recognition models. The implementation process of this method can be divided into the following three stages:
第一、图像数据获取:设置采集模块时间间隔T=4s,选取不同的城市环境路段采集图像,输入检测模块,得到有效的图像1000张;然后使用标注软件labelme3.11.2对图像标注,通过实例场景分割标注功能,框定图像中的各类目标,并标注目标的类别,软件将生成标注图像,图像中使用不同的灰度级标注不同的目标类别。灰度列表为list=[0,20,80,140,180,230]表示不同目标的像素值,其分别包括背景、霓虹灯、交通信号灯、车辆、行人和自行车,合计类别数K=6;最后通过数据扩充模块对图像与标注图像都进行扩充得到数据集。First, image data acquisition: set the acquisition module time interval T = 4s, select different urban environment road sections to collect images, input the detection module, and obtain 1000 valid images; then use the labeling software labelme3.11.2 to annotate the image, and use the instance scene segmentation and annotation function to frame various types of targets in the image and annotate the target categories. The software will generate annotated images, and different grayscale levels are used to annotate different target categories in the image. The grayscale list is list = [0, 20, 80, 140, 180, 230] to represent the pixel values of different targets, which include background, neon lights, traffic lights, vehicles, pedestrians and bicycles, with a total of 6 categories; finally, the data expansion module is used to expand both the image and the annotated image to obtain a data set.
第二、在网络参数设定界面,输入网络参数,如下:图像长度h=224,图像宽度w=224,损失函数L;网络迭代次数epochs=30,批量处理大小batch_size=4和验证集比例rate=0.1;3000张图像集将分为2700张训练集和300张验证集,训练时按batch_size每次4张图像输入残差U-net网络中进行训练,直到训练集全部训练完毕,通过网络输出的预测图像与实际标签图像计算损失函数L的大小,并反向传播调节网络中的参数使L输出趋于最小化,完成一次迭代,迭代训练网络30次,在迭代过程中通过验证集调整网络参数;最后得到合适的网络模型。Second, in the network parameter setting interface, enter the network parameters as follows: image length h = 224, image width w = 224, loss function L; network iteration number epochs = 30, batch processing size batch_size = 4 and verification set ratio rate = 0.1; the 3000 image set will be divided into 2700 training sets and 300 verification sets. During training, 4 images are input into the residual U-net network at a time according to batch_size until all training sets are trained. The size of the loss function L is calculated by the predicted image output by the network and the actual label image, and the parameters in the network are adjusted by backpropagation to minimize the L output. After one iteration, the network is iteratively trained 30 times. During the iteration, the network parameters are adjusted through the verification set; finally, a suitable network model is obtained.
第三、修改采集模块时间间隔T=0.2s,将后续得到的图像输入训练好的深度学习模型中,输出实时的语义分割结果,并将图像中不同灰度回传给处理器,这样车辆就可以很好的识别出前方位置存在哪些类别的物体,以做出后续的不同反应。Third, modify the acquisition module time interval T = 0.2s, input the subsequent images into the trained deep learning model, output real-time semantic segmentation results, and send the different grayscale in the image back to the processor, so that the vehicle can well identify what categories of objects exist in the front position and make subsequent different responses.
实际系统设计形式、网络的建立过程和结果如图1、图2、图3、图4所示,图1为深度学习的城市道路场景语义分割系统的实施的流程。图2为深度学习的城市道路场景语义分割系统中使用的残差U-net网络的整体模型设计。图3为深度学习的城市道路场景语义分割系统使用的残差U-net网络内第二级到第五级残差网络的网络形式。图4为深度学习的城市道路场景语义分割效果图展示。The actual system design form, network establishment process and results are shown in Figures 1, 2, 3 and 4. Figure 1 shows the implementation process of the deep learning urban road scene semantic segmentation system. Figure 2 shows the overall model design of the residual U-net network used in the deep learning urban road scene semantic segmentation system. Figure 3 shows the network form of the second to fifth level residual network in the residual U-net network used in the deep learning urban road scene semantic segmentation system. Figure 4 shows the effect diagram of the deep learning urban road scene semantic segmentation.
以上阐述的是本发明给出的一个实施例所表现出的优良的深度学习的城市道路场景语义分割的效果。需要指出,上述实施例用来解释说明本发明,而不是对本发明进行限制,在本发明的精神和权利要求的保护范围内,对本发明做出的任何修改,都落入本发明的保护范围。The above describes the excellent semantic segmentation effect of urban road scenes by deep learning shown by an embodiment of the present invention. It should be pointed out that the above embodiment is used to explain the present invention, rather than to limit the present invention. Any modification made to the present invention within the spirit of the present invention and the protection scope of the claims shall fall within the protection scope of the present invention.
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202010156966.XACN111598095B (en) | 2020-03-09 | 2020-03-09 | Urban road scene semantic segmentation method based on deep learning |
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202010156966.XACN111598095B (en) | 2020-03-09 | 2020-03-09 | Urban road scene semantic segmentation method based on deep learning |
| Publication Number | Publication Date |
|---|---|
| CN111598095A CN111598095A (en) | 2020-08-28 |
| CN111598095Btrue CN111598095B (en) | 2023-04-07 |
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| CN202010156966.XAActiveCN111598095B (en) | 2020-03-09 | 2020-03-09 | Urban road scene semantic segmentation method based on deep learning |
| Country | Link |
|---|---|
| CN (1) | CN111598095B (en) |
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| WO2018176000A1 (en) | 2017-03-23 | 2018-09-27 | DeepScale, Inc. | Data synthesis for autonomous control systems |
| US10671349B2 (en) | 2017-07-24 | 2020-06-02 | Tesla, Inc. | Accelerated mathematical engine |
| US11893393B2 (en) | 2017-07-24 | 2024-02-06 | Tesla, Inc. | Computational array microprocessor system with hardware arbiter managing memory requests |
| US11409692B2 (en) | 2017-07-24 | 2022-08-09 | Tesla, Inc. | Vector computational unit |
| US11157441B2 (en) | 2017-07-24 | 2021-10-26 | Tesla, Inc. | Computational array microprocessor system using non-consecutive data formatting |
| US12307350B2 (en) | 2018-01-04 | 2025-05-20 | Tesla, Inc. | Systems and methods for hardware-based pooling |
| US11561791B2 (en) | 2018-02-01 | 2023-01-24 | Tesla, Inc. | Vector computational unit receiving data elements in parallel from a last row of a computational array |
| US11215999B2 (en) | 2018-06-20 | 2022-01-04 | Tesla, Inc. | Data pipeline and deep learning system for autonomous driving |
| US11361457B2 (en) | 2018-07-20 | 2022-06-14 | Tesla, Inc. | Annotation cross-labeling for autonomous control systems |
| US11636333B2 (en) | 2018-07-26 | 2023-04-25 | Tesla, Inc. | Optimizing neural network structures for embedded systems |
| US11562231B2 (en) | 2018-09-03 | 2023-01-24 | Tesla, Inc. | Neural networks for embedded devices |
| IL316003A (en) | 2018-10-11 | 2024-11-01 | Tesla Inc | Systems and methods for training machine models with augmented data |
| US11196678B2 (en) | 2018-10-25 | 2021-12-07 | Tesla, Inc. | QOS manager for system on a chip communications |
| US11816585B2 (en) | 2018-12-03 | 2023-11-14 | Tesla, Inc. | Machine learning models operating at different frequencies for autonomous vehicles |
| US11537811B2 (en) | 2018-12-04 | 2022-12-27 | Tesla, Inc. | Enhanced object detection for autonomous vehicles based on field view |
| US11610117B2 (en) | 2018-12-27 | 2023-03-21 | Tesla, Inc. | System and method for adapting a neural network model on a hardware platform |
| US11150664B2 (en) | 2019-02-01 | 2021-10-19 | Tesla, Inc. | Predicting three-dimensional features for autonomous driving |
| US10997461B2 (en) | 2019-02-01 | 2021-05-04 | Tesla, Inc. | Generating ground truth for machine learning from time series elements |
| US11567514B2 (en) | 2019-02-11 | 2023-01-31 | Tesla, Inc. | Autonomous and user controlled vehicle summon to a target |
| US10956755B2 (en) | 2019-02-19 | 2021-03-23 | Tesla, Inc. | Estimating object properties using visual image data |
| CN112070049B (en)* | 2020-09-16 | 2022-08-09 | 福州大学 | Semantic segmentation method under automatic driving scene based on BiSeNet |
| CN112348839B (en)* | 2020-10-27 | 2024-03-15 | 重庆大学 | Image segmentation method and system based on deep learning |
| CN112329780B (en)* | 2020-11-04 | 2023-10-27 | 杭州师范大学 | A method of deep image semantic segmentation based on deep learning |
| CN112767361B (en)* | 2021-01-22 | 2024-04-09 | 重庆邮电大学 | Reflected light ferrograph image segmentation method based on lightweight residual U-net |
| CN112819688A (en)* | 2021-02-01 | 2021-05-18 | 西安研硕信息技术有限公司 | Conversion method and system for converting SAR (synthetic aperture radar) image into optical image |
| CN113076837A (en)* | 2021-03-25 | 2021-07-06 | 高新兴科技集团股份有限公司 | Convolutional neural network training method based on network image |
| CN113034598B (en)* | 2021-04-13 | 2023-08-22 | 中国计量大学 | Unmanned aerial vehicle power line inspection method based on deep learning |
| CN112949617B (en)* | 2021-05-14 | 2021-08-06 | 江西农业大学 | Rural road type identification method, system, terminal device and readable storage medium |
| CN113378845A (en)* | 2021-05-28 | 2021-09-10 | 上海商汤智能科技有限公司 | Scene segmentation method, device, equipment and storage medium |
| CN113468963A (en)* | 2021-05-31 | 2021-10-01 | 山东信通电子股份有限公司 | Road raise dust identification method and equipment |
| CN113269276B (en)* | 2021-06-28 | 2024-10-01 | 陕西谦亿智能科技有限责任公司 | Image recognition method, device, equipment and storage medium |
| CN113657174A (en)* | 2021-07-21 | 2021-11-16 | 北京中科慧眼科技有限公司 | Vehicle pseudo-3D information detection method and device and automatic driving system |
| CN113569765A (en)* | 2021-07-30 | 2021-10-29 | 清华大学 | Traffic scene instance segmentation method and device based on intelligent networked vehicles |
| CN113569774B (en)* | 2021-08-02 | 2022-04-08 | 清华大学 | A method and system for semantic segmentation based on continuous learning |
| CN113705498B (en)* | 2021-09-02 | 2022-05-27 | 山东省人工智能研究院 | A Wheel Slip State Prediction Method Based on Distributed Propagation Graph Network |
| CN113689436B (en)* | 2021-09-29 | 2024-02-02 | 平安科技(深圳)有限公司 | Image semantic segmentation method, device, equipment and storage medium |
| CN113808128B (en)* | 2021-10-14 | 2023-07-28 | 河北工业大学 | Visual control method for the whole process of intelligent compaction based on relative coordinate positioning algorithm |
| CN114419057A (en)* | 2022-01-27 | 2022-04-29 | 盛视科技股份有限公司 | An image-based pavement segmentation method and system |
| CN114495236B (en)* | 2022-02-11 | 2023-02-28 | 北京百度网讯科技有限公司 | Image segmentation method, device, equipment, medium and program product |
| CN118172787B (en)* | 2024-05-09 | 2024-07-30 | 南昌航空大学 | A lightweight document layout analysis method |
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN108062756A (en)* | 2018-01-29 | 2018-05-22 | 重庆理工大学 | Image, semantic dividing method based on the full convolutional network of depth and condition random field |
| CN109145983A (en)* | 2018-08-21 | 2019-01-04 | 电子科技大学 | A kind of real-time scene image, semantic dividing method based on lightweight network |
| CN110111335A (en)* | 2019-05-08 | 2019-08-09 | 南昌航空大学 | A kind of the urban transportation Scene Semantics dividing method and system of adaptive confrontation study |
| CN110147794A (en)* | 2019-05-21 | 2019-08-20 | 东北大学 | A kind of unmanned vehicle outdoor scene real time method for segmenting based on deep learning |
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN108062756A (en)* | 2018-01-29 | 2018-05-22 | 重庆理工大学 | Image, semantic dividing method based on the full convolutional network of depth and condition random field |
| CN109145983A (en)* | 2018-08-21 | 2019-01-04 | 电子科技大学 | A kind of real-time scene image, semantic dividing method based on lightweight network |
| CN110111335A (en)* | 2019-05-08 | 2019-08-09 | 南昌航空大学 | A kind of the urban transportation Scene Semantics dividing method and system of adaptive confrontation study |
| CN110147794A (en)* | 2019-05-21 | 2019-08-20 | 东北大学 | A kind of unmanned vehicle outdoor scene real time method for segmenting based on deep learning |
| Title |
|---|
| 基于卷积神经网络的交通场景语义分割方法研究;李琳辉等;《通信学报》;20180425(第04期);全文* |
| 基于多尺度特征提取的图像语义分割;熊志勇等;《中南民族大学学报(自然科学版)》;20170915(第03期);全文* |
| 基于彩色-深度图像和深度学习的场景语义分割网络;代具亭等;《科学技术与工程》;20180718(第20期);全文* |
| 基于深度学习的遥感图像新增建筑物语义分割;陈一鸣等;《计算机与数字工程》;20191220(第12期);全文* |
| Publication number | Publication date |
|---|---|
| CN111598095A (en) | 2020-08-28 |
| Publication | Publication Date | Title |
|---|---|---|
| CN111598095B (en) | Urban road scene semantic segmentation method based on deep learning | |
| CN109886066B (en) | Rapid target detection method based on multi-scale and multi-layer feature fusion | |
| CN110334705B (en) | A language recognition method for scene text images combining global and local information | |
| CN111401436B (en) | Streetscape image segmentation method fusing network and two-channel attention mechanism | |
| CN110147763A (en) | Video semanteme dividing method based on convolutional neural networks | |
| CN110263786B (en) | A road multi-target recognition system and method based on feature dimension fusion | |
| CN110781850A (en) | Semantic segmentation system and method for road recognition, and computer storage medium | |
| CN112257793A (en) | Remote traffic sign detection method based on improved YOLO v3 algorithm | |
| CN112560670B (en) | Deep learning-based traffic sign symbol and text detection and identification method and device | |
| CN116524189A (en) | High-resolution remote sensing image semantic segmentation method based on coding and decoding indexing edge characterization | |
| CN116630702A (en) | Pavement adhesion coefficient prediction method based on semantic segmentation network | |
| CN114037640A (en) | Image generation method and device | |
| CN119478401B (en) | Urban street view image real-time semantic segmentation method based on attention boundary enhancement and aggregation pyramid | |
| CN116563825A (en) | Improved Yolov 5-based automatic driving target detection algorithm | |
| Dong et al. | Intelligent pixel-level pavement marking detection using 2D laser pavement images | |
| CN118968183B (en) | Fraud image identification method oriented to artificial intelligence ethics | |
| CN119478739A (en) | A method for detecting small targets of unmanned aerial vehicles, electronic equipment and storage medium | |
| CN116883963B (en) | Pedestrian crosswalk detection method based on improvement YOLOv5 | |
| Alzubaidi et al. | Autonomous Vehicle Lane Detection Using Hyperbolic Neural Network with Bi-Directional Long Short-Term Memory | |
| CN117292363A (en) | Dangerous driving action recognition method | |
| Tumuluru et al. | SMS: Signs may save–traffic sign recognition and detection using CNN | |
| CN114639090B (en) | A robust Chinese license plate recognition method in uncontrolled environment | |
| CN115690787A (en) | Semantic segmentation method, image processing apparatus, and computer-readable storage medium | |
| CN116797902A (en) | Infrared target detection network training method based on source model guidance | |
| CN113378838A (en) | Method for detecting text region of nameplate of mutual inductor based on deep learning |
| Date | Code | Title | Description |
|---|---|---|---|
| PB01 | Publication | ||
| PB01 | Publication | ||
| SE01 | Entry into force of request for substantive examination | ||
| SE01 | Entry into force of request for substantive examination | ||
| GR01 | Patent grant | ||
| GR01 | Patent grant |