CN116012602A

Movatterモバイル変換

Info

Publication number: CN116012602A
Application number: CN202310048434.8A
Authority: CN
Inventors: 徐涛; 蒋靓峣; 赵未硕; 蔡磊; 柴豪杰; 鲁银圆
Original assignee: Henan Institute of Science and Technology
Current assignee: Henan Institute of Science and Technology
Priority date: 2023-01-31
Filing date: 2023-01-31
Publication date: 2023-04-25

Abstract

The invention provides an online positioning light-weight significance detection method which is used for solving the technical problem of low operation efficiency of the existing model; the method comprises the following steps: firstly, inputting images in a DUTS-TR data set into an encoder to extract characteristic images; then, decoding the characteristic image by using a decoder to obtain a prediction graph; secondly, optimizing the prediction graph by using a residual error optimizer to obtain a significant graph; updating the encoder-decoder network based on the difference between the saliency map and the truth map; and finally, acquiring an image to be detected, inputting the image to be detected into an encoder-decoder network, and outputting a saliency map of the image to be detected. According to the invention, the residual error optimizer is added to further optimize the predicted image generated by the backbone network, so that the inside of the predicted image is more uniform, and the boundary is clearer; the lightweight network provided by the invention can reduce the model volume and improve the running speed of the model while taking the prediction effect into consideration.

Description

Translated fromChinese

一种在线定位的轻量化显著性检测方法A lightweight saliency detection method for online positioning

技术领域Technical Field

本发明涉及计算机视觉技术领域，具体是指一种在线定位的轻量化显著性检测方法。The present invention relates to the technical field of computer vision, and in particular to a lightweight saliency detection method for online positioning.

背景技术Background Art

显著性检测是一种将人类视觉中最关注的区域或物体从场景中分割出来的任务。在许多视觉任务中有着广泛的应用。包括图像分割、图像检索、物体检测、视觉跟踪、图像压缩和场景分类等。近年来由于CNN(Convolutional Neural Networks)的迅速发展。基于深度学习的显著性检测方法在预测精度上有了极大的飞跃。但是，精度上提升的代价是更大的网络体积和更多的计算量。这些先进的显著性检测方法往往拥有庞大的模型体积。即使在拥有高性能显卡的设备上运行速度也非常缓慢。因此这类模型应用场景受到了极大的限制，很难在机器人、移动设备和工业设备上发挥作用。这些场景中因为设备体积和稳定性的要求，硬件性能有限。Saliency detection is a task that separates the area or object that human vision pays the most attention to from the scene. It has a wide range of applications in many visual tasks, including image segmentation, image retrieval, object detection, visual tracking, image compression, and scene classification. In recent years, due to the rapid development of CNN (Convolutional Neural Networks), deep learning-based saliency detection methods have made great leaps in prediction accuracy. However, the cost of improved accuracy is a larger network size and more computation. These advanced saliency detection methods often have a large model size. Even on devices with high-performance graphics cards, the running speed is very slow. Therefore, the application scenarios of such models are greatly limited, and it is difficult to play a role in robots, mobile devices, and industrial equipment. In these scenarios, due to the requirements of device size and stability, the hardware performance is limited.

SOD任务同时需要高层语义特征和低层次的粒度特征来分别定位显著物体和其细节。还需要多尺度的信息来处理不同场景下不同大小的显著性物体。尽管目前一些轻量级骨干网络如MobileNets和ShuffleNets已经广泛地应用于移动设备。然而这些现有的轻量级网络通常因为模型深度有限导致特征表达能力欠佳。直接将这些轻量级骨干网络应用在显著性检测任务中难以取得理想的精度。另外，在多数基于编码器-解码器架构的显著性检测任务中。低级特征来源于浅层网络，包含了丰富的空间信息，可以突出显著目标的边界。高级特征来源于深层网络，富含语义信息，如显著目标位置信息。然而，在不断地的上采样过程中，这些信息可能会被逐渐稀释。为了在解码时充分利用多尺度特征，以往的显著性检测方法设计了不同种类的特征融合策略。这些采用嵌套密集连接的融合策略虽然提升了最终的检测精度。然而过于密集的嵌套连接操作极大增加了网络的参数数量与计算负荷，导致模型运算效率欠佳。The SOD task requires both high-level semantic features and low-level granular features to locate salient objects and their details respectively. Multi-scale information is also needed to handle salient objects of different sizes in different scenarios. Although some lightweight backbone networks such as MobileNets and ShuffleNets have been widely used in mobile devices, these existing lightweight networks usually have poor feature expression capabilities due to limited model depth. It is difficult to achieve ideal accuracy by directly applying these lightweight backbone networks to saliency detection tasks. In addition, in most saliency detection tasks based on encoder-decoder architectures, low-level features come from shallow networks and contain rich spatial information, which can highlight the boundaries of salient objects. High-level features come from deep networks and are rich in semantic information, such as salient object location information. However, in the process of continuous upsampling, this information may be gradually diluted. In order to make full use of multi-scale features during decoding, previous saliency detection methods have designed different types of feature fusion strategies. These fusion strategies using nested dense connections have improved the final detection accuracy. However, the overly dense nested connection operations greatly increase the number of network parameters and computational load, resulting in poor model operation efficiency.

发明内容Summary of the invention

针对现有模型运算效率低的技术问题，本发明提出了一种在线定位的轻量化显著性检测方法，在兼顾预测效果的同时减少模型体积、提升模型的运行速度。In order to solve the technical problem of low computational efficiency of existing models, the present invention proposes a lightweight saliency detection method with online positioning, which reduces the model size and improves the running speed of the model while taking into account the prediction effect.

本发明的技术方案是这样实现的：The technical solution of the present invention is achieved in this way:

一种在线定位的轻量化显著性检测方法，其步骤如下：A lightweight saliency detection method for online positioning, the steps of which are as follows:

步骤一：将DUTS-TR数据集中的图像输入编码器中提取特征图像；Step 1: Input the image in the DUTS-TR dataset into the encoder to extract the feature image;

步骤二：利用解码器对特征图像进行解码，得到预测图；Step 2: Use the decoder to decode the feature image to obtain a prediction image;

步骤三：利用残差优化器对预测图进行优化，得到显著图；Step 3: Use the residual optimizer to optimize the prediction map to obtain a saliency map;

步骤四：利用损失函数计算显著图和真值图的损失值，判断损失值是否小于阈值，若是，得到训练后的编码器-解码器网络，执行步骤S5，否则，根据损失值自动修改编码器网络和解码器网络的所有层的权重参数，返回步骤一；Step 4: Use the loss function to calculate the loss value of the saliency map and the true value map, and determine whether the loss value is less than the threshold. If so, obtain the trained encoder-decoder network and execute step S5. Otherwise, automatically modify the weight parameters of all layers of the encoder network and the decoder network according to the loss value, and return to step 1.

步骤五：获取待检测图像，将待检测图像输入编码器-解码器网络，输出待检测图像的显著图。Step 5: Obtain the image to be detected, input the image to be detected into the encoder-decoder network, and output the saliency map of the image to be detected.

所述编码器网络包括encode1模块、encode2模块、encode3模块、encode4模块、encode5模块；encode1模块的输入端与第一输入层相连接，第一输入层的输入为彩色图像，encode1模块的输出端与encode2模块的输入端相连接，encode2模块的输出端与encode3模块的输入端相连接，encode3模块的输出端与encode4模块的输入端相连接，encode4模块的输出端与encode5模块的输入端相连接；encode1模块的输出端、encode2模块的输出端、encode3模块的输出端、encode4模块的输出端、encode5模块的输出端通过全尺寸跳跃连接的方式与解码器网络相连接。The encoder network includes an encoder1 module, an encoder2 module, an encoder3 module, an encoder4 module, and an encoder5 module; the input end of the encoder1 module is connected to the first input layer, the input of the first input layer is a color image, the output end of the encoder1 module is connected to the input end of the encoder2 module, the output end of the encoder2 module is connected to the input end of the encoder3 module, the output end of the encoder3 module is connected to the input end of the encoder4 module, and the output end of the encoder4 module is connected to the input end of the encoder5 module; the output end of the encoder1 module, the output end of the encoder2 module, the output end of the encoder3 module, the output end of the encoder4 module, and the output end of the encoder5 module are connected to the decoder network by full-size skip connection.

所述encode1模块的结构为卷积层I-批量归一化层I-激活层I；卷积层I的卷积核为3×3，步长为2，通道数为16；encode2模块的结构为卷积层II-批量归一化层II-激活层II-卷积层III-批量归一化层III；卷积层II的卷积核为3×3，步长为1，输入通道数为16，输出通道数为32，卷积层III的卷积核为1×1，步长为1，输入通道数为32，输出通道数为32；encode3模块的结构为卷积层IV-批量归一化层IV-激活层IV-卷积层V-批量归一化层V；卷积层IV的卷积核为3×3，步长为2，输入通道数为32，输出通道数为64；卷积层V的卷积核为1×1，步长为1，输入通道数为64，输出通道数为64；encode4模块的结构为卷积层VI-批量归一化层VI-激活层VI-卷积层VII-批量归一化层VII；卷积层VI的卷积核为3×3，步长为2，输入通道数为64，输出通道数为96；卷积层VII的卷积核为1×1，步长为1，输入通道数为96，输出通道数为96；encode5模块的结构为卷积层VIII-批量归一化层VIII-激活层VIII-卷积层IX-批量归一化层IX；卷积层VIII的卷积核为3×3，步长为2，输入通道数为96，输出通道数为128；卷积层IX的卷积核为1×1，步长为1，输入通道数为128，输出通道数为128。The structure of the encoder1 module is convolution layer I-batch normalization layer I-activation layer I; the convolution kernel of convolution layer I is 3×3, the step size is 2, and the number of channels is 16; the structure of the encoder2 module is convolution layer II-batch normalization layer II-activation layer II-convolution layer III-batch normalization layer III; the convolution kernel of convolution layer II is 3×3, the step size is 1, the number of input channels is 16, and the number of output channels is 32, the convolution kernel of convolution layer III is 1×1, the step size is 1, the number of input channels is 32, and the number of output channels is 32; the structure of the encoder3 module is convolution layer IV-batch normalization layer IV-activation layer IV-convolution layer V-batch normalization layer V; the convolution kernel of convolution layer IV is 3×3, the step size is 2, the number of input channels is 32, and the number of output channels is 64; the convolution kernel of convolution layer V is 1×1, the step size is The length is 1, the number of input channels is 64, and the number of output channels is 64; the structure of the encode4 module is convolution layer VI-batch normalization layer VI-activation layer VI-convolution layer VII-batch normalization layer VII; the convolution kernel of convolution layer VI is 3×3, the step size is 2, the number of input channels is 64, and the number of output channels is 96; the convolution kernel of convolution layer VII is 1×1, the step size is 1, the number of input channels is 96, and the number of output channels is 96; the structure of the encode5 module is convolution layer VIII-batch normalization layer VIII-activation layer VIII-convolution layer IX-batch normalization layer IX; the convolution kernel of convolution layer VIII is 3×3, the step size is 2, the number of input channels is 96, and the number of output channels is 128; the convolution kernel of convolution layer IX is 1×1, the step size is 1, the number of input channels is 128, and the number of output channels is 128.

所述encode1模块、encode2模块、encode3模块、encode4模块、encode5模块的后均连接有多尺度注意力模型；第q个encode模块的特征图像为

其中，q＝1,2,3,4,5，W_q为第q个特征图像的宽度，H_q为第q个特征图像的高度，C_q为第q个特征图像的通道数量；使用不同尺寸的膨胀卷积来提取其多尺度特征：The encoder1 module, the encoder2 module, the encoder3 module, the encoder4 module, and the encoder5 module are all connected to a multi-scale attention model; the feature image of the qth encoder module is

Where q = 1, 2, 3, 4, 5, W_q is the width of the qth feature image, H_q is the height of the qth feature image, and C_q is the number of channels of the qth feature image; dilated convolutions of different sizes are used to extract its multi-scale features:

F_i,q＝D_i(Conv_i(I_q)),；F_i,q =D_i (Conv_i (I_q )),;

其中，i＝1,2,3表示尺度数，Conv_i(·)表示第i个尺度的一组卷积操作，包含一次常规卷积操作、批量归一化操以及ReLU非线性激活函数；D_i(·)表示第i个尺度的膨胀卷积操作，F_i为处理后的第i个尺度特征；Where i=1, 2, 3 represents the number of scales, Conv_i (·) represents a set of convolution operations at the i-th scale, including a regular convolution operation, a batch normalization operation, and a ReLU nonlinear activation function;_Di (·) represents the dilated convolution operation at the i-th scale, and_Fi is the processed feature of the i-th scale;

使用逐元素的加法整合所有尺度的信息：Use element-wise addition to integrate information from all scales:

其中，F_q为整合后的特征图；Among them, F_q is the integrated feature map;

使用两种注意力机制对整合后的特征图F_q进行处理：Two attention mechanisms are used to process the integrated feature map_Fq :

其中，Channel(·)为通道注意力操作，Spatial(·)为空间注意力操作，

为逐元素乘法；Softmax(·)为激活函数，S_q为通过注意力机制计算后的特征图；Among them, Channel(·) is the channel attention operation, Spatial(·) is the spatial attention operation,

is element-by-element multiplication; Softmax(·) is the activation function, and S_q is the feature map calculated by the attention mechanism;

利用不同尺寸的膨胀卷积Conv_i(·)来提取S_q的多尺度特征，得到S_i,q；Use dilated convolutions Conv_i (·) of different sizes to extract multi-scale features of S_q and obtain S_i,q ;

将S_i,q与F_i,q进行融合，得到特征图像：Fuse Si_,q with_Fi,q to get the feature image:

其中，fuse(·)为多尺度融合操作，A_q为特征图像，

为逐元素相加操作。Among them, fuse(·) is a multi-scale fusion operation, A_q is a feature image,

It is an element-by-element addition operation.

所述利用解码器对特征图像进行解码的方法为：The method for decoding the feature image using a decoder is:

其中，P为金字塔池化器的输出特征，R(·)为调整函数，

为逐元素相加操作，C(·)为通道维度的连接操作；

均为编码器输出的特征图；

均为解码器得到的特征图。Where P is the output feature of the pyramid pooler, R(·) is the adjustment function,

is the element-by-element addition operation, and C(·) is the connection operation in the channel dimension;

They are all feature maps output by the encoder;

These are the feature maps obtained by the decoder.

所述显著图的获得方法为：The method for obtaining the saliency map is:

其中，M_refine为显著图，M_coarse为解码器输出的预测图，M_residual为预测图与真值图之间的残差图。Among them, M_refine is the saliency map, M_coarse is the prediction map output by the decoder, and M_residual is the residual map between the prediction map and the true value map.

所述损失函数为：The loss function is:

L_total＝L_bce+L_iou；L_total =L_bce +L_iou ;

其中，L_total为总的损失函数，L_bce为BCE损失函数，L_iou为IOU损失函数。Among them, L_total is the total loss function, L_bce is the BCE loss function, and L_iou is the IOU loss function.

所述BCE损失函数为：The BCE loss function is:

其中，n为图像的像素个数，yⁱ′为人工标注的真值图的像素值，pⁱ′为预测图像的像素值。Where n is the number of pixels in the image,^yi ′ is the pixel value of the manually labeled true value image, and^pi ′ is the pixel value of the predicted image.

所述IOU损失函数为：The IOU loss function is:

与现有技术相比，本发明产生的有益效果为：Compared with the prior art, the present invention has the following beneficial effects:

1)在网络的最后，本发明追加了一个残差优化器，用来进一步优化骨干网络生成的预测图像，使其内部更加均匀，边界更加清晰。1) At the end of the network, the present invention adds a residual optimizer to further optimize the predicted image generated by the backbone network, making its interior more uniform and its boundaries clearer.

2)本发明将IOU损失与BCE损失函数结合起来，同时兼顾了像素级的精度和整体结构的精度，用来更好的驱动模型进行学习。2) The present invention combines the IOU loss with the BCE loss function, taking into account both pixel-level accuracy and overall structural accuracy, to better drive the model for learning.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

为了更清楚地说明本发明实施例或现有技术中的技术方案，下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍，显而易见地，下面描述中的附图仅仅是本发明的一些实施例，对于本领域普通技术人员来讲，在不付出创造性劳动的前提下，还可以根据这些附图获得其他的附图。In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings required for use in the embodiments or the description of the prior art will be briefly introduced below. Obviously, the drawings described below are only some embodiments of the present invention. For ordinary technicians in this field, other drawings can be obtained based on these drawings without paying creative work.

图1为本发明的网络的总体架构图。FIG. 1 is a diagram showing the overall architecture of a network of the present invention.

图2为本发明的编码器架构图。FIG. 2 is a diagram of the encoder architecture of the present invention.

图3为本发明的多尺度注意力模块图。FIG3 is a diagram of a multi-scale attention module of the present invention.

图4为本发明的全尺寸跳跃连接模块图。FIG. 4 is a full-scale jump connection module diagram of the present invention.

图5为本发明的残差优化器模块图。FIG5 is a module diagram of a residual optimizer according to the present invention.

具体实施方式DETAILED DESCRIPTION

下面将结合本发明实施例中的附图，对本发明实施例中的技术方案进行清楚、完整地描述，显然，所描述的实施例仅仅是本发明一部分实施例，而不是全部的实施例。基于本发明中的实施例，本领域普通技术人员在没有付出创造性劳动前提下所获得的所有其他实施例，都属于本发明保护的范围。The following will be combined with the drawings in the embodiments of the present invention to clearly and completely describe the technical solutions in the embodiments of the present invention. Obviously, the described embodiments are only part of the embodiments of the present invention, not all of the embodiments. Based on the embodiments of the present invention, all other embodiments obtained by ordinary technicians in this field without creative work are within the scope of protection of the present invention.

如图1所示，本发明实施例提供了一种在线定位的轻量化显著性检测方法，其步骤如下：As shown in FIG1 , an embodiment of the present invention provides a lightweight saliency detection method for online positioning, and the steps are as follows:

步骤一：将DUTS-TR数据集中的图像输入编码器中提取特征图像；本发明在显著性检测的六个常用的基准数据集数据集上评估提出的模型。分别为SOD、ECSSD、DUT-OMROM、PASCAL-S、HKU-IS、DUTS。另外，为了满足实际需求，本发明提出了一个猪腿X光图像的自制数据集，包含343张猪腿不同部位的X光图片。遵循近年许多杰出显著性检测模型的做法。本发明的模型采用DUTS-TR数据集对模型进行训练。在训练时使用随机翻转来提高模型的泛化能力。Step 1: Input the image in the DUTS-TR dataset into the encoder to extract the feature image; the present invention evaluates the proposed model on six commonly used benchmark datasets for saliency detection. They are SOD, ECSSD, DUT-OMROM, PASCAL-S, HKU-IS, and DUTS. In addition, in order to meet actual needs, the present invention proposes a self-made dataset of pig leg X-ray images, which contains 343 X-ray images of different parts of pig legs. Follow the practices of many outstanding saliency detection models in recent years. The model of the present invention uses the DUTS-TR dataset to train the model. Random flipping is used during training to improve the generalization ability of the model.

本发明所提出模型的编码器部分是一个轻量级的自定义特征提取网络。如图2所示，编码器网络共有五个阶段，每个阶段由若干卷积操作组成。每个编码阶段都会对特征进行下采样与通道扩张操作。由于不使用传统的大体量特征提取网络，本发明在编码器部分对特征的通道数量进行了控制。每个编码阶段只扩张有限的通道数量，而不是像其他网络一样成倍的扩张。经过五个阶段的特征提取操作后，最后得到7×7×128大小的特征图。The encoder part of the model proposed in the present invention is a lightweight custom feature extraction network. As shown in Figure 2, the encoder network has five stages, each of which consists of several convolution operations. Each encoding stage will perform downsampling and channel expansion operations on the features. Since the traditional large-scale feature extraction network is not used, the present invention controls the number of channels of the features in the encoder part. Each encoding stage only expands a limited number of channels, instead of expanding exponentially like other networks. After five stages of feature extraction operations, a feature map ofsize 7×7×128 is finally obtained.

将图像送入编码器网络。所述编码器网络包括encode1模块、encode2模块、encode3模块、encode4模块、encode5模块；encode1模块的输入端与第一输入层相连接，第一输入层的输入为彩色图像，encode1模块的输出端与encode2模块的输入端相连接，encode2模块的输出端与encode3模块的输入端相连接，encode3模块的输出端与encode4模块的输入端相连接，encode4模块的输出端与encode5模块的输入端相连接；encode1模块的输出端、encode2模块的输出端、encode3模块的输出端、encode4模块的输出端、encode5模块的输出端通过全尺寸跳跃连接的方式与解码器网络相连接。The image is sent to the encoder network. The encoder network includes an encoder1 module, an encoder2 module, an encoder3 module, an encoder4 module, and an encoder5 module; the input end of the encoder1 module is connected to the first input layer, the input of the first input layer is a color image, the output end of the encoder1 module is connected to the input end of the encoder2 module, the output end of the encoder2 module is connected to the input end of the encoder3 module, the output end of the encoder3 module is connected to the input end of the encoder4 module, and the output end of the encoder4 module is connected to the input end of the encoder5 module; the output end of the encoder1 module, the output end of the encoder2 module, the output end of the encoder3 module, the output end of the encoder4 module, and the output end of the encoder5 module are connected to the decoder network through a full-size skip connection.

编码器encode1模块的结构为卷积层I-批量归一化层I-激活层I；如表1所示，卷积层I的卷积核为3×3，步长为2，通道数为16。The structure of the encoder encode1 module is convolution layer I-batch normalization layer I-activation layer I; as shown in Table 1, the convolution kernel of convolution layer I is 3×3, the stride is 2, and the number of channels is 16.

表1 encode1模块的结构Table 1 Structure of the encode1 module

113×3卷积层，步幅为2，输入通道3，输出通道数163×3 convolutional layer, stride 2, 3 input channels, 16 output channels22批量归一化Batch Normalization33ReLU激活函数ReLU activation function

encode2模块的结构为卷积层II-批量归一化层II-激活层II-卷积层III-批量归一化层III；如表2所示，卷积层II的卷积核为3×3，步长为1，输入通道数为16，输出通道数为32，卷积层III的卷积核为1×1，步长为1，输入通道数为32，输出通道数为32。The structure of the encode2 module is convolution layer II-batch normalization layer II-activation layer II-convolution layer III-batch normalization layer III; as shown in Table 2, the convolution kernel of convolution layer II is 3×3, the step size is 1, the number of input channels is 16, and the number of output channels is 32. The convolution kernel of convolution layer III is 1×1, the step size is 1, the number of input channels is 32, and the number of output channels is 32.

表2 encode2模块的结构Table 2 Structure of the encode2 module

113×3卷积层，步幅为1，输入通道数16，输出通道数323×3 convolutional layer, stride 1, 16 input channels, 32 output channels22批量归一化Batch Normalization33ReLU激活函数ReLU activation function441×1卷积层，步幅为1，输入通道数32，输出通道数321×1 convolutional layer, stride 1, 32 input channels, 32 output channels55批量归一化Batch Normalization

encode3模块的结构为卷积层IV-批量归一化层IV-激活层IV-卷积层V-批量归一化层V；如表3所示，卷积层IV的卷积核为3×3，步长为2，输入通道数为32，输出通道数为64；卷积层V的卷积核为1×1，步长为1，输入通道数为64，输出通道数为64。The structure of the encode3 module is convolution layer IV-batch normalization layer IV-activation layer IV-convolution layer V-batch normalization layer V; as shown in Table 3, the convolution kernel of convolution layer IV is 3×3, the step size is 2, the number of input channels is 32, and the number of output channels is 64; the convolution kernel of convolution layer V is 1×1, the step size is 1, the number of input channels is 64, and the number of output channels is 64.

表3 encode3模块的结构Table 3 Structure of the encode3 module

encode4模块的结构为卷积层VI-批量归一化层VI-激活层VI-卷积层VII-批量归一化层VII；如表4所示，卷积层VI的卷积核为3×3，步长为2，输入通道数为64，输出通道数为96；卷积层VII的卷积核为1×1，步长为1，输入通道数为96，输出通道数为96。The structure of the encode4 module is convolution layer VI-batch normalization layer VI-activation layer VI-convolution layer VII-batch normalization layer VII; as shown in Table 4, the convolution kernel of convolution layer VI is 3×3, the step size is 2, the number of input channels is 64, and the number of output channels is 96; the convolution kernel of convolution layer VII is 1×1, the step size is 1, the number of input channels is 96, and the number of output channels is 96.

表4 encode3模块的结构Table 4 Structure of the encode3 module

113×3卷积层，步幅为2，输入通道数64，输出通道数963×3 convolutional layer, stride 2, 64 input channels, 96 output channels22批量归一化Batch Normalization33ReLU激活函数ReLU activation function443×3卷积层，步幅为1，输入通道数96，输出通道数963×3 convolutional layer, stride 1, 96 input channels, 96 output channels55批量归一化Batch Normalization

encode5模块的结构为卷积层VIII-批量归一化层VIII-激活层VIII-卷积层IX-批量归一化层IX；如表5所示，卷积层VIII的卷积核为3×3，步长为2，输入通道数为96，输出通道数为128；卷积层IX的卷积核为1×1，步长为1，输入通道数为128，输出通道数为128。The structure of the encode5 module is convolution layer VIII-batch normalization layer VIII-activation layer VIII-convolution layer IX-batch normalization layer IX; as shown in Table 5, the convolution kernel of convolution layer VIII is 3×3, the stride is 2, the number of input channels is 96, and the number of output channels is 128; the convolution kernel of convolution layer IX is 1×1, the stride is 1, the number of input channels is 128, and the number of output channels is 128.

表5 encode5模块的结构Table 5 Structure of the encode5 module

113×3卷积层，步幅为2，输入通道数96，输出通道数1283×3 convolutional layer, stride 2, 96 input channels, 128 output channels22批量归一化Batch Normalization33ReLU激活函数ReLU activation function441×1卷积层，步幅为1，输入通道数128，输出通道数1281×1 convolutional layer, stride 1, input channels 128, output channels 12855批量归一化Batch Normalization

与一般的编码器不同的是，本发明所提出的编码器网络每个阶段都会引入多尺度注意力模块进行特征的优化。注意力机制在人类的认知的过程中扮演者非常关键角色。不同于计算机一次性可以处理整张图片，人类的视觉系统会自适应地过滤背景等相对不重要的信息。通道注意力可以显式地发掘特征在通道内的联系并且自适应地以逐通道的方式调整特征图像。在逐通道的注意力之上，Woo等人提出了空间的注意力的概念。不论通道注意力还是空间注意力，都属于自注意力的范畴。空间和通道的自注意力分别可以自适应地强调信息量最大的特征块和通道。本发明模型中的多尺度注意力模块同时使用了这两种注意力机制。基于通道内的依赖关系和空间上的上下文线索来自适应地调整不同分支的信息流，如图3所示。因此，多尺度注意力模块可以在轻量级的网络中尽可能的提取更多有效的特征。Unlike general encoders, the encoder network proposed in the present invention introduces a multi-scale attention module at each stage to optimize features. The attention mechanism plays a very critical role in the human cognitive process. Unlike computers that can process the entire picture at one time, the human visual system will adaptively filter relatively unimportant information such as the background. Channel attention can explicitly explore the connection between features in the channel and adaptively adjust the feature image in a channel-by-channel manner. On top of channel-by-channel attention, Woo et al. proposed the concept of spatial attention. Both channel attention and spatial attention belong to the category of self-attention. Spatial and channel self-attention can adaptively emphasize the feature blocks and channels with the largest amount of information, respectively. The multi-scale attention module in the model of the present invention uses these two attention mechanisms at the same time. The information flow of different branches is adaptively adjusted based on the dependency within the channel and the contextual clues in the space, as shown in Figure 3. Therefore, the multi-scale attention module can extract as many effective features as possible in a lightweight network.

在每个编码阶段的最后部分，使用多尺度注意力模块(MAM)对特征图像进一步处理，即encode1模块、encode2模块、encode3模块、encode4模块、encode5模块的后均连接有多尺度注意力模型。第q个encode模块的特征图像为

其中，q＝1,2,3,4,5，W_q为第q个特征图像的宽度，H_q为第q个特征图像的高度，C_q为第q个特征图像的通道数量；使用不同尺寸的膨胀卷积来提取其多尺度特征：At the end of each encoding stage, the multi-scale attention module (MAM) is used to further process the feature image, that is, the encoder1 module, encoder2 module, encoder3 module, encoder4 module, and encoder5 module are all connected with a multi-scale attention model. The feature image of the qth encoder module is

F_i,q＝D_i(Conv_i(I_q)),；F_i,q =D_i (Conv_i (I_q )),;

其中，i＝1,2,3表示尺度数，Conv_i(·)表示第i个尺度的一组卷积操作，包含一次常规卷积操作、批量归一化操以及ReLU非线性激活函数；D_i(·)表示第i个尺度的膨胀卷积操作，F_i为处理后的第i个尺度特征。Where i=1, 2, 3 represents the number of scales, Conv_i (·) represents a set of convolution operations at the i-th scale, including a regular convolution operation, a batch normalization operation, and a ReLU nonlinear activation function;_Di (·) represents the dilated convolution operation at the i-th scale, and_Fi is the processed feature of the i-th scale.

其中，F_q为整合后的特征图。Among them,_Fq is the integrated feature map.

由于各个通道的独立性，注意力机制对通道有较强的依赖。不同通道中的特征分别由独立的滤波器计算得出。在这种情况下，如果一个来自于特定通道的特征对最后的预测结果可以产生正面的影响。那么相同分支下相同通道的特征也会提供有价值的信息。注意力机制同样需要空间上的依赖。因为作为一个中层任务，SOD需要对特定层的每一个像素的相邻像素进行一定程度的推理。因此，将整合后的多尺度信息使用两种注意力机制进行处理：Due to the independence of each channel, the attention mechanism has a strong dependence on the channel. Features in different channels are calculated by independent filters. In this case, if a feature from a specific channel can have a positive impact on the final prediction result, then the features of the same channel under the same branch will also provide valuable information. The attention mechanism also requires spatial dependence. Because as a mid-level task, SOD needs to perform a certain degree of reasoning on the neighboring pixels of each pixel in a specific layer. Therefore, the integrated multi-scale information is processed using two attention mechanisms:

为逐元素乘法；Softmax(·)为激活函数，S_q为通过注意力机制计算后的特征图；利用不同尺寸的膨胀卷积Conv_i(·)来提取S_q的多尺度特征，得到S_i,q。Among them, Channel(·) is the channel attention operation, Spatial(·) is the spatial attention operation,

is element-by-element multiplication; Softmax(·) is the activation function, S_q is the feature map calculated by the attention mechanism; the dilated convolution Conv_i (·) of different sizes is used to extract the multi-scale features of S_q to obtain S_i,q .

其中，fuse(·)为多尺度融合操作，A_q为特征图像，

It is an element-by-element addition operation.

步骤二：利用解码器对特征图像进行解码，得到预测图；本发明所提出的解码器网络整体结构与编码器对称。来自编码器的特征信息在经过一个金字塔池化器后进入解码器网络。大多数基于U-Net模型为了在解码阶段更好的利用提取到的特征。普遍采用不同方式的密集嵌套连接来进行特征融合。然而这种方式会极大的增加参数数量与计算量。受到UNet3+的启发，本发明设计了一个轻量级的全尺寸跳跃连接(SCM)。每个解码器层都包含来自编码器的较小和相同尺度的特征图以及来自解码器的较大尺度的特征图，从而完整地捕获细粒度和粗粒度的语义信息。Step 2: Use the decoder to decode the feature image to obtain a prediction image; the overall structure of the decoder network proposed in the present invention is symmetrical with the encoder. The feature information from the encoder enters the decoder network after passing through a pyramid pooler. Most U-Net-based models use dense nested connections in different ways to better utilize the extracted features in the decoding stage. However, this method will greatly increase the number of parameters and the amount of calculation. Inspired by UNet3+, the present invention designs a lightweight full-size skip connection (SCM). Each decoder layer contains smaller and same-scale feature maps from the encoder and larger-scale feature maps from the decoder, thereby fully capturing fine-grained and coarse-grained semantic information.

图4说明了如何构建全尺寸跳跃连接。对于解码器的每个解码阶段的特征图

与UNet网络类似，首先直接接收来自编码阶段同层数的特征图

不同的是，跳跃连接不止利用同层数的编码特征参与融合。还需要来小于自身尺度的编码特征和大于自身尺度的解码特征。对于这些分辨率与通道数和目标特征不同的信息。在融合前分别进行相应的上采样、下采样以及修改通道数的操作。特征图

的计算公式如下：Figure 4 illustrates how to construct a full-size skip connection. For each decoding stage of the decoder, the feature map

Similar to the UNet network, it first directly receives the feature map of the same layer from the encoding stage

The difference is that the jump connection not only uses the encoding features of the same layer to participate in the fusion. It also requires encoding features smaller than its own scale and decoding features larger than its own scale. For these information with different resolutions, channels and target features, the corresponding upsampling, downsampling and channel number modification operations are performed before fusion. Feature map

The calculation formula is as follows:

其中，P为金字塔池化器的输出特征，R(·)为调整函数，根据不同的输入大小调整其尺寸与通道数。

为逐元素相加操作，C(·)为通道维度的连接操作；

均为编码器输出的特征图；

均为解码器得到的特征图。Among them, P is the output feature of the pyramid pooler, and R(·) is the adjustment function, which adjusts its size and number of channels according to different input sizes.

They are all feature maps output by the encoder;

These are the feature maps obtained by the decoder.

因为模型需要整体保持轻量化，本发明在通道维度的连接操作上抛弃了全通道的连接。对于四个参与融合的特征图，将它们的通道数降至1/4再进行通道维度的连接操作。避免在融合时参数数量的成倍扩张。为了最大限度的降低这一改变所带来的精度损失。每个阶段的最后本发明利用残差的思想，将融合的结果与上阶段的输出结果逐元素进行相加。Because the model needs to remain lightweight as a whole, the present invention abandons the connection of all channels in the connection operation of the channel dimension. For the four feature maps involved in the fusion, their number of channels is reduced to 1/4 before the connection operation of the channel dimension is performed. Avoid the exponential expansion of the number of parameters during fusion. In order to minimize the loss of accuracy caused by this change. At the end of each stage, the present invention uses the idea of residuals to add the fusion result to the output result of the previous stage element by element.

步骤三：利用残差优化器对预测图进行优化，得到显著图；本发明的所提出的优化器模块(RFM)通过学习解码器输出的预测图M_coarse与Ground Truth之间的残差M_residual来优化模型最终的输出。Step 3: Utilize the residual optimizer to optimize the prediction map to obtain a saliency map; the optimizer module (RFM) proposed in the present invention optimizes the final output of the model by learning the residual M_residual between the prediction map M_coarse output by the decoder and the ground truth.

为了优化显着图中不完整的区域和模糊的边界。受到Qin等人的启发。本发明设计了一个轻量级的残差优化器。该优化器采用残差编码器-解码器架构，具体结构见图5。它的主要架构与本发明的预测模块相似但更简单。它包含一个编码器、一个解码器和一个残差输出层。编码器和解码器都有五个阶段。每个阶段只有一个卷积层。每层有依次有4,8,16,24,36个大小为3×3的滤波器，随后是批量归一化操作和ReLU激活函数。编码器阶段进行五次下采样操作，解码器阶段进行五次上采样操作。在优化器最后将输出与输出进行逐元素相加，输出最终的显著图：In order to optimize the incomplete areas and blurred boundaries in the saliency map. Inspired by Qin et al., the present invention designs a lightweight residual optimizer. The optimizer adopts a residual encoder-decoder architecture, and the specific structure is shown in Figure 5. Its main architecture is similar to the prediction module of the present invention but simpler. It contains an encoder, a decoder and a residual output layer. Both the encoder and the decoder have five stages. Each stage has only one convolutional layer. Each layer has 4, 8, 16, 24, and 36 filters of size 3×3, respectively, followed by batch normalization operations and ReLU activation functions. The encoder stage performs five downsampling operations, and the decoder stage performs five upsampling operations. At the end of the optimizer, the output is added element by element to output the final saliency map:

以往的显著性检测网络大多都使用二元交叉熵损失(BCE)作为损失函数。但BCE损失函数只关注每个像素单独的损失值。并不能很好的优化整个显著目标。因此本发明设计了一个混合损失函数，其定义为：Most of the previous saliency detection networks use binary cross entropy loss (BCE) as the loss function. However, the BCE loss function only focuses on the loss value of each pixel. It cannot optimize the entire saliency target well. Therefore, the present invention designs a hybrid loss function, which is defined as:

L_total＝L_bce+L_iou；L_total =L_bce +L_iou ;

BCE损失函数是分类以及分割任务中最常用的损失函数之一，它的定义为：The BCE loss function is one of the most commonly used loss functions in classification and segmentation tasks. It is defined as:

其中，n为图像的像素个数，y^i′为人工标注的真值图的像素值，p^i′为预测图像的像素值。Among them, n is the number of pixels in the image, yi^′ is the pixel value of the manually annotated true value image, and pi^′ is the pixel value of the predicted image.

Intersection Over Union(IOU)最初被用于检测两个集合的相似程度，后来常常被用于目标检测和语义分割任务的评估度量。其定义如下：Intersection Over Union (IOU) was originally used to detect the similarity between two sets, and later it was often used as an evaluation metric for target detection and semantic segmentation tasks. Its definition is as follows:

IOU损失可以从全局上关注整个图像的原始结构。本发明将IOU损失与BCE损失函数结合起来，同时兼顾了像素级的精度和整体结构的精度。用来更好的驱动模型进行学习。IOU loss can focus on the original structure of the entire image from a global perspective. The present invention combines IOU loss with BCE loss function, taking into account both pixel-level accuracy and overall structural accuracy. It is used to better drive the model to learn.

以上所述仅为本发明的较佳实施例而已，并不用以限制本发明，凡在本发明的精神和原则之内，所作的任何修改、等同替换、改进等，均应包含在本发明的保护范围之内。The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention. Any modifications, equivalent substitutions, improvements, etc. made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. The light-weight significance detection method for online positioning is characterized by comprising the following steps of:

step one: inputting the images in the DUTS-TR data set into an encoder to extract characteristic images;

step two: decoding the characteristic image by using a decoder to obtain a prediction graph;

step three: optimizing the prediction graph by using a residual error optimizer to obtain a significant graph;

step four: calculating the loss values of the saliency map and the truth map by using the loss function, judging whether the loss values are smaller than a threshold value, if so, obtaining a trained encoder-decoder network, executing step S5, otherwise, automatically modifying the weight parameters of all layers of the encoder network and the decoder network according to the loss values, and returning to the step I;

step five: and acquiring an image to be detected, inputting the image to be detected into an encoder-decoder network, and outputting a saliency map of the image to be detected.

2. The method for detecting the lightweight saliency of online positioning according to claim 1, wherein the encoder network comprises an encode1 module, an encode2 module, an encode3 module, an encode4 module and an encode5 module; the input end of the encode1 module is connected with a first input layer, the input of the first input layer is a color image, the output end of the encode1 module is connected with the input end of the encode2 module, the output end of the encode2 module is connected with the input end of the encode3 module, the output end of the encode3 module is connected with the input end of the encode4 module, and the output end of the encode4 module is connected with the input end of the encode5 module; the output end of the encode1 module, the output end of the encode2 module, the output end of the encode3 module, the output end of the encode4 module and the output end of the encode5 module are connected with a decoder network in a full-size jump connection mode.

3. The on-line positioning lightweight significance detection method according to claim 2, wherein the structure of the encode1 module is a convolution layer I-batch normalization layer I-activation layer I; the convolution kernel of the convolution layer I is 3 multiplied by 3, the step length is 2, and the channel number is 16; the structure of the encoding 2 module is a convolution layer II-batch normalization layer II-activation layer II-convolution layer III-batch normalization layer III; the convolution kernel of the convolution layer II is 3 multiplied by 3, the step length is 1, the number of input channels is 16, the number of output channels is 32, the convolution kernel of the convolution layer III is 1 multiplied by 1, the step length is 1, the number of input channels is 32, and the number of output channels is 32; the structure of the encoding 3 module is a convolution layer IV-batch normalization layer IV-activation layer IV-convolution layer V-batch normalization layer V; the convolution kernel of the convolution layer IV is 3 multiplied by 3, the step length is 2, the number of input channels is 32, and the number of output channels is 64; the convolution kernel of the convolution layer V is 1 multiplied by 1, the step length is 1, the number of input channels is 64, and the number of output channels is 64; the structure of the encoding 4 module is a convolution layer VI-batch normalization layer VI-activation layer VI-convolution layer VII-batch normalization layer VII; the convolution kernel of the convolution layer VI is 3 multiplied by 3, the step length is 2, the number of input channels is 64, and the number of output channels is 96; the convolution kernel of the convolution layer VII is 1 multiplied by 1, the step length is 1, the number of input channels is 96, and the number of output channels is 96; the structure of the encoding 5 module is a convolution layer VIII-batch normalization layer VIII-activation layer VIII-convolution layer IX-batch normalization layer IX; the convolution kernel of the convolution layer VIII is 3 multiplied by 3, the step length is 2, the number of input channels is 96, and the number of output channels is 128; the convolution kernel of convolution layer IX is 1 x 1, the step size is 1, the number of input channels is 128, and the number of output channels is 128.

4. The method for detecting the light-weight saliency of online positioning according to claim 2, wherein the back of each of the encode1 module, the encode2 module, the encode3 module, the encode4 module and the encode5 module is connected with a multi-scale attention model; the feature image of the q-th encoding module is

Wherein q=1, 2,3,4,5, w_q For the width of the q-th characteristic image, H_q For the height of the q-th feature image, C_q The number of channels for the q-th feature image; the multi-scale features of which are extracted using different-sized dilation convolutions:

F_i,q ＝D_i (Conv_i (I_q )),；

where i=1, 2,3 denotes the number of scales, conv_i (-) represents a set of convolution operations of the ith scale, including one regular convolution operation, a batch normalization operation, and a ReLU nonlinear activation function; d (D)_i (. Cndot.) represents the expansion convolution operation of the ith scale, F_i Is the ith scale feature after processing;

integrating all scale information using element-wise addition:

wherein F is_q The integrated characteristic diagram;

use of two attention mechanisms for the integrated feature map F_q And (3) performing treatment:

wherein Channel (·) is a Channel attention operation, spatial (·) is a Spatial attention operation,

for element-by-element multiplication;

softmax (·) is the activation function, S_q A feature map calculated by an attention mechanism;

convolving Conv with expansions of different sizes_i (. Cndot.) to extract S_q Multi-scale features of (1) to obtain S_i,q ；

Will S_i,q And F is equal to_i,q Fusing to obtain a characteristic image:

wherein fuse (·) is a multiscale fusion procedure, A_q As a characteristic image of the object,

is an element-by-element addition operation.

5. The method for detecting the lightweight significant in online positioning according to claim 4, wherein the method for decoding the feature image by using a decoder is as follows:

wherein P is the output characteristic of the pyramid pooler, R (·) is the adjustment function,

for element-wise add operation, C (·) is a join operation for the channel dimension;

Are feature graphs output by the encoder;

Are feature maps obtained by the decoder.

6. The method for detecting the lightweight saliency of online positioning according to claim 5, wherein the method for obtaining the saliency map is as follows:

wherein M is_refine As a saliency map, M_coarse For the prediction map output by the decoder, M_residual Is a residual diagram between the prediction diagram and the truth diagram.

7. The on-line located lightweight significance detection method of claim 1, wherein the loss function is:

L_total ＝L_bce +L_iou ；

wherein L is_total As a total loss function, L_bce As BCE loss function, L_iou Is an IOU penalty function.

8. The on-line located lightweight significance detection method of claim 7, wherein the BCE loss function is:

wherein n is the number of pixels of the image, y^i′ For pixel values, p, of an artificially-labelled truth-chart^i′ Is the pixel value of the predicted image.

9. The online positioning lightweight significance detection method of claim 8, wherein the IOU loss function is: