CN110210539A

Movatterモバイル変換

Info

Publication number: CN110210539A
Application number: CN201910431110.6A
Authority: CN
Inventors: 张强; 黄年昌; 姚琳; 刘健; 韩军功
Original assignee: Xidian University
Current assignee: Xidian University
Priority date: 2019-05-22
Filing date: 2019-05-22
Publication date: 2019-09-06
Anticipated expiration: 2039-05-22
Also published as: CN110210539B

Abstract

Translated fromChinese

本发明公开了一种多级深度特征融合的RGB‑T图像显著性目标检测方法，主要解决现有技术在复杂多变场景中不能完整一致地检测出显著目标的问题。其实现方案为：1.对输入图像提取粗糙的多级特征；2.构建邻近深度特征融合模块，改善单模态特征；3.构建多分支组融合模块，融合多模态特征；4.得到融合输出特征图；5.训练算法网络；6.预测RGB‑T图像的像素级显著图。本发明可有效融合来自不同模态图像的补充信息，能够在复杂多变场景下完整一致地检测图像显著目标，可用于计算机视觉中图像的预处理进程。

The invention discloses a multi-level depth feature fusion RGB-T image salient target detection method, which mainly solves the problem that the prior art cannot completely and consistently detect salient targets in complex and changeable scenes. The implementation plan is as follows: 1. Extract rough multi-level features from the input image; 2. Construct a neighborhood depth feature fusion module to improve single-modal features; 3. Construct a multi-branch group fusion module to fuse multi-modal features; 4. Get Fusion output feature maps; 5. Training algorithm network; 6. Predicting pixel-level saliency maps of RGB‑T images. The invention can effectively fuse supplementary information from images of different modalities, can completely and consistently detect image salient objects in complex and changeable scenes, and can be used for image preprocessing in computer vision.

Description

Translated fromChinese

多级深度特征融合的RGB-T图像显著性目标检测方法RGB-T image salient object detection method based on multi-level deep feature fusion

技术领域technical field

本发明属于图像处理领域，涉及一种RGB-T图像显著目标检测方法，具体涉及一种多级深度特征融合的RGB-T图像显著性目标检测方法，可用于计算机视觉中图像的预处理进程。The invention belongs to the field of image processing, and relates to a method for detecting salient objects in RGB-T images, in particular to a method for detecting salient objects in RGB-T images with multi-level depth feature fusion, which can be used for image preprocessing in computer vision.

背景技术Background technique

显著性目标检测旨在利用模型或算法检测和分割出图像中的显著性目标区。作为图像的预处理步骤，显著性目标检测在视觉跟踪、图像识别、图像压缩、图像融合等视觉任务中起着至关重要的作用。Salient object detection aims to use models or algorithms to detect and segment salient object areas in images. As an image preprocessing step, salient object detection plays a vital role in vision tasks such as visual tracking, image recognition, image compression, and image fusion.

现有的目标检测方法可以分为两大类：一类是基于传统的显著性目标检测方法，另一类是基于深度学习的显著性目标检测方法。基于传统的显著性目标检测算法通过手工提取的颜色、纹理、方向等特征完成显著性预测，过度依赖于人工选取的特征，对场景适应性不强，在复杂数据集上表现不佳。随着深度学习的广泛应用，基于深度学习的显著性目标检测研究取得了突破性进展，相较于传统的显著性算法，检测性能显著提高。Existing target detection methods can be divided into two categories: one is based on traditional salient target detection methods, and the other is salient target detection methods based on deep learning. Based on the traditional salient target detection algorithm, the saliency prediction is completed through manually extracted features such as color, texture, and direction, which rely too much on manually selected features, are not adaptable to the scene, and perform poorly on complex data sets. With the wide application of deep learning, breakthroughs have been made in the research of salient object detection based on deep learning. Compared with traditional salient algorithms, the detection performance has been significantly improved.

大多数的显著目标检测方法如“Q.Hou,M.M.Cheng,X.Hu,et al.Deeplysupervised salient object detection with short connections.IEEE Transactionson Pattern Analysis and Machine Intelligence,2019,41(4):815–828.”仅通过单一模态的RGB图像计算显著值，获取的场景信息有限，在低光照、低对比度、复杂背景等挑战性场景下，难以完整一致地检测出显著目标。Most salient object detection methods such as "Q.Hou, M.M.Cheng, X.Hu, et al.Deeplysupervised salient object detection with short connections. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2019, 41(4):815–828. "Calculation of saliency values only through single-mode RGB images can obtain limited scene information. In challenging scenes such as low light, low contrast, and complex backgrounds, it is difficult to detect salient objects completely and consistently.

为解决上述问题，一些基于RGB-T图像的显著目标检测方法被提出，如“Li C,WangG,Ma Y,et al.A Unified RGB-T Saliency Detection Benchmark:Dataset,Baselines,Analysis and A Novel Approach.arXiv preprint arXiv:1701.02829,2017.”，公开了一种基于流行排序模型的RGB-T图像显著性目标检测方法，该种方法利用RGB和热红外图像的补充信息，构建跨模态一致性的流型排序模型，结合两阶段图的方法计算各个节点的显著值。在低光照、低对比度的情况下，较以RGB为输入的显著性目标检测方法，能更为准确地检测显著性目标。To solve the above problems, some salient object detection methods based on RGB-T images have been proposed, such as "Li C, WangG, Ma Y, et al. A Unified RGB-T Saliency Detection Benchmark: Dataset, Baselines, Analysis and A Novel Approach .arXiv preprint arXiv:1701.02829,2017.", discloses a RGB-T image saliency target detection method based on the popular ranking model, which uses the supplementary information of RGB and thermal infrared images to construct a cross-modal consistency Streamline sorting model, combined with the method of two-stage graph to calculate the saliency value of each node. In the case of low light and low contrast, it can detect salient objects more accurately than salient object detection methods that use RGB as input.

然而，这种方法以区域块为基本单位进行检测，显著图中出现明显的块效应，目标与背景的分割边界不准确，且目标内部不均一。此外，该方法基于人工提取特征而建立，选取的特征并不能完全表达不同图像的内在特性，对不同模态图像间补充信息的利用尚不充分，检测效果提升有限。However, this method uses the area block as the basic unit for detection, and obvious block effects appear in the saliency map, the segmentation boundary between the object and the background is not accurate, and the interior of the object is not uniform. In addition, this method is based on artificially extracted features, the selected features cannot fully express the intrinsic characteristics of different images, the use of supplementary information between images of different modalities is not sufficient, and the detection effect is limited.

发明内容Contents of the invention

发明目的：针对上述现有技术的不足，本发明的目的在于提出一种基于多级深度特征融合的RGB-T图像显著性目标检测方法，以提高在复杂多变场景图像中对显著目标检测的完整一致性效果。主要解决现有技术在复杂多变场景中不能完整一致地检测出显著目标的问题。Purpose of the invention: In view of the deficiencies in the prior art above, the purpose of the present invention is to propose a method for detecting salient objects in RGB-T images based on multi-level depth feature fusion, so as to improve the detection accuracy of salient objects in complex and changeable scene images. Full consistency effect. It mainly solves the problem that the existing technology cannot detect salient objects completely and consistently in complex and changeable scenes.

实现本发明的关键是RGB-T图像多级深度特征提取和融合：通过对RGB和热红外图像提取的多级单模态特征进行融合，预测显著性：对RGB或热红外图像，从支柱网络的不同深度提取粗糙的多级特征；构建邻近深度特征融合模块，提取改善的多级单模态特征；构建多分支组融合模块，对不同模态特征进行融合；得到融合输出特征图；训练网络得到模型参数；预测RGB-T图像的像素级显著图。The key to realizing the present invention is multi-level depth feature extraction and fusion of RGB-T images: by fusing the multi-level single-modal features extracted from RGB and thermal infrared images, predicting salience: for RGB or thermal infrared images, from the pillar network Extract rough multi-level features at different depths; build a neighboring depth feature fusion module to extract improved multi-level single-modal features; build a multi-branch group fusion module to fuse different modal features; obtain the fusion output feature map; train the network Get the model parameters; predict the pixel-level saliency map of the RGB-T image.

技术方案：多级深度特征融合的RGB-T图像显著性目标检测方法，包括如下步骤：Technical solution: RGB-T image saliency target detection method for multi-level deep feature fusion, including the following steps:

(1)对输入图像提取粗糙的多级特征：(1) Extract rough multi-level features from the input image:

对图像提取基础网络中位于不同深度的5级特征作为粗糙的单模态特征；Extract the 5-level features at different depths in the image extraction base network as rough single-modal features;

(2)构建邻近深度特征融合模块，改善单模态特征：(2) Build a neighboring deep feature fusion module to improve single-modal features:

建立多个邻近深度特征融合模块，然后通过该邻近深度特征融合模块将步骤(1)得到的5级粗糙的单模态特征处理，将来自邻近深度的3级特征进行融合，得到改善的3级单模态特征；Establish multiple adjacent depth feature fusion modules, and then process the 5-level rough single-modal feature obtained in step (1) through the adjacent depth feature fusion module, and fuse the 3-level features from the adjacent depth to obtain an improved 3-level unimodal features;

(3)构建多分支组融合模块，融合多模态特征：(3) Build a multi-branch group fusion module to fuse multi-modal features:

构建包含两个融合分支的多分支组融合模块，对步骤(2)得到的改善的3级单模态特征中，位于同一特征级下的不同单模态特征进行融合，得到融合的多模态特征；Construct a multi-branch group fusion module containing two fusion branches, and fuse different single-modal features under the same feature level among the improved three-level single-modal features obtained in step (2) to obtain a fused multi-modal feature;

(4)得到融合输出特征图：(4) Get the fusion output feature map:

将步骤(3)得到的融合的多模态特征的不同级特征逐级反向融合，得到多个边输出特征图，并将所有边输出特征图融合，得到融合输出特征图；The different levels of features of the multimodal features of the fusion obtained in step (3) are reversely fused step by step to obtain a plurality of edge output feature maps, and all edge output feature maps are fused to obtain a fusion output feature map;

(5)训练算法网络：(5) Training algorithm network:

在训练数据集上，对步骤(4)中得到的边输出特征图和融合输出特征图，采用深度监督学习机制，通过最小化交叉熵损失函数，完成算法网络训练，得到网络模型参数；On the training data set, for the edge output feature map and fusion output feature map obtained in step (4), adopt the deep supervision learning mechanism, and complete the algorithm network training by minimizing the cross entropy loss function, and obtain the network model parameters;

(6)预测RGB-T图像的像素级显著图：(6) Predict the pixel-level saliency map of the RGB-T image:

在测试数据集上，利用步骤(5)得到的网络模型参数，对步骤 (4)中得到的边输出特征图和融合输出特征图，通过sigmoid分类计算，预测RGB-T图像的像素级显著图。On the test data set, use the network model parameters obtained in step (5) to predict the pixel-level saliency map of the RGB-T image through sigmoid classification calculation for the edge output feature map and fusion output feature map obtained in step (4). .

进一步地，步骤(1)中所述的图形为RGB图像或热红外图像。Further, the graphics described in step (1) are RGB images or thermal infrared images.

进一步地，步骤(1)中的基础网络为VGG16网络。Further, the basic network in step (1) is a VGG16 network.

更进一步地，步骤(2)中所述的构建邻近深度特征融合模块，包括以下步骤：Furthermore, the construction of the adjacent depth feature fusion module described in step (2) includes the following steps:

(21)将步骤(1)得到的5级粗糙的单模态特征分别用符号表示，其中，n＝1或者2，分别代表RGB图像或热红外图像；(21) The five-level rough single-mode features obtained in step (1) are respectively represented by Represent, wherein, n=1 or 2, represent RGB image or thermal infrared image respectively;

(22)每一个邻近深度融合模块包含3个卷积操作和1个反卷积操作，以获得第d级单模态特征，d＝1,2,3。(22) Each adjacent depth fusion module contains 3 convolution operations and 1 deconvolution operation to obtain d-level single-modal features, d=1,2,3.

更更进一步地，步骤(22)包括：Further, step (22) includes:

(221)将一个卷积核为3×3，步长为2，参数为的卷积操作一个卷积核为1×1，步长为1，参数为的卷积操作和一个卷积核为2×2，步长为1/2，参数为的反卷积操作分别作用于和(221) A convolution kernel is 3×3, the step size is 2, and the parameter is The convolution operation A convolution kernel is 1×1, the step size is 1, and the parameters are The convolution operation And a convolution kernel is 2×2, the step size is 1/2, and the parameter is The deconvolution operation of respectively act on and

(222)将这3级特征级联，并通过一个卷积核为1×1，步长为 1，参数为的卷积操作得到128通道的第d级单模态特征邻近深度融合模块可表示如下：(222) The 3-level features are cascaded, and a convolution kernel is 1×1, the step size is 1, and the parameter is The convolution operation Get 128-channel d-level unimodal features The neighborhood depth fusion module can be expressed as follows:

其中：in:

Cat(·)表示跨通道级联操作；Cat( ) means cross-channel cascade operation;

φ(·)是一个ReLu激活函数。φ( ) is a ReLu activation function.

进一步地，步骤(3)中的多分支组融合模块是针对同一特征级下的不同单模态进行融合，且包括两个融合分支：多组融合分支和单组融合分支，其中：Further, the multi-branch group fusion module in step (3) is for fusion of different single modes under the same feature level, and includes two fusion branches: multi-group fusion branch and single-group fusion branch, wherein:

多组融合分支有8个组，做单组融合分支只有一个组；There are 8 groups for multi-group fusion branches, and only one group for single-group fusion branches;

每个融合分支输出64通道的特征，将两个融合分支输出特征进行级联，得到128通道的多模态特征。Each fusion branch outputs 64-channel features, and the two fusion branch output features are cascaded to obtain 128-channel multimodal features.

更进一步地，步骤(3)中所述的构建多分支组融合模块，在多组融合分支中对同一特征级下的不同单模态进行融合，得到融合的多模态特征，包括以下步骤：Furthermore, the multi-branch group fusion module described in step (3) is constructed to fuse different single modes under the same feature level in multiple groups of fusion branches to obtain the multi-modal features of fusion, including the following steps:

(31)输入的单模态特征和分别根据通道数量被切分成M 个通道数相同的小组，得到和两个特征集，其中：(31) Input unimodal features and According to the number of channels, they are divided into M groups with the same number of channels, and get and Two feature sets, where:

M正整数，其取值范围是2≤M≤128；M is a positive integer, and its value range is 2≤M≤128;

(32)紧接着，将同级的两个特征集中来自第m个小组的对应RGB和热红外特征通过级联操作进行结合，继而通过通道数为64/M的 1×1的卷积和通道数为64/M的3×3的两个堆栈的卷积操作，实现小组内跨模态特征的融合，每个卷积操作之后都紧随着一个ReLu激活函数；(32) Immediately afterwards, the corresponding RGB and thermal infrared features from the mth group in the two feature sets of the same level are combined through a cascade operation, and then through a 1×1 convolution and channel with a channel number of 64/M The convolution operation of two 3×3 stacks with a number of 64/M realizes the fusion of cross-modal features in the group, and each convolution operation is followed by a ReLu activation function;

(33)M个小组输出被级联在一起，得到多组融合分支的输出特征H_1,d，其表达式为：(33) The outputs of M subgroups are cascaded together to obtain the output features H_1,d of multiple groups of fusion branches, the expression of which is:

其中：in:

表示上述中带ReLu激活函数的堆栈卷积操作， Represents the stack convolution operation with the ReLu activation function in the above,

代表第m个小组的融合参数。 represents the fusion parameters for the mth subgroup.

更更进一步地，步骤(3)中所述的构建多分支组融合模块，在单组融合分支中对同一特征级下的不同单模态进行融合，得到融合的多模态特征，包括以下步骤：Furthermore, the construction of the multi-branch group fusion module described in step (3) is to fuse different single modes under the same feature level in the single group fusion branch to obtain the multi-modal features of fusion, including the following steps :

(3a)单组融合分支可看作是多组融合分支中M＝1时的特殊情况，表达式为：(3a) The single-group fusion branch can be regarded as a special case when M=1 in the multi-group fusion branch, and the expression is:

其中：in:

H_2,d是单组融合分支的第d级融合特征输出；H_2,d is the d-level fusion feature output of the single-group fusion branch;

包含两个堆栈的卷积操作，分别是通道数为64的1×1 的卷积和通道数为64的3×3的卷积，且每个卷积操作之后都跟随着一个ReLu激活函数； Convolution operations including two stacks, namely 1×1 convolution with 64 channels and 3×3 convolution with 64 channels, and each convolution operation is followed by a ReLu activation function;

表示单组融合分支的融合参数； Represents the fusion parameters of a single group of fusion branches;

(3b)第d级的多分支组融合特征H_d由H_1,d和H_2,d简单级联得到，其表达式为：(3b) The multi-branch group fusion feature H_d of level d is obtained by simply cascading H_1,d and H_2,d , and its expression is:

H_d＝Cat(H_1,d,H_2,d)。H_d =Cat(H_1,d ,H_2,d ).

有益效果：本发明公开的多级深度特征融合的RGB-T图像显著性目标检测方法与现有技术相比，具有如下有益效果：Beneficial effects: Compared with the prior art, the multi-level deep feature fusion RGB-T image salient target detection method disclosed by the present invention has the following beneficial effects:

1)不需要人工设计并提取特征，能够实现RGB-T图像的端对端的像素级检测，仿真结果表明本发明在复杂多变场景下检测图像显著目标时更具有完整一致性效果。1) End-to-end pixel-level detection of RGB-T images can be realized without manual design and feature extraction. The simulation results show that the present invention has a more complete and consistent effect when detecting prominent objects in images in complex and changeable scenes.

2)本发明将从支柱网络提取的5级粗糙的单模态特征，通过建立多个邻近深度特征融合模块进行改善，得到3级单模态特征，能够有效捕捉输入图像的低级细节和高级语义信息，同时避免特征级数过多而导致网络整体参数急剧增多，降低网络训练难度。2) The present invention improves the 5-level rough single-modal feature extracted from the pillar network by establishing multiple adjacent deep feature fusion modules to obtain a 3-level single-modal feature, which can effectively capture the low-level details and high-level semantics of the input image Information, while avoiding too many feature series, which will lead to a sharp increase in the overall parameters of the network, and reduce the difficulty of network training.

3)本发明通过构建包含两个融合分支的多分支组融合模块融合不同模态特征，由于单分支组融合结构捕捉来自于RGB图像和热红外图像的不同模态全部特征间跨通道的相关性，而多组融合分支中提取到更显著的特征，可有效地捕捉来自RGB和热红外图像的跨模态信息，有助于检测更完整一致的目标，同时融合模块所需训练参数较少，可提高算法的检测速度。3) The present invention fuses different modal features by constructing a multi-branch group fusion module including two fusion branches, because the single-branch group fusion structure captures the cross-channel correlation between all the features of different modalities from RGB images and thermal infrared images , and more significant features are extracted from multiple sets of fusion branches, which can effectively capture cross-modal information from RGB and thermal infrared images, and help to detect more complete and consistent targets, while the fusion module requires fewer training parameters, The detection speed of the algorithm can be improved.

附图说明Description of drawings

图1为本发明公开的多级深度特征融合的RGB-T图像显著性目标检测方法的实现流程图；Fig. 1 is the implementation flowchart of the RGB-T image salient target detection method disclosed by the present invention of multi-level depth feature fusion;

图2为本发明与现有技术在RGB-thermal数据库下的实验结果仿真对比图；Fig. 2 is the simulation contrast figure of the experimental result of the present invention and prior art under RGB-thermal database;

图3a和图3b为本发明与现有技术在RGB-thermal数据库下的 P-R曲线、F-measure曲线两种评价指标仿真对比图。Fig. 3a and Fig. 3b are the simulation comparison diagrams of the P-R curve and the F-measure curve of the present invention and the prior art under the RGB-thermal database.

具体实施方式：Detailed ways:

下面对本发明的具体实施方式详细说明。Specific embodiments of the present invention will be described in detail below.

参照图1，多级深度特征融合的RGB-T图像显著性目标检测方法, 包括如下步骤：Referring to Figure 1, the multi-level deep feature fusion RGB-T image salient target detection method includes the following steps:

步骤1)对输入图像提取粗糙的多级特征：Step 1) Extract rough multi-level features from the input image:

对RGB图像或热红外图像，提取VGG16网络中位于不同深度的5 级特征作为粗糙的单模态特征，分别为：For RGB images or thermal infrared images, the 5-level features at different depths in the VGG16 network are extracted as rough single-modal features, respectively:

Conv1-2(用符号表示，包含64个尺寸为256×256的特征图)；Conv1-2 (with the symbol Indicates that it contains 64 feature maps of size 256×256);

Conv2-2(用符号表示，包含128个尺寸为128×128的特征图)；Conv2-2 (with the symbol Indicates that it contains 128 feature maps with a size of 128×128);

conv3-3(用符号表示，包含256个尺寸为64×64的特征图)；conv3-3 (with the symbol representation, containing 256 feature maps of size 64×64);

Conv4-3(用符号表示，包含512个尺寸为32×32的特征图)；Conv4-3 (with the symbol representation, containing 512 feature maps of size 32×32);

Conv5-3(用符号表示，包含512个尺寸为16×16的特征图)；Conv5-3 (with the symbol representation, containing 512 feature maps of size 16×16);

其中：n＝1或者2，Wherein: n=1 or 2,

n＝1时代表RGB图像；When n=1, it represents an RGB image;

n＝2时代表热红外图像；When n=2, it represents thermal infrared image;

步骤2)构建邻近深度特征融合模块，改善单模态特征：Step 2) Construct the adjacent deep feature fusion module to improve single-modal features:

常见多模态视觉方法直接将五级特征作为单模态特征，该方法因为特征级数过多导致网络参数量巨大，网络训练难度加大，本发明将不同深度的5级特征作为粗糙的单模态特征，通过建立多个邻近深度特征融合模块，得到3级改善的RGB图像特征或热红外图像特征；Common multimodal vision methods directly use five-level features as single-mode features. In this method, the number of network parameters is huge due to too many feature series, and the difficulty of network training is increased. The present invention uses five-level features of different depths as rough single-mode features. Modal features, through the establishment of multiple adjacent depth feature fusion modules, 3-level improved RGB image features or thermal infrared image features are obtained;

每一个邻近深度融合模块包含3个卷积操作和1个反卷积操作，特别地，为获得第d级单模态特征，d＝1,2,3，首先将一个卷积核为3 ×3，步长为2，参数为的卷积操作一个卷积核为1×1，步长为1，参数为的卷积操作和一个卷积核为2×2，步长为1/2，参数为的反卷积操作分别作用于和以确保来自支柱网络的邻近3级特征具有相同的空间分辨率和特征通道数(本发明为128通道)；之后将这3级特征级联，并通过一个卷积核为1×1，步长为1，参数为的卷积层得到128通道的第d级单模态特征邻近深度融合模块可表示如下：Each adjacent depth fusion module contains 3 convolution operations and 1 deconvolution operation. In particular, in order to obtain the d-level single-modal feature, d=1,2,3, firstly, a convolution kernel is 3 × 3, the step size is 2, and the parameter is The convolution operation A convolution kernel is 1×1, the step size is 1, and the parameters are The convolution operation And a convolution kernel is 2×2, the step size is 1/2, and the parameter is The deconvolution operation of respectively act on and To ensure that the adjacent 3-level features from the pillar network have the same spatial resolution and number of feature channels (128 channels in the present invention); then the 3-level features are cascaded and passed through a convolution kernel to be 1×1, with a step size of is 1, the parameter is convolutional layer Get 128-channel d-level unimodal features The neighborhood depth fusion module can be expressed as follows:

其中，Cat(·)表示跨通道级联操作，φ(·)是一个ReLu激活函数；Among them, Cat( ) represents a cross-channel cascade operation, and φ( ) is a ReLu activation function;

正如上述所示，第d级的RGB或热红外单模态特征同时包含了3级来自支柱网络的特征信息，即与它的邻近深度特征和这也表明将包含更丰富的细节和语义信息，有助于准确识别目标，另外，特征相对于简单合并和拥有更简洁的数据，通过邻近深度特征融合，粗提取的特征中的冗余信息在改善的特征中得到压缩；As indicated above, the RGB or thermal infrared single-modality characteristic of the d-level At the same time, it contains three levels of feature information from the pillar network, namely with its neighboring deep features and This also shows will contain richer details and semantic information, which will help to accurately identify the target. In addition, the feature vs simple merge and With more concise data, through the fusion of adjacent deep features, the redundant information in the roughly extracted features is improving. features are compressed;

步骤3)构建多分支组融合模块，融合多模态特征：Step 3) Build a multi-branch group fusion module to fuse multimodal features:

多分支组融合模块针对同一特征级下的不同单模态进行融合，且包含两个融合分支，其中；The multi-branch group fusion module performs fusion for different single modes under the same feature level, and contains two fusion branches, where;

第一个融合分支(又叫做多组融合分支)有M(本实施例为 M＝8)个组，主要放大各通道的作用，减少网络参数；The first fusion branch (also known as multi-group fusion branch) has M (in this embodiment, M=8) groups, which mainly amplifies the effect of each channel and reduces network parameters;

第二个融合分支(又叫做单组融合分支)只有一个组，主要作用为充分捕捉不同模态的全部输入特征间的跨通道相关性；两个分支输出相同通道数的特征(本实施例为64通道)，因此，多分支组融合模块最终的输出特征通道数是每个融合分支的两倍，同时又等于输入多分支组融合模块的RGB或热红外图像特征通道数(本实施例为128通道)；The second fusion branch (also known as a single-group fusion branch) has only one group, and its main function is to fully capture the cross-channel correlation between all input features of different modalities; the two branches output the features of the same number of channels (this embodiment is 64 channels), therefore, the final output feature channel number of the multi-branch group fusion module is twice that of each fusion branch, and is equal to the RGB or thermal infrared image feature channel number of the input multi-branch group fusion module (128 in this embodiment) aisle);

多组融合分支根据“拆分—转换—合并”的基本思想建立，在多组融合分支中，输入的单模态特征和分别根据通道数量被切分成M个通道数相同的小组(128/M)，得到和两个特征集；紧接着，将同级的两个特征集中来自第m个小组的对应RGB和热红外特征通过级联操作进行结合，继而通过通道数为64/M的1×1的卷积和通道数为64/M的3×3的两个堆栈的卷积操作，实现小组内跨模态特征的融合，其中，第一个1×1主要起到减少特征通道数的作用，第二个卷积主要用于融合特征，而每个卷积操作之后都紧随着一个ReLu激活函数；最终，M个小组输出被级联在一起，得到多组融合分支的输出特征H_1,d，其表达式为：The multi-group fusion branch is established according to the basic idea of "split-transform-merge". In the multi-group fusion branch, the input unimodal features and According to the number of channels, it is divided into M groups with the same number of channels (128/M), and and Two feature sets; then, the corresponding RGB and thermal infrared features from the mth group in the two feature sets of the same level are combined through a cascade operation, and then through a 1×1 convolution with a channel number of 64/M The convolution operation of two stacks of 3×3 with a channel number of 64/M realizes the fusion of cross-modal features in the group. Among them, the first 1×1 mainly serves to reduce the number of feature channels, and the second Convolutions are mainly used to fuse features, and each convolution operation is followed by a ReLu activation function; finally, M group outputs are cascaded together to obtain output features H_1,d of multiple sets of fusion branches, Its expression is:

其中，表示上述中带ReLu激活函数的堆栈卷积操作，代表第m个小组的融合参数；in, Represents the stack convolution operation with the ReLu activation function in the above, represents the fusion parameters of the mth group;

单组融合分支可看作是多组融合分支中M＝1时的特殊情况，表达式为：A single group of fusion branches can be regarded as a special case when M=1 in multiple groups of fusion branches, and the expression is:

其中，H_2,d是单组融合分支的第d级融合特征输出，包含两个堆栈的卷积操作，分别是通道数为64的1×1的卷积和通道数为64 的3×3的卷积，通过两个卷积充分捕捉输入的全部多模态特征之间的相关性信息，且每个卷积操作之后都跟随着一个ReLu激活函数，表示单组融合分支的融合参数；Among them, H_2,d is the d-level fusion feature output of the single-group fusion branch, Convolution operations including two stacks, namely 1×1 convolution with 64 channels and 3×3 convolution with 64 channels, fully capture all multimodal features of the input through two convolutions The correlation information between, and each convolution operation is followed by a ReLu activation function, Represents the fusion parameters of a single group of fusion branches;

最终，经过多组融合分支和单组融合分支，第d级的多分支组融合特征H_d可以由H_1,d和H_2,d简单级联得到，表达式为：Finally, after multiple sets of fusion branches and a single set of fusion branches, the d-th level multi-branch group fusion feature H_d can be obtained by simply cascading H_1,d and H_2,d , the expression is:

H_d＝Cat(H1_,d,H_2,d)H_d ＝Cat(H1_,d ,H_2,d )

正如所述所示，多分支组融合模块既可以通过单分支组融合结构捕捉来自于RGB图像和热红外图像的不同模态全部特征间跨通道的相关性，又可以从多组融合分支中提取到更显著的特征。因此，通过多个多分支组融合模块，基于多模态的多级融合特征被提取，且相较于常用的融合方法，可更有效地捕捉来自RGB和热红外图像的跨模态信息，检测更完整一致的目标；由于分组卷积的思想，多分支组融合模块相较于常见的直接级联再经过一系列卷积层和激活层的融合方法，需要更少的训练参数；As mentioned, the multi-branch group fusion module can not only capture the cross-channel correlation between all the features of different modalities from RGB images and thermal infrared images through the single-branch group fusion structure, but also extract to more prominent features. Therefore, through multiple multi-branch group fusion modules, multi-modality-based multi-level fusion features are extracted, and compared with commonly used fusion methods, it can more effectively capture cross-modal information from RGB and thermal infrared images, detect More complete and consistent goals; due to the idea of group convolution, the multi-branch group fusion module requires fewer training parameters than the common direct cascading and then a series of fusion methods of convolutional layers and activation layers;

步骤4)得到融合输出特征图：Step 4) Obtain the fusion output feature map:

将不同级特征经过逐级反向进行融合，获得多个边输出特征图 {P_d|d＝1,2,3}，表达式为：The features of different levels are fused step by step to obtain multiple edge output feature maps {P_d |d=1,2,3}, the expression is:

其中，D(*；γ_d,(1/2)^d)是一个卷积核为2^d×2^d，步长为(1/2)^d，参数为γ_d的反卷积层，使融合的特征具有相同的空间分辨率，和是两个卷积核为1×1，步长为1，参数分别为和的卷积层，分别被用作融合不同级特征和产生各级的边输出特征图。经过逐级信息传递，我们得到3个尺寸等同于输入的单模态图像的边输出特征图 {P_d|d＝1,2,3}；Among them, D(*;γ_d ,(1/2)^d ) is a deconvolution layer with a convolution kernel of 2^d ×2^d , a step size of (1/2)^d , and a parameter of γ_d , so that the fusion The features of have the same spatial resolution, and The two convolution kernels are 1×1, the step size is 1, and the parameters are and The convolutional layers are used to fuse different levels of features and generate edge output feature maps of each level. After step-by-step information transfer, we get 3 side-output feature maps {P_d |d=1,2,3} whose size is equivalent to the input unimodal image;

使用级联操作将多级特征合并，再通过一个卷积核为1×1，步长为1，参数为θ⁰的卷积操作C(*；θ⁰,1)融合生成特征图P₀，表达式为：Use the cascade operation to combine the multi-level features, and then generate a feature map P₀ through a convolution operation C(*; θ⁰ ,1) with a convolution kernel of 1×1, a step size of 1, and a parameter of θ⁰ , The expression is:

P₀＝C(Cat(P₁,P₂,P₃)；θ⁰,1)P₀ =C(Cat(P₁ ,P₂ ,P₃ );θ⁰ ,1)

步骤5)训练算法网络：Step 5) Train the algorithm network:

在训练数据集上，采用深度监督学习机制，将边输出特征图和融合输出特征图{P_t|t＝0,1,2,3}，与真值图G进行比较，求取网络模型的交叉熵损失函数L：On the training data set, the deep supervised learning mechanism is used to compare the edge output feature map and the fusion output feature map {P_t |t=0,1,2,3} with the truth map G to obtain the network model. Cross entropy loss function L:

其中，G(i,j)∈{0,1}是真值图G中位于(i,j)位置的值，P_t(i,j)是特征图P_t经过σ(P_t)操作后得到的概率图中位于(i,j)位置的概率值，σ(·)是一个 sigmoid激活函数。在不同图像中，显著性目标所占区域大小于背景区域大小是不同的，为了平衡前景和背景的损失，增加算法对不同尺寸的显著性目标的检测准确性，使用了一个类平衡参数β，β是真值图中背景像素的数量和整个真值图像素数量的比值，可以表示为：Among them, G(i,j)∈{0,1} is the value at position (i,j) in the truth map G, P_t (i,j) is the feature map P_t after σ(P_t ) operation The probability value at the (i,j) position in the obtained probability map, σ(·) is a sigmoid activation function. In different images, the size of the area occupied by the salient target is different from that of the background area. In order to balance the loss of the foreground and background and increase the detection accuracy of the algorithm for the salient target of different sizes, a class balance parameter β is used. β is the ratio of the number of background pixels in the truth map to the number of pixels in the whole truth map, which can be expressed as:

其中，N_b表示背景像素点数量，N_f表示前景像素点数量；Among them, N_b represents the number of background pixels, and N_f represents the number of foreground pixels;

本发明使用“3步训练法”对网络进行训练：第一步，通过最小化交叉熵损失函数来训练RGB图像的分支网络，在构建的分支网络中，多分支组融合模块被移除，从多个邻近深度特征融合模块中输出的多级可见光图像特征，被直接输入到反向传递过程中去预测显著性；第二步，使用同第一步中与RGB分支网络相同的方法构建和训练热红外分支；第三步，基于前两步中，RGB和热红外单分支网络得到的 VGG16支柱网络参数和临近深度特征融合模块参数，训练RGB-T图像检测的整体网络，得到网路模型参数；The present invention uses the "3-step training method" to train the network: the first step is to train the branch network of the RGB image by minimizing the cross-entropy loss function. In the constructed branch network, the multi-branch group fusion module is removed, from The multi-level visible light image features output by multiple adjacent depth feature fusion modules are directly input into the backward pass process to predict saliency; the second step is constructed and trained using the same method as the RGB branch network in the first step Thermal infrared branch; the third step, based on the VGG16 pillar network parameters obtained by the RGB and thermal infrared single-branch networks and the parameters of the adjacent depth feature fusion module in the first two steps, train the overall network of RGB-T image detection to obtain network model parameters ;

在训练热红外单模态分支网络参数时，用于热红外单模态显著性目标检测的数据集缺失，为了能够顺利训练，本发明使用了RGB图像的R通道代替热红外单模态数据，因为RGB图像的三个通道中， R通道图像最接近热红外图像，具体训练数据集构建如下：When training the parameters of the thermal infrared single-mode branch network, the data set used for thermal infrared single-mode salient target detection is missing. In order to train smoothly, the present invention uses the R channel of the RGB image instead of the thermal infrared single-mode data. Because the R channel image is the closest to the thermal infrared image among the three channels of the RGB image, the specific training data set is constructed as follows:

使用RGB-thermal数据集中的RGB图像(每两张取一张)和 MSRA-B训练数据集(每3张取一张)，形成1:2的数据比训练RGB 分支网络模型；对应着，使用RGB-thermal数据集中的热红外图像(每两张取一张)和MSRA-B训练数据集中图像的R通道(每3张取一张)，形成1:2的数据比训练热红外分支网络模型；对于RGB-T图像多模态网络模型，使用RGB-thermal数据集中的成对图像(每两对取一对)进行训练；Use the RGB images in the RGB-thermal data set (take one for every two) and the MSRA-B training data set (take one for every three) to form a 1:2 data ratio to train the RGB branch network model; correspondingly, use The thermal infrared images in the RGB-thermal dataset (take one for every two images) and the R channel of the images in the MSRA-B training dataset (take one for every three images), forming a data ratio of 1:2 to train the thermal infrared branch network model ; For the RGB-T image multimodal network model, use the paired images in the RGB-thermal data set (every two pairs get a pair) for training;

训练中，为避免训练数据过少出现过拟合现象，对每幅图像进行旋转90°，180°，270°，以及水平，上下翻转操作，将原有的数据集总量扩大成为8倍的数量；During training, in order to avoid overfitting due to too little training data, each image is rotated 90°, 180°, 270°, and horizontally, up and down flipped to expand the total amount of the original data set to 8 times quantity;

步骤5)预测RGB-T图像的像素级显著图：Step 5) Predict the pixel-level saliency map of the RGB-T image:

将RGB-thermal数据集中除用于训练外的另一半数据作为测试数据，利用步骤(5)得到的网络模型参数，对步骤(4)中得到的边输出特征图和融合输出特征图，进行进一步分类计算，用{S_t|t＝0,1,2,3} 表示网络所有的输出显著图，S_t可表示如下：Use the other half of the data in the RGB-thermal data set except for training as test data, and use the network model parameters obtained in step (5) to further perform a further step on the edge output feature map and fusion output feature map obtained in step (4). For classification calculation, use {S_t |t=0,1,2,3} to represent all output saliency maps of the network, and S_t can be expressed as follows:

S_t＝σ(P_t)S_t =σ(P_t )

其中，σ(·)是一个sigmoid激活函数；Among them, σ( ) is a sigmoid activation function;

最后，将S₀作为最终的RGB-T预测显著图。Finally,_S0 is taken as the final RGB-T predicted saliency map.

以下结合仿真实验，对本发明的技术效果作进一步说明：Below in conjunction with simulation experiment, technical effect of the present invention is described further:

1、仿真条件：所有仿真实验均在Ubuntu 16.04.5环境下采用caffe 深度学习框架，借助Matlab R2014b软件为接口实现；1. Simulation conditions: All simulation experiments are implemented in the Ubuntu 16.04.5 environment using the caffe deep learning framework, with the help of Matlab R2014b software as the interface;

2、仿真内容及结果分析：2. Simulation content and result analysis:

仿真1Simulation 1

将本发明与现有的基于RGB图像的显著目标检测方法和RGB-T 图像的显著性目标检测算法在公共图像数据库RGB-thermal上进行显著目标检测实验，部分实验结果进行直观的比较，如图2所示，其中，RGB图像表示数据库中用于实验输入的RGB图像，T图像表示数据库中用于实验输入的与RGB图像成对的热红外图像，GT表示人工标定的真值图；The present invention and the existing salient target detection method based on RGB images and the salient target detection algorithm of RGB-T images are used for salient target detection experiments on the public image database RGB-thermal, and some experimental results are intuitively compared, as shown in Fig. 2, where the RGB image represents the RGB image used for experimental input in the database, the T image represents the thermal infrared image paired with the RGB image used for experimental input in the database, and GT represents the artificially calibrated truth map;

从图2可以看出，相较于现有技术，本发明对背景抑制效果更好，在复杂场景下的显著目标检测中具有更好的完整一致性效果，且更接近于人工标定的真值图。It can be seen from Fig. 2 that, compared with the prior art, the present invention has a better background suppression effect, better complete consistency effect in salient target detection in complex scenes, and is closer to the true value of manual calibration picture.

仿真2Simulation 2

将本发明与现有的基于单模态图像的显著目标检测方法和基于 RGB-T图像的显著性目标检测算法在公共图像数据库RGB-thermal 上进行显著目标检测实验得到的结果，采用公认的评价指标进行客观评价，评价仿真结果如图3a和图3b所示，其中：The present invention and the existing salient target detection method based on single-modal images and the salient target detection algorithm based on RGB-T images are used to perform salient target detection experiments on the public image database RGB-thermal. The indicators are objectively evaluated, and the evaluation simulation results are shown in Figure 3a and Figure 3b, where:

图3a为本发明和现有技术采用准确率-召回率(P-R)曲线进行评价的结果图；Fig. 3 a is the result figure that the present invention and prior art adopt precision rate-recall rate (P-R) curve to evaluate;

图3b为本发明和现有技术采用F-Measure曲线进行评价的结果图；Fig. 3 b is the result figure that the present invention and prior art adopt F-Measure curve to evaluate;

从图3a和图3b可以看出，相较于现有技术，本发明具有更高的 PR曲线和F-measure曲线，从而表明了本发明对显著性的目标检测具有更好的一致性和完整性，充分表明了本发明方法的有效性和优越性。It can be seen from Figure 3a and Figure 3b that compared with the prior art, the present invention has a higher PR curve and F-measure curve, which shows that the present invention has better consistency and completeness in the detection of salient objects property, fully demonstrated the effectiveness and superiority of the inventive method.

上面对本发明的实施方式做了详细说明。但是本发明并不限于上述实施方式，在所属技术领域普通技术人员所具备的知识范围内，还可以在不脱离本发明宗旨的前提下做出各种变化。The embodiments of the present invention have been described in detail above. However, the present invention is not limited to the above-mentioned embodiments, and various changes can be made within the scope of knowledge of those skilled in the art without departing from the gist of the present invention.