CN113205481A

Movatterモバイル変換

Info

Publication number: CN113205481A
Application number: CN202110297512.9A
Authority: CN
Inventors: 周武杰; 吴俊一; 雷景生; 万健; 甘兴利; 钱小鸿; 叶宁
Original assignee: Zhejiang Lover Health Science and Technology Development Co Ltd
Current assignee: Zhejiang Lover Health Science and Technology Development Co Ltd
Priority date: 2021-03-19
Filing date: 2021-03-19
Publication date: 2021-08-03

Abstract

The invention discloses a salient object detection method based on a stepped progressive neural network. The method comprises a training stage process and a testing stage process; a training stage process, namely constructing a convolutional neural network; inputting various original scene images with salient objects into a convolutional neural network for training to obtain corresponding salient object prediction images; calculating loss function values between image sets formed by predicted images of all the salient objects and image sets formed by real detection images of the corresponding salient objects to obtain optimal weight vectors and bias terms of the convolutional neural network; and in the testing stage process, inputting various scene images to be detected with the salient objects into the trained convolutional neural network to obtain the salient object prediction image. The invention improves the speed and the accuracy of the detection of the salient objects of various scenes.

Description

Salient object detection method based on stepped progressive neural network

Technical Field

The invention relates to the technical field of image processing, in particular to a salient object detection method based on a stepped progressive neural network.

Background

The saliency detection aims at detecting an object which most attracts human attention in a scene, and is widely applied to many visual tasks such as visual tracking, image segmentation and the like; in most existing methods, the prediction of a significant object from an RGB image or an RGB-D image is mainly considered, which depends heavily on the illumination condition, the weather condition and the quality of a depth map, while an infrared image can well make up the problems of insufficient illumination and the like, and capture the information missing from more visible light images; the infrared spectrogram provides a three-dimensional spatial relationship of a scene, and can effectively assist a salient object detection algorithm to avoid ambiguity caused by foreground and background colors; therefore, more and more work is being done to investigate how to better perform the salient object detection task, i.e., RGB-T salient object detection, by means of spectrograms.

Previous approaches to RGB-D salient objects often used depth contrast as an important prior; these methods actually focus on the foreground region using depth information; however, the quality of the depth map becomes important, and often affects the final prediction result of the network. The distribution of the foreground and the background is greatly different, and it is difficult to learn significant clues from the foreground and the background without distinction; some traditional methods propose strategies for reasoning significance regions from the foreground and the background respectively, but in the deep learning-based method, the simple and effective idea is not emphasized, and the problem caused by poor quality of a depth map can be effectively solved by using a spectrogram to perform a significance object detection task.

Disclosure of Invention

The invention aims to solve the technical problem of providing a method for detecting a significant object based on a stepped progressive neural network, which has high detection speed and high accuracy.

The technical scheme adopted by the invention for solving the technical problems is as follows: a salient object detection method based on a ladder-shaped progressive neural network is characterized by comprising a training stage and a testing stage;

the specific steps of the training phase process are as follows:

step 1_ 1: selecting Q original scene images with salient objects and salient object real detection images corresponding to the original scene images with the salient objects, and forming a training set by the Q original scene images with the salient objects and the corresponding salient object real detection images;

step 1_ 2: constructing a convolutional neural network, wherein the convolutional neural network mainly comprises 10 basic modules, 5 blending modules, a multi-scale sharpening feature module, a pyramid sharpening feature module and 4 guide modules;

step 1_ 3: inputting various original scene images with salient objects in the training set as original input images into a convolutional neural network for training to obtain corresponding salient object prediction images;

step 1_ 4: calculating loss function values between image sets formed by prediction images of all the salient objects and image sets formed by real detection images of the corresponding salient objects, and finishing the training of the convolutional neural network when the training times reach preset times to obtain a trained convolutional neural network;

the test stage process comprises the following specific steps:

step 2_ 1: selecting various scenes with salient objects to be detected in the p group in the test set;

step 2_ 2: inputting various scenes with the salient objects to be detected in the p-th group into the trained convolutional neural network, and outputting corresponding salient object prediction images by the trained convolutional neural network.

The convolutional neural network mainly comprises ten basic modules, five blending modules, a multi-scale sharpening feature module, a pyramid sharpening feature module and four guide modules, and specifically comprises the following steps:

the first basic module, the second basic module, the third basic module, the fourth basic module and the fifth basic module are sequentially connected, the sixth basic module, the seventh basic module, the eighth basic module, the ninth basic module and the tenth basic module are sequentially connected, and the input of the convolutional neural network is respectively input into the first basic module and the sixth basic module; the outputs of the first basic module and the sixth basic module are simultaneously input into the first blending module, the outputs of the second basic module and the seventh basic module are simultaneously input into the second blending module, the outputs of the third basic module and the eighth basic module are simultaneously input into the third blending module, the outputs of the fourth basic module and the ninth basic module are simultaneously input into the fourth blending module, the outputs of the fifth basic module and the tenth basic module are simultaneously input into the fifth blending module, the outputs of the first blending module, the second blending module, the third blending module and the fourth blending module are respectively input into the fifth input end, the fourth input end, the third input end and the second input end of the multi-scale sharpening feature module, and the output of the fifth blending module is input into the first input end of the multi-scale sharpening feature module after passing through the pyramid sharpening feature module; each guide module is provided with two input ends, a fifth output end, a fourth output end, a third output end and a second output end of the multi-scale sharpening feature module are respectively connected with the fourth guide module, the third guide module, the second guide module and the first input end of the first guide module, the first output end of the multi-scale sharpening feature module is connected with the second input end of the first guide module, the output of the first guide module is input to the second input end of the second guide module, the output of the second guide module is input to the second input end of the third guide module, the output of the third guide module is input to the second input end of the second guide module, and the output of the fourth guide module is used as the output of the convolutional neural network.

The multi-scale sharpening feature module specifically comprises:

the multi-scale sharpening feature module comprises four stacking modules, ten up-sampling modules and four feature filtering modules;

a fifth input end of the multi-scale sharpening feature module is input into the fourth stacking module, a fourth input end of the multi-scale sharpening feature module is input into the third stacking module, a third input end of the multi-scale sharpening feature module is input into the second stacking module, a second input end of the multi-scale sharpening feature module is input into the first stacking module, a first input end of the multi-scale sharpening feature module is input into the first up-sampling module, outputs of the first up-sampling module are respectively input into the first stacking module and the second up-sampling module, outputs of the second up-sampling module are respectively input into the second stacking module and the third up-sampling module, outputs of the third up-sampling module are respectively input into the third stacking module and the fourth up-sampling module, and outputs of the fourth up-sampling module are input into the fourth stacking module;

the second input end of the multi-scale sharpening feature module is also input into a fifth up-sampling module, the output of the fifth up-sampling module is respectively input into the second stacking module and the sixth up-sampling module, the output of the sixth up-sampling module is respectively input into the third stacking module and the seventh up-sampling module, and the output of the seventh up-sampling module is input into the fourth stacking module;

the third input end of the multi-scale sharpening feature module is also input into an eighth up-sampling module, the output of the eighth up-sampling module is respectively input into the third stacking module and the ninth up-sampling module, and the output of the ninth up-sampling module is input into the fourth stacking module; the input of the fourth input end of the multi-scale sharpening feature module is also input into a tenth up-sampling module, and the output of the tenth up-sampling module is input into a fourth stacking module;

the input of the first up-sampling module is used as the first output end of the multi-scale sharpening feature module, the output of the first stacking module after passing through the first feature filtering module is used as the second output end of the multi-scale sharpening feature module, the output of the second stacking module after passing through the second feature filtering module is used as the third output end of the multi-scale sharpening feature module, the output of the third stacking module after passing through the third feature filtering module is used as the fourth output end of the multi-scale sharpening feature module, and the output of the fourth stacking module after passing through the fourth feature filtering module is used as the fifth output end of the multi-scale sharpening feature module.

The guide module is specifically as follows: the guiding module comprises an eleventh up-sampling module, three convolution modules, three activation modules, a second segmentation module and a middle module;

the input of the first input end of the guide module is input into an eleventh up-sampling module, the eleventh up-sampling module is connected with the second segmentation module after sequentially passing through the first convolution module and the first activation module, the output of the multiplication of the input of the first input end of the guide module and the output of the second segmentation module is added with the input of the first input end of the guide module and then is input into the second convolution module, the output of the second convolution module is input into the second activation module, the output of the multiplication of the input of the first input end of the guide module and the output of the second segmentation module is input into the third convolution module, the third convolution module is connected with the middle module after passing through the third activation module, the output of the second activation module is multiplied with the input of the second input end of the guide module and then outputs a first middle output, the output of the middle module is multiplied with the input of the second input end of the guide module and then outputs a second middle output, and the output obtained by adding the first intermediate output, the second intermediate output and the input of the second input end of the guide module is used as the output of the guide module.

The blending module specifically comprises: the blending module comprises a fourth convolution module, a fifth convolution module, an adaptive module and a fourth activation module;

the fourth convolution module, the fifth convolution module, the self-adaptive module and the fourth activation module are sequentially connected, the input of the blending module is input into the fourth convolution module, and the output of the blending module after the input of the blending module is multiplied by the output of the fourth activation module is added with the input of the blending module to be used as the output of the blending module.

The pyramid sharpening feature module specifically comprises: the pyramid sharpening feature module comprises six convolution modules and a fifth stacking module;

The characteristic filtering module specifically comprises:

the characteristic filtering module comprises a first convolution module, a second convolution module, a first activation module and a first segmentation module; the first convolution module, the second convolution module, the first activation module and the first segmentation module are sequentially connected, the input of the characteristic filtering module is input into the first convolution module, and the output of the first segmentation module multiplied by the input of the characteristic filtering module is used as the output of the characteristic filtering module.

Compared with the prior art, the invention has the beneficial effects that:

1) the method comprises the steps of constructing a convolutional neural network, inputting images of various scenes with salient objects in a training set into the convolutional neural network for training to obtain a convolutional neural network saliency detection training model; and inputting various scene images to be detected into a convolutional neural network salient object detection training model, and predicting to obtain salient object images corresponding to the various scene images.

2) The method adopts the feature sharpening module, can well connect low-dimensional and high-dimensional features, and can better determine the spatial position of a significant object; and the cavity convolution is adopted, so that the receptive field is improved, and the network characteristics are better extracted.

3) The method uses the characteristic guide module in the construction of the convolutional neural network, and the stepped guide module better optimizes a significant object image by stepping and stepping, sharpens the boundary and obtains a more specific significant image.

Drawings

FIG. 1 is a block diagram of an implementation of the method of the present invention;

FIG. 2 is a multi-scale sharpening feature module framework;

FIG. 3 is a feature filter module frame;

FIG. 4 is a guide module frame;

FIG. 5 is an illustration of an blending module frame;

FIG. 6 is a pyramid sharpening feature module framework;

FIG. 7a is a first original image;

FIG. 7b is a salient object image obtained by detecting the first original image shown in FIG. 7a according to the method of the present invention;

FIG. 8a is a second original image;

FIG. 8b is a salient object image obtained by inspecting the second original image shown in FIG. 8a according to the method of the present invention;

FIG. 9a is a third original image;

fig. 9b is a salient object image obtained by detecting the third original image shown in fig. 9a by using the method of the present invention.

Detailed Description

The invention is described in further detail below with reference to the accompanying examples.

The invention provides a salient object detection method based on a stepped progressive neural network, and the overall implementation block diagram of the method is shown in fig. 1, and the method comprises a training stage and a testing stage;

the specific steps of the training phase process are as follows:

step 1_ 1: q original scene images with salient objects and salient object real detection images corresponding to the original scene images with the salient objects are selected, and Q original scene images with the salient objects are used for detecting the salient object real detection imagesVarious scene images of the literary objects and corresponding real detection images of the salient objects form a training set; recording the q-th original scene image with the salient object in the training set as

Centralize the training with

The corresponding real detection image of the salient object is recorded as

The original scene images with the salient objects are RGB images, Q is a positive integer, Q is more than or equal to 1000, if Q is 2500, Q is a positive integer, Q is more than or equal to 1 and less than or equal to Q, the original scene images with the salient objects mainly comprise RGB images of different salient objects shot in different scenes and multispectral images thereof, spectral information of three wave bands of red, green and blue is recorded in the RGB images, the multispectral images record spectral information of other three different wave bands, the spectral information of each wave band is equivalent to a channel component, namely, each original scene image with the salient objects comprises an R channel component, a G channel component, a B channel component and the other three Thermal infrared channel components (Thermal) of the RGB images.

as shown in fig. 1, a first basic module, a second basic module, a third basic module, a fourth basic module and a fifth basic module are connected in sequence, a sixth basic module, a seventh basic module, an eighth basic module, a ninth basic module and a tenth basic module are connected in sequence, the input of a convolutional neural network is respectively input into the first basic module and the sixth basic module, the RGB image of each original various scene images with salient objects is input into the first basic module, and the multispectral image of each original various scene images with salient objects is input into the sixth basic module; the outputs of the first basic module and the sixth basic module are simultaneously input into the first blending module, the outputs of the second basic module and the seventh basic module are simultaneously input into the second blending module, the outputs of the third basic module and the eighth basic module are simultaneously input into the third blending module, the outputs of the fourth basic module and the ninth basic module are simultaneously input into the fourth blending module, the outputs of the fifth basic module and the tenth basic module are simultaneously input into the fifth blending module, the outputs of the first blending module, the second blending module, the third blending module and the fourth blending module are respectively input into the fifth input end, the fourth input end, the third input end and the second input end of the multi-scale sharpening feature module, and the output of the fifth blending module is input into the first input end of the multi-scale sharpening feature module after passing through the pyramid sharpening feature module; each guide module is provided with two input ends, a fifth output end, a fourth output end, a third output end and a second output end of the multi-scale sharpening feature module are respectively connected with the fourth guide module, the third guide module, the second guide module and the first input end of the first guide module, the first output end of the multi-scale sharpening feature module is connected with the second input end of the first guide module, the output of the first guide module is input to the second input end of the second guide module, the output of the second guide module is input to the second input end of the third guide module, the output of the third guide module is input to the second input end of the second guide module, and the output of the fourth guide module is used as the output of the convolutional neural network.

As shown in fig. 2, the multi-scale sharpening feature module specifically includes:

As shown in fig. 4, the first guiding module, the second guiding module, the third guiding module and the fourth guiding module have the same structure, and the guiding modules specifically include: the guiding module comprises an eleventh up-sampling module, three convolution modules, three activation modules, a second segmentation module and a middle module;

As shown in fig. 5, the blending module specifically includes: the blending module comprises a fourth convolution module, a fifth convolution module, an adaptive module and a fourth activation module;

As shown in fig. 6, the pyramid sharpening feature module specifically includes: the pyramid sharpening feature module comprises six convolution modules and a fifth stacking module;

As shown in fig. 3, the feature filtering module specifically includes:

The structures of the 5 basic modules of the Resnet-34 convolutional neural network are respectively the same as those of the first basic module, the second basic module, the third basic module, the fourth basic module and the fifth basic module of the present invention. The first base module and the sixth base module have the same structure, the second base module and the seventh base module have the same structure, the third base module and the eighth base module have the same structure, the fourth base module and the ninth base module have the same structure, and the fifth base module and the tenth base module have the same structure.

For the first base module. The device comprises a 1 st convolution layer, a 1 st normalization layer and a 1 st activation layer which are sequentially arranged; the input end of the first basic module receives RGB three-channel components of an original input image, the width of the original input image received by the input end is required to be W, the height of the original input image is required to be H, the output end of the first basic module outputs 64 pairs of feature maps, and a set formed by the 64 pairs of feature maps is recorded as N1; the size of convolution kernels in the 1 st convolution layer is 3 x 3, the number of convolution kernels is 64, the step size of the convolution kernels is 2, the filling coefficient of the convolution layer is 1, the bias parameter of the convolution layer is negative, the input feature number of the 1 st normalization layer is 64, the activation mode adopted by the 1 st activation layer is 'ReLU', the width of each feature map in N1 is W/2, and the height is H/2.

For the second base module. The system comprises a 1 st down-sampling layer, a 1 st residual block, a 2 nd residual block and a 3 rd residual block which are sequentially arranged; the input end of the second basic module receives all the characteristic diagrams in N1, the output end of the second basic module outputs 64 sub-characteristic diagrams, and the set formed by the 64 sub-characteristic diagrams is marked as N2; the 1 st down-sampling layer adopts maximum pooled down-sampling, the convolution kernel size of the maximum pooled down-sampling is 3 multiplied by 3, the step length of the maximum pooled down-sampling convolution kernel is 2, the filling coefficient of the maximum pooled down-sampling convolution kernel is 1, and the offset parameter is negative; the size of a first convolution kernel in the 1 st residual block is 3 multiplied by 3, the step size of the convolution kernel is 1, the filling coefficient of the convolution kernel is 1, the offset parameter is negative, the number of first convolution kernels in the 1 st residual block is 64, the size of a second convolution kernel in the 1 st residual block is 3 multiplied by 3, the step size of the convolution kernel is 1, the filling coefficient of the convolution kernel is 1, the offset parameter is negative, and the number of second convolution kernels in the 1 st residual block is 64; the size of a first convolution kernel in the 2 nd residual block is 3 multiplied by 3, the step size of the convolution kernel is 1, the filling coefficient of the convolution kernel is 1, the offset parameter is negative, the number of first convolution kernels in the 2 nd residual block is 64, the size of a second convolution kernel in the 2 nd residual block is 3 multiplied by 3, the step size of the convolution kernel is 1, the filling coefficient of the convolution kernel is 1, the offset parameter is negative, and the number of second convolution kernels in the 2 nd residual block is 64; the size of a first convolution kernel in the 3 rd residual block is 3 multiplied by 3, the step size of the convolution kernel is 1, the filling coefficient of the convolution kernel is 1, the offset parameter is negative, the number of first convolution kernels in the 3 rd residual block is 64, the size of a second convolution kernel in the 3 rd residual block is 3 multiplied by 3, the step size of the convolution kernel is 1, the filling coefficient of the convolution kernel is 1, the offset parameter is negative, and the number of second convolution kernels in the 3 rd residual block is 64; each feature map in N2 has a width of W/4 and a height of H/4.

For the third base module. The device comprises a 1 st residual block, a 2 nd residual block, a 3 rd residual block and a 4 th residual block which are sequentially arranged; the input end of the third basic module receives all the feature maps in N2, the output end of the third basic module outputs 128 sub feature maps, and the set formed by the 128 sub feature maps is marked as N3; wherein, the size of the first convolution kernel in the 1 st residual block is 3 × 3, the step size of the convolution kernel is 2, the filling coefficient of the convolution kernel is 1, the offset parameter is no, the number of the first convolution kernels in the 1 st residual block is 128, the size of the second convolution kernel in the 1 st residual block is 3 × 3, the step size of the convolution kernel is 1, the filling coefficient of the convolution kernel is 1, the offset parameter is no, and the number of the second convolution kernels in the 1 st residual block is 128; the size of a first convolution kernel in the 2 nd residual block is 3 multiplied by 3, the step size of the convolution kernel is 1, the filling coefficient of the convolution kernel is 1, the offset parameter is negative, the number of first convolution kernels in the 2 nd residual block is 128, the size of a second convolution kernel in the 2 nd residual block is 3 multiplied by 3, the step size of the convolution kernel is 1, the filling coefficient of the convolution kernel is 1, the offset parameter is negative, and the number of second convolution kernels in the 2 nd residual block is 128; the size of a first convolution kernel in the 3 rd residual block is 3 multiplied by 3, the step size of the convolution kernel is 1, the filling coefficient of the convolution kernel is 1, the offset parameter is negative, the number of first convolution kernels in the 3 rd residual block is 128, the size of a second convolution kernel in the 3 rd residual block is 3 multiplied by 3, the step size of the convolution kernel is 1, the filling coefficient of the convolution kernel is 1, the offset parameter is negative, and the number of second convolution kernels in the 3 rd residual block is 128; the size of a first convolution kernel in the 4 th residual block is 3 multiplied by 3, the step size of the convolution kernel is 1, the filling coefficient of the convolution kernel is 1, the offset parameter is negative, the number of first convolution kernels in the 3 rd residual block is 128, the size of a second convolution kernel in the 4 th residual block is 3 multiplied by 3, the step size of the convolution kernel is 1, the filling coefficient of the convolution kernel is 1, the offset parameter is negative, and the number of second convolution kernels in the 3 rd residual block is 128; each feature map in N3 has a width of W/8 and a height of H/8.

For the fourth basic module. The device comprises a 1 st residual block, a 2 nd residual block, a 3 rd residual block, a 4 th residual block, a 5 th residual block and a 6 th residual block which are sequentially arranged; the input end of the fourth basic module receives all the feature maps in N3, the output end of the fourth basic module outputs 256 pairs of feature maps, and a set formed by the 256 pairs of feature maps is marked as N4; wherein, the size of the first convolution kernel in the 1 st residual block is 3 × 3, the step size of the convolution kernel is 2, the filling coefficient of the convolution kernel is 1, the offset parameter is no, the number of the first convolution kernels in the 1 st residual block is 256, the size of the second convolution kernel in the 1 st residual block is 3 × 3, the step size of the convolution kernel is 1, the filling coefficient of the convolution kernel is 1, the offset parameter is no, and the number of the second convolution kernels in the 1 st residual block is 256; the size of a first convolution kernel in the 2 nd residual block is 3 multiplied by 3, the step size of the convolution kernel is 1, the filling coefficient of the convolution kernel is 1, the offset parameter is negative, the number of first convolution kernels in the 2 nd residual block is 256, the size of a second convolution kernel in the 2 nd residual block is 3 multiplied by 3, the step size of the convolution kernel is 1, the filling coefficient of the convolution kernel is 1, the offset parameter is negative, and the number of second convolution kernels in the 2 nd residual block is 256; the size of a first convolution kernel in the 3 rd residual block is 3 multiplied by 3, the step size of the convolution kernel is 1, the filling coefficient of the convolution kernel is 1, the offset parameter is negative, the number of first convolution kernels in the 3 rd residual block is 256, the size of a second convolution kernel in the 3 rd residual block is 3 multiplied by 3, the step size of the convolution kernel is 1, the filling coefficient of the convolution kernel is 1, the offset parameter is negative, and the number of second convolution kernels in the 3 rd residual block is 256; the size of a first convolution kernel in the 4 th residual block is 3 multiplied by 3, the step size of the convolution kernel is 1, the filling coefficient of the convolution kernel is 1, the offset parameter is negative, the number of first convolution kernels in the 4 th residual block is 256, the size of a second convolution kernel in the 4 th residual block is 3 multiplied by 3, the step size of the convolution kernel is 1, the filling coefficient of the convolution kernel is 1, the offset parameter is negative, and the number of second convolution kernels in the 4 th residual block is 256; the size of a first convolution kernel in the 5 th residual block is 3 multiplied by 3, the step size of the convolution kernel is 1, the filling coefficient of the convolution kernel is 1, the offset parameter is negative, the number of the first convolution kernels in the 5 th residual block is 256, the size of a second convolution kernel in the 5 th residual block is 3 multiplied by 3, the step size of the convolution kernel is 1, the filling coefficient of the convolution kernel is 1, the offset parameter is negative, and the number of the second convolution kernels in the 5 th residual block is 256; the size of a first convolution kernel in the 6 th residual block is 3 multiplied by 3, the step size of the convolution kernel is 1, the filling coefficient of the convolution kernel is 1, the offset parameter is negative, the number of first convolution kernels in the 6 th residual block is 256, the size of a second convolution kernel in the 6 th residual block is 3 multiplied by 3, the step size of the convolution kernel is 1, the filling coefficient of the convolution kernel is 1, the offset parameter is negative, and the number of second convolution kernels in the 6 th residual block is 256; each feature map in N4 has a width of W/16 and a height of H/16.

For the fifth basic module. The device comprises a 1 st residual block, a 2 nd residual block and a 3 rd residual block which are sequentially arranged; the input end of the fifth basic module receives all the feature maps in N4, the output end of the fifth basic module outputs 512 sub-feature maps, and the set formed by the 512 sub-feature maps is marked as N5; wherein, the size of the first convolution kernel in the 1 st residual block is 3 × 3, the step size of the convolution kernel is 2, the filling coefficient of the convolution kernel is 1, the offset parameter is no, the number of the first convolution kernels in the 1 st residual block is 512, the size of the second convolution kernel in the 1 st residual block is 3 × 3, the step size of the convolution kernel is 1, the filling coefficient of the convolution kernel is 1, the offset parameter is no, and the number of the second convolution kernels in the 1 st residual block is 512; the size of a first convolution kernel in the 2 nd residual block is 3 multiplied by 3, the step size of the convolution kernel is 1, the filling coefficient of the convolution kernel is 1, the offset parameter is negative, the number of first convolution kernels in the 2 nd residual block is 512, the size of a second convolution kernel in the 2 nd residual block is 3 multiplied by 3, the step size of the convolution kernel is 1, the filling coefficient of the convolution kernel is 1, the offset parameter is negative, and the number of second convolution kernels in the 2 nd residual block is 512; the size of a first convolution kernel in the 3 rd residual block is 3 multiplied by 3, the step size of the convolution kernel is 1, the filling coefficient of the convolution kernel is 1, the offset parameter is negative, the number of first convolution kernels in the 3 rd residual block is 512, the size of a second convolution kernel in the 3 rd residual block is 3 multiplied by 3, the step size of the convolution kernel is 1, the filling coefficient of the convolution kernel is 1, the offset parameter is negative, and the number of second convolution kernels in the 3 rd residual block is 512; each feature map in N5 has a width of W/32 and a height of H/32.

For the sixth basic module. The device comprises a 1 st convolution layer, a 1 st normalization layer and a 1 st activation layer which are sequentially arranged; the input end of the sixth basic module receives three-channel components of the original infrared image, the width of the original input image received by the input end is required to be W, the height of the original input image is required to be H, the output end of the first basic module outputs 64 pairs of feature maps, and a set formed by the 64 pairs of feature maps is recorded as N6; the size of convolution kernels in the 1 st convolution layer is 3 x 3, the number of convolution kernels is 64, the step size of the convolution kernels is 2, the filling coefficient of the convolution layer is 1, the bias parameter of the convolution layer is negative, the input feature number of the 1 st normalization layer is 64, the activation mode adopted by the 1 st activation layer is 'ReLU', the width of each feature map in N6 is W/2, and the height is H/2.

For the seventh base module. The system comprises a 1 st down-sampling layer, a 1 st residual block, a 2 nd residual block and a 3 rd residual block which are sequentially arranged; the input end of the seventh basic module receives all the feature maps in N6, the output end of the seventh basic module outputs 64 sub-feature maps, and the set formed by the 64 sub-feature maps is marked as N7; the 1 st down-sampling layer adopts maximum pooled down-sampling, the convolution kernel size of the maximum pooled down-sampling is 3 multiplied by 3, the step length of the maximum pooled down-sampling convolution kernel is 2, the filling coefficient of the maximum pooled down-sampling convolution kernel is 1, and the offset parameter is negative; the size of a first convolution kernel in the 1 st residual block is 3 multiplied by 3, the step size of the convolution kernel is 1, the filling coefficient of the convolution kernel is 1, the offset parameter is negative, the number of first convolution kernels in the 1 st residual block is 64, the size of a second convolution kernel in the 1 st residual block is 3 multiplied by 3, the step size of the convolution kernel is 1, the filling coefficient of the convolution kernel is 1, the offset parameter is negative, and the number of second convolution kernels in the 1 st residual block is 64; the size of a first convolution kernel in the 2 nd residual block is 3 multiplied by 3, the step size of the convolution kernel is 1, the filling coefficient of the convolution kernel is 1, the offset parameter is negative, the number of first convolution kernels in the 2 nd residual block is 64, the size of a second convolution kernel in the 2 nd residual block is 3 multiplied by 3, the step size of the convolution kernel is 1, the filling coefficient of the convolution kernel is 1, the offset parameter is negative, and the number of second convolution kernels in the 2 nd residual block is 64; the size of a first convolution kernel in the 3 rd residual block is 3 multiplied by 3, the step size of the convolution kernel is 1, the filling coefficient of the convolution kernel is 1, the offset parameter is negative, the number of first convolution kernels in the 3 rd residual block is 64, the size of a second convolution kernel in the 3 rd residual block is 3 multiplied by 3, the step size of the convolution kernel is 1, the filling coefficient of the convolution kernel is 1, the offset parameter is negative, and the number of second convolution kernels in the 3 rd residual block is 64; each feature map in N7 has a width of W/4 and a height of H/4.

For the eighth base module. The device comprises a 1 st residual block, a 2 nd residual block, a 3 rd residual block and a 4 th residual block which are sequentially arranged; the input end of the eighth basic module receives all the feature maps in N7, the output end of the eighth basic module outputs 128 sub-feature maps, and the set formed by the 128 sub-feature maps is marked as N8; wherein, the size of the first convolution kernel in the 1 st residual block is 3 × 3, the step size of the convolution kernel is 2, the filling coefficient of the convolution kernel is 1, the offset parameter is no, the number of the first convolution kernels in the 1 st residual block is 128, the size of the second convolution kernel in the 1 st residual block is 3 × 3, the step size of the convolution kernel is 1, the filling coefficient of the convolution kernel is 1, the offset parameter is no, and the number of the second convolution kernels in the 1 st residual block is 128; the size of a first convolution kernel in the 2 nd residual block is 3 multiplied by 3, the step size of the convolution kernel is 1, the filling coefficient of the convolution kernel is 1, the offset parameter is negative, the number of first convolution kernels in the 2 nd residual block is 128, the size of a second convolution kernel in the 2 nd residual block is 3 multiplied by 3, the step size of the convolution kernel is 1, the filling coefficient of the convolution kernel is 1, the offset parameter is negative, and the number of second convolution kernels in the 2 nd residual block is 128; the size of a first convolution kernel in the 3 rd residual block is 3 multiplied by 3, the step size of the convolution kernel is 1, the filling coefficient of the convolution kernel is 1, the offset parameter is negative, the number of first convolution kernels in the 3 rd residual block is 128, the size of a second convolution kernel in the 3 rd residual block is 3 multiplied by 3, the step size of the convolution kernel is 1, the filling coefficient of the convolution kernel is 1, the offset parameter is negative, and the number of second convolution kernels in the 3 rd residual block is 128; the size of a first convolution kernel in the 4 th residual block is 3 multiplied by 3, the step size of the convolution kernel is 1, the filling coefficient of the convolution kernel is 1, the offset parameter is negative, the number of first convolution kernels in the 3 rd residual block is 128, the size of a second convolution kernel in the 4 th residual block is 3 multiplied by 3, the step size of the convolution kernel is 1, the filling coefficient of the convolution kernel is 1, the offset parameter is negative, and the number of second convolution kernels in the 3 rd residual block is 128; each feature map in N8 has a width of W/8 and a height of H/8.

For the ninth basic module. The device comprises a 1 st residual block, a 2 nd residual block, a 3 rd residual block, a 4 th residual block, a 5 th residual block and a 6 th residual block which are sequentially arranged; the input end of the ninth basic module receives all the feature maps in N8, the output end of the ninth basic module outputs 256 sub-feature maps, and a set formed by the 256 sub-feature maps is marked as N9; wherein, the size of the first convolution kernel in the 1 st residual block is 3 × 3, the step size of the convolution kernel is 2, the filling coefficient of the convolution kernel is 1, the offset parameter is no, the number of the first convolution kernels in the 1 st residual block is 256, the size of the second convolution kernel in the 1 st residual block is 3 × 3, the step size of the convolution kernel is 1, the filling coefficient of the convolution kernel is 1, the offset parameter is no, and the number of the second convolution kernels in the 1 st residual block is 256; the size of a first convolution kernel in the 2 nd residual block is 3 multiplied by 3, the step size of the convolution kernel is 1, the filling coefficient of the convolution kernel is 1, the offset parameter is negative, the number of first convolution kernels in the 2 nd residual block is 256, the size of a second convolution kernel in the 2 nd residual block is 3 multiplied by 3, the step size of the convolution kernel is 1, the filling coefficient of the convolution kernel is 1, the offset parameter is negative, and the number of second convolution kernels in the 2 nd residual block is 256; the size of a first convolution kernel in the 3 rd residual block is 3 multiplied by 3, the step size of the convolution kernel is 1, the filling coefficient of the convolution kernel is 1, the offset parameter is negative, the number of first convolution kernels in the 3 rd residual block is 256, the size of a second convolution kernel in the 3 rd residual block is 3 multiplied by 3, the step size of the convolution kernel is 1, the filling coefficient of the convolution kernel is 1, the offset parameter is negative, and the number of second convolution kernels in the 3 rd residual block is 256; the size of a first convolution kernel in the 4 th residual block is 3 multiplied by 3, the step size of the convolution kernel is 1, the filling coefficient of the convolution kernel is 1, the offset parameter is negative, the number of first convolution kernels in the 4 th residual block is 256, the size of a second convolution kernel in the 4 th residual block is 3 multiplied by 3, the step size of the convolution kernel is 1, the filling coefficient of the convolution kernel is 1, the offset parameter is negative, and the number of second convolution kernels in the 4 th residual block is 256; the size of a first convolution kernel in the 5 th residual block is 3 multiplied by 3, the step size of the convolution kernel is 1, the filling coefficient of the convolution kernel is 1, the offset parameter is negative, the number of the first convolution kernels in the 5 th residual block is 256, the size of a second convolution kernel in the 5 th residual block is 3 multiplied by 3, the step size of the convolution kernel is 1, the filling coefficient of the convolution kernel is 1, the offset parameter is negative, and the number of the second convolution kernels in the 5 th residual block is 256; the size of a first convolution kernel in the 6 th residual block is 3 multiplied by 3, the step size of the convolution kernel is 1, the filling coefficient of the convolution kernel is 1, the offset parameter is negative, the number of first convolution kernels in the 6 th residual block is 256, the size of a second convolution kernel in the 6 th residual block is 3 multiplied by 3, the step size of the convolution kernel is 1, the filling coefficient of the convolution kernel is 1, the offset parameter is negative, and the number of second convolution kernels in the 6 th residual block is 256; each feature map in N9 has a width of W/16 and a height of H/16.

For the tenth base module. The device comprises a 1 st residual block, a 2 nd residual block and a 3 rd residual block which are sequentially arranged; the input end of the tenth basic module receives all the feature maps in N9, the output end of the tenth basic module outputs 512 sub-feature maps, and the set formed by the 512 sub-feature maps is marked as N10; wherein, the size of the first convolution kernel in the 1 st residual block is 3 × 3, the step size of the convolution kernel is 2, the filling coefficient of the convolution kernel is 1, the offset parameter is no, the number of the first convolution kernels in the 1 st residual block is 512, the size of the second convolution kernel in the 1 st residual block is 3 × 3, the step size of the convolution kernel is 1, the filling coefficient of the convolution kernel is 1, the offset parameter is no, and the number of the second convolution kernels in the 1 st residual block is 512; the size of a first convolution kernel in the 2 nd residual block is 3 multiplied by 3, the step size of the convolution kernel is 1, the filling coefficient of the convolution kernel is 1, the offset parameter is negative, the number of first convolution kernels in the 2 nd residual block is 512, the size of a second convolution kernel in the 2 nd residual block is 3 multiplied by 3, the step size of the convolution kernel is 1, the filling coefficient of the convolution kernel is 1, the offset parameter is negative, and the number of second convolution kernels in the 2 nd residual block is 512; the size of a first convolution kernel in the 3 rd residual block is 3 multiplied by 3, the step size of the convolution kernel is 1, the filling coefficient of the convolution kernel is 1, the offset parameter is negative, the number of first convolution kernels in the 3 rd residual block is 512, the size of a second convolution kernel in the 3 rd residual block is 3 multiplied by 3, the step size of the convolution kernel is 1, the filling coefficient of the convolution kernel is 1, the offset parameter is negative, and the number of second convolution kernels in the 3 rd residual block is 512; each feature map in N10 has a width of W/32 and a height of H/32.

For the first blending module. The system comprises a 1 st convolution module, a 2 nd convolution module, a 1 st self-adaptive module and a 1 st activation function which are sequentially arranged; the input end of the first blending module receives all the feature maps in N1 and N6, the output end of the first blending module outputs 64 sub-feature maps, and a set formed by the 64 sub-feature maps is marked as N11; the size of a convolution kernel in the 1 st convolution module is 3 multiplied by 3, the step size is 1, the filling is 1, and the offset parameter is negative, the size of a convolution kernel in the 2 nd convolution module is 3 multiplied by 3, the step size is 1, the filling is 1, and the offset parameter is negative, the 1 st self-adaptive module adopts self-adaptive average pooling, the size of an output characteristic diagram is 1 multiplied by 1, and the activation mode adopted by the 1 st activation function is 'ReLU'; each feature map in N11 has a width of W/2 and a height of H/2.

For the second blend module. The system comprises a 1 st convolution module, a 2 nd convolution module, a 1 st self-adaptive module and a 1 st activation function which are sequentially arranged; the input end of the second blending module receives all the feature maps in N2 and N7, the output end of the second blending module outputs 128 sub-feature maps, and the set formed by the 128 sub-feature maps is marked as N12; the size of a convolution kernel in the 1 st convolution module is 3 multiplied by 3, the step size is 1, the filling is 1, and the offset parameter is negative, the size of a convolution kernel in the 2 nd convolution module is 3 multiplied by 3, the step size is 1, the filling is 1, and the offset parameter is negative, the 1 st self-adaptive module adopts self-adaptive average pooling, the size of an output characteristic diagram is 1 multiplied by 1, and the activation mode adopted by the 1 st activation function is 'ReLU'; each feature map in N12 has a width of W/4 and a height of H/4.

For the third blend module. The system comprises a 1 st convolution module, a 2 nd convolution module, a 1 st self-adaptive module and a 1 st activation function which are sequentially arranged; the input end of the third blending module receives all the feature maps in N3 and N8, the output end of the third blending module outputs 256 pairs of feature maps, and a set formed by the 256 pairs of feature maps is recorded as N13; the size of a convolution kernel in the 1 st convolution module is 3 multiplied by 3, the step size is 1, the filling is 1, and the offset parameter is negative, the size of a convolution kernel in the 2 nd convolution module is 3 multiplied by 3, the step size is 1, the filling is 1, and the offset parameter is negative, the 1 st self-adaptive module adopts self-adaptive average pooling, the size of an output characteristic diagram is 1 multiplied by 1, and the activation mode adopted by the 1 st activation function is 'ReLU'; each feature map in N13 has a width of W/8 and a height of H/8.

For the fourth blend module. The system comprises a 1 st convolution module, a 2 nd convolution module, a 1 st self-adaptive module and a 1 st activation function which are sequentially arranged; the input end of the fourth blending module receives all the feature maps in N4 and N9, the output end of the fourth blending module outputs 512 sub-feature maps, and a set formed by the 512 sub-feature maps is marked as N14; the size of a convolution kernel in the 1 st convolution module is 3 multiplied by 3, the step size is 1, the filling is 1, and the offset parameter is negative, the size of a convolution kernel in the 2 nd convolution module is 3 multiplied by 3, the step size is 1, the filling is 1, and the offset parameter is negative, the 1 st self-adaptive module adopts self-adaptive average pooling, the size of an output characteristic diagram is 1 multiplied by 1, and the activation mode adopted by the 1 st activation function is 'ReLU'; each feature map in N14 has a width of W/16 and a height of H/16.

For the fifth blend module. The system comprises a 1 st convolution module, a 2 nd convolution module, a 1 st self-adaptive module and a 1 st activation function which are sequentially arranged; the input end of the fifth blending module receives all the feature maps in N5 and N10, the output end of the fifth blending module outputs 512 sub-feature maps, and a set formed by the 512 sub-feature maps is marked as N15; the size of a convolution kernel in the 1 st convolution module is 3 multiplied by 3, the step size is 1, the filling is 1, and the offset parameter is negative, the size of a convolution kernel in the 2 nd convolution module is 3 multiplied by 3, the step size is 1, the filling is 1, and the offset parameter is negative, the 1 st self-adaptive module adopts self-adaptive average pooling, the size of an output characteristic diagram is 1 multiplied by 1, and the activation mode adopted by the 1 st activation function is 'ReLU'; each feature map in N15 has a width of W/32 and a height of H/32.

For a pyramid sharpening feature module. The system comprises a 1 st convolution module, a 2 nd convolution module, a 3 rd convolution module, a 4 th convolution module, a 5 th convolution module, a 6 th convolution module and a 1 st stacking module which are sequentially arranged; the pyramid sharpening feature module receives all feature maps in N15, 512 secondary feature maps are output by the output end of the pyramid sharpening feature module, and a set formed by the 512 secondary feature maps is recorded as N16; wherein, the convolution kernel size of the 1 st convolution module is 1 multiplied by 1, the convolution kernel step is 1, the padding is 0, and the bias parameter is negative, the convolution kernel size of the 2 nd convolution module is 3 multiplied by 3, the convolution kernel step is 1, the padding is 1, the void rate is 1, and the bias parameter is negative, the convolution kernel size of the 3 rd convolution module is 3 multiplied by 3, the convolution kernel step is 1, the padding is 6, the void rate is 6, and the bias parameter is negative, the convolution kernel size of the 4 th convolution module is 3 multiplied by 3, the convolution kernel step size is 1, the filling is 12, the void rate is 12, the offset parameter is negative, the convolution kernel size of the 5 th convolution module is 3 x 3, the convolution kernel step size is 1, the filling is 18, the void rate is 18, the offset parameter is negative, the convolution kernel size of the 6 th convolution module is 1 x 1, the convolution kernel step size is 1, the filling is 0, the offset parameter is negative, and the 1 st stacking module is stacked on the channel dimension; each feature map in N16 has a width of W/32 and a height of H/32.

For a multi-scale sharpening feature module. The device consists of a 1 st feature filtering module, a 2 nd feature filtering module, a 3 rd feature filtering module, a 4 th feature filtering module, a 1 st stacking module, a 2 nd stacking module, a 3 rd stacking module, a 4 th stacking module, a 1 st up-sampling module, a 2 nd up-sampling module, a 3 rd up-sampling module, a 4 th up-sampling module, a 5 th up-sampling module, a 6 th up-sampling module, a 7 th up-sampling module, an 8 th up-sampling module, a 9 th up-sampling module and a 10 th up-sampling module; the multi-scale sharpening feature module receives all feature maps of N11, N12, N13, N14 and N16, the multi-scale sharpening feature module has 5 output ends, the first output end outputs 64 sub feature maps, a set formed by the 64 sub feature maps is N17, the second output end outputs 128 sub feature maps, a set formed by the 128 sub feature maps is N18, the third output end outputs 256 sub feature maps, a set formed by the 256 sub feature maps is N19, the fourth output end outputs 512 sub feature maps, a set formed by the 512 sub feature maps is N20, the fifth output end outputs 512 sub feature maps, and a set formed by the 512 sub feature maps is N21; wherein, the 1 st stacking module, the 2 nd stacking module, the 3 rd stacking module and the 4 th stacking module are stacked on the channel dimension of the feature map, the 1 st up-sampling module, the 2 nd up-sampling module, the 3 rd up-sampling module, the 4 th up-sampling module, the 5 th up-sampling module, the 6 th up-sampling module, the 7 th up-sampling module, the 8 th up-sampling module, the 9 th up-sampling module and the 10 th up-sampling module all adopt 2 times bilinear interpolation up-sampling, the sizes of the first convolution kernels in the 1 st feature filtering module, the 2 nd feature filtering module, the 3 rd feature filtering module and the 4 th feature filtering module are 3 × 3, the step length is 1, the padding is 1, the offset parameter is no, the size of the second convolution kernel in the 1 st feature filtering module is 3 × 3, the step length is 1, the padding is 1, the characteristic filtering module is a 4 th up-sampling module, If the bias parameter is not, the activation mode adopted by the first activation module in the 1 st feature filtering module is 'Softmax', and the first cutting module in the 1 st feature filtering module cuts the feature map into 2 parts on the channel; each feature in N17 has a width W/32 and a height H/32, each feature in N18 has a width W/16 and a height H/16, each feature in N19 has a width W/8 and a height H/8, each feature in N20 has a width W/4 and a height H/4, and each feature in N21 has a width W/2 and a height H/2.

For the first boot module. The system comprises a 1 st up-sampling module, a 1 st convolution module, a 2 nd convolution module, a 3 rd convolution module, a 1 st activation function, a 2 nd activation function, a 3 rd activation function, a 1 st cutting function and a 1 st intermediate function which are sequentially arranged; the first guide module receives all the characteristic diagrams in N21 and N20, the output end of the first guide module outputs 512 sub characteristic diagrams, and the set of the 512 sub characteristic diagrams is marked as N22; the 1 st upsampling module adopts 2 times of bilinear interpolation upsampling, the size of a convolution kernel of the 1 st convolution module is 3 × 3, the step size is 1, the padding is 1, the offset parameter is negative, the size of a convolution kernel of the 2 nd convolution module is 3 × 3, the step size is 1, the padding is 1, the offset parameter is negative, the size of a convolution kernel of the 3 rd convolution module is 3 × 3, the step size is 1, the padding is 1, the offset parameter is negative, the activation mode adopted by the 1 st activation function is "Softmax", the activation mode adopted by the 2 nd activation function is "Sidmoid", the activation mode adopted by the 3 rd activation function is "Sidmoid", the 1 st cutting module cuts the feature map into 2 parts on the channel, and the expression of the 1 st intermediate function is as follows: (x) x + 1; each feature map in N22 has a width of W/16 and a height of H/16.

For the second boot module. The system comprises a 1 st up-sampling module, a 1 st convolution module, a 2 nd convolution module, a 3 rd convolution module, a 1 st activation function, a 2 nd activation function, a 3 rd activation function, a 1 st cutting function and a 1 st intermediate function which are sequentially arranged; the second guide module receives all the characteristic diagrams in N22 and N19, the output end of the first guide module outputs 256 sub-characteristic diagrams, and the set of the 256 sub-characteristic diagrams is marked as N23; the 1 st upsampling module adopts 2 times of bilinear interpolation upsampling, the size of a convolution kernel of the 1 st convolution module is 3 × 3, the step size is 1, the padding is 1, the offset parameter is negative, the size of a convolution kernel of the 2 nd convolution module is 3 × 3, the step size is 1, the padding is 1, the offset parameter is negative, the size of a convolution kernel of the 3 rd convolution module is 3 × 3, the step size is 1, the padding is 1, the offset parameter is negative, the activation mode adopted by the 1 st activation function is "Softmax", the activation mode adopted by the 2 nd activation function is "Sidmoid", the activation mode adopted by the 3 rd activation function is "Sidmoid", the 1 st cutting module cuts the feature map into 2 parts on the channel, and the expression of the 1 st intermediate function is as follows: (x) x + 1; each feature map in N23 has a width of W/8 and a height of H/8.

For the third guidance module. The system comprises a 1 st up-sampling module, a 1 st convolution module, a 2 nd convolution module, a 3 rd convolution module, a 1 st activation function, a 2 nd activation function, a 3 rd activation function, a 1 st cutting function and a 1 st intermediate function which are sequentially arranged; the third guide module receives all the characteristic diagrams in N23 and N18, the output end of the first guide module outputs 128 sub characteristic diagrams, and the set of the 128 sub characteristic diagrams is marked as N24; the 1 st upsampling module adopts 2 times of bilinear interpolation upsampling, the size of a convolution kernel of the 1 st convolution module is 3 × 3, the step size is 1, the padding is 1, the offset parameter is negative, the size of a convolution kernel of the 2 nd convolution module is 3 × 3, the step size is 1, the padding is 1, the offset parameter is negative, the size of a convolution kernel of the 3 rd convolution module is 3 × 3, the step size is 1, the padding is 1, the offset parameter is negative, the activation mode adopted by the 1 st activation function is "Softmax", the activation mode adopted by the 2 nd activation function is "Sidmoid", the activation mode adopted by the 3 rd activation function is "Sidmoid", the 1 st cutting module cuts the feature map into 2 parts on the channel, and the expression of the 1 st intermediate function is as follows: (x) x + 1; each feature map in N24 has a width of W/4 and a height of H/4.

For the fourth boot module. The system comprises a 1 st up-sampling module, a 1 st convolution module, a 2 nd convolution module, a 3 rd convolution module, a 1 st activation function, a 2 nd activation function, a 3 rd activation function, a 1 st cutting function and a 1 st intermediate function which are sequentially arranged; the fourth guide module receives all the characteristic diagrams in N24 and N17, the output end of the first guide module outputs 64 sub-characteristic diagrams, and the set of 64 sub-characteristic diagrams is marked as N25; the 1 st upsampling module adopts 2 times of bilinear interpolation upsampling, the size of a convolution kernel of the 1 st convolution module is 3 × 3, the step size is 1, the padding is 1, the offset parameter is negative, the size of a convolution kernel of the 2 nd convolution module is 3 × 3, the step size is 1, the padding is 1, the offset parameter is negative, the size of a convolution kernel of the 3 rd convolution module is 3 × 3, the step size is 1, the padding is 1, the offset parameter is negative, the activation mode adopted by the 1 st activation function is "Softmax", the activation mode adopted by the 2 nd activation function is "Sidmoid", the activation mode adopted by the 3 rd activation function is "Sidmoid", the 1 st cutting module cuts the feature map into 2 parts on the channel, and the expression of the 1 st intermediate function is as follows: (x) x + 1; each feature map in N25 has a width of W/2 and a height of H/2.

Step 1_ 3: inputting various original scene images with salient objects in the training set as original input images into a convolutional neural network for training to obtain corresponding salient object predicted images, and recording the obtained corresponding salient object predicted images as J_pre；

Step 1_ 4: calculating loss function values between image sets formed by prediction images of all the salient objects and image sets formed by real detection images of the corresponding salient objects, and finishing the training of the convolutional neural network when the training times reach preset times to obtain a trained convolutional neural network; predicting the q-th salient object image

And the q th significant object real detection image

The value of the loss function in between is recorded as

Obtained using binary category cross entropy (binary cross entropy).

The specific steps of the test stage process are as follows:

step 2_ 1: selecting various scenes with salient objects to be detected in the p group in the test set; marking the p-th group of various scene images to be detected with the salient objects as I^pWherein, P is not less than 1 and not more than P, and P is 4, 3, 2 and 1.

Step 2_ 2: the p-th group of various scene images I with the salient objects to be detected^pThe R channel component, the G channel component, the B channel component and the three Thermal infrared channel components (Thermal) are input into a trained convolutional neural network, the trained convolutional neural network outputs a corresponding salient object prediction image, and the salient object prediction image is recorded as

To further verify the feasibility and effectiveness of the method of the invention, experiments were performed.

And (3) building an architecture of the multi-scale perforated convolutional neural network by using a python-based deep learning library Pytrich 3.6. The VT800, VT1000 and VT5000 test sets are adopted to analyze how the effect of the significant object image is predicted by the method of the invention. Here, the detection performance of the predictive garbage classification is evaluated using 3 common objective parameters of the evaluation target detection method as evaluation indexes, i.e., recall rate, accuracy rate, and average absolute error.

The method is used for detecting each image in the test set to obtain a significant object image corresponding to each image, and recall rate, accuracy rate and average absolute error reflecting the target detection effect of the method are listed in table 1. From the data listed in table 1, the significant object images obtained by the method of the present invention are good, which indicates that it is feasible and effective to obtain significant object images of various scenes by using the method of the present invention.

TABLE 1 evaluation results on test sets using the method of the invention

FIG. 7a shows the 1 st original image; FIG. 7b illustrates the salient object detection performed on the original image shown in FIG. 7a by the method of the present invention to obtain a salient object image; FIG. 8a shows the 2 nd original image; FIG. 8b illustrates the salient object detection performed on the original image shown in FIG. 8a by the method of the present invention to obtain a salient object image; FIG. 9a shows the 3 rd original image; fig. 9b shows a salient object image obtained by detecting a salient object in the original image shown in fig. 9a by using the method of the present invention. Comparing fig. 7a and 7b, 8a and 8b, and 9a and 9b, it can be seen that the accuracy of the salient object image obtained by the method of the present invention is higher.

Claims

1. A salient object detection method based on a ladder-shaped progressive neural network is characterized by comprising the following steps: the method comprises a training stage process and a testing stage process;

the specific steps of the training phase process are as follows:

the test stage process comprises the following specific steps:

2. The salient object detection method based on the stepped progressive neural network according to claim 1, wherein: the convolutional neural network mainly comprises ten basic modules, five blending modules, a multi-scale sharpening feature module, a pyramid sharpening feature module and four guide modules, and specifically comprises the following steps:

3. The salient object detection method based on the stepped progressive neural network according to claim 2, wherein: the multi-scale sharpening feature module specifically comprises:

4. The salient object detection method based on the stepped progressive neural network according to claim 1, wherein: the guide module is specifically as follows: the guiding module comprises an eleventh up-sampling module, three convolution modules, three activation modules, a second segmentation module and a middle module;

5. The salient object detection method based on the stepped progressive neural network according to claim 1, wherein: the blending module specifically comprises: the blending module comprises a fourth convolution module, a fifth convolution module, an adaptive module and a fourth activation module;

6. The salient object detection method based on the stepped progressive neural network according to claim 1, wherein: the pyramid sharpening feature module specifically comprises: the pyramid sharpening feature module comprises six convolution modules and a fifth stacking module;

7. The salient object detection method based on the stepped progressive neural network according to claim 3, wherein: the characteristic filtering module specifically comprises: