Disclosure of Invention
The invention aims to solve the problems that: the global model integrally detects the outline of a target, facilitates the detection of a larger target and uniformly distributes a significant value to a contained region, the local optimization model is sensitive to high-frequency contents such as edges, noise and the like in the image, outline details can be optimized, the results of two independent convolutional neural networks are integrated, and the end-to-end significant target detection of the image is realized.
The invention adopts the following technical scheme for solving the technical problems:
a deep network significance detection method based on a global model and local optimization comprises the following steps:
step 1, in a global model, training is carried out by adding an additional connecting layer on the basis of a VGG-16 network, an end-to-end regression convolutional neural network is designed, an input image can be directly mapped to generate a saliency map, the network can generate multi-level features on multiple scales to detect saliency targets, the purpose of predicting the saliency targets of each candidate region from the overall angle of the image is achieved, and 5 saliency maps based on the global model are generated.
And 2, in the local model, extracting points or regions which attract attention by using the characteristics of colors, textures and the like among adjacent pixels to design the local model for saliency detection, converting the saliency division into two classification problems through a depth network, and learning the characteristics of the multi-level image block to predict the saliency value of each pixel to generate a local saliency map.
And 3, fusing a plurality of initial saliency maps obtained by the global and local models by using a Conditional Random Field (CRF) method in the saliency map fusion to generate a final saliency map.
As a preferred embodiment of the present invention, instep 1, an image with a size of 224 × 224 is first input into a convolutional neural network, and the network model used in the present invention has 5 groups of 13 convolutional layers and corresponding mapping units, 5 pooling layers, 2 full-link layers and 1 output layer.
After the input image is subjected to the 5-layer pooling operation, a feature vector of 14 × 14 × 512 dimensions is output.
The method improves the model, extracts the characteristic information of different levels by designing operations such as a reasonable convolution kernel, step length, edge supplement, pooling and the like, and enables different layers of the model to generate a plurality of saliency maps with the same scale.
As a preferred scheme of the present invention, instep 2, the present invention uses image blocks with fixed sizes to traverse the whole image in the form of a sliding window, so as to generate a data set to be trained.
Training data is constructed for each divided image block P and the standard saliency map G, and positive samples are defined as: when the overlapping part of the image blocks P and G meets the following conditions:
|P∩G|≥0.8×min(|P|,|G|)
the negative training sample is defined as: when 1) the center pixel of P is not significant, 2) the overlap of P and G is less than a certain ratio:
|P∩G|<0.3×min(|P|,|G|)
the local network model designed by the invention is composed of 6 layers of structures and comprises 3 convolutional layers and 3 full-connection layers, wherein the three convolutional layers respectively comprise a Relu nonlinear activation function and a maximum pooling layer, in the full-connection layers, in order to prevent overfitting, the full-connection layers of the first two layers adopt a dropout method, and a softmax classifier is used as an output layer to generate a probability value of the significance of a central pixel.
Using a set of annotations, { l }iAs a training set { P }iAnd (4) supervision, selecting the cross entropy of the softmax function as a cost function, and training the network by a random gradient descent method, wherein the cost function is defined as:
where m is the number of samples, θ
LIs a learning parameter of the network, including the weights and biases of all layers, P (l)
i=j|θ
L) Is the labeling probability of the ith training sample predicted by the network, λ is the weight delay parameter,
is the k-th layer weight value.
As a preferred embodiment of the present invention, instep 3, the global and local pairs are aligned using conditional random fieldsObtaining M significant graphs { S by partial models1,S2,…SMThe fusion is performed.
All saliency maps are formed into a feature vector x (p) ═ S1(p),S2(p),…,Sm(p)), y (p) represents a binary mark, notably noted as 1, not notably noted as 0. Therefore, in the feature vector X ═ { X ═ XpMark Y ═ { Y } in | p ∈ I }pThe conditional distribution probability of | p ∈ I } is recorded as:
where p is the pixel in the original image I, xpIs the salient feature vector of p points, ypIs a significance mark, and theta is a parameter to be trained of the CRF model. f. ofd(. and f)s(. is) two characteristic functions, respectively representing the characteristics of the independent pixel p and its neighborhood NpThe feature in the range, η, represents a penalty term.
Generated multilevel saliency map { S1,S2,…SMAnd solving a fusion parameter theta by using a maximum likelihood function through a CRF regression model to generate a final saliency map.
Compared with the prior art, the invention adopting the technical scheme has the following advantages: the saliency detection method based on deep neural network training is adopted, a multi-level saliency map is constructed through two network structures of a global model and a local model, and then a conditional random field is utilized for fusion to generate a final fine saliency map. The global model can effectively utilize global significance characteristics to predict the significance value of each target area, and the local model can learn the local contrast, texture and shape information of the target from the divided multi-level super pixel points. And training is performed by combining the complex features of the depth network, and the local features and the global features of the image are fused, so that the algorithm has higher robustness.
The method is suitable for images with complex and changeable scenes, the detection result is accurate, and the saliency target with better consistency can be obtained.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, embodiments of the present invention are described in further detail below.
As shown in fig. 1, the present invention comprises the steps of:
the method comprises the following steps: and aiming at the original image I, designing an additional convolutional neural network on the basis of the VGG-16 network to generate an initial global saliency map SG.
Step two: and aiming at the original image I, constructing a training data set, designing a 6-layer convolutional neural network, training and predicting the significance value of each pixel, and generating a local significance map SL.
Step three: and fusing the saliency maps generated in the two models by using a conditional random field method to obtain a final saliency map S. The detailed steps are as follows:
1. global model based saliency detection
(1) Reading an original image I, adjusting the image size to 224 x 224, and sending the image into a VGG-16 network.
(2) The network is initialized with parameters using the training model parameters obtained on the dataset ILSVRC 12.
(3) The VGG-16 model is improved, the specific network structure is shown in FIG. 2, and the specific operation is as follows:
omitting the maximum pooling operation of the last two layers, and designing and adding an additional convolution layer;
after Pooling1, the first layer adds a 128-dimensional 5 × 5 convolution kernel, sets the step size stride to 4, and the edge complement pad to 2, and can generate a 56 × 56 × 128 feature vector;
to ensure that the dimensions of the generated saliency map are not further compressed, the second layer maps them into 56 × 56 × 64 dimensional feature vectors using a 1 × 1 convolution kernel;
the third layer also adopts a 1 × 1 convolution kernel to generate a 56 × 56 dimensional characteristic image;
similarly, after adding corresponding convolutional layers after Pooling2, Pooling3 and Pooling4, in order to ensure that the saliency map with the same size is finally generated, the sizes of the convolutional kernels are all 3 × 3, the step lengths stride are respectively 2, 1 and 1, other parameters are kept the same, and in the original network, since Pooling4 and Pooling5 are omitted, the saliency map with 56 × 56 pixels can be directly generated afterconv 5.
(4) Setting the initial learning rate of the newly added network layer to be 0.01, setting the original VGG-16 network layer to be 0.001, setting the impulse parameter to be 0.9 and setting the weight delay to be 0.0005;
(5) in the training stage, supervised learning is carried out by using a standard saliency map and adopting a random gradient descent method, and then additionally added convolution layers and full-link layers are utilized;
(6) in the training stage, supervised learning is carried out by using a random gradient descent method through a standard saliency map, and 5 saliency maps based on a global model are generated by using extra convolution layers and full connection layers
And
(7) the global model network designed by the invention can obtain the saliency map from the global structure of the image, and the information extraction of different convolution layers can also carry out content expression on the original image on different scales. However, the network structure lacks description of detail information of an original image, a single global model cannot meet requirements in terms of quantitative analysis and visual observation, and then the local information is combined for adjustment to generate a more detailed saliency map.
2. Local model
(1) And traversing the original image I by using a sliding image window with the size of 51 multiplied by 51, setting the sliding step length as 10, sequentially traversing from the upper left corner to the lower right corner of the image, and generating a data set to be trained.
(2) Labeling the image blocks by using a standard saliency map (ground route), and generating positive samples and negative samples of a training network, wherein the labeling rule of the positive samples and the negative samples is as follows:
the positive sample is defined as: when the overlapping part of the image blocks P and G meets the condition that P and G are more than or equal to 0.8 Xmin (P and G);
similarly, a negative training sample is defined as: when 1) the central pixel of P is not significant, 2) the overlapping part of P and G is less than a certain proportion, | P |, G | < 0.3 × min (| P |, | G |);
in addition, image blocks that do not meet the above conditions will be discarded and not used for network training.
(3) The local model extracts points or regions which attract attention by using the characteristics of color, texture and the like among the adjacent pixels, so that the detection problem of the saliency points can be converted into a two-classification problem by using a standard saliency map, namely the pixel points have saliency (1) or do not have saliency (0) relative to the adjacent regions.
(4) The local network model designed by the invention is composed of 6 layers of structures and comprises 3 convolutional layers and 3 full-connection layers, wherein the three convolutional layers respectively comprise a Relu nonlinear activation function and a maximum pooling layer, the Relu can enable training data to have sparsity so as to achieve the purpose of rapid training, and the maximum pooling operation enables images to have translation invariance;
at the first layer of the network, local response normalization is applied to achieve generalization. In the fully connected layers, in order to prevent overfitting, the fully connected layers of the first two layers adopt a dropout method, and a softmax classifier is used as an output layer to generate a probability value of whether a central pixel is significant or not;
the local model network structure is as follows:
(5) the network training utilizes a set of labels { l }iAs a training set { P }iSupervision, selecting the cross entropy of the softmax function as a cost function, and carrying out pair by a random gradient descent methodThe network is trained, and the cost function is defined as:
wherein, theta
LIs a learning parameter of the network, including the weights and biases of all layers, P (l)
i=j|θ
L) Is the labeling probability of the ith training sample predicted by the network, λ is the weight delay parameter,
is the k-th layer weight value.
(6) The network is trained using a back propagation algorithm. In the network parameters during training, the batch size (batch size) is set to m of 256, the momentum of 0.9, the weight delay is 0.0005, the learning rate is 0.01, and the training period is set to 80. In the testing stage, the image is traversed by using a sliding window, and then a probability value p (l is 1| theta) of each pixel point is predicted by using a 6-layer neural network to serve as a local significance value.
3. Conditional random field fusion
(1) Obtaining M significant graphs (S) through global and local models1,S2,…SMAnd the conditional random field method adopted by the invention is used for effectively fusing the saliency maps to generate a final saliency map S.
(2) The method of conditional random field is used for fusing the results to generate a saliency map SFThe method has the advantage that the pixel point information and the surrounding neighborhood information are comprehensively considered.
(3) All the salient maps are formed into a feature vector, and the feature vector is recorded as:
x(p)=(S1(p),S2(p),…,Sm(p))
y (p) represents a binary flag, notably 1, not notably 0;
therefore, in the feature vector X ═ { X ═ XpY ═ Y ∈ I) in | p ∈ I)pThe conditional distribution probability of | p ∈ I } is recorded as:
where p is the pixel in the original image I, xpIs the salient feature vector of p points, ypIs a significance mark, and theta is a parameter to be trained of the CRF model.
fd(. and f)s(. is) two characteristic functions, respectively representing the characteristics of the independent pixel p and its neighborhood NpThe feature in the range, η, represents a penalty term.
fd(xp,yp) With only the saliency maps S to be fusedmRelated, defined as:
wherein S ism(p) is p points on the saliency map SmOf (1) significance value, λmIs the CRF model parameter to be trained.
Characteristic function fs(xp,xq,yp,yq) Dependent on the neighborhood pixels, defined as:
wherein 1 (-) is an indicator function, αmIs the parameter to be trained.
(4) In the saliency map SmWhen the pixel point p is determined as a saliency point and the neighborhood pixel q is a background pixel, the saliency probability of the point is correspondingly increased as follows:
when the pixel points with similar color distribution in the original image are marked with different significances, adding the pixel points as penalty factors;
wherein i (p) -i (q) represents a color difference in the original image, and E (-) represents an expectation function.
(5) Thereby generatingMulti-level saliency map of { S }1,S2,…SMAnd solving fusion parameters { lambda, alpha } by using a maximum likelihood function through a conditional random field regression model to generate a final saliency map.
4. Experiment and evaluation
(1) The invention carries out simulation experiment tests in 4 general data sets, namely MSRA, CSSD, SOD and PASCAL-S data sets.
The MSRA data set, published by microsoft asian institute, contains a large number of different categories of objects but most images contain only one salient object, 5000 images;
the CSSD data set comprises 1000 images derived from the network, which are closer to the images in a real scene;
the SOD data set is derived from a target segmentation database of the university of Berkeley, and comprises 300 images in total, wherein each image comprises a plurality of targets with different sizes and positions;
the PASCAL-S is a selected partial image from the PASCAL VOC 2012 data set, a saliency detection data set of 850 pieces, the image containing multiple objects and a complex background.
(2) The method uses Precision (Precision), Recall (Recall), F-measure parameters and Absolute average Error (MAE) as quantitative evaluation indexes of a significance detection algorithm;
the accuracy rate refers to the percentage of the detected significant pixels to the detected total pixels, and the formula is as follows:
recall is defined as the percentage of significant pixels detected to marked, and is expressed as:
b is a thresholded binary saliency map, and G is a manually labeled standard template;
the F-measure parameter comprehensively considers the accuracy and the recall rate, and the calculation formula is as follows:
wherein, usually, β is taken2=0.3
The MAE is obtained by calculating the average absolute difference between the saliency map and the corresponding artificial labeling template, and the calculation formula is as follows:
in summary, in the embodiment of the present invention, throughsteps 1 to 4, the global model and the local optimization model of the image saliency are constructed by fully utilizing the deep convolutional neural network, the relationship between the structures inside the image is deeply mined, the saliency map is generated, and the obtained result is complete and clear.
The method has the success that the model comprises two different attention mechanisms, and the global model trains the whole attention image so that the global model can detect a plurality of target areas in the image and is not influenced by isolated noise; the local model plays an optimization role, so that the detected result is more accurate and finer.
The above description is only a preferred embodiment of the present invention, and is not intended to limit the present invention, and any modifications made on the basis of the technical solutions according to the technical ideas presented in the present invention are within the scope of the present invention.