Disclosure of Invention
The invention aims to solve the technical problem of providing a golden monkey body segmentation algorithm which can meet the requirements of a golden monkey individual re-identification task on natural scene data processing.
In order to realize the task, the invention adopts the following technical scheme:
a golden monkey body segmentation algorithm under a natural scene comprises the following steps:
constructing a semantic segmentation network to realize end-to-end image segmentation; training the semantic segmentation network, and storing the trained network model for segmentation detection of the image to be segmented;
the semantic segmentation network comprises a classification network, a fusion part and an output part, wherein:
the semantic segmentation network sequentially comprises a first convolution layer, a first maximum pooling layer, a second convolution layer c, a second maximum pooling layer, a third convolution layer, a third maximum pooling layer, a fourth convolution layer, a fourth maximum pooling layer, a fifth convolution layer, a fifth maximum pooling layer, a sixth convolution layer and a seventh convolution layer from front to back; each layer of the first convolution layer to the fifth convolution layer comprises two continuous convolution calculations, the sixth convolution layer comprises one convolution calculation, and the function is to perform feature extraction processing on an input image to obtain a feature map; the seventh convolution layer comprises a primary convolution calculation and a classification activation function, and is used for performing feature extraction processing and pixel-level classification to obtain a confidence map; the role of the first through fifth max pooling layers is to reduce the data dimension without losing features;
the fusion part comprises a first characteristic fusion layer and a second characteristic fusion layer, and the first characteristic fusion layer performs characteristic fusion with the output of the fifth maximum pooling layer after up-sampling the output of the seventh convolution layer; the second feature fusion layer performs feature fusion on the output result of the first feature fusion layer and the features extracted by the fourth maximum pooling layer to obtain a fused confidence map;
the output part comprises an output layer, the output layer comprises an up-sampling layer and a classification layer, wherein the input of the up-sampling layer is the output of the second feature fusion layer, and the function is to expand the fused confidence map to the size of the original input image; the classification layer is used for performing classification prediction on each pixel to finally obtain a high-resolution class thermodynamic diagram consistent with the size of the original input image.
Further, the convolution kernel size of each of the first convolution layer to the fifth convolution layer is 2 × 2, and the step length is 2; the convolution kernel size of the sixth convolution layer is 7 × 7, and the step size is 1; the convolution kernel size of the seventh convolution layer is 1 × 1 with a step size of 1.
Further, the pooling kernel size of each of the first largest pooling layer to the fifth largest pooling layer is 2 × 2, and the step size is 2.
Further, the first feature fusion layer performs feature fusion with the output of the fifth largest pooling layer after upsampling the output of the seventh convolutional layer, and includes:
the first feature fusion layer comprises an up-sampling layer and a convolution layer, the input of the up-sampling layer is the output of the seventh convolution layer, the up-sampling multiple is 2 times, and the function is to expand pixels of the confidence map so as to facilitate feature fusion and obtain a confidence map A2 with expanded dimensions; the input of the convolutional layer is the output of the fifth maximum pooling layer, the size of the convolutional kernel is 1 multiplied by 1, the step length is 1, the activation function is a Relu function, and a confidence map B of the output of the fifth maximum pooling layer is obtained; the final output of the first feature fusion layer is the sum of the confidence map B and the confidence map a2, denoted as confidence map C.
Further, the second feature fusion layer performs feature fusion on the output result of the first feature fusion layer and the features extracted by the fourth largest pooling layer to obtain a fused confidence map, which includes:
the second feature fusion layer comprises an up-sampling layer and a convolution layer, the input of the up-sampling layer is a segmentation result graph C, the up-sampling multiple is 2 times, the function is to expand pixels of the confidence graph so as to facilitate feature fusion, and a dimension expanded confidence graph C2 is obtained; the input of the convolution layer is the output of the fourth maximum pooling layer, the size of the convolution kernel is 1 multiplied by 1, the step length is 1, the activation function is a Relu function, and a confidence map D of the output of the fourth maximum pooling layer is obtained; the final output of the second feature fusion layer is the sum of the confidence map D and the confidence map C2, which is denoted as the confidence map E, i.e. the fused confidence map.
Further, when the semantic segmentation network is trained, the loss function adopted is as follows:
wherein, y
(i,j)Representing the label value at the actually classified image pixel point (i, j) corresponding to the input image,
representing the predicted value of the pixel point (i, j) of the output image after the input image is processed by the semantic segmentation network, height and width respectively representing the height and width of the image, dw
(i,j)The weight function is constrained for the distance, expressed as:
wherein distance (C)(i,j),enter(i,j)) Representing a pixel point I(i,j)Center away from the connected domain where it is located(i,j)α and β are two constants.
The invention has the following technical characteristics:
1. the algorithm achieves the purpose of golden monkey target detection by semantically segmenting an original image, mainly performs golden monkey and natural environment segmentation by a full convolution network in deep learning, and focuses an FCN model on the integrity of a golden monkey individual by using a loss function based on distance weight, namely an improved cross entropy loss function, so that the final detection accuracy is improved.
2. Through experimental contrastive analysis, the problem that the body of the golden monkey is divided into a plurality of parts can be better solved by using the improved loss function, and the improved natural scene golden monkey detection network can better improve the image segmentation result of the previous problems.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, embodiments of the present invention will be described in detail with reference to the accompanying drawings. The golden monkey body segmentation in the scheme refers to detecting and segmenting an image part where a golden monkey body is located from an image containing the golden monkeys.
Step 1, constructing a semantic segmentation network to realize end-to-end image segmentation
The semantic segmentation network comprises a classification network, a fusion part and an output part, wherein:
first part, the classification network
The classification network is composed of 7 convolutional layers and 5 pooling layers, and the arrangement of each layer from front to back of the network is as follows: a first rolling layer conv1, a first maximum pooling layer pool1, a second rolling layer conv2, a second maximum pooling layer pool2, a third rolling layer conv3, a third maximum pooling layer pool3, a fourth rolling layer conv4, a fourth maximum pooling layer pool4, a fifth rolling layer conv5, a fifth maximum pooling layer pool5, a sixth rolling layer conv6 and a seventh rolling layer conv 7. Wherein:
performing conv 1-conv 5 on the first to fifth conv layers, wherein each layer comprises two times of continuous convolution calculation, and performing feature extraction processing on an input image to obtain a feature map; a sixth convolution layer conv6, including a convolution calculation, for performing a feature extraction process on the input image to obtain a feature map, and outputting the feature map as the next layer; a seventh convolution layer conv7, which includes a primary convolution calculation and classification activation function, and is used for performing feature extraction processing on the input image and performing pixel-level classification to obtain a heat map of a low-resolution class, namely a confidence map; the sizes of convolution kernels of each layer from conv1 to conv5 are 2 multiplied by 2, and the step length is 2; the convolution kernel size of the sixth convolution layer is 7 × 7, and the step size is 1; the convolution kernel size of the seventh convolution layer is 1 × 1 with a step size of 1.
Reducing the data dimension under the condition of not losing the features from the first largest pooling layer to the fifth largest pooling layer pool 1-pool 5 to obtain a feature map with reduced data dimension; the sizes of pool 1-pool 5 pooling nuclei are all 2X 2, step size is 2.
Second part, the fusion part
The fusion part comprises a first characteristic fusion layer and a second characteristic fusion layer, and the first characteristic fusion layer performs characteristic fusion with the output of the fifth maximum pooling layer after up-sampling the output of the seventh convolution layer; and the second feature fusion layer performs feature fusion on the output result of the first feature fusion layer and the features extracted by the fourth maximum pooling layer.
The first fusion layer fuses the segmentation results of the characteristics of the bottom two layers and is used for fusing the characteristics of multi-layer output, so that the identification accuracy is improved; the second feature fusion layer fuses the segmentation results of the bottom three layers, and cross-layer connection is achieved.
The feature fusion part is shown in figure 2:
the first feature fusion layer comprises an up-sampling layer and a convolution layer, wherein the input of the up-sampling layer is the output of conv7, the up-sampling multiple is 2 times, and the function is to expand pixels of the confidence map so as to facilitate feature fusion and obtain a confidence map A2 with expanded dimensions; the input of the convolutional layer is the output of pool5, the size of the convolutional kernel is 1 multiplied by 1, the step length is 1, the activation function is the Relu function, and a confidence map B of the output of the pool5 layer is obtained; the final output of the first feature fusion layer is the sum of the confidence map B and the confidence map A2, and is marked as a confidence map C, namely a segmentation result map fused with the features of the bottom two layers, and the segmentation result map is used for fusing the features of the multi-layer output and improving the recognition accuracy.
The second feature fusion layer comprises an up-sampling layer and a convolution layer, the input of the up-sampling layer is a segmentation result graph C, the up-sampling multiple is 2 times, the function is to expand pixels of the confidence graph so as to facilitate feature fusion, and a dimension expanded confidence graph C2 is obtained; the input of the convolutional layer is the output of pool4, the size of the convolutional kernel is 1 multiplied by 1, the step length is 1, the activation function is a relu function, and a confidence map D of the output of the pool4 layer is obtained; the final output of the second feature fusion layer is the sum of the confidence map D and the confidence map C2, and is recorded as a confidence map E, namely, the segmentation result map of the bottom three layers is fused, so that cross-layer connection is realized, the features of multi-layer output are fused, and the recognition accuracy is improved.
Third, output section
The output part comprises an output layer, the output layer comprises an up-sampling layer and a classification layer, wherein the input of the up-sampling layer is the output of the fusion part, namely the second characteristic fusion layer; the upsampling multiple is 8 times, and the function is to enlarge the size of the pixels of the fused confidence image E to the size of an original input image (an image input into a classification network); the classification layer activation function is a SoftMax function and is used for performing classification prediction on each pixel to finally obtain a high-resolution class thermodynamic diagram consistent with the size of an original input image.
Step 2, training the semantic segmentation network constructed in thestep 1
Creating a data set: in the data set, the golden monkey partial pixels are uniformly marked as target areas. And finally, creating a golden monkey target detection set-Nature golden monkey Segmentation under a natural scene according to the standard of the PASCAL VOC data set, and naming the golden monkey target detection set-Nature golden monkey Segmentation as the NGS. The NGS data set contains 600 golden monkey images in a natural scene in total, and 31 golden monkey individuals appear in total, and the golden monkey has uniform gender and age layer distribution and rich action types.
Training is carried out by utilizing an NGS data set, training data and testing data are randomly distributed according to the ratio of 4:1, the maximum iteration number is set to be 5000 times, and a network model is stored after the training is finished. The specific training process is as follows:
using the final output of the semantic segmentation network as
Inputting the classified images of the images as y (namely the actual classification result of the original image) into a loss function and calculating, wherein the calculated result reflects the difference between the prediction and the actual result of the network; and (3) carrying out derivation on parameters in the network by the loss function, updating the network parameters according to the relationship of the derivatives, setting the learning rate to be 0.00005, and keeping the learning rate stable and unchanged in the learning process.
In the scheme, a loss function distance-weight loss based on distance weight is adopted as the loss function and is marked as DWL; the loss function is obtained by introducing weight coefficients to different pixel points in a basic cross entropy function and taking the weight coefficients as a distance constraint weight function. The DWL loss function can better calculate the distance between the predicted value and the true value of the sample, so that the model focuses more on the central area of the body of the golden monkey, and learns the structural information of the body of the golden monkey, and the DWL loss function is as follows:
a specific derivation of the DWL loss function is given below.
For the segmentation problem of the invention, the network model outputs the prediction probability value of whether each pixel belongs to the golden monkey region, and the cross entropy function L of the network model can be deduced through the basic classification cross entropy function as follows:
wherein, y
(i,j)Representing the label value at the actually classified image pixel point (i, j) corresponding to the input image,
the predicted value of the pixel point (i, j) of the output image after the input image is processed by the semantic segmentation network is represented, and height and width respectively represent the height and width of the image.
The invention mainly makes two improvements to the cross entropy loss function: the improvement of different weights of the ROI area and the environmental background area and the introduction of the distance information from the center to the edge position of the golden monkey in the ROI area.
Firstly, introducing a weight coefficient W to different pixel points in a basic cross entropy function(i,j)The new cross entropy function WL is as follows:
wherein, W(i,j)The weight of loss at the pixel point (i, j) is expressed, and the original cross entropy loss function L can be understood as W(i,j)Is constantly equal to 1 and is therefore negligible.
In calculating W(i,j)In the process, the label value y of the pixel point is required to be determined(i,j)And the position information pair weight coefficient W of the pixel point (i, j)(i,j)Respectively taking values: firstly, keeping the weight of an environmental background region unchanged, and increasing the weight coefficient of a golden monkey ROI region to enable a model to pay more attention to the golden monkey region; secondly, for a golden monkey ROI target region, in order to utilize the body structure information of the golden monkey, linearly reduced weights are set from the central part of the body of the golden monkey, namely the rectangular center of the ROI region to the edge, so that the learning of the model to the central region of the body of the golden monkey is strengthened, and meanwhile, important distance constraint information between the center of the body of the golden monkey and the hair edge can be kept as much as possible; the distance constraint weight function dw(i,j)As follows:
wherein, I(i,j)Indicating the pixel at (i, j), center(i,j)(ii) represents the center of the connected domain where the pixel point (i, j) is located, distance: ((i,j),enter(i,j)) Representing a pixel point I(i,j)Center away from the connected domain where it is located(i,j)A and beta are two constants for controlling the pixel point I(i,j)The weighted value of the region makes the weighted range of different pixel points in the ROI region of the golden monkey be [ alpha-beta, alpha ]](ii) a Let α be 2 and β be max (distance: (b)(i,j),enter(i,j)) When I) then(i,j)When being the center of the connected component, dw(i,j)A value of 2; when I is(i,j)At farthest distance from the center of the connected component, dw(i,j)The value is 1, thereby achieving the goal of decreasing weights in the golden monkey ROI region from the center to the edge of the connected domain.
Weighting coefficient W of new cross entropy function WL(i,j)Value is dw(i,j)An improved distance weight based loss function DWL can be obtained as follows:
and 3, storing the trained network model for the segmentation detection of the image to be segmented.
After the semantic segmentation network is trained in thestep 2, storing the trained network model; in practical application, an image to be segmented is input into a network model, and the output of the network model is the segmented high-resolution thermodynamic diagram.
Through experimental contrastive analysis, the problem that the body of the golden monkey is divided into a plurality of parts can be better solved by using the improved loss function, and the improved natural scene golden monkey detection network can better improve the image segmentation result of the previous problems.
The experimental comparative analysis procedure is as follows:
the invention performs experiments on the NGS dataset and randomly distributes training data and test data in a 4:1 ratio. In the natural scene golden monkey detection algorithm, the learning rate is set to be 0.00005, the learning rate is stable and unchanged in the learning process, and the maximum iteration number of the experiment is set to be 5000 times. After the segmentation result of the original image in the natural scene is obtained, generating a rectangular golden monkey individual detection result according to the edge of the segmentation image, in the generation process, eliminating a target pixel region which is too small to be normally utilized, and finally obtaining golden monkey individual data through the image region in the rectangular frame.
IoU standard can be used to quantitatively measure the correlation between the true value and the predicted value, as shown in fig. 3, in the present invention, the segmentation results obtained by the FCN network before and after the DWL function is applied are compared, IoU and the average value of all 100 images in the test set are calculated by calculating IoU between the segmentation results and ground truth as the evaluation index of the network performance, and the results are shown in the following table:
IoU comparison before and after DWL function improvement
| Loss function | IoU |
| Cross entropy function | 85.28% |
| DWL function | 86.14% |
As can be seen from the table, the segmentation effect obtained by the original semantic segmentation network is better for the basic data, and the method is proved to be suitable for the segmentation task of the invention; after the loss function is improved, the result of using the DWL function is improved by 0.86 percent compared with the original network, and the improvement of the loss function is proved by the invention, so that the segmentation effect of the network model is improved to a certain extent on the whole.
As shown in fig. 4, the rectangle frame of (a) shows the segmentation result of the original image, and the rectangle frame of (b) is the rectangular golden monkey individual detection result generated based on the result image.
Fig. 5 shows in detail the influence of segmentation error on the generation of rectangle detection result, where (a) and (c) show two cases where the lower edge of the golden monkey is not completely covered by the rectangle detection result, and the segmentation error causes the problem of missing edge information of the golden monkey. (b) The situation that a single golden monkey is divided into two detection boxes due to the segmentation result error is shown, and the segmentation error causes the problem that the single golden monkey is divided into a plurality of rectangles. It is clear that the test data shown in fig. 5 is not usable for the golden monkey individual re-identification experiment.
By the method, the loss function of the original FCN network is improved, the loss weight calculation mode of different pixels based on distance is adopted, so that the constraint information of the body structure of the golden monkey is introduced, the segmentation result is shown in fig. 6, and the result has a better improvement effect. The image compares the situation that the rectangular detection result cannot be used due to two typical segmentation errors before improvement, the segmentation results in the two images are obviously improved, and the area of the error result is obviously reduced. For the generated rectangular detection results, (a) the lower edge of the golden monkey is completely included in the detection results, and (b) the golden monkey originally divided into two rectangular detection frames is successfully and accurately detected as a complete golden monkey individual.
The distance weight-based loss function DWL designed by the invention can greatly improve the golden monkey segmentation result in a natural scene, thereby effectively improving the accuracy of the final rectangular detection result and obtaining golden monkey individual image data meeting the requirements of the golden monkey individual re-identification experiment through the rectangular region result.