Disclosure of Invention
The invention aims to provide an interactive object segmentation method based on boundary box input and gaze point assistance, which uses an interactive form combining an input box and an estimated gaze point diagram for interactive segmentation.
The technical scheme for realizing the aim of the invention is that the interactive object segmentation method based on bounding box input and gaze point assistance comprises the following steps:
step 1, obtaining a boundary box of a segmented object in an image I, converting the boundary box into a binary boundary box B, simultaneously obtaining a gaze point FM of a target image, and erasing gaze point information outside an input box to obtain a processed gaze point map
Step 2, inputting the image I and the boundary block diagram B into an initial segmentation network Coarse U-Net to generate a Coarse segmentation result MC and multi-scale features based on frames
Step 3, calculating an initial segmentation result MC and a processed gaze point diagramThe similarity of the gaze point map is adjusted according to the similarity, and an adjusted gaze point map FM' is obtained;
Step 4, connecting the image I, the adjusted gaze point diagram FM' and the rough segmentation result MC in the channel dimension, inputting the connected images to a refinement segmentation network REFINEMENT U-Net, and extracting refinement featuresAnd fusing the extracted frame-based features of the Coarse U-Net layer by layer in the process
Step 5, refining the characteristicsAnd (3) inputting a decoder of the refinement network REFINEMENT U-Net to decode to obtain a final segmentation result M.
Further, step 1, obtaining a boundary box of the split object in the image I, converting the boundary box into a boundary block B with a binary size, simultaneously obtaining the gaze point FM of the target image, and erasing the gaze point information outside the input box to obtain a processed gaze point mapThe method comprises the following steps:
Step 1-1, labeling a segmentation object in an image I in a picture frame mode, and recording a lower left corner coordinate (xmin,ymin) and an upper right corner coordinate (xmax,ymax) of a rectangular frame;
Step 1-2, calculating a boundary block diagram B:
Step 1-3, using a trained gaze point prediction model TRANSALNET, inputting an image I, and sequentially passing through a CNN encoder, a transducer encoder and a CNN decoder of TRANSALNET to obtain an estimated gaze point map FM;
Step 1-4, setting all pixel points outside the input frame in the estimated gaze point diagram FM to 0 so as to erase gaze point information outside the input frame in the FM and obtain a processed gaze point diagram
Where-represents pixel-by-pixel multiplication.
Further, step 2, inputting the image I and the boundary block diagram B into an initial segmentation network Coarse U-Net to generate a Coarse segmentation result MC and a multi-scale feature based on a frameThe method comprises the following steps:
Step 2-1, connecting an image I with a boundary block diagram B in a channel dimension to obtain an Input tensor Input;
Step 2-2, inputting Input into an encoder part of a Coarse U-Net to perform feature extraction, wherein the process is as follows:
Wherein,Features representing the encoder layer i output of the Coarse U-Net,AndConvolution blocks respectively representing ith layer extracted features of Coarse U-Net encoder and parameters thereof, each convolution block comprising three 3×3 convolution layers and ReLu activation layers, maxPooling representing maximum pooling operation, and finally obtaining five features of different scalesi=1,2,...,5;
Step 2-3, characterizingThe input decoder decodes as follows:
Wherein,Features representing the decoder i-th layer output of the Coarse U-Net,AndConvolution blocks representing the i-th layer extraction features of the decoder of Coarse U-Net and their parameters, respectively, each convolution block comprising three 3×3 convolution layers and ReLu activation layers, concat (·, ·) representing the connection tensor in the channel dimension, upsample representing the upsampling operation, and finallyThe channel value is reduced to 1 through a3×3 convolution layer, and a preliminary segmentation result MC is obtained after Sigmoid operation.
Further, step 3, calculating an initial segmentation result MC and a processed gaze point mapAnd adjusts the gaze point map according to the similarity of the gaze point map to obtain an adjusted gaze point map FM', specifically as follows:
step 3-1, calculating an initial segmentation result MC and a gaze point mapSimilarity α of (2);
Wherein, sum || represents pixel-by-pixel multiplication and summation, respectively;
step 3-2, use of alpha pairGlobal adjustment is performed:
Further, step 4, connecting the image I, the adjusted gaze point diagram FM' and the rough segmentation result MC in the channel dimension, inputting the connected images to the refinement segmentation network REFINEMENT U-Net, and extracting refinement featuresAnd fusing the extracted frame-based features of the Coarse U-Net layer by layer in the processThe method comprises the following steps:
Step 4-1, connecting an original image I, an adjusted gaze point diagram FM' and a rough segmentation result MC in a channel dimension to obtain an Input tensor Input2;
Step 4-2, extracting features layer by inputting Input2 into an encoder of REFINEMENT U-NetAt the same time, the cross jump connection module is used for fusing the characteristics of the Coarse U-Net corresponding layerThe specific formula is as follows:
Wherein,AndThe encoder, representing REFINEMENT U-Net separately, extracts the characteristic convolutions and their parameters, each convolutions containing three 3 x 3 convolutions layers and ReLu activation layers,Representing a refinement of the ith layer of the encoder of REFINEMENT U-Net, phir represents two crisscross attention operations.
Further, step 5, the features will be refinedThe decoder of the input refinement network REFINEMENT U-Net decodes to obtain a final segmentation result M, which is specifically as follows:
Wherein,AndThe convolutional blocks representing the characteristics of the ith layer extraction of the REFINEMENT U-Net decoder, and their parameters, respectively, each comprising three 3 x 3 convolutional layers and ReLu active layer,Features representing the ith layer output of the REFINEMENT U-Net decoder, upsample represents the upsampling operation, and finallyThe channel value is reduced to 1 through a3×3 convolution layer, and the final segmentation result M is obtained through Sigmoid operation.
The interactive object segmentation system based on the boundary box input and the gaze point assistance is characterized in that the interactive object segmentation method based on the boundary box input and the gaze point assistance is implemented, and the interactive object segmentation based on the boundary box input and the gaze point assistance is realized.
A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the bounding box input and gaze point assisted interactive object segmentation method when executing the computer program, implementing bounding box input and gaze point assisted interactive object segmentation.
A computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the bounding box input and gaze point aided interactive object segmentation method described, enabling bounding box input and gaze point aided interactive object segmentation.
Compared with the prior art, the invention has the remarkable advantages that:
1) The invention uses the interaction mode of combining the input box and the gaze point diagram. The implicit interactive information is fully utilized to assist the segmentation, so that the segmentation quality is improved.
2) The double U-Net segmentation network structure, the gaze point diagram adjustment module and the cross jump connection and self-attention mechanism provided by the invention can effectively relieve the difference problem caused by estimating the gaze point diagram, and further improve the segmentation quality.
3) The method can be directly used as a plug-in optimization tool to improve the segmentation quality of other interactive object segmentation models based on the input frame. It allows any input box based interactive object segmentation model to conveniently obtain the help of estimating gaze point patterns.
Detailed Description
The present application will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present application more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the application.
Referring to fig. 1 and 2, the interactive object segmentation method based on bounding box input and gaze point assistance of the present invention comprises the following steps:
and step 1, acquiring a segmentation object and a boundary box in the target image, and estimating a gaze point diagram of the target image. The method comprises the following steps:
For a given target image. First, a rectangular frame is drawn on the image to mark the target object. Then, the information of the frame is converted into a binary image in which the pixel value inside the frame is set to 255 (representing the target area) and the pixel value outside the frame is set to 0 (representing the background). Next, using the trained gaze point prediction model TRANSALNET, the input image I sequentially passes through the CNN encoder, the transducer encoder, and the CNN decoder of TRANSALNET, and then an estimated gaze point map is obtained. Since the estimated gaze point map is based on free gaze, in this step, it is necessary to process the gaze point map and erase the out-of-frame gaze point. Specifically, all pixel points in the estimated gaze point map that are located outside the input box are set to 0 to avoid interference with subsequent processing, thereby ensuring that only gaze points within the target area are focused.
And 2, inputting the target image and the input box into an initial segmentation network Coarse U-Net to obtain a Coarse segmentation result and multi-scale characteristics based on the box. The method comprises the following steps:
the target image is combined with the input box information to form a four-dimensional tensor as input. In Coarse U-Net, the encoder section is responsible for extracting the multi-scale features of the input data, which are denoted asWhere i=1, 2,..5. Each corresponding to a particular scale of feature representations. The decoder then fuses the multi-scale features extracted by the encoder by an inter-skip connection mechanism and initially generates a coarse segmentation result Mc.
And step 3, inputting the initial segmentation result and the gaze point diagram into a gaze point diagram adjustment module so as to calculate the similarity of the initial segmentation result and the gaze point diagram and adjust the gaze point diagram.
As previously described, there may be a discrepancy between the estimated gaze point area and the user's real target object. Accordingly, a gaze point adjustment module was developed to estimate the reliability of the gaze point map and thereby further adjust the gaze point map. As described in step one, only the range of the input box in the gaze point map is consideredThen measuring the relativity between the coarse segmentation result Mc, wherein the specific calculation formula is as follows
Where, sum || represents pixel-wise multiplication and summation, respectively. In practice, the calculationAnd the cross-correlation ratio (IoU) between Mc. A higher α means a higher reliability of agreement between the gaze area in the estimated gaze view and the user's real target object and vice versa. Thus, utilize alpha pairGlobal adjustment is performed:
It can be seen that, when alpha is low,Can be suppressed so as to control the influence thereof to some extent. In the extreme case, when alpha is zero,Will be set to zero. This means that the gaze point area is completely independent of the user's real target object and the segmentation network should be completely dependent on the information provided by the input box.
Step 4, connecting the image I, the adjusted gaze point diagram FM' and the rough segmentation result MC in the channel dimension, inputting the connected images to a refinement segmentation network REFINEMENT U-Net, extracting refinement features, and merging the features extracted by the Coarse U-Net layer by layer in the process based on the frame
Specifically, first, an original image I, an adjusted gaze point map FM', and a rough segmentation result MC are connected in a channel dimension to obtain an Input tensor Input2;
Then, input2 is Input to an encoder of REFINEMENT U-Net to extract features layer by layerAt the same time, the cross jump connection module is used for fusing the characteristics of the Coarse U-Net corresponding layerThe specific formula is as follows:
Wherein,AndThe encoder, representing REFINEMENT U-Net separately, extracts the characteristic convolutions and their parameters, each convolutions containing three 3 x 3 convolutions layers and ReLu activation layers,Representing a refinement of the ith layer of the encoder of REFINEMENT U-Net, phir represents two crisscross attention operations.
Although the estimated gaze point map has been somewhat subjected to global adjustment by the previous gaze point map adjustment module, adverse effects that may be caused by the adjusted gaze point map have not yet been completely eliminated. The input box may provide more reliable target object interaction information than the estimated gaze point map. Thus, in the feature extraction process of REFINEMENT U-Net encoder, the Coarse U-Net extracted features also cross-hop with the current features at different levels. Moreover, these connected features are very efficient through two crisscross attention operations. Thus, features extracted based on the input box can be further utilized to enhance its dominant effect and better fuse these connection features together through a self-attention mechanism, thus properly relating to gaze point information.
Step 5, feature after fusionDecoding is carried out through a decoder of the refinement network REFINEMENT U-Net, and a final segmentation result M is obtained. The method comprises the following steps:
Wherein,AndThe convolutional blocks representing the characteristics of the ith layer extraction of the REFINEMENT U-Net decoder, and their parameters, respectively, each comprising three 3 x 3 convolutional layers and ReLu active layer,Features representing the ith layer output of the REFINEMENT U-Net decoder, upsample represents the upsampling operation, and finallyThe channel value is reduced to 1 through a3×3 convolution layer, and the final segmentation result M is obtained through Sigmoid operation.
Examples
In order to verify the effectiveness of the inventive protocol, the following simulation experiments were performed.
The interactive object segmentation model W-Net based on the boundary box input and the fixation point assistance is implemented as follows:
Step 1, given a target image I with the size of 3×320×320 in fig. 2, a boundary box of a segmentation object in the image is acquired and converted into a boundary block diagram B with the size of 3×320×320 binary. At the same time, the gaze point of the target image is acquired and processed.
Step 1-1, labeling the segmentation object in the image I by a picture frame mode, recording the left lower corner coordinate (xmin,ymin) and the right upper corner coordinate (xmax,ymax) of the rectangular frame,
Step 1-2, calculating a boundary block diagram B according to the following formula:
step 1-3, selecting a trained gaze point prediction model, inputting an image I, and obtaining an estimated gaze point map FM (with the size of 1 multiplied by 320) through reasoning.
Step 1-4, using the following calculation formula to set all pixel points outside the input frame in the estimated gaze point diagram to 0 so as to erase gaze point information outside the input frame in FM and obtain the processed gaze point diagram
Where-represents pixel-by-pixel multiplication.
And step 2, generating an initial segmentation result. Inputting the image I and the boundary block diagram B into an initial segmentation network Coarse U-Net to generate a Coarse segmentation result and multi-scale characteristics based on a frame
Step 2-1, connecting the image I with the boundary block diagram B in the channel dimension to obtain an Input tensor Input with the size of 4 multiplied by 320.
And 2-2, inputting Input into an encoder part of a Coarse U-Net, and extracting features. The calculation formula is as follows:
Wherein,Representing the characteristics of the output of the ith layer of the encoder,AndConvolution blocks representing i-th layer extracted features and parameters thereof, respectively, each convolution block including three 3×3 convolution layers and ReLu activation layers. Finally, five features with different scales are obtained
Step 2-3, characterizingThe input decoder decodes. The process is as follows:
Wherein,Representing the characteristics of the output of the ith layer of the decoder,AndThe convolution blocks representing the i-th layer extracted features of the decoder and their parameters, respectively, each include three 3 x 3 convolutional layers ReLu active layers. Upsample denotes an upsampling operation. FinallyThe channel value is reduced to 1 through a3×3 convolution layer, and a preliminary segmentation result MC is obtained after Sigmoid operation.
And 3, adjusting the gaze point diagram. Combining the initial segmentation result MC with the gaze point mapAnd inputting a gaze point diagram adjustment module, calculating the similarity of the gaze point diagram and the gaze point diagram, and adjusting the gaze point diagram according to the similarity. The method comprises the following specific steps:
Step 3-1, calculating the initial segmentation result MC and the gaze point map according to the following formulaSimilarity α of (c).
Where, sum || represents pixel-wise multiplication and summation, respectively.
Step 3-2, using the alpha pair according to the following formulaFor each pixel value of (a)Global adjustment is performed:
Step 4, connecting the image, the adjusted gaze point diagram and the rough segmentation result in the channel dimension, inputting the connected result to a refinement segmentation network REFINEMENT U-Net, and extracting refinement featuresAnd fusing the frame-based multi-scale features layer by layer through a cross-jump connection moduleThe method comprises the following steps:
And step 4-1, connecting the original image, the adjusted gaze point diagram and the rough segmentation result in the channel dimension to obtain an Input tensor Input2 with the size of 4 multiplied by 320.
Step 4-2, extracting features layer by inputting Input2 into an encoder of REFINEMENT U-Net, and fusing features of corresponding layers of Coarse U-Net by using a cross jump connection moduleThe specific formula is as follows:
Wherein,AndThe convolution blocks representing the extracted features and their parameters, respectively, each convolution block contains three 3x3 convolution layers and ReLu activation layers.Indicating the refinement of the i-th layer. Phir denotes two crisscross attention operations (Criss-CrossAttention).
Step 5, characterizingThe decoder of input REFINEMENT U-Net decodes. The process is as follows:
Wherein,AndThe convolution blocks representing the i-th layer extraction features of the decoder and parameters thereof respectively, each convolution block comprises three 3 x 3 convolution layers and ReLu activation layers.Representing the characteristics of the decoder i-th layer output. Upsample denotes an upsampling operation. FinallyThe channel value is reduced to 1 through a3×3 convolution layer, and the final segmentation result M is obtained through Sigmoid operation.
In summary, the present invention exploits complementarity between an input box and an estimated view of an object to improve input box based interactive object segmentation. The proposed W-Net framework ensures that segmentation is mainly guided by features extracted from the input box, and assisted by auxiliary information extracted from the estimated view, improving the accuracy of segmentation.
1. Experimental setup
1.1. Data set
The training phase trains the W-Net network using an augmented Pascal VOC dataset with 10582 and 1449 images for training and validation, respectively. In particular, an object in an image is treated as a training sample. The performance of the proposed method was evaluated at the test stage on the GrabCut, berkeley, DAVIS popular interactive object segmentation evaluation dataset. The GrabCut dataset contains 50 images, which are used to evaluate the performance of the interactive segmentation model. The Berkeley dataset contains 96 images with a total of 100 masks of target segmented objects for testing. The DAVIS dataset contained 50 videos, which were evaluated using 3440 frames of pictures containing the object therein. The GrabCut dataset provides an input box using the object-real bounding box as the input box for the Berkeley and DAVIS datasets.
1.2. Implementation details
All input tensors are sized 320 x 320. First, the Coarse U-Net is pre-trained for 50 rounds, then parameters of the Coarse U-Net are frozen, and the Coarse U-Net is trained for 50 rounds of REFINEMENT U-Net. Training of both networks uses binary cross entropy loss. The learning rate was set to 10-5. The batch size was set to 8 during the Coarse U-Net training and to 2 during the REFINEMENT U-Net training. In all experiments, adam optimizers of β1 =0.9 and β2 =0.999 were used. Implementation of the method is based on PyTorch framework, training and testing is completed by using NVIDIA GTX 3080Ti GPU.
1.3. Evaluation index
The error rate E is used as an evaluation index, i.e. the percentage of misclassified pixels within the input box is calculated. The following is shown:
Where P, G represents the predicted segmentation mask and the true mask, respectively, S represents the area of the input box, and |·| represents the pixel-by-pixel addition operation. The error rate takes into account not only the error but also the size of the input box. The same number of misclassified pixels in a relaxed input box compared to a tight input box means better segmentation quality and lower error rate, as a relaxed input box generally results in a more difficult segmentation task. A lower average error rate over the dataset indicates better overall performance of the segmentation method.
2. Ablation experiments
In the method, the core idea is to use the estimated gaze point map to assist the input box based interactive object segmentation. Therefore, an ablation study was first conducted to investigate the effectiveness of the estimated gaze point map and the various components in the proposed W-Net framework. The training procedure for the different ablation experiments was the same as the previous experimental setup. Experimental results were evaluated on GrabCut, berkeley and DAVIS datasets.
2.1 Gaze point pattern analysis
For the estimated gaze point map, on the one hand, the sensitivity of the W-Net model to the different gaze point prediction models needs to be evaluated, and on the other hand, the validity of the estimated gaze point map should be studied by comparing the segmentation performance of the models with gaze point map input and without gaze point map input. The specific procedure and results are as follows.
2.1.1 Sensitivity of the framework to gaze point map prediction models
To evaluate the sensitivity of the W-Net model to different gaze point prediction models, two different gaze point prediction models, tempsal and RINet, were selected in addition to TRANSALNET used in the present method. They generate respective gaze point patterns and then retrain and test W-Net accordingly. The test results are shown in Table 1.
Table 1 uses different gaze points error Rate of predictive model (%)
With TRANSALNET, W-Net achieved optimal performance over all datasets, while RINet had slightly better auxiliary performance than Tempsal. Their behavior reflects the respective influence of the gaze point prediction model in different situations. The three sets of data are not very different, indicating that the W-Net model is not very sensitive to a particular gaze point prediction model, but it still requires a better quality gaze point prediction model to refine. TRANSALNET gaze point prediction model was used in all subsequent experiments.
2.1.2 Validity of gaze point map
To verify the refinement of the estimated gaze point map in the model, the estimated gaze point map is excluded from REFINEMENT U-Net and compared while keeping the other components of the W-Net model unchanged. I.e. the input box is the only interactive input.
Table 2 error rate comparison with or without gaze point map input
As shown in table 2, the model without gaze point map assistance was estimated to exhibit a higher error rate. After the estimated fixation point is added, the segmentation performance is obviously improved, and the error rate is reduced. This shows that the segmentation quality can be improved if the estimated gaze point map is used appropriately when refining the U-Net.
In addition, some segmentation results are shown in fig. 3 to illustrate the assistance of estimating the gaze point map. It can be seen that the estimated gaze point map can help the model not only remedy the undivided object areas (e.g. petals of the first row and human arms of the second row), but also eliminate redundant erroneously segmented areas (e.g. trunks of the third row and roofs of the fourth row).
2.2 Model structural analysis
In this section, ablation experiments were performed to verify the effectiveness of the components in the proposed framework, including REFINEMENT U-Net, gaze point map adjustment module, and cross-jump connection module of the self-attention mechanism.
2.2.1Refinement U-Net availability
To demonstrate that REFINEMENT U-Net can improve the Coarse segmentation results of Coarse U-Net. Two comparative experiments were performed, first, a pure Coarse U-Net was retrained, whose input consisted of images and boundary block diagrams, which were taken as baselines. In a second experiment, the estimated gaze point map was added as a tensor to the Coarse U-Net input, and then this single U-Net was retrained again, which is Coarse U-Net+FM.
TABLE 3 error Rate (%) comparison for different combinations of gaze point plot, coarse U-Net and Refine U-Net
The experimental results in Table 3 show that the W-Net model using REFINEMENT U-Net performs best. Meanwhile, the segmentation result of Coarse U-Net+FM is even worse than that of baseline (Coarse U-Net). This means that simple integration of the estimated gaze point map, such as a connection at the input, does not improve the segmentation quality. The disparity problem caused by the estimated gaze point map may even reduce the segmentation quality. Because REFINEMENT U-Net can limit the estimated gaze point diagram, the auxiliary effect of the gaze point diagram is fully utilized, and the adverse effect of the gaze point diagram is overcome as much as possible. Thus, REFINEMENT U-Net refines the coarse segmentation results compared to baseline.
The effect of REFINEMENT U-Net in refining the coarse segmentation results is also illustrated by an example. In fig. 4, if the estimated gaze point map with partial differences is simply input into the Coarse U-Net, the segmentation result will be worse (Coarse U-net+fm). However REFINEMENT U-Net significantly improves segmentation results by imposing constraints on the gaze point map.
2.2.2 Validity of the gaze point adjustment Module
In the W-Net framework, the gaze point adjustment module aims to explicitly adjust the estimated gaze point, thereby alleviating its adverse effects to some extent. To verify its effectiveness, a comparison is made after the gaze point adjustment module is removed. This means that the estimated fixed point plot under the input box range is directly concatenated with the Coarse segmentation result of the image and the Coarse U-Net as input for REFINEMENT U-Net.
Table 4 compares the model error rates (%) of the gaze point map adjustment module.
As shown in table 4, clear adjustment of the gaze point map is a necessary step to ensure the segmentation quality. Specifically, the correlation between the Coarse result of the Coarse U-Net output and the estimated gaze point map is a reliable indicator for controlling the adverse effects of estimating gaze point map.
2.2.3 Availability of Cross-jump connection Module
In addition to the gaze point map adjustment module, a cross-skip module with self-attention mechanism was developed in the REFINEMENT U-Net encoder to enhance the dominance of features extracted based on input boxes. To demonstrate the effectiveness of establishing a cross-jump connection between the corresponding layers of Coarse U-Net and REFINEMENT U-Net, a comparative experiment was performed with the cross-jump connection with self-attention mechanism deleted (i.e., two U-Nets running independently) on the basis of the W-Net framework.
Table 5 comparison of model error rates (%) with and without cross-connect modules
In table 5, W-Net, after adding the cross jump connection, strengthens the dominant role of feature extraction based on the input box, and fuses these comprehensive features together, further alleviating the difference problem in feature extraction, and generating a fine segmentation result. It can be seen from tables 4 and 5 that the gaze point map adjustment module and the cross jump connection module are both important, and are indispensable for REFINEMENT U-Net for improving the segmentation quality.
2. Comparative experiments
In addition to ablation studies, comparisons were made with some of the most advanced methods, namely DGC, SAM and IOG, on GrabCut, berkeley and DAVIS datasets. Meanwhile, the W-Net is used as an optimization tool to perfect SAM and IOG (namely, on the basis of the W-Net, the rough segmentation result diagram of the input REFINEMENT U-Net is directly replaced by the segmentation result diagram of the SAM/IOG, so that the SAM/IOG and other methods can be optimized). These two combinations are SAM+W-Net and IOG+W-Net, respectively. In addition, the center position of each original input box is fixed and its size is enlarged in different proportions (i.e., 1.1, 1.2, 1.3, 1.4). All methods were tested on these loose input boxes to evaluate their ability to handle different input box cases. Since SAM only considers the appropriate object size in the small range of the input box during the training phase as a generic segmentation model, it performs very poorly in the test when the input box is loose or the object is too large, and therefore only reports the default output performance of SAM in the case of being processable. For the IOG, it requires one more point in the object area in addition to the input box. This point needs to be referred to the true segmentation result, determined by its source code in the test.
Table 6 error rates (\%) for the different segmentation methods across GrabCut, berkeley and DAVIS datasets. The best results are indicated in bold.
The comparison result is shown in fig. 6. In a single method comparison, it can be seen that W-Net can achieve better performance on the GrabCut dataset, while SAM performs better on the Berkeley dataset and DAVIS dataset than other methods when the input box is narrower. This reflects the strong ability of SAM as a generic segmentation model.
Some segmentation examples are shown in fig. 5. With the aid of the gaze point, the W-Net may cover certain areas that other models ignore, such as hammers in the second column, character bodies in the third column and last column, while non-object areas, such as trunks in the first column, may also be excluded.
In the case of loose input boxes, W-Net is worse than IOG. The main reason is that the loose input box reduces the performance of the Coarse U-Net, worsens the adverse effect of estimating the gaze point diagram, and thus makes the Coarse result unable to be refined by REFINEMENT U-Net. In the case of an IOG, it can be seen that its extra, points that explicitly indicate the object region play an important role in mitigating the adverse effects caused by the loose input box. However, in FIG. 6, in the case of division of W-Net and IOG under input boxes of different scales, W-Net still shows its advantages, and in some cases, W-Net can obtain stable division results even if the input box size is increased.
When W-Net is used as a segmentation optimization tool, the error rate of the IOG is further reduced, and the IOG+W-Net can achieve the best performance in all cases. This shows that the W-Net adds gaze point auxiliary function on the basis of the IOG, and is an effective segmentation optimization tool. At the same time, it is also noted that SAM+W-Net reduces SAM performance in some cases where the input box is smaller. For these failure cases, the reason is that although SAM can achieve better performance on overall quality, its variance in segmentation quality is larger compared to IOG. Some SAM segmentation results are of poor quality and cannot be used to control the gaze point map, and even exacerbate its adverse effects, making REFINEMENT U-Net not work well. It follows that W-Net also relies on the coarse segmentation results themselves to extract better features during refinement.
In FIGS. 7 and 8, the effects of SAM+W-Net and IOG+W-Net are illustrated, respectively. In particular, the athlete in the first row and the woman in the last row in fig. 7 demonstrate that the present method can help the SAM complete the segmentation of the target object area. At the same time, the construction in the second row and the railing in the third row illustrate that the present method can exclude erroneous areas from the SAM segmentation result. Also, in fig. 8, the gaze point distribution provides richer target object information than the limited range of clicks used in the IOG. Thus, combining the IOG and W-Net can optimize the partitioning effect.
The technical features of the above embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.
The above examples merely represent a few embodiments of the present application, which are described in more detail and are not to be construed as limiting the scope of the present application. It should be noted that it will be apparent to those skilled in the art that several variations and modifications can be made without departing from the spirit of the application, which are all within the scope of the application. Accordingly, the scope of the application should be assessed as that of the appended claims.