Movatterモバイル変換


[0]ホーム

URL:


Skip to main content

Advertisement

Springer Nature Link
Log in

Proposal-Refined Weakly Supervised Object Detection in Underwater Images

  • Conference paper
  • First Online:

Part of the book series:Lecture Notes in Computer Science ((LNIP,volume 11901))

Included in the following conference series:

  • 2422Accesses

Abstract

Recently, Convolutional Neural Networks (CNNs) have achieved great success in object detection due to their outstanding abilities of learning powerful features on large-scale training datasets. One of the critical factors of their success is the accurate and complete annotation of the training dataset. However, accurately annotating the training dataset is difficult and time-consuming in some applications such as object detection in underwater images due to severe foreground clustering and occlusion. In this paper, we study the problem of object detection in underwater images with incomplete annotation. To solve this problem, we propose a proposal-refined weakly supervised object detection method, which consists of two stages. The first stage is a weakly-fitted segmentation network for foreground-background segmentation. The second stage is a proposal-refined detection network, which uses the segmentation results of the first stage to refine the proposals and therefore can improve the performance of object detection. Experiments are conducted on the Underwater Robot Picking Contest 2017 dataset (URPC2017) which has 19967 underwater images containing three kinds of objects: sea cucumber, sea urchin and scallop. The annotation of the training set is incomplete. Experimental results show that the proposed method greatly improves the detection performance compared to several baseline methods.

You have full access to this open access chapter, Download conference paper PDF

Similar content being viewed by others

Keywords

1Introduction

Nowadays, aquaculture has become one of the most promising avenues for coastal fishermen by breeding marine products [13], especially high-quality marine products in sea floor, such as sea cucumbers, sea urchins and scallops. Underwater operation in the traditional aquaculture are mainly carried out by manual labor, which comes with low efficiency and high risk. Meanwhile, due to the development of artificial intelligence and the decrease of manufacturing costs, a huge demand emerged for the application of underwater fishing robots, which are low-cost, reliable and affordable platforms for improving the efficiency of catching the marine products. Although underwater robots such as net cleaning robots have been widely used [13], the application of underwater fishing robots is still very challenging due to the difficulty of accurately detecting marine products in a complicated underwater environment.

With the development of Convolutional Neural Network (CNN), great improvements have been achieved on object detection on land, which are mainly divided into two categories: two-stage detectors and one-stage detectors. Two-stage detectors adopt a region proposal-based strategy, whose pipelines have two stages [3,5,6,7,9,16]. The first stage generates a set of category-independent region proposals, and the second stage classifies them into foreground classes or background. One-stage detectors does not separate detection proposal, making the overall pipeline single stage [8,10,12,14,15]. Although some methods without relying on region proposal have been proposed, region proposal-based methods possess leading accuracy on benchmarks datasets (e.g., PASCAL VOC [4], ILSVRC [19], and Microsoft COCO [11] datasets). Faster R-CNN [16] is one of the most well-known object detection framework, which proposed an efficient and accurate Region Proposal Network (RPN) to generate region proposals. Since then, these RPN-like proposals are standards for recent two-stage object detectors.

Existing object detectors heavily depend on a significant number of accurate annotated images [4,11,19]. The annotation of such benchmark datasets often cost too much time and labors. To reduce the cost of obtaining accurate annotation, some weakly supervised and semi-supervised object detection frameworks have been proposed over the past years. At present, Weakly supervised detection mainly focuses on image-level annotation instead of the bounding-box annotation [20,21,23]. Semi-supervised object detectors are trained by using few annotated data and massive unannotated data [1,2,17]. Nevertheless, the reduction of annotation cost is usually at the cost of degrading model accuracy. Though many promising ideas have been proposed in weakly supervised and semi-supervised object detection, they are still far from comparable to strongly supervised ones.

Unlike land images with common object categories, underwater images own the characteristics of image degradation and color distortion due to the absorption and scattering of light through water. Besides, objects in the underwater environment are usually small and tend to cluster. These reasons cause annotating underwater objects difficult and time-consuming particularly. Therefore, as shown in Fig. 1, missing partial annotations often occurs in underwater image datasets. Under these circumstances, the negative examples are generated not only from the background but also the unannotated foreground, which will misguide the training of detectors. Existing strongly and weakly supervised detection algorithms cannot achieve satisfied results in underwater object detection.

Fig. 1.
figure 1

Example of underwater image and corresponding groundtruth in URPC2017.

To solve this problem, we propose a proposal-refined weakly supervised object detection method, focusing on training detectors with incomplete annotated dataset. We discover that there are great differences between foreground and background in underwater images. Inspired by this, we design a weakly-fitted segmentation network to segment the foreground and background of an image by only using incomplete annotated detection dataset. Then, we use the segmentation map to control the generation of positive and negative examples when training the detection network, which is conducive to the generation of high-quality proposals. The proposed method does not restricted to a specific object detection framework. In fact, it can be incorporated into any advanced ones. Our experiments are carried out on the Underwater Robot Picking Contest 2017 dataset (URPC2017). Experiments show that the proposed method greatly improves the accuracy of object detection compared to several baseline methods.

2The Proposed Method

2.1Overview

In order to reduce the influence of the missed annotations of the training images, we design a weakly-fitted segmentation network to separate the foreground from background, and then utilize the results generated in the segmentation network to guide the generation of positive and negative examples in training of the detector. Figure 2 shows the overview of the proposed architecture. It consists of two stages, where the first stage is a weakly-fitted segmentation network and the second stage is a proposal-refined object detection network. The details of each part of our model are introduced in the following sections.

Fig. 2.
figure 2

The architecture of the proposed object detection network.

2.2Weakly-Fitted Segmentation

To segment the foreground and background of an underwater image, we utilize the idea of U-Net [18], which consists of a contracting path to capture context information and an expanding path to guarantee the accuracy of localization. The traditional well-trained U-Net cannot accurately separate the foreground from background on our underwater images because there are a lot of unannotated foreground area in the training dataset. To address this problem, we propose two modifications: (1) As shown in Fig. 3, We design a light-weighted U-Net to reduce the ability to fit the training dataset. More specifically, we use 7 convolutional layers in downsampling and 6 deconvolutional layers in upsampling, which consist of\(3\times 3\) convolutions with stride 2 without double and halve the number of feature channels at each downsampling and upsampling step. The design of asymmetric convolutional layer can reduce the fitting degree of model to incomplete annotated training dataset. After that, the image size is restored by bilinear interpolation. (2) To segment the foreground as much as possible, the network is back-propagated via a modified MSE loss, denoted as

$$\begin{aligned} L(y,y^*) = \frac{1}{N}\sum _{i = 0} ^N(y_i^* -y_i)^2+\lambda \frac{1}{N}\sum _{i = 0} ^Ny_i^*(y_i^* -y_i) \end{aligned}$$
(1)

wherey is the output image of the weakly-fitted segmentation network,\(y^*\) is the ground-truth image generated by the bounding-box area of the underwater object detection dataset.i is the index of each pixel in an image,N is the number of pixels in an image. The value of\(y_i^*\) equals to 0 if it belongs to background while the value of\(y_i^*\) equals to 1 if it belongs to foreground. The term\(y_i^*(y_i^* -y_i)\) means the last item is activated only for foreground. So, the last item can enlarge the loss, which takes the foreground as background influenced by the confusing of incomplete annotated datasets. Moreover, the two terms are normalized byN and weighted by a balancing parameter\(\lambda \).

Fig. 3.
figure 3

The architecture of the weakly-fitted segmentation network.

2.3Proposal-Refined Object Detection

The quality of the proposals has great influence on the performance of object detection. Therefore various studies focus on region proposal generation [22,24]. Among them, Region Proposal Network (RPN) proposed by Faster R-CNN [16] is the most influential method in recent years. Accordingly, we build our strategy based on the Faster R-CNN framework in this paper.

The architecture of Faster R-CNN can be divided into two parts: Region proposal network (RPN) and region-of-interest (ROI) classifier. For training RPN, traditional methods assign a negative label to an anchor if its Intersection-over-Union (IoU) ratio is lower than 0.3 for all ground-truth boxes. However, as shown in Fig. 4, many false negative examples which contain unlabeled objects will be generated due to the incomplete annotated dataset. It will affect the learning of RPN network directly. To address this problem, We add an input which is generated in the first stage to RPN and ROI classifier. When RPN and ROI classifier assigns a negative label to an anchor, it not only refers to the IoU for the ground-truth, but also the segmentation map.

Fig. 4.
figure 4

The instance of ture and false negative examples.

The specific steps are as follows: (1) Firstly, the foreground of underwater images can be obtained by the weakly-fitted segmentation network, which is denoted as\(S_1\). Then, we subtract ground-truth boxes from\(S_1\) to gain the unlabeled foreground region\(S_2\). (2) For training RPN, the method of labeling positives is the same as traditional strategy. Nevertheless, assigning a negative label to an anchor needs to satisfy two conditions: (i) Its IoU ratio is lower than 0.3 for all ground-truth boxes. (ii) its IoU ratio is lower than or equal to\(\beta \) for\(S_2\). Similarly, the generation of positive and negative examples is constrained by both ground-truth and segmentation map during the training of the ROI classifier.

By controlling the generation of negative examples, we can eliminate the false negative examples, thus provide more accurate positive and negative examples for training object detection network to generate high-quality proposals. Following [16], classification loss and bounding-box regression loss are computed for both the RPN and the RoI classifiers

$$\begin{aligned} L_{total} = L_{cls}^{rpn}+c^*L_{reg}^{rpn}+L_{cls}^{roi}+p^*L_{reg}^{roi} \end{aligned}$$
(2)

where\(L_{cls}\) is the cross-entropy loss for classification,\(L_{reg}\) is the smooth L1 loss defined in [5] for regression,\(c^*L_{reg}^{rpn}\) and\(p^*L_{reg}^{roi}\) mean the regression loss activated only for positive anchors and non-background class proposals respectively. It is worth mentioning that although the proposed method is carried out on Faster R-CNN, it is applicable to other region proposal-based methods such as R-FCN [3], FPN [9], Mask R-CNN [7].

3Experiment

3.1Dataset and Metric

Our experiments are carried out on the Underwater Robot Picking Contest 2017 dataset (URPC2017), which contains 3 object categories (sea cucumber, sea urchin and scallop) with a total of 19967 underwater images. The dataset is divided into the train, val and test set, which have 17655, 1317 and 985 images respectively. In the dataset, the amount of complete annotated data is fewer than incomplete annotated data. We train our segmentation and detection network on the trainval set. The trainval set contains both complete and incomplete annotated images. The test set consists of accurate and complete annotated images. The dataset used to train the weakly-fitted segmentation network is generated from the bounding box area (see Fig. 5). Object detection accuracy is measured by mean Average Precision (mAP).

Fig. 5.
figure 5

The generation of segmentation dataset.

3.2Implementation Details

For the training of weakly-fitted Segmentation network, we use a learning rate of 0.0001 for 70k iterations and set\(\lambda \) = 2 which makes the two terms in Eq. 1 roughly equally weighted after normalization. For the training of the proposal-refined object detection network, we use Faster R-CNN as our baseline detection framework. The VGG16 pre-trained on ImageNet is used as the backbone architecture for feature extraction due to the small scale datasets. The initial learning rate is set to 0.0002 for the first 50k and then decrease to 0.00002 in the following 20k iterations. The momentum and weight decay are set to 0.9 and 0.0005, respectively. Other hyper-parameters are identical as those defined in [16].

3.3Experimental Results

The Influence of IoU Threshold. We explore the influence of IoU threshold\(\beta \) of the segmentation map for detector.\(\beta \) = 1 is the baseline result of the original Faster R-CNN, which is not constrained by segmentation map when generating negative examples. As shown in Table 1,\(\beta \) = 0.3 outperforms other choices, which is\(12.1\%\) better than the baseline. It indicates that containing a part of object in the negative examples is beneficial to improve detection performance. When\(\beta \) = 0, detector will be trained on a large number of easily classified background examples, which is unuseful to improve detection accuracy. Consequently, we choose\(\beta \) = 0.3 for the following experiments.

Table 1. Comparison results with different IoU thresholds of segmentation map.

The Results of Weakly-Fitted Segmentation. Figure 6(c) shows the qualitative results of weakly-fitted segmentation: (a) is the input image, (b) is the segmentation result of U-Net. Obviously, Under the same experimental setting, U-Net cannot completely separate the foreground and background. Because the unannotated foreground area affects the ability of U-Net to distinguish foreground from background. However, The proposed weakly-fitted segmentation network can segment the foreground and background of an underwater images, including the unannotated region in the underwater object detection dataset. Because the proposed method reduce the fitting degree of model to training data and increase the penalty for regarding foreground as background.

Fig. 6.
figure 6

Qualitative segmentation results on URPC2017 dataset.

The Results of Proposal-Refined Object Detection. To show how Faster R-CNN and proposal-refined detector improve during the learning, we plot mAP of the two detectors for different training iterations. As shown in Fig. 7, both detectors get improved at the beginning stage. But the proposal-refined detector always have a higher mAP than the Faster R-CNN, suggesting the effectiveness of the proposal-refined object detection network. Figure 8 shows the qualitative results of proposal-refined detector (top) compared with the benchmark Faster R-CNN (bottom). It can be seen that proposal-refined detector can detect more objects than the baseline framework, especially the small and challenging objects.

Fig. 7.
figure 7

The changes of mAP for Faster R-CNN and proposal-refined detector on URPC2017 dataset during the process of training.

Fig. 8.
figure 8

Qualitative detection results on URPC2017. Top: the results of Faster R-CNN baseline model. Bottom: the results of proposal-refined detector.

Comparisons with the State-of-the-Arts. In this section, we present experimental results of our proposed method applied in other outstanding object detection networks: R-FCN [3], FPN [9], Mask R-CNN [7]. As shown in Table 2, our method improves the mAP of the original object detectors by about 10%, indicating the effectiveness and robustness of the proposal-refined weakly supervised object detection. By eliminating false negative examples, the proposed method can solve the problem of accuracy decrease caused by incomplete annotated dataset.

Table 2. Comparison results of different methods.

4Conclusion

In this paper, we propose a simple but efficient framework for object detection in underwater images with incomplete annotated dataset. Our proposal-refined weakly supervised object detection system is composed of two stages. The first stage is a weakly-fitted segmentation network that separates foreground from background. The second stage is the proposal-refined object detector that uses the segmentation map to generate high-quality proposals. Experiments show that the proposed method greatly improves the detection performance compared to several baseline methods. Through our method, we can not only reduce the cost of dataset annotation, but also offset the accuracy decrease caused by missed annotation. In addition, the idea of the proposed method can not only be applied to underwater object detection but also to other detect tasks with incomplete annotation.

References

  1. Chen, G., Liu, L., Hu, W., Pan, Z.: Semi-supervised object detection in remote sensing images using generative adversarial networks. In: 2018 IEEE International Geoscience and Remote Sensing Symposium, pp. 2503–2506 (2018)

    Google Scholar 

  2. Choi, M.K., et al.: Co-occurrence matrix analysis-based semi-supervised training for object detection. In: 2018 25th IEEE International Conference on Image Processing, pp. 1333–1337 (2018)

    Google Scholar 

  3. Dai, J., Li, Y., He, K., Sun, J.: R-FCN: object detection via region-based fully convolutional networks. In: Advances in Neural Information Processing Systems, pp. 379–387 (2016)

    Google Scholar 

  4. Everingham, M., Gool, L.V., Williams, C.K.I., Winn, J., Zisserman, A.: The pascal visual object classes (voc) challenge. Int. J. Comput. Vis.88(2), 303–338 (2010)

    Article  Google Scholar 

  5. Girshick, R.: Fast R-CNN. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1440–1448 (2015)

    Google Scholar 

  6. Girshick, R., Donahue, J., Darrell, T., Malik, J.: Rich feature hierarchies for accurate object detection and semantic segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 580–587 (2014)

    Google Scholar 

  7. He, K., Gkioxari, G., Dollár, P., Girshick, R.: Mask R-CNN. In: Proceedings of IEEE International Conference on Computer Vision, pp. 2980–2988 (2017)

    Google Scholar 

  8. Law, H., Deng, J.: CornerNet: detecting objects as paired keypoints. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) Computer Vision – ECCV 2018. LNCS, vol. 11218, pp. 765–781. Springer, Cham (2018).https://doi.org/10.1007/978-3-030-01264-9_45

    Chapter  Google Scholar 

  9. Lin, T.Y., Dollar, P., Girshick, R., He, K., Hariharan, B., Belongie, S.: Feature pyramid networks for object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 936–944 (2017)

    Google Scholar 

  10. Lin, T.Y., Goyal, P., Girshick, R., He, K., Dollar, P.: Focal loss for dense object detection. IEEE Trans. Pattern Anal. Mach. Intell. (2018).https://doi.org/10.1109/TPAMI.2018.2858826

  11. Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014).https://doi.org/10.1007/978-3-319-10602-1_48

    Chapter  Google Scholar 

  12. Liu, W., et al.: SSD: single shot multibox detector. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9905, pp. 21–37. Springer, Cham (2016).https://doi.org/10.1007/978-3-319-46448-0_2

    Chapter  Google Scholar 

  13. Naylor, R., Burke, M.: Aquaculture and ocean resources: raising tigers of the sea. Annu. Rev. Environ. Resour.30, 185–218 (2005)

    Article  Google Scholar 

  14. Redmon, J., Divvala, S., Girshick, R., Farhadil, A.: You only look once: unified, real-time object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 779–788 (2016)

    Google Scholar 

  15. Redmon, J., Farhadi, A.: Yolo9000: better, faster, stronger. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6517–6525 (2017)

    Google Scholar 

  16. Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. In: Advances in Neural Information Processing Systems, pp. 91–99 (2015)

    Google Scholar 

  17. Rhee, P.K., Erdenee, E., Kyun, S.D., Ahmed, M.U., Jin, S.: Active and semi-supervised learning for object detection with imperfect data. Cogn. Syst. Res.45, 109–123 (2017)

    Article  Google Scholar 

  18. Ronneberger, O., Fischer, P., Brox, T.: U-Net: convolutional networks for biomedical image segmentation. In: Navab, N., Hornegger, J., Wells, W.M., Frangi, A.F. (eds.) MICCAI 2015. LNCS, vol. 9351, pp. 234–241. Springer, Cham (2015).https://doi.org/10.1007/978-3-319-24574-4_28

    Chapter  Google Scholar 

  19. Russakovsky, O., et al.: Imagenet large scale visual recognition challenge. Int. J. Comput. Vis.115(3), 211–252 (2015)

    Article MathSciNet  Google Scholar 

  20. Tang, P., Wang, X., Bai, X., Liu, W.: Multiple instance detection network with online instance classifier refinement. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2843–2851 (2017)

    Google Scholar 

  21. Tang, P., et al.: Weakly supervised region proposal network and object detection. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11215, pp. 370–386. Springer, Cham (2018).https://doi.org/10.1007/978-3-030-01252-6_22

    Chapter  Google Scholar 

  22. Uijlings, J.R.R., van de Sande, K.E.A., Gevers, T., Smeulders, A.W.M.: Selective search for object recognition. Int. J. Comput. Vis.104(2), 154–171 (2013)

    Article  Google Scholar 

  23. Zhang, X., Feng, J., Xiong, H., Tian, Q.: Zigzag learning for weakly supervised object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4262–4270 (2018)

    Google Scholar 

  24. Zitnick, C.L., Dollár, P.: Edge boxes: locating object proposals from edges. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 391–405. Springer, Cham (2014).https://doi.org/10.1007/978-3-319-10602-1_26

    Chapter  Google Scholar 

Download references

Author information

Authors and Affiliations

  1. School of Computer Science and Technology, Harbin Institute of Technology, Weihai, China

    Xiaoqian Lv, An Wang, Qinglin Liu, Jiamin Sun & Shengping Zhang

Authors
  1. Xiaoqian Lv

    You can also search for this author inPubMed Google Scholar

  2. An Wang

    You can also search for this author inPubMed Google Scholar

  3. Qinglin Liu

    You can also search for this author inPubMed Google Scholar

  4. Jiamin Sun

    You can also search for this author inPubMed Google Scholar

  5. Shengping Zhang

    You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence toShengping Zhang.

Editor information

Editors and Affiliations

  1. Beijing Jiaotong University, Beijing, China

    Yao Zhao

  2. The Australian National University, Canberra, Australia

    Nick Barnes

  3. Peking University, Beijing, China

    Baoquan Chen

  4. The Technical University of Munich, Munich, Bayern, Germany

    Rüdiger Westermann

  5. Zhejiang University, Hangzhou, China

    Xiangwei Kong

  6. Beijing Jiaotong University, Beijing, China

    Chunyu Lin

Rights and permissions

Copyright information

© 2019 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Lv, X., Wang, A., Liu, Q., Sun, J., Zhang, S. (2019). Proposal-Refined Weakly Supervised Object Detection in Underwater Images. In: Zhao, Y., Barnes, N., Chen, B., Westermann, R., Kong, X., Lin, C. (eds) Image and Graphics. ICIG 2019. Lecture Notes in Computer Science(), vol 11901. Springer, Cham. https://doi.org/10.1007/978-3-030-34120-6_34

Download citation

Publish with us

Societies and partnerships


[8]ページ先頭

©2009-2025 Movatter.jp