Part of the book series:Lecture Notes in Computer Science ((LNIP,volume 11901))
Included in the following conference series:
2422Accesses
Abstract
Recently, Convolutional Neural Networks (CNNs) have achieved great success in object detection due to their outstanding abilities of learning powerful features on large-scale training datasets. One of the critical factors of their success is the accurate and complete annotation of the training dataset. However, accurately annotating the training dataset is difficult and time-consuming in some applications such as object detection in underwater images due to severe foreground clustering and occlusion. In this paper, we study the problem of object detection in underwater images with incomplete annotation. To solve this problem, we propose a proposal-refined weakly supervised object detection method, which consists of two stages. The first stage is a weakly-fitted segmentation network for foreground-background segmentation. The second stage is a proposal-refined detection network, which uses the segmentation results of the first stage to refine the proposals and therefore can improve the performance of object detection. Experiments are conducted on the Underwater Robot Picking Contest 2017 dataset (URPC2017) which has 19967 underwater images containing three kinds of objects: sea cucumber, sea urchin and scallop. The annotation of the training set is incomplete. Experimental results show that the proposed method greatly improves the detection performance compared to several baseline methods.
You have full access to this open access chapter, Download conference paper PDF
Similar content being viewed by others
Keywords
1Introduction
Nowadays, aquaculture has become one of the most promising avenues for coastal fishermen by breeding marine products [13], especially high-quality marine products in sea floor, such as sea cucumbers, sea urchins and scallops. Underwater operation in the traditional aquaculture are mainly carried out by manual labor, which comes with low efficiency and high risk. Meanwhile, due to the development of artificial intelligence and the decrease of manufacturing costs, a huge demand emerged for the application of underwater fishing robots, which are low-cost, reliable and affordable platforms for improving the efficiency of catching the marine products. Although underwater robots such as net cleaning robots have been widely used [13], the application of underwater fishing robots is still very challenging due to the difficulty of accurately detecting marine products in a complicated underwater environment.
With the development of Convolutional Neural Network (CNN), great improvements have been achieved on object detection on land, which are mainly divided into two categories: two-stage detectors and one-stage detectors. Two-stage detectors adopt a region proposal-based strategy, whose pipelines have two stages [3,5,6,7,9,16]. The first stage generates a set of category-independent region proposals, and the second stage classifies them into foreground classes or background. One-stage detectors does not separate detection proposal, making the overall pipeline single stage [8,10,12,14,15]. Although some methods without relying on region proposal have been proposed, region proposal-based methods possess leading accuracy on benchmarks datasets (e.g., PASCAL VOC [4], ILSVRC [19], and Microsoft COCO [11] datasets). Faster R-CNN [16] is one of the most well-known object detection framework, which proposed an efficient and accurate Region Proposal Network (RPN) to generate region proposals. Since then, these RPN-like proposals are standards for recent two-stage object detectors.
Existing object detectors heavily depend on a significant number of accurate annotated images [4,11,19]. The annotation of such benchmark datasets often cost too much time and labors. To reduce the cost of obtaining accurate annotation, some weakly supervised and semi-supervised object detection frameworks have been proposed over the past years. At present, Weakly supervised detection mainly focuses on image-level annotation instead of the bounding-box annotation [20,21,23]. Semi-supervised object detectors are trained by using few annotated data and massive unannotated data [1,2,17]. Nevertheless, the reduction of annotation cost is usually at the cost of degrading model accuracy. Though many promising ideas have been proposed in weakly supervised and semi-supervised object detection, they are still far from comparable to strongly supervised ones.
Unlike land images with common object categories, underwater images own the characteristics of image degradation and color distortion due to the absorption and scattering of light through water. Besides, objects in the underwater environment are usually small and tend to cluster. These reasons cause annotating underwater objects difficult and time-consuming particularly. Therefore, as shown in Fig. 1, missing partial annotations often occurs in underwater image datasets. Under these circumstances, the negative examples are generated not only from the background but also the unannotated foreground, which will misguide the training of detectors. Existing strongly and weakly supervised detection algorithms cannot achieve satisfied results in underwater object detection.
To solve this problem, we propose a proposal-refined weakly supervised object detection method, focusing on training detectors with incomplete annotated dataset. We discover that there are great differences between foreground and background in underwater images. Inspired by this, we design a weakly-fitted segmentation network to segment the foreground and background of an image by only using incomplete annotated detection dataset. Then, we use the segmentation map to control the generation of positive and negative examples when training the detection network, which is conducive to the generation of high-quality proposals. The proposed method does not restricted to a specific object detection framework. In fact, it can be incorporated into any advanced ones. Our experiments are carried out on the Underwater Robot Picking Contest 2017 dataset (URPC2017). Experiments show that the proposed method greatly improves the accuracy of object detection compared to several baseline methods.
2The Proposed Method
2.1Overview
In order to reduce the influence of the missed annotations of the training images, we design a weakly-fitted segmentation network to separate the foreground from background, and then utilize the results generated in the segmentation network to guide the generation of positive and negative examples in training of the detector. Figure 2 shows the overview of the proposed architecture. It consists of two stages, where the first stage is a weakly-fitted segmentation network and the second stage is a proposal-refined object detection network. The details of each part of our model are introduced in the following sections.
2.2Weakly-Fitted Segmentation
To segment the foreground and background of an underwater image, we utilize the idea of U-Net [18], which consists of a contracting path to capture context information and an expanding path to guarantee the accuracy of localization. The traditional well-trained U-Net cannot accurately separate the foreground from background on our underwater images because there are a lot of unannotated foreground area in the training dataset. To address this problem, we propose two modifications: (1) As shown in Fig. 3, We design a light-weighted U-Net to reduce the ability to fit the training dataset. More specifically, we use 7 convolutional layers in downsampling and 6 deconvolutional layers in upsampling, which consist of\(3\times 3\) convolutions with stride 2 without double and halve the number of feature channels at each downsampling and upsampling step. The design of asymmetric convolutional layer can reduce the fitting degree of model to incomplete annotated training dataset. After that, the image size is restored by bilinear interpolation. (2) To segment the foreground as much as possible, the network is back-propagated via a modified MSE loss, denoted as
wherey is the output image of the weakly-fitted segmentation network,\(y^*\) is the ground-truth image generated by the bounding-box area of the underwater object detection dataset.i is the index of each pixel in an image,N is the number of pixels in an image. The value of\(y_i^*\) equals to 0 if it belongs to background while the value of\(y_i^*\) equals to 1 if it belongs to foreground. The term\(y_i^*(y_i^* -y_i)\) means the last item is activated only for foreground. So, the last item can enlarge the loss, which takes the foreground as background influenced by the confusing of incomplete annotated datasets. Moreover, the two terms are normalized byN and weighted by a balancing parameter\(\lambda \).
2.3Proposal-Refined Object Detection
The quality of the proposals has great influence on the performance of object detection. Therefore various studies focus on region proposal generation [22,24]. Among them, Region Proposal Network (RPN) proposed by Faster R-CNN [16] is the most influential method in recent years. Accordingly, we build our strategy based on the Faster R-CNN framework in this paper.
The architecture of Faster R-CNN can be divided into two parts: Region proposal network (RPN) and region-of-interest (ROI) classifier. For training RPN, traditional methods assign a negative label to an anchor if its Intersection-over-Union (IoU) ratio is lower than 0.3 for all ground-truth boxes. However, as shown in Fig. 4, many false negative examples which contain unlabeled objects will be generated due to the incomplete annotated dataset. It will affect the learning of RPN network directly. To address this problem, We add an input which is generated in the first stage to RPN and ROI classifier. When RPN and ROI classifier assigns a negative label to an anchor, it not only refers to the IoU for the ground-truth, but also the segmentation map.
The specific steps are as follows: (1) Firstly, the foreground of underwater images can be obtained by the weakly-fitted segmentation network, which is denoted as\(S_1\). Then, we subtract ground-truth boxes from\(S_1\) to gain the unlabeled foreground region\(S_2\). (2) For training RPN, the method of labeling positives is the same as traditional strategy. Nevertheless, assigning a negative label to an anchor needs to satisfy two conditions: (i) Its IoU ratio is lower than 0.3 for all ground-truth boxes. (ii) its IoU ratio is lower than or equal to\(\beta \) for\(S_2\). Similarly, the generation of positive and negative examples is constrained by both ground-truth and segmentation map during the training of the ROI classifier.
By controlling the generation of negative examples, we can eliminate the false negative examples, thus provide more accurate positive and negative examples for training object detection network to generate high-quality proposals. Following [16], classification loss and bounding-box regression loss are computed for both the RPN and the RoI classifiers
where\(L_{cls}\) is the cross-entropy loss for classification,\(L_{reg}\) is the smooth L1 loss defined in [5] for regression,\(c^*L_{reg}^{rpn}\) and\(p^*L_{reg}^{roi}\) mean the regression loss activated only for positive anchors and non-background class proposals respectively. It is worth mentioning that although the proposed method is carried out on Faster R-CNN, it is applicable to other region proposal-based methods such as R-FCN [3], FPN [9], Mask R-CNN [7].
3Experiment
3.1Dataset and Metric
Our experiments are carried out on the Underwater Robot Picking Contest 2017 dataset (URPC2017), which contains 3 object categories (sea cucumber, sea urchin and scallop) with a total of 19967 underwater images. The dataset is divided into the train, val and test set, which have 17655, 1317 and 985 images respectively. In the dataset, the amount of complete annotated data is fewer than incomplete annotated data. We train our segmentation and detection network on the trainval set. The trainval set contains both complete and incomplete annotated images. The test set consists of accurate and complete annotated images. The dataset used to train the weakly-fitted segmentation network is generated from the bounding box area (see Fig. 5). Object detection accuracy is measured by mean Average Precision (mAP).
3.2Implementation Details
For the training of weakly-fitted Segmentation network, we use a learning rate of 0.0001 for 70k iterations and set\(\lambda \) = 2 which makes the two terms in Eq. 1 roughly equally weighted after normalization. For the training of the proposal-refined object detection network, we use Faster R-CNN as our baseline detection framework. The VGG16 pre-trained on ImageNet is used as the backbone architecture for feature extraction due to the small scale datasets. The initial learning rate is set to 0.0002 for the first 50k and then decrease to 0.00002 in the following 20k iterations. The momentum and weight decay are set to 0.9 and 0.0005, respectively. Other hyper-parameters are identical as those defined in [16].
3.3Experimental Results
The Influence of IoU Threshold. We explore the influence of IoU threshold\(\beta \) of the segmentation map for detector.\(\beta \) = 1 is the baseline result of the original Faster R-CNN, which is not constrained by segmentation map when generating negative examples. As shown in Table 1,\(\beta \) = 0.3 outperforms other choices, which is\(12.1\%\) better than the baseline. It indicates that containing a part of object in the negative examples is beneficial to improve detection performance. When\(\beta \) = 0, detector will be trained on a large number of easily classified background examples, which is unuseful to improve detection accuracy. Consequently, we choose\(\beta \) = 0.3 for the following experiments.
The Results of Weakly-Fitted Segmentation. Figure 6(c) shows the qualitative results of weakly-fitted segmentation: (a) is the input image, (b) is the segmentation result of U-Net. Obviously, Under the same experimental setting, U-Net cannot completely separate the foreground and background. Because the unannotated foreground area affects the ability of U-Net to distinguish foreground from background. However, The proposed weakly-fitted segmentation network can segment the foreground and background of an underwater images, including the unannotated region in the underwater object detection dataset. Because the proposed method reduce the fitting degree of model to training data and increase the penalty for regarding foreground as background.
The Results of Proposal-Refined Object Detection. To show how Faster R-CNN and proposal-refined detector improve during the learning, we plot mAP of the two detectors for different training iterations. As shown in Fig. 7, both detectors get improved at the beginning stage. But the proposal-refined detector always have a higher mAP than the Faster R-CNN, suggesting the effectiveness of the proposal-refined object detection network. Figure 8 shows the qualitative results of proposal-refined detector (top) compared with the benchmark Faster R-CNN (bottom). It can be seen that proposal-refined detector can detect more objects than the baseline framework, especially the small and challenging objects.
The changes of mAP for Faster R-CNN and proposal-refined detector on URPC2017 dataset during the process of training.
Qualitative detection results on URPC2017. Top: the results of Faster R-CNN baseline model. Bottom: the results of proposal-refined detector.
Comparisons with the State-of-the-Arts. In this section, we present experimental results of our proposed method applied in other outstanding object detection networks: R-FCN [3], FPN [9], Mask R-CNN [7]. As shown in Table 2, our method improves the mAP of the original object detectors by about 10%, indicating the effectiveness and robustness of the proposal-refined weakly supervised object detection. By eliminating false negative examples, the proposed method can solve the problem of accuracy decrease caused by incomplete annotated dataset.
4Conclusion
In this paper, we propose a simple but efficient framework for object detection in underwater images with incomplete annotated dataset. Our proposal-refined weakly supervised object detection system is composed of two stages. The first stage is a weakly-fitted segmentation network that separates foreground from background. The second stage is the proposal-refined object detector that uses the segmentation map to generate high-quality proposals. Experiments show that the proposed method greatly improves the detection performance compared to several baseline methods. Through our method, we can not only reduce the cost of dataset annotation, but also offset the accuracy decrease caused by missed annotation. In addition, the idea of the proposed method can not only be applied to underwater object detection but also to other detect tasks with incomplete annotation.
References
Chen, G., Liu, L., Hu, W., Pan, Z.: Semi-supervised object detection in remote sensing images using generative adversarial networks. In: 2018 IEEE International Geoscience and Remote Sensing Symposium, pp. 2503–2506 (2018)
Choi, M.K., et al.: Co-occurrence matrix analysis-based semi-supervised training for object detection. In: 2018 25th IEEE International Conference on Image Processing, pp. 1333–1337 (2018)
Dai, J., Li, Y., He, K., Sun, J.: R-FCN: object detection via region-based fully convolutional networks. In: Advances in Neural Information Processing Systems, pp. 379–387 (2016)
Everingham, M., Gool, L.V., Williams, C.K.I., Winn, J., Zisserman, A.: The pascal visual object classes (voc) challenge. Int. J. Comput. Vis.88(2), 303–338 (2010)
Girshick, R.: Fast R-CNN. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1440–1448 (2015)
Girshick, R., Donahue, J., Darrell, T., Malik, J.: Rich feature hierarchies for accurate object detection and semantic segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 580–587 (2014)
He, K., Gkioxari, G., Dollár, P., Girshick, R.: Mask R-CNN. In: Proceedings of IEEE International Conference on Computer Vision, pp. 2980–2988 (2017)
Law, H., Deng, J.: CornerNet: detecting objects as paired keypoints. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) Computer Vision – ECCV 2018. LNCS, vol. 11218, pp. 765–781. Springer, Cham (2018).https://doi.org/10.1007/978-3-030-01264-9_45
Lin, T.Y., Dollar, P., Girshick, R., He, K., Hariharan, B., Belongie, S.: Feature pyramid networks for object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 936–944 (2017)
Lin, T.Y., Goyal, P., Girshick, R., He, K., Dollar, P.: Focal loss for dense object detection. IEEE Trans. Pattern Anal. Mach. Intell. (2018).https://doi.org/10.1109/TPAMI.2018.2858826
Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014).https://doi.org/10.1007/978-3-319-10602-1_48
Liu, W., et al.: SSD: single shot multibox detector. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9905, pp. 21–37. Springer, Cham (2016).https://doi.org/10.1007/978-3-319-46448-0_2
Naylor, R., Burke, M.: Aquaculture and ocean resources: raising tigers of the sea. Annu. Rev. Environ. Resour.30, 185–218 (2005)
Redmon, J., Divvala, S., Girshick, R., Farhadil, A.: You only look once: unified, real-time object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 779–788 (2016)
Redmon, J., Farhadi, A.: Yolo9000: better, faster, stronger. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6517–6525 (2017)
Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. In: Advances in Neural Information Processing Systems, pp. 91–99 (2015)
Rhee, P.K., Erdenee, E., Kyun, S.D., Ahmed, M.U., Jin, S.: Active and semi-supervised learning for object detection with imperfect data. Cogn. Syst. Res.45, 109–123 (2017)
Ronneberger, O., Fischer, P., Brox, T.: U-Net: convolutional networks for biomedical image segmentation. In: Navab, N., Hornegger, J., Wells, W.M., Frangi, A.F. (eds.) MICCAI 2015. LNCS, vol. 9351, pp. 234–241. Springer, Cham (2015).https://doi.org/10.1007/978-3-319-24574-4_28
Russakovsky, O., et al.: Imagenet large scale visual recognition challenge. Int. J. Comput. Vis.115(3), 211–252 (2015)
Tang, P., Wang, X., Bai, X., Liu, W.: Multiple instance detection network with online instance classifier refinement. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2843–2851 (2017)
Tang, P., et al.: Weakly supervised region proposal network and object detection. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11215, pp. 370–386. Springer, Cham (2018).https://doi.org/10.1007/978-3-030-01252-6_22
Uijlings, J.R.R., van de Sande, K.E.A., Gevers, T., Smeulders, A.W.M.: Selective search for object recognition. Int. J. Comput. Vis.104(2), 154–171 (2013)
Zhang, X., Feng, J., Xiong, H., Tian, Q.: Zigzag learning for weakly supervised object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4262–4270 (2018)
Zitnick, C.L., Dollár, P.: Edge boxes: locating object proposals from edges. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 391–405. Springer, Cham (2014).https://doi.org/10.1007/978-3-319-10602-1_26
Author information
Authors and Affiliations
School of Computer Science and Technology, Harbin Institute of Technology, Weihai, China
Xiaoqian Lv, An Wang, Qinglin Liu, Jiamin Sun & Shengping Zhang
- Xiaoqian Lv
You can also search for this author inPubMed Google Scholar
- An Wang
You can also search for this author inPubMed Google Scholar
- Qinglin Liu
You can also search for this author inPubMed Google Scholar
- Jiamin Sun
You can also search for this author inPubMed Google Scholar
- Shengping Zhang
You can also search for this author inPubMed Google Scholar
Corresponding author
Correspondence toShengping Zhang.
Editor information
Editors and Affiliations
Beijing Jiaotong University, Beijing, China
Yao Zhao
The Australian National University, Canberra, Australia
Nick Barnes
Peking University, Beijing, China
Baoquan Chen
The Technical University of Munich, Munich, Bayern, Germany
Rüdiger Westermann
Zhejiang University, Hangzhou, China
Xiangwei Kong
Beijing Jiaotong University, Beijing, China
Chunyu Lin
Rights and permissions
Copyright information
© 2019 Springer Nature Switzerland AG
About this paper
Cite this paper
Lv, X., Wang, A., Liu, Q., Sun, J., Zhang, S. (2019). Proposal-Refined Weakly Supervised Object Detection in Underwater Images. In: Zhao, Y., Barnes, N., Chen, B., Westermann, R., Kong, X., Lin, C. (eds) Image and Graphics. ICIG 2019. Lecture Notes in Computer Science(), vol 11901. Springer, Cham. https://doi.org/10.1007/978-3-030-34120-6_34
Download citation
Published:
Publisher Name:Springer, Cham
Print ISBN:978-3-030-34119-0
Online ISBN:978-3-030-34120-6
eBook Packages:Computer ScienceComputer Science (R0)
Share this paper
Anyone you share the following link with will be able to read this content:
Sorry, a shareable link is not currently available for this article.
Provided by the Springer Nature SharedIt content-sharing initiative