Movatterモバイル変換


[0]ホーム

URL:


Skip to main content

Advertisement

Springer Nature Link
Log in

Real-Time Prediction of Segmentation Quality

  • Conference paper
  • First Online:

Abstract

Recent advances in deep learning based image segmentation methods have enabled real-time performance with human-level accuracy. However, occasionally even the best method fails due to low image quality, artifacts or unexpected behaviour of black box algorithms. Being able to predict segmentation quality in the absence of ground truth is of paramount importance in clinical practice, but also in large-scale studies to avoid the inclusion of invalid data in subsequent analysis.

In this work, we propose two approaches of real-time automated quality control for cardiovascular MR segmentations using deep learning. First, we train a neural network on 12,880 samples to predict Dice Similarity Coefficients (DSC) on a per-case basis. We report a mean average error (MAE) of 0.03 on 1,610 test samples and 97% binary classification accuracy for separating low and high quality segmentations. Secondly, in the scenario where no manually annotated data is available, we train a network to predict DSC scores from estimated quality obtained via a reverse testing strategy. We report an\(\mathrm {MAE} = 0.14\) and 91% binary classification accuracy for this case. Predictions are obtained in real-time which, when combined with real-time segmentation methods, enables instant feedback on whether an acquired scan is analysable while the patient is still in the scanner. This further enables new applications of optimising image acquisition towards best possible analysis results.

You have full access to this open access chapter, Download conference paper PDF

Similar content being viewed by others

Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1Introduction

Finding out that an acquired medical image is not usable for the intended purpose is not only costly but can be critical if image-derived quantitative measures should have supported clinical decisions in diagnosis and treatment. Real-time assessment of the downstream analysis task, such as image segmentation, is highly desired. Ideally, such an assessment could be performed while the patient is still in the scanner, so that in the case an image is not analysable, a new scan could be obtained immediately (even automatically). Such a real-time assessment requires two components, a real-time analysis method and a real-time prediction of the quality of the analysis result. This paper proposes a solution to the latter with a particular focus on image segmentation as the analysis task.

Recent advances in deep learning based image segmentation have brought highly efficient and accurate methods, most of which are based on Convolutional Neural Networks (CNNs). However, even the best method will occasionally fail due to insufficient image quality (e,g., noise, artefacts, corruption) or show unexpected behaviour on new data. In clinical settings, it is of paramount importance to be able to detect such failure cases on a per-case basis. In clinical research, such as population studies, it is important to be able to detect failure cases in automated pipelines, so invalid data can be discarded in the subsequent statistical analysis.

Here, we focus on automatic quality control of image segmentation. Specifically, we assess the quality of automatically generated segmentations of cardiovascular MR (CMR) from the UK Biobank (UKBB) Imaging Study [1].

Automated quality control is dominated by research in the natural-image domain and is often referred to as image quality assessment (IQA). The literature proposes methodologies to quantify the technical characteristics of an image, such as the amount of blur, and more recently a way to assess the aesthetic quality of such images [2]. In the medical image domain, IQA is an important topic of research in the fields of image acquisition and reconstruction. An example is the work by Farzi et al. [3] proposing an unsupervised approach to detect artefacts. Where research is conducted into the quality or accuracy of image segmentations, it is almost entirely assumed that there is a manually annotated ground truth (GT) labelmap available for comparison. Our domain has seen little work on assessing the quality of generated segmentations particularly on a per-case basis and in the absence of GT.

Related Work: Some previous studies have attempted to deliver quality estimates of automatically generated segmentations when GT is unavailable. Most methods tend to rely on a reverse-testing strategy. Both Reverse Validation [4] and Reverse Testing [5] employ a form of cross-validation by training segmentation models on a dataset that are then evaluated either on a different fold of the data or a separate test-set. Both of these methods require a fully-labeled set of data for use in training. Additionally, these methods are limited to conclusions about the quality of the segmentation algorithms rather than the individual labelmaps as the same data is used for training and testing purposes.

Where work has been done in assessing individual segmentations, it often also requires large sets of labeled training data. In [6] a model was trained using numerous statistical and energy measures from segmentation algorithms. Although this model is able to give individual predictions of accuracy for a given segmentation, it again requires the use of a fully-annotated dataset. Moving away from this limitation, [7,8] have shown that applying Reverse Classification Accuracy (RCA) gives accurate predictions of traditional quality metrics on a per-case basis. They accomplish this by comparing a set of reference images with manual segmentations to the test-segmentation, evaluating a quality metric between these, and then taking the best value as a prediction for segmentation quality. This is done using a set of only 100 reference images with verified labelmaps. However, the time taken to complete RCA on a single segmentation is prohibits real-time quality control frameworks: around 11 min.

Contributions: In this study, we show that applying a modern deep learning approach to the problem of automated quality control in deployed image-segmentation frameworks can decrease the per-case analysis time to the order of milliseconds whilst maintaining good accuracy. We predict Dice Similarity Coefficient (DSC) at large-scale analyzing over 16,000 segmentations of images from the UKBB. We also show that measures derived from RCA can be used to inform our network removing the need for a large, manually-annotated dataset. When pairing our proposed real-time quality assessment with real-time segmentation methods, one can envision new avenues of optimising image acquisition automatically toward best possible analysis results.

Fig. 1.
figure 1

(left) Histogram of Dice Similarity Coefficients (DSC) for 29,292 segmentations. Range is [0, 1] with 10 equally spaced bins. Red line shows minimum counts (1,610) at\(\mathrm {DSC}\) in the bin [0.5, 0.6) used to balance scores. (right) 5 channels of the CNNs in both experiments: the image and one-hot-encoded labelmaps for background (BG), left-ventricular cavity (LV), left-ventricular myocardium (LVM) and right-ventricular cavity (RVC).

2Method and Material

We use the Dice Similarity Coefficient (DSC) as a metric of quality for segmentations. It measures the overlap between a proposed segmentation and its ground truth (GT) (usually a manual reference). We aim to predict DSC for segmentations in theabsence of GT. We perform two experiments in which CNNs are trained to predict DSC. First we describe our input data and the models.

Our initial dataset consists of 4,882 3D (2D-stacks) end-diastolic (ED) cardiovascular magnetic resonance (CMR) scans from the UK Biobank (UKBB) Imaging StudyFootnote1. All images have a manual segmentation which is unprecedented at this scale. We take these labelmaps as reference GT. Each labelmap contains 3 classes: left-ventricular cavity (LVC), left-ventricular myocardium (LVM) and right-ventricular cavity (RVC) which are separate from the background class (BG). In this work, we also consider the segmentation as a single binary entity comprising all classes: whole-heart (WH).

A random forest (RF) of 350 trees and maximum Depth 40 is trained on 100 cardiac atlases from an in-house database and used to segment the 4,882 images at depths of 2, 4, 6, 8, 10, 15, 20, 24, 36 and 40. We calculate DSC from the GT for the 29,292 generated segmentations. The distribution is shown in Fig. 1. Due to the imbalance in DSC scores of this data, we choose to take a random subset of 1,610 segmentations from each DSC bin, equal to the minimum number of counts-per-bin across the distribution. Our final dataset comprises 16,100 score-balanced segmentations with reference GT.

From each segmentation we create 4 one-hot-encoded masks: masks 1 to 4 correspond to the classes BG, LVC, LVM and RVC respectively. The voxels of the\(i^{th}\) mask are set at [0, 0, 0, 0] when they do not belong to the mask’s class and the\(i^{th}\) element set to 1 otherwise. For example, the mask for LVC is [0, 0, 0, 0] everywhere except for voxels of the LVC class which are given the value [0, 1, 0, 0]. This gives the network a greater chance to learn the relationships between the voxels’ classes and their locations. An example of the segmentation masks is shown in Fig. 1.

At training time, our data-generator re-samples the UKBB images and our segmentations to have consistent shape of [224, 224, 8, 5] making our network fully 3D with 5 data channels: the image and 4 segmentation masks. The images are also normalized such that the entire dataset falls in the range [0.0, 1.0].

For comparison and consistency, we choose to use the same input data and network architecture for each of our experiments. We employ a 50-layer 3D residual network written in Python with the Keras library and trained on an 11 GB Nvidia GeForce GTX 1080 Ti GPU. Residual networks are advantageous as they allow the training of deeper networks by repeating smaller blocks. They benefit from skip connections that allow data to travel deeper into the network. We use the Adam optimizer with learning rate of\(1e^{-5}\) and decay of 0.005. Batch sizes are kept constant at 46 samples per batch. We run validation at the end of each epoch for model-selection purposes.

Experiments

Can we take advantage of a CNN’s inference speed to give fast and accurate predictions of segmentation quality? This is an important question for analysis pipelines which could benefit from the increased confidence in segmentation quality without compromising processing time. To answer this question we conduct the following experiments.

Experiment 1: Directly Predicting DSC. Is it possible to directly predict the quality of a segmentation given only the image-segmentation pair? In this experiment we calculate, per class, the DSC between our segmentations and the GT. These are used as training labels. We have 5 nodes in the final layer of the network where the outputX is\(\lbrace X \in \mathbb {R}^{5} \ | X \in [0.0,1.0] \rbrace \). This vector represents the DSC per class including background and whole-heart. We use mean-squared-error loss and report mean-absolute-error between the output and GT DSC. We split our data 80:10:10 giving 12,880 training samples and 1,610 samples each for validation and testing. Performing this experiment is costly as it requires a large manually-labeled dataset which is not readily available in practice.

Experiment 2: Predicting RCA Scores. Considering the promising results of the RCA framework [7,8] in accurately predicting the quality of segmentations in the absence of large labeled datasets, can we use the predictions from RCA as training data to allow a network to give comparatively accurate predictions on a test-set? In this experiment, we perform RCA on all 16,100 segmentations. To ensure that we train on balanced scores, we again perform histogram binning on the RCA scores and take equal numbers from each class. We finish with a total of 5363 samples split into training, validation and test sets of 4787, 228 and 228 respectively. The predictions per-class are used as labels during training. Similar to Experiment 1, we obtain a single predicted DSC output for each class using the same network and hyper-parameters, but without the need for the large, often-unobtainable manually-labeled training set.

Fig. 2.
figure 2

Examples showing excellent prediction of Dice Similarity Coefficient (DSC) in Experiment 1. Quality increases from top-left to bottom-right. Each panel shows (left to right) the image, test-segmentation and reference GT.

3Results

Results from Experiment 1 are shown in Table 1. We report mean absolute error (MAE) and standard deviations per class between reference GT and predicted DSC. Our results show that our network can directly predict whole-heart DSC from the image-segmentation pair with MAE of 0.03 (SD = 0.04). We see similar performance on individual classes. Table 1 also shows MAE over the top and bottom halves of the GT DSC range. This suggests that the MAE is equally distributed over poor and good quality segmentations. For WH we report 72% of the data have MAE less than 0.05 with outliers (\(\mathrm {DSC} \ge 0.12\)) comprising only 6% of the data. Distributions of the MAEs for each class can be seen in Fig. 3. Examples of good and poor quality segmentations are shown in Fig. 2 with their GT and predictions. Results show excellent true (TPR) and false-positive rates (FPR) on a whole-heart binary classification task with DSC threshold of 0.70. The reported accuracy of 97% is better than the 95% reported with RCA in [8].

Fig. 3.
figure 3

Distribution of the mean absolute errors (MAE) for Experiments 1 (left) and 2 (right). Results are shown for each class: background (BG), left-ventricular cavity (LV), left-ventricular myocardium (LVM), right-ventricular cavity (RVC) and for the whole-heart (WH).

Our results for Experiment 2 are recorded in Table 1. It is expected that direct predictions of DSC from the RCA labels are less accurate than in Experiment 1. The reasoning is two-fold: first, the RCA labels are themselvespredictions and retain inherent uncertainty and second, the training set here is much smaller than in Experiment 1. However, we report MAE of 0.14 (SD = 0.09) for the WH case and 91% accuracy on the binary classification task. Distributions of the MAEs are shown in Fig. 3. LVM has a greater variance in MAE which is in line with previous results using RCA [8]. Thus, the network would be a valuable addition to an analysis pipeline where operators can be informed of likely poor-quality segmentations, along with some confidence interval, in real-time.

On average, the inference time for each network was of the order 600 ms on CPU and 40 ms on GPU. This is over 10,000 times faster than with RCA (660 s) whilst maintaining good accuracy. In an automated image analysis pipeline, this method would deliver excellent performance at high-speed and at large-scale. When paired with a real-time segmentation method it would be possible provide real-time feedback during image acquisition whether an acquired image is of sufficient quality for the downstream segmentation task.

Table 1. For Experiments 1 and 2, Mean absolute error (MAE) for poor (\(\mathrm {DSC} < 0.5\)) and good (\(\mathrm {DSC} \ge 0.5\)) quality segmentations over individual classes and whole-heart (WH). Standard deviations in brackets. (right) Statistics from binary classification (threshold\(\mathrm {DSC} =0.7\) [8]): True (TRP) and false-positive (FPR) rates over full DSC range with classification accuracy (Acc).

4Conclusion

Ensuring the quality of a automatically generated segmentation in a deployed image analysis pipeline in real-time is challenging. We have shown that we can employ Convolutional Neural Networks to tackle this problem with great computational efficient and with good accuracy.

We recognize that our networks are prone to learning features specific to assessing the quality of Random Forest segmentations. We can build on this by training the network with segmentations generated from an ensemble of methods. However, we must reiterate that the purpose of the framework in this study is to give an indication of thepredicted quality and not a direct one-to-one mapping to the reference DSC. Currently, these networks will correctly predict whether a segmentation is ‘good’ or ‘poor’ on some threshold, but will not confidently distinguish between two segmentations of similar quality.

Our trained CNNs are insensitive to small regional or boundary differences in labelmaps which are of good quality. Thus they cannot be used to assess quality of a segmentation at fine-scale. Again, this may be improved by a more diverse and granular training-sets. The labels for training the network in Experiment 1 are not easily available in most cases. However, by performing RCA, one can automatically obtain training labels for the network in Experiment 2 and this could be applied to segmentations generated with other algorithms. The cost of using data obtained with RCA is an increase in MAE. This is reasonable compared to the effort required to obtain a large, manually-labeled dataset.

Notes

  1. 1.

    UK Biobank Resource under Application Number 2964.

References

  1. Petersen, S.E., et al.: Reference ranges for cardiac structure and function using cardiovascular magnetic resonance (CMR) in Caucasians from the UK Biobank population cohort. J. Cardiovasc. Magn. Reson.19(1), 18 (2017)

    Article  Google Scholar 

  2. Bosse, S., Maniry, D., Müller, K.R., Wiegand, T., Samek, W.: Deep neural networks for no-reference and full-reference image quality assessment.1, 1–14 (2016)

    Google Scholar 

  3. Farzi, M., Pozo, J.M., McCloskey, E.V., Wilkinson, J.M., Frangi, A.F.: Automatic quality control for population imaging: a generic unsupervised approach. In: Ourselin, S., Joskowicz, L., Sabuncu, M.R., Unal, G., Wells, W. (eds.) MICCAI 2016. LNCS, vol. 9901, pp. 291–299. Springer, Cham (2016).https://doi.org/10.1007/978-3-319-46723-8_34

    Chapter  Google Scholar 

  4. Zhong, E., Fan, W., Yang, Q., Verscheure, O., Ren, J.: Cross validation framework to choose amongst models and datasets for transfer learning. In: Balcázar, J.L., Bonchi, F., Gionis, A., Sebag, M. (eds.) ECML PKDD 2010. LNCS (LNAI), vol. 6323, pp. 547–562. Springer, Heidelberg (2010).https://doi.org/10.1007/978-3-642-15939-8_35

    Chapter  Google Scholar 

  5. Fan, W., Davidson, I.: Reverse testing. In: Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining - KDD 2006, p. 147. ACM Press, New York (2006)

    Google Scholar 

  6. Kohlberger, T., Singh, V., Alvino, C., Bahlmann, C., Grady, L.: Evaluating segmentation error without ground truth. In: Ayache, N., Delingette, H., Golland, P., Mori, K. (eds.) MICCAI 2012. LNCS, vol. 7510, pp. 528–536. Springer, Heidelberg (2012).https://doi.org/10.1007/978-3-642-33415-3_65

    Chapter  Google Scholar 

  7. Valindria, V.V., et al.: Reverse classification accuracy: predicting segmentation performance in the absence of ground truth. IEEE Trans. Med. Imaging36, 1597–1606 (2017)

    Article  Google Scholar 

  8. Robinson, R., et al.: Automatic quality control of cardiac MRI segmentation in large-scale population imaging. In: Descoteaux, M., Maier-Hein, L., Franz, A., Jannin, P., Collins, D.L., Duchesne, S. (eds.) MICCAI 2017. LNCS, vol. 10433, pp. 720–727. Springer, Cham (2017).https://doi.org/10.1007/978-3-319-66182-7_82

    Chapter  Google Scholar 

Download references

Acknowledgements

RR is funded by KCL&Imperial EPSRC CDT in Medical Imaging (EP/L015226/1) and GlaxoSmithKline; VV by Indonesia Endowment for Education (LPDP) Indonesian Presidential PhD Scholarship; KF supported by The Medical College of Saint Bartholomew’s Hospital Trust. AL and SEP acknowledge support from NIHR Barts Biomedical Research Centre and EPSRC program grant (EP/P001009/ 1). SN and SKP are supported by the Oxford NIHR BRC and the Oxford British Heart Foundation Centre of Research Excellence. This project supported by the MRC (grant number MR/L016311/1). NA is supported by a Wellcome Trust Research Training Fellowship (203553/Z/Z). The authors SEP, SN and SKP acknowledge the British Heart Foundation (BHF) (PG/14/89/31194). BG received funding from the ERC under Horizon 2020 (grant agreement No. 757173, project MIRA, ERC-2017-STG).

Author information

Authors and Affiliations

  1. BioMedIA Group, Department of Computing, Imperial College London, London, UK

    Robert Robinson, Ozan Oktay, Wenjia Bai, Vanya V. Valindria, Bernhard Kainz, Daniel Rueckert & Ben Glocker

  2. Research & Development, GlaxoSmithKline, Brentford, UK

    Chris Page

  3. NIHR Barts Biomedical Research Centre, Queen Mary University London, London, UK

    Mihir M. Sanghvi, Nay Aung, José M. Paiva, Filip Zemrak, Kenneth Fung, Aaron M. Lee & Steffen E. Petersen

  4. Barts Heart Centre, Barts Health NHS Trust, London, UK

    Mihir M. Sanghvi, Nay Aung, Filip Zemrak, Kenneth Fung, Aaron M. Lee & Steffen E. Petersen

  5. Radcliffe Department of Medicine, University of Oxford, Oxford, UK

    Elena Lukaschuk, Valentina Carapella, Young Jin Kim, Stefan K. Piechnik & Stefan Neubauer

  6. Severance Hospital, Yonsei University College of Medicine, Seoul, South Korea

    Young Jin Kim

Authors
  1. Robert Robinson

    You can also search for this author inPubMed Google Scholar

  2. Ozan Oktay

    You can also search for this author inPubMed Google Scholar

  3. Wenjia Bai

    You can also search for this author inPubMed Google Scholar

  4. Vanya V. Valindria

    You can also search for this author inPubMed Google Scholar

  5. Mihir M. Sanghvi

    You can also search for this author inPubMed Google Scholar

  6. Nay Aung

    You can also search for this author inPubMed Google Scholar

  7. José M. Paiva

    You can also search for this author inPubMed Google Scholar

  8. Filip Zemrak

    You can also search for this author inPubMed Google Scholar

  9. Kenneth Fung

    You can also search for this author inPubMed Google Scholar

  10. Elena Lukaschuk

    You can also search for this author inPubMed Google Scholar

  11. Aaron M. Lee

    You can also search for this author inPubMed Google Scholar

  12. Valentina Carapella

    You can also search for this author inPubMed Google Scholar

  13. Young Jin Kim

    You can also search for this author inPubMed Google Scholar

  14. Bernhard Kainz

    You can also search for this author inPubMed Google Scholar

  15. Stefan K. Piechnik

    You can also search for this author inPubMed Google Scholar

  16. Stefan Neubauer

    You can also search for this author inPubMed Google Scholar

  17. Steffen E. Petersen

    You can also search for this author inPubMed Google Scholar

  18. Chris Page

    You can also search for this author inPubMed Google Scholar

  19. Daniel Rueckert

    You can also search for this author inPubMed Google Scholar

  20. Ben Glocker

    You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence toRobert Robinson.

Editor information

Editors and Affiliations

  1. University of Leeds, Leeds, UK

    Alejandro F. Frangi

  2. King’s College London, London, UK

    Julia A. Schnabel

  3. University of Pennsylvania, Philadelphia, PA, USA

    Christos Davatzikos

  4. Universidad de Valladolid, Valladolid, Spain

    Carlos Alberola-López

  5. Queen’s University, Kingston, ON, Canada

    Gabor Fichtinger

Rights and permissions

Copyright information

© 2018 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Robinson, R.et al. (2018). Real-Time Prediction of Segmentation Quality. In: Frangi, A., Schnabel, J., Davatzikos, C., Alberola-López, C., Fichtinger, G. (eds) Medical Image Computing and Computer Assisted Intervention – MICCAI 2018. MICCAI 2018. Lecture Notes in Computer Science(), vol 11073. Springer, Cham. https://doi.org/10.1007/978-3-030-00937-3_66

Download citation

Publish with us


[8]ページ先頭

©2009-2025 Movatter.jp