- Robert Robinson18,
- Ozan Oktay18,
- Wenjia Bai18,
- Vanya V. Valindria18,
- Mihir M. Sanghvi20,21,
- Nay Aung20,21,
- José M. Paiva20,
- Filip Zemrak20,21,
- Kenneth Fung20,21,
- Elena Lukaschuk22,
- Aaron M. Lee20,21,
- Valentina Carapella22,
- Young Jin Kim22,23,
- Bernhard Kainz18,
- Stefan K. Piechnik22,
- Stefan Neubauer22,
- Steffen E. Petersen20,21,
- Chris Page19,
- Daniel Rueckert18 &
- …
- Ben Glocker18
Part of the book series:Lecture Notes in Computer Science ((LNIP,volume 11073))
Included in the following conference series:
11kAccesses
Abstract
Recent advances in deep learning based image segmentation methods have enabled real-time performance with human-level accuracy. However, occasionally even the best method fails due to low image quality, artifacts or unexpected behaviour of black box algorithms. Being able to predict segmentation quality in the absence of ground truth is of paramount importance in clinical practice, but also in large-scale studies to avoid the inclusion of invalid data in subsequent analysis.
In this work, we propose two approaches of real-time automated quality control for cardiovascular MR segmentations using deep learning. First, we train a neural network on 12,880 samples to predict Dice Similarity Coefficients (DSC) on a per-case basis. We report a mean average error (MAE) of 0.03 on 1,610 test samples and 97% binary classification accuracy for separating low and high quality segmentations. Secondly, in the scenario where no manually annotated data is available, we train a network to predict DSC scores from estimated quality obtained via a reverse testing strategy. We report an\(\mathrm {MAE} = 0.14\) and 91% binary classification accuracy for this case. Predictions are obtained in real-time which, when combined with real-time segmentation methods, enables instant feedback on whether an acquired scan is analysable while the patient is still in the scanner. This further enables new applications of optimising image acquisition towards best possible analysis results.
You have full access to this open access chapter, Download conference paper PDF
Similar content being viewed by others
Keywords
- Binary Classification Accuracy
- Mean Average Error (MAE)
- Automated Quality Control
- Human-level Accuracy
- UK Biobank (UKBB)
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
1Introduction
Finding out that an acquired medical image is not usable for the intended purpose is not only costly but can be critical if image-derived quantitative measures should have supported clinical decisions in diagnosis and treatment. Real-time assessment of the downstream analysis task, such as image segmentation, is highly desired. Ideally, such an assessment could be performed while the patient is still in the scanner, so that in the case an image is not analysable, a new scan could be obtained immediately (even automatically). Such a real-time assessment requires two components, a real-time analysis method and a real-time prediction of the quality of the analysis result. This paper proposes a solution to the latter with a particular focus on image segmentation as the analysis task.
Recent advances in deep learning based image segmentation have brought highly efficient and accurate methods, most of which are based on Convolutional Neural Networks (CNNs). However, even the best method will occasionally fail due to insufficient image quality (e,g., noise, artefacts, corruption) or show unexpected behaviour on new data. In clinical settings, it is of paramount importance to be able to detect such failure cases on a per-case basis. In clinical research, such as population studies, it is important to be able to detect failure cases in automated pipelines, so invalid data can be discarded in the subsequent statistical analysis.
Here, we focus on automatic quality control of image segmentation. Specifically, we assess the quality of automatically generated segmentations of cardiovascular MR (CMR) from the UK Biobank (UKBB) Imaging Study [1].
Automated quality control is dominated by research in the natural-image domain and is often referred to as image quality assessment (IQA). The literature proposes methodologies to quantify the technical characteristics of an image, such as the amount of blur, and more recently a way to assess the aesthetic quality of such images [2]. In the medical image domain, IQA is an important topic of research in the fields of image acquisition and reconstruction. An example is the work by Farzi et al. [3] proposing an unsupervised approach to detect artefacts. Where research is conducted into the quality or accuracy of image segmentations, it is almost entirely assumed that there is a manually annotated ground truth (GT) labelmap available for comparison. Our domain has seen little work on assessing the quality of generated segmentations particularly on a per-case basis and in the absence of GT.
Related Work: Some previous studies have attempted to deliver quality estimates of automatically generated segmentations when GT is unavailable. Most methods tend to rely on a reverse-testing strategy. Both Reverse Validation [4] and Reverse Testing [5] employ a form of cross-validation by training segmentation models on a dataset that are then evaluated either on a different fold of the data or a separate test-set. Both of these methods require a fully-labeled set of data for use in training. Additionally, these methods are limited to conclusions about the quality of the segmentation algorithms rather than the individual labelmaps as the same data is used for training and testing purposes.
Where work has been done in assessing individual segmentations, it often also requires large sets of labeled training data. In [6] a model was trained using numerous statistical and energy measures from segmentation algorithms. Although this model is able to give individual predictions of accuracy for a given segmentation, it again requires the use of a fully-annotated dataset. Moving away from this limitation, [7,8] have shown that applying Reverse Classification Accuracy (RCA) gives accurate predictions of traditional quality metrics on a per-case basis. They accomplish this by comparing a set of reference images with manual segmentations to the test-segmentation, evaluating a quality metric between these, and then taking the best value as a prediction for segmentation quality. This is done using a set of only 100 reference images with verified labelmaps. However, the time taken to complete RCA on a single segmentation is prohibits real-time quality control frameworks: around 11 min.
Contributions: In this study, we show that applying a modern deep learning approach to the problem of automated quality control in deployed image-segmentation frameworks can decrease the per-case analysis time to the order of milliseconds whilst maintaining good accuracy. We predict Dice Similarity Coefficient (DSC) at large-scale analyzing over 16,000 segmentations of images from the UKBB. We also show that measures derived from RCA can be used to inform our network removing the need for a large, manually-annotated dataset. When pairing our proposed real-time quality assessment with real-time segmentation methods, one can envision new avenues of optimising image acquisition automatically toward best possible analysis results.
(left) Histogram of Dice Similarity Coefficients (DSC) for 29,292 segmentations. Range is [0, 1] with 10 equally spaced bins. Red line shows minimum counts (1,610) at\(\mathrm {DSC}\) in the bin [0.5, 0.6) used to balance scores. (right) 5 channels of the CNNs in both experiments: the image and one-hot-encoded labelmaps for background (BG), left-ventricular cavity (LV), left-ventricular myocardium (LVM) and right-ventricular cavity (RVC).
2Method and Material
We use the Dice Similarity Coefficient (DSC) as a metric of quality for segmentations. It measures the overlap between a proposed segmentation and its ground truth (GT) (usually a manual reference). We aim to predict DSC for segmentations in theabsence of GT. We perform two experiments in which CNNs are trained to predict DSC. First we describe our input data and the models.
Our initial dataset consists of 4,882 3D (2D-stacks) end-diastolic (ED) cardiovascular magnetic resonance (CMR) scans from the UK Biobank (UKBB) Imaging StudyFootnote1. All images have a manual segmentation which is unprecedented at this scale. We take these labelmaps as reference GT. Each labelmap contains 3 classes: left-ventricular cavity (LVC), left-ventricular myocardium (LVM) and right-ventricular cavity (RVC) which are separate from the background class (BG). In this work, we also consider the segmentation as a single binary entity comprising all classes: whole-heart (WH).
A random forest (RF) of 350 trees and maximum Depth 40 is trained on 100 cardiac atlases from an in-house database and used to segment the 4,882 images at depths of 2, 4, 6, 8, 10, 15, 20, 24, 36 and 40. We calculate DSC from the GT for the 29,292 generated segmentations. The distribution is shown in Fig. 1. Due to the imbalance in DSC scores of this data, we choose to take a random subset of 1,610 segmentations from each DSC bin, equal to the minimum number of counts-per-bin across the distribution. Our final dataset comprises 16,100 score-balanced segmentations with reference GT.
From each segmentation we create 4 one-hot-encoded masks: masks 1 to 4 correspond to the classes BG, LVC, LVM and RVC respectively. The voxels of the\(i^{th}\) mask are set at [0, 0, 0, 0] when they do not belong to the mask’s class and the\(i^{th}\) element set to 1 otherwise. For example, the mask for LVC is [0, 0, 0, 0] everywhere except for voxels of the LVC class which are given the value [0, 1, 0, 0]. This gives the network a greater chance to learn the relationships between the voxels’ classes and their locations. An example of the segmentation masks is shown in Fig. 1.
At training time, our data-generator re-samples the UKBB images and our segmentations to have consistent shape of [224, 224, 8, 5] making our network fully 3D with 5 data channels: the image and 4 segmentation masks. The images are also normalized such that the entire dataset falls in the range [0.0, 1.0].
For comparison and consistency, we choose to use the same input data and network architecture for each of our experiments. We employ a 50-layer 3D residual network written in Python with the Keras library and trained on an 11 GB Nvidia GeForce GTX 1080 Ti GPU. Residual networks are advantageous as they allow the training of deeper networks by repeating smaller blocks. They benefit from skip connections that allow data to travel deeper into the network. We use the Adam optimizer with learning rate of\(1e^{-5}\) and decay of 0.005. Batch sizes are kept constant at 46 samples per batch. We run validation at the end of each epoch for model-selection purposes.
Experiments
Can we take advantage of a CNN’s inference speed to give fast and accurate predictions of segmentation quality? This is an important question for analysis pipelines which could benefit from the increased confidence in segmentation quality without compromising processing time. To answer this question we conduct the following experiments.
Experiment 1: Directly Predicting DSC. Is it possible to directly predict the quality of a segmentation given only the image-segmentation pair? In this experiment we calculate, per class, the DSC between our segmentations and the GT. These are used as training labels. We have 5 nodes in the final layer of the network where the outputX is\(\lbrace X \in \mathbb {R}^{5} \ | X \in [0.0,1.0] \rbrace \). This vector represents the DSC per class including background and whole-heart. We use mean-squared-error loss and report mean-absolute-error between the output and GT DSC. We split our data 80:10:10 giving 12,880 training samples and 1,610 samples each for validation and testing. Performing this experiment is costly as it requires a large manually-labeled dataset which is not readily available in practice.
Experiment 2: Predicting RCA Scores. Considering the promising results of the RCA framework [7,8] in accurately predicting the quality of segmentations in the absence of large labeled datasets, can we use the predictions from RCA as training data to allow a network to give comparatively accurate predictions on a test-set? In this experiment, we perform RCA on all 16,100 segmentations. To ensure that we train on balanced scores, we again perform histogram binning on the RCA scores and take equal numbers from each class. We finish with a total of 5363 samples split into training, validation and test sets of 4787, 228 and 228 respectively. The predictions per-class are used as labels during training. Similar to Experiment 1, we obtain a single predicted DSC output for each class using the same network and hyper-parameters, but without the need for the large, often-unobtainable manually-labeled training set.
Examples showing excellent prediction of Dice Similarity Coefficient (DSC) in Experiment 1. Quality increases from top-left to bottom-right. Each panel shows (left to right) the image, test-segmentation and reference GT.
3Results
Results from Experiment 1 are shown in Table 1. We report mean absolute error (MAE) and standard deviations per class between reference GT and predicted DSC. Our results show that our network can directly predict whole-heart DSC from the image-segmentation pair with MAE of 0.03 (SD = 0.04). We see similar performance on individual classes. Table 1 also shows MAE over the top and bottom halves of the GT DSC range. This suggests that the MAE is equally distributed over poor and good quality segmentations. For WH we report 72% of the data have MAE less than 0.05 with outliers (\(\mathrm {DSC} \ge 0.12\)) comprising only 6% of the data. Distributions of the MAEs for each class can be seen in Fig. 3. Examples of good and poor quality segmentations are shown in Fig. 2 with their GT and predictions. Results show excellent true (TPR) and false-positive rates (FPR) on a whole-heart binary classification task with DSC threshold of 0.70. The reported accuracy of 97% is better than the 95% reported with RCA in [8].
Distribution of the mean absolute errors (MAE) for Experiments 1 (left) and 2 (right). Results are shown for each class: background (BG), left-ventricular cavity (LV), left-ventricular myocardium (LVM), right-ventricular cavity (RVC) and for the whole-heart (WH).
Our results for Experiment 2 are recorded in Table 1. It is expected that direct predictions of DSC from the RCA labels are less accurate than in Experiment 1. The reasoning is two-fold: first, the RCA labels are themselvespredictions and retain inherent uncertainty and second, the training set here is much smaller than in Experiment 1. However, we report MAE of 0.14 (SD = 0.09) for the WH case and 91% accuracy on the binary classification task. Distributions of the MAEs are shown in Fig. 3. LVM has a greater variance in MAE which is in line with previous results using RCA [8]. Thus, the network would be a valuable addition to an analysis pipeline where operators can be informed of likely poor-quality segmentations, along with some confidence interval, in real-time.
On average, the inference time for each network was of the order 600 ms on CPU and 40 ms on GPU. This is over 10,000 times faster than with RCA (660 s) whilst maintaining good accuracy. In an automated image analysis pipeline, this method would deliver excellent performance at high-speed and at large-scale. When paired with a real-time segmentation method it would be possible provide real-time feedback during image acquisition whether an acquired image is of sufficient quality for the downstream segmentation task.
4Conclusion
Ensuring the quality of a automatically generated segmentation in a deployed image analysis pipeline in real-time is challenging. We have shown that we can employ Convolutional Neural Networks to tackle this problem with great computational efficient and with good accuracy.
We recognize that our networks are prone to learning features specific to assessing the quality of Random Forest segmentations. We can build on this by training the network with segmentations generated from an ensemble of methods. However, we must reiterate that the purpose of the framework in this study is to give an indication of thepredicted quality and not a direct one-to-one mapping to the reference DSC. Currently, these networks will correctly predict whether a segmentation is ‘good’ or ‘poor’ on some threshold, but will not confidently distinguish between two segmentations of similar quality.
Our trained CNNs are insensitive to small regional or boundary differences in labelmaps which are of good quality. Thus they cannot be used to assess quality of a segmentation at fine-scale. Again, this may be improved by a more diverse and granular training-sets. The labels for training the network in Experiment 1 are not easily available in most cases. However, by performing RCA, one can automatically obtain training labels for the network in Experiment 2 and this could be applied to segmentations generated with other algorithms. The cost of using data obtained with RCA is an increase in MAE. This is reasonable compared to the effort required to obtain a large, manually-labeled dataset.
Notes
- 1.
UK Biobank Resource under Application Number 2964.
References
Petersen, S.E., et al.: Reference ranges for cardiac structure and function using cardiovascular magnetic resonance (CMR) in Caucasians from the UK Biobank population cohort. J. Cardiovasc. Magn. Reson.19(1), 18 (2017)
Bosse, S., Maniry, D., Müller, K.R., Wiegand, T., Samek, W.: Deep neural networks for no-reference and full-reference image quality assessment.1, 1–14 (2016)
Farzi, M., Pozo, J.M., McCloskey, E.V., Wilkinson, J.M., Frangi, A.F.: Automatic quality control for population imaging: a generic unsupervised approach. In: Ourselin, S., Joskowicz, L., Sabuncu, M.R., Unal, G., Wells, W. (eds.) MICCAI 2016. LNCS, vol. 9901, pp. 291–299. Springer, Cham (2016).https://doi.org/10.1007/978-3-319-46723-8_34
Zhong, E., Fan, W., Yang, Q., Verscheure, O., Ren, J.: Cross validation framework to choose amongst models and datasets for transfer learning. In: Balcázar, J.L., Bonchi, F., Gionis, A., Sebag, M. (eds.) ECML PKDD 2010. LNCS (LNAI), vol. 6323, pp. 547–562. Springer, Heidelberg (2010).https://doi.org/10.1007/978-3-642-15939-8_35
Fan, W., Davidson, I.: Reverse testing. In: Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining - KDD 2006, p. 147. ACM Press, New York (2006)
Kohlberger, T., Singh, V., Alvino, C., Bahlmann, C., Grady, L.: Evaluating segmentation error without ground truth. In: Ayache, N., Delingette, H., Golland, P., Mori, K. (eds.) MICCAI 2012. LNCS, vol. 7510, pp. 528–536. Springer, Heidelberg (2012).https://doi.org/10.1007/978-3-642-33415-3_65
Valindria, V.V., et al.: Reverse classification accuracy: predicting segmentation performance in the absence of ground truth. IEEE Trans. Med. Imaging36, 1597–1606 (2017)
Robinson, R., et al.: Automatic quality control of cardiac MRI segmentation in large-scale population imaging. In: Descoteaux, M., Maier-Hein, L., Franz, A., Jannin, P., Collins, D.L., Duchesne, S. (eds.) MICCAI 2017. LNCS, vol. 10433, pp. 720–727. Springer, Cham (2017).https://doi.org/10.1007/978-3-319-66182-7_82
Acknowledgements
RR is funded by KCL&Imperial EPSRC CDT in Medical Imaging (EP/L015226/1) and GlaxoSmithKline; VV by Indonesia Endowment for Education (LPDP) Indonesian Presidential PhD Scholarship; KF supported by The Medical College of Saint Bartholomew’s Hospital Trust. AL and SEP acknowledge support from NIHR Barts Biomedical Research Centre and EPSRC program grant (EP/P001009/ 1). SN and SKP are supported by the Oxford NIHR BRC and the Oxford British Heart Foundation Centre of Research Excellence. This project supported by the MRC (grant number MR/L016311/1). NA is supported by a Wellcome Trust Research Training Fellowship (203553/Z/Z). The authors SEP, SN and SKP acknowledge the British Heart Foundation (BHF) (PG/14/89/31194). BG received funding from the ERC under Horizon 2020 (grant agreement No. 757173, project MIRA, ERC-2017-STG).
Author information
Authors and Affiliations
BioMedIA Group, Department of Computing, Imperial College London, London, UK
Robert Robinson, Ozan Oktay, Wenjia Bai, Vanya V. Valindria, Bernhard Kainz, Daniel Rueckert & Ben Glocker
Research & Development, GlaxoSmithKline, Brentford, UK
Chris Page
NIHR Barts Biomedical Research Centre, Queen Mary University London, London, UK
Mihir M. Sanghvi, Nay Aung, José M. Paiva, Filip Zemrak, Kenneth Fung, Aaron M. Lee & Steffen E. Petersen
Barts Heart Centre, Barts Health NHS Trust, London, UK
Mihir M. Sanghvi, Nay Aung, Filip Zemrak, Kenneth Fung, Aaron M. Lee & Steffen E. Petersen
Radcliffe Department of Medicine, University of Oxford, Oxford, UK
Elena Lukaschuk, Valentina Carapella, Young Jin Kim, Stefan K. Piechnik & Stefan Neubauer
Severance Hospital, Yonsei University College of Medicine, Seoul, South Korea
Young Jin Kim
- Robert Robinson
You can also search for this author inPubMed Google Scholar
- Ozan Oktay
You can also search for this author inPubMed Google Scholar
- Wenjia Bai
You can also search for this author inPubMed Google Scholar
- Vanya V. Valindria
You can also search for this author inPubMed Google Scholar
- Mihir M. Sanghvi
You can also search for this author inPubMed Google Scholar
- Nay Aung
You can also search for this author inPubMed Google Scholar
- José M. Paiva
You can also search for this author inPubMed Google Scholar
- Filip Zemrak
You can also search for this author inPubMed Google Scholar
- Kenneth Fung
You can also search for this author inPubMed Google Scholar
- Elena Lukaschuk
You can also search for this author inPubMed Google Scholar
- Aaron M. Lee
You can also search for this author inPubMed Google Scholar
- Valentina Carapella
You can also search for this author inPubMed Google Scholar
- Young Jin Kim
You can also search for this author inPubMed Google Scholar
- Bernhard Kainz
You can also search for this author inPubMed Google Scholar
- Stefan K. Piechnik
You can also search for this author inPubMed Google Scholar
- Stefan Neubauer
You can also search for this author inPubMed Google Scholar
- Steffen E. Petersen
You can also search for this author inPubMed Google Scholar
- Chris Page
You can also search for this author inPubMed Google Scholar
- Daniel Rueckert
You can also search for this author inPubMed Google Scholar
- Ben Glocker
You can also search for this author inPubMed Google Scholar
Corresponding author
Correspondence toRobert Robinson.
Editor information
Editors and Affiliations
University of Leeds, Leeds, UK
Alejandro F. Frangi
King’s College London, London, UK
Julia A. Schnabel
University of Pennsylvania, Philadelphia, PA, USA
Christos Davatzikos
Universidad de Valladolid, Valladolid, Spain
Carlos Alberola-López
Queen’s University, Kingston, ON, Canada
Gabor Fichtinger
Rights and permissions
Copyright information
© 2018 Springer Nature Switzerland AG
About this paper
Cite this paper
Robinson, R.et al. (2018). Real-Time Prediction of Segmentation Quality. In: Frangi, A., Schnabel, J., Davatzikos, C., Alberola-López, C., Fichtinger, G. (eds) Medical Image Computing and Computer Assisted Intervention – MICCAI 2018. MICCAI 2018. Lecture Notes in Computer Science(), vol 11073. Springer, Cham. https://doi.org/10.1007/978-3-030-00937-3_66
Download citation
Published:
Publisher Name:Springer, Cham
Print ISBN:978-3-030-00936-6
Online ISBN:978-3-030-00937-3
eBook Packages:Computer ScienceComputer Science (R0)
Share this paper
Anyone you share the following link with will be able to read this content:
Sorry, a shareable link is not currently available for this article.
Provided by the Springer Nature SharedIt content-sharing initiative