Paper Info Reviews Meta-review Author Feedback Post-Rebuttal Meta-reviews

Authors

Peijie Qiu, Satrajit Chakrabarty, Phuc Nguyen, Soumyendu Sekhar Ghosh, Aristeidis Sotiras

Abstract

Deep learning has achieved state-of-the-art performance in automated brain tumor segmentation from magnetic resonance imaging (MRI) scans. However, the unexpected occurrence of poor-quality outliers, especially in out-of-distribution samples, hinders their translation into patient-centered clinical practice. Therefore, it is important to develop automated tools for large-scale segmentation quality control (QC). However, most existing QC methods targeted cardiac MRI segmentation which involves a single modality and a single tissue type. Importantly, these methods only provide a subject-level segmentation-quality prediction, which cannot inform clinicians where the segmentation needs to be refined. To address this gap, we proposed a novel network architecture called QCResUNet that simultaneously produces segmentation-quality measures as well as voxel-level segmentation error maps for brain tumor segmentation QC. To train the proposed model, we created a wide variety of segmentation-quality results by using i) models that have been trained for a varying number of epochs with different modalities; and ii) a newly devised segmentation-generation method called SegGen. The proposed method was validated on a large public brain tumor dataset with segmentations generated by different methods, achieving high performance on the prediction of segmentation-quality metric as well as voxel-wise localization of segmentation errors. The implementation will be publicly available at \href{https://github.com/peijie-chiu/QC-ResUNet}{https://github.com/peijie-chiu/QC-ResUNet}.

Link to paper

DOI: https://doi.org/10.1007/978-3-031-43901-8_17

SharedIt: https://rdcu.be/dnwC1

Link to the code repository

https://github.com/peijie-chiu/QC-ResUNet

Link to the dataset(s)

N/A


Reviews

Review #1

  • Please describe the contribution of the paper

    The authors describe a neural network-based method for estimating the quality of a produced segmentation by a convolutional neural network such as a U-net. The proposed architecture combines a ResNet and a U-net to simultaneously predict an image-level estimate of segmentation quality and the location of segmentation errors given an input image and a segmentation mask, where both pathways share the same encoder. The method is trained and evaluated on image-segmentation pairs with simulated segmentation errors with the results showing that the proposed method can correctly predict the quality of the segmentation in this setting.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The authors address an important challenge of translating automatic image processing algorithms into the clinical workflow. Since no algorithm is perfect, a key challenge is to automatically detect failure cases without having the radiologist to check every image, which would reduce the confidence in and usefulness of automatic image analysis tools.

    The authors attempt to solve this problem by formulating segmentation quality as a regression problem and training a network to simultaneously predict the aspected Dice score of a given segmentation and image along with an error map. Having this information would be very benefitial in practice, because it allows the preselection of cases for manual review and also allows to define regions of interest that require attention by the radiologist.

    The method is well described with a detailed explanation of the used network architecture, loss function, and data generation process using publicly available data.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    The main weakness of the paper is that it does not consider under which circumstances current neural network-based approaches fail in practice. To build a suitable training set, it is important to understand failure cases that occur in clinical practice, but no such analysis was done. From my personal experience, segmentation does not fail randomly, but is usually due to out-of-distribution samples compared to the training set caused by the use of different scanners and scanning protocols, difference between training and target population in terms of age, gender, ethnicity, disease progression, and presence of imaging artifacts. To create a suitable training set, it is required to not only simulate segmentation errors, but also images that would lead to segmentation errors. However, the proposed method was trained and evaluated only on images that are known to be segmented correctly by neural network-based approaches, as seen in the BraTS challenge, where competing methods produce very good results. This dataset is not representative of the failure cases seen in clinical practice and not suitable to traind and/or evaluate a quality control method.

    The authors partially addressed the challenge of out-of-distribution samples in the evaluation, by also fabricating segmentation errors using a different method then what was used to produce the training set. However, this is still limited to the generation of test segmentations and input images are still not representative.

    The proposed method for detecting segmentation errors is very closely related to current approaches for deriving a segmentation. In essence, it’s a U-Net where the encoding pathway is shared with another ResNet for predicting Dice scores. In contrast to segmentation networks, the network is trained on image-segmentation pairs to predict segmentation error, instead of learning an image to segmentation mapping. However, since the network architecture and training procedure are very similar, it is quite like that both the segmentation network and the quality control network will have similar failure cases. Or in other words, the quality control might fail for cases, where the segmentation network fails, which are exactly the cases where quality control is particularily important. This limitation is speculative and it might well be that the network is still able to detection segmentation errors on failure cases. However, this needs to be evaluated using input images, for which current segmentation approach would fail and not just artificially fail, because they “weren’t trained correctly” (in the example the networks for intentionally trained to fail to produce failure cases, but such networks wouldn’t be used in clinical practice, which makes the evaluation not comparable to clinical practice).

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    The method is well described and all information that can be expected for a conference paper are provided.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html

    I think the method is very intersting and a good attempt in the right direction. Formulating image quality as a regression problem that can be solved with deep learning itself seems reasonable. However, as with any learning-based approach, the resulting model depends strongly on the provided training data and generating training data for the out-of-distribution (OOD) case is difficult. You could always argue that, if you have OOD samples available for training the QC network, you could have used the same samples simply to improve the segmentation network, which doesn’t make those samples OOD anymore. I think segmentation prediction and quality estimation is a bit of a hen and egg problem and it cannot be easily solved with a neural network, but requires a different test procedure. The reverse classification accuracy framework provides a good alternative, because it reverses the training test problem. You need ground truth for evaluation, but you don’t have ground truth during test time. So when you reverse the problem and use images produced during test time for training, you can use your ground truth training set for testing.

    • In Fig. 1, is the entire image fed into the network or only patches?
    • Are the differences of the mean in table 1 statistically significant? The reported standard deviation is about 6 times higher than the difference in means for MAE compared to ResNet50. It is mentioned that p-values were computed but no p-values are reported in the paper.
  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    3

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The authors address a very important problem for the clinical adoption of neural networks for automated image segmentation. However, the evaluation of the approach might not represent clinical failure cases, which impeds an evaluation of the clinical usefulness of the described approach.

  • Reviewer confidence

    Very confident

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    3

  • [Post rebuttal] Please justify your decision

    I think it’s important to show that the model for estimating where segmentation errors occur in the image is better than training a very good segmentation network and calculating the diff of the test segmentation and the segmentation produced by the model. In other words: You could simply train a U-Net like the nnU-Net on the BraTS data for the right amount of epochs, use this model to predict how the segmentation should be and compare it against the segmentation that should be tested. In your setup, such an approach is expected to perform well, because we know that U-nets can segment BraTS and a U-Net wouldn’t make the same errors that were intentially created to create the training and test sets. But we also know that this method is not suitable to detect errors made by a network that was trained to the best of our abilities. To show the usefulness of a quality control method, you need to show that it is able to detect errors of networks that were trained to the best of our abilities, because this are the errors we are after. One way to test this would be to simply use the test set of BraTS and test the QC network only on the segmentation masks that were produced by the best models and leave the simulated errors out. Or use a different data set, where the best models would fail. The QC network itself is also “just” a U-Net and might fail on some cases and it is important to show that it can perform well on cases where a regular U-Net for segmentation would fail. But the described tests don’t test this scenario and only shows to detect quality issues that a poorly trained U-Net would make, but those are not necessarily the same errors a well trained U-net would make. I agree that this is a hen and egg problem, but this is also why quality control is so difficult and so important.



Review #2

  • Please describe the contribution of the paper

    This work aims to improve the automated quality control of 3-class brain tumor segmentations. To that end, the authors train a model that simultaneously predicts both subject- and voxel-level segmentation quality metrics. The model (QCResUNet) is trained on image-segmentation pairs from multiple MRI modalities generated by models of varying quality. Data from the BraTS challenge are used for training and evaluation. Evaluation is conducted on both in- and out-of-distribution samples by calculating Pearson’s r and MAE between predicted DSC and ground truth DSC from manually annotation segmentations. This work also introduces a new segmentation generation method (SegGen) which generates augmentations of image-segmentation pairs.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The authors demonstrate superior performance on subject-level predictions of segmentation quality compared to several comparative models. The work also adds an important, clinically relevant voxel-level evaluation of segmentation quality. The authors provide a well-written, thorough literature review which sets their work in context. Architectures, hyperparameters, datasets, and losses are all explained clearly and in detail which supports the reproducibility of the work.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    1. Using the SegGen regime, the authors produce only 3 augmented segmentations of each modality at each quality level. Did the reviewers see a benefit to this approach, rather than employing a standard continuous augmentation approach on the image-segmentation pairs?

    2. For each subject, is each of the 4 MRI modes segmented with each of the 5 uuNet models at different checkpoints? Then augmented 3 times? It’s unclear how we get 48,000, 12,000, and 15,060 samples in the training sets.

    3. The voxel-level segmenetation error mean DSC of 0.834 for in-distribution samples could be considered insufficient for brain tumor segmentations where the complexity and heterogeneity of the tissue makes high accuracy important. Do the authors have any further intuition of how such a model would be perceived in a clinical setting?

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    Authors provide details of the training regimes, architectures, and hyperparameters used in their work. Datasets have been described.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html
    1. A limitation of this work is the need for all 4 MRI modalities as input. Have the authors evaluated the performance of the model in the scenario where fewer modalities are available? Do the authors think that the model could be extended to be made robust to missing modalities?

    2. An observation to add is that on out-of-distribution samples where the model should be less confident, we see QCResUNet correctly underpredicting the segmentation quality, similar to the ResNet models in Fig. 3.

    • Page 2: line 1 - capitalise Such.
    • Page 2: line 21 - typo “welch-characterized” -> “well-characterized”
    • Page 5: Data Generation - it would be useful to know how many checkpoints were sampled for each of the 5 uuNet models.
  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    6

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The paper builds on previous work whilst providing a good evaluation against the state of the art. Choices were justified and the work should be reproducible.

  • Reviewer confidence

    Very confident

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #3

  • Please describe the contribution of the paper

    Authors propose a network architecture called QCResUNet that simultaneously produces segmentation-quality measures and voxel-level segmentation error maps for brain tumor segmentation quality control. QCResUNet is a 3D encoder-decoder architecture. It is composed of a ResNet-34 encoder for DSC prediction, and a decoder architecture for segmentation error map prediction. Authors validate their method on the BraTS 2021 challenge database and showed statistically significant improvements against baseline methods from the literature.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    Proposed method provides both subject-level segmentation-quality prediction and localizes segmentation failures at the voxel level, which is important for the clinicians to guide the refinement of predicted segmentations. Authors also propose a new datageneration approach, called SegGen, that generates a wide range of segmentations of varying quality, ensuring unbiased model training and testing. This method is adapted to multimodal images and compared with baseline methods of the litterature (UNet, ResNet-34, ResNet-50). Statistical analysis is performed to assess the performance of the subject-level segmentation quality prediction in terms of Pearson coefficient r and MAE between the predicted DSC and the ground-truth DSC.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    The error localization is not compared with other existing methods. More experiments could be conducted to compare this method with quality-control methods based on uncertainty quantification.

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    A public dataset (BraTS 2021) was used for the validation. The code and pre-trained model will be made available publicly.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html

    it could be interesting to compare the proposed method with some recent quality- control methods based on uncertainty quantification and bayesian deep learning: both in the state-of-the-art part and in the results part, especially for the evaluation of the error localization.

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    5

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    This paper proposed a new method providing subject-level segmentation-quality prediction and localizing segmentation failures at the voxel level, which is important for the clinicians to guide the refinement of predicted segmentations. However this paper suffers from a lack of comparison with other quality-controlled methods, especially methods based on uncertainty quantification.

  • Reviewer confidence

    Confident but not absolutely certain

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    5

  • [Post rebuttal] Please justify your decision

    This paper should be interesting to the MICCAI audience. My main concerns have been adressed in the rebuttal




Primary Meta-Review

  • Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

    This paper received rather mixed reviews, with as strong points the clinical relevance and need for such a tool to assess segmentation quality, as well as being a clearly written paper. Results also seem to be quite encouraging. On the other hand, there are several criticisms with regards to the experimental setup, most notably by R#1 who points out to the omission to out of distribution samples, and several have noted the lack to comparison to state of the art approaches. These points would need to be addressed in the rebuttal.




Author Feedback

We appreciate the positive feedback on the novelty and importance of the work as well as the clarity of the presentation. We address the reviewers’ concerns below. Note that [] refers to references in the paper while {} is references in the rebuttal. Regarding the omission of the OOD samples, we agree with R#1 that this is a challenging “hen and egg problem”. Contrary to previous methods (e.g., 2018 MICCAI [10]) that overlooked OOD evaluation, we partially addressed it by leveraging segmentations produced by DeepMedic. To further assess the performance of our method on OOD samples, we conducted preliminary experiments on a private brain tumor dataset comprising 50 samples. Following the data-generation protocol outlined in the paper, we obtained an additional 1358 segmentations with 1108 produced by nnUNet and SegGen after resampling and 250 generated by DeepMedic. For the subject-level DSC prediction, QCResUNet showed superior performance (r=0.947; MAE=0.073) compared to UNet (r=0.822; MAE=0.115), ResNet-34 (r=0.930; MAE=0.076), and ResNet-50 (r=0.920; MAE=0.083). The proposed method also achieved a median DSC of 0.814 on voxel-level segmentation error localization. Despite a slight drop in performance on the OOD dataset, the proposed method demonstrated good generalizability. Regarding R#1’s further concern that the dataset is not representative of the failure cases seen in clinical practice, BraTS data comes from the clinic resulting in a heterogeneous dataset, consisting of data from multiple sites with different levels of quality and different protocols. This is in contrast to the UK Biobank dataset used in [6]. Regarding R#3’s main concern about the absence of comparisons with state-of-the-art segmentation QC methods, we would like to clarify that we have already compared the performance of the proposed QCResUNet with the model used in [10] (i.e., ResNet-34). Additionally, we performed preliminary experiments to compare the proposed method with the RCA framework (mentioned by R#1). The RCA method consistently showed poor performance on the BraTS in-sample dataset (r=0.467; MAE=0.274), BraTS out-of-sample dataset (r=0.694; MAE=0.191), and the additional private dataset (r=0.208; MAE=0.317). The proposed method outperformed the RCA framework by increasing Pearson r correlation by 106%, 39%, and 355% for the in-sample BraTS dataset, out-of-sample BraTS dataset, and private dataset, respectively. The failure of the RCA method in brain tumor segmentation QC may be attributed to the high heterogeneity in brain tumor appearance, location, shape, etc. The RCA method [7, 8] has been previously evaluated on cardiac MRI segmentation QC which involves simpler anatomy, binary segmentation, less variability in appearance and location. These differences might explain the low RCA performance that we observed. Lastly, the uncertainty estimation-based method (mentioned by R#3) can only provide very coarse segmentation error localization {1}, always occurring on the segmentation boundary where the model exhibits the highest level of uncertainty. As such, it cannot identify false positive errors as our method can. Moreover, those methods need to find a proxy measure that relates the estimated uncertainty to the segmentation quality metric, which adds a layer of complexity because not all proxy measures showed a statistically strong relation to quality metric {1}.
Regarding R#2’s request for clarifications about SegGen, the SegGen generated 3 augmented segmentations for each subject so that we can obtain a dataset with diverse segmentation qualities. Due to the computational challenges, a continuous sampling was not possible. However, we included a standard continuous augmentation on image-segmentation pairs when training the QCResUNet. We acknowledge the dependency on 4 MRI modalities as a limitation. We will include this in the discussion of the paper and address it in future work. {1} 10.3389/fnins.2020.00282




Post-rebuttal Meta-Reviews

Meta-review # 1 (Primary)

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    Reviewer #1 brings out a good point with regards to testing the QC network only on the segmentation masks that were produced by the best segmentation models and testing out of distribution samples. These are extremely valid points that would need to be further addressed in subsequent studies, in order to demonstrate the model’s performance in ideal scenarios with separate test set. However given the quality of the paper addressing a hot topic, I also feel this paper would bring out interesting points of discussion at the conference and covers of topic of great interest and tend to be more on the positive side.



Meta-review #2

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    I think this paper is very interesting. Although the method is based on a relatively simple multitask learning framework, the direct prediction of Dice scores and segmentation error maps without ground truths is creative and interesting. The way of generating the training segmentation maps is also reasonable. On the other hand, I agree with reviewer #1 that the framework may be improved by considering more realistic scenarios in which segmentation is prone to failure. While the framework can still be improved, its current form should be interesting to the MICCAI audience and may stimulate discussion and new ideas.



Meta-review #3

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    In this paper, the authors proposed to use a neural network to estimate the quality of segmentation automatically. The problem is very clinical relevant and the paper is well written and easy to implement. The main problem is still exp design. I agree with R1 that it is more important to show “the model for estimating where segmentation errors occur in the image is better than training a very good segmentation network and calculating the diff of the test segmentation and the segmentation produced by the model.” How could network #2 be powerful enough to identify the error of network #1 if both are trained well? If it is true, can we simply use network #2 for segmentation? then which network would be able to identify network #2’s error? I also agree this is hen-egg problem but the rebuttal and paper did not address it.



back to top