Paper Info Reviews Meta-review Author Feedback Post-Rebuttal Meta-reviews

Authors

Kaisar Kushibar, Victor Campello, Lidia Garrucho, Akis Linardos, Petia Radeva, Karim Lekadir

Abstract

Uncertainty estimation in deep learning has become a leading research field in medical image analysis due to the need for safe utilisation of AI algorithms in clinical practice. Most approaches for uncertainty estimation require sampling the network weights multiple times during testing or training multiple networks. This leads to higher training and testing costs in terms of time and computational resources. In this paper, we propose Layer Ensembles, a novel uncertainty estimation method that uses a single network and requires only a single pass to estimate epistemic uncertainty of a network. Moreover, we introduce an image-level uncertainty metric, which is more beneficial for segmentation tasks compared to the commonly used pixel-wise metrics such as entropy and variance. We evaluate our approach on 2D and 3D, binary and multi-class medical image segmentation tasks. Our method shows competitive results with state-of-the-art Deep Ensembles, requiring only a single network and a single pass.

Link to paper

DOI: https://link.springer.com/chapter/10.1007/978-3-031-16452-1_49

SharedIt: https://rdcu.be/cVVp4

Link to the code repository

https://github.com/pianoza/LayerEnsembles

Link to the dataset(s)

https://www.ub.edu/mnms

https://www.bcdr.eu/information/about


Reviews

Review #2

  • Please describe the contribution of the paper

    The authors proposed an uncertainty quantification method that could estimate the segmentation uncertainty in a single pass. They conducted experiments on two datasets and compared the proposed algorithm with the deep ensemble method.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. Most of the sections of the paper are well written, which are easy to follow.
    2. The idea of using layer ensemble to estimate uncertainty is interesting.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    1. The computational performance analysis is desired to show its computational advantages over Deep Ensembles.
    2. Unfair to only apply Simultaneous Truth And Performance Level Estimation (STAPLE) on Layer Ensemble but not on Deep Ensemble in the quantitative evaluation.
  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    Seems reproducible

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html
    1. The computational analysis (such as number of floating point operations, computation time, and speedup) is desired to show the efficiency of LE vs DE.
    2. It’s unfair to only apply STAPLE on LE.
    3. In Methodology/Ensembles of networks of different depths part, the authors claim that LE is equivalent to DE, which is not very solid. More evidence or theories are needed to verify this claim.
  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    6

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The idea of using layer ensemble for uncertainty qualification is interesting.

  • Number of papers in your stack

    5

  • What is the ranking of this paper in your review stack?

    1

  • Reviewer confidence

    Somewhat Confident

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Not Answered

  • [Post rebuttal] Please justify your decision

    Not Answered



Review #4

  • Please describe the contribution of the paper

    The paper proposes a method for lowering the cost of uncertainty estimation methods that are based on network weight sampling by introducing layer ensembles. Therefore, instead of individual networks an ensemble can be built from a single networks’ different depths. Given the high sampling cost of state of the art uncertainty method of standard ensembles, the work is very valuable.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • The paper is well written and easy to understand.
    • The speed up of ensembles for uncertainty estimation is highly relevant given that it is frequently used and the state-of-the-art in the field.
    • The motivation of the method is clear and well explained.
    • The results are promising.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    • the new metric is definitely on point to analyse the task at hand and yields good metric to compare different methods; however, it is not the first image-level metric of this kind for uncertainty estimation. This claim needs to be refined. See the area under sparsification error curve introduced in “Uncertainty estimates and multi-hypotheses networks for optical flow”.
    • epistemic uncertainty is not predictive it is rather empirically computed, this needs to be corrected throughout the manuscript.
    • advantages of the method over multi-headed networks is not clear given that multi-headed networks are also light weight since they add small overhead at the late most layers. See the mentioned paper up.
    • Authors highly correlate easiness of a samle to high confidence. I wonder if this is a general mistake in the community of uncertainty estimation. I believe not all easy samples are supposed to yield low uncertainty as in easy samples that might come in the test scenarios but totally new to the network since it has not seen such an image before (sample novelty). Correlating it to error in my opinion is a better termed way.
    • I believe the comparison of the methods should have equal number of ensemble members for a fairer comparison.
    • The idea in the paper is very novel and worth to be tested in different scenarios including big natural image benchmarks where the standard ensembles are still dominating. For me it would be interesting to see a comparison in this regard to see how the method generalizes to different domains.
    • comparison to multi-headed network is missing in tables. Especially in runtimes since i believe that building ensembles on parts of the networks still might take some time than one forward pass of multi-headed network.
  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    the paper is reproducible to a large extent.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html

    The idea presented in the paper is very interesting and finally challenges the standard ensembles which are state-of-the-art for uncertainty estimation for segmentation. For this reason, in order to show the benefit of the proposed approach in a strong way, i would like to see more comprehensive compariosns to multi-headed networks as well as runtime comparisons. This way the paper can be strengthened.

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    5

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Although some comparisons are missing, i believe that the work as is, is very insightful for the MICCAI community for future research direction in this relevant field of uncertainty estimation.

  • Number of papers in your stack

    5

  • What is the ranking of this paper in your review stack?

    3

  • Reviewer confidence

    Very confident

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Not Answered

  • [Post rebuttal] Please justify your decision

    Not Answered



Review #5

  • Please describe the contribution of the paper

    This paper proposed a new measure of uncertainty to evaluate segmentation on image level.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    -The uncertainty assessment only requires single-pass testing -The idea of ensembling multiple layer output is newly proposed -The method is validated in one 2D dataset and one 3D dataset

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    -The rationale of using multi-layer output needs further justification. Different from deep ensembling that only uses estimation from the last layer of each network, the proposed method considered the estimation from internal layers. Notice that internal layers are classifier determined by low-level (or low scale) features. This method considered low-level feature the same weight as high-level feature in determining a segmentation confidence or uncertainty. It is still questionable how much the low-level feature can be used to determine uncertainty. As shown in the experiment with more difficulty segmentation, more layers are needed. This can be a potential limitation for extremely challenging segmentation task.

    • Significance of image level uncertainty. As the uncertainty evaluation is mostly used to highlight challenging region, it is not clear what is the specific application of image-level uncertainty.
  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    The data is based on publicly available data but code will be disclosed.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html
    • Other noise corruption in segmentation difficulty evaluation. In addition to Gaussian noise, the evaluation could be further evaluated on convolution with a filter kernel or intentionally corrupt the boundary region to mimic more challenging segmentation scenario with lower uncertainty.
    • As another the-state-of-the-art, it will be great to also compare with MCDropout.
    • The randomisation of data in testing can be improved by cross validation.
    • In MnM result of Fig.3, there is a range (0 to 0.1) where DE-LV is lower than LE-LV. It will be great to add more discussion or justification on this issue.
  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    5

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The justification of using low level feature to determine uncertainty is the major factor that impacts the overall score.

  • Number of papers in your stack

    5

  • What is the ranking of this paper in your review stack?

    2

  • Reviewer confidence

    Confident but not absolutely certain

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Not Answered

  • [Post rebuttal] Please justify your decision

    Not Answered




Primary Meta-Review

  • Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

    The authors propose a method to estimate uncertainty in Deep Learning for Segmentation. The paper is well written. The method for uncertainty estimation proposed by the authors is novel and its usefulness is corroborated by the experiments performed by the authors on two distinctive datasets. The reviewers made constructive comments regarding how to fine-tune and shape the messages of the paper. I recommend incorporating their messages when submitting final version of the paper. Please also note that in Table 1 there is no point of providing more precise number (more digits) when reporting the mean results than the numbers reported regarding the standard deviation.

  • What is the ranking of this paper in your stack? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    5




Author Feedback

We thank the reviewers for appreciating the novelty and value of our article. We address the comments of the reviewers to further strengthen the manuscript. The main concerns were related to comparisons of LE with the multi-head (MH) approach. R4.3: The advantages of LE are not clear against MH. A: We agree that both LE and MH are lightweight and single-pass, but they are structurally different. LE observes multiple levels of a network and combines information related to generalisation (from shallow layers) and memorisation (from deep layers) [3] “DL through the lens of example difficulty”, whereas MH combines outputs from the same depth - equally expressive networks with the same number of parameters. This property of LE helps to detect unusual cases where the disagreement between sequential outputs is high. The high correlation of the proposed AULA with segmentation metrics supports it. In terms of performance comparison, LE and MH are similar in segmentation. BCDR-DSC: LE 87.2±.08 and MH 86.5±.09. MnM-DSC: LE 90.3±.10 and MH 89.2±.13. However, the NLL of MH is worse than LE and similar to Plain. BCDR-NLL: LE 0.30±.25, MH 2.19±1.31, Plain 2.31±1.35. MnM-NLL: LE 0.17±.37, MH 0.23±.58, Plain 0.18±.41. These observations will be included in the paper. Another advantage of LE over MH and DE is the runtime (asked for runtime comparison by R2.1&R4). A: We measure average seconds/batch for training (including backprop) and testing. Training: DE 0.99, MH 0.23, LE 0.2, Plain 0.18. Testing: DE 0.24, MH 0.052, LE 0.047, Plain 0.045. LE is faster than MH because the backprop is done at different depths: the forward pass is similar but the backward pass is not for LE and MH. R5.1 questions how much the intermediate layer outputs can be used to determine uncertainty and could be limiting for more challenging tasks due to the low-level features of shallow layers. A: Apart from the empirical results that uphold the idea of LE to measure uncertainty, we believe that the intermediate segmentation heads also contribute to learning better low-level features through deep supervision. Also, LE can be adapted to the task difficulty: adding more parameters to the intermediate output heads, e.g. more conv.layers in the i-th head than i+1. R4.5 about the unequal number of models in DE. A: We tested DE with 10 models and observed no big improvement, only increase in runtime. E.g. DSC 89.6 vs 89.7, NLL 0.16 vs 0.15, and correlation -0.38 vs -0.38, for 5 and 10, respectively. This will be mentioned in the paper. The following comments such as corrections in the terminology or clarifications will be added to the manuscript. R2.2 asked about using STAPLE only in LE. A: The results were very similar for both whether STAPLE was used or not, we will clarify it in the paper. R2.3:Our claim regarding LE being equivalent to DE is not strong. A: We agree that the sub-nets of LE are not the exact equivalent of DE. We will change it to an approximation of DE similarly to the related work shown in [23] deep sub-ensembles. R4.1:AULA not being the first image-level metric. A: We agree with this statement and we do not claim it in the paper, but we will clarify it better. R4.2:Epistemic is not a predictive uncertainty. A: We will use the term epistemic. R4.4:Relation of sample easiness to high confidence. A: We clarify “easiness” as most common and in-distribution. R5.2:Significance of image-level unc. A: Image-level unc. is complementary to pixel-level and useful for retrieval tasks and active learning. R5.6:Fig3 DE-LV is lower than LE-LV in [0-0.1]. A: We will add that it is due to the slightly better initial performance by DE. However, LE takes a steep decline showing that it can detect poor segmentations faster. Recommendations in R4.6-testing on natural image benchmarks, R5.3-using other types of corrupt images, R5.4-comparison with MCDropout, R5.5-improving testing by cross-validation are very interesting and relevant, and will be addressed in future work.




Post-rebuttal Meta-Reviews

Meta-review # 1 (Primary)

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    The authors provided convincing arguments in their answers to the reviewers comments. All reviewers are now convinced this paper is worthy of publication. I trust the published paper has already benefited from this review cycle and I expect the final version of this paper to be an excellent read.

  • After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

    Accept

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    2



Meta-review #2

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    The main concern that the LE are not clear against MH is addressed during rebuttal by the authors. All the reviewers have converged and recommeded acceptance.

  • After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

    Accept

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    5



Meta-review #3

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    This paper proposes a method for segmentation uncertainty quantification based on the different layer confidences from a single pass through the network. This clearly comes with computational benefits, although the underlying motivation may seem unclear. The reviewers were all happy before the rebuttal and the primary AC didn’t seem to have any concerns or questions either, so frankly I think the paper should have been early accepted.

  • After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

    Accept

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    6



back to top