Paper Info Reviews Meta-review Author Feedback Post-Rebuttal Meta-reviews

Authors

Neerav Karani, Neel Dey, Polina Golland

Abstract

Neural network prediction probabilities and accuracy are often only weakly-correlated. Inherent label ambiguity in training data for image segmentation aggravates such miscalibration. We show that logit consistency across stochastic transformations acts as a spatially varying regularizer that prevents overconfident predictions at pixels with ambiguous labels. Our boundary-weighted extension of this regularizer provides state-of-the-art calibration for prostate and heart MRI segmentation. Code is available at https://github.com/neerakara/BWCR.

Link to paper

DOI: https://doi.org/10.1007/978-3-031-43898-1_36

SharedIt: https://rdcu.be/dnwBw

Link to the code repository

https://github.com/neerakara/BWCR

Link to the dataset(s)

https://wiki.cancerimagingarchive.net/display/Public/NCI-ISBI+2013+Challenge+-+Automated+Segmentation+of+Prostate+Structures

https://www.creatis.insa-lyon.fr/Challenge/acdc/databases.html


Reviews

Review #3

  • Please describe the contribution of the paper

    In this paper the consistency regularization is studied in segmentation networks. It is shown that almost always the calibration error is decreasing, while the Dice score is in some configurations improved. This result is more interesting in case of small datasets. Therefore the consistency regularization should be considered as an alternative to data augmentation. In addition, it is shown that the improvement is more important in the case of spatially varying weight for the consistency regularization, where the authors propose a regularization coefficient that is directly related to the label uncertainty appearing near the boundaries.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    (1) It is shown that the logit differences are related to the calibration error. (2) It is demonstrated that the consistency regularization prevents overfitting. (3) The consistency regularization is spatially-varying emphasizing pixels near boundaries. (4) Experiments are fully conducted on hyperparameters, ablation studies, and method comparison. (5) The proposed model was evaluated on two public datasets from different application domains.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    (1) Although it is interesting to know that the consistency regularization reduces the calibration error, it seems that there is no important improvement on classification error, even in small datasets, in comparison with the Data Augmentation approach. (2) Clinical metrics, such as the ejection fraction and the volumes of ventricles for the ACDC case, are missing.

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    Good. Datasets are publicly available, and methods are described with acceptable detail.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html

    The comments on the weaknesses could be considered here.

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    5

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Even if the paper is interesting and provides insights to the relation between the consistency and the calibration, the potential clinical exploitation seems to be limited.

  • Reviewer confidence

    Confident but not absolutely certain

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #1

  • Please describe the contribution of the paper

    This paper proposes a method to calibrate convolutional neural networks so that the predicted probability values correspond to the uncertainty. The proposed method consists of minimizing the difference between two outputs of the same data-augmented image while emphasizing the consistency in the voxels at the border of the regions of interest.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    1) Interesting and very relevant topic for the research community. The accurate estimation of the uncertainty in segmentations has clinical impact, as it shows the areas where clinicians can focus, saving them time. This also relates to interpretability.

    2) Interesting application of the method. It is shown that consistency regularization in the logits can help in callibrating the output of a neural network.

    3) Very detailed Related Work section.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    1) Limited methodological novelty. The proposed method boils down to a typical contrastive approach in which the goal is to increase the consistency of the predictions across different augmented versions of the same image. On top of this, this method emphasizes the consistency in the voxels near the boundaries of the regions of interest by penalizing a bit more when such voxels are not consistent (Eqs. 2-3).

    2) One of the two major claims of this paper was not demonstrated. “We show that CR (consistency regularization) can automatically discover such pixels” (here, “such pixels” refers to pixels with ambiguous labels). The proposed method performs slightly better than the compared methods, and the proposed method produces (in the test set) uncertainty maps. However, it is not shown that the proposed method discovered pixels with ambiguous labels, which would require a visualization and interpretation of uncertainty maps in the training set.

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    The reproducibility checklist agrees to what can be seen in the paper.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html

    Discussion. I think that this paper would benefit a lot by discussing some aspects of the methods and results. Specifically:

    • A short discussion about how to choose R (Eq. 3) depending on the images/task.
    • Discussion about the limitations of the method, or when it makes more sense to use it. This method is oriented for medical images with challenging boundaries due to low contrast, partial volume effect, etc. To what extent would this method work if used in medical images with clear boundaries between the regions of interest?
    • Figure 3 shows the uncertainty maps produced by different methods. However, without a discussion, it is hard to understand which method is superior. This is particularly important because there is no ground truth of the uncertainty maps. It makes sense that the boundaries are the most uncertain, but, for instance, why is the uncertainty map derived from the proposed method (BWCR) better than SVLS in the first and second row?

    Minor suggestions to improve the paper

    • Small correction in the second sentence of the paper “as is often” -> “as it is often”.
    • Last sentence in page 4: “where $r^j$ is the absolute distance function at pixel $j$”. Here, I would clarify that this is the distance to the nearest pixel at the boundary.
    • Remove the colors of the tables to adhere to MICCAI’s template.
    • For some reason, one third of the references don’t include the year of publication. See, e.g., references 8, 9, 11, 16-19.
  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    5

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Although the methodological contribution is limited, the application (i.e., calibrating convolutional neural networks with the proposed method) has more novelty. The paper is very nicely written, easy to follow, complete “Related work” section (to the extent possible with MICCAI’s template), and it has experiments that compared the proposed method with previous methods and a baseline.

  • Reviewer confidence

    Confident but not absolutely certain

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #2

  • Please describe the contribution of the paper
    • The main contribution is to use consistency learning to reduce miscalibration of deep learning models for medical segmentation.
    • In the paper, miscalibration in medical image segmentation is addressed. It proposes to use regularization by consistency learning to tackle overconfident predictions, especially in ambiguous boundary regions. A weighted consistency loss is added to the negative log-likelihood loss to tackle miscalibration during training.
    • The method is motivated by aleatoric uncertainty estimation from augmentations, which is incoroporated into the training by using the proposed loss. The authors give insightful info about the loss landscape of their optimization criterion, which strengthens the theoretical idea of the paper.
    • The idea of CR regularization for confidence calibration is extended to a boundary-weighted variant (BWCR), which is motivated by the fact that uncertainty in segmentation is high in boundary regions.
  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • The main strength is the introduction of a simple, but novel and effective regularization for improving model calibration in segmentation.

    • The paper tackles an important problem of medical image analysis with deep learning. Calibration is often overlooked but very important to consider for clinical use of deep models. The clinical data distribution can vary greatly from the training distribution and overconfident networks often fail to show their inaccuracies in this scenario. The presented method greatly helps to address that.
    • The paper is very well written and easy to follow. Detailed explanations, such as the link between consistency learning and aleatoric uncertainty in §2, help the reader to better grasp the idea and emphasize the theoretical background of the method.
    • The proposed method is novel, well motivated and well described, and the theoretical explanation in form of the loss landscape in $3.1 and Fig.1 is a great addition. I’m not aware of any prior work that linked consistency learning to calibration, especially in the medical domain.
    • The experimental evaluation is sound and methodological correct. The use of commonly accepted calibration metrics is suitable to measure the effect of the proposed method.
    • I applaude the authors for reporting performance metrics with standard deviations from repeated experiments and for conducting hypothesis testing to assess statistical significance of their findings. Unfortunately, this is not yet standard in the (medical) imaging community and should be mandatory.
    • The conclusion gives a short outlook to future work that would be of high interest in combination with the proposed method but was not covered by the paper, e.g., the influence of the consistency loss or out-of-distribution detection.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    • The main weaknesses are the lacking explanation of BWCR, especially Eq. (3), the missed opportunity to tune \lambda per data set, and the lack of addressing the apparent side effect of CR.

    • Even after reading the paper twice, I would not be able to implement BWCR using the given explanations. In Eq. (3), how is the width of the boundary R defined? Which absolute distance is given by r^j and how is this defined? How does R affect the result? Did you optimize this parameter or is it defined by the data set?
    • Tab. 1 analyzes the effect of \lambda. However, the authors fail to optimize this parameter, including the range for \lambda in BWCR, per data set using the validation set. It seems like the same values are used for all experiments. I would expect that the poor result on ACDC n=95 happened due to using an unfavorable value for \lambda.
    • The core problem with the presented method is not addressed in the discussion of the results at all. You can clearly see in Fig. 3 that training with (BW)CR leads to “confidence leaking” between the foreground classes. E.g., in the lower two rows, you can see that both CR and BWCR exibit higher confidence along the boundary of the other class, whereas the comparing methods do not show this effect. Depending on the class threshold value, this could lead to clearly false predictions.
  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    The reproducibility of the paper can be improved.

    Pros

    • The authors state in their reproducibility report that all code will be release upon acceptance.
    • The paper uses publicly available data sets to conduct experiments.
    • The general method is well described, which makes reimplementation of CR possible.
    • Most parameters of the training procedure are reported.

    Cons

    • As stated above, BWCR is described insufficiently and could not be re-implemented given only the information in the paper.
    • The parameters of the geometric transformations T_\psi are not given in the paper.
  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html

    Please:

    • address the “confidence leakage” of CR and BWCR in your discussion,
    • conduct a line search/hyperparameter optimization for \lambda for the bigger data sets, if time permits, or else, discuss the effect of the data set size on the optimal value for \lambda,
    • improve the explanation of BWCR in § 3.2,
    • state the parameters of the geometric augmentations.

    • Moreover, I think the paper would benefit from a (short) discussion on consistency functions, e.g., cosine similarity, mutual information, etc.
    • The abstract is very short and could benefit from a bit more details.
    • Approximate Bayesian neural nets usually provide better calibrated uncertainties than point estimates. For future work, I wonder how the presented approach can be combined with BNNs.
  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    5

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The paper has good merits and I would like to see it discussed at MICCAI. However, the paper has some issues in its current form that need to be addressed to transform it to a clear accept.

  • Reviewer confidence

    Very confident

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A




Primary Meta-Review

  • Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

    This paper combines consistency regularization to improve calibration on segmentation models. Multiple augmentations are employed to implicitly find ambiguous locations and then enforce consistency of logits there, which in turn avoids generating overly confident erroneous probabilities, thereby improving calibration. While no reviewer was 100% (strongly) in favour of direct acceptance, all of them found the paper interesting, well motivated and with meaningful experimental validation, and all of them recommended (weakly) acceptance. I briefly went through the paper and, alhtough I did find some odd superficial aspects - weirdly short abstract, no decimal figures in metrics like ECE, I did not spot any reason not to accept this work. I would like to draw the attention of the authors towards the very nice review provided by R2, which contains several interesting pointers to some aspects of the paper that could be improved if it were to be extended to a journal version.




Author Feedback

We thank reviewers for their valuable time and feedback. Reviewers acknowledged the novel connection established between logit consistency and calibration, as well as the thorough experimental validation and clear writing and organization of the paper.

Additionally, reviewers provided many great suggestions to improve the paper’s quality. Accordingly, we have:

  1. expanded the abstract to incorporate background information and motivation (MR1, R2)
  2. enhanced BWCR explanation, accompanied by an additional figure (Fig 2) (R1, R2)
  3. expanded the discussion of the results to acknowledge (i) the issue of confidence leakage (R2) and (ii) relatively modest level of improvement in segmentation accuracy in the majority of cases (R3)
  4. included discussion about alternative consistency losses (R2)
  5. clarified that R and other hyperparameters have been selected heuristically, but may be tuned per dataset using a validation set (R1, R2)
  6. clarified parameters of geometric transformations (R2)
  7. corrected formatting errors in tables and references (R1)

The additional points raised will be thoroughly investigated in both our current and future work, and we intend to incorporate them into an extended version of this paper. These points include:

  1. visualizing and interpreting predictions on training images, as well as tracking their evolution during training iterations (R1)
  2. understanding why “confidence leakage” happens in CR, and to a lesser extent in BWCR (R2)
  3. hyperparameter optimization using a specific validation set (R1, R2)
  4. exploring connections of the proposed method with Bayesian neural networks (R2)
  5. evaluating in terms of clinical metrics such as ejection fraction (R3)

Finally, we wish to clarify the following points.

  1. The main goal of the paper is not to introduce a novel loss function but rather to demonstrate the previously neglected impact on network calibration for a well-known loss function, namely logit consistency across stochastic transformations (R1)
  2. Dice and ECE values are reported as percentages without decimals to save space without loss of information (MR1)
  3. In the datasets used in our experiments, there are some images with clear boundaries, and some with ambiguous boundaries. We expect this to be the case in most medical imaging datasets. Thus, the proposed method would potentially improve calibration for other medical image segmentation tasks as well (R1)
  4. Given that no ground truth exists for uncertainty maps, we acknowledge the limitations of qualitative discussion and have supplemented it with two metrics that quantify the level of calibration (R1)



back to top