Paper Info Reviews Meta-review Author Feedback Post-Rebuttal Meta-reviews

Authors

Mayank Gupta, Soumen Basu, Chetan Arora

Abstract

Deep Neural Networks (DNNS) have been successful in various computer vision tasks, but are known to be uncalibrated, and make overconfident mistakes. This erodes a user’s confidence in the model and is a major concern in their applicability for critical tasks like medical imaging. In the last few years, researchers have proposed various metrics to measure miscalibration, and techniques to calibrate DNNS. However, our investigation shows that for small datasets, typical for medical imaging tasks, the common metrics for calibration, have a large bias as well as variance. It makes these metrics highly unreliable, and unusable for medical imaging. Similarly, we show that state-of-the-art (SOTA) calibration techniques while effective on large natural image datasets, are ineffective on small medical imaging datasets. We discover that the reason for failure is large variance in the density estimation using a small sample set. We propose a novel evaluation metric that incorporates the inherent uncertainty in the predicted confidence, and regularizes the density estimation using a parametric prior model. We call our metric, Robust Expected Calibration Error (RECE), which gives a low bias, and low variance estimate of the expected calibration error, even on the small datasets. In addition, we propose a novel auxiliary loss - Robust Calibration Regularization (RCR) which rectifies the above issues to calibrate the model at train time. We demonstrate the effectiveness of our RECE metric as well as the RCR loss on several medical imaging datasets and achieve SOTA calibration results on both standard calibration metrics as well as RECE. We also show the benefits of using our loss on general classification datasets. The source code and all trained models have been released.

Link to paper

DOI: https://doi.org/10.1007/978-3-031-43898-1_15

SharedIt: https://rdcu.be/dnwAM

Link to the code repository

https://github.com/MayankGupta73/Robust-Calibration

Link to the dataset(s)

N/A


Reviews

Review #1

  • Please describe the contribution of the paper

    In their manuscript, the authors propose a novel metric for reliability assessment of classifier decisions called Robust Expected Calibration Error (RECE), as well as a calibration loss function called Robust Calibration Regularization (RCR). They demonstrate their applicability to small and large data scenarios and provide a thorough assessment on various datasets and setups.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    Overall, the paper addresses a very relevant point with high practical importance. By doing so, the authors demonstrate in-depth knowledge and propose a method derived by addressing the disadvantages of current methods in a straightforward fashion. Finally, the proposed metric and regularization loss are thoroughly evaluated using a variety of publicly available datasets. The paper is well-written and follows a clear and succinct structure.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    To me, the main weakness of the paper lies in the little demonstration of the correlation between accuracy and RECE. While the authors propose that learning using the RCR loss provides the lowest RECE, it was not completely clear to me whether this always translates into overall better calibration. I felt that the discussion could have been more thorough.

  • Please rate the clarity and organization of this paper

    Excellent

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    Overall, the paper seems to be highly reproducible. Notably, the authors provide all training code and models, which allows for cross-checking and building upon the provided results.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html

    All in all, I felt that the manuscript was very clear and succinct. Still I felt that a few points should be addressed by the authors, which I would like to point out as follows:

    ###################### Major ######################

    • I felt that the paper in its current status, while clearly motivated, does not provide enough evidence for the calibration of the RECE metric. While Fig. 2 and Tab. 1 seem to depict that small sample set estimates better correlate to the calibration metric on the full dataset, the adequacy on this full dataset in my perspective has not been demonstrated clearly.
    • Taking into account Tab. 2, this in fact does not necessarily seem to be the case, as lower RECE values in the RCR setting seem to correlate with lower overall accuracies.
    • While I still see a clear value of a reliable metric that can be estimated from little data, this, as well as the RCR lowering overall performance, are clear limitations that should be discussed in detail.

    ###################### Minor ######################

    • The manuscript contains various typos, such as: “various bin”, p3; “Both ECE and SCE suffers”, p3; “we generate 10 observation”; p5; unneeded “hence”, p5; “We call the loss function as”, p5; “along side”, p5; “We use following publicly” p6; “covid”, p6;
    • Abbreviations should always be introduced on first mention, e.g., ECE, MDCA, GBCU, POCUS, …
    • Section 2 presents multiple related approaches. However, the results from Carneiro and Rajaraman are not yet set in contrast to the method at hand. I would recommend adding this.
    • Eq10 currently contains a convolution operator. This should rather be a \cdot.
    • Larger numbers, such as 50089 and 33132 might benefit from using thousand commas, e.g. 50,089 and 33,132.
    • The choice of the hyperparameters and their variation across datasets should be better motivated
    • Table 2 does not yet contain any confidence intervals or measures of statistical uncertainty. This might be added.
    • The choice of the terminology “We demonstrated the ineffectiveness of existing calibration metrics” seems rather harsh. I would suggest using a weaker formulation here.
  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    6

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Overall, I felt that the main contribution of the paper (i.e., robust assessment of calibration error in small data setups) is clearly demonstrated. The paper is overall well-written and shows mostly minor weaknesses that can be easily addressed by the authors.

  • Reviewer confidence

    Very confident

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    6

  • [Post rebuttal] Please justify your decision

    I feel that the authors were able to address various of the mentioned issues. Regarding the response to the decrease in accuracy, I specifically appreciate that the authors conducted an additional evaluation with the Brier score, directly following the suggestion of the AC.

    Still, it is noteworthy that, according to Tab. 2, training with RCR led to relative increases in error rates over the respective best result by 17% (GBCU), 100% (BUSI), 98% (POCUS), and 162% (Diabetic Retinopathy), which of course, in turn, might come with a worse calibration performance, and therefore should be thoroughly discussed in order to facilitate a better understanding in the reader.

    Especially with respect to the Brier score evaluation, the specific choice of the POCUS set felt rather unlucky to me, as both MDCA and FL clearly outperformed RCR with respect to accuracy in every other dataset but this, and therefore had expectably inferior Brier scores. Notably, for FL, even there the difference was only slight (.1582 vs. .1522), which calls for further discussion. I would strongly appreciate it if the authors added some additional comments on that.



Review #2

  • Please describe the contribution of the paper

    The paper addresses an important problem - developing reliable means for the estimation of prediction confidence. The authors propose a metric treating every confidence prediction value as a value sampled from a probability distribution (gaussian or a misture of gaussians). To calibrate the confidence value, the latter is weighted by a probability of being observed in a particular bin. In addition, a loss function corresponding to the metric is proposed. The method is thoroughly tested on multiple datasets and compared against existing SOTA methods.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The authors propose somewhat simple, but apparently effective calibration measure. As Tab.1 shows, the proposed calibration measure is capable to capture calibration characteristics in low data regime better than existing metrics. Methods combining both simplicity and effectivness are potential candidates for a practical use.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    As the calibration metric the proposed measure seems to demonstrate its value. However, using the proposed calibration loss comes at the cost of losing some percent of prediction accuracy.

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    The datasets used are public, and the method is anonymously released, which deems the results reproducible.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html

    Can you elaborate more why the normalization over the set of bins (Eq.6 and 7 in the denominator) is needed in your formulation? Can you provide reference to the methods in Table 2. Minor: “must predicts it’s”

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    6

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    It’s a thoroughly validated study of a calibration metric to assess condifence of a model prediction. This is an improtant problem not yet solved by the community and solutions to it are in demand.

  • Reviewer confidence

    Confident but not absolutely certain

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    6

  • [Post rebuttal] Please justify your decision

    The authors addressed my minor comments and concerns of R3 (weak reject), so I stay with the “accept” score.



Review #3

  • Please describe the contribution of the paper

    The paper purposes two alternative measures of calibration of deep learning models that are more robust to smaller dataset sizes. These measures are based of sampling either the predictive probabilities with gaussian distributions (RECE-G) or testing images with random augmentations (RECE-M). The paper then sets out how to incorporate these measures of calibration into a regularization term that is incorporated into a loss functions used during training (RCR). The paper conducts experiments with multiple datasets across medical image domains. To evaluate their measures of calibration, experiments were conducted comparing the measure on 100% of the testing dataset to the same measure on varying amounts of the testing set. The authors also compared their method for calibration with other methods for calibration and found that by using their loss function models trained on datasets obtained the best calibration performance.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The paper aims to solve an issue with calibration metrics that is not addressed much in the literature on calibration methods despite their importance to medical image analysis systems. The regularization term that is in cooperated into the loss function is a good way of encouraging the model to be better calibrated and the presented results show that it works well for that purpose. The experiments in the paper use a wide range of medical imaging datasets from a range of medical domains. There is a good ablation study into the different effects of changing the standard deviation which gives an indication of how the RECE works.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    RECE is based off ECE and mentions some of the limitations of ECE that make it less robust when dealing with smaller data sizes. While other methods are measures of calibration are mentioned i.e. SEC and Adaptive ECE, he MICCAI community has mostly moved to using the more recent ECE KDE based on kernel density estimators instead of binning (see Metrics reloaded: Pitfalls and recommendations for image analysis validation). So this being missing from the experiments it a shame. The experiments in Table 1 seem to be inconclusive, as many of the results indicate that RECE is a better measure of calibration when the dataset is smaller but the standard deviation indicates that this isn’t statistically significant and is not a clear improvement over other measure such as Adaptive ECE and SCE. The results explanation and conclusion sections are very short and do not provide much insight to the results. The results fail to discuss the trade off between the accuracy and calibration that would be introduced by the regularization term.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    The work uses publicly available datasets and all code has been made available online.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html

    It is not clear what methods are being tested in Table 2. The experiments in Table 2 could be repeated multiple times displaying the mean and standard deviation could show the stability of the method. Other more modern measures of calibration have been accepted in the MICCAI community namely ECE KDE, This should be used as a comparison in the experiments and should be considered in your method development as it addresses some of the pitfalls of other measures mentioned in this paper by avoiding binning. The results could be explained and discussed more as the current results and conclusion section is very short.

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    4

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The paper tries to deal with calibration measures not being robust when dealing with smaller sized datasets. While this is a novel problem the method in which the authors have tired to solve this using a sampling based method and also purpose a method to improve calibration during training. Although the experiments do not show any statistical significant improvement along with other modern methods of measuring calibration not tested lead leave this paper at a disadvantage.

  • Reviewer confidence

    Very confident

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    5

  • [Post rebuttal] Please justify your decision

    I appreciate the rebuttal presented by the authors and the additional experiments and discussion highlighting of the benefits of RECE in situations where there are smaller test sets available. The results and discussions would be a valuable if included into the paper and make the paper (like RECE) more robust.




Primary Meta-Review

  • Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

    This work introduces a calibration metric, Robust Expected Calibration Error (RECE), and an associated loss function, Robust Calibration Regularization (RCR), that are meant to work better when we have a small amount of data. The paper received two accept recommendations from R1 (“The paper is overall well-written and shows mostly minor weaknesses”) and R2 (“the proposed calibration measure is capable to capture calibration characteristics in low data regime better than existing metrics.”), and a weak rejection opinion from R3, who mentions a couple of concerns. After going through the reviews and the paper myself, I tend to align with R3 regarding those concerns, and prefer to see this paper go through a rebuttal phase so that we can have a longer discussion about them and ensure its contributions are meaningful. Please see my comments below:

    1) Specifically, R3 mentions that an important calibration metric has been left out of the analysis: ECE KDE. This is based on a kernel density estimation replacing ECE binning, and shares the motivation of RECE to achieve an unbiased calibration error estimation. I share the point of view of R3 that this is becoming the standard in the MICCAI community, as it is recommended in the current version of [A], and would like to see this discussed in the rebuttal (this = “how ECE KDE compares to RECE”).

    2) R1 and R2 observe that the improvement in calibration of the proposed loss function may come at the expense of reduced discriminative capability: “Lower RECE values in the RCR setting seem to correlate with lower overall accuracies. (R1)”, “However, using the proposed calibration loss comes at the cost of losing some percent of prediction accuracy. (R2)”. In my personal experience, there are many techniques that can improve calibration performance (Focal Loss, MixUp, Label Smoothing) if you are willing to pay the price of sacrifing accuracy, by just adding certain amount of underfitting. I believe that the way to go in order to guarantee that this is not what is happening is by reporting a Proper Scoring Rule value, which aggregates calibration and discrimnation abilities into a single value. In this case, I guess the authors could report Brier score and/or Cross-Entropy/Negative Log-Likelihood, see [A].

    3) R3 also notes that “the experiments do not show any statistical significant improvement”. May the authors discuss the statistical significance of their analysis?

    [A] Metrics reloaded: Pitfalls and recommendations for image analysis validation




Author Feedback

We are encouraged that the reviewers found our work to be of high importance and effective. Below we address some of the concerns raised.

R3, AC: Comparisons with ECE-KDE We have done the experiment corresponding to Table 1 in the manuscript. We achieve ECE-KDE of 0.2915 on 100% of the test set and 0.902 +- 0.250, 0.489 +- 0.244, 0.399 +- 0.154, 0.309 +- 0.078, 0.313 +- 0.0474, on 1%, 5%, 10%, 25% and 50% of the test set resp. Table 1 in the manuscript shows that RECE-G and RECE-M converge to the expected value using a smaller amount of data compared to ECE-KDE. We would also like to point out that in the work cited by R3 ([A] Pg 171), the authors “advise against” using ECE-KDE for small sample sets as it is dependent on the sample size. Our proposed metric fills this gap by giving a low bias estimation of calibration error on the small datasets (as shown in Table 1), which are common in medical imaging.

R3, AC: Differences between RECE and ECE-KDE While ECE-KDE removes binning in an attempt to mitigate bias, it is still dependent on sample size. Our method mitigates dependency on sample size. While we focus on top-label calibration in this paper, we believe the idea can be extended to all-label calibration as in MDCA. We leave this for future work.

R1, R2, R3, AC: Reduction in accuracy due to RCR calibration Calibrating a model may lead to learning a different and robust representation than using cross-entropy. This may sometimes cause a slight decrease in accuracy. This behavior of RCR is consistent with other SOTA techniques e.g., MDCA.

AC: Brier Score We have experimented with the BS values on the POCUS dataset for evaluating both discriminative and calibration performance. RCR achieves a better BS of 0.1522, compared to MDCA 0.2014 and FL 0.1584.

R3, AC: Statistical Analysis of Calibration Metrics We have done Paired t-tests for results in Table 1, comparing the absolute difference in metric value for RECE and AECE at different data sizes. The experiments yielded p-values of 0.005, 0.258, 0.010, 2e-7, 7e-5 for 1%, 5%, 10%, 25% and 50% of the test set resp. This shows that the results are statistically significant.

R1, R3: Statistical Analysis of Table 2 We have done 10-fold analysis for experiments in Table 2. Training with RCR loss achieves Accuracy 0.877 +- 0.052, ECE 0.078 +- 0.030, SCE 0.076 +- 0.023 and RECE-M 0.059 +- 0.024. In contrast, MDCA training leads to inferior results: Accuracy 0.870 +- 0.045, ECE 0.083 +- 0.035, SCE 0.077 +- 0.022 and RECE-M 0.064 +- 0.024.

R1: Evidence of effectiveness of RECE To further demonstrate the effectiveness of RECE, we have performed the following experiment. We consider ECE at large sample sizes (10000 in CIFAR10) to be our ground truth and evaluate metrics on whether they converge to the ground truth value and how much data they take. RECE is able to converge to the ground truth value of 0.0295 exactly with 25% of the data. SCE and AECE converge to some other value while ECE does not converge even with 50% data. We feel this supplements our claim of the effectiveness of RECE.

R1: Correlation of RECE with lower accuracy In Table 2, the comparison between DCA and FLSD for GBCU dataset shows that RECE is not negatively correlated with accuracy. Rather as discussed above, calibration methods like RCR do lead to a little loss of accuracy.

R2: Details of normalization of bins in Eq 6 and 7 In Eq 6 and 7, we normalize over all bins to ensure that for every sample, the weight distributed in bins sums to 1. This is essential to ensure that each sample has equal weightage in the metric formula and also allows the metric to be of the same scale as common metrics like ECE and AECE for comparison.

R2, R3: Citation of methods used in Table 2 We have used similar baselines as used in the recent SOTA (MDCA). We will add these citations in the final version.

R1, R2: Typos, Missing Abbreviations, Formatting Thanks for mentioning this. We will rectify these in the final version.




Post-rebuttal Meta-Reviews

Meta-review # 1 (Primary)

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    This paper originally attracted two votes for direct Acceptance, but R3 voted for Weak Rejection. After going over authors’ response letter, R1 and R2 kept their Acceptance recommendation, and R3 raised their rating to Weak Acceptance. Over detailed discussion, both me and R3 appreciated the response by the authors, specifically the comment on the usefulness of RECE in a low data regime when compared with standard ECE, statistical significance tests, and the extra comparisons with Kernel Density Estimates-ECE. Therefore I will be recommending acceptance. Please note that I strongly agree with R3 on that the authors should make an effort to incorporate to the camera-ready version the comments and results presented to reviewers during rebuttal as much as possible, since they contribute noticeably to the value of this paper.



Meta-review #2

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    This paper introduces a robust metric for measuring calibration error – the RECE – with a corresponding loss function. As existing measures of calibration error are known to have a sample size bias, robust estimates are highly welcome (a highly related recent work is found in Petersen et al, On (assessing) the fairness of risk score models, FAccT 2023).

    Reviewer 3 had a valid concern that the commonly used ECE KDE was not included, but as explained by the authors in their rebuttal, this metric would still have a sample size bias.

    I am therefore happy to recommend acceptance of this paper – of course, the enlightening explanations and added experiments from the rebuttal should be added to the paper to maximize its impact.



Meta-review #3

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    the paper was discussed and the rebuttal seems to address most of the concerns. the paper might be able to address all issues for a camera ready version



back to top