Paper Info Reviews Meta-review Author Feedback Post-Rebuttal Meta-reviews

Authors

Lior Frenkel, Jacob Goldberger

Abstract

Calibrating neural networks is crucial in medical analysis applications where the decision making depends on the predicted probabilities. Modern neural networks are not well calibrated and they tend to overestimate probabilities when compared to the expected accuracy. This results in a misleading reliability that corrupts our decision policy. We define a weight scaling calibration method that computes a convex combination of the network output class distribution and the uniform distribution. The weights control the confidence of the calibrated prediction. The most suitable weight is found as a function of the given confidence. We derive an optimization method that is based on a closed form solution for the optimal weight scaling in each bin of a discretized value of the prediction confidence. We report experiments on a variety of medical image datasets and network architectures. This approach achieves state-of-the-art calibration with a guarantee that the classification accuracy is not altered.

Link to paper

DOI: https://link.springer.com/chapter/10.1007/978-3-031-16452-1_61

SharedIt: https://rdcu.be/cVVqi

Link to the code repository

N/A

Link to the dataset(s)

N/A


Reviews

Review #1

  • Please describe the contribution of the paper

    The approach tackles the important problem of networks calibration. This is specially important when medical staff takes decisions based on networks confidences. The paper proposes Confidence based Weight Scaling (CWS), a technique for calibrating the outputs of deep learning classification networks. The approach achieves state-of-the-art calibration with a guarantee that the classification accuracy is not altered

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • The proposed calibration technique is well described and formulated. The calibration can be potentially applied to any network and classification dataset.
    • Multiple networks, and datasets are used for the validation. The validation is overall fair and reliable. Performing the validation on three public datasets makes the obtained results stronger.
    • Results seem to show a clear benefit of this calibration method with respect to others existing works
    • The paper is clearly written and easy to follow
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    • Adding some description about the limitations and problems of the proposed method would be desirable; as well as insights about future line of work
  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    The formulation seems clear and reproducible. However, code will not be provided. The validation is performed in three publicly available datasets.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html

    Making the code available will be beneficial for the community

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    6

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The paper tackles a very important problem in deep learning, network calibration. A novel calibration procedure is proposed and the fair evaluation shows a clear improvement with respect existing calibration procedures.

  • Number of papers in your stack

    5

  • What is the ranking of this paper in your review stack?

    1

  • Reviewer confidence

    Somewhat Confident

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Not Answered

  • [Post rebuttal] Please justify your decision

    Not Answered



Review #2

  • Please describe the contribution of the paper

    A weight scaling calibration method that does not alter accuracy of a predictive model like the common post-hoc temperature scaling method often utilised, the approach achieves improved calibration when compared to current calibration methods and can be applied to any trained model.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The idea is interesting for a calibration method but the approach has been utilised before, similar to a paper released last year: https://arxiv.org/pdf/2108.00106.pdf, which is my only concern re novelty. However, the premise of not altering the accuracy of confident samples is different/not given as guaranteed in papers as the paper indicates.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    The concept of keeping the same samples’ confidences high is the aspect that is not exactly clear to me when reading the paper - if its a weight scale but then fed back to the classifier, the values would change for predicted confidence, as you still pass through a soft-max layer?

  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    The equations are provdided and the data available as well as code links and in this way I think it has a high chance of reproducibility.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html

    I enjoyed reviewing this paper and only have a few questions:

    1. A bit more clarity re the method and why only the ECE for top-1 predictions was looked at? Not clear what you refer to here, apologies if something is missed.
    2. On Page 7: The ‘WS calibration was lower than the ECE… by more than half’, I think here it should be worth noticing that the HAM1000 dataset did not perform well for any of the TS->WS, but perhaps you have an indication as to why that would be? Also try some statistical significance testing?
    3. the level of confidence in Fig 2, how come there is no representation for the COVID group?
  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    4

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Well done on trying the approach, interesting and a nice read, my only concern is that perhaps I see it closely related to another paper (as mentioned above), but the methodology is interesting and perhaps even though similar, the findings are interesting. Perhaps I am a little sckeptical re the final results, but perhaps more clarity on the method re points raised would be good to clear up.

  • Number of papers in your stack

    5

  • What is the ranking of this paper in your review stack?

    5

  • Reviewer confidence

    Somewhat Confident

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    5

  • [Post rebuttal] Please justify your decision

    I believe the author explained the concerns reviewers had quite well and I hope to see all updates in the final paper, also wrt the limitations. I would agree with their point of view that their paper is different in their final implementations but I do see some similarity to the concept of using ECE as the metric to optimise for improvements in calibration even though in their paper it is weight scaled vs a loss based smoothening of ECE in the Karandikar et al paper. I would in light of this change my rating and promote the paper as I do feel the area of calibration for medical imaging application requires different approaches in order to tackle the problem and improve measures used and values used to asses performance.



Review #3

  • Please describe the contribution of the paper

    The paper investigates calibration methods for DNNs trained for medical image classification systems and proposes a weight scaling methodology that helps calibrating models.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The motivation of the paper is sound. Models being calibrated is a property that is important, especially in the medical field. The paper uses a number of different architectures and datasets to showcase their experiments, which is a plus.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    The notation used in the paper is not clear. Please use vector notation to make variables clear to the reader (for example, some variables are referred to as being vectors but they are written as constants, the same goes for subscripts). Also, a number of symbols are used for denoting different purposes which makes reading the manuscript confusing (x and y are referred to as inputs and outputs, respectively but later on, both variables are used to refer different patients [e.g., patient x and patient y]).

    The main problem I see in the paper is the usage of ECE as a metric to showcase results. As noted by the authors, although ECE remains to be a top-contender as a metric of study, shortcomings of ECE as a metric of calibration is well documented[1]. In light of this information, having better results showcased (only) with ECE does not mean much. Authors also note that some of the shortcomings of ECE is alleviated with adaECE but refrain from providing experimental results on this metric, why? While reading those lines, I had the expectation to see results for both ECE and adaECE. Yet it is not there. This is especially confusing since the authors themselves acknowledge that ECE is not a good metric of evaluation. Then why would a new method that shows superior results showcased with ECE be useful?

    A number of alternatives of ECE has been discussed in the works of [1] (also referred to in the paper), would it be possible to show results with metrics discussed in this paper (for example adaECE, TACE, SCE)? Does the proposed method still achieve better results when measured with these metrics? If the usage of those metrics is not possible, what is the justification?

    [1] Nixon et al., Measuring Calibration in Deep Learning

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    It would be desirable to have algorithm 1 in the form of a function implemented in any language and any framework of choice for the sake of reproducibility.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html

    1- Employ vector notation for mathematical equations and be concise with the variable usage.

    2- Showcase results with other metrics.

    3- Discuss the differences (if any) obtained with different metrics.

    4- Please be consistent in your referencing style, capitalization of venues, abbreviations etc.

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    4

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Two main problems of the paper is the concise math notation and the lack of experimental results using other metrics.

  • Number of papers in your stack

    5

  • What is the ranking of this paper in your review stack?

    1

  • Reviewer confidence

    Somewhat Confident

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Not Answered

  • [Post rebuttal] Please justify your decision

    Not Answered




Primary Meta-Review

  • Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

    The paper introduces the Confidence based Weight Scaling (CWS) to calibrate the output from deep net classifiers. CWS is based on a convex combination of the classifier output and a uniform distribution. The method shows SOTA calibration results on three public datasets. Reviewers identified positive and negative points associated with the paper. As for the positive aspects, we have: 1) paper is well written, 2) the method is successfully applied to multiple networks and datasets, and 3) SOTA results. The negative points are as follows: 1) method is similar to [1]; 2) poor discussion of the limitations and future work; 3) the paper should clarify the concept of keeping the same samples’ confidences high; 4) notation should be clarified; and 5) paper should explain why results based on MCE, adaECE, TACE, and SCE are now shown. For the rebuttal, the authors should focus the reply on the negative aspects above.

    [1] Nixon et al., Measuring Calibration in Deep Learning. CVPR Workshops 2019.

  • What is the ranking of this paper in your stack? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    7




Author Feedback

We thank the reviewers for their insightful comments and constructive feedback. We will answer the major points below and make the suggested changes to the main text. A code will be provided.

reviewer 1: In the final version we will add discussions on the limitations of the proposed method and about future research directions (e.g. extending the proposed method to calibration of segmentation tasks and extension to a class based calibration).

reviewer 2: The paper “Soft Calibration Objectives..” by Karandikar et al. is also about calibration but other than that it is completely different. The standard calibration approach is temperature scaling and cross-entropy objective is used to find the optimal temperature scaling. That paper proposes an optimization objective (based on directly optimizing a differential variant of ECE) to find the optimal temperature scaling. In contrast, we propose a new calibration algorithm (weight scaling) and show that it outperforms temperature scaling. We also propose a new closed-form method to find the optimal weight scaling parameters. we will explain this difference in the final version.

Regarding your question “The concept of keeping the same samples’ confidences high is the aspect that is not exactly clear to me …”, weight scaling is performed directly on the network output probabilities. Unlike temperature scaling, we dont pass the calibrated values through the soft-max layer. The calibration procedure is described at the top of page 5.

reviewer 3: ECE appears in our paper in two places, first as part of the proposed calibration algorithm and second as an evaluation criterion. In our algorithm, following Nixon [1] and others, we used adaECE since it is a more suitable calibration measure than ECE. ECE is still the standard way to report calibration results and it is used by all other papers in the field. Hence, we also used ECE to report our calibration results to enable comparison with previous studies. Our method achieves better results also when it is evaluated using adaECE. In the final version we will report calibration results using other relevant calibration measures.

Meta review: 1) “method is similar to Nixon” See answer to reviewer 2 (who asked about similarity to an arxiv paper by Karandikar et al.). Nixon is mentioned by reviewer 3 as a reference to adaECE. 2) “poor discussion of the limitations and future work.” See answer to reviewer 1. 3) “the paper should clarify the concept…” See answer to reviewer 2 4) “notation should be clarified” Will do. 5) “paper should explain why results based on MCE,adaECE, TACE, and SCE are not shown” See answer to reviewer 3.




Post-rebuttal Meta-Reviews

Meta-review # 1 (Primary)

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    Paper strengths: 1) paper is well written 2) the method is successfully applied to multiple networks and datasets 3) SOTA results.

    Paper weaknesses: 1) method is similar to [Measuring Calibration in Deep Learning. CVPR Workshops 2019] 2) poor discussion of the limitations and future work 3) the paper should clarify the concept of keeping the same samples’ confidences high 4) notation should be clarified 5) paper should explain why results based on MCE, adaECE, TACE, and SCE are now shown.

    The rebuttal does not provide a clear discussion of limitations. The difference to Nixon et al. is still unclear. The reason why results based on MCE, adaECE, TACE, and SCE are not shown is also unclear. The rebuttal clarified the other points. Even though a borderline paper, I believe it has more pros than cons, so I recommend acceptance.

  • After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

    Accept

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    6



Meta-review #2

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    The rebuttal addressed the major concerns. Although there remain a couple of points to be further clarified, I think this paper is worthy to be presented in MICCAI. I would recommend the accpetance of this paper.

  • After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

    Accept

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    5



Meta-review #3

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    Calibration is probably one of the most important topics for successful clinical translation. R2 increased their score after a good rebuttal and R1 votes for a confident accept. Most concerns have been promised to be resolved in the camera-ready version. Dear authors, please keep your promise.

  • After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

    Accept

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    6



back to top