Paper Info Reviews Meta-review Author Feedback Post-Rebuttal Meta-reviews

Authors

Zeju Li, Konstantinos Kamnitsas, Mobarakol Islam, Chen Chen, Ben Glocker

Abstract

Machine learning models are typically deployed in a test setting that differs from the training setting, potentially leading to decreased model performance because of domain shift. If we could estimate the performance that a pre-trained model would achieve on data from a specific deployment setting, for example a certain clinic, we could judge whether the model could safely be deployed or if its performance degrades unacceptably on the specific data. Existing approaches estimate this based on the confidence of predictions made on unlabeled test data from the deployment’s domain. We find existing methods struggle with data that present class imbalance, because the methods used to calibrate confidence do not account for bias induced by class imbalance, consequently failing to estimate class-wise accuracy. Here, we introduce class-wise calibration within the framework of performance estimation for imbalanced datasets. Specifically, we derive class-specific modifications of state-of-the-art confidence-based model evaluation methods including temperature scaling (TS), difference of confidences (DoC), and average thresholded confidence (ATC). We also extend the methods to estimate Dice similarity coefficient (DSC) in image segmentation. We conduct experiments on four tasks and find the proposed modifications consistently improve the estimation accuracy for imbalanced datasets. Our methods improve accuracy estimation by 18\% in classification under natural domain shifts, and double the estimation accuracy on segmentation tasks, when compared with prior methods.

Link to paper

DOI: https://link.springer.com/chapter/10.1007/978-3-031-16449-1_66

SharedIt: https://rdcu.be/cVRXE

Link to the code repository

https://github.com/ZerojumpLine/ModelEvaluationUnderClassImbalance

Link to the dataset(s)

N/A


Reviews

Review #1

  • Please describe the contribution of the paper

    The manuscript describes a method for DL model output probability calibration, which is to add parameters for class-specific tunings.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The description of the method is clear.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    • The method described is not innovative. Probability calibration is a well-studied field. There are many well established methods that needs to be compared to (read On Calibration of Modern Neural Networks by Guo et al.). In addition, class-specific calibration has been discovered and widely adopted to address issues caused by training data imbalance (read Improving class probability estimates for imbalanced data by Wallace et al.).
    • This is more than a domain adaptation problem but a general ML problem. To tie it to DA is restricting.
    • Result. The Authors should also compare to unseen test set performance within the same domain to evaluate how calibration works without domain shift.
    • Result is unreliable. Fig 3 shows that the predicted accuracy values does not distribute well from 0 to 1, making the linear fitting unreliable.
    • The manuscript also does not explain well how probability calibration is an important topic for medical imaging problems.
  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    The method should be simple to implement.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html

    Probability calibration, especially in neural networks, is an interesting and important topic. The Authors should take a look at the current SOTA probability calibration work (many review papers) and go from there.

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    2

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Low innovation level of the work. Unconvincing experiment & results.

  • Number of papers in your stack

    5

  • What is the ranking of this paper in your review stack?

    5

  • Reviewer confidence

    Very confident

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Not Answered

  • [Post rebuttal] Please justify your decision

    Not Answered



Review #2

  • Please describe the contribution of the paper

    This paper addresses the problem of estimating the predictive performance of a machine learning model on unseen data, where no ground truth labels are available. The authors adopt the concept of average confidence scores and state that the confidence is especially miscalibrated on imbalanced datasets. The main contributions are class-wise confidence calibration methods that greatly improve the performance estimation.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    Performance estimation on new, unlabeled data is a highly relevant problem for implementing trained machine learning models into clinical routine. The miscalibration of overconfident classifiers is another problem on its own. The authors tackle the first problem by solving the latter, which is an interesting approach. Existing methods are extended by class-wise confidence calibration, which helps to improve performance estimation. The proposed method is extensively evaluated on different classification and segmentation datasets with introduced domain shift. The authors clearly show the benefit of global vs. class-specific calibration in Tab. 1.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    1. The main weakness of this paper is that it states class-wise calibration, especially with temperature scaling, as own contribution without acknowledging prior work. E.g., Guo et al. (2017) already suggested vector scaling as class-specific extension to temperature scaling. Kull et al. (2019) and Nixon et al. (2919) further discusses these topics in the context of calibration error metrics. The actual contribution of this paper seems to be the use of class-wise calibration in the context of performance estimation. This should clearly be stated as such.

    2. Given my first point, the actual novelty of this paper seems quite low. Calibrated performance estimation has already been intensively investigated by Guillory et al. (2021) and class-wise calibration is also not novel. Luckily, the authors clearly show the benefit of the combination of the two approaches (see strengths).

    3. I expect Tab. 1 to at least include MAE +/- std dev – or even better, show box plots instead. Instead of using bold font to highlight the best MAE, proper statistical tests to show statistically significant improvements are much appreciated.

    Kull, M., Perello Nieto, M., Kängsepp, M., Silva Filho, T., Song, H., & Flach, P. (2019). Beyond temperature scaling: Obtaining well-calibrated multi-class probabilities with dirichlet calibration. Advances in neural information processing systems, 32.

    Nixon, J., Dusenberry, M. W., Zhang, L., Jerfel, G., & Tran, D. (2019, June). Measuring Calibration in Deep Learning. In CVPR Workshops (Vol. 2, No. 7).

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    With the additional information in the supplemental material, it should be able to reproduce the results of the paper. The use of public datasets further helps with reproducibility.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html

    I think this paper would be a good fit for MICCAI. However, the contribution statement has to be fixed prior to acceptance and the results have to be presented as described above (see weaknesses).

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    5

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The novelty is low, but the addressed problem is important and the proposed method seems effective. I therefore suggest a weak accept.

  • Number of papers in your stack

    5

  • What is the ranking of this paper in your review stack?

    2

  • Reviewer confidence

    Very confident

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    6

  • [Post rebuttal] Please justify your decision

    The authors have addressed my main concerns in the rebuttal. They promised to fix the contribution statement. Having read the diverging reviews, I agree with the authors that some reviewers might have interpreted the paper’s main claim from the wrong viewpoint of advancing model calibration. However, this can also be attributed to the unclear contribution statement. I increase my score to ‘accept’.



Review #3

  • Please describe the contribution of the paper

    Various methods have been developed to estimate how well a trained model will perform on out-of-distribution data. These methods do not account for class imbalances. The authors propose new class-aware modifications to such methods to take rare classes into account when estimating a model’s confidence. They carry out a thorough evaluation for both image classification and segmentation on different distribution shifts.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • The evaluation is very thorough and for multiple use cases (classification and segmentation). The data corpus includes many publicly available datasets.

    • The results show a clear improvement after taking advantage of the proposed modification.

    • The paper follows a clear structure and maintains good writing quality. The authors clearly state how their method is different from existing ones and the math is rigorous. Figure 2 also provides a nice summary. Only very limited prior knowledge on the topic is required to understand the paper. The authors also explain how they adapt their methods to image segmentation.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    • The authors state that their method differs from class-distribution-aware TS for long-tail problems and list a few reasons. Yet I have trouble grasping how significant these differences are. It would have been interesting to get a comparison on long-tailed problems of the authors’ method with the other existing methods.

    • I find it unusual how the background class is handled for segmentation task. Similar methods such as [29] also require a separate handling for the background compared to other classes. It might be interesting to verify how relevant the Dice for the background is for the problem at had, especially considering that the background is most often the majority class in segmentation problems and that this Dice is rarely reported.

  • Please rate the clarity and organization of this paper

    Excellent

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    Sufficient reproducibility is ensured.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html

    Writing:

    • Since the authors aim for American English, the correct spelling for “labelled” is “labeled”
    • Please check all the indices in Fig 2 for DoC, ATC, and TS-ATC
    • Eq (4): “d_y’” → “d_j”

    Some other suggestions:

    • Abstract: “be deployed or its performance” → “be deployed or if its performance”
    • Section 2 Method: “achieve the goal” → “achieve this goal”
    • Section 2 Method “all the training pairs for the case z of totally n pixels”: “of totally n pixels” should be rephrased.
    • Figure 1: there is too much content for a quite small figure. At least fill the entire page width, but I would recommend removing some details.
    • Table 1: Even though the best results are in bold, they are not easily noticeable.
    • Supplementary material: Section C appears empty because the table appears on the next page. Maybe rearrange the section placement.
  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    6

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The quality of this paper is above average. The content is clear, the authors clearly explain the novelty of their methods and the results are good.

  • Number of papers in your stack

    5

  • What is the ranking of this paper in your review stack?

    1

  • Reviewer confidence

    Confident but not absolutely certain

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    6

  • [Post rebuttal] Please justify your decision

    I continue to recommend accepting this paper. The problem discussed is highly relevant, as also highlighted by reviewer 2, and the contribution in valuable. In addition, now that several concerns have been answered, I am further convinced that this paper is a good fit for MICCAI. So I stand by my initial statement.




Primary Meta-Review

  • Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

    The manuscript addresses an important problem on deep learning model calibration in unbalanced datasets by using class-specific confidence score. While there are merits in thorough evaluation and clear performance improvement, reviewers raised important concerns on (1) novelty of the method (e.g., Guo et al. (2017) already suggested vector scaling as class-specific extension to temperature scaling). Calibrated performance estimation and class-wise calibration are not new, though the combination of two did show improvement in this paper. (2) Results reliability (e.g., Fig. 3) (3) Evaluation. Compare to unseen test set performance within the same domain to evaluate how calibration works without domain shift.

  • What is the ranking of this paper in your stack? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    7




Author Feedback

We thank all reviewers for their time and feedback. Reviewers point out the addressed problem is important (R1-2), the approaches are interesting and sound (R2-3), evaluation is thorough (R2-3: 4 tasks, 310 conditions), the improvements are significant (R2-3: 2x accuracy for segmentation) and the presentation is clear (R1-3). As highlighted by R2, our paper is a good fit for MICCAI, as it investigates how to estimate model-performance on unseen data, to enable safer ML deployment in healthcare, and highlights the importance of tackling class-imbalance when using confidence-based performance-estimation methods in medical imaging. We provide below clarifications about the main concerns to resolve misunderstanding.

  1. R1 on problem definition, ties to domain-shift being restrictive, and no evaluation without domain-shift as weaknesses: Assessment by R1 seems mainly done from the viewpoint of advancing Model Calibration. We believe this is a misunderstanding that may have not allowed R1 to appreciate the work’s strengths. As explained in Sec1 and recognized by R2/R3, our primary goal is estimating model performance on unlabeled data, to identify potential failure at deployment. Domain shifts cause performance degradation and we therefore study them, as in related works (eg.[11]). We study confidence-based methods for performance-estimation (Sec1 par2), hence calibration is the means, not main goal. We note that uncalibrated confidence is more likely under domain-shift (see ‘Can you trust your model’s uncertainty?’). This may be the reason why confidence-based performance-estimation methods were not previously used in medical imaging, where domain shifts are common, making this study valuable. Finally, we believe our extensive evaluation with multiple synthetic and natural domain shifts should be acknowledged as a strength (as per R2/R3), rather than weakness. We kindly ask the reviewer to reconsider the value of our work within the context of performance-estimation rather than mainly as advancing model calibration.

  2. R1,R2 on method novelty: We agree that class-wise calibration was explored previously for obtaining better uncertainty. Importantly, we emphasize that the main contribution of our work is that we show the importance of introducing class-wise calibration within the framework of performance-estimation, especially in medical imaging where class-imbalance is prevalent. This contribution is recognized by R2, and we will rephrase claims to make them crystal clear. On the technical side, we already discuss differences from class-distribution-aware TS (CDA TS)[17,29] in Sec1 par3. Vector-scaling (VS)[12] differs from our work as CS TS is optimized so that class-specific confidence matches class-specific accuracy (cf. Eq2), whereas in VS parameters are optimized such that overall-confidence matches overall-accuracy, similar to the TS objective[12]. We validate this by adding further experiments in Tab1 with ATLAS-Syn TS:9.7, CS TS:1.6, VS:11.4, CDA TS:21.4; Prostate-Syn TS:3.7, CS TS:3.0, VS:4.8, CDA TS:6.8; Prostate-Nat TS: 9.2, CS TS:7.8, VS:11.2, CDA TS:14.8. Moreover, our work introduces CS variants of multiple methods not done before, and importantly, shows for the first time their application for segmentation, with very promising results.

  3. R1 is concerned that results are unreliable due to linear fitting in Fig3: We believe this is a misunderstanding. The primary quantitative evaluation is based on MAE and R2 Score (Tab1 & supp), on which the main claims of the text are based. Tab1 shows clear improvements, as agreed by R2/R3. The linear fitting is just an auxiliary visual representation of the results. It allows us to observe that our methods are closer to y=x in all settings, which is a reassuring indication, complementing the results in Tab1.

  4. R3, we apply CDA TS from [17] and summarize results in 2.

  5. R2 suggests addition of std.dev and statistical tests in Tab1: Absolutely. Calculated. Consider it done.




Post-rebuttal Meta-Reviews

Meta-review # 1 (Primary)

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    Authors have comprehensively addressed reviewers’ concerns and clarified their contributions. I would recommend accept.

  • After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

    Accept

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    11



Meta-review #2

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    The authors clarified the novelty of their work in this rebuttal, which sounds convincing. Estimating model performance on an out-of-distribution new dataset without label is a key problem for the community study. Thus, this paper may be an interesting contribution to be presented at MICCAI.

  • After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

    Accept

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    5



Meta-review #3

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    The paper introducs class-wise calibration within the framework of performance-estimation in unbalanced datasets. However, the novelty is doubled by the reviewers R1 & R2, despite a positive score is given.

  • After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

    Reject

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    9



back to top