Paper Info Reviews Meta-review Author Feedback Post-Rebuttal Meta-reviews

Authors

Adrian Galdran, Johan W. Verjans, Gustavo Carneiro, Miguel A. González Ballester

Abstract

Delivering meaningful uncertainty estimates is essential for a successful deployment of machine learning models in the clinical practice. A central aspect of uncertainty quantification is the ability of a model to return predictions that are well-aligned with the actual probability of the model being correct, also known as model calibration. Although many methods have been proposed to improve calibration, no technique can match the simple, but expensive approach of training an ensemble of deep neural networks. In this paper we introduce a form of simplified ensembling that bypasses the costly training and inference of deep ensembles, yet it keeps its calibration capabilities. The idea is to replace the common linear classifier at the end of a network by a set of heads that are supervised with different loss functions to enforce diversity on their predictions. Specifically, each head is trained to minimize a weighted Cross-Entropy loss, but the weights are different among the different branches. We show that the resulting averaged predictions can achieve excellent calibration without sacrificing accuracy in two challenging datasets for histopathological and endoscopic image classification. Our experiments indicate that Multi-Head Multi-Loss classifiers are inherently well-calibrated, outperforming other recent calibration techniques and even challenging Deep Ensembles’ performance. Code to reproduce our experiments can be found at \url{https://github.com/witheld}

Link to paper

DOI: https://doi.org/10.1007/978-3-031-43898-1_11

SharedIt: https://rdcu.be/dnwAI

Link to the code repository

https://github.com/agaldran/mhml_calibration

Link to the dataset(s)

https://bupt-ai-cz.github.io/HSA-NRL/

https://datasets.simula.no/hyper-kvasir/


Reviews

Review #1

  • Please describe the contribution of the paper

    The author’s improve calibration of an image-based multi-class classifier by introducing multiple heads to the end of the model, with the loss of each head increased for a unique class. They compare with other approaches that aim to improve calibration during training, showing improvements in multiple architectures.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • The author’s approach is simple, well justified and well explained. Easy to follow and understand.
    • Good experiments showing generalization across multiple architectures, with experiments showing the approach achieved better accuracy and calibration than most other methods.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    • The novelty is limited, it is a simpler version of the mentioned reference [21]. It would be interesting to see a direct comparison with this method, as it seems applicable to this task.

    • The contributions II and III are a bit overstated. I don’t necessarily see why it is a contribution that they didn’t consider post-processing methods or deep ensembles since methods already exist that do not use them.

    • No confidence intervals/standard deviations are given in the results.

    • There could be more theoretical justification. It is surprising to me that for a given class, the unweighted mean of all the heads will give better results than only the head that was specialized on that class. I would like to have seen some discussion of why this is the case. What would the results look like when at inference, only the specialized head was used for the classification prediction?

    • Further, I can hypothesize that the better calibration was due to very easy samples being classified easily be even the heads not specialized in it, and the hard classes being very difficult for the unspecialized heads etc. However, I’m not sure and it would be insightful to see some deeper discussion and analysis on this aspect.

  • Please rate the clarity and organization of this paper

    Excellent

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance
    • code will be given so under that condition reproducibility is not an issue.
  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html
    • As mentioned in weaknesses, more theoretical justification and discussion on why the author’s method improves calibration would be valuable.
  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    5

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?
    • Good evaluation against SOTA and using multiple architectures and a well explained method. The method has some limited novelty and discussion of results is limited.
  • Reviewer confidence

    Confident but not absolutely certain

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    5

  • [Post rebuttal] Please justify your decision

    My main concern was regarding theoretical/empirical justification of weighting of the heads, or insight into why it works. The authors did not address this satisfactorily in their rebuttal, or give empirical results either. Therefore, my overall opinion remains the same.



Review #3

  • Please describe the contribution of the paper

    The authors present a simple neural network ensembling approach for model calibration for classification problems. The instances in the ensemble share an encoder but have different prediction heads subject to different terms in the training loss. The resulting model is more parameter and compute efficient than deep ensembling (where the instances are completely distinct) while retaining many of the calibration and predictive capabilities.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. Calibration is an important problem and existing solutions have important drawbacks in terms of predictive performance, calibration performance, or compute efficiency. The simple approach put forward by this paper measures well across each of these criteria for the datasets under consideration.
    2. The authors perform a thorough evaluation of their approach relative to other available approaches including a dispersion analysis currently available as supplemental material.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    The first main weakness of the paper surrounds the motivation for the weighting scheme and the lack of comparison with alternative weighting strategies for the multiple prediction heads. The authors rightly point out the issue of symmetry among the multiple prediction heads and argue that this can lead to dead heads as in [21]. Following equation 4, however, they argue “It follows that we indeed require a different weight in each branch”. This is not a sound argument because Eq. 4 uses the assumption p^m \approx p^{\mu}, which may not be the case depending on how the heads are initialized. It may be the case in practice. If so, the authors should provide support.

    Setting this issue aside and assuming that it is a good idea to explicitly break the symmetry between the various heads, the author’s proposed weighting scheme is poorly motivated and there is no experimentation in place to test obvious alternative approaches. First, the authors limit to the case M<=K, “as otherwise we would need to have different branches specializing in the same category”. Relative to deep ensembling, this seems place an undue restriction on the possible diversity in the resulting ensemble. Why not have more heads than classes and then randomly assign weights for each head? It is unclear why it is desirable to have heads that specialize to specific classes. The authors may have a good reason, but it is unclear from the text why this decision was made. Having made this decision, the authors describe their procedure for setting the weights for each head. Again, very little motivation was given for this procedure and the reader can think of many reasonable alternatives such as assigning twice the weight to the specialized classes as to the others, or using the scheme described in the paper but then normlizing so the weights add to one. Or simply choosing weights randomly and normalizing.

    The second weakness is insufficient detail about the evaluation procedure. Though the evaluation metrics are described, it is unclear how these measures were computed (e.g. via cross validation or single holdout set). What were the sizes of the evaluation sets? The reproducibility checklist says that these details are included in the public repository, but since these details are important for evaluating the results, I believe it is essential to include them in the paper itself.

  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    This work uses publicly available datasets, readily available model architectures, and the authors express intention to share all the relevant code. Highly reproducible.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html
    1. The main paper omits the dispersion analysis to save space. This analysis is available in the supplemental material. Considering the dispersion results, some of the paper’s claims are more ambiguous. For instance, “as we diversity the losses and increase the number of heads we tend to improve calibration” (Chaoyang results paragraph, last sentence). This tendency is probably correct, but its not unambiguous looking at the dispresion results. I think the paper would be significantly strengthened by including some of the dispersion analysis in the main paper itself.
    2. The results would be strengthened if the authors evaluated the idea of having more prediction heads and explored a range of reasonable choices for the weighting scheme the null option (no weighting: hope that random initialization is sufficient to break the symmetry), random weighting, and the procedure described in the paper. The current sensitivity analysis is very limited. I expect the authors can achieve even better results by modifying this procedure. If the authors have tried other schemes to settle on the current scheme, they should report on this process as motivation.
    3. The paragraph immediately before section 2.3 is very difficult to read and I believe uses inconsistent notation (K is referred to as number of classes at one point and then used as the weight assignment later in a way that’s inconsistent with the provided example). I was unable to fully understand the authors intended procedure.
    4. Sec 3.1, the Swin Transformer references is incorrectly reused for the ConvNeXt
    5. The paper claims that the supplemental material includes results on additional datasets. This does not appear to be the case.
    6. Last sentence of of Kvasir results paragraph says “two out of three” when the table seems to show “three out of three”.
  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    5

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The lack of clarity in the motivation and a lack of experimentation around the weighting approach are serious issues. However, the work proposes a novel approach to an important problem and demonstrates that one manifestation of this approach produces promising results. I would like to see a more thorough examination of this idea, but the paper still has sufficient value for publication.

  • Reviewer confidence

    Very confident

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #4

  • Please describe the contribution of the paper
    • The paper addresses miscalibration of classification models in medical imaging, which is an important and often overlooked problem.
    • A simple, yet effective multi-head multi-loss approach is proposed that tries to bridge the gap between point estimates of deterministic classifiers and deep ensembles.
    • The proposed approach has considerably less computational demands than a full deep ensemble, but comes very close to the ensemble performance with respect to classification accuracy and model calibration.
    • A multi-loss (differently weighted cross-entropy) is proposed to diversify the posterior distribution of the different multi-heads.
  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • The paper presents a simple, yet effective method. It enables much better calibration by diversifying the posteriors of the M-heads without greatly increasing the computational burden (in contrast to Bayesian neural nets or deep ensembles). This could be of great help for deploying deep models to clinical practice and make better calibrated models more affordable.
    • The method allows for combination with post-hoc recalibration, such as temperature scaling, which improves the model calibration even further.
    • Evaluation is done on two challenging datasets that exhibit high uncertainty due to label ambiguity and noise.
    • The figures are of high quality and aid the reader in understanding the core concept of the approach.
    • The paper is well-written and easy to follow.
    • The relevant related work is referenced.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    • A simple core contribution with limited novelty is always an appreciated paper if the benefit and effectiveness of the method is analyzed well and its many applications are shown in an extended experimental evaluation.
    • However, this paper lacks presenting and evaluating its simple method in any of the many possible downstream tasks that could have been used to show the advantages of MHML model calibration (see comments below for suggestions).
    • The paper uses a dataset with a comparatively high number of classes (23). However, it is well-known that ECE fails at capturing the miscalibration in this case and more appropriate metrics should be used, e.g., the classwise ECE that was presented by Kull et al. (2019), a paper that the authors already reference.
    • The authors uses a “rank” metric to rank the comparative methods. It is unclear to the reader how “rank” is computed. The last sentence in § 2.3 does not sufficiently explain the metric.
  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    The reproducibility is very good as,

    • the method is described sufficiently well to be reimplemented without the code,
    • the code will be released after acceptance,
    • publicly available datasets are used,
    • common network architectures are used, and
    • all important hyper-parameters are reported.
  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html
    • As mentioned above, I would expect a detailed and exhaustive experimental evaluation of the presented method to show its effectiveness and justify publication.
    • Instead of analyzing different model architectures, I would expect a greater variety in tasks, e.g., show that MHML calibration also works for segmentation, regression (bounding box, landmarks, etc.).
    • A more detailed evaluation would also include the presentation of downstream tasks; i.e., what to do with the calibrated uncertainty, e.g., rejection of uncertain predictions, out-of-distribution detection, maybe even improved active learning.
    • Please use more appropriate calibration metrics for multi-class calibration, such as classwise ECE (Kull et al., 2019).
    • The authors already compare to some training-time calibration methods for deterministic neural networks. Additionally, I would appreciate a comparison to at least one Bayesian approach (MC dropout, SWA-Gaussian, MFVI, KFAC, etc.).
    • The authors motivate their method by addressing the reduced computational demand compared to deep ensembles. Yet, no runtimes are reported. To fully assess the trade-off between full ensembles and M-heads, one has to consider runtime and memory demands.
    • Regarding presentation of results: In the medical domain, results should always be posted with confidence intervals (from, e.g., bootstrapping or, ideally, from repeated runs with different random seeds) and statistical tests should be used to show significance of the results, e.g, paired t-test with multiple comparisons correction for model evaluation. Simply stating a single mean value from a single run is not enough.
  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    4

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The core contribution of this paper boils down to the cross-entropy weighting in the loss. Given the very limited experimental evaluation of the simple method (see my suggestions above), I cannot recommend acceptance for MICCAI. However, the presented content of the paper seems to be correct and the method could be of interest for the community, if its effectiveness would have been analyzed in depth and shown on a variety of downstream tasks. I therefore vote for “weak reject”.

  • Reviewer confidence

    Very confident

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    5

  • [Post rebuttal] Please justify your decision

    Most of my concerns were addressed in the rebuttal, including classwise ECE, statistical tests and comparison to Bayesian approaches. However, my suggestion of an in-depth evaluation with different applications was shifted to future work. I increase my score accordingly.




Primary Meta-Review

  • Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

    The work addresses the miscalibration of classification models with multiple heads in the prediction layer, which demonstrates simple yet effective evaluation results. The proposed method is well justified, explained, easy to follow and can be combined with post-hoc recalibration techniques like temperature scaling. However, there are several concerns raised by the reviewers- (i) limited novelty with missing closely related works (R1, R3); (ii) additional ablation study is required on weighting (R3); (iii) lack of evaluation on more possible downstream tasks like segmentation and detection (R4); (iv) classwise ECE should be provided, which demonstrates more effectiveness than vanilla ECE (R4).




Author Feedback

We appreciate that reviewers found aspects of our paper appealing. Following AC’s indications, we answer the main issues raised by R1, R2, R4, which we distribute in two blocks, Evaluation and Motivation.

A) Evaluation: the main concern of R4. A1) R4 mentionss that ECE is not suitable for datasets with many categories, suggesting class-wise ECE (i.e. macro-ECE: 1vsrest ECE and average). Following this advice we computed cw-ECE, finding that this metric favors more strongly our approach, which suggests that our technique might better handle minority classes.

A2) R1 & R4 also request results include averaging multiple runs and reporting dispersion. In fact, dispersion measures were already present in the appendix, which made them go unnoticed. We now discuss them briefly within the paper, as proposed by R3, add p-values for pairwise bootstrapped performance differences (Bonferroni correction), and stress their presence in the appendix.

A3) R4 observes that our approach could be applied in other scenarios: segmentation, object detection, OoD rejection, active learning, and so on. As much as we would like, there is limited space in MICCAI and we already struggled to fit the current version. However, we already have positive results on brain MRI lesion segmentation tasks, and will add R4 suggestions in a follow-up work.

A4) Finally, R4 asks to test with a Bayesian-inspired approach. We trained MC-Dropout models for both datasets, results were consistenly below the proposed approach, for all head combinations.

B) Motivation: Why and how to weight heads? The main concern for R1 and R3. B1) Why category-weighted heads? We hypothesize that in datasets where there is some class imbalance, letting heads focus on a subset of categories by “tuning down” the loss of samples from other classes favors calibration. For this reason, we emphasize the loss of some classes. How to decide which classes? This should probably be dataset-dependent, and we expect to at least develop a sound heuristic strategy in future work. Here, in the absence of a clear answer, we decided to do this randomly and evaluate by repeated training runs.

B2) R1 and R3 also mention a close work [Linmans et al. MedIA 2023], as they also train multiple heads for uncertainty quantification. The fundamental difference is that, for a given batch, they backpropagate through the head achieving lower loss, regardless of the categories of elements within the batch. This provides certain amount of specialization, but fails to explicitly model the inherent difficulty posed by minority-class hard examples, which will occasionally appear in a batch with other majority-category examples: their loss will be averaged out within the batch loss. By leveraging native CE class weighting, we manage to explicitly make branches specialized without relying on batch-level loss values.

  • Remark: we used code by Linmans et al. for training, obtaining lwer performance, -4.04/-2.42 NLL (p<0.0001) using 4 heads. Note that this method was proposed for binary classification.

B3) How to weight heads? R3 saw lack of motivation, which we hope has been partly addressed above. R3 (and AC) also requested sensitivity analysis for weighting schemes. Why not random weighting, or no weights at all? To answer this, let us first note that no-weighting was already included (as 2HSL, only random init breaks head symmetry). Now, random weighting is an interesting suggestion, as it tests the hypothesis that learning different categories better with some heads favors calibration, particularly under typical class imbalance of medical datasets. We experimented with random and random+normalized weighting, and results are consistently below our weighting scheme, plus more unstable across training runs. Due to lack of space, we cannot go deeper on this, but a study on weighting will be part of a future extension, including (as per R1) analysis on head behavior for easy/hard examples, and its impact on calibration.




Post-rebuttal Meta-Reviews

Meta-review # 1 (Primary)

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    The authors have clarified most of the major concerns of reviewers, specifically using suggested evaluation metrics, additional performance comparison, and clarification on motivation. Although the summary of the exact quantitative analysis on additional evaluation is missing in the rebuttal, the rebuttal manages to address most of the main concerns of the reviewers. I think the paper is in an acceptable state and would make an interesting addition to the MICCAI.



Meta-review #2

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    This paper proposes a simple and effective approach for calibration of multi-head classifiers.

    After rebuttal, the paper has three borderline positive reviews – the reviewers appreciate the simple yet effective approach, but at the same time request a more thorough experimental validation on the types of problems where the calibration could be expected to make a difference. In their rebuttal, the authors have deferred this to future work. On the positive side, however, they have carried out validation with respect to additional metrics (in particular classwise ECE, as requested by reviewer 4), but they do not provide the actual numbers. Note also that classwise ECE is likely to suffer from sample size bias, see e.g. Petersen et al, FAccT 2023, which emphasizes the need for actually seeing the numbers to critically assess them.

    This is not critical enough to go against the vote of 3 reviewers, and I will go with their decision to recommend acceptance.



Meta-review #3

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    The authors addressed the concerns well and the pper describes a relevant and very interesting problem. Calibration is important and this work might add positivly to MICCAI.



back to top