Paper Info Reviews Meta-review Author Feedback Post-Rebuttal Meta-reviews

Authors

Changjian Shui, Justin Szeto, Raghav Mehta, Douglas L. Arnold, Tal Arbel

Abstract

Trustworthy deployment of deep learning medical imaging models into real-world clinical practice requires that they be calibrated. However, models that are well calibrated overall can still be poorly calibrated for a sub-population, potentially resulting in a clinician unwittingly making poor decisions for this group based on the recommendations of the model. Although methods have been shown to successfully mitigate biases across subgroups in terms of model accuracy, this work focuses on the open problem of mitigating calibration biases in the context of medical image analysis. Our method does not require subgroup attributes during training, permitting the flexibility to mitigate biases for different choices of sensitive attributes without re-training. To this end, we propose a novel two-stage method: Cluster-Focal to first identify poorly calibrated samples, cluster them into groups, and then introduce group-wise focal loss to improve calibration bias. We evaluate our method on skin lesion classification with the public HAM10000 dataset, and on predicting future lesional activity for multiple sclerosis (MS) patients. In addition to considering traditional sensitive attributes (e.g. age, sex) with demographic subgroups, we also consider biases among groups with different image-derived attributes, such as lesion load, which are required in medical image analysis. Our results demonstrate that our method effectively controls calibration error in the worst-performing subgroups while preserving prediction performance, and outperforming recent baselines.

Link to paper

DOI: https://doi.org/10.1007/978-3-031-43898-1_19

SharedIt: https://rdcu.be/dnwAQ

Link to the code repository

N/A

Link to the dataset(s)

N/A


Reviews

Review #1

  • Please describe the contribution of the paper

    for mitigating calibration bias in medical image analysis, this proposal does not require subgroup attributes during the training through learning a model (fid) to measure calibration properties and measuring gap based on prediction of fid. After that, the calibration properties are grouped according to the gap, and the focal loss is applied to each group in the prediction model learning step. The proposed method showed the lowest calibration error in terms of Q(uantile)-ECE.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The proposed method showed the lowest calibration error in terms of Q(uantile)-ECE.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    • The proposed method seems just a combination of existing methods rather than a novel technique.
    • Sensitivity study for the k value in K-means Clustering ans focal loss hyperparameter should be provided.
    • It may be required to show a more novel method such as graph-based cluster. -In the ablation study, it seems that focal loss is the main performance, and clustering shows only a slight additive effect.
  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    Maybe, the algorithm is not complex.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html

    Pelase refer the comments in No. 6.

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    5

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Even though it is a combination of exsiting methods, the proposal showed the lowest calibration error in terms of Q(uantile)-ECE.

  • Reviewer confidence

    Confident but not absolutely certain

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #3

  • Please describe the contribution of the paper

    This paper addresses the problem of subpopulation disparities in terms of classifier calibration, in contrast to previous work that address subgroup biases w.r.t. performance (e.g. accuracy). The authors demonstrate that their method shows better worst-group calibration error than a suite of comparison methods, while maintaining reasonable classifier performance.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • This is solid work and a really well written paper. The authors address an important problem and do a good job at defining and motivating the problem setting clearly.

    • The method is evaluated thorougly and meaningfully, and the ablation experiments make sense.

    • The approach does not require subgroup attributes during training.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    • I suspect the K-means clustering part of the two-step pipeline is a bit overly complicated. The variable being clustered (the “calibration gap”) is a scalar value, so anyway I would not recommend K-means clustering. Furthermore, the clusters serve a very simple purpose: They are only used to weigh the training samples according to their miscalibration. I believe this could be achieved much more simply and directly with a simple histogram of the calibration gap, and then reweighting the individual samples according to the histogram bins for training the final classifier.
  • Please rate the clarity and organization of this paper

    Excellent

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    Looks fine.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html
    • It would be interesting to see histograms of the calibration gap variable, perhaps even grouped by subgroup.

    • The text in the figures is too small.

    • What is meant by “calibration bias mitigation” during test time (mentioned in introduction)? This becomes clear with further reading, but not yet at that part of the manuscript.

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    6

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The paper addresses an important problem, the manuscript is well written and the solution is well executed and evaluated.

  • Reviewer confidence

    Very confident

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #2

  • Please describe the contribution of the paper

    The paper presents a new two-stage method, named Cluster-Focal, that aims to reduce calibration bias in medical imaging models. One of the main advantages of this method is that it does not require subgroup attributes during training, which makes it flexible enough to mitigate biases for different choices of sensitive attributes without the need for re-training. The authors tested the proposed method on skin lesion classification using the public HAM10000 dataset, as well as on predicting future lesional activity for multiple sclerosis (MS) patients. The results demonstrate that the Cluster-Focal method is successful in reducing calibration error in the worst-performing subgroups without causing severe degradation in prediction performance. This makes it a promising approach for improving fairness and trustworthiness in the deployment of deep learning medical imaging models in real-world clinical practice.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. Novel formulation: The Cluster-Focal method is a two-stage training strategy that aims to reduce calibration bias in deep learning medical imaging models. This approach identifies poorly calibrated samples and introduces group-wise focal loss to mitigate the bias, without requiring pre-labeled subgroups during training. This makes it flexible for different choices of sensitive attributes.
    2. Original way to use data: The authors tested the Cluster-Focal method on skin lesion classification with the HAM10000 dataset and predicting future lesional activity for multiple sclerosis (MS) patients. They considered biases among groups with different image-derived attributes, such as lesion load, which are necessary for medical image analysis.
    3. Clinical feasibility: The proposed method can improve fairness and trust in the deployment of deep learning medical imaging models in real-world clinical practice. The successful reduction of calibration error in the worst-performing subgroups without severely affecting prediction performance is particularly promising for improving patient care.
    4. Strong Evaluation: The authors evaluated the proposed method on multiple datasets and demonstrated its effectiveness in reducing calibration error while maintaining good prediction performance across various subgroups. They also compared their method with other popular fairness mitigation methods and found that Cluster-Focal works best in mitigating calibration bias for older patients, which is a challenging subgroup to correct for.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    1. Limited evaluation on other datasets: The authors evaluate the proposed method on multiple datasets, but only medical imaging datasets are considered. It would be interesting to see how the method performs on other types of datasets.
    2. Lack of comparison with state-of-the-art methods: The authors compare their method with popular fairness mitigation methods, but they do not compare it with state-of-the-art methods thoroughly for mitigating calibration bias in deep learning models.
    3. Limited discussion on potential limitations: The authors do not discuss potential limitations of their proposed method, such as its sensitivity to the choice of clustering algorithm or its scalability to larger datasets.
    4. Lack of clinical validation: While the proposed method has potential clinical applications, it has not been validated in a clinical setting yet. Further studies are needed to evaluate its effectiveness and feasibility in real-world clinical practice.
    5. Lack of novelty in some aspects: While the proposed method is novel in its formulation and application to medical imaging analysis, some aspects, such as focal loss and clustering algorithms, have been used in previous works for mitigating calibration bias in deep learning models.
  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    The paper seems to be reproducible, and the authors have made an effort to provide all necessary information for others to reproduce their experiments. However, I have not seen any publicly available code or links.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html

    In general, I found the paper to be well-written and engaging. The authors present a new method for mitigating calibration bias in deep learning medical imaging models, which has the potential to improve fairness and trustworthiness in the deployment of these models into real-world clinical practice. The authors test their proposed method on multiple datasets and demonstrate its effectiveness in reducing calibration error while maintaining good prediction performance across various subgroups of interest.

    However, there are some areas where the paper could be improved:

    1. The authors could provide additional details on potential limitations of their proposed method, such as its sensitivity to the choice of clustering algorithm or its scalability to larger datasets. This would help readers better understand the scope and applicability of the proposed method.
    2. Although the authors compare their method with popular fairness mitigation methods, they do not compare it with state-of-the-art methods thoroughly for mitigating calibration bias in deep learning models.
    3. The authors should provide more information on how their proposed method can be integrated into existing medical imaging pipelines or workflows. This would help readers better understand how the proposed method can be practically applied in real-world clinical practice.
    4. While the authors have taken steps to ensure that their experiments are reproducible by providing a step-by-step guide, they could also consider making their code available through a separate repository or website for easier access.

    Overall, I think this is an engaging and well-written paper that makes a valuable contribution to the field of medical image analysis. With some revisions and additions, it could become a contribution that will be of interest to researchers and practitioners alike.

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    5

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?
    1. The novelty and significance of the proposed method for mitigating calibration bias in deep learning medical imaging models.
    2. The thoroughness of the experimental evaluation, which includes multiple datasets and subgroups of interest.
    3. The clarity and organization of the writing, which makes it easy to understand the proposed method and its experimental results.
    4. The potential clinical applications of the proposed method in improving fairness and trustworthiness in the deployment of deep learning medical imaging models into real-world clinical practice.
  • Reviewer confidence

    Confident but not absolutely certain

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A




Primary Meta-Review

  • Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

    This paper proposes a method for improving fairness on a classifier without ever requiring sugroup information during training, which all three reviewers find to be a really nice feature. I agree with R3 on that looking at calibration through the lens of fairness is rare but interesting. R3 also found the paper to be “solid work” and a “well written paper”. R2 agreed (“engaging and well-written paper”), and while R1 found some lack of novelty, I do not think the fact that the proposed method is a combination of other existing techniques (focal loss, k-means clustering) renders the paper less valuable. On the contrary, leveraging well-established methods in a novel manner to solve other problems is a good thing I believe, not every paper needs to have a series of N novel components to be a useful piece of work. I will back the general consensus and recommend direct acceptance.




Author Feedback

N/A



back to top