Paper Info Reviews Meta-review Author Feedback Post-Rebuttal Meta-reviews

Authors

Charles Jones, Mélanie Roschewitz, Ben Glocker

Abstract

We investigate performance disparities in deep classifiers. We find that the ability of classifiers to separate individuals into subgroups varies substantially across medical imaging modalities and protected characteristics; crucially, we show that this property is predictive of algorithmic bias. Through theoretical analysis and extensive empirical evaluation, we find a relationship between subgroup separability, subgroup disparities, and performance degradation when models are trained on data with systematic bias such as underdiagnosis. Our findings shed new light on the question of how models become biased, providing important insights for the development of fair medical imaging AI.

Link to paper

DOI: https://doi.org/10.1007/978-3-031-43898-1_18

SharedIt: https://rdcu.be/dnwAP

Link to the code repository

https://github.com/biomedia-mira/subgroup-separability

Link to the dataset(s)

https://stanfordmlgroup.github.io/competitions/chexpert/

https://physionet.org/content/mimic-cxr-jpg/2.0.0/

https://www.nature.com/articles/sdata2018161#Sec10

https://www.nature.com/articles/s41597-022-01388-1#Sec6

https://github.com/mattgroh/fitzpatrick17k


Reviews

Review #1

  • Please describe the contribution of the paper

    The paper aims to explore potential effects of subgroup separability in understanding model bias.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. Good problem definition followed by establishing levels of subgroup separability in real world datasets.
    2. Exploration of multiple datasets.
    3. Experiments with different subgroups enabling a clearer understanding of potential bias
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    Limited experiments on performance degradation with respect to levels of induced noise i.e. while there is potential bias as a function of subgroup separability, its evaluation on a single level of noise makes it unclear whether it is a function of subgroup separability or sample size or amount of noise.

  • Please rate the clarity and organization of this paper

    Excellent

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    Reproducible

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html
    1. An analysis of the effect of degree of label bias on the performance and hence the effect of group separability can be done.
    2. An analysis of the effect of sample size on encoding sensitive information in association with subgroup separability can be performed as there appears to be some association between them.
    3. Would attribute noise in subgroups introduce bias?
  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    7

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Despite researchers best efforts to avoid bias in the data via stratification approaches and other strategies, there can be model bias and understanding subgroup separability is one step in that direction. This work is thus important. Several datasets and subgroups were studied enabling a clearer understanding of the problem

  • Reviewer confidence

    Confident but not absolutely certain

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #3

  • Please describe the contribution of the paper

    This paper study the problem of group-fair subgroup separability for medical image classification. Authors model the subgroup separability as an estimator of false postive rate for binary classification. Empirical studies have been made on various datasets to demonstrate the role of subgroup separability and classification effectiveness in the context of group fairness.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    S1: this paper studies an interesting problem and demonstrates empirically and theoretically that subgroup separability varies across real-world modalities and protected characteristics.

    S2: extensive testing on real-world medical datasets.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    W1: this paper only consider the task of binary classification. I am wondering whether it works on multi-class classification and other tasks like segmentation and detection.

    W2: In terms of subgroup separability, I am wondering whether the theoretical aspects of this work are related to fisher’s discriminant analysis or support vector machine? Especially, the formulas in Section 3 look close to FDR control techniques in stats.

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    Should be able to reproduce the experiments upon codes released.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html

    Please address the weakness points.

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    5

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    This is a high-quality manuscript studying interesting problem. However, authors should reconsider proposal of their solution on a wider technical basis, such as detection or multi-class classification.

  • Reviewer confidence

    Somewhat confident

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #4

  • Please describe the contribution of the paper

    The paper analysis the impact of subgroup separability on the bias of the classification model.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    a novel perspective of analysing the bias in classification models, under the lens of subgroup separability.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    Some of the claims in section 3 are not well-supported. The empirical analysis is not sufficiently rich.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    Sufficiently reproducible

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html

    see above

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    4

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Some of the claims in section 3 are not well-supported. The empirical analysis is not sufficiently rich.

  • Reviewer confidence

    Somewhat confident

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A




Primary Meta-Review

  • Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

    The paper explores the potential effects of subgroup separability on understanding model bias, presenting a well-written study on a relevant and interesting problem. The authors provide empirical and theoretical demonstrations; however, there are some areas that require improvement and clarification before publication. (1) improving experiments on performance degradation with respect to levels of induced noise. (2) it is important to address comments regarding performance in multi-class classification settings compared to binary classification. Providing insights into how the proposed method performs in scenarios with multiple classes can enhance the applicability and generalizability of the findings. (3) clarification of claims and formulation of Section 3.




Author Feedback

We thank the reviewers and meta-reviewer for their encouraging and constructive feedback, which has helped us to improve our work. We address each review in detail below, adding the requested label noise experiment and discussing applicability in multiclass settings.

R1: Many thanks for the valuable comments and encouraging feedback. We appreciate that the reviewer felt this work is important and enables a better understanding of the problem of model bias. We agree with the reviewer that further analysis of the different contributing factors would be interesting to isolate specific effects. On label noise, we had already performed experiments with different levels of noise but left them out for space reasons. Supporting our main results, we found that, as label noise intensity increases, Group 1 performance degrades faster in datasets with high subgroup separability. Importantly, subgroup separability remains predictive of performance degradation at all levels of label noise intensity. While we cannot share figures or updates to the manuscript in this response due to MICCAI guidelines, we can add these results to the appendix, with a brief discussion in the main paper and updates to the code repository.

Further investigations into sample size and attribute noise are interesting future directions (e.g. on smaller datasets, models may be less capable of separating individuals into subgroups), which we will highlight in our discussion. With the release of our code and the public availability of all data used in this work, we hope our study will inspire and facilitate further investigations into model bias.

R3: Many thanks for the comments; we appreciate that the reviewer considered this a “high-quality manuscript on an interesting problem”. Regarding binary versus multiclass prediction tasks, we performed our empirical analysis in the binary setting to align our experimental protocol with the pre-existing MEDFAIR benchmark for fairness (the key difference being our addition of simulated underdiagnosis bias). Similarly, we used a binary formulation of the problem in Section 3 for consistency with the experiments. However, our formulation and experimental setup are readily extendable to the multiclass setting, as multiclass classification may be reduced to multiple binary problems. The main conclusion that performance degradation is a function of subgroup separability should equally hold in multiclass problems.

Regarding further tasks such as segmentation and detection, we agree that these will be exciting directions for future work. We will highlight this in our discussion section.

Regarding the relationship to Fisher’s (and more generally linear) discriminant analysis: LDA can be used as a supervised technique to identify group-separating features, so it is related in a sense that LDA can help to identify if and how strongly groups are separable, given attribute labels according to group membership (e.g. sex, race, age). This is related to our experiments where we quantify group separability. In the context of group separation in downstream prediction tasks (such as disease detection), we are then interested in studying the implicit (unsupervised) separation that may happen as a by-product of the optimisation process (and under scenarios such as label noise). FDR control mechanisms are not directly related to our work but have been studied in the context of fairness in machine learning. For example, studying specific performance disparities in subgroups on certain metrics (e.g. TPR) when holding other metrics (e.g. FPR) constant can be useful.

R4: Unfortunately, the reviewer has not provided any details, and therefore we feel unable to provide a meaningful rebuttal. The reviewer’s assessment seems inadequate and at odds with the feedback from the other reviewers and the area chair.




Post-rebuttal Meta-Reviews

Meta-review # 1 (Primary)

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    The authors have responded to concerns raised by reviewers, specifically addressing issues related to enhancing experiments on performance degradation, performance in multi-class classification settings, and certain claims made in the paper. After a thorough evaluation of the author’s rebuttal, I recommend accepting the paper.



Meta-review #2

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    The paper examines subgroup separability as a means of understanding bias in classification model. There is consensus about the interest and novelty of the approach. The rebuttal is responsive to feedback. Overall, the paper seems interesting for publication in MICCAI.



Meta-review #3

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    The paper presents an interesting a well-conducted analysis of subgroup separability in medical image classification as a surrogate for algorithmic bias. The paper was mostly received positively by the reviewers and the authors did a good job in their rebuttal in clarifying potential limitations of their work (e.g., applicability to multi-class problems). In summary, I think this is a solid paper without any major weaknesses.



back to top