List of Papers By topics Author List
Paper Info | Reviews | Meta-review | Author Feedback | Post-Rebuttal Meta-reviews |
Authors
Zikang Xu, Shang Zhao, Quan Quan, Qingsong Yao, S. Kevin Zhou
Abstract
Deep learning is becoming increasingly ubiquitous in medical research and applications while involving sensitive information and even critical diagnosis decisions. Researchers observe a significant performance disparity among subgroups with different demographic attributes, which is called \textbf{model unfairness}, and put lots of effort into carefully designing elegant architectures to address unfairness, which poses heavy training burden, brings poor generalization, and reveals the trade-off between model performance and fairness. To tackle these issues, we propose FairAdaBN by making batch normalization adaptive to sensitive attributes. This simple but effective design can be adapted to several classification backbones that are originally unaware of fairness. Additionally, we derive a novel loss function that restrains statistical parity between subgroups on mini-batches, encouraging the model to converge with considerable fairness. In order to evaluate the trade-off between model performance and fairness, we propose a new metric, named Fairness-Accuracy Trade-off Efficiency (FATE), to compute normalized fairness improvement over accuracy drop. Experiments on two dermatological datasets show that our proposed method outperforms other methods on fairness criteria and FATE. Our code is available at https://github.com/XuZikang/FairAdaBN.
Link to paper
DOI: https://doi.org/10.1007/978-3-031-43895-0_29
SharedIt: https://rdcu.be/dnwyI
Link to the code repository
https://github.com/XuZikang/FairAdaBN
Link to the dataset(s)
N/A
Reviews
Review #1
- Please describe the contribution of the paper
The authors propose a method for mitigating performance disparities between patient groups in deep learning-based medical image analysis. Their approach is based on two key aspects, i) replacing standard batch normalization layers by a group-specific version, and ii) a loss function that combines cross-entropy with a term that quantifies performance disparities between groups. Moreover, they describe an evaluation metric that summarizes changes in accuracy and fairness metrics relative to a baseline model into a single number.
- Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
The authors propose a novel method for mitigating fairness-related issues, group-specific batch normalization layers. This is an interesting approach that I have not seen described before, and it enables group-specific model customization without requiring the training of fully separate (group-specific) models.
The method is evaluated on two skin lesion classification datasets and compared to a number of baseline methods, across a number of metrics. An ablation study is also included.
- Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
I see three main weaknesses in the current manuscript.
Firstly, the engagement with the existing algorithmic fairness literature seems superficial; I am discussing specific instances and providing many references below in the detailed comments.
Secondly, as a consequence of this, I am skeptical regarding some of the results and their interpretations given in the paper. All metrics considered in the paper are susceptible to changes in decision thresholds. Thus, as a very simple baseline approach, group-specific threshold selection on the vanilla model should be included, ideally also some threshold-independent metric such as AUROC. The addition of more baseline models for comparison should be considered, cf. the various references provided below.
Thirdly, I am skeptical as to the value of the proposed FATE metric. The metric is based on one very particular way of putting numbers on trade-offs between performance and fairness, and it is far from evident to me why this particular version should be the most meaningful way to evaluate the trade-off. Figure 2 nicely illustrates this: (why) should we consider models on straight lines in these diagrams equivalent - and not, say, on convex curves? In the rightmost panel, is a model with accuracy ~0.59 and equalized odds disparity ~0.07 really “equivalent” to the baseline model (acc ~0.88, EO 0~.105)? The metric seems to be comparing apples with oranges to me. (Also note that the scale of meaningful accuracy values is bounded below by the performance of the random classifier, while fairness metrics could really reduce to 0 - another reason why relative changes in the two seem incomparable to me. See Flach and Kull, Precision-Recall-Gain curves: PR Analysis Done Right, for a discussion of a similar issue.)
- Please rate the clarity and organization of this paper
Good
- Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance
The authors have indicated that they will make all code required for running the experiments available after publication. The used datasets are publicly available.
- Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html
The following are some specific instances in which I was missing a more thorough engagement with the existing algorithmic fairness literature:
- In the introduction, the authors write that “The second group [of methods] explicitly takes sensitive attributes into consideration when training models, that is, train independent models for unfairness mitigation with no parameters are shared between subgroups”. However, only a very small subset of methods that “explicitly take sensitive attributes into consideration when training models” train fully independent group-specific models. In the simplest case, models may simply use group membership as an input, or as a conditioning variable. Various domain invariance approaches (some of which the authors even compare their method against) also take groups into account during training but learn a single model. Also see, e.g., Suryakumar et al., When Personalization Harms: Reconsidering the Use of Group Attributes in Prediction.
- A very large number of papers have used modified cost functions such as the one described here for in-processing fairness mitigation. None of these are mentioned / cited / compared against as baselines. See, e.g., section 4.2 (In-process mechanisms). in Pessach et al., A Review on Fairness in Machine Learning, or Zhang et al., Improving the fairness of chest X-ray Classifiers.
- The authors write that pre-processing fairness mitigation techniques “need huge effort due to the preciousness of medical data”. However, the most standard approach from this category is simply group balancing, which certainly does not “need huge effort”. See, e.g., Zhang et al., Improving the Fairness of Chest X-ray Classifiers, or Idrissi et al., Simple data balancing achieves competitive worst-group-accuracy.
- For post-processing methods, the authors only mention FairPrune, which is far from being a standard method in this category. The most basic approaches simply rely on selecting group-specific decision thresholds, see, e.g., Hardt et al., Equality of Opportunity in Supervised Learning.
- There is extensive research on the “fairness-accuracy Pareto frontier” that the authors also explore here. See, e.g., Little et al., “To the Fairness Frontier and Beyond: Identifying, Quantifying, and Optimizing the Fairness-Accuracy Pareto Frontier”, or Zietlow et al., “Leveling Down in Computer Vision: Pareto Inefficiencies in Fair Deep Classifiers.”
Further, minor comments:
- In the tables, I personally find red to be a counterintuitive choice for the best model.
- Equation (7) looks wrong to me. Also, I assume all of Eqs. (5)-(7) should be absolutes?
- The “universality” of FairAdaBN is not “proved” by “plugging into several backbones”.
- Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making
4
- Please justify your recommendation. What were the major factors that led you to your overall score for this paper?
The discussed mitigation approach, a group-specific batchnorm layer, is an interesting and novel strategy that I have not seen mentioned before. Unfortunately, the engagement with prior algorithmic fairness work is severely lacking, resulting in a (to me) unconvincing performance evaluation. Most of my concerns could be addressed by adding more baselines and metrics to the performance evaluation, and by appropriately revising the discussion of other fairness mitigation approaches.
- Reviewer confidence
Very confident
- [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed
5
- [Post rebuttal] Please justify your decision
The authors have addressed and alleviated several of my main concerns in their rebuttal. While I am personally still not very convinced of the value of the FATE metric, the idea to use fair adaptive batch normalization is interesting. The validation is now sufficiently comprehensive that readers can properly assess the utility of this new mitigation technique.
Review #3
- Please describe the contribution of the paper
The issue of model unfairness in medical research arises due to significant performance disparities observed among subgroups with different demographic attributes. In response to this concern, this paper proposes a novel approach known as FairAdaBN, which aims to address and mitigate model unfairness in deep learning models, particularly in the context of dermatological disease classification.
FairAdaBN presents a straightforward yet effective framework that tackles model unfairness by introducing individual batch normalization modules for each subgroup. This adaptive nature of batch normalization, achieved through FairAdaBN, helps mitigate performance gaps among subgroups. Additionally, the paper introduces a statistical disparity loss to minimize disparities in performance between subgroups.
- Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
FairAdaBN offers a simple yet effective framework for addressing model unfairness in deep learning models, making it adaptable to classification backbones utilizing batch normalization. By introducing a novel loss function that regulates statistical parity between subgroups in mini-batches, FairAdaBN enables batch normalization to be responsive to sensitive attributes. The authors have introduced the Fairness-Accuracy Trade-off Efficiency (FATE) metric, which provides a meaningful evaluation of the trade-off between model performance and fairness.
- Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
Weaknesses:
- FairAdaBN requires the use of sensitive attributes during the testing stage, which maybe unfair when compared to EnD and CFair.
- Batch normalization relies on the quality of batch statistics, and if certain subgroups are underrepresented in the dataset, the quality of the BN module’s parameters may be compromised. It is recommended to provide additional analysis when data for certain subgroups is scarce.
- Exploring the simultaneous impact of incorporating FairAdaBN and LSD (statistical disparity loss) on fairness improvement could provide insights into how the parameters of AdaBN change in the presence or absence of LSD. Additionally, investigating the potential link between LSD and the learning of the BN module for minor groups would be interesting.
- Please rate the clarity and organization of this paper
Very Good
- Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance
Overall good, providing code is recommended
- Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html
Batch normalization relies on the quality of batch statistics, and if certain subgroups are underrepresented in the dataset, the accuracy of the BN module’s parameters may be compromised. It is recommended to provide additional analysis when data for certain subgroups is scarce. Exploring the simultaneous impact of incorporating FairAdaBN and LSD (statistical disparity loss) on fairness improvement could provide insights into how the parameters of AdaBN change in the presence or absence of LSD. Additionally, investigating the potential link between LSD and the learning of the BN module for minor groups would be interesting.
- Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making
6
- Please justify your recommendation. What were the major factors that led you to your overall score for this paper?
Although there are some limitation in the paper, I think the idea is intersting and paper’s quality is good, so I would like give a recommendation.
- Reviewer confidence
Confident but not absolutely certain
- [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed
N/A
- [Post rebuttal] Please justify your decision
N/A
Review #4
- Please describe the contribution of the paper
Authors proposed an adaptived batch normalization for unfair data, and a statisical disparity loss function is constructed for the optimization. Authors formulated a noval metric to evaluate the balance between normalized improvement of fairness and normalized drop of accuracy.
- Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
The adaptived batch normalization and statisical disparity loss function are novel formulations. The FATE metric makes balance between fairness and accuracy.
- Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
How to define the performance of FATE? Why it is balanced? Different loss function should be compared with L_SD.
- Please rate the clarity and organization of this paper
Good
- Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance
Sounds good
- Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html
It should be better by using absolute value in Equ. (5)-(7). What is the meaning of negative FATE? It is sound good by using absolute value in Equ. (8), but in Table 1, there are many negative results which absolute values are larger than current results.
- Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making
5
- Please justify your recommendation. What were the major factors that led you to your overall score for this paper?
The adaptived batch normalization and statisical disparity loss function are novel formulations. But the FATE metric should be improved.
- Reviewer confidence
Confident but not absolutely certain
- [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed
N/A
- [Post rebuttal] Please justify your decision
N/A
Primary Meta-Review
- Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.
The paper presents a new way to address model unfairness in deep learning classifiers by replacing global batch normalization modules with attribute/group-ware FairAdaBN modules and a new statistical disparity loss that minimize disparities in performance between subgroups in this setup. Moreover, a new metric (FATE) is introduced to evaluate a model’s fairness-accuracy trade-off with respect to a baseline model. The methods are evaluated on two skin lesion classification datasets and a comparison to several baseline approaches for unfairness mitigation shows favourable results (e.g., small drop in accuracy and better fairness results).
All reviewers highlighted the novelty of this interesting idea and the overall presentation of the paper. A general methodological weakness of the method is its reliance on the availability of the sensitive attribute during inference. This is a significant downside compared to competing methods and raises questions regarding the evaluation presented as two of the main baselines (EnD and CFair) utilized do not require knowledge about the sensitive attribute. R1 sees a general lack of engagement with the available literature on unfairness mitigation, which leads to additional questions regarding the representativeness of the baselines chosen for the evaluation (e.g., comparison against other modified cost functions are missing) and the robustness of the results (e.g., changes in the decision threshold). The reviewers also identified some problems regarding FATE as the meaning of negative values remains unclear (see R4) and it remains unclear why to assess performance/fairness trade-offs this way (see. esp. R1).
The rebuttal should primarily focus on the concerns raised regarding the evaluation and on justifying/clarifying FATE.
Author Feedback
We thank the reviewers for their comments and are pleased that 2 of the 3 reviewers recommended acceptance. They appreciate that our method is novel, interesting, and described clearly.
Response for Meta-Reviewer: We engage extra literature on unfairness mitigation (R1Q1), add two more baseline methods (R1Q2), and further clarify the meaning of FATE (R1Q3, R4Q1&4).
Response for R1: Q1: We admit that we miss some references, and will include them in the revised version. In the introduction part, we mention the 2nd group of methods to emphasize the difference between fairness through unawareness/awareness. In the post-processing part, we select FairPrune because it is the first post-processing method for fair MIC, we will include calibration in the revised version.
Q2: Our experiments are conducted on multi-class, instead of binary classification. Thus, threshold-based methods are inapplicable. However, we implement another two baselines following your suggestion. i.e., GroupDRO and Resampling. On the Fitz17k dataset, the GroupDRO has an ACC of 86.62%, and its fairness criteria (*1e-2) are (0.94, 8.04, 8.23). The Resampling has an ACC of 87.73%, and its fairness criteria are (1.11, 10.43, 10.78). FairAdaBN has the best fairness criteria (0.48, 7.67, 7.73) compared to them. Similar results are found in the ISIC. We will compare other baselines in the future.
Q3: Firstly, we want to emphasize the importance of FATE, i.e., many mitigation methods tend to reduce the utility of both groups to achieve fairness, thus the idea of FATE is direct: how much does unfairness mitigation model improves when scarifying one unit of accuracy? Secondly, FATE should be combined with utility metrics and fairness metrics, rather than independently. We admit that this FATE metric has its limitation too, thus we change its equation to: FATE = \frac{ACC_m-ACC_b}{ACC_b} – A\times \frac{FC_m-FC_b}{FC_b} where A is a weighting factor that adjusts the requirements for fairness pre-defined by the user considering the real application. Specifically, if A is selected properly, the intercept on the y-axis of the line FATE=0 is slightly smaller than the baseline’s accuracy. Moreover, similar trade-off metrics are also proposed in [1-2].
Minor Comments: Q4: “Color of the table”. We will revise it.
Q5: “Equations”. We define A=0 as the unprivileged group, thus the first item is always larger than the second item, and the absolute operation can be omitted. To avoid readers’ confusion, we will add absolute operation in the revised version.
Q6: “Universality”. We regard universality as the efficient improvement of fairness and FATE on various architectures. We will use “Generalization ability” instead.
Response for R3: Q1: We admit this shortcoming of our method and will try to solve this problem by adding classifiers to predict the pseudo-sensitive attributes in future work.
Q2: We will dive deeper into “analyzing scarce data” in future work.
Q3: We have presented the result of combining FairAdaBN and LSD in Tab. 2 (8th row), and further investigation of LSD will be conducted in future work.
Response for R4: Q1&4: “FATE”. We want to emphasize that FATE is a signed variable. We prefer an algorithm that obtains a higher FATE since a higher FATE denotes higher unfairness mitigation and a low drop in utility, and a negative FATE denotes that the mitigation model cannot decrease unfairness while reserving enough accuracy (not beneficial). We will also revise the formula of FATE (refer to R1: “FATE”).
Q2: “Loss”. As shown in Tab. 2: lines 5 & 6 are trained by L_{CE}, and line 8 is trained by L_{CE}+L_{SD}. The result of combined loss is better. We will examine other loss functions in future works.
Q3: “Absolute value”. We will revise Eqs. (5)-(7).
[1] Learning privacy-enhancing face representations through feature disentanglement [2] Unsupervised privacy-enhancement of face representations using similarity-sensitive noise transformations
Post-rebuttal Meta-Reviews
Meta-review # 1 (Primary)
- Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.
The authors do a good job in the rebuttal to address the concerns highlighted in my initial meta-review (unclearness surrounding the evaluation and the proposed FATE metric). I would suggest that the authors clearly mention the major disadvantage of their method (= reliance on the availability of the sensitive attribute during inference) in the final version.
Meta-review #2
- Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.
The work proposes an interesting and novel method for tackling unfairness issues on skin lesion classification, by introducing adaptive individual batch normalisation for each subgroup. The proposed approach is new and interesting with novel formulations. The rebuttal has addressed most of the concerns including the missing literatures, comparisons against representative baselines and clarification on FATE metrics. Though the drawback of using sensitive attributes during inference was not addressed and has been left for future work, I would suggest this to be discussed in the current manuscript to acknowledge this. Considering all the aspects of the strengths and drawbacks of the work, the new method proposed will be of great interest to the readers and will bring some insights into the research community despite the weaknesses. Therefore a recommendation of accept is suggested.
Meta-review #3
- Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.
The paper introduces a novel strategy for addressing model unfairness in deep learning classifiers. This involves substituting global batch normalization modules with attribute/group-aware FairAdaBN modules and applying a new statistical disparity loss to minimize performance disparities among subgroups. The authors also propose a new metric, FATE, to evaluate the fairness-accuracy trade-off of a model relative to a baseline model. The authors evaluate their methods on two skin lesion classification datasets, with the results suggesting a favorable performance compared to several unfairness mitigation baselines, including a small drop in accuracy coupled with improved fairness results. Reviewers commended the innovative concept and the presentation of the paper, but pointed out a major methodological weakness: the dependency of the method on the availability of the sensitive attribute during inference. This presents a significant drawback when compared to other methods, notably the EnD and CFair baselines, which do not require knowledge about the sensitive attribute. Reviewer 1 also identified a general lack of engagement with existing literature on unfairness mitigation, leading to further questions about the selection of baselines for evaluation and the robustness of the results. Furthermore, issues were raised concerning the FATE metric, including the unclear meaning of negative values and the reasoning behind using this metric to assess performance/fairness trade-offs. The authors’ rebuttal should prioritize addressing the concerns about the evaluation process and justifying or clarifying the FATE metric. In their response, the authors have successfully addressed several of the main concerns. While there remains some skepticism about the value of the FATE metric, the idea of using fair adaptive batch normalization is compelling, and the validation process now enables readers to adequately assess the usefulness of this new mitigation technique. I would recommend acceptance for this paper.