Authors

Yawen Wu, Dewen Zeng, Xiaowei Xu, Yiyu Shi, Jingtong Hu

Abstract

Many works have shown that deep learning-based medical image classification models can exhibit bias toward certain demographic attributes like race, gender, and age. Existing bias mitigation methods primarily focus on learning debiased models, which may not necessarily guarantee all sensitive information can be removed and usually comes with considerable accuracy degradation on both privileged and unprivileged groups. To tackle this issue, we propose a method, FairPrune, that achieves fairness by pruning. Conventionally, pruning is used to reduce the model size for efficient inference. However, we show that pruning can also be a powerful tool to achieve fairness. Our observation is that during pruning, each parameter in the model has different importance for different groups’ accuracy. By pruning the parameters based on this importance difference, we can reduce the accuracy difference between the privileged group and the unprivileged group to improve fairness without a large accuracy drop. To this end, we use the second derivative of the parameters of a pre-trained model to quantify the importance of each parameter with respect to the model accuracy for each group. Experiments on two skin lesion diagnosis datasets over multiple sensitive attributes demonstrate that our method can greatly improve fairness while keeping the average accuracy of both groups as high as possible.

Link to paper

DOI: https://link.springer.com/chapter/10.1007/978-3-031-16431-6_70

SharedIt: https://rdcu.be/cVD7q

Link to the code repository

N/A

Link to the dataset(s)

N/A

Reviews

Review #1

Please describe the contribution of the paper

The paper introduces a method for modifying a pre-trained classifier to achieve fairness with respect to certain sensitive attributes. The method is based on identifying network parameters (nodes) with high saliency difference between different demographic groups and then pruning those nodes. This is a novel way of pursuing fairness compared to the traditional adversarial-training-based strategies.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
1. This is a new way of debiasing a model that is conceivably more stable and efficient than the adversarial strategies. It does not require re-training the model.
2. The results are relatively comprehensive on two large datasets.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

The proposal is confined to binary senstive attributes. Future work should extend to categorical (race) and continuous variables (age).
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

Reproducibility should be good if the authors can release their code.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html

I generally enjoy reading the manuscript. Although there seems to be some intuition given in the text, I still feel that the proposed approach ends up being a series of engineering steps without clearly showing why the objective of Eq. 4 is equivalent to fairness. It seems that the authors refer to fairness as the accuracy difference between demographic groups. If so, please explicity define because there is broader definition of fairnes which I don’t believe this artical is pursuing.

Some technical questions in case I missed something. 1. Why \Delta E (change in objective function) has to be identical to accuracy drop? 2. Why can we ignore the third-order terms in Eq.1 if \theta is not near zero?

Minor: bold the accuracy columns in Table 1
Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

6
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The proposal is novel and the results are sufficient.
Number of papers in your stack

5
What is the ranking of this paper in your review stack?

1
Reviewer confidence

Confident but not absolutely certain
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

N/A
[Post rebuttal] Please justify your decision

N/A

Review #2

Please describe the contribution of the paper

The paper proposed a pruning approach to remove the bias of models to sensitive features, specifically skin tone and gender. The approach operates on a pre-trained model and has the added benefit of reducing the model size. Each parameter of the model has different importance for different groups’ accuracy scores, therefore pruning the parameters based on their importance removes the effect of the sensitive parameter and bridges the accuracy gap between the two groups.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
- The paper is well-written and easy to follow.
- The method is intuitive and has clear motivation.
- Bias in machine learning datasets is a critical issue, thus there is wide clinical applicability for the approach, specially since it has the additional benefit of smaller models after pruning.
- The related work section gives a good overview and taxonomy of the existing methods for the task of overcoming dataset bias.
- There were many comparative baselines from different de-biasing method categories.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
- The attribute examined in the second dataset as sensitive data is the gender. First, it should be mentioned whether there is a clinical difference among genders regarding skin conditions. Moreover, the difference between the two groups even for the vanilla method was marginal, thus it is critical to justify the clinical need for fairness in this case.
- The groups are named ‘privileged’ and ‘unprivileged’. I would replace these terms since they are not descriptive of the particular situation. ‘Privileged’ is a group that achieves higher performance on the vanilla model, however, that collides with the common use of the words privileged or unprivileged that are used for majority and minority groups that enjoy different rights and face different societal issues.
- It is crucial to mention how many samples were included in each group of ‘light’ vs ‘dark’ skin and ‘male’ vs ‘female’ and whether there was class imbalance. Why is one group achieving higher performance over the other? Could it be attributed to more training samples originating from that group?
- It would be interesting to show that the proposed pruning approach works for more complex architectures like ResNet-50 and is not only performing well on simpler architectures like VGG-11.
- The experiments were not repeated or cross-validated, thus no standard deviation was reported.
Please rate the clarity and organization of this paper

Very Good
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance
- The datasets used were publicly available.
- There was no mention that the code would become public upon acceptance.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html
- The word ‘significantly’ is widely used in the results and discussion sections. Since no statistical tests were performed I would replace it with ‘substantially’.
- It would be interesting, maybe as future work, to show how the method would generalize to a multi-class setting, for example if we performed the classification for each skin tone class separately.
- There is a minor error, in Table 1 for DomainIndep the calculated differences Diff are wrong, they must have been copied and pasted from AdvRev and not replaced with the actual values.
Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

5
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The topic is very interesting and the method is intuitive and well-explained. The experiments were repeated only once, the use of the wording privileged/unprivileged groups was confusing and there was little explanation to why gender in the case of skin disease classification is considered sensitive data. However, the discussed topic is interesting and the method could be widely applicable in such scenarios.
Number of papers in your stack

5
What is the ranking of this paper in your review stack?

1
Reviewer confidence

Very confident
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

N/A
[Post rebuttal] Please justify your decision

N/A

Review #4

Please describe the contribution of the paper

The paper describes a method for increasing the fairness of machine learning models whilst minimising the drop-off in accuracy for protected groups. The method is based on the idea of ‘pruning’, which is an approach normally used for reducing model complexity. The authors propose a novel metric of parameter saliency that enables the pruning operation to act to address lack of fairness in the model.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

As far as I know, the proposed idea of using pruning to improve fairness is novel

The paper is well-written and easy to follow

Experiments are extensive and include analysis of the effects of key hyperparameters
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

The literature review on fairness in medical imaging applications is slightly limited
Please rate the clarity and organization of this paper

Excellent
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

Overall good. I would like to see code made available if the paper is accepted. The authors do state that a grid search using the validation set was used to set the beta and pruning ratio hyperparameters. I would also like to see a statement of how other hyperparameters were set, e.g. batch size (for pre-training and for saliency calculations), learning rate, number of minibatches used for saliency calculation.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html

I enjoyed reading the paper and found the central idea of using pruning for fairness to be very interesting. The idea is simple (as many of the best ideas are) but as far as I know this has not been proposed before so I believe this paper has a high degree of novelty compared to other MICCAI submissions. I believe the paper would be a good addition to the MICCAI program. The comments below are aimed at improving it still further.

The paper is generally well-written, although see below for minor comments/suggestions. The introduction sets the methodological context quite well, and there are a good number of relevant papers cited from the computer vision literature. However, I thought that the review of papers on fairness in medical imaging was slightly limited. The authors cite Larrazabal et al [13] which is certainly relevant, but there are other equally relevant papers that were not mentioned. In particular I would highlight Abbasi-Sureshjani et al (https://doi.org/10.1007/978-3-030-61166-8_20), Seyyed-Kalantari et al (https://doi.org/10.1038/s41591-021-01595-0) and Puyol-Anton et al (https://doi.org/10.1007/978-3-030-87199-4_39). They could even distinguish between papers that assessed bias (Larrazabal et al, Seyyed-Kalantari et al) and those that also applied mitigation strategies (Abbasi-Sureshjani et al, Puyol-Anton et al) to make the discussion more relevant to this paper. Also, although the specific idea of using pruning to promote fairness is novel, a few papers have analysed the impact of pruning on fairness and so these could also be mentioned (https://doi.org/10.48550/arXiv.2009.09936, https://doi.org/10.48550/arXiv.2201.01709).

In addition, as noted above I think it would be useful for the authors to state the approach they used to set hyperparameters for their model (& comparative approaches?) and the data used.

Other minor suggested edits: • Section 1, para 1, line 5: “rthe” -> “the” • Section 1, para 1, line 8: “turns to perform” -> “performs” • Section 1, para 1, lines 10-11: Put dataset details in brackets. Also, I think “ISIC 2018” should be “ISIC 2019”? • Section 1, para 1, line 16: “with different” -> “from certain” • Section 1, para 1, line 18: “biased” -> “bias” • Section 1, para 2, line 9: “proxy of” -> “proxy for” • Section 1, para 3, line 13: “Besides” -> “In addition” • Section 2, para 3, line 3: “regularizing” -> “regularize” • Section 2, para 3, line 4: “sensitive related” -> “sensitive attribute related” • Section 3.1, para 1, line 1: “Given” -> “We define” • Section 3.2, para 1, line 6: “row i and column i of second” -> “row i and column i of the second” • Section 3.2, para 2, line 6: “and is biased” -> “which is biased”. Also, I think it should be “biased against” not “biased for”? • Section 3.2, para 2, line 7: “In the coordinate” -> “In the bottom illustration” • Section 4, Baselines section, line 6: italicise “DomainIndep”? • Section 4, Ablation study section, line 9: “consistent” -> “consistently”
Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

7
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The points I raised above are relatively minor. The paper has novelty, is well-written and the experiments are extensive. The subject of fairness in AI for medical imaging is topical and one which I expect to grow in years to come, hence I believe there will be significant interest in this paper.
Number of papers in your stack

5
What is the ranking of this paper in your review stack?

1
Reviewer confidence

Very confident
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

N/A
[Post rebuttal] Please justify your decision

N/A

Primary Meta-Review

Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

All reviewers agree that the paper is solid and introduces contributions to the field.
What is the ranking of this paper in your stack? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

2

Author Feedback

N/A

back to top

FairPrune: Achieving Fairness Through Pruning for Dermatological Disease Diagnosis