Authors

Agnieszka Mikołajczyk, Sylwia Majchrowska, Sandra Carrasco Limeros

Abstract

New medical datasets are now more open to the public, allowing for better and more extensive research. Although prepared with the utmost care, new datasets might still be a source of spurious correlations that affect the learning process. Moreover, data collections are usually not large enough and are often unbalanced. One approach to alleviate the data imbalance is using data augmentation with Generative Adversarial Networks (GANs) to extend the dataset with high-quality images. GANs are usually trained on the same biased datasets as the target data, resulting in more biased instances. This work explored unconditional and conditional GANs to compare their bias inheritance and how the synthetic data influenced the models. We provided extensive manual data annotation of possibly biasing artifacts on the well-known ISIC dataset with skin lesions. In addition, we examined classification models trained on both real and synthetic data with counterfactual bias explanations. Our experiments showed that GANs inherited biases and sometimes even amplified them, leading to even stronger spurious correlations. Manual data annotation and synthetic images are publicly available for reproducible scientific research.

Link to paper

DOI: https://link.springer.com/chapter/10.1007/978-3-031-16452-1_42

SharedIt: https://rdcu.be/cVVpX

Link to the code repository

https://github.com/AgaMiko/debiasing-effect-of-gans

Additional website: https://biasinml.netlify.app/bias-in-gans/

Link to the dataset(s)

https://drive.google.com/drive/u/3/folders/1ib7b5sopgUEK9TxqEPhgjD7XEdZXBvuV

Reviews

Review #1

Please describe the contribution of the paper

This paper discusses (de)biasing effect of using GAN-based data augmentation. and introduce the dataset with manual annotations of biasing artifacts in six thou- sands synthetic and real skin lesion images, which can serve as a benchmark for further studies.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

(de)biasing effect of using GAN-based data augmentation is discussed in great detail.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

Novelity of the paper need to be mentioned clearly in an introduction section. GANs are being used widely for data augmentation. How the (de) biasing will change the overall workflow?
Please rate the clarity and organization of this paper

Very Good
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

Data is available online
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html

Cross validation result need to be reported rather than considered testing on some fixed data.

More details are needed for unconditional and conditional GANs.

Results should be evaluated on other bigger dataset.

Discussion section need to be added as currently results section looks week from discussion point of view.

Please mention about parameter tuning inside GANs fully.

Compare the results of classification with other state of the art work on same dataset.

In Table 2, mention standard deviation along with mean value. Grammatical errors need to be corrected throughout the manuscript.
Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

6
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

Paper seem to be raising interesting topic of debiasing in GANs.
Number of papers in your stack

4
What is the ranking of this paper in your review stack?

1
Reviewer confidence

Very confident
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

N/A
[Post rebuttal] Please justify your decision

N/A

Review #2

Please describe the contribution of the paper

The paper investigates the tendency for conditional and unconditional GANs to exacerbate biases in their training databases, specifically looking at artifacts (e.g. dermatological markings, frames, etc…) and natural features (e.g. hair). The authors have found that for their GANs, strong correlations, spurious or otherwise, tend to be amplified and rare events suppressed. Interestingly, the authors suggest that unconditional GANs (trained separately on the two data classes) are less biased than conditional GANs (one GAN to generate both classes).
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

The second experiment on counterfactual artifact insertion is the defining strength which strongly supports their conclusions regarding the exascerbation of certain biases in the data caused by artifacts present in both conditional and unconditional GANs. The first experiment demonstrates some of the effect that the authors claim, although not to the same degree.

The cautionary motivation for the paper is really quite strong, and it provides a measured response to the argument that generative models can completely solve the problem of data availability.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

The authors in the introduction clearly state that they are interested in understanding algorithmic bias, setting aside questions of racial, gender, etc… bias. This is very odd considering that the domain they are investigating, dermatological image processing, is definitely very affected by ethnicity and racial considerations.

There is a clarity issue for Section 3.3, notably the difference between the aug. GANs and the synth. GANs. One assumes it appears that artifacts were inserted into the evaluation dataset (in order to calculate the number of “switches”) and also into the training set for the aug. GANs (and not the real data nor the synth. GANs) but this should be made explicit.
Please rate the clarity and organization of this paper

Very Good
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

The paper is conceptually reproducible, using well-known methods and well-defined techniques.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html

Overall, the authors seem to confuse artifact removal with bias removal, which are not necessarily the same thing. The authors appear to use the term “debiasing” to refer to the removal of particular artifacts such as short hairs, but this isn’t really debiasing the data. One would assume that debiasing would somehow be equilibrating the frequency of certain features causally known to be separate from the task at hand between the two classes.
Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

6
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The paper is well written and motivated and I think it would pose a good contrast to the GAN papers that normally appear at MICCAI that propose new models. The main weakness I see is the lack of expansion of the method to sources of bias that are also very important in the area (namely race) which are harder to address using simulations in the same way that adding frames can be addressed.
Number of papers in your stack

5
What is the ranking of this paper in your review stack?

2
Reviewer confidence

Very confident
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

N/A
[Post rebuttal] Please justify your decision

N/A

Review #3

Please describe the contribution of the paper

In this paper, the authors analyse the impact of data augmentation and bias inheritance on melanoma classification from skin lesion images. They experiment different settings such as manual annotation GAN augmentation, and artifacts such as hair, gel, ruler, frame, … Results give specific insights on the bias generated with each method.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

The authors explore a large and relevant range of possible bias in skin lesion images, sharing these annotations is acknowledgeable. Besides classical metrics, counterfactual bias insertion metrics provide complementary information to the bias effect.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

The quality of the annotation is not evaluated; for instance no information are given on the expertise of the annotator (practitioner, naive, …). This have a direct impact on the final evaluation results.
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

Implementation details are given. Manual annotations will be share publicly.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html

It would be interesting to provide some significance metrics when comparing different parameters. Also, the title can be misleading as it seems generic. It would have been better to mention that it is about bias on skin lesion images.
Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

4
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The experimental setting is intersting. However, the conclusions are ‘local’ and may not be generalized to other medical applications.
Number of papers in your stack

5
What is the ranking of this paper in your review stack?

3
Reviewer confidence

Confident but not absolutely certain
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

N/A
[Post rebuttal] Please justify your decision

N/A

Primary Meta-Review

Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.
- Using GAN to debias via data augmentation on the skin dataset
- The following MUST be addressed in the rebuttal
  - “Cross validation result need to be reported rather than considered testing on some fixed data.”
  - “Compare the results of classification with other state of the art work on same dataset.”
  - “The quality of the annotation is not evaluated”
  - ” significance metrics when comparing different parameters” should be provided
  - I agree with the reviewer that the title is too general and can be misleading, please change it accordingly
What is the ranking of this paper in your stack? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

4

Author Feedback

We would like to thank the reviewers for their assessment of the paper and constructive suggestions. Below we address their major concerns and clarify misunderstandings.

The main novelty of our work lies in the broad analysis of the bias inheritance effect in Generative Adversarial Networks and how GAN-based data augmentation affects the model which has not been done before to such extent. We will clarify the novelty in the introduction. We agree that the title “The (de)biasing effect of GAN-based augmentation methods on skin lesion images” is more appropriate.

Along with the bias influence examination we provided manual annotations for six thousand images with selected artifacts that are considered as possible biasing for real and synthetic images. The annotation process was carried out by a trained professional working with the ISIC collection and identification of its biases for over 4 years. Additionally, we will perform Inter-Annotator Agreement on a small subsample of data and provide the value of Cohen’s kappa coefficient.

Regarding bias removal: we agree that bias removal and artifacts removal are not the same things. However, in the examined case artifacts were strongly connected with specific classes, despite no casual relation that can be perceived as possible biasing.

We agree that ethnicity and racial origin is an interesting and important consideration, but the used dataset ISIC2020 contains dermoscopic images collected from Europe institutes limiting the possibilities to assess those biases. From this perspective, the ethnicity and racial biases are of secondary importance and hence were not evaluated.

We agree with reviewer #1 that we did not provide a comparison of the results of classification with other SOTA work on the same dataset. It was mainly motivated by the page limit and the fact that classification results on ISIC2020 are publicly available on the Kaggle platform. We will provide a brief discussion in the final version of the paper.

Regarding the lack of clarity of section 3.3 mentioned by Reviewer #2, in our research, we compared the performance of the classifier using different training scenarios using: only real data (acronym real), a mixture of real and synthetic images (acronym aug.) and only synthetic data (acronym synth.). The exact numbers of images belonging to each class were provided in Supplementary Table 1.

Regarding the comment about the cross-validation, we would like to emphasize that all Counterfactual Bias Insertion (CBI) experiments were extensively tested by repeating it five times for each bias type, one the whole test dataset. Our preliminary results on skin lesion classification showed that there is insignificant variance in the classification accuracy while doing a 5-fold cross-validation which agrees with reported studies from the literature. Five-fold cross-validation for the classification model would require repetition for every 5 scenarios (real, augmented with cGANs/uGANs, and synthetic data cGANs/uGANs) for five-folds = 25 full trainings. Adding GAN trainings (2 types of GANs x 5 folds) makes it 35. Extensive testing would have to be repeated 5 scenarios x 5 folds x 5 CBI tests x 5 artifacts = 3125 times. Moreover, cross-validation of GANs would require additional, costly manual data annotation. Hence, during the phase of experiment design, we decided to resign from cross-validation in order to reduce our carbon footprint. This complies with the MICCAI policy on reducing costs and carbon footprint. However, if necessary, we will take it into consideration and repeat additional experiments.

The manuscript will be sent for professional proofreading in order to correct grammatical errors throughout the manuscript.

back to top