List of Papers By topics Author List
Paper Info | Reviews | Meta-review | Author Feedback | Post-Rebuttal Meta-reviews |
Authors
Luyang Luo, Dunyuan Xu, Hao Chen, Tien-Tsin Wong, Pheng-Ann Heng
Abstract
Deep learning models were frequently reported to learn from shortcuts like dataset biases. As deep learning is playing an increasingly important role in the modern healthcare system, it is of great need to combat shortcut learning in medical data as well as develop unbiased and trustworthy models. In this paper, we study the problem of developing debiased chest X-ray diagnosis models from the biased training data without knowing exactly the bias labels. We start with the observations that the imbalance of bias distribution is one of the key reasons causing shortcut learning, and the dataset biases are preferred by the model if they were easier to be learned than the intended features. Based on these observations, we propose a novel algorithm, pseudo bias-balanced learning, which first captures and predicts per-sample bias labels via generalized cross entropy loss and then trains a debiased model using pseudo bias labels and bias-balanced softmax function. We constructed several chest X-ray datasets with various dataset bias situations and demonstrated with extensive experiments that our proposed method achieved consistent improvements over state-of-the-art approaches.
Link to paper
DOI: https://link.springer.com/chapter/10.1007/978-3-031-16452-1_59
SharedIt: https://rdcu.be/cVVqg
Link to the code repository
Link to the dataset(s)
Reviews
Review #2
- Please describe the contribution of the paper
A pseudo bias-balanced learning algorithm, which first captures and predicts per-sample bias labels via generalized cross entropy loss and then trains a debiased model using pseudo bias labels and bias-balanced softmax function.
- Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
novel approach to debiasing
- Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
not sure how realistic is the training sets, what happens when the data has no bias, or multiple forms of bias.
- Please rate the clarity and organization of this paper
Satisfactory
- Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance
not easily reproducible
- Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html
important topic has been studied, and not needing a labelled bias info makes the approach practical; but, I am missing the discussion on the impact of the method in no-bias case, and when multiple forms of bias are existing in the data.
- Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making
4
- Please justify your recommendation. What were the major factors that led you to your overall score for this paper?
Please see the comments in 8.
- Number of papers in your stack
4
- What is the ranking of this paper in your review stack?
3
- Reviewer confidence
Somewhat Confident
- [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed
N/A
- [Post rebuttal] Please justify your decision
N/A
Review #3
- Please describe the contribution of the paper
The paper proposes a novel model to learn prediction using biased dataset. The methodology is two folder. The first part is to estimate a Pseudo Bias label from the sensitivity and specificity on the training set. The second part is considering this bias label into the Bias-Balanced softmax function at which a generalized cross entropy is used to capture the discrepancy between biased training and non-biased testing set.
- Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
- Conceptual innovation
- Novel methodology
- Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
- Questionable experiments given the problem statement and dataset definition
- Please rate the clarity and organization of this paper
Very Good
- Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance
All data and codes will be open-source
- Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html
1- In problem statement: Source-biased Pneumonia (SbP): “We then sampled 5, 000×r% pneumonia cases from NIH and the same amount of healthy cases from MIMIC-CXR. Here, the data source became the dataset bias, and health condition is the target to be learned.” Do you mean here that the subset of data can cause this bias? as I see that you randomly sample the exact number from each set/class. I didn’t get how this can mimic a bias. 2- In problem statement: Gender-biased Pneumothorax (GbP):Here it is much straight forward. However, I just wonder why didn’t you consider a direct bias from the label? Instead of having this sort of indirect bias through one of the covariables? 3- In algorithm 1: Are fB and fD independent networks? Does not share weights?
4- In Page 5 “Giving f(x) the softmax output of the model, denoting fy=j (x) the probability of x being classified to class y = j and θ the parameters of model” It would be better if θ is included in the f function definition. 5- In table 1: G-DRO looks having best results with ground truth bias label. Is that comparison performed on an independent datasets? If used on the same dataset so the GT bias should be available for your model as well. 6- The improvement in results looks marginal. - Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making
5
- Please justify your recommendation. What were the major factors that led you to your overall score for this paper?
Questionable experiments but can accept if appropriate explanations are provided. Weak accept for incremental results
- Number of papers in your stack
5
- What is the ranking of this paper in your review stack?
3
- Reviewer confidence
Very confident
- [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed
N/A
- [Post rebuttal] Please justify your decision
N/A
Review #4
- Please describe the contribution of the paper
This paper proposed a novel algorithm, pseudo bias-balanced learning (PBBL), to tackle the dataset bias problems in medical images. The underlying method first estimate the bias level for each case, then use this pseudo bias label to train a debiased model which avoids the shortcuts and directly learns from the intended information.
- Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
-
Dataset bias is truly noteworthy problem in medical image domain. A simple example, most of the subjects who choose to receive specific medical scans may have similar symptoms and diseases. It is often hard to handle such conditions since the labeling processing is troublesome and require additional experts labor. The method proposed by this paper tries to deal with this problem without explicit labeling process.
-
The method itself, PBBL, is straight forward, easy to follow and seems feasible to me. The use of data is also very reasonable. The authors first generate two types of highly biased dataset which are biased towards data source and gender information respectively. Then the models is trained with a bias-balanced softmax.
-
To generalize the algorithm to unknown bias, rather than data source or gender information, the authors took a step further by applying a generalized cross entropy loss. The major assumption is “dataset biases would be preferred when they were easier to be learned than the intended features”, which matches the intuition.
-
- Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
- The major concern of mine is the paper lacks proper visualization of the biased/unbiased model. For example, model A could be biased towards gender information while model B is trained using the proposed method and is bias-balanced. It is important to show the class activation maps of the models before/after the generalized learning.
- Please rate the clarity and organization of this paper
Very Good
- Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance
The authors provide all the source code in the checklist and supplemental materials. And they use public dataset which is beneficial for reproducing the work.
- Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html
- Consider superimpose the class activation maps to the x-ray images to see: based on which part of the image, the biased/debiased model is making the final decision. This may help the readers get a more intuitive understanding.
- Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making
6
- Please justify your recommendation. What were the major factors that led you to your overall score for this paper?
The authors explained their method in a clear and reasonable way. Each step is well founded and seems feasible to me.
- Number of papers in your stack
7
- What is the ranking of this paper in your review stack?
2
- Reviewer confidence
Confident but not absolutely certain
- [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed
N/A
- [Post rebuttal] Please justify your decision
N/A
Primary Meta-Review
- Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.
The paper has initially received mixed reviews. But after further investigating the arguments raised in the reviews, the AC found R1’s comments are not properly evidenced. The AC contacted the reviewer multiple times to update their reviews or provide more feedback, but it did not happen. As a result, the AC reviewed the paper thoroughly and sided with the 2 positive reviewers, and decided to recommend acceptance. If the paper is accepted, the authors are strongly encouraged to address the comments (about visualization from R4 and detailed comments by R3) in the final version.
- What is the ranking of this paper in your stack? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).
5
Author Feedback
We thank the meta reviewer and the reviewers for their affirmation and constructive comments, and all concerns will be thoroughly addressed in the final version. Please find the specific responses below.
R2.1. [Why data source causes bias] The data source is a typical type of indirect bias which could lead to a biased classifier as discussed in [a], where the way we collect the data reveals clues to identify their sources. In chest X-ray, the differences among data sources could be caused by various reasons like different scanning protocols, different imaging machines, or even patients’ positions [b]. In the SbP dataset, most pneumonia cases were from MMIC-CXR, and most healthy cases were from NIH. Such a correlation could lead to a biased classifier, which could also be seen from the experimental results, where the vanilla model almost became a source classifier when r was set to 1.
R2.2. [Why study indirect bias] Bias could be caused by many reasons, for example, imbalanced positive-negative ratios of one disease, or direct label bias as the reviewer mentioned. This direct label bias mainly results in a related but different question which is called class imbalance or long-tailed distribution. Our study focuses on another very important yet often ignored bias caused by the covariates (or sometimes called spurious correlations), where people may find the model making decisions for the wrong reasons (the biased model nearly became a source classifier in the SbP case) or giving unfair prediction for different populations (as in the GbP case). More importantly, such biases may not be explicitly annotated unless they were finally found harmful, which motivated us to propose the PBBL algorithm.
R2.3. [Independence of fB and fD] fB and fD are independent networks without weight sharing. We will clarify this in the final version.
R2.4. [Definition of fƟ] We shall add Ɵ in the f function definition in the final version.
R2.5. [Why compare with G-DRO while not using bias labels in PBBL] G-DRO was used on the same dataset to show the upper bound of the debiasing methods when the bias labels were known. However, in many practical scenarios, the labels for the dataset biases may not be known, which motivated us to propose the PBBL algorithm.
R2.6. [Concerns on improvements in results] We have made consistent improvements over state-of-the-art algorithms like LfF and DFA through two challenging datasets with in total of five different scenarios. Overall, we believe the improvements were quite meaningful.
R3. [Visualization of model results] We observed that the biased model (vanilla) tended to make decisions out of the wrong regions more often than the model trained by PBBL. We will show the visualization results in the final version as well. However, not all kinds of wrong reasons can be directly visualized from the heatmaps, as discussed in [c]. The model could still learn wrong reasons even when they were limited to learning from the lesion regions, and more works will be needed to promote the research of model debiasing further.
[a] Torralba, Antonio, and Alexei A. Efros. “Unbiased look at dataset bias.” CVPR 2011. IEEE, 2011. [b] DeGrave, Alex J., Joseph D. Janizek, and Su-In Lee. “AI for radiographic COVID-19 detection selects shortcuts over signal.” Nature Machine Intelligence 3.7 (2021): 610-619. [c] Viviano, Joseph D., et al. “Saliency is a Possible Red Herring When Diagnosing Poor Generalization.” International Conference on Learning Representations. 2020.