Paper Info Reviews Meta-review Author Feedback Post-Rebuttal Meta-reviews

Authors

Susu Sun, Lisa M. Koch, Christian F. Baumgartner

Abstract

While deep neural network models offer unmatched classification performance, they are prone to learning spurious correlations in the data. Such dependencies on confounding information can be difficult to detect using performance metrics if the test data comes from the same distribution as the training data. Interpretable ML methods such as post-hoc explanations or inherently interpretable classifiers promise to identify faulty model reasoning. However, there is mixed evidence whether many of these techniques are actually able to do so. In this paper, we propose a rigorous evaluation strategy to assess an explanation technique’s ability to correctly identify spurious correlations. Using this strategy, we evaluate five post-hoc explanation techniques and one inherently interpretable method for their ability to detect three types of artificially added confounders in a chest x-ray diagnosis task. We find that the post-hoc technique SHAP, as well as the inherently interpretable Attri-Net provide the best performance and can be used to reliably identify faulty model behavior.

Link to paper

DOI: https://doi.org/10.1007/978-3-031-43895-0_40

SharedIt: https://rdcu.be/dnwyT

Link to the code repository

https://github.com/ss-sun/right-for-the-wrong-reason

Link to the dataset(s)

N/A


Reviews

Review #1

  • Please describe the contribution of the paper

    The paper investigates the ability of various explanation techniques to identify spurious correlations in black-box neural network classifiers used for medical imaging applications. Spurious correlations arise when training data is confounded by additional variables unrelated to the diagnostic information to be predicted. The authors focus on diagnosing cardiomegaly from chest x-ray data, incorporating three types of artificially generated spurious correlations.

    The authors propose two novel metrics to evaluate explanation techniques: Confounder Sensitivity (CS) and Sensitivity to prediction changes via explanation NCC. CS measures the ability of an explanation to correctly attribute the confounder, while NCC assesses whether explanations change when a classifier’s prediction changes due to the addition or removal of a confounder. Findings show that SHAP and AttriNet are more able to spot spurious correlations.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • Very relevant topic
    • Nice experimental design with the 3 types of perturbations that are inspired on real perturbations on these images.
    • Simple yet effective novel evaluation metrics. Confounder sensitivity and NCC of saliency maps with and without confounding.
    • The authors present first evidence on the increasing effect of spurious correlation and how models can over focus on the presence of a confounding effect for medical imaging.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    • The authors only used one class, cardiomegaly, out of many others. It is not clear how the findings would propagate to other classes, especially considering cardiomegaly is one of the easiest classes to predict.
    • Following the first point, the title would need to be adapted to avoid a potential overstatement.
    • I would argue that some other important XAI methods are needed here, such as LRP, DeepLift, DeepTaylor, Integrated Gradients and similar ones that in my opinion are more advanced versions of guided back propagation, which is known to be very insensitive to input and model perturbations.
    • The sensitivity metric might be biased as methods yielding larger blobs of activations might get a higher mertric. In the case of GradCam results shown here for cardiomegaly, those areas are not detected, but other methods might have that advantage.
  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    Pretty good

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html

    The paper is really well written and structured. More awareness from the community on shortcut learning in connection to XAI methods is needed and the paper highlights it.

    Can the authors explain why not showing the results on other classes? Some other conditions in the chexpert dataset are very interesting, and it would have been interesting to see how the models and the proposed metrics look for those most challenging class labels. So far I see it as a main weakness of the paper, especially for a multilabel/multiclass dataset.

    Please comment/argue on the potential bias of the sensitivity metric alone. See the weaknesses section for the related thought about it.

    Interesting results for AttriNet. I advise the authors to note in the paper though that AttriNet is not model agnostic, as SHAP and others are, but intrinsically is model and attribution generator. This can be a limitation for future development of DL models. Eventually the authors will need to defend the performance and robustness of AttriNet in favor of it being intrinsically interpretable against mainstream models and those that are being promoted such as transformers and the like.

    What happened to some missing points on Figure 2, second and third row?

    Why not using SSIM over the proposed NCC? SSIM has been used before in measuring similarity of saliency maps.

    Supplementary Figure 1, bottom is quite interesting (and a pity it couldn’t fit into the 8 pages!), it shows that models might actually be more resilient to shortcut learning than we thought (?). Only at very large percentages of contamination (p>80) performance drops. I think this finding per se can be topic of a study…

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    5

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Well-structured paper, very relevant topic Some work on showing results on more classes, more updated XAI gradient-based approaches, and other datasets, though

  • Reviewer confidence

    Very confident

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #2

  • Please describe the contribution of the paper
    • The authors propose an evaluation strategy to assess an explanation technique’s ability to identify spurious correlations.
    • They evaluate six methods for their ability to detect three types of artificially added confounders in a chest x-ray diagnosis task.
  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • The authors come up with a novel work regarding the ability of interpretability methods to identify confounders in a medical imaging task. They propose two metrics to assess the capacity of the interpretability methods to detect the confounders. They assess both post-hoc and inherently interpretable methods.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    • Focusing on samples where the confounder flips the class may be too restrictive. Confounders may severely affect the probability without flipping the class, and that may also be relevant.
    • Some interpretability methods produce sparser explanations than others, and that is not taken into consideration.
    • Some interpretability methods are generating maps that don’t look at either confounder nor at a disease-relavant region (e.g., LIME). That is not explored nor described.
  • Please rate the clarity and organization of this paper

    Excellent

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance
    • Reproducibility is guaranteed.
  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html
    • The authors should detail the impact of explanation sparsity in the assessment. They should also take into account cases where confounders don’t lead to a flip in the class but highly contribute to uncertainty in the classification task.
  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    5

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?
    • Interesting and important study on the ability of state-of-the-art interpretability methods to detect confounders.
  • Reviewer confidence

    Very confident

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    5

  • [Post rebuttal] Please justify your decision

    Authors’ rebuttal didn’t modify my assessment of the paper.



Review #3

  • Please describe the contribution of the paper

    The manuscript conducted an experiment to evaluate five post-hoc and one interpretable model on their ability to detect spurious features. The evaluation of the sensitivity of confounders showed that the method SHAP and the interpretertable model Attri-Net outperformed others on detecting spurious feature correlations.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    Strengths:

    • The studied problem of bias detection with explanation is important
    • Thorough review of the related works
    • The evaluation conclusion adds to the existing body of knowledge in medical image analysis tasks
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    Weaknesses

    • The conclusions were drawn on the qualitative visual correlations and comparisons without quantitative measures of the correlation between ground truth and the evaluation metrics.
    • The conclusion “can be used to reliably identify faulty model behaviour” is absolute. Given the evaluation was conducted on one task/dataset, it is unknown how well the results would apply to other tasks/datasets.
  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    The code will be made available.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html

    Details on the weaknesses:

    In the introduction, “older patients in our training data may be more likely to present with a disease than younger patients. A classifier trained on this data may inadvertently learn to base its decision on image features related to age rather than the pathology”. This statement may not be an appropriate example and is not aligned with medical knowledge, because the protected features in other ML tasks, such as age and gender, are actually important determing medical information to differentiate among different diagnoses. The authors may consider giving another example of spurious features such as extra text or a ruler marker in the medical or skin images.

    In the introduction, “such faulty behaviour cannot be identified using performance metrics such as accuracy” This statement may not hold. For example, if we suspect a spurious feature, we may identify it by the fine-grained accuracy score on the subset of test data containing and not containing the spurious feature.

    In Section 2.1, “based classifier contains an unknown spurious correlation with the target label”, why the spurious correlation unknown? In the experiment, the authors designed three types of spurious correlation that are known as ground truth. Please define what the “unknown” spurious correlation means.

    In Section 2.1, “we can say with certainty that the classifier did not learn a spurious signal because there was none in the data.” The statement is too absolute with no grounds to support it. It is unknown whether the chest X-ray dataset contains its own intrinsic spurious patterns or not. The work on ref. [2] divided the spurious features into known and unknown features. Maybe in this sentence, the authors are referring to unknown spurious features in the dataset? Please revise the statement accordingly.

    In Section 2.2 Studied Confounders, are there any reasons for the third type of spurious correlation of obstruction? What exactly are the “foreign materials” in the second type of spurious correlation?

    The main evaluation results in Section 3 Explanations were presented in Fig 2 as qualitative results. And the conclusions on comparing different methods were rely on qualitative inspection on the graphs. There is a lack of quantitative results on the correlation between the evaluation results and the ground truth spurious correlation percentage. Lacking quantitative correlation as a performance metric may make the comparison of multiple (>10) methods difficult by inspecting the graphs and to break ties.

    In Fig 4 and 5, the Attri-Net highlights the spurious correlation with the negative feature importance value of blue color. This contradicts to the confounder sensitivity metric that only counts for “top 10% attributed pixels”. Did the saliency map undergo post-processing by using the absolute, rather than the positive, top 10% values?

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    4

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The main concerns are that 1) the conclusions were limited to one dataset meanwhile the conclusion was absolutely generalizably made, 2) the conclusions were drawn on the qualitative visual correlations and comparisons without quantitative measures of the correlation between ground truth and the evaluation metrics. The two concerns limits the applicability and validity of the results to be added as new knowledge.

  • Reviewer confidence

    Very confident

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A




Primary Meta-Review

  • Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

    Strengths

    • The paper is well written
    • The paper is dealing with an interesting topic of the ability to detect spurious correlation by using XAI techniques.
    • Experimental design looks interesting
    • Novel evaluation metrics are proposed

    Weaknesses

    • Experiments have been conducted on one class in a single dataset. It is not clear how the finding would generalize to other challenging classes
    • Some important XAI methods are missing
    • The title seems to be an overstatement.
    • Limited XAI methods are used
    • The metric might be biased

    The paper is dealing with a very interesting topic but experiments are limited. In particular, the experiments are conducted on only one class in a single dataset. Also, some important XAI methods are missing. It would be great to address the concerns on the generalization of the findings during the rebuttal. Also, the potential bias of the metric needs to be discussed.




Author Feedback

We thank the reviewers for their valuable feedback. We were encouraged to hear that they found the work very relevant (R1), novel (R2), and important (R3).

[R1,R3] Evaluations on more tasks/datasets We have verified that the same findings hold for the class “Pneumothorax” in the CheXpert dataset and the class “Aortic Enlargement” in the VinDr-CXR dataset. Regrettably, due to space restrictions, these additional findings cannot be included in the final manuscript. While we agree with R1 and R3 on the potential benefits of evaluating more diagnostic tasks and datasets, we underscore that our primary contributions lie in assessing three distinct types of realistic confounders in chest X-ray scans, as well as proposing a novel evaluation strategy for comparing different XAI methods. Our research is not chiefly centered on the diagnostic task, and we anticipate that our results will generalise to a wide range of tasks and datasets.

[R1] More XAI methods Our choice of baseline methods reflects diverse explanation strategies (gradient-based, linear approximations, counterfactual explanations), as well as examples from both the post-hoc explanation and inherently interpretable XAI paradigms. We believe the chosen methods are representative of current research trends and allow some broad conclusions. We have also evaluated LRP, DeepLift and Integrated Gradients and found that the inclusion of those techniques would not change any of our conclusions, but would overload the paper. We thus leave the study of those methods for future work. With the present baselines and the proposed evaluation framework we hope to lay the ground stone for future benchmarking of methods.

[R1,R3] Overstatements in title and text We will adjust our title to reflect the study’s exploratory nature as suggested by R1 and we will qualify our claims about the methods’ ability to detect faulty model behaviour (R3).

[R1] Potential bias of confounder sensitivity We’d like to stress that our proposed confounder sensitivity metric is not substantially affected by the resolution of the explanations, as it measures how many of only the top 10% attributed (super-)pixels coincide with the confounder pixels. It can be argued that XAI methods relying on superpixels have a higher chance of randomly hitting the correct confounder. However, we find that other effects dominate the explanations’ behaviour.

[R2] Focus on flipped samples too restrictive Our strategy of focusing solely on samples that led to flipped predictions helps us filter out samples where the confounder had a verifiable impact on the prediction. This tactic is crucial for both evaluation metrics. For example, for Explanation NCC we postulate that if the prediction is flipped also the explanation should be different. This would be much more difficult to formalise with a real-valued decline in classifier confidence. Furthermore, we observed that for the different classification networks in our study (AttriNet and ResNet50) the decline in confidence behaves very differently, which would make them difficult to compare in this manner.

[R2] Effect of explanation sparsity We acknowledge that diverse XAI methods exhibit varying degrees of sparsity. Comparing different XAI methods is a multifaceted issue and multiple metrics will be required to fully characterise a method’s behaviour. We note the Explanation NCC measure remains unaffected by sparsity. We will clarify the effect of sparsity on the explanation sensitivity metric in the revised manuscript.

[R3] Conclusions based on qualitative observations Note that our conclusions are drawn from visually interpreting plots of quantitative results rather than from pure qualitative observations. However, we have now calculated the correlation between the proportion of confounders and the respective scores, and have found that they support our findings. We will include a small table with those results in the supplementary materials.




Post-rebuttal Meta-Reviews

Meta-review # 1 (Primary)

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    Overall, the paper is well written. The authors show interesting analysis on detection of confounders by using feature attribution methods. Although the results are not strongly conclusive due to the limited experiments on one class, I think the paper has novel contributions and it will be of interest to MICCAI community. It would be great to update the final version of the paper by considering reviewer’s comments.



Meta-review #2

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    The paper explores if XAI approach can detect shortcut. There are concern about overstatment of the title which is addressed by changing the title. Some questions about the experiments are answered. The quantitative aspect of the paper may not be as strong as one may want.



Meta-review #3

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    As pointed by the reviewers and the primary AC, this paper is dealing with a very interesting topic but experiments are limited. This however cannot be fully addressed through a rebuttal given the very limited time. Balancing the strengths and weaknesses evaluated by the reviewers, and also taking into consideration of R2’s post-rebuttal feedback, the paper may not be ready for presentation at MICCAI yet. However, the work is very interesting and the authors should pursue it further.



back to top