Authors

Frederik Pahde, Maximilian Dreyer, Wojciech Samek, Sebastian Lapuschkin

Abstract

State-of-the-art machine learning models often learn spurious correlations embedded in the training data. This poses risks when deploying these models for high-stake decision-making, such as in medical applications like skin cancer detection. To tackle this problem, we propose Reveal to Revise (R2R), a framework entailing the entire eXplainable Artificial Intelligence (XAI) life cycle, enabling practitioners to iteratively identify, mitigate, and (re-)evaluate spurious model behavior with a minimal amount of human interaction. In the first step (1), R2R reveals model weaknesses by finding outliers in attributions or through inspection of latent concepts learned by the model. Secondly (2), the responsible artifacts are detected and spatially localized in the input data, which is then leveraged to (3) revise the model behavior. Concretely, we apply the methods of RRR, CDEP and ClArC for model correction, and (4) (re-)evaluate the model’s performance and remaining sensitivity towards the artifact. Using two medical benchmark datasets for Melanoma detection and bone age estimation, we apply our R2R framework to VGG, ResNet and EfficientNet architectures and thereby reveal and correct real dataset-intrinsic artifacts, as well as synthetic variants in a controlled setting. Completing the XAI life cycle, we demonstrate multiple R2R iterations to mitigate different biases. Code is available on https://github.com/maxdreyer/Reveal2Revise.

Link to paper

DOI: https://doi.org/10.1007/978-3-031-43895-0_56

SharedIt: https://rdcu.be/dnwzo

Link to the code repository

https://github.com/maxdreyer/Reveal2Revise

Link to the dataset(s)

https://challenge.isic-archive.com/landing/2019/

https://www.rsna.org/education/ai-resources-and-training/ai-image-challenge/rsna-pediatric-bone-age-challenge-2017

Reviews

Review #4

Please describe the contribution of the paper

The authors describe a general workflow for the largely automated detection and mitigation of shortcut learning in medical imaging. The method comprises the semi-automated detection of potential shortcuts based on previously described XAI methods, as well as the integration of these detection methods with previously described methods for mitigating detected shortcuts. This cycle can be passed through multiple times, in order to mitigate multiple separate sources of shortcut learning. The performance of the whole workflow is evaluated on skin lesion classification and bone age estimation tasks.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

The manuscript describes an innovative approach for integrating XAI methods and artifact mitigation strategies into a coherent and semi-automated workflow for mitigating the detrimental effects of shortcut learning. The paper is well-written and easy to follow despite the complexity of the setup, and the well-designed figures work very well to illustrate the approach. The proposed approach is highly general and can be adapted for use with other XAI methods and artifact-invariant learning approaches than the ones considered by the authors here.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

I see two main weaknesses in the manuscript. The first weakness, which the authors acknowledge themselves, is that their approach is predominantly designed to work with clearly localized artifacts, such as rulers or skin markings. It is not immediately clear to me how to extend this approach to non-localized sources of spurious correlations, such as differences in recording equipment or skin type. The authors mention this challenge as a topic for future research.

Second, and perhaps more relevant to the present manuscript, the quantitative performance evaluation is not completely convincing to me. The authors evaluate on artificially poisoned datasets, finding that their method improves model performance. However, performance is still significantly lower than on the original, not artificially contaminated test set. In this regard, the authors note that “artifacts might overlap clinically informative features in poisoned samples, limiting the comparability of poisoned and original test performance.” While I appreciate this argument, it sadly limits the possibility to conclude whether the modified models are indeed fully robust to the induced artifacts or not. For this reason, and because one might generally hope for an artifact-robust method to generalize better, I would have appreciated an additional evaluation on an external (OOD) test set, such as, for example, the Diverse Dermatology Images (DDI) dataset or any other suitable external dataset.
Please rate the clarity and organization of this paper

Excellent
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

The study uses publicly available datasets, and code to reproduce the experiments was provided to the reviewers. The authors have indicated that the code will be made publicly available.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html
In addition to my main criticism above concerning the performance evaluation, I have the following, rather minor comments.
1. The authors cite a lot of work from a single group, but many research groups have worked on (preventing) shortcut learning. As there is some space left in the reference section, I would encourage citing a bit more broadly. Some possible examples:
  - Geirhos et al., Shortcut learning in deep neural networks
  - Robinson et al., Can contrastive learning avoid shortcut solutions?
  - Nauta et al., Uncovering and correcting shortcut learning in machine learning models for skin cancer diagnosis
  - Makar et al., Causally motivated shortcut removal using auxiliary labels
  - Puli et al., Out-of-distribution generalization in the presence of nuisance-induced spurious correlations
Some of the methods described in these papers might also be interesting further candidates for the “revise” stage of the authors’ methodology, or might make for interesting baseline comparisons.
1. I was confused when I first looked at the CRP concept visualizations in figure 1, because I did not understand that these are parts of the band-aids shown in the figure above. Can the authors try to make this a bit more clear, either visually or in the text? Also, it might be worth emphasizing for readers unfamiliar with the CRP methodology that the concepts are “zoomed in”.
2. In various places, the authors write about “model biases”. This is a highly loaded term that means very different things to different readers. If possible, I would suggest using another, less ambiguous term.
3. I was not able to fully understand how the artifact localization stage works. Is the CAV “trained”? What model is the “modified backward pass with LRP” done on, is it the original one? Or the artifact classifier? And do I understand correctly that the authors basically just put a threshold on the “artifact relevances” R(x) to segment the artifact?
Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

7
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

This is a well-written, interesting paper about an innovative approach to a practically highly relevant problem: the detection and mitigation of shortcut learning in (deep learning-based) medical image analysis.
Reviewer confidence

Very confident
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

N/A
[Post rebuttal] Please justify your decision

N/A

Review #2

Please describe the contribution of the paper

This paper highlights the potential limitations of current Deep Neural Networks (DNNs) in making accurate predictions due to their inductive biases. Specifically, the authors address the issue of spurious correlations within medical imaging datasets, which can cause the model to rely solely on artifact-signal correlations associated with true labels, rather than on genuine evidence. To address this challenge, the authors propose a framework to reduce the risk of shortcuts learning by incorporating existing methods for bias identification and model correction. Additionally, the authors suggest an automated strategy for replacing artifacts in input images, and their empirical findings suggest that their pipeline could further enhance model robustness.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

The proposed solution is straightforward and does not require any additional modifications to the selected model architecture. The authors also draw on established research on artifact identification and debiasing strategies. In terms of empirical evidence, the authors have conducted their main experiments thoughtfully, and their results could have various practical uses.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
The paper is quite hard to follow. In particular, the methods section could be more descriptive:
- The input and output of various steps are not clearly stated, which makes it difficult for readers to understand the approach;
- The transformation from heatmap to binary mask mentioned in Section 3.1 is not adequately explained.
The related work section is limited, and the authors could expand their literature review to include more studies on debiasing and bias removal, e.g. [1, 2, 3, 4].

Additionally, the authors should consider addressing scenarios where there is an overlap between the target and the artifact in the image and consider other bias signals such as texture, shape, sensitive attributes (gender), and data sampling in their experiments.

The novelty of the proposed method is relatively low, as it appears to be a combination of existing methods. Additionally, given the lack of competitors in the experiments and the use of single datasets, it is difficult to establish the real impact of the proposed framework in the field of bias removal and the extent to which it can be applied to different scenarios.

[1] Bahng, H., Chun, S., Yun, S., Choo, J., & Oh, S. J. (2020, November). Learning de-biased representations with biased representations. In International Conference on Machine Learning (pp. 528-539). PMLR. [2] Sagawa, S., Koh, P. W., Hashimoto, T. B., & Liang, P. Distributionally Robust Neural Networks. In International Conference on Learning Representations. [3] Lahoti, P., Beutel, A., Chen, J., Lee, K., Prost, F., Thain, N., … & Chi, E. (2020). Fairness without demographics through adversarially reweighted learning. Advances in neural information processing systems, 33, 728-740. [4] Mehrabi, N., Morstatter, F., Saxena, N., Lerman, K., & Galstyan, A. (2021). A survey on bias and fairness in machine learning. ACM Computing Surveys (CSUR), 54(6), 1-35. [5] Sohoni, N., Dunnmon, J., Angus, G., Gu, A., & Ré, C. (2020). No subclass left behind: Fine-grained robustness in coarse-grained classification problems. Advances in Neural Information Processing Systems, 33, 19339-19352. [6] Nam, J., Cha, H., Ahn, S., Lee, J., & Shin, J. (2020). Learning from failure: De-biasing classifier from biased classifier. Advances in Neural Information Processing Systems, 33, 20673-20684.
Please rate the clarity and organization of this paper

Satisfactory
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

The authors provide the source code as supplemental material. The README accompanying the code is well organized and allows any reader to reproduce the experiments described in the paper.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html

The paper would benefit from improved clarity and comprehensiveness of the problem statement, including the type of biases in medical applications, and a more thorough literature review of existing pre/in/post-processing debiasing methods.

To avoid distracting the reader, it is recommended to avoid citing works in the method section.

Additionally, each module of the framework should be described in terms of input and output, and the objective function minimized should be clearly stated, along with whether the training is end-to-end. The dimensionalities of the problem and variables involved should also be formalized.

It is suggested to avoid defining acronyms only in the abstract that should be treated as a stand-alone-piece-of-text. Also, it is suggested to avoid using non-defined acronyms in the abstract to ensure smooth reading for non-experts.

To enhance the experiments’ robustness, additional scenarios such as larger datasets and attributes, different types of bias, and fairness metrics, should be included and discussed.

Figures and Tables should be placed as near as possible where they are cited, to avoid the reader moving back and forth in the paper.

The paper should be self-contained, so it is highly recommended to avoid extensively referencing appendixes provided only as supplemental material.
Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

4
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The paper lacks proper structure and makes it difficult to understand even the optimization process employed. The novelty of the work is limited, as it combines various approaches to identify and correct biased models. Given the nature of the work, it would have been interesting to evaluate and analyze the pipeline results on larger datasets (e.g., WSI) and considering more tasks (e.g., regression, segmentation).

Under these circumstances, it is not possible to evaluate any potential application in a real-world scenario as the results do not introduce any novelty in the field of bias removal.
Reviewer confidence

Confident but not absolutely certain
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

N/A
[Post rebuttal] Please justify your decision

N/A

Review #1

Please describe the contribution of the paper

Very interesting approach to iteratively correct bias in AI models via an explainability approache
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

important topic to explain problems and then mitigate them well written text novelty in the way the approach is set up intersting experiment with the embedded bias
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

quantitative evaluation is hard, a user test might have been a good idea the examples given are relatively simple and more complex concepts may be hard it would be interesting to also look at regression concept vectors
Please rate the clarity and organization of this paper

Very Good
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

public data sets are used, which is good. It may still be difficult to fully reproduce the paper.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html

Interesting and well written paper It would-be good to extend the work ^from Concept Activation Vectors towards Regression Concept vectors that seem to give more complexity in a medical setting It would be good to run a user test with this to see how it works in a prospective setting
Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

6
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

technical novelty and an interesting setup for an important topic and good experiments
Reviewer confidence

Very confident
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

N/A
[Post rebuttal] Please justify your decision

N/A

Primary Meta-Review

Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

This paper introduces a very interesting approach to iteratively correct the bias of deep models by using an explanation. It is an interesting topic and the method looks novel. The limitation of the paper is well discussed. Although there are some concerns, I think the paper has more merits. Many researchers in this community would be interested in this paper. It would be great to consider constructive comments from reviewers to further improve the paper in the final version. I would recommend this paper be presented in MICCAI.

Author Feedback

N/A

back to top

Reveal to Revise: An Explainable AI Life Cycle for Iterative Bias Correction of Deep Models