Authors

Mitchell Pavlak, Nathan Drenkow, Nicholas Petrick, Mohammad Mehdi Farhangi, Mathias Unberath

Abstract

To safely deploy deep learning-based computer vision models for computer-aided detection and diagnosis, we must ensure that they are robust and reliable. Towards that goal, algorithmic auditing has received substantial attention. To guide their audit procedures, existing methods rely on heuristic approaches or high-level objectives (e.g., non-discrimination in regards to protected attributes, such as sex, gender, or race). However, algorithms may show bias with respect to various attributes beyond the more obvious ones, and integrity issues related to these more subtle attributes can have serious consequences. To enable the generation of actionable, data-driven hypotheses which identify specific dataset attributes likely to induce model bias, we contribute a first technique for the rigorous, quantitative screening of medical image datasets. Drawing from literature in the causal inference and information theory domains, our procedure decomposes the risks associated with dataset attributes in terms of their detectability and utility (defined as the amount of information the attribute gives about a task label). To demonstrate the effectiveness and sensitivity of our method, we develop a variety of datasets with synthetically inserted artifacts with different degrees of association to the target label that allow evaluation of inherited model biases via comparison of performance against true counterfactual examples. Using these datasets and results from hundreds of trained models, we show our screening method reliably identifies nearly imperceptible bias-inducing artifacts. Lastly, we apply our method to the natural attributes of a popular skin-lesion dataset and demonstrate its success. Our approach provides a means to perform more systematic algorithmic audits and guide future data collection efforts in pursuit of safer and more reliable models. Full code is available at https://github.com/mpavlak25/data-audit.

Link to paper

DOI: https://doi.org/10.1007/978-3-031-43898-1_43

SharedIt: https://rdcu.be/dnwBD

Link to the code repository

https://github.com/mpavlak25/data-audit

Link to the dataset(s)

https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/DBW86T

Reviews

Review #1

Please describe the contribution of the paper

The manuscript addresses an important and interesting question: can we flag the presence and importance of potential “shortcuts” in medical image databases, prior to and independent of training a particular model? The authors provide a formal framework for addressing this question, by considering whether 1) a certain characteristic is observable from the images, and 2) how much information it provides about the target label. They propose a methodology based on mutual information estimates and statistical hypothesis testing for quantifying both properties. This enables assessing the risk for a certain characteristic to be exploited as a shortcut. The methodology is illustrated both on synthetic and real test cases.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

The authors describe an interesting way of quantifying the presence and potential severity of shortcuts in medical image databases. This is an important and, to my knowledge, largely open problem. The method described in the present paper may help to proactively address such problems instead. It is the first approach of this type that I am reading about.

The statistical methodology employed in the article is notably thorough and enables drawing reliable conclusions based on the developed approach. The experimental evaluation is comprehensive and convincing.

The paper is very well-written.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

I believe that there is additional prior work that should at least be discussed. E.g., section 3.1. in Fabrizzi et al. (2022), A survey on bias in visual datasets, mentions several visual bias discovery methods (some of them based on notions of mutual information) that seem closely related to the method discussed here.
Please rate the clarity and organization of this paper

Excellent
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

The authors have indicated that the full code, including data splits etc., will be made available upon acceptance. The study uses publicly available datasets.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html
I have three larger questions concerning the methodology employed in the paper, although these might fall more under the “future work” category.

First, could the authors discuss the possibility to extend this methodology to the unsupervised case, i.e., when explicit artifact labels are not available during training? This would strongly (further) increase the utility of this method. There are some works in the “unsupervised bias discovery” area, maybe some of these can serve as inspiration? E.g., Wamburu et al., Systematic Discovery of Bias in Data, or some of the references in the survey paper mentioned above.

Second, while the authors propose their methodology for screening datasets for potential biases, the method does rely on specific DNNs, in particular for predicting the attributes from the images. Did the authors investigate the robustness of their results to different choices of attribute prediction models? Do they believe their method is sufficiently robust to this factor to label the result a “dataset property” and not a “dataset+model property”?

Third, the authors note that in their experiments, artifact detectability and utility do not seem to be independent. This seems unexpected and undesirable to me. Can the authors elaborate on this issue?

In addition, I have a few smaller comments and questions.
- I am not sure I understand Figure 2 correctly; an explanatory caption would be helpful. Are the arrows part of the synthetic artifacts, or are they added to point at the actual artifacts? Also, what is a “Noise” local artifact, and am I supposed to see it?
- In Figs 3-5, the authors could consider adding the explicit keywords “Utility” and “Detectability” to the axes, to make it easier for readers to map their intuitive understanding onto the graphs. Also, I would suggest noting the arguments of “CMI”, like also done for “MI(Artifact, Y)”.
- Is the performance drop in Fig. 3b with respect to the worst-case test set? Also, does the “compression-30” line in this graph precisely correspond to the worst-case line shown in Fig 3a?
- I believe to have inferred now that the AUC in Fig. 4 is AUC with respect to predicting A, not Y? If yes, this is very unintuitive for me and should be made clear(er).
- Also in Figure 4, the information conveyed by the rightmost panel is not clear to me. What is the specific meaning of the 95th percentile threshold?
- Also in Figure 4, is the scale on the CMI axis here the same as on the colorbar in Fig. 3b? If yes, does the large scale difference have an intrinsic meaning as well, maybe akin to the relationship between effect size and statistical significance?
- In table 1, what exactly is AUC here? Is that related to a model specifically trained to detect Gaussian noise? What does “because we introduce artifacts with no relationship to detectable image characteristics, AUC represents a valid ground truth” mean?
Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

6
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The addressed problem is interesting and important, and I have not seen an approach similar to the one described here before. The paper is very well-written and methodologically sound. If the authors can address some of my concerns, this will be an excellent contribution.
Reviewer confidence

Confident but not absolutely certain
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

N/A
[Post rebuttal] Please justify your decision

N/A

Review #3

Please describe the contribution of the paper

The proposed method for identifying and detecting unwanted bias in machine learning models involves a rigorous and quantitative screening technique that focuses on dataset screening. The authors develop a variety of datasets with synthetically inserted artifacts with different degrees of association to the target label that allow evaluation of inherited model biases via comparison of performance against true counterfactual examples. Using these datasets and results from hundreds of trained models, the authors show that their screening method reliably identifies nearly imperceptible bias-inducing artifacts. Lastly, they apply their method to the natural attributes of a popular skin-lesion dataset and demonstrate its success.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
1. this paper decomposes the risks associated with dataset attributes in terms of their detectability and utility.
2. it is capable of generating targeted hypotheses on a much broader set of attributes beyond the more obvious ones.
3. it provides a way to help determine the attributes that may introduce unfairness
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
1. while this method can identify and investigate hypotheses about potential sources of bias in a dataset, it does not provide a solution for correcting or mitigating those biases.
2. The method requires sensitive attributes’ label, the success of this method may depends on the quality and representativeness of the datasets used for screening.
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

It is recommanded to provide code
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html

The method requires sensitive attributes’ label, the success of this method may depends on the quality and representativeness of the datasets used for screening.
Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

5
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

This paper decomposes the risks associated with dataset attributes in terms of their detectability and utility. May provide a way to help determine the attributes that may introduce unfairness which is important.
Reviewer confidence

Somewhat confident
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

N/A
[Post rebuttal] Please justify your decision

N/A

Review #2

Please describe the contribution of the paper

This paper presents a method for identifying bias-inducing artifacts of medical imaging datasets for computer vision models. The authors develop datasets with synthetically inserted artifacts and use their screening method to identify nearly imperceptible biases.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
1. This paper studies a timely topic and presents an interesting angle for identifying model bias: data auditing. The authors analyze the bias from the perspectives of information theory and causality.
2. The experiments are well-motivated and carefully designed, and help to demonstrate the effectiveness and sensitivity of the proposed method.
3. The paper is overall well-written and easy-to-follow.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
1. Are there any other related methods/algorithms that can be used as baselines for comparison besides the “random chance”? The current experiments can only show the performance of the proposed method itself.
2. It would be more convincing if the authors could use more datasets from other modalities than the skin lesion datasets only.
3. In the second paragraph of the method section. It says “We assume that Y (the diagnosis) is the causal parent of X (the image) given that the diagnosis affects the image appearance but not vice versa”. I would assume the authors are trying to say “disease” are the causal parent of the image instead of the “diagnosis”, as the disease leads to the specific appearance of the image, and the image leads to the diagnosis. I think diagnosis and actual disease are different things (e.g., diagnosis may be wrong). Suggesting clarifying the writing. The current one seems confusing.
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

Looks clear.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html
1. Using more datasets for experiments as per weakness 2.
2. Clarify the writing as per weakness 3.
3. The illustration of different synthetic artifacts in Figure 2 looks similar to me. I guess it’s because the figures are small so hard to show the difference. It would be clearer if the authors can come up with a better way to present.
Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

6
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The paper presents an interesting study with convincing experiments.
Reviewer confidence

Confident but not absolutely certain
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

N/A
[Post rebuttal] Please justify your decision

N/A

Primary Meta-Review

Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

This paper presents an interesting twist on the issue of potential shortcut learning in deep learning models: Instead of analyzing whether a specific model exploits image information related to an attribute that is not causally linked to the actual target task, the paper proposes a statistical method rooted in information theory and causality to identify the existence of such unwanted links in the training data that could be exploited.

The reviewers acknowledge that such a data auditing approach is a very interesting idea to identify sensitive attributes that may introduce unfairness/biases into downstream models. It is also highlighted that the paper is well-written and easy to follow, the presentation of this method is clear and sound, the experiments/results are convincing and showcase the usefulness of the approach. Moreover, all reviewers agreed that this work is above the acceptance level. Weaknesses mentioned by the reviewers mostly relate to potential future work such as a better analysis of the impact a specific choice for the DNN that is used to predict the attributes from the images has, the use of more/additional datasets beyond the skin images analyzed, or the extension of the technique to unsupervised scenarios where knowledge about (potentially) bias-inducing attributes is not available.

Interesting idea (systematic data auditing to screen for potential shortcuts) that I have not seen before in our area. The work is carried out nicely and the paper is clearly above the acceptance threshold for me.

Author Feedback

We thank the reviewers for their time and insightful feedback. We are deeply excited that all reviewers recognize the novelty and importance of screening image datasets for concerning features compared to auditing individual task models. Regarding weaknesses and areas for future work, we have the following comments:

== Unsupervised Case == We agree with R1 and R3 that adapting our method for situations without attribute labels could enhance its utility. We expect our work could greatly complement existing unsupervised model-specific auditing approaches like Eyuboglu et al.’s Domino. However, approaches in the space typically find underperforming clusters in the feature space of the model to be audited. How well these clusters remain coherent across different model sizes and architectures, representing a dataset property rather than a model + dataset property, is an open question. These questions are good candidates for future work and we will make a note about this in the conclusion.

== Additional Datasets / Modalities == We acknowledge the importance of applying our method to additional datasets and modalities and are actively pursuing this. We will make a note of this in the discussion.

== DNN Selection == Because we use DNNs to detect the presence or absence of attributes, R1 asked about our experiences with the robustness of the method to DNN selection. We believe strongly that the approach is robust to reasonable choices in DNNs for attribute prediction models. All results use attribute prediction models (ResNet18) that are substantially different and weaker than task prediction models (Swin Transformer tiny with RandAugment). Further, we emphasize that our already impressive results could be easily improved by simply selecting stronger attribute detection models.

== Detectability / Utility Dependence == R1 commented on the lack of independence between Detectability and Utility. This is visible in Fig 3b, where at high utility levels the detectability associated with a given attribute decreases. We believe this is an artifact of how we measure detectability. To avoid falsely claiming an artifact is detectable (as in Fig 4 left), we need to condition on Y to account for task related image features. As a result, we measure detectability as I(A; A_hat | Y). Intuitively, in the case where A and Y are strongly related (Utility is high), knowing the task label means A is nearly determined, so learning A_hat does not convey much new information and detectability is smaller.

That being the case, we will clarify the following in the main text:

This issue only affects our ability to compare relative magnitudes of detectability. The permutation scheme developed by Runge maintains the level of association between A and Y, so judgements on whether or not an artifact is detectable are not affected.

Even with this limitation, given attributes of roughly equal utility, detectability is highly informative of the magnitude of counterfactual performance drop a task model trained on a given dataset will experience (Fig 3b and Section 4.4).

We are hopeful a corrective factor could be derived to address this.

== Minor Clarifications / Suggestions == -We will add R1’s suggested work to our related works section. -We will adjust wording as per R2’s suggestion. -We will adjust figures in accordance with reviewers’ feedback. For clarification: Fig 2: As noted by R1 and R2, we explore challenging cases and nearly imperceptible artifacts. We will update so differences are more apparent. Fig 4, Left: AUC measures uncorrected performance at detecting the nonexistent artifact. Fig 4, Middle: scale is the same as Fig 3b. Here, no artifacts are introduced to images (but artifact labels are biased towards task label). Because CMIs are near zero, we (correctly) conclude no artifacts are present. Fig 4, Right: Exceeding 95% of CMI values from samples permuted with Runge’s method is our preferred conservative threshold for artifact presence.

back to top

Data AUDIT: Identifying Attribute Utility- and Detectability-Induced Bias in Task Models