Paper Info Reviews Meta-review Author Feedback Post-Rebuttal Meta-reviews

Authors

Emma A. M. Stanley, Matthias Wilms, Nils D. Forkert

Abstract

Despite the remarkable advances in deep learning for medical image analysis, it has become evident that biases in datasets used for training such models pose considerable challenges for a clinical deployment, including fairness and domain generalization issues. Although the development of bias mitigation techniques has become ubiquitous, the nature of inherent and unknown biases in real-world medical image data prevents a comprehensive understanding of algorithmic bias when developing deep learning models and bias mitigation methods. To address this challenge, we propose a modular and customizable framework for bias simulation in synthetic but realistic medical imaging data. Our framework provides complete control and flexibility for simulating a range of bias scenarios that can lead to undesired model performance and shortcut learning. In this work, we demonstrate how this framework can be used to simulate morphological biases in neuroimaging data for disease classification with a convolutional neural network as a first feasibility analysis. Using this case example, we show how the proportion of bias in the disease class and proximity between disease and bias regions can affect model performance and explainability results. The proposed framework provides the opportunity to objectively and comprehensively study how biases in medical image data affect deep learning pipelines, which will facilitate a better understanding of how to responsibly develop models and bias mitigation methods for clinical use. Code is available at github.com/estanley16/SimBA.

Link to paper

DOI: https://doi.org/10.1007/978-3-031-43895-0_46

SharedIt: https://rdcu.be/dnwyZ

Link to the code repository

https://github.com/estanley16/SimBA

Link to the dataset(s)

N/A


Reviews

Review #1

  • Please describe the contribution of the paper

    The paper proposes a pipeline to simulate brain MRI with varying disease and bias effects by applying diffeomorphic transformations to a real-world template MRI. This allows the systematic study of how biases in medical imaging data affect deep learning models.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    Tackling the problem of bias in deep learning and in medical applications is essential for deployment of such models in the clinical. Since real-world data is often subject to subtle and often unknown sources of bias, creating semi-synthetic data for the systematic evaluation is highly relevant to advance the field. The proposed approach is relatively simple, yet allows simulating the types of morphological biases one would encounter in the wild.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    The last couple of years has seen many works in the area of bias in machine learning models for medical applications. Unfortunately, the paper does not capture this and misses many important previous work. Moreover, the experimental evaluation is itself based on a biased test dataset, which is not ideal, since bias can affect the performance measure too. It is highly recommended to evaluate on test data that is not biased to precisely analyze to which extent a model learned from biased training data.

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    Reproducibility appears to be fair.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html

    The paper addresses a need of generating semi-synthetic medical datasets where one can control the type and amount of bias. The paper is an important step to provide benchmark datasets for studying bias-mitigating techniques for medical applications. Nevertheless, the paper could be improved substantially.

    First, the paper leaves out many important previous work on bias in medical data, such as: https://doi.org/10.1016/j.neuroimage.2020.117002, https://doi.org/10.1007/978-3-030-87199-4_39, https://doi.org/10.1016/j.media.2020.101879, https://arxiv.org/abs/2106.01132, https://doi.org/10.1007/978-3-031-16452-1_55, https://doi.org/10.1007/978-3-031-16431-6_9, https://doi.org/10.1073/pnas.1919012117, https://doi.org/10.1007/978-3-031-16452-1_59.

    Second, I would suggest to precisely define what type of bias the proposed approach is supposed to tackle. Unfortunately, the term “bias” has been used very loosely in the community, despite the fact that causal inference does provide a precise definition. Discussing what “bias” means and which type of bias (confounding, selection bias, M-) is simulated, would help to clarify which scenarios the proposed approach tries to emulate.

    Another concern is with respect to the evaluation procedure. It is my understanding that evaluation has been performed on test data that is biased too, i.e. contains the same imbalances as the training data. This is a problematic setup to study to which extent a deep learning model depends on the bias in the training data, because the bias will also impact the evaluation on the test data. Therefore, I would recommend evaluating on test data that is unbiased, i.e. the regional bias-transformation is applied equally to the disease and non-disease class. This way, one would only evaluate the performance for classifying the “real” disease. Currently, the result in Tables 1 and 2 in the supplement contain conflicting results. For “Near”, the “No Bias” results are worse than the “Bias” results in Table 1, while “No Bias” is better than “Bias” in Table 2. This appears counter-intuitive, and I wonder whether this has to do with evaluating on biased test data.

    Another problem arises when studying confounding bias (if the bias region causally affects the disease region and the class label). In this setting, evaluating the total performance would entangle the direct effect (bias region → class label) and the indirect effect (bias region → disease region → class label). Mediation analysis would disentangle these two paths and allow quantifying to which extent a model relies on the disease mechanism for prediction.

    Regarding the generation of the synthetic class label. It is not clear to me how the binary class label was generated from subject, disease, and bias effect. Figure 1 in the supplement only provides a high-level view. Is a class labels sampled (from a binomial distribution) or assigned deterministically?

    Please define which regions are “near”, “middle”, “far”.

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    4

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    I acknowledge that the paper addresses an unmet need to systematically study bias in medical data. However, I would suggest to define “bias”, as it is used in this paper, in terms of the language of causal inference to avoid ambiguities. Moreover, the discussion of related work is not capturing the current state of the community. Finally, the evaluation on biased test data is problematic.

  • Reviewer confidence

    Very confident

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    5

  • [Post rebuttal] Please justify your decision

    If the final paper improves on the discussion of related work, and defines the paper’s notion of “bias” more precisely, I would recommend to accept the paper.



Review #2

  • Please describe the contribution of the paper

    This paper describes a framework to generate synthetic data based on neuroimaging to provide realistic distributions of biases and disease effects, and study how deep learning networks may take them into account.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The issue the paper addresses is a critical one: finding whether a deep learning network learnt a spurious correlation based on a biased distribution is still not straightforward. The research domain of deep learning explainability, and the study of the reliability of the explainability methods themselves is still a field in which no consensus was found. The experiments conducted are insightful, as it shows that the interaction between bias and disease effects actually also possibly depends on the distance between these effects.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    The main challenge of a generating procedure is to show that its output is diverse. Here the authors explained that the set of deformations they sample from is estimated from 50 images and then generate 500 images. They also conclude that the framework allows to generate “synthetic dataset of arbitrary size”. The authors should be more careful with this conclusion and actually check that new images bring information compared to the previous ones: possibly they would obtain the same performance / results with only 50 synthetic images, or by using the original 50 original images and applying random transforms directly on them to simulate bias and disease effects. Secondly, this reviewer actually concludes different things than the authors based on the conducted experiments. The authors state: “When the regions are farther apart from each other, the CNN filters become more tuned to recognize bias effects separately from disease effects.”. But what is visible in Figure 2 is that the bias is always affecting the results, but differently depending on the distance: if the effects are close most biased images are labelled as diagnosed and the non-biased group is correctly classified, whereas if the effects are far biased images are correctly classified and this time non-biased images are mostly classified as non-diseased. Then in the first case if there is bias OR disease effects the images is considered as diseased, whereas in the second case an image is diseased if there is bias AND disease effects. In both cases, the CNN needs to recognise both effects to apply the rule. The attribution maps (Figure 2.B) could also lead to interesting conclusions, but only visualising one participant is not enough to lead to the conclusion “XAI may not always be a reliable tool to uncover sources of bias in medical image data” (though this reviewer agrees with this statement). To strengthen their point, a quantified analysis of the attribution maps should be conducted to evaluate the relative influence of the biased and diseased regions in the attribution map.

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    The data set, template, and software (version missing though) used to estimate the deformation fields are clearly stated.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html
    1. The authors should show that their generative model led to diverse images (otherwise there is no point in generating an arbitrary large data set, or more specifically simulating normal anatomical variability instead of directly applying disease / bias effects to normal participants). To achieve this, results using different data sets sizes could be compared.
    2. This reviewer is actually more interested in the results of the experiments led with the synthetic data than the framework itself. Ideally a better analysis of the attribution maps obtained with SmoothGrad could be conducted (the authors can take for example the metrics used in doi:10.1117/12.2653809).
    3. Please also reconsider the conclusions on the recognition of bias and disease effects depending on their distance.
    4. “Each experiment simulated and used 500 datasets of voxel dimensions (173x211x155) with a 55%/15%/30% train/validation/test split, stratified by disease and bias labels.”  here the reviewer assumed that the word “datasets” should be replaced by “samples”. This reviewer is very enthusiastic about the possible experiments that could be conducted by the team, especially on the ability of explainability methods to correctly identify spurious correlations.
  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    6

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The main outcome of the paper is the generative framework (the conclusions of the experiments are not even given in the conclusion). This is a bit light after considering the potential weak advantage this generator is actually offering compared to less realistic synthetic data. However, the results of the experiments are interesting and there is a real need in discussing this topic (evaluating the limitations of deep learning networks and explanation methods) in our community.

  • Reviewer confidence

    Very confident

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    5

  • [Post rebuttal] Please justify your decision

    This paper leads to interesting considerations and future developments which is why this reviewer supports its publication. Supplementary experiments would be necessary to support the scientific findings but cannot be performed.



Review #3

  • Please describe the contribution of the paper

    The article proposes a framework to introduce biases (in the form of morphological changes), a disease (synthetic, in the form of a morphological variation in the original template) and inter-subject variability. The purpose is to have a framework that allows to disentangle the different effects and hence, have some control and understanding of their effect.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The paper provides a good foundation and generic framework to evaluate biases in medical images from a morphological perspective. The explanations and “simplicity” (compared to other potential contributions geared towards generative and diffusion models) is a strong point of the work.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    In spite of the clear objective and interesting approach to tackle potential morphological biases, the article seems to miss a more comprehensive evaluation. The authors tackle the evaluation for a particular set-up (brain MRI) without specification of the disease and more importantly, what “morphological bias” means and entails in this particular context. The definition of “morphological bias” is a bit blurry in the context and unclear to me as to how the authors interpret it. Further, the generic evaluation does not help in understanding the utility of the framework in a specific clinical context, limiting the potential of the study.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    The authors will share the code if accepted. The experiments seem reproducible based on seeding information.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html

    The article is well-structured and the writing is clear as it is, since it is fairly easy to follow the development of objectives of the work. However, the evaluation side and definition of “morphological bias” is, in my opinion, not clear enough. The whole work revolves around the analysis of those morphological bias in a really specific setting (disease) and it is hard to assess the utility of the framework per se without a concrete case scenario with a well-defined disease and where we have some prior knowledge to hypothesise how bias-disease distance and different morphological changes might affect the decision-making for the specific disease. I understand the authors wanted to keep the work as generic as possible highlighting the utility in other contexts, but in my personal opinion, it would have been better to be more clear and specific in the evaluation protocol (specific disease, what kind of biases, the basis to choose them, how are “near”/”middle”/”far” distances defined and why and clinical relevance of it).

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    5

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    I see the potential of the framework but I struggle to see the utility of it with the given evaluation. I think the article has potential but it is borderline.

  • Reviewer confidence

    Confident but not absolutely certain

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A




Primary Meta-Review

  • Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

    There is consensus among the reviewers that the presented work aims to address an important problem in the field. However, there is variability in the scores, which reflects concerns regarding different aspects of the work. First, the paper makes a poor work in providing appropriate context for the work missing relevant literature. Second, and more importantly, there are concerns regarding the evaluation setting. These are driven in part by differences in interpreting results (R#1, R#2), but also lack of clarity regarding what bias means and what scenarios the propose approach tries to emulate (R#1, R#3). Importantly, some of the conclusions are drawn based on attribution maps derived for only one participant. A more systematic quantitative approach is required, but also a colormap for the figure is missing. Lastly, some methodological concerns should be noted. There are concerns regarding the ability of the method to produce datasets of arbitrary size, while some clarification about the data generation procedure is required. Beyond what the reviewers note, some additional methodological limitations need to be discussed. First, is it appropriate to do PCA on velocity fields? Second, the combination of one atlas and diffeomorphisms means that the cortical anatomy doesn’t change much in the sense that we always have the same gyri and sulci. This is not really true. Third, why interpolate v_d and v_b? Aren’t these supposed to be localized? Fourth, it is not clear how one would create biased data with a spatially dispersed bias effect (e.g., to different contrast).




Author Feedback

We thank the reviewers and AC for the positive feedback and thoughtful comments on our novel method for simulating and evaluating bias in MIC models. We appreciate that all reviewers highlight that our work addresses an important, unmet need in the domain. We apologize for not initially including more literature on bias in MIC. However, we would like to emphasize that none of the uncited papers propose methodologies similar to ours, and some of the papers suggested by R1 are already cited in our paper ([9, 22]). Nevertheless, we agree that a more extensive review of challenges surrounding bias is helpful and will add more references to the final paper. We appreciate the comments by R1/R3 on how the word bias is used. We agree that it was not well enough defined and will modify this in the final paper. More precisely, bias in our work is understood as a property of the data (e.g., class/attribute imbalance, spurious correlations) used for training a MIC model that can lead to model shortcut learning and/or failure to adequately represent data subgroups, which may lead to reduced generalizability and/or fairness when applied in real world scenarios. While R1 is correct in saying that the term bias is used loosely and that future definitions rooted in causality are needed, our definition aligns well with how others in the community have used it (see [3,5,9]). We agree with R1 that evaluating models on unbiased data is valuable to tease out differences between dataset and model bias. While we did perform experiments on balanced test data, which showed trends similar to those shown in the paper, we limited our initial evaluation due to space constraints and since inference on biased data is commonly done in literature and is also representative of the real world (i.e., a balanced dataset is rarely feasible in clinical settings). R2 mentions an important point about diversity and dataset size. Although our PCA-based mechanism has long been utilized in the statistical shape modeling community and we are confident that it can simulate a large amount of variability, it is unknown how well those variations will be picked up by models. We will add a discussion point that an additional benefit of our framework is that it can be used to investigate how differing levels of variability are interpreted by various architectures. We appreciate R2’s thoughtful interpretation of the results, which does not necessarily conflict with our interpretation, but rather provides an alternate explanation of how the CNN may perform shortcut learning. We will add that understanding these mechanisms exemplifies a highly interesting direction for future work to the paper. Regarding the AC’s comments: Stationary velocity fields in the log-Euclidean framework of diffeomorphic transformations form a linear space and, therefore, it is theoretically sound to apply linear statistics such as PCA [1]. We acknowledge the limitation that cortical topology will not be altered in the current setup. However, topology-breaking intensity alterations can be easily integrated in our framework via additive transformations (e.g., via PCA-based active appearance models). We also apologize for the confusion around v_D/v_B interpolation and will clarify in the paper that we densified the sparse fields. We would also like to clarify to reviewers that the saliency maps shown were averages of 25 subjects rather than just one. We will add the missing colorbar (AC) and clarify that class labels are assigned deterministically (R1). We appreciate the suggestions for further evaluation, including quantification of XAI (R2), mediation analysis (R1), and clinical disease modeling (R3). Although we cannot include additional results due to MICCAI guidelines, we would like to emphasize that these future directions again highlight the utility of the proposed, publicly available framework for enabling researchers to work together to investigate the various challenges related to bias.




Post-rebuttal Meta-Reviews

Meta-review # 1 (Primary)

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    The authors have sufficiently addressed the concerns that were previously raised. There is a consensus that the merits of the paper weigh over weaknesses.



Meta-review #2

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    The proposed method tackles an important problem of simulating brain data with different bias effects. The rebuttal clarified most questions and the reviewers agreed to accept the paper.



Meta-review #3

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    Based on the reviews and authors feeback, it seems that the paper is still lacking some major comparisons and details. I would recommend rejection.



back to top