Paper Info Reviews Meta-review Author Feedback Post-Rebuttal Meta-reviews

Authors

Milda Pocevičiūtė, Gabriel Eilertsen, Stina Garvin, Claes Lundström

Abstract

Multiple-instance learning (MIL) is an attractive approach for digital pathology applications as it reduces the costs related to data collection and labelling. However, it is not clear how sensitive MIL is to clinically realistic domain shifts, i.e., differences in data distribution that could negatively affect performance, and if already existing metrics for detecting domain shifts work well with these algorithms. We trained an attention-based MIL algorithm to classify whether a whole-slide image of a lymph node contains breast tumour metastases. The algorithm was evaluated on data from a hospital in a different country and various subsets of this data that correspond to different levels of domain shift. Our contributions include showing that MIL for digital pathology is affected by clinically realistic differences in data, evaluating which features from a MIL model are most suitable for detecting changes in performance, and proposing an unsupervised metric named Fréchet Domain Distance (FDD) for quantification of domain shifts. Shift measure performance was evaluated through the mean Pearson correlation to change in classification performance, where FDD achieved 0.70 on 10-fold cross-validation models. The baselines included Deep ensemble, Difference of Confidence, and Representation shift which resulted in 0.45, -0.29, and 0.56 mean Pearson correlation, respectively. FDD could be a valuable tool for care providers and vendors who need to verify if a MIL system is likely to perform reliably when implemented at a new site, without requiring any additional annotations from pathologists.

Link to paper

DOI: https://doi.org/10.1007/978-3-031-43904-9_16

SharedIt: https://rdcu.be/dnwGT

Link to the code repository

N/A

Link to the dataset(s)

N/A


Reviews

Review #1

  • Please describe the contribution of the paper

    This paper addresses the urgent problem of identifying domain shifts between the datasets used for training a model and the clinical data it is being deployed on. Specifically it proposes a method of quantifying domain drift for MIL applications and demonstrates its applicability on two large publicly available datasets that illustrate several clinically relevant sources of domain drift.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The main strengths of the paper are:

    1. It looks at domain drift in MIL - an understudied area of great clinical relevance
    2. It proposes a simple yet effective modification to the Frechet Instance Distance (a standard method for assessing the quality of the output from GANs) that is motivated by the attention mechanism commonly used in MIL.
    3. The evaluation is based on a widely used, publicly available MIL method and is very thorough. Ablation studies comparing different aggrgation policies for the proposed metric are presented and comparisons to existing approaches to this problem are provided.
    4. The selection of different domains represented by the datasets are well motivated and clinically relevant 6.. The method could be applied to any MIL problem and is not restricted to digital pathology
    5. The paper is clearly written and easy to understand.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    There are no major weaknesses in the paper. Some comment on the downstream implications and significance of the relatively modest domain shifts reported in table 1would have been useful.
    In future work it may be useful to include a task and dataset that reflect a more significant degradation in performance due to domain shift. Since I am unaware of any such task, and since the experiments presented here already represent a significant contribution, this is not really a criticism of the current paper. There are advantages in looking at this task with subtle domain shifts since this is more taxing problem.

  • Please rate the clarity and organization of this paper

    Excellent

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    The core CLAM method is publicly available and has been widely used as a baseline in MIL studies. Both datasets are available to researchers (the more recent dataset requires a specific request to be made but presumably this will not restrict someone from attempting to reproduce the experiments) The supplemental file was missing on the CMT site - I assume this would have contained additional information needed to reproduce the study.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html

    This is a well written paper that tackles an urgent and important problem in the clinical deployment of AI models. The clinical problem is well posed and the methods and dataset selected provide an excellent test case for the proposed metric.

    Some minor comments on the manuscript.

    1. The supplementary material is absent for the CMT site - this should be addressed. The supplementary material should only provide additional tables for parameters, model architecture etc - there are several references to further discussion in the supplementary material (sections 5.3 and 5.2) suggesting that too much additional material has been added.
    2. Table 2 does not contain rows reporting the results for the different aggregaiton strategies (presumably a space issue). Some of results are summarized in the text and the plots in fig 2 show the details. It would be useful to add the mean(SD) PC on the boxplot figures (or if the submission rules allow for additional space, reinstate the missing rows in table 2)
    3. The last sentence of the penultimate paragraph on page 6 is unclear. Should there be a comma after “For most models”?
    4. This section is confusing and seems to be over-reaching a little: “From Fig. 1 we can see that if we further investigated all model-dataset combinations that resulted in F DD64 above 0.5, we would detect many cases with a drop in performance larger than 0.05. However, the drop is easier to detect on axillary and lobular datasets compared to others.”
  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    8

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    This is an understudied problem that has to be solved in order to safely deploy AI models in the clinic. The clinical task chosen is highly relevant as it is likely one of the first applications to be widely deployed in digital pathology. The experimental design was excellent and the proposed metric is logical and was well motivated.

  • Reviewer confidence

    Very confident

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    8

  • [Post rebuttal] Please justify your decision

    I think the rebuttal answered most questions well. The main criticism of R3 is that “FDD is only useful when you know the unknown domain.” - this is not really a major disadvantage at all. We are a very long way away from a situation where a new AI algorithm is adopted in a large medical centre without some kind of validation. In such a scenario, test images from the new domain will certainly be available but ground truth labels may not (eg for survival studies it takes a followup period of many years to get ground truth). This method provides a way for us to evaluate whether significant domain drift is present when it is not possible to check accuracy of a model directly. It could also be used to detect domain drift due to a change in practice or equipment and therefore may be applicable in an MLops QA pipeline.



Review #2

  • Please describe the contribution of the paper

    The paper shows that MIL models for digital pathology are affected by domain shifts, evaluates different features from a MIL model which are most suitable for detecting changes in performance in these settings, and finally propose a metric namely Frechet Domain Distance (FDD) for quantifying domain shifts. The experiments show how the FDD metric is correlated with a drop in performance when tested on a domain shifted test set.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • The paper tackles a very relevant practical problem for AI application in digital pathology. Domain shifts are a key reason which limits the use of ML models in the wild. These experiments can be useful for detecting distributional shifts in data when such tools are used in real-world clinical applications.

    • The paper presents a novel metric motivated by Frechet Distances to quantify domain shift. This is done by aggregating the features across all WSIs of each dataset and using their statistics for computing the mean and covariance matrix.

    • Analysis of the proposed FDD metric shows that it is most correlated with changes in performance on other heldout test sets which are constructed to simulate domain shifts.

    • The paper shows many interesting findings. Foe example, the other baseline methods like DE and DoC are unable to reflect the loss of performance in the metrics. Also, high attention weights are a good indicator of patches relevant for measuring domain shifts. The random K patch selection for features provides a good baseline for measuring the efficacy of the patch features. Analyzing both the mean Pearson correlation and its std helps convince the readers of which measure is the best to quantify drops in performance when a different test set is introduced. The low std on the FDD metric is a good sign.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    • Domain shifts are hard to quantify, so drops in classification performance from the in-distribution test set to other test sets is used as a proxy for extent of domain shift. Table 1 & Fig 1 shows this effect. However, direction of performance change is not consistent with MCC and ROC, highlighting the importance of the choice of the metric.

    • The MCC metric depends on the positive prevalence of the label. Hence when quantifying domain shifts using drops in MCC, the paper fails to acknowledge that part of the reason why the MCC value is different on a new test set can be potentially due to a different prevalence rate. The label shift from Camelyon to other datasets are small, hence the effect on MCC will be small as well, but seeing these Pearson correlations replicated for AUROCs and AUPRCs would have been more convincing.

    • There are a wide variety of methods and distances which have been used before to detect OOD samples and domain shifts [1,2,3]. The baseline methods for comparison could’ve been more up to date with recent literature.

    [1] Yang, J., Zhou, K., Li, Y., & Liu, Z. (2021). Generalized out-of-distribution detection: A survey. arXiv preprint arXiv:2110.11334. [2] Ren, J., Liu, P. J., Fertig, E., Snoek, J., Poplin, R., Depristo, M., … & Lakshminarayanan, B. (2019). Likelihood ratios for out-of-distribution detection. Advances in neural information processing systems, 32. [3] Fort, S., Ren, J., & Lakshminarayanan, B. (2021). Exploring the limits of out-of-distribution detection. Advances in Neural Information Processing Systems, 34, 7068-7081.

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    The paper is reproducible and the details are enumerated clearly.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html

    Consider repeating the study for metrics other than MCC or adjusting MCC to be invariant to the label prevalence or ensuring that label distribution is exactly the same across the test sets.

    Also, repeating this exercise for another MIL problem in digital pathology will make the case for FDD stronger as different MIL problems behave very differently wrt the final attention distribution and patch features. Thus some of the findings might be specific to the Camelyon problem.

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    5

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The paper proposes an interesting way to measure domain shifts in data using the Frechet Distance and attention values for an MIL model. The experiments conducted on Camelyon and other tests sets show high correlation between changes in performance and the proposed metric. Different subsets of the test sets which are used to simulate domain shifts also show some expected behavior in terms of the proposed metrics. The paper still has issues with the choice of metrics used for quantifying drops in performance as well the different aggregation strategies used. However, the proposed metric is simple, intuitive and shown to be empirically useful. Hence my vote is for an accept.

  • Reviewer confidence

    Confident but not absolutely certain

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    6

  • [Post rebuttal] Please justify your decision

    The paper proposes an interesting way to measure domain shifts in data using the Frechet Distance and attention values for an MIL model. The proposed metric is simple, intuitive and shown to be empirically useful. The rebuttal answers some of my questions. My vote is for an accept.



Review #3

  • Please describe the contribution of the paper

    The paper reports how clinically realistic domain shifts affect attention-based MIL for digital pathology: Domain shift may affect MIL. It is also advocated to use attention for feature selection and a proposed distance metric for quantification of expected performance drop. The Frechet Domaine Distance seems to be the main proposal.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    Paper well written. Experiments are clear. Data seems to be sufficient for the FDD purpose.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    Proposing to use the well-known Frechet distance to estimate domain shift appears to have some benefit in the lab. If I know the unknown domain to measure FDD, then it’s not unknown anymore. FDD seems to have limited value in a experimental setting. Besides, why should MIL be more (or less) susceptible to domain shift?

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    The results are reproducible.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html

    Can you think of a scenario where you do not know the new hospital and FDD is still useful? Adding a classifier that shows the same behavior/sensitivity as MIL would help to clarify some things. As well, please clarify how the FDD can be used when we do not know other domains.

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    3

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    FDD is only useful when you know the unknown domain.

  • Reviewer confidence

    Very confident

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A




Primary Meta-Review

  • Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

    The paper proposes the Frechet Domain Distance, a method to quantify domain shifts between training and testing data in histopathology and specifically multiple instance learning (MIL). Key strength of the paper include the relevance and timeliness of the application, a simple but effective metric and a broad evaluation of domain shift and the proposed method in the context of MIL.

    Major concerns include the choice of MCC as the central metric for evaluation given its dependence on label prevalence, the limited range of the selected baseline methods and the missing connection to out-of-distribution (OOD) detection. Furthermore, the variance of the MCC and - potentially even more - the FDD in the cross validation setting seem to point to a fairly unstable training of the MIL-model. For the rebuttal, the authors may want to comment on this aspect as well as clarify the choice of metric for quantifying drops in performance and the (potentially missing) results on aggregation strategies. R#3 further criticizes that the domain needs to be known to assess the domain shift. Rephrasing this question, the authors may want to highlight to what extent their approach will work on single images and/or small datasets, and to what extent the method allows to identify subsets with high/low domain shift in a large dataset in a new domain. Currently, the authors derive the subsets based on known labels/subtypes.

    Note: The supplementary material was indeed removed due to excessive length.




Author Feedback

We thank the reviewers for the constructive feedback. It is encouraging to see that the reviewers deem our work clinically relevant and innovative (R1, R2). Below, we hope to have addressed the concerns raised.

Suitability of MCC metric (R2)
R2 raises a rightful point that MCC metric could be affected by difference in label prevalence. Label prevalence varies between 35% and 45% in the test data which should not result in substantial effects on the MMC metric. We cite a paper showing that MCC is a suitable metric for imbalanced datasets common in medical domains such as ours. Moreover, we argue that fixing a threshold on validation data represents better real-world setup. We agree that ROC-AUC is also an interesting metric, but the page limit forced us to choose, and MCC is our priority. We will improve the discussion regarding this in the manuscript.

Comparison to OOD detection (R2, R3)
We agree that that OOD methods are related but due to the focus on domain shift and the page limit we felt that other references had higher priority. Domain shift detection refers to the overall change in the expected performance of a model between datasets. For example, quantifying the change in performance between two medical centres without pinpointing the exact samples where performance will deteriorate the most. However, OOD detection aims to identify individual samples that significantly differ from the in-domain data. The proposed FDD metric is for domain shift quantification and does not address OOD, i.e., individual significantly different samples. Therefore, we deem that the selected baseline methods, that include approaches based on uncertainty, softmax score, and latent features, cover the alternative options well. We will clarify the difference between OOD and domain shift detection in the manuscript.

Unidentified domains (R3)
R3 says that domain shift needs to be known, and as MR1 we understand this to refer to the fact that the study subsets need to be identified beforehand. This is true; the FDD method does not include subset identification. Nevertheless, the method is widely applicable as factors known to be relevant for domain shift are abundant: disease or demographics subgroups, equipment, or processing differences, etc. MR1 further asks for clarification whether FDD works for single images or small subsets. FDD does not work for single images as it is not meant for OOD detection. However, it can work well on quite small subsets given that their features can be approximated by a normal distribution in Fréchet distance computation. Due to the page limit, we are unable to perform a study to establish what is the smallest subset size that still achieves a useful domain shift detection by FDD.

Variation in MCC and FDD results (R2)
There is substantial variation of the MCC and FDD outcome in the cross-validation, and it is relevant to consider whether this could indicate an unstable training of the MIL-model. We are confident, however, that this is not the case. For MCC, there is low standard deviation on the in-domain test data, showing consistent performance of all models. The other evaluation datasets were designed to cause domain shifts, hence it is reasonable that among the 10 cross-validated models some are slightly more robust than others by chance. Regarding FDD, there is indeed varying performance. However, all evaluated domain shift metrics exhibited this behaviour with FDD having the smallest variation. Future work could study the connection between the robustness of a MIL model to the domain shift and the effectiveness of FDD.

Aggregation (R2) The explanation for missing information on aggregation strategies is that this indeed was part of our submitted supplementary material, which unfortunately was removed as it exceeded the page limit (miscommunication regarding the guidelines). We will remove all references to supplementary content from the manuscript.




Post-rebuttal Meta-Reviews

Meta-review # 1 (Primary)

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    The paper addresses a relevant problem in the clinical application of MIL approaches by addressing/identifying domain shifts in unlabelled data. The reviews in the initial review were quite diverse, with strengths including the clinical importance of the problem (domain shift), the simplicity of the selected metric and a fairly broad evaluation. The most critical points include the choice of metric and the applicability/the use-case of the proposed approach. In their rebuttal, the authors answered most questions satisfactorily from my perspective. While R#3 was very critical about this work and some comments may be valid for a number of use cases, the approach still has value given potentially large archives/datasets where only meta-information is available. While this work is from my perspective still in an early stage (e.g., what are the actual implications when a metric is above/below a certain value? How does one move from some form of quantification of domain shift to an/which intervention strategy?), research in this direction is important and justifies discussion and presentation and MICCAI.



Meta-review #2

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    This paper had a wide range of scores from (3-Reject to 8-accept- award winning). Based on the great answers in the authors’ rebuttal, and update of scores from the reviewers, I believe this study should be accepted for its merits.



Meta-review #3

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    Paper presents on how MIL models for digital pathology are affected by domain shifts, and proposes Frechet Domain Distance (FDD) for quantifying domain shifts. Strengths include relevance of problem and critical new metric in this regard. Rebuttal does not do a great job of addressing the critiques on MCC, relationship to OOD, knowing domain needs, and variations in performance between MCC and FDD. The contribution and solution presented appears important, but some concerns remain on its relevance in the broader context of the field.



back to top