List of Papers By topics Author List
Paper Info | Reviews | Meta-review | Author Feedback | Post-Rebuttal Meta-reviews |
Authors
Patrick Godau, Piotr Kalinowski, Evangelia Christodoulou, Annika Reinke, Minu Tizabi, Luciana Ferrer, Paul F. Jäger, Lena Maier-Hein
Abstract
Domain gaps are among the most relevant roadblocks in the clinical translation of machine learning (ML)-based solutions for medical image analysis. While current research focuses on new training paradigms and network architectures, little attention is given to the specific effect of prevalence shifts on an algorithm deployed in practice. Such discrepancies between class frequencies in the data used for a method’s development/validation and that in its deployment environment(s) are of great importance, for example in the context of artificial intelligence (AI) democratization, as disease prevalences may vary widely across time and location. Our contribution is twofold. First, we empirically demonstrate the potentially severe consequences of missing prevalence handling by analyzing (i) the extent of miscalibration, (ii) the deviation of the decision threshold from the optimum, and (iii) the ability of validation metrics to reflect neural network performance on the deployment population as a function of the discrepancy between development and deployment prevalence. Second, we propose a workflow for prevalence-aware image classification that uses estimated deployment prevalences to adjust a trained classifier to a new environment, without requiring additional annotated deployment data. Comprehensive experiments based on a diverse set of 30 medical classification tasks showcase the benefit of the proposed workflow in generating better classifier decisions and more reliable performance estimates compared to current practice.
Link to paper
DOI: https://doi.org/10.1007/978-3-031-43898-1_38
SharedIt: https://rdcu.be/dnwBy
Link to the code repository
https://github.com/IMSY-DKFZ/prevalence-shifts
Link to the dataset(s)
www.kaggle.com/ahemateja19bec1025/covid-xray-dataset
doi.org/10.34740/KAGGLE/DSV/1370629
doi.org/10.6084/m9.figshare.1512427.v5
ftp.itec.aau.at/datasets/ovid/CatRelevanceCompression
stanfordmlgroup.github.io/competitions/chexpert/
data.mendeley.com/datasets/rscbjbr9sj/3
ftp.itec.aau.at/datasets/LapGyn4/
dl.acm.org/do/10.1145/3193165/full/
stanfordmlgroup.github.io/competitions/mura/
Reviews
Review #2
- Please describe the contribution of the paper
The paper uses large scale empirical experiments across 30 different medical image classification tasks to quantitatively show the importance of prevalence handling during deployment. Prevalence-shift handling includes model recalibration using estimates of deployment prevalence, and using metrics that are invariant to prevalence shifts to support prevalence-aware decision rules. Some of the findings include: argmax generally not being universally optimal under prevalence shifts, the need for prevalence-invariant metrics like Expected Costs (EC) or Balanced Accuracy (BA) for assessment of deployment performances.
- Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
The evaluation has been done on a very wide variety of data sets (30 classification tasks, with sample sizes varying from ~1000 to ~120000 and class imbalance ratio ranging from [2,8]). I particularly like the experiments where the deployment prevalences may not be exactly known (a very likely real-world scenario). In this case, the authors add perturbations to the prevalence estimates and report the findings. The experiments across a range of class imbalance ratios are also very useful, along with an ablation-style analysis of the effects of re-calibration and metric choices.
I find the paper an exciting read for the MICCAI community. It will help explore issues related to clinical translation and deployment, together with validation approaches and metrics that are more suitable in this context and invariant to some of the common underlying deployment issues.
Many of the pieces of the pipeline are from prior work (e.g. temperature scaling - Guo 2017, weight adaptation in loss function - Lipton 2018, bias addition - Brummer 2006, importance of metrics like EC and BA for deployment of image analysis techniques- Maier-Hein 2022). The novelty is in the large-scale empirical experiments conducted to support and integrate the claims across these methods.
- Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
1) The strategy behind the selection of the 30% deployment test set (D-dep) lacks several key details. Since the below suggestions may lead to more experiments, they might be better suited for an extended journal paper. However, these details are important to discuss and clarify in the current manuscript.
a) Was D-dep pulled out from the original data set D only once at the beginning? Or was the 30% hold-out for deployment randomly changed each time (like in bootstrap) or swapped for another non-overlapping 30% partition of D (like in cross-validation)? This is in addition to the varying IR for a given selection of D-dep. Based on the theoretical principles behind the validity of bootstrap and cross-validation subsampling strategies, for the paper’s conclusions to be valid, any data point should be allowed as D-dep under the principles of exchangeability of observations and to meet the assumption that the samples are representative of the underlying population (Efron & Tibshirani, 1994 “An introduction to the bootstrap”, Politis 2001, “On the asymptotic theory of subsampling,”). This holds true for both the development data as well as the deployment data - because despite the prevalence shift, samples from both data sets are still assumed to be representative of the ‘same’ underlying population.
b) Was D-dep sampled to ensure class representation for each of the C classes? This may be a non-issue for the data sets selected by the authors (N>1000 with C<=8, therefore, a reasonably high probability that each class was represented in 30%). However, it may lead to certain classes missing representation in much smaller datasets. For deployment to work well, it will be important to explicitly test for this scenario and show that the conclusions hold true, irrespective of how many classes are represented in D-dep (in addition to the prevalence shift in all the existing classes). In a way, would the proposed prevalence-handling approach work equally well while scaling-up and scaling-down the data size (the latter often being a real-world clinical issue when existing AI/ML methods are deployed and tested in limited-sample-size trials and setups).
2) The source of the expected ‘prevalence shifts’ is not clear. Would this affect the conclusions? Possible sources:
- A different population (which might imply a completely different underlying statistical distribution)
- A different geographical locations in proximity, changes along time (might imply a gradual/overlapping shift)
- Differences due to varying image acquisition techniques within the data (e.g. Kilim, 2022, “Physical imaging parameter variation drives domain shift”)
- Sample selection bias, annotation labels shift (Dockes 2021 [10])
- Instead of prevalence ‘shifts’, what about outliers or extreme out-of-distribution samples? Would the conclusion still hold true?
- Please rate the clarity and organization of this paper
Very Good
- Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance
The reproducibility aspect of the paper is good (public data set, clearly mentions network parameters and cites algorithms that have been used from prior work).
- Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html
-
In general, which of the steps, choices or findings in the framework might change if instead of classification tasks, regression tasks were assessed instead?
-
Adding more background information for readers not familiar with ML vocabulary: e.g., intuition behind metrics like EC and BA which are currently not the popular metrics used in medical imaging literature. Without that, the current manuscript requires significant background reading regarding calibration techniques, decision rules and metrics, and is written with a relatively technical, computer science (CS) audience in mind. It can benefit from making the text more stand-alone and accessible to non-CS audiences within the MICCAI community.
-
The same is true for prior work which is mostly CS-oriented. It will be useful to add relevant medical imaging literature sources as well, for e.g.: — Zhang 2022, Nature Biomedical Engineering, “Shifting machine learning for healthcare from development to deployment and from models to data” — Cohen 2021, Canadian Medical Association Journal, “Problems in the deployment of machine-learned models in health care” — Kilim 2022, Nature Scientific Reports, “Physical imaging parameter variation drives domain shift” — Another empirical study of dataset shifts: Rabanser 2019, Neurips, “Failing Loudly: An Empirical Study of Methods for Detecting Dataset Shift”. — Also, papers on “importance sampling approaches” to handle data set shifts would be a good addition due to the strong overlap with the paper.
-
It will be helpful to increase the sizes of the plots for readability. At the current size and resolution, I found it difficult to match and verify the quantitative numbers reported in the text with the graphically plotted results. For e.g. Sect. 3, last paragraph: deviation numbers for metric scores like Accuracy (0.41/0.18), etc. versus Fig. 5.
-
For some papers in the bibliography, older ArXiv versions are listed, even though their corresponding peer-reviewed versions are also available [14,22,24,30]
-
- Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making
6
- Please justify your recommendation. What were the major factors that led you to your overall score for this paper?
Overall, a well-written paper with extensive experiments across a diverse set of data, with important findings and insights that will be useful for the broader MICCAI community.
- Reviewer confidence
Confident but not absolutely certain
- [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed
N/A
- [Post rebuttal] Please justify your decision
N/A
Review #3
- Please describe the contribution of the paper
Prevalence shifts are common in the setting of ML model deployment, due to, among many things, different populations, locations, and even time. Systematically understanding how these impact model performance and how to assess and overcome this issue enables more robust and wider-spread model deployment in increasingly diverse settings.
- Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
Clear description of the research problem and gap in the literature motivating the work here within
Both a methodological and empirical contribution with likely impact of prevalence shift on a wide-array of ML deployment settings
Clear empirical demonstration of the methodological contribution in diverse deployment settings (different data types, degrees of prevalence shift, performance metrics)
A very nice summary of rules for model development, testing, and deployment for real-world settings in the discussion. Particularly for those new to the field, these guidelines should prove incredibly informative.
- Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
While the steps are laid out clearly in the methods and I’m sympathetic to the space constraints, a clearer mathematical flow through the methods would be helpful. For example, being more explicit about how the weights computed in step 2 are used for the affine scaling and expected cost following would make it easier for the reader to follow. Indeed, slightly less written description and slightly more mathematical description would clarify this method tremendously.
- Please rate the clarity and organization of this paper
Very Good
- Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance
Provided the enclosed code for this work is well commented, this work appears to be highly reproducible. It would benefit from a clearer mathematical description of the methods (already detailed in “weaknesses” section)
- Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html
Overall this is an important work with a clear contribution. A more detailed mathematical description of the methods with less textual commentary and slightly more concrete steps for the reader would improve this work.
- Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making
7
- Please justify your recommendation. What were the major factors that led you to your overall score for this paper?
This paper describes the impact of a very common and practical issue in MIC settings and provides a concrete method for how to overcome this. While recalibration methods are not new, they are not widespread in the given field and as the authors point out, metrics which suffer dramatically from these impacts (such as accuracy and F1 score) are still highly widespread. Therefore, I believe it should be accepted with minor revisions clarifying the mathematical steps of the methods.
- Reviewer confidence
Very confident
- [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed
N/A
- [Post rebuttal] Please justify your decision
N/A
Review #1
- Please describe the contribution of the paper
- The authors demonstrate the consequences of missing prevalence handling by analysing the extent of miscalibration, the deviation of the decision threshold, and the ability of validation metrics to reflect neural network performance.
- They propose a workflow for prevalence-aware image classification that uses estimated deployment prevalences to adjust a classifier to the deployment dataset.
- Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
- The paper presents an interesting study on the effects of not taking prevalence shifts into account.
- The authors propose a novel workflow to deal with prevalence shifts.
- Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
- The method assumes we have access to the deployment dataset (or at least to the prevalence information), which may not be the case in translational scenarios.
- The experiments are performed in a simplified setting as deployment test sets are from the same cohort and follow the same acquisition process as training, and validation. That is not the case in realistic deployment settings.
- Please rate the clarity and organization of this paper
Excellent
- Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance
- Experiments were done with public datasets.
- Code will be made available.
- Reproducibility is guaranteed.
- Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html
- It would be interesting to know how the method performs in external test sets (with different cohorts and acquired with different scanners). The authors should include that in an extended version of the paper.
- Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making
6
- Please justify your recommendation. What were the major factors that led you to your overall score for this paper?
- The paper is clear and presents an interesting and complete study and a novel proposal for the community. Prevalence shifts are extremely common in medical imaging tasks.
- Reviewer confidence
Very confident
- [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed
N/A
- [Post rebuttal] Please justify your decision
N/A
Primary Meta-Review
- Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.
This paper generates a wide consensus across all reviewers (and I agree) that it should be accepted and presented orally at MICCAI. There is no fancy novelty here, but large-scale, rigorously tested, experimental insights, which are probably more interesting than the latest variant of a transformer architecture reaching a +0.001 better dice score. Although R1 pointed to the weakness of assuming known test-time prevalences, R2 stressed that they have “liked the experiment where the deployment prevalences may not be exactly known (a very likely real-world scenario)”, in which case the authors use noisy estimates. I would like to point the authors to the very detailed and interseting review provided by R2, which might be useful for a potential extension of this work. Congratulations.
Author Feedback
We thank the reviewers for unanimously suggesting acceptance of our paper and for providing helpful suggestions for improvement. Comments:
Reviewer #1:
- Knowledge on deployment prevalences: The important point of uncertainties related to the estimation of deployment prevalences has been addressed in the supplementary material (Fig. 7, 8, 9). We will further make the fact that previous work has proposed effective solutions for data-driven solutions on prevalence estimation more explicit.
- Realism: We agree that in general deployment scenarios, multiple dataset shifts such as acquisition shift or manifestation shift might occur at the same time 1, and such shifts equally need to be taken into account. However, these are out of the scope of our work, since we focus on the prevalence shift aspect in isolation. As suggested by reviewer #2, we clarify potential causes of prevalence shifts.
Reviewer #2:
- Design of deployment dataset: In our original experiments we distributed each task once into four sets: training, validation, development test and deployment test. In response to the reviewer’s feedback we repeated all experiments with an additional splitting seed as well as multiple model training seeds to investigate the robustness of our results. All conclusions were confirmed, and we will add this information to the manuscript.
- Causes of prevalence shifts: We will add some to the manuscript.
- References: Will be integrated.
Reviewer #3:
- Clarity: We will make the usage of the weights in the re-calibration explicit in the text.