Paper Info Reviews Meta-review Author Feedback Post-Rebuttal Meta-reviews

Authors

Jameson Merkow, Arjun Soin, Jin Long, Joseph Paul Cohen, Smitha Saligrama, Christopher Bridge, Xiyu Yang, Stephen Kaiser, Steven Borg, Ivan Tarapov, Matthew P Lungren

Abstract

Clinical AI applications, particularly medical imaging, are increasingly being adopted in healthcare systems worldwide. However, a crucial question remains: what happens after the AI model is put into production? We present our novel multi-modal model drift framework capable of tracking drift without contemporaneous ground truth using only readily available inputs, namely DICOM metadata, image appearance representation from a variational autoencoder (VAE), and model output probabilities. CheXStray was developed and tested using CheXpert, PadChest and Pediatric Pneumonia Chest X-ray datasets and we demonstrate that our framework generates a strong proxy for ground truth performance. In this work, we offer new insights into the challenges and solutions for observing deployed medical imaging AI and make three key contributions to real-time medical imaging AI monitoring: (1) proof-of-concept for medical imaging drift detection including use of VAE and domain specific statistical methods (2) a multi-modal methodology for measuring and unifying drift metrics (3) new insights into the challenges and solutions for observing deployed medical imaging AI. Our framework is released as open-source tools so that others may easily run their own workflows and build upon our work. Code available at: https://github.com/microsoft/MedImaging-ModelDriftMonitoring

Link to paper

DOI: https://doi.org/10.1007/978-3-031-43898-1_32

SharedIt: https://rdcu.be/dnwBs

Link to the code repository

https://github.com/microsoft/MedImaging-ModelDriftMonitoring

Link to the dataset(s)

https://bimcv.cipf.es/bimcv-projects/padchest/padchest-dataset-research-use-agreement/

https://stanfordaimi.azurewebsites.net/datasets/8cbd9ed4-2eb9-4565-affc-111cf4f7ebe2

https://data.mendeley.com/datasets/rscbjbr9sj/2


Reviews

Review #1

  • Please describe the contribution of the paper

    The presented solution relies on statistics of input data, deep-learning based pixel data representations, and output predictions to develop a performance monitoring workflow.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    Since in healthcare, the availability of real-time ground truth data is often limited, accurate and timely performance monitoring is challenging. Moreover, medical imaging data, include both pixel and non-pixel data, while common monitoring methods are conceived to work with only structured data. This justify the aim of the work to develop an approach to real-time monitoring of medical imaging AI models without contemporaneous ground truth labels.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    Joint use of 2 large datasets is reasonably handled (however detail info is lacking from the submission). The methodological part is difficult to read (especially the formal part is not explained clearly enough) and the adopted solutions/metrics not well justified. Experiments seems reasonable in their manifestation but there are no comparisons with respect to possible variants nor verifications across diversified experiments.

  • Please rate the clarity and organization of this paper

    Poor

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    code sharing declared upon acceptance lack of supplementary info

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html

    In Sec.2.1 authors refers to supplementary material I did not find in the submission.

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    4

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Overall I found difficult to derive from this paper why the proposed methods are better than other possible monitoring strategies

  • Reviewer confidence

    Confident but not absolutely certain

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #3

  • Please describe the contribution of the paper

    The authors propose a framework for model monitoring of real-time metrics. The framework used 3 sources to calculate the drift: Changes in the model probabilities, DICOM metadata, and the self-defined novelty of the internal representation of the model using the latent VAE space. They test the framework using 3 publicly available datasets 2 for adults and 1 for children from 3 different locations. They test the performance of the model in 3 realistic scenarios.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    1: The paper tackles a real problem in the clinic and is a must for clinical translations of the models. 2: The framework uses multimodal input data which can help with the robustness. 3: They proposed a unification for the framework’s metric. 4: The solution can be employed at the same time the model predicts the solution so it’s more likely to be useful in the clinic. 5: They evaluate the robustness of the framework in realistic scenarios (Scenario 2 and 3).

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    1: Although scenarios 2 and 3 evaluate two possible situations (incomplete or incorrect metadata, and no metadata at all), the concrete examples they use to evaluate these scenarios (X-ray position and age) seem to be the easiest features. It would be more useful to check additional variables in each scenario.

    2: The entire framework is based on the assumption that potential drift can be detected in the similarity between the training dataset and the test data, which can be reflected in the variables they measure. However, if a new disease appears, as has happened with COVID-19, it’s likely that the changes won’t be significant enough. The performance of the algorithm will be affected by factors such as the same machines, age and gender distribution, and pneumonia manifestation, but a priori, they will not be detectable in the proposed framework.

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    Yes

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html

    This is a very nice paper that addresses a real problem for ML models in production. It would be interesting to include other variables in scenarios 2 and 3. For instance, changes in age could be considered, as they may be related to the risk of disease or potential comorbidities if the data is available. It would also be interesting to test the framework on other modalities of data, such as 3D brain datasets, which have several publicly available datasets with metadata.

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    7

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The novelty of the paper and implication for clinical workflow.

  • Reviewer confidence

    Confident but not absolutely certain

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #2

  • Please describe the contribution of the paper

    The authors propose a framework to detect data drift to monitor in real-time an AI tool for medical imaging.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • Well organized
    • Strong study design
    • High quality figures
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
  • Please rate the clarity and organization of this paper

    Excellent

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    Reproducible since it is trained on publicly available datasets

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html

    Recommend the authors convert this to a full manuscript submission

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    8

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Strong technical background

  • Reviewer confidence

    Very confident

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A




Primary Meta-Review

  • Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

    This paper proposes a framework for model monitoring of real-time metrics to calculate the drift: Changes in the model probabilities, DICOM metadata, and the self-defined internal representation. Key strengths:

    1. The paper tackles a real and important problem in the clinic.
    2. The paper proposes a novel and unified framework for workflow monitoring.
    3. The evaluation was conducted in a real clinical scenario.

    Key weaknesses:

    1. The methodological part is difficult to read.
    2. No comparison with other approaches or factors.
    3. Would need to consider more factors in the clinic.




Author Feedback

N/A



back to top