Paper Info Reviews Meta-review Author Feedback Post-Rebuttal Meta-reviews

Authors

Chantal Pellegrini, Matthias Keicher, Ege Özsoy, Petra Jiraskova, Rickmer Braren, Nassir Navab

Abstract

Automated diagnosis prediction from medical images is a valuable resource to support clinical decision-making. However, such systems usually need to be trained on large amounts of annotated data, which often is scarce in the medical domain. Zero-shot methods address this challenge by allowing a flexible adaption to new settings with different clinical findings without relying on labeled data. Further, to integrate automated diagnosis in the clinical workflow, methods should be transparent and explainable, increasing medical professionals’ trust and facilitating correctness verification. In this work, we introduce Xplainer, a novel framework for explainable zero-shot diagnosis in the clinical setting. Xplainer adapts the classification-by-description approach of contrastive vision-language models to the multi-label medical diagnosis task. Specifically, instead of directly predicting a diagnosis, we prompt the model to classify the existence of descriptive observations, which a radiologist would look for on an X-Ray scan, and use the descriptor probabilities to estimate the likelihood of a diagnosis. Our model is explainable by design, as the final diagnosis prediction is directly based on the prediction of the underlying descriptors. We evaluate Xplainer on two chest X-ray datasets, CheXpert and ChestX-ray14, and demonstrate its effectiveness in improving the performance and explainability of zero-shot diagnosis. Our results suggest that Xplainer provides a more detailed understanding of the decision-making process and can be a valuable tool for clinical diagnosis.

Link to paper

DOI: https://doi.org/10.1007/978-3-031-43904-9_41

SharedIt: https://rdcu.be/dnwHm

Link to the code repository

https://github.com/ChantalMP/Xplainer

Link to the dataset(s)

https://stanfordaimi.azurewebsites.net/datasets/23c56a0d-15de-405b-87c8-99c30138950c

https://nihcc.app.box.com/v/ChestXray-NIHCC/folder/36938765345


Reviews

Review #1

  • Please describe the contribution of the paper

    This paper presents a novel framework Xplainer following a zero-shot approach for X-ray diagnosis prediction. The proposed approach uses CLIP image and test encoder based on BioVil. It uses ChatGPT’s response to describe observations and predict the final decision based on descriptions.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    This paper explores a new direction of explainability in the clinical domain which is interesting. Also, it has the adaptability to any specific population.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    a) The approach is already well established in the CV and NLP field leveraging classification by description. b) This approach does not need data labels but still requires human effort to refine the descriptions. How reliable is the descriptive observation of ChatGPT? How many times has Redionlogist had to refine the descriptions?

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    What was the actual prompt tuning setting to generate the good response is not clear.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html

    Prompt generation is not engineered. It is mainly tuning or finding the proper prompt. I would suggest using ‘prompt tuning/finding’.

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    5

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Performance is marginally improved compared to the SOTA approach and still requires human-level effort to refine the ChatGPT’s response which is similar to data labeling.

  • Reviewer confidence

    Very confident

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    6

  • [Post rebuttal] Please justify your decision

    Authors have cleared some confusion on code/prompt/labeling and promised to publish.



Review #2

  • Please describe the contribution of the paper

    The paper presents a zero-shot approach for chest X-ray diagnosis prediction using the BioVil pre-trained model.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    novel approach in computer-aided diagnosis

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    There seems to be no significant changes in the performance measures observed as from Table 1. The paper gets around 72% AUC in Chest X-ray 14 dataset while the previous work shows higher auc. Why is that?

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    to an extent

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html

    How is the data set divided for training, testing, validation? Specifically how is the data partitioned, Is it random or based on subjects? Will there be any difference in these two approaches?

    What drawbacks are observed in the method? Where all the method is failing? and is there any possible ways to fix it?

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    6

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    novel approach in CAD

  • Reviewer confidence

    Somewhat confident

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #3

  • Please describe the contribution of the paper

    The authors propose a classification-by-description based framework for more interpretable zero-shot multiclass medical image classification. ChatGPT is employed to generate caption of an image for each observation for each pathology, mimicing the style of a radiology report. This information is further used as auxiliary input via text encoder to get better predictions. Xplainer is evaulated on CheXpert and ChestXray-14 datasets.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    BioVil CLIP image+encoder architecture is utilized to perform contrastive learning between observations and image for prediction of a particular pathology. Prompt engineering is carried out with the help of ChatGPT to generate better prompts for each observation in an image indicating the presence or absence of particular pathology. Many prompt styles have been explored and the report style is inferred to have worked the best.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    The only main contribution comes from how to design the prompts that will be used by the model for contrastive learning between images and text. It is not clear how all the observations for all pathologies are used by the model for contrastive matching - is one by one that the model generates the probabilities for each pathology or is it all by once? Positive and negative prompts are used to generate pos and neg probabilities. But how does the authors just use softmax to come up with the probability of presence of an observation is not clear. Furthermore the joint probability employed to come up with the probability of pathology is also not clear. In eqn. 1 it should be / and not ➗. The design choice of prompt is interesting, and how it has an impact on the prediction but not ground-breaking. Other approaches also work decently. Moreover the comparison with other methods is very poor, and its hard to justify the increase in performance. I assume, the method does not take text as input in inference - how does it accomodate this criteria? I didnt find an explanation regarding this. Moreover does the framework need to know the pathology classes in the dataset beforehand or is it completely unsupervised? Extensive ablations have been conducted on the prompting styles and modeling of ‘no finding’ label. The latter is not that interesting as it is just an insignificant dataset-specific class and can be avoided. Again, I am not sure how the approach remains zero-shot if one knows about the class beforehand.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    Not much experimental details are provided. It may be hard to reproduce.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html

    Please consult the points mentioned in weakness section.

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    3

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Limited novelty, with just exploration of prompt styles over pretrained CLIP-based arcgitectures.

  • Reviewer confidence

    Very confident

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A




Primary Meta-Review

  • Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

    This paper explores a novel direction of explainability in the clinical domain and more specifically for x-ray images While the paper has merits, there are concerns raised by reviewers regarding its novelty and contribution (R3), reproducibility and training details (R1, R2, and R3), as well as the clarification around marginal improvement of results and human-level text refinement cost (R1). In the rebuttal, these concerns should be addressed to strengthen the paper’s overall quality.




Author Feedback

We thank all reviewers for their valuable feedback and for recognizing that the main contribution of Xplainer is to introduce a novel direction of explainability in the clinical domain (R1,MR). Further, the reviewers also value the adaptability (R1) of our new computer-aided diagnosis approach (R2).

Regarding our contribution over CLIP [13] (R3), while CLIP is designed for zero-shot classification, Xplainer’s primary aim is semantic explainability in zero-shot diagnosis. We achieve this by breaking down diseases into a collection of radiological observations that indicate their presence, and we focus on identifying these observations rather than directly targeting the diseases themselves. The diagnosis is modeled as a joint probability of the respective observations, showing the influence of each of them. This allows the examination of predictions for plausibility, illuminates the origins of errors, and stands in contrast to previous explainability methods, which interpret results after prediction. We emphasize that Xplainer presents the first interpretable diagnosis-by-observations paradigm for medical image understanding.

Our findings show that the diagnosis-by-observations paradigm, in addition to offering explainability, also delivers improved SOTA zero-shot results on Chexpert and Chest-XRay14 (Table 1). As for the one method with a seemingly superior performance on Chest-XRay14 (R2), we clarify that this is not comparable because it achieves its results in an in-domain setting. Unlike our setup, they use Chest-XRay14 for CLIP pre-training [14]. We outperform the out-of-domain results reported by Seibold et al. [14], further supporting the effectiveness of the proposed method.

We affirm that the human refinement cost (R1) was minimal, requiring only a few hours. The radiologist reviewed the descriptors once, flagging any inaccuracies. As for the reliability of ChatGPT observations (R1), we draw attention to Table 4, which presents performance metrics both with and without refinement. While refinement enhances the results, the initial descriptors already exhibit high performance.

In Fig. 2, we show and discuss a few failure cases (R2), demonstrating the dependency of our predictions on the correct detection of observations. As Xplainer is not tied to specific image and text encoders, orthogonal works that lead to better encoders can be used to improve our results further. We could extend the discussion section to reflect this orthogonal nature of our work.

Regarding reproducibility and training details (R1, R2, R3), we will publish our code and prompts upon acceptance. Moreover, we would like to provide further details on our methodology. In deep learning, “zero-shot” denotes a model’s ability to generalize to unseen classes (R3). CLIP [13] operates under this zero-shot paradigm, as it merely requires class names during inference. Likewise, in Xplainer, which is based on CLIP, pathology classes are introduced only during inference, eliminating the need for labeled data. During inference, Xplainer encodes an image and a list of positive and negative text descriptors with the image and text encoders, respectively; therefore, it indeed takes text as input (R3). Then, the model calculates the similarities between the image and each descriptor embedding. As an approximation for the presence/absence probability of a given observation, we apply softmax between the similarities of the positive and negative descriptors to the image. We then compute an estimated pathology probability as the joint probability of descriptor probabilities (R2). We evaluate our method using the official validation and test splits of both datasets (R2). We will make these points clearer in the camera-ready version.

We hope we could clarify open questions and are confident that the proposed semantic explainability approach can be of utmost value to the MICCAI community.




Post-rebuttal Meta-Reviews

Meta-review # 1 (Primary)

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    This paper delves into a new avenue of explainability for x-ray images. Although the paper has commendable qualities, reviewers have expressed concerns regarding its novelty and contribution, reproducibility and training details, as well as the need for further clarification on marginal improvements in results and the cost of refining text to a human-level standard. However, the authors have effectively addressed these concerns in the rebuttal, strengthening the overall quality of the paper. They have also promised to release the code, which will enhance the reproducibility of their work. Therefore, I recommend accepting the paper.



Meta-review #2

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    The paper proposes to identify concepts using CLIP and then generate a template prompt to be fed to ChatGPT to final outcome. The final results is explainable. There are issues with writing but overall an acceptable paper.



Meta-review #3

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    The paper introduces a novel direction of explainability in the clinical domain and presents improved results on relevant datasets. In their rebuttal, the authors provide satisfactory explanations and clarifications for most of the concerns raised by reviewers. They address the concerns about novelty and contribution, reproducibility and training details, human refinement cost, and failure cases. They also highlight the potential value of their semantic explainability approach to the MICCAI community. Taking all these factors into account, I believe the paper has addressed the reviewers’ concerns adequately, and the strengths of the work outweigh the weaknesses. Therefore, recommending acceptance.



back to top