Paper Info Reviews Meta-review Author Feedback Post-Rebuttal Meta-reviews

Authors

Ajay K. Tanwani, Joelle Barral, Daniel Freedman

Abstract

Writing reports by analyzing medical images is error-prone for inexperienced practitioners and time consuming for experienced ones. In this work, we present RepsNet that adapts pre-trained vision and language models to interpret medical images and generate automated reports in natural language. RepsNet consists of an encoder-decoder model: the encoder aligns the images with natural language descriptions via contrastive learning, while the decoder predicts answers by conditioning on encoded images and prior context of descriptions retrieved by nearest neighbor search. We formulate the problem in a visual question answering setting to handle both categorical and descriptive natural language answers. We perform experiments on two challenging tasks of medical visual question answering (VQA-Rad) and report generation (IU-Xray) on radiology image datasets. Results show that RepsNet outperforms state-of-the-art methods with 81.08 % classification accuracy on VQA-Rad 2018 and 0.58 BLEU-1 score on IU-Xray. Supplementary details are available at: https://sites.google.com/view/repsnet

Link to paper

DOI: https://link.springer.com/chapter/10.1007/978-3-031-16443-9_68

SharedIt: https://rdcu.be/cVRzm

Link to the code repository

N/A

Link to the dataset(s)

N/A


Reviews

Review #1

  • Please describe the contribution of the paper

    Paper attempts to address two applications: (a) classification of answers for given questions in VQA-Rad dataset and (b) text generation task in IU dataset.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    Prior context knowledge for open ended answers is an interesting idea. Contrastive loss for vision and text learning is also a good idea. Paper is well written and understandable as well as reproducible. Huge experimentation as well as a demo is shown in supplementary material.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    Novelty in terms of network design is limited but the proposed methodology is new. Comparative in terms of space and time is missing. Network training needs more elaboration.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    Reproducible with some efforts.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html

    Few other questions need to be addressed: There is no mention of questions and answers in the IU dataset? Where is misalignment defined before generating a heat map? How to address data bias as abnormal patients being way more than normal patients?

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    5

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    It is a nice application.

  • Number of papers in your stack

    5

  • What is the ranking of this paper in your review stack?

    2

  • Reviewer confidence

    Somewhat Confident

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Not Answered

  • [Post rebuttal] Please justify your decision

    Not Answered



Review #3

  • Please describe the contribution of the paper

    This paper proposes an approach for generating clinical reports using an image/text encoder-decoder model. Notably, it comprises a bi-linear attention network for image/text fusion, and a self-supervised contrastive alignment with NL descriptions. Results are provided on medical visual question and radiology report generation.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The paper presents an important and original contribution. The work appears to be technically sound and well-thought. Results are well presented.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    The transfer of the developed methodology to clinical practice remains uncertain. Limitations of the work are not illustrated.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    It appears that authors meet the requirements for the reproducibility.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html

    The transfer of the developed methodology to clinical practice remains uncertain. What are the limitations of your approach? How would this approach integrate into a clinical routine? What are the difficulties? What is the uncertainty associated with the prediction?

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    5

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The gap between technical pipeline and transfer of it to the clinical practice.

  • Number of papers in your stack

    4

  • What is the ranking of this paper in your review stack?

    2

  • Reviewer confidence

    Confident but not absolutely certain

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Not Answered

  • [Post rebuttal] Please justify your decision

    Not Answered



Review #4

  • Please describe the contribution of the paper

    This paper presents an encoder-decoder method that combines two modalities (medical image and text) for medical visual question answering. The method consists of the contrastive image-text encoder and conditional language decoder. Experiments are performed on two public VQA datasets(Med-Rad and IU-Xray).

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • This work introduces contrastive learning into feature representation learning;
    • The proposed method combines medical images and text for multi-tasks ( categorical and descriptive natural language answers).
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    • Most parts of the proposed method are a combination of existing approaches: bilinear attention network (Kim 2018), contrastive vision and language learning (Chen 2020), and prior context knowledge (Johnson 2017).
    • Experimental results: 1) No BLEU evaluation results on Med-rad 2019. The results only reported accuracy evaluation in Tabel I on Med-rad. 2) The experimental setting of Med-rad is not clearly stated performed on the validation set or the test set. 3). The author claims their RepsNet for medical reports generation task. However, the authors only performed generation task evaluation on the IU-Xray dataset.
  • Please rate the clarity and organization of this paper

    Poor

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance
    • Is the reported CLEF (Abacha 2019) results of Tab.1 evaluated on the test set of Med-rad 2019? Are the author’s results evaluated on the validation set of Med-rad 2019?
  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html
    • Please improve the manuscript clarity, i.e., the results discussions, motivation, etc.
    • Please check the consistency of notation system, i.e., the decoder part Equation 5 with conditional inputs {y, \had(X), \had(C) } and Fig.2 with inputs {\had(X), \had(C), \had(Q), \had(Y) }
  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    4

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The method is not attractive, and Med-rad results are not convincing enough.

  • Number of papers in your stack

    5

  • What is the ranking of this paper in your review stack?

    3

  • Reviewer confidence

    Confident but not absolutely certain

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    5

  • [Post rebuttal] Please justify your decision

    From the response, the authors solved my main concerns about this paper.




Primary Meta-Review

  • Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

    The reviewers commented that the proposed methodology, which consists of contrastive image-text encoder and conditional language decoder, is interesting and promising. While they are impressed with the results for radiology report generation using the IU-Xray dataset, they are less convinced about results for VQA using the Med-Rad dataset. There were some concerns about insufficient novelty (since main modules like bilinear attention network, contrastive learning are from existing literature), and needing to improve manuscript clarity.

  • What is the ranking of this paper in your stack? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    6




Author Feedback

R1.1: Novelty in terms of network design is limited but the proposed methodology is new? Thanks for acknowledging the proposed methodology. In terms of network design, adding auxiliary features of input image and nearest neighbor prior report aligned via contrastive learning in conditional language decoder took extensive experimentation in network design.

R1.2: No mention of questions and answers in IU dataset ? We only reported the results for the findings section in the IU-Xray dataset which has the same question, i.e., what are the findings in the image ? In our subsequent work, we also compare this with other sections including impressions and manual tags with respective questions.

R1.3: Misalignment reference for generating a heat map? The heat map is generated with Grad-CAM that visualizes class activation maps; and does not use a reference image for estimating the misalignment. See Grad-CAM details in [31].

R1.4: Data bias with abnormal patients way more than normal patients? For IU-xray dataset we do not have access to labels as normal/abnormal patients, rather descriptive findings of patients. We use contrastive learning to represent different cases distinctly in the feature space, and feed prior reports for case based reasoning to handle rare examples.

R3.1: Transfer of the developed methodology to clinical practice? We are currently working with clinical sites for automated reporting in gastroenterology. Main challenges include liability of generated report errors, and comparison of performance metrics of optimizing time and errors with physicians. This paper benchmarks our methodology against the SOTA on publicly available datasets.

R3.2: Limitations of the approach and uncertainty in the prediction? We are currently incorporating attention mechanisms for conditional visualization of generated text on image patches as a measure of uncertainty in the prediction. Making these reports self-explainable would help in its wide adoption. Besides that, we are addressing rare unseen cases by incorporating medical procedural knowledge as part of ongoing work.

R4.1: Proposed method is a combination of other approaches, namely BAN, contrastive learning and prior knowledge context? The proposed methodology in combining these approaches under a visual question answering framework is novel. Moreover, using contrastive learning to fetch nearest neighboring reports from the database and feeding to the conditional language decoder is also a novel highlight.

R4.2: VQA-Rad results on test or validation set, and VQA-Rad results are not convincing? VQA-Rad datasets are summarized in App. Tab. 1. We use the standard training and eval splits as released by the authors (test, eval and validation set implies the same in the paper). See also [1] for comparison, where training and validation sets are described in Sec 3.3. We have had consistent improvement across all vqa-rad datasets. We plan to release the codes for benchmarking the performance.

R4.3: Report generation task evaluated only on IU-Xray dataset, not on vqa-rad (no BLEU evaluation) ? We pose the problem of VQA-Rad as a classification problem over all possible sets of answers. Consequently, we only compare the predicted and the ground-truth categories with their indices (not by matching n-gram sequences with BLEU score as in answer generation for IU-xray dataset). We believe automating medical reports requires both classification and text generation, and the proposed methodology combines both these aspects, and evaluates them on respective datasets.

R4.4: Manuscript clarity and consistency of notation: Fig. 2 has \bar{Q}, which is not present in Eq.5 for conditional decoder? Page 4 after BAN description, we clarify that we use \bar{X} to denote both the image and the question features in describing the encoder and decoder sections for the sake of brevity. We will further clarify the difference with the Fig. and add experimental details in the final version.




Post-rebuttal Meta-Reviews

Meta-review # 1 (Primary)

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    The paper got two weak accepts and one weak reject initially. The authors submitted a strong rebuttal to clarify questions about novelty and results on VQA-Rad. They also answered other reviewer questions in the rebuttal. Reviewer 3, who initially gave a weak reject rating, changed rating to weak accept after reading the rebuttal. The paper should be acceptable given three weak accept ratings post rebuttal.

  • After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

    Accept

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    8



Meta-review #2

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    This paper proposes to generate clinical reports using an image/text encoder-decoder model. Originally, the concerns were mainly raised on results for VQA  and insufficient novelty. The author addressed most concerns from the reviewers in the rebuttal. And the lowest rated reviewer raised the score from 4 to 5.Recommend to accept and ask the authors to reflect the rebuttal points in the paper if finally accepted.

  • After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

    Accept

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    9



Meta-review #3

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    This paper introduces a deep learning-based vision-language framework that combines an image encoder and a text encoder for visual question answering and automated medical report generation. Although different components in the framework are mainly based on existing algorithms, it can successfully integrate those algorithms into a single framework and the experimental results are impressive. The rebuttal addresses the major concerns of the reviewers, and all reviewers recommend (weak) acceptance.

  • After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

    Accept

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    3



back to top