Paper Info Reviews Meta-review Author Feedback Post-Rebuttal Meta-reviews

Authors

Tom van Sonsbeek, Mohammad Mahdi Derakhshani, Ivona Najdenkoska, Cees G. M. Snoek, Marcel Worring

Abstract

Medical Visual Question Answering (VQA) is an important challenge, as it would lead to faster and more accurate diagnoses and treatment decisions. Most existing methods approach it as a multi-class classification problem, which restricts the outcome to a predefined closed-set of curated answers. We focus on open-ended VQA and motivated by the recent advances in language models consider it as a generative task. Leveraging pre-trained language models, we introduce a novel method particularly suited for small, domain-specific, medical datasets. To properly communicate the medical images to the language model, we develop a network that maps the extracted visual features to a set of learnable tokens. Then, alongside the question, these learnable tokens directly prompt the language model. We explore recent parameter-efficient fine-tuning strategies for language models, which allow for resource- and data-efficient fine-tuning. We evaluate our approach on the prime medical VQA benchmarks, namely, Slake, OVQA and PathVQA. The results demonstrate that our approach outperforms existing methods across various training settings while also being computationally efficient.

Link to paper

DOI: https://doi.org/10.1007/978-3-031-43904-9_70

SharedIt: https://rdcu.be/dnwIi

Link to the code repository

https://github.com/tjvsonsbeek/open-ended-medical-vqa

Link to the dataset(s)

N/A


Reviews

Review #1

  • Please describe the contribution of the paper

    This work is concerned about the task of medical visual question answering. To address the open-ended setting, it takes the generative approach instead of the classification based approach which is often used for close-ended setting. This work takes advantage of pre-trained language models and proposes an effective way to consider the visual information of image. Various fine-tuning strategies are investigated to adapt the language model to specific medical VQA tasks. Experimental study is conducted on benchmark datasets to show the efficacy of the proposed approach.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. Considering open-ended visual question answering (VQA) setting is more close to the real situation and more general than its close-ended counterpart. Research in this setting is welcome.
    2. Taking advantage of pre-trained language models to address the open-ended medical VQA tasks is technically sound, and this is particularly worth exploring since the recent advent of a number of large language models like ChatGPT.
    3. To incorporate the image information into the language model, the work proposes to map visual features into learnable tokens, which is a sound method.
    4. Experimental study demonstrates the promising performance obtained by the proposed approach.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    This work can be strengthened by addressing the following issues:

    1. After investigating the four parameter-efficient strategies, could this work discuss their shortcomings and point out the probable ways to improve them? This will enhance the technical novelty of this work.
    2. When evaluating the generated answers for open-ended VQA, the criteria related to clinical correctness may need to be considered in addition to the BLEU criterion which is only effective in assessing the fluency of the generated answer.
    3. In the experimental study, it will be helpful to indicate the network backbone used by the methods in Table 3. This will help to better interpret the result.
    4. Table 4 can be better introduced, especially on the four settings in comparison. Also, it is mentioned that “LM ignores the visual information if it is placed in front of the question….” Please give a bit more details on this explanation.
  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    The proposed method is not difficult to understand and has been clearly described in this paper. This work plans to make the code of this paper publicly available. The reproducibility shall not be an issue.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html

    An overall well organised and presented work. No further comments.

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    7

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    This work addresses a more realistic setting for medical visual question answering and proposes an approach to efficiently utilize and adapt pre-trained language models for this setting. Experimental study shows the better performance achieved by this approach. This work becomes particularly interesting when considering the great potential demonstrated by the recently released large language models.

  • Reviewer confidence

    Confident but not absolutely certain

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #3

  • Please describe the contribution of the paper

    The authors propose a new pipeline for medical VQA. They leverage large-ish language models (1-3B parameter range) and consider different fine-tuning approaches and different ways to condition these models on the image features. They focus on open-ended VQA and their approach leads to significant improvements over the existing methods on three different datasets.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • They successfully translate promising approaches from the general domain (CLIP, larger language models, parameter-efficient fine-tuning approaches) to medical images, showing that these approaches are more successful for medical VQA than existing approaches.
    • The writing is clear and well organised.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    • The authors evaluate different parameter-efficient fine-tuning techniques, arguing that these are beneficial for small datasets such as those used in medical VQA. It would be useful to demonstrate this further, by e.g. seeing how they compare against fully fine-tuning the language model. Currently, the only comparison is between different kinds of parameter efficient finetuning (less than 1% of model parameters).
    • It remains unclear if the novel method is “particularly suited for small, domain-specific, medical datasets” or if the increases in performance come mostly from using large language models and the CLIP image encoders.
    • It would have been interesting to see how this method performs on other tasks, such as radiology report generation, where there are more approaches to compare to. The architecture can easily be modified for this, by solely removing the question section, or by replacing it via the indication section of a report.
    • The authors claim that their approach is not bounded by the class-imbalance issue, as they approach it as open-ended VQA, i.e. answer generation. Is this really true? Isn’t there still an imbalance in terms of the answers generated by the model?
  • Please rate the clarity and organization of this paper

    Excellent

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    It is mentioned that the code will be made publicly available.

    However, some details should be mentioned more explicitly: for example, how were positional embeddings and token types handled in the language model?

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html
    • I am surprised by the low performance of BioGPT, although this is not the first time langauge model pre-trained on medical text peform worse for medical tasks. However, given that the BioGPT approach performs almost as bad as completely ignoring the image, I am curious about what is going wrong there. The argument made in the paper that the reason is a lack of generalisation ability, even though the medical datasets should be within the domain that it was trained on, is not convincing to me.
    • What is the intuition for the sometimes very big differences in performance of different fine-tuning strategies?
  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    5

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The authors show that nicely combining powerful (vision-)language models can lead to significant improvements on medical VQA. This is of relevance to the MICCAI community. However, it would have been nice to see a more thorough analysis of the approach, and perhaps beyond medical VQA, where it’s possible that solely the size and pre-training of the current models have led to the improvements.

  • Reviewer confidence

    Very confident

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #2

  • Please describe the contribution of the paper

    Proposes to perform open ended medical visual question answering by generating a representation of the image as a sequence of embedding vectors in the incoming space. Four ways of integrating this sequence in the language model are explored. The resulting generative approach is implemented using BioGPT, BioMedLM and GPT2, of which the combination of GPT2 and Low Rank Adaptation (LORA) showed best performance.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    This paper applies the ClipCap approach to providing an image to a language model for visual question answering. The exploration of four progressively more complex approaches to tuning is also interesting.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    There is a missed opportunity to investigate where these images map to in the embedding space. Are they near other tokens, or off in a space of their own?

  • Please rate the clarity and organization of this paper

    Excellent

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    Code will be made available, and the datasets are public.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html

    In Fig 3. Metacarpal is misspelled metacarpel - is it actually like that in the dataset? Generating a weird misspelling suggests memorization.

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    7

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The technique is interesting and several variants are explored, across several datasets. Code will be made available.

  • Reviewer confidence

    Confident but not absolutely certain

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A




Primary Meta-Review

  • Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

    The paper introduces a method for conducting open-ended medical visual question answering by generating an image representation as a sequence of embedding vectors in the input space. Four different approaches for integrating this sequence into the language model are investigated. The proposed generative approach is implemented using BioGPT, BioMedLM, and GPT2. Among these, the combination of GPT2 and Low Rank Adaptation (LORA) demonstrates the most impressive performance. This is a very nice paper with solid and extensive results and the topic is very relevant for the community. The camera ready version still needs to address a few remaining questions and suggestions. These include the need for more intuitive explanations regarding the significant performance gap observed among different fine-tuning strategies, and conducting an investigation into the embedding space. Providing further insights and analysis in these areas would greatly enhance the overall quality of the paper. Thanks for this high quality work!




Author Feedback

We want to thank the reviewers for their encouraging and positive assessment of the paper and their constructive feedback. We will address the feedback of each reviewer below:

R1

Strengths/weaknesses of finetuning strategies: In our paper, we explain how fine-tuning strategies that adapt attention layers inside language models (e.g., LoRA) are more effective than prefix-tuning. LoRA directly modifies the Q and V weight matrices of the self-attention layers, influencing the weights of the language backbone when adapting for VQA. This strength of LoRA makes it more efficient for task-specific language models. This performance gap between LoRA and prefix-tuning methods can be observed in related work as well, supporting our findings.

BLEU metric validity: The concern of the clinical correctness of the BLEU metric is completely valid. For this reason we evaluated our performance across this metric and 3 others, namely, BERTScore, F1 and accuracy/exact match. By jointly considering the performance across all four metrics a more complete assessment of our method can be observed.

Type of image encoder of prior works: We agree that adding the type of the image encoder is useful in the comparison between prior works and our method in table 3. We added this for the camera-ready version.

Table 4: We improved and clarified the introduction and discussion of the results in Table 4 for the camera-ready version.

R2

Visualization of visual tokens: We agree that a qualitative visualization of text tokens which are nearest to visual tokens can give insight to help to further understand our method. We visualized this by using k-Nearest Neighbour similarity search, and added this visualization to the camera-ready version.

Misspelling of Metacarpal: Thank you for noting this finding from Figure 3. Further inspection shows that this is indeed a typo and it will be corrected in the camera-ready version.

R3

Full fine-tuning of LM: Comparison against a full-fine-tuning of the language model is not included in our results. These experiments showed: (1) exponential increase in computational need (2) fast (near immediate) overfitting (3) lower performance on low-prevalence classes. Since this setting negates most of the perceived benefits of our methodology we did not include it in our paper.

Class imbalance: We agree that our method is not completely immune to class imbalance. However the performance of our method is especially good on the dataset with the highest class imbalance (pathVQA, see Table 1), showing a better robustness of our method against class imbalance compared to earlier methods. This will be clarified in the camera-ready version.

Report generation: We agree that another use-case of our method can be radiology report generation. This is a challenging task, which differs from VQA since (1) the generated reports usually contain more sentences, (2) the coherence and logical flow between sentences needs to be preserved. This is an interesting outlook that will be added to the camera-ready version.

BioGPT performance: We agree that the low performance of BioGPT is peculiar and further discuss this in the results section of the paper. Our hypothesis is that it is less complex for general language models to “focus” on a small subarea of knowledge, while it is harder for a specific model to generalize towards knowledge that is maybe even slightly outside its domain knowledge. This finding is in line with existing works comparing general models and models fine-tuned on specific domain, where usually general models overall outperform their domain-specific variants (e.g. CLIP vs MedCLIP).

Performance difference between finetuning methods: We will clarify the reason for fluctuating performance between the parameter-efficient fine-tuning methods in the paper. This is due to the size of the datasets, since this observation is more apparent on the smaller SLAKE dataset than the larger OVQA and PathVQA datasets.



back to top