Paper Info Reviews Meta-Review Author Feedback Post-rebuttal Meta-Reviews Back to top

List of Papers By topics Author List

Paper Info

Reviews

Meta-review

Author Feedback

Post-Rebuttal Meta-reviews

Authors

Sergio Tascon-Morales, Pablo Márquez-Neila, Raphael Sznitman

Abstract

Visual Question Answering (VQA) models aim to answer natural language questions about given images. Due to its ability to ask questions that differ from those used when training the model, medical VQA has received substantial attention in recent years. However, existing medical VQA models typically focus on answering questions that refer to an entire image rather than where the relevant content may be located in the image. Consequently, VQA models are limited in their interpretability power and the possibility to probe the model about specific image regions. This paper proposes a novel approach for medical VQA that addresses this limitation by developing a model that can answer questions about image regions while considering the context necessary to answer the questions. Our experimental results demonstrate the effectiveness of our proposed model, outperforming existing methods on three datasets. Our code and data are available at https://github.com/sergiotasconmorales/locvqa.

Link to paper

DOI: https://doi.org/10.1007/978-3-031-43895-0_34

SharedIt: https://rdcu.be/dnwyN

Link to the code repository

https://github.com/sergiotasconmorales/locvqa

Link to the dataset(s)

https://github.com/sergiotasconmorales/locvqa

Reviews

Review #1

Please describe the contribution of the paper

The paper studies medical visual question answering in a setup where the region of interest of the question is provided together with the image and the question. The model is thus able to answer questions about regions in the images. This may be beneficial in some specific applications.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
1. Visual question answering and visual entailment are important research topics in all domains and especially with medical images.
2. The problem of answering to localized questions is quite novel.
3. The used methods are state of the art.
4. The experiments and the results obtained from them seem trustworthy.
5. The results are promising.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
1. The paper does not propose anything methodologically novel, but only applies SoTA methods in a task that is at least somewhat novel.
2. The clinical applicability of the proposed setup is not clear. Would medical practitioners ask such localized questions? Don’t they recognize needles and retractors without help from a computer vision system?
3. Wouldn’t it be better if the model returned the region as output rather than needed it as input, such as in visual grounding problems?
4. The role of “glimpses” remained obscure and untested. Could it be that other number than G=2 glimpses had worked better overall or with some of the datasets?
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

The code has been promised to be made available. The experiments seem reproducible with it. The answers to the checklist questions seem reasonable.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html
1. See my comments above concerning the weaknesses.
2. In Table 1 column Macula, a wrong result is bolded.
3. Try to motivate the clinical need for such a visual question answering system!
Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

5
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

From the computer vision and vision-language points of view the work is acceptable even though its clinical need and applicability remains unclear at the best.
Reviewer confidence

Very confident
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

5
[Post rebuttal] Please justify your decision

After emergency reviews I had total of 6 papers and this was still the 2nd best of them. Accordingly, I remain with my original recommendation. The authors have responded to my and other reviewers’ concerns sufficiently in their rebuttal.

Review #2

Please describe the contribution of the paper

This work focused on the medical VQA problem. In order to achieve a better performance, the authors utilized the attention mechanism, training the network to focus on the region of interest that is provided by the dataset. Evaluated on three medical VQA datasets, their full pipeline performed better than the methods without using the region information or without using attention.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
- Network design is simple and straightforward.
- Without a lot of medical VQA datasets that provide the region information, the authors generated two datasets themselves to strengthen their evaluation.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
- The methodology contribution of this work is limited. The attention mechanism has been well explored in the VQA task, and the authors didn’t bring much insight into this method.
- The way the RIS-VQA and INSEGCAT-VQA datasets were generated is questionable. The answer was labeled as “yes” as long as one pixel of the instrument is in the region. But only one or two pixels are barely visible, which is unfair for the “crop region” and “draw region” baselines. It is also not meaningful from an application point of view with just several pixels in sight.
- Again, VQA is a well-studied task in general, and more state-of-the-art baseline methods could have been compared.
Please rate the clarity and organization of this paper

Very Good
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

I think the result could be reproduced if proper code was provided.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html

Please see the weaknesses section.
Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

4
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

Please see the weaknesses section.
Reviewer confidence

Confident but not absolutely certain
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

N/A
[Post rebuttal] Please justify your decision

N/A

Review #3

Please describe the contribution of the paper

This paper presents a novel approach to medical Visual Question Answering (VQA) that tackles the limitations of existing models by focusing on answering questions about specific image regions rather than the entire image. By considering the context necessary to answer these questions, the proposed method enhances the interpretability and capabilities of medical VQA models. Experimental results demonstrate the effectiveness of this innovative approach, with the proposed model outperforming existing methods on three distinct datasets.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
1. Original data utilization: The authors introduce an innovative approach to guide the learning process of medical VQA using question-related masks. This technique enables the model to focus on relevant image regions while considering the context necessary to answer the questions.
2. Novel application: The paper presents a creative method to embed the mask into the input image using an attention mechanism. This novel application allows the model to effectively incorporate spatial information and improve its interpretability and capabilities in answering region-specific questions.
3. Strong evaluation: The proposed method has been rigorously evaluated on three distinct datasets, demonstrating its effectiveness in outperforming existing approaches. This comprehensive evaluation highlights the potential of the proposed approach to advance the field of medical VQA and improve the performance of region-specific question answering.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

One main weakness of the paper is as follows: Limited comparison with existing methods: The paper would benefit from a more comprehensive comparison with existing Medical VQA methods on the same datasets. Providing an extensive evaluation against a wider range of state-of-the-art approaches would help to further establish the effectiveness and novelty of the proposed method. For example, the authors could re-implement the previous works shown below on this task: (1) Overcoming data limitation in medical visual question answering; (2) Medical visual question answering via conditional reasoning; (3) VQAMix: Conditional triplet mixup for medical visual question answering; (4) A question-centric model for visual question answering in medical imaging.
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

I believe this work could be reproduced if the code is given.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html
1. Extensive comparison with existing methods: As mentioned in the previous question, the paper would benefit from a more comprehensive comparison with existing Medical VQA methods on the same datasets. Including additional methods in the evaluation will help to further establish the effectiveness and novelty of the proposed method and offer a clearer understanding of its position within the field of medical VQA.
2. Expand the literature review: The authors should broaden the literature review to cover the most recent and relevant works in the field of medical VQA. This will help to position the proposed method within the context of the current state of the art and emphasize its novelty and contributions.
3. Clarify the methodology: The authors should provide a more detailed explanation of the various components and steps involved in the proposed method, such as the attention mechanism used to embed the mask into the input image and the process of generating question-related masks. Ensuring that the methodology is clearly described will help readers to understand and appreciate the novelty and effectiveness of the approach.
Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

5
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?
1. Novelty of the approach: The paper presents a novel approach to medical VQA by focusing on region-specific questions and using question-related masks to guide the learning process. This innovative technique has the potential to advance the field and improve the performance of region-specific question answering.
2. Comprehensive evaluation: The paper should provide a thorough evaluation of the proposed method on multiple datasets and include comparisons with existing state-of-the-art medical VQA methods. A strong evaluation will help to establish the effectiveness and novelty of the proposed method.
Reviewer confidence

Very confident
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

N/A
[Post rebuttal] Please justify your decision

N/A

Primary Meta-Review

Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

The work presents an interesting VQA method with localized questioning for medical visual question answering. While there are merits in the novel application and strong evaluation, reviewers raised major concerns on- (i) poor technical novelty in the method (R1, R2); (ii) doubts on dataset generation (R2) and clinical application (R1); and (iii) limited comparison (R3).

Author Feedback

We thank all the reviewers for their constructive feedback. We are pleased with R1 and R3’s positive feedback. While R2 has rated our paper very low, we believe our clarifications will alleviate all concerns. [R1/R2/Meta] Limited technical novelty: Like R3, we consider our approach to be novel. While it is true that attention mechanisms have been studied in VQA, embedding a target region in them without disregarding context (i.e., localized attention) is a novel concept. In addition, our experimental results show our approach to be superior to other previously proposed methods. That is, this work’s novelty lies in how to integrate localized attention into VQA models to answer questions about specific regions. Our approach is the first to do so to the best of our knowledge. [R3/R2/Meta] Limited comparison to SOTA: We wish to clarify that our proposed work is not a specific VQA architecture but a method to focus on answering questions about specific image regions. For this reason, a distinction must be made between VQA methods for questions about entire images and VQA methods that also answer localized questions. Since our method deals with the latter, comparisons were made not to other state-of-the-art architectures but to methods for localized questions. We precisely compared our approach to all methods found in the literature that do so. [R2/Meta] Doubts on dataset generation: Though not reported in the paper, we also experimented with other thresholds to define positive answers. For example, making the threshold dependent on region area (1%), or by using 10 px as threshold. In both cases, we observed significantly better performance in our method. For example, using the mentioned thresholds on the RIS-VQA dataset, our method has an AUC 0.04 higher than the baseline Crop Region, which is very similar to the difference reported in the paper (0.05). We considered 1 px the least arbitrary threshold and reported results for it. [R1/Meta] Clinical application: We consider our method to be clinically applicable in two ways: (1) our method provides a more localized evaluation of images, which can be useful, for example, as a second opinion for medical practitioners. Evaluation of suspicious regions of an image can be useful for diagnosis. (2) our method can contribute to the trustworthiness of medical VQA models. Often, the decision of a VQA model is obscure. Consider, for example, asking about the DME risk grade of a fundus image. The model may provide the correct answer, but this could happen for the wrong reasons (e.g., shortcuts). By asking localized questions, the level of agreement between global and local questions can be evaluated. We agree with R1 that medical experts do not need AI to recognize tools, but in the absence of datasets with localized questions, using these is useful to test our method’s robustness and failures. [R3] Methodology clarification and more extensive literature review: As R2 pointed out, attention mechanisms have been widely studied in VQA. Therefore, we judged it inconvenient to provide more details than necessary. However, we provide references to the works we rely on. Furthermore, Eq 2 describes all operations that take place to compute the attention maps. Regarding question generation, we describe relevant aspects in Sec 3.1. In addition, generation parameters will be available in our source code. We refer to a wide range of recent VQA work in our introduction, mentioning 13 papers [4, 6, 8, 13-15, 17-19, 21-24] and grouping relevant literature into data-oriented and architecture-oriented approaches, but also mentioning emerging trends. Because our paper focuses mainly on localized questions in VQA, we emphasize SOTA methods on this topic. [R1] Obscurity of the role of glimpses: We experimented with different numbers of glimpses, concluding that G=2 works best for all baselines. Well-known previous works (e.g., Fukiu et al., 2016, arxiv:1606.01847; Kim et al., ICLR2017) also report using G=2.

Post-rebuttal Meta-Reviews

Meta-review # 1 (Primary)

Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

The authors have clarified most of the major concerns of reviewers, specifically technical novelty, doubts about dataset generation, and usage in the clinical domain. Although technical novelty is limited in this work, I found the application is novel from a clinical perspective to provide a more localized evaluation of images which can be considered as an deep evaluation of suspicious regions of a diagnostic image. There are still a lot of scopes to improve the model with more complex localized QA instead of just binary (yes/no) QA. By considering a novel application in healthcare, I am inclined to accept this paper and give an opportunity to the MICCAI community to discuss further methodological improvement and the clinical usefulness of the work.

Meta-review #2

Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

This paper introduces a method for medical VQA by focusing on answering questions about specific image regions rather than the entire image. Reviewers have concerns with the very limited technical contributions and discussion with existing studies. The findings of this study look also not surprising. I think this paper does not have enough contributions for MICCAI.

Meta-review #3

Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

In their rebuttal, the authors provide satisfactory explanations and clarifications for most of the concerns raised by reviewers. They address the issues of technical novelty, dataset generation, clinical applicability, methodology clarification, and literature review. The potential clinical applicability of the proposed approach adds value to the research. Taking all these factors into account, I believe the paper has addressed the reviewers’ concerns adequately, and the strengths of the work outweigh the weaknesses.

back to top

Localized Questions in Medical Visual Question Answering