Paper Info Reviews Meta-review Author Feedback Post-Rebuttal Meta-reviews

Authors

Sergio Tascon-Morales, Pablo Márquez-Neila, Raphael Sznitman

Abstract

Visual Question Answering (VQA) models take an image and a natural-language question as input and infer the answer to the question. Recently, VQA systems in medical imaging have gained popularity thanks to potential advantages such as patient engagement and second opinions for clinicians. While most research efforts have been focused on improving architectures and overcoming data-related limitations, answer consistency has been overlooked even though it plays a critical role in establishing trustworthy models. In this work, we propose a novel loss function and corresponding training procedure that allows the inclusion of relations between questions into the training process. Specifically, we consider the case where implications between perception and reasoning questions are known a-priori. To show the benefits of our approach, we evaluate it on the clinically relevant task of Diabetic Macular Edema (DME) staging from fundus imaging. Our experiments show that our method outperforms state-of-the-art baselines, not only by improving model consistency, but also in terms of overall model accuracy. Our code and data are available at https://github.com/sergiotasconmorales/consistency_vqa.

Link to paper

DOI: https://link.springer.com/chapter/10.1007/978-3-031-16452-1_37

SharedIt: https://rdcu.be/cVVpS

Link to the code repository

https://github.com/sergiotasconmorales/consistency_vqa

Link to the dataset(s)

https://drive.google.com/file/d/1qKW6OIL2QdoJ9_xVwaDpfuVBLVnjQr-V/view?usp=sharing


Reviews

Review #1

  • Please describe the contribution of the paper

    • The paper proposes a VQA model with focus on consistency across model replies, to ensure that the model doesn’t contradict itself when answering multiple questions over same image.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    • The paper is well-written and easy to follow. • The idea to enforce consistency, by letting the user ask questions about a specific region in the image (by providing a mask) is interesting. • The strength of the paper lies in learning from special dataset that provides presence or segmentation labels of hard-exudates. • This paper is a good example of extending a deep learning model to incorporate domain specific constraints. Though the extended VQA is trained using a special supervised dataset, the training is still compatible with standard dataset without any consistency information.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    • Gathering special dataset with expert annotations either for marking presence or segmentation of hard-exudates or for marking fovea to compute macula region is expensive. • To support clinical deployment, the authors should report results on a third dataset that have only DME labels. The authors can report results in two scenarios, o Consider this dataset as just a deployment dataset for inference time only. o Consider this dataset for fine-tuning the pre-trained VQA. Report how VQA helped in building trust by identifying DMW scale and specifying regions with hard exudates.

  • Please rate the clarity and organization of this paper

    Excellent

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    The paper will be reproducible when the code will be released.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html

    • The attention vectors can help identify the important regions in image for answering a question. For a given main question, an analysis of when main and sub questions pay attention to same or different regions, will further help understanding the reasoning process of the VQA. • At inference time, for an image with DME scale of 1 or more, one can create multiple sub questions, while gradually questioning the entire image. Aggregating positive replies can create a weak attention map over the image highlighting the location of hard exudates. • The authors can consider an additional question, “is this region macula?” This can help in identifying macula in un-annotated images and then further evaluating question: “Are there hard exudates in the macula?” • The current scope of the study is very narrow, with a specific imaging modality highlight a specific disease. The authors should consider evaluating their method on another study to demonstrate the generalizability of their approach.

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    6

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Its a novel approach for a very specific medical problem.

  • Number of papers in your stack

    5

  • What is the ranking of this paper in your review stack?

    3

  • Reviewer confidence

    Very confident

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Not Answered

  • [Post rebuttal] Please justify your decision

    Not Answered



Review #2

  • Please describe the contribution of the paper

    This paper proposes a method that enforces VQA consistency specifically by regularizing the answers to perception questions and to reasoning questions. The authors demonstrate improvements on model consistency as well as model accuracy in diabetic macular edema staging.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    Solid experimental results on improving both model accuracy and consistency in the diabetic macular edema staging experiment.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    Unclear what the difference between the proposed approach and the one in Selvaraju, Ramprasaath R., et al. “Squinting at vqa models: Introspecting vqa models with sub-questions.” Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2020.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    Good.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html

    Adding a paragraph to discuss related work would be helpful. The authors should be explicit about the contributions / novelty of this paper and consider discussing Wang, Peiqi, et al. “Image Classification with Consistent Supporting Evidence.” Machine Learning for Health. PMLR, 2021.

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    5

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The authors demonstrate improvements on model consistency as well as model accuracy in diabetic macular edema staging. The novelty of the proposed method, however, is not clear.

  • Number of papers in your stack

    5

  • What is the ranking of this paper in your review stack?

    2

  • Reviewer confidence

    Confident but not absolutely certain

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Not Answered

  • [Post rebuttal] Please justify your decision

    Not Answered



Review #3

  • Please describe the contribution of the paper

    This paper proposes a novel loss function and corresponding training procedure that allows the inclusion of relations between questions into the training process. The proposed method is evaluated on Diabetic Macular Edema (DME) staging from fundus imaging and showed that proposed method outperforms state-of-the-art baselines, not only by improving model consistency, but also in terms of overall model accuracy.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    Problem addressed in this paper is intresting and an important. The model outperforms a baseline on one dataset

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    No technical novelty. only one data is used. Generalizibility of method is unproven. Recent VQA methods in medical domain (MedVQA) are not discussed. Recent SOTA methods are not used to compare the performance of the proposed method. Writing is a bit poor; there are a lot of typos and grammar issues.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    none

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html

    –First, this work lacks of novelty.

    –The experiments are not enough, and it is not convincing only on the single dataset.

    –This paper lacks a discussion on the experimental results and motivation analysis.

    –Fig 2. occupied more than half of the page (page 4) - you could have saved this space to better discuss the motivation and results

    –I suggest the authors could add baselines that recently proposed.

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    4

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Lack of novelty and generalizibility of method is unproven. Further, no recent SOTA methods are used to compare.

  • Number of papers in your stack

    4

  • What is the ranking of this paper in your review stack?

    4

  • Reviewer confidence

    Somewhat Confident

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    5

  • [Post rebuttal] Please justify your decision

    RESPONSE TO AUTHORS:

    I thank the authors for submitting a detailed rebuttal. Their explanation helped me to understand better the actual scope of the contribution. My point about the generalizability and dataset would have made the paper very strong, in my opinion. And I am satisfied by the promise of the authors to share the data with the public. I am therefore changed my score and recommend acceptance.




Primary Meta-Review

  • Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

    This paper proposed a model to address the consistency in medical VQA where multiple questions over same image are enfored to be consistent. It is an interesting topic and well-written paper. However, there are some concerns arised. The main concern is the technical novelty and experiments settings. As pointed out by R2 and R3, the difference between this paper and the CVPR2020, PMLR2021 papers is not clear in technical novelty. In particular, the proposed method only validated on one specific dataset with presence or segmentation labels of hard-exudates. It is not compared with other methods or other datasets, and hard to be deployed in clinical. Therefore, the authors should address the concerns in rebuttal.

  • What is the ranking of this paper in your stack? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    8




Author Feedback

We thank all the reviewers for their comments and constructive feedback. We are pleased about R1 and R2’s positive feedback. While R3 has rated our paper as very low, we believe our clarifications should alleviate the raised concerns. [R2/R3/Meta] Unstated/unclear or lacking novelty: We consider our method to be both conceptually and technically novel. That is, 1) this is the first work that focuses on VQA consistency for medical imaging and 2) we achieve this using a novel regularization approach and allow our VQA method to use questions on arbitrary image regions in order to further help construct (in-)consistent question pairs. To the best of our knowledge, no previous work on medical VQA focuses on consistency nor do they tackle it as explicitly as we do. In addition, the use of a regularization approach for inconsistency minimization is novel. We have described these contributions at the end of our introduction (page 2). [R3/Meta] No comparison to SOTA: We are somewhat surprised that R3 indicated that we did not compare against the SOTA given that we show performances against SQuINT (Selvaraju et al, CVPR2020) on page 7. This is by far the most relevant and recent SOTA, as it is the only existing method we (and all reviewers) have found to consider consistency for main and sub-questions. Other methods mentioned in our introduction mainly deal with consistency in terms of question rephrasings or logical relationships. As such, they do not fit with our intended goal. Please note that PMLR2021 is about image classification and not VQA, hence it is difficult to put it as direct comparison method to ours. [R2/R1/Meta] Difference to CVPR2020: CVPR2020 trains a model that enforces similar attention maps for main and sub-questions. As mentioned in our paper, the main limitation of CVPR2020 is that consistency is achieved at the expense of overall accuracy. In contrast, our approach directly minimizes inconsistency in question responses with a new loss and avoids the need of attention maps altogether. Hence, conceptually, and practically the two methods are quite different, and our experiments show that our approach significantly performs better in terms of accuracy and consistency (table 1). As noted by R1, observing how the attention maps from our approach differ to those of CVP2020 would be interesting, but we leave this as future work. [R1/R3/Meta] Generalizability and dataset: (1) We would like to point out that there is no open dataset for medical imaging VQA for consistency: PathVQA, VQAMed and VQA-RAD do not have question-answer pairs that allow consistency to even be evaluated. (2) Building medical VQA datasets is a new and difficult task, with virtually all public datasets less than 5 years old. Adding main and sub-questions adds significant overhead to that process. Hence the fact that we are coming forward with the first VQA medical dataset that allows for consistency is extremely positive and brings the field forward in an important way. We will make our (question, type, and answers) available for future use to the public (an obligation by our funding). (3) Previous MICCAI VQA-related papers almost always focused on a single type of dataset (e.g., radiology), hence our paper does not actually stand out in this regard. [R2/R3] Missing discussion about related work (e.g. MedVQA): We refer to a wide range of recent VQA work in our introduction, including MedVQA, for which we mention 8 different papers [6, 8, 12-14, 19, 24, 27]. Because the focus of our paper is mainly on consistency in VQA, we discuss SOTA on consistency. As stated in our paper, recent approaches on consistency have exclusively been proposed for general computer vision VQA. R2’s reference to PMLR2021 (which we will add to our references) does bear some similarity to our proposed regularization but is focused on image classification. Furthermore, in our case the inclusion of region-specific questions adds an important dimension to VQA consistency.




Post-rebuttal Meta-Reviews

Meta-review # 1 (Primary)

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    The rebuttal addressed partical concerns of the reviewers, and three positive ratings are given.

  • After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

    Accept

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    2



Meta-review #2

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    The paper is well-written and deals with important problems. The experiments are well-designed where the authors verify the effectiveness of the proposed method. During the rebuttal, the authors successfully address the concerns of reviewers. As a result, there is a consensus in the reviews. I would also recommend acceptance of this paper.

  • After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

    Accept

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    3



Meta-review #3

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    This paper proposed a novel VQA model to enforce consistency across model replies by regularizing the answers to perception and reasoning questions. While all reviewers think the paper is well-written, problem is important, and performance was good, the reviews were split in the first stage due to the concerns on difference from existing work on VQA consistency (CVPR2020, PMLR2021) and lack of generalizability due to limited validation on one dataset. The authors were able to address all questions raised by reviewers well, especially to Reviewer 3, so the stage 2 reviews after rebuttal all endorse the acceptance of this paper based on its novelty, contribution to medical VQA open dataset, and excellent performance in consistency and accuracy. I also recommend acceptance of this paper based on the consensus of the reviews.

  • After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

    Accept

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    3



back to top