Paper Info Reviews Meta-Review Author Feedback Post-rebuttal Meta-Reviews Back to top

List of Papers By topics Author List

Paper Info

Reviews

Meta-review

Author Feedback

Post-Rebuttal Meta-reviews

Authors

Pengfei Li, Gang Liu, Jinlong He, Zixu Zhao, Shenjun Zhong

Abstract

Medical visual question answering (VQA) is a challenging task that requires an-swering clinical questions of a given medical image, by taking consider of both visual and language information. However, due to the small scale of training data for medical VQA, pre-training fine-tuning paradigms have been a commonly used solution to improve model generalization performance. In this paper, we present a novel self-supervised approach that learns unimodal and multimodal feature representations of input images and text using medical image caption da-tasets, by leveraging both unimodal and multimodal contrastive losses, along with masked language modeling and image text matching as pre-training objectives. The pre-trained model is then transferred to downstream medical VQA tasks. The proposed approach achieves state-of-the-art (SOTA) performance on three pub-licly available medical VQA datasets with significant accuracy improvements of 2.2%, 14.7%, and 1.7% respectively. Besides, we conduct a comprehensive anal-ysis to validate the effectiveness of different components of the approach and study different pre-training settings. Our codes and models are available at https://github.com/pengfeiliHEU/MUMC.

Link to paper

DOI: https://doi.org/10.1007/978-3-031-43907-0_36

SharedIt: https://rdcu.be/dnwcN

Link to the code repository

https://github.com/pengfeiliHEU/MUMC

Link to the dataset(s)

https://osf.io/bd96f

https://github.com/UCSD-AI4H/PathVQA

https://www.med-vqa.com/slake

Reviews

Review #2

Please describe the contribution of the paper

While medical Visual Question Answering (VQA) is an important problem, there are fewer multi-modal datasets available for this issue. This makes it challenging to apply deep learning methods, as they require a large amount of annotated data. To address this, the authors have structured a transformer-based encoder that utilizes contrastive loss for both unimodality and multi-modality. In their experiments, they demonstrate that the proposed method outperforms state-of-the-art methods and provide an ablation study to justify the necessity of the modules in the proposed network. They also present visualizations that can be used for explaining the method in the future.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

During the pre-training phase, the authors effectively combine several valid methods, including contrastive loss for learning both unimodality and multi-modality, image-text matching, masked language modeling, and masked image strategy for data augmentation. While deep learning models such as transformers require a large amount of annotated data, the proposed method addresses the problem of the lack of annotated data.

The authors validate the model structure through various experimental results compared to state-of-the-art methods and an ablation study. The baselines used are the state-of-the-art methods published between 2019 and 2022. The proposed method outperforms them across diverse datasets such as VQA-RAD, PathVQA, and SLAKE. The ablation study also demonstrates that the modules in the network structure are necessary. Finally, the visualization using Grad-CAM also shows which regions of the image the network is attending to when presented with different types of medical images and questions. This helps validate that the proposed method learns appropriate features from the image given the associated text.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

Some descriptions are not clear as below.

In Fig. 1, it appears that the authors have implemented a momentum model for contrastive loss, but they do not provide a clear description of what it is. To improve clarity, the paper should include a brief explanation of this technique. It is suspected that it may be related to the momentum update technique named MoCo[19] though.

The authors address both types of VQA, closed-ended questions, and open-ended questions. Closed-ended questions can be easily evaluated as classification accuracy, as shown in Table 1. However, it is not clear how the accuracy for open-ended questions is measured. The predicted answer and the ground truth may have different lengths of sequences. It is unclear whether the accuracy is being compared on a character or word level. The paper should clarify how the accuracy is measured for open-ended questions.
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

The authors provide detailed structure of the proposed network and losses in the paper as well as the reproducibility checklist except the momentum model in Fig. 1. which is suspected as MoCo[19].
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html

The authors address the important problem that deep learning-based models for medical VQA require a large amount of annotated data by utilizing a pre-training method that combines necessary components. Overall, the paper is well written, except for the weakness discussed in question 6. To further improve the paper, the authors could include additional descriptions for a journal version.
Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

5
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The paper is recommended for MICCAI 2023. The authors address the important problem that deep learning-based models for medical VQA require a large amount of annotated data, while such data is often not available. They combine valid components to overcome the data scarcity. Furthermore, they conduct extensive experiments, including comparisons with state-of-the-art methods, an ablation study, and visualizations.
Reviewer confidence

Confident but not absolutely certain
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

N/A
[Post rebuttal] Please justify your decision

N/A

Review #3

Please describe the contribution of the paper

In this paper, the authors propose to address the medical visual question answering (VQA) problem with a self-supervised method based on contrastive learning (MOCO, CVPR 2020). The proposed method includes a pre-training phase that learns to align features from both image and text domains from medical image caption datasets, which was then fine tuned to address VQA problem. The proposed method was validated on three public VQA datasets and outperformed baselines.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
- The proposed method address the data scarcity problem of VQA via pretraining on the more abundant image captioning datasets, which can be of inspiration to the community.
- The comparison to SOTA seemed sound and comprehensive.
- The ablation studies with pretraining designs are comprehensive and sound
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
- Novelty of the pre-training model design seemed to be limited to adopting and integrating the existing designs from contrastive learning (MOCO), ViT and BERT.
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

In general, it seems feasible to reproduce the results from this paper. However, it is not mentioned that code will be released.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html
- Would other types of self-supervised learning schemes other than contrastive learning have similar performance as MOCO based learning scheme?
- How big the image captioning datasets need to be for learning sufficiently good representations? What about an ablation study on the amount of pretraining data?
- What are the failure cases and their corresponding GradCAM results (would the model look at unrelated regions in the images)?
Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

6
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?
- This paper focuses on VQA problem, which is of great interest to the community and potentially helpful in clinical applications.
- The proposed pretraining scheme can be of inspiration to researchers with similar focus.
- Experimental validation and ablation studies are comprehensive and seemed solid.
Reviewer confidence

Somewhat confident
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

N/A
[Post rebuttal] Please justify your decision

N/A

Review #4

Please describe the contribution of the paper

The authors have developed an approach that effectively addresses the limited availability of training data in medical VQA by leveraging medical image caption datasets for pre-training. By combining unimodal and multimodal contrastive losses with masked language modeling and image-text matching, the authors have achieved state-of-the-art performance on three benchmark medical VQA datasets. The paper also provides a comprehensive analysis of the effectiveness of each component of the approach, studying different pre-training settings, and demonstrating how this combination of techniques can significantly improve model generalization in medical VQA tasks.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
1. Application of existing techniques to medical VQA: The paper applies existing self-supervised learning techniques, such as masked vision and language pre-training and multimodal contrastive losses, to the medical VQA domain. By adapting these techniques to medical VQA tasks, the authors demonstrate their effectiveness in improving model generalization performance when dealing with limited training data.
2. Leveraging medical image caption datasets for pre-training: The paper presents an original way to use data by utilizing medical image caption datasets for pre-training the model. This approach addresses the limited availability of medical VQA training data and allows the model to learn more robust and generalizable features.
3. Combination of unimodal and multimodal contrastive losses: The proposed approach combines both unimodal and multimodal contrastive losses for learning feature representations, which enhances the model’s ability to capture the interaction between visual and textual information. This combination has led to improved performance on medical VQA tasks.
4. State-of-the-art performance on benchmark datasets: The paper demonstrates significant improvements in accuracy on three publicly available medical VQA datasets compared to existing state-of-the-art methods.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
1. Lack of novelty in the methods: Although the paper applies existing self-supervised learning techniques to the medical VQA domain, the individual methods, such as masked vision and language pre-training and multimodal contrastive losses, are not novel in themselves. The references of prior work are list below: (a) Multi-modal Masked Autoencoders for Medical Vision and-Language Pre-training. (b) Contrastive Pre-training and Representation Distillation for Medical Visual Question Answering Based on Radiology images.
2. Disobeying paper format rules: It seems that the paper does not adhere to the formatting guidelines, which can negatively impact the paper’s overall presentation and its chances of being accepted for publication.
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

Lack of novelty in the methods: Although the paper applies existing self-supervised learning techniques to the medical VQA domain, the individual methods, such as masked vision and language pre-training and multimodal contrastive losses, are not novel in themselves. The work could benefit from introducing new or innovative techniques to further advance the field.

Disobeying paper format rules: The paper does not adhere to the formatting guidelines, which can negatively impact the paper’s overall presentation and its chances of being accepted for publication. Following the proper formatting rules is essential for maintaining consistency and clarity in academic publications.

Confusing and low-quality figures: The paper contains confusing and low-quality figures, making it difficult for readers to understand the visualizations and interpret the results. Improving the quality and clarity of the figures would enhance the overall presentation of the paper.

Insufficient experiments: The paper could benefit from additional experiments to further validate the proposed approach and its performance. This may include experiments with different random seeds, alternative pre-training strategies, and evaluation on a wider range of medical VQA tasks. More comprehensive experiments would provide stronger evidence for the effectiveness of the proposed method.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html

I think this paper can be implemented through the details in this paper.
Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

4
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?
1. The paper applies existing self-supervised learning techniques, such as masked vision and language pre-training and multimodal contrastive losses, to the medical VQA domain, demonstrating their effectiveness in improving model generalization performance when dealing with limited training data.
2. The proposed approach combines both unimodal and multimodal contrastive losses for learning feature representations, which enhances the model’s ability to capture the interaction between visual and textual information.
3. The paper demonstrates improvements in accuracy on three publicly available medical VQA datasets compared to existing state-of-the-art methods, highlighting the effectiveness of the proposed approach.
Reviewer confidence

Very confident
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

N/A
[Post rebuttal] Please justify your decision

N/A

Primary Meta-Review

Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

The paper proposed a medical visual question answering (VQA) solution based on self-supervised and more specifically based on MoCo. The proposed method includes a pre-training phase that learns to align features from both image and text domains from medical image caption datasets, which was then fine-tuned to address VQA problem. The proposed method was validated on three public VQA datasets and outperformed baselines. The novelty of the pre-training model design seemed to be limited to adopting and integrating the existing designs from MoCo, ViT and BERT which is still valuable. There are several areas that require improvement and clarification before publication. This includes: (1) the paper would benefit from an ablation study that explores and discusses the impact of other self-supervised techniques. This would provide a more comprehensive understanding of the proposed method’s effectiveness in comparison to alternative approaches. (2) Addressing novelty concerns, and improving reproducibility by addressing comments regarding model training details. (3) the quality of figures seems to be degraded, possibly due to the absence of vector images. Addressing this issue would enhance the visual presentation of the paper.

Author Feedback

Dear reviewers,

We would like to express our sincere gratitude for providing us with valuable feedback. We have carefully considered each comment and suggestion and provide a detailed response to the main issues raised by the reviewers.

Addressing reviewer’s comments on the novelty of using existing pre-training techniques, we acknowledge the reviewers’ perspective. The focus of our work is not on new self-supervised techniques. but to address the challenge of the limited availability of medical VQA training data by incorporating multiple objectives into visual language pre-training on non-VQA datasets, to learn robust visual / textual feature encoders, and multi-modal fusion encoders, for downstream VQA tasks. Specifically, we introduced a strategy that combined uni-modality feature learning using contrastive learning objectives, and multi-modal feature learning via ITM and MLM losses together during the pre-training stage. Through the experiments, we demonstrated that this combined pre-training approach led to performance improvements over the previous SOTA method, M3AE that applied masked strategy only.

The implementation of the momentum model in our method (mentioned by reviewer #2) does follow the design of MOCO, however, we extend that to a multi-modal setup, where we maintain two momentum models for both image and text encoders. Correspondingly, we used two queues to buffer the image and text embeddings encoded by the momentum models. The contrastive losses in our work are applied to (1) align image and text features; (2) learn uni-modal image encoders via momentum contrasts of different views of the same image (i.e. different views are generated by different image masks); (3) learn uni-modal text encoder via momentum contrasts. We will add the above information regarding momentum model in the next update of the manuscript.

In responses to Reviewer #4’s remarks regarding the earlier works, CPRD (MICCAI, 2021) and M3AE (MICCAI, 2022), they significantly inspired our current work, however, we would like to highlight the differences that set our method apart from these aforementioned studies. Either M3AE or CPRD applied contrastive losses to align image and text in the pre-training phase, which was one of the main contributions in our work. In CPRD, it followed the work of MOCO that trained teacher model for visual encoder via contrastive loss of different image views (by data augmentations). Instead, our method applied random patch-wise masking strategy to create multiple views to compute the contrastive loss. Compared to M3AE, we excluded image reconstruction loss, MIM, in our network design, instead, we focused more on multi-modality alignment. Both CPRD and M3AE were significant works in the field of VQA and we included them in our experiments for comparison.

Responding to Review #3’s constructive feedback on “ablation experiments on pre-training data size,” we didn’t cover this since our focus was validating our method’s effectiveness, not model scalability. We are investigating this in another study.

Addressing Review #2’s question on measuring open-ended question accuracy, we treated VQA as a generative task. Our method calculates similarities between the generated answer and candidate list answers, selecting the highest score as the final answer. We’ll clarify this in our manuscript.

Besides, it is insightful that Review #3 raise the concern about “failure cases and corresponding GradCAM results”. In the experiments, we randomly sampled some examples and indeed found instances where the model focused on irrelevant regions. We will explicitly state in our manuscript that not all the attention maps are clearly interpretable, to prevent any potential misinterpretation.

We will publish the code and model weights on github, after the peer review process, and attach the high resolution figures in the updated manuscript.

Once again, we sincerely appreciate your review and evaluation of our work.

Best regards

Post-rebuttal Meta-Reviews

Meta-review # 1 (Primary)

Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

Based on a thorough assessment of the author’s rebuttal, I recommend accepting the paper. The rebuttal convincingly addresses concerns regarding previous work and novelty by clearly outlining the differences. I suggest modifying the related work sections to incorporate this new information. Additionally, the authors have committed to publishing the code and model, thereby enhancing the reproducibility of their work. This further supports the acceptance of the paper.

Meta-review #2

Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.
In this paper, the authors propose to address the medical visual question answering (VQA) problem with a self-supervised method based on contrastive learning.

key strengths:
1. extensive experiments
2. novel way to combine unimodal and multimodal contrastive losses
key weaknesses:
1. missing some details in method description
2. formating and figure quality issues
The rebuttal adequately addresses the novelty issues and comparison with existing methods.

Meta-review #3

Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.
Strengths:
- Adoption and integration of existing designs from MoCo, ViT, and BERT.
- Improved performance compared to the state-of-the-art method M3AE.
Weaknesses:
- Lack of an ablation study exploring the impact of other self-supervised techniques.
- Concerns about novelty and reproducibility, particularly in model training details.
- Degraded quality of figures.
In their rebuttal, the authors addressed these concerns by clarifying their focus on addressing limited training data, explaining the extension of the momentum model design, and highlighting the differences from previous works. Considering the authors’ response and their efforts to address the reviewers’ feedback, the paper presents a valuable contribution in medical VQA. However, it is recommended that the authors incorporate the suggested improvements, such as including additional training details and enhancing figure quality. Adding the ablation study on the impact of other self-supervised techniques is strongly recommended.

back to top

Masked Vision and Language Pre-training with Unimodal and Multimodal Contrastive Losses for Medical Visual Question Answering