Paper Info Reviews Meta-Review Author Feedback Post-rebuttal Meta-Reviews Back to top

List of Papers By topics Author List

Paper Info

Reviews

Meta-review

Author Feedback

Post-Rebuttal Meta-reviews

Authors

Long Bai, Mobarakol Islam, Hongliang Ren

Abstract

Medical students and junior surgeons often rely on senior surgeons and specialists to answer their questions when learning surgery. However, experts are often busy with clinical and academic work, and have little time to give guidance. Meanwhile, existing deep learning (DL)-based surgical Visual Question Answering (VQA) systems can only provide simple answers without the location of the answers. In addition, vision-language (ViL) embedding is still a less explored research in these kinds of tasks. Therefore, a surgical Visual Question Localized-Answering (VQLA) system would be helpful for medical students and junior surgeons to learn and understand from recorded surgical videos. We propose an end-to-end Transformer with the Co-Attention gaTed Vision-Language (CAT-ViL) embedding for VQLA in surgical scenarios, which does not require feature extraction through detection models. The CAT-ViL embedding module is designed to fuse multimodal features from visual and textual sources. The fused embedding will feed a standard Data-Efficient Image Transformer (DeiT) module, before the parallel classifier and detector for joint prediction. We conduct the experimental validation on public surgical videos from MICCAI EndoVis Challenge 2017 and 2018. The experimental results highlight the superior performance and robustness of our proposed model compared to the state-of-the-art approaches. Ablation studies further prove the outstanding performance of all the proposed components. The proposed method provides a promising solution for surgical scene understanding, and opens up a primary step in the Artificial Intelligence (AI)-based VQLA system for surgical training. Our code is available at github.com/longbai1006/CAT-ViL.

Link to paper

DOI: https://doi.org/10.1007/978-3-031-43996-4_38

SharedIt: https://rdcu.be/dnwPi

Link to the code repository

https://github.com/longbai1006/CAT-ViL

Link to the dataset(s)

https://endovissub2018-roboticscenesegmentation.grand-challenge.org/home/

https://endovissub2017-roboticinstrumentsegmentation.grand-challenge.org/

Reviews

Review #1

Please describe the contribution of the paper

This work introduces a new transformer-based for the medical VQA task. Guided attention is exploited to communicate the visual features with text features. These two modalities are fused by the gated fusion module. DeiT architecture is also involved in the pipeline. The result shows that it outperforms previous works.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
- It introduces a new pipeline, termed CAT-ViL for medical VQA task. Gated fusion module is proposed to fuse two modalities. DeiT is also involved to boost the performance.
- The paper is well-organized
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
- I wonder if the guided-attention is the same as the cross-attention?
- When compared with previous works, shown in Table 2, do previous work also involve DeiT? To compare them apple to apple, it would be better to also involve DeiT in the previous work, or do the ablation study on the DeiT part.
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

The paper has provided most of the details.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html

It would be better to do more ablation studies to verify some components. For example, how about guided attention from visual part to text part, or bi-directional guided attention. Another interesting ablation study is the number of layers to do guided attention.
Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

5
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

This paper introduces a trensformer-based pipeline for visual question anaswering. Co-attention gated fusion is proposed. Some ablation studies are missing. Although overall technical novelty is marginal, the vision-language model for medical VQA task would be interesting.
Reviewer confidence

Confident but not absolutely certain
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

4
[Post rebuttal] Please justify your decision

After going through the rebuttal and comments from other reviewers, I am still a little concerned about the novelty, which is a consensus among all reviewers.

I am ok if it ends up with acceptance.

Review #2

Please describe the contribution of the paper

The paper proposed an end-to-end Transformer with Co-Attention gaTed Vision-Language (CAT-ViL) for VQLA in surgical scenarios, which does not require feature extraction through detection models. The proposed method utilized two prior works to fuse visual and text embeddings: the guided-attention module and the gated module. The method was evaluated on Endovis-17 and 18 datasets and performed better than the baselines.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

The proposed method improved a shortcoming of recent work VisualBERT ResMLP, which has a naive concatenation between visual and text embeddings with the guided attention and the gated module and successfully achieved higher performance.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

Incorporating the guided attention module and gated module from previous works into the VisualBERT ResMLP is an effective way to address the limitations of the previous method. However, there appears to be a lack of originality in the algorithmic approach by simply utilizing on these modules without modifications. Co-attention between the self-attention and guided attention is also almost same as the previous work [24].

The ablation study is unclear that how each fusion strategy is incorporated. Are all strategies based on same feature extraction from visual and text embeddings and only dffierenitate fusing mechanisms by inserting or removing modules?
Please rate the clarity and organization of this paper

Very Good
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

This research is reproducible based on the checklist.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html

I think highly of the resaonble approch to overcome shorchomoing of each previous method by fusing them and achieving better performance. But I hope we have more innovations for a new algorithm or model.
Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

6
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

I believe that while there may be a lack of innovation in the algorithm perspective, the work presented in this paper makes a reasonable improvement to the application of CAI. Therefore, I recommend that the MICCAI community accepts this paper.
Reviewer confidence

Somewhat confident
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

N/A
[Post rebuttal] Please justify your decision

N/A

Review #3

Please describe the contribution of the paper

The contribution of this paper is a novel attention structure for VQLA tasks that combines fine-grained text embeddings with visual attention in images through a guided-attention module and gated module to improve object localization performance. This proposed solution addresses a real-world problem in VQA and has the potential to benefit the medical imaging community by inspiring further research. The paper’s clear presentation and reproducibility provide a valuable contribution to the VQA research community with the potential for reproducibility.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

(1) The authors propose a novel attention structure that combines fine-grained text embeddings of VQA questions into each layer of visual attention in images through a guided-attention module. They implement an additional gated module to further incorporate visual and textual embeddings to improve object localization performance.

(2) The motivation of this paper is commendable as it addresses a real-world problem in VQA and proposes a corresponding solution. Its contribution could potentially benefit the medical imaging community by inspiring further research.

(3) The presentation of this paper is exemplary. The authors provide a clear background of the scenario and describe their proposed improvements for VQLA tasks. The structure of the paper is well-organized and easy to follow. The visual aids and formulas are of high quality, making this paper highly accessible to readers.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

(1) Although the authors highlighted some issues with current VQA models, such as the simplistic fusion of attention methods to incorporate heterogeneous features, their proposed method can still be classified within this genre to some extent. While the proposed method does show improvements in performance, some key illustrations, such as the justification for the guided-attention module and gated module, are not provided.

(2) In the comparison section, the CAL-ViL DeiT model only slightly outperforms other models. However, the authors did not mention the potential influence of the choice of random seed or the use of repeated experiments during evaluation. Therefore, there may be randomness that significantly affects the final results, which the authors did not address.

[1] Praveen R G, de Melo W C, Ullah N, et al. A joint cross-attention model for audio-visual fusion in dimensional emotion recognition[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022: 2486-2495. [2] Georgescu M I, Ionescu R T, Miron A I, et al. Multimodal multi-head convolutional attention with various kernel sizes for medical image super-resolution[C]//Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 2023: 2195-2205. [3] Wu Z, Liu L, Zhang Y, et al. Multimodal Crowd Counting with Mutual Attention Transformers[C]//2022 IEEE International Conference on Multimedia and Expo (ICME). IEEE, 2022: 1-6.
Please rate the clarity and organization of this paper

Very Good
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

I believe that this paper has the potential to contribute to the community through its reproducibility, especially if the authors open-source their code and software. This would allow other researchers to easily replicate their results and build upon their findings.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html

In conclusion, this is an above-average work that focuses on a specific task in VQLA and proposes a hybrid refined guided attention module and sequence model styled gated fusion of visual and text information. I would like to suggest the following improvements: (1) Provide more justification for the design of the guided attention module and gated module. (2) Explain whether the final results are affected by randomness. (3) Conduct additional experiments to demonstrate the potential significance of this work.
Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

4
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The paper is slightly lacking in novelty, but, in general, it is well-shaped with good motivation and topic selection.
Reviewer confidence

Confident but not absolutely certain
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

N/A
[Post rebuttal] Please justify your decision

N/A

Primary Meta-Review

Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.
This paper presents a novel Transformer with Co-Attention gaTed Vision-Language (CAT-ViL) model for VQLA in surgical scenarios, through fusion of visual and text embeddings to improve object localization. Experiments on surgical videos from the EndoVis Challenge datasets demonstrate superior performance of the method compared to SOTA in end-to-end real time applications. The reviewers highlight the clinical motivation, the effectiveness of the proposed approach, the improved performance compared to recent work, the clear presentation and reproducibility as the strengths of the paper.

The main criticisms of the work are concerns regarding lack of novelty in the algorithmic approach (combining prior modules), concerns regarding randomness in the results, and clarification regarding the ablation studies. More thorough ablation studies on some of the components are also recommended.

The following points should be addressed in the rebuttal:
- Justification for the design of the guided attention module and gated module (and comments on the originality of the co-attention between the self and guided attention modules)
- Clarification whether the final results are affected by randomness
- Clarification regarding ablation results presented in Table 2, how each fusion strategy is incorporated, and the use of DeiT

Author Feedback

Thank the reviewers (R) for their critical assessment, insightful suggestions, and overall positive ratings (6 5 4) for our paper. We also appreciate the meta-reviewer (MR) for granting us the opportunity to clarify the major critiques as follows:

Justification on guided-attn vs cross-attn, and the design of guided-attn and gated module (MR, R1, R2, R3): Current cross-attn in text and vision fusion takes both the vision to guide text and text to guide vision (Parthasarathy et al. STL 2021) or weighted fusion (Praveen et al. FG 2021), but our guided-attn only uses text to guide vision. Our guide-attn takes the key and value from text embedding and query from visual embedding, which can better help the model to focus on the targeted image context related to the question.

For originality, the self-attn unit is a widely used mechanism, and the guided-attn unit is similar to that from [24] (as R2 claimed). Besides, [15, 19, 24] only concat or sum the multimodal output, and [24] use simple MLP after co-attn modules. However, naive fusion with MLP can not fully unleash the power of multimodal input. We thereby utilize the gated module for better fusion instead of naive concat or sum, and use the DeiT for better feature fusion instead of MLP. The gated module shall constrain and balance the input weights from two modalities, and explore a best intermediate feature combination status. The ablation study on w & w/o gate also proves its effectiveness.

Randomness and multi seeds (MR, R3): We have conducted additional validation on 5 random seeds and our method still outperforms the SOTA methods. The average results are shown below and will be updated in the final manuscript:

Method Endo18 (Acc FScore mIoU) Endo17 (Acc FScore mIoU)

VB 0.627, 0.333, 0.739 0.401, 0.338, 0.707

VBRM 0.630, 0.339, 0.735 0.419, 0.337, 0.714

MCAN 0.629, 0.334, 0.753 0.414, 0.293, 0.703

DeiT 0.610, 0.316, 0.734 0.380, 0.286, 0.691

MUTAN 0.628, 0.340, 0.764 0.424, 0.348, 0.722

MFH 0.628, 0.325, 0.759 0.4103, 0.350, 0.722

BT 0.620, 0.329, 0.765 0.422, 0.352, 0.729

Ours 0.645, 0.332, 0.771 0.449, 0.362, 0.732

Clarification on the ablation in Table 2 (MR, R1, R2): The ablation uses the same feature extraction (R2) and DeiT backbone (R1, MR). We only change and compare the fusion methods (in Fig.1, after the Visual and Text Embedding, before DeiT). In Table 2, the rows w/o Gate denote that we use different attention mechanisms to boost self/mutual interaction between two modalities, and then concat the visual and text embedding. The rows with Gate denote that, after the attn modules, we use the Gated module for fusion instead of concat.

Additional comparison (R3): With the same feature extraction and DeiT backbone, additional comparisons have been added, and ours still outperforms them. These results will be added in Table 2 in the final manuscript.

Method Endo18 (Acc FScore mIoU) Endo17 (Acc FScore mIoU)

[1]JCA 0.602, 0.301, 0.753 0.375, 0.284, 0.715

[2]MMHCA 0.610, 0.312, 0.745 0.358, 0.300, 0.708

[3]MAT 0.619, 0.318, 0.742 0.337, 0.285, 0.696

Additional ablation studies (R1): As suggested, we conduct guided-attn from visual to text, bi-directional guided-attn, and different attention layers. Our method yields the highest score.

Method Endo18 (Acc FScore mIoU) Endo17 (Acc FScore mIoU)

Bi-Attn 0.606, 0.309, 0.721 0.364, 0.308, 0.704

Bi-Attn Gated 0.623, 0.312, 0.742 0.426, 0.359, 0.728

V-Guide-T Attn 0.639, 0.326, 0.722 0.345, 0.227, 0.714

V-Guide-T Attn Gated 0.635, 0.326, 0.760 0.430, 0.354, 0.707

The number of attn layers on CAT-ViL: 2 | 0.621, 0.310, 0.769 | 0.457, 0.340, 0.735 4 | 0.626, 0.335, 0.755 | 0.436, 0.340, 0.718 6 | Ours 8 | 0.636, 0.307, 0.770 | 0.462, 0.327, 0.725 10 | 0.631, 0.314, 0.770 | 0.388, 0.302, 0.726

Reproducibility (R3): We have already provided our anonymous code repo and reproducing instructions in supplementary. We will also make it public.

Post-rebuttal Meta-Reviews

Meta-review # 1 (Primary)

Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

The rebuttal addresses concerns regarding randomness of the results and multi-seeds by demonstrating additional validation results, and provides clarification regarding ablations presented in Table 2 and use of DeiT, however comments regarding additional comparison and ablation studies cannot be addressed without substantial additions to the results (which cannot be considered at this stage), and concerns regarding novelty still remain.

Meta-review #2

Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

The authors propose a work on a Visual Q&A systems for endoscopic applications. The paper received mixed reviews, mainly because of lack of proposing a novel component to the architecture. From my point of view, I don’t see this as a major problem as the idea is particularly novel and evaluation is performed well.

Meta-review #3

Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

The authors provide their implementation of VQA using endovis data with additional labels. The clinical motivation presented by the authors is sound. While there are concerns regarding novelty as noted by the reviewers, the application side of their paper is well motivated and explained. The rebuttal does seem to address the initial concerns minus that of novelty though R1 who downgraded their initial score slightly did qualitatively comment that this downgrade was based on concerns regarding novelty and was otherwise ok with acceptance. On the whole, their experiments, including those in rebuttal, address the major concerns other than novelty and thus leans me toward accept.

back to top

Method	Endo18 (Acc FScore mIoU)	Endo17 (Acc FScore mIoU)
VB	0.627, 0.333, 0.739	0.401, 0.338, 0.707
VBRM	0.630, 0.339, 0.735	0.419, 0.337, 0.714
MCAN	0.629, 0.334, 0.753	0.414, 0.293, 0.703
DeiT	0.610, 0.316, 0.734	0.380, 0.286, 0.691
MUTAN	0.628, 0.340, 0.764	0.424, 0.348, 0.722
MFH	0.628, 0.325, 0.759	0.4103, 0.350, 0.722
BT	0.620, 0.329, 0.765	0.422, 0.352, 0.729
Ours	0.645, 0.332, 0.771	0.449, 0.362, 0.732

CAT-ViL: Co-Attention Gated Vision-Language Embedding for Visual Question Localized-Answering in Robotic Surgery