Paper Info Reviews Meta-review Author Feedback Post-Rebuttal Meta-reviews

Authors

Lalithkumar Seenivasan, Mobarakol Islam, Adithya K Krishna, Hongliang Ren

Abstract

Visual question answering (VQA) in surgery is largely unexplored. Expert surgeons are scarce and are often overloaded with clinical and academic workloads. This overload often limits their time answering questionnaires from patients, medical students or junior residents related to surgical procedures. At times, students and junior residents also refrain from asking too many questions during classes to reduce disruption. While computer-aided simulators and recording of past surgical procedures have been made available for them to observe and improve their skills, they still hugely rely on medical experts to answer their questions. Having a Surgical-VQA system as a reliable ‘second opinion’ could act as a backup and ease the load on the medical experts in answering these questions. The lack of annotated medical data and the presence of domain-specific terms has limited the exploration of VQA for surgical procedures. In this work, we design a Surgical-VQA task that answers questionnaires on surgical procedures based on the surgical scene. Extending the MICCAI endoscopic vision challenge 2018 dataset and workflow recognition dataset further, we introduce two Surgical-VQA datasets with classification and sentence-based answers. To perform Surgical-VQA, we employ vision-text transformers models. We further introduce a residual MLP-based VisualBert encoder model that enforces interaction between visual and text tokens, improving performance in classification-based answering. Furthermore, we study the influence of the number of input image patches and temporal visual features on the model performance in both classification and sentence-based answering.

Link to paper

DOI: https://link.springer.com/chapter/10.1007/978-3-031-16449-1_4

SharedIt: https://rdcu.be/cVRUK

Link to the code repository

https://github.com/lalithjets/Surgical_VQA.git

Link to the dataset(s)

https://drive.google.com/drive/folders/1QbLWTg2hmeSmx4_6RExuGol_mxy0PeKD?usp=sharing


Reviews

Review #1

  • Please describe the contribution of the paper

    This paper introduces a VisualBERT and ResMLP based model for Surgical-VQA. It is evaluated on answer classification and sentence generation tasks on three datasets and achieve improved results over previous methods and the baseline method. Visualizations on the generated words/sentences are presented. Ablative studies are presented.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. The introduced model is evaluated on three datasets and achieves promising results. Rich experiments are conducted on two tasks as well as module ablations.

    2. Leveraging VisualBERT and ResMLP on Surgical-VQA is novel and is shown effective over the previous methods. This helps promote related work with similar data scale and requirements.

    3. Ablative studies on temporal visual features are also given which is helpful for related video tasks.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    1. The novelty of the proposed VisualBert ResMLP is limited considering it as a combination of VisualBert and ResMLP modules.

    2. VisualBert ResMLP seems to achieve very close performance to VisualBert alone in Table 1 and Table 2. This can also be seen from Figure 4 (c). This makes me wonder the effectiveness of combining the ResMLP module.

    3. Table 1, the Acc performance of MedFuse is different from the results in the original MedFuse paper. Can the authors explain more on this?

    4. How long does it take and what are the GPU requirements to conduct this experiment?

    5. The font size should be increased for better visibility in Figure 2, 3, 4.

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    The authors provide codes and documents.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html

    Please refer to previous comments.

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    4

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Overall, I think this paper is an interesting paper proposing a BERT based framework on Surgical-VQA task. I hope the authors can address my concerns especially question 3 in the ‘Weakness’ and I’m willing to raise my score.

  • Number of papers in your stack

    4

  • What is the ranking of this paper in your review stack?

    2

  • Reviewer confidence

    Very confident

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Not Answered

  • [Post rebuttal] Please justify your decision

    Not Answered



Review #2

  • Please describe the contribution of the paper

    This work proposes to use VisualBERT[15] together with ResMLP[24] to approach the task of surgical VQA. This paper presents results in three datasets: Med-VQA, Endovis18-VQA, and Cholec80-VQA. Furthermore, it reports an ablation study on the proposed architecture.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • The task of surgical VQA is relevant to the medical image analysis community.
    • The paper is easy to read.
    • This work reports a complete ablation study over the model hyperparameters.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    The principal weaknesses of the paper are listed below:

    • This work does not present technical novelty. All the individual components used in this work already exist and are already implemented. This paper presents just a compilation without almost any modification. Specifically, the main architecture was taken from [15], and it uses ResMLP from [24].
    • The improvements by incorporating the ResMLP into the main architecture are marginal. The reported results do not demonstrate the required empirical contribution of the paper.
    • The experimental setup does not allow an assessment of the model generalization capacity. Optimizing the architecture over the test sets might result in overfitting in the three benchmark datasets. Without results in an independent set of data, it is not possible to discard that possibility.
    • The data used is not public. According to the reproducibility checklist, the data will be released upon acceptance. However, within the text, there is no clear intention to make it publicly available or as a relevant contribution to the paper. If the data were not released, it would limit the reproducibility of the results and the progress in this task. Considering that the paper provides ablation experiments, these should be performed on the validation set to guarantee its generalization.
  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    According to the reproducibility checklist, the source code and pretrained models will be made publicly available, which is essential to guarantee the reproducibility of the results. Additionally, the method was developed using a public benchmark dataset for surgical VQA, which promotes research in the area. Despite not including any intention to publicly release the data in the main text, the reproducibility checklist mention that they will be released upon acceptance. Furthermore, there should be included more statistics about the new data. How were these annotations generated? This information will be valuable for new research on this task.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html
    • The submission instructions must be followed correctly. The paper ID must be added at the beginning of the submission.
    • The legends in Figure 4 are not consistent with the reported results in Tables 1 and 2.
    • How is the annotation generation process performed? Is it a completely automatic process?
    • Do the questions and sentences follow natural language templates?
    • It is not clear why the performance drops when using temporal features. Are there any additional insights?
  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    3

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Although the task of surgical VQA is relevant for the medical image analysis community, the lack of technical novelty and the marginal improvement of the final model significantly affect the paper’s main contributions. Additionally, it is unclear whether the data will be made publicly available upon acceptance. This work requires further modifications before being accepted.

  • Number of papers in your stack

    5

  • What is the ranking of this paper in your review stack?

    4

  • Reviewer confidence

    Confident but not absolutely certain

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Not Answered

  • [Post rebuttal] Please justify your decision

    Not Answered



Review #3

  • Please describe the contribution of the paper

    The authors propose to use a deep learning method to answer the potential questions from student conditioned on the surgery video. It utilizes the Bert-based model to take visual tokens and text tokens in the feature extraction process, and apply conventional transformer-based decoder to predict the output answer. On this paper the RESMLP is proposed to enhance the representational ability of the transformer encoder. It also proposed an extension version of dataset based the EndoVis and Cholec80, including the questions and the answers. Finally, it has compared to the state-of-the-art model in Cholech 80 dataset.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The main strength of the paper is the novel application, visual questioning answering to release the burden of clinical experts. It introduces the Visual Bert and its variation ResMLP to achieve cross-token and cross-channel fusion during the text-image feature extraction. Secondly is the newly proposed dataset in Cholec80 with classification-based and sentence-based VQA dataset. Finally, the proposed method is compared with state-of-the-art methods and achieves the promising results.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    -For the ResMLP, the figure illustration and the equation (1) (2) are so general that the input shape, output shape and the processing is not well explained. -Computational resource, the transformer structure is a high space-complexity method, it has the fundamental disadvantage to handle long sequences. It should be discussed the computational resources used to train & test the model. Also, length of input sequence should be clarified. -ResMLP: The effect of ResMLP seems to be minor according to the tables 1.

  • Please rate the clarity and organization of this paper

    Excellent

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    According to the materials, the paper is reproducible.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html

    -Except for the weakness, I think the author can give more details about the motivation behind this VQA task as well as its clinical benefits. For now, the system can only act as a second option to roughly clarify student’s confusion. I am wondering if there is any intra-operative applications that could use this system. -The effect of the decoder is unknown. As far as I am concerned, the transformer decoder, unlike the auto-regressive decoders can predict sentence at once. Can you give more explanation about the decoder’s choice and can we replace this decoder with the other methods, such as LSTM, GRU? -Also, will you public the dataset? -Can you compared to the other captioning methods?

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    6

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    High performance compared to the baseline model. The new caption dataset.

  • Number of papers in your stack

    5

  • What is the ranking of this paper in your review stack?

    1

  • Reviewer confidence

    Confident but not absolutely certain

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    6

  • [Post rebuttal] Please justify your decision

    Since part of the VQA dataset is generated automatically, it is needed to at least provide some details/pesudo code/figure to illustrute the pipeline.




Primary Meta-Review

  • Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

    Although the reviewers see merits in the paper and the methods, several concerns are raised. Some of the major ones include the lack of technical novelty and the marginal improvements, as well as the inconsistency of the results compared to other published papers.

  • What is the ranking of this paper in your stack? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    12




Author Feedback

We thank the reviewers for their following positive feedback. It motivates us to strive for better. • Novel application (R1-Q4.2, R3) • Novel datasets (R3) • Relevant to medical image analysis (R2) • Introduce cross-token and cross-channel fusion into VisualBERT (R3) • Rich experiments (R1) and ablation studies (R1, R2) • Better than baseline models (R3) • Paper clarity (R1, R2, R3) • Reproducibility: codes provided (R1, R3). Response to major comments:

  1. The technical novelty is limited. The proposed VisualBert ResMLP (VBRM) is not a naive combination of VisualBert (VB) and ResMLP. As recognized by R3, carefully selected modules (cross-token and cross-channel) from the MLP-based model (ResMLP) are used to replace the intermediate and output module in the attention-based VB model for specific purposes: To enforce interaction among all tokens (globally reasoning) and reduce the model size (comment 2). Additionally, this work contains novel datasets and applications (Surgical VQA).
  2. The improvements by incorporating the ResMLP are marginal. The marginal improvement is still significant as it is observed across all datasets (Table 1). This improvement is achieved in classification tasks with 13.64% less parameters and on-par performance is achieved in sentence tasks using 11.98% less parameters. Parameter size: Task|VB|VBRM Classification|184.2M|159.0M Sentence|209.8M|184.7M Additionally, a new multi-fold study on the EndoVis-18-VQA (C) dataset is done to further justify the improvement significance. Fold|VB (Acc, Fscore)|VBRS (Acc, Fscore) 1|Table 1. 2|VB (0.605, 0.313)|VBRS (0.649, 0.347) 3|VB (0.578, 0.337)|VBRS (0.585, 0.373)
  3. The MedFuse performance is different from the original paper. The code for MedFuse[20] is adopted from its official GitHub repo. As the pre-trained weights are not released, the model is trained from scratch based on the parameters and train/test split stated in the original paper. The difference could have arisen from the change in the system environment, GPU and random seeds.
  4. Will the dataset be made public? How were the annotations generated? Include statistics on the new data. The two novel datasets will be made public. Based on the tissue/tool/interaction/location/phase annotation in the EndoVis-18 [13] and cholec80 [25] datasets, randomized Q&A pairs were generated using standard templates in an automated process. Due to the space limit, the statistics are not included. The dataset zip will include the Q&A pairs generation code and the dataset statistics. The download link will be added to our GitHub repo. To remain anonymous during the review, the GitHub link is temporarily removed from the manuscript. By releasing the dataset, we aim for others to use our work as a base comparison model.
  5. What is the GPU requirements and train time? Nvidia GTX TITAN X is used in this work. Our model takes ~0.21s for each training batch (batch size = 50) for sentence tasks.
  6. The experimental setup does not allow an assessment of the model generalization capacity. A validation set should be used to guarantee the model’s generalization. Some instruments/actions are present in less than 3 video sequences, making it difficult to split the dataset into train/Val/test sets. The Med-VQA and EndoVis-18 train/test splits follow existing works [20][19].
  7. The effect of the decoder is unknown. Explain the decoder’s choice and can it be replaced by LSTM, GRU? Decoder effects will be studied in our future works. A naive transformer decoder is used to incorporate multi-head attention at the decoder stage. It can be replaced by LSTM/GRU.
  8. Any insights on the performance drops when using temporal features? The instrument action / surgical phase could be different between the consecutive temporal frames. This could have caused the temporal features to hold contradicting features and result in low performance.

Minor remarks: The legends in the figures and tables will be updated for better clarity.




Post-rebuttal Meta-Reviews

Meta-review # 1 (Primary)

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    The paper addresses an interesting topic and reports good supporting experiments. The rebuttal addresses most of the reviewers’ concerns. Although some of the reviewers did not engage in post rebuttal discussion, the AC finds the paper in an acceptable state if the pre-rebuttal concerns are addressed in the final paper.

  • After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

    Accept

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    9



Meta-review #2

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    A framework for Surgical-visual question answering (VQA) is presented to predict answers for questions related to surgical procedures, surgical tools and their interaction with the tissue, based on the surgical scene. The work is relevant and of interest, and the method is thoroughly validated on two new datasets created for the Surgical VQA task (by extending the MICCAI EndoVis-18 and the Cholec80 datasets), and an existing public dataset, as well as thorough ablation experiments. The main criticism of the work were regarding the technical novelty of the approach, questions around the results (including impact of the ResMLP module, and performance of one of the SOTA), and concerns regarding the experimental setup. The rebuttal manages to address most of the main concerns from the reviewers.

  • After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

    Accept

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    9



Meta-review #3

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    The authors have addressed the main concerns of the reviewers. The clinical problem is highly relevant and the contributed dataset along with their proposed method will serve as benchmark.

  • After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

    Accept

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    9



back to top