Paper Info Reviews Meta-review Author Feedback Post-Rebuttal Meta-reviews

Authors

Lalithkumar Seenivasan, Mobarakol Islam, Gokul Kannan, Hongliang Ren

Abstract

Advances in GPT-based large language models (LLMs) are revolutionizing natural language processing, exponentially increasing its use across various domains. Incorporating uni-directional attention, these autoregressive LLMs can generate long and coherent paragraphs. However, for visual question answering (VQA) tasks that require both vision and language processing, models with bi-directional attention or models employing fusion techniques are often employed to capture the context of multiple modalities all at once. As GPT does not natively process vision tokens, to exploit the advancements in GPT models for VQA in robotic surgery, we design an end-to-end trainable Language-Vision GPT (LV-GPT) model that expands the GPT2 model to include vision input (image). The proposed LV-GPT incorporates a feature extractor (vision tokenizer) and vision token embedding (token type and pose). Given the limitations of unidirectional attention in GPT models and their ability to generate coherent long paragraphs, we carefully sequence the word tokens before vision tokens, mimicking the human thought process of understanding the question to infer an answer from an image. Quantitatively, we prove that the LV-GPT model outperforms other state-of-the-art VQA models on two publically available surgical-VQA datasets (based on endoscopic vision challenge robotic scene segmentation 2018 and CholecTriplet2021) and on our newly annotated dataset (based on the holistic surgical scene dataset). We further annotate all three datasets to include question-type annotations to allow sub-type analysis. Furthermore, we extensively study and present the effects of token sequencing, token type and pose embedding for vision tokens in the LV-GPT model.

Link to paper

DOI: https://doi.org/10.1007/978-3-031-43996-4_27

SharedIt: https://rdcu.be/dnwO1

Link to the code repository

https://github.com/lalithjets/SurgicalGPT

Link to the dataset(s)

https://github.com/lalithjets/SurgicalGPT


Reviews

Review #1

  • Please describe the contribution of the paper

    This paper proposes a language-vision model that expands the GPT2 model to include vision tokens. The contributions of the paper are: 1) A new method that augments the GPT2 model with a learnable vision token extractor. 2) A method to combine text and visual tokens. 3) A new annotated dataset for Surgical VQA.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    1) The paper proposes a new method to combine visual and text embeddings as input to a transformer model by sequence embeddings of word tokens before

    2) Thorough evaluation that compares the LV-GPT to other SOTA VQA transformers models with an ablation study on the models used to obtain the vision embeddings.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    1. The idea of passing a linearized embedding vector representation of the images along with text to a transformer has been studied in previous works such as Unicoder-VL

    2. The reasoning behind selecting certain baseline models for comparison is not evident. Although some models like Mutan have been used for comparison, others like BEiT have not been included in the selection.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    The authors clearly describe their experimental and hardware setup and the parameter configurations of the model. Further, the authors, upon acceptance, will publish the code, datasets, and the pre-trained model. With this information, I’m confident that the results in this paper can be reproduced.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html

    The use of pose embeddings and positional embeddings interchangeably is confusing. It would benefit the reader if consistent terminology was used across the paper.

    The training methodology is not sufficiently elaborated in the paper. It would be beneficial to gain more insights regarding the training of the vision tokenizer model. For example, do the vision tokenizers have an independent training process?

    It’s counter intuitive to me that zero position embeddings for all the vision tokens provided better results. It would be interesting to hear the author’s perspective on why this is the case.

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    5

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The paper was well written with well designed experiments. However, there were a few gaps in the paper that did not clearly explain the choice of baseline models the proposed approach was evaluated against.

  • Reviewer confidence

    Somewhat confident

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #3

  • Please describe the contribution of the paper

    This paper proposes a language-vision GPT for surgical VQA, which combines bi-directional attention and visual tokens.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. The paper applies the super-hot model GPT to surgical VQA.

    2. The proposed method achieves good performance on three datasets.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    1. The proposed method is not new. Vision-language large models are popular topics in computer vision, such as the VisualBERT mentioned. Compared to these general models in CV, the proposed method has limited methodology improvements and especially the insights about the surgery problem.

    2. This might not be the problem of this paper, but I think the surgical VQA task is overly simplified. It mostly queries a single word or very short answer using basic questions without the need of high-level reasoning or comprehensive surgical knowledge. It is more like a naive classification problem rather than real VQA (Fig.3). The authors motivate VQA using that it will reduce the burden of senior surgeons. But I think these basic question can be readily answered by the novices. So the clinical usefulness is unknown.

    More importantly, although this is an existing problem of previous dataset, the newly proposed dataset in the paper does not try to handle this problem thus provide limited added value.

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    The reproducibility is good given the released codes.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html

    I would recommend construct a challenging VQA dataset that requires deep understanding of the procedure. That would be a huge contribution.

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    4

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Given the limited added value of the method and dataset, I would recommend as above.

  • Reviewer confidence

    Confident but not absolutely certain

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A




Primary Meta-Review

  • Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

    The authors present their work on creating an end-to-end trainable GPT model that demonstrates capabilities vision-language processing for surgical scenes based on EndoVIS18-VQA and cholec80-VQA as well as VQA annotations added by authors to PSI-AVA dataset.

    In their work, the authors utilize GPT2 as a language model for queries and vision models (ResNet18, Swin, and ViT) for visual tokens to generate word and vision embeddings. The goal is to generate classification of frames into relevant outputs based on query. VisualBert and other SOTA models were used as comparison. The authors demonstrated Improved performance in most metrics, particularly for cholec80VQA.

    Strengths of the work include a clear description of the token embeddings and good performance of their model on the proposed task. The topic matter itself is hugely popular in the moment.

    A criticism of the study is relative lack of novelty given how hot the topic has been and a growing interest in visual-language models. Additional clarification would be helpful in addressing perceived weaknesses of the study: 1) Additional information about training of the vision tokenizer model would be important to further understand their methods. 2) The authors themselves note that they had hypothesized that positional embeddings would improve performance but found that zero positional embeddings performed better. While authors stated future work should investigate the role of positional embeddings in these networks, it would be helpful to understand why the authors think these results were found and if any experimentation was done to explore this seemingly counter-intuitive finding. 3) Given the dataset used, it would be helpful to understand whether there was a clear restriction/recommendation on format of the language queries to optimize performance based on the dataset. Given the idea of using natural language queries to achieve visual results, understanding the performance of query type or at least having a discussion of the impact of query format is helpful for context. 4) The authors explain their use of VisualBert as a comparison but do not provide motivation for their selection of other models for comparison and for the selection of ResNet18, ViT, Swin as their visual tokenizers over other methods.




Author Feedback

We thank reviewers (R1,R3,MR) for the positive comments:

  • Propose end-to-end trainable GPT2 model, expands GPT2 for vision-language processing (R1,MR)
  • new visual and text embeddings combination method (R1)
  • Popular topic (R3,MR)
  • Good performance (R3,MR)
  • Thorough SOTA evaluation and ablation (R1)
  • Novel dataset (R1,MR)
  • Code reproducibility & Good paper writing (R1,R3)

Response to feedback:

  1. Impact of query format. (MR,R3) The test query format (paper results) is similar to queries on the trainset. Based on MR’s input, we evaluated our model’s performance on EndoVis-18 by modifying (rephrasing & synonym words) the test query formats. Observation: Our models are still robust to modified query and surpass most SOTA models (test queries similar to train queries) in Table 1. LV-GPT(RN18):Acc:0.598|F1:0.386 LV-GPT(Swin):Acc:0.606|F1:0.413 LV-GPT(ViT):Acc:0.631|F1:0.413

  2. Author’s view on why zero position embedding for vision tokens provided better results (R1, MR) and explore any counter-intuitive finding? (MR). Upon in-depth analysis, we observe that our RN18 (CNN) based model improved with position embedding (PosEmb) (Table 3&4). In ViT/Swin (transformer) based model, PosEmb is already incorporated at the VIT/Swin layer and adding PosEmb at GPT level results in double PosEmb. The “zero-position” can be interpreted as “our model only requires one layer of PosEmb”. We will update this analysis in paper.

  3. Vision-language models already exist (VisualBert (R3), Unicode-VL & BEiT (R1)). Given the hot topic & growing interest in visual-language models, this work lacks novelty (R3, MR). Explain the choice of baseline models (M1, MR) and training methodology on vision tokenizer model (MR). Most vision-language models (VisualBert, BEiT and Unicoder-VL) are bidirectional encoders. GPT is a unidirectional decoder generator. GPT’s text-generative ability motivated us to integrate vision without changing its core structure. Integrating vision without changing GPT’s core allows us to exploit its generative ability & extend further easily when trained on comprehensive surgical knowledge. We agree that large models are hot topic (R3,MR). In GPT-based generative model domain, We proposed an end-to-end trainable language-vision GPT model (R1,MR). Before miccai deadline (9 Mar), chatGPT (GPT3) only supported text input. GPT4 (vision & text) was announced on 14 Mar. While the topic, in general, is hot, we request reviewers to consider the miccai deadline date for fair comparison. To keep our end-to-end model a unidirectional decoder, we didn’t integrate it with multi-modal encoder. Instead, we employed standard pre-trained ResNet18/ViT/Swin for vision features to experiment CNN/transformer models. The vision token (patch) features from these models are then passed through an embedder (learnable linear layer), type and pos embedding. We also provide results for encoder-only BEiT-VQA model (R1) on EndoVis-18: Acc:0.595|F1:0.270. It performs below our model.

  4. Surgical VQA is overly simplified, queries a single word using basic questions (R3). Identifying tool actions and surgical phase also requires context awareness and deep reasoning on vision and question. We present this work as a foundation for multi-modality input for text generative tasks in surgical domain. Given GPT’s generation ability and our core GPT remains unchanged, our work can easily be extended for complex text generation without significant architecture change by training on comprehensive surgical knowledge. In the absence of such complex dataset, we incrementally improve both model and dataset. Furthermore, our code is end-to-end trainable and is open-sourced, allowing other researchers to extend further easily. To prove model’s sentence generative ability, we naively trained our end-to-end LV-GPT to answer in sentence form in a generative way (BLEU-3:0.610, BLEU-4:0.574). We also benchmarked our model on complex text queries (Response 1).




Post-rebuttal Meta-Reviews

Meta-review # 1 (Primary)

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    I appreciate the authors’ response regarding zero PE and this intuitively makes sense. I am not fully convinced regarding novelty though note the different between BERT and GPT as stressed by the authors. That being said, the authors do provide a complete rebuttal to all points of clarification requested at meta-review and provide additional data that further supports their work and answers the questions at initial review.



Meta-review #2

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    The paper presents an end-to-end trainable multi-modality Language-Vision GPT (LV-GPT) model for VQA tasks in robotic surgery, by expanding the language-only GTP model to incorporate vision tokens. The results are validated on the EndoVis18 and Cholec80 datasets as well as a newly annotated dataset for VQA.

    The paper is well-written and well presented, the validation experiments are thorough, and the results are promising. The main concerns are regarding the weak novelty of the proposed approach, over-simplification of the VQA task, the added value of the introduced dataset, as well as justification for selecting the baseline methods.

    While the rebuttal addresses some of the questions/concerns regarding choice of baseline models and the training methodology on vision tokenizer model, and provides additional insights on the results (as well as additional analysis to show the robustness to modified query formats), other concerns regarding the incremental value of the model/novelty and the introduced dataset still remain.



Meta-review #3

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    The authors show the applicability of a Visual question answering system in the surgical domain with interesting results. The authors presented the results in a comprehensive manner with adequate comparison to SOTA. It was critisized that the questions presented to the model were not complex enough, however, I think the paper has nevertheless merit and is of value for the community. I therefore recommend acceptance.



back to top