Paper Info Reviews Meta-review Author Feedback Post-Rebuttal Meta-reviews

Authors

Zhen Chen, Qingyu Guo, Leo K. T. Yeung, Danny T. M. Chan, Zhen Lei, Hongbin Liu, Jinqiao Wang

Abstract

Automatic surgical video captioning is critical to understanding surgical procedures, and can provide the intra-operative guidance and the post-operative report generation. As the overlap of surgical workflow and vision-language learning, this cross-modal task expects precise text descriptions of complex surgical videos. However, current captioning algorithms neither fully leverage the inherent patterns of surgery, nor coordinate the knowledge of visual and text modalities well. To address these problems, we introduce the surgical concepts into captioning, and propose the Surgical Concept Alignment Network (SCA-Net) to bridge the visual and text modalities via surgical concepts. Specifically, to enable the captioning network to accurately perceive surgical concepts, we first devise the Surgical Concept Learning (SCL) to predict the presence of surgical concepts with the representations of visual and text modalities, respectively. Moreover, to mitigate the semantic gap between visual and text modalities of captioning, we propose the Mutual-Modality Concept Alignment (MC-Align) to mutually coordinate the encoded features with surgical concept representations of the other modality. In this way, the proposed SCA-Net achieves the surgical concept alignment between visual and text modalities, thereby producing more accurate captions with aligned multi-modal knowledge. Extensive experiments on neurosurgery videos and nephrectomy images confirm the effectiveness of our SCA-Net, which outperforms the state-of-the-arts by a large margin. The source code is available at https://github.com/franciszchen/SCA-Net.

Link to paper

DOI: https://doi.org/10.1007/978-3-031-43996-4_3

SharedIt: https://rdcu.be/dnwOD

Link to the code repository

https://github.com/franciszchen/SCA-Net

Link to the dataset(s)

N/A


Reviews

Review #3

  • Please describe the contribution of the paper

    This paper presents a new surgical Video Captioning method using visual and text encoders followed by a multi-modal decoder. To improve the results the visual and text features are aligned together. The method is validated on two different datasets and compared with SOTA methods.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    • Evaluation performed in two datasets with SOTA comparison • Integration of 2 components to take into account surgical concepts and multi-modal concept alignment (SCL and MC-Align) • Integration of an ablation study in one dataset to demonstrate the impact of the two components

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    • The description of SCL and MC-Align is not always clear • Some SOTA comparisons were done with the visual captioning method, whereas the video captioning method was also presented in these paper (V-SwinMLP-TranCap [24]) • Method limitations not discussed.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    Some information were provided to allow reproducibility thanks to the source code and the use of one public dataset. However, the authors indicated that the following points were included, but are not: • An analysis of situations in which the method failed.: The qualitative comparison only presents cases where the method work. On supplementary material, failed cases were present but not analyzed. • Discussion of clinical significance: not included

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html

    Major comments:

    1. The SCL objective for the visual modality is very clear: associate the surgical concepts to the visual features, it is close to workflow recognition methods. But for the text modality, the objective is not very clear. In the example provided in Figure 2, the text and the concepts seem to be the same.
    2. In MC-align, the authors specify that text tokens and visual tokens were average. What did this mean?
    3. For Neurosurgery Video Captioning Dataset, it is specified that necessary data cleaning was performed. What does it involve?
    4. The authors chose to use SwinMLP-TranCAP, an image captioning method, as one of the SOTA methods. However, on paper [24], a video captioning method is also presented (V-SwinMLP-TranCAP). As booth datasets are composed of video clips, why does not use the V- SwinMLP-TranCAP? Especially because on EndoVis-2018, the best results will be obtained for METEOR and SPICE for this model.
    5. For EndoVis-2018 dataset, why the metric ROUGE is not used?
    6. Adding and analyzing a case where the proposed method failed directly on the main paper will be a plus.
    7. On the video provided as supplementary material, it is difficult to understand if the method provides the correct prediction or not.

    Minor comments:

    1. Considering the following phrase: “However, given the differences between two modalities with separate encoders, it is inappropriate for the decoder to directly explore the cross-modal relationship between visual and text tokens [26,24].” It is redundant with the phrase on the introduction: “Second, existing studies [9,26,24] simply processed visual and text modalities in sequential, while ignoring the semantic gap between these two modalities. This restricts the integration of visual and text modality knowledge, thereby damaging the captioning performance.”
    2. Did the Neurosurgery Video Captioning Dataset will be made public?
    3. Did the SOTA method has been reimplemented or the results extracted from the articles? In the second case, there is a type for SwinMLP-TranCAP METEOR, the paper result is 31.3 not 31.1
  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    4

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    This paper needs some clarification about SCL and MC-Align method and discussion about the limitations, that could be easily asses. The major limitation is the use of an image captioning method instead of a video captioning one for the SOTA comparison, especially because this method outperformed the one proposed by the authors for 2 metrics.

  • Reviewer confidence

    Confident but not absolutely certain

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #4

  • Please describe the contribution of the paper

    This paper proposes a novel approach to automatic surgical video captioning by introducing the concept of “Surgical Concept Learning”, which bridges the gap between visual and text modalities. To this end, the authors introduce a Surgical Concept Alignment Network (SCA-Net), which is trained to predict the presence of surgical concepts with the representations of visual and text modalities, as well as a caption. The proposed method achieves state-of-the-art results in two datasets, and further demonstrate its effectiveness through ablation studies.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • The proposed method leverages surgical concepts to bridge the gap between visual and text modalities. This is a novel approach that improves results compared to current approaches.
    • The evaluations and ablation studies are comprehensive and show the effectiveness of the proposed approach.
    • The writing is clear.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    • It is unclear to what extend “Tool-Tissue Interaction” annotations are needed. To my knowledge, EndoVis Image captioning dataset does not include such labels, therefore it is unclear how the method could be applied to that dataset.
    • The authors state that they “utilize the Vision Transformer (ViT) with causal mask as the text encoder”. This is confusing as vision transformer is an image encoder, and can’t take text as input without modifications. If the authors meant “Transformer”, or if the authors are using VIT with certain modifications, this is not currently clear.
  • Please rate the clarity and organization of this paper

    Excellent

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    The authors will release the code upon acceptance (indicated in the paper / checklist), and also the dataset (indicated in the checklist).

    It is however unclear, if the authors intend to release their newly collected large Neurosurgery Video Captioning Dataset, or only refer to the already public EndoVis Image Captioning Dataset. I would strongly encourage the authors to publish the Neurosurgery Video Captioning Dataset, if possible.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html
    • It would be important to clarify which labels are needed in the datasets to accomplish “Surgical Concept Learning”. Specifically, are “Tool-Tissue Interaction” labels necessary? If yes, can you clarify how you modified your approach for EndoVis?

    • Please clarify the confusion mentioned in the second weakness. If a vanilla transformer is used as the text encoder, this should be stated as such. If for a certain reason a Vision Transformer is employed, to motivations and how this is achieved with text input should be very clearly stated.

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    6

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The paper tackles the interesting task of surgical video captioning, and proposes a novel and effective method, surgical concept learning to this end. The clarity of the paper would be improved, if the authors make the suggested changes.

  • Reviewer confidence

    Confident but not absolutely certain

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #2

  • Please describe the contribution of the paper

    The paper presents an interesting learning-based approach for surgical video captioning. The topic is interesting and very active in the CAI community.

    Main contribution is the introduction of Surgical Concept Learning (SCL) and Mutual-Modality Concept Alignment (MC-Align) and their implementation to the SCA-Net.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The paper is well written and easy to follow with most concepts adequately explained and discussed. Key strengts are summarised below:

    1) SCL with mc-align is a novel concept for surgical captioning.

    2) Experimenation with both private and public datasets is convincing.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    Limitations are summarised as:

    1) Additional insight into the obtained reuslts would be interesting. Specifically what is the performane between the overepresented and underepresented SC classes.

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    The authors commit to releasing the neurosurgery dataset and code which will allow results reproducibility.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html

    I would like the authors to consider the following comments/suggestions

    1) it would be very interesting to see the accuracy results on individual surgical concepts.

    • Is captioning afected for less represented concepts (Fig S1)? Is it less accurate ? This will also allow further insight on how the SC distribution in the dataset might affect performance.

    • Are the 3 type of SCs (instrument, target, actions) captioned with similar accuracy ?
    • Are there any similarities in the captioning performance of simialr type of SCs across the two datasets/procedures ? I.e. is it easier to caption instruments than targets/actions ?

    2) I believe that there exist further relationship between SCs. Specific instruments operate on specific targets, performing specific actions. Could these relationships be expoided to further improve performance. If yes could this be integrated in SCA-Net ? Please comment/ elaborate.

    3) Are the two modalities equal contributors ? I would assume not. Is there a way to assess the contribution of each modality and take this into consideration in the MCA-align step ? Please comment/ elaborate.

    4) In connection with comment 3), it would be very interesting to exlpaing why The loss coefficients λ1 (0.1) of LSCL is 10 times larger than λ2 (0.01) of LMCA. How sensitive is the model to the lamda hyperparameters ?

    5) Page 5 paragraph2. I would appreciate a bit more clarification on the MC-align process. It would be nice to link this part with Fig.2. In the following statements, “For text modality, we parse the label of each text token and average text tokens of each surgical concept as tc and update historical text concept representations” and “For visual modality, we average visual tokens as the representation of each surgical concept present in the input video and update historical visual concept representations”

    6) Comparison agaisnt SwinBERT [13] would be very interesting.

    7) How does MC-align compares to two recent semantic alignment methods ?

    i) D. Wu, X. Dong, L. Shao and J. Shen, “Multi-Level Representation Learning with Semantic Alignment for Referring Video Object Segmentation,” 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 2022, pp. 4986-4995, doi: 10.1109/CVPR52688.2022.00494.

    ii) Zejun Li, Zhihao Fan, Huaixiao Tou, Jingjing Chen, Zhongyu Wei, and Xuanjing Huang. “MVPTR: Multi-Level Semantic Alignment for Vision-Language Pre-Training via Multi-Stage Learning.” In Proceedings of the 30th ACM International Conference on Multimedia (MM ‘22). Association for Computing Machinery, New York, NY, USA, 4395–4405. https://doi.org/10.1145/3503161.3548341

    Minor: Page 1, paragraph 1: Please review “to understanding the surgery with complicated operations, and can produce the natural language description with given surgical videos [26,24].”

    Page 2, paragraph 2: Please change “visual and text modalities in sequential” to “visual and text modalities in sequence”

    Page2, paragraph 2: Please revise “Due to the variability of lesions …”

    Page 6, Section 3.1: Please revise “into 11, 004 thirty second video clips with clear surgical purposes.”

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    6

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    In my opinion the paper introduces a new methodology for surgical captioning including two interesting and novel concepts SCL and MC-align.

    The theoretical formulation of SC-Net is sound and experimentaion is thorough.

  • Reviewer confidence

    Confident but not absolutely certain

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #1

  • Please describe the contribution of the paper

    This paper claims to have developed methods to caption surgical videos. A proof-of-concept method is shown that appears to have better metrics than other methods. The improvements claimed in this method include a multimodal decoder and aligning visual and text tokens.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The strengths include an interesting way to use annotations in an existing dataset. There is an incremental advance with using an alignment module for the video and text tokens. Evaluation was done on two datasets and against multiple other networks.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    The conclusions overreach inferences that can be drawn based on evaluation.

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    The code will reportedly be made available. Accessibility of the neurosurgery dataset is unclear. The code to implement the state-of-the-art networks is not clear.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html
    1. The data are not clear. What do the “surgical captions” look like in the two datasets used in this study?

    2. How were the other state-of-the-art algorithms implemented? How is it assured that the algorithms were developed and evaluated with the same rigor as that used for the current algorithm?

    3. Data were split at the patient-level for algorithm training, but computing the evaluation metrics did not account for correlation in algorithm performance between clips / images from the same patient. For an early-stage evaluation as this study, I don’t think it is a critical limitation. However, findings and conclusions must be tempered to respect this limitation.

    4. Were evaluation metrics for the other algorithms computed using the same partitions of the data?

    5. Were any of the test set data used to select hyperparameters?

    6. Ablation studies are fine. However, missing ablation studies include the effect of different values for the loss coefficients that were empirically set. How were they “empirically set”, by the way?

    7. Qualitative study findings are interesting, but they show cherry-picked instances where the algorithm seems to do well.

    8. Text within figure1 in the supplement is too small.

    9. The conclusion implies more confidence in the findings than warranted.

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    6

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    This is an interesting idea and the experiments were well designed. Some details of the implementation are unclear. The conclusions claim more than what is supported by the findings in this early stage evaluation of a new algorithm.

  • Reviewer confidence

    Very confident

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A




Primary Meta-Review

  • Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

    The authors present their work performed on neurosurgical and nephrectomy videos. While on initial read the idea of a “surgical concept” sounds similar to action triplets that have been popularized in the cholect45 dataset, the authors work appears to extend that beyond simple three word lists of {instrument, verb, target}. Figure 2 demonstrates their surgical concept alignment network that takes multimodal text and visual data through a surgical concept learner and a mutual-modality concept alignment.

    The paper has strengths: 1) The overall notion of surgical concepts as an approach to tackle surgical video captioning is well motivated and explained. It has some novelty as well. 2) Ablation studies demonstrate the contribution of the component elements of their approach (SCL and MC Align) to their performance. 3) The reviewers demonstrate their approach against other SOTA approaches. It is of interest, however, that the authors compare to an image captioning method rather than comparing to video captioning approaches. (reviewer #3)

    A few key weaknesses in the study that are potentially outweighed by the strengths but should be addressed: 1) Reviewer #1 noted: Data were split at the patient-level for algorithm training, but computing the evaluation metrics did not account for correlation in algorithm performance between clips / images from the same patient. Were such correlations a concern? 2) Clarification would be helpful as reviewer 4 noted: The authors state that they “utilize the Vision Transformer (ViT) with causal mask as the text encoder”. This is confusing as vision transformer is an image encoder, and can’t take text as input without modifications. If the authors meant “Transformer”, or if the authors are using VIT with certain modifications, this is not currently clear. 3) While overall performance is reported, I, too, am curious like the reviewers on how performance is for individual concepts. Are the 3 type of SCs (instrument, target, actions) captioned with similar accuracy? As reviewer #2 notes, “Are there any similarities in the captioning performance of simialr type of SCs across the two datasets/procedures ? I.e. is it easier to caption instruments than targets/actions ? “ 4) The qualitative analysis piece does not add much to the paper as it seems to just demonstrate one example that authors chose where their method had strengths vs. Swin methods. This space may be better utilized addressing clarifications as suggested above and by reviewers.




Author Feedback

We thank the AC and Reviewers for their time and effort in reviewing our paper. Here, we attempt to clarify the main concerns raised by the AC and Reviewers.

Q: Two datasets used for “surgical captions” in this study? A: Yes. One is the public EndoVis-2018 dataset for surgical image captioning, and another is the collected neurosurgery dataset for surgical video captioning. These two datasets confirm the effectiveness of our captioning framework both images and videos.

Q: How were the other state-of-the-art algorithms implemented? How is it assured that the algorithms were developed and evaluated with the same rigor as that used for the current algorithm? A: The state-of-the-art algorithms are open-sourced, and we implement these algorithms on the neurosurgery dataset in the same setting of training and evaluation as our algorithm.

Q: Were evaluation metrics for the other algorithms computed using the same partitions of the data? A: The evaluation metrics of our algorithm and the other algorithms are computed in the same manner.

Q: In MC-align, the authors specify that text tokens and visual tokens were average. What did this mean? A: In MC-align, the averaged tokens in text and visual modalities are called historical concept representations, which represent the semantics of each surgical concept in each modality. This averaging operation is consistent with widely-used prototype learning. We push the concept representation to corresponding concept semantics, and pull it from other concept semantics, as illustrated in Eq. (2).

Q: For Neurosurgery Video Captioning Dataset, what is the necessary data cleaning? A: Some video clips in neurosurgery videos do not have clear surgical purposes, and we discard these samples when preparing the dataset. We will add more details about this dataset in the camera-ready.

Q: Did the Neurosurgery Video Captioning Dataset will be made public? A: Currently we do not claim releasing the dataset as a contribution of this paper. We also need to go through the approval process for releasing this dataset.

Q: Did the SOTA method has been reimplemented or the results extracted from the articles? A: For the public EndoVis-2018 dataset, we collect the results reported in their papers. For the neurosurgery video captioning dataset, we train and evaluate the state-of-the-art methods with the open-sourced codes under fair comparison.

Q: In the second case, there is a type for SwinMLP-TranCAP METEOR, the paper result is 31.3 not 31.1. A: Thanks for pointing out this typo. We will fix this mistake in camera-ready.

Q: It would be important to clarify which labels are needed in the datasets to accomplish “Surgical Concept Learning”. Specifically, are “Tool-Tissue Interaction” labels necessary? If yes, can you clarify how you modified your approach for EndoVis? A: The TTI is a principle for describing surgical maneuvers. The CIDA [13] asked surgeons to annotate the EndoVis-2018 dataset according to this principle. Consistent with previous works [13, 24, 26], our neurosurgery video captioning dataset is also prepared according to this TTI principle.

Q: Please clarify the confusion mentioned in the second weakness. If a vanilla transformer is used as the text encoder, this should be stated as such. If for a certain reason a Vision Transformer is employed, to motivations and how this is achieved with text input should be very clearly stated. A: After the input-specific tokenization step, ViT can flexibly handle token inputs with different dimensions, including 3D videos, 2D images and 1D texts. We adopt this description of the text encoder for consistency with the visual encoder.



back to top