Paper Info Reviews Meta-review Author Feedback Post-Rebuttal Meta-reviews

Authors

Nisarg A. Shah, Shameema Sikder, S. Swaroop Vedula, Vishal M. Patel

Abstract

Automated surgical step recognition is an important task that can significantly improve patient safety and decision-making during surgeries. Existing state-of-the-art methods for surgical step recognition either rely on separate, multi-stage modeling of spatial and temporal information or operate on short-range temporal resolution when learned jointly. However, the benefits of joint modeling of spatio-temporal features and long-range information are not taken in account. In this paper, we propose a vision transformer-based approach to jointly learn spatio-temporal features directly from sequence of frame-level patches. Our method incorporates a gated-temporal attention mechanism that intelligently combines short-term and long-term spatio-temporal feature representations. We extensively evaluate our approach on two cataract surgery video datasets, namely Cataract-101 and D99, and demonstrate superior performance compared to various state-of-the-art methods. These results validate the suitability of our proposed approach for automated surgical step recognition. Our code is released at: https://github.com/nisargshah1999/GLSFormer



Link to paper

DOI: https://doi.org/10.1007/978-3-031-43996-4_37

SharedIt: https://rdcu.be/dnwPg

Link to the code repository

https://github.com/nisargshah1999/GLSFormer

Link to the dataset(s)

http://ftp.itec.aau.at/datasets/ovid/cat-101/


Reviews

Review #3

  • Please describe the contribution of the paper

    The authors propose a vision transformer-based method to jointly learn spatiotemporal information from video frames sampled both in the short-term and long term. The proposed Gated Temporal Attention allows the model to capture the relationship between short-term features and long-term features in a better way. Furthermore, the proposed method can be trained in an end-to-end manner. Evaluation results show that the proposed method outperforms several state-of-art methods.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    (1) The design of Gated Temporal Attention seems to be novel. As short-term features and long-term features are input into the model at the same time, the network is able to jointly capture and weighted the short-term and long-term information with Gated Temporal Attention. (2) The proposed method can be trained in an end-to-end manner, it is easy to train compared to the two stages designs where to train a feature extraction network in the first stage and a long-term temporal modeling network in the second stage.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    (1) Lack of details about the version of ViT and Timesformer used in the comparison. From my point of view, the proposed method is a modification of the TimesFormer with two changes. One is the input, GLSFormer takes frames sampled at different sampling rates. The other one is Gated Temporal Attention. GLSFormer uses Gated Temporal Attention for short-term and long-term information fusion. For, ViT or Timesformer, different size of the model performs very differently. For Timesformer, different input lengths of video frames also affect the performance a lot. It is important to share these details.

    (2) Small typo in Comparison with the state-of-the-art methods section: “In contrast, ViT and TimesFormer capture short-term temporal and spatial information efficiently,” ViT cannot capture temporal information. This is a typo. Small typo in the Datasets section: “We randomly shuffled videos and select 60, 20 and 20 videos for training, validation and testing respectively. “ The Dataset only contains 99 videos, this is a typo.

    The rest of this point can be difficult to address, so it is completely Ok not to address the rest of point (2). As the authors stated six models are ResNet based, whereas GLSFormer is a TimesFormer-based model. The pre-trained datasets are also quite different (ImageNet vs K400). May I ask can ResNet in the six models also be pre-trained on K400? Can ViT or other advanced feature extraction networks replace ResNet for these six models to have a better comparison? As GLSFormer seems to already outperform TimesFormer, I think the authors already justify their methods. The rest of point (2) can be considered as future work or get addressed in detail if the paper is accepted and extended for a journal.

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    I did not see any problem on the reproducibility of the paper.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html

    Besides the points mentioned in the previous section. (1) The authors only tested the design of Temporal attention first and spatial attention second. Similar to Timesformer, may I ask do you think testing the following designs can be beneficial to the readers? (a) Spatial attention first, temporal attention second. (b) Spatial attention and temporal attention jointly. This can be addressed in future work.

    (2) It seems that for the final design, the 8 frames are used for both short-stream and long-stream. May I ask will using more frames improve the performance of the model? May I ask will use more frames in the long stream improve performance as it can capture more long-term information? This can be addressed in future work.

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    7

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The design of Gated Temporal Attention seems novel to me. The authors modified Timesformer and justify their design with experiments. The model can be trained end-to-end and outperform some of the previous 2-stage designs.

  • Reviewer confidence

    Very confident

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #2

  • Please describe the contribution of the paper
    1. The paper proposes a new approach for automated surgical step recognition using a vision transformer-based method that jointly learns spatio-temporal features directly from sequence of frame-level patches.
    2. The proposed approach incorporates a gated-temporal attention mechanism to combine short-term and long-term spatio-temporal feature representations, resulting in superior performance compared to existing methods.
    3. This contribution is significant as it can improve patient safety and decision-making during surgeries by accurately recognizing the different steps involved in the surgery process.
  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. Novel approach: The proposed method is a novel approach for automated surgical step recognition that uses a vision transformer-based method to jointly learn spatio-temporal features directly from sequence of frame-level patches. This is an original way to use data and incorporates long-range information, which has not been taken into account in existing methods.
    2. Gated-temporal attention mechanism: The proposed approach incorporates a gated-temporal attention mechanism that intelligently combines short-term and long-term spatio-temporal feature representations, resulting in superior performance compared to existing methods.
    3. Evaluation on multiple datasets: The authors extensively evaluate their approach on two cataract surgery video datasets, namely Cataract-101 and D99, demonstrating superior performance compared to various state-of-the-art methods.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    1. While the authors claim superior performance compared to various state-of-the-art methods, they do not provide a detailed comparison with each individual method or explain why their approach outperforms them. This could make it difficult for readers to fully understand how their proposed approach compares with existing methods.
    2. Finally, while this work has strong clinical feasibility implications as mentioned earlier, there is no discussion in this paper about any practical implementation challenges or limitations of using automated surgical step recognition in real-world settings.
  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    The authors have filled out a reproducibility checklist upon submission, which is a positive sign for the reproducibility of their work. They have provided detailed information about their experimental setup and evaluation metrics, as well as links to the datasets used in this study. However, while they state that code will be made publicly available after the review process, it is not currently available at the time of publication. Overall, based on the information provided in their checklist and paper itself regarding data availability and experimental details, it seems that reproducing this work would be feasible with some effort. However, without access to code or more detailed implementation instructions beyond what was included in their paper, full reproducibility may be challenging for some readers.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html

    It would be beneficial if you could provide more details about how your approach outperforms each individual method or explain why this is so. Additionally, while I understand that code will be made publicly available after the review process, it would have been helpful if some implementation instructions were included in the paper itself (e.g., sufficient hyperparameter settings). This would make reproducing and building upon your work easier for readers. Finally, as mentioned earlier there was no discussion about any practical implementation challenges or limitations of using automated surgical step recognition in real-world settings. It may be useful to include such discussions as part of future work sections or implications sections since these are important considerations when translating research into clinical practice.

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    4

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    My overall score of 4 is based on the strengths and weaknesses of the paper. The proposed approach for automated surgical step recognition using a vision transformer-based method that jointly learns spatio-temporal features directly from sequence of frame-level patches with a gated-temporal attention mechanism is novel and promising, with strong potential impact on improving patient safety during surgeries. Additionally, the authors have provided detailed information about their experimental setup and evaluation metrics in their reproducibility checklist. However, there are some areas for improvement such as providing more detailed comparisons with existing methods to better understand how their approach outperforms them. Also including implementation instructions in the paper itself would aid reproducibility which could help other researchers build upon this work.

  • Reviewer confidence

    Somewhat confident

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #1

  • Please describe the contribution of the paper

    This paper address the problem of automated surgical step recognition. The authors propose to combine spatio-temporal and long-range temporal information using a vision transformer for step recognition in surgical videos. They key contribution is the proposed two-stream model GLSFormer that uses gated attention to combine the short-range and long-range temporal information in the latent space. Experimental evluation along with ablation studies shows the superior performance of the proposed method in comparison with prior art.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. Duration of surgical steps is highly variable depending on the complicacy of the condition and surgeon’s skill. A temporal model that considers only short-range or long-range temporal infornation could potentially fail owing to the variable nature of the surgical procedures. The proposed method takes care of this by capturing both short and long-range temporal information.
    2. Prediction of gating parameters in the attention module based on the spatio-temporal representations allows dynamic gating even during inference.
    3. Extensive experimenal validation on two Cataract surgery dataset with ablation study on sampling rate of video and various versions of the temoral stream in the proposed network.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    1. Limited novelty
    2. The methods section is not well. Non-standard usage of notations, eg. vectors are not formated as bold font. ‘st’ and ‘lt’ used as shortened notations for short-term and long-term makes the equations hard to read and confuses with ‘t’ that is used to represent the time step . Similar issue with ‘Gt’ used to represent gating parameter. There are inconsistenties in notations which adds to the confusion, eg. z_{p,t}^{st} vs z_{l-1}^{st}(p,t).
    3. Comparison of computational and time-complexity of the proposed method with prior art is missing.
  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance
    1. Details of the datasets and evaluation metrics used in the experiments is provided.
    2. Implentation details including chosen hyperparameters for training are available in the paper.
  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html
    • The motivation for the design of the proposed two-stream vision transformer-based network is well explained and justified. Experimental results also show the superior performance of the proposed method in comparison with the network which captures only short-term dependencies.
    • While the section on experiments and results is presented well, Sec. 2 on the GLSFormer model lacks clarity. Please refer to the noted weaknesses to make appropriate changes to the notations used.
    • Using multiple alphabets in a notations can be confusing as it could be wrongly interpreted as matrix multiplication. Therefore it is recommended to use alphabets. Eg. In the explanation of Gated Temporal Attention, ‘A’ denotes the number of number of attention head. Later, ‘MSA’ is used to denote multi-head self-attention.
  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    6

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The concept of capturing both short and long-range temporal information through gated attention modules that also models their interactions is an interesting direction to explore in the context of surgical video analytics. The proposed approach could be extended to related tasks such as surgical tool recognition, action triplet recognition, etc.

  • Reviewer confidence

    Very confident

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A




Primary Meta-Review

  • Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

    The authors present a vision transformer based paradigm, designed to concurrently learn spatiotemporal information from video frames sampled across both short-term and long-term intervals. Though the reviewers generally like the innovative idea developed in the paper, the reviewers have raised several concerns about insufficient comparison with SOTA methods, unclear expression on methodology section, lack of time-complexity analysis, which is very important in surgical video tasks, etc. I invite the authors to submit the rebuttal focusing on addressing reviewers comments.




Author Feedback

N/A



back to top