Paper Info Reviews Meta-review Author Feedback Post-Rebuttal Meta-reviews

Authors

Saurav Sharma, Chinedu Innocent Nwoye, Didier Mutter, Nicolas Padoy

Abstract

Surgical action triplets describe instrument-tissue interactions as 〈instrument, verb, target〉 combinations, thereby supporting a detailed analysis of surgical scene activities and workflow. This work focuses on surgical action triplet detection, which is challenging but more precise than the traditional triplet recognition task as it consists of joint (1) localization of surgical instruments and (2) recognition of the surgical action triplet associated with every localized instrument. Triplet detection is highly complex due to the lack of spatial triplet annotation. We analyze how the amount of instrument spatial annotations affects triplet detection and observe that accurate instrument localization does not guarantee a better triplet detection due to the risk of erroneous associations with the verbs and targets. To solve the two tasks, we propose MCIT-IG, a two-stage network, that stands for Multi-Class Instrument-aware Transformer - Interaction Graph. The MCIT stage of our network models per class embedding of the targets as additional features to reduce the risk of misassociating triplets. Furthermore, the IG stage constructs a bipartite dynamic graph to model the interaction between the instruments and targets, cast as the verbs. We utilize a mixed-supervised learning strategy that combines weak target presence labels for MCIT and pseudo triplet labels for IG to train our network. We observed that complementing minimal instrument spatial annotations with target embeddings results in better triplet detection. We evaluate our model on the CholecT50 dataset and show improved performance on both instrument localization and triplet detection, topping the leaderboard of the CholecTriplet challenge in MICCAI 2022.

Link to paper

DOI: https://doi.org/10.1007/978-3-031-43996-4_48

SharedIt: https://rdcu.be/dnwPt

Link to the code repository

https://github.com/CAMMA-public/mcit-ig

Link to the dataset(s)

N/A


Reviews

Review #3

  • Please describe the contribution of the paper

    This paper proposes a method for surgical action triplet detection, which combines transformers, attentions, and graph networks.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. The proposed method is a good combination of existing techniques, including Deformable DETR, GAT, attentions and so on.

    2. The proposed method achieves improved performance.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    1. It is stated in the reproductivity checklist that central tendency and variation are not applicable. I don’t see why the authors can not provide the mean and standard deviations of the results, for example over multiple runs.

    2. What is the “class id embedding” on Line 11 Page 4.

    3. Below Eq 1, why can authors state that “MCIT learns meaningful class embeddings of the target”? How can the N tokens corresponds to each class, given that they are averaged at the end?

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    The reproducibility is good given that the code will be released.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html

    Please refer to weakness.

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    5

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    This paper presents contributions, which is however incremental. Therefore I would recommend weak accept.

  • Reviewer confidence

    Somewhat confident

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #2

  • Please describe the contribution of the paper

    The authors proposed a two-stage pipeline for triplet detection in laparoscopic cholecystectomy procedures. They introduced a transformer-based method for learning per class embeddings of target anatomical structures in the absence of target instance labels, and an interaction graph that dynamically associates the instrument and target embeddings to detect triplets.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • The paper is well organized and relatively easy to follow
    • The proposed method is novel and interesting
    • The authors demonstrate a comprehensive evaluation of the proposed method, including comparison with SOTA and extensive ablation studies.
    • The method outperforms the top methods from CholcTriplet 2022 challenge.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    Why there is no comparison to Triplet recognition results from CholcTriplet 2022 challenge.

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    The paper provides enough implementation details to reproduce the results.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html
    • It will be interesting to hear authors’ opinion how this method can be extended to video domain and to more general domain of action recognition.

    • The CholcTriplet 2022 challenge has three criteria:

      1. Classification AP for action triplet recognition
      2. Localization AP for surgical instrument localization
      3. Detection AP for box-triplet association Why does Table 4 include only two last criteria?
  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    7

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The paper is well organized, the proposed method is novel. The method outperforms the top methods from CholcTriplet 2022 challenge. I think the paper should be accepted.

  • Reviewer confidence

    Confident but not absolutely certain

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #4

  • Please describe the contribution of the paper

    Proposed MCIT-IG, a two-stage network for surgical instrument localization and surgical triplet detection. a. MCIT- Multi-Class Instrument aware Transformer: To perform instrument-aware target class embeddings. b. IG- Interaction Graph: Models instrument-target interaction to detect surgical triplet. c. Training based on Mxied-supervised learning.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    1) Proposes a novel two-stage MCIT-IG model for instrument localization and surgical triplet detection: a. Technical Novelty: Extending [21],this work develops MCIT to improve target class embedding. b. Application Novelty [minor]: Integrating MCIT and Interaction graph for surgical triplet detection and instrument localization. 2) Quantitative Analysis: a. The model outperforms baseline model (RDV[13]) in table 1. b. Ablation study (Table 3) clearly shows the significance of each sub-modules.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    1) Limited SOTA model comparison: Results reported in Tables 2 and 4 are solely based on the quoted challenge results [minor weakness]. In my view, quoting and comparing with challenge results doesn’t fully justify fair SOTA comparison. Firstly, there is no consistency in pre-training as some are trained on additional datasets and some are not, making it an unfair comparison of model performance. Secondly, they are not reproduced (trained) in the same system environment (library initial weights may affect model performance). For a fair comparison, all models should be trained on the same dataset and implemented in the same environment. Thirdly, the challenge models may not represent the latest SOTA models (surgical and computer vision domain) as it depends on the comfort zone of the challenge participants. However, given the significant difference in the challenge results, I consider this a minor weakness. a. Only 1 baseline model is compared (RDV[13]) b. Lacks benchmarking against SOTA triplet detection models (CNN/transformer/graph-based models) reported in the surgical/computer vision domain. Eg: SIRNet, Forest Graph Convolutional Network. c. Lacks benchmarking against SOTA interaction detection models (graph and transformer-based) reported in the surgical domain. While an ablation study is a report in supplementary, those graph models are not recent models. Works based on the SOTA graph parsing neural network and visual-semantic graph attention network have been reported in the surgical domain. 2) Manuscript lacks qualitative analysis on both localization and triplet detection. 3) Lacks details on model size/inference speed. 4) Figure 1 quality needs improvement: a. Need visualization of nodes and edges in the interaction graph b. Improve space utilization c. Improve choice of colors (words are not clear)

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    The model and experiment setup are described. However train/test code is not avaible. I assume, it would be made public upon paper acceptance.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html

    This work has both application (minor) and technical novelty and is commendable. Points to improve: 1) Improve the quality of Figure 1: - If possible, provide visualization of ideal nodes and edges for a given scene. - Improve space utilization and choice of color. 2) Include qualitative analysis: - As the SOTA results are quoted from the challange, comparing against the top 2 models will suffice. The remaining space could be better utilized for qualitative analysis. 3) - Reproduce Challenge models: Reproduce challenges models and train on the same datasets in the same system environment for a fair comparison. 4) Compare against SOTA action triplet models and graph models.

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    5

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    This work contains clear technical novelty. The proposed model outperforms a base model (RDV) and the ablation study clearly shows the significance of each module. However, the manuscript lacks qualitative analysis and mostly quotes challenging results (in my view, not a fully fair comparison). It also lacks comparison against SOTA action triplet detection models and interaction detection models. Taking into account these factors and the significant increase in performance, I recommend weak acceptance.

  • Reviewer confidence

    Confident but not absolutely certain

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A




Primary Meta-Review

  • Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

    The paper presents a framework for improved surgical action triplet detection through a two-stage mechanism of building instrument-aware classwise target embeddings and an interaction graph to learn the triplet associations. The framework is validated on the CholecT50 dataset through extensive ablation studies and comparison with SOTA, and outperforms top method from the CholecTriplet 2022 challenge. The proposed approach is novel and interesting, and the paper is well-written and well presented.
    Feedback and questions from the reviewers regarding some of the details in the methodology, details on model size/inference speed, and improvements in figures (Figure 1) should be incorporated in the final submission.




Author Feedback

We thank the AC and reviewers for their diligent evaluation of our manuscript and especially for finding our proposed method on surgical action triplet detection insightful and valuable to the research community. We have made appropriate revisions to enhance the manuscript.

  • Response to Reviewer 2 We kindly re-emphasize that our work focuses solely on the triplet detection task, where the model is designed to detect only present triplet instances instead of predicting probabilities of every valid triplets. Hence, the model’s evaluation does not include classification AP in Table 4. Regarding the extension of our work to the video domain, we think that the MCIT-IG can be augmented with space-time graphs to incorporate instrument instances across time in a causal manner.
  • Response to Reviewer 3 We follow the standard evaluation and result reporting protocol in the CholecTriplet 2022 challenge to ensure direct result comparison with the existing methods. The class id embeddings (mentioned in page 4) refers to the d-dimensional vector generated by the transformation of the class labels of the detected instruments. For the target class, MCIT, designed to learn per class token embeddings of the target, also applies class-wise positional embeddings to the features before averaging, which ensures that the N tokens correspond to the N classes. Combining these with the instrument instance features enhances the semantics of the target features.
  • Response to Reviewer 4 At the time of the paper submission, the challenge methods are the latest SOTA on triplet detection and localization. We compare against these methods because our work uses the same dataset and follows the same data split. We chose RDV as a baseline for comparison in Table 1 because its code is publicly available and it allows us to re-execute it to obtain predictions (bounding boxes and triplet classes). The boxes, being weakly supervised, are re-mapped to more accurate boxes from the supervised instrument detector before the comparison. Moreover, Table 4 provides a direct comparison with other methods published in the challenge. Benchmarking on methods such as SIRNet and Forest GCN is not possible as they do not perform triplet detection. Benchmarking on other interaction detection methods outside the scope of surgical action triplet detection would require modification of the approach changing the model architecture. A comparison would not reflect the original author’s proposals. The supplementary material contains the qualitative analysis of MCIT-IG.

For the final submission, we have revised the manuscript to include details on model size and inference speed, as well as a clearer diagram of the proposed model architecture in Figure 1.

We thank all reviewers for their invaluable feedback.



back to top