Authors

Yiliang Chen, Shengfeng He, Yueming Jin, Jing Qin

Abstract

Including context-aware decision support in the operating room has the potential to improve surgical safety and efficiency by utilizing real-time feedback obtained from surgical workflow analysis. In this task, recognizing each surgical activity in the endoscopic video as a triplet {instrument, verb, target} is crucial, as it helps to ensure actions occur only after an instrument is present. However, recognizing the states of these three components in one shot poses extra learning ambiguities, as the triplet supervision is highly imbalanced (positive when all components are correct). To remedy this issue, we introduce a triplet disentanglement framework for surgical action triplet recognition, which decomposes the learning objectives to reduce learning difficulties. Particularly, our network decomposes the recognition of triplet into five complementary and simplified sub-networks. While the first sub-network converts the detection into a numerical supplementary task predicting the existence/number of three components only, the second focuses on the association between them, and the other three predict the components individually. In this way, triplet recognition is decoupled in a progressive, easy-to-difficult manner. In addition, we propose a hierarchical training schedule as a way to decompose the difficulty of the task further. Our model first creates several bridges and then progressively identifies the final key task step by step, rather than explicitly identifying surgical activity. Our proposed method has been demonstrated to surpass current state-of-the-art approaches on the CholecT45 endoscopic video dataset.

Link to paper

DOI: https://doi.org/10.1007/978-3-031-43996-4_43

SharedIt: https://rdcu.be/dnwPo

Link to the code repository

N/A

Link to the dataset(s)

N/A

Reviews

Review #4

Please describe the contribution of the paper

In order to tackle the Surgical Activity Triplet Recognition problem (Triplet components including recognizing: instrument, verb, and target), the authors propose to decompose the recognition of triplets into multiple complementary and simplified sub-problems and train sub-networks accordingly. The first network focus on learning the existence/number of three triplet components. The second network focuses on learning the association between the three triplet components. The other networks focus on predicting each of the triplet components. The authors trained their networks in a hierarchical way and show their networks can outperform previous state-of-the-art designs.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

(1) While learning multi-tasks together is one trending research direction, the authors move forward in the other direction of research which is decomposing the recognition of triplets into multiple complementary and simplified sub-problems and training sub-networks accordingly. This seems able to lead to some good discussions if the paper gets a chance to be presented at MICCAI.

(2) The Hierarchical Training Schedule seems to be novel. This training method contains multiple stages and can break down the complexity of the task and improve the performance of each component at each stage.

(3) While the dataset is very small, the authors conduct k-fold cross-validation.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

(1) The abstract, figure 2, and the text for Figure 2 seem not to match up very well. In the abstract, the authors stated: “Our network decomposes the recognition of triplet into five complementary and simplified sub-networks.” In the figure, it seems there are only 4 sub-networks, looks like verb and target are learned together in one sub-network. In section 2.3, the authors stated: “We divide our triplet recognition task into three sub-networks: the tool network, verb network, and target branch network.” This can be a little bit confusing for the reader, it will be great if the authors fix this mismatch.

(2) The text for Table 3 is not match up with the results in Table 3. Results w/o SC are very close to the baseline results. This contradicts the conclusion of the inclusion of these modules can significantly contribute to the overall performance of their model. Maybe the authors can write this part in a better way.
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

I did not see any problems with the reproducibility of the work.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html

(1) One of the famous previous works that solve Surgical Activity Triplet Recognition[1] is not cited. In this work, many different methods to solve Surgical Activity Triplet Recognition are proposed, it will be great if the authors cite this work in their Introduction section.

[1] Nwoye C I, Alapatt D, Yu T, et al. Cholectriplet2021: A benchmark challenge for surgical action triplet recognition[J]. Medical Image Analysis, 2023: 102803.

(2) The below point can be very difficult to address, so please just talk about your thought on this point. the backbone network for both RDV and RiT is ResNet18 while the author uses I3D(ResNet50). I3D(ResNet50) was also pre-trained on ImageNet and Kinetics datasets whereas ResNet18 was trained on ImageNet only. May I ask if the improvement we see here is because of changes in the backbone network?
Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

5
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

Instead of learning different tasks together, the authors propose to decompose the problems and ask different networks to work on different sub-problems. I believe this idea itself is very interesting and might be great to share with the MICCAI community.
Reviewer confidence

Very confident
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

N/A
[Post rebuttal] Please justify your decision

N/A

Review #2

Please describe the contribution of the paper

The paper addresses the task of surgical activity recognition from video in laparoscopic cholecystectomy. The model proposes to decompose the objective into separate tasks. The proposed method considerably outperforms prior work in the area.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
- The paper leverages pre-training on a soft-labelling task, decomposing the triplet task, and prior work on class activation-guided attention
- Results considerably outperform prior work
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
- More clarity on the verb and target networks could be provided
Please rate the clarity and organization of this paper

Very Good
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

I believe this paper is reproducible. Methods are clearly described, and code will be released on publication. The paper uses a common datasets that is known in the area.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html

It is unclear to me if Backbone 3 is shared by the verb and target prediction tasks, and if so, why this decision was made. The motivation for this paper seems to be trying to disentangle the instrument, verb, and target tasks. Figure 2 could be updated to clarify this.

The proposed hierarchical training schedule is interesting. It seem pre-training on the soft-labelling task has added value.

The added value over prior work (including [20] which uses CAGAM too) is the disentanglement of the tasks of predicting the three elements in the triplet.

It would be interesting to ablate on the disentanglement (e.g. use only the triplet network). The ablation study seems to only consider the soft-labelling task and the separation of the tool network from the verb and target networks.
Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

7
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The major factor leading to this score are: (1) the interesting method of leveraging pre-trained and decomposing the task and (2) the impressive results on the dataset.
Reviewer confidence

Very confident
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

N/A
[Post rebuttal] Please justify your decision

N/A

Review #3

Please describe the contribution of the paper

The paper proposes a triplet disentanglement framework for surgical action triplet recognition in endoscopic videos. The objective is to address the imbalance of triplet in one shot.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
1. The mentioned issues in Fig.1 are realistic and not seriously considered in previous work of surgical activity recognition.
2. The proposed disentanglement network has novelty. It has a multi-task and hierarchical structure that facilitates the learning of each element in the triplet.
3. The experiment shows an improvement of the performance over other SOTA methods.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
1. The designed soft label branch functions more like an auxiliary task in the multi-task learning setting, which does not explicitly incorporate the soft labels into the recognition task.
2. The hierarchical training schedule may increase the difficulty of learning, where the errors in preceding stages propagate to the current stage and there are many hyperparameters to tune.
Please rate the clarity and organization of this paper

Very Good
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

The experiment of the paper is carried out on the public dataset CholecT45. Also the authors claim to release the training code of proposed method. Therefore, it would be easy to reproduce the work and compare the performance with the presented results in the paper.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html
1. The performance improvement shown in Table 1 is not very strong. Some further analysis on the special cases (like described in Figure 1) needs to be added to justify the practical values of proposed method.
2. From Table 3, it is hard to conclude that the inclusion of those modules can “significantly” contribute to the performance.
3. The detailed network for each branch can be described other than the backbone.
Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

5
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The issues to be addressed in the paper are interesting and the proposed method has novelty. The performance improvement is not significant and the merits of designed modules need further analysis and discussion.
Reviewer confidence

Confident but not absolutely certain
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

N/A
[Post rebuttal] Please justify your decision

N/A

Primary Meta-Review

Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

The authors present their work on “disentangling” the elements of the action triplet through use of the cholect45 publicly available dataset. The authors present their motivation regarding their work, namely that one shot recognition of action triplets can be difficult and unnecessarily hampered by elements in the data such as multiple action triplets in one frame, similar instruments/verbs/targets in one frame, and potentially irrelevant data from competing instruments/targets in a frame. They thus propose disentangling the elements of an action triplet to improve performance in action triplet recognition.

The manuscript is very well written and logical to follow. The strengths that are drive the score are: 1) Improvement in performance over SOTA methods while also presenting a novel approach to the task through “disentanglement”. The largest gains in performance were in top 5 accuracy and in mAP of the overall triplet. The design of their approach has, as one reviewer noted, “multi-task and hierarchical structure that facilitates the learning of each element in the triplet.” 2) The motivations of each of their approaches, such as how the authors determined “irrelevant” surgical activity, are intuitive and clearly explained. Each step of their approach appears to fit well with their concisely stated motivation and has justification based on the design of the dataset.

The paper is not without its weaknesses, though they are largely outweighed by the strengths: 1) A reviewer astutely pointed out: “The abstract, figure 2, and the text for Figure 2 seem not to match up very well. In the abstract, the authors stated: “Our network decomposes the recognition of triplet into five complementary and simplified sub-networks.” In the figure, it seems there are only 4 sub-networks, looks like verb and target are learned together in one sub-network. In section 2.3, the authors stated: “We divide our triplet recognition task into three sub-networks: the tool network, verb network, and target branch network.” This can be a little bit confusing for the reader…”

As a minor point, while the authors cite much of the initial work in this space by Nwoye et al, they do leave out a paper that summarizes the approaches of different groups related to the Action Triplet Challenge: Nwoye C I, Alapatt D, Yu T, et al. Cholectriplet2021: A benchmark challenge for surgical action triplet recognition[J]. Medical Image Analysis, 2023: 102803.

Author Feedback

We appreciate the reviewers’ feedback and recognition of our work’s novelty and contribution. We have carefully considered the raised concerns and will address them as follows:

R#2Q1 Whether Backbone 3 is shared by the verb and target prediction tasks.

Backbone3 includes two sub-networks for verbs and targets respectively for better disentanglement. We will revise this part to make it clearer.

R#2Q2 ”ablate on the disentanglement (use only the triplet network),”

Regarding the question pertaining to ablating on using only the triplet network, we indeed conducted similar experiments. However, training a triplet network solely posed significant challenges, and our training cannot be converged. This is because training a triplet network without other support too ambiguous. We will add more discussions on this.

R#3Q1 justify the practical values of proposed method.

While the overall improvement may not appear substantial at first glance, we achieved about 4% improvement on AP_IVT metric. On the other hand, our proposed strategies can remedy the training difficulty, and can potentially inspire other applications.

R#3Q2 Table 3 doesn’t clearly show that including these modules significantly improves performance.

We agree that the term ”significantly” may not be accurate. However, we would like to emphasize that our primary contribution lies in the hierarchical training strategy, which has demonstrated notable effects on the overall performance.

R#3Q3 A more detailed description of the network.

We will add more detailed discussions on each branch in the revision.

R#4Q1 One of the famous previous work is not cited.

We will update our Introduction section to include a discussion of the cited work.

R#4Q2 The differences of different backbone networks.

It is crucial to note that the differences in tool recognition, as mentioned, are not significant due to our frame-by-frame predictions. To ensure consistency in the temporal dimension, we modified our 3D-CNN approach, considering that our frame length is relatively short. Although the 3D-CNN has a stride of 1 in the temporal dimension, it still captures some temporal information. The verb recognition, on the other hand, may benefit more from the 3D-CNN approach. We did explore adding LSTM/TCN layers after the extractor, but the results were unsatisfactory. Therefore, our current approach strikes a balance, optimizing the performance of the system.

back to top

Surgical Activity Triplet Recognition via Triplet Disentanglement