Paper Info Reviews Meta-Review Author Feedback Post-rebuttal Meta-Reviews Back to top

List of Papers By topics Author List

Paper Info

Reviews

Meta-review

Author Feedback

Post-Rebuttal Meta-reviews

Authors

Mona Fathollahi, Mohammad Hasan Sarhan, Ramon Pena, Lela DiMonte, Anshu Gupta, Aishani Ataliwala, Jocelyn Barker

Abstract

Mastering the technical skills required to perform surgery is an extremely challenging task. Video-based assessment allows surgeons to receive feedback on their technical skills to facilitate learning and development. Currently, this feedback comes primarily from manual video review, which is time-intensive and limits the feasibility of tracking a surgeon’s progress over many cases. In this work, we introduce a motion-based approach to automatically assess surgical skills from surgical case video feed. The proposed pipeline first tracks surgical tools reliably to create motion trajectories and then uses those trajectories to predict surgeon technical skill levels. The tracking algorithm employs a simple yet effective re-identification module that improves ID-switch compared to other state-of-the-art methods. This is critical for creating reliable tool trajectories when instruments regularly move on- and off-screen or are periodically obscured. The motion-based classification model employs a state-of-the-art self-attention transformer network to capture short- and long-term motion patterns that are essential for skill evaluation. The proposed method is evaluated on an in-vivo (Cholec80) dataset where an expert-rated GOALS skill assessment of the Calot Triangle Dissection is used as a quantitative skill measure. We compare transformer-based skill assessment with traditional machine learning approaches using the proposed and state-of-the-art tracking. Our result suggests that using motion trajectories from reliable tracking methods is beneficial for assessing surgeon skills based solely on video streams.

Link to paper

DOI: https://link.springer.com/chapter/10.1007/978-3-031-16449-1_52

SharedIt: https://rdcu.be/cVRXq

Link to the code repository

N/A

Link to the dataset(s)

N/A

Reviews

Review #1

Please describe the contribution of the paper

The paper proposed a method to assess the surgeon’s skills in minimally invasive surgery. The proposed method is video-based and has two main steps. The first one is creating motion trajectories of the surgical instruments by tracking the their motion in the video. The second step is using these trajectories to classify the skill into two classes; good and poor performance. The proposed method was tested on real surgery videos from the Cholec80 data set.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

The paper is generally well written. The fact that the proposed method only needs the surgery videos makes it applicable in a wide range of surgical settings. Testing the proposed methods on real surgery (not dry lab exercises for example) is another strength of this paper.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

It seems unfair to compare the proposed method with the GOALS assessment tool (or at least arguing that it can replace GOALS). The reason is that in GOALS one gets continuous value, representing the assessors’ feedback, but the proposed method only considers two classes. In this context, from a trainee’s prescriptive, GOALS gives more useful information compared with the proposed method.
Please rate the clarity and organization of this paper

Very Good
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

I think that it would be better if the authors share more details on how they evaluated their proposed method to make it easier to reproduce their results.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html
1- On using methods in refs 15 and 24 for longer videos: Would not the decomposition of longer videos into shorter ones make it possible to use these methods for long videos as well? This is instead of creating a brand new method to deal with this problem.
1. How sensitive is the proposed method to the authors’ choice of the 3.5 threshold? (in the Dataset Description section)
2. It seems unfair to compare the proposed method with the GOALS assessment tool. The reason is that in GOALS one gets continuous value, representing the assessors’ feedback, but the proposed method only considers two classes. In this context, from a trainee’s prescriptive, GOALS gives more useful information compared with the proposed method.
3. In the discussion section, the authors wrote: “This indicates the classification created by the model is comparable to human level performance” I am not sure if the ultimate goal should be having models comparable to human level performance because humans in the surgical skill assessment task can vary a lot and we do not want to reproduce this variability at all.
Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

5
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

It think the work in this paper is interesting. However, the points I outlined in my review above need to be addressed properly for the paper to be really useful for the readers in my view.
Number of papers in your stack

4
What is the ranking of this paper in your review stack?

1
Reviewer confidence

Confident but not absolutely certain
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

N/A
[Post rebuttal] Please justify your decision

N/A

Review #2

Please describe the contribution of the paper

This paper provides some incremental contributions on algorithms for video-based tracking of surgical procedures.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

As far as my knowledge, this paper proposes new ways to track and assess surgical tools in a video sequence. At least in the field of surgical tracking and surgical workload the proposed algorihtms are innovative and provide good results (vs. bytetrack that is not specifically designed for this purpose).
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

From my point of view, the main drawback of this paper is the lack of comparison of its results with other previous works that could be highly related with this topic. The authors used bytetrack such as baseline for performance comparison but I will suggest to include some others: doi: 10.1007/s10916-020-1525-9 doi: 10.1109/WACV.2018.00081 doi: 10.1007/s11548-016-1388-1

Additionally, the main contributions (cost function and track recovery) have not been discussed or compared with other approaches of the previous studies.

Finally, I cannot clearly see the novel contribution stated by authors “classify skill directly from the tool tracking” in the discussion section. Maybe, I missed something but for me ‘classification” implies to group these skills in any way (novice, intermediate, experts… or something similar). As far as I read the paper I cannot directly understand that the algorithm can provide these groups.
Please rate the clarity and organization of this paper

Very Good
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

Any additional comment on this.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html

As said before, I will suggest to compare your proposal with other convolutional networks that were specifically developed for surgical tool tracking. There are some of them in the bibliography and at least in a qualitative way should be included in the discussion section. The real contribution of the proposed algorithms could be fairly compared after this analysis because an important part of the state of the art is currently missed.
Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

5
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

Lack of comparison with previous contributions in video-based tracking of surgical tools.
Number of papers in your stack

4
What is the ranking of this paper in your review stack?

3
Reviewer confidence

Very confident
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

N/A
[Post rebuttal] Please justify your decision

N/A

Review #3

Please describe the contribution of the paper

The manuscript proposes an automated way of assessing a surgeon’s skills based on the tool tracking method employed on the video feed. The idea is to track the instruments over a longer time span and use the trajectories to predict skill levels. The proposed tracking model uses a transformer architecture and is evaluated on the Cholec80 dataset with skill ratings provided on Calot Triangle Dissection. The manuscript further compares the transformer method with traditional ML methods like a random forest with the goal of reducing the number of identity switches. The manuscript provides results on both tracking and skills assessment where it outperforms a state-of-the-art tracking method.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
- The abstract is clear and well written.
- The motivation for the work is clear and matches the goal provided in the abstract. The requirement of automated skills assessment and what those typical skills consist of are clearly mentioned.
- The novelty lies in the new tracking algorithm for generating a cost function and track recovery method based on the Hungarian algorithm.
- There is a potential of extending the public Cholec80 dataset with spatial labels if the used annotations are made public.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
- The evaluation dataset is very small and limits the reliability of the obtained results.
- The rationales for architectural choices are omitted.
- Ablation studies is needed to justify lots of modeling in this paper.
Please rate the clarity and organization of this paper

Satisfactory
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance
- The manuscript mentions the use of data augmentation strategies but did not provide details on different augmentation methods that may have been used. -The basic hyperparameters are not presented. The experiments performed should accompany the details of the training but there are no details found in the manuscript.
- The manuscript mentions Bayesian hyper-parameter search was used for learning-based models but did not include initial or any parameter details on which the search was carried out. This is missing from the experimental setup section.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html
1. Title matching content: Yes, the manuscript describes a novel tool tracking method for skills assessment and thus matches with the title description.
2. Abstract summarizing content: The abstract mentions the problem statement clearly along with the problems associated with prevalent methods for skills assessment.
  The abstract mentions the use of the self-attention transformer network but did not justify the rationale behind their choice. It would be nice to mention the percentage improvement obtained by the proposed model for skills assessment.
3. Knowledge advancement: The method proposed a new cost function and uses an existing re-identification network with a transformer to perform tool tracking. However, the manuscript does not provide details on how it compares with other standard temporal models such as LSTM, TCN, etc., hence failing to provide a holistic picture of the tracking method. The method is evaluated on a small subset of the dissection phase amongst the entirety of other phases present in Cholec80 for skills assessment. This limits the novelty only to the specific Calot Triangle Dissection phase.
4. Positioning with existing literature: Bulk of reviewed works on object tracking are outside the scope of medical domains. However, there are lots of published works on tool tracking in surgical domains such as [1-3]. The manuscript needs to be properly positioned. Related references are covered for the tool trajectory-metrics, transformer, and datasets are provided. References [1] Nwoye, Chinedu Innocent, et al. “Weakly supervised convolutional LSTM approach for tool tracking in laparoscopic videos.” International journal of computer assisted radiology and surgery 14.6 (2019): 1059-1067. [2] Robu, Maria, et al. “Towards real-time multiple surgical tool tracking.” Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization 9.3 (2021): 279-285. [3] Zhang, Lin, et al. “Real-time surgical tool tracking and pose estimation using a hybrid cylindrical marker.” International journal of computer assisted radiology and surgery 12.6 (2017): 921-930. etc.
5. Method description and rationales: The method is divided into three modules with sufficient details and provides enough purpose for all the modules. The three modules correspond to Tracking algorithm, feature-based and learning-based skills assessment. The method uses yolov5 for detecting tools in a scene for which yolov5 was evaluated on the “last 5 videos of the dataset”. This part is confusing as it does not clearly mention which dataset is used for evaluation - is it the Cholec80 or the 15 videos annotated with a bounding box (section 2.1). It would be nice to clarify the dataset used for evaluation. It would be nice to justify the Kalman filter method used for tool tracking and what formulation for the Kalman filter is used for the same. This needs to be explained better as tracked locations are further used for the construction of a cost matrix, a novelty the manuscript claimed. The description provided for the cost function under “Cost Function Definition” is not in proper order. The manuscript mentions the third term in equation 1 as the second term in the description (pg 5). Similarly, the second term in equation 1 is mentioned as the last term. These structural mistakes should be avoided to prevent confusion for readers. Even though the transformer works well in time series data, there are other temporal models like LSTM, TCN, etc. which work well, but the manuscript did not provide enough justification for choosing the transformer model beyond the standard “best performing model” claim. This leaves the readers to contemplate the performance of other temporal models with respect to the transformer. The manuscript should clarify more on this as a simple ablation with other temporal models would have been sufficient. The manuscript did not present a description of the transformer configuration or parameters. The traditional machine learning model used for feature-based skill assessment is the random forest model, however, the manuscript did not provide any rationale behind it. What happens if other models such as “xgboost” or tree-based models are used? The manuscript should provide ablations justifying the use of the random forest model.
6. Standalone figures and tables: The problem statement in Fig 1 is clear and summarizes the goal of the paper. The caption for Fig 1 should add short details of what the x,y,d, and w variable represents. Also, the intermediate cube mentioned with d and w/2 parameters should provide details on what it represents. However, the acronyms provided in Table 2 such as IDs, MOTA, FP, and FN, and their full names are missing from the paper.
7. Data contribution/usage: There is no dataset contribution, However, the method is implemented/evaluated on the dissection phase of the Cholec80 dataset. The manuscript did not mention train/val/test splits used for training and evaluation of the model. The CholecT50 triplet dataset is annotated with bounding boxes (133k) for surgical instruments for detection during tracking.
8. Results presentation: The results in the tables are clear but the best results must be highlighted in bold for Table 2. The results are not specified with mean and std across different runs which would have made it easy to comprehend the stability of the experiments.
9. Discussion of results and method justification: The skill efficiency is separated into two classes - low and high based on the 3.5 threshold value. Is there a justification for choosing 3.5 for the threshold? The author should provide more details on this. The manuscript points to the less number of Identity switches with the proposed method but it is important to know how the ID switch is computed; the detail which is missing in the paper. To avoid the class imbalance problem, the manuscript mentions the use of random oversampling but did not provide details on the number of samples for the two skill classes - low and high.However, the readers would not have an idea about the number of samples used.
10. Comparative analysis with existing works: The results were compared with the baseline ByteTrack and Random Forest models as there is no existing work dedicated to GOALS-based skill assessment on Calot Triangle Dissection. Limitations are clearly specified and provided with reasoning.
11. Conclusion The manuscript presents insights for research continuation for extending skill assessment to other phases of the Cholec80.
12. Arguable claims The manuscript claims the transformer to be the best performing model of tool tracking and yet did not provide ablations on other temporal models to position the importance of the transformer model. It raises serious questions on the nature of the experiments performed.
13. Manuscript writing and typographical corrections Reference to Hungarian Algorithm paper is missing. (pg: 4 first row) [Aglorithm] in Algorithm 1: [Algorithm] [hungrian] in Algorithm 1: [Hungarian] [inacive] in Algorithm 1: [inactive] [cos t] in Track Recovery subsection: [cost] Acuuracy in Table 3: [Accuracy]
Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

6
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The contribution is novel and relevant to the community. There is a reported data annotation effort. However, comparison with the right baselines is missing.
Number of papers in your stack

4
What is the ranking of this paper in your review stack?

1
Reviewer confidence

Very confident
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

N/A
[Post rebuttal] Please justify your decision

N/A

Primary Meta-Review

Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

The reviewers agree that this is an interesting work with sufficient novelty. R2 and R3 agree that the performance evaluation study would be improved by including comparison to existing video-based surgical tool tracking methods.
What is the ranking of this paper in your stack? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

3

Author Feedback

-One suggestion by R1 was regarding the long-term tracking and decomposing the video into shorter streams rather than creating a new method. The drawback of this would be losing the identities of the tracked tool and the need to create a post-processing step to merge the tracks of the shorter video streams which may result in more identity switches.

-Regarding using GOALS in binary mode rather than ranking, the authors do not claim that the method replaces GOALS but rather gives a binary decision on surgeon efficiency that is as good as human performance. The proposed method and GOALS are compared in the binarized use case to ensure that the comparison is fair. We also note that while this model cannot replace the granularity of GOALS, as the method is automated, it can be scaled to be applied to more procedures than GOALS can. Thus, a surgeon can view their progress by viewing the proportion of above average cases over time rather than relying on one or two GOALS assessments.

-We decided to binarize our output due to the limited size of the dataset. With 5 classes, the individual class representation was too small for learning. The cutoff point (3.5) is selected as it shows the best agreement between the annotators. Using cutoff threshold of 3.5, 29 cases belong to low-performing group and the other 51 in high-performing group. The performance may degrade on different thresholds as agreement between annotator is lower.

-R1 has concerns on the variability in GOALS score and the model reproducing this variability. We try to mitigate this variability by asking two different surgeons to annotate the same data independently. We then use the concusses of the two annotators as the ground truth.

-R2 suggests comparing the method with other skill assessment methods which is highly desired but not applicable in the suggested works as they either tackle a different camera settings or different assessment baseline. For example, Pérez-Escamirosa et al. use orthogonal cameras in lab settings. Ganni et al. use motion features extracted from motion tracks which are covered in the paper as the first results line in Table 3.

-Comparison our tracker with other sota methods: we compared our proposed tracking method with Bytetrack being the current sota method (as they claim). We will include the suggested tracking methods in literature review. To compare our approach to the recommended trackers, it is required to re-produce their models. Some works cannot be compared with the proposed method because they require installation of markers on tools to achieve pose estimation and tracking (e.g. Zheng et al.).

-Small size of the evaluation dataset for skill assessment: The authors mitigate this problem by using 5-fold CV and McNemar’s statistical significance test which came out as significant (p-value <0.05).

-Regarding the choice of random forest rather, the authors were keen to build a model that would not overfit the data, random forest models tend avoid overfitting the best. While the xgboost model was not evaluated here, other classification models were also tested, and with random forest came being as the best performer. However, the xgboost algorithm has properties that would be beneficial to consider in future studies.

-As for using a transformer opposed to an LSTM, it has been seen that a transformer performs well on longer sequences and can better handle training on a smaller amount of data (Vaswani et al. Attention is all you need). We trained an Inception1D model but it came short compared to the transformer (Accuracy: 0.7375; Kappa : 0.452).

-The tracking algorithm evaluated on 15 videos of Cholec80.

-Our model consists of two 1D-conv blocks and a transformer-encoder. 1D-conv block is 1D-convolutional + BatchNorm+ Relu. We optimized the number of output channels, d, and kernel size k. Conv1: (k=11, stride=1), Conv2:(k=3, stride=2), d=128. Transformer: num_heads=7, num_layers=2, dim_feedforward = 56.

back to top

Video-based Surgical Skills Assessment using Long term Tool Tracking