Authors

Xinpeng Ding, Ziwei Liu, Xiaomeng Li

Abstract

Self-supervised learning has witnessed great progress in vision and NLP; recently, it also attracted much attention to various medical imaging modalities such as X-ray, CT, and MRI. Existing methods mostly focus on building new pretext self-supervision tasks such as reconstruction, orientation, and masking identification according to the properties of medical images. However, the publicly available self-supervision models are not fully exploited. In this paper, we present a powerful yet efficient self-supervision framework for surgical video understanding. Our key insight is to distill knowledge from publicly available models trained on large generic datasets to facilitate the self-supervised learning of surgical videos.To this end, we first introduce a semantic-preserving training scheme to obtain our teacher model, which not only contains semantics from the publicly available models, but also can produce accurate knowledge for surgical data. Besides training with only contrastive learning, we also introduce a distillation objective to transfer the rich learned information from the teacher model to self-supervised learning on surgical data.Extensive experiments on two surgical phase recognition benchmarks show that our framework can significantly improve the performance of existing self-supervised learning methods. Notably, our framework demonstrates a compelling advantage under a low-data regime.

Link to paper

DOI: https://link.springer.com/chapter/10.1007/978-3-031-16449-1_35

SharedIt: https://rdcu.be/cVRXa

Link to the code repository

N/A

Link to the dataset(s)

N/A

Reviews

Review #1

Please describe the contribution of the paper

The work presented in this paper considers the task of surgical phase recognition on videos and tackles specifically one of its challenges: the lack of annotated data. In response, the paper proposes a two-fold strategy: a self-supervision approach (teacher/student models) and the use of publicly available models trained on large generic datasets to train the teacher model.

To do so, the presented method has three main characteristics. (1) it follows a contrastive learning approach, as training an encoder for a dictionary look-up task. (2) it enables to preserve semantics extracted from the model trained with large datasets, by fixing its backbone and updating its head projection. (3) it operates a self-training of the student model with a specific distillation strategy where the similarity matrices of the teacher and student models are constrained to resemble.

Experiments with cholecystectomy videos reveal that (2) and (3) are effective strategies, leading to a general improvement.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

Well-organized and well written The idea of using knowledge from larger dataset to improve task solving where the dataset is small is relevant in general and especially for surgery videos, where the annotation is particularly time-consuming and definitively require expertise. The underlying assumption (“using the same backbone and self-supervised learning method, the model trained with ImageNet data can yield a comparable performance for surgical phase recognition with that trained with surgical video data”) is tested and verified numerically. There is no new component, but the combination is new and its application as well. The strategy of semantic-preservation is experimentally verified by the ablation study and makes intuitively sense. The experimental section is well furnished: 2 datasets and several appropriate metrics.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

Major weakness: Potential overstatement and incomplete state-of-the-art Note that I am not expert on this particular topic.

P3 “To best of our knowledge, we are the first to investigate the use of self-supervised training on surgical videos.” This claim may be removed considering the following publication: “Learning from a Tiny Dataset of Manual Annotations: a Teacher/Student Approach For Surgical Phase Recognition”, Tong Yu et al., IPCAI 2019

After looking at the literature, I found that this paper only reports the existing works on self-supervised learning for images, while there are some works on videos. Note that I could not find any paper that shares important similarities with this work and then harms the novelty of the submitted work. They consider other problem configurations and do not have the same assumptions. Here are two papers I found: “Learning from a Tiny Dataset of Manual Annotations: a Teacher/Student Approach For Surgical Phase Recognition”, Tong Yu et al., IPCAI 2019 “Teaching Yourself: A Self-Knowledge Distillation Approach to Action Recognition”, Vu et al., in IEEE Access, vol. 9, pp. 105711-105723, 2021, doi: 10.1109/ACCESS.2021.3099856. → Why is this group of works not mentioned in literature review?

Minor weakness: lack of qualitative results and impact of MS-TCN One of the main difference with MoCo v2, from which this work is inspired, is that the data used in the submitted work has a temporal component. This work handles this using a multi-stage temporal convolution (MS-TCN). Even if this choice is supported by three references, there is no display of surgical phase sequences, which makes very hard the appreciation of this aspect. Considering that there is no left space in the current version of the paper, at least an appendix would have been great.

Minor weakness: Lack of clarity

Fig 3 is not clear about the meaning of the x-axis: does a label fraction of X% mean that X% of the data is labeled? Also, it is hard to relate the results of Fig 3 and Table 1.
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

Not all parameters for the training are given, such as the mini-batch size and the number of epochs. The hardware could have been mentioned. The datasets are clearly identified. The authors did not mention that they will release the code. However, this code should share some similarities with the MoCo v2 code, which is available.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html

** General ** The list of key contributions is a bit redundant with the previous paragraph. Some space would be saved then for the display of some sequences of surgical phases. I am not requesting this for the paper acceptance.

Sometimes “et al.” is written in italic and sometimes not. Can you make it consistent?

Section 2.1. L6: what is the meaning of the index “i”? L6: what is the meaning of the index “+” under “k”?

P4, Figure 2: the “x” at the bottom of the figure should a \nu instead. Otherwise, you need to change Eq 1 and the first line of the first paragraph in Section 2.1.

I would appreciate a small motivation of the use of moment-updated encoder. This motivation is written in [10], but I think that the paper will gain in clarity.

Section 3.1 Can you provide the same type of information than Cholec80? e.g., the type of surgeries and image size. Also, reporting the average length of the videos can give an idea of the dataset size.

P6, Section 3.1, L3: “We sample the videos into 5 fps”. I think that the explanation should be mentioned, even if it is for computational reason.

P6, Section 3.2, L8: “by self-supervised learning approaches”: you can add a reference to the next section 3.3, to make clear what are these approaches.

Table 1: Usually, the best value is in bold. However, this common practice is not followed for: Recall on Cholec80: Ours → MoCo v2 Precision on M2CAI16: Ours → MoCo v2 Table 3: similarly for Recall, where the third configuration of the ablation study, and not the second, gives the highest recall.

Tables 1, 2 and 3: can you center the metric names? Tables 2 and 3: Can you add in the legend that the results are given for Cholec80?

** Typos ** P1, Paragraph 1, L1: require → requires P2, Paragraph 1, L6: corrupted images reconstruction → reconstruction of corrupted images P3, 2nd bullet point: dataset to improve → dataset improves P3, Section 2, Paragraph 1, L2: Section. 2.1. →Section 2.1. P3, Section 2, Paragraph 1, L3: illustrated → presented P4, Legend of Figure 2, L4: model ,i.e., → model, i.e., P4, Last Paragraph, L5: formulate → formulated P5, Section 2.3, Paragraph 1, L7: i.e.sim → i.e., sim P5, Section 2.3, Paragraph 2, L4: the the →the P6, Section 3.2, Paragraph 1, L1: frames → frame P6, Section 3.3, Paragraph 1, L5: the performance of self-supervised training on ImageNet outperforms that trained from scratch→ to reformulate (“performance…outperforms”) P6, Section 3.3, Paragraph 1, L7: motivate → motivates P6, Section 3.3, Paragraph 1, L9: outperform → outperforms P7, Section 3.4, Paragraph 2, L1: We conduct ablation study → We conduct an ablation study P7, Section 3.4, Paragraph 2, L4: model ,i.e., → model, i.e., P7, Section 3.4, Paragraph 2, L4: can not → cannot P8, Paragraph 1, L2: fine-tuning → fine-tune P8, Paragraph 1, L3: approaches → approach P8, Figure 3, Legend, L3: scratch, → scratch. P8, Figure 3, Legend, L4: MoCo v2.→ MoCo v2, respectively.

** Future works ** It would have been interesting to have an idea of the performance of the proposed method on videos from other surgeries, such as cataract surgeries. Cataract-101 dataset: “Cataract-101 - Video dataset of 101 cataract surgeries”, K. Schoeffmann et al., Proceedings of the 9th ACM Multimedia Systems Conference, MMSys 2018, pp. 421–425, 2018.
Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

6
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The challenge tackled by this paper and the proposed strategy are relevant. This strategy includes three components already published, but remains novel since this combination has never been evaluated and applied to surgical video understanding. The missing existing works do not impact the novelty of this work. This paper appears to me quite promising, encouraging a significant number of additional experiments to see how far we can reach with using already trained models for the problems of surgery video understanding. Because it opens a new door for this problem, I think that the paper can be accepted in MICCAI.
Number of papers in your stack

5
What is the ranking of this paper in your review stack?

1
Reviewer confidence

Confident but not absolutely certain
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

N/A
[Post rebuttal] Please justify your decision

N/A

Review #2

Please describe the contribution of the paper

The paper investigated use of self-supervised training for surgical video analysis tasks. This is an important topic for the community given scarce amount of labeled datasets available.

The paper proposed methodology to leverage large-scale publicly available datasets to enhance performance on surgical video analysis tasks. Semantic-preserving training (via contrastive learning) for teacher network and a distillation objective function for student network are interesting and impactful techniques.

Authors have done experiments on two publicly available datasets and compared their method with a few other self-supervised approaches.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

This is a well-written paper with clear motivation, description of methods and experimental validation on publicly available datasets.

The paper proposes a novel training technique to leverage large-scale publicly available datasets to enhance performance in surgical video segmentation tasks. The method is well described and compared with other SOTA approaches.

Experimental validation and ablation studies are reasonable and reproducible on publicly available datasets.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

In recent years, computer vision community has shifted to 3D conv-net style models for video action recognition. Models like I3D, SlowFast, TimSformer and Swin has been consistently beating ResNet+LSTM/GRU/TCN style models. The proposed methods apply to the older approaches and not the newer models more applicable to this problem. Would be good to apply the same methods to Kinetics for pre-training a clip-based model and then apply to a 3D conv-net method for testing.

Would be good to add ablation studies on the effect of parameters Tau and Lambda.

mAP is metric often used in computer vision community for evaluation of action recognition models on long videos. Might be good to add this metric to the paper for completeness.
Please rate the clarity and organization of this paper

Very Good
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

The paper is reproducible. Methods are clearly described and tested on publicly available datasets.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html

Would be good to add ablation studies on the effect of hyper-parameters Tau and Lambda.

mAP is metric often used in computer vision community for evaluation of action recognition models on long videos. Might be good to add this metric to the paper for completeness.

The method should be applied to a more updated approach for video action recognition, i.e. 3D convnets which are outperforming 2D CNN based models.
Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

5
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

Clear motivation and clinical relevance, novel method with clear description, complete experiments to show the impact.
Number of papers in your stack

4
What is the ranking of this paper in your review stack?

2
Reviewer confidence

Very confident
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

N/A
[Post rebuttal] Please justify your decision

N/A

Review #3

Please describe the contribution of the paper

This manuscript presents an approach to adapt a self-supervised method (MoCo-V2) from a general computer vision domain to a surgical domain. They propose a two-stage self-supervised training approach: in the first stage, they propose “semantic-preserving training scheme” by training just the projection-head of the self-supervised model on the surgical video; in the second stage, the trained model from the first stage, called teacher mode, is used to guide a student model using distillation and contrastive loss. They show the improved results using their training approach on the Cholec80 and MI2CAI16 datasets.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
- The paper presents the first study using the state-of-the-art self-supervised approaches in the surgical domain.
- The proposed two-stage self-supervised training approach provides improved results.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
- The results show some counter-intitutive analysis where the results obtained with 20% annotated labels are better than the 100% labels.
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance
Datasets and code: the authors used public dataset for their methods. The authors have neither provided nor mention the availability of the models, training/evaluation code upon acceptance.
- Experimental results: No result on the different hyperparameters setting or on the sensitivity of hyperparameters on the results. The authors used fixed hyperparameters.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html

The paper is well written and presented. It proposes to distill the abundant knowledge in the general computer vision datasets learned using the current self-supervised approaches to the surgical domain. The proposed two-stage training approach to preserve the semantics in the first stage and distill the knowledge in the second stage helps it obtain improved results. However, Fig. 3 gives some counter-intuitive results where the results using less percentage of labels are better than using all the labels. For example, results using 20% labels achieve more than 90% accuracy on the Cholec80 dataset, whereas they reach 87% accuracy from the 100% labels. The authors should thoroughly check the correctness of the results in Fig. 3. Moreover, it also provides an insight into the results (if we assume that the results are off by some margin) that the underlying task is relatively straightforward. It shows that the methods reach more than 80% accuracy on the Cholec80 dataset using as few as 5% of labels. The authors should discuss these points or apply their approach to some more complex surgical data science tasks to evaluate their approach.
Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

5
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The authors proposed an effective self-supervised approach for surgical workflow recognition and obtained better results, specifically in the less labeled data regime. The authors should consider addressing points discussed above.
Number of papers in your stack

6
What is the ranking of this paper in your review stack?

2
Reviewer confidence

Confident but not absolutely certain
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

N/A
[Post rebuttal] Please justify your decision

N/A

Primary Meta-Review

Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

The paper presents a self-supervision framework for surgical video understanding, by proposing a novel distilled contrastive learning to transfer representation ability from publically available models trained on large datasets, to improve self-training on surgical data. The framework is validated extensively on two surgical phase recognition benchmarks (Cholec80 and M2CAI16), indicating that the method can improve the performance of existing self-supervised learning methods. The topic is of interest and well-motivated, the paper is well-written, the approach is novel, and the validation experiments are thorough. Feedback from the reviewers regarding state-of-the-art, further discussion and clarification of the results (Fig.3.), consideration of other metrics, and improvements to figures and text should be incorporated in the final submission.
What is the ranking of this paper in your stack? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

1

Author Feedback

Meta-Reviews: Thank you for the comments. We will revise the manuscript accordingly.

Reviewer #1: Thank you for the comments.

Q1: Typos, missing definitions, notation inconsistencies and writing suggestions: We have revised the manuscript following your suggestions to polish our paper.

Q2: Redundant key contributions: We have addressed this problem in the new manuscript. More specifically, we integrate the first and second contributions.

Q3: Future works: our proposed approach is generalized and can be applied to different surgical videos. In future works, we will explore our method to the dataset mentioned by the reviewer, i.e., Cataract-101 dataset.

Q4: Reproducibility: we will release our code at https://github.com/xmed-lab/DistillingSelf

Reviewer #2: Thank you for the comments.

Q1: 3D backbone: All current surgical video works use 2D backbone to extract frame-wise features. In our paper, we just follow the setting as them for fair comparison. In future works, we will follow the suggestions of the reviewer to use 3D backbone to improve the performance.

Q2: Ablation studies on the effect of parameters Tau and Lambda: For Tau, we follow MoCo v2. We conduct ablation studies on the effect of Lambda on Cholec80: Lambda (Lam) = 1, accuracy (Acc) = 86.3; Lam = 2, Acc = 86.6; Lam = 3, Acc = 86.7; Lam = 4, Acc = 87.1; Lam = 5, Acc = 87.3, Lam = 6, Acc = 87.0. Due to the page limitation, we cannot provide detailed results in the current version.

Q3: mAP metric. We follow the same evaluation protocol as current state-of-the-arts.

Reviewer #3: Thank you for the comments.

Q1: The model trained with 20% labels outperforms the models trained with100% labels: In table 1-3, the models are obtained by linear fine-tuning, while in Fig. 3, the models are obtained by fully fine-tuning. Hence, models trained with 20% labels in Fig. 3 outperforms the models trained with 100% labels in Table 1-3. We have added the explanation in Section 3.4 of the new manuscript.

Q2: Ablation studies on parameters: Please refer to Reviewer #2 Q2.

back to top

Free Lunch for Surgical Video Understanding by Distilling Self-Supervisions