Paper Info Reviews Meta-review Author Feedback Post-Rebuttal Meta-reviews

Authors

Amine Yamlahi, Thuy Nuong Tran, Patrick Godau, Melanie Schellenberg, Dominik Michael, Finn-Henri Smidt, Jan-Hinrich Nölke, Tim J. Adler, Minu Dietlinde Tizabi, Chinedu Innocent Nwoye, Nicolas Padoy, Lena Maier-Hein

Abstract

Surgical scene understanding is a key prerequisite for context-aware decision support in the operating room. While deep learning-based approaches have already reached or even surpassed human performance in various fields, the task of surgical action recognition remains a major challenge. With this contribution, we are the first to investigate the concept of self-distillation as a means of addressing class imbalance and potential label ambiguity in surgical video analysis. Our proposed method is a heterogeneous ensemble of three models that use Swin Transfomers as backbone and the concepts of self-distillation and multi-task learning as core design choices. According to ablation studies performed with the CholecT45 challenge data via cross-validation, the biggest performance boost is achieved by the usage of soft labels obtained by self-distillation. External validation of our method on an independent test set was achieved by providing a Docker container of our inference model to the challenge organizers. According to their analysis, our method outperforms all other solutions submitted to the latest challenge in the field. Our approach thus shows the potential of self-distillation for becoming an important tool in medical image analysis applications.

Link to paper

DOI: https://doi.org/10.1007/978-3-031-43996-4_61

SharedIt: https://rdcu.be/dnwP5

Link to the code repository

N/A

Link to the dataset(s)

https://github.com/CAMMA-public/cholect45


Reviews

Review #2

  • Please describe the contribution of the paper

    This paper addresses the problem of surgical action recognition on a surgical video dataset. The proposed approach consists of the usage of self-distillation in order to boost the performance of a single classifier. The backbone model used in this approach is the Swin Transformer. Results conducted on an open source dataset presented with an ablation study show that the best performing case is when plugging Swin Transformer in a self-distillation scenario with a three head ensemble of student models.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • The visualization of the dataset is very clear and the task at hand is well presented and detailed.
    • The usage of self-distillation, which is the motivation of the paper, is clearly the best performing model.
    • Results with an ablation study are well presented to motivate the usage of each feature in the proposed model.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    • Figures should be in a vectorized format for better resolution, the ones in the paper seem to be in a png format.
    • The usage of self-distillation is basically when the teacher and the student models have the exact same architecture. For this reason, the authors must change the presentation of the student model. In Figure 2.a, the student seems to have 3 Swin Transformers, this came as a surprise but while reading the rest of the paper it was clear that this means an ensemble of 3 Swin Transformers. This should be insisted on in the Figure to avoid miss-understanding the concept of the proposed model.
    • Without fault on my part, in section 3, paragraph “Analysis of soft labels”, I went multiple times through the analysis of the authors on why soft labels would improve the performance, I still find it hard to understand. The authors are advised to re-visit this section and present it in a more clear structure.
    • In section 4, point 4 in “findings”: “Ensembling increased performance further, as also suggested by various publications in a wide range of fields.” is not actually a finding, this is known fact. Ensembling will almost always increase the performance.
  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    All good

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html

    The authors are advised to divide section 4 into 2 subsections, for contributions+findings and one for a general conclusion.

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    5

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The paper is very interesting, but referring to the weaknesses I mention in the review, the authors should make the necessary changes.

  • Reviewer confidence

    Very confident

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #4

  • Please describe the contribution of the paper

    This paper presents a framework for surgical action triplet recognition. The framework uses the concept of self-distillation which including teacher and student networks. Swin Transformer is adopted as the backbone and an ensemble architecture is also used to combine the Transformers of different scales. The proposed framework has been evaluated on the CholectTriplet challenge dataset and has followed the established evaluation pipeline. A detailed ablation study and discussion have been provided.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    This paper is well-written and has provided a comprehensive review in the Introduction. The flow of the paper is easy for readers to follow.

    The technical novelty of this paper is sufficient. Particularly, it is claimed that the work pioneers to use self-distillation for surgical data.

    A thorough ablation study is presented. Detailed discussion have also been provided. The discussion has covered sufficient insights.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    The main weakness of this paper is with the comparisons against state-of-the-art. Limited approaches have been included as the baselines for comparisons.

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    The authors confirmed in the form that they will make the code public available.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html

    This paper is written well and the structure of the paper is also good. Given the technical novelty is also good, this reviewer only has one comment regarding the comparison study.

    Please consider including more state-of-the-art approaches in the comparison study. For example, graph neural networks have also been employed for a similar purpose on the same data. It would be nice if the authors can include some more representative results of those approaches that are used for CholectTriplet challenge for a direct comparison. Apart from the ablation study, it is not very straightforward to see that this approach has overall better performance than other approaches. Addressing class imbalance is a claimed feature of the proposed framework, it would be even better justified if the authors can compare this framework to more state-of-the-art.

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    6

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    This paper is well-written and the novelty is good too. Methodology section has provided sufficient details. A thorough ablation study has also been provided. This reviewer feel that the paper can be even better if the authors can include more quantitative results of the framework compared against more state-of-the-art.

  • Reviewer confidence

    Very confident

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #1

  • Please describe the contribution of the paper

    In this paper, the authors address the task of surgical action tripet recognition in laparoscopic surgery. One of the major challenge associated with this task is the inherent imbalance in the distribution of action triplet in the surgery. The authors propose a self-distillation based approach that uses soft-labels to train a student network. This helps in tacking with the problem of class imbalance and also potentially incorrect labels.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. Well written and easy to understand.
    2. While the method of self-distillation usig soft labels is similar to the classical knowledge-distillation approach [1], leveraging it as a method to tackle ambiguous/missing annotations in a promising contribution.
    3. Detailed experimental analysis that show the benefit of each component in the propsoed solution.

    [1] Hinton, G., Vinyals, O. and Dean, J., 2015. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    1. Limited novelty. Self-distillation using soft-labels is in theory same as knowledge distillation where the ‘softmax’ predictions of a teacher network is used to train a student network.
    2. It was hypothesised that soft labels help in tacking with faulty annotations. However, since the evaluation is performed using hard labels which could be faulty the observed improvement in performance is not justified. Even if the model learn to predict triplets which were orginally missed out in the annotation, the current evaluation scheme does not factor that in.
    3. Ensembling is considered in the ablation study, however importance of each network within the ensemble is not studied. For example, do we neeed a separate SwinT with multi-task learning of instrument, verb and target when the third model in the assemble also does the same with the addition of surgical phase prediction?
  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    Details of the network architecture, dataset used and choise of training parameters are provided in the paper. However, details about the hardware required to train the proposed model is missing.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html

    The results show considerable improvement over the SOTA method. The ablation study where each component is gradually added shows the impact of individual components. It would be interesting to see if similar improvement in performance can be acheived even for simpler models. That could help in establishing the generalisability of the proposed method across several tasks and the possibility to work with limited compute resources. In the current setting, the teacher network is trained using hard labels which are difficult to acquire for a large dataset. In my view that remains as a challenge to be addressed.

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    6

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The proposed method acheives significantly better performance than the cuurent state of the art in the task of surgical action triiplet recognition. Results of the ablation studies clearly show the impact of self-distillation using soft labels over the vanilla backbone model. This work shows the need for designing methods that can deal with faulty annotations during training and evaluation in the domain of surgical video analysis.

  • Reviewer confidence

    Very confident

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A




Primary Meta-Review

  • Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

    The paper proposes a self-distillation approach as a means of addressing class imbalance and potential label ambiguity in surgical videos, for the task of surgical action recognition. The approach is validated in a comprehensive ablation study on the CholecT45 dataset, and through external validation on the CholecTriplet 2022 challenge test set demonstrating improvement over the SOTA method. The topic is of interest and well-motivated, the contribution is valuable, the paper is well-written, and experimental results are thorough and well presented.

    Feedback from reviewers regarding further clarification on analysis of soft labels and justification for performance improvement, suggestions for improvements in the figures, and comments regarding the ablation studies should be incorporated in the final submission.




Author Feedback

We thank the reviewers for unanimously suggesting acceptance of our paper and will briefly comment on the suggestions:

Comparison to further methods: Due to the anonymization guidelines we were initially not allowed to reveal our identity as the winner of the CholecTriplet 2022 surgical action recognition challenge, where we naturally competed successfully against the latest state-of-the-art methods. We can now update the manuscript with this information.

Validation with hard labels: We agree on the weaknesses of validation with hard labels, which is a common design decision by state-of-the-art biomedical challenges. On the positive side, we see a key advantage of our approach in the external validation via an international challenge, which prevented us from overfitting on the test data and from implementing competing methods in an “unfair” manner. We will add this information to the manuscript.

Novelty: To our knowledge, at the time of our challenge participation, we were the first to explore the concept of self-distillation in the field of biomedical image analysis. For an in-depth discussion of the concept in relation to the commonly applied knowledge distillation approaches, we refer the reader to this recent ICLR 2023 Paper (Allen-Zhu, Z., & Li, Y.)

Clarity of presentation: We have carefully revised our manuscript based on the reviewers’ feedback (e.g., inclusion of hardware used for training, vectorization of Fig. 1).



back to top