Paper Info Reviews Meta-review Author Feedback Post-Rebuttal Meta-reviews

Authors

Roy Hirsch, Mathilde Caron, Regev Cohen, Amir Livne, Ron Shapiro, Tomer Golany, Roman Goldenberg, Daniel Freedman, Ehud Rivlin

Abstract

Self-supervised learning (SSL) has led to important breakthroughs in computer vision by allowing learning from large amounts of unlabeled data. As such, it might have a pivotal role to play in biomedicine where annotating data requires a highly specialized expertise. Yet, there are many healthcare domains for which SSL has not been extensively explored. One such domain is endoscopy, minimally invasive procedures which are commonly used to detect and treat infections, chronic inflammatory diseases or cancer. In this work, we study the use of a leading SSL framework, namely Masked Siamese Networks (MSNs), for endoscopic video analysis such as colonoscopy and laparoscopy. To fully exploit the power of SSL, we create sizable unlabeled endoscopic video datasets for training MSNs. These strong image representations serve as a foundation for secondary training with limited annotated datasets, resulting in state-of-the-art performance in endoscopic benchmarks like surgical phase recognition during laparoscopy and colonoscopic polyp characterization. Additionally, we achieve a 50% reduction in annotated data size without sacrificing performance. Thus, our work provides evidence that SSL can dramatically reduce the need of annotated data in endoscopy.

Link to paper

DOI: https://doi.org/10.1007/978-3-031-43904-9_55

SharedIt: https://rdcu.be/dnwH1

Link to the code repository

https://github.com/RoyHirsch/endossl

Link to the dataset(s)

N/A


Reviews

Review #1

  • Please describe the contribution of the paper

    This article uses the Masked Siamese Networks method for endoscopic video tasks. And two new datasets are proposed. Adequate experiments and analyses are done, but there is nothing interest in this paper.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    This paper uses the masked siamese network for video analysis.
    Detailed experiments are done for analysing the performance of the masked siamese network (Assran et al.) with different training methods and configurations on different datasets. As a result, “we show that this methodology results in strong gener- alization, achieving similar performance using 50% of labeled data in comparison with results obtained when training on the entire labeled dataset.”

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    This paper has detailed described the method and the experiments, but no improvement is shown. What I can see are the new datasets and The recently published deep learning method, which made this article a technical report rather than a research article. Furthermore, the author utilized an image classification method to adapt on a video task, it is OK but not perfect unless it is only used for pretrain.

    For the proposed new datasets, the description and analysis is poor. Is their labeling the same as the public datasets? How is the labeling done? How to measure all the labeling is right / avoid mistakes? Are these labels useful for clinical applications?

    This paper has analyzed the effect of new proposed model, various data augmentations to show the improvement of the algorithm. But the borrowed algorithm seems not much related to the task. While the model is a main part of this paper.

    Some of the numbers in this article should be further checked and the source should be shown in the citation. In Fig.1, “See paper body for details.” is not a description of the figure.

  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    The algorithm is borrowed but it can not be reproduced because the private datasets are not available.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html

    This paper seems a half-done work, improving the datasets analysis and the algorithm are necessary.

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    3

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Further improvement is recommended, but not only a technical report.

  • Reviewer confidence

    Very confident

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #2

  • Please describe the contribution of the paper

    The paper examines the effectiveness of Masked Siamese Networks (MSNs) in learning frame visual representations from a large, unlabeled private endoscopic dataset through self supervision to be utilised for various endoscopy tasks, such as laparoscopy phase recognition and optical characterization of colorectal polyps. The authors demonstrate that these representations achieve state-of-the-art performance and can match fully supervised performance using only 50% of the annotated data.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The paper presents a well-structured evaluation of the usefulness of MSNs in endoscopy application.

    The reported results are significant for the community, as they show that MSNs can improve SOTA performance and reduce the requirement for annotated data.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    The paper uses a proprietary dataset and applies standard MSNs to endoscopy applications without modifying the approach for medical imaging. It only showcases how using MSNs can achieve state-of-the-art performance in endoscopy, which has been demonstrated in other domains.

    In Table 2, the fully supervised approaches are consistently implemented using ResNet50 as the frame encoder, whereas the self-supervised models employ Vision Transformers (ViTs) instead. I believe it would have been preferable to include the results using the best-performing ViT as the frame encoder and perhaps discarding the TeCNO result.

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    No code is provided and the reliance of all the important results on a proprietary dataset limits the potential for results replication.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html

    No architecture details about the Multi-Stage Temporal Convolution Networks (MS-TCN) used for temporal evaluation and its difference in number of parameters with the TeCNO and OperA architecture are reported. I believe they could have been added to the supplementary as well.

    Given the very good performance of the SSL pretrained models in the Low-Shot regime, why did the authors not also compute performances using even less than 12% of the images.

    I would suggest adding more references to the first two paragraphs of the Introduction. Additionally, it might be beneficial to consider reducing the length of Section 2, as it appeared excessively lengthy.

    Very minor - Fig. 3 Low-shot

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    5

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Although the paper does not introduce or evaluate any notable technical innovation to MSNs for medical imaging and the results are based on a large proprietary dataset inaccessible to the community, I believe it is a comprehensive and robust application study of MSNs for SSL in the field of endoscopy which would likely be of interest to the MICCAI community.

  • Reviewer confidence

    Confident but not absolutely certain

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    6

  • [Post rebuttal] Please justify your decision

    The release of the training code and trained model definitely adds value to the paper. I think this is a well thought application study and, since all my concerns have been addressed in the rebuttal, I am happy to raise my vote mark to Accept.



Review #3

  • Please describe the contribution of the paper

    This paper expands the application of the self-supervised learning approach in medical endoscopy and demonstrates its benefits over supervised counterparts in terms of amount of training data for tasks such as cholecystectomy phase recognition.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The manuscripts touches an important topic in the field of deep learning application in medical endoscopy by utilizing the self-supervised methods and comparing its performance to the supervised methods.

    It is well written and also the results provided are convincing. The references are also relevant and recent.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    There is no implementation of the code provided or at least mentioned that the authors would do it in the future. Although it is not necessary but it could help the reviewers to verify the results.

    The overall novelty of the work is on the application side rather than the methodology.

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    As mentioned earlier, there is no implementation of the code provided or at least mentioned that the authors would do it in the future.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html

    The manuscripts reads well and also talks about self-supervised methods in deep learning which is an important topic.

    In the abstract, the sentence which mentions “… 50% the annotated data…” is a bit confusing. It become clear when it is mentioned again in the Introduction. So it could read better if the authors be more clear on 50% of which data.

    Page 5, Line 6: What was the low confidence threshold used for filtering the scores?

    Page 7, Line 3: typo “…to the to the …”

    Fig 3. PolypSet Characterization, it looks odd that the blue line drops dramatically at 25%. Is there any reason?

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    7

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Paper is well written and also is about an important topic in the field of machine learning

  • Reviewer confidence

    Confident but not absolutely certain

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A




Primary Meta-Review

  • Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

    The authors present their work on self-supervised learning through use of a Masked Siamese Network that was trained and tested on a large private dataset from 8 hospitals and tested on a subset of the private data and on cholec80.

    The authors perform extensive testing of their application of the previously presented (ECCV) Masked Siamese Network. The evaluation is dependent on the use of a large private dataset of 7877 laparoscopic videos from 8 institutions.

    Strengths: 1) A large dataset from multiple institutions was used and this dataset would be hard to replicate. However, this strength is tempered significantly by the private nature of the data and the lack of details around the data as noted below in weaknesses. 2) The exploration of their data with regards to fractional use of labeled data is well reported and tested.

    Key weaknesses are in lack of methodological novelty. The authors utilize a previously described model and apply it to a very large (relative to other surgical datasets) dataset. As one reviewer suggested, this thus reads more like a technical report on Masked Siamese Networks on a given dataset rather than a research investigation. Secondly, though their dataset is large, it appears to be private and there is little information about the datasets other than authors noting that private pre-trained models were used to generate bounding boxes for the colonoscopy videos. There is no information on how the private laparoscopy dataset was labeled and whether this paralleled the labels in cholec80 on which results are reported. Was the privately pre-trained model that generated labels for their private dataset trained on the publicly available PolypsSet on which their evaluation results are reported? This information is critical to understand potential for contamination of their results.

    Some additional clarification is also requested regarding their use of segments of the annotated data as this is their key result. In Table 1, the authors report pretraining on public data (and separate this from training on ImageNet1K). Is this to mean that pretraining occured on cholec80 and PolypsSet? These are the only two public datasets mentioned in the article. The results are then reported on cholec80 and PolypsSet. In the Low Shot Regime section, authors note using k portion of annotated videos. Are these the private dataset videos for which labeling was not described? Additional interpretation by the authors on some of their results would be helpful. For example, in Figure 3, what do the authors hypothesize contributes to drop in performance in ViT-S MSN with 25% of the data in polyp detection when otherwise across tasks each model appears to improve and plateau without marked variability?




Author Feedback

We thank the area chair and the reviewers for their valuable feedback and constructive criticism. Please note that several minor comments (e.g. typos, implementation details, etc.) are not mentioned here due to lack of space, yet we have considered them and revised the manuscript accordingly. Below we address the major reviewers’ concerns.

Impact Our work aims to provide a comprehensive application study of MSNs for SSL in the field of endoscopy, leading to SOTA results that may be significant to the medical community. Please note that applying MSNs for endoscopy is not straightforward as it requires massive data collection, extensive experiments and methodological innovation by carefully adapting MSNs to this domain. Ultimately, our work adopts state-of-the-art methods to a new problem or context which answers the MICCAI definition of application studies, hence, we submitted it as an application study.

Reproducibility We understand the concerns regarding the lack of reproducibility. Hence, we will provide our implementation code and trained models, allowing researchers to reproduce and further investigate our results.

Details on Private Data We understand that there are few misunderstandings regarding the private datasets and we apologize for that. We would like to clarify the following points:

Data collection - we built two large datasets, one for laparoscopy and the other for colonoscopy. In colonoscopy, we collected videos and then processed them using a pre-trained polyp detector to extract frames of polyps, thus ultimately creating an image dataset of polyps with no labels. Important to note that the pre-trained detector was trained in a supervised fashion using per-frame bounding boxes, yet the labels contain no information about the polyp types (adenoma or hyperplastic).

Labeling - SSL requires no labels, so we did not annotate the private datasets, implying there is no information about their labels. Training - we perform training in two stages, self-supervised pre-training and supervised training. The first stage of training is done without labels using either private datasets or public datasets. The second phase is supervised training using only the public datasets (PolypSet or Cholec80).

Low-data regime - here we study the model performance when trained on small portions of annotated samples. As the private datasets have no labels, annotated samples are taken solely from public datasets.

Following the above, we will extend the description of the private datasets and emphasize the points mentioned.

Additional Interpretations we will extend the discussion about the reported results. For instance, the performance drop at 25% in Fig 3. PolypSet Characterization is due to the small size of PolypSet. We found that using only small portions of PolypSet (12%, 25%) hinders the training process and makes it sensitive to the selected portion, as evident from the large variance of the results. Thus, we cannot say the performance at 12% is better than that at 25%. Above 50%, this behavior stops and the training process stabilizes.

ResNet50 Baselines We found that training a fully-supervised ResNet50 leads to better performance than that obtained using ViT-S, which has a comparable number of parameters. For this reason and to be consistent with previous works, we use ResNet50 as a baseline for fully-supervised approaches. We clarify this in the revised paper and include additional ViT-S baselines in the supplementary.

MSN for Video Tasks Indeed MSN can be seen as an image classification method which we adapt to video-tasks. Yet, pre-training with MSN produces general representations shown to be beneficial for various downstream tasks. To the best of our knowledge, we are first to demonstrate the effectiveness of MSN for endoscopy, leading to SOTA performance. Furthermore, we utilize the temporal information by training an MS-TCN on top of the per-frame features, making our method suitable for video tasks.




Post-rebuttal Meta-Reviews

Meta-review # 1 (Primary)

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    The authors provide a well constructed and organized rebuttal with point-by-point response to concerns brought up in the initial meta-review and by reviewers. While there is still a concern regarding overall technical novelty and the lack of availability of the dataset, the authors do commit to publicly releasing their code so that other researchers can assess reproducibility on their own datasets. On the basis of their rebuttal, I am inclined to lean toward accept given their extensive and methodical approach to testing of MSNs.



Meta-review #2

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    The rebuttal does a decent job of addressing the concerns. I feel the paper is a good contribution for endoscopic video analysis and something the miccai community will want to explore and hear about



Meta-review #3

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    Although the reviewers have not responded to the rebuttal or change their scores, I read the rebuttal and I think the authors’ addressed the majority the concerns.



back to top