Authors

Zhao Wang, Chang Liu, Shaoting Zhang, Qi Dou

Abstract

Foundation models have exhibited remarkable success in various applications, such as disease diagnosis and text report generation. To date, a foundation model for endoscopic video analysis is still lacking. In this paper, we propose Endo-FM, a foundation model specifically developed using massive endoscopic video data. First, we build a video transformer, which captures both local and global long-range dependencies across spatial and temporal dimensions. Second, we pre-train our transformer model using global and local views via a self-supervised manner, aiming to make it robust to spatial-temporal variations and discriminative across different scenes. To develop the foundation model, we construct a large-scale endoscopy video dataset by combining 9 publicly available datasets and a privately collected dataset from Baoshan Branch of Renji Hospital in Shanghai, China. Our dataset overall consists of over 33K video clips with up to 5 million frames, encompassing various protocols, target organs, and disease types. Our pre-trained Endo-FM can be easily adopted for a given downtream task via fine-tuning by serving as the backbone. With experiments on 3 different types of downstream tasks, including classification, segmentation, and detection, our Endo-FM surpasses the current state-of-the-art (SOTA) self-supervised pre-training and adapter-based transfer learning methods by a significant margin, such as VCL (3.1% F1, 4.8% Dice, and 5.5% F1 for classification, segmentation, and detection) and ST-Adapter (5.9% F1, 9.6% Dice, and 9.9% F1 for classification, segmentation, and detection). Code, datasets, and models are released at https://github.com/med-air/Endo-FM.

Link to paper

DOI: https://doi.org/10.1007/978-3-031-43996-4_10

SharedIt: https://rdcu.be/dnwOK

Link to the code repository

https://github.com/med-air/Endo-FM

https://github.com/openmedlab/Endo-FM

Link to the dataset(s)

https://mycuhk-my.sharepoint.com/:f:/g/personal/1155167044_link_cuhk_edu_hk/EmB8iuYtsGdDrpIQSO6AMHEBtaSW-DY-dRfHCmfd96kCTg?e=KWUWsd

Reviews

Review #5

Please describe the contribution of the paper

The author has developed an endoscopy video analysis foundation model that is pre-trained on a large-scale dataset. The pre-trained model has shown superior performance compared to the state-of-the-art (SOTA) on three downstream tasks, indicating its potential as a robust and effective tool for endoscopy video analysis.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

1) The proposed method use varying frame rates and spatial crop to train spatial-temporal encoder which successfully capture the relationships between different spatial-temporal variations. 2) The author reported details of datasets, training process and ablation studies which make reader easy to follow.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

The paper looks good
Please rate the clarity and organization of this paper

Very Good
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

It is easy to reproduce the work of this paper if the author release the code.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html

The ablation study, experiments and description of proposed method are sufficient.
Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

7
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The ablation study, experiments and description of proposed method are sufficient.
Reviewer confidence

Very confident
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

N/A
[Post rebuttal] Please justify your decision

N/A

Review #1

Please describe the contribution of the paper

This paper presented a large-scale self-supervised pre-training method for endoscopy video analysis and a corresponding foundation model. Specifically, A video transformer (Endo-FM) that captures local and global long-range dependencies was proposed and trained on a large-scale endoscopy video dataset constructed by the authors from public datasets and a private one. Experiments on three downstream tasks show the effectiveness of the proposed method. The main contributions are the proposed method for endoscopy data and the large-scale dataset.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
- A new large-scale dataset was presented, which might be useful for the following research if it will be released.
- The proposed method achieved better performance than the compared recent works.
- Ablation studies were performed to show the effectiveness of the proposed method components.
- The paper is generally well-written and easy to follow.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
- The motivation of designing a foundation model for endoscopic videos is unclear. From the current description, it was motivated by “lack of foundation models…”, but this was unconvincing. The necessity is unclear.
- It seems the main contribution and the difference to prior works is that the “rich spatial-temporal information” need to be captured, as claimed by the authors. But it is unclear why this is the “key to learning from endoscopy video data”. Any particular difference to other video data?
- The technical novelty and contributions are a bit limited. Considering large pre-training (or foundation) models were proposed in many recent works and also in medical data (as acknowledged by the authors as well), and the similarity to related work [6], the main difference between this work and prior works is the application to the endoscopy data, which is a bit limited.
- Missing discussion of recent foundation (large pre-training) models in medical imaging, e.g. [1] [2] [3] to name a few (though [3] is a bit new, so would not be considered as a “missing review”). [1] Fu, Zeyu, et al. “Anatomy-aware contrastive representation learning for fetal ultrasound.” Computer Vision–ECCV 2022 Workshops: Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part III. Cham: Springer Nature Switzerland, 2023. [2] Tang, Yucheng, et al. “Self-supervised pre-training of swin transformers for 3d medical image analysis.” Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022. [*3] Moor, Michael, et al. “Foundation models for generalist medical artificial intelligence.” Nature 616.7956 (2023): 259-265.
- It is unclear why have to use transformers instead of regular CNNs.
- The authors claimed that the proposed method was “specifically designed for endoscopy”, but it is unclear what is the uniqueness that was particularly for endoscopy. From the current description, the approach was for video data and seems to be able to apply to other video data other than endoscopy.
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

Based on the description, it seems to be able to reproduce the main technical method. But given that no code was provided, it is unsure if the exact same results could be reproduced.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html
- It would be better to clearly clarify the uniqueness of endoscopy video data and its key difference and challenges compared to other medical (or even more general video) data. An experimental comparison would be more convincing (note the reviewer is not asking for more experiments, but more of a suggestion in future).
- As there are segmentation and detection experiments, it would be better to show some qualitative results/comparisons for these tasks.
- It would be better to increase the font size in the charts in Fig. 3.
Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

4
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

Overall, this is a borderline paper, showing good results and presenting a method for endoscopy data. But considering the technical contributions and novelty, it makes the reviewer hesitate to recommend a clear accept.
Reviewer confidence

Confident but not absolutely certain
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

N/A
[Post rebuttal] Please justify your decision

N/A

Review #4

Please describe the contribution of the paper

The article proposes a novel foundation model, Endo-FM, for analyzing endoscopic videos through a self-supervised, spatial-temporal pre-training strategy.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

the strength of the proposed approach lies in effectively capturing the rich spatial-temporal information in endoscopy videos through a teacher-student pre-training scheme with spatial-temporal matching on diverse video views. Comparing with the baselines of training from scratch, their Endo-FM achieves high effectiveness and outperforms state-of-the-art (SOTA) methods for polyp diagnosis, segmentation, and detection tasks. The proposed dynamic motion modeling and prediction during pre-training under dynamic endoscope scenes effectively addresses the challenges posed by varied motion speeds and ranges in different endoscopy videos.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

the proposed approach could be further evaluated on larger-scale datasets for benchmarking and generalizeability to other domains of video analysis beyond endoscopy videos.
Please rate the clarity and organization of this paper

Very Good
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

Because there are private datasets and I think it’s very difficult to reproduce.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html

I am not an expert in this field and I cannot give a definitive opinion. But I’m looking forward to see if the author will open the source code
Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

5
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

I am not an expert in this field and I cannot give a definitive opinion. But I’m looking forward to see if the author will open the source code
Reviewer confidence

Somewhat confident
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

N/A
[Post rebuttal] Please justify your decision

N/A

Primary Meta-Review

Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

The paper proposes a pre-trained endoscopy video analysis foundation model that outperforms the state-of-the-art on three downstream tasks. All three reviewers acknowledge the potential impact of the proposed method. The main strengths of the paper include the use of varying frame rates and spatial crop to train spatial-temporal encoders, description of datasets, training process and ablation studies, and the potential for real-world clinical applications.

The main weakness is limited technical novelty, unclear motivation, lack of code and data release as there is a private data of more than 24k videos used for pertaining which constitute 70% total data used. Moreover, insightful discussion of the contributions and missing discussion on existing foundational models in medical imaging. Recommend addressing these weaknesses in the camera ready version.

Author Feedback

N/A

back to top

Foundation Model for Endoscopy Video Analysis via Large-scale Self-supervised Pre-train