Authors

Yu Tian, Guansong Pang, Fengbei Liu, Yuyuan Liu, Chong Wang, Yuanhong Chen, Johan W Verjans, Gustavo Carneiro

Abstract

Current polyp detection methods from colonoscopy videos use exclusively normal (i.e., healthy) training images, which i) ignore the importance of temporal information in consecutive video frames, and ii) lack knowledge about the polyps. Consequently, they often have high detection errors, especially on challenging polyp cases (e.g., small, flat and partially visible polyps). In this work, we formulate polyp detection as a weakly-supervised anomaly detection task that uses video-level labelled training data to detect frame-level polyps. In particular, we propose a novel convolutional transformer-based multiple instance learning method designed to identify abnormal frames (i.e., frames with polyps) from anomalous videos (i.e., videos containing at least one frame with polyp). In our method, local and global temporal dependencies are seamlessly captured while we simultaneously optimise video and snippet-level anomaly scores. A contrastive snippet mining method is also proposed to enable an effective modelling of the challenging polyp cases. The resulting method achieves a detection accuracy that is substantially better than current state-of-the-art approaches on a new large-scale colonoscopy video dataset introduced in this work. Our code and dataset will be publicly available upon acceptance.

Link to paper

DOI: https://link.springer.com/chapter/10.1007/978-3-031-16437-8_9

SharedIt: https://rdcu.be/cVRsV

Link to the code repository

https://github.com/tianyu0207/weakly-polyp

Link to the dataset(s)

https://github.com/tianyu0207/weakly-polyp

Reviews

Review #1

Please describe the contribution of the paper

The paper introduce a novel robust anomaly video classification method to detect polyp frames in colonoscopy videos.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
1. The idea is novel, with new proposed hard/easy example mining based on the task.
2. The illustration is clear.
3. The experiments are robust and the results are promissing.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

See comments.
Please rate the clarity and organization of this paper

Very Good
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

The author has provided a lot of experiment details.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html
1. The reason why depth-wise 1D conv works for temporal modeling instead of FC is not fully explained.
2. If I3D is not fine-tuned on any medical dataset, it is recommend at least show what is the (expected) output of I3D. Otherwise, it is pretty confusing why it would work since pretrained dataset has nothing similar.
3. One of the key contribution, as stated in paper, is the selection of hard/easy example. However the description is hard to follow/understand. It is strongly recommended to expand Fig.2 for a much more detailed explanation. For example, how the hard abnormal snippets are formed is not directly shown in the figure.
Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

7
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The paper is well written, except some places are not perfectly organized which may be caused limited space.
Number of papers in your stack

5
What is the ranking of this paper in your review stack?

1
Reviewer confidence

Confident but not absolutely certain
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

N/A
[Post rebuttal] Please justify your decision

N/A

Review #2

Please describe the contribution of the paper

This paper proposes a weakly-supervised framework based on transformers and a contrastive snippet mining approach to identify frames with abnormality (eg. Polyps) from colonoscopy video frames. An imbalanced dataset is collected from publicly available colonoscopy data and used in this work. The method is compared against some SOTA work and the results indicate better performance in abnormality detection.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

The paper deals with one of the challenging problems in colonoscopy, detecting abnormality, especially the video snippets with flat or small polyps. A dataset including normal, abnoram(polyps) is collected from publicly available data.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
- A clarification on Cls token is required; “The Cls token is applied for a video classifier to predict if a video contains anomalies.” Please elaborate on this as all readers are not familiar with transforms.
- it seems the paper is missing a comparison against another SOTA which is using similar snippet mining approach “CoLA: Weakly-Supervised Temporal Action Localization with Snippet Contrastive Learning, Zhang et al, 2021”
- More clarification about the dataset is required, as one of the claims is that this work has a better performance when small or flat lesions appear. Therefore, a table presenting the percentage of these sorts of lesions and the performance of the framework on them is required.
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

Some information is missing for example how the Cls are calculated, but I assume that is explained inside their code repo.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html

please refer to section 5
Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

5
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

This paper proposes a framework to identify abnormalities (frames that include polyps) from colonoscopy video snippets. The authors collected a dataset from publicly available data. This data should be approved by experts before publication, and further experiments are required to validate this method against the one which deployed a similar approach.
Number of papers in your stack

4
What is the ranking of this paper in your review stack?

3
Reviewer confidence

Very confident
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

N/A
[Post rebuttal] Please justify your decision

N/A

Review #3

Please describe the contribution of the paper

The authors propose to use I3D to extract features from colonoscopy videos for the convolutional transformer to achieve Polyp Frame Detection. The authors utilize multiple instance learning (MIL) and contrastive snippet mining (CSM) to further improve the performance of the model. Experiment results show that their proposed method can outperform some SOTA methods for Polyp Frame Detection on a new dataset.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

The authors combine multiple methods in the literature as their method for Polyp Frame Detection. This combination does have its advantage and outperforms many SOTA methods. From this aspect, the system is well-designed and has some novelty.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

(1) The authors did not cite previous work in the Contrastive Snippet Mining section. Please also fix the words in the contribution section to reflect this.

Previous work is here: Zhang, Can, et al. “Cola: Weakly-supervised temporal action localization with snippet contrastive learning.” Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2021.

It is called Snippet Contrast (SniCo) Loss in the above-mentioned paper. The Polyp Frame Detection is a special use case in the Temporal Action Localization problem category.

(2) The authors combined several public colonoscopy datasets and build a new large-scale diverse colonoscopy video dataset to benchmark their method against other methods. However, the authors only provided the training details for their proposed methods. As this is a new dataset, training parameters might be different from the original papers. Please also provide training details for all other methods that are used in the benchmark. It will be also more clear for the readers if the authors provide method differences (For example: Did MIL is used? What Loss is used in training?) in Table 1.
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

The authors provide training details for their proposed method in the paper and they will share their code and dataset if the paper is accepted. I do not see a reproducibility issue for their proposed method.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html

For weakness point (1), please cite previous work or share the reasoning behind why citation is not needed. For weakness point (2), please provide training details for all methods benchmarked on the new dataset.

It will be great to share the reasoning behind why combining multiple open-source datasets instead of applying the proposed method to each dataset and compared it with papers published accordingly.
Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

5
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

This paper is more of an application paper in my eyes. Although the building blocks of the system come from the literature, the proposed system does have a great performance. It can be a novel application paper from this aspect. This is the reason I vote “Weak accept” for the paper.
Number of papers in your stack

6
What is the ranking of this paper in your review stack?

4
Reviewer confidence

Very confident
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

N/A
[Post rebuttal] Please justify your decision

N/A

Primary Meta-Review

Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

The reviewers have consistent reviews of your paper. I believe that their comments are of great value, so please take them into careful consideration to further enhance your paper.
What is the ranking of this paper in your stack? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

1

Author Feedback

We appreciate the reviews from all reviewers. We will clarify that the I3D is not fine-tuned on any medical datasets. We will follow R1’s suggestion to refine the description of the selection of hard/easy examples. We will follow R2’s suggestion to add more details about the transformer. We will cite the paper “CoLA: Weakly-Supervised Temporal Action Localization with Snippet Contrastive Learning, Zhang et al, 2021” in the final version. Following R3’s suggestion, we will add more details about the comparison methods in the supp material.

back to top

Contrastive Transformer-based Multiple Instance Learning for Weakly Supervised Polyp Frame Detection