Paper Info Reviews Meta-review Author Feedback Post-Rebuttal Meta-reviews

Authors

Weijie Ma, Ye Zhu, Ruimao Zhang, Jie Yang, Yiwen Hu, Zhen Li, Li Xiang

Abstract

The colorectal polyps classification is a critical clinical examination. To improve the classification accuracy, most computer-aided diagnosis algorithms recognize colorectal polyps by adopting Narrow-Band Imaging (NBI). However, the acquisition of this specific image requires manual switching of the light mode when polyps have been detected by using White-Light (WL) images since the NBI usually suffers from missing detection in real clinic scenarios. To avoid the above situation, we propose a novel method to directly achieve accurate white-light colonoscopy image classification by conducting structured cross-modal representation consistency. In practice, a pair of multi-modal images, i.e. NBI and WL, are fed into a shared Transformer to extract hierarchical feature representations. Then a novel designed Spatial Attention Module (SAM) are adopted to calculate the similarities between class token and patch tokens for a specific modality image. By aligning the class tokens and spatial attention maps of paired NBI and WL images at different levels, the Transformer achieves the ability to keep both global and local representation consistency for the above two modalities. Extensive experimental results illustrate the proposed method outperforms the recent studies with a margin, realizing multi-modal prediction with a single Transformer while greatly improving the classification accuracy when only with WL images.

Link to paper

DOI: https://link.springer.com/chapter/10.1007/978-3-031-16437-8_14

SharedIt: https://rdcu.be/cVRs0

Link to the code repository

https://github.com/WeijieMax/CPC-Trans

Link to the dataset(s)

The public part of the dataset: https://drive.google.com/drive/folders/1e2t5HhQf08sTAE_CPRNVgpi6YUKgQSHn?usp=sharing

The bounding-box annotation of the dataset: https://drive.google.com/file/d/1K06-VFm6b64Rhu-ehBtJ4OY6Yk7YZyIm/view?usp=sharing


Reviews

Review #1

  • Please describe the contribution of the paper

    The manuscript represents an approach for authomatic colorectal polyp classification via structured cross-modal representation consistency on WL and NBI images.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The paper is well-written and address a clinical issue. Also, it presents a novel designed Spatial Attention Module to calculate similarities.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    The proposed method is evaluated on a relatively small datasets, and no information regarding the computation costs as well as no discussion on the real-time application in clinical settings is provided.

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    The technical implementation process is well described and the method is tested on public datasets.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html
    1. Did you perform any image augmentation in order to increase the number of images and avoid possible overfitting?
    2. Please clarify if the presented results on the Table 1 and Table 2 are on the same datasets or not.
    3. I highly recommend you to add an explanation regarding the real-time clinical application of the proposed approach.
  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    5

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The paper proposes an intresting approach that can be a solution for a clinical problem. However, better evaluation on a bigger dataset, better representation of the advantages over the state-of-the-art, and better discussion on the real-time clinical use need to be addressed in the manuscript.

  • Number of papers in your stack

    5

  • What is the ranking of this paper in your review stack?

    3

  • Reviewer confidence

    Confident but not absolutely certain

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #2

  • Please describe the contribution of the paper

    The manuscript presents an algorithm based on Structured Cross-modal Representation Consistency for polyp detection in colonoscopy images.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • Well written paper
    • The addressed topic is of interest for the surgical data science community
    • Experiments and comparison with the literature are performed
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    • The methodological innovation should be stated clearly
    • The research hypotheses of the work should be stated (e.g., which is the hypothesis behind using SAM?)
    • Limits in the literature should be highlighted.
    • The discussion of the results can be improved to give more insights to the readers.
  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    Good.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html
    • The introduction is a bit verbose (avoid using the same sentences for both abstract and introduction)
    • The survey of the state of the art should be improved. Limits in the state of the art have not been clearly highlighted. Open challenges the authors want to tackle are not listed.
    • The listed contributions are not clear to me. The authors state that they introduce a general framework of multi-modal learning for medical image analysis, but only one dataset is considered for the experiments. Point (3) is not a contribution.
    • The authors should try to clarify why the introduced contributions allow them to overcome the state of the art (without just presenting the results).
    • The clinical rationale behind the work is not clear. The authors write “Despite the enhanced imaging, endoscopists rely on WL images before they change the light mode to detect the possible polyps, which means that the WL images may fail and lead to missing detection”. What does this sentence mean? If the clinicians have the possibility to exploit NBI, I don’t see the point of developing algorithms that work with WL. The issue can be that some centers do not have NBI endoscopy.
  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    6

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Nice paper but there are some weaknesses that have to be solved.

  • Number of papers in your stack

    5

  • What is the ranking of this paper in your review stack?

    1

  • Reviewer confidence

    Very confident

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #3

  • Please describe the contribution of the paper

    This manuscript proposed a novel method to directly achieve accurate white-light colonoscopy image classification by conducting structured cross-modal representation consistency.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    This manuscript has detailed mathematics, designs and implementations for each module, it is easy for people to follow even for those readers who are not familiar with attention model.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    Need more visual representation for result section.

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    It is very easy to follow the network design, with the information provided by the authors, it should be able to duplicate the network without major issues.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html

    Please provide more attention and vision transformer information in medical image domain, rather than general computer vision domain.

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    6

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    This manuscript used the currently popular deep learning architecture, attention based vision model, to solve an important cross-model Colorectal Polyp recognition problem. Also, it improved the baseline model with its newly designed module. Similar method should be able to be applied to other similar problems.

  • Number of papers in your stack

    4

  • What is the ranking of this paper in your review stack?

    1

  • Reviewer confidence

    Very confident

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A




Primary Meta-Review

  • Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

    The paper has sufficient originality related to using spatial attention module in colorectal polyp recognition. Reviewers think that experiments are well described and conclusive, the paper is well-written with detailed mathematics, making it easy to follow even for readers without much knowledge in attention module. However, the paper does have moderate weaknesses related to insufficient highlight of contributions and unclear clinical rationale. After considering all reviewers’ feedback, the area chair recommends accepting this paper for publication.

  • What is the ranking of this paper in your stack? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    8




Author Feedback

We sincerely thank the reviewers for their comments. We conclude the comments and suggestions of reviewers and respond to the main concerns in the following aspects, and the Q&A list is as follows:

Reviewer 1 Q1. The proposed method is evaluated on a relatively small dataset, and no information regarding the computation costs as well as no discussion on the real-time application in clinical settings is provided. A1. Thanks for pointing out this problem. We believe that the CPC-Paired Dataset we used is already a relatively large dataset in current colorectal polyp classification tasks due to the difficulty in obtaining medical images. While in the testing phase, the size and FLOPS of our backbone are similar to ResNet50 (even less than). What’s more, in the training stage, unlike the previous SOTA method using three extra much larger networks to improve the main model, we only adopt one simpler cross-modal attention as the external part. The decline in computation costs is obvious. And in real-time application in clinic scenarios, only WL images are required for our proposed method to perform accurate colorectal polyps detection. Our model doesn’t need any NBI images at the inference stage since the model has extracted more discriminative representations of WL images from NBI based on global and local consistent learning in the training phase.

Q2. Did you perform any image augmentation in order to increase the number of images and avoid possible overfitting, and if the presented results in Table 1 and Table 2 are on the same datasets or not? A2. Thank you for pointing out this problem, as discussed in the paper (4.2 Implementation Details), we only adopted random resized cropping and horizontal flipping augmentation to avoid possible overfitting, but did not increase the number of training images. And the presented results in Table1 and Table 2 are both on the CPC-Paired Dataset.

Reviewer 2 Q3. The methodological innovation should be stated clearly, the research hypotheses of the work should be stated (e.g., which is the hypothesis behind using SAM?) and the authors should try to clarify why the introduced contributions allow them to overcome the state of the art (without just presenting the results). A3. Thanks for the indication of the problem. The module we proposed is motivated by the fact that NBI is capable to present more discriminative visual information than the relatively ordinary WL images. On the basis of cross-modal global alignment, we utilize the aggregated proxy to further interact with the local features, which eventually produces the domain-specific response map. So that the model will more focus on the lesion-relevant but may indistinct areas on the WL image after limiting the distance between the aforementioned output correlation maps. And the visualization results also demonstrated our hypotheses and more semantic areas were attentive rather than former false sections. However, in contrast, the previous state-of-the-art method seems to be complex and a little over-meticulous-designed. There are three extra heavier networks for promoting the main student network and more training modules and parameters for constructing auxiliary losses. Moreover, the interpretability can also be weaker based on their visualization results, and readers may slight get the semantic representation of their model only according to the similar color style. While we elegantly dig the potential of the class proxy based on the characteristics of our architecture and the attention mechanism.

Revewer 3 Q4. Please provide more attention and vision transformer information in medical image domain, rather than general computer vision domain. A4. Thanks for the kindly suggestion, we will make up some relevant materials in our final paper.



back to top