Paper Info Reviews Meta-review Author Feedback Post-Rebuttal Meta-reviews

Authors

Linghan Cai, Meijing Wu, Lijiang Chen, Wenpei Bai, Min Yang, Shuchang Lyu, Qi Zhao

Abstract

Automatic and precise polyp segmentation is crucial for the early diagnosis of colorectal cancer. Existing polyp segmentation methods are mostly based on convolutional neural networks (CNNs), which usually utilize the global features to enhance local features through well-designed modules, thereby dealing with the diversity of polyps. Although CNN-based methods achieve impressive results, they are powerless to model explicit long-range relations, which limits their performance. Different from CNN, Transformer has a strong capability of modeling long-range relations owing to self-attention. However, self-attention always spreads attention to unexpected regions and the Transformer’s ability of local feature extraction is insufficient, resulting in inaccurate localization and fuzzy boundary. To address these issues, we propose PPFormer for accurate polyp segmentation. Specifically, we first adopt a shallow CNN encoder and a deep Transformer encoder to extract rich features. In the decoder, we present the PP-guided self-attention that uses prediction maps to guide self-attention to focus on the hard regions so as to enhance the model’s perception of polyp boundary. Meanwhile, the Local-to-Global mechanism is designed to encourage the Transformer to capture more information in the local-window for better polyp localization. Extensive experiments on five challenging datasets show that PPFormer outperforms other advanced methods and achieves state-of-the-art results with six metrics, i.e. mean Dice and mean IoU.

Link to paper

DOI: https://link.springer.com/chapter/10.1007/978-3-031-16440-8_60

SharedIt: https://rdcu.be/cVRwN

Link to the code repository

N/A

Link to the dataset(s)

N/A


Reviews

Review #1

  • Please describe the contribution of the paper

    This paper proposes PPFormer for accurate polyp segmentation. It combines transformer and CNN to improve polyp segmentation accuracy.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The main strong point of this paper is combination of transformer and CNN/. For long capture of features, they utilize the transformer. The authors achieved very high scores in several datasets

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    Some rewriting process may be need to improve readability. Also, there is no explanation or presentation of polyp segmentation from the medical aspects. How about flat polyp detection rate? How about concave polyp? The paper is heavily depends on public dataset.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    no problem

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html

    In local-to-global approach, do you think more multi-scale layers will improve the segmentation accuracy?

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    6

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    This paper shows good example of transformer and CNN.

  • Number of papers in your stack

    5

  • What is the ranking of this paper in your review stack?

    1

  • Reviewer confidence

    Very confident

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #4

  • Please describe the contribution of the paper

    The paper propose a novel transformer block, called L2G PPFormer block, to better model boundary information in Polyp segmentation. Experiments demonstrate that the proposed method advance the SOTA results.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    (1) A novel self-attention block that encourages the model to focus on high-confidence region, which potentially improves the quality of attention; (2) A bottom-up two-stage self-attention strategy for better capturing local context. Both the two techniques are shown to improve the performance.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    (1) Some details are missing in the paper, e.g. how does the transformer decoder work and how is the attention map in figure 1 generated? (2) Lack of comparison in terms of model size and FLOPs;

  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    Some model details are missing. The authors do state in the reproducibility checklist that they will release the code.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html

    (1) In terms of the PP-guided Self-attention: a. how to properly determine the parameter alpha? Is the model performance sensitive to the choice of alpha? b. The vanilla attention matrix, i.e. M_{SA} in the paper, is normalized by softmax operation, but the proposed attention is not normalized; does this affect the model performance and training stability? (2) In Table 1, some numbers are not consistent with prior works, e.g. TransFuse [24] achieves 0.942 mDice on ClinicDB in their paper, but the number in Table 1 is only 0.908. Why is the gap so large? Also, to ensure fair comparison with TransFuse and PraNet, please compare the model size and inference speed; (3) Where does the attention map from figure 1 comes from? The paper claim that the proposed method enhance model’s perception of polyp boundary. It would be better if the paper can compute the similarity between the attention map and the groudtruth segmentation in the boundary region. (4) Most medical segmentation application are on 3D images. It would be good to see how the proposed model perform on 3D data, e.g. MRI, CT., and compare the results with nnUnet3D (https://github.com/MIC-DKFZ/nnUNet).

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    5

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    This paper propose a novel transformer block for polyp segmentation and advance the SOTA considerably. On the other hand, the paper lack analysis on model size and inference speed, which weaken the comparison results with prior models.

  • Number of papers in your stack

    6

  • What is the ranking of this paper in your review stack?

    2

  • Reviewer confidence

    Confident but not absolutely certain

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #3

  • Please describe the contribution of the paper

    The authors present a set of attention techniques to perform polyp segmentation on colonoscopic images. The main specific contributions are:

    • PPFormer, a neural network combining transformers and CNN’s for global and local feature extraction respectively.
    • PP-Guided self-attention, a technique used to guide the model to focus on regions that are difficult to classify.
    • L2G (local to global), a mechanism designed to capture first local then global information in each transformer block.
    • State-of-the-art results on multiple representative datasets.

    Note: “PP” refers to a dot product performed in the L2G blocks, P·P, where P is the flattened feature map from a level with smaller resolution.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The method presented in the paper is well explained and the design decisions are well justified. It mixes modern techniques from other works that leverage both attention mechanisms and CNNs to design a novel architecture and blocks that achieve results that seem considerably improve the state of the art.

    The paper is well structured and complete, and the motivation and challenges are clearly described.

    The quantitative evaluation is thorough and suggests significant improvements over the state of the art for multiple relevant datasets.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    Limitations of the approach are not presented, nor are failure cases. Moreover, no statistical analyses have been performed, and the variance of results is also missing. This makes me skeptical of the reported results and the superiority of the presented methodology, although the differences in mean metrics indicate that this might not be a concern.

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance
    1. For all models and algorithms, check if you include A clear declaration of what software framework and version you used. [Yes]

    No, the version is not mentioned.

    1. For all datasets used, check if you include:

    2. For all code related to this work that you have made available or will release if this work is accepted, check if you include: Specification of dependencies. [Yes] Training code. [Yes] Evaluation code. [Yes] (Pre-)trained model(s). [Yes] Dataset or link to the dataset needed to run the code. [Yes] README file including a table of results accompanied by precise command to run to produce those results. [Yes]

    None of this is mentioned in the manuscript!!

    1. For all reported experimental results, check if you include: The range of hyper-parameters considered, method to select the best hyper-parameter configuration, and specification of all hyper-parameters used to generate results. [Yes]

    No; no; some.

    An analysis of situations in which the method failed. [Yes]

    Not found in the manuscript.

    A description of the memory footprint. [Yes]

    Not found in the manuscript.

    The average runtime for each result, or estimated energy cost. [Yes]

    Not found in the manuscript.

    An analysis of statistical significance of reported differences in performance between methods. [Yes]

    No statistical analyses have been performed!

    A description of results with central tendency (e.g. mean) & variation (e.g. error bars). [Yes]

    No variation of results is described!

    The details of train / validation / test splits. [Yes]

    Validation splits are not mentioned.

    The exact number of training and evaluation runs. [Yes]

    Not found in the manuscript.

    Information on sensitivity regarding parameter changes. [Yes]

    Not found in the manuscript.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html

    I recommend adding measurements of dispersion in the quantitative results, and justify the choice of examples for qualitative analyses.

    Also, please check the language. Some examples that need reviewing are:

    • “six metrics, i.e. mean Dice, mean IoU and etc.”
    • “an architecture consists”
    • “using prediction map”
    • “decoder has two stage”

    In the implementation details I see the authors employ a “multi-scale strategy”, but it is not clear what this means. I would also like to know why vertical flip was not used.

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    6

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    I found the paper was well written, all contributions were explicit and clear, and the motivation was also clearly presented. The structure is good, the figures are helpful. I would have liked to see a stronger effort on reproducibility, and statistical analysis of the reported metrics. I am not happy with many of the claims in the reproducibility checklist, which don’t correspond to what can be found in the manuscript.

  • Number of papers in your stack

    6

  • What is the ranking of this paper in your review stack?

    3

  • Reviewer confidence

    Confident but not absolutely certain

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #5

  • Please describe the contribution of the paper

    Three strategies are proposed to advance polyp segmentation: 1. two parallel encoder branches (CNN & Transformer) are used to extract local and global context, which are later fused together; 2. a new prediction-guided self-attention block to enhance model’s focus on pixel boundary; 3. apply windowed attention to retain more local information.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. Explore the use of Transformer in medical image segmentation;
    2. Utilize the high-level low-resolution output to guide low-level high-resolution prediction;
    3. Considerable segmentation performance improvement compared to prior works.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    1. The writing of this paper should be improved, especially for the preciseness and conciseness. Many sentences are confusing and sometimes misleading;
    2. The parallel branching design and windowed attention is not novel; the former is similar to [24] in the paper, and the latter is similar to Swin Transformer;
    3. Some major details and justification of the proposed methods are missing;
  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    The authors do provide some implementation details but is insufficient to reproduce the whole work.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html
    1. In Intro the paper writes “Transformer-based methods frequently adopt down-sampling…”. In fact, vanilla VIT only adopts single patch projection operation for downsampling, and thus, using “frequently” is not correct here;
    2. Also in Intro, the authors claim that the proposed method “guide self-attention to focus on the hard regions”, this needs further justification; the method in its current form, does not seem to have any mechanism to focus on “hard region”;
    3. How do patch merging and patch expanding work? I cannot find any explanation of this two modules; the paper lack significant details and reference on the transformer branch;
    4. For the L2G mechanism, the proposed methods firstly apply a very fine-grained self-attention in a window-based fashion, then apply a global self-attention. This means that a single L2G block essentially apply self-attention operation twice. Will this require significant computational resources?
    5. Does PP-guided Self-attention generalize to multi-class segmentation? the current implementation heavily relies on the absolute value of the prediction logits, which does not seem to be applicable for multi-class problem;
    6. In the ablation study, please further clarify each setting, e.g. what exactly is the “backbone”? Also, what would be the result if we use simple self-attention instead of PPFormer Block?
  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    4

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    This paper propose two new strategy to enhance polyp segmentation, from the perspective of improving local information and enhance the boundary feature representation. However, some major design motivation and details are missing. Also, the writing in the paper are not concise and precise. Thus, the paper needs further improvement.

  • Number of papers in your stack

    8

  • What is the ranking of this paper in your review stack?

    5

  • Reviewer confidence

    Confident but not absolutely certain

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A




Primary Meta-Review

  • Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

    This paper proposed a transformer-based method for polyp segmentation which has shown to perform better than the SOTA. The paper is well-written and the presented method is novel. Based on the reviewers’ feedback, an early accept is recommended. The authors should consider incorporating reviewers’ feedback in the camera ready.

  • What is the ranking of this paper in your stack? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    3




Author Feedback

Dear Reviewers: We appreciate your efforts in reviewing our manuscript entitled “Using Guided Self-Attention with Local Information for Polyp Segmentation” (ID 962). Sincerely yours To Reviewer #1: Comment #1: Thanks for your constructive comment, we will consider multi-scale fusion in future work to further improve segmentation performance. To Reviewer #3: Comment #1: Thanks for your valuable suggestions on our experiments. We will supplement the measurements of dispersion. In qualitative analyses, we select four polyps of different scales from unseen and seen datasets, the visualization results show the effectiveness and generalization capability of our method. Comment #2: Thanks for your careful review, we are sorry to make mistakes in language, and we promise to check and revise them in the latest version. Comment #3: The multi-scale strategy represents random scaling. To ensure fairness, we use a unified data augmentation in experiments, random vertical flip is not adopted here. To Reviewer #4: Comment #1: This is a valuable comment. PPFormer’s performance is influenced by alpha in PP-Guided Self-Attention. We set alpha as 1e-2 because when alpha is set as 0, 1e-4, 1e-3, 1e-2,1e-1, the results of mean Dice on ClinicDB are 0.933, 0.936, 0.941, 0.946, and 0.935, respectively. In our work, the self-attention matrix is normalized by softmax, and we set alpha less than 1 to reduce the negative influence of non-normalization. Inspired by your suggestions, we can further use softmax or min-max to improve our method. Comment #2: Thanks for your advice. To ensure fairness, we train TransFuse based on the released code, which is smaller than the unreleased TransFuse_L, resulting in the differences. Here, we present the parameters and FLOPs of different models. Parameters: 32.5M (PraNet), 26.3M (TransFuse) and 36.0M (PPFormer). FLOPs (input size 352x352): 210G (PraNet), 350G (TransFuse) and 110G (PPFormer). PPFormer has stronger performance with slightly increased parameters. Comment #3, 4: We visualize the attention map by utilizing the approach in “Transformer Interpretability Beyond Attention Visualization”. Appreciate the suggestions on our work, which are constructive to improve and apply our method. To Reviewer #5: Comments #1: We agree with you that “frequently” is not suitable to describe the VIT-based methods, because vanilla VIT only adopts a single patch projection operation. We will revise it. However, the subsequent transformer methods, such as PVT and CVT, frequently adopt patch projection operations to achieve pyramid structure or spatial reduction. Comments #2: Thanks for your careful review. We adopt the Self-Attention to extract global context for each pixel and use the PP-Guided Self-Attention to enhance the model’s perception of boundary regions. We believe the whole process can make the model focus on the hard regions. The qualitative results demonstrate that our method can accurately locate polyps and discriminate boundaries. Comments #3: We are sorry that we did not introduce patch expanding and patch merging in detail. Patch merging is a down-sampling operation, using a convolutional layer to merge patches and extend feature dimensions. Patch expanding up-samples patches and applies a linear layer to reduce feature dimensions. We will supplement them in the latest version. Comments #4: The L2G mechanism needs extra computer resources, thus, we adopt efficient convolutions and spatial reduction (refer to CvT) to prevent the significant increase in resources. Comments #5: This is a valuable and constructive problem. In the multi-class task, we may convert predicted result to a binary mask in the PP-Guide Self-Attention or create multiple masks according to the number of categories. Comments #6: The backbone is the combination of CvT and VGG16. The result of backbone represents using self-attention instead of PPFormer Block in the ablation study. We promise to explain it clearly in the latest version.



back to top