Paper Info Reviews Meta-review Author Feedback Post-Rebuttal Meta-reviews

Authors

Jinfeng Wang, Qiming Huang, Feilong Tang, Jia Meng, Jionglong Su, Sifan Song

Abstract

Colonoscopy, currently the most efficient and recognized colon polyp detection technology, is necessary for early screening and prevention of colorectal cancer. However, due to the varying size and complex morphological features of colonic polyps as well as the indistinct boundary between polyps and mucosa, accurate segmentation of polyps is still challenging. Deep learning has become popular for accurate polyp segmentation tasks with excellent results. However, due to the structure of polyps image and the varying shapes of polyps, it easy for existing deep learning models to overfitting the current dataset. As a result, the model may not process unseen colonoscopy data. To address this, we propose a new State-Of-The-Art model for medical image segmentation, the SSFormer, which uses a pyramid Transformer encoder to improve the generalization ability of models. Specifically, our proposed Progressive Locality Decoder can be adapted to the pyramid Transformer backbone to emphasize local features and restrict attention dispersion. The SSFormer achieves statet-of-the-art performance in both learning and generalization assessment.

Link to paper

DOI: https://link.springer.com/chapter/10.1007/978-3-031-16437-8_11

SharedIt: https://rdcu.be/cVRsX

Link to the code repository

https://github.com/Qiming-Huang/ssformer

Link to the dataset(s)

N/A


Reviews

Review #1

  • Please describe the contribution of the paper

    This paper proposes a transformer based segmentation architecture SSFormer. The main novelty of SSFormer is the local emphasis operator, which forces the attention to put more weights on nearby patches. SSFormer achieves good empirical performance on the polyp segmentation task.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. Local emphasis operator (LE). Although there are similar techniques to emphasize locality, e.g. swin transformer and NesT, the LE module is somewhat different.
    2. The experimental results are good.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    1. The main claim that the LE module works becuase it can reduce “attention dispersion” (I guess it’s more commonly known as “feature oversmoothing”?) But the intuition why this can reduce “attention dispersion” is not clearly explained. The authors only support this claim with visualizations (Fig. 2).
    2. Important baselines are missing in the experiments, esp. ViT based models, such as Segtran (IJCAI 2021), TransUNet (arXiv:2102.04306), Swin-UNet (arxiv:2105.05537) and SETR. In addition, PraNet should be compared in Table 1.
    3. Citations are missing for Segtran, TransUNet and Swin-UNet.
    4. Various writing problems (see Section 8).
  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    I didn’t spot issues about reproducibility.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html
    1. Writing issues: 1) Don’t capitalize State-Of-The-Art. 2) “attention dispersion” is invented in this paper. If you propose a new term, please define/explain it in detail. Otherwise, please follow common terms. 3) In the caption of Fig.1., “(a) is the …” => “(a) the” 4) In the caption of Fig.1., “emphasized features” is an awkward term. 5) In Section 2.2, (such as… etc.) please don’t use “such as” and “etc.” at the same time. 6) In Equation (1), what’s Ci, C and i? Please define them. 7) In “Stepwise Featgure Aggregation”, “information interacted by ..” is an awkward expression.

    2. Fig. 2 is not clearly explained. Esp., as transformer attention is pairwise, you have to select a query point and visualize the attention of all pixels with the query point. What are the query points used for the attention maps in Fig.2?

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    5

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?
    1. The empirical performance of SSFormer seems good. Therefore I lean towards acceptance.
    2. However a few important baselines are missing, and there are many issues on writing. The intuition of LE against “attention dispersion” is not clearly explained. I can only recommend “weak accept”. The authors should heavily revise the manuscript to address these issues.
  • Number of papers in your stack

    5

  • What is the ranking of this paper in your review stack?

    1

  • Reviewer confidence

    Confident but not absolutely certain

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Not Answered

  • [Post rebuttal] Please justify your decision

    Not Answered



Review #2

  • Please describe the contribution of the paper

    In order to segment polyp more accurately, this work designs a new framework, SSFormer, which exploits pyramid transformer architecture as encoder, and proposes a multi-stage aggregate decoder (PLD) to progressively fuse information at different stages. It shows a better performance on different benchmarks.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • The paper is well organized
    • A new pipeline is designed for polyp segmentation. The experiments show it has a better learning ability and generalization ability
    • Ablation study on different combinations of encoder/decoder is conducted to show the effectiveness of PLD.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    • To verify the effectiveness of the LE module, it would be better to have an ablation study in which conv layers are removed. In Fig 1(b), does it mean that the sizes of output are always H(W)/4?
    • The framework exploits the transformer encoder to have enhanced capability. The whole pipeline is similar to Unet. Overall, the technical novelty is marginal. For results, compared to SegFormer, the improvement is marginal.
  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    I think this paper has provided enough details for models and experiments settings.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html
    • For equation (1) and (2), it would be better to explain the meaning of notations, like C and F_i.
    • In table 4, it seems the last sentence is not complete. (“the CVT is”)
  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    4

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    This work proposes a new framework for polyp segmentation, and achieves competitive results. However, the technical novelty is marginal. I prefer to have weak reject at the current stage.

  • Number of papers in your stack

    5

  • What is the ranking of this paper in your review stack?

    4

  • Reviewer confidence

    Confident but not absolutely certain

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    4

  • [Post rebuttal] Please justify your decision

    I have read authors’ rebuttal and comments from other reviewers. I still have the concern about the similarity between this work and Unet, although there are some differences in designing, like different architectures(transformer), different upsamling strategies. Technically speaking, the novelty is marginal. R3 also points out the issue of novelty. Also, considering some missing exps, I keep the original socre unchanged.



Review #3

  • Please describe the contribution of the paper

    The paper provide a pyramid transformer based model for 2D medical image segmentation, with a novel progressive locality decoder from multi-stage feature aggregation, in order to improve the generalizability/robustness and better capture local features in challenge segmentation task (polyp segmentation). Comprehensive experiments with promising performance are reported.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • The paper has a clear motivation for its method. Pyramid transformer encoder is selected for better generalizability of the model, especially for challenged polyp segmentation with various size and shape. And progressive locality decoder (PLD) is designed to emphasize local features of small structures.
    • The model is evaluated on both in-domain testing set as well as unseen dataset, and the results demonstrate the robust generalizability of the model performance.
    • The model is further trained and tested on other segmentation tasks, i.e. skin lesions, nucleus segmentation, which shows the model’ potential for common medical segmentation tasks.
    • Comprehensive horizontal and ablation studies are reported in Results, which further convinced the improvements of the model.
    • The paper provides attention maps of each scale and method to better visualize better local feature extraction ability of PLD.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    • The paper’s novel PLD contains two parts: LE and SFA. Although the results show that this PLD decoder improves the segmentation mDice score on multiple datasets, there’s no ablation study to explore and demonstrate the effect of LE or SFA separately, i.e. how is the performance gain with only LE combined with traditional parallel feature fusion? and how is the performance gain with sole SFA paired with simple upsampling without precedent two conv layers? Without these ablation study, it’s hard to verify whether the performance gain of the whole model benefits from LE or SFA or both.
    • The scores of other horizontal methods in Table 2 and 3 (generalization tests) are directly refer to [9, 14, 16, 25], instead of reproduce those papers’ methods using the same experimental settings, especially data augmentation in this paper. This is because the data augmentation itself is able to improve the generalizability/robustness of the model on unseen dataset, which has been proved in both robust learning[A1] and domain generalization[A2]. Therefore, if the scores are refer to the original SOTA papers, people cannot know whether the improvements of generalizability is from the proposed model or the different data augmentation.
    • The paper uses mDice and mIOU as evaluation metrics, but these two metrics are actually evaluating the same aspect of segmentation by calculating overlapping area, so only keep mDice is enough, which is widely accepted in medical segmentation. In addition, in order to better demonstrate the segmentation performance, Hausdorf-Distance, another commonly used segmentation metric, should be used to evaluate the predicted mask shape accuracy.
    • The novelty of the method is slightly limited: the pyramid transformer encoder design is from [20,22], and for the two novelties in PLD, compared to Segformer[22], the Local Emphasis simply adds two convolution layers before upsampling feature maps of each scale, and the Stepwise Feature Aggregation, instead of parallel fusion, inserts a linear layer between each aggregation level and progressively fuses multi-scale features from deep to shallow, where this progressive feature fusing scheme is also used in many other segmentation decoders, even widely-used Unet is using a complicated version of this scheme. Thus, the method novelty of this paper is more like some small increments of modules from previous works, which somewhat limits the method novelty of the paper.

    Minor problems:

    • The convolution kernel size of LE module is missed.
    • The size of C in PLD is missed.
    • The description of Fig.2 has no explanation of (c).
    • The description of Table.4 has no definition of CvT.
    • Eq1 doesn’t contain the upsampling operation in LE.

    Ref: [A1] Hendrycks, Dan, et al. “Augmix: A simple data processing method to improve robustness and uncertainty.” ICLR 2020. [A2] Volpi, Riccardo, et al. “Continual adaptation of visual representations via domain randomization and meta-learning.” CVPR 2021.

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    The authors indicate that the code and model will be released after being accepted.

    The datasets used in Experiments are all open accessed. The paper also provides the data splits, training parameters and model structure, although some of the model parameters are missed (convolution kernel size in LE module), the results of the paper should be able to reproduce.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html

    For the Experiments:

    • In order to better explore and present the impact of LE and SFA separately, I’d suggest the author adding another group of ablation study for several settings: LE + parallel fusion, simple upsampling + SFA, LE + SFA and baseline without LE or SFA.
    • Keep mDice scores and remove mIOU scores, and add Hausdorf Distance as an extra evaluation metric for a more comprehensive segmentation evaluation.
    • Double check the data augmentation methods in papers [9, 14, 16, 25] from which the scores in Table 2 and 3 are refer. As I mentioned above, the different data augmentation will affect the robustness and generalizability of the model. It’s even better to reproduce the models in those papers using exactly the same experiment settings, especially data augmentation, and compare the final scores in Table 2 and 3.

    Other issues in paper writing:

    • From the paper, it’s not clear how the author generate the attention maps at different scales. Please briefly introduce how the attention maps in Fig.2 is generated.
    • Check through the paper and fix the minor writing issues mentioned in “main weakness”, as well as correcting some other spelling and grammar typos.
  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    5

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Considering the clear motivation and reasonable contributions, the comprehensive experiments and the promising results, I think the paper has potential to be accepted. Even though there are a few deficiencies in the experiments and results, which cause some weakness of the paper, the overall rate of the paper should be above borderline.

  • Number of papers in your stack

    5

  • What is the ranking of this paper in your review stack?

    3

  • Reviewer confidence

    Very confident

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Not Answered

  • [Post rebuttal] Please justify your decision

    Not Answered




Primary Meta-Review

  • Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

    This paper proposes a transformer based segmentation architecture SSFormer. The main novelty of SSFormer is the local emphasis operator, which forces the attention to put more weights on nearby patches. SSFormer achieves good empirical performance on the polyp segmentation task.

    Innovation lies in Local emphasis operator (LE). Although there are similar techniques to emphasize locality.

    The main claim that the LE module works becuase it can reduce “attention dispersion”. But the intuition why this can reduce “attention dispersion” is not clearly explained.

    The model is evaluated on both in-domain testing set as well as unseen dataset, and the results demonstrate the robust generalizability of the model performance.

    Although the results show that this PLD decoder improves the segmentation mDice score on multiple datasets, there’s no ablation study to explore and demonstrate the effect of LE or SFA separately.

  • What is the ranking of this paper in your stack? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    8




Author Feedback

Dear Meta Reviewer, reviewers #1, #2 and #3

Thank you for your constructive feedback. I apologize for the missing details, grammatical mistakes and unclear explanations in the paper.

I would like to emphasize the contribution of our method, which introduces Transformer encoder to improve generalization. Furthermore, we propose PLD which can be viewed as LE and SFA modules to suppress the attentional dispersion of the Transformer and smoothly fuse features from different depths. The proposed SSFormer achieves SOTA or outstanding performance on multiple lesion segmentation benchmarks.

I summarize feedbacks and concerns of reviewers into following points and respond separately: 1) The design ideas and working principles of the LE module. 2) The novelty of PLD. 3) Ablation experiments and data augmentation. 4) Baseline selection and model performance.

1) The concept of “attention dispersion” is related to “attention collapse” in DeepViT ([24] in the paper). Utilizing the local receptive field of the convolution kernel to force the model to further capture the local spatial context, this method to reduce semantic ambiguity in the attention mechanism is proposed in CvT ([21] in the paper). Furthermore, we argue that the attention matrix in the self-attention mechanism can be viewed as global non-preset “convolution kernel”. Therefore, using the local receptive field of the convolution kernel to increase the macro weights of the patches around each query patch to refocus attention on neighboring features thus reducing attention dispersion. I’m sorry about the missing citation and unclear explanation.

2) the LE module is different from the methods of mixing Transformer and CNN backbone or modifying the self-attention mechanism to emphases local information. To reduce computational cost and information gap, we use convolution operations in the decoder to emphasize local information for features from different depths. Although PLD and UNet both perform scaling and multi-stage feature aggregation, I cannot agree that they are similar. They have essentially different design purposes and model interaction. UNet uses continuous upsampling so that the decoder can interact with the encoder through skip connections. In contrast, the proposed SFA module uniformly scales the features from different depths to the target size. Furthermore, in the feature fusion part, we adopt parallel fusion method, which is different from the serial method in UNet.

3) I apologize for the incomplete experimental content. Due to the page restriction and focus of the paper, we have not been able to show the effects of LE and SFA on the model, respectively. However, our paper mainly focuses on the positive effect of the entire PLD on the local information loss problem of the Transformer backbone. Therefore, limited by length requirements, we choose to emphases the overall capabilities of PLD and its internal processing of feature streams. If there is an opportunity to conduct follow-up research on the SSFormer, we will definitely do so. Furthermore, the data augmentation strategies of models trained in different benchmark datasets are as consistent as possible with the baseline models under the same learning strategy.

4) Since our paper is focused on the lesion segmentation task, we choose the baseline models that have performed well in the benchmark datasets used in the paper. At the same time, since we adopt two learning strategies in this paper, for the sake of fairness, we prefer models with similar strategies to ours. For benchmark model selection, we refer to the SOTA records of multiple benchmark datasets. In addition, in terms of model performance, SSFormer improves mDice score by about 3% (ablation experimental) in both the ClinicDB and Kvasir benchmarks when compared to the Segformer as mentioned by a reviewer. We believe that this improvement is not trivial.

I appreciate your comments and hope my responses address your concerns.




Post-rebuttal Meta-Reviews

Meta-review # 1 (Primary)

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

  • After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).



Meta-review #2

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    Although all reviewers appreciated the quality of the results, this paper preserved divergent recommendations after the rebuttal. The rebuttal adequately addresses most reviewer concerns, specifically regarding technical novelty. The final version should include all reviewer comments and suggestions.

  • After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

    Accept

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    9



Meta-review #3

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    The paper received two weak accept and one weak reject ratings. The authors submitted a strong rebuttal to further explain the novelty of LE module and PLD and answer some other reviewer questions. If the authors can incorporate some of the replies into the final version, the paper should be acceptable.

  • After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

    Accept

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    11



back to top