Paper Info Reviews Meta-Review Author Feedback Post-rebuttal Meta-Reviews Back to top

List of Papers By topics Author List

Paper Info

Reviews

Meta-review

Author Feedback

Post-Rebuttal Meta-reviews

Authors

Yanglan Ou, Ye Yuan, Xiaolei Huang, Stephen T.C. Wong, John Volpi, James Z. Wang, Kelvin Wong

Abstract

We present a new encoder-decoder Vision Transformer architecture, Patcher, for medical image segmentation. Unlike standard Vision Transformers, it employs Patcher blocks that segment an image into large patches, each of which is further divided into small patches. Transformers are applied to the small patches within a large patch, which constrains the receptive field of each pixel. We intentionally make the large patches overlap to enhance intra-patch communication. The encoder employs a cascade of Patcher blocks with increasing receptive fields to extract features from local to global levels. This design allows Patcher to benefit from both the coarse-to-fine feature extraction common in CNNs and the superior spatial relationship modeling of Transformers. We also propose a new mixture-of-experts (MoE) based decoder, which treats the feature maps from the encoder as experts and selects a suitable set of expert features to predict the label for each pixel. The use of MoE enables better specializations of the expert features and reduces interference between them during inference. Extensive experiments demonstrate that Patcher outperforms state-of-the-art Transformer- and CNN-based approaches significantly on stroke lesion segmentation and polyp segmentation. Code for Patcher is released to facilitate related research.

Link to paper

DOI: https://link.springer.com/chapter/10.1007/978-3-031-16443-9_46

SharedIt: https://rdcu.be/cVRy0

Link to the code repository

https://github.com/YanglanOu/patcher.git

Link to the dataset(s)

https://datasets.simula.no/kvasir-seg/

Reviews

Review #1

Please describe the contribution of the paper

This work presents a new transformer-based network with three branches to predict a Gauss map, a boundary map, and a contour map. Then, a MoE-based decoder is presented to disentangle features. Experimental results on stroke lesion segmentation dataset and polyp segmentation dataset show that the developed network outperforms state-of-the-art methods.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
1. The authors present a transformer-based network for medical image segmentation by embedding mixture of experts, and it has achieved a superior performance over state-of-the-art methods.
2. The writing of this work is easy to follow.
3. Two datasets are employed for evaluating the proposed segmentation method.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
1. It is unclear why the authors include the mixture-of-experts (MoE) into the decoder for boosting the segmentation method. What about the performance of removing the MoE from the decoder?
2. According to Section 2.4, the authors just claimed that the upsampled MLP features as the expert features and then utilized an attention block to weight the so-called expert features for predicting the segmentation result. Hence, the novelty of MoE-based decoder is limited.
3. The technical novelty of the Patcher block is unclear. It seems that the Patcher is based on vison transformer block. What is the main difference? And an ablation study is required to evaluate the effectiveness of the main difference.
Please rate the clarity and organization of this paper

Very Good
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

The authors have claimed that they will release code, trained models, and results.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html
1. The technical novelty of the Patcher encoder is not clearly explained.
2. It seems that the MoE-based decoder tends to be over-claimed. It just utilizes an attention block to weight features at different CNN layers.
Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

3
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

Please refer to the weakness.
Number of papers in your stack

5
What is the ranking of this paper in your review stack?

5
Reviewer confidence

Very confident
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

5
[Post rebuttal] Please justify your decision

The response mostly addressed my concerns on this work, and thus I tend to follow another two reviewers to accept this work.

Review #2

Please describe the contribution of the paper
1. The paper proposes a novel transformer-based model that is composed of multi-scale Patcher blocks and a Mixture-of-Experts module.
2. The methods achieves SOTA results compared with existing methods.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
1. The proposed model structure, including Patcher and MoE, is novel and interesting.
2. The experimental results are comprehensive and convincing. I appreciate Figure 4 showing the function of each expert.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
1. Can you explain or show the limitations of the proposed method?
Please rate the clarity and organization of this paper

Very Good
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

All the code and models will be released.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html
1. Please make the sub-subsection titles in Section 3.1(and others if applicable) in consistent formats.
Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

6
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The main factors for my rating are mentioned in Question 3 and 4.
Number of papers in your stack

5
What is the ranking of this paper in your review stack?

1
Reviewer confidence

Confident but not absolutely certain
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

Not Answered
[Post rebuttal] Please justify your decision

Not Answered

Review #3

Please describe the contribution of the paper

The authors present a new Neural Network Architecture for image segmentation, combining ideas from convolutions and transformers for the feature extraction and a Mixture of Model approach for the reconstruction. They show that their model beats SOTA on two medical imaging segmentation tasks.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

The design choices of the presented architecture are justified properly, the paper is well written and organized. The ideas are either novel, a novel combination of existing ideas or well executed.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

The experiments do not accurately depict the performance of the model. All the presented results are provided by the authors, which does not guarantee that each model was set up to perform in optimal conditions.
Please rate the clarity and organization of this paper

Excellent
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

The authors will release the code upon paper acceptance. The implementation details in the paper could be enough to reproduce the network architecture. The stroke lesion dataset is private, making future result-proofing and comparison impossible in that dataset.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html

As the main contribution of the paper is the creation of a new neural network architecture, it would have been welcome to compare it against established benchmarks in the field of medical imaging. Comparing this new architecture with SOTA methods on a new dataset, or a dataset where other state-of-the-art technics had not been applied yet, does not let us know whether the model is globally better, only excellent at the selected 2 tasks, or whether the other methods were not applied with the same amount of efforts. To clear up any doubt, please evaluate your model on a segmentation benchmark where some other technics have already been applied. As an example, [1] does not report the same result for UNet depending on training conditions.

For the ablation study in Table 3, putting the relative increase/decrease in accuracy would be more enlightening than the absolute results achieved by each combination. Try to answer: “What is the relative increase in DSC/IoU when swapping SETR decoding method with ours?”. Absolute results are helpful, but are not enough to get the full picture.

[1] Huang, C. H., Wu, H. Y., & Lin, Y. L. (2021). Hardnet-mseg: a simple encoder-decoder polyp segmentation neural network that achieves over 0.9 mean dice and 86 fps. arXiv preprint arXiv:2101.07172.
Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

6
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The paper is clear and presents a novel model. It is not certain that it beats all other SOTA method, as some papers claim higher scores on the KVASIR-seg dataset, but it contains many nice ideas that could be used in future research and good justification for the different design choices of the presented network architecture.
Number of papers in your stack

5
What is the ranking of this paper in your review stack?

3
Reviewer confidence

Confident but not absolutely certain
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

6
[Post rebuttal] Please justify your decision

The architecture proposed by the authors is partially new, and they claim to beat SOTA. They promise to clarify their data split to make it clearer why some other methods claim higher results.

Primary Meta-Review

Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

Reviewers agree on the technical novelty of the paper and the quality of the results. However, they also express severe concerns and provide diverging recommendations. The rebuttal should thoroughly address all the points raised by the reviewers. In particular: (1) Novelty, motivation, and contribution of MoE decoder and Patcher block (R1), (2) Discussion of limitations of the method (R2), and (3) flaws in the experimental setup and missing theoretical comparisons against the state-of-the-art.
What is the ranking of this paper in your stack? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

4

Author Feedback

We thank the AC and reviewers for their helpful feedback and for highlighting the novelty of our work.

==== R1 ==== First, we’d respectfully point out the summary comment by R1 — “This work presents a new transformer-based network with three branches to predict a Gauss map, a boundary map, and a contour map.” — is unrelated to our paper. Our work does not have these branches.

Q1. Novelty of MoE: Although our weight maps (Fig. 4) make the MoE decoder look like attention, there are major differences between them. In short, the effect of MoE cannot be easily achieved by attention as we elaborate below.

First, in image attention, each output pixel looks at all pixels in different locations across the image to obtain attention weights (size HxW), where weights for pixels at different locations sum to 1. In contrast, in our MoE decoder, for each output pixel we only generate 4 weights for combining the features of the 4 expert feature maps at the same pixel location. The 4 weights are obtained using the entire image via convolutional layers, which is also different from attention. Moreover, the effect of MoE is difficult to produce using attention because we are only combining 4 pixel features for each output pixel, therefore the size of “Value” in attention (if we wanted to use it) should be 4, but MoE uses the entire image to generate the weights so the size of “Key” should be 4xHxW, i.e., the size of the 4 feature maps. Such discrepancy between the sizes of “Value” and “Key” is a problem for attention since they typically have the same size.

Second, MoE selects a suitable set of hierarchical expert features (with coarse-to-fine spatial context) to produce more accurate segmentation, which is another main novelty of our approach. As appreciated by R2, our visualization of MoE weights shows MoE enables the specialization of each expert.

Q2. Ablation of removing MoE: We’d like to point out that the ablation of removing MoE is already in the original submission (Table 3, Lines 3&4), where we replace our MoE decoder with the one used in SETR and UNet, which leads to DSC drops of 1.05% and 2.48%, respectively.

Q3. Novelty of Patcher block & Its ablation: Although Patcher has used vision transformer blocks as one of its components, it has several important new designs to break the known limitations. The main contribution of Patcher is the use of large and small patches to focus on global and local context, as well as the simple and effective context padding (Fig. 2): we divide the input into large patches and use padded context to enhance the communication between large patches, which focus on global context modeling. Each large patch is further divided into smaller patches, which have a limited receptive field as defined by the large patch, therefore focusing on local context modeling. Also, we’d respectfully point out that the ablation of replacing Patcher with other vision transformers is already included in the original submission. The results are shown in Table 1, Line 4 - Line 7, where Patcher outperforms SOTA vision transformers significantly.

==== R2 ==== Q4. Limitations: One limitation of our method is that it costs more memory than CNN-based methods such as U-Net due to the use of transformers in multiple layers.

==== R3 ==== Q5. Experiment setup: We also notice that some papers claim higher scores on KVASIR-seg [1, 2]. However, it’s not a fair comparison: [1, 2] use different train/test splits and they don’t have a validation split, i.e., they tune models on the test set. In contrast, we use a validation split to select the best model to avoid overfitting to testing data. To enable fair comparison, we will release the data splits and the code for our method and all the baselines.

[1] Huang et al. “Hardnet-mseg: a simple encoder-decoder polyp segmentation neural network that achieves over 0.9 mean dice and 86 fps.” arXiv 2021. [2] Zhang et al. “Transfuse: Fusing transformers and CNNs for medical image segmentation.

Post-rebuttal Meta-Reviews

Meta-review # 1 (Primary)

Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

The rebuttal addresses most reviewer concerns adequately, particularly concerning motivation and technical clarity. As a result, all reviewers recommend acceptance. The final version should include all reviewer comments and suggestions, specifically regarding missing experimental details and analyses.
After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

Accept
What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

3

Meta-review #2

Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

The response addressed my concerns, The proposed model is novel and interesting.
After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

Accept
What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

5

Meta-review #3

Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

The rebuttal addressed reviewers’ concerns adequately. All reviewers now agree to accept this paper.
After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

Accept
What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

2

back to top

Patcher: Patch Transformers with Mixture of Experts for Precise Medical Image Segmentation