Authors

Ziheng Wang, Xiongkuo Min, Fangyu Shi, Ruinian Jin, Saida S. Nawrin, Ichen Yu, Ryoichi Nagatomi

Abstract

Vision transformer is the new favorite paradigm in medical image segmentation since last year, which surpassed the traditional CNN counterparts in quantitative metrics. The significant advantage of ViTs is to utilize the attention layers to model global relations between tokens. However, the increased representation capacity of ViTs comes with corresponding shortcomings: short of CNN’s inductive biases (locality), translation invariance, and hierarchical structure of visual information. Consequently, well-trained ViTs require more data than CNNs. As high quality data in medical imaging area is always limited, we propose SMESwin UNet. Firstly, based on Channel-wise Cross fusion Transformer (CCT) we fuse multi-scale semantic features and attention maps by designing a compound structure with CNN and ViTs (named MCCT). Secondly, we introduce superpixel by dividing the pixel-level feature into district-level to avoid the interference of meaningless parts of the image. Finally, we used External Attention to consider the correlations among all data samples, which may further reduce the limitation of small datasets. According to our experiments, the proposed superpixel and MCCT-based Swin Unet (SMESwin Unet) achieves better performance than CNNs and other Transformer-based architectures on three medical image segmentation datasets (nucleus, cells, and glands).

Link to paper

DOI: https://link.springer.com/chapter/10.1007/978-3-031-16443-9_50

SharedIt: https://rdcu.be/cVRy4

Link to the code repository

N/A

Link to the dataset(s)

N/A

Reviews

Review #1

Please describe the contribution of the paper

The authors proposed an image segmentation method that combines CNN with Transformer. As part of the network architecture, they introduced superpixel to reduce redundancy and noise in the images. It is fed into CNN to generate features of input images that are later combined with multi-scale features from Transformer. The proposed method is evaluated on two datasets, demonstrating superior performance in comparison to other related models.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

The authors proposed an improved design of transformer-based method for image segmentation. In particular, it combines CNN with Transformer with the help of superpixel and a multi-scale fusion module. As the results show, the addition of superpixel aids in improving the performance, which is effective and useful since superpixel is simple yet generic methodology that can be applied to similar problems. The multi-scale fusion module incorporates multiple features from both CNN and Transformer, which would be one of the key aspects of the proposed work since the combination of CNN and Transformer happens in this module. The authors obtained good results for two datasets and conducted a full ablation study on one dataset. These show the superiority of the proposed work and the effect of different design components of the proposed work.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

Although the proposed work shows good results and proposes a new design of a Transformer-based segmentation method, the work is built based upon several pre-existing methods. Hence, there is limited technical novelty. Simple adoption and combination of swin unet and CCT as well as addition of a CNN layer. The authors evaluated the proposed work on three datasets. For MoNuSeg, the results were inferior to UNet++ for both metrics.
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

likely to be reproducible
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html

The authors examined their work on three datasets. They obtained the best results for two datasets (GlaS and WBCs). However, the results for MoNuSeg was inferior to UNet++. And, there is no explanation for this result. The authors may provide an extended discussion on their results. Such difference may arise from the different characteristics of the datasets. All three are similar in a sense that the method needs to segment objects that are circular or elliptical, in general. The ones in MoNuSeg may be the smallest. This may affect the performance of the method. If so, it indicates that the method has difficulty in dealing with small objects. The authors may investigate their results from this perspective. Also, the comparative methods are not the state-of-the-art methods for the three datasets. The authors may compare their method to the current SOTA methods for these datasets. For these datasets, DICE and mIoU may not be the optimal choice for evaluation metric. PQ would be an alternative, in particular for nuclei segmentation, which has been widely used these days.
Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

5
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

Simple combination of the existing methods that limits the technical novelty of the method. Limited discussion on the experimental results. No comparison to SOTA on these datasets used here.
Number of papers in your stack

5
What is the ranking of this paper in your review stack?

2
Reviewer confidence

Very confident
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

N/A
[Post rebuttal] Please justify your decision

N/A

Review #2

Please describe the contribution of the paper

his paper presents a new network architecture, SMESwin Unit, that merged CNN and transformer for medical image segmentation. It fuses multi-scale semantic features and attention maps, then introduces superpixels to avoid the interference of meaningless parts of the image, and finally uses external attention to consider the correlations among all data samples. The proposed network achieved better results than CNNs and other Transformer-based architectures on three medical image segmentation datasets.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

1: The proposed architecture seems to be general to another segmentation tasks. I think it will work for many applications.

2: The experiments results are enough for the evaluation and give good evidence for the claims of the authors.

3: The paper is also well written; most of the parts are easy to follow.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

1: The technical contribution is the main concern. I feel that it is not certainly enough for a MICCAI paper.

2: I think the usage of the superpixel may hurts the feasibility of the method. If this procedure could be removed, and the performance just degrades slightly, then I will vote for publication.

3: It lacks of the evaluation in the 3D datasets; for a general network, such an evaluation may be required.
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

Possible to reproduce.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html

1: I feel that there are too much text used for the superpixels generation. Current version has not give enough arguments and motivations for using it.

2: In MCCT, it is a bit mathematically intensive; providing more conceptual description would be helpful for the readers to understand it.

3: Pointing out the limitation and future directions for the further development of the method could help some readers sometimes.
Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

5
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

I. think that the knowledge contribution may be not enough for a MICCAI paper and the presentation quality is also a concer for publication.
Number of papers in your stack

4
What is the ranking of this paper in your review stack?

2
Reviewer confidence

Confident but not absolutely certain
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

N/A
[Post rebuttal] Please justify your decision

N/A

Review #3

Please describe the contribution of the paper

Authors proposed one deep learning segmentation system based on hybrid swin-transformer and Channel-wise Cross fusion Transformer (CCT). They replaced one skip connection from CCT with one CNN branch processing super-pixeled raw images and name it MCCT. External attention mechanism was deployed to further refine the features. The system was tested in GlaS, MoNuSeg, and WBC for gland, nuclei, and cell segmentation, respectively and achieved promising performance.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
1. The manuscript is well-written and well-organized. It is easy to follow and also should be easy to reimplement the system.
2. The system was comprehensively tested in three datasets with different segmentation target.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
1. As the system was based on CCT, it is kind of a must to compare swin-transformer with CCT in the final performance to prove the CNN branch added is useful. If CNN branch can be proven truly useful, this can have a bigger impact to the research field for better deep learning design.
2. The system is a hybrid of swin-transformer, CCT, and ET. So relatively the work lacks novelty in technique.
Please rate the clarity and organization of this paper

Excellent
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

As the codes will be public after acceptance and dataset is already public, the reimplementation and repeat of the results should be possible and easy.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html
1. As the T1, T2, T3, T4 all have different number of channels and number of patches/tokens, how are they concatenated together?
2. In section 2.3, please double check the dimension of all matrices mentioned. The Ti should be in the shape of dCi; if W_Qi is in the shape of Cid, the shape of Qi should be d*d.
3. Could authors provide more insight illustration on the reason why they think CNN analysis on raw images cannot deal with its ‘influence of constructed defect and noise’.
4. In Eq 5, as Fi are in different size, how should they share the same Mk?
5. The CNN generated features are only used as 64/(64+128+256+512)=0.07 of the T_sum in attention mechanism. I am wondering if this can be a significant contributor to the final results. And also, swin transformer with CCT should be compared with MCCT to prove the CNN branch is useful. Also, in original CCT paper, they used four skip connections, while authors of this paper used three and one CNN branch. It is better to provide more experiments and illustrations on this: replace one or add extra one.
Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

5
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

It is one interesting and convincing manuscript. The only main concern for me is in my opinion, experiments of swin transformer with CCT should be included to prove the MCCT, the added CNN branch, is useful as mentioned in point #5.
Number of papers in your stack

5
What is the ranking of this paper in your review stack?

2
Reviewer confidence

Very confident
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

N/A
[Post rebuttal] Please justify your decision

N/A

Primary Meta-Review

Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

This paper proposes a new image segmentation model that combines CNN with Transformer. One of the novel ideas is using superpixels and attention mechanism in the skip connection of the model. The authors demonstrated that the proposed additions improve up to 2% in Dice and 4% in mIoU. The summary of pros and cons of this paper is summarized below:

Pros: Novel extension of CNN and Transformer-based model (R1, R2) Demonstrate good results via comprehensive evaluation (R1, R2, R3) Paper is well written and organized (R2, R3)

Cons: Weak technical novelty (based on existing works) (R1, R2, R3) Results are not consistently good (UNet++ is better on MoNuSeg) (R1) Lack of 3D evaluation (R2)

Even though the novelty/contribution is somewhat weak, all the reviewers seem to agree that the idea and results seem promising and the paper is worth presenting at MICCAI.
What is the ranking of this paper in your stack? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

4

Author Feedback

N/A

back to top

SMESwin Unet: Merging CNN and Transformer for Medical Image Segmentation