Paper Info Reviews Meta-Review Author Feedback Post-rebuttal Meta-Reviews Back to top

List of Papers By topics Author List

Paper Info

Reviews

Meta-review

Author Feedback

Post-Rebuttal Meta-reviews

Authors

Yufan He, Vishwesh Nath, Dong Yang, Yucheng Tang, Andriy Myronenko, Daguang Xu

Abstract

Transformers for medical image segmentation have attracted broad interest. Unlike convolutional networks~(CNNs), transformers use self-attentions that do not have a strong inductive bias. This gives transformers the ability to learn long-range dependencies and stronger modeling capacities. Although they, e.g. SwinUNETR, achieve state-of-the-art~(SOTA) results on some benchmarks, the lack of inductive bias makes transformers harder to train, requires much more training data, and are sensitive to training recipes. In many clinical scenarios and challenges, transformers can still have inferior performances than SOTA CNNs like nnUNet. A transformer backbone and corresponding training recipe, which can achieve top performances under different medical image segmentation scenarios, still needs to be developed. In this paper, we enhance the SwinUNETR with convolutions, which results in a surprisingly stronger backbone, the SwinUNETR-V2, for 3D medical image segmentation. It achieves top performance on a variety of benchmarks of different sizes and modalities, including the Whole abdominal ORgan Dataset (WORD), MICCAI FLARE2021 dataset, MSD pancreas dataset, MSD prostate dataset, and MSD lung cancer dataset, all using the same training recipe with minimum changes across tasks.

Link to paper

DOI: https://doi.org/10.1007/978-3-031-43901-8_40

SharedIt: https://rdcu.be/dnwDO

Link to the code repository

https://github.com/Project-MONAI/MONAI/blob/dev/monai/networks/nets/swin_unetr.py

Link to the dataset(s)

N/A

Reviews

Review #1

Please describe the contribution of the paper

The paper presents a segmentation method based on Swin-UNETR method. The original idea in UNETR method is to substitute Convolution operations in the Unet architecture by the more general Transformers. Swin-Transfomers are an evolution in the transfomer operation that introduces shifted windows in order to mimic the sliding-window concept from convolutions that is missing in Transformers. The present paper presents yet an improvement on the Swin-UNETR architecture, consisting on introducing convolutions before each Swin-Transformer block. The paper defends the idea that inductive Bias of convolutions in combination with Swin-Transformers are better than Swin-Transfomers alone. The authors tried several combination approaches and demonstrate the improvement of the proposed one in extensive experiments on 3 datasets.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
Strengths:
- the idea of hybrid architectures combining convolutions and transformers is interesting
- the authors have tried different combination options and provide comparison results of each variant, which is important for future research directions
- extensive comparison on 3 well-known datasets against numerous strong baselines
- the method clearly outperforms the baselines
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
Weaknesses:
- This is an incremental contribution on the reference Swin-UNETR architecture. Although substantial work has been done in evaluating different variants on combinations of convolutions and transformers, all of them are somewhat “minor” modifications to the reference Swin-UNETR architecture
- Lacks principle on why the proposed changes lead to improvements while other similar ones not (this applies to most architectural research, actually). Reasons alluded in the motivation for the paper, namely, benefits of introducing inductive biases in attention-based architectures would also apply to all the variants tried. Nevertheless, only the chosen one succeeds in improving performance.
- On which ground were the 3 sub-tasks on MSD selected?
- Information about important parameters such as token and window sizes is missing
- Authors claim that the method is robust to change in hyperparameters but no evidence is provided (eg, against token and window sizes)
- Authors claim that baseline swin-UNETR requires extensive pre-training and hyperparameter tuning to perform well although experiments show that its performance is also excellent compared to other baselines in similar conditions as the proposed method (ie, no pre-training, same hyperparameters)
Minor comments:
- Sub-figure numbers in Fig 1 are wrong
- Sub-caption 2.b in Fig 1 should replace transofmers for CNNs?
- “Our method belongs to the second category” in Page 3 should be “Our method belongs to the third category” ?
- Typo in swin transformer equation, where “i” should be replaced by “i+1” ?
Please rate the clarity and organization of this paper

Satisfactory
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

Datasets are publicly available. Training recipes are based on previous work, with citation to corresponding repos. Code not yet available. Value of some main hyperparameters not described (see comments)
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html
- try to understand why the proposed architectural changes work compared to the closely related variants. Is it the features are more useful or it is an easier training (or both)?
Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

6
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?
Pros:
- The paper explores the interestin idea of hybrid convolution-transformer architecture.
- Evaluations are extensive and results are excellent. Cons:
- “unexciting” incremental contribution over a reference architecture (Swin-UNETR)
- lack of justification on why it works
Reviewer confidence

Very confident
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

N/A
[Post rebuttal] Please justify your decision

N/A

Review #2

Please describe the contribution of the paper

The transformer architecture, SwinUNETR, can achieve state-of-the-art results in some benchmarks, but its lack of inductive bias makes it harder to train and requires more data, leading to inferior performance in many clinical scenarios compared to convolutional neural networks (CNNs) like nnUNet. The paper presents SwinUNETR-V2, which enhances SwinUNETR with convolutions and results in a stronger backbone for 3D medical image segmentation. SwinUNETR-V2 deomonstrates top performance on various benchmarks with minimal changes in the training recipe.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

Authors introduced stage-wise convolution into the SwinUNETR backbone to improve the performance. The network is evaluated extensively on a variety of benchmarks and achieved top performances on the WORD, FLARE2021, MSD prostate, MSD lung cancer, and MSD pancreas cancer datasets. Clearly it is a step towards improving performance and training stability of transformer-based models than CNN models.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

The major weakness of the study is authors proposing minimal changes in the network architecture. Although the result from various datasets shows improvements it is not too far from SwinUNETR results.
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

The paper is clearly written and has enough information to reproduce
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html
1. Although the result from various datasets shows improvements it is not too far from SwinUNETR results. As improving computational efficiency and reducing extensive hyper parameter tuning of SOTA SwinUNETR is the main objective of the paper, author can discuss them in detail the paper in addition to the performance of the proposed method in various datasets.
Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

6
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

Proposed method adds some novelty to SOT SwinUNETR and demonstrates some improvements over SwinUNETR.
Reviewer confidence

Confident but not absolutely certain
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

N/A
[Post rebuttal] Please justify your decision

N/A

Review #3

Please describe the contribution of the paper

Authors propose a SwinUNETR-V2, transformer-based architecture for 3D medical image segmentation. It builds upon SwinUNETR by introducing stage-wise convolutions into the backbone. The authors add a residual convolution (ResConv) block at the beginning of each resolution level to regularize the features for the following attention blocks. Numerical experiments are performed on six datasets: WORD, FLARE2021, MSD (3 datasets).
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
1. Authors propose to combine convolutions with window-based self-attention, addressing limitations in the original SwinUNETR and other ViT-based models.
2. SwinUNETR-V2 demonstrates improvements in 3D medical image segmentation tasks, surpassing baseline methods on six CT and MRI datasets.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
1. Authors select 3 datasets from MSD challenge, omitting other datasets from MSD for no apparent reason.
2. No experiments on BraTS dataset, which was the only dataset SwinUNETR reported the results. Such experiments would allow for a direct comparison between two architectures in the same setup.
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

Authors report the results on 6 public datasets. Authors do not provide code to reproduce the results, nor the neural networks weights.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html
1. It would be nice to unify the numbers format in Tables 1,3,4,5 (keep the same number of point digits, multiply Dice by 100 or not, but keep it consistent). Also, using \toprule instead of \hline at the top of the tables results in a nicer look.
2. What is the reasoning behind using different normalisations in attention blocks (LN) and resblocks (Instance Normalisation)? As both Swin and SwinUNETR use LN.
3. It would be good to have experiments on the BraTS dataset. First, it would serve a direct comparison with SwinUNETR, which you claim to be a successor of. Second, it would allow you to assess (to some extent) how your method scales with the increased training data size, since it is known that convolutional based architectures work better on the smaller samples due to inductive biases, while attention-based architectures scales better with increased sample size.
4. Your idea of convolutional res-blocks at the beginning of attention blocks is similar (but not the same) to CvT architecture https://arxiv.org/pdf/2103.15808.pdf . It would be interesting to see how your method compares with similar architecture but a CvT backbone.
Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

5
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?
This work is nicely written and has no particular flaws, however:
- it is a marginal improvement over the baselines
- it has a strong sense of “yet another Unet” type of work
Reviewer confidence

Confident but not absolutely certain
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

N/A
[Post rebuttal] Please justify your decision

N/A

Primary Meta-Review

Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

This paper proposed a very simple improvement against SWINUNETR, that is, a resblock on each stage. Experiments show the SWINUNETR-v2 is surprisingly strong. This work is simple but effective, and it is easy to reproduce. One weakness is the limited novelty. Also, as R3 mentioned, the authors select 3 datasets from MSD challenge, omitting other datasets from MSD for no apparent reason. The authors also ignored results from Brats which SWINUNETR had reported results. The computational complexity should be better discussed in the paper. Though there are several issues, I incline to accept this paper on condition the authors reply the mentioned issues from me and other reviewers.

Author Feedback

We thank the reviewers and AC for the comments. We first address the main issues from AC 1) limited novelty, 2) MSD dataset selection; and why no Brats. 3) Computational complexity. For 1), we agree that adding convolutions to transformers is not new, but our main goal is to find a “to-go” segmentation transformer and its training recipe to be able to replace nnunet. Answering “why”, “how”, and “where” to add convolution to a 3D swin transformer, developing a widely applicable and easy-to-use recipe/backbone, and making it state-of-the-art for 3D medical images is not trivial, which requires extensive research and experiments. Recent works like MOAT[31] and 3D-UXNET[14] (ICLR2023) are simple but effective, supported by thorough experiments. A limited backbone change should not be considered a limited novelty or contribution. For 2) the selection of MSD tasks, our main experiments are WORD/FLARE with fair comparison (using the best tuned results of baselines from their papers) and showing state-of-the-art performances. For MSD, we try to show the method’s robustness to different dataset sizes/challenges. Prostate (small, 32 training), lung tumor (middle, 63), and pancreas tumor (large, 281) are both challenging and representative for varying data sizes, as compared to the rest 7 MSD tasks. #R3 wants Brats21 results for a direct comparison between swin and swin-v2. However, Brats21 results are not reported in the Swinunetr paper, only in GitHub with a modified recipe. Our Swin-V2 and original Swin are trained with exactly the same recipe that was tuned for Swin, so comparisons between these two are completely fair. We already have five datasets with thorough benchmark and ablation experiments; thus, we leave Brats21 for future work. 3) The last line in the “Results” section, “WORD result” sub-section, listed the parameter number and flops for both swin and swin-v2. All other baseline complexities are in ref [14].

For other comments, #R1 questioned why the proposed change worked while other similar ones did not. Our discussion briefly talked about this. The high level intuition is that the introduction of convolutional inductive bias will help, but adding how many and where to add is very hard to analyze theoretically since it has complicated interactions with the training recipe and the dataset properties. This needs to be determined by experiments, as in other architecture papers. #R1 also questioned about missing hyper-parameters (window and token size) and their robustness analysis. All those hyperparaters are the same as in the original swin in the github repo. We didn’t claim that swin-v2 is insensitive to those hyperparameters. We claim the current recipe/hyperparameters are robust across different datasets, and can be used as the “to go” method. #R1 also pointed out that the original swin performs well without tuning or pretraining. That’s true, but we can see that it can have inferior results compared to nnunet. #R3 asked about why we use instance norm in resblock. We use instance norm in all convolution blocks in both swin and swin-v2, which we observed better stability. #R3 asked about CvT, which is highly related to our work, but CvT is a 2D model for natural images and uses convolution for token embedding, making it hard to make fair comparisons since their recipe and architecture are tuned for 2D natural image classification. So we used other conv+transformer work in medical imaging as our baselines for fair comparison.

Post-rebuttal Meta-Reviews

Meta-review # 1 (Primary)

Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

The rebuttal has mostly addressed my concerns, and the 3 reviewers all agree to accept it. So I prefer to accept it.

Meta-review #2

Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

This paper enhances the SwinUNETR with convolutions, which results in a surprisingly stronger backbone, for 3D medical image segmentation. The whole paper is well written and verifies the effectiveness and novelty of the proposed method.

Meta-review #3

Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

Three reviewers are positive to accept this work. And the authors well addressed the main issues raised by reviewers. Hence, I think this work can be accepted now.

back to top

SwinUNETR-V2: Stronger Swin Transformers with Stagewise Convolutions for 3D Medical Image Segmentation