Paper Info Reviews Meta-Review Author Feedback Post-rebuttal Meta-Reviews Back to top

List of Papers By topics Author List

Paper Info

Reviews

Meta-review

Author Feedback

Post-Rebuttal Meta-reviews

Authors

Himashi Peiris, Munawar Hayat, Zhaolin Chen, Gary Egan, Mehrtash Harandi

Abstract

We propose a Transformer architecture for volumetric segmentation, a challenging task that requires keeping a complex balance in encoding local and global spatial cues, and preserving information along all axes of the volume. Encoder of the proposed design benefits from self-attention mechanism to simultaneously encode local and global cues, while the decoder employs a parallel self and cross attention formulation to capture fine details for boundary refinement. Empirically, we show that the proposed design choices result in a computationally efficient model, with competitive and promising results on the Medical Segmentation Decathlon (MSD) brain tumor segmentation (BraTS) Task. We further show that the representations learned by our model are robust against data corruptions.

Link to paper

DOI: https://link.springer.com/chapter/10.1007/978-3-031-16443-9_16

SharedIt: https://rdcu.be/cVRyu

Link to the code repository

https://github.com/himashi92/VT-UNet

Link to the dataset(s)

http://medicaldecathlon.com/

Reviews

Review #1

Please describe the contribution of the paper

This paper proposes a 3D pure transformer architecture, called VT-UNet, for volumetric medical image segmentation. The network can directly work on 3D volumes. The authors design an encoder block with two consecutive self-attention layers for feature extraction and a decode block with parallel cross-attention and self-attention to recover the learned features for segmentation. Experimental results on a large MRI dataset demonstrate its superior performance compared to other baselines. The author also did a robustness analysis to show the VT-UNet is more robust to artifacts.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

1) The idea of designing a 3D pure transformer for medical image segmentation is novel, given that 3D transformer is not easy to train and requires careful design like the parallel cross-attention and self-attention.

2) The effectiveness of the proposed method is demonstrated on a large MRI dataset with better performance than all the other pure CNN baselines and CNN+transformer baselines.

3) The paper well-written and structured, the figures help a lot on illustrate the ideas.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

1) In the introduction, the authors mention that the proposed method has fewer model parameters and lower FLOPs compared to existing method. While from Fig.1, there is no information about the FLOPs, the number of parameters and the FLOPs are not necessarily the same thing.

2) It seems strange that the authors did not mention any data augmentation in the experiment setting. If no data augmentation is used for all the comparison, this won’t be “fair”, given that data augmentation is a common technique that people use to trained DNN models nowadays. With data augmentation, the gap between different methods could be smaller because almost all the methods will become better.

3) Transformer usually needs a large dataset for training, when there is only limited data available, the proposed method may not be better than CNN, which could limit the usefulness of this method.
Please rate the clarity and organization of this paper

Very Good
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

Code is given and data is public,
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html

1) Some baselines’ results deviate too much from their original papers. For example, the Transunet and UNETR, in their original paper, they are both better than UNet and nnUNet. However, in Table 1, their accuracies are much lower than nnUNet. The authors need to discuss a little on these results.

2) Why are Fourier feature positions used in the decoding process? Why not use no position information or normal position information? The authors need to do ablation study on this design choice.

3) In Equation 5, why alpha=0.5?
Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

5
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

Overall, the paper is well-written. The idea of designing a pure 3D transformer for medical image segmentation is interesting, this paper could be interesting to the audience of MICCAI 2022. My biggest concern is that no data augmentation is used in the experiments. This would affect all the comparison with other baselines.
Number of papers in your stack

1
What is the ranking of this paper in your review stack?

4
Reviewer confidence

Very confident
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

N/A
[Post rebuttal] Please justify your decision

N/A

Review #2

Please describe the contribution of the paper

The paper proposes a transformer architecture for volumetric segmentation with a new design that encodes local and global features. The method obtains competitive performance in BraTS data.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

1) VT-UNet has very less number of parameters and computational complexity when compared to the other recent works while getting better performance.

2) The concept of introducing parallel cross-attention and self-attention in the expansive path to create a bridge between queries from the decoder and keys & values from the encoder is good.

3) The proposed method is purely transformer based and improves upon the previous works and shows clear distinction when compared to them.

4) The paper is neatly organized and clear to understand.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

1) Ablation study is missing. I believe experiments which show the increments in performance obtained because of the 3D shifted windows, K,V skip connections would be useful.

2) Not much novelty in terms of encoder.

3) I am still surprised by the drop in complexity. Structure wise I do not see anything much different that would actually reduce the complexity as reported in the paper. I read the supplementary where the authors discuss about the same. I think more clear discussion about the complexity reduction and the reason behind the same should be present in the main paper.
Please rate the clarity and organization of this paper

Excellent
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

Code is present in supplementary.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html

Ablation study and more discussion on the reason behind reduction in parameters and complexity can be added to make the paper look better.
Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

6
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

Performance and Efficiency.
Number of papers in your stack

5
What is the ranking of this paper in your review stack?

1
Reviewer confidence

Very confident
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

N/A
[Post rebuttal] Please justify your decision

N/A

Review #3

Please describe the contribution of the paper

The author proposed a decoder block, which uses one shared projection of the queries and independently computes cross and self-attention. Besides, The author combined a lot of mature and well-known structures, and the performance of the network outperforms other SOTA.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
1. The high performance comparing with other methods is the key strength of this paper, and the model size is also excellent.
2. The shared projection of the queries and independently computes cross and self-attention is an interesting contribution.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
1. The authors spent a lot of content introducing the existing work, although these structures are parts of the network. For example, the shift window, relative positional bias, patch merging, and patch expanding are very famous existing work, but the authors focus on these contents as a subtitle.
2. The author declared that they propose a convex combination approach along with Fourier positional encoding, while there are few detailed introductions about the Fourier positional encoding in this paper.
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

Great. The used dataset is public data, and the code is available.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html
1. It is better to spend more content to introduce the new-designed work and the true contribution of this paper.
2. The authors should add some explanation and analysis of the model parameters and size.
3. There are few detailed introductions about the Fourier positional encoding in this paper. The authors should add more details about it.
Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

5
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The performance of this network is great, but the content of this paper did not focus on the true contribution.
Number of papers in your stack

5
What is the ranking of this paper in your review stack?

2
Reviewer confidence

Very confident
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

N/A
[Post rebuttal] Please justify your decision

N/A

Primary Meta-Review

Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

This paper proposes a 3D pure transformer architecture, called VT-UNet, for volumetric medical image segmentation. Experimental results on a large MRI dataset demonstrate improved performance compared to other baselines. The model has reduced the number of parameters. All reviewers agree that the paper is well organized and work is of interest for the MICCAI audience. Please supplement the details of the Fourier positional encoding in the revision. Please also discuss the reason behind reduction in parameters and complexity.
What is the ranking of this paper in your stack? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

4

Author Feedback

We thank all the reviewers and meta reviewers for their invaluable comments. We start our author feedback by providing answers to general comments of the reviewers and follow it up with answers to specific questions Q from each reviewer R. We’ll use REV for “revised version”.

General comments: Due to space restrictions, we had to omit some of the key content from the main paper. In our REV, We’ll include an in-depth discussion on FPE, convex combination, reduction in parameters and computational complexity, and an ablation study on design choices.

R1-Q1 (FLOPS in Fig. 1) In Fig. 1, the Circle size indicates Computational Complexity by FLOPs. For a better comparison, we’ll add a FLOP count label in each circle in our REV.

R1-Q2 (Data Augmentation) We have followed similar data augmentation techniques as in previous works (UNETR and nnFormer). The experimental results for SOTA in the paper were taken from the latest works UNETR [10] and nnFormer [33]. We’ll include all these missing facts in our REV.

R1-Q3 (Small Datasets) Thank you for the comment. We agree that transformers are more data-hungry than CNN models. As per our latest results, this version of the transformer is able to extract fine features to some extent even for small datasets on par with CNN models. In our future work, we will be addressing this.

R1-Q4 (Results of TransUNet & UNETR) The results for UNETR were taken from their original work [10]. Regarding the experimental results on TransUNet, it is a hybrid type of 2D convolutional neural network. Therefore, the reason for the deviation in results for the MSD BraTS dataset could be due to its inability to preserve features along axes. We will include a discussion on these experimental results in REV.

back to top

A Robust Volumetric Transformer for Accurate 3D Tumor Segmentation