Paper Info Reviews Meta-review Author Feedback Post-Rebuttal Meta-reviews

Authors

Wentao Liu, Tong Tian, Weijin Xu, Huihua Yang, Xipeng Pan, Songlin Yan, Lemeng Wang

Abstract

The success of Transformer in computer vision has attracted increasing attention in the medical imaging community. Especially for medical image segmentation, many excellent hybrid architectures based on convolutional neural networks (CNNs) and Transformer have been presented and achieve impressive performance. However, most of these methods, which embed modular Transformer into CNNs, struggle to reach their full potential. In this paper, we propose a novel hybrid architecture for medical image segmentation called PHTrans, which parallelly hybridizes Transformer and CNN in main building blocks to produce hierarchical representations from global and local features and adaptively aggregate them, aiming to fully exploit their strengths to obtain better segmentation performance. Specifically, PHTrans follows the U-shaped encoder-decoder design and introduces the parallel hybird module in deep stages, where convolution blocks and the modified 3D Swin Transformer learn local features and global dependencies separately, then a sequence-to-volume operation unifies the dimensions of the outputs to achieve feature aggregation. Extensive experimental results on both Multi-Atlas Labeling Beyond the Cranial Vault and Automated Cardiac Diagnosis Challeng datasets corroborate its effectiveness, consistently outperforming state-of-the-art methods. The code is available at: \url{https://github.com/lseventeen/PHTrans

Link to paper

DOI: https://link.springer.com/chapter/10.1007/978-3-031-16443-9_23

SharedIt: https://rdcu.be/cVRyB

Link to the code repository

https://github.com/lseventeen/PHTrans

Link to the dataset(s)

https://acdc.creatis.insa-lyon.fr/description/databases.html

https://www.synapse.org/#!Synapse:syn3193805/wiki/217789


Reviews

Review #1

  • Please describe the contribution of the paper

    This paper presents a mixed network architecture that combines convolution and transformer for better performance.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • The studied problem is important.
    • The designed trans&conv block is neat.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    • Missing reference to closely related work.
    • Datasets are small.
  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    I think the paper is reproducible.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html

    This paper presents a network architecture that combines the design of convolution and transformer blocks so that both local and global visual representations are captured. The motivation is clear and experiments validate the effectiveness of the proposed approach.

    I have the following concerns.

    1. The relationship to prior work is not well analyzed. In particular, In ICCV 2021, there is a paper named <Conformer: Local features coupling global representations for visual recognition>, which delivered a very similar idea to this work. The authors shall add discussions on this topic. In addition, it is helpful to provide some visualization (or other analytical results) showing how the mixed architecture helps recognition – please also refer to the above paper for examples.

    2. The improvements on both datasets (BCV and ACDC) are marginal compared to the baseline and other competitors. While I understand that the baselines are already high, but, I strongly suggest the authors to provide additional results to support that the improvement is solid (e.g. by qualitative or other quantitative studies). This is very important considering the small volume of the studied datasets.

    Overall, introducing a novel architecture is helpful for medical image analysis. The paper is well written and I recommend weak acceptance.

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    5

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Please see the above comments.

  • Number of papers in your stack

    4

  • What is the ranking of this paper in your review stack?

    1

  • Reviewer confidence

    Very confident

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Not Answered

  • [Post rebuttal] Please justify your decision

    Not Answered



Review #2

  • Please describe the contribution of the paper

    propose a novel hybrid architecture for medical image segmentation called PHTrans, which parallelly hybridizes Transformer and CNN in main building blocks to produce hierarchical representations from global and local features and adaptively aggregate them, aiming to fully exploit their strengths to obtain better segmentation performance.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    It outperforms nnUNet and some medical image segmentation transformer models such as TransUNet and CoTr.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    This paper cites but do not show numerical results of UNETR, which is very important and necessary SOTA method.

  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    I think some numerical results of baselines are not consistent with other one. For example, in the same dataset of BCV dataset, nnUNet (most important medical image segmentation model) from UNETR(https://arxiv.org/pdf/2103.10504.pdf, table 1) is 88.8, while in this paper it is only 87.75 and PHTrans is only 88.55. I have added my comments below if the train&val dataset are not split identically.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html

    Recently, many transformer model paper for medical image segmentation have been made. However, most of them do not make fair comparison. I am curious why the author have cited UNETR but do not show its performance. And why the author followed the paper setting of nnFormer(https://arxiv.org/pdf/2109.03201.pdf), where it outperforms nnUNet in BCV/Synapse dataset, but in this PHTrans paper, nnUNet is better than nnFormer. And also, in UNETR (https://arxiv.org/pdf/2103.10504.pdf), its results are better than PHTrans, so I suggest author add detailed results of UNETR(https://arxiv.org/pdf/2103.10504.pdf) and Swin UNETR(https://arxiv.org/pdf/2201.01266.pdf) because I think those two transformer models are SOTA right now.

    Back to framework of this paper, I think it lacks novelty. The most novelty part in PHTrans is adding Conv Block parallelly in “Trans&Conv Block”. Too many parts of PHTrans are referenced from Swin Transformer and UNet(encoder-decoder) architecture.

    Moreover, some SOTA models such as CoTr and nnUNet are also trained from scratched, they outperform certain transformer models which have pretrained models. So I think PHTrans does not need pretrained model, which is great, but that is not enough.

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    3

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Add detailed experiments of UNETR and Swin UNETR to make fair comparison, I would change my opinion and tolerate lack of novelty if it proves it is a real SOTA which outperforms extremely than UNETR and Swin UNETR. Besides, I would be more appreciated if the author would release the code in Github to check its reproducibility. Because this paper has been released on ArXiv, the total community would testify its real performance and so do I.

  • Number of papers in your stack

    5

  • What is the ranking of this paper in your review stack?

    5

  • Reviewer confidence

    Very confident

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Not Answered

  • [Post rebuttal] Please justify your decision

    Not Answered



Review #3

  • Please describe the contribution of the paper

    This work looked at a popular and important problem in medical image segmentation - how to efficiently hybrid CNN and ViT. To this end, the authors proposed a hybrid architecture, in which convolution and self-attention (from ViT) are performed simultaneously at each downsampled and upsampled scale in U-shaped architecture. The manuscript is well written with extensive experiments demonstrating the benefits of the proposed method on two datasets.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • This work looked at a popular and important problem in medical image segmentation - how to efficiently hybrid CNN and ViT.
    • Reasonable novel method has been proposed.
    • Extensive experiments have demonstrated substantial benefits of the proposed method on two datasets.
    • The manuscript is overall well written.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    This submission has good quality of finishing hence I am in general satisfied with acceptance, if have to list weaknesses the following are minor rather than major:

    • The authors argued Volume-to-Sequence (V2S) and Sequence-to-Volume (S2V) operations as one of their key contributions, however, it’s not clear to me, unless I have missed, what are those operations? Are they learnable or simple transformations?
  • Please rate the clarity and organization of this paper

    Excellent

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    implementation details are provided, code is not submitted

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html
    • The manuscript will benefit from a clear definition of Volume-to-Sequence (V2S) and Sequence-to-Volume (S2V) operations;
    • The definitions of W-MSA and SW-MSA are not given;
    • The authors declared PHTrans w/o ST essentially have the same architecture as nnU-Net. While we observe a difference between table 4 (PHTrans w/o ST have DSC=87.71 and HD=14.37) and table 2 (nnU-Net DSC=87.75 HD=9.83). Can the authors provide some intuition about what caused such difference, particularly in HD?
  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    6

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    This work looked at a popular and important problem in medical image segmentation and proposed reasonable novel hybrid architecture with extensive experiments that justified the proposed method. The work is also well written. Hence I recommend “6: accept”.

  • Number of papers in your stack

    5

  • What is the ranking of this paper in your review stack?

    1

  • Reviewer confidence

    Confident but not absolutely certain

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Not Answered

  • [Post rebuttal] Please justify your decision

    Not Answered




Primary Meta-Review

  • Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

    This paper receives mix scores - 1 accept, 1 weak accept and 1 reject. The most concerning issue lies in the performance comparison, in particular, why nnUNet performs better UNETR in this paper whereas in previous works UNETR outperforms nnUNET and why UNETR is not compared. In addition, how this work differentiates from the previous ICCV 2021 paper “Conformer: Local features coupling global representations for visual recognition” is not clear. The performance improvement on the BCV and the ACDC dataset also appears to be quite marginal. Given that the newly designed network could be of great potential for medical image segmentation problems, the meta-reviewer decides to invite this paper for a rebuttal, and hope that the authors carefully address all concerns from the reviewers during the rebuttal phase to improve the quality of this paper.

  • What is the ranking of this paper in your stack? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    2




Author Feedback

We would like to thank all the reviewers for their constructive comments.

Q1: Reproducibility of the paper A1: We have released our code on Github (https://github.com/lseventeen/PHTrans).

Reviewer#1: Q2: How does our work differ from Conformer? A2: The idea of local features coupling global representations for visual recognition proposed in Conformer is similar to ours, but the framework is quite different. Conformer consists of a CNN branch and a transformer branch, which respectively follow the design of ResNet and ViT. The two branches continuously interact to achieve information fusion. However, ViT and ResNet require resampling to make the dimensions consistent with each other due to the single-resolution property of ViT. Our PHTrans utilizes the hierarchical property of the Swin Transformer, integrates them with CNN into a block, and constructs an encoder-decoder architecture for medical image segmentation.

Q3:Datasets are small and the improvement is marginal. A3: Most previous works (e.g., Swin-Unet, CoTr, and nnFormer) used DCA and ACDC datasets to evaluate model performance. There are many studies on them, so we evaluated the performance of PHTrans with those datasets, which are more convincing. Table 2 shows that the results of all models are close in ACDC. But for DCA, PHTrans improved over nnFormer, CoTr and 3D Swin-Unet by 2.1, 2.22 and 5.08 in DSC. Besides, we compared the latest SOTA (UNETR and Swin UNETR) as suggested by Reviewer#2. PHTrans outperforms them (9.13 and 3, see A5). Overall, we consider the improvement is not marginal.

Reviewer#2: Q4: Why are the results of UNETR not compared? A4: In UNETR, authors employed five-fold cross validation, which is different from our dataset partition of 18 training cases and 12 test cases that follow Swin UNet and nnFormer, etc. Besides, UNETR provides two results. The higher DSC (0.888 and 0.891) for nnUNet and UNETR was obtained by training additional datasets, in which the training cases were increased to 80 volumes. The above two points can be found in Section 4.3 and Figure 1 of UNETR’s paper.

Q5: Add detailed experiments of UNETR and Swin UNETR A5: For a fair comparison, we employed the code framework of nnUNet to evaluate the performance of PHTrans as the same as CoTr and nnFormer. All experiments were performed under the default configuration of nnUNet. Similarly, we also evaluated the performance of UNETR and Swin UNETR in the same way and used the same dataset partition as ours. We used the official code of UNETR. The official code of Swin UNETR has not yet been uploaded. Therefore, we reproduced it by modifying ViT in the encoding stage of UNETR into a Swin transformer. Their relevant code has been updated in our repository. The DSC and HD of UNETR obtained from the above experiments are 79.42 and 29.27, which are similar to the results reproduced in the nnFormer’s paper (79.56 and 22.97). Swin UNETR’s DSC and HD are 85.55 and 16.91. Although the results of Swin UNETR are better than UNETR, the model complexity has greatly increased. The number of parameters and FLOPs are 69.93M and 407.21G. The DSC and HD of PHTrans (the results in the manuscript) are 88.55 and 8.68, which outperform Swin UNETR and UNETR.

Q6: In nnFormer’s paper, it outperforms nnUNet in BCV. A6: According to table 5 (b) in the latest version of nnFormer (https://arxiv.org/pdf/2109.03201.pdf), it can be found that the DSC of nnUNet and nnFormer are 86.99 and 86.57, i.e., nnUNet outperforms nnFormer, which is consistent with ours.

Reviewer#3: Q7:The clear definition of V2S and S2V operations A7: V2S is used to reshape the entire volume (3D image) into a sequence of 3D patches with a window size. S2V is the opposite operation. They are not learnable.

Q8: The difference between table 4 (PHTrans w/o ST) and table 2 (nnU-Net) A8: The architecture is the same, but the former uses PHTrans’s parameter settings. The number of channels and activation functions are different.




Post-rebuttal Meta-Reviews

Meta-review # 1 (Primary)

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    The performance comparison with UNETR and swin-UNETR well addressed the reviewers’ concerns.

  • After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

    Accept

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    5



Meta-review #2

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    This work received diverging initial recommendations. The rebuttal addresses most of the reviewers’ concerns regarding reproducibility, technical novelty, and experimental details. Overall, the contributions seem of interest to the MICCAI community. The final version should include all reviewer suggestions and comments.

  • After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

    Accept

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    8



Meta-review #3

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    The key strength of this paper is a seemingly novel method combining CNNs and Transformers in an image segmentation architecture. While all reviewers mentioned the contribution favorably, reviewer concerns were around the performance comparison details, especially regarding discrepancies in comparison with other recent methods and lack of comparison with UNETR. Authors addressed these concerns in the rebuttal and clarified discrepancies, as well as indicated that a comparison with UNETR gave still better results of the proposed method. One concern of a reviewer was the closeness to the ICCV Conformer paper, however, the authors could describe how the proposed architecture differs, and it seems to be a valid novel contribution in the area of medical image segmentation. Regarding the performance differences, I tend to agree with the authors that those are reasonable. Overall, the segmentation architecture seems to be of interest to the MICCAI community.

  • After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

    Accept

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    10



back to top