Paper Info Reviews Meta-review Author Feedback Post-Rebuttal Meta-reviews

Authors

Junjia Huang, Haofeng Li, Guanbin Li, Xiang Wan

Abstract

Self-supervised learning methods based on image patch reconstruction have witnessed great success in training auto-encoders, whose pre-trained weights can be transferred to fine-tune other downstream tasks of image understanding. However, existing methods seldom study the various importance of reconstructed patches and the symmetry of anatomical structures, when they are applied to 3D medical images. In this paper we propose a novel Attentive Symmetric Auto-encoder (ASA) based on Vision Transformer (ViT) for 3D brain MRI segmentation tasks. We conjecture that forcing the auto-encoder to recover informative image regions can harvest more discriminative representations, than to recover smooth image patches. Then we adopt a gradient based metric to estimate the importance of each image patch. In the pre-training stage, the proposed auto-encoder pays more attention to reconstruct the informative patches according to the gradient metrics. Moreover, we resort to the prior of brain structures and develop a Symmetric Position Encoding (SPE) method to better exploit the correlations between long-range but spatially symmetric regions to obtain effective features. Experimental results show that our proposed attentive symmetric auto-encoder outperforms the state-of-the-art self-supervised learning methods and medical image segmentation models on three brain MRI segmentation benchmarks. All codes and model weights will be made available.

Link to paper

DOI: https://link.springer.com/chapter/10.1007/978-3-031-16443-9_20

SharedIt: https://rdcu.be/cVRyy

Link to the code repository

N/A

Link to the dataset(s)

http://adni.loni.usc.edu/

https://www.oasis-brains.org/

http://www.braintumorsegmentation.org/

https://www.nitrc.org/projects/ibsr/

https://wmh.isi.uu.nl/data/


Reviews

Review #1

  • Please describe the contribution of the paper

    The paper proposes an attentive symmetric auto-encoder based on ViT for MRI segmentation. The method does pre-training and resorts to the prior of brain structures and develop a Symmetric Position Encoding.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    1) Paper is clearly organized and neatly written.

    2) Masked Pre-training for medical segmentation is novel and makes sense as well.

    3) Attentive reconstruction loss and symmetric position encoding help use some key properties of medical data to do the pre-training.

    4) Experiments show a decent improvement in performance.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    1) Novelty in terms of network architecture is not great. The authors use shifted windows in UNETR architecture.

  • Please rate the clarity and organization of this paper

    Excellent

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    Authors state code will be released

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html

    This paper shows the masked pre-training can improve the performance of the segmentation. Experimental results validate it and I believe it is a good finding for 3D medical image segmentation

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    7

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Pre-tranining, new additions to the SSL method.

  • Number of papers in your stack

    5

  • What is the ranking of this paper in your review stack?

    1

  • Reviewer confidence

    Very confident

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #2

  • Please describe the contribution of the paper

    The paper proposed a 3D brain segmentation model based on Masked Auto-Encoder self-supervised learning scheme, with a novel symmetric positional encoding (SPE) to add anatomical symmetric prior, and a novel attentive reconstruction loss based on histogram of gradient reweighting in order to emphasize the informative patches. Experiments shows the sota performance on 3 benchmarks and the effectiveness of each novelty.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • The paper has clear motivation for its novel ARloss (add importance to informative patch reconstruction) and SPE (induce anatomical symmetric prior to positional encoding).
    • The proposed model/methods are evaluated on multiple benchmarks and different metrics (Dice & HD95), and achieve sota performance compared with other sota transformer-based models and SSL methods.
    • The attentive reconstructed loss takes importance score for each patch using 3DD VHOG, which is reasonable and easy to implement.
    • Ablation studies demonstrate the effectiveness of SPE and ARloss separately.
    • The 3D brain MRI is aligned to a template anatomical space during pretraining so that the left-right symmetric could be guaranteed.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    There are several concerns about the paper:

    • The paper makes a strong assumption that the input 3D brain MRI is left-right symmetric, and from this point, the author develops SPE to induce the anatomical symmetric prior in SW-ViT encoder. This is okay for the pre-training phase if all the brain volume is aligned to BIDS. However, I’m concerned about the brain MRI with disease/lesions that can significantly change the local region and appearance of a brain. For example, Alzheimer’s Disease and brain tumor can both affect brain structure in a certain region in one side of the brain, and cerebral atrophy can even severely change one side of the brain structure. When the brain structure is asymmetrically affected by some neurodegenerative diseases, the anatomical symmetric assumption is not held anymore, does SPE still have any advantages over the vanilla PE? There’s no discussion about this in the paper, however, this issue should be concerned, since the proposed ASA is a general brain segmentation model and should be able to take care of diseased brain.
    • One question about the downstream finetuning on BraTS 2021: this benchmark has 4 MRI modalities so the input volume will be 4 channels, however, the SW-ViT is pretrained using only T1 MRIs which is single channel. When fine-tuning the pretrained model on BraTS, how does the model handles 4-channel input? More specifically, how is the input patch embedding layer being initialized, given the gap of different channel number.
    • From Table 4 ablation study, comparing SPE&SSL with SSL-only setting, there’s only marginal improvements in ET Dice and HD95 scores, and the Dice scores for WT and TC is even lower in SPE&SSL setting. Therefore, it looks like there’s no obvious advantages using SPE under SSL pretraining scheme.
    • Although the paper emphasizes that SPE improves the symmetric details in segmentation, but there’s no evaluation on how good the symmetric structures are preserved and segmented in the 3 downstream tasks. I suggest adding some segmentation results which contain symmetric brain structures and segmented better than that from other SOTA models/methods.
  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    As the authors indicate in Abstract, the code and model will be released after accepted. The datasets are all open-accessible and the hyper-parameters of pretraining, fine-tuning and ASA model are also provided in paper, so this work should be reproducible.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html
    • I’d suggest some discussion and possibly adding some evaluations on the model’s ability for handling challenged diseased brains with asymmetric structures, and explore/explain whether SPE is still effective on these brain volumes.
    • It’s better to add some visualization results to demonstrate that the proposed model can generate better segmentations on symmetric structures compared with other SOTA methods, in order to verify the motivation and assumption of the proposed method.
    • Please explain how the SW-ViT backbone pretrained on 1-channel input is adapted to the 4-channel BraTS image in Methods part.
  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    5

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Considering the SOTA performance on multiple benchmarks, the clear motivation as well as the novelties in SPE and ARLoss, although the paper does not fully verify and demonstrate the effectiveness and advantage of its symmetric novelty, I think the paper is definitely above borderline and has the potential for acceptance. I would like to see the rebuttal response from the author and adjust my rating of the paper.

  • Number of papers in your stack

    5

  • What is the ranking of this paper in your review stack?

    3

  • Reviewer confidence

    Very confident

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #3

  • Please describe the contribution of the paper
    1. The authors proposed an attentive reconstruction loss function.
    2. The symmetric position encoding seems useful for this task.
    3. The experimental performance looks good.
  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. Considering that MAE needs a lot of data, it is difficult to apply it to medical images. The high performance of this paper may be the main strength.
    2. Symmetric Position Encoding is an interesting work.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    1. As we know, ViT is a patch-based structure, and swin Transformer is a pixel-based structure. It is better to add more details about Shifted Linear Window-based Multihead Self-attention(SLW-MSA). It may be confusing for the reader whether it is computed on pixel-level or patch-level, so the author should add more details.
    2. This paper is similar to “Self-Supervised Pre-Training of Swin Transformers for 3D Medical Image Analysis” (https://arxiv.org/pdf/2111.14791v1.pdf) The author should add more discussion to distinguish it from that paper.
  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    The author did sufficient experiments on different public dataset, so it is easy to reproduce.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html
    1. It is better to add the ablation study about the influence of P.
    2. It is better to add more details about Shifted Linear Window-based Multihead Self-attention (SLW-MSA). I wonder it is computed on patch-level or pixel-level.
  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    6

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    This paper is an interesting work, and the experiment is sufficient.

  • Number of papers in your stack

    5

  • What is the ranking of this paper in your review stack?

    1

  • Reviewer confidence

    Confident but not absolutely certain

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A




Primary Meta-Review

  • Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

    The authors propose an attentive symmetric auto-encoder based on ViT for 3D brain segmentation from MRI, which makes use of symmetric position encoding and self-supervised pre-training. Additionally, a novel attentive reconstruction loss emphasizing informative patches is proposed.

    The paper is well-written and clearly organized. The motivation is intuitive and proposed attentive reconstruction loss and symmetric position encoding help pre-training on medical data. The experiments show an increase in performance compared to other SOTA-methods and are evaluated on multiple benchmarks and with multiple metrics. Lastly, the paper includes an interesting ablation study.

    As pointed out by the reviewers, there are some minor weaknesses of the paper. It is unclear wether the shifted linear window-based multi head self-attention is computed on a pixel- or patch-level. Additionally, the authors could have put more focus on distinguishing their work from “Self-Supervised Pre-Training of Swin Transformers for 3D Medical Image Analysis”. Lastly, it would be interesting to see in future work how proposed method performs in presence of non-symmetric brains (e.g. influences by diseases such as tumors) and an evaluation if symmetric structures are actually better segmented than other SOTA-methods. However, the method strengths outweigh the weaknesses, which is why I opt for accept.

  • What is the ranking of this paper in your stack? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    1




Author Feedback

N/A



back to top