Authors

Jue Jiang, Neelam Tyagi, Kathryn Tringale, Christopher Crane, Harini Veeraraghavan

Abstract

Vision transformers, with their ability to more efficiently model long-range context, have demonstrated impressive accuracy gains in several computer vision and medical image analysis tasks including segmentation. However, such methods need large labeled datasets for training, which is hard to obtain for medical image analysis. Self-supervised learning (SSL) has demonstrated success in medical image segmentation using convolutional networks. In this work, we developed a self-distillation learning with masked image modeling method to perform SSL for vision transformers (SMIT) applied to 3D multi-organ segmentation from CT and MRI. Our contribution is a dense pixel-wise regression within masked patches called masked image prediction, which we combined with masked patch token distillation as pretext task to pre-train vision transformers. We show our approach is more accurate and requires fewer fine tuning datasets than other pretext tasks. Unlike prior medical image methods, which typically used image sets arising from disease sites and imaging modalities corresponding to the target tasks, we used 3,643 CT scans (602,708 images) arising from head and neck, lung, and kidney cancers as well as COVID-19 for pre-training and applied it to abdominal organs segmentation from MRI pancreatic cancer patients as well as publicly available 13 different abdominal organs segmentation from CT. Our method showed clear accuracy improvement (average DSC of 0.875 from MRI and 0.878 from CT) with reduced requirement for fine-tuning datasets over commonly used pretext tasks focusing on full image reconstruction. Extensive comparisons against multiple current SSL methods was done. Code will be made available upon acceptance for publication.

Link to paper

DOI: https://link.springer.com/chapter/10.1007/978-3-031-16440-8_53

SharedIt: https://rdcu.be/cVRwE

Link to the code repository

https://github.com/harveerar/SMIT.git

Link to the dataset(s)

N/A

Reviews

Review #1

Please describe the contribution of the paper

This paper introduced knowledge distillation into masked autoencoders (MAE), wherein the teacher network takes all patches of a 3D volume, and the student network takes the visible patches only. Apart from the reconstruction loss in the original MAE, the authors proposed to distill [CLS] token and patch token from teacher to student networks, ensuring both global (volume-level) and local (voxel-level) constraints, respectively. The Vision Transformer (ViT) was pre-trained on 3,643 CT scans from a variety of body regions. The efficacy of the pre-trained ViT was evaluated on two datasets, but the description and results of the MRI upper abdominal organs segmentation were unclear in the paper. The results on the BTCV dataset showed the proposed pre-training approach outperformed existing self-supervised methods developed for CNNs and Transformers.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
- Open dataset: Most results were obtained from public BTCV dataset. Recent top solutions on BTCV benchmark were reported and compared.
- Clear illustration: The description and illustration of the proposed knowledge distillation approach (both local and global) are clear and easy to implement.
- Sufficient comparison: The proposed method is compared with several up-to-date self-supervised methods under both CNNs and Transformer backbones.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
- The relationship between [CLS] token and global image embedding is unclear.
- The conclusion of 1-layer vs. multi-layer decoder needs to be clarified.
Please rate the clarity and organization of this paper

Very Good
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

It is easy to implement the idea based on the method description.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html
I’m willing to increase the rating if the authors could clarify the following two aspects.
1. What is the reason that [CLS] token in the image reconstruction task carries global information? It makes sense for image classification task (as stated in [24]), but it is not appropriate to directly borrow this assumption to image reconstruction task (pixel-wise task) without justification. What is role of [CLR] token in an image reconstruction task? Why does global image embedding matter for an image reconstruction task?
2. It remains unclear the conclusion of Fig. 5. 1-layer seems to produce better reconstructed images than multi-layer decoder, but does a lower MSE loss mean better representation? Results for fine-tuning 1-layer vs. multi-layer decoders should be presented along with the reconstruction quality.
Here are some suggestions for improving the paper:
- The authors assembled several CT datasets for pre-training by image reconstruction. One possible issue is the image difference across these CT datasets, such as contrast enhancement, because restoring pixel intensity can be deeply influenced by imaging protocols. This domain gap might make image reconstruction more challenging to accomplish. Please comment on this.
- For target tasks, have the authors applied the same pre-processing to Dataset I (CT scans) as the one used in pre-training? What about pre-processing for Dataset II (MRI scans)? How to address the domain difference (in terms of data) between pre-training and fine-tuning? The pre-processing steps for the target tasks should be included in the paper.
Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

5
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

Overall, it is a great study that incorporates local and global information into masked autoencoders, yielding performance improvement on two datasets. The authors also benchmarked with several representative self-supervised methods on CNNs and Transformers. However, several aspects need to be clarified, such as (1) the relationship between [CLS] token and global information and (2) the conclusion of 1-layer vs. multi-layer decoders.
Number of papers in your stack

5
What is the ranking of this paper in your review stack?

2
Reviewer confidence

Very confident
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

N/A
[Post rebuttal] Please justify your decision

N/A

Review #2

Please describe the contribution of the paper

This paper presents a new self-supervised learning method, SMIT, for 3D multi-organ segmentation. Specifically, they use ViT, masked image modeling (MIM), to learn dense patch feature and use MT self-distillation to train the model. Extensive experiments demonstrate the effectiveness of the proposed method.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
1. The method is novel. It combines MIM and MT to train a ViT.
2. Experiments and ablation study are thorough.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
1. Missing one ablation study: What is the performance of using MIM only? This would help us to understand what role MIM and self-ditillation play respectively in the pre-training process.
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

Authors provide enough information on method details and experimental settings.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html

See the major weakness above.
Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

5
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The method is well presented and sound. Extensive experiments demonstrate the effectiveness of the proposed method.

It would be better if one more ablation study (sse above) is added.
Number of papers in your stack

6
What is the ranking of this paper in your review stack?

1
Reviewer confidence

Very confident
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

N/A
[Post rebuttal] Please justify your decision

N/A

Review #3

Please describe the contribution of the paper

This paper proposed a self-supervised learning method for 3D multi-organ segmentation. They presented a self-distilled masked image transformer to pre-train the segmentation network and employed the well-trained model to initialize the model for better segmentation performance. The method has been validated on two public datasets with nice experiment results.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
1. It is interesting to exploit powerful self-supervised learning techniques to improve the performance of target downstream tasks.
2. The improvements over other SSL methods on two datasets look nice.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
1. As a matter of fact, I am more curious about the performance of applying the CT pre-trained model on MR segmentation tasks. Please also provide the segmentation performance of all methods trained from scratch, and the results of SMIT using the proposed SSL method (basically the MR version of Table 1).
2. When finetuning SWIT for target downstream tasks, what are the computational time and memory consumption used to achieve a satisfying segmentation performance? How is the situation compared to other SSL methods?
3. When finetuning SWIT on the MR segmentation task, will it require more samples and more training time to obtain good results in comparison with finetuning it on the CT segmentation task?
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

It would be easier to reproduce this work, if the code could be released.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html
1. It would be better to provide more descriptions of the proposed method in the caption of Fig.1.
2. I would suggest briefly discussing the finetune difficulties among different downstream tasks (e.g., MR organ segmentation and CT organ segmentation).
3. Please provide more details on the split of the dataset, such as the ratio of train, validation, and test.
Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

5
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The improvements over other SSL methods on two public datasets are significant.
Number of papers in your stack

5
What is the ranking of this paper in your review stack?

2
Reviewer confidence

Confident but not absolutely certain
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

N/A
[Post rebuttal] Please justify your decision

N/A

Primary Meta-Review

Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

Reviewers agree that there is some novelty in the method design (e.g., incorporating local and global information into masked autoencoders) and the evaluations show the improvement compared to some recent approaches. However, compared to the online leaderboard of BTCV (standard challenge set), the reported accuracies are not as good as some latest methods, e.g., pancreas 0.898 vs 0.851, esophagus 0.875 vs 0.822. In addition, reviewers also point out some unclarities and provide some constructive comments, which can help improve the paper quality.
What is the ranking of this paper in your stack? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

2

Author Feedback

We thank the AC and reviewers for their insightful comments. We are encouraged by the AC and reviewers’ enthusiasm for this work. We addressed AC and reviewer’s concerns in the paper and our response to the major concerns are below: [R1] Relationship of CLS token and global image embedding: CLS token is a way of representing global image embedding, which is clarified in the paper. [R1] reason … [CLS] token for image reconstruction (pixel-wise task)…justification: [CLS] token is derived using a multi-headed transformer, where each head attends to different local parts of the image. As a result, [CLS] tokens capture the global embedding using multiple local contexts, where the global embedding provides the needed semantic information for making inference on the local parts, including pixels, similar to self-attention. Hence, [CLS] used by itself leads to more accurate segmentations for small and highly variable organs than image reconstruction losses. [R1] unclear conclusion of Fig 5: Lower MSE loss indicates better reconstruction as shown using 1-layer vs. multi-layer decoder. Segmentation accuracies with 1-layer (MIP) vs. multi-layer (ML-MIP) were previously included in Fig 4. We now clarified this in the paper. [R1] domain gap might make image reconstruction more challenging to accomplish. Please comment on this. To address this, we updated our discussion as, “Transformers, which extract the global embedding of the image are known to be robust to domain differences (Nasser et.al NeurIPS 2021). Our results showed robust reconstruction for CT despite large differences in image acquisition, and was also easily translatable to an entirely different MRI modality due to the ability of transformers to reliably extract the anatomic embedding in these images.” [R1] pre-processing to Dataset I (CT scans) and Dataset II (MRI): We clarified this in the paper as, “same intensity pre-processing including rescaling [-175 HU to 250 HU] was used for Dataset I as done in pre-training. Dataset II (MRI) was first subjected to histogram standardization to a randomly selected scan, followed by intensity clipping [0, 2000] (2000 corresponds to 95th percentile of MR intensity distribution) and intensity normalization to [0, 1].” [R2] performance using MIM: We have now clearly stated that MIP (Fig 4) corresponds to MIM. [R3] segmentation performance of all methods from scratch…MR version of Table 1: While interesting, we cannot include this due to space constraints. This is planned work for an expanded version of the paper in a journal submission. [R3] computational time and memory consumption: Extracting the computational times would require redoing all of the analyses, which is not feasible at this time. Hence, we will include this in the expanded version of the paper in a journal submission. [R3] more descriptions of proposed method in caption of Fig 1: Updated caption Fig 1. [R3] discuss finetuning difficulties among different tasks (MR … CT): We updated discussion as, “Larger variations between MRI than CT were addressed by performing histogram standardization. Also, T2-weighted MRI captures anatomic information like CT and has higher-soft tissue contrast, which aids fine tuning, despite pre-training with CT datasets.” [R3] details on split of dataset,… train, validation, and test: Details were included in Lines 13-15, 22-23 on Page 5, Sec. 3 Training dataset.

back to top

Self-supervised 3D anatomy segmentation using self-distilled masked image transformer (SMIT)