Authors

Nurislam Tursynbek, Marc Niethammer

Abstract

Inspired by findings that generative diffusion models learn semantically meaningful representations, we use them to discover the intrinsic hierarchical structure in biomedical 3D images using unsupervised segmentation. We show that features of diffusion models from different stages of a U-Net-based ladder-like architecture capture different hierarchy levels in 3D biomedical images. We design three losses to train a predictive unsupervised segmentation network that encourages the decomposition of 3D volumes into meaningful nested subvolumes that represent a hierarchy. First, we pretrain 3D diffusion models and use the consistency of their features across subvolumes. Second, we use the visual consistency between subvolumes. Third, we use the invariance to photometric augmentations as a regularizer. Our models perform better than prior unsupervised structure discovery approaches on challenging biologically-inspired synthetic datasets and on a real-world brain tumor MRI dataset. Code is available at https://github.com/uncbiag/diffusion-3D-discovery.

Link to paper

DOI: https://doi.org/10.1007/978-3-031-43907-0_31

SharedIt: https://rdcu.be/dnwcI

Link to the code repository

https://github.com/uncbiag/diffusion-3D-discovery

Link to the dataset(s)

N/A

Reviews

Review #3

Please describe the contribution of the paper

This paper presents a multi-stage unsupervised pipeline for 3D semantic segmentation. In the first step, the paper trains a diffusion model. And, then employs three different self-supervised objectives:visual consistency, local feature consistency and photometric invariances on top of the features extracted from the diffusion model. The experiments are performed both on the real and synthetic data. Both the qualitative and quantitative results are present on the paper. And the results shows the superiority of the proposed method compared to the existing methods.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

Unsupervised semantic segmentation of the 3D volumetric data has several clinical significance. Thus it is an important research problem.

The paper is generally well written. The literature survey from the medical domain seems adequate. There are also a few important works from computer visions that are also mentioned.

The motivation to use diffusion model to extract features seems convincing.

Experimental results compared with the existing methods show the superiority of the method
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

This paper employs the diffusion model trained on the medical data set. However, neither qualitative nor quantitative results of the model are present. It is hard to imagine how well the diffusion model is trained. How crucial would it be to improve the performance of the the diffusion model in order to improve the segmentation performance?

From the ablation study, it seems the contribution of the local feature consistency and visual consistency is redundant. The performance gap is marginal only. Do we really need both the objectives to train the model?

From the ablation study, it seems the contribution of the local feature consistency and visual consistency is redundant. The performance gap is marginal only. Do we really need both the objectives to train the model?
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

Essential network configurations and hyper-parameters are shared.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html

Proposed method section is rather short. The authors may squeeze intro and experimental sections and may elaborate the method sections. The main contributions of the paper are presented in few equations which make the article rather like a report.
Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

5
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

Application is important. Idea of using diffusion model is also interesting. And the results show the effectiveness of the proposed method,
Reviewer confidence

Confident but not absolutely certain
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

N/A
[Post rebuttal] Please justify your decision

N/A

Review #4

Please describe the contribution of the paper

The paper proposes a new form of unsupervised segmentation of 3D medical images using fine-tuning of diffusion models with a clustering-like loss. They perform experimental evaluation of their method on synthetic, cell-like images and tumor segmentation on BraTS.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
1. The usage of generative (diffusion) models for unsupervised segmentation / part discovery is novel and well motivated through the lack of publicly available ImageNet-like discriminative models for 3D medical data.
2. The proposed method outperforms the compared baselines significantly and the qualitative results look convincing.
3. The description of the methods and experiments is clear and precise.
4. The authors provided an ablation experiment of the different loss components
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

Comment 1: Clustering baselines are probably weak

The k-means baseline was trained with only one input feature (intensity). This is not state of the art. Also, gaussian mixture models are long established clustering methods and should be put into comparison to the proposed method.

Comment 2: Easy experimental tasks

Meissen et al. (https://arxiv.org/pdf/2109.06023.pdf) have shown that tumors in BraTS FLAIR images can successfully be segmented using simple thresholding and reached a Dice-score outperforming the proposed method. Considering this result, the selected tasks are relatively easy. Discuss this and / or investigate the reliance on unique intensity values via evaluation on BraTS T1.

Comment 3: Missing information for Reproducibility

See “Comment on Reproducibility”
Please rate the clarity and organization of this paper

Very Good
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

Reproducibility is rather weak and significantly lower as promised in the reproducibility checklist. The checklist contains only “yes”es, even at questions that are not applicable for this work. However, the manuscript is missing many of the promised information, such as description of the compute hardware, A description of results with central tendency (e.g. mean) & variation, Details on how baseline methods were implemented and tuned, A description of the memory footprint, Discussion of clinical significance, The average runtime for each result, or estimated energy cost, a note that code will be released and more.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html
1. Please explain the difference between a “ladder-like” UNet and a regular one if there is any. If not, consider removing the term.
2. A limitation of the proposed method is the selection of K. Although this limitation is shared with many other clustering methods, it needs to be stated.
3. For a full picture, information about memory requirements and compute should be included. Diffusion models are known to be slow in training and inference and 3D UNets with 128x128x128 volumes require a lot of memoty.
4. Consider replacing the term “parts” in the Methods section with “clusters”.
5. In a possible journal extension, more ablations would be interesting, for example on the choice of t=25 and the degradation of performance if the number of clusters K is increased.
6. Also, the method has only been tested on the FLAIR images in BraTS. To test it’s reliance on intensity values, evaluation on BraTS T1 is required.
Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

5
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The paper shows a novel approach to unsupervised clustering that is interesting enough to be presented at MICCAI. However, the experimentation seems weak. Also, the untrue answers in the reproducibility checklist lower the trustworthiness of the work.
Reviewer confidence

Somewhat confident
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

N/A
[Post rebuttal] Please justify your decision

N/A

Review #2

Please describe the contribution of the paper

The authors propose an innovative approach to unsupervised 3D structure discovery by pre-training a 3D diffusion model as a feature extractor and designing specific loss functions. The authors demonstrate that the features obtained from different stages of ladder-like U-Net-based diffusion models effectively capture distinct hierarchical levels within 3D biomedical volumes. Importantly, the proposed method outperforms previous 3D unsupervised discovery techniques on challenging synthetic datasets and even on a real-world brain tumor segmentation dataset (BraTS’19).
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
1. This paper introduced the use of generative diffusion models to uncover intrinsic hierarchical structures in biomedical 3D images through unsupervised segmentation. This approach addresses issues such as the lack of good feature extractors for 3D biomedical images and ImageNet pre-training networks that operate only on 2D images.
2. This paper leverages features extracted from different stages of a U-Net based ladder-like architecture to capture different hierarchical levels in 3D biomedical images. This original use of features demonstrates a unique approach to analyzing and interpreting data.
3. In this paper, three losses are designed to train a predictive unsupervised segmentation network that encourages the decomposition of 3D volumes into meaningful nested volumes representing hierarchies. The use of pre-training, visual consistency, and invariance to photometric augmentation as regularization techniques enhances the robustness and accuracy of the proposed model.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
1. The description of the proposed method needs to be provided in more detail. It only describes how the three loss functions are set up, but does not explain the details of the main model.
2. Diffusion models have been applied to influence segmentation in 2D medical influence, but this paper does not explain how the diffusion model in this paper improves over 2D methods.
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

The parameters given in this paper are only the size of the input image, the learning rate and the number of iterations, while the specific parameters of the model (such as the size of the convolution kernel and the weight of the loss function) are not explained.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html

The 3D diffusion model is one of the main contributions of this paper, but instead of giving a detailed description in the Methods section, this paper gives a general introduction to the diffusion model-only mechanism. Therefore, it is recommended that the authors clarify the differences and commonalities between the 3D diffusion model and the 2D part of the method.
Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

5
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

In this paper, the expressions are smoothed, the application of the 2D image unsupervised method to the 3D difficulties and problems solved by the proposed model is explicitly presented, and the results obtained from the unsupervised model used in this paper are also valid. However, the novelty of this paper is somewhat weak. While proposing three loss functions, it only transfers the 2D diffusion model to 3D and does not explain specific innovations from 2D to 3D.
Reviewer confidence

Confident but not absolutely certain
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

N/A
[Post rebuttal] Please justify your decision

N/A

Primary Meta-Review

Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

The authors propose an innovative approach to unsupervised 3D structure discovery by pre-training a 3D diffusion model as a feature extractor and designing specific loss functions. The proposed method outperforms previous 3D unsupervised discovery techniques on challenging synthetic datasets and even on a real-world brain tumor segmentation dataset (BraTS’19). The three reviewers also affirmed the merits of this paper. The issues include adding details of the description of the proposed method, explanation the goodness of the diffusion model, and some other details mentioned by reviewers. Please address these concerns in the final version.

Author Feedback

We are grateful for the feedback provided by the reviewers and the meta-reviewer. Thank you for the encouraging words about the novelty and the innovation of our approach.

The two main review concerns were the brevity of the method description and the lack of implementation details. In the camera-ready version we will provide more methodological details, including a discussion of our network architecture. We will also provide implementation details of our method, including hardware specifications and training/inference times to improve reproducibility.

Other concerns:

Difference between 2D and 3D diffusion models: The main difference between 2D and 3D diffusion models is that the former work in pixel space, while the latter work in voxel space. Working in voxel space should allow to more easily capture 3D spatial context. It would be interesting to explore the impact of 2D versus 3D diffusion modeling for our unsupervised segmentation task in future work.

Quality of generated images: We found that increasing generation quality (i.e., training diffusion models for more epochs) initially increases the quality of extracted features and then plateaus after some point. We suspect that this is because at later points diffusion models may start focusing on generation details, while coarser features may be more beneficial for segmentation.

Redundancy of some of the losses: We agree that disabling some losses does not decrease the performance much. However, to obtain the best overall performance all three losses are necessary.

BraTS T1 experiments: For a fair comparison we followed the experimental procedure of (Hsu et al) where the FLAIR images of BraTS are used.

Comparison with mentioned thresholding technique - The mentioned technique (Meissen et al) of thresholding is semi-supervised, which includes training on healthy scans without anomalies. We only compared our method with unsupervised methods that only use given BraTS data. We showed fully supervised method as approximate upper bound.

Difference between U-Nets and “ladder-like” U-Nets: All U-Net decoders have “ladder-like” designs, meaning the step structure of the upsampling layers. Thus, there is no difference between U-Nets and “ladder-like” U-Nets. We will clarify this in the camera-ready version.

“Clusters” instead of “Parts”: We agree that in the Methods section we should use clusters to describe the segmented parts.

Ablation studies for the choice of t and K: We found that changing the timestep does not significantly affect the extracted features. For using different K (i.e., a different number of clusters/parts) we do not have ground truth / gold standard segmentation labels. That the number of clusters, K, needs to be specified is a shortcoming of our approach. We will state this shortcoming in the camera-ready version of the paper.

back to top

Unsupervised Discovery of 3D Hierarchical Structure with Generative Diffusion Features