Paper Info Reviews Meta-Review Author Feedback Post-rebuttal Meta-Reviews Back to top

List of Papers By topics Author List

Paper Info

Reviews

Meta-review

Author Feedback

Post-Rebuttal Meta-reviews

Authors

Lingting Zhu, Zeyue Xue, Zhenchao Jin, Xian Liu, Jingzhen He, Ziwei Liu, Lequan Yu

Abstract

Cross-modality medical image synthesis is a critical topic and has the potential to facilitate numerous applications in the medical imaging field. Despite recent successes in deep-learning-based generative models, most current medical image synthesis methods rely on generative adversarial networks and suffer from notorious mode collapse and unstable training. Moreover, the 2D backbone-driven approaches would easily result in volumetric inconsistency, while 3D backbones are challenging and impractical due to the tremendous memory cost and training difficulty. In this paper, we introduce a new paradigm for volumetric medical data synthesis by leveraging 2D backbones and present a diffusion-based framework, Make-A-Volume, for cross-modality 3D medical image synthesis. To learn the cross-modality slice-wise mapping, we employ a latent diffusion model and learn a low-dimensional latent space, resulting in high computational efficiency. To enable the 3D image synthesis and mitigate volumetric inconsistency, we further insert a series of volumetric layers in the 2D slice-mapping model and fine-tune them with paired 3D data. This paradigm extends the 2D image diffusion model to a volumetric version with a slightly increasing number of parameters and computation, offering a principled solution for generic cross-modality 3D medical image synthesis. We showcase the effectiveness of our Make-A-Volume framework on an in-house SWI-MRA brain MRI dataset and a public T1-T2 brain MRI dataset. Experimental results demonstrate that our framework achieves superior synthesis results with volumetric consistency.

Link to paper

DOI: https://doi.org/10.1007/978-3-031-43999-5_56

SharedIt: https://rdcu.be/dnww9

Link to the code repository

N/A

Link to the dataset(s)

https://rire.insight-journal.org/index.html

Reviews

Review #1

Please describe the contribution of the paper

The paper proposes an approach for cross-modality 3D medical image synthesis which (i) uses denoising autoencoders on slice-wise models, (ii) applies a 3D supervised model on slices for 3D consistency. Experiments are on two brain MRI datasets outperforming one recent and three old baselines.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
- The proposed method performs well qualitatively.
- I found the paper easy to follow.
- The authors are honest about the limitations of the experimental results.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
Baselines and Experimental Results
- Pix2Pix, Pix2Pix 3D, CycleGAN are all old methods <= 2017 – the authors should employ more recent methods improving on these techniques. Generally, we need at least 2 recent, functioning, baselines to show the approach presented cannot be outperformed by an off-the-shelf technique.
- ‘We do not include naive 3D diffusion-based models as we fail to train an efficient backbone’ - this is a very large weakness of the paper. It is important to compare 3D Diffusion vs 2D Diffusion + harmonizing slices.
- As the 3D data is paired, there are approaches in the MICCAI / medical imaging community that could be directly applied. For example [Blumberg, Ning] use patch-based approaches. Another approach is [Torbati]. None of these approaches used GANs and disprove ‘medical image synthesis methods rely on generative adversarial networks and suffer from notorious mode collapse and unstable training.’ (abstract).
- ‘For the Palette method, we implemented the 2D version but were unable to produce high-quality slices stably and failure cases dramatically affected the metrics results.’ I understand it is difficult to implement these baselines. However, the Palette paper showed large improvements over the CycleGAN - and this paper shows the complete opposite. I suggest the authors select other baselines if they cannot get Palette to work.
Questions
- Are the 2D outputs of the 2D-CycleGAN and Palette fed as inputs to a 3D-U-Net – for fair comparison?
- Why did you resize the data – this might compromise the quality if you use e.g. interpolation.
- Is the RIRE data paired as well (I suppose it is, as you use pix2pix)?
- Did you consider using approaches from domain adaptation [Domain Adaptation Papers List]?
I look forward to the authors’ responses.

References [Blumberg] (Blumberg et al, MICCAI 2019, multi-stage prediction networks for Data Harmonization) [Ning] (Ning et al, NeuroImage 2020, Cross-scanner and cross-protocol multi-shell diffusion MRI data harmonization: Algorithms and results) [Torbati] (Torbati et al. Multi-scanner Harmonization of Paired Neuroimaging Data via Structure Preserving Embedding Learning) [Domain Adaptation Papers List] https://paperswithcode.com/task/domain-adaptation/latest
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

Good. Consider providing anonymized code for the reviewers.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html
- Notation of eq-5,6,7 are very strange, add commas between the different dimensions.
- No point labeling the above equations if they are not referred to later on.
- CyclaGan -> CycleGAN in table 1.
- Zoom-in regions in figure 2.
- ‘The drawbacks of the generating and fooling fashion’ doesn’t make sense.
Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

3
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

Issues with the baselines and experimental results (see weaknesses).
Reviewer confidence

Confident but not absolutely certain
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

4
[Post rebuttal] Please justify your decision

Thank you for the reply.

‘Q1… no 3D network backbone available in literature’ This is a good explanation. I still think a more analysis on this would be useful e.g. quantitative experiments on memory usage - it could go into the supplementaries.

‘added two recent baselines Reg-GAN [1] and CUT [2] for comparison… [results]’ I am pleased with these new results, but am concerned that these recent baselines are still worse than Pix2Pix 3D (from 2017). I still find it difficult to believe the best baseline performance (out of all the many options) is by this old approach.

‘mentioned harmonization methods … may be insufficient for complex cross-modality tasks due to different targets.’ The papers perform multiple-input to target harmonization, so it is unclear what you mean by complex cross-modality tasks . All metrics calculated in this paper could be easily computed from the harmonization methods.

‘reproducibility, our code is put [link]’ The code is clearly explained and appears well-structured. I think it will be a valuable addition to the community. I hope the authors also consider releasing the weights.

‘Only 2D/3D networks are used for 2D/3D methods’ This is an unfair comparison then - as suggested in my comment, the 2D slices should be combined together as input to a 3D network.

‘Paired’ [following my comment on dataset types] This should be made more explicit in the paper. Why discuss GANs if they could be avoided by other simpler techniques on paired data?

I am raising my score as (i) explanation as to why 3D techniques are not good, (ii) provided code. However, I am still concerned about the baselines used / implementations and experimental settings. Therefore I cannot raise to accept.

Review #2

Please describe the contribution of the paper

In this work, the authors propose to improve slice-to-slice consistency in volumetric synthesis and name their method “Make-A-Volume”. The technique involves two stages; the first stage learns a standard latent diffusion model (LDM) on 2D slices to map across contrasts, while the second stage refactors features from the 2D model into 3D and fine-tunes to map volumes across contrasts. This novel yet straightforward formulation is shown in this work to improve slice-to-slice consistency in two MR synthesis tasks: SWI-to-MRA and T1-to-T2. One model is trained for each task with appropriate train/test splits, and the authors show improved qualitative and quantitative results compared to 2D-slice-independent and 3D GAN-based baselines. Furthermore, the authors show in Figure 4 that their “volumetric layer insertion” technique does indeed improve slice-to-slice consistency in a small ablation experiment.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

This work is overall valuable, since there is much evidence to show that 2D models often perform better (there is also more training data), yet slice-to-slice consistency remains a problem. This paper addresses that problem directly.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
Major concerns:
- The primary missing experiment to make this work convincing is the 3D LDM baseline. The authors are able to fit their 3D model in stage 2, so why not train that model directly? The primary benefit of LDM is to learn a lower-dimensional manifold in the latent space of a pre-trained autoencoder, which is available. Without this comparison, it is not possible to conclude that the two-stage approach of “learn 2D” then “fine-tune 3D” is what led to improvements over the baselines rather than just the LDM framework overall.
Minor concerns:
- Typo in Table 1: “CyclaGAN”
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

This method is not reproducible. Although the method is explained fairly clearly, the pre-trained autoencoder is not available nor is its training well-describe. Also, without code no reproducibility guarantees can be made.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html

Please see the missing experiment outlined above.
Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

5
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

Please see the summary statement and concerns outlined above.
Reviewer confidence

Very confident
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

5
[Post rebuttal] Please justify your decision

My decision has not changed, but I want to provide reasoning why. My concern was the following: “fine-tuning the 3D model in stage two is possible with the authors’ hardware. Why not train that model from scratch, rather than fine-tuning?” The rebuttal from the authors seems to have misunderstood my concern, or at least have not resolved this ambiguity. My decision remains as “weak accept”.

Review #3

Please describe the contribution of the paper

The paper proposes Make-a-Volume, a latent diffusion model (LDM) for 3D volumetric synthesis that can be directly fine-tuned from the 2D LDM without touching any of the weights of the 2D model. The method is applied to medical image translation tasks, where the method outperforms other comparison methods by a large margin.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

The paper is well-written, easy to understand with clear motivations. Interleaving the pre-trained diffusion model with additional layers to be finetuned have been one of the key building blocks of recent diffusion models that tried to scale image models towards video models. In similar spirit, the proposed method does something similar, but adapts it well to the volumetric generation case. The method is intuitive to understand, but seems to be a product of hard engineering, including modifying the noise sampling schedule. The authors did a good job in putting forth such effort together. Finally, it is worth noting that conditional diffusion model for medical image-to-image translation, especially for 3D volumetric data has been underexplored. The proposed method would be a good cornerstone for future approaches.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

I think the authors are missing some relevant citations and discussions of related works. For one, [1] is also an 3D LDM, which is not finetuned from 2D image LDM but learned directly from abundant sources of 3D data. I do believe that Make-a-Volume has advantages over the method, but the discussion seems to be inevitable. Another method that leverages 2D diffusion for 3D generation without finetuning is [2]. Since [2] is not designed for translation but designed as a general generative model, the two models are inherently different, but strongly related such that it is worth mentioning.

[1] Pinaya, Walter HL, et al. “Brain imaging generation with latent diffusion models.” Deep Generative Models: Second MICCAI Workshop, DGM4MICCAI 2022, Held in Conjunction with MICCAI 2022, Singapore, September 22, 2022, Proceedings. Cham: Springer Nature Switzerland, 2022.

[2] Lee, Suhyeon, et al. “Improving 3D Imaging with Pre-Trained Perpendicular 2D Diffusion Models.” arXiv preprint arXiv:2303.08440 (2023).
Please rate the clarity and organization of this paper

Very Good
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

The paper seems to be reproducible. However, everything seems to be N/A for open-sourcing the code, which is a bit worrying in the reproducibility side.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html

The comments in the weakness section are mostly minor. I believe this is a strong paper that would benefit the community. Well done.
Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

7
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The paper has no particular weaknesses. It is a strong contribution to the community.
Reviewer confidence

Confident but not absolutely certain
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

N/A
[Post rebuttal] Please justify your decision

N/A

Primary Meta-Review

Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

This paper proposes an adaptation of the 2D latent diffusion model to the 3D brain MR synthesis problem. However, reviewers (particularly R1 and R2) raised concerns about the experimental design and the persuasiveness of the comparisons made in the paper. Additionally, the feasibility of training a 3D diffusion model directly needs to be addressed. Reproducibility is also a major issue that should not be overlooked in evaluating the merit of this paper. Anonymous code release would be a favorable option to ensure reproducibility.

Author Feedback

We appreciate reviewers’ favorable comments on the significance (R2, R3), novelty (R2), and clear writing (R1, R3). We clarify the key issues below.

Q1: Feasibility of directly training 3D diffusion models (AC, R1, R2, R3)

It is difficult or impractical to directly train 3D diffusion models. The main reason is that no powerful 3D network backbone is available for diffusion models in related literature. Developing such a backbone is extremely challenging. Simply replacing 2D conv to 3D conv in existing 2D backbones makes training infeasible due to the GPU memory limit. Our current 2D and pseudo 3D models have already reached the limit of what common hardware (A100) can handle.

Though we could seek light-weight backbones where a simpler 3D network or a pseudo 3D network (e.g., stage 2 in our paper) can be applied, another major challenge arises. In diffusion models, random diffusion timestep (out of 1000 steps) is sampled for each input data. If we build diffusion models on 3D model (1 volume for 1 timestep), the sampling efficiency is extremely low, while sampling timesteps for slices means 100x efficiency if the z-axis shape is 100. That is also why we need to adopt a dual stage approach. (R2 mentioned)

In our preliminary experiments, directly training our second-stage model or adopting a simple 3D U-Net cannot produce meaningful results (MAE on S2M: 100.91 and 115.32). Instead, our two-stage solution provides a practical way (though at the very first steps) to implement pseudo 3D diffusion for medical data.

Q2: Compared baselines (AC, R1) We have added two recent baselines Reg-GAN [1] and CUT [2] for comparison. The MAE on S2M of Reg-GAN and CUT: 6.83/7.08 (worse than ours). Although the mentioned harmonization methods are effective for cross-scanner data, they may be insufficient for complex cross-modality tasks due to different targets. Some of them adopt DL and can be roughly treated as encoder-decoder models w/ or w/o adversarial losses. We implemented Pix2pix w/o adversarial loss on the S2M and the performance dropped hugely (MAE 11.63), suggesting the crucial role of adversarial losses in medical image synthesis. (R1 mentioned)

[1] Kong et al. Breaking the dilemma of medical image-to-image translation. NeurIPS 2021. [2] Park et al. Contrastive learning for unpaired image-to-image translation. ECCV 2020.

Q3: Reproducibility (AC, R1, R2, R3) To show reproducibility, our code is put on anonymous.4open.science with the postfix /r/anonymous_miccai23-E143 (combining the website link with the postfix).

Q4: Failure run of Palette (R1) There would be several reasons, including the use of unofficial code and different application domains. Despite these limitations, we still included this method as it provides a clear visualization of volumetric inconsistency.

Q5: Other comments of R1

We include both 2D and 3D methods. Only 2D/3D networks are used for 2D/3D methods.

We resize all the data to 256 due to memory limitation.

Paired.

Our work (synthesis) is related to domain adaptation (DA) in context of domain shift, but they are two different tasks.

Q6: Training 3D LDM (R2) Thanks for your concern. The reason that our approach cannot be simplified as one-stage is the timestep sampling efficiency. Besides, achieving a low-dimensional latent strikes a balance between computational efficiency and preserving plenty of information. Please also refer to Q1.

Q7: Discussion of related works (R3) Thanks for your kind suggestions. We will include them in the revision. Pinaya et al focus on the unconditional setting. Though directly generating 3D data, the performances may not be good for complex conditional tasks (the high compression factor of 3D). Ours focuses on pseudo-3D generation with manageable memory on a conditional setting. Lee et al is concurrent with ours (public after MICCAI deadline) and both focus on leveraging 2D diffusion for 3D generation (ours directly enhance 2D models with the same targets).

Post-rebuttal Meta-Reviews

Meta-review # 1 (Primary)

Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

The proposed approach of incorporating an adapter layer to enable 2D to 3D conversion demonstrates significant value. While extensive studies supports the notion that 2D models generally outperform their 3D counterparts due to larger training datasets. However, the issue of slice-to-slice consistency has persisted as a challenge. This paper directly tackles this problem.

Meta-review #2

Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

After reading the rebuttal as well as the reviewers feedback, I agree that even the authors are touching a very important problem, but current method/results do not convince me this work can bring new insights to the comunity. The core concerns have not be addressed at all. I recommend reject.

Meta-review #3

Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

The paper proposes Make-a-Volume, a latent diffusion model (LDM) for 3D volumetric synthesis that can be directly fine-tuned from the 2D LDM without touching any of the weights of the 2D model. The method is applied to medical image translation tasks, the results outperform the comparison methods. Interleaving the pre-trained diffusion model with additional layers to be finetuned have been one of the key building blocks of recent diffusion models that tried to scale image models towards video models. The author has a good try in this work. However, there are still some suggestions, e.g., the generality of the proposed model, whether it can be used for multiple dataset and even multiple body parts, and whether it is easy to be reproduced. More description are needed for the model. Combining the comments of the reviewer and myself, it is an interesting paper with merits slightly weigh over weakness.

back to top

Make-A-Volume: Leveraging Latent Diffusion Models for Cross-Modality 3D Brain MRI Synthesis