Paper Info Reviews Meta-review Author Feedback Post-Rebuttal Meta-reviews

Authors

Qingqiao Hu, Hongwei Li, Jianguo Zhang

Abstract

Medical image synthesis has attracted increasing attention because it could generate missing image data, improving diagnosis and benefits many downstream tasks. However, so far the developed synthesis model is not adaptive to unseen data distribution that presents domain shift, limiting its applicability in clinical routine. This work focuses on exploring domain adaptation (DA) of 3D image-to-image synthesis models. First, we highlight the technical difference in DA between classification, segmentation and synthesis models. Second, we present a novel efficient adaptation approach based on 2D variational autoencoder which approximates 3D distributions. Third, we present empirical studies on the effect of the amount of adaptation data and the key hyper-parameters. Our results show that the proposed approach can significantly improve the synthesis accuracy on unseen domains in a 3D setting. The code is publicly available at https://github.com/WinstonHuTiger/2D_VAE_UDA_for_3D_sythesis

Link to paper

DOI: https://link.springer.com/chapter/10.1007/978-3-031-16446-0_47

SharedIt: https://rdcu.be/cVRTG

Link to the code repository

https://github.com/WinstonHuTiger/2D_VAE_UDA_for_3D_sythesis

Link to the dataset(s)

N/A


Reviews

Review #1

  • Please describe the contribution of the paper

    This paper proposed an unsupervised domain adaptation method based on 2D VAE approximating 3D distributions. The proposed domain adaptation approach is applied to image synthesis problem and the authors demonstrated the effectiveness of 2D VAE method.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • The idea of considering 3D volume as the stacked 2D slices and applying them in a mini-batch instead of different channels is very interesting. In this way, the model size can be dramatically reduced compared to 3D based model.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    • The evaluation could be performed in a cross-validation manner.
    • Only mean values were reported in the quantitative results. It’s hard to tell the improvement is statistically significant. Please add standard deviations or test statistically to show the significance of the proposed method.
  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    Lack of explanation on the hyper-parameters selection, average run time

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html
    • Please use the term ‘data augmentation’ instead of ‘data argumentation’.
    • Please double check the notations and equations in page 4. Some are not consistent or not fully explained.
  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    4

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The proposed method sounds interesting and novel, but it’s hard to decide whether the improvement is statistically significant or not. Also, there’s no comparison with the other competing methods.

  • Number of papers in your stack

    4

  • What is the ranking of this paper in your review stack?

    3

  • Reviewer confidence

    Confident but not absolutely certain

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    5

  • [Post rebuttal] Please justify your decision

    The authors have addressed most of my comments by showing statistical significance with cross-validation.



Review #3

  • Please describe the contribution of the paper

    This paper presents an unsupervised domain adaptation strategy for image translation/synthesis. A VAE is pre-trained on output domain of the paired training set and used to approximate its distribution. The synthesis network is trained on the paired training domain (S) (in a supervised fashion) and on the unpaired shifted domain (T) (KL between the output domain of S and output domain of T, under the VAE). In order to work with small training set, a 2D VAE – rather than a 3D one – is used. The latent code of a 3D volume is obtained by concatenating the latent code of all its 2D slices.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • Novelty: Does not try to transform the inputs from one domain to the other, but directly trains the network on both domains.
    • Significance: Unsupervised and easy to implement, and can potentially be used on a number of tasks (e.g. semi-supervised training on datasets where only a small proportion of the data is paired)
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    • Performance: The qualitative results in Fig. 3 are underwhelming. It’s difficult to gauge without an example of what a “good” synthesis (in the absence of domain shift) would look like.
    • Writing: The paper lacks clarity and is difficult to follow at times.
    • Evaluation: Given that the adaptation strategy is quite generic, the paper could have more impact if the strategy was evaluated on multiple tasks. For example, it seems that the exact same architecture could be used on a segmentation task, where the VAE would be trained on one-hot labels.
  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance
    • The code is not provided, but maybe it will be made available upon acceptance?
    • No statistical tests nor measures of variance are provided, making it difficult to evaluate the significance of the quantitative results.
  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html
    • In Fig 3 I would suggest adding a column with one of the supervised DA results, which I expect look much better. Currently, it is difficult to evaluate how much is “lost” by going unsupervised.
    • I would train/test a network without domain shift (e.g. train and test on TCIA) in order to get a feel of the best possible synthesis. This would provide an additional upper (upper) bound.
    • Fig 3. shows that even after DA, the synthesized images can be quite unusable. The authors may want to acknowledge this and comment on it: what’s missing to get networks that are truly domain agnostic?
    • In the 2D VAE, I don’t see why the fact that batch order follows the slice order should have an impact. I imagine that the VAE could be trained on a subset of shuffled slices (even from different subjects). What matters is that the concatenated code for a full volume (in a single batch, or split over minibatches) is computed during fine-tuning.
    • I did not understand the paragraph about “impact of the amount of volumes” at all. What does “the first continuing training batch in the UDA process contributes more to the results” mean?
    • The authors stop short of claiming that they provide a solution to unsupervised domain adaptation, and merely claim to “explore domain adaptation for medical image-to-image synthesis models”. I completely agree that some problems are hard and stay unsolved, but in this case, a better paper would evaluate different unsupervised approach and discuss the factors that make the problem still unsolved. Is the VAE just not a good distribution encoder? Are there not enough training examples for the VAE? Are there more than intensity differences between the two domains (e.g., different pathologies, different anatomies) that makes matching distributions not a good surrogate form of supervision?
  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    5

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?
    • The concept makes sense to me.
    • It does beat the baselines, but results are not completely convincing.
  • Number of papers in your stack

    4

  • What is the ranking of this paper in your review stack?

    2

  • Reviewer confidence

    Somewhat Confident

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    5

  • [Post rebuttal] Please justify your decision

    First, I would like to commend the authors for the quality of their rebuttal.

    However, the additional experiments provided do not fundamentally change my appreciation of the paper: the method and experiments are very sound, but the results are slightly underwhelming and unlikely to have a strong impact or shift the current practice for UDA. I will not change my decision (weak accept).

    If the paper is accepeted, I would advise the authors to further comment on the following points:

    • Could the authors provide downstream dice scores in the non-adaptive case? There does not seem to be a significant difference in Dice between the 2D and 3D VAE cases.
    • I am sorry to only spot this now: what are the latent space sizes for the 2D and 3D VAE? I hope that the 3D VAE’s latent size is 256 times the 2D VAE’s so that the latent code for a full volume has the same length in both cases.



Review #4

  • Please describe the contribution of the paper

    The authors explore a new topic of unsupervised domain adaptation (UDA) for image synthesis. The key difference from the previously well-researched UDA classification and segmentation tasks is the discrepancy between objectives. Here, the authors suggest an approach based on two existing ideas: image synthesis, and domain distributions (generated by VAE) matching. Besides, the ideas are combined in a novel way and used in a new setup.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. The authors suggest a novel problem formulation that potentially enhances the development of DA and domain generalization methods.

    2. The motivation, structure, experimental setup, and analysis (including the study of different factors) are properly detailed, creating a complete picture of the problem and approach to solve it.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    1. The clinical applicability of study is limited. Among publicly available datasets, I could not find the one that contain some of the modalities only partially. And the authors do not explicitly describe the application scenarios. [More details are given in Sec. “Detailed comments”.]

    2. The metrics SSIM and PSNR do not explicitly measure the quality of some algorithm. Furthermore, one of the frequent metrics in image synthesis is the quantitative assessment of the impact of generated images on the downstream task. More specifically, the authors could add Dice score of the glioma segmentation task, using the original modalities, as an upper-bound, and present Dice score of the same model using the generated modality instead of the original one (or, alternatively, trained from scratch using generated modality). This difference in Dice score quantitatively assesses the image synthesis algorithm also with the clear motivation in terms of further applicability.

  • Please rate the clarity and organization of this paper

    Excellent

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    I would like to suggest the authors to make their code publicly available in an anonymous form (e.g., by creating an anonymous account on github) along with submission of the paper.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html

    On major weaknesses:

    M.W. 1 (a) From my practice, it is a typical approach to filter out such cases with missing modalities when creating a dataset, and in practice such cases are not rare. But the authors need to properly describe that point (maybe the description could be find in literature). (b) Moreover, I have seen few works, that address the problem of missing modalities and use a synthetic experimental setup on BraTS dataset, stochasticaly removing modalities. Such work could enhance the authors message and method motivation. Unfortunately, I could not recover the papers due to limited time, but I believe this hit would encourage the authors to find these works and strengthen their message further.

    Other comments:

    1. It seems like contribution (1) is enough, and (2) and (3) are the implementation and evaluation details, respectively. The authors could formulate their contribution as a plain paragraph with one strong message (1) and supporting details (2) and (3).

    2. Fig. 2 floats, no reference from the text is given. Besides, Fig. 2 is self-explanatory, but it still would be better to link it with text. What is the floating “n” in the caption?

    3. major comment Why do not the authors train 3D VAE in the same fashion on small patches? As far as I understand, the only reason to discard 3D case is diminishing difference in distribution with the increase in size (e.g., equalizing anatomical structure). But reducing 3D images to small patches also solves the problem of equalizing distributions. Moreover, the same procedure of learning structured (along one of the axis) latent representation for 2D images, as in [16], could be applied to learn a structured representation for 3D images, switching from 1-axis structuring to 3-axis structuring. In my opinion, the authors should develop 3D method in depth as well as they do for 2D. It might underperform due to the lack of finetuning, as 2D case has.

    4. Work [3] has been published as [https://dl.acm.org/doi/abs/10.5555/3304415.3304514], thus its citation should be replaced with the appropriate form.

    5. In Sec. 2, par. 2D s-VAE for modeling 3D distribution, it would be more clear to use word “call” instead of “nickname”.

    6. How do the authors use the order of slices in their method? It seems like the order does not impact the training procedure…

    7. Why do we need regularization of distribution to N(0, 1) with KL-divergence loss (Eq. 1)? It seems like the task does not require the exact form of distribution.

    8. N(0, 1) is a multi-dimensional distribution, so the authors should replace 0 and 1 with the zero vector and “identity” matrix, respectively.

    9. Why the authors use L2 loss to train a VAE (Eq. 1), while training a CNN for synthesis with L1 loss? In both cases, the task is the same — image generation — thus, this choice is unmotivated. Also, super resolution reviews (e.g., [https://arxiv.org/pdf/1902.06068.pdf]) indicate that L1 is perceptually a better choice. The authors might use it to motivate their decision.

    10. In Tab. 1, the authors might indicate the unavailability of labels (e.g., in row 2, col 2) with a specific symbol (e.g., *) to enhance readability.

    11. The authors should name their figures in Supplementary Materials starting from 5. Providing the explicit links to them from the text (e.g., backbone in Fig. 5) would also increase the readability. Hyperlinks between two files would be unclickable, but it still a visual improvement.

    12. The authors could additionally report SSIM and PSNR for the ground truth (using the original modalities) as the upper-bound.

    13. In Sec. 4.2, the authors could describe the procedure of [16] in few lines, so the paper becomes self-containing. Clarification “sampling infinite number of 2D slices” seems to be misleading.

    14. Describing the results in Fig. 4 (b), the authors should specify the domain (target or source) of volumes that they vary. (Fig. 4 (b) caption indicates the target one.)

    15. The use of term “batch” in Sec. 4.2 diverge from term “iteration” in Sec. 3.

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    7

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The authors address a novel topic, clearly formulating their contribution. Their setup and evaluation are properly designed and, from my perspective, do not contain explicit mistakes. Most of my comments are directed on the minor improvements of the paper that could be addressed at the proof-reading stage.

  • Number of papers in your stack

    3

  • What is the ranking of this paper in your review stack?

    1

  • Reviewer confidence

    Very confident

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    7

  • [Post rebuttal] Please justify your decision

    The authors have addressed most of my concerns satisfactorily. However, the metric (Dice Score) on a downstream segmentation task indicates no difference between 3D VAE and the proposed method. I consider the latter as an open minor weakness and hold the same opinion of the paper. I further encourage the authors to add Dice Score for the other methods, e.g., supervised DA.




Primary Meta-Review

  • Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

    This paper has split reviews. Reviewers think that the paper has substaintial merits related to a novel combination image synthesis and domain distributions matching. However, reviewers also pointed out significant weaklesses related the presentation and clinical relevance. Specifically, reviewers think that the clinical applicability of this work is limited, some metrics might not be most appropriate, qualitative results are not convincing, some part of the paper has clarity issues. After considering all reviewers’ comments, the area chair would like to invite authors to submit a rebuttal addressing the reviewers’ concerns.

  • What is the ranking of this paper in your stack? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    12




Author Feedback

We appreciate the valuable comments. All reviewers agreed with the technical contribution (the idea is very novel and interesting) in addressing unsupervised domain adaptation for 3D image synthesis. All discussions will be fully addressed in our revision.

(1) Quantitative results Five-fold cross-validation (R1): we further evaluate our model in a 5-CV manner and include the standard deviation when reporting the results. For TCIA->CBICA, SSIM and PSNR are 0.848 ± 0.0257 and 20.031 ± 3.293, respectively. For task CBICA -> TCIA, SSIM and PSNR are 0.844 ± 0.0268 and 21.404 ± 2.272, respectively. The results are consistent with that of the initial setting. We will include those results in the revision. Statistical tests (R1, R3): We have used Wilcoxon signed-rank tests to compare our method with other methods on the validation set. The p-values for a) ours vs. lower bound, and b) ours vs. 3D-VAE are all less than 0.0001, indicating our method outperforms the two methods significantly.

New upper bound (R3): we have trained the models supervised on the target domains (without domain shift) with five-fold cross-validation. On CBICA, SSIM and PSNR are 0.896 ± 0.0207 and 24.656 ± 2.907. On TCIA, SSIM and PSNR are 0.911 ± 0.0263 and 25.519 ± 3.630. These results are higher than the existing upper bound (using the initial settings) and will be included in the revision.

(2) Qualitative results and codes (R3, R4). We provided more qualitative results to highlight our method and uploaded our codes to an anonymous repository: https://github.com/Auser173/2D_VAE_UDA_for_3D_sythesis .

(3) Metrics and downstream tasks (R3, R4). We further use the Dice score as an additional metric in a downstream segmentation task. We use a pre-trained nnUnet model on the Brats’20 dataset to segment the generated images from different methods. The dice score for each case is averaged over three brain structures. For CBICA->TCIA, the mean Dice scores of three methods, i.e., 3D-VAE, ours, and using real images are 0.772, 0.773, and 0.904. This indicates the performance of our method is still promising on downstream tasks.

(4) Clinical applicability and discussion (R4, Meta-Reviewer) Thanks and we further clarify this. Missing modality is a common issue in multi-modal neuroimaging, e.g., due to motion in the acquisition process [1]. Given space constraints, we will discuss more related work in the revision. [1] Generative Adversarial Networks to Synthesize Missing T1 and FLAIR MRI Sequences for Use in a Multisequence Brain Tumor Segmentation Model, Radiology 2021.

(5) Slice order of 2D VAE (R3, R4). We appreciate the reviewers’ comments. We re-trained a 2D VAE with shuffled slice order, SSIM and PSNR are 0.8472 ± 0.0194 and 20.284 ± 2.971 on TCIA -> CBICA. Comparing the new results with the ones in the manuscript, we found that the slice order does not affect the ability of 2D VAE to encode the desired distribution. We will include the results in the revision.

(7) Limitation and discussion (R3). We would like to discuss some of the factors contributing to the difficulty of domain adaptation in 3D image synthesis. In our approach, we translate the whole volume from one domain to another instead of using a patch-based method. Although whole-volume approaches can capture the full spatial information, it suffers from limited training data issues. As we have shown in Fig. 3, even after domain adaptation, we observed that the domain gap is challenging to overcome. Recent disentangled learning that could separate domain-specific and shared features effectively might improve the current results. Contrastive learning could be explored to better captures the representation of the source or target domains more effectively. Due to space limitations, we will include more discussion on potential techniques to address the problem further in the revision.

(8) Minors on clarification, citations and others (R3, R4). We will address them in the revision.




Post-rebuttal Meta-Reviews

Meta-review # 1 (Primary)

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    All reviewers seem to be satisfied with the rebuttal. The AC recommend acceptance for this paper.

  • After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

    Accept

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    NR



Meta-review #2

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    The paper proposes an unsupervised domain adaption technique in the context of 3D image synthesis, relying on training sliding window 2D VAE. The rebuttal clarifies the significance of the results and provides a crossvalidation (not initially present in the paper). Three reviewers have voted for acceptance and have retained their support after the rebuttal. In my opinion the novelty directly builds on recent work and I still find that the lack of comparison to other existing unsupervised domain adaption methods is a major weakness (many papers have been published in domain adaption recently). Given the supports of reviewers but considering the above weakeness I suggest acceptance while putting the paper at the bottom of my acceptance list.

  • After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

    Accept

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    10



Meta-review #3

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    After rebuttal, reviewers are mostly satisfied with the answers and the majority votes for weak accept (2x) and strong accept (1x). Additional results show statistical significance over baseline. The additional points after rebuttal raised by #R3 should be considered during paper revision. My vote is (weak) accept, but I found that the allowed supplemental material length exceeded 2 pages, which needs to be considered in the overall decision.

  • After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

    Accept

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    7



back to top