Paper Info Reviews Meta-review Author Feedback Post-Rebuttal Meta-reviews

Authors

Reuben Dorent, Nazim Haouchine, Fryderyk Kogl, Samuel Joutard, Parikshit Juvekar, Erickson Torio, Alexandra J. Golby, Sébastien Ourselin, Sarah Frisken, Tom Vercauteren, Tina Kapur, William M. Wells III

Abstract

We introduce MHVAE, a deep hierarchical variational auto-encoder (VAE) that synthesizes missing images from various modalities. Extending multi-modal VAEs with a hierarchical latent structure, we introduce a probabilistic formulation for fusing multi-modal images in a common latent representation while having the flexibility to handle incomplete image sets as input. Moreover, adversarial learning is employed to generate sharper images. Extensive experiments are performed on the challenging problem of joint intra-operative ultrasound (iUS) and Magnetic Resonance (MR) synthesis. Our model outperformed multi-modal VAEs, conditional GANs, and the current state-of-the-art unified method (ResViT) for synthesizing missing images, demonstrating the advantage of using a hierarchical latent representation and a principled probabilistic fusion operation. Our code is publicly available.

Link to paper

DOI: https://doi.org/10.1007/978-3-031-43999-5_43

SharedIt: https://rdcu.be/dnwwW

Link to the code repository

https://github.com/ReubenDo/MHVAE

Link to the dataset(s)

N/A


Reviews

Review #2

  • Please describe the contribution of the paper

    This article introduces the first multi-modal VAE approach with a hierarchical latent representation for unified medical image synthesis. This study extends MVAEs using a hierarchical structure to improve the expressiveness of the model. Utilize fusion operations based on probability formulas to support missing modes and achieve image synthesis.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The paper is clear and rigorous logical structure. The MHVAE model adopts a hierarchical latent variable approach to express image features, introduces a new probability fusion method to handle incomplete image sets, and uses adversarial learning to generate more realistic composite images. These innovative points enable the MHVAE model to achieve better performance in medical image synthesis, providing strong support for medical image analysis and diagnosis.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    For the fusion of multimodal images, this method uses a probability based method, but this method may cause image blurring in some cases. Therefore, further exploration may be needed to improve the clarity of images. In the experimental section, this method was tested only in one application scenario (a combination of iUS and MR), and more extensive testing is needed to verify its applicability and robustness.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    The paper mentions that the dataset used in this experiment will be published on TCIA in 2023, and the code is publicly available. A detailed description of the algorithm and model structure used in the paper was provided. But in Supplementary, only one modality encoder is shown.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html

    Overall, I think your work is very interesting and well written. For a more detailed comparison with previous methods, it is possible to consider comparing the performance of each method on different indicators, such as image quality or similarity to real images. In order to thoroughly discuss the limitations of your method, please consider conducting additional experiments to evaluate its performance under different conditions, such as low-quality input images .

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    6

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    please see the list the main weaknesses of the paper.

  • Reviewer confidence

    Confident but not absolutely certain

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #3

  • Please describe the contribution of the paper

    The authors propose a deep hierarchical VAE called MHVAE which is an extension of multi-modal VAEs for synthesizing missing images from various modalities. They propose a hierarchical latent structure and create a common latent representation for fusing multi modal images with ability to handle incomplete image sets. The conducted experiments show that MHVAE outperformed multi modal VAEs, conditional GANs, and the current SOTA methods for synthesizing missing images.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    1) Methodology section is very well explained with theoretical foundation. 2) Hierarchical representation of latent variables for improving the method’s expressivity, is interesting and shows improved performance. 3) The proposed method is capable of supporting missing modalities and image synthesis. 4) The proposed method is carefully compared with various baselines and SOTA methods and shows improved performance. 5) Paper is well written.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    1) Unclear how well the proposed method generalizes to new datasets since the model is trained and tested on one small dataset. 3) The proposed approach involves a hierarchical structure and probabilistic fusion operation, which may increase its complexity and decrease its interpretability compared to simpler methods. Unclear whether the proposed method increases the complexity and training cost. 2) Figure2 needs revision and the name “ours” could be changed to MHVAE

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    The authors will provide the code and dataset.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html

    The methodology section could benefit from an overview accompanied by an outline figure.

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    6

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Authors propose an interesting and novel approach for synthesizing missing images across various modalities which is carefully evaluated.

  • Reviewer confidence

    Not confident

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #5

  • Please describe the contribution of the paper

    This paper proposed a multi-modal hierarchical VAE, which is a combination of previously proposed hierarchical VAE and multi-modal VAE, for MR-ultrasound image translation.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The paper is well written and easy to follow.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    The authors adapted the multi-modal VAE framework but applied it to an image-to-image (i.e., iUS-MR) translation problem with only two modalities (M=2). In this case, the multi-modal VAE essentially becomes a conventional VAE model (or VAE-GAN if with adversarial loss). The proposed method is demonstrated as an application of conventional (hierarchical) VAE or VAE-GAN for image-to-image translation. For the multi-modal hierarchical VAE formulation, the proposed MHAVE is a combination of previous works from MVAE and HAVE. The proposed product-of-experts (POE) to handle missing modalities has been proposed in ref.29.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    The authors will make their code available.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html

    The authors try to solve an image-to-image (i.e., iUS-MR) translation problem (at least this is what has been demonstrated in the paper). In my opinion, the multi-modal VAE does not fit well with this problem and the benefit of the proposed method is unconvincing. Maybe the authors can consider other STOA image-to-image translation methods, such as diffusion models (Diffusion Models for Medical Image Analysis: A Comprehensive Survey. 2023)

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    3

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The novelty part is the major concern

  • Reviewer confidence

    Confident but not absolutely certain

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    5

  • [Post rebuttal] Please justify your decision

    Thanks for the clarification. In the case of two modalities, the proposed multi-modal VAE becomes a hierarchical VAE (with adversarial loss). I still maintain my opinion that this paper is a combination of MVAE, HAVE and adversarial training. But I agree that bridging the gap and providing theoretical foundations are interesting contributions to the image synthesis field. The ultrasound/MR synthesis is indeed a difficult task and the dataset will be another exciting contribution in this area. Therefore, I raise my score to weakly accept.




Primary Meta-Review

  • Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

    The paper proposes a deep hierarchical VAE for synthesizing missing images from various modalities. The authors present a hierarchical latent structure and create a common latent representation for fusing multi-modal images with the ability to handle incomplete image sets. However, the paper still has the following issues: 1) The novelty needs to be strengthened. The proposed method is a combination of previous works from MVAE and HAVE, while the proposed product-of-experts to handle missing modalities has been presented in previous works. 2) It is unclear whether the proposed method increases the complexity and training cost. 3) it is not clear how well the proposed method generalizes to new datasets since the model is trained and tested on one small dataset. 4) More discussions of the performance under different conditions, such as low-quality input images.




Author Feedback

We thank the reviewers for their insight. Reviewers found our work ‘rigorous’, ‘very well explained’ and ‘very interesting’ (R2), ‘novel’ and ‘carefully evaluated’ (R3). We are confident that minor edits can address most of the expressed concerns.

NOVELTY: ‘The novelty part is the major concern’; ‘The proposed MHAVE is a combination of previous works from MVAE and HAVE.’ (R5). The novelty of this work is three-fold. Firstly, we introduced the first principled approach for unified image synthesis that generates high-quality synthetic images. While previous approaches, such as multi-modal variational autoencoders (MVAE), provided a principled framework for handling incomplete sets of input images, they generated blurry images, as shown in Figure 2. In contrast, SOTA unified image synthesis frameworks leverage recent advancements in computer vision (e.g. Transformers for ResViT) allowing for synthesizing high-quality images, but lack theoretical foundations. By integrating a hierarchical latent representation into the multi-modal variational setting and using adversarial learning, our novel approach bridges the gap, enabling the synthesis of high-quality images while establishing a mathematically grounded formulation for unified image synthesis.

Secondly, we respectfully rebut the claim from R5 that ‘the multi-modal VAE essentially becomes a conventional VAE model’. Firstly, our approach exploits a complex latent space structure spanned over several resolutions allowing for high-resolution image reconstruction, which is fundamentally different from VAEs that rely on a low-resolution latent representation. This hierarchical probabilistic formalism is largely unconventional in medical image analysis. Secondly, unlike standard VAEs, our multi-modal VAE exhibits flexibility in handling incomplete sets of images. Consequently, a single multi-modal network can accommodate each potential combination of inputs, eliminating the need for a VAE per combination, as demonstrated in our experiments (i.e., 1 vs. 3 networks). Hence, our method combining the hierarchical and multi-modal formalisms is fundamentally different from a conventional VAE model.

Thirdly, we conducted extensive experiments on the difficult task of ultrasound/MR synthesis, an area that has remained relatively unexplored due to the absence of a large dedicated dataset for validation. Note that we will release our dataset alongside our paper on TCIA, facilitating further research in this area.

To enhance clarity, these three points will be elucidated in the introduction and abstract, ensuring that the novelty of the proposed approach is readily apparent.

COMPLEXITY/TRAINING COST: As requested by R3, a comprehensive discussion on the model complexity will be added: “Our approach demonstrates significantly lighter computational demands when compared to the current SOTA unified image synthesis framework (ResViT), both in terms of time complexity (8G MACs vs. 487G MACs) and model size (10M vs. 293M parameters). Compared to MVAEs, our hierarchical multi-modal approach only incurs a marginal increase in time complexity (19%) and model size (4%)”.

GENERALIZATION: Since there is currently no publicly available paired T2/iUS dataset, quantitative assessment of our approach on an additional dataset remains infeasible. However, we have found that our approach exhibits strong generalization capabilities to MR scans from BraTS 2020. We will include qualitative results of synthetic iUS generated from BraTS 2020 images in the supplementary materials, further substantiating our claims.

ROBUSTNESS TO LOW-QUALITY: Future work will investigate the use of low-quality images for image synthesis. In particular, we plan to apply our approach to low-to-high resolution problems. A discussion on this point will be added.




Post-rebuttal Meta-Reviews

Meta-review # 1 (Primary)

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    Based on the feedback of the authors and the combined comments of the reviewers, we have decided to accept this paper.



Meta-review #2

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    This paper is VERY similar to Hierarchical Multimodal Variational Autoencoders ICLR 2022 (Published: 29 Jan 2022). And authors did not cite it.



Meta-review #3

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    The paper presents a principled probabilistic approach for multi-view image synthesis. The reviewers had concerns about the novelty, complexity/training cost, generalization to new datasets, and performance under different conditions. The authors have addressed these concerns in their rebuttal. The paper can be accepted after minor edits that address the reviewers’ concerns.



back to top