Paper Info Reviews Meta-review Author Feedback Post-Rebuttal Meta-reviews

Authors

Hadrien Reynaud, Mengyun Qiao, Mischa Dombrowski, Thomas Day, Reza Razavi, Alberto Gomez, Paul Leeson, Bernhard Kainz

Abstract

Image synthesis is expected to provide value for the translation of machine learning methods into clinical practice. Fundamental problems like model robustness, domain transfer, causal modelling, and operator training become approachable through synthetic data. Especially, heavily operator-dependant modalities like Ultrasound imaging require robust frameworks for image and video generation. So far, video generation has only been possible by providing input data that is as rich as the output data, e.g., image sequence plus conditioning in, video out. However, clinical documentation is usually scarce and only single images are reported and stored, thus retrospective patient-specific analysis or the generation of rich training data becomes impossible with current approaches. In this paper, we extend elucidated diffusion models for video modelling to generate plausible video sequences from single images and arbitrary conditioning with clinical parameters. We explore this idea within the context of echocardiograms by looking into the variation of the Left Ventricle Ejection Fraction, the most essential clinical metric gained from these examinations. We use the publicly available EchoNet-Dynamic dataset for all our experiments. Our image to sequence approach achieves an R2 score of 93%, which is 38 points higher than recently proposed sequence to sequence generation methods. Code and weights are available at https://github.com/HReynaud/EchoDiffusion.

Link to paper

DOI: https://doi.org/10.1007/978-3-031-43999-5_14

SharedIt: https://rdcu.be/dnwws

Link to the code repository

https://github.com/HReynaud/EchoDiffusion

Link to the dataset(s)

https://echonet.github.io/dynamic/


Reviews

Review #1

  • Please describe the contribution of the paper

    The paper extends the Elucidated Diffusion Model (EDM) and Cascaded Diffusion Model (CDM) for generating ultrasound video clips. The extended EDM accepts LVEF as the condition instead of texts, and the trained models can generate realistic echocardiograms to improve the performance of LVEF regression and downstream task balancing.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. The paper is well-organized, easy to follow, and contains comprehensive experimental results and visualizations.
    2. The proposed method is novel and outperforms previous state-of-the-art methods for US video generation to a large extent.
    3. This work illustrates the potential for diffusion models to model the distribution of medical data, especially US videos, which may be important for downstream medical tasks that lack sufficient training data.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    I don’t think this work has a major weakness, but there are still some minor issues:

    1. For the algorithm pipeline in Fig. 1, it is better to illustrate both EDM for generating US sub-samples and CDM for up-sampling the generated videos. The model architecture can be simplified instead.
    2. In section 2, the introduction of the EDM contains too many details that could have been omitted: 1) the formula of probability flow ODE in EDM framework; 2) preconditioning parameters c_skip, c_in, c_out, and c_noise; 3) the formula of detailed prediction and correction steps.
    3. In section 2, the introduction to CDM takes only a few lines, which would make readers that are not familiar with diffusion models confused about its settings.
    4. Tab. 2 appears earlier in the text than Fig. 2, but is placed after Fig. 2. It is also recommended that the charts be placed separately next to the text rather than on a separate page to make them more readable.
    5. In Tab. 2, the authors only compare the proposed method to Dartagnan. More methods can also be involved (e.g., Handheld Ultrasound Video High-Quality Reconstruction Using a Low-Rank Representation Multipathway Generative Adversarial Network, Zixia Zhou, et al.)
    6. The proposed method seems to take much more time than the previous SOTA (over 100x), but the authors didn’t take into account any more advanced sampling methods to accelerate the inference and make the method more practical.
  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    The authors said that all the code, experiments, and weight files would be released by the time of the conference. In addition, the data used in this work is publicly available.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html
    1. Revise the method section to make its structure more informative and easily read. For example, use an algorithm with pseudocodes to introduce the training and sampling procedures instead of putting them in the text. In addition, explaining how the conditions are encoded is preferred.
    2. Discuss the generation quality w.r.t. different sampling steps.
    3. Use tables to show the experimental results for expert evaluation of generated videos and downstream performance.
  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    6

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    I recommend publishing this paper at MICCAI since it proposes a novel and effective method for US video generation based on diffusion models. To the best of my knowledge, this is the first work that verifies the effectiveness of diffusion for generating medical videos. The overall quality of the paper is satisfactory.

  • Reviewer confidence

    Confident but not absolutely certain

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #3

  • Please describe the contribution of the paper

    1) Attempting TSSR and video generation with conditioning in medical imaging through Cascaded Diffusion Model. 2) Showing significant performance improvement in LVEF estimation compared to the previous SOTA paper. Increased by 93% in R^2, however, decreased in SSIM value. 3) The code will be released.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    A. The approach of generating US images in medical imaging through Cascaded EDM is very challenging. The attempt to create medical images from conditions is actually a very risky yet interesting endeavor. B. The conditioning were set and modified specifically for the task and were well executed. C. The results were compared using various indicators and analyzed accordingly, ultimately resulting in an improvement in performance in the regression domain compared to SOTA. D. The theoretical explanation was robust, and there were references and practical applications for various networks.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    A. The omitted parts made it difficult to understand the context. The explanation regarding Counterfactual image, the model used in Downstream, and the method used to obtain LVEF in Table 1 was insufficient. B. The problem of low similarity in generated images is certainly a significant issue, and the lack of all video generation results for each model makes it difficult to provide accurate interpretation and assessment.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    The author promises to make the code publicly available, which enhances reproducibility.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html

    Creating a video corresponding to a given image using the LVEF parameter was a very interesting and innovative challenge. The latest deep learning techniques were applied effectively, and the results are very significant. However, generating data that does not exist in the medical field poses a significant risk, and there are concerns. Although the similarity to actual data was claimed through qualitative evaluation, the low quantitative metrics could be problematic. If these issues can be addressed, the method could be used in various applications.

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    6

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The research has some noteworthy strengths, such as achieving high accuracy in video generation using medical images, which is a relatively unexplored area. To support their findings, the authors conducted comprehensive comparisons and analyses using various metrics and methods. To provide a more thorough and detailed analysis, the authors also utilized qualitative evaluation methods in addition to quantitative metrics.
    I also consider the challenges associated with the research topic to be a strength. There is still considerable skepticism regarding the generation of synthetic videos in the medical field. However, due to the challenges associated with acquiring data in the medical domain, such research must continue to advance and evolve. Therefore, I believe that pursuing this line of research is desirable.

  • Reviewer confidence

    Somewhat confident

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #2

  • Please describe the contribution of the paper

    The paper applied the elucidated diffusion models to video modeling to generate ultrasound video sequences from single images and arbitrary conditioning with clinical parameters. The proposed method is validated using the public dataset and is demonstrated to have better performance than the sequence-to-sequence generation methods.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The paper proposed a novel method for ultrasound video generation from single images. The approach is valuable given the number of public ultrasound datasets is quite limited, which should be of interest to other researchers. Nice online demo.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    The structure and writing of the paper are not easy to follow, e.g most of the equations are embedded in the texts. The paper doesn’t define several abbreviations before using them such as 4SCM and CDMs, which may make readers confusing.

  • Please rate the clarity and organization of this paper

    Poor

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    The author will release their codes and experiments by the time of the conference.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html
    • Refine the structure and writing of the paper.
    • Define all abbreviations before using them.
  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    4

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Although the topic and approach are valuable to the research field, the presentation of the paper needs to be refined further.

  • Reviewer confidence

    Somewhat confident

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A




Primary Meta-Review

  • Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

    This paper proposes a method to generate plausible continuous video of echocardiogram from a discrete set of US images.

    Strengths:

    • Interesting solution to a challenging problem
    • Reviewers acknowledge that the results are promising and that the method is novel

    Weaknesses:

    • Reviewers highlight that some aspects of the general text presentation and structure could be improved

    Overall, all reviewers agree on the merit of this paper’s contribution including method and experimental results. Very minor issues are raised about the text presentation which in my opinion do not warrant a rebutal.




Author Feedback

We are grateful for the positive and insightful feedback received from the reviewers and the AC. We would like to answer the following points raised in the reviews.

Organization and clarity: We acknowledge the reviewers’ point about the density of the paper and its dependency on the reader’s understanding of diffusion models. To address these concerns, we simplified the Methods section, as suggested by R1, by condensing certain details and employing pseudocode where applicable. We are confident that these changes improve readability and alleviate the concerns raised by reviewers R2 and R3 about the paper’s clarity.

We also acknowledge R3’s query regarding the lack of explicit details for the counterfactual setup and the downstream task. We revised our explanations for these points in the “Results” and “Downstream Task” sections, respectively.

R1 also proposed presenting the “Qualitative study” and “Downstream task” results in tabular form. We argue that the “Qualitative study” is better presented by a clear and unambiguous description of its setup rather than by a space-consuming table. The “Downstream task” does have an extended table (Table 4) provided in the supplementary material.

Illustration of Results: R3 highlighted the lack of visual examples. Our online anonymous demo (https://huggingface.co/spaces/anon-SGXT/echocardiogram-video-diffusion) is available for the duration of the review process and a de-anonymized version of the demo along with a dedicated Github page, presenting hundreds of examples, will be shared along with the publication.

Quality of Samples and Processing Time: R1 recommended discussing the quality of samples in relation to the sampling steps. We have conducted such experiments and even explored different samplers before selecting the EDM. However, due to constraints in space and the important volume of context necessary to present such results, we were unable to include these findings. We aim to delve into this in the near future.

Concerning R1’s suggestion to adjust Figure 1 to better illustrate the EDM steps and the Cascaded Models, we agree that a revision is warranted. We will aim to strike a balance between detail and clarity in the revised figure.

LVEF Estimation and SSIM Performance: In response to R3’s concern about our presentation of LVEF estimation, we maintain that our concise LVEF section provides sufficient information, especially considering the ease of reproduction via our open-source code and the public EchoNet-Dynamic dataset.

R1 and R3’s observations about our SSIM performance are noteworthy. We suppose that the limited results may come from the metric’s suitability to the task [1] rather than from the quality of the generated videos (particularly for the 1SCM model). SSIM, being a pair-wise metric, differs fundamentally from FID and FVD, which analyse overall data distributions. We initially used SSIM to compare results with D’artagnan [2], but as outlined in the paper, D’artagnan primarily focuses on reconstructing input videos, while our model generates plausible videos for the given anatomy+LVEF pair.

We deeply appreciate the comments offered by the reviewers and the AC, and we look forward to incorporating these suggestions in the final version of our paper.

[1] Metrics Reloaded - A new recommendation framework for biomedical image analysis validation, Reinke et al. [2] D’artagnan: Counterfactual video generation, Reynaud et al.



back to top