Paper Info Reviews Meta-review Author Feedback Post-Rebuttal Meta-reviews

Authors

Masoud Mokhtari, Teresa Tsang, Purang Abolmaesumi, Renjie Liao

Abstract

Ejection fraction (EF) is a key indicator of cardiac function, allowing identification of patients prone to heart dysfunctions such as heart failure. EF is estimated from cardiac ultrasound videos known as echocardiograms (echo) by manually tracing the left ventricle and estimating its volume on certain frames. These estimations exhibit high inter-observer variability due to the manual process and varying video quality. Such sources of inaccuracy and the need for rapid assessment necessitate reliable and explainable machine learning techniques. In this work, we introduce EchoGNN, a model based on graph neural networks (GNNs) to estimate EF from echo videos. Our model first infers a latent echo-graph from the frames of one or multiple echo cine series. It then estimates weights over nodes and edges of this graph, indicating the importance of individual frames that aid EF estimation. A GNN regressor uses this weighted graph to predict EF. We show, qualitatively and quantitatively, that the learned graph weights provide explainability through identification of critical frames for EF estimation, which can be used to determine when human intervention is required. On EchoNet-Dynamic public EF dataset, EchoGNN achieves EF prediction performance that is on par with state of the art and provides explainability, which is crucial given the high inter-observer variability inherent in this task.

Link to paper

DOI: https://link.springer.com/chapter/10.1007/978-3-031-16440-8_35

SharedIt: https://rdcu.be/cVRv0

Link to the code repository

https://github.com/MasoudMo/echognn

Link to the dataset(s)

https://echonet.github.io/dynamic/


Reviews

Review #1

  • Please describe the contribution of the paper

    The authors describe a computational framework to automate EF calculation using ultrasound cines. This is achieved by mainly 3 steps - a video encoder, an attention encoder and a regressor, with embedded neural networks at different stages. For testing and training purposes, a large sample of echo cines is used (10,000+) containing ‘ground truth’ data. Comparison and validation with other methods is also provided, showing relatively good agreement with ground truth for some EF sub-cohorts, and with computational complexity superiority.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • I found the paper to be strong and clinically relevant. Methodology, experiments and size of datasets are sound and sufficient.
    • Particularly enjoyable is the apparent simplicity of the method, at least from an architectural point-of-view. In a way, the more difficult aspects of the methodology are hidden in plain sight, facilitating clear objective functions and measures. I think the combination of tools is very clever, and yields simple and effective results. – The paper is well written and organised. The used statistical methods are simple as a first cut, and also well presented. Supplementary material is helpful.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    • Some of the limitations of the method could be explicitly summarised at the end. There is half a page available.
    • Some may argue that the improvement overall is incremental, but I agree with the authors that the lower complexity of the problem described is important for clinical translation.
    • Clinical aspects could be further analysed. The authors have a wealth of data available, many more experiments are possible. Please note that echo is not the only modality where EF is assessed. This is also routinely done with CMRI, and other imaging techniques. Might add a comment.
    • What are the clinical implications of having an unbalanced training set? Could you tease this out in the Discussion? How could you ‘correct’ this in future?
  • Please rate the clarity and organization of this paper

    Excellent

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    Please address Ethics / IRB approval - you can use a placeholder for now. Otherwise I have no comments, seems sufficient.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html
    • Minor style issues. For example, I would not start a sentence with a reference such as “[18] did this and that”.
    • Typo: “EF error and increase THE model’s ability”
    • How could this method be used in longitudinal studies? e.g. progressive heart disease.
    • Table 1. Would it be possible to include running time?
    • Justify T_fixed = 64.
    • Rephrase “necessitate the need” (abstract)
    • What about other sources of bias / noise in the data, e.g. image protocol, intra-observer bias, etc.
    • The selection of the thresholds (unstated weights value and block width of 55) for ED/ES frame approximation is not explained (supp. Fig 3), but this selection will directly impact the aFDs. Can you please comment.
  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    7

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Many strengths and minor weaknesses, as detailed above. The paper reads easily, is elegant and has the benefit of using well executed and combined past contributions from the field. A MICCAI exemplar.

  • Number of papers in your stack

    4

  • What is the ranking of this paper in your review stack?

    1

  • Reviewer confidence

    Very confident

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #2

  • Please describe the contribution of the paper

    In this work, the authors integrate an attention encoder, which learns the adjacency matrix describing the relationship between video frames, and a GNN after that, which leverages the learned attention matrix, to predict ejection fraction (EF), from AP4/AP2 Ultrasound images. The particular formulation of the attention encoder and the GNN is perhaps unique and interesting as it offers some explainability - which the authors claim is lacking in many works in this domain. They offer very good results and explainability as well, which is good to see.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • This work is novel in its formulation of the GNN, and using it to generate weights for each frame of the echo video. The adjacency matrix is used for the graph convolution. These frame weights seem to align well with the cardiac cycle, in situations where the videos are of good quality. They don’t seem to align well, when the video quality isn’t good. So, these weights can be used to guide a clinical user when to trust the result and when not to.

    • Another advantage is the flexibility this model provides where a single video can be used or multiple if necessary.

    • They test their algorithm on a nice standard set (EchoNet), which makes it easy for comparison.

    • The model is also pretty light-weight, making it practically usable in POCUS settings and other resource constrained settings.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    • While the model does offer some explainability, its not clear how impactful it can be to clinicians. It is unlikely that clinicians will take the time to look at adjacency matrices, frame weight curves and make assessment. It may be a good idea to use these more complex information to synthesize a single number, or metric to be used a proxy for quality and/or confidence. Because in terms of clinical workflow, its much more convenient to report a number or two and keep track of those numbers, as opposed to having qualitative description of quality, etc.

    • While the possibility of using multiple echo videos is interesting, it may not be a good idea clinically. Reason being, sometimes videos are acquired at slightly different angles, imaging parameters, etc. Your model may have to learn, not only to be responsive to these variations, but also the interplay between different variations when multiple videos are used together. The frame sampling may also not work too well when there are too many frames from too many videos.

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance
    • The authors describe their methods, architecture, etc. reasonably well, but some more details would have been desirable (although some more is included in the supplementary material).
    • They provide anonymized link to the code in github.
    • One thing that reduces reproducibility is the use of custom architecture blocks. For example, the video encoder could be replaced with a generic frame based image encoder. Perhaps, the performance suffers a little bit? Would be good to know why they made this choice.
  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html
    • Continuing my point from above. SInce the temporal relationship between the frames are learned using the Attention encoder + GNN, why do we still need the temporal relationship handled using 3D convolution in the video encoder? It would be conceptually cleaner/simpler if those are separated out. Would also increase the modularity/reproducibility.

    • Defining the edge/node relationships, why don’t you write the equation for the last one (Ws) since that seems important?

    • One thing I was curious about was, for GCN, could you not leverage the sequential nature of the data and establish stronger edge relationships? Essentially bake in a stronger inductive bias. Perhaps you’re already doing that?

    • Your model does have less parameters. But what about training time because of the 3d convolutions?

    • For table 1, what’s the hypothesis on good performance of aFD on ED and not good on ES?

    • Another question I have is, EF is calculated using AP2 view US images as well. Do you think this work will translate well if used there as well?

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    6

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?
    • This is a good work overall with added bonus of explainability and good results. However, as I’ve written above the explaination provided by the model is perhaps not very interesting clinically.

    • Some technical portion could also use more clarification and consideration.

  • Number of papers in your stack

    4

  • What is the ranking of this paper in your review stack?

    4

  • Reviewer confidence

    Confident but not absolutely certain

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #3

  • Please describe the contribution of the paper

    The authors propose a Graph Neural Network for explainable ejection fraction (EF) estimation from cardiac US imaging (echo) which they call EchoGNN. The weakly-supervised training pipeline does not directly rely on ES/ED ground truth annotations and benefits from a low number of parameters, hence reducing computational requirements. EchoGNN consists of three main components which are explained in detail: a video encoder, an attention encoder and a graph regressor. The authors show that their framework is able to accurately predict EF while also correctly identify end systole and end diastole. They tested their method on a large data set of AP4 echo cines.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The authors succeed in highlighting three advantages over competing models: 1) EchoGNN produces explainable indicators as to why a model fails or succeeded, which is why the authors claim that their model can indicate whether human intervention is required; 2) the framework does not rely on accurate ground truth ED and ES labels; 3) the model has a lower number of parameters to reduce computational time. They achieved this by implementing a Graph Neural Network, which they claim was the first time such a network was applied to echo cines in the context of EF estimation. Quantitative results appear convincing in terms of MAE, R² and F1.

    The paper is well written, and each component explained with detail. The supplementary material and the illustrations are helpful to the reader.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    Generally, there are no major limitations. The paper would, however, benefit from a more critical discussion on results that were presented in the supplementary material. A few comments are detailed in (8) but these can be addressed during the rebuttal period.

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    Authors have taken great care to make their work reproducible. In the reproducibility disclosure, authors state they will make code available. The methods are described in great detail in the paper. The EchoNet Dynamic dataset is publicly available.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html

    In the supplementary material (Fig 2) it appears that coefficients for EF are low on the diagonal for 30%<EF<40% and 40%<EF<50%. Patients below 40% require medical care, and this data suggest that about a third would be misclassified as above 40%. There should be a discussion on this point in the main text. Generally, the Conclusion section is missing some detailed discussion on limitations and would benefit from a more specific outlook on future work. I would appreciate if the author could discuss as to why they think EchoGNN struggles with ES aFD in comparison to other methods. Also, are aFDs of 3+ frames not considered a poor outcome? Coming from a background of segmenting MR images, I view the temporal resolution of echo imaging superior and thought that the identification of ED and ES should be more accurate (especially when considering the closure and opening of the valves).

    Small editing comments:

    • first line in Section 2: ConvoLutional Neural Networks
  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    6

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    A well written submission that clearly describes the network and its advantages over previous work. The clinical application is highly relevant, as non-expert users of echo (point-of-care) will benefit from reliable detection of EF and insight on explainability. The paper could be more convincing if more emphasis were put on the discussion of the results in a clinical context.

  • Number of papers in your stack

    4

  • What is the ranking of this paper in your review stack?

    2

  • Reviewer confidence

    Confident but not absolutely certain

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A




Primary Meta-Review

  • Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

    The work presents a novel method (video encoder, attention encoder and GNN) for EF estimation and cardiac key frame detection. The reviewers agree that the paper has merit, the work is sufficiently reproducible, compares against other works and clarity of the paper is already very good.

    However, several reviewers proposed to detail on the limitations of the work in the discussion. Reviewer3 expressed that the aFD values are higher than that of the state of the art, e.g., compare with Dezaki et al., Cardiac Phase Detection in Echocardiograms With Densely Gated Recurrent Neural Networks and Global Extrema Loss, IEEE TMI. Please also consider the other remarks made by the reviewers, e.g. further comparisons to works on CMR cine should be made.

  • What is the ranking of this paper in your stack? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    NR




Author Feedback

We thank all reviewers for their insightful comments and provide some discussion on the concerns that were brought forth. Please rest assured that we will consider all your valuable suggestions when making the final adjustments to the paper.

Common Concern: Lack of discussion on limitations of the work A: A discussion with the following key points will be added: (1) Not producing explainability maps on the pixel level. (2) With large videos (not common in echo) the initial complete graph assumption over frames has a large memory footprint.

Reviewer #1 and Reviewer #2 Q: Possibility to add running time A: In our own experiments, we observed that our model ran faster than prior work. However, since running time depends on factors beyond the architecture of the model including what hardware is used (GPU memory, number of GPUs, etc.) or how well and parallelizable the code is written, we decided not to include such data. Additionally, we observed inconsistencies in running time reported in prior work. For example, the running time reported by the authors of [18] for their model is different from what is reported in [21] for that same model. We may obtain the number of floating point operations per second (FLOPS) for each model for fairer comparison.

Reviewer #2: Q1: Why use 3D Convolutions when GNN is already capturing the temporal information? A1: The 3D Convolutions capture the temporal relationship between the frames at the patch-level. Since the GNN uses embedded features (rather than original frames) it works on a more abstract level and captures frame-level temporal relationships. This means that the 3D Convolutions and the GNN attention encoder are complementary and capture both patch-level and frame-level relationships. Q2: Usefulness of explainability provided through node/edge weights – better to summarise into a single number? A2: Our explainability weights can be easily mapped into a single number. For instance, since confident examples show more dense weights, and uncertain samples show diffused weights, the entropy of the weights could produce such a number. However, we abstained from doing so because we believe single-number confidence indicators fail to provide explainability and do not provide enough insight into why the model is confident or not. In a clinical setting, the clinicians would not have to precisely investigate the adjacency matrix or frame weights. Instead, weights produced by the model can be superimposed on ECG plots to create easy to reference and understand visualisations.

Reviewer #3: Q: ES/ED location performance A: The focus of our work is accurate and explainable prediction of EF using a weakly supervised learning framework. Our framework learns to predict ES/ED frame location without leveraging the ground truth location labels. The only supervision we have is the ground truth EF values. Prior works for EF estimation either lack ED/ES location detection (only predicting EF values) or learn such a secondary task in a fully supervised manner (using ground truth labels for each frame). Note that ours performs similarly or better than those fully supervised ones (as shown in Table 1). As mentioned in the meta-review, Dezaki et. al. performs significantly better than our model in finding these frame locations. However, it must be noted that their work is strictly focused on the detection of these frames using per-frame ground truth labels with a fully supervised approach, and they do not perform any other clinical metric estimation. We also see that ES detection performance suffers compared to ED, even in such a fully supervised setting, which agrees with our results. Lastly, the main reason we report our model’s ES/ED frame detection performance is to quantify how our explainability maps are capturing useful information rather than cherry-picking a few examples in the form of figures.



back to top