Paper Info Reviews Meta-Review Author Feedback Post-rebuttal Meta-Reviews Back to top

List of Papers By topics Author List

Paper Info

Reviews

Meta-review

Author Feedback

Post-Rebuttal Meta-reviews

Authors

Rand Muhtaseb, Mohammad Yaqub

Abstract

Learning spatiotemporal features is an important task for efficient video understanding especially in medical images such as echocardiograms. Convolutional neural networks (CNNs) and more recent vision transformers (ViTs) are the most commonly used methods with limitations per each. CNNs are good at capturing local context but fail to learn global information across video frames. On the other hand, vision transformers can incorporate global details and long sequences but are computationally expensive and typically require more data to train. In this paper, we propose a method that addresses the limitations we typically face when training on medical video data such as echocardiographic scans. The algorithm we propose (EchoCoTr) utilizes the strength of vision transformers and CNNs to tackle the problem of estimating the left ventricular ejection fraction (LVEF) on ultrasound videos. We demonstrate how the proposed method outperforms state-of-the-art work to-date on the EchoNet-Dynamic dataset with MAE of 3.95 and R2 of 0.82. These results show noticeable improvement compared to all published research. In addition, we show extensive ablations and comparisons with several algorithms, including ViT and BERT. The code is available at https://github.com/BioMedIA-MBZUAI/EchoCoTr

Link to paper

DOI: https://link.springer.com/chapter/10.1007/978-3-031-16440-8_36

SharedIt: https://rdcu.be/cVRv1

Link to the code repository

https://github.com/BioMedIA-MBZUAI/EchoCoTr

Link to the dataset(s)

https://echonet.github.io/dynamic/index.html#dataset

Reviews

Review #1

Please describe the contribution of the paper

This paper aims to estimate the left ventricular ejection fraction (LVEF) from 2D echo sequences. To do so, the authors adapt the existing UniFormer architecture with the objective of overcoming the limitations of CNNs and vision transformers for this type of task, therefore leading to a convolutional transformer. They demonstrate their methods on 10.000+ sequences from the EchoNet public database, focusing on 4CH views. Extensive comparisons with state-of-the-art methods and architectural choices are performed, demonstrating improved performance in terms of MAE and correlation.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
- Relevant architecture overcoming the limitations of two types of state-of-the-art methods
- Use of a known and publicly available large dataset
- Extensive evaluation against state-of-the-art methods and architectural choices
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
- Strong reserve on the applicative impact of this work, given the current performance of LV segmentation including for echocardiography.
- Limited methodological originality, although the authors perform extensive evaluation.
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance
- Use of a large, well-known, and public dataset (EchoNet).
- Code will be available if the paper is accepted.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html
As said in the weaknesses, the applicative impact is in my opinion limited, given the performance of current segmentation and tracking methods. I also wonder if better impact would be reached by focusing on assessing the dynamics along the sequences, instead of trying to (slightly) improve the performance on a rather classic problem (estimating LVEF).

The network is supposed to select the most representative frames to estimate LVEF. I wonder how it behaves on cases with abnormal motion, and in particular little motion.

Writing could be revised on several aspects:
- The Title and in particular “spatiotemporal echocardiographic assessment” may be revised to better fit what is actually proposed.
- Abstract: the sentence “However, according …” is rather vague and could be revised.
- I have similar remarks for other parts of the paper: beginning of the Introduction, section on LVEF in the center of p.2
- p.2: “adopted” should be “adapted”
Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

3
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

Mainly motivated by the limited methodogical originality and applicative impact of this work, as mentioned above.
Number of papers in your stack

4
What is the ranking of this paper in your review stack?

3
Reviewer confidence

Confident but not absolutely certain
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

4
[Post rebuttal] Please justify your decision

Thank you for the rather complete coverage of our comments. I agree that not having to identify ED and ES frames is an asset (still arguable in case of disease), and that there is an interesting adaptation of existing architectures with somehow better accuracy and speed. However, the dataset and clinical question addressed still may not be optimal to demonstrate the potential of this method. Going further with motion abnormalities could be tested on other existing datasets (real or synthetic) or even toy experiments, although I understand that this is too much for a MICCAI paper. In conclusion, the technical part, although not fully original, may raise interest in the MICCAI audience, but I wonder about the clinical relevance and impact of the addressed problem.

Regarding the title, I would replace the generic term “assessment” to better refer to the methods (“convolutional transformer”).

Review #2

Please describe the contribution of the paper

In this article, a-convolutional transformer (EchoCoTr) is proposed as a method that combinies vision and CNN transformers to analyze echocardiogram video sequences and generate LVEF prediction. Deep learning networks require a fixed number of video samples, to obtain them they are taken at uniform frequencies, and authors proposed to use images from the end of systole and diastole images. The EchoCoTr architecture learns local features without avoiding redundancy in adjacent images while capturing global information through video. The results show that EchoCoTr can train with little information and give better or comparable results to other models such as EchoNet-Dynamic, BERT, DistilBERT and ViT although they also show that the model results are affected by the way the samples are taken in the model. video.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

There is a novel formulation in the use of transformers to obtain ejection fraction of the heart’s left ventricle. Authors compare their new proposal with known ones and in all give quantittative results, comparing each other and demonstrate the advantages of using transformers. They clarify the clinical background importance.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

Although it has a first use of the transformers to calculate the ejection fraction, is looks the method still does not possess such convincing results when compared to the other methods. And the theoretical part is not so strong to demonstrate the reasons of the different experiments.
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

If authors get to publish their code, it can be reproduced. This paper it is strong in this aspect, database is publicly available also. And details on language and platform is given.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html

The database has been extensively used and now authors provide a new method and a good amount of experiments, that makes the paper strong and illustrative. I believe it can be relatively easy to reproduce and help for others to have access to your code and be able to compare the new ones.
Also, the contribution mentioned in the introduction is well written and gives a very good idea of the paper. Perhaps it would be stronger if authors go deeper on the reasons why their method is better compared to the others and talk about specific disadvantages.
Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

6
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

Authors explain their paper, show all the experiments and work done, and compare and quantify themselves against other existent methods. Also, they show the clinical importance of their method.
Number of papers in your stack

5
What is the ranking of this paper in your review stack?

2
Reviewer confidence

Confident but not absolutely certain
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

Not Answered
[Post rebuttal] Please justify your decision

Not Answered

Review #3

Please describe the contribution of the paper

The authors present an application of the UniFormer network to the task of LVEF prediction. The results outperform existing approaches by a very small margin.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

The results of the paper outperform the state-of-the-art approach, which was using CNNs. The architecture used here, while not new, uses a combination of transformer and convolutional blocks. It beats the existing transformer-based approaches by a large margin. The results are supported by a solid ablation study and compared to the relevant literature. The network architecture used by the author had never been tested for this specific task before.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

The novelty of the work is limited. The authors took the existing UniFormer [1] architecture and adapted it to their task. It is unclear how efficient this approach is compared to the other state of the art.

[1] Li, K., Wang, Y., Gao, P., Song, G., Liu, Y., Li, H., Qiao, Y.: Uniformer: Unified transformer for efficient spatiotemporal representation learning (2022)
Please rate the clarity and organization of this paper

Excellent
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

The data used by the authors is public and the code will be made available upon acceptance. The higher-level parameters used by the authors are listed in the paper. It is mentioned that the UniFormer model was adapted, but it is not clear what changes were actually performed. The paragraph detailing this (end of page 4) is not clear enough and may be not complete. Overall, the reproducibility should be excellent once the github is made public.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html

It would be welcome to have a comparison of the different models’ efficiency in Table 1. This may highlight one of the limitation of EchoCoTr or reinforce its usability in a clinical setup, where compute resources are limited. Efficiency could be estimated by looking at how much time it takes for each model to analyze all the videos from the Validation or Test set, or by using the standard FLOPs metric.

The data sampling (page 3, section 3.1) looks well described, but EchoNet dynamic contains videos of arbitrary length. In the case where the video is longer than the clip covered by the sampling, how is the clip starting point selected ? And how do the authors handle the case where the source video is too short for the selected sampling method ?

The changes made to the UniFormer model are not very clear. The model is partially described at the end of page 4, but it would be welcome to know what differs from the UniFormer and what was taken as-is. If all the mentioned parameters are different from the ones used in the UniFormer paper, please clarify it.

Authors mention that their model can predict the LVEF with a good accuracy when it is given just one end-systolic and one end-diastolic frame. This is very interesting, but it should be stated that this is an important bias, as the model usually has to determine the position of these key frames.
Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

6
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The paper beats the state of the art on the EchoNet dynamic dataset, which is the new standard for the LVEF prediction on 2D echocardiograms, but there is not much novelty.
Number of papers in your stack

5
What is the ranking of this paper in your review stack?

1
Reviewer confidence

Very confident
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

7
[Post rebuttal] Please justify your decision

This paper is a solid work and sets a new performance record on the task of predicting LVEF. Authors promised to integrate the missing parts that were raised at the first review stage in the camera ready paper. The only remaining weakness is the lack of novelty in the approach.

Primary Meta-Review

Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.
The authors propose a transformer based framework for LVEF estimation. All authors agree to the clarity of the paper, and have praised the extensive evaluation. However, there are concerns on the novelty. I would suggest authors to focus on addressing:
- Technical novelty: clearly state the novel contributions and why they matter/what is the impact
- Application novelty: how does their approach make a different that is important/relevant in the target application I would also recomend that authors read carefully the reviewers’ comments and try to address them all.
What is the ranking of this paper in your stack? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

6

Author Feedback

We thank the reviewers (R1-3) and the meta reviewer (MR) for providing the valuable feedback and comments. Please find our answers (A) below.

MR + R1: Originality + Strong reserve on the applicative impact of this work + R1: Assessing the dynamics + R3: Novelty A: We agree with the reviewers that our solution combines some existing methods to tackle this problem. However, the combination and adaptation of methods is original and this was reflected by a more accurate and faster solution (reducing the MAE from 4.22 (current SOTA with no required segmentation) to 3.95 and at least 20% speedup). In addition, our technique learns from weak supervision (only LVEF value) compared to other works which require segmentation of LV. We concur that LVEF is a classical problem to solve and predicting other biomarkers e.g., GLS, is important. Unfortunately, we (and other researchers) are limited with the available datasets and their annotation (only LVEF labels exist). In addition, what makes our contribution impactful is that it does NOT require

1) information regarding the position of ES and ED,

2) segmentation masks as EchoNet-Dynamic’s beat-to-beat pipeline and

3) a pre-defined length of the cardiac scan.

We will edit our contribution section in the paper to make these points clearer.

R1: Abnormal motion + R2: Disadvantages A: Assessing our method with abnormal heart motion cases is ideal but we are limited to the available dataset that may not have such abnormal cases. However, we believe that if the dataset contains such cases, the proposed algorithm is likely to learn discriminative features for the abnormal motion of the heart. We will add this point to the conclusion as a possible improvement.

R1: Writing could be revised A: Small issues shall be fixed, and part of the unclear text shall be rephrased as requested. Regarding the change of paper title, we could revise it but it will be great if some suggestions could be made to what exactly in the title the reviewer thinks is not appropriate.

R2: Convincing results + theoretical part is not strong A: As shown in Table 1, our results show better performance compared to SOTA (MAE improvement of 3.95 from 4.22). Please refer to the contributions stated in the 1st point above. Since the work combines multiple ideas, some theoretical description could be found in [4, 5, 8].

R2: access to the code A: Our code will be available upon acceptance

R3: data sampling A: The starting frame to process is not necessary frame0. We randomly select the starting frame from the following range [0 - (Number of original video frames - (Number of sampling video frames - 1) * Sampling frequency)]. This sampling allows the network to be insensitive to the starting frame. In the case of short videos, the starting frame is frame0 while zero-filled frames are padded to the end of the video if needed. We will ensure that this is clear in the paper.

R3: bias for the experiment when ES and ED are assumed known A: It is a good point, so we added that part in the paper with “…as the location of ES and ED frames are already known beforehand…”.

R3: Changes made to the UniFormer model A: We mentioned in the paper that we have utilized the UniFormer model as a base to our architecture. The focus of our work was not on improving the UniFormer model but rather to develop a novel method which could be applied efficiently to the problem of estimating LVEF in temporal video data. We shall revise the text in the paper to make this point clearer.

R3: Efficiency A: We have compared the inference speed of our proposed models against EchoNet-dynamic’s R2Plus1D. Our EchoCoTr-S and EchoCoTr-B are approximately 50% and 20% faster than R2Plus1D. This shall be reported in the paper with exact timing on the reported hardware which could also be reproduced once the code is made available.

Post-rebuttal Meta-Reviews

Meta-review # 1 (Primary)

Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

The authors addressed most comments from the reviewers. Although there is some reservations still about novelty and utility, reviewers have risen their scores and overall the paper seems suitable for publication.
After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

Accept
What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

7

Meta-review #2

Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

The reviewers tend to accept this paper and also R1 increased their score and stating that the only reason whey they didn’t go higher is that other papers on their stack were rejected despite them ranking them higher. Thus, I would also count R1 as a borderline accept and the paper should be accepted for MICCAI’22.
After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

Accept
What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

7

Meta-review #3

Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

According to the reviewers, the overall conducted extensive experiments seem to outweigh the somewhat limited technical novelty of the proposed approach. I therefore think that the paper is interesting for the community. I vote for acceptance.
After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

Accept
What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

8

back to top

EchoCoTr: Estimation of the Left Ventricular Ejection Fraction from Spatiotemporal Echocardiography