Paper Info Reviews Meta-review Author Feedback Post-Rebuttal Meta-reviews

Authors

Hooman Vaseli, Ang Nan Gu, S. Neda Ahmadi Amiri, Michael Y. Tsang, Andrea Fung, Nima Kondori, Armin Saadat, Purang Abolmaesumi, Teresa S. M. Tsang

Abstract

Aortic stenosis (AS) is a common heart valve disease that requires accurate and timely diagnosis for appropriate treatment. Most current automatic AS severity detection methods rely on black-box models with a low level of trustworthiness, which hinders clinical adoption. To address this issue, we propose ProtoASNet, a prototypical network that directly detects AS from B-mode echocardiography videos, while making interpretable predictions based on the similarity between theinput and learned spatio-temporal prototypes. This approach provides supporting evidence that is clinically relevant, as the prototypes typically highlight markers such as calcification and restricted movement of aortic valve leaflets. Moreover, ProtoASNet utilizes abstention loss to estimate aleatoric uncertainty by defining a set of prototypes that capture ambiguity and insufficient information in the observed data. This provides a reliable system that can detect and explain when it may fail. We evaluate ProtoASNet on a private dataset and the publicly available TMED-2 dataset, where it outperforms existing state-of-the-art methods with an accuracy of 80.7% and 79.7%, respectively. Furthermore, ProtoASNet provides interpretability and an uncertainty measure for each prediction, which can improve transparency and facilitate the interactive usage of deep networks to aid clinical decision-making. Our source code is available at: https://github.com/hooman007/ProtoASNet.

Link to paper

DOI: https://doi.org/10.1007/978-3-031-43987-2_36

SharedIt: https://rdcu.be/dnwJR

Link to the code repository

https://github.com/hooman007/ProtoASNet

Link to the dataset(s)

N/A


Reviews

Review #5

  • Please describe the contribution of the paper

    With their submission, the authors propose a prototype-based aortic stenosis classification algorithm using time-variant prototypes. The authors demonstrate the value of the method on two datasets, one private and one public. Moreover, the authors announce to disclose their code on acceptance.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    Overall, the manuscript is very well-written and was a pleasure to read. It follows a highly clear structure, emphasizing its contributions as well as all relevant steps to evaluate the approach. The use of figures, tables, and captions is exemplary, as is the submission of supplementary material and even a video demonstration of the approach. The approach is novel and interesting and has a high clinical value. The achieved results are state-of-the-art.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    To me, the paper had two minor weaknesses, comprising a) the lack of inference statistical measures, and b) the missing description of the derivation of the weighting of the different loss terms.

  • Please rate the clarity and organization of this paper

    Excellent

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    The reproducibility sheet well matches the submission. The authors agreed to publish their code on acceptance. Additionally, one of the used datasets has been public. Furthermore, the authors added descriptive supplementary material. Therefore, reproducibility of the results is most likely given.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html

    Overall, the paper felt mostly exemplary to me. Therefore, my recommendations will only address two minor points:

    ################### Method & Evaluation ###################

    • The concrete choice of the hyperparameters for lambda as .7, .8, .08, 1e-2, 1e-3, and 1e-4 felt somewhat unmotivated. Namely do they include multiple magnitudes, which puts into question the values of L_trans and L_norm. Might the authors add one or two sentences regarding the choice of these parameters?

    • Regarding the evaluation, I felt that this was mostly exemplary. However, I would strongly recommend to the authors to add inference statistical measures. I would recommend having a look at the work from Efron [1].
    • By introducing confidence intervals, the authors might better be able to demonstrate the superiority of their method, as well as the contribution of each component towards this result. Notably, having a look at Table 2, confidence intervals might demonstrate the added value of the push-phase.

    ############ References ############ [1] Efron, B. (1987). Better bootstrap confidence intervals. Journal of the American statistical Association, 82(397), 171-185.

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    8

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    As stated above, in my perspective the paper is written in an exemplary fashion, addresses two clinically highly relevant fields (XAI and AS detection), succeeds in demonstrating the methods value, and is strongly reproducible. The use of tables, figures, and captions has been exemplary, on the other hand, the weaknesses have been very minor.

  • Reviewer confidence

    Very confident

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #4

  • Please describe the contribution of the paper

    The authors have proposed a method for classifying the severity of Aortic stenosis using a video-based prototype network that uses the similarity between inputs and learned prototypes in making predictions. They have evaluated this method on two echocardiography datasets and compared to other methods from the literature. Their method outperforms the rest.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    -Novelty of the method and use of video data instead of image data for the task

    • Well-written and easy to follow
    • Extensive validation by comparing to methods from the literature
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    • Needs reorganizing the material so that methods material is only under Methods, … .
  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    The authors have used one public dataset and they will make their code public by sharing their github repo. So I believe their work is reproducible.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html
    • It seems that the authors have used the supplementary material to report the results on the public dataset which they couldn’t fit in the paper because of the page limitation. I suggest that they either remove this section or summarize the results and present them the same way they have for the private one.
  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    6

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The authors have used an innovative approach for detecting AS severity. They have validated their work by comparing to a large number of existing methods. And their method outperforms these baselines.

  • Reviewer confidence

    Somewhat confident

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #1

  • Please describe the contribution of the paper

    This work presents a novel method for aortic stenosis classification in echocardiography based on a dynamic prototypical network model that leverages uncertainty-oriented prototypes to capture class ambiguity (namely due to suboptimal imaging). The method was validated in one private and one public database, showing superior performance against state-of-the-art and baseline variants.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    Novelty: the study presents both methodological and application novelty. On the one hand, the authors propose a dynamic prototypical network to process medical videos and embed uncertainty prototypes to deal with classification ambiguity. On the other hand, it is the first time this type of method is used for AS severity classification, outperforming other SoTA methods and opening new avenues into explainable AI methods in echocardiographic interpretation that may be applicable to other similar problems. Interpretability: a key aspect for clinical practice integration, which is here accounted for with the use of prototypical networks. Interestingly, this strength is further supported and evidenced by the very illustrative and appealing supplementary videos submitted. Ablation study and proof of algorithmic reasoning: The authors have adequately compared their strategy against simpler variants of the proposed method, including a frame-only variant, no uncertainty prototypes, or no prototype clustering, all of which reinforce the validity of the algorithmic decisions made.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    Methodology description: despite being globally well written and very complete, small details of the methodology seem to be missing, or require the reading of cited references or seminal (uncited) works. Experimental details missing: during SoTA comparison (in both private and public databases), certain methodological details seem to have been modified for fairer comparison, but certain details of these experimental nuances are missing. Similarly, dataset details (like the number of studies per set or the acquisition details [single or multiple US machines/transducers, frame rate, etc.]) are missing.

  • Please rate the clarity and organization of this paper

    Excellent

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    The authors employ a private dataset of 2572 studies. However, limited description is given regarding image acquisition (equipment used, frame rate or other imaging characteristics relevant in the context of the video classifier proposed), methods employed for quality control (Doppler-based assignment performed by a single observer or multiple, whether consensus was required), etc. Similarly, several methodological/experimental details are lacking, hampering the adequate comprehension or reproduction of the authors’ methods/results (note that most details can probably be understood once the code is made public, but the document should be as much self-explainable as possible). See specific comments below.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html

    In addition to the comments raised above, some comments follow:

    • Please complete the description of the “push” mechanism or include a reference for it. How is the projection/minimization performed? When you say, “closest training examples”, is a fixed number defined? How are these prototypes initialized? Consider adding a pseudo-code of the training mechanism (ProtoNet + “push”), even if in supplementary material.
    • How many convolutional layers are used in the feature and ROI modules? Do all layers have the same number of filters (equal to the output depth of the respective branch)? What activation function was used in-between? Were there any normalization layers?
    • How were the results in Table 1 obtained? Please explicitly mention whether official code implementations were used (even if changes were made for the sake of comparison) or your own.
    • Still regarding Table 1, was the modification of the backbone performed for the prototypical methods only or for all SoTA methods? If the latter, although I understand that it makes the comparison fairer, it also means that you are not necessarily comparing against the authors’ original method (which could have tested multiple networks for example and reached the conclusion that another worked better for their pipeline). Consider including the results with both modified and original backbone.
    • A similar question could be asked about the change in backbone for TMED-2. Are the results very different if a ResNet-18 were used like in the private database? Consider adding such a result.
    • What were the thresholds set in TMED-2? Why is a threshold required for the uncertainty component: couldn’t one assume that a sample is uncertain if the probability assigned to it is bigger than to any other AS class?
    • How were the hyperparameters of the loss function determined? Experimentally on the validation set (and thus one must carefully assess them if employed in a distinct application) or are these values “known” (or at least often used) in literature. If the latter, include the appropriate reference.

    Minor remarks:

    • In page 3, include the meaning of “X”(set of training images?).
    • Table 1 appears too early. Consider placing it only upon being referenced in the text.
    • in page 7, correct to “pump blood to the rest”.
    • in page 8, “explained by a low”.
    • in supplementary Table 1, for clarity, consider replacing “our public dataset” to “the TMED-2 dataset”.
  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    6

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The manuscript presents sufficient methodological novelty in a clinically relevant task. It is well written, presents adequate experiments, a good discussion of the method, algorithmic decisions made, and its results, and very appealing figures/videos. Moderate weaknesses exist but these seem easily correctable in a rebuttal phase, after which the paper would be, in my opinion, a “strong accept”.

  • Reviewer confidence

    Confident but not absolutely certain

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A




Primary Meta-Review

  • Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

    The manuscript has been accepted by the reviewers due to the novelty and interest of the approach, its high clinical relevance, and the exemplary level of writing and presentation. The use of video data, instead of traditional image data, for the given task has been recognized as a strong point of the manuscript. The reviewers also commend the extensive validation against existing literature and the provision of supplementary material, including a video demonstration.

    However, while the overall positive feedback is apparent, the reviewers have raised several points of concern that require some attention. A significant portion of these concerns stems from missing or incomplete methodological details, particularly with respect to the employed datasets, the convolutional layers used in the feature and ROI modules, and the derivation of the weighting of different loss terms. In addition, the concrete choice of hyperparameters and the lack of inferential statistical measures were also noted as weaknesses that could be addressed.

    Recommendation:

    Given the strengths of the paper and the agreement among reviewers to accept it, along with the manageable nature of the identified weaknesses, the recommendation is to provisional accept.




Author Feedback

We would like to thank the (meta-) reviewers for their time and effort in assessing our paper. We gained valuable insight from your constructive feedback and will use your advice to substantially improve the quality and clarity of the paper. In this response, we will summarize and respond to the feedback in different themes, and point readers to any part of the paper that was changed as a result.

Firstly, we have added more information regarding our private dataset. AS severity was graded based on ACC/AHA 2006 guidelines: Mild (AVA >1.5 cm2, peak transaortic velocity of 2.00-2.99 m/s, mean transaortic gradient <20 mmHg), Moderate (AVA 1.0-1.5 cm2, peak velocity 3.00-3.99 m/s, mean gradient 20-39.9 mmHg), and Severe (AVA <1.0 cm2, peak velocity ≥4 m/s, mean gradient ≥40 mmHg), where AVA is the aortic valve area. Cines were extracted from Philips iE33, Vivid i, Vivid E9 ultrasound machines, with frame rates ranging from 12-147 fps and an average of 37 fps. AS grading was performed by a single level III echocardiographer. We followed an 80-10-10 ratio to split the dataset, and ensured patient exclusivity. We have added these details in the revision. It would be interesting in future work to explore deriving ground truth from more recent editions of the guidelines, and comparing our results against inter-rater agreement for clinical tests, should the data be available.

Secondly, we elaborated further on the methodology. We used 2 convolutional layers in the feature module with the same number of filters, D, but used 3 convolutional layers in the ROI module, with D, half of D, and P filters respectively, to avoid sudden channel reduction to P. In both modules, except the last convolutional layer that had linear activations, the rest are followed by ReLU activation function. We did not use any normalization layer. Furthermore, the “push” mechanism simply copies the embedding values from the closest feature representation to p^c_k. We decided to keep the term “push” to be consistent with the existing prototypical network literature, describing the copying action as a “projection”. We have modified Eq. (1) for more clarity in the revision.

Regarding the choice of hyperparameters, we found the optimal lambda_abs based on the mean F1 score of the validation set through a search of five values (0.1, 0.3, 0.5, 0.9, and 1.0). Lambda_clst, lambda_sep, and lambda_norm are derived from the hyperparameter selection of ProtoPNet [1] and are kept constant throughout the experiments. Lambda_norm is of a different order of magnitude since the normalization penalty is based on the L1-norm of weights. Increasing this weight adversely affected the model performance. Lambda_trans was set to 1e-3 since we empirically found larger values sometimes led to convergence at suboptimal minima. For lambda_orth, we tried two low values of 1e-2 and 1e-3 to prevent overly constraining the prototype learning, and found 1e-2 to be better empirically.

As for the inference statistical measures, we will aim to run more iterations with the same hyperparameters to obtain confidence intervals. Due to the computation load of the task we will first prioritize the video-based models.

The reviewers have made a valuable point discussing the choice of backbone for each of the experiments in Table 1. For the two un-interpretable AS classification models we compare to, Huang et al. and Ginsberg et al., we use their original backbones, WideResNet and ResNet(2+1)D-18, respectively. Since the innovation of previous prototypical methods is aimed to be architecture-agnostic, we compare the different methods on the same backbones, which are ResNet-18 for image and the first three blocks of ResNet(2+1)D-18 for video. We have added an additional comment specifying this in the revision.

[1] Chen, C., Li, O., Tao, D., Barnett, A., Rudin, C., Su, J.K.: This looks like that: deep learning for interpretable image recognition. Advances in neural information processing systems 32 (2019)



back to top