Paper Info Reviews Meta-review Author Feedback Post-Rebuttal Meta-reviews

Authors

Anlan Sun, Zhao Zhang, Meng Lei, Yuting Dai, Dong Wang, Liwei Wang

Abstract

Breast ultrasound videos contain richer information than ultrasound images, therefore it is more meaningful to develop video models for this diagnosis task. However, the collection of ultrasound video datasets is much harder. In this paper, we explore the feasibility of enhancing the performance of ultrasound video classification using the static image dataset. To this end, we propose KGA-Net and coherence loss. The KGA-Net adopts both video clips and static images to train the network. The coherence loss uses the feature centers generated by the static images to guide the frame attention in the video model. Our KGA-Net boosts the performance on the public BUSV dataset by a large margin. The visualization results of frame attention prove the explainability of our method. We release the code and model weights in https://github.com/PlayerSAL/KGA-Net.

Link to paper

DOI: https://doi.org/10.1007/978-3-031-43904-9_43

SharedIt: https://rdcu.be/dnwHo

Link to the code repository

https://github.com/PlayerSAL/KGA-Net

Link to the dataset(s)

N/A


Reviews

Review #2

  • Please describe the contribution of the paper

    This paper proposes a method to enhance the performance of ultrasound video classification using the static image dataset, namely KGA-NET with coherence loss and estimates the attention weights. The model improves the performance on the publicly available BUSV dataset. 


  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • Clearly motivated problem with good related work.
    • Well written
    • I like the idea of coherence loss and the ablation studies. It shows that attention has the highest impact from all the modules. 

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    • Other design choice ablations would be helpful, such as estimation of weights, aggregation etc.
    • See questions below.
  • Please rate the clarity and organization of this paper

    Excellent

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    Ok.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html

    Questions:

    • Introduction, 1st paragraph: I don’t agree that it is ‘essential’ but it might be helpful to aggregate information from the entire video to perform accurate automatic lesion diagnosis. And also please add references for multi-view US works for breast cancer detection.
    • Did you try using LSTM based aggregation instead of weighted sum using attention weights?
    • What is the idea behind using gram matrix for coherence loss?
    • Does it play a role if the videos have different number of frames? Are the weight normalised?
    • Can you also give information about runtimes (training, inference etc.).
    • How separable are benign and malign centres? Is there any ablation study on that?
    • Do you think your method is applicable to other problems?
  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    6

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    In general, I find the idea of the paper good. Paper is well written and the results are well communicated.

  • Reviewer confidence

    Confident but not absolutely certain

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #3

  • Please describe the contribution of the paper

    This paper introduces KGA-Net (Keyframe Guided Attention Network), a novel approach for improving ultrasound video classification. To address the challenge of limited ultrasound video datasets, the authors leverage a static image dataset and propose coherence loss to guide frame attention. The KGA-Net model combines video clips and static images during training, with the frame attention mechanism determining the contribution of each frame for diagnosis. Experimental results on the BUSV dataset demonstrate the superiority of KGA-Net over other video classification models.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    This study presents a compelling investigation with robust quantitative findings that hold significant potential impact. The paper exhibits clear and coherent writing, ensuring accessibility for readers. The authors are commended for their accomplished work.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    I would like to offer a few comments for further consideration. First, it is crucial to address the implications of this study on clinical practice and workflow, as they represent a vital component currently missing from the paper. The authors should elaborate on the practical implications arising from their research. Furthermore, as the model’s outcome appears to be binary classification, it is important to provide contextual information on the clinical relevance and utility of such an outcome in real-world clinical practice. This will help readers understand the practical value of the proposed approach. Moreover, it is worth noting that when working with 3D data, conventional wisdom suggests that 2D models may not yield optimal performance. Hence, I recommend that the authors showcase the effectiveness of their strategy using a backbone architecture specifically designed to handle 3D data, thus reinforcing the credibility of their approach. Lastly, an integral aspect contributing to the success of the proposed methodology is its reliance on the expertise of sonographers, who actively select the key static images to guide video model training. While this factor is not a weakness, I suggest that the authors emphasize and elucidate this point to underscore the importance of domain expertise in their approach.

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    Reasonably reproducible.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html

    Reiterating the above comments: This study presents a compelling investigation with robust quantitative findings that hold significant potential impact. The paper exhibits clear and coherent writing, ensuring accessibility for readers. The authors are commended for their accomplished work. Nevertheless, I would like to offer a few comments for further consideration. First, it is crucial to address the implications of this study on clinical practice and workflow, as they represent a vital component currently missing from the paper. The authors should elaborate on the practical implications arising from their research. Furthermore, as the model’s outcome appears to be binary classification, it is important to provide contextual information on the clinical relevance and utility of such an outcome in real-world clinical practice. This will help readers understand the practical value of the proposed approach. Moreover, it is worth noting that when working with 3D data, conventional wisdom suggests that 2D models may not yield optimal performance. Hence, I recommend that the authors showcase the effectiveness of their strategy using a backbone architecture specifically designed to handle 3D data, thus reinforcing the credibility of their approach. Lastly, an integral aspect contributing to the success of the proposed methodology is its reliance on the expertise of sonographers, who actively select the key static images to guide video model training. While this factor is not a weakness, I suggest that the authors emphasize and elucidate this point to underscore the importance of domain expertise in their approach.

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    5

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    This is a strong paper with clear clinical relevance. The paper could use some clarifications on the clinical side but it is on track to acceptance.

  • Reviewer confidence

    Confident but not absolutely certain

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #1

  • Please describe the contribution of the paper

    The authors propose the Keyframe Guided Attention Network (KGA-Net) for breast ultrasound video classification. The KGA-Net uses a 2D feature extractor for each individual video frame, and weights the frames according to a predicted “attention weight” (instead of using e.g. averaging) before classification. The attention weight prediction is learned with means of learned image features centers of a breast ultrasound static image dataset.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • The general idea (using static images to improve breast ultrasound video classification) is appealing and useful. The idea to learn important keyframe features from a static breast ultrasound dataset and use these in for video breast ultrasound classification seems novel.
    • SOTA results on a public benchmark dataset.
    • The predicted attention weights can be used for model explainability.
    • The technical solution (learn to predict frame attention using a coherence loss on feature distances to feature centers learned with a center loss) seems novel.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    • The authors claim that there are many breast ultrasound datasets with static images, but have only used one of these, and do not motivate why.
    • The authors write that “Recent research demonstrated the potential of deep learning for breast lesion classification tasks (…) while few works focus on the video modality.” However, they do not cite, summarize or compare to any of these previous methods, nor explain why, they merely state that “Since the research on ultrasound video classification is lacking, we compare our method with other strong video baselines on natural images.”
    • The ablation study is not very exhaustive, and it is not very clear what the different setups include. Two obvious setups seem to be missing: (1) training a standard video classification network with a 2D backbone and e.g. frame averaging and (2) option 1 with a 2D backbone pre-trained on the static US images. The ablation study also lacks evaluation of the impact of the CE loss in the image classification networks, and the impact of the shared weights of the 2D backbones.
    • The authors do not report the statistical significance of the results.
    • The experimental setup for the compared methods is not clearly reported (In-house implementations, or implementations by the original works? Choice of hyperparameters etc?) Further, the author state that “For a fair comparison, we use both the video and image data to train these models.” without support that this setup actually gives a fair comparison (i.e. compared to using only videos).
    • There are no discussion on failure cases.
  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance
    • Generally, the details on the experimental setup for the compared methods and ablation study are lacking.
    • Choice of model, training strategy and hyperparameters is not motivated despite saying so in the reproducibility report.
    • The authors state that “The average runtime for each result, or estimated energy cost.” and “An analysis of situations in which the method failed” are not applicable to this study, which I do not agree with.
    • the authors state that they have included a “Discussion of clinical significance” which I do not agree with.
  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html
    • The abstract should be rewritten with more care. E.g. “Breast ultrasound videos contain richer information than ultrasound images, therefore it is more meaningful to develop video models for this diagnosis task.” - which task? “However, the collection of ultrasound video datasets is much harder.” - why? in what way? “In this paper, we explore the feasibility of enhancing the performance of ultrasound video classification using the static image dataset.” - What kind of classification? Which dataset? And so on…
    • Figure 2 is somewhat chaotic and should be polished: there are several arrows that are neither intuitive nor very well explained. The caption should be extended to help the reader.
    • The authors should cite, summarize and compare to previous method on ultrasound video classification, or motivate why they do not.
    • The ablation study should include more setups, and be explained in a more clear manner. Missing setups: (1) training a standard video classification network with a 2D backbone and e.g. frame averaging and (2) option 1 with a 2D backbone pre-trained on the static US images, (3) impact of the CE loss in the image classification networks, and (4) impact of the shared weights of the 2D backbones. Also, it is not clear what the setup “w/o coherence loss means” - here it seems like attention i still used, but how is this attention computed if not coherence loss is used for training an attention prediction network? This needs to be clarified.
    • The authors should report the statistical significance of the results.
    • The authors should report the experimental setup for the compared methods.
    • The author writes that “During inference, we use the video classification network individually. We sample up to 128 frames of each video to form a video clip and predict its classification result using our model.” They should motivate why they do not use the full videos and why they chose to this particular length.
    • Fig. 3 (a-d) should include information on the GT and predicted class.
    • It would be very informative if a brief discussion on failure cases and/or some examples would be included in the paper.
  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    6

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The usefulnes of the overall approach (to enhance the performance of BUS video classification with means of BUS static image datasets), the novel technical solution (learn to predict frame attention using a coherence loss on feature distances to feature centers learned with a center loss), the SOTA results and the explainability aspect outweigh the paper’s weaknesses (inferior reproducibility, lack of motivations why using/not using BUS datasets/previous methods, lack of analysis of statistical and clinical significance, and failure cases).

  • Reviewer confidence

    Confident but not absolutely certain

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A




Primary Meta-Review

  • Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

    All three reviewers agree to accept this work. Hence, this work can be accepted. Please prepare the final version based on comments.




Author Feedback

Dear Reviewer #1,

We appreciate your efforts in reviewing our paper and providing constructive feedback. We will make necessary modifications based on your comments. We would also like to offer some clarifications on the issues you raised. Firstly, with regards to the lack of comparison with the previous ultrasound video classification method, we would like to explain that the previous work relied on a private dataset with keyframe annotations for supervised training. The released code does not include keyframe detection, which makes direct comparison impossible. We will provide all experimental setup and details, including hyperparameter settings and related strategies, by publicly sharing the paper’s code.

Dear Reviewer #2,

We sincerely thank you for your valuable feedback. We will revise the paper accordingly. However, we have a few points to clarify. Regarding the LSTM-based aggregation question, we believe that the weighted sum is a natural fit for our hypothesis that each frame contributes differently to the diagnosis. Besides, the LSTM-based method aggregates information from multiple frames, making the contribution of individual frames less distinctive. As for the question on the frame number, our method supports dynamic numbers of frames, while the fixed length is chosen to meet the requirement of MViT. Finally, regarding the applicability of our method to other problems, we would like to mention that our method has transferability. For instance, it could be applied to video classification tasks that heavily rely on keyframes.

Dear Reviewer #3,

We are grateful for your constructive feedback, which will help us improve the quality of our paper. We would like to comment on the suggestion to use 3D convolution. While 3D models can effectively organize information from adjacent frames using three-dimensional convolution, our research focuses on utilizing 2D ultrasound frames to enhance the classification capability of ultrasound videos. As each frame contributes differently to discrimination, we used 2D convolution for feature extraction to avoid information confusion between frames and preserve distinctive contributions. Moreover, our attention mechanism fuses multiple frames, making it inappropriate to compare our model directly with purely 2D models. Instead, our method is more akin to 2+1D or spatiotemporal separable methods.



back to top