List of Papers By topics Author List
Paper Info | Reviews | Meta-review | Author Feedback | Post-Rebuttal Meta-reviews |
Authors
D. Hudson Smith, John Paul Lineberger, George H. Baker
Abstract
Many medical ultrasound video recognition tasks involve identifying key anatomical features regardless of when they appear in the video suggesting that modeling such tasks may not benefit from temporal features. Correspondingly, model architectures that exclude temporal features may have better sample efficiency. We propose a novel multi-head attention architecture that incorporates these hypotheses as inductive priors to achieve better sample efficiency on common ultrasound tasks. We compare the performance of our architecture to an efficient 3D CNN video recognition model in two settings: one where we expect not to require temporal features and one where we do. In the former setting, our model outperforms the 3D CNN – especially when we artificially limit the training data. In the latter, the outcome reverses. These results suggest that expressive time-independent models may be more effective than state-of-the-art video recognition models for some common ultrasound tasks in the low-data regime. Code is available at https://github.com/MedAI-Clemson/pda_detection.
Link to paper
DOI: https://doi.org/10.1007/978-3-031-43895-0_70
SharedIt: https://rdcu.be/dnwzC
Link to the code repository
https://github.com/MedAI-Clemson/pda_detection
Link to the dataset(s)
N/A
Reviews
Review #1
- Please describe the contribution of the paper
The paper explores which kinds of temporal interdependencies are required to analyse medical ultrasound data. The authors use their hypotheses on the data analysis to form different architectures and explore their use on appropriate tasks. The results confirm the authors hypotheses and round up the paper.
- Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
-The paper is extremely well written. -Research ideas are clear and easy to follow -Architectural considerations follow clear hypotheses are are extremely convincing -Experimental validation is solid and appropriate.
- Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
I could not spot any major weaknesses
- Please rate the clarity and organization of this paper
Excellent
- Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance
Software will be published open source and is therefore reproducible.
- Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html
Paper is extremely well written and it was a pleasure to review. It is clearly the best paper in my stack.
- Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making
8
- Please justify your recommendation. What were the major factors that led you to your overall score for this paper?
All requirements of a great MICCAI Paper are met. I hope the other reviewers will agree on this.
- Reviewer confidence
Very confident
- [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed
N/A
- [Post rebuttal] Please justify your decision
N/A
Review #2
- Please describe the contribution of the paper
The authors study machine learning based ‘recognition’ from ultrasound image data sequences. They consider image level detection of Patent Ductus Arteriosus and ejection fraction predicition from echocardiography. For the image level detection task, their approach considering frames independent of order performes well, while a spatio-temporal architecture is preferable for predicting the ejection fraction. This is not surprising.
- Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
The authors illustrate that the network architecture should be chosen accoring to the problem at hand. They demonstrate that when temporal features are irrelevant for interpreting a ‘video’, they will not add to the performance.
- Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
The title is not quite matching the study and results. Essentially, the authors are pooling over a set of frames for a classification problem without temporal meaning. It is not surprising that this kind of ‘ensembling’ results in (relatively) good performance. However, chosing a task that does not depend on temporal features to conclude that temporal features are not relevant for the task does not seem to be helpful.
There are further limitations in the comparison, particularly regarding the size of the respective models which will have an impact on training and performance for different tasks. A careful ablation study considering size and backbone would potentially provide interesting insights.
- Please rate the clarity and organization of this paper
Good
- Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance
Should be okay if code is provided.
- Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html
The authors make a rather bold and general statement regarding ‘time-independent’ models being preferable for recognition in US data. As indicated above, I am not surprised by their results and I don’t quite see what we learn. If ultrasound (US) images do not show some time varying process, their order does not carry much meaning. This is quite natural in some US imaging scenarios, e.g., as clinicians move the prove to obtain a better view of an otherwise stationary structure. For 2D ultrasound studied by the authors this implicitly provides spatial context without temporal meaning. Their pooling approach could also be considered a kind of ‘ensembling’ over different hypotheses from a set of images. It seems reasonable for the task and the improvements would be expected. However, I don’t see what we learn from comparison to 2D+t methods which try to interpret time? Also, the size of the models is likely different and this may contribute to the relative performance. As an ablation w.r.t different model sizes is missing, it is hard to interpret the few results provided.
- Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making
3
- Please justify your recommendation. What were the major factors that led you to your overall score for this paper?
The results are limited and neither the methods nor the conclusions carry much novelty.
- Reviewer confidence
Very confident
- [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed
3
- [Post rebuttal] Please justify your decision
The authors state that they want to demonstrate that the wrong choice of an ML approach for US video recognition may lead to poor results. This is quite obvious. Multi-head attention is not novel at all and with results for two problems and essentially two architectural choices (as explained by the authors, max and average pooling are special cases), I still don’t think this manuscript is providing some general new insights. It also remains unclear, whether their approach is performing well for PDA, given a simple average pooling is very close. The paper is well-written, but the content is not convincing. Would be great for an applied ultrasound conference, but not for MICCAI.
Review #3
- Please describe the contribution of the paper
The paper provides a time-independence ultrasound video-recognition method based multi-head attention architecture for the task of PDA detection and EF prediction, and analyzes the effect of temporal features to the model performance.
- Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
The paper provides a perspective of ultrasound video recognition by analyzing the effectiveness of the temporal feature in different tasks. The paper provides a novel US Video Network (USVN) that treats frames as independent and unordered and combines information from multiple frames using a novel multi-head attention mechanism. The multi-head attention with a MIL formalism focuses on different subspaces of the image-level embeddings extracted by a shared encoder.
- Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
The contrast experiments are confusing. First of all, the main idea of this work is that the time-independence model based on multi-head attention will have a better sample efficiency. The paper just compared with 3D CNN, but on other common temporal feature extracting method like those based on RNN. The role of the temporal feature highly depends on the intrinsic properties of the task, such as the color blood region for PDA detection and the ED and ES frames for EF prediction. From this viewpoint, the paper’s finding is not new. The motivation and the contribution are not clear enough. State-of-art methods for the two tasks should be included for comparison so that convincing conclusion can be drawn. The result in the top panel of figure 2 shows that the proposed method is more efficient in low-data regime. can you explain the underlying reason? It seems that the method has no specific design to deal with limited training data.
- Please rate the clarity and organization of this paper
Satisfactory
- Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance
NA
- Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html
Make the motivation and the contribution of the paper more clear. Comparison with existing state-of-the-art methods. Reorganize the content in Section 2. Other detailed comments: -In the introduction, the description of the innovation/contribution of this paper is not clear enough. -In 2.1, the descriptions of multi-head attention processing could be clearer. -In 2.2, the first paragraph is kind of lack focus. It’s not related to the subtitle of benchmark implementation and makes the content difficult to understand. -What’s “The advantage of the nature of common US recognition tasks”?
- Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making
4
- Please justify your recommendation. What were the major factors that led you to your overall score for this paper?
contribution is not clear and no comparison with SOTA.
- Reviewer confidence
Very confident
- [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed
6
- [Post rebuttal] Please justify your decision
My previous concern about the motivation can be addressed in the revised paper. As for the experiments, the author’s response is convincible to me.
Primary Meta-Review
- Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.
This manuscript has several issues that are raised by the reviewers. A key concern is the potential mismatch between the title and the content of the manuscript, with the authors’ choice of tasks for evaluation seeming to favor their proposed model. This manuscript might benefit from a more diversified comparison, considering factors such as the size of respective models. Further concerns include bold and potentially overly general statements about time-independent models for US data recognition. As pointed out, this might not be entirely surprising considering the nature of some US imaging scenarios, and thus, it’s unclear what new knowledge is being provided here. Moreover, the lack of comparison with state-of-the-art methods and other common temporal feature extraction methods, such as RNNs, limits the paper’s scope and depth. This limitation, coupled with some sections’ lack of clarity detracts from the overall impact of the paper. To improve the manuscript’s quality, the reviewers should address the following key points in their rebuttal:
- Reconsider the title to better reflect the manuscript’s content.
- Provide more diversified comparisons, including variations in model sizes and structures.
- Clarify the implications of your findings regarding time-independent models and US data recognition.
- Clarify the manuscript’s motivation and contribution, possibly through a reorganization of the content, especially in Section 2.
Author Feedback
We thank all reviewers for their thoughtful comments in response to our manuscript. Concerns fall into two main areas: 1) lack of clarity around the motivation and contribution of the manuscript and 2) concerns about the experimental design. We will speak to these concerns and argue that our manuscript still contributes something meaningful to the MICCAI community.
Firstly, the reviews pointed out a lack of clarity on the purpose of this study. For instance, Reviewer 2 stated, “I don’t see what we learn from comparison to 2D+t methods which try to interpret time?”. Review 3 said simply that “The motivation and the contribution are not clear enough.” Firstly, our purpose was not to suggest that a time-independent approach like the one we propose is appropriate for all ultrasound tasks. Instead, we wanted to demonstrate that under some ordinary circumstances temporal dependence is not essential, provide a clear rationale for why, and then present an effective architecture for that situation. We included the EchoNet task to offset the impression that time-independent methods are appropriate for all scenarios. This point comes through clearly in our “Conclusions and Discussion” section. In our experience, many applications of deep learning to medical Ultrasound use temporal models as the default simply because of the video data structure. We hope that our concrete demonstration that this can lead to poor sample efficiency plus our model architecture can be a constructive work to help applied researchers improve efficiency in data-constrained settings. Upon reviewing our introduction, we agree with the reviewers that this purpose is unclear. We propose to address this by 1) rewording the title to avoid the impression that we are anti-temporal features in all cases. 2) rewording the introduction (first two paragraphs and last paragraph) to emphasize the aforementioned points.
Secondly, the reviews expressed concerns about our experimental design. For instance, Reviewer 2 raised the concern that the differences in sample efficiency might arise from the size of the respective models, while Reviewer 3 argued that it is necessary to compare with a greater variety of temporal models than R(2+1)D alone. To Revier 2’s concern, the R(2+1)D network we used had roughly 31.5M parameters, while the ResNet-50 backbone had 25.6M. For R(2+1)D the temporal and spatial convolutions are pretrained on a large human action recognition dataset. Though the difference in parameters may explain some differences in sample efficiency, it doesn’t adequately explain the contrast between the PDA task (where USVN excels) and the EchoNet task (where USVN suffers). It is more natural to explain this difference in terms of the temporal independence prior built into USVN, which aids training for one task but leads to systematic error for the other. We agree that this rationale could be expressed more clearly in the paper and will add this language to our discussion. In response to Reviewer 3’s question, while we agree that additional reference models (and tasks for that matter) would strengthen our point, we still believe that the comparisons in our current manuscript sufficiently demonstrate our primary hypothesis. R(2+1)D has proven to be a very efficient model across a wide range of US tasks, so we consider it a strong baseline. We also believe that clarifying our motivation, as described previously, will help explain the relevance of our experimental design.
In conclusion, our concrete demonstration of the importance of tailoring the model architecture to the US task at hand and the introduction of USVN together form a constructive contribution to computer vision applications to Ultrasound. On the other hand, we agree with the reviewers that the clarity of the document can be improved in some key areas. We believe relatively minor changes to the language can improve these areas. We again thank our reviewers for their constructive comments.
Post-rebuttal Meta-Reviews
Meta-review # 1 (Primary)
- Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.
I think this paper should be rejected as I and the reviewers) agree that the authors choice of tasks for evaluation seems to favor their proposed model.
In the authors’ response, they explained that the proposed work is designed around a specific task, and it works under specific circumstances. “we wanted to demonstrate that under some ordinary circumstances temporal dependence is not essential, provide a clear rationale for why, and then present an effective architecture for that situation.”
Given that the paper does not have any technical novelty, and the fact it is task-specific, I do not think it provides enough contribution to the MICCAI community. Also, it seems reviewers did not change their score/rating, and they both decided to reject the paper.
Finally, I want to note that this paper got an average score of 5 (which according to the guideline should be accepted). Please note that this paper received an average of 5 only because the first reviewer gave a definite accept, the other two reviewers gave rejection and weak reject.
Meta-review #2
- Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.
The rebuttal addressed most critical concerns raised by reviewers including the clarification of motivation and experiment. Though there still exist the concerns around limited new technical insight, given the contributions of the paper, I recommend acceptance.
Meta-review #3
- Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.
The main contribution of this paper is a demonstration that video recognition tasks do not always require or benefit from the default use of temporal models. The paper is definitely well written but the contributions remain unclear. I do not think the findings would be surprising to the MICCAI community, which is well-versed with the need to tailor model architecture to the task at hand. The novelty of a multi-head attention mechanism (USVN) is also limited.
In my impression, making this work ready for publication would require either additional methodological novelty or a repositioning as a comprehensive characterization study (e.g., across more diverse video recognition tasks and network types). My recommendation is to reject.