Paper Info Reviews Meta-review Author Feedback Post-Rebuttal Meta-reviews

Authors

Haojun Yu, Youcheng Li, QuanLin Wu, Ziwei Zhao, Dengbo Chen, Dong Wang, Liwei Wang

Abstract

During ultrasonic scanning processes, real-time lesion detection can assist radiologists in accurate cancer diagnosis. However, this essential task remains challenging and underexplored. General-purpose real-time object detection models can mistakenly report obvious false positives (FPs) when applied to ultrasound videos, potentially misleading junior radiologists. One key issue is their failure to utilize negative symptoms in previous frames, denoted as negative temporal contexts (NTC). To address this issue, we propose to extract contexts from previous frames, including NTC, with the guidance of inverse optical flow. By aggregating extracted contexts, we endow the model with the ability to suppress FPs by leveraging NTC. We call the resulting model UltraDet. The proposed UltraDet demonstrates significant improvement over previous state-of-the-arts and achieves real-time inference speed. We release the code, checkpoints, and high-quality labels of the CVA-BUS dataset used in our experiments in https://github.com/HaojunYu1998/UltraDet.

Link to paper

DOI: https://doi.org/10.1007/978-3-031-43987-2_1

SharedIt: https://rdcu.be/dnwJj

Link to the code repository

https://github.com/HaojunYu1998/UltraDet

Link to the dataset(s)

https://github.com/HaojunYu1998/UltraDet


Reviews

Review #4

  • Please describe the contribution of the paper

    This paper proposes a real-time lesion detection model for ultrasound videos. The approach addresses the issue of false positives that arises when applying frame-level models to ultrasound videos. The model leverages negative temporal contexts to mitigate false positives by extracting and aggregating relevant contextual information from previous frames. The aggregation approach relies on a novel optical-flow based method for aligning information from the context frames and the current frame. The proposed model significantly outperforms previous works and achieves real-time inference speed.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. The paper is well organized, clear, and has good use of visualization. Specifically, the network architecture diagrams are logically laid out with all of the necessary details needed to understand their moderately complicated architecture approach with a bit of work on the reader’s part. This plus the lucid writing greatly aid in general understanding and potentially with reproducibility.
    2. The optical flow-based feature alignment technique is novel and could have applications beyond the dataset under consideration. The core contribution is the idea of aligning information from previous frames with the current frame in order to make real time inferences. Based on the ablation study, this is very effective in the current case, and I expect this simple idea can be applied in many situations.
    3. The authors performed a very thorough comparison with state-of-the-art techniques and an adequate ablation study of their core architecture contributions.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    1. Though the method seems quite general, the authors only present results for the CVA-BUS dataset. This weakens the broader impact of the paper somewhat because it leaves open the possibility that the NTCA idea is highly specialized to the real-time lesion detection problem. To their credit, the authors do not make unsupported claims about the applicability of the model in more general scenarios.
    2. The authors introduce new annotations for the CVA-BUS dataset, but give almost no information about how the labels are produced.
    3. The authors report on the inference speed of their model but give very little detail on how this speed was measured. What hardware was used?
  • Please rate the clarity and organization of this paper

    Excellent

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    This work is highly reproducible. The authors use a public dataset. Though they produce custom annotations, the annotations will be made available (are included in provided supplementary material). Code is likewise available. Lastly, paper is detailed and clear.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html
    • in intro, reference to Figure 1(b) should probably be 1(a)
    • in the inverse optical flow align section, reference to Figure1(a) should probably be 1(b)
    • in sec 4.3, ultradet settings, it is claimed that FlowNet generalizes well to US datasets without citation. citation(s) should be supplied
    • in figure 4, b, the red abd blue boxes are almost invisible. They should be made more prominent
  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    7

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    None of the weaknesses described significantly undermine the contributions of the paper. The novel architecture, thorough execution, and clear presentation make this a valuable work for the community.

  • Reviewer confidence

    Very confident

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #2

  • Please describe the contribution of the paper

    This paper proposes to aggregate features of the current and previous frames in ultrasound videos for real-time lesion detection. Key idea is use negative temporal contexts, particularly for false positive suppression. Experiments conducted on CVA-BUS dataset show better performance of the proposed approach as compared to the existing approaches for object detection. Effectiveness of the negative temporal context incorporation is demonstrated using ablation studies.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • The idea of aggregating features of the current and previous frames in ultrasound videos for lesion detection is interesting.
    • Authors have refined the labels of the publicly available data and conducted relevant experiments to convincingly demonstrate the performance of the proposed method.
    • The method is able to achieve near real-time performance.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    • The paper mentions significant improvements without conducting any hypothesis tests which is not correct.
    • Section 3.2 mentions “We sample T_ctxt context frames from T previous frames”. How are the context frames selected is not described.
    • Section 3.3 mentions “To improve training efficiency, we apply auxiliary losses L_aux = L to all previous T frames.” There is no further detail provided on the auxiliary losses. How does the auxiliary losses improve efficiency?
    • More details should have been provided on the role of relation operation in eq. (2) and also on the requirement of more than one NTCA modules.
  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    Codes are shared.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html

    Some of the important details and justifications are missing which can improve the quality of the paper.

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    5

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The paper has merit however some important information and details are missing.

  • Reviewer confidence

    Confident but not absolutely certain

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #3

  • Please describe the contribution of the paper

    Authors propose a novel Negative Temporal Context Aggregation (NTCA) module, imitating radiologists’ diagnosis processes to suppress false positives.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The main strengths of the paper are the proposed UltraDet model and the Negative Temporal Context Aggregation (NTCA) module, which addresses the challenge of real-time lesion detection in ultrasound videos. The model leverages negative temporal contexts (NTC) from previous frames, including TC of lesion-like regions exhibiting negative symptoms, resulting in improved detection performance and significant reduction in false positives (FPs). This is a novel formulation as previous works have only considered inter-object relationships and not utilized NTC. The NTCA module extracts TC features by applying inverse optical flow to the original regular grids in previous frames, which is an original way to use data. The proposed UltraDet model provides reliable and interpretable improvement in real-time lesion detection in ultrasound videos and achieves real-time inference speed. The paper also demonstrates the clinical feasibility of the proposed model, which is essential for accurate cancer diagnosis. The study releases code, checkpoints, and high-quality labels of the CVA-BUS dataset to facilitate future research. The evaluation metrics used in the paper are also particularly strong, including frame-level precision values, lesion-level FP rates, AP50, and R@16. Overall, the proposed UltraDet model and NTCA module are novel and have strong potential to assist radiologists in more accurate cancer diagnosis in clinical practice.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    The study focuses on addressing the clinical challenge of real-time lesion detection in ultrasound videos and proposes a novel model to improve the detection performance and reduce false positives. The document also includes a discussion of related works and an ablation study that evaluates the impact of each module on the detection performance. However, referring to other relevant studies or conducting further research is recommended to identify any potential weaknesses or limitations of the proposed model.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    Given the code, it seems to be guaranteed

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html

    The authors should be commended for their comprehensive and innovative approach to address the challenge of real-time lesion detection in ultrasound videos through the proposed UltraDet model. The paper is well-written and structured, and the methodology is described thoroughly, making it easy to understand their proposed solution. The use of negative temporal contexts (NTC) to suppress false positives (FPs) is particularly interesting, and the NTCA module’s effective application is also commendable. Additionally, the release of high-quality labels of the CVA-BUS dataset and code checkpoints is a significant contribution to future research. However, there are some aspects that the authors should consider to improve their work. Firstly, the dataset used in the experiments is relatively small, and further evaluation on larger and more diverse datasets will enhance the generalizability of the UltraDet model’s performance. Secondly, although the results presented indicate that the UltraDet model outperforms previous state-of-the-art models, the comparison is limited to just a few models. A broader comparison with more methods would increase the study’s robustness and credibility. Lastly, the ablation study results are not discussed in-depth, and the authors should provide more details to support their argument. Overall, the authors’ work is well-conceived and executed, and their contributions are valuable to the field. The suggested improvements will help enhance the paper’s quality and strengthen its impact.

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    6

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    the proposed UltraDet model provides a novel solution to the challenge of real-time lesion detection in ultrasound videos, leveraging negative temporal contexts to suppress false positives and outperforming previous state-of-the-art models.

  • Reviewer confidence

    Confident but not absolutely certain

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #1

  • Please describe the contribution of the paper

    This paper proposes a new model called UltraDet for real-time ultrasound lesion detection that utilizes negative temporal contexts (NTC) to suppress false positives (FPs) caused by non-lesion anatomies. The model extracts temporal contexts from previous frames using inverse optical flow and aggregates them to suppress FPs by leveraging NTC. The proposed method demonstrates significant improvement over previous state-of-the-art models in terms of reducing FPs while achieving real-time inference speed. The authors have also released the code, checkpoints, and high-quality labels of the CVA-BUS dataset used in the experiments to facilitate future research.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The proposed UltraDet model is a novel method for real-time lesion detection in ultrasound videos that leverages negative temporal contexts to suppress false positives. It is shown to significantly outperform previous state-of-the-art methods. The UltraDet is evaluated on the CVA-BUS dataset, and it is shown to significantly outperform previous state-of-the-art methods by reducing more than 50% of false positives at a recall rate of 0.90. The paper explores real-time ultrasound video lesion detection, which is an underexplored area compared to lesion detection in still images or offline videos. Real-time lesion detection can assist radiologists in accurate cancer diagnosis during scanning, and the proposed method achieves real-time inference speed, making it clinically feasible. To facilitate future research, the authors have released the code, checkpoints, and high-quality labels of the CVA-BUS dataset used in the experiments. They also reproduced all baselines using the high-quality labels to ensure a fair comparison, and this is a lot of work. The entire paper is smooth and focused.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    The paper does not compare the proposed UltraDet model with other methods that are specifically designed for ultrasound lesion detection, which makes it difficult to assess the overall performance of the proposed method in the context of ultrasound images. The proposed method is evaluated on a single dataset, CVA-BUS, which may limit the generalizability of the results to other datasets or clinical scenarios. I think some core parts of the methodology should be supplemented. For example, in Fig. 1, are the “ROI regions” indeed the predicted boxes Bτ mentioned in the following texts? How does the method separate the ROI region into multiple ones to conduct inverse optical flow? What is the number of the separated regions for each frame and why? The Temporal Relation Module is also very important for readers to understand the methodology. I think it should be described in a more detailed way.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    The authors have released the code, checkpoints, and high-quality labels of the CVA-BUS dataset used in the experiments. This paper has the high reproducibility.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html

    There are some mistakes (not grammar mistakes) in the paper, which may mislead readers. For example, in Page1, “the red box in Figure 1(b)” should be “the red box in Figure 1(a)”.  I am confused about some expressions in the paper, where I hope the authors can make a detailed explanation or correct the potential mistakes:  In Page 4, it is said that “The RPN generates proposals consisting of boxes Bτ and proposal features Qτ using RoI Align and average pooling”. But why Equation (1) expresses that Fτ and Bτ generate Qτ?  In Fig. 2, all the previous T frames provide temporal information for the inference of Frame It, but why in the caption, it is mentioned that “ The yellow and green frames are sampled as context frames”, without red frames?  The authors only briefly mention the dataset used in the experiments without providing sufficient information. It would be helpful to provide more details about the dataset, such as the number of patients, the number of lesions, and the type of lesions. By addressing the issues mentioned above, the authors can further improve the quality of this work.

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    6

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Comprehensively taking into account the contributions, strengths and weaknesses of this paper, I confirm the score of this paper. In general, the proposed UltraDet model is a novel method for real-time lesion detection in ultrasound videos that leverages negative temporal contexts (NTC) to suppress FPs.

  • Reviewer confidence

    Very confident

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A




Primary Meta-Review

  • Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

    All four reviewers are positive to accept this work. By checking the paper, I also vote for accepting this work.




Author Feedback

[To Reviewer 1] Inputs of the RPN are feature maps Fτ and the outputs are proposals (RoI regions Bτ and corresponding features Qτ) and Eq(1) indicates how Qτ are extracted from Fτ using boxes Bτ. We do not separate Bτ into multiple ones to conduct IOFAlign. We calculate the inverse optical flow between Ft and Fτ, warp the features and extract RoI-level features from the calibrated feature map. The Temporal Relation module uses the Relation operation to extract inter-frame inter-object relation information, which is similar to the Attention operation, except the proposal features and box positions are calculated separately. The number of patients and the type of lesions are not mentioned in the CVA-Net. We think the number of patients is the same as the number of videos and the lesion type is mass.

[To Reviewer 2] We conduct hypothesis tests on Pr80, Pr90, FP80 and FP90 of BasicDet and UltraDet and the p-value < 0.0005 holds for all. We stack the NTCA module in each block to increase the expressiveness of the network. Auxiliary losses can provide more supervision signals in each iteration to speed up convergence. We will supply ablation studies on the Relation operation, the number of NTCA modules and the auxiliary losses.

[To Reviewer 3] For potential weakness, the ability to utilize the NTC from far past is limited because of the real-time inference speed limitation. We will conduct more ablation studies and other concerns are discussed in [To All Reviewers].

[To Reviewer 4] We apologize for the missing citations in Sec4.3. We list some relevant works [4,12,13] in Sec 2 and will correctly cite them in Sec4.3. Although their tasks are not detection, they still indicate that FlowNet generalizes well to US datasets. We will make the red and blue boxes more prominent in Fig 4(b). Thank you for your advice. For the new annotations, an experienced radiologist first annotates videos in CVA-BUS. Then, we review the differences with another expert to determine the final annotations. For inference speed, we use one NVIDIA GeForce RTX 3090 GPU to run inference on the test set and exclude the influence of the data loader.

[To All Reviewers] We sincerely appreciate the hard work of dedicated reviewers for helping review our paper and providing valuable feedback. We are grateful for having you working alongside us during this journey. We apologize for the typos: (1) in Sec 1, the reference to “Figure 1(b)” should be “Figure 1(a)”; (2) in Sec 3.2, the reference to “Figure 1(a)” should be “Figure 1(b)”; (3) In Sec 4.1, the number of lesions in test split should be “39” instead of “149”. For methodology, we randomly sample T_ctxt frames (illustrated as yellow and green frames in Fig. 2) from T previous frames to provide context information in the NTCA module. All previous T frames provide temporal information in the Temporal Relation module, which is different from the NTCA module. For experiments, we reproduce CVA-Net and Track-YOLO as ultrasound-specific baselines and we look forward to your further feedback about important baselines that we missed. We recognize that conducting experiments on only CVA-BUS limits the generalizability of our method. However, to the best of our knowledge, other ultrasound datasets are 2D images. We will verify the effectiveness of UltraDet on new US datasets as soon as they are released and look forward to further discussion with you about the potential usage of the NTCA idea. Finally, we want to thank all the reviewers again for your helpful and insightful feedback. We are pleased that our work was recognized as a novel real-time ultrasound lesion detection method and we will work hard to broaden its impact for social good in the future. Please notify us if you have any further questions and suggestions, we would be glad to discuss them.

With best regards, Authors of Paper197



back to top