Paper Info Reviews Meta-review Author Feedback Post-Rebuttal Meta-reviews

Authors

Zhi Lin, Junhao Lin, Lei Zhu, Huazhu Fu, Jing Qin, Liansheng Wang

Abstract

Breast lesion detection in ultrasound is critical for breast cancer diagnosis. Existing methods mainly rely on individual 2D ultrasound images or combine unlabeled video and labeled 2D images to train models for breast lesion detection. In this paper, we first collect and annotate an ultrasound video dataset (188 videos) for breast lesion detection. Moreover, we propose a clip-level and video-level feature aggregated network (CVA-Net) for addressing breast lesion detection in ultrasound videos by aggregating video-level lesion classification features and clip-level temporal features. The clip-level temporal features encode local temporal information of ordered video frames and global temporal information of shuffled video frames. In our CVA-Net, an inter-video fusion module is devised to fuse local features from original video frames and global features from shuffled video frames, and an intra-video fusion module is devised to learn the temporal information among adjacent video frames. Moreover, we learn video-level features to classify the breast lesions of the original video as benign or malignant lesions to further enhance the final breast lesion detection performance in ultrasound videos. Experimental results on our annotated dataset demonstrate that our CVA-Net clearly outperforms state-of-the-art methods. The corresponding code and dataset are publicly available at https://github.com/jhl-Det/CVA-Net.

Link to paper

DOI: https://link.springer.com/chapter/10.1007/978-3-031-16437-8_59

SharedIt: https://rdcu.be/cVRuK

Link to the code repository

https://github.com/jhl-Det/CVA-Net

Link to the dataset(s)

https://pan.baidu.com/s/1yYME7-DvvIEZzCb72NXaJA?pwd=jnie


Reviews

Review #1

  • Please describe the contribution of the paper

    The authors report on the development of a network for breast lesion detection in ultrasound video clips. In addition, the authors collected and annotated a video dataset consisting of 118 videos for use in breast lesion detection and classification.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The dataset and its annotations would be of value to many investigators. The detection algorithm performs well in cased when the lesion is known to be in the video.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    The use of the method for dtecting a lesion when a lision is known to be present is not clear.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    OK

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html

    Experimental results In the introduction, you mentioned the need for detection and classification. However, for detection you would need to include video clips without lesions. Since all the training videos contained lesions, the task you are solving is a detection in the case that a lesion is known to be present in the frames. It would be good the make that clear. As well, to use your algorithm, a physician would need to identify a lesion first and then use your method to detect it in the frame. Since the physician would need to collect a video of a region with a lesion, then its identification has already been done and an algorithm is not needed. Thus, the value of a detection method would be if the algorithm could detect lesions in videos with and without lesions with a high true-positive rate and low false-negative rate. An explanation of the value of detecting a lesion in videos, in which it is known that a lesion is present would help too clarify the value of the method.

    How many images were typically collected in the videos.

    How did the physicians annotate the lesions as benign or malignant? Were the lesions classified by a biopsy or visually? If visually, what is the variability in generating the annotations of the images.

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    3

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The use of the method for detection breast lesion in cases that it is know that a lesion is present requires justification.

  • Number of papers in your stack

    4

  • What is the ranking of this paper in your review stack?

    3

  • Reviewer confidence

    Somewhat Confident

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #2

  • Please describe the contribution of the paper

    The authors focus on DL for breast nodule detection and classification in sequences of ultrasound images. They provide a data set of expert annotated 188 videos. They also design and propose a network using local information of consecutive US frames and shuffled ones, aiming at using both local and global features. The network is well explained. The results are nicely analyzed through a comparative and an ablation study. The authors offer the data set and their code to the community.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The method is novel and well justified and presented in details. The data set is nicely gathered and annotated by experts. The experiments are well done and the results are valuable. The authors offer their data and their code to the scientific community.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    For a conference paper, I do not see a major weakness. For a journal paper, more explanation of the cost functions would be nice.

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    These seems to be no problem with reproducibility. The random shuffling and then taking the consecutive frames within the randomly ordered frames could be improved. The fact that the authors will provide data and code definitely supports the reproducibilty.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html

    Please either try to justify better your choice of the random shuffling or think of providing a more systematic approach making sure that the method can always take advantage of the global features, if you strongly beleive that these play crucial roles. Thanks for providing enough details and doing the ablation studies.

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    7

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Well justified and novel methodology. Excellent annotated data. Well done experiments and ablation studies. Offerning the annotated data and code to the community.

  • Number of papers in your stack

    5

  • What is the ranking of this paper in your review stack?

    1

  • Reviewer confidence

    Confident but not absolutely certain

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #3

  • Please describe the contribution of the paper

    This paper presents an empirical study on detection and classification of breast lesions in ultrasound videos by utilizing transformer. The general topic of the paper is very interesting specially in the domain of ultrasound videos. The authors further proposed the release of their annotated dataset upon the acceptance of their work. The paper proposed inter- and intra-video fusion blocks based on attention mechanism prior to transformer block.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • Releasing of the data upon acceptance
    • Use of transformers for lesion detection
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    • Use of private dataset
    • Lack of intuitive reasons for the proposed structure design, especially the proposed inter- and intra-VFB modules
  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance
    • There has been no downloadable link for the dataset used in the paper.
  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html

    Despite the authors’ interesting comparison and ablation study, the fact that they used a private dataset, even though the dataset is supposed to be released upon acceptance, makes the judgement of the true effectiveness of the proposed structure difficult, especially given the ever-growing number of studies in the field. Of course, half of the paper’s contribution is devoted to releasing the dataset, the other half is devoted to the proposed structure design, which lacks specific explanations on how and why the proposed modules are useful. Although, the main focus of the paper is on ultrasound videos, it will be highly beneficial to the impact of the paper if authors experiment their proposed architecture design on different video datasets (either natural or medical).

    Because several aspects of the methodology were difficult to follow, giving more specifics would be advantageous. For example, it was stated that: “ For a current video frame (It), our CVA-Net takes three neighboring images (denoted as Ik, Ik−1, and Ik+1) “, and “Figure 3(c) shows the details of our intra- video fusion module, which further integrates three output features (Pk−1, Pk, and Pk+1) of inter-video fusion from three adjacent video frames.” The use of subscripts (k-1,k,k+1) and superscipts (1,2,3) for “neighboring images “ and “adjacent video frames”, respectively, is confusing. If subscripts relate to consecutive frames/images, please make sure your manuscript is consistent in this regard.

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    6

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The anticipated dataset release was the main reason of my comments. The justification of the effectiveness of the proposed structure is limited.

  • Number of papers in your stack

    1

  • What is the ranking of this paper in your review stack?

    1

  • Reviewer confidence

    Confident but not absolutely certain

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A




Primary Meta-Review

  • Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

    The author contributes a new dataset of 118 ultrasound videos containing annotations for breast lesion detection and classification. A new network CVA-Net is proposed for the detection and classification of lesion and cancer categories. The paper is well-written and the dataset and proposed method is clearly presented. The experiments are thorough. The code and dataset that the authors plan to release with this paper is a valuable contribution for the research community. I recommend a provisional accept. And encourage the authors to incorporate reviewers’ feedback clarifying their explanations and design choices to further improve the presentation of the paper.

  • What is the ranking of this paper in your stack? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    2




Author Feedback

Dear ACs and Reviewers, We thank you very much for provisionally accepting our paper #1016 entitled “A New Dataset and A Baseline Model for Breast Lesion Detection in Ultrasound Videos”. We appreciate the reviewers and meta-reviewers very much for the positive and constructive comments and suggestions on our manuscript. The responses to the reviewer’s comments are as following:

Reviewer #1: Q1. How many images were typically collected in the videos? A: Our dataset has 188 videos with 25, 272 images in total. The number of ultrasound images at each video varied from 28 to 413. We will clarify this in the final version. Q2. How did the physicians annotate the lesions as benign or malignant? Were the lesions classified by a biopsy or visually? If visually, what is the variability in generating the annotations of the images? A: Pathology results that lesions are benign or malignant were obtained by ultrasound-guided biopsy or surgical excision, and two pathologists with eight years of experience in breast pathology were invited to perform the histopathological examination and analysis.

Reviewer #2: Q1. Please either try to justify better your choice of the random shuffling or think of providing a more systematic approach making sure that the method can always take advantage of the global features, if you strongly believe that these play crucial roles. Thanks for providing enough details and doing the ablation studies. A: Thanks. Our method devises an inter-video fusion module to enhance the video breast lesion detection performance of local frames from the original video using global feature information, which is extracted from video frames of the shuffled video. Here, we just employ a random shuffling operation on the original video to obtain the shuffled video, since the random shuffling is an easy and effective way to acquire global long-range features from the shuffled video. On the other hand, we also utilize data augmentation techniques with varied parameters on these selected frames of the shuffled video, which is equal to introduce regularization terms for training the model. In the ablation study and Table 2 of Section 3.2 of the original manuscript, we have already experimentally shown the effectiveness of applying random shuffling and the data augmentations. We take a more systematic approach or complicated video shuffling operations as one of the future directions of our work.

Reviewer #3: Q1. explore different video datasets (either natural or medical). A: Thanks. There is no annotated dataset for video breast lesion detection. Hence, this work collected the first high-quality video dataset by annotating each video frame. We hope that our new dataset and benchmark in this work would promote the development of the breast lesion detection community. Meanwhile, we will try our best to explore different video datasets, and further work includes the collection of more data. Q2. “Because several aspects of the methodology were difficult to follow, giving more specifics would be advantageous. …The use of subscripts (k-1,k,k+1) and superscripts (1,2,3) for “neighboring images “ and “adjacent video frames”, respectively, is confusing. If subscripts relate to consecutive frames/images, please make sure your manuscript is consistent” A: Thanks. First, we admit that there is a typo in “ For a current video frame (It)”, and I_t should be corrected as I_k. Then, we will modify the manuscript to make the writing consistent. In our original manuscript, we use subscripts (k-1, k, k+1) to denote adjacent video frames or their features. And each feature map (i.e., P_k) contains three CNN feature maps from an encoder and thus we utilize the superscripts (1,2,3) to denote the three feature maps (i.e., P_k^1, P_k^2, P_k^3). We will double check the final version to make them consistent and give more specifics in the methodology to make it easy to follow.



back to top