Paper Info Reviews Meta-Review Author Feedback Post-rebuttal Meta-Reviews Back to top

List of Papers By topics Author List

Paper Info

Reviews

Meta-review

Author Feedback

Post-Rebuttal Meta-reviews

Authors

Junhao Lin, Qian Dai, Lei Zhu, Huazhu Fu, Qiong Wang, Weibin Li, Wenhao Rao, Xiaoyang Huang, Liansheng Wang

Abstract

Breast lesion segmentation in ultrasound (US) videos is essential for diagnosing and treating axillary lymph node metastasis. However, the lack of a well-established and large-scale ultrasound video dataset with high-quality annotations has posed a persistent challenge for the research community. To overcome this issue, we meticulously curated a US video breast lesion segmentation dataset comprising 572 videos and 34,300 annotated frames, covering a wide range of realistic clinical scenarios. Furthermore, we propose a novel frequency and localization feature aggregation network (FLA-Net) that learns temporal features from the frequency domain and predicts additional lesion location positions to assist with breast lesion segmentation. We also devise a localization-based contrastive loss to reduce the lesion location distance between neighboring video frames within the same video and enlarge the location distances between frames from different ultrasound videos. Our experiments on our annotated dataset and two public video polyp segmentation datasets demonstrate that our proposed FLA-Net achieves state-of-the-art performance in breast lesion segmentation in US videos and video polyp segmentation while significantly reducing time and space complexity. Our model and dataset are available at https://github.com/jhl-Det/FLA-Net.

Link to paper

DOI: https://doi.org/10.1007/978-3-031-43898-1_48

SharedIt: https://rdcu.be/dnwBI

Link to the code repository

https://github.com/jhl-Det/FLA-Net

Link to the dataset(s)

https://github.com/jhl-Det/FLA-Net

Reviews

Review #1

Please describe the contribution of the paper

This paper propose a dataset and a model for breast lesion segmentation. The proposed dataset contains 572 videos and the model is composed of an encoder, a feature aggregation module and a two-branch decoder.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
- The proposed dataset is larger than the datasets proposed before.
- The proposed model performs better than baseline methods.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
- It is not sure whether the authors will make their code and dataset publicly available. There is no evidence of the authenticity of the dataset.
- The novelty of the proposed model is limited. The methods for either lesion segmentation or video analysis have been well explored.
- The hyper-parameters, FLOPs, and size of the nine state-of-the-art methods are not clear, it is not clear whether the comparison shown in Table 2 is fair. Also, it is not clear whether the nine methods are well-trained.
- The compared method [8] was published in 2021, it is apparently not the state-of-the-art method on the poly segmentation task now. However, the method proposed by the authors is not totally better [8]. This suggests the limitation of the proposed method.
Please rate the clarity and organization of this paper

Excellent
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

N/A
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html

Please refer to 6. for more information
Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

4
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?
- The dataset is private for now and the experimental details are not clear;
- The novelty of the proposed model is limited.
Reviewer confidence

Confident but not absolutely certain
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

N/A
[Post rebuttal] Please justify your decision

N/A

Review #2

Please describe the contribution of the paper

This paper presented a new dataset of 572 US videos with 34,300 annotated frames for breast lesion segmentation study. A new method, named frequency and localization feature aggregation network (FLA-Net) is proposed. The method enhances spatial-temporal feature aggregation with Fourier transforms. A comprehensive numerical study has been conducted for model evaluation.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
1. The paper presents a new dataset for breast lesion segmentation in ultrasound videos. The number of videos (572) almost triples the previous existing work (Lin et al. 2022, 188).
2. The usage of Fourier transforms for spatial-temporal feature aggregation is interesting. Ablation study shows clearly the performance improvement with this module.
3. Numerical study is well conducted. Comparison with SOTA works shows the convincing performance of the proposed approach.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
1. Motivation and related work of using Fourier transform on the feature space is not well presented. The paper focuses on the methodology details of this part but the rationale behind is not convincingly presented.
2. Discussion on the limitation of the work is missing. One potential issue is the speed of the method. Different from other modalities, ultrasound videos usually run on a high framerate. The speed of the method determines whether the approach can be used in real-time during treatment or can only be used in offline mode for diagnostics.
Please rate the clarity and organization of this paper

Excellent
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

The authors will release the dataset and model.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html
Overall the paper is well written. I only have minor comments.
1. I see a challenge on the model speed. Many of the previous work have a fast runtime, in range of 30-90 FPS, see Li et al.[10], Table 1. This makes them suitable for real-time application, such as lesion assessment during treatment. The proposed approach first uses a large image size of 352 x 352, secondly applies Fourier transforms on features that are from ResNet50 encoding (number of feature channels can be many), this may cause a slow-down in the model inference time.
2. I would recommend the authors to show standard deviations as previous work (Li et al.[10]) in Table 2, so one can evaluate the robustness of the model. As the paper writes, ‘The spectral convolution theorem in Fourier theory suggests that updating a single value in the spectral domain affects all the original input features globally’, a single feature encoding failure can have non-local side effects in following stages in such model.
Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

6
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The paper is well written and organized. Numerical study shows clearly the good performance of the model. A new dataset larger than previous works is presented.
Reviewer confidence

Very confident
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

N/A
[Post rebuttal] Please justify your decision

N/A

Review #3

Please describe the contribution of the paper

This paper proposed a breast lesion segmentation method on ultrasound videos and introduced a new dataset consisting of bounding boxes, segmentation masks and classification labels on lesion and lymph node metastasis. The proposed method outperform previous methods on the introduced dataset. Abalation studies are given on the proposed modules.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
1. The introduced dataset is large and comprehensive containing multiple labels comparing to previous methods. This seems to benefit the community on exploring large models.
2. The proposed method achieve better performance comparing to previous methods.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
1. Since the method is leveraging videos, can the authors explain why a 3D or video backbone is not used?
2. From table 3, the proposed contrastive loss seem to bring small performance improvement.
3. In 3.2, the authors mentioned the decoder is able to incorporate temporal features from nearby frames. Can the authors give some quantitative/qualitative results to support this?
Please rate the clarity and organization of this paper

Very Good
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

The authors claim to open-source the code and data.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html

It would be great to have some visualizations when adding/removing each part of the model to better understand the contribution of each module. Meanwhile, it would be interesting if the authors can propose some new tasks given this rich annotations introduced in this dataset, such as clinical-related diseasy severity measurement based on size, temporal length, etc. to better fit the clinical needs.
Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

5
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

Please refer to the above comments.
Reviewer confidence

Very confident
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

N/A
[Post rebuttal] Please justify your decision

N/A

Primary Meta-Review

Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

This paper proposed a frequency-based feature aggregation framework for tumour segmentation of ultrasound videos. Compared with previous works, a large-scaled dataset is introduced which can benefit the community. As pointed out by the reviewers, the motivation of frequency strategy is unclear and other-related video segmentation methods are not fully investigated. The overall technical contribution of this work needs more explanations. Please carefully prepare the rebuttal to address the issues raised by the reviewers.

Author Feedback

We appreciate reviewer’s positive comments on our largest dataset and the novelty of our approach. Below, we clarify main issues of reviewers. Q1: Dataset and Code release (R1) A: We shall release our dataset and code upon acceptance.

Q2: Technical novelty (R1) A: Apart from building the largest dataset, there are two main technical contributions: (1) our work develops a frequency-aware feature aggregation module to integrate neighboring video frames to predict the breast lesion segmentation result and a location map of breast lesions for the current video frame $I_t$. (2) We devise a contrastive loss on the predicted location maps to make locations of neighboring video frames similar, and locations of different videos are dissimilar. To the best of our knowledge, the contrastive loss is the first one for video breast lesion detection. Moreover, these two technical contributions are well recognized by R2, R3, and AC.

Q3: Additional Experimental Details (AC,R1,R2) A: To ensure accurate and fair comparisons with nine state-of-the-art methods, we obtain their results by utilizing either their public implementations or our own implementations, and retrained these networks with turning their hyper-parameters to produce their best results for fair comparisons. Moreover, we follow your suggestion and report FLOPS and size of our network and nine SOTA methods in Table below for a 352×352 video frame. With the largest Dice score, our method has comparable FPS and the number of Parameters with existing methods.

Method Dice FPS Params.(M) Backbone

UNet 0.745 56 33 Res2Net-50

UNet++ 0.749 33 49 Res2Net-50

TransUNet 0.733 23 105 R50-ViT

SETR 0.709 27 159 T-Large

STM 0.741 20 38 ResNet-50

AFB-URR 0.750 24 33 ResNet-50

PNS+ 0.754 105 27 Res2Net-50

DPSTT 0.755 23 40 ResNet-50

DCFNet 0.762 12 72 ResNet-101

Our FLA-Net 0.789 43 46 Res2Net-50

Q4: Motivation of Fourier transform (R2) A: Fourier theory suggests that updating a single value in the spectral domain affects all the original input features globally. Hence, we utilize the Fourier transform to learn the global features for aggregating neighboring video features. Compared to the high computational time of the classical transformer learning global context features, our fast Fourier transform (FFT) can largely reduce computational time, and video breast lesion segmentation often requires a fast inference method. As shown in Table 1, compared to transformer-based methods (TransUNet, SETR, PNS+) our method has a FPS of 43 and largest Dice score for a 352×352 video frame. We will clarify it in the manuscript.

Q5: Model’s inference speed (R2) A: Our FLA-Net achieves an impressive real-time inference speed of 43 frames per second (FPS) due to leverage an efficient Fast Fourier Transform (FFT). As shown in Q3, our method has the largest Dice score, and a comparable or smaller inference time than SOTA methods.

Q6: Why not a 3D or video backbone? (R3) A: We empirically utilize the same 2D backbone of compared SOTA methods to extract features for fair comparisons; see backbones of Q3.

Q7: Incremental improvement (R3) A: We compute the p-values = 9.1e-44, 2.1e-47, 6.6e-44, and 4.6e-44 for four metrics, which are all smaller than 0.05, indicating a statistically significant improvement by the contrastive loss.

Q8: results to support decoder incorporates temporal features… (R3) A: Our FLA module aims to learn the temporal feature O_t and the input of the decoder is O_t, so that the decoder can incorporate temporal features. To support it, we remove the FLA module from our network and report the results at table below. Our superior result over ‘w/o-FLA’ shows our decoder with temporal features O_t of the FLA module outperforms that without O_t. We will add it. Dice Jaccrad F1-score MAE w/o-FLA 0.768 0.658 0.797 0.035 Ours 0.789 0.687 0.815 0.033

Post-rebuttal Meta-Reviews

Meta-review # 1 (Primary)

Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

The main framwork of this submisssion is technically sound. However, the reviewers have pointed out the limitations of motivations and comparisons. According to the rebuttal, most of the issues have been answered. Therefore, I recoomend the acceptance of this manuscript.

Meta-review #2

Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

strengths: a new dataset and baseline results weaknesses: technical contribution is relatively weak, meanwhile it may be somewhat compensated by the new dataset that are to be made available; limitation of the work is not discussed; convincing empirical comparison with the SOTA methods how the rebuttal informed your decision: The authors rebuttal states that the code & dataset will be made publicly available;

Meta-review #3

Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

The use of spectral-domain features for segmentation was not clear, even after the author’s feedback. Additional experiments in the feedback empirically partially validate the method, however they cannot be taken account into for the assessment of the paper.

back to top

Method	Dice	FPS	Params.(M)	Backbone
UNet	0.745	56	33	Res2Net-50
UNet++	0.749	33	49	Res2Net-50
TransUNet	0.733	23	105	R50-ViT
SETR	0.709	27	159	T-Large
STM	0.741	20	38	ResNet-50
AFB-URR	0.750	24	33	ResNet-50
PNS+	0.754	105	27	Res2Net-50
DPSTT	0.755	23	40	ResNet-50
DCFNet	0.762	12	72	ResNet-101
Our FLA-Net	0.789	43	46	Res2Net-50

Shifting More Attention to Breast Lesion Segmentation in Ultrasound Videos