Paper Info Reviews Meta-Review Author Feedback Post-rebuttal Meta-Reviews Back to top

List of Papers By topics Author List

Paper Info

Reviews

Meta-review

Author Feedback

Post-Rebuttal Meta-reviews

Authors

Jialu Li, Qingqing Zheng, Mingshuang Li, Ping Liu, Qiong Wang, Litao Sun, Lei Zhu

Abstract

Automatic breast lesion segmentation in ultrasound (US) videos is an essential prerequisite for early diagnosis and treatment. This challenging task remains under-explored due to the lack of availability of annotated US video dataset. Though recent works have achieved better performance in natural video object segmentation by introducing promising Transformer architectures, they still suffer from spatial inconsistency as well as huge computational costs. Therefore, in this paper, we first present a new benchmark dataset designed for US video segmentation. Then, we propose a dynamic parallel spatial-temporal Transformer (DPSTT) to improve the performance of lesion segmentation in US videos with higher computational efficiency. Specifically, the proposed DPSTT disentangles the non-local Transformer along the temporal and spatial dimensions, respectively. The temporal Transformer attends temporal lesion movement on different frames at the same regions, and the spatial Transformer focuses on similar context information between the previous and the current frames. Furthermore, we propose a dynamic selection scheme to effectively sample the most relevant frames from all the past frames, and thus prevent out of memory during inference. Finally, we conduct extensive experiments to evaluate the efficacy of the proposed DPSTT on the new US video benchmark dataset.

Link to paper

DOI: https://link.springer.com/chapter/10.1007/978-3-031-16440-8_38

SharedIt: https://rdcu.be/cVRv3

Link to the code repository

N/A

Link to the dataset(s)

N/A

Reviews

Review #2

Please describe the contribution of the paper

A new benchmark dataset for automatic breast lesion segmentation in ultrasound video is presented. Dynamic parallel spatial-temporal transformer (DPSTT), implemented on the basis of temporally, and spatially decoupled Tansformer blocks, is proposed. Dynamic memory selection scheme is presented to dynamically update memory frames of the DPSTT. The dataset and network are assessed through a comprehensive ablation study and comparison with other SoTA models.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
1. The author presents a breast lesion segmentation dataset in an ultrasound video. The temporal information contained in the proposed dataset contributes to accuracy in automatic breast lesion segmentation.
2. A dynamic parallel and spatial-decoupled transformer framework is presented. The DPSTT achieves computational efficiency through the proposed non-local spatial, and temporal transformer module.
3. A dynamic memory selection scheme is proposed to eliminate unnecessary features of the past frames. Through ablation studies, the author verifies that the scheme contributes to the segmentation accuracy.
4. Quantitative assessment is conducted by comparing DPSTT with other SoTA models including image-based, and video-based segmentation models. The results demonstrate that the DPSTT outperforms other SoTA models with a large margin.
5. A comprehensive ablation study is provided. The study well explains the effectiveness of the proposed temporal and spatial transformer model, and dynamic memory selection scheme.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
1. It would have been valuable to describe some details of the proposed video ultrasound segmentation dataset. e.g Types of lesions, the number of the subject would be helpful.
2. Description of the loss function is insufficient. e.g. How the binary cross entropy loss, and dice loss are weighted.
Please rate the clarity and organization of this paper

Very Good
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

The authos will release the dataset and the code.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html
- The spatial resolution of the dataset is down-sampled to 300x200. It is concerned that such down-sampling affects the reconstruction accuracy.
- Additional results about the relationship between the number of memory frames and accuracy would help understand the effectiveness of the proposed spatially, and temporally decoupled transformer module.
Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

6
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The paper is well written. A novel video ultrasound lesion segmentation dataset is presented. Nice application of the Transformer modules in automated lesion segmentation
Number of papers in your stack

4
What is the ranking of this paper in your review stack?

1
Reviewer confidence

Confident but not absolutely certain
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

N/A
[Post rebuttal] Please justify your decision

N/A

Review #4

Please describe the contribution of the paper

The authors publish the first annotated breast lesion segmentation dataset using ultrasound video. The paper presents a dynamic parallel temporal and spatial-decoupled transformer. The neural network efficiently reduces the amount of computation and enhances performance. The extensive comparative and ablation studies, it is shown that the accuracy of the proposed network outperforms existing methods.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

A. The paper introduces an automated breast lesion segmentation dataset using ultrasound video. Recently a line of works has demonstrated that proper utilization of the spatio-temporal information enhances the reconstruction accuracy. The proposed temporal breast data is expected to be applied in a diverse neural network that is using temporal information and contributes to the accuracy of breast lesion segmentation. B. The proposed spatial and temporal transformer reduces computation compared to baseline while showing enhanced performance. C. Extensive quantitative comparison and ablation studies are provided. The proposed decoupled spatial and temporal transformer network outperforms other State-of-the-art neural networks with reduced inference time. D. Overall, the paper is well written. The organization is well structured.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

A. The result does not present the computational complexity required for the proposed dynamic selection algorithm. The sorting and cosine similarity calculation is carried out for every frame, which might have adverse impact on the inference time. B. Descriptions and qualitative assessments are insufficient for a dynamic memory selection algorithm.
Please rate the clarity and organization of this paper

Very Good
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

Author will provide the code and the results seem reproducible.

Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html

The proposed temporal decoupled transformer split the memory, and query key into s^2 non-overlapping patches. The relationship between the accuracy and the s value would provide a deeper understanding of the temporal transformer and local similarity.
The dynamic memory selection scheme employs cosine similarity to calculate similarity among each frame. Diverse similarity metrics such as Euclidean distance could be an alternative option. It would be good if the reason for choosing cosine similarity is recommended.

Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

6
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The introduces automated breast lesion segmentation dataset using ultrasound video is novel. The proposed decoupled spatial and temporal decoupled transformer is efficiently formulated and properly assessed.
Number of papers in your stack

5
What is the ranking of this paper in your review stack?

1
Reviewer confidence

Very confident
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

N/A
[Post rebuttal] Please justify your decision

N/A

Review #5

Please describe the contribution of the paper

The paper introduces an ultrasound video dataset with pixel-wise annotations for breast lesion segmentation. It additionally proposes a video segmentation method based on general segmentation architectures, i.e., STM. The results of the model on the dataset look good.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

The dataset is probably a valuable resource for the community.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

The methodology contributions are not significant. Most models are well studied, especially in natural video segmentation community. The model is like to combine existing techniques together.
Please rate the clarity and organization of this paper

Very Good
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

Most technical details are well provided. I believe that the model can easily be implemented. However, more training details should be given like the optimizer used, and the learning rate scheduler.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html

My first major concern about the paper is the baseline model for comparison. The reviewer assume that the baseline is STM, however, it is hard to align the results of STM with the ablative results. I will suggest that the authors clarify the baseline and then incrementally demonstrate the contributions of the components proposed in the article. This is essential since the segmentation framework has been explored in other places.

My second concern is about the generalization of the method. Is the method also applicable to other medical video segmentation tasks? What makes it unique to US video segmentation?

Memory-based networks have been explored in medical segmentation like Quality-Aware Memory Network for Interactive Volumetric Image Segmentation. Thus it should be carefully discussed.

The ablation study is not sufficient. Some detailed investigation of memory hyperparameters (e.g., K in Eq.5) should be performed.
Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

5
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

Though the novelty of the methodology is not very significant, the dataset is a contribution. Overall, I think that the merits weigh over weaknesses and recommend “weak accept”.
Number of papers in your stack

4
What is the ranking of this paper in your review stack?

2
Reviewer confidence

Very confident
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

N/A
[Post rebuttal] Please justify your decision

N/A

Primary Meta-Review

Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

Reviewers recommend acceptance of the paper unanimously. They appreciate the contribution of a new experimental framework to the community, consider the technical novelty adequate, and agree on the adequacy of the experiments. The final version should take into account all reviewers’ comments and suggestions. In particular: (1) Additional dataset statistics (R1), (2) Additional technical details (R1, R2), (3) Complexity analysis in the experiments (R2), and (4) additional experiments and discussions (R2, R3).
What is the ranking of this paper in your stack? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

1

Author Feedback

We appreciate the reviewers for their positive feedback and for judging that our paper is “novel”, “well written” and “valuable”, as well as their constructive comments. We shall address the major concerns in the following: (1) Additional dataset statistics The breast US dataset involves 63 subjects, one video sequence per person; thus, 63 video sequences were collected, with 4619 frames annotated with pixel-level ground truth by experts. This dataset doesn’t label specific types of lesions. (2) Additional technical details We use the binary cross-entropy(BCE) loss and the dice loss with the weight of 0.5 and 0.5 during the training process. We utilize the Adam optimizer with a weight decay of 1e-5. (3) Complexity analysis for the dynamic memory selection Our method sets the size of the memory buffer as K. During inference, the dynamic selection module is activated if the index of the query frame t exceeds K. It only compares the similarity metric between the features of the query frame and the previous one, as well as those in the memory (K frames). Then it updates the memory frames by adding the previous frame at the tail and removing the one in the original memory with the least similarity value, or maintains the original memory if the previous frame has the least similarity value. In this way, the complexity of dynamic selection is O(n), and n represents the video length. (4) Additional experiments and discussions We evaluate different settings of the memory hyperparameter K. The quantitative results(Jaccard) of K = 3, 5, 10 are 73.55%, 73.64%, and 73.65%, respectively. There is no significant difference between the results with different K. Therefore, we conclude that our method is robust with different K. (5) The generalization of our method. The proposed method is general and applicable to other video segmentation tasks with adapting implementation. We plan on extending our experiments to the natural video segmentation or other medical video segmentation tasks in a follow-up journal paper. (6) Additional discussion with [Ref1] Thanks for introducing the related work [Ref1] to us. There are three significant differences between [Ref1] and ours. 1) Motivation: We first propose a novel DPSTT method as well as a new benchmark dataset for medical video segmentation, while [Ref1] studies a quality-aware memory network for 3D medical image segmentation; 2) Technique novelty: [Ref1] proposes a quality assessment module to automatically select slices for human iterative correction; our DPSTT improves the performance of the memory reading module, and disentangles the non-local Transformer along the temporal and spatial dimensions. 3)Inference: [Ref1] also saves a new memory item every 5 slices, resulting in memory overflow when applied to video segmentation tasks. [Ref1] Zhou T, Li L, Bredell G, et al. Quality-aware memory network for interactive volumetric image segmentation[C]//International Conference on Medical Image Computing and Computer-Assisted Intervention. Springer, Cham, 2021: 560-570.

back to top

Rethinking Breast Lesion Segmentation in Ultrasound: A New Video Dataset and A Baseline Network