Paper Info Reviews Meta-Review Author Feedback Post-rebuttal Meta-Reviews Back to top

List of Papers By topics Author List

Paper Info

Reviews

Meta-review

Author Feedback

Post-Rebuttal Meta-reviews

Authors

Chao Qin, Jiale Cao, Huazhu Fu, Rao Muhammad Anwer, Fahad Shahbaz Khan

Abstract

Detecting breast lesion in videos is crucial for computer- aided diagnosis. Existing video-based breast lesion detection approaches typically perform temporal feature aggregation of deep backbone fea- tures based on the self-attention operation. We argue that such a strat- egy struggles to effectively perform deep feature aggregation and ig- nores the useful local information. To tackle these issues, we propose a spatial-temporal deformable attention based framework, named STNet. Our STNet introduces a spatial-temporal deformable attention module to perform local spatial-temporal feature fusion. The spatial-temporal deformable attention module enables deep feature aggregation in each stage of both encoder and decoder. To further accelerate the detection speed, we introduce an encoder feature shuffle strategy for multi-frame prediction during inference. In our encoder feature shuffle strategy, we share the backbone and encoder features, and shuffle encoder features for decoder to generate the predictions of multiple frames. The exper- iments on the public breast lesion ultrasound video dataset show that our STNet obtains a state-of-the-art detection performance, while oper- ating twice as fast inference speed. The code and model are available at https://github.com/AlfredQin/STNet.

Link to paper

DOI: https://doi.org/10.1007/978-3-031-43895-0_45

SharedIt: https://rdcu.be/dnwyY

Link to the code repository

https://github.com/AlfredQin/STNet

Link to the dataset(s)

N/A

Reviews

Review #3

Please describe the contribution of the paper

The paper proposed a spatial-temporal deformable attention-based network for ultrasound video-based breast lesion detection. The main contributions are the temporal and local feature fusion module and multi-frame prediction with encoder feature shuffle in the inference.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

The spatial-temporal deformable attention module is useful which can exploit the intra-frame spatial information and inter-frame temporal information. The encoder feature shuffle strategy is interesting as it can predict multi-frames with the same encoder and accelerate inference speed. The results are promising.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

The selection of three consecutive frames and three random frames needs justification and explanation. It is not clear how the authors divide the training and testing datasets.
Please rate the clarity and organization of this paper

Excellent
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

The authors used a public dataset and claimed they would publish their code upon acceptance.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html
1. Please provide details of the classification and detection modules which is an important aspect in the detection task. It seems that your classification also has a confidence level (percentage), please explain the process.
2. The classification is based on each frame. However, in practice, the classification should be determined based on the whole video of the patient. Please justify your choice and explain how you would incorporate the decision for each frame into a decision of the whole video for each patient.
3. Duplicate references 9 and 10
Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

7
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The idea of using spatial-temporal deformable network is novel. The results are promising.
Reviewer confidence

Confident but not absolutely certain
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

N/A
[Post rebuttal] Please justify your decision

N/A

Review #2

Please describe the contribution of the paper
- This work proposes an approach based on spatial-temporal deformable attention, which aims to detect breast lesion in ultrasound videos.
- The proposed method achieves the SOTA performance on a large video dataset with significant reduction in inference time.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
- Overall writing is clear and organized.
- The performance has increased with a large margin compared to the baselines.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
- The novelty of the proposed model structure is limited.
- Only one dataset was used for verification of the model’s effectiveness.
- The model evaluation is focused on AP without other metrics.
- The inference time is compared only with one of the baseline.
Please rate the clarity and organization of this paper

Very Good
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

The proposed model architecture and the implementation details are well provided .
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html
- I would like to request further clarification on the topic of spatio-temporal deformable attention, as it appears that there are existing studies on this subject. I kindly ask the authors to elucidate the differences between their proposed model and the following models, explaining how their method is superior for the breast lesion detection task compared to adopting these models as other baselines (for example, ‘RetinaNet’ is used as baseline in this paper):
  1. Deformable VisTR: Spatio Temporal Deformable Attention for Video Instance Segmentation (https://ieeexplore.ieee.org/document/9746665)
  2. DeVIS: Making Deformable Transformers Work for Video Instance Segmentation (this model applies temporal multi-scale deformable attention in a Transformer encoder-decoder architecture.) (https://arxiv.org/pdf/2207.11103.pdf)
- I noticed that the model evaluation seems to focus on the AP. I suggest that the authors provide additional accuracy, recall, and precision of detecting frames for the proposed model and the baselines in order to demonstrate the quantitative results shown in Fig. 3. As an example, the frame detection accuracy could be calculated as ‘(the number of correctly detected slices by the model)/(the total number of frames).’
- I suggest authors to compared the inference time with various various baselines by adding the result to Table 1.
- Minor issues to be addressed:
  1. A small typo was found in the Introduction section: ‘[9])’
  2. In Table 1, the third column appears unnecessary, as all the models adopt the same backbone. Consider removing or modifying this column for clarity.
Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

5
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The novelty of the proposed method appears to be weak, given the existence of other models that employ spatio-temporal deformable attention in detection and segmentation tasks. It is crucial for the authors to demonstrate how their approach contributes to the field in a unique and innovative manner, differentiating it from previously published works. Nevertheless, the comparison with the existing two types of models is great, and the paper is well-organized overall. If the authors address the concerns mentioned above and provide appropriate clarification, I would be willing to increase my score for this paper.
Reviewer confidence

Confident but not absolutely certain
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

N/A
[Post rebuttal] Please justify your decision

N/A

Review #1

Please describe the contribution of the paper

In the manuscript a method for breast lesion detection in ultrasound videos is presented. The method is based on a spatial-temporal deformable attention module that performs local spatial-temporal feature fusion.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
The manuscript has some strengths:
- new and interesting modality to use self-attention module, typical of transformer methods;
- integration of spatial and temporal information in a video;
- accurate ablation study;
- interesting performance compared to the state of the art.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
Some details could add in the method description and in the experiments. In details:
- clarify some points, as for example what is meant by reference points;
- add information about distribution of the dataset in training and validation sets.
Please rate the clarity and organization of this paper

Very Good
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

In the paper a lot of details that allow the reproducibility of the method are given. However, it is not clear the distribution of data in training/validation sets.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html

The paper is well written, even if in some cases some sentences are redundant, as for example in the abstract the sentences “To address these issues, we propose a spatial-temporal deformable attention based framework, named STNet, based on a spatial-temporal deformable attention module that performs local spatial-temporal feature fusion. The spatial-temporal deformable attention module enables deep feature aggregation in each stage of both encoder and decoder.”, could be re-written. So, I suggest a thorough rereading to improve some sentences. Then I suggest to clarify some points, as for example what is meant by reference points, cited in subsection 2.1 and the distribution of the data in training/validation/test sets.
Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

6
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The paper is well written and the method is described in a clear way. However a thorough rereading is suggested to improve some sentences. Moreover, it would be better to clarify the distribution of the data in training/validation/test sets.
Reviewer confidence

Confident but not absolutely certain
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

N/A
[Post rebuttal] Please justify your decision

N/A

Primary Meta-Review

Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

The proposed work utilizes spatial-temporal deformable attention module which proves to be highly effective based on the evaluation performance.

Author Feedback

We sincerely appreciate the reviewers for dedicating their time to review our paper and for providing constructive feedback.

Training and Testing Datasets Division: We adopted the same dataset splits, as in the previous work of CVA-Net, to ensure a fair comparison. The testing set comprises 38 videos, randomly selected from the dataset, representing about 20% of the total. The remaining videos form the training set. This division detail will be added to the paper in the camera-ready version.

Duplicate References 9 and 10: We acknowledge the duplicate references. This will be corrected in the camera-ready version.

Specification of Spatio-Temporal Deformable Attention: Our decoder uses a global query in cross-attention to decode the encoder features of three sequential and three global frames. This is in contrast to Deformable VisTR and DeVIS, which assign a query for each frame. The global query allows encoder feature shuffling to accelerate prediction in the inference phase. Our attention module is multi-scale, enabling it to use multi-scale information for improved performance. In contrast, Deformable VisTR replaces the vanilla Deformable DETR’s multi-scale features with multi-frame features, limiting their spatio-temporal attention’s ability to use multi-scale features.

Typographical error in the Introduction section: ‘[9])’: We acknowledge this typographical error and will correct it in the camera-ready version.

back to top

A Spatial-Temporal Deformable Attention based Framework for Breast Lesion Detection in Videos