Paper Info Reviews Meta-Review Author Feedback Post-rebuttal Meta-Reviews Back to top

List of Papers By topics Author List

Paper Info

Reviews

Meta-review

Author Feedback

Post-Rebuttal Meta-reviews

Authors

Mengya Xu, Mobarakol Islam, Hongliang Ren

Abstract

Surgical captioning plays an important role in surgical instruction prediction and report generation. However, the majority of captioning models still rely on the heavy computational object detector or feature extractor to extract regional features. In addition, the detection model requires additional bounding box annotation which is costly and needs skilled annotators. These lead to inference delay and limit the captioning model to deploy in real-time robotic surgery. For this purpose, we design an end-to-end detector and feature extractor-free captioning model by utilizing the patch-based shifted window technique. We propose Shifted Window-Based Multi-Layer Perceptrons Transformer Captioning model (SwinMLP-TranCAP) with faster inference speed and less computation. SwinMLP-TranCAP replaces the multi-head attention module with window-based multi-head MLP. Such deployments primarily focus on image understanding tasks, but very few works investigate the caption generation task. SwinMLP-TranCAP is also extended into a video version for video captioning tasks using 3D patches and windows. Compared with previous detector-based or feature extractor-based models, our models greatly simplify the architecture design while maintaining performance on two surgical datasets.

Link to paper

DOI: https://link.springer.com/chapter/10.1007/978-3-031-16449-1_36

SharedIt: https://rdcu.be/cVRXb

Link to the code repository

https://github.com/XuMengyaAmy/SwinMLP_TranCAP

Link to the dataset(s)

https://endovissub2018-roboticscenesegmentation.grand-challenge.org/Data/

https://engineering.purdue.edu/starproj/_daisi/

Reviews

Review #1

Please describe the contribution of the paper

The authors proposed a window-based MLP transformer (with patch-based shifted window) to achieve surgical captioning tasks on video data. Two surgical datasets are adopted for benchmarking purpose, where comparable results are obtained with much less computation burden.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
- The authors proposed to borrow patch-based shifted window technique to realise real-time robotic surgery.
- Surgical video captioning is an interesting topic, which is still not well-explored yet. It’s encouraging to see studied proposed to apply advanced vision techniques into this domain.
- The paper is well-written with clear demonstration.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
- Although the authors specified their difference from ViT and Swin Transformer, I may still find it a lack of their own uniqueness and technical contribution. This judgement might somehow reduce its overall impression, as Swin Transformer might be the main reason for the reduction in computation burden.
- The authors claimed to have their design with less-expensive computation cost, however, I did not find it well-justified in the result section. A runtime comparison with other counterparts and efficiency analysis (GPU, #FLOPS) is suggested, so that their efforts paid in efficiency improvements can be quantitatively demonstrated. Besides, the authors mentioned about “real-time robotic surgery”, could this be justified by numerical results as well?
- Qualitative comparisons with SOTAs on captioning demonstration are preferred but missing in the manuscript. The quantitative results in Tbl.1 does not look convincing enough to demonstrate the superiority of the proposed method.
Please rate the clarity and organization of this paper

Excellent
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

The authors included their source code in supplementary, and with clear specification in their manuscript, I do not doubt its reproducibility.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html

The authors did not promise to release the code, which might be somehow disappointing to the community I suppose.
Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

5
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

Considering both its advantages and drawbacks, I find it an interesting paper overall.
Number of papers in your stack

5
What is the ranking of this paper in your review stack?

1
Reviewer confidence

Confident but not absolutely certain
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

6
[Post rebuttal] Please justify your decision

Thanks for the authors preparing rebuttal - they did partially address my concerns. I’d like to change it to “accept”, please revise and improve your manuscript carefully as discussed and promised. Regarding their results presented in rebuttal, I think it’s necessary to be included in final version - if there is a page limitation to catch, please consider including it in supp. doc. or additional public project page.

Review #2

Please describe the contribution of the paper

This paper proposes an architecture for Surgical Captioning without the need of an intermediate feature extraction or detection step. The authors evaluated the use of Swin transformers with MLP and designed an Encoder-Decoder caption architecture. They evaluate their method on two datasets including Qualitative results.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
- Surgical captioning is an interesting research direction that goes beyond a simple phase, tool or activity recognition.
- Novel architectural choices for task of surgical captioning creation
- Addition of Video is interesting and adds to the value of the work
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
1. “Nonetheless, the feature extractor still exists as an intermediate module which unavoidably leads to inefficient training and long inference delay at the prediction stage”
2. “De- spite the impressive performance, most of the approaches are required heavy computational resources for the surgical captioning task which limits the real- time deployment”
3. “take the im- age patches directly, to eliminate the object detector and feature extractor for real-time application”
4. “These lead to inference delay and limit the captioning model to deploy in real-time robotic surgery.”
5. “reduces the training parameters, and improves the inference speed.”
In the work the authors highlight the advantage of their approach in terms of efficiency and speed (point 1 to 5). No comparison regarding flops or fps between detection and detection free models. The model speed and efficiency is not necessarily restricted if its not end-to-end. In summary I cannot follow the argument of the reduced complexity and efficiency.

In Fact ResNet has significantly less parameters and flops compared to the used Swin-L model. Parameters resnet-18: 11mio Parameters Swin-L: 197mio Flops (224x224) resnet-18: 2 G Flops (224x224) Swin-L: 34.5G ——— The metrics for SwinMLP are significantly improved compared to Swin. However, the performance of Swin and SwinMLP should be comparable and Swin should also be comparable to Transformer[5]. It would be good to see if there is any rational for this difference as this makes the results less trustworthy. ——— Evaluation missing against: Zhang, J., Nie, Y., Chang, J., Zhang, J.J.: Surgical instruction generation with transformers. In: International Conference on Medical Image Computing and Computer-Assisted Intervention. pp. 290–299. Springer (2021) ——— Ablation missing between 2D, 3D and 2D/3D. Also, this should be compared to a 3D baseline e.g. 3D ResNet. ———
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

Reproducibility Response is set to “Yes” for every question.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html
Table1: why does “Ours” does not have FPS (Frames per second?)

especially in robotic surgery. —> why is it especially for robitc surgery a problem? Isnt it also a problem for e.g. minimally invasive surgery? ——— In Table 1. SwinMLP-TranCAP, Swin-TranCAP, V-SwinMLP-TranCAP i think its better to say SwinMLP-TranCAP-Encoder (likewise for the others) for FE Column. Add a X to Det. for same group ——— “Self-sequence and AOA originally take the region features extracted from the object detector with feature extractor as input. In our work, we design the hybrid style for them by sending image features extracted by the feature extractor only” —> doenst this modification make the AOA and Self-sequence method less capable? ——— How does the window size of 14, instead of 7 influence the results and what is the rational for this change to the baseline? ——— Revisit sentence: “The shifted window and multi-head MLP architecture design make our model less computation” “Replacing the multi-head attention module with a multi-head MLP also reveals that the generic transformer architecture is the core design instead of the attention-based module.” ——— Avoid using these terms in academic writing:
- extremly simple,
- very compute heavy https://www.scribbr.com/academic-writing/taboo-words/
Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

5
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

Interesting work on Surgical Captioning using the novel SWIN architecture. Evaluation and Ablation could be in more detail e.g. comparing to other works on the same task and dataset. The main concern is that there is no evidence that this work is more suitable for real-time applications than other methods with separate detection and feature extraction step - This concern should be addressed.
Number of papers in your stack

5
What is the ranking of this paper in your review stack?

1
Reviewer confidence

Very confident
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

4
[Post rebuttal] Please justify your decision

In the authors answer point 2. is my main concern. Here it seems like the proposed model only helps in GFLOPs and not N_Parameters or FPS. The other metrics are also not very convincing. . A comparison to e.g. mobileViT has not been performed (but was also not proposed by the reviewers) which is a optimized version of the transformer architecture.

Review #3

Please describe the contribution of the paper

This paper designs an end-to-end detector and feature extractor-free captioning model by utilizing the patch-based shifted window technique from the recently twin-transformer. Compared to the conventional swintransformer, it replaces the multi-head attention with window-based multi-head multi-layer perceptron. It releases the limitation of human annotation of bounding boxes and boost the real-time performance. The authors validate the model on the surgical video captioning task and compare with the baseline methods.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

-Unify the structure: this work releases the limitation that the transformer-based captioning model needs a feature-extractor or detector ahead, which shows a transformer-only unified network structure for the surgical video captioning task. This make the network not limited by the pre-trained detection models/ feature extraction models. -Parameter efficient: Interestingly, the model replace the multi-head attention with group convolution to keep the design of transformer and save the computational cost. The time complexity comparison between conventional swintransformer and this work is discussed and provided. This operation is based on the observation that it is the structure of the transformer that make things work, not the self-attention. -Evaluation: this paper evaluates upon the surgical video captioning task on the large public dataset and compared with suitable baseline methods.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

-Limited novelty: the paper’s main improvement is from the visual backbone, which is swintransformer. However, there is little discussion about why a unified structure is needed and is there any other more powerful convent is not suitable for the caption task. -Experiment: The paper modified replaces the multi-head self-attention with multi-head MLP and provides the time complexity comparison. However, the third row in table 1 does not show the FPS and the table 3 does not show any parameter improvement compared to the swin transformer. -Baseline: the baseline methods in the experiment are not the state-of-the-art method. I think there are many captioning methods using the fully transformer.
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

The paper is able to reproduce and the dataset for training and testing are public.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html

-This is a great work, however I’d like to see more discussion, such as the difference between Swin’s encoder and this work’s is the multi-head MLP by group convolution. How that differs from the conventional convolutional network? -The decoder is standard transformer structure, and can we replace it with the other sequence modeling methods, such as LSTM, GRU? -For the video model, is there any fundamental improvement compared to the 2D-based method? Can we simply use 2D-based method to handle the video caption? -The comparison should be complete and fair, the work has a larger window size compared to the twin transformer, I think it is better to evaluates win with the same setting in the table 1. Otherwise, I will be confused about if the window size affect the performance or the modification boost the performance. -Inference speed, FPS problem as indicated above. -In the ablation table 2, why not show the patch size of SWIN-MLP, but the vanilla transformer? Also, it seems that the patch size as 16 works better, why do we finally choose patch 4 in the table 3. The param column in table (b) does not show the difference between this work and swin transformer, is there any explanation?
Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

5
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

Experiments
Number of papers in your stack

5
What is the ranking of this paper in your review stack?

4
Reviewer confidence

Confident but not absolutely certain
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

5
[Post rebuttal] Please justify your decision

The authors still need to clarify why the unified extractor-free strcuture is needed and its use case for the medical field, e.g., easier hardware design and etc. Also, the FPS performance is not major as one of the main contributions.

Primary Meta-Review

Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

This paper addresses is an interesting topic, which is still not well explored. It’s encouraging to see studied proposed to apply advanced vision techniques into this domain. The paper is well-written with clear demonstrations.

There are also several weaknesses that the authors should try to address in their final submission. The authors claimed to have their design with less-expensive computation cost, however, I did not find it well- justified in the result section. A runtime comparison with other counterparts and efficiency analysis is suggested, so that their efforts paid in efficiency improvements can be quantitatively demonstrated. Besides, the authors mentioned about “real-time robotic surgery”, could this be justified by numerical results as well? Qualitative comparisons with SOTAs on captioning demonstration are preferred but missing in the manuscript. The quantitative results in Table 1 does not look convincing enough to demonstrate the superiority of the proposed method. The metrics for SwinMLP are significantly improved compared to Swin. However the performance of Swin and SwinMLP should be comparable and Swin should also be comparable to Transformer. It would be good to see if there is any rational for this difference as this makes the results less trustworthy.
What is the ranking of this paper in your stack? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

7

Author Feedback

Thank the reviewers for their mostly positive feedback and critical assessment of our work. It is encouraging for us, they all found our approaches clear and interesting, and our research topic has still not been well explored before [R1, AC]. We addressed reviewer comments and will incorporate all the feedback and code in the final version.

Lack the technical contribution. As recognized by [R3], we pioneeringly designed an end-to-end captioning model, SwinMLP-TranCAP, which aims to overcome the limitation that conventional captioning models need an object detector and feature extractor. Our model gets rid of these intermediate models by using patches as input. Meanwhile, our model reduces the computation cost by adopting the shifted window and replacing self-attention with MLP to reveal that the overall structure of the Transformer is crucial instead of the self-attention module. To adapt shifted window to the captioning task, which is more challenging than simple classification tasks [R1], we must consider the interaction between the vision encoder and language decoder. Moreover, as [R1, R2, R3] noticed, we also design the video captioning model using 3D patches.

Lack the quantitative proof of less computation cost and real-time application. We will add the following overall metrics in Table 1. Our Swin-Tran-L is closer to Swin-B[10] [R2]. Our approach achieves better captioning performance/efficiency trade-offs. Model, FPS, N_Parameters(M), GFLOPs YLv5x(Res)+Tran: 9.368, 97.88+46.67, 1412.8+25.88 FasRCNN(Res)+Tran: 8.418, 28.32+46.67, 251.84+25.88 Res+Tran: 11.083, 11.69+46.67, 1.82+25.88 Our SwinTran-TranCAP: 10.604, 165.51, 19.59 Our SwinMLP-TranCAP: 12.107, 99.11, 14.15

Model, B4, MET, SPI, CID FasRCNN(Res)+Self-Seq: 0.295, 0.283, 0.496, 1.801 FasRCNN(Res)+AOA: 0.377, 0.371, 0.580, 1.811 FasRCNN(Res)+Tran: 0.363, 0.323, 0.512, 2.017

The results in Table 1 do not look convincing.
Our purpose is not to provide better results but to get rid of the object detector and feature extractor from the conventional captioning system training pipeline for more flexible training, less computation cost, and faster inference speed without sacrificing performance. Surprisingly, our approach also obtained a slightly better quantitative performance.

Why is SwinMLP better than Swin and Transformer? Avoiding overfitting, correct use of local biases, pyramid structures, and control of computational complexity are the keys to designing good vision models [21]. Compared with Transformer/ViT, Swin-TranCAP injects local bias back into the network by constraining the self-attention operation within the local window. This setup also controls the computational complexity. SwinMLP-TranCAP replaces self-attention with MLP to reduce the number of parameters to avoid overfitting and computational complexity. It also allows the use of pyramid structures and multi-stage processing. These operations help the model learn good visual features. [21] Tang et al. “Sparse MLP for Image Recognition: Is Self-Attention Really Necessary?”

Comparison with 3DResNet. Can we use a 2D method? “3DRes18+Tran”: MET=0.345, SPI=0.586, CID=2.757. The “spatial encoder + temporal encoder + decoder” can handle the video caption using a 2D-based method. The limitation is that it requires employing more encoders.

Comparison with SOTA[20]. Table 1 contains the results from [20] which is composed of “Res[7]+Tran[5]” (B4=0.454,CID=4.283). [20] employs the Tran[5] and does not make any modifications to the model. We used the same code library as [20].

Window size M of 14 vs. 7. On the DAISI dataset, SwinMLP-Tran-L(M=7), B4=0.434, CID=4.046; SwinMLP-Tran-L (M=14), B4=0.459, CID=4.272. We have found that larger window sizes provide slightly better results.

Post-rebuttal Meta-Reviews

Meta-review # 1 (Primary)

Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

I would like to thank the authors for their convincing response and their willingness to add them to the final paper. I think this is an interesting work and of interest to MICCAI readership.
After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

Accept
What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

5

Meta-review #2

Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

The paper generally received positive reviews from all reviewers. Although comparison with other SOTA algorithms is missing, especially in terms of quantitative performance metrics, this paper has sufficient novelty and clinical applicability that would interest the CAI community working on image-guided surgeries.
After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

Accept
What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

NR

Meta-review #3

Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

The paper presents a end-to-end detector and feature extractor-free captioning model for surgical videos. The framework is validated on two public datasets and compared against baseline approaches. The paper is well-written, tackles an interesting and less-explored topic, and the choice of architectures is novel. Comments of the reviewers around model efficiency and speed and other results metrics should be incorporated in the final version of the paper.
After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

Accept
What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

8

back to top

Rethinking Surgical Captioning: End-to-End Window-Based MLP Transformer Using Patches