Authors

Rohan Raju Dhanakshirur, K. N. Ajay Shastry, Kaustubh Borgavi, Ashish Suri, Prem Kumar Kalra, Chetan Arora

Abstract

Surgical tool classification and instance segmentation are crucial for minimally invasive surgeries and related applications. Though most of the state-of-the-art for instance segmentation in natural images use transformer-based architectures, they have not been successful for medical instruments. In this paper, we investigate the reasons for the failure. Our analysis reveals that this is due to incorrect query initialization, which is unsuitable for fine-grained classification of highly occluded objects in a low data setting, typical for medical instruments. We propose a class-agnostic Query Proposal Network (QPN) to improve query initialization inputted to the decoder layers. Towards this, we propose a deformable-cross-attention-based learnable Query Proposal Decoder (QPD). The proposed QPN improves the recall rate of the query initialization by 44.89% at 0.9 IOU. This leads to an improvement in segmentation performance by 1.84% on Endovis17 and 2.09% on Endovis18 datasets, as measured by ISI-IOU. The source code can be accessed at https://aineurosurgery.github.io/learnableQPD.

Link to paper

DOI: https://doi.org/10.1007/978-3-031-43996-4_70

SharedIt: https://rdcu.be/dnwQh

Link to the code repository

https://github.com/AINeurosurgery/Learnable-QPD-for-maskDINO

Link to the dataset(s)

N/A

Reviews

Review #2

Please describe the contribution of the paper

This paper developed a framework based on MaskDino to handle the surgical instrument instance segmentation task. Specifically, a Query Proposal Network (QPN) is proposed for better region proposal generation. Experimental results on Endovis17 and Endovis18 demonstrate that it can achieve higher IoU compared to other existing methods.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
- Non-Maximal Suppression (NMS) is used for region proposal generation, which seems to be better that using the classification logits as a threshold.
- Experiments on public datasets (Endovis 17 and 18) make the results more reliable compared to in-house dataset.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
- The novelty of the paper is limited, both MaskDino and Non-Maximal Suppression (NMS) are not new. The paper is more like a technical report compared to a scientific paper.
- Since this paper proposed a framework for surgical instrument instance segmentation based video frames, it is necessary to present the FPS metric to see whether this model could be used in real scenario.
- This manuscript should be further polished and some typos need to be fixed.
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

The author claim that they will make the code public after the paper is accepted.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html
- It is not clear how the deformable cross attention module is implemented, detailed description of this module should be contained.
- It is interesting to discuss why the performance decreases when adding more QPD layers without the help of NMS.
Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

5
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The paper utilizes NMS to generate more region proposal compared to using classification logits. The generated region proposals are further refined by Query Proposal Decoder (QPD) for better instance segmentation results. The method is simple, yet seems to be effective according to the experimental results on both Endovis 17 and 18 dataset.
Reviewer confidence

Very confident
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

N/A
[Post rebuttal] Please justify your decision

N/A

Review #1

Please describe the contribution of the paper

The authors propose a query proposal network for improved query initialization as an input into decoders for surgical instrument segmentation in video images.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

The authors identify incorrect query initialization as the main reason for the poor performance of transformer-based object detectors.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

Compared to the reported state-of-the-art performance of alternate algorithms, the demonstrated improvement on Endovis 17 and 18 is 2.09% or less. That is not very impressive.
Please rate the clarity and organization of this paper

Very Good
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

The authors intend to make all relevant materials available once the paper is published.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html

Overall, this is a well-written paper. My only real complaint is that the improvements offered are marginal.
Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

5
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The improvements reported are marginal.
Reviewer confidence

Not confident
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

N/A
[Post rebuttal] Please justify your decision

N/A

Review #3

Please describe the contribution of the paper

For the medical instrument instance segmentation tasks, the authors investigate the cause of failure of transformer-based object detectors. The result of their analysis is that incorrect query initialization is the cause. In order to spread the proposal over the whole image, they propose to switch back to NMS-based proposal selection in transformers. They show a 1.84% improvement over the best performing SOTA technique on Endovis17 and 2.09% on Endovis18. measured by ISI-IOU.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

The authors investigate the reason for the failure of transformer-based object detectors for medical instrument instance segmentation tasks. They identify the reason as incorrect query initialization. Their solution is relatively simple. However, their proposed method outperforms the state-of-the-art on two datasets.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

Some technical details are unclear.
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

The authors declare that the codes and models will be published after acceptance of the paper. They use public datasets, EndoVis17 and EndoVis18. The hyperparameters are reported in the paper.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html

[Major Comments] (1) The proposed method uses Mask DINO as the backbone architecture as mentioned in section 3. Besides Query Proposal Network (QPN), does the proposed method use the exact same architecture? For example, Mask DINO has “GT+Noise” as input to its decoder, but this is not shown in the proposed architecture in Figure 1. Please clarify how this differs from Mask DINO (besides QPN).

(2) Page6 states “We set λ = [0.19, 0.24, 0.1, 0.24, 0.24]”. How did the authors determine these values?

(3) In figure 2, the colors are not in accordance with the classes. Readers cannot understand whether the classification is correct or not. In order to understand whether the classification is correct or not, please correct the colors.

(4) On page 7, it says, “After running NMS, the recall rate drops to 0.00%, but the queries are diversified. After running the same through the Query Proposal Proposal Decoder (QPD), the recall rate is observed to be 52.38%”. Why the recall rate increases from 0% to 52.38% is not clearly understood. If the diversity of the queries is important, is it okay to initialize the queries by providing the NMS with random region proposals in the query proposal networks?

[Minor comments] (1) Page 6 says “It can be observed that the proposed technique outperforms the best performing SOTA methods by 5.26% in terms of challenge IOU for EV17”. But it should be 1.84% because the proposed method shows 77.80% and Mask DINO shows 75.96%.

(2) Typo: On page 5, “thruogh” should be “through”.
Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

5
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The authors investigate the reason for the failure of transformer-based object detectors for the medical instrument instance segmentation tasks and identify the reason as incorrect query initialization. Their solution is relatively simple, applying NMS, but the proposed method outperforms the state-of-the-art (Mask DINO) on two datasets.
Reviewer confidence

Confident but not absolutely certain
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

N/A
[Post rebuttal] Please justify your decision

N/A

Primary Meta-Review

Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

All reviewers found the paper interesting and relevant to the field of surgical instrument instance segmentation tasks. The paper proposes a solution for the failure of transformer-based object detectors by identifying incorrect query initialization as the cause and applying NMS to spread the proposal over the whole image. The proposed method outperformed the state-of-the-art on two datasets.

Some areas do require clarification and improvement. It is essential to address all major comments from the reviewers including discussions and justifications for completeness, improving writing and adding details about the experiments and discussion of the limitations of the proposed method.

Author Feedback

We thank the reviewers for their insightful comments. We shall make modifications to the manuscript based on the reviewers’ comments. We shall add details to the proposed methodology, differences of the same against MaskDINO, and implementation of deformable cross-attention. We shall maintain colour consistency in accordance with the classes in Figure 2. Further, experiments have shown the effect of random query initialization on the “query proposal network” to have sub-optimal performance (76.2 Ch. IOU on Endovis 18 dataset). Results of the same along with the ablation study for the choice of lambda (λ)shall be added to the manuscript. We observe a testing speed of 40 FPS on a standard 40GB Nvidia A100 GPU. The results of the same will be added to the manuscript. We will correct all the typos before the camera-ready version of the manuscript.

back to top

Learnable Query Initialization for Surgical Instrument Instance Segmentation