Paper Info Reviews Meta-review Author Feedback Post-Rebuttal Meta-reviews

Authors

Wenjun Lin, Yan Hu, Luoying Hao, Dan Zhou, Mingming Yang, Huazhu Fu, Cheekong Chui, Jiang Liu

Abstract

Instrument-tissue interaction detection in surgical videos is a fundamental problem for surgical scene understanding which is of great significance to computer-assisted surgical systems. However, few works focus on this fine-grained surgical activity representation. In this paper, we propose to represent instrument-tissue interaction as ⟨instrument bounding box, tissue bounding box, instrument class, tissue class, action class⟩ quintuples. We present a novel quintuple detection network (QDNet) to address the instrument-tissue interaction quintuple detection task in cataract surgery videos. Specifically, a spatiotemporal attention layer (STAL) is proposed to aggregate spatial and temporal information of the regions of interest between adjacent frames. We also propose a graph-based quintuple prediction layer (GQPL) to reason the relationship between instruments and tissues. Our method achieves an mAP of 42.24% on a cataract surgery video dataset, significantly outperforming other methods.

Link to paper

DOI: https://link.springer.com/chapter/10.1007/978-3-031-16449-1_38

SharedIt: https://rdcu.be/cVRXd

Link to the code repository

N/A

Link to the dataset(s)

N/A


Reviews

Review #1

  • Please describe the contribution of the paper

    This paper proposes a neural network architecture to jointly localize and classify the instruments, tissues interacting with the instruments, and classify the action type. To enhance instrument and tissue detection performances, the anthers have employed joint spatio-temporal information via a spatiotemporal attention layer (STAL). A graph convolutional network is adopted to boost quintuple detection via reasoning relations between the instruments and tissues.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The proposed spatio-temporal attention layer can effectively take advantage of the domain-specific spatio-temporal features to boost quintuple detection performance. The experimental results confirm the effectiveness of these additional components in enhancing instrument and tissue localization performance.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    1) The proposed neural network architecture is not re-producible since many important implementation details are missing. These details include: a) The dimensions of input and output feature maps of different sub-networks and layers (STAL, FC, and GQPL). b) The operations’ details (size and number of the kernels in all convolutional layers after the ROI align operations, the location and type of the adopted activation functions). c) The number of trainable parameters in the proposed architecture compared to the rival approaches, and the number of trainable parameters in each evaluated network in the ablation study.

    2) Regarding STAL (the spatio-temporal attention layer), it seems that feature aggregation is mistakenly formulated using the term ``concatenation”. The authors should note that concatenating N feature maps of size $N\times H \times W$ results in a new feature map with dimensions $(N*C)\times H \times W$. Besides, the authors mention that the visual feature of the current frame are augmented by addition. Hence, I would expect that the features which are added together should have the same dimensionality. All mentioned feature maps and vectors (including queries, keys, and value vectors) should be formulated using the details of the layers they pass through (e.g., linear layers, convolutional layers, or activation functions).

    3) It seems that the proposed multi-task learning approach is inspired by the previous work related to instrument-tissue interaction [16]. However, the authors have not provided any comparative results to show the superiority of the proposed network compared to this important reference. This competitor approach does not localize the instruments and tissues notwithstanding, it is expected that the proposed quintuple detection network outperforms this network in instruments, tissues, and action classification.

    4) The evaluation metrics adopted in this paper cannot reveal the network’s performance in quintuple detection. The two metrics used in this paper only consider the average precision for the instruments, tissues, or joint instruments-tissues. Indeed, while ``action” appears to be the main component of the quintuple, no metric is used to demonstrate the network’s ability to classify the actions. I would expect the authors to provide comparative results of mAP for joint instrument-tissue-action (as in triplet recognition performance measurement in [16], but for quintuple).

    5) Regarding the dataset, the authors have mentioned that the frames which do not include any instrument-tissue interaction are removed. However, the prepared network exploits a number of consecutive frames before each keyframe for spatio-temporal feature extraction and refinement. Which strategy is adopted when the reference frames are removed? In case the frames with no instrument-tissue interaction should be removed before evaluations, this method has a major weakness of relying on manual annotations for the test sets.

  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    Due to lacking many details about the convolutional and fully connected operations, activation functions, size of the output feature maps of different layers, number of trainable parameters, and many other important details, the paper is not reproducible.

    In case the dataset will not be released: Since the dataset will not be released with the acceptance of the paper, there is no possibility to reproduce the results and explore the subject further. Indeed, only the authors themselves can improve the results provided, which is regarded as a major weakness of the current paper.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html

    I would suggest that the authors try to address the mentioned weaknesses. In particular, the authors should formulate all operations in the STAL and STAM, with a detailed description of convolutional and fully connected layers and feature maps’ dimensions.

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    3

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    See main weaknesses.

  • Number of papers in your stack

    5

  • What is the ranking of this paper in your review stack?

    5

  • Reviewer confidence

    Confident but not absolutely certain

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Not Answered

  • [Post rebuttal] Please justify your decision

    Not Answered



Review #2

  • Please describe the contribution of the paper

    The authors propose to represent instrument-tissue interaction as instrument bounding box, tissue bounding box, instrument class, tissue class, action classquintuples by extending the earlier works that represent them as triplets. Moreover, they localize these quintuples. They propose QDNet which aggregates spatial and temporal information through the use of a spatiotemporal attention layer (STAL) and a graph-based quintuple prediction layer (GQPL) which is able to infer tool-tissue relationships.

    As part as QDNet, they propose a spatiotemporal attention layer (STAL) to aggregate spatial and temporal information of the regions of interest between adjacent frames, and a graph-based quintuple prediction layer (GQPL) to infer the relationship between instruments and tissues.

    They build a cataract surgery video dataset with annotations named Cataract Quintuple Dataset. According to what is stated in the Reproducibility checkbox list, the authors intend to share this dataset.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The proposed method extends the state-of-the-art models of representing instrument-tissue interaction as triplets to quintuples of instrument bounding box, tissue bounding box, instrument class, tissue class, action class. Moreover, they model these quintuples in a localized manner.

    As part as QDNet, they propose a spatiotemporal attention layer (STAL) to aggregate spatial and temporal information of the regions of interest between adjacent frames. STAL is a modification of the commonly used spatio-temporal attention module (STAM) but aggregates spatial and temporal information of the ROIs instead.

    They build a cataract surgery video dataset with annotations named Cataract Quintuple Dataset. According to what is stated in the Reproducibility checkbox list, the authors intend to share this dataset.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    They propose a graph-based quintuple prediction layer (GQPL) to infer the relationship between instruments and tissues and this is emphasized as a novel contribution, however there are examples of similar works in surgical domain. (Such as the somewhat recent work (2021) by Islam et al. titled STAN: Spatio-Temporal Attention Network for Next Location Recommendation.)

    A literature review on both the generic object-object interaction graph representations, and particularly tool-tissue interaction graph representations is missing.

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    The authors have checked the boxes relating release of the source code and the dataset which is specifically build and annotated for this study.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html

    A literature review on both the generic object-object interaction graph representations and particularly tool-tissue interaction graph representation should be added. It should be clarified that a similar approach was proposed by some earlier works in the surgical domain, and the authors should distinguish their approach in comparison to these works.

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    6

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The proposed method extends the state-of-the-art models of representing instrument-tissue interaction as triplets to quintuples adding localization. The stated contribution STAL is a modification of the commonly used spatio-temporal attention module (STAM) but aggregates spatial and temporal information of the ROIs instead.

    Although a graph-based quintuple prediction layer (GQPL) to infer the relationship between instruments and tissues is proposed as contribution, the authors do not address the state of the art models in generic object-object graph representations, and also the tool-tissue interaction graph representations in surgical domain, such as the somewhat recent work (2021) by Islam et al. titled STAN: Spatio-Temporal Attention Network for Next Location Recommendation.

    It should be clarified that a similar approach was proposed by some earlier works in the surgical domain, and the authors should distinguish their approach in comparison to these works.

  • Number of papers in your stack

    4

  • What is the ranking of this paper in your review stack?

    3

  • Reviewer confidence

    Very confident

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Not Answered

  • [Post rebuttal] Please justify your decision

    I genuinely apologize for citing the wrong paper - even though I have cited the correct author, I have written the title wrong somehow (might be a copy paste issue) However, a quick search of literature by the topic or by the author the relevant paper: https://arxiv.org/abs/2007.03357 “Learning and Reasoning with the Graph Structure Representation in Robotic Surgery” by Mobarakol Islam, Lalithkumar Seenivasan, Lim Chwee Ming, Hongliang Ren



Review #3

  • Please describe the contribution of the paper

    The paper presents an approach for instrument-tissue interaction detection in surgery videos. In doing so, the authors use a Quintuple detection network (QDNet) and apply it to cataract surgery videos.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • Detailed method description
    • Real data
    • Ablation study
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    • Minor improvement
    • Limited data
  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    The methods have been described in detail and should be reproducible.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html

    For me, this paper is more a “pipeline” paper, where several known methods have been stacked together for a specific application. Per se, that is not a bad thing, not every paper has to invent an absolute novel algorithm. However, I am not so impressed by the results, they seem more minor to me, compared to previous works. For me, it is hard to judge what “effect” this better results have on the surgery and the authors should comment on that. Another limitation is the dataset, which seems an in-house one: “we build a cataract surgery video dataset” “labeled frame by frame … under the direction of ophthalmologists” With this little information, it is hard to say how valid this dataset is. In summary, the paper is borderline for me.

    Minor comment: Mean Average Precision (mAP) should be defined earlier in the manuscript (actually it should be defined when it first appears in the text).

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    5

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The paper is well written and good to follow. There has been an evaluation on real data and an ablation study. However, the results could be better to state-of-the-art, and it is not clear to me how the improved results affect the application. The dataset is limited in my opinion.

  • Number of papers in your stack

    7

  • What is the ranking of this paper in your review stack?

    3

  • Reviewer confidence

    Somewhat Confident

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Not Answered

  • [Post rebuttal] Please justify your decision

    Not Answered




Primary Meta-Review

  • Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

    The manuscript received quite detailed reviews, and there is general agreement that the method is good fit for the considered clinical task, that the methods work well, and that the dataset is relevant. However, the reviewers raise various concerns that should be clarified. 1) There is conflicting perception of the nature of the dataset: while some reviewers appreciate the fact that it will be made public with the annotations produced in this effort, others understood that the dataset will remain private which was perceived as a major concern (especially since the method was not evaluated on other public datasets, suggesting that others may not be able to properly benchmark against this method). This needs to be clarified, ideally including an estimated timeline should the dataset be made public. 2) There is conflicting perception of the adequate level of detail: While some reviewers perceived the details to be sufficient, others felt that a quite substantial amount of details were missing. This needs to be clarified and addressed. 3) The evaluation is perceived as a weakness: choise and omission of baselines, inadequate evaluation metrics, and a perceived minor improvement over other methods. 4) Some concerns about the completeness of the related work, which is easily addressed (minor).

  • What is the ranking of this paper in your stack? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    4/17 (~20th percentile)




Author Feedback

We thank the reviewers for high-quality comments. Below we provide point-to-point responses to the comments, which will be integrated in the final version.

[Q] Dataset (R1/R3) [A] We plan to make the dataset publicly available after our journal paper is published.

[Q] Performance comparison with citation [16] (R1) [A] With localization information, our QDNet outperforms Tripnet [16] by a large margin. We add an experiment using mAP_IVT [16] to compare with Tripnet. On our dataset, Tripnet achieves 33.14% mAP_IVT, while our QDNet achieves 58.84% mAP_IVT, 25.7% higher than Tripnet.

[Q] Evaluation matrix (R1) [A] Our mAP_ITI is based on role mAP [7], which is commonly used in human-object interaction detection. A true positive detection should satisfy two conditions: 1. correct instrument-tissue-action class; 2. both instrument and tissue boxes have an Intersection over Union (IoU) higher than 0.5 with corresponding ground truth. We adopt two evaluation metrics: mAP_IT for the instrument and tissue detection, and mAP_ITI for instrument-tissue-interaction quintuple detection.

[Q] Minor improvement (R3) [A] There are two significant improvements in this paper. One is that we are the first to introduce instrument-tissue interaction quintuple detection to the best of our knowledge. Our work is innovative inherently and the baseline is created on our own, so the whole experiment is novel. The other one is that quintuples detection by our QDNet improves about 5.35% in mAP_ITI compared with Faster RCNN baseline, which is significant.

[Q] Strategy on the test set (R1) [A] There is no special strategy adopted on the test set. For easier model training and comparison, non-interaction frames are removed first as there is no valid interaction to calculate the loss. We select r frames that are kept in front of the key frame as reference frames. Here, we add an experiment on a test set consisting of full videos, in which non-interaction frames are included. Our QDNet still works well. Constrained by space, we only choose three competitive methods and report the mAP_ITI, as: Method / mAP_ITI Faster RCNN / 32.35% iCAN / 32.61% Zhang et al. / 34.92% Our QDNet / 36.51%

[Q] Implementation details (R1) [A] 1. Model details For better comparison, we will release the code and add the following details in the final version. We follow the Faster RCNN with RoI Align and linear box head to extract RoI features whose size is n×c, as input for STAL. In STAL, RoI features are passed through linear layers (c × c/N). Also, binary box maps are reshaped to n × hw vectors and passed through linear layers (hw × c/N). After concatenating N STAM outputs and addition in equation (1), the output features size of STAL is n×c. For GQPL, the input is the RoI features of detected boxes (n’×c). In GQPL, (c2×1) fully-connected layers are used to calculate weight, and (c2×a) fully-connected layers are used to predict action. The size of GQPL action outputs is n’×a, where a is 1 + the number of action classes. In the experiments, n is the number of boxes (n=512 for the keyframe, n=512*r=1536 for 3 reference frames), c, N, h, and w are set to 1024, 8, 64, and 80 respectively.

  1. Feature aggregation in equation (1) The number of output channels for STAM output = 1/N * the number of input channels. To keep the same dimension for skip connection in equation (1), feature aggregation for N STAM outputs should be concatenation as the paper states.

[Q] Related work (R2) [A] We find the paper titled STAN: Spatio-Temporal Attention Network for Next Location Recommendation written by Luo instead of Islam. This paper focuses on location-based services in the information system area, not the surgical domain. The proposed method does not include any graph representation, either. So there is no similarity with our GQPL. We will add related works about generic object-object interaction graph representations and tool-tissue interaction graph representations in the final version.




Post-rebuttal Meta-Reviews

Meta-review # 1 (Primary)

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    The strenghts of this work include the method’s fit for the considered clinical task, it’s performance, and relevant dataset (which will supposedly be made publicly available, but only after a journal publication, making this aspect somewhat irrelevant for this decision). The primary weaknesses brought about in initial review were minor performance increases and limited novelty. Both of these shortcomings, to my understanding, were addressed convincingly by arguing about quintuple detection as a new task (implying novelty per se, with none of the reviewers objecting) and adding a new baseline method that was suggested during review. It is thus my perception that the paper can be accepted.

  • After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

    Accept

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    6



Meta-review #2

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    The paper generally received positive reviews from all reviewers. The main concerns from the reviewers were the lack of novelty and minimal performance improvement. The authors have addressed the reviewers’ concerns in the rebuttal. Reviewer 2 comments on the details of the implementation can be addressed in the final version of the paper.

  • After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

    Accept

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    NR



Meta-review #3

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    I thank the authors for their effort in addressing the questions raised by the reviewers. I encourage the authors to incorporate their answers on Strategy on the test set, Implementation details and Related work into the final version. The reviewer has also corrected the reference, the correct one is Islam, M, et al. “Learning and reasoning with the graph structure representation in robotic surgery.” MICCAI 2020.

  • After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

    Accept

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    7



back to top