Paper Info Reviews Meta-review Author Feedback Post-Rebuttal Meta-reviews

Authors

Natalia Valderrama, Paola Ruiz Puentes, Isabela Hernández, Nicolás Ayobi, Mathilde Verlyck, Jessica Santander, Juan Caicedo, Nicolás Fernández, Pablo Arbeláez

Abstract

Most benchmarks for studying surgical interventions focus on a specific challenge instead of leveraging the intrinsic complementarity among different tasks. In this work, we present a new experimental framework towards holistic surgical scene understanding. First, we introduce the Phase, Step, Instrument, and Atomic Visual Action recognition (PSI-AVA) Dataset. PSI-AVA includes annotations for both long-term (Phase and Step recognition) and short-term reasoning (Instrument detection and novel Atomic Action recognition) in robot-assisted radical prostatectomy videos. Second, we present Transformers for Action, Phase, Instrument, and steps Recognition (TAPIR) as a strong baseline for surgical scene understanding. TAPIR leverages our dataset’s multi-level annotations as it benefits from the learned representation on the instrument detection task to improve its classification capacity. Our experimental results in both PSI-AVA and other publicly available databases demonstrate the adequacy of our framework to spur future research on holistic surgical scene understanding.

Link to paper

DOI: https://link.springer.com/chapter/10.1007/978-3-031-16449-1_42

SharedIt: https://rdcu.be/cVRXh

Link to the code repository

https://github.com/BCV-Uniandes/TAPIR

Link to the dataset(s)

https://github.com/BCV-Uniandes/TAPIR

https://www.synapse.org/#!Synapse:syn21776936/wiki/601701

https://endovissub-instrument.grand-challenge.org/Data/


Reviews

Review #1

  • Please describe the contribution of the paper

    The authors present a novel dataset and method for benchmarking in the domain of surgical data science. The dataset, called “Phase, Step, Instrument, and Atomic Visual Action (PSI-AVA)”, contains eight instances of radical prostatectomy surgeries (performed with the Da Vinci SI3000 Surgical System). The annotations in the dataset go beyond currently available datasets, comprising hierarchical annotations from atomic actions and surgical tool bounding boxes, to surgical phase detection. The proposed method (TAPIR) creates a strong baseline performance on this method and is shown to be able to make use of hierarchical annotations on PSI-AVA data.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • PSI-AVA: Novel dataset and benchmark code, shared publicly (license unclear, please indicate in paper)
    • Hierarchical/multi-level annotation and learning targets: Compared to other dataset, a very comprehensive set of target labels (see Table 1 - incl. phase/step/instrument recognition, instrument detection, action/task annotaiton, spatial annotations )
    • Novel method: “Transformers for Action, Phase, Instrument, and steps Recognition (TAPIR)” able to leverage multi-level annotations, from tool localization over task classification to surgical phase recognition. Experiments underline TAPIR’s ability to make use of hierarchical knowledge. This is shown in Table 3, where TAPIR performs better than e.g. a SOTA model “SlowFast”, both on the (non-hierarchical-labels) EndoVis challenge data and disproportionally better on the PSI-AVA dataset with hierarchical labels.
    • PSI-AVA intrinsic validation: the dataset is also briefly validated by comparing two non-TAPIR models (Faster R-CNN vs. Def. DETR), which yield more consistent performances on PSI-AVA (at similar FLOPs/#Param characteristics). This probably indicates a high quality and consistency of annotations in PSI-AVA.
    • Appropriate and informative supplementary material.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    • Dataset only contains only 8 instances of surgeries (but total runtime of 19.1 hours, which is average compared to other datasets, see Table 1.). The cross-validation is 2-fold, 4 surgeries for training, 4 for validation, which even further reduces the amount of training data.
  • Please rate the clarity and organization of this paper

    Excellent

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance
    • The Reproducibility statement is filled out correctly, to my impression.
    • Very high reproducibility overall, as the authors state: “we will make publicly available the PSI-AVA dataset, annotations, and evaluation code, as well as pre-trained models and source code for TAPIR”, which is fully transparent.
    • The only missing information was the chosen license for dataset+annotations, TAPIR code, and pre-trained weights.
  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html
    • The paper is very strong, both on PSI-AVA and TAPIR aspects, in combination even stronger. Needless to say, 8 radical prostatectomies is not much data, and it is arguable whether this covers a large range of variability, especially on the highest annotation level (surgical worflow deviations, anatomical anomalies causing backup/recovery workflow steps etc.). Annotation effort on PSI-AVA level must be enormous - nonetheless, if any way possible, it would be great to further increase this number, e.g. towards a journal extension, or maybe make this a yearly growing challenge, similar to how BRATS dataset size has increased over the years.
  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    8

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?
    • Very complex dataset with timely challenges to teams participating in benchmarking (hierarchical annotations, spatio-temporal inference from video etc).
    • Very strong reference model TAPIR, to set a solid baseline performance for competing teams once the dataset goes public.
    • Overall, an extremely valuable contribution to the community.
  • Number of papers in your stack

    5

  • What is the ranking of this paper in your review stack?

    1

  • Reviewer confidence

    Very confident

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #2

  • Please describe the contribution of the paper

    The paper introduces a new data-set (PSI-AVA) with annotations for phase and step recognition,instrument detection, and the novel task of atomic action recognition in surgical. The dataset is novel and unique and can serve as a new benchmark for multiple tasks in surgical video understanding.

    The paper also proposes TAPIR, a transformer-based method that leverages the multi-level annotations of PSI-AVA dataset. The authors show superior performance of their method compared to other baselines.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The first strength of the paper is introduction of a new data-set (PSI-AVA) with annotations for phase and step recognition,instrument detection, and the novel task of atomic action recognition in surgical. The dataset is novel and unique and can serve as a new benchmark for multiple tasks in surgical video understanding.

    The paper is well written and with enough level of detail on the dataset, methods and validation strategy.

    The authors also propose a new transformer-based model for feature extraction in spatio-temporal domain from surgical videos. Given the evidence in the computer vision community related to merit of vision transformer models, the paper and it’s validation can benefit the miccai community.

    Experimental validations are sufficient to prove the claims made by the authors and can serve as a guideline for other to follow.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    TimeSformer and Swin transformers are two of the state-of-the-art models recently introduced in the computer vision community for video analysis tasks. For completeness, authors need to compare TAPIR with such methods to show their method is actually the state of the art.

    It would be interesting to see how TAPIR performs on other publicly available datasets like Cholec80.

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    The paper is reproducible if authors release the dataset.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html

    TimeSformer and Swin transformers are two of the state-of-the-art models recently introduced in the computer vision community for video analysis tasks. For completeness, authors need to compare TAPIR with such methods to show their method is actually the state of the art.

    It would be interesting to see how TAPIR performs on other publicly available datasets like Cholec80.

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    6

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The paper has strong contributions both on the novelty of their dataset and methods. The miccai community can benefit from such dataset for various video analysis tasks.

  • Number of papers in your stack

    4

  • What is the ranking of this paper in your review stack?

    1

  • Reviewer confidence

    Very confident

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #3

  • Please describe the contribution of the paper

    This paper introduces and validates the PSI-AVA dataset with annotations for phase and step recognition, instrument detection, and the novel task of atomic action recognition in surgical scenes. Besides, this paper proposes TAPIR, a transformer-based method that leverages the multi-level annotations of PSI-AVA dataset and establish a stepping stone for future work in our holistic benchmark.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    There are two main advantages of this paper. First, the paper proposes a new dataset for phase and step recognition, instrument detection, and the novel task of atomic action recognition in surgical scenes. This dataset is adapted to a variety of tasks and is publicly available. It is very helpful for researchers to follow-up work in this field. Second, the authors propose a transformer-based framework.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    1. The structure of the transformer is not shown in Figure 2. This is not conducive to the reproduction of the model.
    2. All tables have no bottom border. It’s not pretty.
    3. The authors used too few methods for comparison to well verify the advanced performance of the proposed method.
  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    The structure of the transformer is not shown in Figure 2. This is not conducive to the reproduction of the model.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html
    1. The structure of the transformer is not shown in Figure 2. This is not conducive to the reproduction of the model.
    2. All tables have no bottom border. It’s not pretty.
    3. The authors used too few methods for comparison to well verify the advanced performance of the proposed method.
  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    5

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The dataset presented in this paper is valuable.

  • Number of papers in your stack

    4

  • What is the ranking of this paper in your review stack?

    1

  • Reviewer confidence

    Confident but not absolutely certain

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A




Primary Meta-Review

  • Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

    The paper contributes a new dataset (PSI-AVA) with fine grain annotations into phase, step, instrument, and atomic visual action, of 8 prostatectomy surgeries. A new method TAPIR is also proposed based on transformer for feature extraction in spatio-temporal domain. The paper is well-written and clearly presents the dataset and the method. The model is validated through thorough comparison with the SOTA. Based on the reviewers’ feedback, a provisional accept is recommended. I encourage the authors to further improve the paper by incorporating reviewers’ comments and providing justifications regarding the experimental design choices, i.e. only 8 surgical videos and 2-fold cross validation and only limited SOTA validated.

  • What is the ranking of this paper in your stack? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    1




Author Feedback

We thank the reviewers for their careful evaluation of our work and their interest in our contributions. All reviewers appreciate our novel benchmarks and methodology to jointly address all proposed tasks, including the challenging Atomic Action recognition problem. We welcome their constructive feedback, and we will include their comments and suggestions in the final version of the paper.

Concerning R1’s question, all the resources of this project (the PSI-AVA dataset, annotations, TAPIR source code, and pre-trained weights) will be publicly available under the MIT license on our group’s GitHub page. Also, we appreciate R3’s suggestion on illustrating the transformer structure in Fig. 2 and will include a more detailed architecture figure in the Supplementary Material. Moreover, we thank R3’s comments on the presentation of the document, and we will add a bottom border to all the document’s tables. The final version will also include extended comparisons with state-of-the-art models, particularly with the novel Video Swin Transformer approach.

We would like to assuage reviewers’ concerns about possible practical shortcomings of our experimental methodology, particularly regarding the total number of videos. PSI-AVA results from extensive data acquisition and annotation efforts within an academic setting, requiring our three medical co-authors’ specific expertise and commitment throughout the project. To assess the adequacy of our benchmarks, we conducted extensive comparative experimentation on referential public datasets for surgical scene analysis. Our results in Table 3 show complete consistency in the relative order of methods across tasks and datasets, strongly supporting the suitability of PSI-AVA as the first public testbed for simultaneous instrument detection and hierarchical action classification in surgical videos.

We look forward to extending our work in the future to a broader validation setting, including the public release of more annotated surgeries. For the time being, we hope that PSI-AVA and TAPIR will provide a solid grounding for our community in building the path towards holistic surgical scene understanding.



back to top