Paper Info Reviews Meta-review Author Feedback Post-Rebuttal Meta-reviews

Authors

Aditya Murali, Deepak Alapatt, Pietro Mascagni, Armine Vardazaryan, Alain Garcia, Nariaki Okamoto, Didier Mutter, Nicolas Padoy

Abstract

Recently, spatiotemporal graphs have emerged as a concise and elegant manner of representing video clips in an object-centric fashion, and have shown to be useful for downstream tasks such as action recognition. In this work, we investigate the use of latent spatiotemporal graphs to represent a surgical video in terms of the constituent anatomical structures and tools and their evolving properties over time. To build the graphs, we first predict frame-wise graphs using a pre-trained model, then add temporal edges between nodes based on spatial coherence and visual and semantic similarity. Unlike previous approaches, we incorporate long-term temporal edges in our graphs to better model the evolution of the surgical scene and increase robustness to temporary occlusions. We also introduce a novel graph-editing module that incorporates prior knowledge and temporal coherence to correct errors in the graph, enabling improved downstream task performance. Using our graph representations, we evaluate two downstream tasks, critical view of safety prediction and surgical phase recognition, obtaining strong results that demonstrate the quality and flexibility of the learned representations.

Link to paper

DOI: https://doi.org/10.1007/978-3-031-43996-4_62

SharedIt: https://rdcu.be/dnwP9

Link to the code repository

https://github.com/CAMMA-public/SurgLatentGraph

Link to the dataset(s)

N/A


Reviews

Review #1

  • Please describe the contribution of the paper

    The paper contributes a method to encode surgical videos as latent spatiotemporal graphs that can be used without modification for two diverse downstream tasks. It also presents a framework for effectively modeling long-range relationships in surgical videos via multiple-horizon temporal edges, and introduces a Graph Editing Module that can correct errors in the predicted graph based on temporal coherence cues and prior knowledge.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    Novel formulation: The paper introduces a method to encode surgical videos as latent spatiotemporal graphs, which allows for more effective learning and reasoning based on surgical anatomy. This object-centric approach retains implicit visual features and can be fine-tuned for various downstream tasks while maintaining differentiability.

    Graph Editing Module: The paper proposes a Graph Editing Module that leverages the spatiotemporal graph structure and predicted object semantics to efficiently correct errors in object detection. This module is capable of incorporating prior knowledge and constraints, providing robustness to a wide range of input graphs.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    Reliance on high-quality object detection: The object-centric approach employed in this paper relies on accurate object detection, which could be challenging, particularly for surgical videos. Although the authors propose a Graph Editing Module to address this issue, the method’s overall performance may still be limited by the quality of object detection.

    Limited evaluation tasks: The method is evaluated on two downstream tasks (CVS clip classification and surgical phase recognition), which, although diverse, might not fully represent the range of possible applications in surgical video analysis. Additional evaluations on other tasks or in different surgical domains would further demonstrate the method’s generalizability.

    No direct comparison with alternative methods: The paper lacks a detailed comparison with alternative approaches for surgical video analysis, such as other object-centric models or techniques that use different feature representations. This makes it harder to assess the relative strengths and weaknesses of the proposed method compared to the existing state-of-the-art techniques.

    Potential scalability issues: The method involves constructing latent spatiotemporal graphs for entire surgical videos, which could lead to scalability issues when processing longer videos or videos with high frame rates. The paper does not provide a thorough analysis of the method’s computational complexity or scalability.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    The authors do not claim they will release the code in the paper. The video demo looks interesting and may be reproducible.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html

    Object detection quality: The proposed method relies heavily on accurate object detection, which could be challenging for surgical videos. While the Graph Editing Module addresses this issue to some extent, it would be helpful if you could further discuss the potential limitations of your approach due to the quality of object detection and explore possible solutions to overcome these limitations.

    Evaluation tasks and generalizability: Your method is evaluated on two downstream tasks, which, although diverse, might not fully represent the range of possible applications in surgical video analysis. It would be beneficial to evaluate the method on additional tasks or in different surgical domains to further demonstrate its generalizability and applicability in various scenarios.

    Comparison with alternative methods: The paper lacks a detailed comparison with alternative approaches for surgical video analysis, such as other object-centric models or techniques that use different feature representations. Providing a comparison with existing state-of-the-art techniques would strengthen the paper and better highlight the advantages of your proposed method.

    Scalability and computational complexity: The method involves constructing latent spatiotemporal graphs for entire surgical videos, which could lead to scalability issues when processing longer videos or videos with high frame rates. Please provide a thorough analysis of the method’s computational complexity and scalability, and discuss any potential limitations or strategies to address these challenges.

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    5

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    I recommend this paper for its novel approach to encoding surgical videos as latent spatiotemporal graphs, enabling effective modeling of long-range relationships in surgical videos and facilitating downstream tasks such as CVS clip classification and surgical phase recognition. However, some aspects could be improved, such as further discussing the limitations due to object detection quality, evaluating the method on additional tasks or domains, providing a comparison with alternative methods, analyzing the computational complexity and scalability. Addressing these points would further strengthen the paper and showcase its potential impact in the field of surgical video analysis.

  • Reviewer confidence

    Very confident

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #2

  • Please describe the contribution of the paper

    This paper presents a novel approach to represent surgical videos using latent spatiotemporal graphs, which capture the relationships between anatomical structures and tools as they evolve over time. The method predicts frame-wise graphs, connects nodes with temporal edges based on spatial coherence and visual and semantic similarity, and incorporates long-term temporal edges for better scene evolution modeling. Additionally, a new graph-editing module corrects errors in the graph using prior knowledge and temporal coherence, improving performance in downstream tasks such as critical view of safety prediction and surgical phase recognition.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • Interesting technical solution for encoding spatiotemporal information in surgical videos using graphs
    • Tested on two tasks: a critical view of safety prediction and surgical phase detection
    • Tested on public datasets
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    • Graph neural networks have already been used for phase recognition and event detection along with TCN, as used by the authors. The authors did not elaborate on the potential of the framework on other tasks
    • The code and some implementation details haven’t been reported
    • Not extensively tested
    • Lack of statistical test
    • Graph improvement is not very well highlighted in the results
    • The manuscript should be a bit reorganized
  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    The authors relies on public datasets but is not reported if the code will be released or the training settings.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html

    The paper presents a novel approach for representing surgical videos using latent spatiotemporal graphs. Furthermore, a new graph-editing module corrects errors in the graph using prior knowledge and temporal coherence, improving performance in downstream tasks such as critical view of safety prediction and surgical phase recognition. The manuscript is of interest for MICCAI audience and focus on a task of interest for the clinical community.

    Major strengths:

    The paper introduces an interesting technical solution for encoding spatiotemporal information in surgical videos using graphs. The proposed method has been tested on two tasks: critical view of safety prediction and surgical phase detection. The authors have tested their approach on public datasets, which enhances the reproducibility of their work.

    Major weaknesses:

    Graph neural networks have been used previously for phase recognition and event detection in conjunction with temporal convolutional networks (TCN), but the authors did not elaborate on the potential of their framework for other tasks. The code and some implementation details have not been reported, which may limit the reproducibility of the results. The method has not been extensively tested, potentially limiting its generalizability. The paper lacks statistical tests to validate the significance of the reported results. The improvement brought about by the graph-editing module is not well highlighted in the results section. The manuscript could benefit from reorganization to improve clarity and readability. The authors are encouraged to provide more graphical results or details to strengthen the evidence of the effectiveness of the proposed solution.

    Minor issues:

    The video in the supplementary material exhibits glitches at the beginning. The author should consider fixing this issue to ensure a smoother presentation of their work. In Table 2, the performance metrics for the F1 score should be consistent in terms of the number of digits displayed (e.g., 80.2 should be written as 80.20).

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    4

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The method proposed here seems to be a natural evolution of the SOTA and is an interesting solution for the tasks chosen by the authors. However, some additional efforts in results presentation and some more experiments on other studies (e.g., object detection/segmentation) would be worthy of being done.

  • Reviewer confidence

    Somewhat confident

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #3

  • Please describe the contribution of the paper
    1. Proposes a novel latent spatiotemporal graph representation for entire surgical videos that can be adopted for various downstream tasks such as critical view of safety (CVS) and surgical phase recognition.
    2. Incorporates temporal edges at multiple horizons (short-team {prior works exist} and long-term {introduced}).
    3. Introduces a novel graph editing module that corrects errors in the graph (due to errors in object detection) by leveraging spatial-temporal graph structure and predicted object semantics.
  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. This work introduces a latent spatiotemporal graph representation of surgical videos that can be adopted for various downstream tasks. [novelty] a. The latent graphs include novel temporal edges at multiple horizons. [incremental novelty] b. Novel graph editing modules and edits the graph based on prior knowledge and temporal coherence. [novelty]
    2. The flexibility in the use of the proposed latent spatiotemporal graph representation for the surgical downstream task is quantitatively proven. a. The latent graph model outperforms (most cases)/ on-par (sometimes) SOTA models on both CVS and phase recognition tasks w/wo single-frame finetuning.
    3. Ablation Study: a. The Ablation study in Table 3 quantitatively shows the significance of modules introduces in this work (multiple horizon temporal edges and graph editing module) in improving the graph model performance.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    1. Lacks qualitative analysis: No qualitative analysis/visualization of the module’s latent graph or output for downstream tasks is provided in the manuscript.
    2. Limited quantitative analysis: a. The performance of the latent spatiotemporal graph is only compared against 2 SOTA models on both tasks. The model can be benchmarked against other spatial-temporal graph models from the computer-vision domain. b. For each task, the model is evaluated on a single dataset and against less than 3 SOTA models. This raises concerns about any dataset bias. In the absence of more SOTA model comparisons, a multi-fold cross-validation test could help improve the qualitative analysis. c. Quoting results of DeepCVS-R18 and LG-CVS from [15] in Single Frame of Table 1 doesn’t add much value to the manuscript nor provide any significant insights. If the author wise to show the difference between single vs temporal, quoting one result would suffice. Furthermore, as these results were not reproduced and only quoted, it raises doubts about any performance change that could arise due to the library environment / initial random weights setting [minor weakness]. d. Quoting results of TeCNO [2] from [20] instead of reproducing the results also raises doubts about any performance change that could arise due to the change in the library environment / initial random weights setting [minor weakness]. Furthermore, the quoted result is inaccurate (‘80.2’ instead of the actual ‘80.3’ in [20]).
    3. Figure quality: Figure 1 is not manuscript ready. Its text and image/shape alignment must be improved.
  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    Most of the training parameters are defined (in manuscript/supplementary). Training / test codes are not avaiable at present. I assume it would be made public by the author if the manuscript is accepted.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html
    1. Improve Figure 1: a. Space out the components within the figure evenly. b. Improve the “graph encoder” block. In the reader’s view, it appears too small and cramped compared to the available blank spaces in the figure. c. Re-position the graph network (behind the GCN block) and GCN block to avoid any blocking. d. Try to include a real latent graph of the processed video instead of a graphical representation of a latent graph. e. Space out the node feature block evenly.
    2. Include qualitative analysis: a. In my view, adding qualitative analysis of your model performance (latent graph output vs GT / output for downstream tasks) will significantly improve the manuscript. The table position can be adjusted or the introduction can be condensed to include qualitative analysis within the manuscript.
    3. Improve quantitative analysis: a. Reproduce SOTA models instead of quoting from paper for fair analysis (to remove any bias from system environment/weights initialization). b. Perform multi-fold cross-validation tests or evaluate the model on other datasets for CVS / phase recognition tasks. c. Compare against other spatial-temporal SOTA graph models from the computer vision domain for video graph representation or task-specific SOTA models.
  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    5

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The main manuscript lacks qualitative analysis (available in supplementary video) and in my view, the quantitative analysis against SOTA models for downstream tasks are limited (less than 3 SOTA model comparison and some SOTA model results are quoted and not reproduced). However, taking into consideration (i) the technical novelty (latent spatiotemporal graph representation with two incremental novelty), (ii) the ablation study that signifies improvement from incremental novelty and (iii), the flexibility of the model to be employed for various downstream tasks, I am inclined towards weak-accept.

  • Reviewer confidence

    Confident but not absolutely certain

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    5

  • [Post rebuttal] Please justify your decision

    The author’s feedback was useful and clarified few things. However, in my view, it wasn’t significant enough to change my evalaution.




Primary Meta-Review

  • Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

    The paper introduces a method for encoding surgical videos into latent spatiotemporal graphs, addressing two downstream tasks. The reviewers have raised several concerns about insufficient experimental validation both qualitatively and quantitatively, lack of discussion about the efficacy of graph-edit module, questionable generalizability, etc. I invite the authors to submit the rebuttal addressing these comments raised by reviewers.




Author Feedback

We thank the reviewers for their thorough and constructive feedback. To reiterate, we propose to represent surgical videos as latent spatiotemporal graphs that model each object’s evolution over time. This is in stark contrast to status-quo approaches to surgical video understanding, which largely rely on implicit feature representations from 2D or 3D backbones. We therefore need to answer two key questions through our experiments: (1) what benefits do object-centric methodologies offer compared to the status quo and (2) how does our approach compare to existing object-centric approaches in computer vision. R1-3 note that comparison to SOTA is not extensive, but by focusing on one method to represent the status quo and one object-centric approach, we are able to deeply address these conceptual questions. We test two vastly different tasks (CVS prediction, phase recognition), conduct ablation studies for both tasks, and explore various evaluation settings (ground truth availability for CVS, training complexity for phase), in each case analyzing why performance differs across approaches. Moreover, we carefully select (or construct) these baselines to be representative of the SOTA: STRG is a seminal work in object-centric classification that remains popular, and more recent methods like STIN and ORViT mostly consist of improvements to the classification decoder, retaining STRG’s object-centric representation. CVS is a new task, and CVS clip classification is unexplored, so we construct a baseline that extends DeepCVS, a SOTA non-object-centric approach for single-frame CVS prediction, by adding a Transformer stage. Finally, while numerous methods have been proposed for phase recognition, most are multi-stage approaches that iterate on the temporal model used to process frame-wise feature vectors; TeCNO [2], the most popular approach, thus well-represents the status quo. R2 notes that GNNs have been used for phase recognition [30], but this approach builds a graph out of frame-wise feature vectors, thereby falling into the same category as TeCNO, with the GNN replacing TCN. Our graph representations individually model each object and its evolution over time, and are therefore much richer, manifesting in particularly strong performance for phase recognition without finetuning and CVS prediction in the box setting. To illustrate our arguments, we evaluate OrViT on both tasks and obtain similar performance to STRG: Phase – 77.92; CVS Box – 61.04; CVS Seg – 61.31. R3 also mentions that reproducing rather than quoting results would be preferable: we do so and obtain similar results: DeepCVS-R18 – 52.72/60.47; LG-CVS – 58.46/60.89; TeCNO FT – 80.8. R1 also cites limited evaluation tasks as a weakness, but as mentioned above, we intentionally select the two tasks to be vastly different, testing different aspects of our approach. R2 and R3 concur, citing evaluation on 2 tasks as a strength. Similarly, R2 notes that the graph-editing module is not well-highlighted, but we ablate this component in Table 3, which R3 mentions as a strength. We also plan on extending this work to a journal submission, where we will have the space to expand our evaluation (more tasks, baselines). Finally, R1 cites reliance on high-quality object detection as a weakness; however, by employing a modular detect-then-classify approach, we enable compatibility with any object detector, an important strength of our approach. Also, object detection need not be high quality; our results: Endoscapes: 53.8 Obj. Det. mAP@50; CholecSeg8k: 54.2 Obj. Det. mAP@50. CholecSeg8k in particular represents only 3 of the possible 7 phases, so most frames are actually out-of-distribution for the trained detector. We leave the study of varying object detection methodology/quality to future work. We will release all code+train/test configs and address manuscript organization (move qualitative results to main manuscript) if accepted. [30] Kadkhodamohammadi et al. (2022). PATG…IJCARS.




Post-rebuttal Meta-Reviews

Meta-review # 1 (Primary)

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    The Meta-R appreciates the efforts made on addressing many crucial points in the rebuttal. However, some critical concerns still remain, such as the insufficient comparison (especially for the well-studied surgical phase recognition task, Tecno is unfortunately not the state-of-the-art for now), and lack of statistic test. Therefore, I recommend rejection.



Meta-review #2

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    The authors responded adequately to the reviewers’ comments



Meta-review #3

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    This paper proposes a graph representation embedding as an alternative to more conventional CNN’s for image feature encoding. Crucially, this graph embedding provides a convenient way to model temporal relationships in continuous video, while still keeping it as a frame-based embedding. This idea was tested on two different tasks: CVS clip classification, and surgical workflow segmentation, showing advantages of the graph embedding when compared to conventionally used CNN embeddings.

    Strengths:

    • All reviewers acknowledge that the idea is novel and interesting
    • Experiments demonstrate advantages of the new embedding on two different tasks

    Weaknesses:

    • Reviewer concerns around computational complexity were not addressed in rebuttal
    • Reviewer concerns about object detection dependency were addressed in rebuttal, but perhaps need further detailed experiments to be fully verified. This is a minor concern though.

    Overall, the authors did a good job answering to some of the reviewer criticisms. I believe most concerns around missing experiments are addressed:

    • I don’t see additional baselines for workflow segmentation as strictly necessary to demonstrate the idea in this paper. While a new embedding is proposed, other SoTa methods (TransSVNet, Opera, etc) all use a traditional CNN encoding, with most of the differences being on temporal models and training strategies, which are orthogonal to the focus in this paper
    • I agree with the authors that other approaches using GNN’s as a subsequent block to process CNN features, such as temporal modeling, are fundamentally different from what this paper is trying to do. It doesn’t make sense to compare them 1-to-1
    • Testing the idea on more tasks would be a “nice-to-have”, but this is a conference paper with limited size, not a lengthy journal publication, so I find the experimental scope of 2 tasks adequate

    Therefore, I am leaning to accept this submission.



back to top