Authors

Ege Özsoy, Tobias Czempiel, Felix Holm, Chantal Pellegrini, Nassir Navab

Abstract

Modern surgeries are performed in complex and dynamic settings, including ever-changing interactions between medical staff, patients, and equipment. The holistic modeling of the operating room (OR) is, therefore, a challenging but essential task, with the potential to optimize the performance of surgical teams and aid in developing new surgical technologies to improve patient outcomes. The holistic representation of surgical scenes as semantic scene graphs (SGG), where entities are represented as nodes and relations between them as edges, is a promising direction for fine-grained semantic OR understanding. We propose, for the first time, the use of temporal information for more accurate and consistent holistic OR modeling. Specifically, we introduce memory scene graphs, where the scene graphs of previous time steps act as the temporal representation guiding the current prediction. We design an end-to-end architecture that intelligently fuses the temporal information of our lightweight memory scene graphs with the visual information from point clouds and images. We evaluate our method on the 4D-OR dataset and demonstrate that integrating temporality leads to more accurate and consistent results achieving an +5% increase and a new SOTA of 0.88 in macro F1. This work opens the path for representing the entire surgery history with memory scene graphs and improves the holistic understanding in the OR. Introducing scene graphs as memory representations can offer a valuable tool for many temporal understanding tasks. We will publish our code upon acceptance.

Link to paper

DOI: https://doi.org/10.1007/978-3-031-43996-4_29

SharedIt: https://rdcu.be/dnwO3

Link to the code repository

https://github.com/egeozsoy/LABRAD-OR

Link to the dataset(s)

https://github.com/egeozsoy/4D-OR

Reviews

Review #1

Please describe the contribution of the paper

This work proposes to solve the taks of generating the future semantic scene graphs by LABRAD-OR(Lightweight Memory Scene Graphs for Accurate Bimodal ReAsoning in Dynamic Operating Rooms). The LABRAD-OR contains a buffer, storing the sequence of feature vectors extracted from scene graph. The buffer is processed with different mode to integrate the temporality, boosting the final scene prediction performance.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
1. This is a novel task that anticipate the scene graph is not well explored in the holistic OR understanding.
2. Compared to the latent temporality, the paper proposes an memory buffer to store all the past scene graphs’ feature vector and then apply transformer and different modet ,’long-shot’ and etc, to attend to the nodes that information.
3. The experiment shows that the proposed module is effective for the scene anticipation task.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
1. The paper uses the memory encoder to enhance the current timestamp’s representations. I am wondering how is the non-enhanced representation works in the scene graph prediction? For example, remove the bottom part of Fig. 1.
2. Is this transformer in memory encoder applied with causal masking strategy?
3. The training and testing procedure is not clear. I am curious about why ground truth training is necessary here. If you perform the training like the LSTM, RNN sequentially, is it able to obtain predicted scene graph during the training?
4. Also, how do you perform the inference. Is it still on the ground truth?
5. In the introduction, one insight is ‘‘certain design choices must be made regarding which feature from every time-point should be used as a temporal summary.’’ , is this the motivation behind graphormer to extract scene graph into single feature vector? Do you have more explanation why using graphormer to extract feature vector for scene graphs.
6. The conclusion of ‘all’ mode is leading to overfitting. This seems not convincable in the paper. Do you have other supports to choose ‘long-short’ mode? Such as inference time and so on.
Please rate the clarity and organization of this paper

Excellent
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

The paper is reproducable, as it is evaluated on the public dataset.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html

The implict part of this paper is the choice of graphormer and training&testing pipeline. This work proposes multiple modules, however the experiment results does not support the choice of these modules, see the 5 and 6 in weakness. It would be better to illustrate the training and testing pipeline, is the ground truth as the input really necessary here? During the testing, at least you will need 1 scene graph at the starting time, how do you obtain this scene graph, do you still use ground truth or some predicted scene graphs from 4D-OR?
Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

5
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

I made the recommendation based on the novel tasks and methodology.
Reviewer confidence

Very confident
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

N/A
[Post rebuttal] Please justify your decision

N/A

Review #3

Please describe the contribution of the paper

1) Introduces LABRAD-OR(Lightweight Memory Scene Graphs for Accurate Bimodal ReAsoning in Dynamic Operating Rooms), a novel lightweight end-to-end model for generating scene graphs based on input visual features (point could and image) and temporal scene graph. Introduces: a. Memory scene graph: temporal scene graph knowledge b. Augmentation in memory scene graph: increases model robustness to wrong scene prediction (which will be added to memory scene graph during subsequent frame processing) during inference. c. Time-of-interest positional Ids: encode absolute long/short relation.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

1) Application Novelty: Use of temporal information for holistic OR modelling. 2) Technical novelty: Novel end-to-end model that generates scene graphs based on the visual features and temporal human-interpretable scene graphs. a. Introduces memory scene graphs (scene graphs from previous timeframes) and memory modes (“longshort”) to observe the short-term context in detail and long-term context sparingly. b. Memory augmentation allows the model to be robust to wrong scene graph predictions during inference. c. Time-of-interest positional Ids: Encoding absolute position (frame) of scene graph in memory scene graph to encode short/long relation. 3) Quantitative analysis: a. The proposed temporal model outperforms the SOTA base model (4D-OR+) by 5% and the latent-based temporal baseline model by 2%. b. The use of temporal features is also proven to improve consistency. 4) Ablation studies also show the improvement in model performance from memory augmentation, Time-of-interest Ids, end-to-end models, multi-task training and memory module. 5) Qualitative analysis (figure 3) aligns with quantitative analysis.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

1) Lack of comparison with SOTA models a. While the proposed model is compared with the SOTA application base model (4D-OR+), its comparison with the temporal-based model remains limited. While the model is compared against a latent-based temporal (LBT) baseline model, the LBT model is not cited or related to existing models, making it difficult to benchmark the performance of the proposed model against temporal models. Citing the LBT and benchmarking against other SOTA spatial-temporal models from the computer vision domain could significantly improve qualitative analysis. 2) Lack of multi-fold cross-validation test a. As the performance increment is less than ~3% (in ablation studies) and 5% (in SOTA comparison), it raises doubts if model performance is affected by any bias in the test set. Given the limited dataset in the OR domain, a multi-fold cross-validation test is needed for in-depth quantitative analysis.
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

At present, code is not made available. However, the author has indicated that the code will be made public upon paper acceptance.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html
1. Improve figure quality: a. Maintain consistency in text alignment (left or center) for block heading. b. Maintain consistency in the image (scene) border within a figure.
2. Text edit: a. Consider replacing “Where the “Short” mode can often lead to insufficient memory” in section 4, para 3 (page 8) with “Where the “Short” mode can often lead to insufficient contextual information”. Using “insufficient memory” could mislead the readers into thinking, “short” could lead to insufficient memory space.
3. Multi-fold cross-validation test: a. Given limited time, I would highly recommend performing a multi-fold cross-validation test for key results.
4. SOTA comparison: a. Cite or relate the LBT model to existing closest temporal models. b. If possible, add additional SOTA spatial-temporal model performance.
Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

5
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The quantitative analysis lacks proper/in-depth comparison against SOTA temporal models, making it difficult to benchmark the proposed model. Furthermore, the lack of a multi-fold cross-validation test also raises concerns if the model improvement is affected by any test-set bias. However, taking into account both application and technical novelty, performance against base (4D-OR) model and clear ablation study, I recommend “weak accept”. I am willing to increase my recommendation if (i) the LBT model can be cited / closely related to the latest SOTA temporal models and (ii) a multi-fold cross-validation test is performed for key comparisons.
Reviewer confidence

Confident but not absolutely certain
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

N/A
[Post rebuttal] Please justify your decision

N/A

Review #2

Please describe the contribution of the paper

This paper presents LABRAD-OR, a novel and lightweight approach for generating accurate and consistent scene graphs using temporal information available in Operating Room (OR) recordings. The authors introduce the concept of memory scene graphs, which serve as both input and output, integrating temporality into the scene graph generation process. The proposed end-to-end architecture fuses temporal information with visual information, leading to significantly higher scene graph generation accuracy than the state-of-the-art, as well as better inter-timepoint consistency.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
1. Novel and lightweight approach: LABRAD-OR proposes a memory scene graph-based temporal modeling approach that is innovative and computationally efficient.
2. Bimodal architecture: The end-to-end design fuses both temporal and visual information, leading to higher scene graph generation accuracy and better inter-timepoint consistency.
3. Memory Modes: The introduction of different memory modes allows for better control over computational overhead and prevention of overfitting.
4. Memory Augmentations and Timepoint of Interest (ToI) positional ids: The use of memory augmentations and ToI positional ids significantly contributes to the model’s performance.
5. Improved downstream task performance: LABRAD-OR demonstrates improvements in the downstream task of clinical role prediction, indicating the practical applicability of the approach.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
1. Limited scope of dataset: The experiments are conducted on the 4D-OR dataset, which consists of only ten simulated knee surgeries. This may limit the generalizability of the results to other surgical procedures or real-world settings.
2. Lack of comparison to alternative temporal models: The paper does not thoroughly compare the proposed approach to other temporal models that may be used for the same purpose.
3. Surgical duration variability: The proposed memory modes may not perform optimally for surgical procedures with durations that differ significantly from those in the 4D-OR dataset.
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

The authors claim, “We will publish our code upon acceptance.” No code is found currently.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html

Clarity in methodology: While the methodology section provides a comprehensive description of the proposed approach, it could benefit from more clarity and succinctness. Specifically, the explanation of memory modes and the motivation behind their design could be improved. It would be helpful if the authors could provide a clear rationale for choosing the specific memory modes and the settings of their parameters (e.g., S).

Memory scene graphs vs. latent temporality: The paper would benefit from a more detailed comparison between memory scene graphs and latent temporality approaches, highlighting the advantages and limitations of each method. This would provide a better understanding of the reasons for choosing memory scene graphs and their potential impact on the results.

Limitations and generalizability: The authors should discuss the limitations of the proposed method and its generalizability to other surgical procedures, datasets, and tasks beyond the 4D-OR dataset. This would provide readers with a better understanding of the potential applicability and usefulness of the proposed approach in real-world scenarios.

Computational overhead analysis: The paper claims that the proposed method only adds a 40% overhead to the computational cost. A detailed analysis of the computational overhead, including a comparison with other methods, would be helpful to support this claim and demonstrate the efficiency of the proposed method.

Dataset limitations: The 4D-OR dataset used in this study consists of simulated knee surgeries. The authors should discuss the limitations of the dataset and how they might affect the results. Additionally, it would be useful to explore the applicability of the proposed method to other types of surgeries or even different medical domains.

More visual examples: The paper would benefit from additional visual examples, showcasing the effectiveness of the proposed method in various surgical scenarios. This would allow readers to better grasp the improvements achieved by LABRAD-OR compared to other methods.

Related work: Although the paper cites relevant literature, a more comprehensive review of related work, especially in the context of scene graph generation and temporality in computer vision, would provide a better understanding of the research landscape and the novelty of the proposed method.
Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

5
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The well-defined task and concept, and the interesting proposed methods with improved results compared to previous works.
Reviewer confidence

Somewhat confident
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

N/A
[Post rebuttal] Please justify your decision

N/A

Primary Meta-Review

Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

The paper proposes a novel and lightweight approach which leverages temporal information for generating accurate and consistent scene graphs in operating room . The developed method is interesting and results are promising. However the reviewers still have a few concerns. Please address these, especially the questions about insufficient experiments and comparison, method generalizability, and some details about method and experimental setup should be clarified.

Author Feedback

We are grateful to all the reviewers for their meaningful and valuable comments. All reviewers appreciate the novelty of our work(R1,R2,R3), the advantages of proposed memory modes for “better control”(R2) and the effectiveness of our “memory augmentations” (R2,R3). Finally, they appreciate the improved results on scene graph generation as well as downstream tasks (R1,R2,R3). In addressing the comparison of latent feature-based temporality to memory scene graph-based temporality (R1,R2,R3), we crafted a Latent-Based Temporal (LBT) model (R3) comparable to the established phase recognition method OperA[1] to serve as a fair baseline. Existing SOTA methods were not directly applicable, as they require a latent representation for each scene, which is not present in current scene graph generation pipelines, as only have latent representations per object pair. Both LABRAD-OR and LBT employ transformers for temporal processing, yet, LABRAD-OR not only leads to better results, but also to higher efficiency, owing to the low dimensionality of the scene graphs. Regarding our decision to opt for the “Long-short” mode (R1,R2), the slight advantage over the “All” mode in our ablation study (Table 4) likely results from less overfitting. Additionally the “Long-short” mode is not only marginally faster than “All”, but crucially, the reduced memory requirements, also enables our model to handle substantially lengthier surgical procedures than those in 4D-OR. We can use the memory mode variable “S” to allow longer context to be processed efficiently. Thus, while there is a modest accuracy benefit, the primary advantage of our design choice is its flexibility and scalability. In terms of evaluation on 4D-OR (R1,R2,R3), we adhered to the standard train/val/test division as proposed by the authors of 4D-OR (R3). At the time of this writing, this is the sole holistic surgery understanding dataset providing an external view. Regarding the implementation (R1), our memory encoder solely processes preceding scene graphs (causal masking), rendering LABRAD-OR suitable for real-time applications. On ground truth usage during training (R1), “teacher forcing” is an established strategy for training of transformer architectures. By using “teacher forcing” the training can be parallelized resulting in more efficient training. Importantly, no ground truth is used during inference. We use a common NLP technique by padding all our scene graph sequences with a token. This enables our model to commence predictions on a fresh surgery devoid of any prior scene graphs (R1), contributing to the versatility and adaptability of our approach. We extend our sincere gratitude to the reviewers for their constructive suggestions regarding figure and manuscript improvements. We will incorporate this valuable feedback into the camera-ready version of our work. Overall, as acknowledged by all reviewers, LABRAD-OR efficiently employs novel memory scene graphs for temporal processing, enhancing holistic surgical scene understanding, and thus delivering superior outcomes in both scene graph generation and downstream tasks. We are confident that this work constitutes a significant contribution to the surgical data science community, fostering consistent and precise analysis of dynamic surgeries. [1]Czempiel, T. et al. Opera: Attention-regularized transformers for surgical phase recognition. MICCAI 2021.

back to top

LABRAD-OR: Lightweight Memory Scene Graphs for Accurate Bimodal Reasoning in Dynamic Operating Rooms