Paper Info Reviews Meta-review Author Feedback Post-Rebuttal Meta-reviews

Authors

Lennart Bastian, Daniel Derkacz-Bogner, Tony D. Wang, Benjamin Busam, Nassir Navab

Abstract

The digitization of surgical operating rooms (OR) has gained significant traction in the scientific and medical communities. However, existing deep-learning methods for operating room recognition tasks still require substantial quantities of annotated data. In this paper, we introduce a method for weakly-supervised semantic segmentation for surgical operating rooms. Our method operates directly on 4D point cloud sequences from multiple ceiling-mounted RGB-D sensors and requires less than 0.01\% of annotated data. This is achieved by incorporating a self-supervised temporal prior, enforcing semantic consistency in 4D point cloud video recordings. We show how refining these priors with learned semantic features can increase segmentation mIoU to $10\%$ above existing works, achieving higher segmentation scores than baselines that use four times the number of labels. Furthermore, the 3D semantic predictions from our method can be projected back into 2D images; we establish that these 2D predictions can be used to improve the performance of existing surgical phase recognition methods. Our method shows promise in automating 3D OR segmentation with a 20 times lower annotation cost than existing methods, demonstrating the potential to improve surgical scene understanding systems.

Link to paper

DOI: https://doi.org/10.1007/978-3-031-43996-4_6

SharedIt: https://rdcu.be/dnwOG

Link to the code repository

https://github.com/bastianlb/segmentOR

Link to the dataset(s)

https://bastianlb.github.io/segmentOR/


Reviews

Review #1

  • Please describe the contribution of the paper

    The paper presents an interesting learning-based approach for OR semantic segmentation from wall/ceil mounted RGBD cameras. The topic is interesting and very active in CAI community.

    Key contributions are: i) the adaptation of the OTOC method in this problem, and ii) its extension to incorporate temporal consistency of the RGBD point clouds using a simple (flow, nearest neighbour) prior which appears to improve results.

    iii) Making the dataset used and annotations open access is also an important contribution

    Proposed method SegmentOR is able to produce good semantic segmentation performance with minimum annotation effort (one click), which is the main contribution of the paper

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The paper is well written and easy to follow with most concepts adequately explained and discussed. Ideas that can provide accurate OR semantic segmantation with minimum manual annotation requirements are very interesting.

    The key strengts of the paper are summarised as:

    i) The integration temporal consistency for point cloud semantic segmentation. Altough it is somewhat expected that temporal consistency will improve 3D semantic segmenation the paper contributes the formulation and investigation of the OTOC+T method.

    ii) Experimentation is thorough and the results presented are properly discussed and justify the methodology followed. The evaluation of the method on downstreat tasks showing the improvement yielded is also very interesting.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    The main weaknesses I have identified:

    1) Benchmarking and comparison is only against a weakly-supervised methods (OTOC) without temporal consistency.

    2) More informaiont on data generation is required

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    The authors commit to release the dataset and relevant code. Reproducibility should be straightforward.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html

    In my opinion the following points if addressed will result in improving the paper.

    1) Without doubt the proposed method outperforms the baseline (OTOC). However OTOC is also a weakly supervised method and comparison against a fully-supervised baseline is not provided. I appreciate the huge benefit of having to annotate only 0.005 - 0.0217% of points but it would be very helpful if there was a comparison against a fully supervised baseline to better place in context the obtained levels of performance.

    2) I would also appreciate more information on generation of the GT point clouds. It is mentioned that: “The full dataset consists of RGB-D video acquisitions from four co-registered ceiling-mounted Azure Kinect cameras,” Are all 4 cameras required to get the point clouds? Were there any consideration for OR room coverage

    Also,: how many persons (i.e. medical personnel) appear the procedures and how are the object classes defined determined ?

    3) How is the Expectation threshold 0.9 defined ? If helps to avoid erroneous pseudo-labelling why not set higher ?

    4) The per-class results (Table 3 supplementary material) are interesting and should be briefly discussed.

    5) In page 8, result discussion of Table 3. What are the two cameras (surgical and workflow) mentioned ?

    6) Please explain the statement in page 5 “reducing computational complexity to an additional comparison per supervoxel, instead of n” and clarify how the reduction in complexity is achieved.

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    6

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Ideas that can provide accurate OR semantic segmantation with minimum manual annotation requirements are very interesting.

    The paper proposes the SegmentOR method which alghouth heavily based on a previous one-click method (OTOC) does contain novel contributions.

    introduced is interesting although

  • Reviewer confidence

    Confident but not absolutely certain

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    6

  • [Post rebuttal] Please justify your decision

    The additional experimentaiton with FS and WS methods is convincing and I am confident in my original score. I also agree that this comparison has to be added to the paper even in summary.

    I still believe the authors must address comment 5 from my review “5) In page 8, result discussion of Table 3. What are the two cameras (surgical and workflow) mentioned ?”



Review #2

  • Please describe the contribution of the paper

    This paper introduces a weakly-supervised method to semantically segmenting surgical operating rooms, which are represented as 4D point cloud sequences with only 0.01% data being annotated. To achieve it, a self-supervised temporal prior is used to enforce semantic consistency in 4D point cloud video recordings. Moreover, the proposed 3D semantic predictions can be further used to improve the performance of existing surgical phase recognition methods. Based on quantitative comparisons to baseline method and associated ablation studies, the paper shows the effectiveness of the proposed method.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. The paper is well-written and easy to follow.

    2. The paper presents a novel way of using temporal consistency to achieve weakly-supervised semantic segmentation.

    3. The paper demonstrates its potential in improving surgical phase recognition. Future research on similar topics or associated downstream tasks can benefit from its codes and data.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    1.The experimental part is not comprehensive: (i) the experiments are only conducted on one dataset comparing to the baseline method OTOC and own variants; and (ii) only a set of qualitative results are presented in the supplementary material.

    1. The technical contribution is limited, since the approach is highly derived from OTOC for an adaption on temporal point cloud series data.

    2. Moreover, the practical value of the proposed setting and method remain unknown: (i) the paper only presents the results based on partially annotated data while the results using fully supervised data are not reported – if the results are far behind fully supervised methods, the value would be limited (e.g., OTOC paper achieves 69.1% using 0.02% data, while fully supervised method achieves 72.5%); and (ii) according to Tab.3, the surgical phase recognition improvement on Camera 02 data is very limited.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    Could be reproduced upon released codes and data.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html
    1. More quantitative and qualitative experiments on other datasets are expected.

    2. The technical values in practical cases should be better demonstrated.

    Please refer to the above “weaknesses”.

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    4

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Please refer to the comments above.

  • Reviewer confidence

    Confident but not absolutely certain

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #3

  • Please describe the contribution of the paper

    This paper proposes a 4D weakly-supervised semantic segmentation method for the operating room scenario, which extends the OTOC method with temporal label propagation.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. The paper is well motivated and works on an important topic.

    2. The proposed method is well built upon the OTOC method and show improved empirical performance compared to OTOC.

    3. The codes and dataset will be released.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    1. Major: The description of the essential part of this paper, the temporal propagation, is not clear. Eq 2 and Eq 3 are not rigorous or explained well. The two terms on the right-hand side of Eq 2 are not explained. \hat{m} is not shown in Eq 2 or Eq 3, so its usage is unclear. Why could this propagation method be more efficient than OTOC+T? Due to these ambiguous descriptions, I am not able to identify this part as a good contribution.

    2. OTOC is the only method for comparison. Some more recent methods are not included.

    3. Minors:

      • The pseudo-label is generated according to confidence, E(Y|S)>0.9. Will it be affected by the mis-calibration problem?
      • What is the frame rate of the data?
  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    Reproducibility is good given that the code and dataset will be released.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html

    Please refer to weakness.

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    5

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    My recommendation is mainly based that the dataset would be a good contribution for the field after release.

  • Reviewer confidence

    Confident but not absolutely certain

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #4

  • Please describe the contribution of the paper

    This work introduces weakly supervised 3D semantic segmentation of the operating room. It integrates temporal consistency with OTOC (one-think-one-click) [18] to improve the performance of semantic segmentation while requiring less than 0.01% annotated data.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    1) Application novelty: Introduces weakly supervised 3D semantics segmentation of OR. 2) Proposes segmentOR: OTOC [18] coupled with temporal consistency that outperforms (~10% mIoU) baseline model (OTOC[18]) in a multi-fold cross-validation test. 3) Performing a multi-fold cross-validation test shows that the model performance is independent of any dataset bias. 4) Shows that the segmentation output obtained using the proposed technique can improve other downstream tasks such as surgical phase recognition. 5) The ablation studies on the inclusion of colours also give valuable insights.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    1) While the segmentOR is benchmarked against OTOC [18], it lacks benchmarking against any other SOTA weakly supervised 3D semantic segmentation methods. 2) The work also lacks benchmarking against other techniques that incorporate temporal consistency. It is also unclear how dissimilar the temporal consistency (employed in this work to generate pseudo-labels) is to existing techniques employed in the computer vision domain for driverless car applications. 3) As the dataset is declared to be proprietary (in the acknowledgement section), I assume it to be not publicly available. Therefore, this work lacks any results on a public dataset, making it difficult to benchmark the model performance. A doubt about superior performance due to dataset bias may also arise. However, taking into account the multi-fold cross-validation test (which removes any dataset bias), I consider this a minor weakness. 4) The manuscript lacks qualitative analysis. However, it is included in the supplementary.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    Code and dataset are not available. However, the author has declared that the code will be made public.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html

    1) I highly recommend the authors condense the introduction to make space for the qualitative analysis figure in the manuscript. Adding it to the manuscript will significantly improve the manuscript quality. 2) Clearly state the difference between the temporal consistency technique coupled with this work compared to temporal consistency techniques employed in the computer-vision domain. This will highlight additional technical novelty (if any) in this work. 3) Minor type error: please rewrite the sentence “The densely labelled validation annotations comprise approximately 93% of the on average one million points per point cloud.” On page 6, section 4, para 2, line 2. 4) Compare with at least 1 or two more SOTA weakly supervised 3D semantics segmentation techniques that incorporate temporal consistency priors.

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    5

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    In my view, this work has equal merits and areas to improve. On one side, it has an application novelty in which it introduces weakly supervised 3D semantic segmentation for OR. The proposed method outperforms the baseline method by a significant margin in a multi-fold cross-validation test. On the other side, benchmarking the proposed approach seems limited as it lacks benchmarking against existing SOTA weakly supervised methods or methods that employ temporal prior. It also lacks benchmarking on the public dataset (assuming the dataset used here is not public as it is declared to be proprietary). However, in my view, the application need and novelty slightly outweigh the drawbacks. Furthermore, taking into consideration the annotated datasets that will be released (claimed by the author), I propose to weakly accept the paper. I would be happy to increase my rating if benchmarking against other SOTA weakly supervised techniques or benchmarking on public datasets is provided.

  • Reviewer confidence

    Confident but not absolutely certain

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    5

  • [Post rebuttal] Please justify your decision

    Insufficient benchmarking and limited technical novelty. However, in my view, application need (dataset) and application novelty still slightly outweights the drawbacks.




Primary Meta-Review

  • Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

    The paper proposes a weakly-supervised method for 3D semantic segmentation of operating theatre. Promising performance is achieved with 0.01% annotations needed. The reviewers have raised several concerns, such as the limited innovation compared with OTOC, practical value of the proposed setting, and insufficient comparison with SOTA weakly-supervised methods. I, therefore, invite the authors to submit the rebuttal focusing on addressing reviewers comments.




Author Feedback

We would like to thank the reviewers for their constructive feedback, which has helped us strengthen the quality of the manuscript. We appreciate that the paper has been perceived as well-written, easy to follow (R1, R2), and well-motivated, addressing an important topic (R1, R3).

We would like to clarify that although the data is proprietary, we’ve obtained approval for its release, along with the implementation. Upon release, this would be the first indoor semantic scene segmentation dataset with temporal point cloud sequences, which will benefit not only the medical but larger robotics community. A reference describing omitted dataset details will be added after acceptance of the paper to preserve anonymity throughout the review process. This encompasses point cloud acquisition and camera placement specifics and description (R1, R3). We indeed optimize for room coverage; cameras are re-mounted for every surgical acquisition. Semantic classes, as well as the original surgical phase labels, were defined under clinical guidance.

Regarding additional datasets (R2, R4), the general vision community performs indoor segmentation on reconstructions of static scenes (ScanNet, S3DIS) with no temporal information. While evaluation in autonomous driving settings is possible, 3D segmentation in autonomous driving typically demands specific architectures (i.e., Cylinder3D or PointPillars), that consider the acquisition traits of a moving lidar sensor. These assumptions do not hold for indoor acquisitions. In LESS [17, Fig 1.], the benefits of a cylindrical voxelization are contextualized with OTOC [18], the latter of which was designed in an indoor context.

However, we conducted additional experiments in both fully supervised (FS) (R1, R2) and weakly supervised (WS) (R3, R4, MR) settings. We perform a three-fold cross-validation of the U-Net backbone of our model on the complete labels in a FS setting (results: 77.13 +- 3.36 mIOU). While not trained on the same splits (our training splits only have sparse labels), these results provide valuable context, showing that segmentOR significantly closes the gap to FS methods. Furthermore, we evaluate ContrastiveSceneContext (competitive for WS segmentation, see LESS [17, Fig. 1]) on our dataset (results: 67.65 +- 2.0 mIOU), demonstrating that while it outperforms baseline OTOC, a significant gap to segmentOR remains. As suggested, we will include these results in the final version as they provide valuable context for the method and dataset.

Regarding model calibration (R1, R3), the expectation threshold E(Y|S) incorporates both networks’ feature representations, color, and position (x, y, z) (see Suppl. Sec. 1). Previous works have shown that such co-training can limit overconfident predictions. However, the threshold must generally still requires tuning to balance false-positive and negative predictions; setting it higher would also omit correct predictions limiting the supervision.

Equations 2 and 3 formulate our general framework for temporal matching based on OTOC’s supervoxel graph propagation (R3, R4). Each entry in the matrix M describes the probability that belief can be propagated between supervoxels from two timestamps. OTOC never propagates information temporally, while OTOC+T propagates information for all supervoxels (M is initialized densely). The proposed method segmentOR undergoes a sparse initialization (hence fewer calculations R1) from a temporal prior, updated during training based on probabilities from the RelationNet. Updated matchings (\hat{m}) are then used in graph propagation. This is detailed in Suppl. Sec. 1; however, we will clarify the notation in the paper.

Finally, following suggestions (R2, R4), we’ll condense the introduction and add a qualitative figure in the final version, alongside a brief discussion on class-specific segmentation metrics, as they provide valuable context.




Post-rebuttal Meta-Reviews

Meta-review # 1 (Primary)

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    The concerns of the reviewers around the validation experiments and the method novelty remain, however given the contributions of the paper, i.e., application novelty, I recommend acceptance.



Meta-review #2

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    This paper proposes a weakly-supervised approach to perform semantic segmentation of an OR room from point clouds sensed with RGBD cameras. The method is a modification of OTOC, and is compared favorably against this baseline.

    Strengths:

    • All reviewers acknowledge the paper is well written, the problem is very interesting and relevant to MICCAI, and the approach is novel
    • The dataset is interesting. The authors state this will be released which is great, although it may be impossible to confirm this before decision.

    Weaknesses

    • All reviewers would like to see comparisons against supervised methods to better understand how well the proposed methods are performing.

    In my opinion this paper should be accepted as the strengths outweigh the weaknesses. The lack of comparison against supervised methods (see comment below) is an acceptable compromise for a conference paper given how challenging it is to obtain fully supervised manually labeled data in high numbers, so this would be a challenge to any competing algorithm tackling the same problem.

    NOTE: This paper introduced significant new experiments in the rebuttal. This is against MICCAI guidelines and therefore my assessment of this paper ignores them, i. e. I believe the paper should be accepted without taking these results into account . They cannot be considered to be peer-reviewed / reproducible to the same degree as the rest of the paper:

    • The very minimalistic rebuttal format is - by design - unsuitable for describing new experiments with sufficient detail. In this particular case, some examples: the new WS baseline is only reported in mIoU, not all metrics of table 1; it is also unclear if it utilises RGB or not (both proposed and OTOC are tested with/without colour features but we only have 1 result for new baseline); what is the backbone encoder for U-Net? etc. There is also no opportunity for reviewers to request info about these.
    • There is no guarantee that adding the new results and appropriately describing them in the text can fit under the page limit, which could cascade in further non peer-reviewed changes to the paper
    • We would be advising authors to disregard MICCAI guidelines, which gives the wrong incentives to the community



Meta-review #3

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    The paper performs semantic segmentations on videos of the operating room. The authors have clarified in the rebuttal that their (in my point of view) valueable dataset (first indoor semantic scene segmentation dataset with temporal point cloud sequences) will be released to the public, which is great given that CAI datasets are often not easily accessible. Reviewers raised the point that methodological advancement is not the main strength of the paper. I find the work solid and authors explain their rationale of evaluation in the rebuttal nicely. In summary, the need for innovation in these application areas outweights the other weaknesses for me.



back to top