Paper Info Reviews Meta-review Author Feedback Post-Rebuttal Meta-reviews

Authors

Hao Ding, Jintan Zhang, Peter Kazanzides, Jie Ying Wu, Mathias Unberath

Abstract

Vision-based segmentation of the robotic tool during robot-assisted surgery enables downstream applications, such as augmented reality feedback, while allowing for inaccuracies in robot kinematics. With the introduction of deep learning, many methods were presented to solve instrument segmentation directly and solely from images. While these approaches made remarkable progress on benchmark datasets, fundamental challenges pertaining to their robustness remain. We present CaRTS, a causality-driven robot tool segmentation algorithm, that is designed based on a complementary causal model of the robot tool segmentation task. Rather than directly inferring segmentation masks from observed images, CaRTS iteratively aligns tool models with image observations by updating the initially incorrect robot kinematic parameters through forward kinematics and differentiable rendering to optimize image feature similarity end-to-end. We benchmark CaRTS with competing techniques on both synthetic as well as real data from the dVRK, generated in precisely controlled scenarios to allow for counterfactual synthesis. On training-domain test data, CaRTS achieves a Dice score of 93.4 that is preserved well (Dice score of 91.8) when tested on counterfactually altered test data, exhibiting low brightness, smoke, blood, and altered background patterns. This compares favorably to Dice scores of 95.0 and 86.7, respectively, of the SOTA image-based method. Future work will involve accelerating CaRTS to achieve video framerate and estimating the impact occlusion has in practice. Despite these limitations, our results are promising: In addition to achieving high segmentation accuracy, CaRTS provides estimates of the true robot kinematics, which may benefit applications such as force estimation. Code is available at: https://github.com/hding2455/CaRTS

Link to paper

DOI: https://link.springer.com/chapter/10.1007/978-3-031-16449-1_37

SharedIt: https://rdcu.be/cVRXc

Link to the code repository

https://github.com/hding2455/CaRTS

Link to the dataset(s)

N/A


Reviews

Review #1

  • Please describe the contribution of the paper

    The authors proposed to perform semantic segmentation via 3D pose estimation and rendering of surgical instruments. They propose that this makes the method robust to challenging image based situations such as smoke/blood.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The idea of using a 3D model tracking to provide a segmentation mask is not really new, it is a different approach from what people commonly try and the authors correctly identify that it could provide some benefits in challenging cases where pure image based methods break.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    I think fundamentally this paper is solving a more difficult problem than segmentation, which is 3D pose estimation. Although the idea is interesting, I am not convinced that this approach will yield better results in the long run. The challenges of 3D pose estimation, when the instruments are moved into complex articulations or when the tool movement is fast will cause this method to fail. Additionally, this method requires kinematics access (not guaranteed), 3D CAD model availability (again, not guaranteed) and a very powerful GPU to even get close to real time (which the authors admit is far away). Additionally it won’t be able to handle laparoscopic instruments which are often used in robotic procedures.

    I don’t see the comparison to a vanilla UNet as convincing. For this approach to be comparable, it should be compared with a UNet (or ideally a newer architecture) that has at least been trained with augmentation for smoke and other artifacts so that the fall off can be compared. Looking at the failures of the UNet in figure 1, I don’t see these as hugely challenging cases representing a fundamental limit of the state of the art. These are quite easy images that a well trained network should be able to handle.

  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    Paper seems like it would be reproducible since the code will be released.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html

    I think the authors should try training a state of the art segmentation model and add augmentation before continuing with attempting to perform segmentation via 3D pose estimation.

    I think the approach would be useful/interesting evaluated as it is, which is 3D pose estimation. There are some previous works (e.g. Real-time 3D Tracking of Articulated Tools for Robotic Surgery, Ye et al, MICCAI & 3-D pose estimation of articulated instruments in robotic minimally invasive surgery, Allan et al, TMI) which would be good comparison points.

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    3

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    I think the idea is fairly creative but I don’t see a clear path to improving the state of the art in segmentation (which this paper is addressing) using this approach.

  • Number of papers in your stack

    5

  • What is the ranking of this paper in your review stack?

    3

  • Reviewer confidence

    Very confident

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Not Answered

  • [Post rebuttal] Please justify your decision

    Not Answered



Review #2

  • Please describe the contribution of the paper

    This paper presents a novel framework for robot tool segmentation. The key novelty of the proposed model is that instead of assuming a causal relation between an image and its segmentation mask, it rather assumes a direct causal link between robot kinematics measurements and the segmentation mask. Removing the direct link between the image and its segmentation mask, that is the standard assumption in prior works, is hypothesized to render the segmentation model robust to image-domain shifts. The latter is experimentally validated on both real and simulated data captured under controlled settings using the dVRK platform.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    1) The paper explores a model that goes beyond the standard paradigm for robot instrument segmentation. Removing the direct causal link between the image and the segmentation mask is hypothesized to improve performance when unseen image-space conditions are encounted (ex low brightness, smoke etc). This is supported by the experimental results of the paper, as the proposed model is tested on several unseen image-domains with varying conditions. It outperforms a standard U-Net baseline and a method that also leverages kinematic parameters combined with Convolutional network.

    2) The proposed model links various parameters of the robotic platform (camera, tool semantics, kinematics). Thus it constitutes a flexible framework that can be used to estimate missing parameters via gradient descent, given all involved operations are differentiable. This is leveraged to iteratively refine the measurements of the kinematic parameters using gradient descent while also resulting in improved segmentation performance across domains.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    1) Despite the fact that the model does not directly link the image to the segmentation mask, it requires a pretrained segmentation network (a UNet) to extract semantically-rich features over which the difference between the rendered and observed image forms the utilized loss function (page 6). This network leverages the standard paradigm of mapping input images directly to segmentation masks. Conclusively, the overall method implicitly employs the “contemporary” causal model, in the form of a pretrained feature extractor. This merits some discussion in the paper.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    The authors stated that the code and dataset publicly available upon publications, thus it can be assumed that reproducing the results in the paper will be possible.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html

    1) The baseline UNet’s performance significantly drops when tested in unseen image-space conditions. However, it is trained without any augmentation (as stated in supplementary). It would be interesting to explore the limits of this baseline when data augmentation is employed, in the form of simulated image-space alterations (such as smoke, bleeding etc). Including a data augmentation pipeline is essential in most standard tool segmetation approaches and thus provides a stronger baseline to compare the proposed method to.

    2) In page 6, it is mentioned that the feature extractor is trained on collected images and hybrid images where the average image background is added to rendered images. The latter should be justified in the paper.

    3) It would greatly benefit the reader’s understanding of the method section, if a figure describing the various parts of the robotic setup and linking them to the referenced variables in the text, was added to the paper.

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    5

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    This paper presents a novel and flexible framework that questions the standard approach to robot tool segmentation. Both the novelty and the adequate experimental validation outweigh the lack of justification and discussion around some methodological choices. Therefore, I incline towards acceptance.

  • Number of papers in your stack

    4

  • What is the ranking of this paper in your review stack?

    1

  • Reviewer confidence

    Confident but not absolutely certain

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    5

  • [Post rebuttal] Please justify your decision
    • The authors responded with a detailed rebuttal and added more experimental results. The comparisons with stronger segmentation baselines (HRNet, SWIN) and the fact that the proposed method using a simpler UNet network still compares favourably under unseen conditions (smoke, blood etc) strengthens the paper’s claims of robustness. I also appreciate the authors statement that they will improve the justification of training choices and presentation of the variables/setup parameters mentioned in the method section.

    • My concern regarding the implicit use of the “contemporary causal model of segmentation” remains. I agree with the authors, that it should be possible to use a self-supervised feature extractor (instead of the currently used supervised network) could be used for the proposed feature-based loss. However, this in the paper, as it stands, there is no empirical evidence supporting that using it will sustain the competitive results of the proposed method relative to baselines. In my view, this should be included as a clear limitation of the method in the main paper or empirical evidence should be provided.

    • Overall, I believe that despite the abovementioned limitation of the method the paper is adequately novel and proposes an interesting direction to the robot tool segmentation task while leveraging multimodal information (kinematics and vision). Therefore, my final score is “Weak accept”.



Review #3

  • Please describe the contribution of the paper

    The paper describes, CaRTS, a surgical instrument segmentation algorithm. CaRTS proposes a novel framework for estimating the segmentation of surgical instruments using images and kinematics. The results on binary segmentation seems to be well above other existing methods.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • The proposed framework, which combines image information and kinematics, is interesting though not novel as existing works have already explored this framework
    • The proposed algorithm seems to obtain high segmentation performance, including under different settings such as low brightness, bleeding, smoke, background change, and simulated smoke
    • The chosen dataset, and metrics are adequate
    • The paper is well presented and easy to read
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    • The authors state that the proposed model is very flexible, potentially capable of “direct estimation of robot kinematic parameters, including joint angles, base frame and camera transformations, in an end-to-end fashion.”. However, the authors cast the model experiments for surgical instrument segmentation. When focusing on the evaluation, the authors only compare the model against one segmentation model, U-Net, which is already 7 years old. I encourage to authors to compare against more recent segmentation models to compare against (e.g. EfficientDet, HRNet, Swin Transformers, …).
    • In addition, the model seems to be designed to only work with binary segmentation (background and instrument) and in absence of occlusions. These are important limitations which makes the proposed framework interesting but its potential and real applicability is unclear
  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    The code and data seems that will not be released

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html

    In addition to the suggestions above, authors might refine the paper by improving the readability of the figures by defining all variables in their caption. For example in Figure 2.

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    6

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The proposed framework is very interesting as it defines and employs multiple sources of information that are commonly available in real settings (kinematics and vision). Upcoming works should follow this trend and employ all information available to generate algorithms that are more robust and reliable, in ‘any’ scenario.

  • Number of papers in your stack

    5

  • What is the ranking of this paper in your review stack?

    2

  • Reviewer confidence

    Confident but not absolutely certain

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Not Answered

  • [Post rebuttal] Please justify your decision

    Not Answered




Primary Meta-Review

  • Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

    The paper presents an interesting idea for tool segmentation via 3D pose estimation with adequate novelty. However, all the reviewers agree that the performance evaluation should be enhanced by including comparison to related 3D pose estimation methods and to recent segmentation models as suggested by R1 and R3, respectively. In addition, for a fair comparison to the baseline UNet, the model should be trained with augmentation for smoke and other artifacts such as blood. Clarifications in the methodology suggested by R2, would improve the paper.

  • What is the ranking of this paper in your stack? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    5




Author Feedback

We first summarize the review outcome and address concerns on a high level and then provide more detailed responses to selected queries.

All reviewers perceived our method to be interesting and appreciated the method’s approach of differentiably connecting kinematic parameters with the vision pipeline. Further, R2 and R3 commended the performance on unseen data and applaud the combination of multiple data sources, i. e., vision, and kinematics, to achieve tool segmentation.

Regarding weaknesses:

All reviewers expressed varying levels of concern that our primary claim, the robustness of segmentation performance in presence of unseen image corruptions, may have been demonstrated using weak baselines. To address this concern, we now benchmark our method against stronger baselines suggested by the reviewers, including HRNet and Swin Transformer, and by training conventional image-based methods using simulated smoke as an additional augmentation. HRNet achieved Dice scores of 95.2, 86.3, 56.3, 77.2, and 92.1 on regular, low brightness, blood, smoke, and background change domains respectively and Swin Transformer achieved Dice scores of 95.0, 93.0, 76.5, 82.4, and 94.8 respectively. While the use of newer architectures and smoke augmentation indeed establishes a stronger baseline, we find that the performance of CaRTS, using a simple UNet as feature extractor, still compares favorably with Dice scores of 93.4, 92.4, 90.8, 91.6, and 92.3. While newer methods achieve higher/comparable Dice scores than UNet, their performance still deteriorates substantially when tested on some unseen domains, which was among the primary motivations for the development of our technique. These results will be included in the final manuscript.

Further, R1 felt that our paper addressed 3D pose estimation and because comparisons to prior such work were missing, recommended rejection. We emphasize that claims made in the paper pertain to segmentation alone and believe that – especially after adding stronger baselines – have adequate justification backing this claim. We agree that the differentiable connection between images and robot parameters has similarities to 3D pose estimation work and will modify the manuscript to acknowledge this connection. We appreciate the reviewer’s reconsideration of this circumstance.

For more details:

R2 suggests that CaRTS implicitly relies on the contemporary causal model of segmentation. While it is true that the current model is trained in this way for convenience, we clarify that the feature extractor might not have to be trained for segmentation, but could rely on other techniques, such as unsupervised representation learning.

Further, we include hybrid images to ensure the feature extractor can extract reasonable features from both test and rendered images, respectively. The average background replaces 0s in the rendered image to avoid potential numerical issues. All hybrid images and the average background can be obtained “for free” from the training set. This will be clarified in the final version as per R2.

R3 criticizes the use of binary, instead of multi-class, segmentation. We clarify that extending the approach to multi-class tool segmentation is straightforward as it would merely involve a tool model with finer annotation.

As R2 and R3 suggested, we will include more detailed captions in figures along with the robot set-up, if necessary, in a supplementary.

R1 raises concerns that CaRTS may not work for laparoscopy or when robot kinematics is unavailable. As we indicate already in the title of our contribution, CaRTS is specifically designed for robot tool segmentation. While many recent challenges have indeed focused on purely image-based tool segmentation, which certainly is an interesting problem, we strongly believe that a method to reliably segment robotic tools should consider all information that is easily available when integrated with a robotic system. This is also supported by R3.




Post-rebuttal Meta-Reviews

Meta-review # 1 (Primary)

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    The authors responded adequately to the reviewers’ comments and added more experimental results. I recommend acceptance of the paper as it presents a novel method and proposes an interesting robot tool segmentation direction.

  • After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

    Accept

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    1



Meta-review #2

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    The paper received positive reviews except R1 and did a good rebuttal of this one. The AC recommends acceptance.

  • After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

    Accept

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    NR



Meta-review #3

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    The authors addressed all major comments from the reviewers and provided additional experimental results to justify their method. The provided results, justifications and clarifications should be included in the camera ready.

  • After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

    Accept

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    6



back to top