Paper Info Reviews Meta-review Author Feedback Post-Rebuttal Meta-reviews

Authors

Yuehao Wang, Yonghao Long, Siu Hin Fan, Qi Dou

Abstract

Reconstruction of the soft tissues in robotic surgery from endoscopic stereo videos is important for many applications such as intra-operative navigation and image-guided robotic surgery automation. Previous works on this task mainly rely on SLAM-based approaches, which struggle to handle complex surgical scenes. Inspired by recent progress in neural rendering, we present a novel framework for deformable tissue reconstruction from binocular captures in robotic surgery under the single-viewpoint setting. Our framework adopts dynamic neural radiance fields to represent deformable surgical scenes in MLPs and optimize shapes and deformations in a learning-based manner. In addition to non-rigid deformations, tool occlusion and poor 3D clues from a single viewpoint are also particular challenges in soft tissue reconstruction. To overcome these difficulties, we present a series of strategies of tool mask-guided ray casting, stereo depth-cueing ray marching and stereo depth-supervised optimization. With experiments on DaVinci robotic surgery videos, our method significantly outperforms the current state-of-the-art reconstruction method for handling various complex non-rigid deformations. To our best knowledge, this is the first work leveraging neural rendering for surgical scene 3D reconstruction with remarkable potential demonstrated.

Link to paper

DOI: https://link.springer.com/chapter/10.1007/978-3-031-16449-1_41

SharedIt: https://rdcu.be/cVRXg

Link to the code repository

https://github.com/med-air/EndoNeRF

Link to the dataset(s)

N/A


Reviews

Review #1

  • Please describe the contribution of the paper

    This paper adopts a leading edge deep learning 3D rendering algorithm called neural rendering to reconstruct 3D surfaces of surgical scenes in the context of robot-assisted procedures that employ a stereo endoscopic cameras. It seeks to provide an improved rendering of the visualized surface based on a known depth model from the stereo-camera. As distinct from the traditional 3D rendering approach, where the 3D surface is approximated using polygons and projected back to the camera using physical optical principles, the neural rendering technique is trained to create a predictor as a function of camera position and it is able to directly predict the r,g,b, alpha(transparency) value for all camera rays. In the paper, the authors use an existing stereo-matching algorithm STTR-light to obtain a coarse depth map for surface reconstruction, and then proposes an improved method based on the existing D-NeRF model for neural rendering.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    In addition to small optimization improvements and surface smoothness improvements, the biggest advance of this paper appears to be the incorporation of a frame-to-frame deformation model and surgical tool mask to reconstruct soft tissue surfaces lying beneath the surgical tool. Figure 2 suggests that the algorithm is very effective in achieving this task, and the video in the supplementary material reinforce this conclusion. The video in particular clearly demonstrates the ability of this approach to render regions for which other methods have problems.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    1. Clarity. This paper is quite difficult to follow, particularly with respect to the details of the method. Some acronyms are not defined, and some methodological descriptions are vague. Work flow and block diagrams describing their methodology could be very helpful. Moreover, photometric error measurements are not clearly defined.
    2. It would help to describe the un-met clinical need more concisely, particularly with respect to the need for 3D reconstruction during a typical procedure for a radical laparoscopic robotic prostatectomy (RALP) for example. How would RALP for example be improved if the proposed technique (or any other approach for that matter) if reliable surface reconstruction of the surgical scene were available. I do agree however that “robust scene reconstruction is important for augmented reality, surgical environment simulation, immersive education, and robotic surgery automation”.
    3. What is the clinical impact of being able to remove the surgical tools or reconstruct the 3D scene during tissue deformation in real time? It would be helpful if the paper indicated explicitly what makes the proposed method superior for dealing with tissue deformation.
    4. Why is “single viewpoint” important, since in a RALP procedure , the tools are in constant motion, and the camera is moved frequently to track them. In the reconstruction of tissue deformation during an actual procedure, I would have thought that eliminating the tools from the image would be counter-intuitive. Is the purpose to superimpose the stereo representation of the real tools onto the reconstructed deformed scene?
    5. Validation: If I understand the validation method correctly, the photometric error measurements are not definitive, especially for tissues under the surgical tool. While it is acknowledged that ground truth is difficult if not impossible in clinical cases, could not a phantom study have been employed to obtain ground truth on some examples?
    6. Some phrases are very obscure. For example, what does “To capture high frequency details, we use positional encoding γ(·) to map the input coordinates and time into Fourier features before fed to the networks” mean?.
    7. How long does the algorithm take to reconstruct each frame? Is it feasible for real time application?
    8. Is E-DSSR the only available competing algorithm?
  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    It is not clear whether the data will be made public.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html

    In spite of the paper being difficult to read, the results are impressive. More clarity in writing as detailed above, along with a comment on using the neural rendering technique in conjunction with arbitrary conventional reconstruction techniques to improve performance would improve the paper.

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    6

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Impressive results that seem to be superior to competition. Enthi=usiasm is dampened somewhat by the number of unanswered questions.

  • Number of papers in your stack

    4

  • What is the ranking of this paper in your review stack?

    3

  • Reviewer confidence

    Very confident

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #2

  • Please describe the contribution of the paper

    In this work, the authors newly adopt neural radiance fields (NeRF) to reconstruct dynamic surgical scenes from single-view (left view) stereo endoscope videos. The proposed framework is based on D-NeRF, and applies STTR-light to estimate depth maps as the prior knowledge for 3D scenes reconstruction. Results indicate that this method is promising, and provides good dynamic scenes recontraction models even non-rigid deformation and tool occlusion exists.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    1) Tool mask-guided ray casting is designed for eliminating tool occlusion. 2) Depth-cueing resampling and depth-map loss are introduced to make NeRF effective on single-view reconstruction. 3) The proposed method is complete and the results are impressive.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    The proposed method requires per-scene optimization as the authors state in the “Implementation Details”, it means that the proposed method seems cannot achieve the real-time reconstruction of tool-occluded areas. That is, surgical tools in the intraoperative video cannot be removed in the reconstructed scenes. Therefore, what is the significance of the proposed method for robotic surgery, in which the tool occlusion often occurs? The motivation of this article should be more clear.

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    It is easy to reproduce the work of this paper with released data and code.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html

    1) In the supplementary materials, some areas are occluded throughout the videos, such as pulling, cutting, and tearing cases. How to accurately reconstruct these areas, and perform the evaluation?

    2) The authors point out that they perform the evaluation following Ref. 10. While, the evaluation method in Ref. 10 is only applicable to the situation where occlusion areas change frequently. So, how to achieve the reference of these areas which are occulated throughout the whole video?

    3) Stereo endoscope captures both left and right images, but only left images are used in the proposed method, can the method be applied to right images? If it does, how to evaluate its performance on right images?

    4) Some format issues: – Equation (1): What is the relationship of M_i and M_j? If M_j is a subset of M_i, M_j can be 0 (M_j is tissue). – Equation (1) and (3): The meaning of variable j is unclear, please give the interpretations or references. – Equation (2): Is D_i [u,v] similar to {〖D_i}〗(i=1)^T used in Ref. 10? If it is, why {〖D_i}〗(i=1)^T is named as “coarse depth maps” rather than a reference, considering that it has been respectively used as a sampling guidance and depth supervision in subsection 2.4 and 2.5? – Equation (4): What is I_i [u,v]? I suppose it could be the images with the tissue information only, if it is, please give the definition explicitly. – Subsection 2.5: “we firstly find residual maps …”, what is D_i? Is D_i equivalent to {〖D_i}〗_(i=1)^T?

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    6

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    1) Tool mask-guided ray casting is designed for eliminating tool occlusion. 2) Depth-cueing resampling and depth-map loss are introduced to make NeRF effective on single-view reconstruction. 3) The proposed method is complete and the results are impressive.

  • Number of papers in your stack

    5

  • What is the ranking of this paper in your review stack?

    1

  • Reviewer confidence

    Confident but not absolutely certain

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #3

  • Please describe the contribution of the paper

    This paper proposed a NeRF-based 3D reconstruction framework for deformable tissues during robot-assisted surgery. The framework uses neural implicit field for dynamic scene representation and incorporates mask-guided ray casting for occlusion issue as well as depth-cueing ray marching and depth-supervised optimization scheme. The method was evaluated on clinical dataset from DaVinci surgical robot and compared against the most recent SOTA method E-DSSR and outperformed significantly.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    1) The authors proposed an innovative and effective design. For example, the NeRF structure is incorporated with mask-guided ray casting which solves the issue from tool occlusion. 2) The authors also did thorough evaluation by comparing against the most recent SOTA method. Furthermore, ablation study was done to further investigate the contribution of each customized modules to the significant improvement on performance. And lastly, the fact that the method was tested on clinical dataset is another highlight of the paper.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    In Section 2.5, the equation (3) and (4) are not well explained. Equation (3) is too dense, maybe consider dividing them and explaining each clearly and also make sure the meaning/definition of each parameters are described in the text. For example, the T in (3) and I in (4) were not defined in the text.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    The code will be available which is great. Thanks to the authors.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html

    In Section 2.5, the equation (3) and (4) are not well explained. Equation (3) is too dense, maybe consider dividing them and explaining each clearly and also make sure the meaning/definition of each parameters are described in the text. For example, the T in (3) and I in (4) were not defined in the text.

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    7

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    In Section 2.5, the equation (3) and (4) are not well explained. Equation (3) is too dense, maybe consider dividing them and explaining each clearly and also make sure the meaning/definition of each parameters are described in the text. For example, the F, G, T in (3) and I in (4) were not defined in the text.

  • Number of papers in your stack

    5

  • What is the ranking of this paper in your review stack?

    1

  • Reviewer confidence

    Somewhat Confident

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A




Primary Meta-Review

  • Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

    The authors use a deep learning 3D rendering algorithm for 3D reconstruction of deformable soft tissues from endoscopic stereo video during robot-assisted surgery. Specifically, neural implicit fields are used for dynamic scene representation incorporating mask-guided ray casting to deal with occlusions as well as depth-cueing ray marching and depth-supervised optimization. The method was evaluated on a clinical dataset from the DaVinci surgical robot and compared against the state-of-the-art E-DSSR method outperforming the method and showing impressive results.

    Although all reviewers agree about the significance and impressive results there are some issues specifically in terms of clarity of the methods and framing of the work in terms of the clinical context and needs. For the paper to be acceptable the authors should address all the reviewers concerns including:

    • adding details, defining acronomys and clarifying the methodology
    • defining the terms and explaining and formatting equations better
    • commenting on other state of art methods
    • clarifying the impact of removing surgical tools
    • etc.

    Please refer to the detailed reviews of the reviewers and revise the manuscript as needed.

  • What is the ranking of this paper in your stack? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    5




Author Feedback

We thank reviewers and AC for their recommendation of our work. All the three reviewers recognize our novel neural rendering framework for surgical scene reconstruction. To further certify this paper’s significance, we summarize and answer reviewers’ concerns in the following 3 points.

  • Motivation and Applications: We initially design our method for endoscopic surgeries, e.g., robotic prostatectomy. We observe camera movements are constrained during endoscopic operation. Therefore, we regard “single-viewpoint” as one of our problem settings. Currently, our method can be used to construct virtual surgery environments from real endoscopic videos for surgery robot learning and AR/VR surgery training. Specifically, AR/VR metaverse endoscopic surgery training requires high-quality and diverse models of soft tissues to demonstrate real scenarios of surgeries. The significance of this clinical application has been addressed in piles of recent study. However, bionic modeling of soft-tissues is not easy due to complex 3D structures and textures of in-vivo environments. Our method can overcome this issue by automatically reconstructing vivid shapes and textures of soft tissues from real videos. Notably, surgical instruments usually appear in the captured videos but we only need the reconstruction of soft tissues. Therefore, we include “removing tool occlusion” as a crucial part in the problem settings. Since our goal in this paper mainly focuses on producing high-fidelity results, efficiency is not well tailored. Our current implementation will take around 10h to reconstruct all frames for one surgical scene. Thus, intraoperative use is not achievable yet.

  • Validation: Evaluation on occluded areas is challenging in our experiments. We find it is difficult to collect good ground truth for evaluating occluded areas even with phantoms. In this regard, we compromise to merely conduct qualitative evaluation on those occluded areas (Fig. 2). For the question regarding competing algorithms: among few recent approaches addressing similar problems, E-DSSR is the only method that is open-sourced to us and achieves considerable performance. Thus, we compare with E-DSSR only to validate great effectiveness of our method.

  • Clarity: The method proposed in this paper is built upon the emerging NeRF framework. Due to page limit, we omit many explanations of existing modules in NeRF, e.g., positional encoding, volume rendering, etc., and mainly stress our novel modules designed for surgical scenes. We invite reviewers who are lost in the method details to read the related literature Ref. 14. for better understanding of our full paper. To relieve readers’ reading pressure, we also plan to simplify our notations in the final version.



back to top