Authors

Ruyi Zha, Xuelian Cheng, Hongdong Li, Mehrtash Harandi, Zongyuan Ge

Abstract

Reconstructing soft tissues from stereo endoscope videos is an essential prerequisite for many medical applications. Previous methods struggle to produce high-quality geometry and appearance due to their inadequate representations of 3D scenes. To address this issue, we propose a novel neural-field-based method, called EndoSurf, which effectively learns to represent a deforming surface from an RGBD sequence. In EndoSurf, we model surface dynamics, shape, and texture with three neural fields. First, 3D points are transformed from the observed space to the canonical space using the deformation field. The signed distance function (SDF) field and radiance field then predict their SDFs and colors, respectively, with which RGBD images can be synthesized via differentiable volume rendering. We constrain the learned shape by tailoring multiple regularization strategies and disentangling geometry and appearance. Experiments on public endoscope datasets demonstrate that EndoSurf significantly outperforms existing solutions, particularly in reconstructing high-fidelity shapes. Code is available at \url{https://github.com/Ruyi-Zha/endosurf.git}.

Link to paper

DOI: https://doi.org/10.1007/978-3-031-43996-4_2

SharedIt: https://rdcu.be/dnwOC

Link to the code repository

https://github.com/Ruyi-Zha/endosurf.git

Link to the dataset(s)

N/A

Reviews

Review #2

Please describe the contribution of the paper

The article proposes a novel neural-field-based method, dubbed EndoSurf, to represent a deforming tissue surface from a RGB+D sequence acquired by a stereo endoscope. EndoSurf builds on the recent work of EndoNERF [26] and, like this predecessor, it models surface deformation (or dynamics), shape (or geometry), and color (or appearance) while accounting for occlusions caused by surgical tools and instruments. The key difference between EndoNeRF and EndoSurf is the shape representation that is carried in the former by a density field and in the the latter by a SDF field according to the NeuS formulation [25]. The SDF field enables to extract higher quality surfaces and provides normals that serve as additional input to the radiance field to disentangle geometry and appearance learning [27]. The experimental validation shows that EndoSurf consistently outperforms EndoNeRF in appearance and geometry recovery.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
- The paper is in general clear and easy to follow.
- The differences and modifications with respect to the baseline method (EndoNeRF) are clear and sound.
- The experiments show beyond reasonable doubt that EndoSurf is superior to EndoNeRF. The discussion of the results and the explanations for the observed differences are convincing
- The code is made available in a companion website
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
- The work is competent but of incremental nature. It basically leverages the EndoNeRF by improving the shape representation using NeuS and, in a less extent, the learning of the appearance using [27]
- I understand the space limitations, but would like to see additional ablation studies in particular to understand if the observed improvements mostly come from the SDF field, the disentangling, or both.
Please rate the clarity and organization of this paper

Very Good
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

The submission has a companion website with data, code and results.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html
- It would be useful to elaborate on the masking of surgical instruments as it seems to be different from [25]. EndoNeRF uses importance maps / probability mass functions Vi that balances the sampling rate of frequently occluded pixels taking advantage of the camera barely moving with respect to scene (single viewpoint stereo). EndoSurf seems to use the binary tool mask at each frame. Whata re the reasons and practical implications of these differences?
- Minor Comments: I belive that, unlike what is said in page 2, 2nd paragraph the EndoNeRF trains two (and not three) separate MLPs: one for the deformation, and the other for both occupancy and color as in the original NeRF article
Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

6
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

This is a competently executed work that describes a pipeline that outperforms the current State-of-the-Art. In my view it is a solid contribution that should feature next MICCAI. I do not rank it higher because of its incremental nature.
Reviewer confidence

Very confident
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

N/A
[Post rebuttal] Please justify your decision

N/A

Review #4

Please describe the contribution of the paper

This paper proposes a method for 3D reconstruction of endoscopic scenes in the presence of deformations. The approach uses neural implicit fields to learn dynamic scenes from stereo frames. This method improves over previous work by generating more accurate smooth surfaces.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
- 3D surface reconstruction of endoscopic scenes is a challenging yet important problem. The scenes are complex due to the limited FoV, small stereoscope baseline, illumination, occlusions and most importantly soft tissue deformation. This paper makes a step forward in solving this problem.
- The related work section is clear and cites similar papers while positioning the work correctly. Most of the related work is dated post-2020.
- The method is well described, is technically sound and the design is adequate.
- The figures are helpful to understand the pipeline.
- The results are convincing, with ablation study, comparison with state-of-the-art and tests of real in-vivo data. The metrics are adequate.
- The performance of this method outperforms related works.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
- This paper, despite bringing a new method, is incremental. It improves over existing methods. This is not a major weakness in my opinion since the results outperform previous work. The methodology and the results outweigh the relatively low novelty.
- We do not easily understand the necessity of each MLP. Why do we need a deformation network and a geometry network?
- The assumptions in the problem settings should be discussed. The foreground mask is not trivial to obtain, and the Projection matrix implies a calibration. This needs to be addressed.
- A more detailed ablation study would have been welcomed. The impact of each loss on the overall results is addressed but a table with all metrics should have been clearer than figure 5.
Please rate the clarity and organization of this paper

Very Good
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

Code is available publicly. An anonymous link is provided and is functional (I didn’t test the code). This paper uses a public dataset.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html
- A better explanation on why three MLPs is needed to improve the paper clarity
- The assumptions need to be discussed in term of feasibility and clinical practicality
- Please clarify if one model is trained for each case and discuss this limitation
- Authors should detail the ablation study on the losses
- Reference 16 and 17 are the same
Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

6
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The paper introduces a new method to solve the challenging problem of 3D surface reconstruction of endoscopic scenes. The paper is technically sound, well-written and the results are convincing. Code and data are available. Overall, this is a valuable contribution to the MICCAI community.
Reviewer confidence

Very confident
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

N/A
[Post rebuttal] Please justify your decision

N/A

Review #5

Please describe the contribution of the paper

summary: The authors propose a new method they call “EndoSurf”, which is a neural-field-based (i.e. NERF) technique for reconstructing deforming surfaces from RGBD sequences captured by stereo endoscope videos. This method uses three neural fields to model surface dynamics, shape, and texture. This is a different approach from previous methods that struggled to produce high-quality geometry and appearance due to inadequate representations of 3D scenes. The work builds forward on EndoNERF with the key contribution: modelling dynamic scenes.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
1. New Application: The authors apply Neural Radiance Fields (NeRFs) to endoscope-based surface reconstruction.
2. Three-field architecture: The authors are using three separate neural fields, each for deformation, geometry, and appearance, which might be a new approach in this specific context. It seems that the existing solutions either don’t segregate these fields or don’t manage them as effectively.
3. Innovative regularization strategies: To enforce the geometry network to learn a solid surface, the authors design various regularization strategies, which appear to be novel contributions.
4. Disentanglement of appearance from geometry: The authors involve positions and normals as extra clues to separate the appearance from the geometry, which might be a new approach in the reconstruction from endoscope videos
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
1. Model Complexity: The use of three separate neural fields must introduce a level of complexity that could potentially make the model difficult to train on new scenes. From the training strategy of how NERFs work, they must overfit on each scene. So how does it handle optimisation on different Endoscopic videos? Paper only experiments with a limited set of scenes.
2. Also what’s the effect of reconstruction in presence of blood, smoke etc? I would like to get the authors view on this; as typically this has been a huge barrier to development of AI based 3D reconstruction techniques.
3. Overfitting Risk: The mention of “various regularization strategies” might suggest a risk of overfitting, especially if the model is heavily tuned to the specific datasets used for development and testing. I understand authors introduce different strategies, I would like to see an ablation of the impact on the results with and without such strategies. No quantitative ablation is provided
4. Computational Resources: Methods like NERF often require substantial computational resources as each model is unique to a particular sequence (they don’t generalise), which could limit their practical application. This model took 9 hours on a RTX 3090 per sequence (I believe… from implementation details). For real world use, is that suitable to think a model will have to be re-trained on every new endoscopic video sequence for 9 hours before a result becomes useful?
5. Lack of competing methods in Validation: The claim of superior performance over existing solutions is made by the authors themselves. However validation was only performed with EndoNERF. Have other more recent nerf been considered? especially NERFs that deal with dynamic scenes like NSFF (https://arxiv.org/abs/2011.13084), D-NERF (https://openaccess.thecvf.com/content/CVPR2021/papers/Pumarola_D-NeRF_Neural_Radiance_Fields_for_Dynamic_Scenes_CVPR_2021_paper.pdf),
Please rate the clarity and organization of this paper

Satisfactory
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

Paper is well written and code provided + implementation details provide satisfactory description.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html

A complete evaluation as suggested in the weaknesses section would bump the paper up with higher merit. The paper is indeed innovative in how it combines different areas of NERF and deep learning literature to create a cocktail of a solution for dynamic scenes. However, it lacks certain evaluation and clarification points for its claims of performance and applicability in real surgical scenarios. Overall:

Model Complexity and Generalization: Your approach uses three separate neural fields which could introduce substantial complexity, possibly making the model challenging to train on new scenes. I’d appreciate if you could delve deeper into the optimization of the model across different endoscopic videos. I also wonder how the model performs with a broader and more diverse set of scenes.

Challenging Scenarios: In practical endoscopy, the presence of elements such as blood or smoke can pose challenges. It would be insightful to understand your views on how these situations affect the model’s performance and how your method copes with them. I appreciate this can’t be solved directly from your method, however highlighting this is beneficial for further research in this space.

Risk of Overfitting and Ablation Study: The use of various regularization strategies is noted and given NERF based methods are unite to each video sequence, overfitting is normal. However, it is vital to understand metrically the impact of these regularisation strategies and how they improve on previous methodologies. An ablation study showing the impact of these strategies quantitatively on the results could provide a better understanding of their necessity and effectiveness.

Computational Resources: Considering the computational resources required for training and the time taken (9 hours per sequence on an RTX 3090), I wonder about the feasibility of this method in real-world scenarios. Clarifying how to make this approach more computationally efficient or how it could be adapted for use in real-time applications would be valuable.

Competing Methods in Validation: While your results are promising, expanding the validation to include other recent NeRF-based methods that deal with dynamic scenes, such as NSFF and D-NeRF, would make the comparison more robust and convincing.
Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

5
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

I commend the novel three-field NERF approach and the application of it to endoscope-based surface reconstruction. The complex tasks tackled by the models, such as handling deformation, geometry, and appearance, truly demonstrates the potential of the authors work.

While recognising these significant advancements, I propose further exploration to ensure the model’s performance across diverse scenarios, its resilience in challenging conditions often encountered in endoscopy, i.e. blood, smoke etc and the implications of the applied unique regularization strategies. Which is key for understanding the applicability of such research to the real-world. Additionally, addressing the computational demands of the model and including a broader range of contemporary methods in the validation study, will enhance the real-world applicability and scientific rigor of this research. I believe such considerations will underscore the study’s strengths and contribute meaningfully to the field.
Reviewer confidence

Confident but not absolutely certain
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

N/A
[Post rebuttal] Please justify your decision

N/A

Primary Meta-Review

Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

The paper proposes EndoSurf, a neural-field-based method which effectively learns to represent a deforming surface from an RGBD sequence. The reviewers agree that the paper is well written and theoretically sound. Although the contribution of the proposed work is incremental, the performance evaluation study is strong. All the reviewers suggest that the validation could be strengthen by including a more detailed ablation study. According to R3, the performance evaluation study can be strengthened by including comparison to recent NERF models for dynamic scene reconstruction and under challenging realistic scenarios such as, in the presence of occlusions, blood, smoke. The authors should address the points raised by the reviewers to further clarify details about the presented work.

Author Feedback

We thank Reviewers #2, #4, #5 and Meta-reviewer for their valuable feedback. All comments are carefully considered and will be reflected in the final version.

R2: Masking: We use the same mask-guided sampling strategy as [26] EndoNeRF, as illustrated in section 2.2.2 pipeline.

R2: Three MLPs: We believe both two and three MLPs make sense. On the one hand, NeRF can be considered as one MLP that outputs the occupancy from the middle layer and color from the last layer. On the other hand, NeRF can be composed of two MLPs, one that inputs the position and returns the occupancy and a feature vector, and another MLP that inputs the direction and feature vector and outputs the color. In our paper, we use three MLPs to highlight their different regression purposes. We will explain it in our camera-ready version.

R2&R4&R5: Explanation of three MLPs: Behind the design of three MLPs is the idea that all deforming 3D models are warped from a canonical one. Based on that, we use one MLP for deformation, and two (or one, according to R2’s comment) MLPs for the canonical shape and texture. Such a structure has been validated in papers like EndoNeRF and D-NeRF. While a single MLP could theoretically directly regress the deforming 3D model by adding time as an input, papers like D-NeRF have shown that it is infeasible. We will explain the motivation for our design in detail in our final version.

R2&R4&R5: Ablation study: The results in Fig 5 are done by removing one loss at a time. It may seem not obvious which loss plays the most important role. In our final version, we will improve the ablation study by adding one loss at a time and listing the performance increase instead of absolute results.

R4: Assumptions: We follow the same assumption as EndoNeRF where foreground masks and projection matrices are obtained in advance. We agree that obtaining masks and matrices is also very important. Considering the page limit and our research focus, we will leave it to future work. In the future, we can develop a fully automatic reconstruction pipeline by adding segmentation and pose estimation modules.

R5: Training time: Like other neural radiance field approaches, we have to train models for each case. Computational efficiency is an important research aspect in 3D reconstruction, however, it is not the main focus of this paper and we will leave it to future work.

R4&R5: Limitations: The long training time is the main limitation of our method and introducing the Neural SDF field may further increase training time. We will discuss them in our final version.

R4: Repeated reference: Thanks for noticing. We will fix it.

R5: Generalization and challenging scenarios: Unfortunately, we cannot acquire other datasets except ENDONERF and SCARED for stereo endoscope reconstruction tasks. While authors of the EndoNeRF paper evaluated their method using 6 cases from their in-house dataset, we have access to a total of eight cases - 2 from the public cases provided by ENDONERF and additional 6 cases from the SCARED dataset. We believe these two datasets are efficient to validate our model since they cover a large diversity of organs. Our model may not handle challenging scenarios with blood and smoke because they do not have canonical shapes (our core assumption).

R5: Compete with orthers: We compared our method with EndoNeRF and it shares a similar modeling approach to D-NERF. Specifically, both EndoNeRF (D-NERF) and NSFF utilize a density-field method for scene geometry representation. These density-field-based methods are primarily designed for novel view synthesis tasks. In contrast, our method focuses on generating high-quality geometry and a solid surface using SDF representations.

back to top

EndoSurf: Neural Surface Reconstruction of Deformable Tissues with Stereo Endoscope Videos