Paper Info Reviews Meta-Review Author Feedback Post-rebuttal Meta-Reviews Back to top

List of Papers By topics Author List

Paper Info

Reviews

Meta-review

Author Feedback

Post-Rebuttal Meta-reviews

Authors

Adam Schmidt, Omid Mohareri, Simon DiMaio, Septimiu E. Salcudean

Abstract

Deformable tracking and real-time estimation of 3D tissue motion is essential to enable automation and image guidance applications in robotically assisted surgery. Our model, Sparse Efficient Neural Depth and Deformation (SENDD), extends prior 2D tracking work to estimate flow in 3D space. SENDD introduces novel contributions of learned detection, and sparse per-point depth and 3D flow estimation, all with less than half a million parameters. SENDD does this by using graph neural networks of sparse keypoint matches to estimate both depth and 3D flow anywhere. We quantify and benchmark SENDD on a comprehensively labelled tissue dataset, and compare it to an equivalent 2D flow model. SENDD performs comparably while enabling applications that 2D flow cannot. SENDD can track points and estimate depth at 10fps on an NVIDIA RTX 4000 for 1280 tracked (query) points and its cost scales linearly with an increasing/decreasing number of points. SENDD enables multiple downstream applications that require estimation of 3D motion in stereo endoscopy.

Link to paper

DOI: https://doi.org/10.1007/978-3-031-43996-4_23

SharedIt: https://rdcu.be/dnwOX

Link to the code repository

N/A

Link to the dataset(s)

N/A

Reviews

Review #2

Please describe the contribution of the paper
1. Novel Model for 3D Depth and Flow Estimation: The paper presents SENDD, an innovative approach that tracks 3D deformation from endoscopic cameras. The model extracts a sparse set of keypoints for stereo reconstruction and estimates the depth map using a neural interpolation method. The 3D flow is then estimated using a neural network, given the images and depth maps from two time steps. This unique approach contributes to the field of robotically assisted surgery and related applications.
2. Dataset Generation for Performance Evaluation: The authors have created a new dataset using fluorescent paint and IR lights to capture the deformation of the surgical scene. This dataset allows for comprehensive evaluation of SENDD’s performance in realistic surgical environments, making it a valuable resource for researchers working on similar problems.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
1. Model Design: The paper presents a new approach to track 3D deformation from endoscopic cameras. By extracting a sparse set of keypoints, the model estimats the depth map using a neural interpolation method, and estimating the 3D flow using a graph neural network. The model addresses significant limitations, such as computation speed, in existing techniques.
2. Model Efficiency: With less than half a million parameters, SENDD achieves comparable performance to equivalent 2D flow models while enabling 3D motion estimation applications. This highlights the model’s efficiency in terms of parameter usage and demonstrates its potential for real-world implementation in various surgical scenarios.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
1. Clarity and Implementation Details: The paper lacks clarity on certain aspects of the model’s training, such as how image wrapping is performed and what is the data flow in training procedure. Providing more information on these aspects would help readers better understand and replicate the proposed method.
2. Robustness to Large Deformation: As showed in the supplementary video, SENDD may not be robust to large deformation scenarios. The paper could benefit from addressing this limitation and exploring potential solutions or improvements to enhance the model’s performance in such situations.
3. Limited Comparison: The paper primarily compares SENDD with an equivalent 2D flow model, which may not provide a comprehensive understanding of its performance. Including additional comparisons with other 3D motion estimation techniques could offer a more complete assessment of SENDD’s capabilities in relation to existing methods.
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

To enhance the reproducibility and accessibility of the proposed method, it is recommended that the authors open-source their implementation. By providing the source code and detailed documentation, the research community can more easily replicate and build upon the findings presented in the paper.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html
1. The reliance on a sparse set of keypoints for depth and flow estimation might lead to issues in capturing small or sharp deformations. The paper could discuss the potential limitations of using sparse keypoints and explore ways to mitigate these issues, such as incorporating additional keypoints or refining the interpolation method.
2. The current presentation of Figures 3 and 4, with small points, makes them difficult to read and interpret. A suggestion for improvement would be to convert the plots to tables and represent the data numerically, which could help readers better understand the information being conveyed.
Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

6
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

Althougn there are some weaknesses regarding the performance and evaluation, this paper presents a novel approach and demonstrate great potential on deformation tracking. I would recommend accept this paper for presentation in MICCAI.
Reviewer confidence

Confident but not absolutely certain
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

6
[Post rebuttal] Please justify your decision

Thank you authors for the clarification. I keep my original rating but it would be good to also include quantitative results for 3D and compare with other 3D tracking methods.

Review #3

Please describe the contribution of the paper

The manuscript presents SENDD, a neural approach for estimating sparse scene flow across frames. The proposed methods builds on top of SuperPoint [3] and shares many similarities with SuperGlue (not cited). The approach is designed to be efficient and end-to-end differentiable. To validate SENDD, a dataset collected using infrared light is presented.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
- Idea: the idea of using sparse key points for depth and scene flow for efficient computation is theoretically sound
- Dataset: the effort towards properly evaluating the SENDD using IR camera is sound, compared to image-based metrics such as PSNR or SSIM
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
While I enjoy the high-level ideas of the paper, there are several main weaknesses. Please find the detailed comments:
- Missing experiments: In summary of contribution, SENDD is claimed robust as it “detects salient points based on learning objectives rather than warped hymnographies”. Why is end-to-end training “robust”? Are there experiments supporting such a claim? At a minimum, a citation should be provided to support the claim.
- Potential improper evaluation: (a) In section “Quantification”, “centers of each segmentation region” are used to validate SENDD. Does this mean that the locations are provided and the detection component of SENDD is not used? (b) SENDD builds on top of SuperPoint [3], which already includes “keypoint descriptor”. Why would SENDD uses ReTRo [18] as “keypoint descriptor”? Combing the above two points, if during evaluation the keypoints are NOT “detected” but given and “descripted” by ReTRo [18], SENDD is not end-to-end evaluated as what the manuscript claims. A proper evaluation would have been using SENDD for keypoint detection.
- Missing baselines: The only baseline that SENDD compares against is a 2D variant of SENDD. Yet, the biggest contribution of SENDD is sparsity. What is the accuracy differences between sparse and dense approaches? What is the inference speed differences between SENDD and using dense networks? This will provide a better context for readers.
- Methodology confusion: – Abusing of notations: In Section 3, “Ga(.)” denotes the graph attention mechanism. However, “Ga(.)” sometimes outputs a set of features and sometimes outputs flow/disparity. How are the flow/disparity computed? Based on weighted average of attended locations? This is a core component of the approach and should be clarified. – In the “Loss” section, a photometric loss is computed between images A and B. What are A and B? Are they left images of two stereo frames at different time? Are they left and right images of the same stereo frame? Or A and B are all the above?
- Results confusion: As SENDD already backprojects the 2D pixels to 3D for position encoding, it should be theoretically possible to report the error in metric scale (mm) instead of pixels. This will help readers understand the accuracy of SENDD.
Please rate the clarity and organization of this paper

Satisfactory
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

The paper in current form is hard to reproduce and the code will not be made available. I have concerns about reproducibility.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html
I would encourage changes of the manuscript to address the concerns raised in the weakness section. Furthermore, there are several minor points that do not contribute to the rating yet may be useful:
- Citation [17] for graph-based attention is inappropriate. In my opinion, SuperGlue [A] should get the credit.
- It is stated that the 2D approach performs similarly to SENDD as “lens smudges or specularities can corrupt the depth map, leading to errors in the 3D model that the purely photometric 2D model might not encounter.” I would not agree with this statement as the 2D approach also suffers from matching ambiguities due to specularities. Can the manuscript be clarified further?
- Manuscript clarity: – The manuscript may benefit from stating the input is stereo videos from the beginning as endoscopy can be monocular as well. – “3D Flow Network” section should really be introduced after the “Sparse depth interpolation” section as 3D flow depends on depth estimates. – In Section “3D Flow Network”, the features “b_i” positionally encode the point distance “p_i”. Is this in 2D or 3D? – What is “Neural Spin”? Why is this needed at a high level? – Why is the total variance loss not computed for depth to encourage smoothness, but only flow? – The performance of SENDD is claimed to outperform 2D in Figure 3 consistently. However, it is quite hard to comprehend the figures. A better way may be to have a bar/violin plot of all frames instead of the scattered plot. – In section “Dataset”, what does it mean by “human labeling after the fact”? – In section “Benchmarking and Model Size”, it is claimed that the approach runs at 10fps when stereo features are re-used. Are the results reported in the manuscript with reusing enabled or not?
[A] Sarlin, Paul-Edouard, et al. “Superglue: Learning feature matching with graph neural networks.” Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2020.
Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

3
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

Given the weaknesses of the manuscript and concerns about reproducibility, I think further revisions are required to accept the manuscript.
Reviewer confidence

Very confident
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

5
[Post rebuttal] Please justify your decision

I appreciate the additional clarifications on the method and I have raised my rating from reject to accept.

However, I strongly recommend revisions on the manuscript to improve the clarity. The paper seems to be put together last minute and it is very hard to understand in its current form. The suggested changes in the original review feedback may be helpful.

Review #4

Please describe the contribution of the paper

This paper aims at tracking tissue and organs in surgical endoscopy, which is important for downstream tasks in image guidance and motion compensation. The author proposes model, SENDD, to achieve feature detection, depth estimation in a sparse manner, and 3D flow estimation. The feature detection part in SENDD is achieved by a simplified SuperPoint, and it is trained by the downstream loss, namely, the final photometric reconstruction loss. In the 3D flow estimation part, the author used the ReTRo keypoints to obtain initial matches between two images. Based on these matches, the sparse depth can be estimated by the sparse depth estimation part in SENDD. Then, the author used the GNN to optimize the initial matches and depth for the final 3D flow maps. In the sparse depth estimation part, the author used another lightweight GNN to estimate the depth sparsely. In the training of the SENDD, the author mainly adopted the photometric loss on warped stereo images, smoothness loss on flow a
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

Novelty: The author presented a framework to estimate the scene flow (deformation) of the dynamic surgical scenes. The framework was designed based on GNN and used a few parameters, running at 10 fps. Therefore, the framework can be applied to the surgical robotic system in real-time. Novelty: A new labelled surgical dataset about feature matching was proposed. The author used the ICG dye technique instead of manual labeling to make the dataset. Novelty: Th author did extensive experiments quantitively and qualitatively to evaluate the proposed method. The supplementary video also demonstrates its efficiency in feature extraction and matching.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

Limitation of Experiment: The proposed framework focuses on depth and scene flow estimation of the dynamic scene, but the experiment results are more about feature matching and the tracking error is also calculated on the image pixel. Limitation of New Dataset: As shown in Fig. 2, the number of labelled points in the image is not large.
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

The author did not provide the code about the proposed method. However, the new dataset will be publicly released before MICCAI.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html

More experiments related to depth accuracy and scene flow accuracy should be provided.
Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

5
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

Deformable tissue tracking is a really interesting topic. The author proposed a method to preliminary achieve soft-tissue tracking.
Reviewer confidence

Confident but not absolutely certain
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

5
[Post rebuttal] Please justify your decision

As claimed in the paper, the novelty of the method contains estimates scene flow using a GNN on salient points in 3D space. However, the paper did not provide the evaluation on 3D scene flow and depth estimation. In the rebuttal, the author also did not describe how to evaluate the method in 3D scene flow and depth estimation. Based on the experiment, it it obvious that the method focuses on feature matching and point tracking. In this case, the current version can be weakly accepted.

Primary Meta-Review

Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

The paper proposes a deep learning model to estimate 3D scene flow and depth based on sparse keypoint matches. The approach has been designed to be efficient and end-to-end differentiable. In addition, a new dataset has been generated using fluorescent paint and infrared light to capture the deformation of the surgical scene. The reviewers agree that this is an interesting work. However, the performance evaluation needs to be strengthened by including validation of the depth and scene flow estimation and providing the calculated error in mm instead of pixels only. Also, comparison of the proposed model with other 3D motion estimation techniques would make the validation study more robust. Details about the implementation and the methodology should be clarified as suggested by the reviewers. The limitations of the method to the degree of deformation should be explained.

Author Feedback

We thank the reviewers for suggesting ways to clarify our paper and what sets SENDD apart.

3D Evaluation: We evaluate SENDD in 2D to demonstrate its performance relative to a proven 2D counterpart. In Fig. 3 for endpoint error over our IR-labelled dataset, we projected the 3D error to 2D for comparison but omitted the 3D error calculated with depth from the exact same experiment. SENDD has a 7.9mm 3D endpoint error on tracking segment centers, outperforming RAFT (33.0mm), and CSRT (56.7mm). Refer to the SurgT challenge for the justification of CSRT as a SoTA comparison (the best submission utilized CSRT). We will include these (RAFT/CSRT numbers), adjust Figs. 3, 4, and add in a table as desired. Although some images may contain few labelled points (R4), our dataset has a total of >10,000 points.

Comparison to 3D works: The nearest sparse-keypoint approach is SuPer Deep, which is not real-time. Prior work, SuPer, is faster but requires 2 computers (~500ms). For dense CNNs, RAFT-3D is another alternative, but is slower (45M params 386ms) relative to SENDD (<0.5M params, <100ms).

Statement on limitations: SENDD is unable to cope with occlusion or relocalization, and like all methods is vulnerable to drift. These could be amended by integrating a SLAM system. SENDD does not perform as well in 3D with lens smudges; all it takes is for one camera artifact to obscure the depth map, and the resulting flow. The 2D method decreases the likelihood of this happening as it only uses one image.

R3:

First we would like to clarify that SENDD is sparse in that it uses sparse keypoints to control an underlying parameterization that can evaluate flow at any location in space (sparse or dense).

Notation: Great point on the ‘Ga(.)’ operator; we missed a layer here. There is a weighted average (attention) of the neighbor features followed by an MLP layer to change the dimension to 3D/1D for flow/stereo. Loss is between stereo and temporal frame pairs; A,B are arguments to the L_p function (see L_total).

Concern on improper evaluation: Centers of each segmentation region are used to evaluate SENDD, but these are not used as keypoints. In the supplementary material, there is a section in the example video of SENDD tracking labelled and random points which are not detections.

SENDD is end-to-end: Only a small part of our network architecture (the detector) is similar to SuperPoint, and the ReTRo network architecture (not pre-trained) is used for generating descriptors to maintain efficiency. These are differentiable and trained end-to-end instead of using synthetic warping. We believe this makes SENDD robust as it is trained on real-world deformations that happen in surgery. That said, we agree there is no specific ablation experiment to support the claim, will adjust said statement, and mention an ablation study to test our hypothesis as future work.

Regarding reuse of features possibly changing results: The results are the same for both, only timing differs. When estimating flow for a pair, we need to calculate two sets of keypoint features, but for a video, we can reuse the features from the previous frame instead of recalculating the full pair each time.

SuperGlue: We thank R3 for pointing out that we did not refer to SuperGlue, which is a great paper for keypoint matching. The paper we reference for our GNN model [17] references SuperGlue, but we will add the citation for clarity. SuperGlue calculates matching scores between sets of keypoints in image pairs using GNNs. SENDD, instead, takes in already completed matches (nearest-neighbors in feature space), skipping the SuperGlue step. SuperGlue can be integrated as another layer, but we choose not to due to computation constraints. For 2048 keypoints, SuperGlue alone takes ~270ms to run. Instead of estimating match candidates, SENDD outputs new features and displacement estimates for each keypoint pair. The only similarity is our use of a graph neural network on keypoints.

Post-rebuttal Meta-Reviews

Meta-review # 1 (Primary)

Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

The authors responded adequately to the reviewers’ comments. One of the reviewers increased the rating and now all the reviewers recommend acceptance of the paper. The authors should enhance the camera ready paper following the reviewers’ suggestions.

Meta-review #2

Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

The response addresses some of the concerns regarding baselines and 3D evaluation. However, it appears that these experiments were conducted as part of the rebuttal and the gap between strong baselines (RAFT3D) and the proposed method is large, which may indicate that some of the experimental conditions here were not fair. It is impossible to say given the information provided in the rebuttal. Further, several reviewers - while recommending acceptance - are still critical about the manuscripts quality and urge the authors to make sure it’s in much better presentation state upn subsequent submission.

Overall, while the reviewers have reached a more optimistic consensus about this manuscript, I personally believe that due to the shortcomings of the initial submission combined with the unclearities introduced by additional experiments of the rebuttal, this is a borderline paper.

Meta-review #3

Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

The manuscript presents a technique to estimate scene flow (i.e. dense 3D displacement fields) in endoscopic videos. The method is a relatively simple combination of monocular depth estimation and DNN-based dense tracking (SuperPoint) - and I agree it’s pretty similar to SuperGlue. The method validation is a strong point, with the use of fluorescent markers a la the Middlesberry dataset. All reviewers agree that this paper should be accepted, and a decision I agree with.

For the camera-ready, please try to address the clarity concerns expressed by the reviewers and especially R2.

back to top

SENDD: Sparse Efficient Neural Depth and Deformation for Tissue Tracking