Paper Info Reviews Meta-review Author Feedback Post-Rebuttal Meta-reviews

Authors

O. León Barbed, José M. M. Montiel, Pascal Fua, Ana C. Murillo

Abstract

Endoscopy is the gold standard procedure for early detection and treatment of numerous diseases. Obtaining 3D reconstructions from real endoscopic videos would facilitate the development of assistive tools for practitioners, but it is a challenging problem for current Structure From Motion (SfM) methods. Feature extraction and matching are key steps in SfM approaches, and these are particularly difficult in the endoscopy domain due to deformations, poor texture, and numerous artifacts in the images. This work presents a novel learned model for feature extraction in endoscopy, called SuperPoint-E, which improves upon existing work using recordings from real medical practice. SuperPoint-E is based on the SuperPoint architecture but is trained with a novel supervision strategy. The supervisory signal used in our work comes from features extracted with existing detectors (SIFT and SuperPoint) that can be successfully tracked and triangulated in short endoscopy clips (building a 3D model using COLMAP). In our experiments, SuperPoint-E obtains more and better features than any of the baseline detectors used as supervision. We validate the effectiveness of our model for 3D reconstruction in real endoscopy data.

Link to paper

DOI: https://doi.org/10.1007/978-3-031-43907-0_56

SharedIt: https://rdcu.be/dnwdD

Link to the code repository

https://github.com/LeonBP/SuperPointTrackingAdaptation

Link to the dataset(s)

https://arxiv.org/abs/2204.14240

https://durrlab.github.io/C3VD/


Reviews

Review #2

  • Please describe the contribution of the paper

    This paper addresses the problem of 3D reconstruction on endoscopic images. The main idea is to use a SuperPoint-based approach where part of the supervision for training the model comes from a previously computed handcrafted reconstruction using COLMAP. The authors aim to tackle problems related to the low-quality reconstruction with state-of-the-art approaches due to specular reflections and the lack of rich image textures in endoscopic images.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • The paper is well motivated as the image artifacts they describe in endoscopic images pose problems for accurate 3D reconstruction.
    • The paper does a sufficient revision of the state-of-the-art.
    • Using COLMAP reconstruction as a supervision signal aligns with many methods in the state-of-the-art for various tasks.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    • The paper is incremental, and thus there are concerns about its novelty. This can be observed in section 3, subsect “Deep feature extraction for endoscopy,” where most of the paper contribution is compressed in the last part of the subsection.
    • In Eq. 2 and 3, some symbols are not defined, making the loss function difficult to understand.
    • The Tables have a lot of information that is not well-explained, neither in the manuscript nor in the Tables’ caption.
  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    The experimental setup needs to be written to improve the clarity of the paper. Without that it is difficult to reproduce the results.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html
    • I will suggest including a video of this (from section 3): “This is a very challenging domain, and existing SfM pipelines fail in longer videos” to emphasize the need for a new approach.

    • Put paragraph titles in bold letters.

    • Suggestion: In Fig. 1, if you put the subfigure labels on the bottom, you can make the images slightly better and easier to understand since Fig. 1 is critical for the method.

    • What is d^{\prime}_{b} in equation 3?

    • Report the number of frames and point correspondences the authors use from EndoMAP. The dataset seems small for a large deep model.

    • is the SuperPoint modification related to only the loss function, or is it something else?

    • In Tab. 1, the authors describe SP-E v1,2,3. However, they need to mention this in the paper and the differences between these methods. I suspect it has to do with columns 2, 3, and 4, but this is not clear from the table or the manuscript.

    -Why do the authors use only 93% of the subsequence images in the SP baseline but more images in their method? It is a slight difference in percentage, but it makes the comparison unfair. Same for the number of points.

    -The Tables have a lot of information that needs to be better explained in the manuscript and the Tables’ caption.

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    4

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    My decision is based on the novelty concerns plus the difficulty understanding they way the authors presented the results in tables 1 and 2.

  • Reviewer confidence

    Confident but not absolutely certain

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    5

  • [Post rebuttal] Please justify your decision

    The authors answers my questions and I consider the manuscript should be accepted.



Review #3

  • Please describe the contribution of the paper

    The paper suggests a learned feature (key points) extraction method for SfM called SuperPoint-E, that is tailored specifically for the endoscopy domain, and improves upon the vanilla SuperPoint method. The supervision signal for training the method is based on short clips of real data, and using existing feature extraction methods. Those features are tracked using existing methods to provide a ground truth matching for training SuperPoint-E. This in contrast to the vanilla SuperPoint, which uses matching based on homography wrapped instances of a single image, based on a planar surface assumption, which doesn’t hold in the endoscopic domain. Thus, the suggested method utilises the temporal nature of training data and overcomes the planar surface assumption. The benefits of the method are examined in the task of 3D reconstruction with real endoscopy data.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • 3D understanding of endoscopic videos is a challenging and important problem, and the paper aims to address the unique challenges of this domain.
    • The paper presents a novel and simple method to obtain seemingly better supervision signals for training a deep feature extractor for clips video data, compared to the vanilla SuperPoint that utilises only single images.
    • The experimental results demonstrate extraction of a larger number of keypoints with low reprojection error on real colonoscopy data.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    • Some of the phrasing is unclear (see Minor Comments). The paper writing can be improved.
    • The paper does not discuss how the method performs with specularities, Which is a major challenge for feature extraction in endoscopy.
    • More experiments are needed to demonstrate the utility of the method. For example, an experiment on simulated data of camera motion tracking or 3D surface reconstruction, where the estimations can be compared to the simulator ground truth.
    • The existing experiments are lacking in my opinion, the track length improvement is marginal and falls within the error margin, and the fact that more keypoints are extracted with low projection error is not a “hard evidence” for the benefit of the method (we would also like the key points to spread out across the image to improve the 3D understanding from the images, but this is not demonstrated).
  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    I believe the work can be reproduced based on the authors’ explanation

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html

    On page 1:

    • The sentence is not clear “reconstruction strategies have been studied for long, being the feature detection and matching a key step to feed Structure from Motion (SfM) pipelines.”
    • “drawbacks for these tasks” - do you mean “challenges for these tasks”?

    On page 4 - the sentence is not clear “We consider this happens when there were originally detections of the point in at least a previous and a posterior frame”

    Page 3 - “deteiled” ??

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    4

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Please see section 6 on weaknesses. In addition the paper is hard to follow, I suggest improving the exposition, see my comments in section 9.

  • Reviewer confidence

    Confident but not absolutely certain

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    4

  • [Post rebuttal] Please justify your decision

    I thank the authors for addressing my concerns. I believe that a revised manuscript with those added experiments and improved clarity can be a stronger publication. Since the required changes are not minor, I will keep the score unchanged.



Review #4

  • Please describe the contribution of the paper

    This paper describes a method for 3d-reconstruction from endoscopy videos, inspired by the popular SuperPoint structure-from-motion method.

    A key adaption to the original (henceforth referred to as ‘vanilla’) SuperPoint method is the introduction of ‘tracking adaption’, using successfully reconstructed videos via conventional methods to produce ground-truth correspondences for training rather than homographic warps of a single image as in the ‘homographic adaption’ of vanilla SuperPoint.

    The method is tested on successfully COLMAP-reconstructed sequences from the publicly available EndoMapper dataset, where it shows improvements across many different sequences compared to SIFT and vanilla SuperPoint features.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    This is an interesting task and likely of significant clinical benefit.

    The concept of tracking adaption is clever, (to my knowledge) novel, fairly well motivated and carefully explained.

    Experimental results are quite compelling, with the method achieving superior performance compared to to relevant baseline feature extractors (SIFT, vanilla SuperPoint)

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    There are a few details missing which makes this paper harder to read, particularly in Table 2 (I have expanded on this point in Qu. 9)

    The authors make reference to a key limitation to applying existing structure-from-motion methods in endoscopy videos - that anatomy is deformable. However, I could not find an explanation of how this is handled by their method. I may have missed the relevant argument, however, if not, this argument should either be explicitly made or acknowledged as also a limitation of the proposed approach.

    The proposed method is only assessed on a single dataset so it is difficult to assess the generalisability of the method - that said, I believe this issue is not practically easy to remedy, and should not be considered a major weakness of the paper.

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    The paper appears reproducible, with clear reference to the training data, software and methods used.

    The reproducibility checklist implies that code will be released for this method, however this is not mentioned in the paper. The intention may be to add this sentence to the camera ready version.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html

    A slightly more detailed explanation of what guided matching in COLMAP is would be useful - this doesn’t need to be in great detail, but an overview would help the readability of the paper.

    The graphs showing a cumulative distribution of reconstruction error in Figure 4 are very useful, however the text could made much larger. Also, would it make sense to have this graph aggregated across all the subsequences in the test dataset?

    In table 2, you should label the columns as subsequences in the training data. Also what are the units of the mean track length?

    It would be nice to have a more exhaustive description of the dataset used - how many training frames/correspondences were there? What proportion of EndoMapper dataset was COLMAP able to reconstruct?

    In the ablation study section, ‘   3dIm   Number of Images’ should be ‘   3dIm   Fraction of Images’. Likewise ‘Number of points’ should be ‘Number of points per subsequence’ - although I think this biases the results towards the longer subsequences so should perhaps be number of points per frame. Furthermore, in Table 1, the   3DPts   doesn’t really need 2 decimal places and should be aligned on the decimal place.

    I think if Tr-N always represents 4 frames in table 2, it would be better to represent it as Tr-4. Is there a particular reason why 4 was chosen?

    The quotation marks around on ‘originally’ on page 4 need fixing.

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    7

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    This is an interesting adaption to a well-known method in structure from motion, and is applied to an important clinical problem. Experimental comparison to existing methods for feature extraction from video is quite compelling and the paper is clear. There are a few details not well explained currently, although these should be easily fixed.

  • Reviewer confidence

    Somewhat confident

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A




Primary Meta-Review

  • Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

    Strengths: it works on an important problem of problem of 3D structure & motion (or visual SLAM) from endoscopic image sequences Weaknesses: presentation needs significant improvement; results are limited; lack of more thorough evaluations and comparison with more SOTA methods




Author Feedback

We thank the reviewers for appreciating the importance of adequate supervision for boosting SP feature detection in endoscopy (R2,R3) and the increase in performance (R2,R4), the originality of the tracking adaptation (R4), and for signaling the room for improvement in clarity.

[R2] Concerns about novelty: Our main contribution is a novel approach to automatically generate reliable training data from video sequences by tracking feature points discovered by older detection methods, which do not require training. When used to train SuperPoint, it yields a self-supervised method that outperforms current ones.

R3 and R4 recognize it as an important contribution, especially for endoscopy where self-supervision is a must.

If accepted, we will rewrite the intro and related work to make more explicit that the novelty lies in the proposed supervision by SfM tracking and its improvements to the application in short real endoscopy shots.

[R2, R3, R4] Concerns about convincing or more clear experimental validation. If accepted, we will include the following content/clarifications to the manuscript:

  • [R3, R4] Additional results for camera motion estimation with simulated data (5 short sequences of 100-150 frames from [Bobrow et al., Colonoscopy 3D Video Dataset with Paired Depth from 2D-3D Registration, arXiV 2022])

We computed the camera trajectory for all methods, with similarly low RMSError for trajectory alignment with respect to the simulation ground truth trajectory positions. (SIFT - 4.61 mm; SP - Only reconstructed 3/5 sequences; E-SP - 4.71 mm )

Hence, in relatively easy scenarios (simulated environments lack some of the biggest challenges such as deformation, wet surfaces and liquids) SIFT and E-SP perform equally well. But our other results (Table 2 and Fig. 3, 4) show that E-SP can reconstruct more parts of the sequences.

  • [R3] Spread of the features: We observed qualitatively in the examples of the initial submission that the features were more spread. We have now evaluated this quantitatively: We defined a 16x16 grid over each image and computed the percentage of those cells that have at least one reconstructed point. On average, for all frames in all test sequences we get: SIFT: 43.9%; SP: 56.9%; SP-E (Ours): 67.5%.

  • [R3] Performance with specularities: Our hypothesis is that our model manages to implicitly tackle this, because we train to extract features like those that remain from different existing detectors (SIFT and SuperPoint) after the global optimization step in a conventional SfM pipeline (bundle adjustment run in COLMAP). Non-reliable features, like most ones on top of specularities, are filtered out, so our supervision favors ignoring reflections. To confirm this, we computed the average percentage of points that fall on top of specularities (if value(pixel) > 180) per frame in all test sequences for all methods: SIFT: 28.6%; SP: 19.6%; SP-E (Ours): 9.9%.

  • [R2, R4] EndoMapper dataset usage. This dataset contains 100 full endoscopy sequences, from which we chose a subset (14 for training, 6 for testing). Existing SfM methods (e.g., COLMAP) are only able to reconstruct short parts from these videos, around 5 sec. long clips. So, we got 65 reconstructions corresponding to 4-7 sec. long clips obtained with COLMAP. They amount to 11260 frames. For testing, we use 7 different reconstructions amounting to 838 frames.

[R2, R4] Clarity

  • Eq. 3 We meant d_{b} instead of d^{\prime}_{b}.
  • Table 1: Columns 2, 3, 4 give the different configurations used for training. After training, all methods are tested with the same number of frames. Columns 5 through 9 show testing results. 93.91% means that the fine-tuned vanilla SP method can only register 93.91% of the frames into the reconstruction. Our method improves upon this.
    Table 2: Columns will be labeled as subsequences on the test set. The units of track length are “number of images”.

We thank the reviewers and hope we addressed all their main concerns




Post-rebuttal Meta-Reviews

Meta-review # 1 (Primary)

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    The paper aims to address an important surgical scene understanding task, namely 3D structure & motion (or visual SLAM) from endoscopic image sequences. Overall the work is promising and has its merits. The rebuttal helps in clarifying some concerns. Meanwhile, as pointed out by e.g. R3, it still requires a non-trivial efforts in revising the current manuscript. I’ll suggest the authors take the opportunity to revise the paper by closely following the comments, to be ready for a resubmission to next MICCAI or a closely related venue.



Meta-review #2

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    This paper studied the problem of 3D reconstruction from real endoscopic videos and proposed an improved reconstruction method based on SuperPoint and SIFT features for matching. The method reported higher performance than the baseline method. The rebuttal provides additional method and experiment details to support the effectiveness of the method.



Meta-review #3

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    This paper focuses on tackling the challenge of 3D reconstruction from endoscopic images. The main approach involves utilizing a SuperPoint-based method, where the model is trained with supervision derived from a handcrafted reconstruction generated using COLMAP. However, it is an incremental work that lacks novelty. In addition, it lacks sufficient comparison experiments with other SOTA methods, which make it an unconvincing study. In summary, I believe it should be rejected.



back to top