Paper Info Reviews Meta-review Author Feedback Post-Rebuttal Meta-reviews

Authors

Baoru Huang, Jian-Qing Zheng, Anh Nguyen, Chi Xu, Ioannis Gkouzionis, Kunal Vyas, David Tuch, Stamatia Giannarou, Daniel S. Elson

Abstract

Depth estimation is a crucial step for image-guided intervention in robotic surgery and laparoscopic imaging system. Since per-pixel depth ground truth is difficult to acquire for laparoscopic image data, it is rarely possible to apply supervised depth estimation to surgical applications. As an alternative, self-supervised methods have been introduced to train depth estimators using only synchronized stereo image pairs. However, most recent work focused on the left-right consistency in 2D and ignored valuable inherent 3D information on the object in real world coordinates, meaning that the left-right 3D geometric structural consistency is not fully utilized. To overcome this limitation, we present M3Depth, a self-supervised depth estimator to leverage 3D geometric structural information hidden in stereo pairs while keeping monocular inference. The method also removes the influence of border regions unseen in at least one of the stereo images via masking, to enhance the correspondences between left and right images in overlapping areas. Intensive experiments show that the method outperforms previous self-supervised approaches on both a public dataset and a newly acquired dataset by a large margin, indicating a good generalization across different phantoms and laparoscopes.

Link to paper

DOI: https://link.springer.com/chapter/10.1007/978-3-031-16449-1_2

SharedIt: https://rdcu.be/cVRUI

Link to the code repository

The link is being prepared and will be released once the paper is published.

Link to the dataset(s)

The link is being prepared and will be released once the paper is published.


Reviews

Review #1

  • Please describe the contribution of the paper

    This paper, proposes a self-supervised laparoscopic image depth estimation approach that in addition to left-right consistency, also invokes the inherent geometric structural consistency of real-world objects, as well as optimizing mutual information between stereo pairs. The authors demonstrate their approach using public and locally- acquired datasets that show the ability of this approach to generalize across different instruments and imaging environments.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The paper is well written and presents a convincing demonstration that the proposed approach out-performs a number of other techniques, as shown by Fig 2. Although the quantitative results are presented in Table 1, Fig 2 would be more informative of a) a colour scale was provided, and if the colour showed the deviation from ground truth instead of absolute depth. Some more details of the collection of the LATTE dataset (perhaps provided in supplementary material), would be appreciated, along with some indication of its quality. The 3D Geometric Consistency loss and Blind masking approaches are quite novel, but it would be helpful to understand their impact more clearly. For example, how does left & right 3D model consistency contribute to a more accurate disparity map? In experiments, is it possible to show a comparison with and without 3D consistency loss to demonstrate its effectiveness?

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    There was a very similar paper published in MedIA early this year: Bardozzo et al. “A stacked and siamese disparity estimation network for depth reconstruction in modern 3D laparoscopy” , which is also a self-supervised network minimizing a 2D reconstruction loss using SSIM as the metric. Can the authors refer to this paper and describe how the submitted work differs from it? It is not clear from the text what the primary motivation for this work is. The authors mention the acknowledged problem of providing “ground truth” for laparoscopic 3D reconstruction, against which various reconstruction algorithms can be evaluated, and the way the paper is presented, the reader could understand that this was its objective. However, further reading reveals a structured light based generation of ground truth used to validate their new approach. Perhaps the intro could be revised to make the objectives clearer – perhaps along with clinically-oriented statement of the “un-met need’ that is being addressed. The training/testing scheme in this paper is somewhat problematic. “Hence, only key-frame ground truth depth maps were used from this test dataset while the remainder of the RGB data formed the training set.” If I understood this correctly, the author used n-1 frames for training and 1 frame for testing for each sub-dataset. Would this not cause bias for overfitting, since all images in the sub-dataset are very similar and have similar depth. Would it not be more appropriate to use several subsets for training and 1 or 2 for testing as was the case for the challenge?

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    Good

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html

    Overall a well written paper that contributes to the field of 3D reconstruction stereo endoscopic images. The paper would benefit by having the points made above addressed concisely

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    7

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The paper adds useful information to the standard methods of stereo reconstruction from endoscopes, and I believe makes a valuable contribution to the literature.

  • Number of papers in your stack

    4

  • What is the ranking of this paper in your review stack?

    2

  • Reviewer confidence

    Very confident

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Not Answered

  • [Post rebuttal] Please justify your decision

    Not Answered



Review #2

  • Please describe the contribution of the paper

    The authors propose to use stereo images to train a depth from mono method. They propose to generate depth images from monocular depth estimation from each eye and use a consistency loss to ensure these depth maps are consistent. The method is validated on a well known benchmark dataset and compared with some previously published mainstream computer vision methods.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The paper method is well written, clear and easy to follow.

    The validation is clear and shows a good improvement over some previous methods. The authors are also using a benchmark dataset which makes comparison more interpretable.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    The main weaknesses of this paper are

    • the utility of this method given a) many laparoscopes are stereo and b) 3D information is not really obtainable from mono depth methods
    • the motivations are not clear
    • whether this is the right community for this paper I elaborate more on this in the constructive comments.
  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    The authors claim they will release data + code so this should be reproducible.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html

    A weakness for me is that this method is quite general and not really specific to computer assisted surgery/MICCAI community. I feel like it could have been submitted to a mainstream computer vision conference and validated on the larger datasets those communities use, can the authors explain why they have submitted this paper to MICCAI?

    The motivation for depth-from-mono is not well established, many laproscopes now are stereo- the authors should improve this motivation in the introduction. The authors also do not really explain how they propose to use a depth from mono method to solve the type of applications they suggest in the introduction. Without a known object, a mono system cannot predict true depth. The authors should explain how they see depth from mono being used.

    I have concerns about the accuracy of the proposed method. Although it is clearly better than the previous monocular methods from mainstream computer vision, the error is still very large. Perhaps too high to be useful? Is this simply off by a scale factor since it’s using a mono method? Can the authors comment on this?

    The authors could/should have compared with some previous medical depth from mono papers for a stronger comparison. For example ‘Self-supervised Learning for Dense Depth Estimation in Monocular Endoscopy’, Liu et al, MICCAI 2018, this paper has code available and would be a stronger comparison.

    Is this statement actually correct: 
’However, all of these methodologies employed left-right consistency and smoothness constraints in 2D, e.g. [3],[5], ignored the important 3D geometric structural consistency from the stereo images.’ I see the self-supervised methods as implicitly using the 3D whereas this is more explicit about it. The authors should clarify this point unless they disagree with my assessment.

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    5

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    I feel this was one of the stronger papers in my stack, it has its limitations but at least proposes something fairly novel and validates on a well known dataset. I would definitely not argue strongly for it to be included, but it was at least interesting and the idea makes sense.

  • Number of papers in your stack

    5

  • What is the ranking of this paper in your review stack?

    1

  • Reviewer confidence

    Somewhat Confident

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Not Answered

  • [Post rebuttal] Please justify your decision

    Not Answered



Review #3

  • Please describe the contribution of the paper

    This work proposed a framework for laparoscopic image depth estimation. The estimator was trained self-supervised with stereo images, with 3D ICP loss and blinding mask, and achieved good performance on two datasets.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • Writing is clear and easy to follow.
    • Applying geometric constraints to endoscopic depth estimation is worth studying. This work can inspire future work to further explore this direction.
    • The framework yields significantly better results compared to the listed previous work.
    • A new dataset with ground-truth depth is collected, which will certainly benefit future research.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    Major concerns:

    • Novelty. The two claimed methodologic contributions of this paper, 3D geometric consistency loss from ICP and blinding mask, have already been well-studied by previous depth estimation work. The ICP loss for monocular depth estimation was proposed in [13], and the idea of using geometric constraints in endoscopy image depth estimation is not new [a]. The blinding mask is a widely-applied technique/trick; for example, it was applied in [13] and previous medical implementations such as [b]. Although this work re-implemented the approaches in a new application scenario, laparoscopic stereo images, the approaches are not novel.
    • Necessity of 3GC. The proposed framework has three differences from Mono1, i.e., the additional ICP loss (3GC), blinding mask (BM), and decoder structure (FD). The selected results in the ablation study Table 3 only show combinations that are in favor of this paper, but the full ablation in supplementary raises concerns about the efficiency of the main contribution of 3GC. As shown in Supp. Table 1, modifying the decoder (Mono1 w/ FD) can already bring significant improvement comparing baseline Mono1, but further adding ICP loss (Mono1 w/ 3GC, FD) decreases the performance greatly (the second-worst result in the table). This is an important finding, showing the ICP is not as useful as claimed for a decent baseline. The authors should have pointed it out.

    Minors:

    • The three baselines compared in Table 1&2 are all photometric-based methods. It would be more convincing if some geometric-based baselines were compared, such as general depth estimation baseline [c,d] and medical baseline [a].
    • An error in Eq.2

    [a] Liu, Xingtong, et al. “Dense depth estimation in monocular endoscopy with self-supervised learning methods.” IEEE transactions on medical imaging 39.5 (2019): 1438-1447. [b] Ma, Ruibin, et al. “RNNSLAM: Reconstructing the 3D colon to visualize missing regions during a colonoscopy.” Medical image analysis 72 (2021): 102100. [c] Yang, Zhenheng, et al. “Lego: Learning edge with geometry all at once by watching videos.” Proceedings of the IEEE conference on computer vision and pattern recognition. 2018. [d] Bian, Jiawang, et al. “Unsupervised scale-consistent depth and ego-motion learning from monocular video.” Advances in neural information processing systems 32 (2019).

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    Implementation details were provided. The data collection process was described.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html

    Please refer to the weakness section.

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    4

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    This work focused on the important direction of using geometric consistency to improve laparoscopic image depth estimation. Yet, the scientific novelty of the paper is limited, and the experiment results are not able to prove the merit of one of the major contributions.

  • Number of papers in your stack

    3

  • What is the ranking of this paper in your review stack?

    3

  • Reviewer confidence

    Very confident

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Not Answered

  • [Post rebuttal] Please justify your decision

    Not Answered




Primary Meta-Review

  • Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

    The paper received three highly detailed reviews with varied recommendations. Reviewers generally agreed that the method was well and clearly presented and commended the evaluation on both internal and public data. However, the reviews also revealed several major concerns that require attention. They include: 1) The motivation of the work is unclear; this is important to better contextualize the method’s performance and adequacy (also in light of the comment that many laparoscopes are nowadays stereo, questioning the need for monocular techniques in this space). 2) There are strong concerns around the novelty of this work, and the concerns are backed up by adequate references. A central claim of this work, the use of geometric consistency, is not new and it appears that other components are also taken from prior work. Clarification on the specific novelty are needed, especially since it appears that the ablation study presented in the appendix fails to convincingly demonstrate the necessity for one of the core contributions, the 3GC module. 3) It appears that, while the performance of the method is perceived as strong, the comparisons omit strong baselines such es the ones that previously proposed and used geometric consistency. 4) There are concerns that the experimental setup may have introduced leakage/bias, so that the reported performance metrics may be optimistic.

  • What is the ranking of this paper in your stack? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    6/17 (~30th percentile)




Author Feedback

We thank the reviewers for their constructive comments and have addressed all the points.

Reviewer 1:

  1. The requested visualization (color scales, error maps, data example) have been added in the final version. The dataset and code will be released.

  2. Effectiveness of 3D consistency loss: The loss penalized the difference between left and right point clouds after registration and further constrained the consistency of left and right disparity maps. Table 2 (the ablation study) in the Supplementary material confirmed that the addition of the 3D geometric consistency boosted the performance.

  3. Differences with Bardozzo et al.: 1. The Bardozzo model was based on stereo rectified images for both training and testing. We only need stereo for training and mono for testing. 2. We proposed a more advanced loss combining 2D and 3D losses leveraging not only the left-right consistency in 2D but also the inherent geometric structural consistency of real-world objects in 3D. 3. Our model is lightweight, achieving an inference time at 105 fps, faster than Bardozzo.

  4. Motivation of the work: We proposed a new real-time framework that achieves SOTA results and released a new dataset. Fast and accurate depth estimation can provide 3D tissue surface data, allow pre- and multimodal image registration, and tracking of novel surgical diagnostic instrument data - our main current motivation.

  5. Train/test splits: We used the keyframes of the training dataset where accurate depth ground truth was available for testing and removed them (and similar adjacent frames) from the training dataset (9491 images). Following this suggestion, we retrained our model and other compator methods on the MICCAI challenge split. Our method outperformed Mono2 and Mono1 by 22.3% and 19.7% on Abs Rel. New results will be added to the paper.

Reviewer 2:

  1. Suitability of the work for MICCAI community: Please see Reviewer 1 Q. 4.

  2. The motivation for depth-from-mono: Mono laparoscopes remain prevalent and will remain so for applications with limited aperture size. Many SOTA methods require stereo for inference, while we only need a single mono image. This requires less tedious calibration and allows faster inference time.

  3. Accuracy of the method: Our method estimates metric depth as it is trained with stereo information and no scale factor is needed. It significantly outperformed other SOTA and provided acceptable errors for our application. We are routinely and continually collecting human tissue data and extensive training on bigger datasets will further improve the performance.

  4. Comparison to geometric-based methods: Following this suggestion, we compared our model with geometric methods, including the geometric-based mode of Mono2 which relied on pre- and post (temporally neighboring) frames and Bian et al. Our method outperformed them by 22.3% and 10.2% on Abs Rel using the original SCARED split. These results will be added to the paper.

  5. Self-supervised methods: We agree and want to stress that most of the previous work utilized consistency in 2D space, while we want to use this in 3D. We will clarify this in the final version.

Reviewer 3:

  1. Novelty of ICP and blind mask: The mask and 3D geometric consistency loss from ICP in [a, b, 13] required the assistance of pose between every keyframe. However, in laparoscopic applications, tool-tissue interaction creates a dynamic scene, leading to failure of local photometric and geometric consistency across consecutive frames in both 2D and 3D. The ICP loss in our paper was applied on stereo images where 3D geometry inferred from left and right images is assumed identical, allowing adoption of 3D and 2D losses.

  2. Necessity of 3GC: The combination of fine disp and ICP loss decreased the performance because the 3GC requires blind masking to achieve the best performance. We will clarify this in the final version.

  3. Comparison to geometric-based methods: Please see Reviewer 2 Question 4.




Post-rebuttal Meta-Reviews

Meta-review # 1 (Primary)

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    After rebuttal, the major strenghts of the work include its clear presentation as well as the evaluation on both internal and public data. The rebuttal has mitigated some of the initial weaknesses in regards to clarification and adds another baseline method that reviewers suggested, but there seem to remain several shortcomings. Several competing approaches to monocular depth estimation were identified by the reviewers that are not used as strong baselines for the work. This is especially true for methods that relied on highly similar ideas around consistency, and could thus be considered foundational for some of the work presented here, making it a natural baseline choice. In addition, while the ideas around ICP-based 3D geometric consistency assessment based on stereo vision are valid, and neglecting errors due to imperfect synchronicity, will show the same scene, the use of stereo concepts, such as disparity, for a monocular depth estimation paper is a little artifical. This is because disparity is not defined for a monocular camera setup.

  • After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

    Reject

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    10



Meta-review #2

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    The paper generally received positive reviews from all reviewers. There is a need to estimate depth from mono laparoscopes since most standard-of-care non-robotic MIS are performed with a monocular laparoscope. R3 raised a number of key concerns, which I think were addressed in the rebuttal. Overall I think this paper would be of sufficient interest to the CAI community working on image-guided surgeries.

  • After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

    Accept

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    NR



Meta-review #3

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    The paper received mostly positive reviews except R3. The rebuttal wasn’t particularly convincing but the AC believes the paper is of good quality for MICCAI and recommends acceptance.

  • After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

    Accept

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    NR



back to top