Paper Info Reviews Meta-review Author Feedback Post-Rebuttal Meta-reviews

Authors

Hao Yue, Yun Gu

Abstract

The depth and pose estimations from monocular images are essential for computer-aided navigation. Since the ground truth of depth and pose are difficult to obtain, the unsupervised training method has a broad prospect in endoscopic scenes. However, endoscopic datasets lack sufficient diversity of visual variations, and appearance inconsistency is also frequently observed in image triplets. In this paper, we propose a triplet-consistency-learning framework (TCL) consisting of two modules: Geometric Consistency module(GC) and Appearance Inconsistency module(AiC). To enrich the diversity of endoscopic datasets, the GC module generates synthesis triplets and enforces geometric consistency via specific losses. To reduce the appearance inconsistency in the video triplets, the AiC module introduces a triplet-masking strategy to act on photometric loss. TCL can be easily embedded into various unsupervised methods without adding extra model parameters. Experiments on public datasets demonstrate that TCL effectively improves the accuracy of unsupervised methods even with limited number of training samples.

Link to paper

DOI: https://doi.org/10.1007/978-3-031-43996-4_14

SharedIt: https://rdcu.be/dnwOO

Link to the code repository

https://github.com/EndoluminalSurgicalVision-IMR/TCL

Link to the dataset(s)

N/A


Reviews

Review #1

  • Please describe the contribution of the paper

    This paper proposes a triplet-consistency-learning framework (TCL) for unsupervised depth and pose estimation of monocular endoscopes. The framework consists of two modules that address the challenges of insufficient dataset diversity and inconsistent appearance. The proposed method can be easily embedded into previous SfM methods without adding model parameters.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The idea of preserving multiple consistencies (geometric, depth, pose and appearance consistency, etc.) is innovative in this paper.

    The comparative experiments are very comprehensive, comparing with many existing advanced methods and demonstrating superiority.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    No significant weaknesses to comment on.

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    Good

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html

    The pose results in Figure 2 are somewhat confusing. It is hoped that they can be improved to make them clear.

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    6

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The paper is well organized and well written. The experiments are well constructed and thoroughly conducted.

  • Reviewer confidence

    Confident but not absolutely certain

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #2

  • Please describe the contribution of the paper

    A triplet-consistency-learning framework (TCL) is proposed to improve the accuracy of monocular endoscopy unsupervised depth and pose estimation. Two modules are proposed to enrich the diversity of endoscopic datasets and reduce appearance inconsistency. The proposed method is evaluated through the public datasets and compared with a number of other related algorithms.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    Well written Well structured Rigorous evaluation Good performance of the proposed method compared with competing methods.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    • The limitation and future work are not discussed.
  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    Although the codes are not released, a clear description of the method, a declaration of what software framework, and implementing details are provided. The data for evaluation are publicly available. Thus, it is possible to re-implement the method.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html

    Overall speaking, the paper is well-written and organized. The proposed method is useful for improving the accuracy of monocular endoscopy unsupervised depth and poses estimation. The method is fairly evaluated through the comparison with state-of-the-art methods such as the original MonoDepth2 and SC-SfMLearner. Some minor comments are provided as follows:

    • In the last sentence on page 1 “In this framework…”, the reference is missing.
    • In the second paragraph on page 2 “This can easily lead to the overfitting of modern SfM methods based on deep neural networks.”, the reference is missing.
    • For section 2.1, it is better to put “Framework Architecture” at the beginning.
    • Clarify the term “video triplets”, in the provided references, this term is not found.
    • The discussion on the limitation of the method is missing.
  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    7

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The paper is solving an important problem in endoscopic image computing. According to the performance reported in the paper, the proposed method can be very useful for improving the monocular endoscopy depth and pose estimation. The paper is well written and structured.

  • Reviewer confidence

    Confident but not absolutely certain

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    7

  • [Post rebuttal] Please justify your decision

    I think the responses from the authors have answered major concerns from R3. And some minor concerns from other reviewers will be solved in their final version such as confusing expressions in the manuscript will be rectified.



Review #3

  • Please describe the contribution of the paper

    This paper focuses on depth and pose estimation from a monocular endoscope video. The author proposes two modules, namely, the Geometric Consistency module (GC ) and the Appearance Inconsistency module (AiC), which can be easily embedded into an unsupervised monocular depth estimation network, to solve two issues in the endoscope scene, namely, insufficient visual diversity of endoscopic datasets and appearance inconsistency in endoscope video triplets. In the GC module, the author used the perspective view synthesis to improve the visual diversity of endoscope video triplets, which is under-explored by other methods. In the AiC module, the author designed a triplet-masking strategy to reduce appearance inconsistency. Compared with previous methods, the proposed approach did not involve extra model parameters. The author also performed experiments to validate the accuracy of depth and pose estimation.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    Novelty: The author found two problems unique to the endoscope scene, namely, insufficient visual diversity and appearance inconsistency. These problems do not exist in the autonomous driving dataset, such as KITTI, and are not considered by other unsupervised depth estimation algorithms. Novelty: To enrich the diversity of the endoscope video, the author generated synthesis with the variance of camera views based on the perspective view synthesis method, which is performed on video triplets. Furthermore, serval corresponding loss functions are introduced to preserve the depth consistency and the pose consistency. Besides, pose estimation results are pretty good.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    Limitation of Method: Based on quantitative and qualitative evaluations, the proposed two modules do not fundamentally address two problems in the endoscope. Limited Description of Method: In the GC, the author randomly set the perturbation pose P0, which is used for further perspective view synthesis. From Fig. 1, it is apparent that the synthesis triplet has black area without any information, which may influence the further depth and pose estimation and the network training, so the perturbation pose should be carefully designed. In the AiC module, the author eliminated the appearance inconsistency by measuring the differences between the triplet-level and the frame-level representations. However, the author did not describe why this method could reduce the inconsistency in detail. Limitation on Experiments: As shown in Table 1 and Fig. 2, the depth estimation accuracy of the method is limited. It is confusing that the method has a better depth estimation accuracy on SERV-CT data than on SCARED data since it is only trained SCARED data. In ablation studies on different dataset amounts, as described in Fig. 3(b), the trend of RMSE and ATE on AF-SfMLearner and the proposed method is similar, which means the AF-SfMlearner could improve the accuracy with the limited number of training samples.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    The author has given the code of the proposed method.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html

    The GC and AiC modules should be carefully designed again to improve the depth estimation results.

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    4

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Based on quantitative and qualitative evaluations, the proposed two modules do not fundamentally address two problems in the endoscope.

  • Reviewer confidence

    Confident but not absolutely certain

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    5

  • [Post rebuttal] Please justify your decision

    In the rebuttal, the author described the strength of the proposed two modules and explained the experiments. However, based on the depth evaluation accuracy listed in Table 1 and Fig. 2, the proposed two modules are limited in improving depth estimation. Therefore, I think the paper can be weakly accepted.



Review #4

  • Please describe the contribution of the paper

    This paper proposes two data augmentation methods for improving the performance of unsupervised depth and pose estimation. Their first augmentation is shifting pose under perspective transformations using their Geometric Consistency (GC) model, enforcing pose estimates and depth values to stay equivalent. The second augmentation (Appearance Inconsistency (AiC)) helps avoid issues in photometric loss (specularity) by masking the loss by features in a frame triplet that are different from the weighted average feature. Using both of these, the authors show the performance increase of adding these augmentation methods into existing models, and evaluate ablation of each augmentation.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    This paper is strong in its simplicity allowing its incorporation into existing models. By adding two relatively simple data augmentation methods, performance of depth and pose estimation can be improved. Their formulation of using triplets for pose and the depth map alignment under augmentation is logical and clean, along with their quantitative and qualitative evaluations.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    I hold no large concerns about weaknesses. Under possible future work, it would be of note to see the parameters chosen for the Triplet Mask weights.

    In terms of language, I disagree with the use of ‘video triplet’, and would recommend using ‘triplet’ or ‘image triplet’ (or something else to emphasize the sequential aspect). Other works use the term video triplet to mean three videos, but it is used here to refer to a set of three temporally related images.

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    Reproducing this work should be relatively quick work. The math for the augmented losses is put clearly, and there are no new models required to implement it.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html

    The paper was clear, and the concept was very well presented. Fig. 1 was especially helpful for understanding the data augmentation process.

    Text corrections:

    -On the use of the word ‘synthesis’. I recommend calling them synthetic rather than synthesis.

    -Fig. 1. “Triple Masks” should be “Triplet Masks”

    • “However, raw and novel samples are used separately in the previous works” (p.4) What does this mean? Elaborate if you have the space.

    • “Since the movement of camera is normally slow, the same appearance inconsistency cannot exist multiple times within an endoscopic video triplet.” (p.5) By this I presume you mean that the images are changing and artifacts such as specularity are unlikely to occur in the same spot. Thus they can be detected with the average filtering the paper proposes. I would say appearance consistency is ‘unlikely’ to exist multiple times, but not necessarily that it cannot.

    • “TCL: Triplet Consistent Learning for Odometry Estimation of Monocular Endoscope” I recommend changing the title to something more grammatically clear such as: “TCL: Triplet Consistent Learning for Odometry Estimation in Monocular Endoscopy”

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    5

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    This paper proposes two data augmentation methods that can be used in monocular endoscopy. Their evaluation is performed well, and the method should integrate easily into other works. Thus I recommend to accept for the sake of having this information available to those working in monocular endoscopy.

  • Reviewer confidence

    Confident but not absolutely certain

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    6

  • [Post rebuttal] Please justify your decision

    Authors have addressed the primary concern I had regarding language ‘video triplet’->’image triplet’. I recommend accept.




Primary Meta-Review

  • Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

    This submission proposes to solve unsupervised depth and pose estimation of monocular endoscopic images using triplet-consistency-learning framework (TCL).

    The reviews were generally positive (R1 - accept, R2 - strong accept, R4 - weak accept), but R3 recommended weak-reject. Since reviewers were not unanimous, and R3 has raised some important concerns, I recommend that the authors are invited to rebut the major negative comments from all all of the reviewer. I recommend the emphasis is put on the various points raised by R3 (item 6) concerning method limitations, description lack of clarity, and experimental limitations. A clear description of method limitations (R1), clarification of Figure 2 (R1), and response to parameter sensitivity (R4) should be provided ideally. Space permitting, a defence of the term ‘video triplet’ (R4).should also be given. Note that following the MICCAI rules, new experimental results should not be presented in the rebuttal.




Author Feedback

We thank the reviewers for their constructive comments Method limitations&Description lack of clarity[R2,R3]: Insufficient dataset diversity and inconsistent appearances are challenges in endoscopic imaging. This manuscript proposed a parameter-free method to alleviate this problem. 1.The visual diversity of endoscopic datasets is insufficient compared to road datasets such as KITTI. While road datasets mostly exhibit 3DOF motion(2DOF translation and 1DOF rotation in the road plane), endoscopy involves 6DOF motion within 3D anatomical structures. Therefore, SFM algorithms for endoscopic imaging are designed to estimate complicated trajectories with limited data diversity. Our GC module leverages the perspective view synthesis that emulates camera motion to generate additional perspectives of endoscopic scenes. In addition, the GC module facilitates training by introducing loss functions to enforce consistency between the augmented and raw triplets. The synthesis method may inherently generate invalid (black) areas in the augmented samples. This arises from the single-image-based augmentation process, which lacks the additional information to fill the new areas generated from the viewpoint transformation. The invalid regions have been masked in the loss function. In Fig.3(b)-upper, Ours(5k) exhibits superior depth estimation compared to MonoDepth2(5k), and shows comparable performance to MonoDepth2(15k). In Fig.3(b)-lower, our pose result surpasses MonoDepth2 for all dataset sizes. The results for both depth and pose demonstrate that our method effectively enhances dataset diversity.

  1. The main idea of the SFM method is the photometric loss to guide the alignment of adjacent frames. The appearance-inconsistent area may generate substantial photometric losses even in the well-aligned adjacent frames. These photometric losses caused by the inconsistent appearance impede the training process and remain unable to optimize. Our AiC module offers a straightforward and effective solution to reduce the effect of photometric loss in the inconsistent area, achieved by measuring the differences between the triplet-level and the frame-level representations. In Fig.3(a), the AiC module effectively identifies the inconsistency areas within the triplet. Furthermore, the quantitative results presented in Tab.2 demonstrate that the AiC module improves the depth and pose accuracy by reducing inconsistencies.
  2. The issues above will be included in the revised version of the manuscript. Limitation on Experiments[R3]: 1.Due to the uncertain depth scale of the monocular unsupervised SFM method, we follow the recognized depth evaluation process(MonoDepth2) which aligns the predicted depth map with the median value of the ground truth before computing metrics. Therefore, the range of depth metrics is related to the scale of ground truth. Since SCARED and SERV-CT datasets provide different ranges and scales of depth ground truth, the same method is not comparable between the numerical depth metrics of the two datasets. 2.AF-SfMLearner improves MonoDepth2 by introducing two additional network modules. In Fig.3(b), we achieve more substantial improvements from MonoDepth2 compared to AF-SfMLearner, without involving additional network parameters. Based on experimental results, our method demonstrates a seamless integration of the SFM methods in endoscopic navigation tasks, benefiting algorithms that have exhibited impressive performance on road datasets(MonoDepth2,SC-SfMlearner). Clarification of Fig.2(b)[R1]: The left 3D graph provides a visual comparison among all methods. The right graphs offer more details of each method’s trajectory in 3D and the three projection planes. Future work(parameter sensitivity)[R2,R4]: We consider enhancing the adaptive weights of the Triplet Mask and other loss functions. Writing issues[R2,R4]: The ‘image triplet’ mentioned by R4 is more accurate. All writing issues will be modified in the revised paper.




Post-rebuttal Meta-Reviews

Meta-review # 1 (Primary)

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    The authors have done a good job in the rebuttal, and all reviewers agree this work should be accepted. The main strengths of this work are the technical innovation with the TCL approach for unsupervised depth and pose estimation, and a high-quality method evaluation. Due to the unanimously positive reviews, this work should clearly be accepted at MICCAI.



Meta-review #2

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    The authors do a good job addressing the critiques of the reviewers. R3 had the most significant critiques for the paper. They have also agreed to accept the paper. I recommend accepting this paper.



Meta-review #3

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    The authors have addressed all major concerns highlighted by the reviewers. An accept is recommended for this paper.



back to top