Paper Info Reviews Meta-Review Author Feedback Post-rebuttal Meta-Reviews Back to top

List of Papers By topics Author List

Paper Info

Reviews

Meta-review

Author Feedback

Post-Rebuttal Meta-reviews

Authors

Wenda Li, Yuichiro Hayashi, Masahiro Oda, Takayuki Kitasaka, Kazunari Misawa, Kensaku Mori

Abstract

Depth values are essential information to automate surgical robots and achieve Augmented Reality technology for minimally invasive surgery. Although depth-pose self-supervised monocular depth estimation performs impressively for autonomous driving scenarios, it is more challenging to predict accurate depth values for laparoscopic images due to the following two aspects: (i) the laparoscope’s motions contain many rotations, leading to pose estimation difficulties for the depth-pose learning strategy; (ii) the smooth surface reduces photometric error even if the matching pixels are inaccurate between adjacent frames. This paper proposes a novel self-supervised monocular depth estimation for laparoscopic images with geometric constraints. We predict the scene coordinates as an auxiliary task and construct dual-task consistency between the predicted depth maps and scene coordinates under a unified camera coordinate system to achieve pixel-level geometric constraints. We extend the pose estimation into a Siamese process to provide stronger and more balanced geometric constraints in a depth-pose learning strategy by leveraging the order of the adjacent frames in a video sequence. We also design a weight mask for depth estimation based on our consistency to alleviate the interference from predictions with low confidence. The experimental results showed that the proposed method outperformed the baseline on depth and pose estimation. Our code is available at https://github.com/MoriLabNU/GCDepthL.

Link to paper

DOI: https://link.springer.com/chapter/10.1007/978-3-031-16440-8_45

SharedIt: https://rdcu.be/cVRww

Link to the code repository

https://github.com/MoriLabNU/GCDepthL

Link to the dataset(s)

N/A

Reviews

Review #1

Please describe the contribution of the paper

A self-supervised monocular depth estimation framework is presented for surgical video datasets. This framework has included a scene coordinate prediction branch in addition to depth and pose estimation branches. The pose estimation branch has also been added with an additional Siamese optimization process. A weighting mask is used based on the dual-consistency test to reduce the effect of unreliable predictions.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
1. This paper is overall well-written. The flow of the paper is easy for readers to follow. The Introduction section has included a good review of the literature.
2. The technical contribution of this paper is adequate, given the inclusion of scene coordinates and consistency-based weighting mask.
3. A good comparison study was provided by authors and several state-of-the-art approaches have been added into the comparison. The results have shown the proposed framework demonstrated better performance.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
1. The paper is missing some explanations and insights of methodology in the technical sections.
2. There are also some mistakes in Fig 1 and notations which have made it harder for the readers to capture the gist of the proposed architecture.
Please rate the clarity and organization of this paper

Satisfactory
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

The authors have confirmed in the checklist that they will share the code to the public. The dataset used in the work is from a public challenge.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html
1. In the first paragraph of Introduction, the authors stated that ARAMIS and RAMIS have become the preferred approach for laparoscopic surgery. This reviewer agrees that RAMIS has become preferred, however, ARAMIS is still questionable nowadays. The provided references for this statement do not provide enough clinical evidence that ARAMIS has become the preferred approach yet. Therefore, it is better for the authors rephrase this statement.
2. There is probably a mistake in Fig 1. Given the only input of Source Image I_{t’}, can the Scene Coordinate Network generate both Scene Coordinate of I_{t} and I_{t’}? Can the authors double check this figure and make corresponding corrections as needed? Also, in the paragraph below the Fig 1, please change “I_{t} and I_{s} are the input of pose estimation” to “I_{t} and I_{t’} …”. Please describe what is D(p_t) function below Equation (1). Furthermore, regarding Equation (1), how the empty pixels (those ones do not projection) are addressed?
3. Given that Scene Coordinates can be generated from the predicted depth of the image, can the authors provide a detailed explanation why a separate Scene Coordinate Network has to be explicitly trained? I understand that an Ablation Study has been provided, but an insightful discussion on this matter is currently missing, e.g., why this additional branch will benefit on the smooth surfaces? In addition, are Scene Coordinates ranged? Are they normalized? How the performance of Scene Coordinates get impacted when a different camera (with diff. focal lengths, etc.) is used?
4. What are the unit of the rotation errors in Table 2? In radians or degrees? Can the authors describe how the errors were computed?
Minor comments:

Please use “Siamese” instead of “siamese”, and there are a few places where “Siamese” got mis-written (e.g., Abstract, etc.). In Page 3, change “To overcome the challenging” to “To overcome the challenge”. There are also other places with noticeable grammar mistakes and typos, please perform proof-reading on the paper.
Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

5
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The addressed application is very relevant to the MICCAI topics. The technical novelty of this paper is adequate and validation has been conducted rigorously which also includes an ablation study. The authors have compared their approach to several state-of-the-art approaches and the results have demonstrated that their approach performs better.
Number of papers in your stack

4
What is the ranking of this paper in your review stack?

2
Reviewer confidence

Very confident
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

N/A
[Post rebuttal] Please justify your decision

N/A

Review #2

Please describe the contribution of the paper
- A self-supervised approach to monocular depth estimation with a promising results
- Depth prediction and pixel 3d coordinate prediction are handled as two distinct tasks whose results support each other via a consistency loss
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
- The monocular depth estimation is treated slightly different from previous approaches, which enhances the results.
- The system can be applied to real data relatively easily, since no ground truth is required for training (only a camera calibration matrix) and it works on monocular data, which is easy to acquire
- Evaluation on real data
- A thorough comparison against many other methods is performed, as well as an ablation study.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
- The prediction of the scene coordinates S and the prediction of the depth D are very similar tasks (almost the same except for the representation?). I miss a small discussion of how they are different and why they would complement each other. Would it be possible that training multiple depth-prediction networks (and use them for the dual-consistency check) would have similar results?
- Section 2.2 is difficult to grasp without reading it multiple times (see details below)
- Were the runs for the ablation study performed multiple times? Some of the values are so close to each other I feel it may be a coincidence that the “full” model performed best. Maybe report the average over N runs?
- Maybe I missed it, but: Monocular depth estimation is underdefined, i.e. there’s no way to know the actual distance (unless objects of known size are in the scene). How do you handle this unknown scale factor? Does this mean that you would have to train on a per-patient basis?
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

The authors mention code being made available, which is great, but please remember to place a link in the paper once it’s no longer anonymous.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html
- I don’t understand Equation (3). Are you really searching the t’ for which E becomes minimal? Does that mean you’re selecting the neighboring frame with the lower loss? Or is that a typo? Either way, I think this equation should be explained or removed.
Fig. 1 implies that only Source image T_t’ is given to the Scene Coordinate Network, however the network predicts both ^c’S_t’ AND ^cS_t - is there an arrow for the second input missing? I think this network would require both images?

I find section 2.2 relatively hard to follow. I had to read it a few times before I (think I) understood. I understand this is likely because of limited space to describe a complex topic but here are a few things I stumbled accross:
- The term “world coordinate system” is used, however as far as I can tell the system never really uses any consistent world coordinate system? Everything is done in (local) camera coordinates. If I understood this correctly, then I suggest removing the term “world coordinate system”.
- “from the camera coordinate system c to camera coordinate system t’” should this be c’ instead of t’?
- “the camera system coordinate c” -> the camera coordinate system c (?)
- I think ^cP (introduced for Fig. 3 and in the text on page 5) represents the same as ^cS_t(p_t), correct? As the latter representation is used in Equation 5, maybe replace ^cP by ^cS_t(p_t) (and similarly for ^c’P)?
- I think you could maybe drop the superscripts c, c’ and c’->c? As far as I can tell they’re always consistent with the subscripts t, t’ and t’->t and this would be one less thing to have to understand/interpret. I think it’s clear from the text that camera coordinate system c is where the camera is at at time t and c’ for t’.
Minor:
- “and scene coordinate prediction with novel consistency loss functions under a camera coordinate system.” The “under a camera coordinate system” wasn’t clear to me at this point - maybe remove here, or explain?
- Fig. 2 referenced before Fig. 1
- “I t and I s are the input of pose estimation to estimate the transformation matrix” -> I_s should probably be I_t’?
- If Fig. 4 represents absolute values, could you show the range instead of “low” and “high”?
- Since you show rotational errors in Table 2, I assume that SCARED includes ground truth poses? If so, I would mention this in 3.1 where SCARED is introduced (since the ground truth depth is mentioned).
Language (suggestions only, I’m not a native english speaker):
- “…the smooth surface causes the photometric error to reduce even the corresponding positions between adjacent frames are inaccurate.” Words missing? add “for” and remove “are inaccurate”?
- “We predict the scene coordinate prediction as an auxiliary task.” predict… prediction (maybe “estimate the scene coordinates” instead?)
- “an unified” -> “a unified”?
- “siamses” -> siamese”? (2x)
- “depth estimation had become” -> “depth estimation has become”
- “To overcome the challenging of pose estimation” -> the challenge of
- “Then the minimized photometric error can be deformed as” -> can be formulated as (?)
- “in the previous researches” -> in previous research?
- “two consistency” -> two consistencies? Is this the same as “dual consistency”? If so, maybe replace with the same term everywhere?
- “an weight mask” -> a weight mask
- “ours w/o scene coords represent s the proposed method was evaluated with the siamese optimized pose process.” -> ?
- “on test datasets overall the whole scenes.” -> on test datasets over all scenes. (?)
Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

6
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The method seems sound, it is interesting that the two presented tasks seem to complement each other even though they are very similar. As mentioned above, the only larger thing I’m missing is a small discussion why this works (or at least references explaining it). The explanation is mostly clear and can likely be made clearer by making minor changes to the variable names etc. There are multiple language errors, but nothing serious.
Number of papers in your stack

4
What is the ranking of this paper in your review stack?

1
Reviewer confidence

Very confident
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

N/A
[Post rebuttal] Please justify your decision

N/A

Review #3

Please describe the contribution of the paper

This work proposed a self-supervised framework for laparoscopic depth estimation. The authors used the multi-task training strategy, adding scene coordinate prediction to train the network with dual-task consistency. The confident mask is also computed from the scene coordinate prediction. The authors also updated the pose estimation with the siamese process, which improved the pose rotation prediction.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
- Writing is clear and easy to follow.
- Adding scene coordinate prediction task as an auxiliary is a novel idea for monocular depth estimation.
- Averaging the forward/backward poses in training seems to be a useful trick to boost the performance a bit.
- The performances were compared to the state-of-the-art in general depth estimation areas and in the medical domains, which makes the merit of this work more convincing.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
- The intuition behind using siamese poses is unclear. The author mentioned the complex rotation in laparoscopic application makes the problem harder than the autonomous driving cases, but averaging forward and backward poses has no direct relation to dealing with this issue. If the purpose of this operation was to better optimize the pose, additional a loss function on it could be an alternative, such as optimizing the T^t’_t and (T^t_t’)^-1 to be the same. Better explanation and more analysis of alternative solutions would be helpful.
- Regarding camera pose evaluation, the authors only reported the rotation results but missed the translation results. Since both translation and rotation determine the camera pose quality, camera translation evaluation is necessary; at least a demonstration that the proposed methods would not sabotage this aspect is needed.
Please rate the clarity and organization of this paper

Very Good
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

Implementation details were provided. The evaluation dataset was cited.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html
- Only the encoders’ structures were described. It would be better to also provide details of decoders.
- “siamese” was spelled wrong several times.
- Page 3, wrong notation “I_s”.
- Ambiguous sentence on the top of Page 8.
Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

5
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

This paper proposed a new framework to deal with the special challenges in laparoscopic depth estimation. Although some aspects need more clarification, this work can provide insight for the area, especially its novelty of using dual-task consistency from scene coordinate prediction to improve the depth quality.
Number of papers in your stack

3
What is the ranking of this paper in your review stack?

1
Reviewer confidence

Confident but not absolutely certain
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

N/A
[Post rebuttal] Please justify your decision

N/A

Primary Meta-Review

Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

The paper is about monocular depth estimation in laparoscopy. It uses a NN trained with self-supervision. All reviewers were positive about the paper, which brings an incremental but well defined technical contribution and thorough experimental results. The AC concurs and recommends early acceptance, trusting that the authors can implement the reviewers’ suggestions for the final version without difficulty.
What is the ranking of this paper in your stack? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

NR

Author Feedback

We sincerely thank all the reviewers and ACs for their encouraging feedback and constructive comments. We will include the reviewers’ suggestions in our camera ready and carefully proofread our paper. We are conducting suggested experiments and will report the results in our code repo.

back to top

Geometric Constraints for Self-supervised Monocular Depth Estimation on Laparoscopic Images with Dual-task Consistency