Paper Info Reviews Meta-Review Author Feedback Post-rebuttal Meta-Reviews Back to top

List of Papers By topics Author List

Paper Info

Reviews

Meta-review

Author Feedback

Post-Rebuttal Meta-reviews

Authors

Wenda Li, Yuichiro Hayashi, Masahiro Oda, Takayuki Kitasaka, Kazunari Misawa, Kensaku Mori

Abstract

This paper proposes a novel self-supervised monocular depth estimation approach for laparoscopic scenes. Previous methods independently predicted depth maps ignoring spatial coherence in local regions and temporal correlation between adjacent images. The proposed approach leverages spatio-temporal coherence to address the challenges of textureless areas and homogeneous colors in such scenes. This approach utilizes a multi-view depth estimation model to guide monocular depth estimation when predicting depth maps. Moreover, the minimum reprojection error is extended to construct a cost volume for the multi-view model using adjacent images. A cycled prediction learning for view synthesis and relative poses is also designed to exploit the temporal correlation between adjacent images fully. To benefit from spatial coherence, deformable patch-matching is introduced to the monocular and multi-view models to smooth depth maps in local regions. Additionally, the 3D consistency of the point cloud back-projected from predicted depth maps is optimized for the monocular depth estimation model. Experimental results show that the proposed method outperforms existing methods in both qualitative and quantitative evaluations.

Link to paper

DOI: https://doi.org/10.1007/978-3-031-43996-4_41

SharedIt: https://rdcu.be/dnwPm

Link to the code repository

https://github.com/MoriLabNU/MGMDepthL

Link to the dataset(s)

N/A

Reviews

Review #2

Please describe the contribution of the paper

The paper proposes a multiview stereo guided self-supervised monocular depth estimation approach for laparoscopic scenes. It uses pose network to explore the geometric constraint between consecutive images. The pipeline provides a complete structure for real world applications.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

The proposed multi-view depth guided pipeline enhances the performance of the proposed monocular depth prediction network. An extended deformable patch matching promotes the efficiency of the network.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
1.I am confused about the role of ground truth depth. Fig.1 and all losses suggest the proposed method is a labeled-free approach as no ground truth (GT) depth is presented and needed. However, the 23.687 training images (No idea why it is not an integer) seem suggest it uses GT depth. As the validation section does not raise scale issues, I believe the training is done on labeled data set. Please explain whether GT depth is used in the training step.
1. Another weakness is reproducibility. The authors do not clearly introduce the backbone network structure, including the monocular depth network, pose network, offset network, and feature extractor. The training framework and hyperparameters in such size make it almost impossible to reproduce.
2. The paper said it used 2.405 testing data size. 2-3 testing data size is not convincing. More experiment, especially in different texture and illumination, should be tested.
Please rate the clarity and organization of this paper

Satisfactory
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

The four backbone networks are not clearly referenced. Considering the complicated structure and heavy hyperparameters, this manuscript is very difficult to reproduce.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html

Please cite the 4 backbone networks in the framework.

More results can be tested to validate the performance of this article. It is interesting to see if the trained network can be transferred to data set in different illumination and texture.

A demo video will significantly benefit readers.
Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

5
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The overall impression, presentation and completeness of the work.
Reviewer confidence

Confident but not absolutely certain
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

5
[Post rebuttal] Please justify your decision

The rebuttal confirms my evaluation towards the research. I will keep my rating as previous one.

Review #3

Please describe the contribution of the paper

This paper focuses on the monocular depth estimation of the laparoscope. The author uses spatio-temporal coherence in images to help the depth estimation. Based on the traditional self-supervised monocular depth estimation (MDE) network, the author first adopts a multi-view depth estimation (MVDE) model to guide the training of the MDE network. Here, the input of the MVDE network has three images, so it leverages the temporal information in the short clip. Then, deformable patch matching is introduced into MDE to explore the spatial coherence in local regions of the image. Besides, point cloud consistency is considered during the training. The author also did experiments to validate the proposed method.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

Novelty of Method: The temporal coherence between adjacent images and the spatial information in the images are integrated into the training of the MDE network. The temporal relation between adjacent images can improve MDE, so the author adopts an MVDE model and constructs a cost volume in the MVDE to guide the training of the MDE network. Texture-less areas and reflective parts are common in surgical scenes. Therefore, the author introduces deformable path-matching-based local spatial propagation to MDE. Specifically, an offset network based on an encoder-decoder is proposed to estimate the offset map of the image. Then, the path-matching-based reprojection error based on the offset map is added to the final training loss. Experiments: The author compared the proposed method with several existing methods and performed extensive ablation studies to validate each component of the proposed method.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

Complexity of Method: The author introduces two additional modules to the training of the MDE network, namely, the multi-view depth network and the offset network. As shown in Table 2, the efficiency of these two modules is limited. On the contrary, the point cloud consistency (PCC), which does not add any complexity to the MDE, can improve the depth estimation accuracy greatly. Limitation of Experiment: As shown in Fig. 4, the author qualitatively shows the depth estimation results. However, the depth maps are too small to see the depth results.
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

The author provides the code of the proposed method.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html

In the experiment, the author needs to improve the design of the two modules or provide more losses to make them more efficient. Besides, the ablation studies should be performed on the validation dataset instead of the test dataset. The standard deviations of scores need to be reported.
Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

4
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The author considers the spatial information in the single image and the temporal coherence among adjacent images during the training of the monocular depth estimation network. However, the whole model is too complex, and the final evaluation results also show that the proposed two modules improve the depth estimation accuracy limitedly.
Reviewer confidence

Confident but not absolutely certain
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

4
[Post rebuttal] Please justify your decision

As shown in Table 2, the first row represents the method without point cloud consistency and the last row means the proposed method with point cloud consistency. Due to the point cloud consistency, the method improves a lot in depth evaluation accuracy, from 9.217 mm to 6.441 mm of RMSE. However, the improvements by other modules are limited. For example, the MRE only increases the RMSE from 6.576 mm to 6.441 mm. In addition, all ablation studies should be evaluated on validation data instead of test data.

Review #4

Please describe the contribution of the paper

This manuscript presents a self-supervised monocular depth estimation method that exploits spatial-temporal correspondence during training. Multi-view depth model is used as guidance during training with minimum reprojection error proposed for cost volume construction. Point cloud consistency module is used for geometry constraint between frames. Deformable patch matching is used to handle the challenge of spatially coherent local regions. Cycled prediction learning is applied to exploit more temporal information. The result shows that the proposed method performs better than previous works.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
1. The paper is clearly organized and written with well-made figures.
2. The proposed method has a decent amount of novelty, specifically the multi-view cost volume guidance and deformable patch matching are very interesting.
3. The experiments are thorough and well-designed including a comparison study with recent works and an ablation study demonstrating the contribution of each proposed module.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
1. Some minor typos will need corrections.
Please rate the clarity and organization of this paper

Excellent
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

The paper should be reproducible given that the paper is clearly written with enough details.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html
1. first paragraph of page 2, typo, should be “uncertainty”
2. section 2.3, typo, should be “patch”
3. last sentence of page 5, I think the author means “i” stands for the index number of the view
4. section 3.4, typo, should be “patch”. It would be good if authors could further proofread the manuscript to correct those typos.
Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

7
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The method design has decent novelty. The experiment results are good. The paper clarity is excellent with well-made figures.
Reviewer confidence

Very confident
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

N/A
[Post rebuttal] Please justify your decision

N/A

Primary Meta-Review

Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

The paper proposes a self-supervised monocular depth estimation approach which leverages spatio-temporal coherence to address the challenges of textureless and homogeneous areas in laparoscopic scenes. Details regarding the methodology, implementation, and the training of the model, as suggested by the reviewers, should be clarified. R1 suggests that the generalisability of the method to scenes with different illumination and texture, should be evaluated. According to R2, the proposed modules do not improve the performance of the method significantly.

Author Feedback

We appreciate the meta-reviewer and the reviewers for their constructive feedback. We address major concerns raised by reviewers.

Reviewer 1:

Q1: The validation section does not raise scale issues, please explain whether GT depth is used in the training step. A1: We did not use GT depth during training, following other self-supervised methods we compared in our experiments. GT depth was only used to evaluate predicted depth maps. The outputs of self-supervised methods were up-to-scale depth values. About the scale issues, following the previous methods listed in Table 1, we adopted the standard and widely used process. It computed the ratio of the median prediction and ground-truth values in the validation section.

Q2: The training framework and hyperparameters in such size make it almost impossible to reproduce. A2: As R2 mentioned, we commit to releasing the source code and trained models upon paper acceptance, making it not difficult to reproduce the method. As mentioned in Section 3.2, we outlined that following Manydepth, the MDE, MVDE, and pose network are based on the regular ResNet-18, and the offset network consists of two 2D convolution layers.

Q3: The paper used 2.405 testing data size. 2-3 testing data size is not convincing. More experiments in different texture and illumination should be tested. A3: As mentioned in section 3.1, 2,405 and 23,687 are the number of images in the test and training dataset. Our test dataset of 2,405 images covers nine different scenes (Section 3.1), offering varied texture and illumination conditions.

Reviewer 2:

Q1: The limited efficiency of the multi-view depth and offset network modules is concerning, given the significantly improved depth estimation accuracy provided by the point cloud consistency (PCC) without any additional complexity. The author needs to improve the design of the two modules or provide more losses to make them more efficient. A1: First, the offset network was used to realize the proposed deformable patch matching (DPM). As shown in the third row of Table 2, DPM contributed to the proposed method on all metrics, especially for Sq Rel, RMSE and RMSE log metrics. For the multi-view depth network, we introduced minimum reprojection error (MRE) to optimize the construction of cost volume. And as shown in the last three rows of Table 2, the combination of the proposed MDE and multi-view depth estimation (MVDE) performed better than the only proposed MDE part on all metrics, which means that the optimized MVDE model (with MRE) also improved MDE greatly. In addition, the proposed method had stable results with standard deviations of 0.002, 0.018, 0.038, and 0.001 for Abs Rel, Sq Rel, RMSE, and RMSE log. Second, this paper focused on improving MDE based on spatio-temporal correspondence. MVDE leveraged the temporal information, and DPM considered the spatial coherence in the local region to improve the results of MDE. According to the Sq Rel metric (sensitive to large errors) in Table 2, these two modules greatly enhanced predictions’ stability. Third, PCC was also one of our proposed modules and greatly fit the combination of MDE and MVDE. As shown in Table 2, other proposed components also obviously enhanced the results of the proposed method on all metrics. Further improvements (network pruning and so on) will be considered in future work. Forth, as mentioned in Section 3.1, only the MDE network was used to predict depth maps during inference without additional complexity from other proposed modules.

Q2: The depth maps are too small to see the depth results. A2: We will leverage the extra space and increase the size of Figure 1 in the final version.

Q3: The ablation studies should be performed on the validation dataset instead of the test dataset. A3: We used the training and validation dataset to adjust and finalize the hyperparameters for variants of the proposed method. All the proposed method variants were evaluated on the test dataset in Table 2.

Post-rebuttal Meta-Reviews

Meta-review # 1 (Primary)

Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

The authors responded adequately to the reviewers’ comments. R2 still has concerns regarding the significance of the performance improvement achieved with the proposed modules. However, I would recommend acceptance of the paper and encourage the authors to enhance the camera ready paper following the reviewers’ suggestions.

Meta-review #2

Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

The authors have sufficiently addressed the reviewers’ comments in the rebuttal. There is sufficient value in the paper to merit acceptance to MICCAI.

Meta-review #3

Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

This works presents a self-supervised method to reconstruct multi-view depth maps (up to scale) from a set of laparoscopic images. The approach is based heavily on existing approaches [4,12] with some adaptations in the loss functions. Specifically, it uses geometric constraint acting between neighbouring frames, to provide extra self-supervision information. The method is compared on the SCARED dataset (commonly used for benchmarking reconstruction methods), with encouraging performance. compared to baselines [12] and [25]. Reviewers 2 and 4 favour acceptance, however, Reviewer 3 recommends weak rejection, with important concerns about limited contributions of various proposed adaptations (CPL, DPE, MRE). Indeed it seems that the only modification to baseline [12] with a major impact on performance is the point-cloud consistency term. Without a statistical comparison of results (not presented - although difficult with this kind of dataset), it is very hard to say if the impact of them is numerically significant, and more importantly, significant for relevant down-stream tasks. At the same time, it is unfair to reject this method based on lack of statistical significance analysis, when that is often the case (sadly!) for research on this problem. Based on the fact that results do indeed seem better with a proposed extension on [14], and because the problem is very relevant in CAI, I would be in favour of this work being accepted. However, I find that some of the extra modifications except for the pointcloud consistency loss, will likely not have a tangible impact on performance in real-world downstream tasks.

back to top

Multi-view Guidance for Self-supervised Monocular Depth Estimation on Laparoscopic Images via Spatio-temporal Correspondence