Paper Info Reviews Meta-Review Author Feedback Post-rebuttal Meta-Reviews Back to top

List of Papers By topics Author List

Paper Info

Reviews

Meta-review

Author Feedback

Post-Rebuttal Meta-reviews

Authors

Samyakh Tukra, Stamatia Giannarou

Abstract

Accurate stereo depth estimation is crucial for 3D reconstruction in surgery. Self-supervised approaches are more preferable than supervised approaches when limited data is available for training but they can not learn clear discrete data representations. In this work, we propose a two-phase training procedure which entails: (1) Performing Contrastive Representation Learning (CRL) of left and right views to learn discrete stereo features (2) Utilising the trained CRL model to learn disparity via self-supervised training based on the photometric loss. For efficient and scalable CRL training on stereo images we introduce a momentum pseudo-supervised contrastive loss. Qualitative and quantitative performance evaluation on minimally invasive surgery and autonomous driving data shows that our approach achieves higher image reconstruction score and lower depth error when compared to state-of-the-art self-supervised models. This verifies that contrastive learning is effective in optimising stereo-depth estimation with self-supervised models.

Link to paper

DOI: https://link.springer.com/chapter/10.1007/978-3-031-16449-1_58

SharedIt: https://rdcu.be/cVRXw

Link to the code repository

https://github.com/CVRS-Hamlyn/Stereo-Depth-Estimation-via-Self-Supervised-Contrastive-Representation-Learning/

Link to the dataset(s)

Kitti: http://www.cvlibs.net/datasets/kitti/raw_data.php

CityScapes: https://www.cityscapes-dataset.com/dataset-overview/

Hamlyn: http://hamlyn.doc.ic.ac.uk/vision/

SCARED: https://endovissub2019-scared.grand-challenge.org/Downloads/

Reviews

Review #1

Please describe the contribution of the paper

This paper addresses the problem of stereo depth estimation. Authors introduce a self-supervised contrastive representation learning method for two-stage stereo depth estimation.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
- Methods are novel and interesting. Authors formulate CRL as a dictionary look-up problem as the contrastive learning for self-supervision. Methods are pretty well presented and easy to understand.
- Results are convincing. Following Figure 1, we notice that such a method can show the differences between different categories. Fig 5 also shows that such a method can be extended to general images.
- Authors have promised to release codes used in the paper.
- This paper is well written and easy to follow.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
- I do not have major concerns for this paper.
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

Authors promise to release their code as well as the datasets used in this paper. Thus I believe the results should be able to reproduce following the descriptions and the released repo.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html

Please refer to 4 and 5.
Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

6
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

Novelty, performance and reproductibility.
Number of papers in your stack

4
What is the ranking of this paper in your review stack?

1
Reviewer confidence

Confident but not absolutely certain
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

Not Answered
[Post rebuttal] Please justify your decision

Not Answered

Review #2

Please describe the contribution of the paper

This paper proposes the first contrastive representation learning (CRL) method for stereo depth estimation based on a momentum pseudo-supervsied contrastive loss, which achieves state-of-the-art performance.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
- The paper is well written and easy to follow in general.
- The authors proposed a novel contrastive learning algorithm tailored for the stereo depth estimation problem. Self-supervised stereo depth estimation is an important problem given the lack of ground-truth labels, especially in medical domain.
- The proposed method is evaluated on two surgical datasets and one non-surgical dataset and demonstrates state-of-the-art performance.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
- Despite the novel application of CRL on stereo depth estimation, the proposed method seems to be a combination/modification of multiple prior work, which limits the technical novelty. For example, the proposed momentum supervised contrastive loss is a combination of the MOCO loss [8] and supervised contrastive loss [9]. And the decoder is modified from DispNet and PWCNet [14]
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

The authors provide implementaion details in the paper and also agree to release codes and pretrained models in the reproducibility checklist.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html
- More ablation studies may be needed to understand the contribution of each components of the proposed method. For example, how does the proposed momentum supervised contrastive loss compared to vanillar CRL loss/ MOCO loss?
- It is unclear how the hyper-parameters are chosen such as the temperature \tau, and the weights for L_pe.
- Can the authors explain more about multi-scale disparity estimation in the decoder?
Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

6
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The application is novel and the performance is good, but the technical novelty is somewhat limited.
Number of papers in your stack

4
What is the ranking of this paper in your review stack?

2
Reviewer confidence

Confident but not absolutely certain
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

6
[Post rebuttal] Please justify your decision

The paper can be interesting to the MICCAI community and the rebuttal addresses most of my concern.

Review #3

Please describe the contribution of the paper

This paper proposes an approach for stereo depth estimation. This paper proposes a two-phase training procedure including contrastive representation for feature learning and self-supervised disparity learning.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

This paper focus on the self-supervised stereo depth estimation. This is important as it is hard to collect abundant ground truth clinical data for supervised leaning.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

It is unclear what is the major difference between the proposed method and existing methods. It seems the two stage framework with the first stage using the contrastive representation learning is the main difference. If so, it is necessary to study other approaches to learn the representation at the first stage as the first stage only requires class labels rather than the ground truth disparity maps. The class labels are easy to get. There are other options to learn/initialize the encoder such as finetuning a pretrained model. Both the memory dictionary and the momentum based key encoder appear in the literature of contrastive representation learning. It is hard to identify any new method/contribution in the proposed stereo contrastive representation learning. It is unclear what is the difference of the proposed decoder and the decoder in DispNet as the details of DispNet is missing. It is unclear what image data, what feature representation, what model etc are used to obtain the t-sne visualisations in Fig.1. Thus, it is impossible to draw any conclusion based on the figure. It is unclear how to train the Encoder when using only the Lpe. Does the Encoder use weights pretrained on another dataset? Is the Encoder initialized randomly and jointly trained with the decoder?
Please rate the clarity and organization of this paper

Satisfactory
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

Details to train the encoder is missing.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html

The legend in Fig.1 is too small. Reference is expected to be added for DispNet. The equation (3) is confusing. I understand losses, Lmo and Lpe are applied at different stages of the model training. But the equation suggests both losses are applied at the same time.
Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

3
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The contribution of this paper is limited. And there are much details unclear to understand the proposed method (see weakensses).
Number of papers in your stack

4
What is the ranking of this paper in your review stack?

3
Reviewer confidence

Confident but not absolutely certain
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

Not Answered
[Post rebuttal] Please justify your decision

Not Answered

Primary Meta-Review

Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

This paper introduces a new method for stereo depth estimation, comprising contrastive representation learning (CRL) and self-supervised disparity learning. The method is assessed on two surgical datasets and one non-surgical dataset and shows SOTA performance. The paper has pros and cons, according to the reviewers, where positive aspects are: 1) interesting method; 2) paper is well-written; 3) results are convincing; and 4) paper handles on a real-world important problem. There are also a few negative points, as follows: 1) method seems to be a combination/modification of multiple prior work, which limits the technical novelty; 2) is the main contribution of the approach the first stage using the contrastive representation learning? 3) what is the difference of the proposed decoder and the decoder in DispNet as the details of DispNet is missing? 4) it is unclear what image data, feature representation and model are used to obtain the t-sne visualisations in Fig.1; and 5) it is unclear how to train the Encoder when using only the Lpe. Given the scores and comments, I’d like to invite the authors to write a rebuttal.
What is the ranking of this paper in your stack? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

5

Author Feedback

We thank the reviewers for their constructive comments. Minor issues will be addressed in the revised paper.

Novelty: The main novelty of our work is the CRL loss function proposed in Eq.1. Existing CRL loss functions [2,9] require high computational resources to learn discriminative representations between highly similar images. Our proposed function is tailored for learning differences between highly similar images such as stereo pairs, and is scalable even on a single GPU hardware. Our loss utilises pseudo-labels as a stabiliser to learn CRL effectively in stereo images, acting as an anchor to force learning discriminative features between the two views. We refactor CRL as a dictionary-look up problem to reduce GPU resource demand. Without these advances, CRL would not be effective on stereo images. This is verified in the ablation study below.

Ablation study: We enhanced our ablation study by training 2 new variants of StereoCRL with:

Vanilla CRL loss [2]

Vanilla MoCo loss [8] Both variants are later finetuned for disparity estimation via Eq.2. Variant 1 achieves average depth error of 24.5 and 21.4 mm on SCARED datasets 1 and 2 respectively. Variant 2 achieves 23.2 and 23.6 mm on the same dataset. Both perform worse than the proposed StereoCRL method, which achieves 16.9 and 11.1 mm errors. The results of the new variants are like a model trained purely on Eq.2. This study further verifies the superiority of our method in learning discrete discriminative representations.

Hyperparameter selection: hyperparameters were selected using the original paper sources, for temperature(tau)= 0.07 [8] and for deltas 1,2 and 3 = 0.85, 1 and 1, respectively [6].

StereoCRL decoder details:

Cost volume: We build a cost volume at each decoder block, by warping the encoder feature maps for the left and right views, followed by correlation operation from PWCNet (avoiding high computational cost). The cost volume is used to encode features for disparity by discretizing in disparity space and comparing the features along epipolar lines. This overcomes the issue of normalised disparity predicted by other self-supervised models [13, 17] which is affected by changes in image scale. For our cost volume, we set the search range to [-5, 4] allowing a sufficient radius for negative and positive disparities. DispNet conversely doesn’t develop a cost volume and outputs normalised disparity of range .

Multi-scale cost: StereoCRL estimates a cost volume in each decoder block alongside disparity. At each block all the feature maps are increased by a scale factor of 2. Thus, StereoCRL iteratively refines its disparity estimate by adjusting its cost volume weights at each scale.

Pretrained encoder model: Taking a pre-trained model from supervised learning for fine tuning, will have inductive biases from the original data and training process. This is detrimental for surgery, where the models trained on simulation / natural scenes do not transfer well, due to inherent data mismatch. Finetuning such models via photometric error still has the limitations of self-supervised learning. With StereoCRL, we show that our proposed CRL learns generalisable discrete features from the stereo images, both for natural and surgical scenes. This pre-trained StereoCRL encoder model when paired with a decoder then, achieves state of the art downstream performance with pure self-supervised learning.

Training with Eq.2 only: we randomly initialise weights of the encoder and jointly train it with the decoder

Fig.1 details: 200 test images from Kitti2015 data were used. Fig.1 displays the projections of the model’s encoder’s final layer representations for each of the stereo input frames. A is generated from the encoder of a pre-trained [10]. B, the proposed StereoCRL encoder trained with Eq.1 followed by finetuning by Eq.2. C, StereoCRL encoder, trained on Kitti raw split (42,382 stereo frames) only via Eq.2.

Post-rebuttal Meta-Reviews

Meta-review # 1 (Primary)

Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

Paper strengths: 1) interesting method; 2) paper is well-written; 3) results are convincing; and 4) paper handles on a real-world important problem.

Paper weaknesses: 1) limited novelty: method seems to be a combination/modification of multiple prior work; 2) is the main contribution of the approach the first stage using the contrastive representation learning? 3) what is the difference of the proposed decoder and the decoder in DispNet as the details of DispNet is missing? 4) it is unclear what image data, feature representation and model are used to obtain the t-sne visualisations in Fig.1; and 5) it is unclear how to train the Encoder when using only the Lpe.

Except for the t-sne graph, all other questions were convincingly answered, so I recommend the paper to be accepted.
After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

Accept
What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

9

Meta-review #2

Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

The manuscript presents a method for estimating depth from stereo cameras using a contrastive representative learning approach to learn an embedding space were left and right views are clustered in an appropriate manner. Overall, I found the layout of the methods a bit confusing, especially since it was mostly focused on the intuition and not the technical construction and details.

While the authors justify the novel contributions (I disagree with R3 on this point) many things presented in the manuscript lack necessary details to put specific components such as Fig 1 into context (as mentioned by R3). I also think it is important to add ablation studies to help justify the multiple changes made to the CRL network and two-stage approach which are not clearly justified at the moment. Finally I don’t see how Eq 3. can possible work as the first term is only for the encoder block and the second for the decoder block – this was raised by R3 and the meta reviewer but was not addressed by the authors in the rebuttal.
After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

Reject
What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

12

Meta-review #3

Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

I believe the authors have addressed the major concerns of the reviewers with clarifications of the methods as well as addition of ablation studies. Given this I would lean towards accepting the paper.
After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

Accept
What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

NR

back to top

Stereo Depth Estimation via Self-Supervised Contrastive Representation Learning