Paper Info Reviews Meta-Review Author Feedback Post-rebuttal Meta-Reviews Back to top

List of Papers By topics Author List

Paper Info

Reviews

Meta-review

Author Feedback

Post-Rebuttal Meta-reviews

Authors

Xuelian Cheng, Yiran Zhong, Mehrtash Harandi, Tom Drummond, Zhiyong Wang, Zongyuan Ge

Abstract

The self-attention mechanism, successfully employed with the transformer structure is shown promise in many computer vision problems including image recognition, and object detection. Despite the surge, the use of the transformer for the problem of stereo matching remains relatively unexplored. In this paper, we comprehensively investigate the use of the transformer for the problem of stereo matching, especially for laparoscopic videos, and propose a new hybrid deep stereo matching framework (HybridStereoNet) that combines the best of the CNN and the transformer in a unified design. To be specific, we investigate several ways to introduce transformers to volumetric stereo matching pipelines by analyzing the loss landscape of the designs and in-domain/cross-domain accuracy. Our analysis suggests employing transformers for feature representation learning, while using CNNs for cost aggregation. Our extensive experiments on Sceneflow, SCARED2019 and dVPNdatasets demonstrate the superior performance of our HybridStereoNet.

Link to paper

DOI: https://link.springer.com/chapter/10.1007/978-3-031-16449-1_44

SharedIt: https://rdcu.be/cVRXj

Link to the code repository

https://github.com/XuelianCheng/HybridStereoNet-main

Link to the dataset(s)

N/A

Reviews

Review #1

Please describe the contribution of the paper
The paper presents a deep-learning-based method to solve the stereo matching problem involving laparoscopic images. An existing framework (reference [4]) is modified to achieve superior performance for laparoscopic applications. The authors main contributions are the following:
1. Use of Transformer networks in place of Convolutional NN in cost-aggregation. The authors justify this architectural change in terms of loss landscape, learning trajectory and accuracy
2. The proposed architecture is compared to the state-of-the-art using publicly available datasets
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
The paper presents a deep-learning-based method to solve the stereo matching problem involving laparoscopic images. An existing framework (reference [4]) is modified to achieve superior performance for laparoscopic applications. The authors main contributions are the following:
1. Use of Transformer networks in place of Convolutional NN in cost-aggregation. The authors justify this architectural change in terms of loss landscape, learning trajectory and accuracy
2. The proposed architecture is compared to the state-of-the-art using publicly available datasets
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

Nothing as far as MICCAI conference is concerned. For minor suggestions, see detailed comments.
Please rate the clarity and organization of this paper

Very Good
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

The architecture is clearly described, and references are given where further details can be found. The framework used for implementation is given. The metrics used to compare different methods are adequately described. However, the paper does not contain details about the running time, memory footprint and the computational platform in which the implementation was tested. Moreover, no failure cases have been included and discussed.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html
1. The paper is well-written: adequate background to the problem with references to state-of-the-art methods in learning-based stereo matching is given. The rationale behind the use of transformer networks is given and the proposed architecture is studied in terms of loss landscape, learning trajectory and accuracy. Finally, the experiments are adequately described while the results are clearly presented.
2. The proposed method is compared to the state-of-the-art using publicly available datasets. Mean absolute error has been used to compare the performance. I would suggest reporting either the variance together with the mean or report the RMS-error. Other descriptive statistics such as the 95th percentile will be useful to the serious reader.
3. Significant value can be added to the paper by including details on the running time and memory requirements of proposed architecture.
Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

7
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The paper is very well written. It has some novelty in the methods, and the authors provide adequate justification for the proposed architectural changes. The superiority of the method is shown both quantitatively and qualitatively using publicly available datasets. Moreover, the performance of the method is compared to the state-of-the-art demonstrating its competitiveness. The application novelty combined with thorough validation makes this paper acceptable for presentation in MICCAI 2022.
Number of papers in your stack

5
What is the ranking of this paper in your review stack?

1
Reviewer confidence

Very confident
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

Not Answered
[Post rebuttal] Please justify your decision

Not Answered

Review #2

Please describe the contribution of the paper

This paper targets the task of stereo matching between two rectified images. Authors try to replace CNN with Transformer for learnable components of a cost-volume-based network architecture from the prior work named LEAStereo. Various visualizations have been done for loss landscape upon convergence. Experiments on SceneFlow, SCARED2019, and dVPN datasets have been conducted and the authors claim that the proposed model performs better than some of the prior works. They claim that by replacing the CNN with Transformer for feature extraction, the model results in faster convergence, higher accuracy, and better generalization.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
1. The idea of visualizing the loss landscape to demonstrate the differences among various architectures is interesting, though the way to represent the results could be further improved.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
1. Limited novelty in terms of method design. The authors simply try replacing the 2D and 3D CNN with the equivalent Transformer architecture that comes from prior works. No special adjustment has been attempted to better exploit the attention mechanism for the task of stereo matching.
2. Loss landscape visualization is not convincing. The authors try to use Figures 4 and 5 to show that the proposed model can achieve a lower local minimum with a flatter region around the optimal point. However, it is hard to appreciate the claim based on these figures. The scales of the figures are not controlled, and it is also difficult to see the value of the optimal point. Additionally, a lot of other factors besides the network architecture can affect the loss landscape upon the convergence. For example, the choice of the optimizer. There are works showing that Transformer can form a steeper loss landscape around the local minimum but with a better optimizer, it can achieve a flatter neighboring region [1].
3. The experiments are not thorough enough to validate the performance of the proposed method. The authors only use 10 frames in SCARED2019 to evaluate the performance. With such a small sample set, it is difficult to say if there is any statistical significance in the performance of various methods. The methods named “Supervised” are also not clearly cited, thus it is difficult to appreciate what prior works the authors have compared to. LEAStereo is also not included in Table 3. Authors claim that the proposed method works better than LEAStereo in the texture-less region. In Fig. 1 of supplementary material, I don’t see a significant difference between the methods. More learning-based stereo matching methods need to be evaluated on a larger dataset to validate the claims made in the manuscript.
4. Chen, Xiangning, Cho-Jui Hsieh, and Boqing Gong. 2021. “When Vision Transformers Outperform ResNets without Pre-Training or Strong Data Augmentations.” arXiv [cs.CV]. arXiv. http://arxiv.org/abs/2106.01548.
Please rate the clarity and organization of this paper

Satisfactory
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

Okay reproducibility because datasets are public and they will open-source the code.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html
1. In Fig. 1, a wrong disparity map is used. Replace it with the one associated with the stereo input. “Transformer” instead of “Tranformer” in Fig. 1.
2. 4th line in Sec. 2, “natural the stereo matching task”.
3. In Sec. 2.1, besides the number of layers, you will also need to maintain a relatively similar number of learnable parameters to make a fair comparison.
4. 3rd line in page 4, “we”.
5. In Fig. 4, it is difficult to see the value of the lowest point. Also, there need to be some quantitative measurements to justify your claim about the shape here. A larger set of hyperparameters should also be tried in order to remove the randomness in the loss landscape caused by a single set of settings. What do the axes stand for?
6. In Fig. 5, the numbers are too small to see. Also, the scales of the axes are different across different methods. It is very difficult to obtain any useful information from these plots. What do the axes stand for?
7. The DPI of Fig. 6 is too low. Also, this figure is pointless if you want to show the accuracy comparison because the models have not converged yet.
8. In Table 1, why not fine-tune the learning-based methods on this dataset to show those results also? The references to these comparison methods need to be made in Table 1.
9. The comparison with STTR is not fair because this method tries to predict an occlusion mask and does not produce valid disparities values in those occluded regions, as can be seen in Fig. 7. You should also try to compare the performance of the non-occluded regions.
10. In Table 3, the direction of the arrow for SSIM seems to be wrong and you did not explain what these arrows stand for.
11. DSSR uses a similar architecture as STTR, and you should explain this in the paper.
Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

3
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

Very limited novelty, not convicing claims, and not thorough and validate experiment results.
Number of papers in your stack

5
What is the ranking of this paper in your review stack?

4
Reviewer confidence

Very confident
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

5
[Post rebuttal] Please justify your decision

The authors addressed many of my original concerns by explanations and some additional evaluation results. I hope authors will modify the content correspondingly based on their rebuttal to strengthen the paper, especially the additional results and the quality of the figures.

Review #3

Please describe the contribution of the paper

This paper introduces a stereo matching architecture, consisting of a feature extraction network using a transformer and a matching network using a CNN. It performed extensive experiments on ablation, learning behaviours and comparisons to other methods.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
- Extensive experiments;
- Combining feature extraction using a transformer and matching using a CNN.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

Maybe I missed it somewhere but I was wondering how fast the inference is, as it will be quite important for use in practice.
Please rate the clarity and organization of this paper

Satisfactory
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

it looks reproducible.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html

I just have a question other than the one above. In Fig 6, why are the errors lower on SCARED2019 than on the in-domain data?
Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

6
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

As aforementioned, extensive experiments on various aspects were performed, which is very helpful for the readers. Also, the results seem promising.
Number of papers in your stack

5
What is the ranking of this paper in your review stack?

1
Reviewer confidence

Confident but not absolutely certain
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

Not Answered
[Post rebuttal] Please justify your decision

Not Answered

Primary Meta-Review

Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

Introduction: This paper investigates a deep-learning-based method to solve the stereo matching problem involving laparoscopic images. Specifically, they have proposed the use of the transformer for the problem of stereo matching, and a deep stereo matching framework (HybridStereoNet)

Strengths: • The paper has two contrasting reviews. On one hand, R1 has highlighted the novelty of the approach of replacing the CNN with the Transformer for learning. • R1 has also highlighted the validation of the proposed approach with other SOTA approaches in the literature. • R2 seems to have a positive view of the loss landscape visualization. Weaknesses: • R2 has a major concern with the novelty of the paper. Specifically, the modification of the CNN with the Transformers seems to be incremental in the reviewer’s opinion. • R2 also has concerns about the validation of the proposed approach and needs a more exhaustive comparison with the SOTA algorithms.

Points to be addressed by authors: • I would encourage the authors to rebut the suggestions/criticisms raised by R2.
What is the ranking of this paper in your stack? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

5

Author Feedback

We thank AC and reviewers for their valuable feedback, e.g. positive view of the loss landscape visualization(AC&R2), thorough experiments(AC&R1&R3), and promising results(R3).

Q1. Novelty of the paper (AC&R2) Unlike natural stereo images, laparoscopic images have large textureless regions. Adding the difficulties associated with limited training data, there is very little evidence on how to employ the transformer structure to address the problem of laparoscopic stereo matching.

We thoroughly study transformer-based designs (please see Fig.2 and Fig.3) for the problem at hand. We also studied hybrid structures in our work, and justify our choices (eg., use of CNN for MatchNet). This is beyond a simple adaptation and requires efforts in the model design, along with a battery of analyses – summarized by R1: loss landscape visualization, learning trajectory, and accuracy comparison.

Based on these analyses, we further proposed a new hybrid network combining the advantages of transformers and CNN, that achieves superior performance on three datasets.

Q2. Comparison with SOTA algorithms (AC&R2)
In table 1, we evaluate models on the 5907 stereo pairs, rather than just 10 frames. As mentioned in Section 3.3, all the “Supervised” methods are from the official paper [1]. Per your comment, we will also add citation [1] to each method in Table 1 to avoid any confusion. Moreover, the compared methods exploit the cutting-edge stereo matching architectures, eg. PSMNet, GWC-Net, HSM, and Deep Pruner. Noting that their results include highly engineered solutions, eg. interpolation to remove noising points. In contrast, our solution competed with them comfortably without hefty engineering. Per your comment, we compared with STTR on non-occluded regions. STTR (3px Error 3.69, EPE 1.57); HybridStereoNet (3px Error 2.73, EPE 1.34). Our method outperforms STTR with a clear margin.

Q3. Loss landscape visualization (R2) The loss landscape is defined purely based on the structure and data and has nothing to do with the optimizer. We however agree with you that a different optimizer could lead to a different minimum with different behavior, i.e. trajectory. Please note that Fig.4 shows that three transformer variants already converge to a flatter region, compared with the pure CNN structure (LEAStereo). A more thorough investigation, including randomness in the loss landscape, requires dedicated work and goes beyond the scope of our paper. We appreciate your comment and will reflect the above discussion in a revised paper.

We will add the value of the optimal point to Fig 4. We will tighten up our language regarding the claim, as it is an observation based on our experiments.

Q4. Memory footprint (R1&R2&R3) We test variants on a Quadro GV100 with the input size 504*840. We provide details below and will include this in a revised draft. LeaStereo Type II Type III HybridStereoNet Params [M] 1.81 9.54 9.62 1.89 Runtime[s] 0.30 0.48 0.50 0.32

Q5. LEAStereo performance on dVPN (R2) Results for LEAStereo: Mean SSIM 53.93, Mean PSNR 14.84. Our HybridStereo achieves the highest scores on this dataset.

Q6. Figures (R2) We will fix per your suggestion. In Fig. 4, the two axes mean two random directions with filter-wise normalization. Please check the details of Fig. 5 in our paper. For Fig. 6, we trained the variant models for a longer time and observed similar trends. We will show EPE (LEAStereo 0.7063 v.s. HybridStereoNet 0.6754) on the disparity for better demonstration.

Q7. Failure Case (R1) Thanks for the suggestion, we will add a discussion section. The algorithm may fail to estimate very close objects beyond pre-specify disparity.

Q8. Why lower errors on SCARED2019? (R3) We conjecture that this is due to the fact that SceneFlow has dense ground-truth disparity maps while SCARED2019 comes with semi-dense ones.

Q9. Other issues (R1&R2) We will revise typos carefully, and add RMS-error, 95th percentile metrics in a revised version.

Post-rebuttal Meta-Reviews

Meta-review # 1 (Primary)

Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

The original reviews of the paper were generally positive for the paper (except R2). R2 had a number of constructive criticisms for the paper in their original review, which has been addressed by the authors in their rebuttal to the satisfaction of R2. I recommend acceptance of the paper.
After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

Accept
What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

NR

Meta-review #2

Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

The paper received mixed reviews. R2 raised important questions regarding novelty and experiments. The rebuttal clarified the fact that there wasn’t any misunderstanding. The authors provide a quick additional experimental result, which shouldn’t be taken into account, as per the MICCAI evaluation rules. The AC recommends rejection.
After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

Reject
What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

NR

Meta-review #3

Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

I believe the authors have done a good job of addressing the concerns of R2 and agree that the experiments are well done in terms of comparison with the state of the art and using publicly available datasets. Given this I would lean towards accepting the paper.
After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

Accept
What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

NR

back to top

Deep Laparoscopic Stereo Matching with Transformers