Paper Info Reviews Meta-Review Author Feedback Post-rebuttal Meta-Reviews Back to top

List of Papers By topics Author List

Paper Info

Reviews

Meta-review

Author Feedback

Post-Rebuttal Meta-reviews

Authors

Rema Daher, O. León Barbed, Ana C. Murillo, Francisco Vasconcelos, Danail Stoyanov

Abstract

Feature detection and matching is a computer vision problem that underpins different computer assisted techniques in endoscopy, including anatomy and lesion recognition, camera motion estimation, and 3D reconstruction. This problem is made extremely challenging due to the abundant presence of specular reflections. Most of the solutions proposed in the literature are based on filtering or masking out these regions as an additional processing step. There has been little investigation into explicitly learning robustness to such artefacts with single-step end-to-end training. In this paper, we propose an augmentation technique (CycleSTTN) that adds temporally consistent and realistic specularities to endoscopic videos. Such videos can act as ground truth data with known texture occluded behind the added specularities. We demonstrate that our image generation technique produces better results than a standard CycleGAN model. Additionally, we leverage this data augmentation to re-train a deep-learning based feature extractor (SuperPoint) and show that it improves. CycleSTTN code is made available at https://github.com/RemaDaher/CycleSTTN.

Link to paper

DOI: https://doi.org/10.1007/978-3-031-43999-5_54

SharedIt: https://rdcu.be/dnww7

Link to the code repository

https://github.com/RemaDaher/CycleSTTN

Link to the dataset(s)

https://www.synapse.org/#!Synapse:syn26707219/wiki/615178

https://doi.org/10.17605/OSF.IO/MH9SJ

Reviews

Review #1

Please describe the contribution of the paper

Specularity can be a hurdle in medical image understanding. This paper proposes a cycle-gan based augmentation technique that adds temporally consistent and realistic specularities to endoscopic videos. It is shown that, compared with image-based cycle-gan, the temporal information considered in the video-based cycle-gan can help other tasks, like feature point extraction.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
1. The addressed task regarding speuclarity synthesis and removal is important in medical imaging, like endoscopy.
2. The experiment results show some improvement compared with existing methods.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
Overall, technical contributions are very limited, and the real implications of the improvements reported are not clear.
1. The video extension of CycleGan is claimed as the top1 contribution. Yet, this has been extensively studied in CV researches, like ‘Mocycle-GAN: Unpaired Video-to-Video Translation’.
2. Technical components in Fig.1 are also widely known, basically, an integration of existing techniques, but trained on a specific task dataset.
3. More fundamentally, the specular regions are always saturated, and they do not contain any information. The reviewer can not find out the reason why the proposed method should work better than simply removing them/or inpainting them, in terms of accuracy. Sure, one can claim that, the pre-process will take more time in inference, than a model directly trained with enhanced specularity.
4. Some results in Table 3 actually undermine the claims of the paper. Also, a direct comparision with other technical threads (for example specularity removal as preprocessing, rather than specularity-synthesis for end-to-end training) is not given.
Please rate the clarity and organization of this paper

Satisfactory
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

It should be able to reproduce it with some efforts.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html
1. The claims should be re-considered, in the context of broad CV researches. Other techniques for video extension of Cycle-Gan should be reviewed, and compared in the experiment section.
2. The performance of a standard point detector on video after specularity removal/inpainting should be directly compared.
3. The real benefit might lie in the nearly saturated region around specularities, and the author should carefully analyze this point.
Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

3
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The technical contributions are limited, and the real benefits are unclear. Detailed analysis on when and why the proposed method works better than existing pipelines is missing, especially considering that saturation almost always happens in specular regions.
Reviewer confidence

Very confident
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

5
[Post rebuttal] Please justify your decision

Some other reviewers like this paper, and it seems that this technique can be useful in practice. Although I could not completely agree with the claims in rebuttal, I will not stick to reject. Please reshape the contribution in the final paper, and highlight the adaptations specific to this task.

Review #2

Please describe the contribution of the paper

This work proposes the CycleSTTN model, which can add temporally consistent and realistic specularities to endoscopic videos. The proposed augmentation method outperformed the standard CycleGAN model and can improve the feature extraction performance of the SuperPoint model.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
1. The work is well motivated. Specular reflections cause problems in surgical vision. The authors propose to solve the problem from the data augmentation aspect.
2. The model makes sense by combining the STTN_R and STTN_A submodels, which are used to remove and add specular reflections, respectively, in a CycleGAN-like framework.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
1. Fig. 1 is not well designed. Some descriptions of initilization, losses, etc., are not very clear. The input and output maps in Fig. 1 are not linked with the specific symbols in the text.
2. The metrics in the experiments are not clearly defined, for example the rotation error metric and its unit are not given in Sec. 4.3
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

The video and many visualization results are provided. The experiment is carried out on a public dataset.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html

The meaning of the Pseudo evaluation experiments is not significant. Suggest to conduct a surgical image/video segmentation to demonstrate the benefits of your method on different surgical vision tasks.
Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

5
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The work is well motivated. The problem of specular reflection is addressed in a reasonable way. More experiments on different kinds of surgical vision tasks will make this paper more valuable and convincing.
Reviewer confidence

Confident but not absolutely certain
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

5
[Post rebuttal] Please justify your decision

Although the authors did not consider my suggestion of adding experiment for better demonstration, the proposed method is inspiring and expandable for surgical vision researches (e.g. smoke, exudation). Besides, the authors should not ignore one point: specular highlight could be generated in image with handcrafted algorithm (maybe much more easily?).

Review #3

Please describe the contribution of the paper

In this paper, the authors developed a deep learning framework to model the specular regions in endoscopy videos. The CycleGAN model was expanded with the temporal-awareness structure STTN. The proposed CycleSTTN is able to remove the specular regions from input frames or add more specular spots. The authors showed that using this model as a data augmentation technique can help downstream tasks such as camera pose estimation.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
Strengths:
- The motivation of building the specular removal/adding model using CycleGAN is sound. Moreover, it is the right direction that to consider the temporal relation between frames and to use a temporal-related model.
- It is a meaningful direction to add specular augmentation in downstream tasks.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

Weaknesses: My biggest concern is the notion of adding specularity to real endoscopy images. Specularity is caused by water reflecting light, which reflects the shape of the surface and point light/camera direction. For a real video V_A, the original specular regions represent this information, while adding more specularity is like introducing artifacts to the image, which I suspect would not benefit the downstream application. As shown in Fig.2, the added specular regions on V_A using STTN_A1 look unrealistic. Also, the best camera pose result in Table 3 comes from line#2 that simply removes the specular regions, and adding specularity sabotages the performance.

Nit: formatting error in Table 2.
Please rate the clarity and organization of this paper

Very Good
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

I think the result could be reproduced if proper code was provided.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html

Please see the weaknesses section.
Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

4
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

Please see the weaknesses section.
Reviewer confidence

Confident but not absolutely certain
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

4
[Post rebuttal] Please justify your decision

The notion and experiments are still not satisfactory to me.

Review #4

Please describe the contribution of the paper

The authors propose a method for augmented endoscopic frames with artefacts such as specular reflections using an extension to the traditional CycleGAN model, which takes into account temporal information. The purpose of such augmentation is to explicitly introduce robustness into the training process of models for tasks such as camera motion estimation and 3D reconstruction. The authors leverage the generated data to re-train a deep-learning based feature detection method (SuperPoint) and show that it improves feature extraction over a model trained with CyclleGAN based implementations.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

a) This paper presents a well-motivated problem and describes in some detail the limitations of current data augmentation systems used for endoscopic imaging, especially targeted for generating realistic images with artifacts (specular reflections herein)

b) The authors substantiate the need for such an approach and present a data-centric framework based on extension to the CycleGAN architecture to tackle this problem. The state of the art has been thoroughly investigated and compared, highlighting areas of opportunity

c) The method proposed by the authors is one of the first such approaches for generating specular reflections in highly dynamic scenes found in endoscopy currently in the literature.

d) The architecture is a lightweight model, although the authors do not discuss inference times or other related metrics.

e) Extensive experiments using the proposed model and dataset, as well as a case study in feature matching demonstrate the feasibility of the proposed approach for endoscopic imaging applications.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

The paper is overall well written and well organized, however this certain things can be improve (more comments in point 9)

Some of the figures are not very clear and would benefit from a more detailed caption, instead of describing details in the main text. The same applies for some of the tables, which are poorly organized and confusion
Please rate the clarity and organization of this paper

Very Good
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

The method is based on publicly available datasets and it could be easily reproducible
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html

a) The authors provide a schematic representation of the proposed solution and an overview of the proposed solution, although this could be improved in my opinion: Figure 1 is well thought and the different blocks are well described, but a longer caption explaining the figure could make the contribution easier to understand.

b) Figure 2 is very hard to see. The complementary materials and the video really help in the review process, but for the final paper a more carefully designed figure could to justice to the author’s work.

c) For me the tables are very difficult to decode. Table 2 could use some other color encoding (two types of grey for instance) to highlight what metrics refer to which case (temporal vs not temporal). The same applies for Table 3: in this case, improving the way the results are summarized could really help, and including more discussion on the caption could make it easier to understand. In table 3 it would be nice to understand what the contractions and acronyms are just for the general reader.
Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

6
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

I think the paper is a solid contribution and it could be a good paper for MICCAI. However, the writing and organization of the paper could be improved to simplify certain aspects and enhance the readability of the paper ad to make easier to understand the contributions made by the authors

I especially think the captions of tables and figures could be improved.
Reviewer confidence

Very confident
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

N/A
[Post rebuttal] Please justify your decision

N/A

Primary Meta-Review

Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.
This work presents an approach to augment endoscopic videos with simulated specularities using a temporally-consistent conditional GAN (called CycleSTTN). The main motivation is to allow ML models that use endoscopic videos as inputs, to be trained with specularity-augmented videos, and thus making them more robust to the phenomimom.

The reviews were mixed. R1 and R3 recommended reject and weak-reject, but R2 and R4 recommended weak-accept and accept. Due to the inconsistent verdicts, I recommend that the authors are invited to carefully consider the reviews, and provide a rebuttal. In particular, the authors should carefully consider all the reviewer comments, and address the main negative comments in the rebuttal, which can be broadly categorised as follows:
- Lack of technical novelty, unsubstantiated innovation claims, and lack of comparison with prior video-consistent conditional GANs (R1 - main weakness #1 and #2).
- Lack of comparison / unclear justification against specularity post-processing (R1).
- Unconvincing experimentation (R1, R2, R3), and non-informative experimentation (R2)
- Lack of simulated data physical realism (R3), inadequate definition of performance metrics (R2), and lack of clarity of figures and tables (R4)
Recall that according to the MICCAI rules, new experiments should not be added or described in the rebuttal.

Author Feedback

A1: Novelty w.r.t. video-2-video translation (R1)

Our problem formulation has key differences to conventional unpaired video-to-video translation (beyond the training data). While two cycle generators perform 1-to-1 inverse operations, our problem is different:

When removing specularities, STTN_R targets specific regions (detected highlights), while STTN_A is not explicitly conditioned to add specularities in specific regions.

Our networks are trained in a way that: 1) applying STTN_R to an image without specularities does not alter it; 2) but applying STTN_A to an image with specularities adds more specularities. This asymmetry is unconventional in the context of unpaired video-to-video translation and requires key modifications to conventional cycleGAN-style training:

The cycle identity loss for STTN_A is different from STTN_R (eq 4).

The mask input for STTN_R is generated by a separate specularity detection algorithm, forcing the network to focus on specific regions. However, the mask input for STTN_A is left blank to let the network learn which locations to add specularities. This should be better highlighted in the manuscript.

A2: Highlight removal pre-processing (R1)

Removing specularities at inference time with pre-processing would be slower (2 networks instead of 1) and also, for good results, intrinsically offline since STTN_R typically takes as input a sequence of several frames (both future and past). This would be a significant limitation compared to our augmented end-to-end SuperPoint which receives only 2 input frames. Furthermore, [3] also shows advantages of end-2-end SuperPoint vs. preprocessing, even without our augmentation scheme.

A3: Experimental result concerns (R1, R3)

All augmentation combinations consistently produce better results than baseline (no augmentation). This suggests that our proposed augmentation, as a general practice, is useful. We note that even removing specularities (in augmentation) is different from works like [6] where this is used as offline pre-processing (see A2). Looking at our top 3 augmentations, we see small differences in performance (rot medians: 10.4, 11.0, 12.2) when compared to no augmentation (rot median: 20.1). We should be conservative with small differences since the groundtruth is an estimation from SfM (COLMAP) that, while reliable, is expected to have small errors. We still believe that the general trend is interesting.

A4: Physics-based realism (R3)

We agree that we are not explicitly enforcing physical realism, but:

As an augmentation for SuperPoint, this is still “more realistic” than the baseline (a global homography warping, which moves specularities in the same way as the scene). Empirically, this is supported by the results, which improve with augmentation.

The pseudo-evaluation (see A5) demonstrates that, to some degree, our specularities are generated in more reasonable locations after training.

A5: Pseudo evaluation (R2)

We first remove specularities from an image (STTN_R), we add them again (STTN_A), and then we compare the result against the original image (with real specularities). Better scores here mean that generated specularities are closer to real ones both in appearance and location.

A6: Specular highlight saturation (R1)

While many specular regions are saturated, it is not always the case (see Fig. 2 - upper left region of images). SuperPoint benefits from both saturated regions and their surroundings, since augmented images as a whole are used in end-2-end training.

A7: Metrics definitions (R2)

All metrics are the same as in [3] and [6] for compatibility. The Rotation error is the geodesic angle (degrees) between SuperPoint estimation via RANSAC and groundtruth from COLMAP. Metric definitions will be added to the paper.

A8: More experiments (R2)

Testing the augmentation on other downstream tasks such as segmentation would be interesting, but more appropriate for a journal due to length limitations.

Post-rebuttal Meta-Reviews

Meta-review # 1 (Primary)

Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

This work still has several major problems in my opinion, spotted by R1 and R3, namely unconvincing results and lack of appropriate baseline comparisons.

The most important quantitative results by far are in Table 3. This table quantifies the effect of the proposed data augmentation method on an important downstream application (relative pose estimation using point correspondences). This was also the only presented application.

What is most striking is that the best results are achieved when specularities are automatically removed from superpoint training images. Importantly, the differences in Table 3, row 2 (temporal vs non-temporal) are tiny - so the temporal extension, proposed as a major contribution, is not having a meaningful improvement. The authors state that the lack of improvement using specularity addition in training may be due to a domain shift in the training data (fewer specularities in the test dataset). But that is speculation, and the experimental results needed to validate the method, which they did not.
The results therefore do not provide sufficient methodological support. There is also a reluctance to compare with specular removal pre-processing at both test and training times e.g. using [10] due to the fact that two networks must be inferred at run-time. Sure, that is not desirable, but specular removal can be performed very quickly on modern GPUS (100s of times per second), so the overhead is minimal. Consequently, I don’t accept the argument to not compare against this clearly relevant baseline. The decision to not compare [3] is also not well motivated “While [3] is originally trained with a specularity loss term that encourages the network to ignore specularity regions, we do not include this term in our training. We want our models to be robust to specularities, rather than just avoiding them.” This is not good justification to skip the quantitative results comparison.

Meta-review #2

Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

The paper proposes a novel data augmentation strategy to remove and add specularities from endoscopic videos to improve robustness against those artifacts. The reviewers expressed concern about the experimentation and comparison to baselines. The rebuttal did not convince 3 out of 4 reviewers, although R#1 changed his/her recommendation to weak accept (“Although I do not completely agree with the claims in the rebuttal…”). My main criticism is the lack of baseline comparisons. The authors claim that their data augmentation strategy makes the method more robust to outliers. However, other strategies, such as artifact removal, are not compared. In the rebuttal the authors argue that inference would be much slower, but this is not an argument for not comparing both approaches. But besides the concerns about the baseline comparisons, the method seems to be interesting and useful and I recommend acceptance.

Meta-review #3

Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

Upon careful evaluation of the authors’ rebuttal, the consensus among the reviewers is that this paper slightly surpasses the acceptance threshold of MICCAI. The authors have adequately addressed major concerns regarding the technical novelty and have provided clarifications regarding the experimental design, leading to an improved score by R1, raising it to a weak acceptance. Despite this meta reviewer’s recommendation for paper acceptance, it is strongly advised that the authors carefully incorporate all questions and concerns raised by the reviewers into a revised version of the paper.

back to top

CycleSTTN: A Learning-Based Temporal Model for Specular Augmentation in Endoscopy