Paper Info Reviews Meta-review Author Feedback Post-Rebuttal Meta-reviews

Authors

Yannik Frisch, Moritz Fuchs, Antoine Sanner, Felix Anton Ucar, Marius Frenzel, Joana Wasielica-Poslednik, Adrian Gericke, Felix Mathias Wagner, Thomas Dratsch, Anirban Mukhopadhyay

Abstract

Cataract surgery is a frequently performed procedure that demands automation and advanced assistance systems. However, gathering and annotating data for training such systems is resource intensive. The publicly available data also comprises severe imbalances inherent to the surgical process. Motivated by this, we analyse cataract surgery video data for the worst-performing phases of a pre-trained downstream tool classifier. The analysis demonstrates that imbalances deteriorate the classifier’s performance on underrepresented cases. To address this challenge, we utilise a conditional generative model based on Denoising Diffusion Implicit Models (DDIM) and Classifier-Free Guidance (CFG). Our model can synthesise diverse, high-quality examples based on complex multi-class multi-label conditions, such as surgical phases and combinations of surgical tools. We affirm that the synthesised samples display tools that the classifier recognises. These samples are hard to differentiate from real images, even for clinical experts with more than five years of experience. Further, our synthetically extended data can improve the data sparsity problem for the downstream task of tool classification. The evaluations demonstrate that the model can generate valuable unseen examples, allowing the tool classifier to improve by up to 10% for rare cases. Overall, our approach can facilitate the development of automated assistance systems for cataract surgery by providing a reliable source of realistic synthetic data, which we make available for everyone.

Link to paper

DOI: https://doi.org/10.1007/978-3-031-43996-4_34

SharedIt: https://rdcu.be/dnwPd

Link to the code repository

https://github.com/MECLabTUDA/CataSynth

Link to the dataset(s)

https://cataracts.grand-challenge.org/


Reviews

Review #1

  • Please describe the contribution of the paper

    This paper describes networks to classify instruments in cataract surgery video images. The method includes diffusion networks to generate images and classifier-free guidance to do it in an efficient manner.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    Addressed a good question; use of synthetic data is interesting. Evaluated user perception of synthetic data.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    Unclear write-up likely because too much information was included in too little space, which obscured experiment details needed to evaluate validity of findings.

  • Please rate the clarity and organization of this paper

    Poor

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    The data and hyperparameters will reportedly be accessible to all.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html
    1. The interpretation of instruments is misleading in Figure 2. For example, a capsulorhexis forceps is not used for suturing as assumed in Figure 2.

    2. Metrics for evaluation are not clearly explained, e.g., KID and FID. Why is inception score a measure of “tool realism”?

    3. What are mode collapses?

    4. What are the numbers shown after the “+/-“ in Table 1? How are they computed?

    5. Unclear qualifying terms such as “superior”. What criteria make this qualifier appropriate to use here?

    6. How were the images for the user study selected?

    7. How was the estimate 61% computed? How was the “on average” estimate computed?

    8. Matthews correlation coefficient - is it the appropriate measure to show in Table 2? I would have expected a more classical measure of interrupter reliability, e.g., Fleiss kappa. The false classification rate in Table 2 - is it false positives or false negatives?

    9. Table 3 - what are the “Original” and “Extended” mean for “Data”?

    10. Figure 5 is unclear. What metrics were used for “phase-wise performance”? What was the experiment?

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    3

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Interesting work, but experiment details are not sufficiently clear to assess validity of findings. Metrics for evaluation are unclear or inappropriate.

  • Reviewer confidence

    Very confident

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    4

  • [Post rebuttal] Please justify your decision

    The authors provided a carefully considered rebuttal. The claim of being the first is not of much relevance in my opinion. It is not clear why the FP and FN measure the same quantity given a conventional 2*2 table. The rest of the comments are clarified except some such as generalizability that cannot be adequately addressed in one conference paper.



Review #2

  • Please describe the contribution of the paper

    This paper proposes a method to generate synthetic data for cataract surgery using a conditional generative model based on Denoising Diffusion Implicit Models (DDIM) and Classifier-Free Guidance (CFG). The generated data can help improve the performance of a tool classifier by addressing the data sparsity problem and imbalances in the original data.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    This paper is the first to combine CFG with diffusion models to generate realistic cataract surgery data with a complex label structure. The authors of this paper analyze cataract video data to identify the phases in which a pre-trained tool usage classifier performs poorly. They use a conditional denoising diffusion model to create new samples for these phases, which are recognized by the tool classifier and are difficult to distinguish from real images, even by clinicians with over five years of experience.

    The paper also demonstrates how the artificially extended data can address the data sparsity problem in downstream tasks. Overall, the paper’s evaluations show that their model can generate valuable examples that help bridge the gap between research and clinical applications. The strength of this paper lies in its innovative approach and its potential to improve clinical practice.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    1) Generalization capabilities: The proposed approach may generate unreasonable samples, such as completely wrong tools during a phase. This limitation suggests that the model may not be able to generalize well to unseen data or scenarios. 2) Quality improvement: While tool realism is significantly better for the proposed method, the CF1 and CAS scores indicate that it can be further improved. Therefore, the paper acknowledges that there is room for improvement in terms of image quality and tool preservation. 3) Lack of data: The underlying class imbalances and lack of available data are even more severe for the downstream task of anatomy and tool segmentation. This limitation suggests that there may be a need for more data to improve the performance of the proposed model. 4) Limited scope: The proposed method focuses only on generating high-quality cataract surgery images. Therefore, the paper does not address other challenges associated with computer-assisted cataract surgery, such as safety and efficacy.

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    The author well elaborate the experiments performed.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html

    If the manuscript could provide images with higher dimension, the reader would be able to observe the generated image more clearly.

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    6

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    My decision has been made based on the above comments.

  • Reviewer confidence

    Confident but not absolutely certain

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    6

  • [Post rebuttal] Please justify your decision

    In rebuttal, the author address well my earlier review comments.



Review #3

  • Please describe the contribution of the paper

    This paper tackles the problem of imbalanced dataset in the context of surgical videos. The authors propose a solution using a conditional generative model based on Denoising Diffusion Implicit Models and Classifier-Free Guidance to synthesize diverse, high-quality examples of surgical phases and combinations of surgical tools. The synthetically extended data improves the data sparsity and the tool classifier’s performance on rare cases by up to 10% on CATARACTS dataset.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    • To best of our knowledge, this is the first attempt to generate decent synthetic images using diffusion models in a surgical video analysis context. This method is one the solutions for handling rare cases in the surgical video analysis • The authors present a detailed analysis of the quality of the synthetic images. The quality of the synthetic images is decent. This demonstrates the efficiency of their diffusion models to generate good looking images with predefined conditions (i.e. specifying the tools and the phase to be in) • Decent improvement presented on the downstream task applied on CATARACTS dataset

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    • No analysis was done on the correctness of the generated samples. A table showing the number of unreasonable samples, e.g. completely wrong tools during a phase, is recommended • The analysis of possible reasons of the negative values for deltaF1 of ImplantEjection and Positioning is missing. • Analysis was done only on one dataset. To validate its performance, this should be tested on another dataset (e.g. cholec80).

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    Nothing to note on this section

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html

    • It would be interesting to note the image size used when presenting the images to clinicians (in section 3.3) since this would affect a lot the way we see the details in the images. • In section 3.4, the authors present the performance gain for the extended version of the dataset. It would interesting to see the impact of the other methods of generating synthetic images on the downstream task, even though their quality of the generated images is not the best.

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    5

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    To best of our knowledge, this is the first attempt to use diffusion models to generate synthetic images in the context of surgical video analysis. Nothing new in terms of the novelty of the method, however, the quality of the generated images is very good and the analysis done on these images show promising solutions for hard problems in this domain, while improving the performance on a downstream task.

  • Reviewer confidence

    Confident but not absolutely certain

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    5

  • [Post rebuttal] Please justify your decision

    The authors replied with clear answers to some inquiries asked by the reviewers. To best of our knowledge, this is the first decent attempt to augment the training dataset using synthetic images using CF+DM in the context of surgical workflow analysis. It shows promising results and a potential solution for a huge problem (i.e. the class imbalance) in the medical field.




Primary Meta-Review

  • Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

    The paper proposes a diffusion model-based method to synthesize cataract surgical frames. Through tackling the class imbalance problem, the phase recognition accuracy is improved. The reviewers have raised several concerns about unclear experimental details and evaluation metrics, limited method novelty, and detailed analysis on the results, etc. I invite the authors to submit the rebuttal focusing on addressing reviewers comments.




Author Feedback

We thank all reviewers and the meta-reviewer for their kind reviews, especially for appreciating the quality of our generated samples and the experiments we conducted to improve the downstream task model. We are grateful for their constructive suggestions on further improving our manuscript. In the following, we summarize and respond to the reviewers’ comments.

Novelty@R3,MR

Our manuscript is the first to combine DDIM and CFG for conditionally synthesizing surgical data with a complex underlying label structure. We are the first to show how training on samples synthesized this way improves the performance of a surgical tool classifier for rare/underrepresented cases of cataract surgeries. Finally, we conduct a visual Turing test in the form of a user study with clinical experts to validate the realism of our generated samples.

Presentation of Results@R1,R3,MR

We thank you for pointing out potential misunderstandings in the description of Table 3. Here, “Original” refers to the original CATARACTS training data, while “Extended” represents the extension of the original with synthesized data. In Figure 5, we display the change in F1 scores for underrepresented phases, which improved the overall performance. We did exclude the baselines’ samples, as the quantitative findings on image quality suggest significantly inferior performance. The +/- values in Table 1 represent the standard deviation of the metrics.

Generative Models Metrics and Terms@R1,MR

By using a pre-trained tool classifier for the Inception Score (IS), it yields a measure of tool realism since good values indicate that a pre-trained classifier can better identify the tools. For a mathematical definition of the Frechet Inception Distance (FID) and the Kernel Inception Distance (KID) metrics, we kindly refer to related literature on generative models [1]. “Mode collapse” describes a common problem with GANs, leading to reduced image variability. Score-based methods like our diffusion model are less prone to that.

User Study Metrics and Details@R1,R3,MR

We thank you for hinting at potential misunderstandings in the descriptions of the user study. We presented images of roughly 235 x 132 pixels, randomly chosen from each phase and different tool combinations. We computed a very low Fleiss-Kappa of 0.049, but our objective is not to evaluate intra-rater agreement. We instead evaluate the validity of subjective binary decisions with known ground truth. Therefore, we see the average MCC, computed over all participants, as an appropriate metric. In our case, FP and FN values measure the same quantity, which we denoted as false classification rate (FR). The average FR is 61%. We are happy to share the user study images upon acceptance.

Model Generalization and Result Analysis@ALL

All reviewers raised questions about the distribution of tools and the generalization capabilities to unlikely combinations of phases and tools. Figure 2 displays ground truth annotations of the dataset. The generalization capabilities of our approach to different phase and tool combinations are bound by the annotation qualities. Therefore, generating wrong tool combinations for phases, e.g. capsulorhexis forceps during suturing, is possible if they are also present in the dataset. In the future, we aim to improve this by incorporating prior knowledge. R3 suggests testing on other data, e.g. Cholec80. We agree that this is a great suggestion and invite researchers to collaborate. We appreciate R2’s valuable suggestion for a more thorough examination of the performance declines depicted in Figure 5. We attribute this to a phenomenon akin to “catastrophic forgetting” in continual learning. This is a minor impediment to the goals of the manuscript, but we are committed to presenting a comprehensive solution in the future.

We hope we could answer all relevant questions and thank all reviewers.

[1] Theis, L., et al. (2015). A note on the evaluation of generative models.arXiv:1511.01844




Post-rebuttal Meta-Reviews

Meta-review # 1 (Primary)

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    The rebuttal addressed most critical concerns raised by reviewers. Though there still exist the concerns around FP and FN measure, given the contributions of the paper, especially attempting Diffusion to SDS with good results, I recommend acceptance.



Meta-review #2

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    This paper proposes a framework for combining classifier-free guidance with diffusion models to generate realistic cataract surgery data with complex underlying label structures. The paper is well-motivated and the proposed approach is innovative and of interest. The reviewers highlight the paper’s evaluations (including analysis of the quality of the synthetic data), and improvements on the downstream task as the strengths.

    The main comments/concerns from reviewers regarding clarification on experimental details, results and evaluation metrics, and novelty have been addressed by the rebuttal, and additional comments regarding improvements to figures (including Fig 2 and Fig 5) should be incorporated into the final version if accepted.



Meta-review #3

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    This paper aims to conduct surgica video data synthesis based on denoising diffusion implicit models, in order to address the challenge of data sparsity for the recognition task. The research topic is novel and interesting. The paper is well-written and clearly presented. Overall, the paper received two positive ratings and one negative rating (still changed from reject to weak reject after rebuttal). Despite still some issues on experimental details, the paper’s contributions in terms of task novelty and decent diffusion model based surgical data synthesis are worthy for acceptance.



back to top