Paper Info Reviews Meta-Review Author Feedback Post-rebuttal Meta-Reviews Back to top

List of Papers By topics Author List

Paper Info

Reviews

Meta-review

Author Feedback

Post-Rebuttal Meta-reviews

Authors

Puxun Tu, Hongfei Ye, Jeff Young, Meng Xie, Ce Zheng, Xiaojun Chen

Abstract

Phacoemulsification cataract surgery (PCS) is typically performed under a surgical microscope and adhering to standard procedures. The success of this surgery depends heavily on the seniority and experience of the ophthalmologist performing it. In this study, we developed an augmented reality (AR) guidance system to enhance the intraoperative skills of ophthalmologists by proposing a two-stage spatiotemporal learning network for surgical microscope video recognition. In the first stage, we designed a multi-task network that recognizes surgical phases and segments the limbus region to extract limbus-focused spatial features. In the second stage, we developed a temporal pyramid-based spatiotemporal feature aggregation (TP-SFA) module that uses causal and dilated temporal convolution for smooth and online surgical phase recognition. To provide phase-specific AR guidance, we designed several intraoperative visual cues based on the parameters of the fitted limbus ellipse and the recognized surgical phase. The comparison experiments results indicate that our method outperforms several strong baselines in surgical phase recognition. Furthermore, ablation experiments show the positive effects of the multi-task feature extractor and TP-SFA module. Our developed system has the potential for clinical application in PCS to provide real-time intraoperative AR guidance.

Link to paper

DOI: https://doi.org/10.1007/978-3-031-43990-2_64

SharedIt: https://rdcu.be/dnwMo

Link to the code repository

N/A

Link to the dataset(s)

N/A

Reviews

Review #1

Please describe the contribution of the paper

In this work, the authors present an intraoperative AR guidance system for phacoemulsification cataract surgery (PCS). The authors designed a two-stage spatiotemporal learning network for surgical microscope video recognition consisting of (1) a multi-task network to recognize surgical phases and segment specific regional features; and (2) a temporal pyramid-based spatiotemporal feature aggregation (TP-SFA) module which uses temporal convolution for online surgical phase recognition. Phase-specific AR guidance is presented through intraoperative visual cues based on fitting parameters of the limbus ellipse and predicted surgical phase, and a comparison on benchmark dataset and ablation experiments was performed to demonstrate the positive impact of the multi-task feature extractor.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

The authors set out to create an augmented reality-based guidance system to enhance intraoperative skills of ophthalmologists. Previous strategies focused on frame-wise processing of surgical video data, potentially leading to loss of temporal information and challenges with surgical scene recognition. Further, the surgical phase is typically not considered, leading to the presentation of augmented visual information which is not relevant to the current task and may overwhelm the surgeon. A novel aspect of this work is in the author’s regional focus by their spatial feature extraction network on the limbus region – a key structure that is targeted during the cataract surgery procedure. The authors performed a rigorous analysis of their approach against prior strategies in the literature and included additional ablation experiments to demonstrate the impact of the multi-task feature extractor and TP-SFA module.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

Many of the networks involved in this work (ResNet-50 backbone, U-net segmentation architecture, etc.) are not novel on their own; however, they are novel in their combination, specific tuning to cataract surgical video, and focus on online recognition for real-time intraoperative guidance. The preliminary AR guidance, presented as visual cues to support the 9 phases of the PCS procedure, is not evaluated in a population of surgeon users and instead serves as a proof of concept.
Please rate the clarity and organization of this paper

Excellent
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

Software for training the described network is not presented. Detailed descriptions of the networks involved are included, and benchmarks were performed on publicly available datasets. Additional supplementary video is included to show performance.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html

When viewing the included supplemental video demonstrating online performance of your algorithm, I found the virtually augmented information was somewhat distracting to the underlying anatomy. Were surgeons involved in the design/selection of the relevant guidance information to be presented at each surgical phase? In this work, the authors focused on PCS, I’m wondering if you anticipate that a feature extraction approach that focuses on the limbus region could generalize well for other eye surgeries. In your dataset description, please clarify the number of labeled frames which were manually delineated by the non-M.D. experts. Please comment on the accuracy of identified limbus regions by your network, is the error within a suitable range for leading microsurgery (< 1 mm)? If not, what are the steps required to improve the system accuracy?
Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

6
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

Interesting idea for phase-specific augmented guidance, well written paper, and detailed analysis and ablation on a benchmark dataset. Additional clarification on the augmented content selected for visual cue selection and discussion on generalization to other procedures would be beneficial.
Reviewer confidence

Confident but not absolutely certain
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

6
[Post rebuttal] Please justify your decision

Thank you to the authors for your responses to our suggestions. I believe this paper fits well within the scope of MICCAI submissions (focus on feasibility and initial evaluation of a novel idea). I look forward to your future insights as to the benefit of contextual references when assessed during physician-led studies of AR guidance during PCS.

Review #2

Please describe the contribution of the paper

In this paper, authors propose an AR-guided system for phacoemulsification cataract surgery (PCS). Their system consist of a network which can perform limbus region segmentation and phase recognition at the same time, and a alogrithm to generate the Visual cues for AR guidance.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

The concept of this system is novel and interesting, especially on intergrating the AI and AR for PCS and define phase-specific AR Guidance. I am eager to see how it can be used for clinical application.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
Although promising, there are some major weaknesses that remain to be further addressed:
1. Unclear novelty of the proposed method: The paper presents a novel intraoperative AR-guide system for PCS, but the novelty of the method is not clearly demonstrated. The use of a two-stage spatiotemporal network for surgical microscope video recognition appears to be inspired by existing methods, such as multi-task CNNs [16] and deep CNN-based methods [17]. The paper should provide a more explicit discussion of the novelty of the proposed method compared to these previous works.
2. Unclear focus of the paper: The whole paper tends to present an Augmented Reality system, however, the focus of this paper is not so clear as they mainly evaluate the results of surgical phase recognition and segmentation. While there is no evaluation of the AR guidance, such as the AR overlay accuracy in 2D or 3D, and initial user study on the system. In short, the organization of this paper is ambiguous.
3. Lack of comparison with more recent state-of-the-art methods: The paper compares its method with several strong baselines, but it is unclear whether the most recent state-of-the-art methods have been included in this comparison. For example, the paper does not mention any comparison with [4][27], which are transformer-based models for aggregating spatial features. Providing a comparison with these recent methods would strengthen the paper’s claims of superiority over existing approaches.
Please rate the clarity and organization of this paper

Satisfactory
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

The authors do not state that they will release the code.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html
1. Methodology clarification: The paper would benefit from a more detailed explanation of the proposed spatiotemporal network, particularly the TP-SFA module. It would be helpful to include a clear explanation of the motivation behind the specific architecture choices and how they address the current limitations in AR-guided phacoemulsification systems.
2. AR Guidance Evaluation: It is mentioned that your method achieves real-time intraoperative microscope video processing at 36 fps. While this is promising, a more comprehensive evaluation of the system’s performance in a clinical setting would be useful. This might include feedback from ophthalmologists, an analysis of the impact of the system on surgical outcomes, or a comparison with existing AR-guided systems.
3. Figure 2 (a) is very similar to the figure in ref [12]. A different style is strongly suggested.
Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

5
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The overall paper organization, method novelty, and concept.
Reviewer confidence

Confident but not absolutely certain
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

N/A
[Post rebuttal] Please justify your decision

N/A

Review #3

Please describe the contribution of the paper

This paper presents a deep-learning-based framework for spatiotemporal analysis of phacoemulsification cataract surgery videos to be utilized in an augmented reality guidance system.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

The paper is very well written and organized and includes very professional and informative visualizations of the proposed framework and qualitative results. The proposed framework is able to jointly detect and segment the spatial and
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

I have some concerns regarding the applicability of the proposed framework, the generalizability of the trained network, dataset, baseline methods, and the experimental results, which I list in the following.

1) Generalizability: The results of the current trained network for phase recognition cannot be regarded as a guidance tool for novice surgeons. In fact, the phases in a phaco cataract surgery do not follow a particular order. There are several different intra-operative irregularities that affect the workflow such as pupil reactions and intraocular lens rotation. Besides, depending on the hardness of the lens and some other factors, phases such as viscoelastic and phaco occur more than one time during the surgery. In such cases that are not rare, the next phase indicated by the proposed framework is different from the real next phase, that can be distracting information for novice surgeons. In addition, the surgeons use a microscope that provides a 3D view of the eye. Adding the boundaries in the AR software can even degrade the 3D surgical scene.

2) Applicability: The authors should have conducted an expert review with a qualitative study to verify the suitability of the current developed framework as an augmented reality aid for real-time surgery.

3) Applicability: It is mentioned that the limbus is annotated for all frames. Since the frame-based segmentation of limbus (even at 1fps) is very time-consuming, the current settings limit the applicability in real-world conditions. Indeed, reproducing the results with such amount of annotations for a new hospital/camera with domain shift from the current dataset is not possible.

4) Dataset: the authors mention that all videos in the dataset are subsampled to the temporal resolution of one frame per second. However, it is argued in the paper that the proposed network can detect the phases at the speed of 36 fps. How can a network trained on videos with 1fps be able to detect the phases with 36fps? Can the authors justify their argument by providing detailed settings of the inference step?

5) Baseline methods: References [9,11] are very old. One the other hand, there are several SOTA approaches that have not been listed and considered as the baseline. With the two keywords “phase recognition” and “cataract surgery”, I found the following papers that have a similar objective:

Garcia Nespolo R, et al., Evaluation of Artificial Intelligence–Based Intraoperative Guidance Tools for Phacoemulsification Cataract Surgery.

N. Ghamsarian, et al., Relevance Detection in Cataract Surgery Videos by Spatio- Temporal Action Localization.

Ting Wang, et al., Intelligent cataract surgery supervision and evaluation via deep learning.

6) Experimental results: performance improvement over the two SOTA baselines is marginal.
Please rate the clarity and organization of this paper

Very Good
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

The authors have not mentioned that they will release the limbus annotations (considering the reproducibility response and the paper). Hence, the results cannot be reproduced.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html

I would suggest that the authors compare the results with more SOTA methods, provide expert review results, and delineate the inference setting regarding frame rate.
Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

4
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

Necessity of expert review considering the mentioned problems in applicability, marginal improvement over current baselines, and lack of comparisons with important SOTA as baselines are the main reasons for the current decision.
Reviewer confidence

Very confident
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

4
[Post rebuttal] Please justify your decision

The authors have made commendable efforts to address various concerns and provide justifications. I particularly appreciate the reviewer’s explanations regarding the incorporation of boundaries into 3D surgical scenes. However, after careful consideration, I stand by my previous decision due to the following reasons:

(1) I find the limited comparisons with state-of-the-art methods due to the lack of source codes unacceptable. Many state-of-the-art methods offer sufficient details to enable reproducibility, and the absence of this information hinders the assessment of the proposed approach.

2) the requirement for a very large number of annotations (24,613) raises concerns about the practicality of the method in real-world conditions with domain-shift problems.

3) Predicting at a rate of 1fps results in a very low temporal resolution, which is particularly unsuitable for analyzing surgical videos. Surgical procedures often involve rapid and intricate movements, and a higher temporal resolution would be necessary to capture fine details and precise actions.

4) I maintain my previous concern regarding the order of phases in cataract surgery. Phacoemulsification cataract surgery can involve several unpredictable intra-operative complications that require immediate decision-making by the surgeon regarding the next surgical phase. These complexities limit the applicability of the proposed method, as it may not account for such dynamic scenarios.

Primary Meta-Review

Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

The paper proposes a system for phacoemulsification cataract surgery (PCS) that’s guided by Augmented Reality (AR). The proposed system comprises a network capable of simultaneously conducting limbus region segmentation and phase recognition. The concept of intergrating the AI and AR for surgery guidance is interesting, however, the reviewers have raised several concerns about unclear clinical applicability, lack of experimental evaluation in terms of AR guidance, unclear method innovation, etc. I invite the authors to submit the rebuttal focusing on addressing reviewers comments

Author Feedback

We thank all reviewers (R1, R2, R3, Meta-R) for their constructive suggestions. We categorize their main concerns and make relevant responses in the followings.

Clinical applicability (R1, R3): Despite variations in fine-grained actions in clinical practice, PCS follows a consistent pattern of surgical phases, making it a quasi-standardized procedure. Our method utilizes a global temporal aggregation module to achieve smooth prediction and merge repeated fine-grained actions into neighboring major phases. We combine the current recognized phase with the previous phase to predict and show the next phase, enabling more accurate predictions and reducing distractions. Our spatial feature extractor shows potential for application in other ocular anterior segment surgeries, but it is not suitable for posterior segment surgeries.

AR guidance system evaluation (R1, R2, R3): This study focuses on describing and evaluating the algorithmic aspects of the AR systems. Thus, we evaluated the efficiency, segmentation accuracy, and phase recognition accuracy of our primary AR system and identified several instances of failed AR scenes. However, we acknowledge that these evaluations are insufficient to demonstrate the full clinical applicability of the system. Therefore, we plan to conduct a user study, following approval by the ethics committee. The user study will employ the ICO-Ophthalmology Surgical Competency Assessment Rubric to further assess the system’s performance in clinical.

Method innovation (R2): Our motivation is based on the insight that global temporal features can offer contextual references, while incorporating local temporal features can provide fine-grained information for accurate phase recognition. Thus, we design the TP-SFA module to aggregate multi-scale temporal features. The SOTA method [8] only aggregate global temporal features. [16] and [17] do not incorporate temporal information for phase recognition.

Comparison with SOTA methods (R2, R3): We compare our approach with existing methods that have publicly available source code. [8] is the SOTA transformer-based method with open-source code. In comparison with [8], our method shows improved performance, with smoother prediction results and better accuracy on challenging local frames (Fig. 2 (a)). We include [9] and [11] in our comparison, as they serve as baseline methods for our feature extractor.

Distractibility (R1, R3): Modifying the color, transparency, and style of visual cues can eliminate potential distractibility to underlying anatomy. Regarding 3D surgical scene, a practical method involves using two cameras, such as the Sony MCC-1000MD. AR is then applied to each camera, and the synthesized 3D scene is displayed on an external 3D surgical monitor. Further evaluation is necessary to assess its suitability.

Inference setting (R3): During online inference, the spatial feature of the current frame is stored into a historic queue. We then sample the historical queue at 1 fps, with the current spatial feature as the last element, to get a sub-queue for temporal aggregation and phase prediction. This approach enables real-time inference since we only need to compute spatial features for the most recent frame rather than all past frames.

Limbus segmentation (R1, R3): Two non-M.D. experts manually delineated 24,613 frames using a hierarchical data annotation method (DOI: 10.1038/s41467-022-29637-2) to reduce workload. The segmentation error of the image (1920×1080 pixels) corresponds to the actual physical error in the microscope, which is significantly less than 1 mm. Note that most surgical video recognition methods encounter domain-shift challenges. Our method focuses on an open dataset. To enhance generalization, incorporating a mixed dataset from different hospitals can be beneficial.

Style of Fig. 2 (a) (R2): We adopt the common practice in surgical phase recognition by using a color-coded ribbon to show the results.

Post-rebuttal Meta-Reviews

Meta-review # 1 (Primary)

Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

The rebuttal addressed most critical concerns raised by reviewers, especially the clinical applicability and method motivation. Though there still exist the concerns, such as the insufficient evaluation in clinical setting, which the authors plan to explore in future, given the contributions of the paper, I recommend acceptance.

Meta-review #2

Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

This paper presents an augmented reality-guided system for phacoemulsification cataract surgery stage identification. The proposed framework consists of a multi-task learning, a spatial feature aggregation and a spatial-temporal feature aggregation module. The idea of the proposed method is interesting and the experimental results achieve the optimum. However, the reviewers raised questions about the clinical applicability, comparative experiments and systematic evaluation of the paper. After the rebuttal, R3 states that the authors have still not addressed the practicality and comparative experimental aspects. In this regard, I agree with R3 and am inclined to reject the paper.

Meta-review #3

Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

After reading the reviews and rebuttal, this paper presents an intraoperative AR guidance system for phacoemulsification cataract surgery. Although concerns about comprsion experiments and clinical application are arised, the rebuttal has addresed those concerns. I recommend accept.

back to top

Efficient Spatiotemporal Learning of Microscopic Video for Augmented Reality-Guided Phacoemulsification Cataract Surgery