Paper Info Reviews Meta-review Author Feedback Post-Rebuttal Meta-reviews

Authors

Barak Ariel, Yariv Colbeci, Judith Rapoport Ferman, Dotan Asselmann, Omri Bar

Abstract

An accurate estimation of a surgical procedure’s time to completion (ETC) is a valuable capability that has significant impact on operating room efficiency, and yet remains challenging to predict due to significant variability in procedure duration. This paper studies the ETC task in depth; rather than focusing on introducing a novel method or a new application, it provides a methodical exploration of key aspects relevant to training machine learning models to automatically and accurately predict ETC. We study four major elements related to training an ETC model: evaluation metrics, data, model architectures, and loss functions. The analysis was performed on a large-scale dataset of approximately 4,000 surgical videos including three surgical procedures: Cholecystectomy, Appendectomy, and Robotic-Assisted Radical Prostatectomy (RARP). This is the first demonstration of ETC performance using video datasets for Appendectomy and RARP. Even though AI-based applications are ubiquitous in many domains of our lives, some industries are still lagging behind. Specifically, today, ETC is still done by a mere average of a surgeon’s past timing data without considering the visual data captured in the surgical video in real time. We hope this work will help bridge the technological gap and provide important information and experience to promote future research in this space. The source code for models and loss functions is available at: https://github.com/theator/etc.

Link to paper

DOI: https://doi.org/10.1007/978-3-031-43996-4_16

SharedIt: https://rdcu.be/dnwOQ

Link to the code repository

https://github.com/theator/etc

Link to the dataset(s)

N/A


Reviews

Review #1

  • Please describe the contribution of the paper

    The paper investigates the task of predicting the remaining surgery duration. It focuses on the effect of different loss functions and different model architectures on several different datasets. The paper seems to be primarily exploring the effect of different design decisions, rather than proposing a solution.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • The methods are evaluated on three different datasets
    • A large number of loss functions are evaluated
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    • Paper objective(s) is/are not clearly stated
    • It is unclear if the differences in performance are clinically meaningful (no measures of spread provided)
  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    Pending adding a few more details about the loss functions, I believe this paper is reproducible. It makes use of several common datasets. Code will be released on publication.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html

    The paper presents an exploration of design decisions related to predicting the remaining surgery duration. But it was unclear to me what the objective of the exploration is. For example, why study loss and model type? Why not study feature representation or model size? For example, why compare LSTM and transformer-based models? Why not compare TCN models too? I appreciate that it is not feasible to explore every design decision, but given it is an exploratory study, I think this paper could benefit from an explanation on why the particular aspects were explored and not others.

    My two takeaways from this paper are (1) the best loss function depends on the architecture and (2) using an ensemble of models for predicting remaining surgery duration has added value. But it is unclear if these differences are clinically meaningful. It would be helpful to add measures of spread to the reported values to see if there is a large effect.

    Given the above two takeaways are fairly common for machine learning tasks, it is unclear to me if this paper will have considerable impact in the space of remaining surgery duration prediction.

    The claim that SMAPE can be utilized as a better metric to compare ETC models is lacking evidence. It is certainly an interesting quantity to measure, but I believe to make a claim that is is better would require studying whether optimizing it leads to better OR efficiency than optimizing MAE.

    Description of the loss functions could use more information. For corridor loss, what is pi? For L_squared_error, what is S?

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    4

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The major factors leading to this score are: (1) the objectives of the exploration are not well defined and (2) it is unclear if the differences in performance for different design decisions are clinically meaningful.

  • Reviewer confidence

    Very confident

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    4

  • [Post rebuttal] Please justify your decision

    My primary concerns with the paper are: (1) the objectives of the exploration are not well defined and (2) it is unclear if the differences in performance for different design decisions are clinically meaningful.

    These concerns were not sufficiently addressed in the rebuttal. Thus, my review of the paper remains the same.



Review #2

  • Please describe the contribution of the paper

    This paper compares the impact of different components to estimate surgical time completion (ETC). These components are 2 evaluation metrics, 5 Loss functions, and 3 ETC models. This comparison is done on three datasets from different surgical procedures: Cholecystectomy, Robotic-Assisted Radial Prostatectomy (RARP), and Appendectomy.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    • Comparison of different combination of ETC components • Evaluation of different datasets. • Comparison with one SOTA method

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    • Lack of novelty: as explained by authors there is no presentation of new models • Incomplete dataset description.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    The reproducibility checklist for the dataset is not coherent with the paper. On the checklist, it is mentioned that the information for the public dataset is provided, but the paper does not mention the use of such datasets. Furthermore, the checklist specified that the items were not applicable to new dataset, whereas the paper presents a new dataset, and the description is incomplete (see comments).

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html
    1. Abstract: The acronym RARP is not defined.
    2. Section 3.1: What is the consequence of the fact that “MAE does not consider the actual video duration or the temporal location for which the predictions are made”? By explaining why this might be a problem, the use of SMAPE would be clearer
    3. Section 3.2: The description of the dataset misses the following information: the number of surgeons and surgical sites for each surgical procedure. The detailed duration information, for training, validation, and test set, is provided in the supplementary material. But specifying the mean and std duration in the article will be a plus to support the argument that “ RARP is almost four times longer on average”
    4. Section 3.2: how the duration of the surgery has been defined? It is based on the camera (inserted/removed from the patient) or from the first incision to the final suture? It would be a plus to discuss the limit leading by this duration definition in the context of the integration of clinical practice.
    5. Section 3.3: Equation 8: What is the meaning of the term yt-S ?
    6. Section 3.3: To help the clarity of loss presentation, it will be to present the Internal L1 loss before the Total variation denoising loss, has the total variation loss depends on the other one.
    7. Section 4.1: On which part of the dataset is this study performed?
    8. Section 4.1: Error analysis: What loss combination has been used for this study?
    9. Section 4.1: Baseline comparison. The authors specified that they use a different learning rate compared to the initial RSDNet, without specifying the value.
    10. Section 4.1: what is snorm?
    11. Section 4.2: It is not clear what the set presented in this section contains. Are they all 4 models or only some? Does the averaging give the same weight to each model or not?
    12. Table 2: what is 90p SMAPE?
    13. Table 1,2,4: From which part of the dataset are these results obtained?
  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    4

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The comparison of the impact of the different ETC components is interesting and could help the global understanding. However, several study’s information are not presented or defined. By clarifying this point the paper will greatly improve.

  • Reviewer confidence

    Confident but not absolutely certain

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    6

  • [Post rebuttal] Please justify your decision

    The authors’ rebuttal has responded to all comments.



Review #3

  • Please describe the contribution of the paper

    With a feature extraction network to extract features for LSTM and Transformer, the authors conduct multiple experiments to achieve the estimation of time to completion (ETC) for different surgical procedures. The authors study four major elements including applying new evaluation metrics, applying methods with different data, applying model architectures, and applying different loss functions.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    (1) The authors applied different combinations of loss functions for training their methods. The authors also share detailed information for different loss functions in Section 3.3.

    (2) The authors evaluate their methods on multiple datasets.

    (3) While LSTM is the most common method for solving ETC or RSD tasks, the authors also conduct experiments with Transformer.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    (1) While the authors conduct many experiments, some experiments’ results are not very clear to the reader. (a) May I ask why validation set results are reported in Table 1 instead of the test set? (b) For Table 1, MAE/Mean SMAPE changes are very small for ETC-LSTM and ETCformer. In Table 2, once again, for Cholecystectomy and RARP, MAE changes are very small, if it is possible, can the authors explain more on this, please?

    (2) Lack of details of the feature extraction process, please provide more details on this process. For example: What are the video clip lengths and number of frames used in VTN for feature extraction? May I ask if the VTN is also pre-trained on RARP or Appendectomy as well before feature extraction?

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    Please add more details for the feature extraction part of the paper for better reproducibility.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html

    (1) The authors compared their method with RSDNet. As single model methods are not constantly outperformed RSDNet in different procedures, the authors utilize ensemble methods which they know will take a lot of computation resources. It will be great at least to share the inference speed of different methods in the paper.

    (2) I think one of the future work directions for the authors can be proposing more advanced and novel methods for ETC.

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    5

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    This is an application paper in my eyes. Although the novelty of the paper is not strong, I agree that the information in this study will assist researchers in developing new methods. I hope the authors open-source part of their dataset if it is possible.

  • Reviewer confidence

    Very confident

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    5

  • [Post rebuttal] Please justify your decision

    The authors are not able to address the feature extraction process well enough. VTN is a video classifier. The authors mentioned that the VTN model uses 1 FPS. And every second in the video is represented by one feature vector. It seems the authors used ViT from VTN to extract the features. If ViT is used? Then why pre-train VTN instead of ViT? If VTN is used, may we know what is input for VTN? Is it that every second, 32 frames are sent to VTN or more frames are used for feature extraction?

    The SMAPE score differences in Figure 1 are big, not small. So, the author did not address the point.

    I still hope the authors can address my points if the paper gets accepted. Considering my suggested weaknesses are not addressed well, I cannot improve my score and still rate it as weak accept. If other reviewers still think the paper should be rejected, then I hope the authors can address comments from all reviewers and resubmit this work in the future.




Primary Meta-Review

  • Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

    The authors present their work exploring estimated time to completion for surgery based on endoscopic video. The authors do this by exploring the contributions of 2 evaluation metrics, 5 Loss functions, and 3 ETC models.

    There is clinical utility in estimating time to completion from a hospital logistics standpoint. The authors explore several different elements that may impact the performance of models that predict ETC and do so on datasets of different types of operations.

    The paper, however, does not provide any additional novelty as it largely explores the contributions of the aforementioned elements to ETC performance. That is ok and is clearly recognized and stated by the authors. However, there are several weaknesses that merit consideration by the authors and opportunity for clarification: 1) What is the consequence of the fact that “MAE does not consider the actual video duration or the temporal location for which the predictions are made”? By explaining why this might be a problem, the use of SMAPE would be clearer 2) Additional information on the dataset is of interest given that surgeons vary in their technique and may cluster the length of their procedure. What was the number of surgeons and surgical sites for each surgical procedure? What is the authors definition of case length used in the dataset? 3) For the ablation study in 4.1, On which part of the dataset is this study performed? Additionally in this section, what loss combination was used. Also, the authors specified that they use a different learning rate compared to the initial RSDNet, without specifying the value. 4) For Table 1, MAE/Mean SMAPE changes are very small for ETC-LSTM and ETCformer. In Table 2, once again, for Cholecystectomy and RARP, MAE changes are very small, if it is possible, can the authors explain more on this, please? 5) Additional details of the feature extraction process can be helpful. Please provide more details on this process. For example: What are the video clip lengths and number of frames used in VTN for feature extraction? Is VTN also pre-trained on RARP or Appendectomy as well before feature extraction? 6) Reviewer #2 also notes several areas for clarification of terms to improve understanding:

    • Equation 8: What is the meaning of the term yt-S ?
    • Table 2: what is 90p SMAPE?




Author Feedback

We thank the reviewers and the meta-reviewer for reading our manuscript and providing us with these constructive comments. Below we address the weakness raised and describe the changes we will make in the revised version.

(1): MAE only considers the difference between the ground truth and the prediction while ignoring two important properties: (A) the length of the video and (B) where in the video errors occurred (e.g., the beginning vs. end). An example for (A): a 10 minute absolute error in a 30-minute video is more significant than a 10 minute error in a 1000-minute video. An example for (B): in a 1000-minute video, a 10-minute error in the final minutes of the video is more significant than a 10 minutes error in the beginning of the video. SMAPE solves both problems by dividing the absolute error by the sum of the ground truth and the prediction. This property enables a fair comparison between videos with different lengths.

(2): A summary of Medical Centers (MC) and surgeons per procedure: Cholecystectomy: 14 MC and 118 surgeons (MC1: 47, MC2: 23, and the rest with less than 15 surgeons for each MC) Appendectomy: 5 MC and 61 surgeons (The largest MC has 53 surgeons, and the rest with less than five surgeons for each MC) RARP: 2 medical centers and 14 surgeons (MC1: 11 surgeons and MC2: 3)

The case duration is defined as the difference between surgery start and end times, which is the time interval between scope-in and scope-out.

(3): The ablation study in Section 4.1 is done on the Cholecystectomy dataset. This is described in Section 3.2: “The first dataset is Laparoscopic Cholecystectomy… This dataset was utilized for the development and ablation study”. We first explore various loss variations, as shown in Table 1, and the rest of the analysis is done with the best performing model for each variation. We will add a clarification in the revised manuscript to ensure this is better explained.

We used the same initial learning rate as in RSDNet (0.001) but changed the LR scheduler as noted in the paper: “only changing the learning rate reduction policy to match the same epoch proportion in our dataset” - That is, while RSDNet reduced the learning rate by a factor of 10 every 10K iterations, we reduced it by a factor of 10 every 180K iterations due to the longer schedule.

(4) We agree that it can be difficult to directly judge the significance of numerical differences in MAE/SMAPE metrics. To build intuition we visualized some examples in Figure 1(a) that show predictions with a difference of 1-2 SMAPE. Based on our experience reading these figures, we have found that even slight differences in SMAPE can reveal qualitatively significant improvements in these prediction curves, as shown in Figure 1(a). To further sharpen the readers intuition we will include more examples in the supplementary material.

(5): The feature extraction done with the VTN model uses 1 FPS, i.e., every second in the video is represented by one feature vector. VTN was pre-trained on the surgical step task for all three datasets.

(6): Eq8: S is an interval that represents the time span (jump) between two timestamps. Thus, the L_squared_error is the sum of squared errors for each point (t, t-S) independently. Table 2: 90p stands for 90 percentile - we will adjust the Table text to clarify this.

  • measures of spread: we will add the standard deviation to the tables for the reported mean SMAPE values.
  • pi in the corridor loss is a wrapper function that weights the loss according to the location of the predicted value and whether it lay inside or outside the corridor borderlines (Figure 2).
  • RARP acronym: We will adjust in the revised manuscript.
  • Clarity of loss presentation: We will change the loss order appearance to clarify this.
  • snorm is a hyperparameter described in the RSDNet paper. It is a regularization factor that normalizes the elapsed time by dividing it with a constant value - snorm.




Post-rebuttal Meta-Reviews

Meta-review # 1 (Primary)

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    I found the reviewers adequately addressed the concerns raised in the initial meta-review. In fact, one reviewer upgraded score from 4 to a 6 while the other two reviewers maintained their scores of 4 and 5. On balance, I found that the work was interesting if not entirely novel - it offers a look at a multiinstitutional dataset, it performed extensive testing, and it explored a topic of clinical relevance. As such, I lean toward accept.



Meta-review #2

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    This paper aims to predict estimated time to surgical procedure completion, i.e., predicting the remaining surgery duration based on video data. This is an interesting and under-explored research task in surgical data science. The paper conducts extensive experiments on three types of procedures. The rebuttal has successfully addressed some concerns on method and experimental details. R2 changes from weak reject to accept. Overall, after rebuttal, two reviewers are positive and one reviewer is negative. The remaining concern is about unclear clinical meaningfulness of the task, which is somewhat understandable, but this is quite a subjective assessment. The meta-reviewer considers the method design and task formulation are reasonable enough w.r.t. current research progress in the field. This would be an important research task in CAI and releated work on it needs to be encouraged. The overall recommendation of acceptance is made based on majority consensus among reviewers and meta-reviewer.



Meta-review #3

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    The rebuttal did not well address several critical concerns raised by reviewers, such as unclear objective define and method design, therefore I recommend rejection.



back to top